Robust and sparse k-means clustering for high-dimensional data

Brodinová, Šárka; Filzmoser, Peter; Ortner, Thomas; Breiteneder, Christian; Rohm, Maia

doi:10.1007/s11634-019-00356-9

Datensatz Zitierlink:

https://resolver.obvsg.at/urn:nbn:at:at-ubtuw:3-7509
http://hdl.handle.net/20.500.12708/687

Titel:

Robust and sparse k-means clustering for high-dimensional data

Zitat:

Brodinová, Š., Filzmoser, P., Ortner, T., Breiteneder, C., & Rohm, M. (2019). Robust and sparse k-means clustering for high-dimensional data. Advances in Data Analysis and Classification, 905–932. https://doi.org/10.1007/s11634-019-00356-9

Verlags-DOI:

10.1007/s11634-019-00356-9

CatalogPlus:

AC15518672

Publikationstyp:

Artikel - Forschungsartikel

Sprache:

Englisch

Autor_innen:

Brodinová, Šárka
Filzmoser, Peter
Ortner, Thomas
Breiteneder, Christian
Rohm, Maia

Organisationseinheit:

E105 - Institut für Stochastik und Wirtschaftsmathematik

Zeitschrift:

Advances in Data Analysis and Classification

ISSN:

1862-5347

Datum (veröffentlicht):

2019

Umfang:

Verlag:

Springer Nature

Peer Reviewed:

Keywords:

Clusters; Outliers; Noise variables; High-dimensions; Gap statistic

Abstract:

In real-world application scenarios, the identification of groups poses a significant challenge due to possibly occurring outliers and existing noise variables. Therefore, there is a need for a clustering method which is capable of revealing the group structure in data containing both outliers and noise variables without any pre-knowledge. In this paper, we propose a k-means-based algorithm incorporating a weighting function which leads to an automatic weight assignment for each observation. In order to cope with noise variables, a lasso-type penalty is used in an objective function adjusted by observation weights. We finally introduce a framework for selecting both the number of clusters and variables based on a modified gap statistic. The conducted experiments on simulated and real-world data demonstrate the advantage of the method to identify groups, outliers, and informative variables simultaneously.

Lizenz:

CC BY 4.0