Clustering; Sparse data observers; Unsupervised learning; Data Analysis
en
Abstract:
Given a set of data points, clustering serves to discover groups based on pairwise similarities and the shapes drawn by the data in the feature space. In other words, it is a tool to describe data and reveal their intrinsic nature in terms of patterns or groups. In this paper, we review the methodology of clustering when used to explore a priori unknown data, i.e., we do not know how data spaces are manipulated, how algorithms are tuned, and how results are validated. Under this practical approach, we examine the advantages of SDOclust, a clustering method that stands out for its simplicity, lightness, no need for parameterization and not being subject to traditional clustering limitations. We test SDOclust and main established alternatives — HDBSCAN,
-means--, Fuzzy C-means, Hierarchical Clustering, CLASSIX, and N2D Deep Clustering — by extensive experimentation with more than 200 datasets, both real and synthetic, that have been collected from the literature on evaluation and represent different data analysis challenges. We submit only SDOclust to unfavorable testing conditions by denying it a parameter tuning phase. Nevertheless, its overall performance is excellent and positions it as one of the best general-purpose alternatives.
With deep clustering as the consolidation of a new paradigm, trends in clustering consist mainly in projecting data into spaces that are easier to dissect. Therefore, in cases where the original space does not show clustering-friendly structures and when we can assume transformation costs, SDOclust easily adapts and is a most natural choice to perform the partitioning task.