Baston, R. (2015). Analysis of hubness and application of reduction methods on high dimensional datasets [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.30460
E188 - Institut für Softwaretechnik und Interaktive Systeme
-
Date (published):
2015
-
Number of Pages:
57
-
Keywords:
Hubness; Business Intelligence; Classification
de
Hubness; Business Intelligence; Classification
en
Abstract:
Hubness is a fairly new issue emerging out of the domains business intelligence and machine learning, that originates from an asymmetric distance relationship between two points inside a dataset. In a high dimensional dataset with a high number of points, this will result is a small amount of points with a lot of neighbors with fairly short distance, the so called 'hub points', and many points with long distance to all neighbors, the 'anti hubs'. That's the reason why hub points occur frequently as neighbors to other points. On the other hand, most other points rarely occur as neighbors and therefore have little influence on subsequent classification. There is a Matlab hubness toolbox which is able to calculate the hubness of a dataset, however, it does not contain functions to deal with large datasets. Also, neither a method to calculate the distance matrix using different metrics and nor a method to compare those hubness values are implemented in the toolbox. Those functions were added by myself in the course of the programming work for this thesis. This thesis aims to answer research questions regarding the contained hubness of datasets and the possibility to reduce said hubness using new reduction methods while verifying that the quality of the data is not reduced. Two of the reduction methods proposed in this thesis are the exclusion of the biggest hubpoints and the projection on a hypersphere both proposed by Abdel Aziz Taha. The third reduction method is the execution of a principal component analysis to find the most important dimensions and visualize the hub points and the data in form of a plot. After the implementation of these new reduction methods, and the application on artificial and non-artificial data, it became clear that the reduction methods operate as expected, reducing the hubness. The pca reduction method turned out to be a non-viable approach though, since the quality of the data and the retrieval system was reduced in the process. The other two reduce hubness and do not damage the quality of the data and the retrieval system. Another general observation was the fact that artificial datasets tended to contain more hubness than non-artificial datasets.
en
Additional information:
Zusammenfassung in deutscher Sprache Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers