Analysis of hubness and application of reduction methods on high dimensional datasets

Baston, Roscoe

doi:10.34726/hss.2015.30460

Datensatz Zitierlink:

https://doi.org/10.34726/hss.2015.30460
http://hdl.handle.net/20.500.12708/2853

Titel:

Analysis of hubness and application of reduction methods on high dimensional datasets

Zitat:

Baston, R. (2015). Analysis of hubness and application of reduction methods on high dimensional datasets [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.30460

reposiTUm-DOI:

10.34726/hss.2015.30460

CatalogPlus:

AC12724489

Publikationstyp:

Hochschulschrift - Diplomarbeit

Sprache:

Englisch

Autor_innen:

Baston, Roscoe

Betreuer_in:

Rauber, Andreas

Mitbetreuer_innen:

Lidy, Thomas

Organisationseinheit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Datum (veröffentlicht):

2015

Umfang:

Keywords:

Hubness; Business Intelligence; Classification

Abstract:

Hubness is a fairly new issue emerging out of the domains business intelligence and machine learning, that originates from an asymmetric distance relationship between two points inside a dataset. In a high dimensional dataset with a high number of points, this will result is a small amount of points with a lot of neighbors with fairly short distance, the so called 'hub points', and many points with long distance to all neighbors, the 'anti hubs'. That's the reason why hub points occur frequently as neighbors to other points. On the other hand, most other points rarely occur as neighbors and therefore have little influence on subsequent classification. There is a Matlab hubness toolbox which is able to calculate the hubness of a dataset, however, it does not contain functions to deal with large datasets. Also, neither a method to calculate the distance matrix using different metrics and nor a method to compare those hubness values are implemented in the toolbox. Those functions were added by myself in the course of the programming work for this thesis. This thesis aims to answer research questions regarding the contained hubness of datasets and the possibility to reduce said hubness using new reduction methods while verifying that the quality of the data is not reduced. Two of the reduction methods proposed in this thesis are the exclusion of the biggest hubpoints and the projection on a hypersphere both proposed by Abdel Aziz Taha. The third reduction method is the execution of a principal component analysis to find the most important dimensions and visualize the hub points and the data in form of a plot. After the implementation of these new reduction methods, and the application on artificial and non-artificial data, it became clear that the reduction methods operate as expected, reducing the hubness. The pca reduction method turned out to be a non-viable approach though, since the quality of the data and the retrieval system was reduced in the process. The other two reduce hubness and do not damage the quality of the data and the retrieval system. Another general observation was the fact that artificial datasets tended to contain more hubness than non-artificial datasets.

Weitere Information:

Zusammenfassung in deutscher Sprache
Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers

Lizenz:

Urheberrechtsschutz

Enthalten in den Sammlungen:

Thesis