Priselac, S. (2022). Outlier detection for mixed-attribute data [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.99623
Outlier detection is a data mining technique for identifying a typical observations in data, which are called outliers or anomalies. Applications of outlier detection include removing noise from data, leading to more accurate machine learning models, and identifying interesting observations that may arise from various data generation mechanisms. Despite the fact that various data contain both continuous and categorical attributes, outlier detection techniques for such mixed-attribute data have not been widely used in practice thus far. This thesis examines a selection of available outlier detection techniques for mixed- attribute data and their respective properties in terms of effectiveness and efficiency. The analysis is limited to unsupervised scoring techniques where the true status of the observations is unknown and the output of the method provides scores rather than just a binary label. The review of scientific literature resulted in eight methods selected for analysis, designated by the acronyms POD, ABOD, FAMDAD, SECODA, ZDisc, KMeans, PCAmix, and MIX. Their properties are acquired based on extensive simulation experiments and evaluation with real data sets. The performance of the methods for different data structures is investigated by observing the effects of the outlier proportion, severeness and type, the correlation between attributes, and the different data sizes. The analysis of the respective outlier detection methods shows that examining outlyingness for mixed-attribute data appears more complex as opposed to homogeneous data types and thus also requires increased consideration. The methods perform differently when the observations are outlying only in either continuous or categorical attribute spaces, or the entire attribute space. In addition, the efficiency of the methods is strongly influenced by the proportions of mixed attributes and their total number.