Fritz, M. (2023). Decision tree classification with missing values [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.105505
E180 - Fakultät für Informatik E105-06 - Forschungsbereich Computational Statistics
-
Datum (veröffentlicht):
2023
-
Umfang:
82
-
Keywords:
classification tree; random forests
en
Abstract:
Nowadays, data analyses are playing a significant role, as important decisions are often taken based on them. The continuous growth of data increases the challenges of the analyses, for example, data incorrectness, difficult data cleaning and processing, runtime issues, etc. A common challenge that most of the data sets face is incompleteness. More often than not, data sets are incomplete and appropriate handling of missing values is very crucial to achieve reliable and robust results.The present thesis focuses on the problematic of finding appropriate strategies to handle missing values in real data when training a decision tree or random forest classifier. In the theoretical part, it analyzes several strategies applied by decision trees and random forests to handle missing values directly during training time, in addition to outlining a single and multiple imputation approach, namely k-nearest neighbor (kNN) imputation and multivariate imputation by chained equations (MICE). For the study part, real data sets provided by UNIQA Insurance Group are used. These data sets have been merged, enriched with additional information, pre-processed and explored. The study part is divided in two sections: a simulation study and a case study. The goal of the simulation study is to identify if the missing data handling techniques improve the classification accuracy compared to decision trees and random forests that take only complete observations of the data into account. Additionally, it aims to examine how the performance of decision trees and random forests that directly handle missing values changes compared to decision trees and random forests that use imputed data. For that purpose, different percentages of missing values are artificially created on the complete data set to study the effect of the different methods for handling the incomplete data. The aim of the case study is to check the extent to which the missing "occupancy" attributes of buildings in the data, that is, the use or purpose of buildings such as bakery, family house, hospital, etc., can be predicted using a decision tree or random forest classifier with appropriate missing data handling techniques. Therefore, the most promising approach from the simulation study is trained and evaluated with the full provided data set.The simulation study concludes that the performance of decision trees and random forests is similar for all the missing data handling techniques used. Training decision trees or random forests on imputed data or using techniques applied by decision trees or random forests to handle missing values directly during training, does not considerably improve the performance compared to simply using the data that is complete. Furthermore, simple strategies of decision trees and random forests to handle missing values directly are competitive to complex imputation approaches. These outcomes are primarily explained by the special missingness pattern reflected in the data. However, the choice of the decision tree or random forest method has indeed a significant impact on the performance.The key findings of the case study are that random forests along with the separate class method, where missing values are treated as a separate category, exhibit highly satisfactory results in predicting the occupancy for buildings by achieving an accuracy of up to 90%. This master thesis offers a significant insight into machine learning technology and its impact on the future of industry, while dealing with real-world data sets and one of the most relevant inherent challenges, which is incompleteness of the data. It highlights the importance of understanding and properly handling the limitations of real-world data sets, such as missing values and their patterns. Additionally, it shows that existing machine learning models can be successfully applied to make predictions, which can be used, for example, to improve the quality of real-world data and consequently to improve the accuracy of the results of the analyses performed with these data.