Kircher, A. S. (2017). Random forest for unbalanced multiple-class classification [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.44844
E105 - Institut für Stochastik und Wirtschaftsmathematik
-
Date (published):
2017
-
Number of Pages:
127
-
Keywords:
Random Forest; Multiple-class classification; Unbalanced data; Error rates
en
Abstract:
Random Forest is a cutting-edge method for unbalanced multiple-class classification. The main problem with unbalanced data is that the classifier tends to focus more on the bigger classes than on the smaller classes. To overcome this skewness, three sampling methods, namely oversampling, undersampling and a combination of both are introduced and compared based on the performance of the forest on a highly unbalanced data set with eleven classes. It seems that oversampling improves the performance of the forest dramatically, while undersampling often worsens it compared to the unbalanced classification. A combination of both seems, however, more adequate for this specific analysed data set since the effect of oversampling on the accuracy is much lower regarding the test data set than the dramatic improvements for the training data set. The danger of overfitting is lower if the data set is not only oversampled but retains its original total size while the observations are oversampled or undersampled to the same amount of observations. Analysing the data has shown that there are many noisy variables which legitimated raising the value of available variables (mtry) from the default to the median value between the default for classification mtry =sqrt(p) and the default value for regression mtry = 2p/3 .
en
Additional information:
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers