Random forest for unbalanced multiple-class classification

Kircher, Anna Sofia

doi:10.34726/hss.2017.44844

Record link:

https://doi.org/10.34726/hss.2017.44844
http://hdl.handle.net/20.500.12708/6814

Title:

Random forest for unbalanced multiple-class classification

Citation:

Kircher, A. S. (2017). Random forest for unbalanced multiple-class classification [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.44844

reposiTUm DOI:

10.34726/hss.2017.44844

CatalogPlus:

AC13734521

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Kircher, Anna Sofia

Advisor:

Filzmoser, Peter

Organisational Unit:

E105 - Institut für Stochastik und Wirtschaftsmathematik

Date (published):

2017

Number of Pages:

127

Keywords:

Random Forest; Multiple-class classification; Unbalanced data; Error rates

Abstract:

Random Forest is a cutting-edge method for unbalanced multiple-class classification. The main problem with unbalanced data is that the classifier tends to focus more on the bigger classes than on the smaller classes. To overcome this skewness, three sampling methods, namely oversampling, undersampling and a combination of both are introduced and compared based on the performance of the forest on a highly unbalanced data set with eleven classes. It seems that oversampling improves the performance of the forest dramatically, while undersampling often worsens it compared to the unbalanced classification. A combination of both seems, however, more adequate for this specific analysed data set since the effect of oversampling on the accuracy is much lower regarding the test data set than the dramatic improvements for the training data set. The danger of overfitting is lower if the data set is not only oversampled but retains its original total size while the observations are oversampled or undersampled to the same amount of observations. Analysing the data has shown that there are many noisy variables which legitimated raising the value of available variables (mtry) from the default to the median value between the default for classification mtry =sqrt(p) and the default value for regression mtry = 2p/3 .

Additional information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis