Random Forest Klassifikation bei unbalancierten Daten

Hackl, Sebastian

doi:10.34726/hss.2015.32902

Record link:

https://doi.org/10.34726/hss.2015.32902
http://hdl.handle.net/20.500.12708/9107

Title:

Random Forest Klassifikation bei unbalancierten Daten

Citation:

Hackl, S. (2015). Random Forest Klassifikation bei unbalancierten Daten [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2015.32902

reposiTUm DOI:

10.34726/hss.2015.32902

CatalogPlus:

AC12661268

Publication Type:

Thesis - Diplomarbeit

Language:

German

Authors:

Hackl, Sebastian

Advisor:

Filzmoser, Peter

Organisational Unit:

E105 - Institut für Stochastik und Wirtschaftsmathematik

Date (published):

2015

Number of Pages:

Keywords:

random forest

Abstract:

In dieser Arbeit wird das Random Forest Verfahren behandelt. Angefangen bei der Definition eines Entscheidungsbaumes, dem Baustein des Random Forest, werden in dieser Arbeit weitere grundlegende Definitionen angegeben, um dann den Algorithmus vorzustellen. Des Weiteren wird angeführt, wie sich bereits während der Trainingsphase die Merkmalswichtigkeiten, ein Schätzer für den Missklassifikationsfehler und ein Distanzmaß berechnen lassen. Anschließend wird auf die Problematik hingewiesen, die auftritt, wenn man einen Random Forest auf einem unbalancierten Datensatz trainiert. Dazu werden zuerst die Methoden des "Over Sample" bzw. "Under Sample" vorgestellt. An zwei unterschiedlichen Datensätzen werden diese Methoden mit jeweils differenten Parameterwerten angewandt, um diese Ergebnisse dann gegenüberzustellen und zu analysieren. Daraus ist auch zu erkennen, welch wichtiger Faktor der Zufall in einem Randon Forest ist. Zuletzt wird in dieser Arbeit auch festgestellt, dass dieses Verfahren auch vom "Overfitting" Problem betroffen sein kann, gerade dann, wenn man die Bäume bis zur maximalen Größe wachsen lässt.

This diploma contains a detailed study of the Random Forest Method. In the first part the basic definitions, that are necessary to describe the Random Forest Algorithm, are presented. In particular, the concept of decision trees is introduced, which is the main building block of a Random Forest. In addition a method to obtain a measure for the importance of variables and an estimator for the misclassification error and a distance measure is described. In the following part a series of problems is highlighted, which occur when a Random Forest is trained on an unbalanced dataset. In this context the methods of "Under Sample" and "Over Sample" are described and are tested on two data sets for different parameter values. The corresponding results are then compared and analysed. From these experiments one can deduce the impact of "random" on the Random Forest. Finally it is shown, that the Random Forest method is also affected by the "Overfitting Problem". This is in particular the case for large trees.

Additional information:

Abweichender Titel laut Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis