E105 - Institut für Stochastik und Wirtschaftsmathematik
-
Date (published):
2022
-
Number of Pages:
62
-
Keywords:
Neural networks; Random forests; Logistic regression
en
Abstract:
While the amount of data collected by banks increases exponentially, the introduction of sophisticated machine learning models becomes inevitable in order to keep up with the times. The European Banking Authority (EBA) published a discussion paper in Novem- ber 2021 which might open new possibilities for the estimation of the risk parameters by the internal rating-based (IRB) approach.This thesis aims to compare the performance of different machine learning algorithms in the field of credit risk and, more specifically, in the discrimination of good and bad customers as a part of the probability of default (PD) estimation. The data consists of the corporate customers of a European bank and their balance sheet positions enriched by the region and industry information with the 12 months default flag as the target variable.The binary classification algorithms are described from the theoretical point of view and then applied using R packages. Thereby, the data pre-processing pipeline including an extensive missing data treatment as well as an outlier detection method plays a decisive role because of a significant noise level in the sample, while simultaneously addressing the problem of imbalanced data through undersampling and overweighting. A cross-validation procedure ensures that an adequate out-of-time generalization is achieved.The results state that some of the advanced machine learning techniques outperform the ordinary logistic regression and its regularized modifications while the others such as support vector machine deliver a comparable performance. A plain neural network with one hidden layer provides the best predictions in terms of gini on the holdout sample using a uniform quantile transformation. Random forest achieves the best performance with the untransformed data, notwithstanding that the interpretation of the results and implementation of the model in production environment are less straightforward than in case of logistic regression.
en
Additional information:
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers