Zeraliu, E. (2021). Comparison of ensemble-based feature selection methods for binary classification of imbalanced data sets [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.77227
machine learning; feature selection; imbalanced data
en
Abstract:
Everything in the modern world is being digitized and shifted to the Internet sphere, resulting in vast amounts of data. Hence, quicker transmission, robust processing systems and reduction of storage requirements are needed, making it desirable to reduce the number of features that are used as input when processing data. This, so-called feature selection process, is one of the most valuable steps of data preprocessing. A considerable amount of scientific literature portrays the feature selection process in an insufficient manner, and there is not a general agreement on the optimum technique for assuring high predictive performance and stability. In this work, we aim to disclose the best feature selection method for imbalanced data. We use three different ensemble models in combination with stability selection, a broadly applicable method that is expected to significantly improve the selection approach. Later, we compare the performance results with feature selection by neural networks. The experiments are performed on 11 different datasets from various domains in order to draw broad generalizations. To assess the final performance, we utilize three metrics and three different classifiers. We conclude our findings in terms of classification scores and discriminative power.Our results confirm that selecting features not only decreases storage requirements, but also improves classification scores. The method that delivers the best feature subsets in our experiments is stability selection with gradient boosting. The neural network approach, on the other hand, shows the best ability to distinguish between relevant and irrelevant features.