Predicting bugs in source code : A machine learning approach for predicting faults by utilizing code and change metrics

Felder, Jodok

doi:10.34726/hss.2025.124198

Record link:

https://doi.org/10.34726/hss.2025.124198
http://hdl.handle.net/20.500.12708/213269

Title:

Predicting bugs in source code : A machine learning approach for predicting faults by utilizing code and change metrics

Citation:

Felder, J. (2025). Predicting bugs in source code : A machine learning approach for predicting faults by utilizing code and change metrics [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.124198

reposiTUm DOI:

10.34726/hss.2025.124198

CatalogPlus:

AC17467640

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Felder, Jodok

Advisor:

Weippl, Edgar

Co-advisor:

Schatten, Alexander

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2025

Number of Pages:

Keywords:

Machine Learning; Bug Prediction; Code Metrics; Change Metrics; Gradient Boosting; CatBoost; Classification

Abstract:

Bug detection plays a critical role in software engineering, offering significant time and cost savings for organizations and developers alike. With the exponential growth in code volume and the availability of data surrounding its development, bug prediction has become increasingly important. This thesis focuses on combining code metrics, especially ones based on code changes, and machine learning techniques to address the challenge of identifying buggy software files.The thesis leverages a dataset comprising 34 open-source projects and utilizes more than 37 code metrics, ranging from basic measures such as Lines of Code to advanced metrics rooted in object-oriented programming principles. A CatBoost classifier was employed to develop a predictive model capable of classifying files as buggy or non-buggy and assigning a corresponding Risk Score -- a numerical indicator of the likelihood that a given file contains bugs. The model achieved an average accuracy of 84.1% and a recall rate of 83%, demonstrating its reliability and effectiveness in identifying buggy files.The analysis further examined the importance of individual code metrics in driving the model's predictions. Feature Importance Analysis identified complexity metrics and the Bus Factor as the most influential in predicting buggy files, offering valuable insights into key contributors to software quality. Additionally, a Logistic Regression-based approach, which achieved an accuracy of 61%, was evaluated to contrast its performance with advanced non-linear models like CatBoost, demonstrating the latter's superior predictive capabilities for bug prediction.This work contributes to the field of software engineering by demonstrating the efficacy of combining machine learning with metric-driven approaches for bug prediction. The results provide a foundation for future research and practical applications aimed at enhancing software reliability and development efficiency.

Additional information:

Zusammenfassung in deutscher Sprache

License:

In Copyright

Appears in Collections:

Thesis