Lee, B. (2025). Survival Analysis Model to Predict Customer Churn in the Edtech Sector [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.124833
Survival Analysis; Customer Churn; Parametric Models; Machine Learning; Random Survival Forests; Cox Proportional Hazards; Time-Variant Data
en
Abstract:
This thesis explores the application of survival analysis models to predict customer churn in the edtech sector, an area of growing importance for subscription-based businesses. By leveraging statistical and machine learning techniques, the study aims to improve retention models over existing heuristic methods and identify key variables influencing churn behaviour. The research focuses on using survival analysis, a statistical framework adept at handling censored data, to predict customer churn and retention duration, providing more precise and actionable insights.Drawing from a dataset comprising several hundred thousand customer records with both time-variant and time-invariant features, this study evaluates classical survival models, including Kaplan-Meier and Cox Proportional Hazards models, as well as advanced machine learning techniques like Random Survival Forests and Gradient Boosting Machines. The incorporation of time-variant data, a novel aspect of this study, enhances model sophistication and predictive capability.Results demonstrate that machine learning models outperform traditional heuristic approaches, achieving higher concordance index and lower integrated Brier scores. Permutation importance methods highlighted variables and features which strongly affected survival time and its inverse: customer churn. Time-variant data was found to further improve model performance although caution must be exercised to ensure correct interpretation of results. This work contributes to the literature by extending survival analysis applications to the edtech sector, where customer retention is critical for sustainable growth. The developed models form a basis as a testbed for further analysis as new hypothesised variables come in for testing. However, the lack of readily-available libraries for time-variant analysis, particularly in Python both highlight the cutting edge nature of time-variant survival analysis, as well as the risks of productionising time-variant methodologies.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers