Mühlbacher, T. (2018). Human-oriented statistical modeling: making algorithms accessible through interactive visualization [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.60531
E193 - Institut für Visual Computing and Human-Centered Technology
Number of Pages:
Visual Analytics; Statistical Modeling; Machine Learning; Visual Data Science; Visual Data Mining; Information Visualization; Human-Computer Interaction
Statistical modeling is a key technology for generating business value from data. While the number of available algorithms and the need for them is growing, the number of people with the skills to effectively use such methods lags behind. Many application domain experts find it hard to use and trust algorithms that come as black boxes with insufficient interfaces to adapt. The field of Visual Analytics aims to solve this problem by a human-oriented approach that puts users in control of algorithms through interactive visual interfaces. However, designing accessible solutions for a broad set of users while re-using existing, proven algorithms poses significant challenges for the design of analytical infrastructures, visualizations, and interactions. This thesis provides multiple contributions towards a more human-oriented modeling process: As a theoretical basis, it investigates how user involvement during the execution of algorithms can be realized from a technical perspective. Based on a characterization of needs regarding intermediate feedback and control, a set of formal strategies to realize user involvement in algorithms with different characteristics is presented. Guidelines for the design of algorithmic APIs are identified, and requirements for the re-use of algorithms are discussed. From a survey of frequently used algorithms within R, the thesis concludes that a range of pragmatic options for enabling user involvement in new and existing algorithms exist and should be used. After these conceptual considerations, the thesis presents two methodological contributions that demonstrate how even inexperienced modelers can be effectively involved in the modeling process. First, a new technique called TreePOD guides the selection of decision trees along trade-offs between accuracy and other objectives, such as interpretability. Users can interactively explore a diverse set of candidate models generated by sampling the parameters of tree construction algorithms. Visualizations provide an overview of possible tree characteristics and guide model selection, while details on the underlying machine learning process are only exposed on demand. Real-world evaluation with domain experts in the energy sector suggests that TreePOD enables users with and without statistical background a confident identification of suitable decision trees. As the second methodological contribution, the thesis presents a framework for interactive building and validation of regression models. The framework addresses limitations of automated regression algorithms regarding the incorporation of domain knowledge, identifying local dependencies, and building trust in the models. Candidate variables for model refinement are ranked, and their relationship with the target variable is visualized to support an interactive workflow of building regression models. A real-world case study and feedback from domain experts in the energy sector indicate a significant effort reduction and increased transparency of the modeling process. All methodological contributions of this work were implemented as part of a commercially distributed Visual Analytics software called Visplore. As the last contribution, this thesis reflects upon years of experience in deploying Visplore for modeling-related tasks in the energy sector. Dissemination and adoption are important aspects of making statistical models more accessible for domain experts, making this work relevant for practitioners and application-oriented researchers alike.