Strümpf, K. (2025). Supporting domain experts develop data exploration and modelling workflows: a ML-based approach [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.101362
E194 - Institut für Information Systems Engineering
-
Date (published):
2025
-
Number of Pages:
162
-
Keywords:
data analysis workflows; low-code systems; graph-based pipeline generation; large language models; AutoML; Monte Carlo Tree Search; data pipeline automation; human-in-the-loop machine learning
en
Abstract:
Domain experts in fields such as healthcare, marketing, or manufacturing are increasingly expected to engage with data analysis tasks. However, existing tools either require programming knowledge or limit users to predefined operations in graphical interfaces, creating barriers for non-technical users seeking to build meaningful data workflows.This thesis explores how Data Analysis Workflows (DAWs) can be made more accessible by automating their construction through two complementary approaches: a structured graph-based system and a prompt-driven system based on Large Language Models (LLMs). Both systems are designed to support domain experts in generating DAWs without requiring programming expertise.The graph-based system represents workflows as Directed Acyclic Graphs (DAGs), where nodes correspond to datasets and operations. It incorporates a Monte-Carlo Tree Search (MCTS) strategy for generating synthetic training data and supervised learning models that predict valid pipeline structures. In contrast, the LLM-based system relies on prompt-based interactions to generate executable Python code directly from natural language input, offering greater flexibility and reducing the engineering effort needed to define task-specific logic.Both systems were implemented as web applications and evaluated across several dimensions, including predictive performance, engineering complexity, and user-facing flexibility. The graph-based system demonstrates higher reproducibility and transparent pipeline construction, while the LLM-based approach offers rapid prototyping and lower development overhead at the cost of increased uncertainty and reduced control.This comparative analysis reveals key trade-offs in the design of systems for domain-expert-centric DAW generation. It also highlights the potential for combining the strengths of both approaches in future research. Recommendations are provided for improving robustness, expanding system capabilities, and incorporating user feedback to further lower the barriers to accessible and effective data analysis.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers