Supporting domain experts develop data exploration and modelling workflows : a ML-based approach

Strümpf, Konstantin

doi:10.34726/hss.2025.101362

Record link:

https://doi.org/10.34726/hss.2025.101362
http://hdl.handle.net/20.500.12708/220411

Title:

Supporting domain experts develop data exploration and modelling workflows : a ML-based approach

Citation:

Strümpf, K. (2025). Supporting domain experts develop data exploration and modelling workflows : a ML-based approach [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.101362

reposiTUm DOI:

10.34726/hss.2025.101362

CatalogPlus:

AC17681814

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Strümpf, Konstantin

Advisor:

Dustdar, Schahram

Co-advisor:

Morichetta, Andrea

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2025

Number of Pages:

162

Keywords:

data analysis workflows; low-code systems; graph-based pipeline generation; large language models; AutoML; Monte Carlo Tree Search; data pipeline automation; human-in-the-loop machine learning

Abstract:

Domain experts in fields such as healthcare, marketing, or manufacturing are increasingly expected to engage with data analysis tasks. However, existing tools either require programming knowledge or limit users to predefined operations in graphical interfaces, creating barriers for non-technical users seeking to build meaningful data workflows.This thesis explores how Data Analysis Workflows (DAWs) can be made more accessible by automating their construction through two complementary approaches: a structured graph-based system and a prompt-driven system based on Large Language Models (LLMs). Both systems are designed to support domain experts in generating DAWs without requiring programming expertise.The graph-based system represents workflows as Directed Acyclic Graphs (DAGs), where nodes correspond to datasets and operations. It incorporates a Monte-Carlo Tree Search (MCTS) strategy for generating synthetic training data and supervised learning models that predict valid pipeline structures. In contrast, the LLM-based system relies on prompt-based interactions to generate executable Python code directly from natural language input, offering greater flexibility and reducing the engineering effort needed to define task-specific logic.Both systems were implemented as web applications and evaluated across several dimensions, including predictive performance, engineering complexity, and user-facing flexibility. The graph-based system demonstrates higher reproducibility and transparent pipeline construction, while the LLM-based approach offers rapid prototyping and lower development overhead at the cost of increased uncertainty and reduced control.This comparative analysis reveals key trade-offs in the design of systems for domain-expert-centric DAW generation. It also highlights the potential for combining the strengths of both approaches in future research. Recommendations are provided for improving robustness, expanding system capabilities, and incorporating user feedback to further lower the barriers to accessible and effective data analysis.

License:

In Copyright

Appears in Collections:

Thesis