Performance and Scalability Analysis of Dask Applications on Large Scale Systems

Chakarov, Teodor

doi:10.34726/hss.2026.131836

DC Field

Value

Language

dc.contributor.advisor

Hunold, Sascha

dc.contributor.author

Chakarov, Teodor

dc.date.accessioned

2026-03-16T09:59:13Z

dc.date.issued

2026

dc.date.submitted

2026-02

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Chakarov, T. (2026). <i>Performance and Scalability Analysis of Dask Applications on Large Scale Systems</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.131836</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2026.131836

dc.identifier.uri

http://hdl.handle.net/20.500.12708/226951

dc.description

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft

dc.description

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

dc.description.abstract

This thesis evaluates the scalability and performance characteristics of the Dask framework for large-scale analytical workloads across two contrasting computing environments: a local Kubernetes-based deployment and a high-performance computing (HPC) cluster managed by SLURM. Although Dask is increasingly used as a flexible alternative to frameworks such as Apache Spark, its behavior under compute-intensive workloads remains poorly known. To address this gap, we conduct extensive strong-scaling and weak-scaling experiments using different-scale (TPC-H–like) datasets, analyze Dask’s adaptive autoscaling mechanism, and integrate DuckDB as an embedded execution engine within Dask workers.The results show that the local Kubernetes deployment achieves limited scalability due to shared-resource contention, with performance saturating after only a few workers. In contrast, the HPC system maintains high parallel efficiency across many nodes, particularly when running one worker per node. This demonstrates that distributed memory bandwidth and reduced spill-to-disk activity are essential for scaling shuffle-intensive queries. Weak scaling on the HPC system is more stable than on the local cluster, although absolute runtimes are higher because distributed execution introduces significant network and I/O overheads.Adaptive scaling performs poorly in both environments. On Kubernetes, the autoscaler behaves unpredictably, frequently terminating workers too aggressively and destabilizing long-running queries. On the HPC system, adaptive scaling consistently underperforms fixed-size clusters because the ramp-up phase forces memory-bound workloads to execute with insufficient resources.Finally, experiments combining Dask with DuckDB reveal substantial performance improvements. Distributed DuckDB consistently outperforms Dask DataFrame execution—often by a factor of three to four—due to more efficient local processing, reduced I/O per worker, and optimized query execution. Even single-worker DuckDB is faster than Dask-only execution, emphasizing the importance of leveraging specialized query engines within distributed data-processing frameworks.Overall, this thesis provides a detailed empirical analysis of Dask’s behavior under large-scale analytical workloads and offers practical recommendations for configuring Dask on HPC systems, understanding its scaling limits, and integrating complementary technologies such as DuckDB to improve performance.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Dask

dc.subject

Distributed Computing

dc.subject

Scalability Analysis

dc.subject

High-Performance Computing

dc.subject

Kubernetes

dc.subject

SLURM

dc.subject

Adaptive Autoscaling

dc.subject

Weak and Strong Scaling

dc.subject

Analytical Workloads

dc.subject

DuckDB

dc.title

Performance and Scalability Analysis of Dask Applications on Large Scale Systems

dc.title.alternative

Analyse der Leistung und Skalierbarkeit von Dask-Anwendungen auf Hochleistungssystemen

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2026.131836

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Teodor Chakarov

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E191 - Institut für Computer Engineering

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17802688

dc.description.numberOfPages

122

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.advisor.orcid

0000-0002-5280-3855

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.grantfulltext

open

item.cerifentitytype

Publications

item.openairetype

master thesis

item.mimetype

application/pdf

item.languageiso639-1

item.fulltext

with Fulltext

item.openaccessfulltext

Open Access

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(3.13 MB)

In Copyright

Show simple item record

Page view(s)

checked on Mar 16, 2026

Download(s)

checked on Mar 16, 2026

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM