Improving HPC Efficiency with Advanced SLURM Modes of Operation

Sattari, Sanaz

doi:10.34726/hss.2026.63081

Record link:

https://doi.org/10.34726/hss.2026.63081
http://hdl.handle.net/20.500.12708/229061

Title:

Improving HPC Efficiency with Advanced SLURM Modes of Operation

Citation:

Sattari, S. (2026). Improving HPC Efficiency with Advanced SLURM Modes of Operation [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.63081

reposiTUm DOI:

10.34726/hss.2026.63081

CatalogPlus:

AC17906785

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Sattari, Sanaz

Advisor:

Rauber, Andreas

Organisational Unit:

E194 - Institut für Information Systems Engineering

Date (published):

2026

Number of Pages:

108

Keywords:

High performance Computing; Scheduling; SLURM

Abstract:

Hochleistungsrechner (HPC)-Systeme sind heute grundlegende Infrastrukturen fuer eine breite Palette von Aktivitaeten, darunter wissenschaftliche Anwendungen, großskalige Simulationen und datenintensive Verarbeitung. Da Umfang und Heterogenitaet von HPC-Systemen kontinuierlich zunehmen, muessen Ressourcen- und Workload-Management optimiert werden, um Anforderungen an eine hohe Auslastung, geringe Scheduling-Overhead-Kosten und minimale Job-Wartezeiten zu erfuellen. In modernen HPC-Umgebungen beeinflussen Konfiguration und Betriebsverhalten von Workload-Management-Systemen die Gesamtperformance des Systems sowie die Benutzererfahrung unmittelbar.Diese Arbeit untersucht verschiedene Betriebskonfigurationen des SLURM-Workload-Managers mit Fokus auf Single-Cluster-, Multi-Cluster- und Federated-Cluster-Setups. Ziel ist es, zu analysieren, wie sich diese Konfigurationen unter unterschiedlichen Workload-Eigenschaften und variierender Ressourcenverfügbarkeit auf Scheduling-Effizienz, Skalierbarkeit, Durchsatz und Latenz auswirken. Besonderes Augenmerk wird auf Job-Warteschlangenzeiten, Reaktionsfähigkeit des Schedulers, Effizienz der Ressourcenallokation sowie die Cluster-Auslastung in Test- und produktionsnahen HPC-Infrastrukturen gelegt.Zur Erreichung dieses Ziels wird eine experimentelle Methodik eingesetzt, die Benchmarking auf einem Testbed mit Analysen aus realen Betriebsumgebungen kombiniert. Eine dedizierte Testbed-Umgebung wird verwendet, um Scheduling-Entscheidungen und Ressourcenallokation unter verschiedenen Cluster-Setups zu evaluieren, waehrend Experimente auf operativen Systemen Einblicke in das Verhalten von SLURM in realen großskaligen Deployments liefern. Die Studie untersucht darüber hinaus den Einfluss heterogener Ressourcen, Partitionierungs-Konfigurationen, Quality-of-Service-(QoS)-Richtlinien sowie Job-Submission-Strategien auf die Scheduling-Effizienz.Ueber den vergleichenden Ansatz unterschiedlicher SLURM-Betriebsmodelle hinaus stellt diese Arbeit auch QOSS vor, ein Python-basiertes Submission-Tool, das flexible Job-Platzierung ueber verteilte SLURM-Cluster hinweg ermoeglicht. Dieses Tool erweitert konventionelle Submission-Workflows, indem es adaptive Partitionsauswahl sowie Cross-Cluster-Job-Submission ermöglicht, wenn die urspruenglich vorgesehenen Ressourcen nicht verfuegbar sind. Ziel des Tools ist es, die Ressourcenauslastung zu verbessern und Scheduling-Verzoegerungen in Situationen zu reduzieren, in denen SLURM standardmaeßig auf bestimmte Partitionen oder Cluster beschränkt ist.Die experimentellen Ergebnisse zeigen, dass geeignete Konfigurationen fortgeschrittener SLURM-Setups die Scheduling-Effizienz erhöhen und Job-Wartezeiten reduzieren koennen. Das Multi-Cluster-Setup bietet eine groeßere Flexibilitaet fuer Workload-Verteilung und -Platzierung, waehrend Federated SLURM Inter-Cluster-Scheduling und einen hoeheren Durchsatz für geeignete Workloads ermoeglicht. Diese Konfigurationen fuehren jedoch auch zu einer zusaetzlichen Komplexitaet der SLURM-Architektur. Die Ergebnisse dieser Arbeit liefern praktische Erkenntnisse fuer die Auswahl und das Tuning von SLURM-Betriebsmodellen in modernen HPC-Infrastrukturen.

High Performance Computing (HPC) systems are now fundamental infrastructures for a broad range of activities, such as scientific applications, large-scale simulations, and data-intensive processing. As the scale and heterogeneity of HPC systems are continuously increasing, resource and workload management must be optimized to meet demands for high utilization, low scheduling overhead, and minimal job waiting times. In modern HPC environments, the configuration and operational behavior of workload managers directly influence overall system performance and user experience.This thesis investigates various operational configurations of the SLURM workload manager, focusing on Single-Cluster, Multi-Cluster, and Federated-Cluster setups. The aim is to examine how these configurations affect scheduling efficiency, scalability, throughput, and latency under different workload characteristics with different resource availability. Special consideration is given to job queue waiting time, scheduler responsiveness, resource allocation efficiency, and cluster utilization in test and production-scale HPC infrastructures.To achieve this, an experimental methodology combining testbed-based benchmarking and real-world operational analysis is employed. A dedicated testbed environment is used to evaluate scheduling decisions and resource allocation under varying cluster setups, while experiments conducted on operational systems provide insights into the behavior of SLURM in realistic large-scale deployments. The study further examines the impact of heterogeneous resources, partition configurations, Quality of Service (QoS) policies, and job submission strategies on scheduling efficiency.Beyond the comparative study of different SLURM operational setups, this work also introduces QOSS, a Python-based submission tool designed for flexible job placement across distributed SLURM clusters. This tool extends conventional submission workflows by enabling adaptive partition selection and cross-cluster job submission when the originally targeted resources are unavailable. The tool aims to improve resource utilization and reduce scheduling delays in situations where the default SLURM is limited to certain partitions or clusters.The experimental findings demonstrate that with appropriate settings, advanced SLURM setups can enhance the scheduling efficiency and reduce job queue waiting times. The Multi-Cluster setup offers greater flexibility for workload distribution and positioning, whereas Federated SLURM allows inter-cluster scheduling and improved throughput for suitable workloads. However, these configurations also add an additional layer of complexity to the SLURM architecture.The results of this thesis offer practical insights for choosing and tuning SLURM operational setups for modern HPC infrastructures.

Additional information:

Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis