Online algorithm selection of MPI collective communication operations

Steiner, Sebastian

doi:10.34726/hss.2023.105821

Datensatz Zitierlink:

https://doi.org/10.34726/hss.2023.105821
http://hdl.handle.net/20.500.12708/177272

Titel:

Online algorithm selection of MPI collective communication operations

Zitat:

Steiner, S. (2023). Online algorithm selection of MPI collective communication operations [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2023.105821

reposiTUm-DOI:

10.34726/hss.2023.105821

CatalogPlus:

AC16856127

Publikationstyp:

Hochschulschrift - Diplomarbeit

Sprache:

Englisch

Autor_innen:

Steiner, Sebastian

Betreuer_in:

Hunold, Sascha

Organisationseinheit:

E191 - Institut für Computer Engineering

Datum (veröffentlicht):

2023

Umfang:

Keywords:

Keywords: HPC; Parallel Computing; MPI; Collective Communication; Auto-Tuning; Performance Prediction

Abstract:

Das Message Passing Interface (MPI) stellt die Basis für den Großteil an Software, die für parallele Computercluster oder Supercomputer in verschiedenen Disziplinen erstellt wird. Operation mit kollektiver Kommunikation machen einen großen Anteil der Laufzeit von MPI-Anwendungen aus. Die Mehrheit von kollektiven Operationen können durch mehrere Algorithmen umgesetzt werden, wobei in der Regel kein einzelner Algorithmus für jede Eingabe die beste Leistung erzielt. Gängige MPI Bibliotheken inkludieren eine große Menge an Algorithmen für jede kollektive Operation, sowie eine Entscheidungslogik zur Auswahl. Die Standardauswahl erzielt jedoch häufig eine unzureichende Leistung und bietet somit Verbesserungspotenzial. Aktuelle Ansätze nutzen typischerweise einen Tuning-Schritt, um die Entscheidungslogik an eine spezifische Maschine und Auslastung anzupassen. Ein Offline-Tuning-Ansatz leidet unter zwei Nachteilen: 1) einer potenziell lang andauernden Tuning-Phase und 2) der Notwendigkeit, im Voraus festzulegen, welche kollektiven Parameterfälle zu messen sind. Um diese Einschränkungen zu beheben, schlagen wir in dieser Arbeit einen Autotuner mit niedrigem Overhead vor, der in die Ausführung von MPI Anwendungen integriert werden soll. Durch abfragen eines leichtgewichtigen Modell zur Algorithmenauswahl, wird ein Algorithmus abhängig seiner vorhergesagten Leistung zufällig gewählt. Die Laufzeit von kollektiven Operation wird mitgeschrieben und in periodischen Abständen genutzt, um ein Modell zur Laufzeitvorhersage für jeden Algorithmus zu trainieren. Anschließend werden die Modelle zur Algorithmenauswahl, basierend auf den Laufzeitvorhersagemodellen, aktualisiert und in die nächste Runde MPI Anwendungen eingefügt. Wir demonstrieren die Anwendbarkeit dieses Auto-Tuners in einer quantitativen Studie und erreichen Leistungssteigerungen durch Tuning von ECP Proxy-Anwendungen an zwei unterschiedlichen Computerclustern

The Message Passing Interface (MPI) underlies most software run on large parallel machines or supercomputers, facilitating the development of highly parallel applications across various fields. Collective communication operations make up a large fraction of the runtime of MPI applications. These collective operations are generally implemented in multiple different algorithms, off which no single one performs the best for all inputs. Common MPI libraries include a large set of algorithms for each collective operation and decision logic for selection. The default selection, however, frequently underperforms, leaving room for performance improvements. State-of-the-art approaches commonly apply an offline tuning step, to adapt the selection logic to a specific machine and workload. An offline approach suffers from two main drawbacks: 1) a potentially long-running tuning step and 2) the necessity to predefine the collective cases to be tuned. To address these limitations, this thesis proposes a low-overhead online auto-tuner to be injected into the execution of MPI applications. Algorithms are selected randomly based on their predicted runtime, as indicated in the lightweight algorithm selection models. The actual runtime of collective operations is recorded and periodically used to train a runtime prediction model for each algorithm. Subsequently, these prediction models are used to update the algorithm selection models, which are injected into the next MPI applications. We demonstrate the feasibility of this auto-tuner in a quantitative study, achieving performance improvements through tuning ECP proxy applications on two distinct compute clusters

Weitere Information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

Lizenz:

Urheberrechtsschutz

Enthalten in den Sammlungen:

Thesis