<div class="csl-bib-body">
<div class="csl-entry">Salimi Beni, M., Hunold, S., & Cosenza, B. (2024). Analysis and prediction of performance variability in large-scale computing systems. <i>Journal of Supercomputing</i>, <i>80</i>(10), 14978–15005. https://doi.org/10.1007/s11227-024-06040-w</div>
</div>
-
dc.identifier.issn
0920-8542
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/200994
-
dc.description.abstract
The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.
en
dc.language.iso
en
-
dc.publisher
SPRINGER
-
dc.relation.ispartof
Journal of Supercomputing
-
dc.rights.uri
http://creativecommons.org/licenses/by/4.0/
-
dc.subject
Dragonfly+ topology
en
dc.subject
High performance interconnects
en
dc.subject
MPI
en
dc.subject
Performance predictability
en
dc.subject
Performance variability
en
dc.title
Analysis and prediction of performance variability in large-scale computing systems