Analysis and prediction of performance variability in large-scale computing systems

Salimi Beni, Majid; Hunold, Sascha; Cosenza, Biagio

doi:10.1007/s11227-024-06040-w

Record link:

http://hdl.handle.net/20.500.12708/200994

Title:

Analysis and prediction of performance variability in large-scale computing systems

Citation:

Salimi Beni, M., Hunold, S., & Cosenza, B. (2024). Analysis and prediction of performance variability in large-scale computing systems. Journal of Supercomputing, 80(10), 14978–15005. https://doi.org/10.1007/s11227-024-06040-w

Publisher DOI:

10.1007/s11227-024-06040-w

CatalogPlus:

AC17319288

Publication Type:

Article - Original Research Article

Language:

English

Authors:

Salimi Beni, Majid
Hunold, Sascha
Cosenza, Biagio

Organisational Unit:

E191-04 - Forschungsbereich Parallel Computing

Journal:

Journal of Supercomputing

ISSN:

0920-8542

Date (published):

2024

Number of Pages:

Publisher:

SPRINGER

Peer reviewed:

Yes

Keywords:

Dragonfly+ topology; High performance interconnects; MPI; Performance predictability; Performance variability

Abstract:

The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.

Research Areas:

Computer Engineering and Software-Intensive Systems: 90%
Computer Science Foundations: 10%

Science Branch:

1020 - Informatik: 100%

License:

CC BY 4.0