Analysis and prediction of performance variability in large-scale computing systems

Salimi Beni, Majid; Hunold, Sascha; Cosenza, Biagio

doi:10.1007/s11227-024-06040-w

DC Field

Value

Language

dc.contributor.author

Salimi Beni, Majid

dc.contributor.author

Hunold, Sascha

dc.contributor.author

Cosenza, Biagio

dc.date.accessioned

2024-09-30T08:50:37Z

dc.date.available

2024-09-30T08:50:37Z

dc.date.issued

2024

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Salimi Beni, M., Hunold, S., & Cosenza, B. (2024). Analysis and prediction of performance variability in large-scale computing systems. <i>Journal of Supercomputing</i>, <i>80</i>(10), 14978–15005. https://doi.org/10.1007/s11227-024-06040-w</div> </div>

dc.identifier.issn

0920-8542

dc.identifier.uri

http://hdl.handle.net/20.500.12708/200994

dc.description.abstract

The development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.

dc.language.iso

dc.publisher

SPRINGER

dc.relation.ispartof

Journal of Supercomputing

dc.rights.uri

http://creativecommons.org/licenses/by/4.0/

dc.subject

Dragonfly+ topology

dc.subject

High performance interconnects

dc.subject

MPI

dc.subject

Performance predictability

dc.subject

Performance variability

dc.title

Analysis and prediction of performance variability in large-scale computing systems

dc.type

Article

dc.type

Artikel

dc.rights.license

Creative Commons Namensnennung 4.0 International

dc.rights.license

Creative Commons Attribution 4.0 International

dc.identifier.scopus

2-s2.0-85188893818

dc.identifier.url

https://api.elsevier.com/content/abstract/scopus_id/85188893818

dc.contributor.affiliation

University of Salerno, Italy

dc.contributor.affiliation

University of Salerno, Italy

dc.description.startpage

14978

dc.description.endpage

15005

dc.rights.holder

dc.type.category

Original Research Article

tuw.container.volume

tuw.container.issue

tuw.journal.peerreviewed

true

tuw.peerreviewed

true

wb.publication.intCoWork

International Co-publication

tuw.researchTopic.id

tuw.researchTopic.name

Computer Engineering and Software-Intensive Systems

tuw.researchTopic.name

Computer Science Foundations

tuw.researchTopic.value

dcterms.isPartOf.title

Journal of Supercomputing

tuw.publication.orgunit

E191-04 - Forschungsbereich Parallel Computing

tuw.publisher.doi

10.1007/s11227-024-06040-w

dc.date.onlinefirst

2024-03-28

dc.identifier.eissn

1573-0484

dc.identifier.libraryid

AC17319288

dc.description.numberOfPages

tuw.author.orcid

0000-0002-8634-7712

tuw.author.orcid

0000-0002-5280-3855

tuw.author.orcid

0000-0002-8869-6705

dc.rights.identifier

CC BY 4.0

dc.rights.identifier

CC BY 4.0

wb.sci

true

wb.sciencebranch

Informatik

wb.sciencebranch.oefos

1020

wb.sciencebranch.value

100

item.languageiso639-1

item.openairetype

research article

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_2df8fbb1

item.openaccessfulltext

Open Access

crisitem.author.dept

University of Salerno

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.orcid

0000-0002-8634-7712

crisitem.author.orcid

0000-0002-5280-3855

crisitem.author.orcid

0000-0002-8869-6705

crisitem.author.parentorg

E191 - Institut für Computer Engineering

Appears in Collections:

Article

Fulltext (Version of Record (published version))

Adobe PDF

(1.78 MB)

CC BY 4.0

Show simple item record

Page view(s)

223

checked on Sep 30, 2024

Download(s)

checked on Sep 30, 2024

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM