<div class="csl-bib-body">
<div class="csl-entry">Chu, X., Hofstätter, D., Ilager, S. S., Talluri, S., Kampert, D., Podareanu, D., Duplyakin, D., Brandic, I., & Iosup, A. (2024). <i>Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis</i>. arXiv. https://doi.org/10.48550/arXiv.2409.08949</div>
</div>
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/211385
-
dc.description.abstract
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting science, business, and other decision-making processes. However, understanding how ML jobs impact the operation of HPC datacenters, relative to generic jobs, remains desirable but understudied. In this work, we leverage long-term operational data, collected from a national-scale production HPC datacenter, and statistically compare how ML and generic jobs can impact the performance, failures, resource utilization, and energy consumption of HPC datacenters. Our study provides key insights, e.g., ML-related power usage causes GPU nodes to run into temperature limitations, median/mean runtime and failure rates are higher for ML jobs than for generic jobs, both ML and generic jobs exhibit highly variable arrival processes and resource demands, significant amounts of energy are spent on unsuccessfully terminating jobs, and concurrent jobs tend to terminate in the same state. We open-source our cleaned-up data traces on Zenodo (this https URL), and provide our analysis toolkit as software hosted on GitHub (this https URL). This study offers multiple benefits for data center administrators, who can improve operational efficiency, and for researchers, who can further improve system designs, scheduling techniques, etc.
en
dc.description.sponsorship
FFG - Österr. Forschungsförderungs- gesellschaft mbH
-
dc.description.sponsorship
FWF - Österr. Wissenschaftsfonds
-
dc.description.sponsorship
FWF - Österr. Wissenschaftsfonds
-
dc.language.iso
en
-
dc.subject
HPC datacenters
en
dc.subject
Machine Learning (ML)
en
dc.subject
generic & compute-intensive workloads
en
dc.subject
long-term operational data
en
dc.title
Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis
en
dc.type
Preprint
en
dc.type
Preprint
de
dc.identifier.arxiv
2409.08949
-
dc.contributor.affiliation
TU Wien, Austria
-
dc.contributor.affiliation
Vrije Universiteit Amsterdam, Netherlands (the)
-
dc.contributor.affiliation
SURFsara (Netherlands), Netherlands (the)
-
dc.contributor.affiliation
National Renewable Energy Laboratory, United States of America (the)
-
dc.contributor.affiliation
Vrije Universiteit Amsterdam, Netherlands (the)
-
dc.relation.grantno
45285029
-
dc.relation.grantno
P 36870-N
-
dc.relation.grantno
PAT1668223
-
tuw.project.title
High-Performance integrated Quantum Computing
-
tuw.project.title
Transprecise Edge Computing
-
tuw.project.title
Themis - Vertrauenswürdiges und nachhaltiges Code-Offloading
-
tuw.researchTopic.id
I4
-
tuw.researchTopic.name
Information Systems Engineering
-
tuw.researchTopic.value
100
-
tuw.publication.orgunit
E194-04 - Forschungsbereich Data Science
-
tuw.publication.orgunit
E056-23 - Fachbereich Innovative Combinations and Applications of AI and ML (iCAIML)
-
tuw.publisher.doi
10.48550/arXiv.2409.08949
-
dc.description.numberOfPages
10
-
tuw.author.orcid
0009-0006-0204-2361
-
tuw.author.orcid
0000-0003-1178-6582
-
tuw.author.orcid
0000-0003-3461-4919
-
tuw.author.orcid
0000-0002-4207-8725
-
tuw.author.orcid
0000-0001-5132-0168
-
tuw.author.orcid
0009-0007-0661-5937
-
tuw.author.orcid
0000-0001-8030-9398
-
tuw.publisher.server
arXiv
-
wb.sciencebranch
Informatik
-
wb.sciencebranch
Wirtschaftswissenschaften
-
wb.sciencebranch.oefos
1020
-
wb.sciencebranch.oefos
5020
-
wb.sciencebranch.value
90
-
wb.sciencebranch.value
10
-
item.fulltext
no Fulltext
-
item.openairecristype
http://purl.org/coar/resource_type/c_816b
-
item.grantfulltext
none
-
item.cerifentitytype
Publications
-
item.languageiso639-1
en
-
item.openairetype
preprint
-
crisitem.author.dept
TU Wien
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.dept
Vrije Universiteit Amsterdam
-
crisitem.author.dept
SURFsara (Netherlands)
-
crisitem.author.dept
National Renewable Energy Laboratory
-
crisitem.author.dept
E194-04 - Forschungsbereich Data Science
-
crisitem.author.dept
Vrije Universiteit Amsterdam
-
crisitem.author.orcid
0009-0006-0204-2361
-
crisitem.author.orcid
0000-0003-1178-6582
-
crisitem.author.orcid
0000-0003-3461-4919
-
crisitem.author.orcid
0000-0002-4207-8725
-
crisitem.author.orcid
0000-0001-5132-0168
-
crisitem.author.orcid
0009-0007-0661-5937
-
crisitem.author.orcid
0000-0001-8030-9398
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering
-
crisitem.author.parentorg
E194 - Institut für Information Systems Engineering
-
crisitem.project.funder
FFG - Österr. Forschungsförderungs- gesellschaft mbH