<div class="csl-bib-body">
<div class="csl-entry">Salimi Beni, M., Laso, R., Cosenza, B., Benkner, S., & Hunold, S. (2025). Optimizing Distributed Deep Learning Training by Tuning NCCL. In <i>ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025</i> (pp. 38–38). https://doi.org/10.34726/10424</div>
</div>
ASHPC25 – Austrian-Slovenian HPC Meeting 2025 Rimske Toplice, 19–22 May, 2025 Optimizing Distributed Deep Learning Training by Tuning NCCL Majid Salimi Benia, Ruben Lasob, Biagio Cosenzac, Siegfried Benknerb, and Sascha Hunolda aFaculty of Informatics, TU Wien, Austria bFaculty of Computer Science, University of Vienna, Austria cDepartment of Computer Science, University of Salerno, Italy Distributed Deep earning is essential for training large-scale neural networks when the entire data set or model cannot fit into a single machine. The communication layer of such a deep learning framework is responsible for synchronizing model updates and exchanging gradients between nodes, and the communica- tion operations in that layer must be efficient. The NVIDIA Collective Communications Library (NCCL) is a widely used back-end for communication in GPU-accelerated clusters. Similar to the Message Passing Interface (MPI), NCCL’s efficiency depends on its parameter configuration [1, 2], including the choice of communication algorithms, buffer sizes, and network types.
NCCL Parameter Tuning: We propose a two-step offline tuner to optimize the NCCL parameter configura- tion for multi-GPU clusters. First, we profile the train- ing of the models to determine the most relevant message sizes. Second, we employ a Bayesian optimizer to find an efficient parameter configuration. Experimental Results: Figure 1 compares the per- formance of two deep learning models (Bert and NasNetMobile) on 2 nodes of the Leonardo supercomputer, using TensorFlow and Horovod. On top, we compare the bandwidth obtained after tuning the collectives for the most frequently used message size of each model. The tuned configurations improved the bandwidths of the re- spective NCCL operations in the microbenchmarks by 2.26 and 21.02 times. On the bottom, we show an improve- ment in training performance of 12% and 13% for Bert and NasNetMobile, respectively, when using the tuned config- uration. Our experiments highlight the significant perfor- mance gains achievable through optimizing NCCL in dis- tributed deep learning training.
en
dc.language.iso
en
-
dc.rights.uri
http://rightsstatements.org/vocab/InC/1.0/
-
dc.subject
High Performance Computing
en
dc.subject
GPU-Computing
en
dc.subject
deep learning
en
dc.title
Optimizing Distributed Deep Learning Training by Tuning NCCL
en
dc.type
Inproceedings
en
dc.type
Konferenzbeitrag
de
dc.rights.license
Urheberrechtsschutz
de
dc.rights.license
In Copyright
en
dc.identifier.doi
10.34726/10424
-
dc.contributor.affiliation
University of Vienna, Austria
-
dc.contributor.affiliation
University of Salerno, Italy
-
dc.contributor.affiliation
University of Vienna, Austria
-
dc.description.startpage
38
-
dc.description.endpage
38
-
dc.type.category
Abstract Book Contribution
-
tuw.booktitle
ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025