Salimi Beni, M., Laso, R., Cosenza, B., Benkner, S., & Hunold, S. (2025). Optimizing Distributed Deep Learning Training by Tuning NCCL. In ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025 (pp. 38–38). https://doi.org/10.34726/10424
ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025
-
Date (published):
22-May-2025
-
Event name:
Austrian-Slovenian HPC Meeting 2025 (ASHPC25)
en
Event date:
19-May-2025 - 22-May-2025
-
Event place:
Rimske Toplice, Slovenia
-
Number of Pages:
1
-
Peer reviewed:
Yes
-
Keywords:
High Performance Computing; GPU-Computing; deep learning
en
Abstract:
ASHPC25 – Austrian-Slovenian HPC Meeting 2025 Rimske Toplice, 19–22 May, 2025 Optimizing Distributed Deep Learning Training by Tuning NCCL Majid Salimi Benia, Ruben Lasob, Biagio Cosenzac, Siegfried Benknerb, and Sascha Hunolda aFaculty of Informatics, TU Wien, Austria bFaculty of Computer Science, University of Vienna, Austria cDepartment of Computer Science, University of Salerno, Italy Distributed Deep earning is essential for training large-scale neural networks when the entire data set or model cannot fit into a single machine. The communication layer of such a deep learning framework is responsible for synchronizing model updates and exchanging gradients between nodes, and the communica- tion operations in that layer must be efficient. The NVIDIA Collective Communications Library (NCCL) is a widely used back-end for communication in GPU-accelerated clusters. Similar to the Message Passing Interface (MPI), NCCL’s efficiency depends on its parameter configuration [1, 2], including the choice of communication algorithms, buffer sizes, and network types.
NCCL Parameter Tuning: We propose a two-step offline tuner to optimize the NCCL parameter configura- tion for multi-GPU clusters. First, we profile the train- ing of the models to determine the most relevant message sizes. Second, we employ a Bayesian optimizer to find an efficient parameter configuration. Experimental Results: Figure 1 compares the per- formance of two deep learning models (Bert and NasNetMobile) on 2 nodes of the Leonardo supercomputer, using TensorFlow and Horovod. On top, we compare the bandwidth obtained after tuning the collectives for the most frequently used message size of each model. The tuned configurations improved the bandwidths of the re- spective NCCL operations in the microbenchmarks by 2.26 and 21.02 times. On the bottom, we show an improve- ment in training performance of 12% and 13% for Bert and NasNetMobile, respectively, when using the tuned config- uration. Our experiments highlight the significant perfor- mance gains achievable through optimizing NCCL in dis- tributed deep learning training.