Optimizing Distributed Deep Learning Training by Tuning NCCL

Salimi Beni, Majid; Laso, Ruben; Cosenza, Biagio; Benkner, Siegfried; Hunold, Sascha

doi:10.34726/10424

Record link:

http://hdl.handle.net/20.500.12708/218553
https://doi.org/10.34726/10424

Title:

Optimizing Distributed Deep Learning Training by Tuning NCCL

Citation:

Salimi Beni, M., Laso, R., Cosenza, B., Benkner, S., & Hunold, S. (2025). Optimizing Distributed Deep Learning Training by Tuning NCCL. In ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025 (pp. 38–38). https://doi.org/10.34726/10424

reposiTUm DOI:

10.34726/10424

CatalogPlus:

AC17620581

Publication Type:

Inproceedings - Abstract Book Contribution

Language:

English

Authors:

Salimi Beni, Majid
Laso, Ruben
Cosenza, Biagio
Benkner, Siegfried
Hunold, Sascha

Organisational Unit:

E191-04 - Forschungsbereich Parallel Computing

Published in:

ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025

Date (published):

22-May-2025

Event name:

Austrian-Slovenian HPC Meeting 2025 (ASHPC25)

Event date:

19-May-2025 - 22-May-2025

Event place:

Rimske Toplice, Slovenia

Number of Pages:

Peer reviewed:

Yes

Keywords:

High Performance Computing; GPU-Computing; deep learning

Abstract:

ASHPC25 – Austrian-Slovenian HPC Meeting 2025 Rimske Toplice, 19–22 May, 2025 Optimizing Distributed Deep Learning Training by Tuning NCCL Majid Salimi Benia, Ruben Lasob, Biagio Cosenzac, Siegfried Benknerb, and Sascha Hunolda aFaculty of Informatics, TU Wien, Austria bFaculty of Computer Science, University of Vienna, Austria cDepartment of Computer Science, University of Salerno, Italy Distributed Deep earning is essential for training large-scale neural networks when the entire data set or model cannot fit into a single machine. The communication layer of such a deep learning framework is responsible for synchronizing model updates and exchanging gradients between nodes, and the communica- tion operations in that layer must be efficient. The NVIDIA Collective Communications Library (NCCL) is a widely used back-end for communication in GPU-accelerated clusters. Similar to the Message Passing Interface (MPI), NCCL’s efficiency depends on its parameter configuration [1, 2], including the choice of communication algorithms, buffer sizes, and network types. NCCL Parameter Tuning: We propose a two-step offline tuner to optimize the NCCL parameter configura- tion for multi-GPU clusters. First, we profile the train- ing of the models to determine the most relevant message sizes. Second, we employ a Bayesian optimizer to find an efficient parameter configuration. Experimental Results: Figure 1 compares the per- formance of two deep learning models (Bert and NasNetMobile) on 2 nodes of the Leonardo supercomputer, using TensorFlow and Horovod. On top, we compare the bandwidth obtained after tuning the collectives for the most frequently used message size of each model. The tuned configurations improved the bandwidths of the re- spective NCCL operations in the microbenchmarks by 2.26 and 21.02 times. On the bottom, we show an improve- ment in training performance of 12% and 13% for Bert and NasNetMobile, respectively, when using the tuned config- uration. Our experiments highlight the significant perfor- mance gains achievable through optimizing NCCL in dis- tributed deep learning training.

Link (external):

https://ashpc.eu/event/25/attachments/151/300/ashpc25_booklet.pdf

Additional information:

https://ashpc.eu/event/25/attachments/151/300/ashpc25_booklet.pdf

Research Areas:

Mathematical and Algorithmic Foundations: 50%
Computer Science Foundations: 25%
Computational System Design: 25%

Science Branch:

1020 - Informatik: 50%
2020 - Elektrotechnik, Elektronik, Informationstechnik: 40%
1010 - Mathematik: 10%

License:

In Copyright

Appears in Collections:

Conference Paper