Optimizing Distributed Deep Learning Training by Tuning NCCL

Salimi Beni, Majid; Laso, Ruben; Cosenza, Biagio; Benkner, Siegfried; Hunold, Sascha

doi:10.34726/10424

DC Field

Value

Language

dc.contributor.author

Salimi Beni, Majid

dc.contributor.author

Laso, Ruben

dc.contributor.author

Cosenza, Biagio

dc.contributor.author

Benkner, Siegfried

dc.contributor.author

Hunold, Sascha

dc.date.accessioned

2025-08-26T14:05:15Z

dc.date.available

2025-08-26T14:05:15Z

dc.date.issued

2025-05-22

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Salimi Beni, M., Laso, R., Cosenza, B., Benkner, S., & Hunold, S. (2025). Optimizing Distributed Deep Learning Training by Tuning NCCL. In <i>ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025</i> (pp. 38–38). https://doi.org/10.34726/10424</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/218553

dc.identifier.uri

https://doi.org/10.34726/10424

dc.description

https://ashpc.eu/event/25/attachments/151/300/ashpc25_booklet.pdf

dc.description.abstract

ASHPC25 – Austrian-Slovenian HPC Meeting 2025 Rimske Toplice, 19–22 May, 2025 Optimizing Distributed Deep Learning Training by Tuning NCCL Majid Salimi Benia, Ruben Lasob, Biagio Cosenzac, Siegfried Benknerb, and Sascha Hunolda aFaculty of Informatics, TU Wien, Austria bFaculty of Computer Science, University of Vienna, Austria cDepartment of Computer Science, University of Salerno, Italy Distributed Deep earning is essential for training large-scale neural networks when the entire data set or model cannot fit into a single machine. The communication layer of such a deep learning framework is responsible for synchronizing model updates and exchanging gradients between nodes, and the communica- tion operations in that layer must be efficient. The NVIDIA Collective Communications Library (NCCL) is a widely used back-end for communication in GPU-accelerated clusters. Similar to the Message Passing Interface (MPI), NCCL’s efficiency depends on its parameter configuration [1, 2], including the choice of communication algorithms, buffer sizes, and network types. NCCL Parameter Tuning: We propose a two-step offline tuner to optimize the NCCL parameter configura- tion for multi-GPU clusters. First, we profile the train- ing of the models to determine the most relevant message sizes. Second, we employ a Bayesian optimizer to find an efficient parameter configuration. Experimental Results: Figure 1 compares the per- formance of two deep learning models (Bert and NasNetMobile) on 2 nodes of the Leonardo supercomputer, using TensorFlow and Horovod. On top, we compare the bandwidth obtained after tuning the collectives for the most frequently used message size of each model. The tuned configurations improved the bandwidths of the re- spective NCCL operations in the microbenchmarks by 2.26 and 21.02 times. On the bottom, we show an improve- ment in training performance of 12% and 13% for Bert and NasNetMobile, respectively, when using the tuned config- uration. Our experiments highlight the significant perfor- mance gains achievable through optimizing NCCL in dis- tributed deep learning training.

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

High Performance Computing

dc.subject

GPU-Computing

dc.subject

deep learning

dc.title

Optimizing Distributed Deep Learning Training by Tuning NCCL

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.rights.license

Urheberrechtsschutz

dc.rights.license

In Copyright

dc.identifier.doi

10.34726/10424

dc.contributor.affiliation

University of Vienna, Austria

dc.contributor.affiliation

University of Salerno, Italy

dc.contributor.affiliation

University of Vienna, Austria

dc.description.startpage

dc.description.endpage

dc.type.category

Abstract Book Contribution

tuw.booktitle

ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025

tuw.peerreviewed

true

tuw.researchTopic.id

tuw.researchTopic.name

Mathematical and Algorithmic Foundations

tuw.researchTopic.name

Computer Science Foundations

tuw.researchTopic.name

Computational System Design

tuw.researchTopic.value

tuw.linking

https://ashpc.eu/event/25/attachments/151/300/ashpc25_booklet.pdf

tuw.publication.orgunit

E191-04 - Forschungsbereich Parallel Computing

dc.identifier.libraryid

AC17620581

dc.description.numberOfPages

tuw.author.orcid

0000-0002-8634-7712

tuw.author.orcid

0000-0003-2574-4025

tuw.author.orcid

0000-0002-8869-6705

tuw.author.orcid

0000-0002-6520-2047

tuw.author.orcid

0000-0002-5280-3855

dc.rights.identifier

Urheberrechtsschutz

dc.rights.identifier

In Copyright

tuw.event.name

Austrian-Slovenian HPC Meeting 2025 (ASHPC25)

tuw.event.startdate

19-05-2025

tuw.event.enddate

22-05-2025

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Rimske Toplice

tuw.event.country

tuw.event.presenter

Salimi Beni, Majid

tuw.event.track

Single Track

wb.sciencebranch

Informatik

wb.sciencebranch

Elektrotechnik, Elektronik, Informationstechnik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

2020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.openairetype

conference paper

item.openaccessfulltext

Open Access

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.grantfulltext

open

item.languageiso639-1

item.mimetype

application/pdf

item.fulltext

with Fulltext

item.cerifentitytype

Publications

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.dept

University of Vienna, Austria

crisitem.author.dept

University of Salerno, Italy

crisitem.author.dept

University of Vienna, Austria

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.orcid

0000-0002-8634-7712

crisitem.author.orcid

0000-0002-8869-6705

crisitem.author.orcid

0000-0002-6520-2047

crisitem.author.orcid

0000-0002-5280-3855

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.author.parentorg

E191 - Institut für Computer Engineering

Appears in Collections:

Conference Paper

Fulltext (Version of Record (published version))

Adobe PDF

(209.23 kB)

In Copyright

Show simple item record

Page view(s)

checked on Aug 26, 2025

Download(s)

checked on Aug 26, 2025

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM