Exploring NCCL Tuning Strategies for Distributed Deep Learning

Salimi Beni, Majid; Laso, Ruben; Cosenza, Biagio; Benkner, Siegfried; Hunold, Sascha

doi:10.1109/IPDPSW66978.2025.00015

DC Field

Value

Language

dc.contributor.author

Salimi Beni, Majid

dc.contributor.author

Laso, Ruben

dc.contributor.author

Cosenza, Biagio

dc.contributor.author

Benkner, Siegfried

dc.contributor.author

Hunold, Sascha

dc.date.accessioned

2025-12-18T16:15:37Z

dc.date.available

2025-12-18T16:15:37Z

dc.date.issued

2025-08-13

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Salimi Beni, M., Laso, R., Cosenza, B., Benkner, S., & Hunold, S. (2025). Exploring NCCL Tuning Strategies for Distributed Deep Learning. In <i>2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)</i> (pp. 59–62). IEEE. https://doi.org/10.1109/IPDPSW66978.2025.00015</div> </div>

dc.identifier.uri

http://hdl.handle.net/20.500.12708/222800

dc.description.abstract

The communication overhead in distributed deep learning caused by the synchronization of model parameters across multiple devices can significantly impact training time. Although powerful GPU-GPU communication libraries, such as NCCL, are available, their default configurations have not been effectively adapted to varying hardware and workloads, which can result in lower performance.In this paper, we explore the tuning potential of NCCL and present an approach to tuning its parameters for distributed deep learning workloads. We identify efficient parameter configurations through an optimization process that explores the solution space defined by performance-related NCCL parameters. Experimental results on the Leonardo supercomputer, utilizing up to 64 GPUs, show significant performance improvements across micro-benchmarks and three deep learning models. For ncclAllReduce and ncclAllGather, we improved the bandwidth by factors of 112× and 36× in micro-benchmarks, respectively. The tuned NCCL parameter configurations reduced the training time of the models by up to 12.5%.

dc.description.sponsorship

FWF - Österr. Wissenschaftsfonds

dc.language.iso

dc.relation.ispartofseries

IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

dc.subject

Collective Communications

dc.subject

Deep Learning

dc.subject

Multi-GPU

dc.subject

NCCL

dc.subject

Parameter Tuning

dc.title

Exploring NCCL Tuning Strategies for Distributed Deep Learning

dc.type

Inproceedings

dc.type

Konferenzbeitrag

dc.contributor.affiliation

University of Salerno, Italy

dc.contributor.affiliation

University of Vienna, Austria

dc.relation.isbn

979-8-3315-2643-6

dc.relation.doi

10.1109/IPDPSW66978.2025

dc.relation.issn

2639-3867

dc.description.startpage

dc.description.endpage

dc.relation.grantno

P 33884-N

dc.type.category

Full-Paper Contribution

dc.relation.eissn

2995-066X

tuw.booktitle

2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

tuw.peerreviewed

true

tuw.relation.publisher

IEEE

tuw.project.title

Offline- und Online-Autotuning von Parallelen Programmen

tuw.researchTopic.id

tuw.researchTopic.name

Mathematical and Algorithmic Foundations

tuw.researchTopic.name

Computer Science Foundations

tuw.researchTopic.name

Computational System Design

tuw.researchTopic.value

tuw.publication.orgunit

E191-04 - Forschungsbereich Parallel Computing

tuw.publisher.doi

10.1109/IPDPSW66978.2025.00015

dc.description.numberOfPages

tuw.author.orcid

0000-0002-8634-7712

tuw.author.orcid

0000-0003-2574-4025

tuw.author.orcid

0000-0002-8869-6705

tuw.author.orcid

0000-0002-6520-2047

tuw.author.orcid

0000-0002-5280-3855

tuw.event.name

2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

tuw.event.startdate

03-06-2025

tuw.event.enddate

07-06-2025

tuw.event.online

On Site

tuw.event.type

Event for scientific audience

tuw.event.place

Milan

tuw.event.country

tuw.event.presenter

Salimi Beni, Majid

wb.sciencebranch

Informatik

wb.sciencebranch

Elektrotechnik, Elektronik, Informationstechnik

wb.sciencebranch

Mathematik

wb.sciencebranch.oefos

1020

wb.sciencebranch.oefos

2020

wb.sciencebranch.oefos

1010

wb.sciencebranch.value

item.openairetype

conference paper

item.openairecristype

http://purl.org/coar/resource_type/c_5794

item.cerifentitytype

Publications

item.languageiso639-1

item.grantfulltext

none

item.fulltext

no Fulltext

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.dept

University of Salerno, Italy

crisitem.author.dept

University of Vienna, Austria

crisitem.author.dept

E191-04 - Forschungsbereich Parallel Computing

crisitem.author.orcid

0000-0002-8634-7712

crisitem.author.orcid

0000-0003-2574-4025

crisitem.author.orcid

0000-0002-8869-6705

crisitem.author.orcid

0000-0002-6520-2047

crisitem.author.orcid

0000-0002-5280-3855

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.author.parentorg

E191 - Institut für Computer Engineering

crisitem.project.funder

FWF - Österr. Wissenschaftsfonds

crisitem.project.grantno

P 33884-N

Appears in Collections:

Conference Paper

Show simple item record

Page view(s)

checked on Dec 18, 2025

Google Scholar^TM

Check

Page view(s)

Google ScholarTM

Google Scholar^TM