Vardas, I., Laso Rodriguez, R., & Salimi Beni, M. (2025). ncclsee: A Lightweight Profiling Tool for NCCL. In ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025 (pp. 39–39). https://doi.org/10.34726/10426
ASHPC25 : Austrian-Slovenian HPC Meeting 2025 : Rimske Terme, Slovenia : 19-22 May 2025
-
Date (published):
22-May-2025
-
Event name:
Austrian-Slovenian HPC Meeting (ASHPC25)
en
Event date:
19-May-2025 - 22-May-2025
-
Event place:
Rimske Toplice, Slovenia
-
Number of Pages:
1
-
Peer reviewed:
Yes
-
Keywords:
GPU acceleration; profiling; High Performance Computing
en
Abstract:
To achieve scalable and efficient distributed deep learning, optimized GPU communication is paramount. We introduce ncclsee, a lightweight profiler plugin built using version 2 of NVIDIA’s Collective Communi- cation Library (NCCL) [1] profiling interface and the NVIDIA CUDA Profiling Tools Interface (CUPTI) [3]. ncclsee captures communication patterns in real time, offering insights into GPU communication perfor- mance. By focusing on simplicity and efficiency, ncclsee enables users to pinpoint and alleviate bottlenecks in distributed workloads, making it useful for debugging and optimizing large-scale AI training workflows.
NCCL is the de facto library for GPU communication with NVIDIA GPUS in deep learning frameworks such as PyTorch and TensorFlow. It delivers high performance by leveraging advanced technologies, including RDMA (Remote Direct Memory Access) and GPUDirect, which enable direct GPU-to-GPU data transfers, over interconnects such as PCIe, Infiniband and NVLink, with minimal CPU involvement. To optimize collective operations like AllReduce, Broadcast, and AllGather, NCCL automatically selects algorithms such as the Ring or Tree, depending on message size, topology, and buffer size characteristics. Once the parameters are selected, NCCL creates a CUDA kernel to perform the collective operation on the GPUs.
Profiling NCCL behavior at scale remains challenging, tools like NVIDIA Nsight can produce highly detailed traces, but the volume of data generated makes them impractical to analyze for large GPU clusters. ncclsee tackles this problem by offering summary information on NCCL operations. It leverages NCCL’s event callbacks, including start and stop events as well as proxy progress activity, to accurately track asynchronous operations. Because stopEvent indicates only that a collective has been enqueued rather than completed, ncclsee utilizes CUPTI to measure the time of the corresponding CUDA kernel that performs the NCCL operation on the GPUs. The main challenge for developing ncclsee was creating an efficient interface that associates NCCL’s profiling API events to the correct CUPTI’s event.
ncclsee captures summary data for each NCCL operation based on the buffer size, presenting information in a concise format (e.g., Operation: ncclAllReduce, Buffer range: 128-4096 Bytes, Calls: 52, Time: 770 ms), allowing users to quickly identify communication patterns and potential bottlenecks.
Ease of use: Integrating ncclsee into existing workflows is straightforward. After compiling ncclsee producing the libnccl-profiler.so, users simply have to set the NCCL PROFILER PLUGIN environment variable to point to the path of libnccl-profiler.so. Once enabled, ncclsee records metrics for applica- tions that use NCCL directly or through frameworks that depend on it, such as PyTorch and TensorFlow. ncclsee is actively under development, with a functional version already available on GitHub [3]