Vardas, I. (2025). Improving Colocated MPI Application Performance via Process Mapping in HPC Systems. [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.136022
With the rapid growth of data-intensive applications in scientific simulations and artificial intelligence, the demand for High-Performance computing (HPC) has increased considerably. Modern HPC systems have evolved into complex architectures, characterized by deep memory hierarchies and numerous computing cores with non-uniform memory access times. While these architectures offer extreme computing power, they present significant challenges for optimizing parallel applications. Resource conflicts over shared elements such as caches or main memory can degrade performance of parallel applications. Additionally, HPC scheduler allocation policies, which are designed to minimize node usage, can inadvertently increase competition for shared resources among homogeneous processes, negatively affecting overall performance. A critical challenge in optimizing parallel applications is the assignment of processes to computing cores to avoid resource conflicts and maximize performance. This dissertation addresses this challenge through two approaches.The first approach investigates the communication structure of MPI applications to identify performance bottlenecks and optimize process mapping. The Message Passing Interface (MPI) remains the de facto standard for programming parallel applications in HPC. MPI communicators enable process grouping for communication, and understanding these communication patterns is essential for performance optimization. We developed the profiling tool mpisee, which, unlike existing tools, provides detailed information about communication per communicator, revealing MPI collective inefficiencies to guide better algorithm selection. The second approach examines how multiple parallel applications can efficiently share a common pool of compute nodes on high-performance systems. We developed mapping strategies paired with colocation to place computing processes from different applications onto the cores within each individual node of the shared allocation in ways that utilize shared resources efficiently. Our results demonstrate that these strategies often improve runtime compared to isolated execution by enhancing the colocated execution of multiple parallel applications.
en
Additional information:
Arbeit an der Bibliothek noch nicht eingelangt - Daten nicht geprüft