Study of performance and portability of a scientific code on long vector architectures

Pölzl, Paul

doi:10.34726/hss.2026.130548

Record link:

https://doi.org/10.34726/hss.2026.130548
http://hdl.handle.net/20.500.12708/228559

Title:

Study of performance and portability of a scientific code on long vector architectures

Citation:

Pölzl, P. (2026). Study of performance and portability of a scientific code on long vector architectures [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2026.130548

reposiTUm DOI:

10.34726/hss.2026.130548

CatalogPlus:

AC17891812

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Pölzl, Paul

Advisor:

Jantsch, Axel

Co-advisor:

Maragkou, Sofia

Organisational Unit:

E384 - Institut für Computertechnik

Date (published):

2026

Number of Pages:

109

Keywords:

High Performance Computing; RISCV; Scientific Computing

Abstract:

Long-vector architectures are currently experiencing a resurgence in high performance computing (HPC) as they promise massive parallelism and portable code when combined with auto-vectorization. The European processor accelerators (EPAC) prototype, developed by the European Processor Initiative(EPI), combines this principle with the open-source RISC-V "V" (RVV) extension. While the hardware is actively developed, most HPC software targets heterogeneous host-device platforms and is poorly prepared for long-vector hardware. Current literature specifically lacks full-scale optimization studies of applications dominated by spherical harmonic transforms or similar spectral methods. Therefore, the vectorization challenges and cross-platform performance portability of these workloads remain unclear. To address this gap, this thesis formalizes a reusable workflow based on the software development vehicle (SDV) methodology to systematically optimize the spherical harmonic transforms inside XSHELLSfor the EPAC prototype, explicitly comparing the trade-offs between compiler auto-vectorization andarchitecture-specific vector intrinsics. Performance evaluations reveal that refactoring code to assist auto-vectorization yields overall speedups of up to 1.91×. Explicit vectorization overcomes severe compiler limitations in nested loops and delivers overall gains of up to 2.49×. Although these code adaptations translate effectively to the NEC SX-Aurora (yielding gains up to 20.59× over the scalar baseline), they produce significant loop overhead, degrading performance to 0.63× and 0.87× of the auto-vectorized baseline on the Intel Sapphire Rapids and NVIDIA Grace CPUs, respectively. Ultimately, this research produces an SDV-based optimization blueprint for the EPAC platform and identifies three generalizable code patterns critical for vectorization efficiency. The results demonstrate that maximizing hardware utilization on current long-vector architectures requires manual intrinsic vectorization, as compiler support for nested multi-dimensional loops remains a critical bottleneck.

Additional information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis