Creating a portable programming abstraction for wavefront patterns targeting HPC systems

Date
2019
Journal Title
Journal ISSN
Volume Title
Publisher
University of Delaware
Abstract
Processor architectures have been rapidly evolving for decades. From the introduction of the first multicore processor by IBM in 2001 [2] to the massively parallel supercomputers of today, the exploitation of parallelism has become increasingly important, as the clock rates of a single core have plateaued. Heterogeneity is also on the rise since the revelation that domain-specific pieces of hardware (GPUs) could be repurposed for generalized parallel computation [10]. This shift has prompted the need to rethink algorithms, languages, and programming models in order to increase parallelism from a programming standpoint and migrate large scale applications to today’s massively powerful platforms. This is not a trivial task, as these architectures and systems are still undergoing constant evolution. More recently, supercomputing centers are transitioning toward utilizing fat-nodes (nodes with even more cores due to the presence of multiple accelerators) in order to reduce node count and the overhead associated with cross-node communication. For example, Oak Ridge National Laboratory’s TITAN supercomputer (OLCF-3), which was built in 2011, was comprised of 18,688 nodes, each containing a single NVIDIA Tesla K20x GPU accelerator [60]. In 2018, less than a decade later, ORNL constructed the Summit supercomputer (OLCF-4), consisting of 4,608 nodes, each equipped with six NVIDIA Tesla V100 GPU accelerators [67]. The trend toward fat-node based systems illustrates the importance of on-node programming models. Low-level languages like CUDA and OpenCL offer direct control over GPU hardware, but they incur a learning curve and lack portability, which are concerns for application developers. It is a huge time sink to have to learn a hardware-specific low-level language, port your code using that language, and then reimplement that same code when a newer GPU (or non-GPU) architecture emerges. The demand for portable solutions for programming parallel systems with minimal programmer overhead lead to the creation of directive-based programming. Directive-based programming models, such as OpenMP [106] and OpenACC [40, 24, 105], allow programmers to simply annotate their existing code with statements that describe the parallelism found within that code. A compiler then translates this into code that can run on a specified target architecture. This type of programming approach has become increasingly popular amongst industry scientists [9]. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task, especially when the goal is to achieve the theoretical maximum performance that today’s hardware platforms are ready to offer. One such parallel pattern commonly found in scientific applications is called wavefront. This thesis examines existing state-of-the-art wavefront applications and parallelization strategies, which it uses to create a high-level abstraction of wavefront parallelism and a programming language extension that facilitates an easy adaptation of such applications in order to expose parallelism on existing and future HPC systems. This thesis presents an open-source tool called Wavebench, which uses wavefront algorithms commonly found in real-world scientific applications to model the performance impact of wavefront parallelism on HPC systems. This thesis also uses the insights gained during the creation of this tool to apply the developed, high-level abstraction to a real-world case study application called Minisweep: a mini-application representative of the main computational kernel in Oak Ridge National Laboratory’s Denovo radiation transport code used for nuclear reactor modeling. The OpenACC implementation of this abstraction running on NVIDIA’s Volta GPU (present in ORNL’s Summit supercomputer) boasts an 85.06x speedup over serial code, which is larger than CUDAs 83.72x speedup over the same serial implementation. This serves as a proof of concept for the viability of the solutions presented by this thesis.
Description
Keywords
Citation