Abstract. Resilience is an imminent issue for next-generation platforms due to projected increases in node failures. Typically, Checkpoint/restart is used to save the temporary state of a program and to resume from where failures happened. However, it is not always trivial for users to add the Checkpoint/restart capability to their applications because there is no performance portable and out-of-box Checkpoint/restart framework for different HPC platforms.
Kokkos is designed to provide a productive yet performance portable parallel programming model. In this talk, we explore the possibility of constructing a performance portable Checkpoint/restart framework built on top of the Kokkos programming system. Essentially, our key contributions are twofold: First, we propose a clang-based source-to-source translator that analyzes the use of Kokkos's View in Kokkos's parallel constructs and automatically generates a sequence of checkpoint API. Second, we propose a cooperative runtime and its API, which enables optimizing the frequency of checkpointing, thereby avoiding too-fine-grain checkpointing. Such a synergetic approach enables Kokkos programmers to focus on writing standard Kokkos programs. We will discuss the effectiveness and applicability of our approach using different Kokkos applications such as a 3D heat distribution applications and mini-applications from the Mantevo project.
| PP22 Home 2022 | Program | Speaker Index |