Virtual Conference: 2022 SIAM Conference on Parallel Processing for Scientific Computing

Part of MS38 Toward Scalable Resilient and Fault Tolerant Applications for Extreme Scale Computing Systems - Part III of III
A Cooperative Compiler and Runtime Checkpoint/Restart Approach for Kokkos

Abstract. Resilience is an imminent issue for next-generation platforms due to projected increases in node failures. Typically, Checkpoint/restart is used to save the temporary state of a program and to resume from where failures happened. However, it is not always trivial for users to add the Checkpoint/restart capability to their applications because there is no performance portable and out-of-box Checkpoint/restart framework for different HPC platforms.

Kokkos is designed to provide a productive yet performance portable parallel programming model. In this talk, we explore the possibility of constructing a performance portable Checkpoint/restart framework built on top of the Kokkos programming system. Essentially, our key contributions are twofold: First, we propose a clang-based source-to-source translator that analyzes the use of Kokkos's View in Kokkos's parallel constructs and automatically generates a sequence of checkpoint API. Second, we propose a cooperative runtime and its API, which enables optimizing the frequency of checkpointing, thereby avoiding too-fine-grain checkpointing. Such a synergetic approach enables Kokkos programmers to focus on writing standard Kokkos programs. We will discuss the effectiveness and applicability of our approach using different Kokkos applications such as a 3D heat distribution applications and mini-applications from the Mantevo project.

Authors  
 
 
PP22 Home 2022 Program Speaker Index
Powered by MathJax