SC22 BOF

SC22 BOF:The Checkpoint/Restart Interface Standard: Version 1.0

We will host a BOF session in SC22 to introduce the Checkpoint/Restart standard 1.0. The BOF session is a in-person event only, no remote participations are allowed per request from SC22. We will make the presentation slides available in this page after the BOF.


Time: Tuesday, November 15, 2022, 12:15pm - 1:15pm CST


Location: C141-143-149 at Kay Bailey Hutchison Convention Center Dallas, Texas, USA


Event Type: In-person only


Schedule:


Presentations (slides)


Time: 12:15pm -12:19pm CST

Title: Introduction to the CRI standard 1.0

Presenter: Zhengji Zhao, Lawrence Berkeley National Laboratory

Description: This talk will provide an introduction to the CRI standard version1.0, including the motivation , goals, and background behind developing the Checkpoint/Restart standard, CRI-1.0 standard design, as well as the roadmap for the CRI standard.


Time: 12:19pm - 12:26pm CST

Title: APIs for Application-Level Checkpointing & I/O strategy management

Presenter: Bogdan Nicolae, Argonne National Laboratory

Description: This talk will cover the APIs proposed for application-level checkpointing and the APIs designed to address I/O overhead challenges that are common to all checkpointing.


Time: 12:26pm - 12:31pm CST

Title: APIs and Commands for Transparent Checkpointing

Presenter: Kapil Arya, Azure Systems Research

Description: This talk will cover the APIs and commands defined for transparent checkpointing that are intended for application end users to use externally.


Time: 12:31pm - 12:38pm CST

Title: Hardware-Vendor Support for Checkpointing

Presenter: Gene Cooperman, Northeastern University

Description: This talk will cover a set of standard APIs that hardware vendors are recommended to provide to help transparent C/R tools to achieve code portability over hardware.


Time: 12:38pm - 12:42pm CST

Title: APIs for Hybrid Checkpointing & Recommendations for Applications

Presenter: Donglai Dai, X-ScaleSolution, Inc

Description: This talk will cover the APIs for applications to invoke transparent checkpointing tools internally or externally to achieve checkpointing efficiency and flexibility. The talk will also present the information/practices that applications can provide/follow to assist transparent checkpointing.


Time: 12:42pm - 12:45pm CST

Title: Batch System - Scheduler

Presenter: Rebecca Hartman-Baker, Lawrence Berkeley National Laboratory

Description: This talk will cover the desired features/functionalities for batch systems to enable C/R.


Q&A Session

12:45pm - 1:15pmt

SC22 BOF Title: The Checkpoint/Restart Interface Standard: Version 1.0

Abstract

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. To help the community develop portable C/R codes to harness C/R benefits, which go far beyond resilience, the C/R Standard Forum will release the first version of the C/R interface standard in SC22. In this session, the C/R Standard Forum will present their first release of the C/R interface standard specification, inviting feedback from the HPC community on both the features included in the specification and the roadmap for future efforts.


Long Description

As a primary approach to fault-tolerant computing, Checkpoint/Restart (C/R) is essential to a wide range of HPC communities. It has seen increasing adoption for many additional scenarios, including suspend-resume, process migration, and replay debugging. More recently, with the convergence of HPC, big data analytics and machine learning, checkpointing is becoming an essential pattern in allowing applications to progress with their computations. Because software and hardware are fast evolving and becoming more complicated and heterogeneous, the development cycles for C/R tools will lengthen to support new hardware and the workloads on it, resulting in C/R tools chasing ever-newer hardware and never quite being able to catch up. This cycle has impeded HPC communities from reaping the benefits of C/R.

To help the HPC community to develop more portable C/R codes to harness the C/R benefits that are far beyond resilience, the C/R Standard Forum will release the first version of the C/R interface standard, which requires all parties in HPC to work together to achieve portability of the C/R codes. In this BOF meeting, the C/R Standard Forum will present the C/R interface standard specification, getting feedback from the SC22 attendees (HPC hardware/software vendors, system software developers, C/R tools/libraries developers, applications and other tools/libraries developers, application end users, and HPC practitioners) on the features included in the specification as well as the roadmap for future efforts to help guide future extensions and modifications.

The C/R Standard Forum was formed in January 2022, and its efforts kicked off by gathering requirements for the C/R standard via a requirements gathering workshop and bi-weekly meetings. This is their first time to host a BoF in SC. The session will be led by Zhengji Zhao, the primary organizer of the C/R Standard Forum, with the help of experts on both transparent and application-initiated checkpointing who are actively working on the C/R Standard Forum. The speakers will be Zhengji Zhao (NERSC), Gene Cooperman (Northeastern Univ.), Bogdan Nicolae (ANL) and Rebecca Hartman-Baker (LBNL). The discussions in this BoF meeting will be summarized and published on the C/R Standard Forum website, and will be used to inform C/R Forum working groups in their future activities, e.g., developing the next version of the C/R interface standard.


The C/R Standard Forum is open to anyone interested; new members are always welcome. We hope to use this BoF as an opportunity to make the Forum better known in the HPC community and to encourage participation from a broader and more diverse group of interested people. Our goal is to make the Forum known to and more approachable for the wider community, inform them about ongoing activities and future directions, and encourage larger community participation in this important standard that impacts a wide range of HPC communities, expanding its adoption and deployment in the HPC community.