Checkpoint/Restart Standard Forum

Who are we?

We are a group of people from academia, industry, and government labs to work on standardizing the Checkpoint/Restart (C/R) APIs for supercomputing. Our goal is to make C/R tools usable on fast changing HPC platforms with production workloads by working with HPC hardware/software vendors, C/R tools developers, application and library developers, HPC practitioners, and end users. We are working to develop a C/R interface standard, and facilitate its adoption in the HPC community to help harness the full benefits of C/R that go far beyond fault-tolerance.

Steering committee

  • Kapil Arya, Azure Systems Research

  • Gene Cooperman, Northeastern University

  • Donglai Dai, X-ScaleSolutions Inc

  • Doug Fuller, Cornelis Networks

  • Rebecca Hartman-Baker, Berkeley Lab

  • Lena M. Lopatina, Los Alamos National Lab

  • Bogdan Nicolae, Argonne National Lab

  • Sarp Oral, Oak Ridge National Lab

  • Adrian Reber, RedHat Inc

  • David Yat Sin, AMD, Inc

  • Andrey Vagin, Google Inc

  • Patrick Widener, Oak Ridge National Lab

  • Zhengji Zhao, Berkeley Lab

Review Committee

  • Tony Skjellum, University of Tennessee, Chattanooga

  • John Shalf, Lawrence Berkeley National Laboratory

  • Eric Roman, Lawrence Berkeley National Laboratory

  • Yves Robert, ENS Lyon

Activities

The C/R standard Forum was formed in January 2022. Since then the steering committee has been meeting bi-weekly to gather the requirements for the C/R interface standard, and have collected inputs from 27 code teams so far via the bi-weekly meetings and the C/R Requirements Gathering Workshop held in July, 2022. Currently the committee is working on drafting the requirements documents from which the C/R Interface specification will be extracted. The steering committee will release the first version of the C/R interface standard specification in SC22 (November, 2022). A BOF session has been planned for this.


The C/R Standard Forum is open to anyone interested. If you are interested in participating in the C/R standardization effort, please contact at ZZhao@lbl.gov (Zhengji Zhao), or any member of the steering committee.