The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing and restarting large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.
Detailed usage is provided at SCR.ReadTheDocs.io.
As an open source project, we welcome contributions via pull requests, as well as questions, feature requests, or bug reports via issues. Please refer to both our code of conduct and our contributing guidelines.
Developer documentation is provided at SCR-dev.ReadTheDocs.io.
SCR uses components from ECP-VeloC, which have user and developer docs.
Numerous people have contributed to the SCR project.
To reference SCR in a publication, please cite the following paper:
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, LLNL-CONF-427742, Supercomputing 2010, New Orleans, LA, November 2010.
Additional information and research publications can be found here:
http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi