The Scalable Checkpoint / Restart (SCR) library enables MPI applications to utilize distributed storage on Linux clusters to attain high file I/O bandwidth for checkpointing, restarting, and output in large-scale jobs. With SCR, jobs run more efficiently, recompute less work upon a failure, and reduce load on critical shared resources such as the parallel file system.
Instructions to build and use are hosted at SCR.ReadTheDocs.io.
For new users, the Quick Start guide shows one how to build and run an example using SCR.
As an open source project, we welcome contributions via pull requests, as well as questions, feature requests, or bug reports via issues. Please refer to both our code of conduct and our contributing guidelines.
Developer documentation is provided at SCR-dev.ReadTheDocs.io.
SCR uses components from ECP-VeloC, which have their own user and developer docs.
For a development build of SCR and its dependencies on SLURM systems, one can use the bootstrap.sh script:
git clone https://github.com/LLNL/scr.git
cd scr
./bootstrap.sh --dev --debug
cd build
cmake -DCMAKE_INSTALL_PREFIX=../install ..
make install
One can then run a test program:
cd examples
srun -n4 -N4 ./test_api
For developers who may be installing SCR outside of an HPC cluster, who are using Fedora, and who have sudo access, the following steps install and activate most of the necessary base dependencies:
sudo dnf groupinstall "Development Tools"
sudo dnf install cmake gcc-c++ mpi mpi-devel environment-modules zlib-devel pdsh
[restart shell]
module load mpi
Numerous people have contributed to the SCR project.
To reference SCR in a publication, please cite the following paper:
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R. de Supinski, Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System, LLNL-CONF-427742, Supercomputing 2010, New Orleans, LA, November 2010.
Additional information and research publications can be found here:
http://computation.llnl.gov/projects/scalable-checkpoint-restart-for-mpi