Skip to content

Commit

Permalink
Checkpointing (quasi-Newton solver) (#693)
Browse files Browse the repository at this point in the history
* added notes and some drafty interface

* added draft of the api for checkpointing

* fixed compilation issues

* integrated AXOM

code compiles; does nothing

* added user options for checkpointing

* more work on load checkpoint EOD

* semi-operation checkpointing

checkpoint interface complete
additional states need to be saved

* removed save checkpoint callback from the interface

* fixed typos in comments

* moved sidre-related code from Algorithm class to a "utils" helper

* switched to refs; some testing of options-based checkpointing

* added sidre copy to/from dense matrices

* instrumentation for saving quasi-Newton internals to sidre

* updated iteration counter to keep track of total number over restarts

* updated doc; replace all #

* added example on how to use checkpoint API

* clean up

* added metadata

* testing and clean up

* update user manual with checkpointing

* updated pdf user manual

* fix ci errors (compilation)

* fix adtl compilation issues

* fixed compil error

* addresed reviews
  • Loading branch information
cnpetra authored Sep 27, 2024
1 parent 5bbe218 commit 22efbe8
Show file tree
Hide file tree
Showing 16 changed files with 1,187 additions and 1,049 deletions.
16 changes: 15 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ option(HIOP_USE_EIGEN "Build with Eigen support" ON)
option(HIOP_USE_MPI "Build with MPI support" ON)
option(HIOP_USE_GPU "Build with support for GPUs - CUDA or HIP libraries" OFF)
option(HIOP_TEST_WITH_BSUB "Use `jsrun` instead of `mpirun` commands when running tests" OFF)
option(HIOP_USE_RAJA "Build with portability abstraction library RAJA" OFF)
option(HIOP_USE_RAJA "Build with portability abstraction library RAJA" OFF)
option(HIOP_USE_AXOM "Build with AXOM to use Sidre for scalable checkpointing" OFF)
option(HIOP_DEEPCHECKS "Extra checks and asserts in the code with a high penalty on performance" OFF)
option(HIOP_WITH_KRON_REDUCTION "Build Kron Reduction code (requires UMFPACK)" OFF)
option(HIOP_DEVELOPER_MODE "Build with extended warnings and options" OFF)
Expand Down Expand Up @@ -289,6 +290,19 @@ if(HIOP_USE_RAJA)
message(STATUS "Found umpire pkg-config: ${umpire_CONFIG}")
endif()

if(HIOP_USE_AXOM)
if(HIOP_USE_MPI)
find_package(AXOM CONFIG
PATHS ${AXOM_DIR} ${AXOM_DIR}/lib/cmake/
REQUIRED)
target_link_libraries(hiop_tpl INTERFACE axom)
message(STATUS "Found AXOM pkg-config: ${AXOM_CONFIG}")
elseif(HIOP_USE_MPI)
message(FATAL_ERROR "Error: HIOP_USE_MPI is required when HIOP_USE_AXOM is ON")
endif()
endif()


if(HIOP_WITH_KRON_REDUCTION)
set(HIOP_UMFPACK_DIR CACHE PATH "Path to UMFPACK directory")
include(FindUMFPACK)
Expand Down
Binary file modified doc/hiop_usermanual.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions doc/src/sections/solver_options.tex
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,16 @@ \subsubsection{Problem preprocessing}
\medskip
\subsubsection{Checkpointing of the solver state and restarting}\label{sec:checkpoint}
As detailed in Section~\ref{sec:checkpoint_API}, \Hi can save/load its internal state to/from disk. All the options in this section require an Axom-enabled build (use ``-DHIOP\_USE\_AXOM=ON'' with cmake) and are supported only by the quasi-Newton IPM solver (\texttt{hiopAlgFilterIPMQuasiNewton} class) for the \texttt{hiopInterfaceDenseConstraints} NLP formulation/interface.
\noindent \textbf{checkpoint\_save}: Save state of NLP solver to file indicated by value of option ``checkpoint\_file''. String values ``yes'' or ``no'', default ``no''.
\noindent \textbf{checkpoint\_load\_on\_start} On (re)start the NLP solver will load checkpoint file specified by ``checkpoint\_file`` option. String values ``yes'' or ``no'', default ``no''.
\noindent \textbf{checkpoint\_file} Path to checkpoint file to load from or save to. If present, the character ``\#'' is replaced with the iteration number at which the checkpointing is saved (but \textit{not} when loaded). \Hi adds a ``.root'' extension internally if the value of the option is a directory. If this option is not specified and loading or saving checkpoints is enabled, \Hi will use a file named ``hiop\_state\_chk''.
\noindent \textbf{checkpoint\_save\_every\_N\_iter} Iteration frequency of saving checkpoints to disk if ``checkpoint\_save'' is ``yes''. Takes positive integer values with a default value $10$.
\subsubsection{Miscellaneous options}
Expand Down
27 changes: 25 additions & 2 deletions doc/src/techrep_main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
\vspace{3cm}

{\huge\bfseries \Hi\ -- User Guide} \\[14pt]
{\large\bfseries version 1.03}
{\large\bfseries version 1.1.0}

\vspace{3cm}

Expand All @@ -155,7 +155,7 @@
\vspace{4.75cm}

\textcolor{violet}{{\large\bfseries Oct 15, 2017} \\
{\large\bfseries Updated Feb 5, 2024}}
{\large\bfseries Updated Sept 22, 2024}}

\vspace{0.75cm}

Expand Down Expand Up @@ -474,6 +474,29 @@ \subsubsection{Calling \Hi for a \texttt{hiopInterfaceDenseConstraints} formulat
\end{lstlisting}
The standalone drivers \texttt{NlpDenseConsEx1}, \texttt{NlpDenseConsEx2}, and \texttt{NlpDenseConsEx3} inside directory \texttt{src/Drivers/} under the \Hi's root directory contain more detailed examples of the use of \Hi.

\subsubsection{Checkpointing}\label{sec:checkpoint_API}
File checkpointing is available for \Hi's quasi-Newton IPM solver, which is used exclusively to solve \texttt{hiopInterfaceDenseConstraints} formulation. This can be helpful when running a job on
a cluster that enforces limits on the job’s running time.
Later, this feature will also be provided for other solvers, such as the Newton IPM (used exclusively with sparse NLP) and HiOp-PriDec.

The checkpointing I/O is based on Axom's scalable Sidre data manager (see \url{https://axom.readthedocs.io/en/develop/axom/sidre/docs/sphinx/index.html} for more information) and, thus, requires an Axom-enabled build (use ``-DHIOP\_USE\_AXOM=ON'' with cmake).

There are two ways to use \Hi's checkpointing. The first is via the quasi-Newton solver's API, namely, the methods
\begin{lstlisting}
void load_state_from_sidre_group(const ::axom::sidre::Group& group);
void save_state_to_sidre_group(::axom::sidre::Group& group);
\end{lstlisting}
of \texttt{hiopAlgFilterIPMQuasiNewton} solver class. New Sidre views will be created (or reused) within the group passed as argument to load / save state variables of the quasi-Newton solver. Alternatively, \texttt{hiopAlgFilterIPMQuasiNewton} solver class offers similar methods to work directly with a file, namely,
\begin{lstlisting}
bool load_state_from_file(const ::std::string& path) noexcept;
bool save_state_to_file(const ::std::string& path) noexcept;
\end{lstlisting}
These two methods will create the Sidre group internally and checkpoint to/from it using the first two methods.

A second avenue to checkpoint is via user options. This is detailed in Section~\ref{sec:checkpoint}.

\warningcp{Note:} A couple of particularities stemming from the use of Sidre must be acknowledged. First, a checkpoint file should be loaded using HiOp with the same number of MPI ranks as when it was saved. Second, checkpointing is not available for non-MPI builds due to Axom having MPI as a dependency. Finally, when loading from or saving to a checkpoint file, the sizes of the file's variables (Sidre views) must match the sizes of the HiOp variables to which the data is loaded or saved, meaning \Hi will throw an exception if an existing file is (re)used to load or save a algorithm state for a problem that changed sizes since the file was created.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% NLP Sparse
Expand Down
61 changes: 59 additions & 2 deletions src/Drivers/Dense/NlpDenseConsEx1.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@
#include <cstdio>
#include <cassert>

#ifdef HIOP_USE_AXOM
#include <axom/sidre/core/DataStore.hpp>
#include <axom/sidre/core/Group.hpp>
#include <axom/sidre/core/View.hpp>
#include <axom/sidre/spio/IOManager.hpp>
using namespace axom;
#endif

using namespace hiop;

Ex1Meshing1D::Ex1Meshing1D(double a, double b, size_type glob_n, double r, MPI_Comm comm_)
Expand Down Expand Up @@ -178,10 +186,59 @@ void DiscretizedFunction::setFunctionValue(index_type i_global, const double& va
this->data_[i_local]=value;
}



/* DenseConsEx1 class implementation */

bool DenseConsEx1::iterate_callback(int iter,
double obj_value,
double logbar_obj_value,
int n,
const double* x,
const double* z_L,
const double* z_U,
int m_ineq,
const double* s,
int m,
const double* g,
const double* lambda,
double inf_pr,
double inf_du,
double onenorm_pr,
double mu,
double alpha_du,
double alpha_pr,
int ls_trials)
{
#ifdef HIOP_USE_AXOM
//save state to sidre::Group every 5 iterations if a solver/algorithm object was provided
if(iter > 0 && (iter % 5 == 0) && nullptr!=solver_) {
//
//Example of how to save HiOp state to axom::sidre::Group
//

//We first manufacture a Group. User code supposedly already has one.
sidre::DataStore ds;
sidre::Group* group = ds.getRoot()->createGroup("HiOp quasi-Newton alg state");

//the actual saving of state to group
try {
solver_->save_state_to_sidre_group(*group);
} catch(std::runtime_error& e) {
//user chooses action when an error occured in saving the state...
//we choose to stop HiOp
return false;
}

//User code can further inspect the Group or add addtl info to DataStore, with the end goal
//of saving it to file before HiOp starts next iteration. Here we just save it.
sidre::IOManager writer(comm);
int n_files;
MPI_Comm_size(comm, &n_files);
writer.write(ds.getRoot(), n_files, "hiop_state_ex1", sidre::Group::getDefaultIOProtocol());
}
#endif
return true;
}

/*set c to
* c(t) = 1-10*t, for 0<=t<=1/10,
* 0, for 1/10<=t<=1.
Expand Down
35 changes: 33 additions & 2 deletions src/Drivers/Dense/NlpDenseConsEx1.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ class Ex1Meshing1D
MPI_Comm comm;
int my_rank, comm_size;
index_type* col_partition;

friend class DiscretizedFunction;

private:
Expand Down Expand Up @@ -112,7 +112,9 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
{
public:
DenseConsEx1(int n_mesh_elem=100, double mesh_ratio=1.0)
: n_vars(n_mesh_elem), comm(MPI_COMM_WORLD)
: n_vars(n_mesh_elem),
comm(MPI_COMM_WORLD),
solver_(nullptr)
{
//create the members
_mesh = new Ex1Meshing1D(0.0,1.0, n_vars, mesh_ratio, comm);
Expand Down Expand Up @@ -218,6 +220,31 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
}
return true;
}

inline void set_solver(hiop::hiopAlgFilterIPM* alg_obj)
{
solver_ = alg_obj;
}

bool iterate_callback(int iter,
double obj_value,
double logbar_obj_value,
int n,
const double* x,
const double* z_L,
const double* z_U,
int m_ineq,
const double* s,
int m,
const double* g,
const double* lambda,
double inf_pr,
double inf_du,
double onenorm_pr,
double mu,
double alpha_du,
double alpha_pr,
int ls_trials);
private:
int n_vars;
MPI_Comm comm;
Expand All @@ -228,6 +255,10 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
DiscretizedFunction* c;
DiscretizedFunction* x; //proxy for taking hiop's variable in and working with it as a function

/// Pointer to the solver, to be used to checkpoint
hiop::hiopAlgFilterIPM* solver_;

private:
//populates the linear term c
void set_c();
};
Expand Down
109 changes: 95 additions & 14 deletions src/Drivers/Dense/NlpDenseConsEx1Driver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,23 @@
#include <cstdlib>
#include <string>

#ifdef HIOP_USE_AXOM
#include <axom/sidre/core/DataStore.hpp>
#include <axom/sidre/core/Group.hpp>
#include <axom/sidre/core/View.hpp>
#include <axom/sidre/spio/IOManager.hpp>
using namespace axom;
#endif


using namespace hiop;

static bool self_check(size_type n, double obj_value);

#ifdef HIOP_USE_AXOM
static bool do_load_checkpoint_test(const size_type& mesh_size,
const double& ratio,
const double& obj_val_expected);
#endif
static bool parse_arguments(int argc, char **argv, size_type& n, double& distortion_ratio, bool& self_check)
{
n = 20000; distortion_ratio=1.; self_check=false; //default options
Expand Down Expand Up @@ -67,24 +80,27 @@ int main(int argc, char **argv)
err = MPI_Init(&argc, &argv); assert(MPI_SUCCESS==err);
err = MPI_Comm_rank(MPI_COMM_WORLD,&rank); assert(MPI_SUCCESS==err);
err = MPI_Comm_size(MPI_COMM_WORLD,&numRanks); assert(MPI_SUCCESS==err);
if(0==rank) printf("Support for MPI is enabled\n");
if(0==rank) {
printf("Support for MPI is enabled\n");
}
#endif
bool selfCheck; size_type mesh_size; double ratio;
if(!parse_arguments(argc, argv, mesh_size, ratio, selfCheck)) { usage(argv[0]); return 1;}

bool selfCheck;
size_type mesh_size;
double ratio;
double objective = 0.;
if(!parse_arguments(argc, argv, mesh_size, ratio, selfCheck)) {
usage(argv[0]);
return 1;
}

DenseConsEx1 problem(mesh_size, ratio);
//if(rank==0) printf("interface created\n");
hiop::hiopNlpDenseConstraints nlp(problem);
//if(rank==0) printf("nlp formulation created\n");

//nlp.options->SetIntegerValue("verbosity_level", 4);
//nlp.options->SetNumericValue("tolerance", 1e-4);
//nlp.options->SetStringValue("duals_init", "zero");
//nlp.options->SetIntegerValue("max_iter", 2);

hiop::hiopAlgFilterIPM solver(&nlp);
problem.set_solver(&solver);

hiop::hiopSolveStatus status = solver.run();
double objective = solver.getObjective();
objective = solver.getObjective();

//this is used for testing when the driver is called with -selfcheck
if(selfCheck) {
Expand All @@ -97,7 +113,19 @@ int main(int argc, char **argv)
}
}

if(0==rank) printf("Objective: %18.12e\n", objective);
if(0==rank) {
printf("Objective: %18.12e\n", objective);
}

#ifdef HIOP_USE_AXOM
// example/test for HiOp's load checkpoint API.
if(!do_load_checkpoint_test(mesh_size, ratio, objective)) {
if(rank==0) {
printf("Load checkpoint and restart test failed.");
}
return -1;
}
#endif
#ifdef HIOP_USE_MPI
MPI_Finalize();
#endif
Expand Down Expand Up @@ -134,3 +162,56 @@ static bool self_check(size_type n, double objval)

return true;
}

#ifdef HIOP_USE_AXOM
/**
* An illustration on how to use load_state_from_sidre_group API method of HiOp's algorithm class.
*
*
*/
static bool do_load_checkpoint_test(const size_type& mesh_size,
const double& ratio,
const double& obj_val_expected)
{
//Pretend this is new job and recreate the HiOp objects.
DenseConsEx1 problem(mesh_size, ratio);
hiop::hiopNlpDenseConstraints nlp(problem);

hiop::hiopAlgFilterIPM solver(&nlp);

//
// example of how to use load_state_sidre_group to warm-start
//

//Supposedly, the user code should have the group in hand before asking HiOp to load from it.
//We will manufacture it by loading a sidre checkpoint file. Here the checkpoint file
//"hiop_state_ex1.root" was created from the interface class' iterate_callback method
//(saved every 5 iterations)
sidre::DataStore ds;

try {
sidre::IOManager reader(MPI_COMM_WORLD);
reader.read(ds.getRoot(), "hiop_state_ex1.root", false);
} catch(std::exception& e) {
printf("Failed to read checkpoint file. Error: [%s]", e.what());
return false;
}


//the actual API call
try {
const sidre::Group* group = ds.getRoot()->getGroup("HiOp quasi-Newton alg state");
solver.load_state_from_sidre_group(*group);
} catch(std::runtime_error& e) {
printf("Failed to load from sidre::group. Error: [%s]", e.what());
return false;
}

hiop::hiopSolveStatus status = solver.run();
double obj_val = solver.getObjective();
if(obj_val != obj_val_expected) {
return false;
}
return true;
}
#endif // HIOP_USE_AXOM
6 changes: 3 additions & 3 deletions src/Interface/hiopInterface.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -467,8 +467,8 @@ class hiopInterfaceBase
}

/**
* This method is used to provide an user all the hiop iterate
* procedure. @see solution_callback() for an explanation of the parameters.
* This method is used to provide user all the internal hiop iterates. @see solution_callback()
* for an explanation of the parameters.
*
* @param[in] x array of (local) entries of the primal variables (managed by Umpire, see note below)
* @param[in] z_L array of (local) entries of the dual variables for lower bounds (managed by Umpire, see note below)
Expand Down Expand Up @@ -496,7 +496,7 @@ class hiopInterfaceBase
{
return true;
}

/**
* A wildcard function used to change the primal variables.
*
Expand Down
Loading

0 comments on commit 22efbe8

Please sign in to comment.