Checkpoint Feature
Overview
The checkpoint feature allows users to transparently save the state of containerized applications while they are executing and then restart them from the saved state.
This is valuable for use cases that need to minimize the impact of ephemeral environment failures, random hardware failures or job premption.
Note
This feature is marked as experimental to allow flexibility as community feedback may warrant breaking changes to improve the overall usability for this feature set as it matures.
In order to enable checkpointing as a feature, the DMTCP project has been chosen as the basis for an integration. Apptainer injects an installation of DMTCP from the host into the container at container creation and uses it to wrap the launching of the container process. When launching a process, DMTCP will monitor the spawned application process and its children in order to fully checkpoint the state of the application when triggered.
For interested readers, more detailed information about the technology can be found in the publications listed on the DMTCP website.
Requirements
Due to the architecture of DMTCP, there are a couple requirements that must be met to allow this integration to work correctly:
Applications that are the intended to be checkpointed must be dynamically linked.
dmtcp uses
dladdr1()
which is a GNU extension that requiresglibc
orglibc
compatibility inside the container.
In addition to these requirements, the Apptainer integration with DMTCP requires that either:
The binaries and libraries required by DMTCP are within the user
PATH
and theldconfig
cache respectively.The
dmtcp-conf.yaml
configuration file has been modified to contain the absolute paths of the listed binaries and libraries.
At time of release, the list of libraries and binaries in this configuration is appropriate for the latest master branch of DMTCP. It can be modified by the system administrator to add additional libraries if necessary. See the admin guide for more details.
DMTCP Installation
Installation instructions for DMTCP can be found on their git repository.
In order to maximize the portability of the DMTCP build, we recommend using the
--enable-static-libstdcxx
configure option.
A build and install of DMTCP from source will generally place libraries at
/usr/local/lib/dmtcp
, which is not a standard path for the linker. You can
configure your linker to find these libraries and update its cache with the
following commands:
# echo /usr/local/lib/dmtcp > /etc/ld.so.conf.d/dmtcp.conf
# ldconfig
Verify that the libraries are in the cache by ensuring there is output from the following command:
$ ldconfig -p | grep dmtcp
The DMTCP project is currently working towards a 3.0
release. If you
encounter issues using a 2.x
installation of DMTCP, we recommend trying the
tip of the master
branch to see if the issue has been resolved in the code
that will become the next release.
Example
The following container (referred to as server.sif
below) contains a simple
http server that allows a variable to be read and written with GET
and
POST
requests respectively. This is not typical of applications packaged by
the Apptainer community, but gives us an easy way to check that application
state has been successfully restored upon restart.
Bootstrap: docker
From: python:3.10-buster
%post
mkdir /app
cat > /app/server.py <<EOF
import argparse
from http.server import BaseHTTPRequestHandler, HTTPServer
state = "0"
parser = argparse.ArgumentParser(description='Optional app description')
parser.add_argument('port', type=int, help='A required integer port argument')
args = parser.parse_args()
class handler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-type','text/plain')
self.end_headers()
self.wfile.write(bytes(state, "utf8"))
def do_POST(self):
self.send_response(200)
self.send_header('Content-type','text/plain')
self.end_headers()
global state
state = self.rfile.read(1).decode("utf8")
self.wfile.write(bytes(state, "utf8"))
with HTTPServer(('', args.port), handler) as server:
server.serve_forever()
EOF
%startscript
python3 /app/server.py $@
We can build this container using:
$ apptainer build server.sif server.def
First things first, we need to create a checkpoint
using the checkpoint
command group. This initializes a location in your user home directory for
Apptainer to store state related to DMTCP and the checkpoint images it will
generate.
$ apptainer checkpoint create example-checkpoint
INFO: Checkpoint "example-checkpoint" created.
Now we can start an instance of our application with the --dmtcp-launch
flag
naming the checkpoint
we want to use to store state for this instance.
$ apptainer instance start --dmtcp-launch example-checkpoint server.sif server 8888 # this last arg is the port the server will listen to.
INFO: instance started successfully
Once we have our application up and running, we can curl
against it and read
the state of a variable on the server.
$ curl localhost:8888; echo
0
We can see that it is set to 0
by default when this application is started
normally. We can now update the state of the server from 0
to 1
with
the following POST
request:
$ curl -X POST localhost:8888 -d '1'; echo
1
$ curl localhost:8888; echo
1
Now that variable on our server is in a new state, 1
, we can use the
checkpoint instance
command and reference the instance via the
instance://
URI format:
$ apptainer checkpoint instance server
INFO: Using checkpoint "example-checkpoint"
Now that we have checkpointed the state of our application, we can safely stop the instance:
$ apptainer instance stop server
INFO: Stopping server instance of /home/ian/server.sif (PID=209072)
We can restart our server and restore its state by starting a new instance using
the --dmtcp-restart
flag and specifying the checkpoint to be used to restore
our application’s state:
$ apptainer instance start --dmtcp-restart example-checkpoint server.sif restarted-server 8888
INFO: instance started successfully
And now we can verify the variable on the server has been properly restored to
a value of 1
, instead of the default of 0
:
$ curl localhost:8888; echo
1
Finally, we can stop our instance running our restored application and delete our checkpoint if we no longer need it to restart our application from this state:
$ apptainer instance stop restarted-server
INFO: Stopping restarted-server instance of /home/ian/server.sif (PID=247679)
$ apptainer checkpoint delete example-checkpoint
INFO: Checkpoint "example-checkpoint" deleted.