Checkpoint Feature
Overview
The checkpoint feature allows users to transparently save the state of containerized applications while they are executing and then restart them from the saved state.
This is valuable for use cases that need to minimize the impact of ephemeral environment failures, random hardware failures or job premption.
Note
This feature is marked as experimental to allow flexibility as community feedback may warrant breaking changes to improve the overall usability for this feature set as it matures.
In order to enable checkpointing as a feature, the DMTCP project has been chosen as the basis for an integration. Apptainer injects an installation of DMTCP from the host into the container at container creation and uses it to wrap the launching of the container process. When launching a process, DMTCP will monitor the spawned application process and its children in order to fully checkpoint the state of the application when triggered.
For interested readers, more detailed information about the technology can be found in the publications listed on the DMTCP website.
Requirements
Due to the architecture of DMTCP, there are a couple requirements that must be met to allow this integration to work correctly:
Applications that are the intended to be checkpointed must be dynamically linked.
dmtcp uses
dladdr1()which is a GNU extension that requiresglibcorglibccompatibility inside the container.
In addition to these requirements, the Apptainer integration with DMTCP requires that either:
The binaries and libraries required by DMTCP are within the user
PATHand theldconfigcache respectively.The
dmtcp-conf.yamlconfiguration file has been modified to contain the absolute paths of the listed binaries and libraries.
At time of release, the list of libraries and binaries in this configuration is appropriate for the latest master branch of DMTCP. It can be modified by the system administrator to add additional libraries if necessary. See the admin guide for more details.
DMTCP Installation
Installation instructions for DMTCP can be found on their git repository.
In order to maximize the portability of the DMTCP build, we recommend using the
--enable-static-libstdcxx configure option.
A build and install of DMTCP from source will generally place libraries at
/usr/local/lib/dmtcp, which is not a standard path for the linker. You can
configure your linker to find these libraries and update its cache with the
following commands:
# echo /usr/local/lib/dmtcp > /etc/ld.so.conf.d/dmtcp.conf
# ldconfig
Verify that the libraries are in the cache by ensuring there is output from the following command:
$ ldconfig -p | grep dmtcp
The DMTCP project is currently working towards a 3.0 release. If you
encounter issues using a 2.x installation of DMTCP, we recommend trying the
tip of the master branch to see if the issue has been resolved in the code
that will become the next release.
Example
The following container (referred to as server.sif below) contains a simple
http server that allows a variable to be read and written with GET and
POST requests respectively. This is not typical of applications packaged by
the Apptainer community, but gives us an easy way to check that application
state has been successfully restored upon restart.
Bootstrap: docker
From: python:3.10-bookworm
%post
mkdir /app
cat > /app/server.py <<EOF
import socketserver
import argparse
count = 0
parser = argparse.ArgumentParser(description='Optional app description')
parser.add_argument('port', type=int, help='A required integer port argument')
args = parser.parse_args()
class Handler(socketserver.BaseRequestHandler):
def handle(self):
global count
count += 1
response = bytes("request:{}\n".format(count), "ascii")
self.request.sendall(response)
if __name__ == "__main__":
with socketserver.TCPServer(('', args.port), Handler) as server:
server.allow_reuse_address = True
server.serve_forever()
EOF
%startscript
python3 /app/server.py $@
We can build this container using:
$ apptainer build server.sif server.def
First things first, we need to create a checkpoint using the checkpoint
command group. This initializes a location in your user home directory for
Apptainer to store state related to DMTCP and the checkpoint images it will
generate.
$ apptainer checkpoint create example-checkpoint
INFO: Checkpoint "example-checkpoint" created.
Now we can start an instance of our application with the --dmtcp-launch flag
naming the checkpoint we want to use to store state for this instance.
$ apptainer instance start --dmtcp-launch example-checkpoint server.sif server 8888 # this last arg is the port the server will listen to.
INFO: instance started successfully
Once we have our application up and running, we can curl against it and read
the state of a variable on the server.
$ curl --http0.9 localhost:8888
request:1
We can see that the request count value is 1 when this application is started
and accessed via curl. After making another call to the application, we can see that the request
count is 2 as expected.
$ curl --http0.9 localhost:8888
request:2
Now that the request count variable on our server is in a new state, 2, we can use the
checkpoint instance command and reference the instance via the
instance:// URI format:
$ apptainer checkpoint instance server
INFO: Using checkpoint "example-checkpoint"
Now that we have checkpointed the state of our application, we can safely stop the instance:
$ apptainer instance stop server
INFO: Stopping server instance of /home/ian/server.sif (PID=209072)
We can restart our server and restore its state by starting a new instance using
the --dmtcp-restart flag and specifying the checkpoint to be used to restore
our application’s state:
$ apptainer instance start --dmtcp-restart example-checkpoint server.sif restarted-server 8888
INFO: instance started successfully
And now when we get access to the application again, the request count value is 3 as expected,
meaning that the previous request count value was 2.
$ curl --http0.9 localhost:8888
$ request:3
We can repeat the previous two steps, i.e. stop the server instance and restart it via dmtcp to verify the restoration of the value of the request count.
$ apptainer instance stop server
$ apptainer instance start --dmtcp-restart example-checkpoint server.sif restarted-server 8888
Then access the application and see that the request count value is restored as expected.
$ curl --http0.9 localhost:8888
$ request:3
Finally, we can stop our instance running our restored application and delete our checkpoint if we no longer need it to restart our application from this state:
$ apptainer instance stop restarted-server
INFO: Stopping restarted-server instance of /home/ian/server.sif (PID=247679)
$ apptainer checkpoint delete example-checkpoint
INFO: Checkpoint "example-checkpoint" deleted.
Self-Contained Checkpointing in a Container
The built-in checkpoint feature described above requires a DMTCP
installation on your host, glibc compatibility between host and container, and uses the
apptainer instance workflow with --dmtcp-launch / --dmtcp-restart
flags. An alternative approach is to install DMTCP directly inside the
container so that the checkpointing software and application is fully containerized.
This is useful for the following scenarios:
The host does not have DMTCP installed or configured.
Portability across different hosts or clusters is needed.
Automatic periodic checkpointing is preferred over manual
checkpoint instancecommands.The simplicity of the
apptainer runworkflow is preferred over instances.
Note
This approach bypasses the built-in apptainer checkpoint command group
entirely. Checkpointing and restarting are handled by the DMTCP installation
inside the container, driven by the container’s %runscript.
Building the Container
The definition file below installs DMTCP from source and includes a small
Python counter application as an example workload. The key piece is the
%runscript, which automatically detects whether checkpoint files exist and
either restarts from them or launches the application fresh under DMTCP.
Bootstrap: docker
From: rockylinux/rockylinux:10
%post
dnf groupinstall -y "Development Tools"
dnf install -y libatomic which
git clone --branch=4.1.0 https://github.com/dmtcp/dmtcp.git
cd dmtcp
./configure
make
make install
# --- Replace this section with your own application ---
mkdir /app
cat > /app/server.py <<EOF
import time
counter = 0
while True:
print(counter, flush=True)
counter += 1
time.sleep(1)
EOF
%runscript
CKPT_DIR="${DMTCP_CHECKPOINT_DIR:-.}"
RESTART_SCRIPT="$CKPT_DIR/dmtcp_restart_script.sh"
if [ -f "$RESTART_SCRIPT" ]; then
exec "$RESTART_SCRIPT"
else
exec dmtcp_launch \
--interval "${DMTCP_CHECKPOINT_INTERVAL:-10}" \
--coord-port 0 \
--ckpt-open-files \
python3 /app/server.py "$@"
fi
The %runscript logic works as follows:
DMTCP_CHECKPOINT_DIR— if set, DMTCP writes checkpoint files to that directory; otherwise it defaults to the current working directory (.).On startup the script checks for
dmtcp_restart_script.shin the checkpoint directory. If present, the application is restarted from the saved state. If absent, a fresh launch is performed underdmtcp_launch.
The dmtcp_launch flags used here are:
--interval— seconds between automatic checkpoints. Controlled at runtime via theDMTCP_CHECKPOINT_INTERVALenvironment variable (default: 10 seconds).--coord-port 0— tells DMTCP to pick an available port for its coordinator, avoiding conflicts with other processes.--ckpt-open-files— includes open file descriptors in the checkpoint image so that files in use are properly saved and restored.
To adapt this for your own application, replace the %post application
installation section and the python3 /app/server.py "$@" command in the
%runscript with your own application and launch command.
Build the container with:
$ apptainer build myapp.sif myapp.def
Running and Checkpointing
By default, DMTCP writes checkpoint files to the current working directory. It is recommended to create a dedicated directory so checkpoint data is easy to find and manage.
$ mkdir ckpt && cd ckpt
$ apptainer run ../myapp.sif
0
1
2
...
Checkpoints are created automatically at the interval configured in the definition file (every 10 seconds by default). You can override this at runtime:
$ DMTCP_CHECKPOINT_INTERVAL=30 apptainer run ../myapp.sif
After a checkpoint interval elapses, interrupt the program with control-C and verify that checkpoint files have been written:
$ ls
ckpt_python3_*.dmtcp dmtcp_restart_script.sh ...
Restarting from a Checkpoint
To restart the application from its checkpointed state, run the container again from the same directory that contains the checkpoint files:
$ cd ckpt
$ apptainer run ../myapp.sif
10
11
12
...
The %runscript detects dmtcp_restart_script.sh and automatically
restarts the application from where it left off — the counter resumes from its
checkpointed value. Subsequent checkpoints continue to update the files in the
same directory.
Starting Fresh
To discard a previous checkpoint and start the application from scratch, either remove the checkpoint files:
$ rm -f ckpt_*.dmtcp dmtcp_restart_script.sh
Or simply run the container from a different, empty directory:
$ mkdir ckpt-new && cd ckpt-new
$ apptainer run ../myapp.sif
0
1
2
...
Using Exec and Shell
The examples above all use apptainer run, which invokes the
%runscript and its automatic launch/restart logic. You can also use
apptainer exec and apptainer shell to work with DMTCP directly inside
the container.
Launching with exec
Use apptainer exec to call dmtcp_launch directly, bypassing the
%runscript. This is useful when you want to pass different flags or
launch a different command than what the %runscript provides:
$ mkdir ckpt && cd ckpt
$ apptainer exec ../myapp.sif dmtcp_launch \
--interval 60 --coord-port 0 --ckpt-open-files \
python3 /app/server.py
Restarting with exec
Similarly, you can restart from a checkpoint by executing the restart script directly:
$ cd ckpt
$ apptainer exec ../myapp.sif ./dmtcp_restart_script.sh
10
11
12
...
Interactive debugging with shell
Use apptainer shell to open an interactive session inside the container.
This is helpful for inspecting checkpoint files, testing DMTCP commands, or
debugging:
$ cd ckpt
$ apptainer shell ../myapp.sif
> ls ckpt_*.dmtcp
ckpt_python3_*.dmtcp
> dmtcp_restart_script.sh --help
...