Apptainer and MPI applications

The Message Passing Interface (MPI) is a standard extensively used by HPC applications to implement various communication across compute nodes of a single system or across compute platforms. There are two main open-source implementations of MPI at the moment - OpenMPI and MPICH, both of which are supported by Apptainer. The goal of this page is to demonstrate the development and running of MPI programs using Apptainer containers.

There are several ways of carrying this out, the most popular way of executing MPI applications installed in an Apptainer container is to rely on the MPI implementation available on the host. This is called the Host MPI or the Hybrid model since both the MPI implementations provided by system administrators (on the host) and in the containers will be used.

Another approach is to only use the MPI implementation available on the host and not include any MPI in the container. This is called the Bind model since it requires to bind/mount the MPI version available on the host into the container.

Note

The bind model requires users to be able to mount user-specified files from the host into the container. This ability is sometimes disabled by system administrators for operational reasons. If this is the case on your system please follow the hybrid approach.

Hybrid model

The basic idea behind the Hybrid Approach is when you execute a Apptainer container with MPI code, you will call mpiexec or a similar launcher on the apptainer command itself. The MPI process outside of the container will then work in tandem with MPI inside the container and the containerized MPI code to instantiate the job.

The Open MPI/Apptainer workflow in detail:

The MPI launcher (e.g., mpirun, mpiexec) is called by the resource manager or the user directly from a shell.
Open MPI then calls the process management daemon (ORTED).
The ORTED process launches the Apptainer container requested by the launcher command.
Apptainer instantiates the container and namespace environment.
Apptainer then launches the MPI application within the container.
The MPI application launches and loads the Open MPI libraries.
The Open MPI libraries connect back to the ORTED process via the Process Management Interface (PMI).

At this point the processes within the container run as they would normally directly on the host.

The advantages of this approach are:

Integration with resource managers such as Slurm.
Simplicity since similar to natively running MPI applications.

The drawbacks are:

The MPI in the container must be compatible with the version of MPI available on the host.
The MPI implementation in the container must be carefully configured for optimal use of the hardware if performance is critical.

Since the MPI implementation in the container must be compliant with the version available on the host system, a standard approach is to build your own MPI container, including building the same MPI framework installed on the host from source.

Test Application

To illustrate how Apptainer can be used to execute MPI applications, we will assume for a moment that the application is mpitest.c, a simple Hello World:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv) {
        int rc;
        int size;
        int myrank;

        rc = MPI_Init (&argc, &argv);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Init() failed");
                return EXIT_FAILURE;
        }

        rc = MPI_Comm_size (MPI_COMM_WORLD, &size);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_size() failed");
                goto exit_with_error;
        }

        rc = MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_rank() failed");
                goto exit_with_error;
        }

        fprintf (stdout, "Hello, I am rank %d/%d\n", myrank, size);

        MPI_Finalize();

        return EXIT_SUCCESS;

 exit_with_error:
        MPI_Finalize();
        return EXIT_FAILURE;
}

Note

MPI is an interface to a library, so it consists of function calls and libraries that can be used by many programming languages. It comes with standardized bindings for Fortran and C. However, it can support applications in many languages like Python, R, etc.

The next step is to create the definition file used to build the container, which will depend on the MPI implementation available on the host.

MPICH Hybrid Container

If the host MPI is MPICH, a definition file such as the following example can be used:

Bootstrap: docker
From: ubuntu:22.04

%files
    mpitest.c /opt

%environment
    # Point to MPICH binaries, libraries man pages
    export MPICH_DIR=/opt/mpich
    export PATH="$MPICH_DIR/bin:$PATH"
    export LD_LIBRARY_PATH="$MPICH_DIR/lib:$LD_LIBRARY_PATH"
    export MANPATH=$MPICH_DIR/share/man:$MANPATH

%post
    echo "Installing required packages..."
    export DEBIAN_FRONTEND=noninteractive
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make python3-dev

    # Information about the version of MPICH to use
    export MPICH_VERSION=4.1.1
    export MPICH_URL="http://www.mpich.org/static/downloads/$MPICH_VERSION/mpich-$MPICH_VERSION.tar.gz"
    export MPICH_DIR=/opt/mpich

    echo "Installing MPICH..."
    mkdir -p /tmp/mpich
    mkdir -p /opt
    # Download
    cd /tmp/mpich && wget -O mpich-$MPICH_VERSION.tar.gz $MPICH_URL && tar xzf mpich-$MPICH_VERSION.tar.gz
    # Compile and install
    cd /tmp/mpich/mpich-$MPICH_VERSION && ./configure --prefix=$MPICH_DIR && make -j$(nproc) install

    # Set env variables so we can compile our application
    export PATH=$MPICH_DIR/bin:$PATH
    export LD_LIBRARY_PATH=$MPICH_DIR/lib:$LD_LIBRARY_PATH

    echo "Compiling the MPI application..."
    cd /opt && mpicc -o mpitest mpitest.c

Note

The version of MPICH you install in the container must be compatible with the version on the host. It should also be configured to support the same process management mechanism and version, e.g. PMI2 / PMIx, as used on the host.

There are wide variations in MPI configuration across HPC systems. Consult your system documentation, or ask your support staff for details.

Open MPI Hybrid Container

If the host MPI is Open MPI, the definition file looks like:

Bootstrap: docker
From: ubuntu:22.04

%files
    mpitest.c /opt

%environment
    # Point to OMPI binaries, libraries, man pages
    export OMPI_DIR=/opt/ompi
    export PATH="$OMPI_DIR/bin:$PATH"
    export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH"
    export MANPATH="$OMPI_DIR/share/man:$MANPATH"

%post
    echo "Installing required packages..."
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file bzip2

    echo "Installing Open MPI"
    export OMPI_DIR=/opt/ompi
    export OMPI_VERSION=4.1.5
    export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$OMPI_VERSION.tar.bz2"
    mkdir -p /tmp/ompi
    mkdir -p /opt
    # Download
    cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
    # Compile and install
    cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make -j$(nproc) install

    # Set env variables so we can compile our application
    export PATH=$OMPI_DIR/bin:$PATH
    export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH

    echo "Compiling the MPI application..."
    cd /opt && mpicc -o mpitest mpitest.c

Note

The version of Open MPI you install in the container must be compatible with the version on the host. It should also be configured to support the same process management mechanism and version, e.g. PMI2 / PMIx, as used on the host.

There are wide variations in MPI configuration across HPC systems. Consult your system documentation, or ask your support staff for details.

Note

The MPI UCX library has a problem with unprivileged user namespaces. See apptainer issue 769 for details and for now use the UCX_TLS=sysv,ib transport as a workaround, for example:

mpirun -np 2 -mca pml ucx -x UCX_TLS=sysv,ib apptainer exec $MY_CONTAINER ./a.out

Using –sharens mode

By default in unprivileged user namespace mode each apptainer command runs in its own user namespace. This causes problems for MPI runs, as observed in this issue. The new --sharens flag will move all spawned processes into the same user namespace, similar to what is done with Apptainer SUID mode.

mpirun -np 2 apptainer exec --sharens $MY_CONTAINER ./a.out

Note

Note that the --sharens flag requires all apptainer commands to have the same parent process. This is because these child processes will join the same namespace identified using the parent process id. To meet the requirement, Apptainer containers should be launched by the same process, e.g. mpirun, and all these containers should be running on the same host.

Running an MPI Application

The standard way to execute MPI applications with hybrid Apptainer containers is to run the native mpirun command from the host, which will start Apptainer containers and ultimately MPI ranks within the containers.

Assuming your container with MPI and your application is already built, the mpirun command to start your application looks like when your container has been built based on the hybrid model:

$ mpirun -n <NUMBER_OF_RANKS> apptainer exec <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>

Practically, this command will first start a process instantiating mpirun and then Apptainer containers on compute nodes. Finally, when the containers start, the MPI binary is executed:

$ mpirun -n 8 apptainer run hybrid-mpich.sif /opt/mpitest
Hello, I am rank 3/8
Hello, I am rank 4/8
Hello, I am rank 6/8
Hello, I am rank 2/8
Hello, I am rank 0/8
Hello, I am rank 5/8
Hello, I am rank 1/8
Hello, I am rank 7/8

Bind model

Similar to the Hybrid Approach, the basic idea behind the Bind Approach is to start the MPI application by calling the MPI launcher (e.g., mpirun) from the host. The main difference between the hybrid and bind approach is the fact that with the bind approach, the container usually does not include any MPI implementation. This means that Apptainer needs to mount/bind the MPI available on the host into the container.

Technically this requires two steps:

Know where the MPI implementation on the host is installed.
Mount/bind it into the container in a location where the system will be able to find libraries and binaries.

The advantages of this approach are:

Integration with resource managers such as Slurm.
Container images are smaller since there is no need to add an MPI in the containers.

The drawbacks are:

The MPI used to compile the application in the container must be compatible with the version of MPI available on the host.
The user must know where the host MPI is installed.
The user must ensure that binding the directory where the host MPI is installed is possible.
The user must ensure that the host MPI is compatible with the MPI used to compile and install the application in the container.

The creation of an Apptainer container for the bind model is based on the following steps:

Compile your application on a system with the target MPI implementation, as you would do to install your application on any system.
Create a definition file that includes the copy of the application from the host to the container image, as well as all required dependencies.
Generate the container image.

As already mentioned, the compilation of the application on the host is not different from the installation of your application on any system. Just make sure that the MPI on the system where you create your container is compatible with the MPI available on the platform(s) where you want to run your containers. For example, a container where the application has been compiled with MPICH will not be able to run on a system where only Open MPI is available, even if you mount the directory where Open MPI is installed.

Bind Mode Definition File

A definition file for a container in bind mode is fairly straight forward. The following example shows the definition file for the test program, which in this case has been compiled on the host to /tmp/mpitest:

Bootstrap: docker
From: ubuntu:22.04

%files
      /tmp/mpitest /opt/mpitest

%environment
      export PATH="$MPI_DIR/bin:$PATH"
      export LD_LIBRARY_PATH="$MPI_DIR/lib:$LD_LIBRARY_PATH"

In this example, the application mpitest is copied from the host into /opt, so we will need to run it as /opt/mpitest inside our container.

The environment section adds paths for binaries and libraries under $MPI_DIR - which we will need to set when running the container.

Running an MPI Application

When running our bind mode container we need to --bind our host’s MPI installation into the container. We also need to set the environment variable $MPI_DIR in the container to point to the location where the MPI installation is bound in.

Setting up the container in this way makes it semi-portable between systems that have a version-compatible MPI installation, but under different installation paths. You can also hard code the MPI path in the definition file if you wish.

$ export MPI_DIR="<PATH/TO/HOST/MPI/DIRECTORY>"
$ mpirun -n <NUMBER_OF_RANKS> apptainer exec --bind "$MPI_DIR" <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>

On an example system we may be using an Open MPI installation at /cm/shared/apps/openmpi/gcc/64/4.0.5/. This means that the commands to run the container in bind mode are:

$ export MPI_DIR="/cm/shared/apps/openmpi/gcc/64/4.0.5"
$ mpirun -n 8 apptainer exec --bind "$MPI_DIR" bind.sif /opt/mpitest
Hello, I am rank 1/8
Hello, I am rank 2/8
Hello, I am rank 0/8
Hello, I am rank 7/8
Hello, I am rank 5/8
Hello, I am rank 3/8
Hello, I am rank 4/8
Hello, I am rank 6/8

Batch Scheduler / Slurm

If your target system is setup with a batch system such as SLURM, a standard way to execute MPI applications is through a batch script. The following example illustrates the context of a batch script for Slurm that aims at starting an Apptainer container on each node allocated to the execution of the job. It can easily be adapted for all major batch systems available.

$ cat my_job.sh
#!/bin/bash
#SBATCH --job-name apptainer-mpi
#SBATCH -N $NNODES # total number of nodes
#SBATCH --time=00:05:00 # Max execution time

mpirun -n $NP apptainer exec /var/nfsshare/gvallee/mpich.sif /opt/mpitest

In fact, the example describes a job that requests the number of nodes specified by the NNODES environment variable and a total number of MPI processes specified by the NP environment variable. The example is also assuming that the container is based on the hybrid model; if it is based on the bind model, please add the appropriate bind options.

A user can then submit a job by executing the following SLURM command:

$ sbatch my_job.sh

Alternative Launchers

On many systems it is common to use an alternative launcher to start MPI applications, e.g. Slurm’s srun rather than the mpirun provided by the MPI installation. This approach is supported with Apptainer as long as the container MPI version supports the same process management interface (e.g. PMI2 / PMIx) and version as is used by the launcher.

In the bind mode the host MPI is used in the container, and should interact correctly with the same launchers as it does on the host.

Interconnects / Networking

High performance interconnects such as Infiniband and Omnipath require that MPI implementations are built to support them. You may need to install or bind Infiniband/Omnipath libraries into your containers when using these interconnects.

By default Apptainer exposes every device in /dev to the container. If you run a container using the --contain or --containall flags a minimal /dev is used instead. You may need to bind in additional /dev/ entries manually to support the operation of your interconnect drivers in the container in this case.

Troubleshooting Tips

If your containers run N rank 0 processes, instead of operating correctly as an MPI application, it is likely that the MPI stack used to launch the containerized application is not compatible with, or cannot communicate with, the MPI stack in the container.

E.g. if we attempt to run the hybrid Open MPI container, but with mpirun from MPICH loaded on the host:

$ module add mpich
$ mpirun -n 8 apptainer run hybrid-openmpi.sif /opt/mpitest
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1
Hello, I am rank 0/1

If your container starts processes of different ranks, but fails with communications errors there may also be a version incompatibility, or interconnect libraries may not be available or configured properly with the MPI stack in the container.

Please check the following things carefully before asking questions in the Apptainer community:

For the hybrid mode, is the MPI version on the host compatible with the version in the container? Newer MPI versions can generally tolerate some mismatch in the version number, but it is safest to use identical versions.

Is the MPI stack in the container configured to support the process management method used on the host? E.g. if you are launching tasks with srun configured for PMIx only, then a containerized MPI supporting PMI2 only will not operate as expected.

If you are using an interconnect other than standard Ethernet, are any required libraries for it installed or bound into the container? Is the MPI stack in the container configured correctly to use them?

We recommend using the Apptainer Google Group or Slack Channel to ask for MPI advice from the Apptainer community. HPC cluster configurations vary greatly and most MPI problems are related to MPI / interconnect configuration, and not caused by issues in Apptainer itself.