Singularity and MPI applications¶

The Message Passing Interface (MPI) is a standard extensively used by HPC applications to implement various communication across compute nodes of a single system or across compute platforms. There are two main open-source implementations of MPI at the moment - OpenMPI and MPICH, both of which are supported by Singularity. The goal of this page is to demonstrate the development and running of MPI programs using Singularity containers.

There are several ways of carrying this out, the most popular way of executing MPI applications installed in a Sigularity container is to rely on the MPI implementation available on the host. This is called the Host MPI or the Hybrid model since both the MPI implementations provided by system administrators (on the host) and in the containers will be used.

Another approach is to only use the MPI implementation available on the host and not include any MPI in the container. This is called the Bind model since it requires to bind/mount the MPI version available on the host into the container.

Note

The bind model requires to mount storage volumes into the container to use the host MPI from the containers. This file system sharing between the host and containers is sometimes not an option on high-performance computing platforms. This restriction on some HPC systems is due to the fact that mounting a storage volume would either require the execution of privileged operations, potentially compromise the access restrictions to other users’ data or go against mount options of the parallel/distributed file system where MPI is installed.

Hybrid model¶

The basic idea behind the Hybrid Approach is when you execute a Singularity container with MPI code, you will call mpiexec or a similar launcher on the singularity command itself. The MPI process outside of the container will then work in tandem with MPI inside the container and the containerized MPI code to instantiate the job.

The Open MPI/Singularity workflow in detail:

The MPI launcher (e.g., mpirun, mpiexec) is called by the resource manager or the user directly from a shell.
Open MPI then calls the process management daemon (ORTED).
The ORTED process launches the Singularity container requested by the launcher command.
Singularity instantiates the container and namespace environment.
Singularity then launches the MPI application within the container.
The MPI application launches and loads the Open MPI libraries.
The Open MPI libraries connect back to the ORTED process via the Process Management Interface (PMI).

At this point the processes within the container run as they would normally directly on the host.

The advantages of this approach are:

Integration with resource managers such as Slurm.
Simplicity since similar to natively running MPI applications.

The drawbacks are:

The MPI in the container must be compatible with the version of MPI available on the host.
The configuration of the MPI implementation in the container must be configured for optimal use of the hardware if performance is critical.

Since the MPI implementation in the container must be compliant with the version available on the system, a standard approach is to build your own MPI container, including the target MPI implementation.

To illustrate how Singularity can be used to execute MPI applications, we will assume for a moment that the application is mpitest.c, a simple Hello World:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv) {
        int rc;
        int size;
        int myrank;

        rc = MPI_Init (&argc, &argv);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Init() failed");
                return EXIT_FAILURE;
        }

        rc = MPI_Comm_size (MPI_COMM_WORLD, &size);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_size() failed");
                goto exit_with_error;
        }

        rc = MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
        if (rc != MPI_SUCCESS) {
                fprintf (stderr, "MPI_Comm_rank() failed");
                goto exit_with_error;
        }

        fprintf (stdout, "Hello, I am rank %d/%d", myrank, size);

        MPI_Finalize();

        return EXIT_SUCCESS;

 exit_with_error:
        MPI_Finalize();
        return EXIT_FAILURE;
}

Note

MPI is an interface to a library, so it consists of function calls and libraries that can be used my many programming languages. It comes with standardized bindings for Fortran and C. However, it can support applications in many languages like Python, R, etc.

The next step is to build the definition file which will depend on the MPI implementation available on the host.

If the host MPI is MPICH, a definition file such as the following example can be used:

Bootstrap: docker
From: ubuntu:latest

%files
    mpitest.c /opt

%environment
    export MPICH_DIR=/opt/mpich-3.3
    export SINGULARITY_MPICH_DIR=$MPICH_DIR
    export SINGULARITYENV_APPEND_PATH=$MPICH_DIR/bin
    export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$MPICH_DIR/lib

%post
    echo "Installing required packages..."
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make

    # Information about the version of MPICH to use
    export MPICH_VERSION=3.3
    export MPICH_URL="http://www.mpich.org/static/downloads/$MPICH_VERSION/mpich-$MPICH_VERSION.tar.gz"
    export MPICH_DIR=/opt/mpich

    echo "Installing MPICH..."
    mkdir -p /tmp/mpich
    mkdir -p /opt
    # Download
    cd /tmp/mpich && wget -O mpich-$MPICH_VERSION.tar.gz $MPICH_URL && tar xzf mpich-$MPICH_VERSION.tar.gz
    # Compile and install
    cd /tmp/mpich/mpich-$MPICH_VERSION && ./configure --prefix=$MPICH_DIR && make install
    # Set env variables so we can compile our application
    export PATH=$MPICH_DIR/bin:$PATH
    export LD_LIBRARY_PATH=$MPICH_DIR/lib:$LD_LIBRARY_PATH
    export MANPATH=$MPICH_DIR/share/man:$MANPATH

    echo "Compiling the MPI application..."
    cd /opt && mpicc -o mpitest mpitest.c

If the host MPI is Open MPI, the definition file looks like:

Bootstrap: docker
From: ubuntu:latest

%files
    mpitest.c /opt

%environment
    export OMPI_DIR=/opt/ompi
    export SINGULARITY_OMPI_DIR=$OMPI_DIR
    export SINGULARITYENV_APPEND_PATH=$OMPI_DIR/bin
    export SINGULAIRTYENV_APPEND_LD_LIBRARY_PATH=$OMPI_DIR/lib

%post
    echo "Installing required packages..."
    apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file

    echo "Installing Open MPI"
    export OMPI_DIR=/opt/ompi
    export OMPI_VERSION=4.0.1
    export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2"
    mkdir -p /tmp/ompi
    mkdir -p /opt
    # Download
    cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2
    # Compile and install
    cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make install
    # Set env variables so we can compile our application
    export PATH=$OMPI_DIR/bin:$PATH
    export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH
    export MANPATH=$OMPI_DIR/share/man:$MANPATH

    echo "Compiling the MPI application..."
    cd /opt && mpicc -o mpitest mpitest.c

Bind model¶

Similarily to the Hybrid Approach, the basic idea behind Bind Approach is to start the MPI application by calling the MPI launcher (e.g., mpirun) from the host. The main difference between the hybrid and bind approach is the fact that with the bind approach, the container usually does not include any MPI implementation. This means that Singularity needs to mount/bind the MPI available on the host into the container.

Technically this requires two steps:

Know where the MPI implementation on the host is installed.
Mount/bind it into the container in a location where the system will be able to find libraries and binaries.

The advantages of this approach are:

Integration with resource managers such as Slurm.
Container images are smaller since there is no need to add an MPI in the containers.

The drawbacks are:

The MPI used to compile the application in the container must be compatible with the version of MPI available on the host.
The user must know where the host MPI is installed.
The user must ensure that binding the directory where the host MPI is installed is possible.
The user must ensure that the host MPI is compatible with the MPI used to compile and install the application in the container.

The creation of a Singularity container based on the bind model is based on the following steps:

Compile your application on a system with the target MPI implementation, as you would do to install your application on any system.
Create a definition file that includes the copy of the application from the host to the container image, as well as all required dependencies.
Generate the container image.

As already mentioned, the compilation of the application on the host is not different from the installation of your application on any system. Just make sure that the MPI on the system where you create your container is compatible with the MPI available on the platform(s) where you want to run your containers. For example, a container where the application has been compiled with MPICH will not be able to run on a system where only Open MPI is available, even if you mount the directory where Open MPI is installed.

A definition file for a container in bind mode is fairly straight forward. The following example shows the definition file for NetPIPE-5.1.4 compiled on the host in /tmp/NetPIPE-5.1.4:

Bootstrap: docker
From: ubuntu:disco

%files
      /tmp/NetPIPE-5.1.4/NPmpi /opt

%environment
      MPI_DIR=/opt/mpi
      export MPI_DIR
      export SINGULARITY_MPI_DIR=$MPI_DIR
      export SINGULARITYENV_APPEND_PATH=$MPI_DIR/bin
      export SINGULARITYENV_APPEND_LD_LIBRARY_PATH=$MPI_DIR/lib

%post
      apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file
      mkdir -p /opt/mpi
      apt-get clean

In this example, the application, NetPIPE-5.1.4, is copied into /opt, as a result, the path to the executable to use on the mpirun command is /opt/NPmpi. Also, this definition file prepares the environment to have the host MPI mounted in /opt/mpi; it sets all the required environment variables (PATH and LD_LIBRARY_PATH) for the system to find all MPI binaries and libraries at run-time.

Execution¶

The standard way to execute MPI applications with hybrid Singularity containers is to run the native mpirun command from the host, which will start Singularity containers and ultimately MPI ranks within the containers.

Assuming your container with MPI and your application is already build, the mpirun command to start your application looks like when your container has been built based on the hybrid model:

$ mpirun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>

Practically, this command will first start a process instantiating mpirun and then Singularity containers on compute nodes. Finally, when the containers start, the MPI binary is executed.

For containers built based on the bind model, the command simply needs to include the appropriate bind option:

$ mpirun -n <NUMBER_OF_RANKS> singularity exec --bind <PATH/TO/HOST/MPI/DIRECTORY>:<PATH/IN/CONTAINER> <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>

Based on the example presented in the previous sub-section, and assuming MPI is installed in /opt/openmpi on the host, the command will look like:

$ mpirun -n <NUMBER_OF_RANKS> singularity exec --bind /opt/openmpi:/opt/mpi <PATH/TO/MY/IMAGE> /opt/NPmpi

If your target system is setup with a batch system such as SLURM, a standard way to execute MPI applications is through a batch script. The following example illustrates the context of a batch script for Slurm that aims at starting a Singularity container on each node allocated to the execution of the job. It can easily be adapted for all major batch systems available.

$ cat my_job.sh
#!/bin/bash
#SBATCH --job-name singularity-mpi
#SBATCH -N $NNODES # total number of nodes
#SBATCH --time=00:05:00 # Max execution time

mpirun -n $NP singularity exec /var/nfsshare/gvallee/mpich.sif /opt/mpitest

In fact, the example describes a job that requests the number of nodes specified by the NNODES environment variable and a total number of MPI processes specified by the NP environment variable. The example is also assuming that the container is based on the hybrid model; if it is based on the bind model, please add the appropriate bind options.

A user can then submit a job by executing the following SLURM command:

$ sbatch my_job.sh