Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

MPI-Start Installation and Configuration

DISCLAIMER: This manual obsoletes the previos specify version maintained at specify link

Title	MPI-Start Installation and Configuration
Document link	https://wiki.egi.eu/wiki/MAN03
Last modified	19 August 2014
Policy Group Acronym	OMB
Policy Group Name	Operations Management Board
Contact Group	operations-support@mailman.egi.eu
Document Status	Approved
Approved Date
Procedure Statement	This manual provides information on MPI-Start Installation and Configuration.
Owner	Owner of procedure

This document is intended to help EGI site administrators to properly support MPI deployments using MPI-Start.

Using yaim

A YAIM module for MPI configuration is available as part of the gLite distribution (glite-yaim-mpi). This module will perform automatic configuration for PBS/Torque schedulers, enabling the use of submit filters, definition of environment variables and configuration of the information system. Information about the yaim module can be found here. Some manual configuration may still be required, please read carefully the yaim documentation and the information available in this page.

Configuration of batch system

The batch system must be ready to execute parallel jobs (i.e. more than one slot is requested for a single job). Each batch system has its own specific ways of configuring such support.

Here you can find the instructions to manually configure different batch systems to execute MPI jobs.

Torque/PBS

PBS-based schedulers such as Torque do not deal properly with CPU allocations, because they assume homogeneous systems with the same number of CPUs for all the nodes (machines). $cpu_per_node can be set in the jobmanager, but it has to be the same for all the machines. Furthermore, PBS doesn't seem to understand that there might be processes running in 1 CPU of each machine of 2 CPUs in a farm, so there are still half the capacity free for more jobs. For these reasons, it is needed to add some special configuration to the batch system.

A way to avoid the Torque limitations is using a submit filter. This filter rewrites the job submission to adapt the request of the user to the available nodes in the cluster, by rewriting the -l nodes=XX option with specific requests, based on the information given by pbsnodes -a command.

There is a sample submit filter available here, that you can use at your site. It should work for most site configurations without any modifications.

In order to configure it you will need to edit (create it if it does not exist) your torque configuration file (/var/torque/torque.cfg or /var/spool/pbs/torque.cfg) and add a line containing:

SUBMITFILTER /var/torque/Torque_submit_filter.pl

Then download the Torque_submit_filter.pl and put it in the above location. It should have execution rights for the users that are able to submit jobs to the queues.

The submit filter is crucial. Failing to use the submit filter translates in the job being submitted to only one node, where all the MPI processes are allocated too, instead of distributing the job across several nodes.

Warning: glite updates tend to rewrite torque.cfg. Check that the submit filter line is still there after performing an update

Maui

Edit your configuration file (usually under /var/spool/maui/maui.cfg) and check that it contains the following line:

ENABLEMULTIREQJOBS TRUE

The ENABLEMULTINODEJOBS parameter must not be set to FALSE! (if not specified is TRUE by default). These parameters allow a job to span to more than one node and to specify multiple independent resource requests.

The maui version provided as third party in EMI/UMD (maui-3.2.6p21-snap.1234905291.5.el5) has a bug that prevents the use of more than one WN when submitting a parallel job. See https://ggus.eu/ws/ticket_info.php?ticket=57828 for details. It is recommended to use newer versions of maui that do not have this problem.

Testing your configuration

Checking the submit filter

The submit filter can be tested emulating a request for a number of processes larger than the number of cpus per WN. For example in a site with 2 CPUs WN:

$ cat /var/torque/torque.cfg
SUBMITFILTER /var/torque/submit_filter
$ echo "#PBS -l nodes=4" | /var/torque/submit_filter
#PBS -l nodes=2:ppn=2
#PBS -A mpi

If the request is bigger than the site size, the output will not be changed.

Checking submission

mpi-start installation can be checked with the submission of a hello world job that shows the number of processes and the hosts allocated for the execution.

This is a sample script for submission:

#!/bin/bash
#PBS -l nodes=4

MYDIR=`mktemp -d`

cd $MYDIR
cat > test.sh << EOF
#!/bin/sh
echo "Execution from: \`hostname\`"
echo "Slots: \$MPI_START_NSLOTS"
echo "-> WN list"
cat "\$MPI_START_HOSTFILE"
echo "-> WN/Process list"
cat "\$MPI_START_HOST_SLOTS_FILE"
EOF

chmod +x test.sh

mpi-start -v -t dummy -- test.sh

rm -rf $MYDIR

Submit the job (e.g. qsub mpi.sub) and check that it is executed an produces an output similar to this:

=[START]=======================================================================
Execution from: WN02
Slots: 4
-> WN list
WN01
WN02
-> WN/Process list
WN01 2
WN02 2
=[FINISHED]====================================================================

The names and number of WN depend on your configuration. Make sure that the number of slots is the same that you requested in the submit file.

If there are problems, please set a higher verbosity level (using -vv or -vvv parameters in mpi-start call) and check the standard error file. Some help may be available at the mpi-start troubleshooting guide

Depending on the MPI implementations you have installed at your site, you can submit similar jobs selecting the flavor to be tested instead of "dummy" in the submit file. Please take into account that not all MPI implementations can execute shell scripts, so you may need to write and compile an MPI program. Check the user manual for this.

If local submission works, no problems should appear when submitting through the CE. Anyway, it should be also tested. Check user manual for examples.

SGE

Support for parallel jobs under SGE is enabled using Parallel Environments. You will need to configure at least one parallel environment in order to execute the jobs. Check the Parallel Environment documentation for more information. The CREAM Blah scripts will automatically use the available parallel environment if a job that requires more than one CPU is submitted.

In the following example a PE configuration is shown:

[root@ce ~]# qconf -sp mpi
pe_name            mpi
slots              4    
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE   
job_is_first_task  FALSE

Testing your configuration

Submission to SGE can be checked with a job that requires some slots and invokes mpi-start to print the detected slots and hosts:

#!/bin/bash
#$ -S /bin/bash
#$ -pe mpi 16

MYDIR=`mktemp -d`

cd $MYDIR
cat > test.sh << EOF
#!/bin/sh
echo "Execution from: \`hostname\`"
echo "Slots: \$MPI_START_NSLOTS"
echo "-> WN list"
cat "\$MPI_START_HOSTFILE"
echo "-> WN/Process list"
cat "\$MPI_START_HOST_SLOTS_FILE"
EOF

chmod +x test.sh

mpi-start -v -t dummy -- test.sh

rm -rf $MYDIR

Use the appropriate pe name and a number of slots that fits into your cluster. This job will produce an output similar to this:

=[START]=======================================================================
Execution from: gcsic180wn
Slots: 16
-> WN list
gcsic180wn.ifca.es
gcsic032wn.ifca.es
gcsic060wn.ifca.es
gcsic051wn.ifca.es
gcsic040wn.ifca.es
cms10wn.ifca.es
-> WN/Process list
gcsic180wn.ifca.es 1
gcsic032wn.ifca.es 1
gcsic060wn.ifca.es 1
gcsic051wn.ifca.es 7
gcsic040wn.ifca.es 3
cms10wn.ifca.es 3
=[FINISHED]====================================================================

The names and number of WN depend on your configuration. Make sure that the number of slots is the same that you requested in the submit file.

If there are problems, please set a higher verbosity level (using -vv or -vvv parameters in mpi-start call) and check the standard error file. Some help may be available at the mpi-start troubleshooting guide

Depending on the MPI implementations you have installed at your site, you can submit similar jobs selecting the flavor to be tested instead of "dummy" in the submit file. Please take into account that not all MPI implementations can execute shell scripts, so you may need to write and compile an MPI program. Check the user manual for this.

If local submission works, no problems should appear when submitting through the CE. Take into account that the CREAM support for sge uses -pe *, so the batch system choses the most appropriate Parallel Environment for your job and it may not be the one configured previously. Check user manual for examples of submission through the CE.

Other batch systems

mpi-start supports other batch systems (such as Condor and Slurm). If you need assistance configuring those, please contact the MPI support unit by submitting a ticket to GGUS.

Passwordless ssh (hostbased authentication)

Depending on the MPI implementation used and if the site does not have a shared file system, passwordless ssh between the nodes may be required between the WN. If that's the case, make sure that any pool account can login from one WN to any other using ssh without showing any password prompt.

Job Limits

The execution of MPI jobs requires that the limits imposed by batch system on the jobs are configured properly. The most relevant are the job size and the time limits. See also the Section on the Information System below for relevant variables on the GlueSchema for publishing these limits.

Job size. MPI jobs use more than one slot, hence the batch system must allow the execution of jobs with more than 1 slot. Moreover, an upper limit should be also defined accordingly to the size of the cluster and the type of nodes available at the site. Each site must decide on the most appropriate value for their available resources, big values could decrement the throughput of the site but may also limit the usage of parallel jobs. It is recommended that this limit should be greater or equal than the number of slots available in two nodes of the cluster in order to allow the testing of multiple node execution.

Time limits. Usually the batch system can limit two different times for job execution, wall clock time and cpu time. The wall clock time is the difference between the end and the start of the job, independently of the usage of resources of the job. If your batch system is configured correctly, the CPU time of one job accounts the sum of the CPU time of each of the slots allocated for the given job. In a parallel job, the CPU time is normally greater than the wall clock time. The batch system must be configured to allow MPI jobs to run for a reasonable time. The recommended values for these limits are:
- Maximum wall clock time must be bigger than 0 and less than infinity.
- Maximum CPU time must be bigger than the maximum wall clock time.
- The maximum CPU time should allow a 4 CPU job to run for the maximum wall clock time, that is, it should be at least 4 times the maximum wall clock time.

Installation of MPI implementation

In order to execute MPI jobs, the site must support one of the multiple MPI implementations available. Most extended are Open MPI and MPICH2. OS distributions provide ready to use packages that fit most use cases. SL5 provides the following packages:

openmpi and openmpi-devel for Open MPI.
mpich2 and mpich2-devel for MPICH2.
lam and lam-devel for LAM

Installation of devel packages for the MPI implementation is recommended, since this will allow users to compile their applications at the site. Moreover the Nagios probes will try to compile a binary, thus not having a working compiler will make them fail. Note that the compiler may not be specified as dependencies of the -devel packages. Make sure that gcc and related packages are available.

Note also that not all the implementations support tight integration with the batch system. Tight integration is required for proper accounting numbers.

Open MPI

Open MPI support tight integration with several batch system, including Torque/PBS and SGE, that may require recompilation of the packages in order to get it. Tight integration allows proper accounting of resources used (CPU time) and better control of the jobs by the system, thus avoiding zombie processes if something goes wrong with the application. The following sections describe the SGE and PBS/Torque cases:

SGE

The SGE tight scheduler integration allows Open MPI to start the processes in the worker nodes using the native batch system utilities, thus providing better process control and accounting. SL5 packages already include support for SGE with the openmpi and openmpi-devel rpms. After Open MPI is installed, you should see one component named gridengine in the ompi_info output:

$ ompi_info | grep gridengine
                MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.4)

Check Open MPI FAQ for more information.

Torque/PBS

In the case of Torque/PBS in SL5 you will need to compile the packages for your site. The Open MPI FAQ includes instructions for doing so. You can adapt the SL5 packages to support Torque/PBS following these steps:

Download and install Open MPI source rpm

$ rpm -Uvh http://ftp2.scientificlinux.org/linux/scientific/5x/SRPMS/vendor/openmpi-1.4-4.el5.src.rpm
Retrieving http://ftp2.scientificlinux.org/linux/scientific/5x/SRPMS/vendor/openmpi-1.4-4.el5.src.rpm
warning: /var/tmp/rpm-xfer.DAMscP: Header V3 DSA signature: NOKEY, key ID 192a7d7d
   1:openmpi                warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
########################################### [100%]
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root
warning: user mockbuild does not exist - using root
warning: group mockbuild does not exist - using root

Modify the spec file to include Torque/PBS support:

--- openmpi.spec        2010-03-31 23:18:20.000000000 +0200
+++ openmpi.spec        2011-03-07 18:37:11.000000000 +0100
@@ -114,6 +114,7 @@
 ./configure --prefix=%{_libdir}/%{mpidir} --with-libnuma=/usr \
        --with-openib=/usr --enable-mpirun-prefix-by-default \
        --mandir=%{_libdir}/%{mpidir}/man %{?with_valgrind} \
+        --with-tm \
        --enable-openib-ibcm --with-sge \
        CC=%{opt_cc} CXX=%{opt_cxx} \
        LDFLAGS='-Wl,-z,noexecstack' \

Install Torque/PBS development libraries:

$ yum install libtorque-devel

Build the RPMs

$ rpmbuild -ba /usr/src/redhat/SPECS/openmpi.spec

Install the resulting RPMs:

$ yum localinstall -nogpgcheck /usr/src/redhat/RPMS/x86_64/openmpi-*

Check that the support for Torque/PBS is enabled:

$ /usr/lib64/openmpi/1.4-gcc/bin/ompi_info | grep tm
              MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.4)
                 MCA ras: tm (MCA v2.0, API v2.0, Component v1.4)
                 MCA plm: tm (MCA v2.0, API v2.0, Component v1.4)

Open MPI without tight integration

Open MPI can use rsh/ssh for starting the jobs if no tight integration is available. Jobs will run if you have passwordless ssh enabled between the WN, but the accounting figures will be incorrect.

MPICH2

MPICH2 can use several launchers for starting the processes:

MPD, which uses rsh/ssh for starting the processes, so it will not produce correct accounting numbers.
Hydra, which also uses rsh/ssh and should support tight integration with some batch systems.
For PBS/Torque, OSC Mpiexec which includes tight integration.

mpi-start is able to select the most appropriate one if found (Hydra is prefered over MPD)

MPD

MPD is available in all versions of MPICH2 and uses rsh/ssh to start the processes. It was the default starter for versions < 1.3. It uses a .mpd.conf file at the home directory, so it is necessary to provide a way to access the home directory from the WN (usual case)

Hydra

Hydra is the new starter of MPICH2 and the default since version 1.3. It is designed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge. Notice that not all versions support all the daemons!. The version included with SL5 does NOT support pbs or sge, therefore passwordless rsh/ssh between the nodes is mandatory.

sge support is included since version 1.3b1. pbs is included in version 1.5a1. If you want to have tight integration (i.e. accounting) with MPICH2 and one of these systems you may need to download and compile the packages at your site.

OSC Mpiexec

OSC Mpiexec provides tight integration for PBS/Torque system. In order to use it with mpi-start you will need to define the variable MPI_MPICH2_MPIEXEC pointing to its location.

There are no binary distributions of OSC Mpiexec available. You will need to download and compile the software. In order to compile you need the torque-devel packages installed. Check the documentation included with the software.

MPI-Start installation

MPI-Start is the recommended solution to hide the implementation details for the submitted jobs. It was developed inside the int.eu.grid project and now its development is continued in the EMI project. It should be installed on every worker node involved with MPI. It might be installed also in user interface machines for testing purposes.

MPI-Start configuration

MPI-Start is designed to auto detect most of the site configurations without any administrator intervention. The default installation will automatically detect:

the batch scheduler at the site: currently PBS/Torque, SGE, LSF, Condor and Slurm are supported.
existence of shared file system in the job running directory
availability of OSC mpiexec for PBS/Torque systems
default MPI installations for SLC5.

If the automatic detection fails for any of these, the administrator can set in the worker nodes the environment variables shown in the table (this is normally done by adding a script to /etc/profile.d):

Variable Name	Meaning	Example
MPI_DEFAULT_FLAVOUR	name of the default flavour for jobs running at the site	MPI_DEFAULT_FLAVOUR=openmpi
MPI_<flavour>_PATH	Path of the bin and lib directories for the MPI flavour	MPI_MPICH_PATH=/opt/mpich-1.2.6
MPI_<flavour>_VERSION	preferred version of the MPI flavour	MPI_MPICH_VERSION=1.2.6
MPI_<flavour>_MPIEXEC	Path of the MPIEXEC binary for the specific flavour	MPI_MPICH_MPIEXEC=/opt/mpiexec-0.80/bin/mpiexec
MPI_<flavour>_MPIEXEC_PARAMS	Additional parameters for the MPIEXEC of the flavour	MPI_MPICH_MPIEXEC_PARAMS="-l"
MPI_<flavour>_MPIRUN	Path of the MPIRUN binary for the specific flavour	MPI_MPICH_MPIRUN=/opt/mpich-1.2.6/bin/mpirun
MPI_<flavour>_MPIRUN_PARAMS	Additional parameters for the MPIRUN of the flavour	MPI_MPICH_MPIRUN_PARAMS="-l"

Distribution of binaries

The MPI binaries that users want to run need to be accessible on every node involved in an MPI computation (it is a parallel job after all). There are three main approaches:

Shared home/other shared area

By far the best option is to provide user homes hosted on a shared filesystem. This could either be a network filesystem (e.g. NFS) or a cluster filesystem (e.g. GPFS or Lustre). Then the MPI binary you transfer in the Sandbox (or compile up) on the first MPI node will automatically be available on all nodes. This is the normal mode of operation for MPI, and what MPI users will probably expect.

MPI-Start checks if the working directory of the job is in a shared filesystem (nfs, gfs, afs, smb, gpfs and lustre are detected) and considers that if the filesystem is shared the binaries will be available without any further action in all the nodes involved in the execution. If jobs at your site start at a shared filesystem then you do not need to configure anything else, mpi-start should detect it automatically.

In some cases, there is an available shared filesystem but the job does not start its execution there. Site admins can force MPI-Start to use one directory as shared for transferring the job files to all nodes by setting these variables:

Variable Name	Meaning	Example
MPI_START_SHARED_FS	If undefined, MPI-Start will try to detect a shared file system in the execution directory. If defined and equal to 1, MPI-Start will assume that the execution directory is shared between all hosts and will not try to copy files. Any other value will make MPI-Start assume that the execution directory is not shared.	MPI_START_SHARED_FS=0
MPI_SHARED_HOME	If set to yes , MPI-Start will use the path defined in MPI_SHARED_HOME_PATH for copying the files and executing the application.	MPI_SHARED_HOME="yes"
MPI_SHARED_HOME_PATH	Path to a shared directory (only used if MPI_SHARED_HOME=yes)	MPI_SHARED_HOME_PATH=/path/to/your/shared/area

For example, consider a site that has not shared homes (where the job does start its execution), but provides an additional NFS shared filesystem mounted in all WNs for MPI applications located in /nfs/shared_home. In this case you should set:

MPI_SHARED_HOME=1
MPI_SHARED_HOME_PATH=/nfs/shared_home

mpi-start will copy all the files in the sandbox to a temporary directory under /nfs/shared_home/ and start the application from that location. Files will be cleaned up after execution.

Passwordless ssh between WNs

If you configure host-based authentication between worker nodes, then MPI-Start can automatically replicate your binary to nodes involved in the computation. All the files in the working directory will be replicated, however, other needed files (e.g. data) may not be replicated, so this would have to be done manually (and would be slow for large data sets). Also it could open up the potential for users to subvert the normal resource management mechanisms by directly executing commands on nodes not allocated to them.

Use OSC mpiexec to distribute files

This option is for sites with neither a shared filesystem nor passwordless ssh between WNs. If you have an mpiexec that can spawn the remote jobs using the LRMS native interface, you can use it to distribute the files. See this page for the basic idea.

Information System

Job Policies

As described earlier, the execution of MPI jobs requires proper limits implemented in the batch system. These limits must be known to the users and made available through the information system in the following variables (see also the Section above on Job Limits):

GlueCEPolicyMaxSlotsPerJob. This variable publishes the maximum number of slots that a single job may use at the site. It is not automatically published by the information provider, so it must be explicitly set by the admin. It must be bigger than 1 for parallel jobs and should be defined accordingly to the size of the cluster and the type of nodes available at the site. By default is set to 9999999.

GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime. These variables publish the job time limits, that must be set in the batch system and published automatically by the info providers. There are some rules that should be met by the variables:
- GlueCEPolicyMaxCPUTime > GlueCEPolicyMaxWallClockTime
- GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= 4
- GlueCEPolicyMaxWallClockTime not equal to 0, 1 or 9999999

MPI Support

Sites may install different implementations (or flavours) of MPI. It is important therefore that users can use the information system to locate sites with the software they require. You should publish some values to let the world know which flavour of MPI you are supporting, as well as the interconnect and some other things. Everything related with MPI should be published as GlueHostApplicationSoftwareRunTimeEnvironment in the corresponding sections.

MPI-start support

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START

MPI flavour(s)

This is the most basic variable and one should be advertised for each MPI flavour that has been installed and tested. Currently supported flavours are MPICH, MPICH2, LAM and OPENMPI.

Example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH2
GlueHostApplicationSoftwareRunTimeEnvironment: LAM
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI

MPI version(s)

This should be published to allow users with special requirements to locate specific versions of MPI software.

Examples:

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.0.2
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-G2-1.2.7
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.0.2-ICC

MPI compiler(s) -- optional

If <Compiler> is not published, then gcc suite is assumed.

Interconnects -- optional

MPI-<interconnect>

Interconnects: Ethernet, Infiniband, SCI, Myrinet

Examples

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-Infiniband

Shared homes

If a site has a shared filesystem for home directories it should publish the variable MPI_SHARED_HOME.

GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME

Revision History

Version	Authors	Date	Comments
	M. Krakowian	19 August 2014	Change contact group -> Operations support

MAN03 MPI-Start Installation and Configuration

MPI-Start Installation and Configuration

Using yaim

Configuration of batch system

Torque/PBS

Maui

Testing your configuration

Checking the submit filter

Checking submission

SGE

Testing your configuration

Other batch systems

Passwordless ssh (hostbased authentication)

Job Limits

Installation of MPI implementation

Open MPI

SGE

Torque/PBS

Open MPI without tight integration

MPICH2

MPD

Hydra

OSC Mpiexec

MPI-Start installation

MPI-Start configuration

Distribution of binaries

Shared home/other shared area

Passwordless ssh between WNs

Use OSC mpiexec to distribute files

Information System

Job Policies

MPI Support

MPI-start support

MPI flavour(s)

MPI version(s)

MPI compiler(s) -- optional

Interconnects -- optional

Shared homes

Revision History

Navigation menu

Search