MPI User Guide
Introduction
This document is intended to help EGI user community to execute MPI applications on the European Grid Infrastructure. The document has been prepared by the EGI-InSPIRE project. Please provide feedback to Enol Fernández del Castillo (enolfc_AT_ifca_DOT_unican_DOT_es).
Many of the sites that are involved in the European Grid Infrastructure support various MPI implementations, such as: OpenMPI, MPICH 1/2, OpenMP. Site administrators may deploy any version of any MPI implementation to serve the needs of the site users.
Execution of MPI applications requires sites that properly support the submission and execution of parallel applications and the availability of a MPI implementation. Site administrators should check the gLite MPI v.1.0.0 release notes of EMI 1 Kebnekaise release or MPI-START v1.0.4 manual with the relevant information about the configuration of sites. Since not all of them have this support enabled, special tags e.g 'MPI-START|OpenMPI|MPICH2' are published via the information system to allow users to discover which are the sites that can be used for their executions. Sites may also install different implementations (or flavors) of MPI. It is important therefore that users can use the information system to locate sites with the software they require.
(information about the MPI wrapper allowing MPI jobs to be executed on two or more nodes having different locations on a different grid sites )
Current status of MPI supporting sites
The Service Availability Monitoring infrastructure of EGI monitors the status and correct configuration of MPI sites and suspend erronuous sites if necessary. This monitoring system tests MPI-START wrapper and it's supported MPI flavours by Nagios probes, standalone MPI implementations are not tested if the site doesn't support MPI-START wrapper. You can always check latest monitoring data for the sites at central MyEGI webpage.
The execution of parallel application does not only require the middleware support for such jobs, it also needs a correct configuration of the sites where the jobs are actually run. In order to assure the correct execution of these applications, monitoring probes that check the proper support for such jobs are available.
The monitoring probes are executed at all the sites that publish the MPI-START tag in their information system and consist in the following steps:
- Assure that MPI-Start is actually available.
- Check of the information published by the site. This first step inspects the announced MPI flavors supports and selects the probes that will be run in the next steps.
- For each of the supported MPI flavors, submit a job to the site requesting 2 processes that is compiled from source using the MPI-Start hooks. The probe checks that the number of processes used by the application was really the requested number.
Although the probes request a low number of slots (2), they allow the early detection of basic problems. These probes are flagged as critical, thus any failure may cause the site to be suspended from the infrastructure.
Executing MPI applications with MPI-Start
MPI-Start is the recommended way of starting MPI jobs in the infrastructure. The MPI-START v1.0.4 User Guide contains a complete description on how to run MPI jobs in general and in the Grid. Documentation is focused on gLite resources, although MPI-Start can be used with ARC and UNICORE if installed and configured by the site administrator.
Examples can be found also at the tutorials materials prepared by the EGI-InSPIRE SA3 Support for parallel computing (MPI) task:
- MPI hands-on training, Vilnius, April 2011 - Slides in English, gLite
- Parallel Jobs with MPI, Grid and e-CIENCIA, Valencia, July 2010 - Slides in English, gLite
Sites supporting MPI-Start must publish the proper tags in the information system e.g. BDII attribute 'GlueHostApplicationSoftwareRunTimeEnvironment' should have the tag MPI-START
MPI-Start features
- Supported MPI Implementations:
- OpenMP (basic)
- OpenMPI
- MPICH2
- MPICH
- LAM-MPI
- PACX-MPI
- File automatic distribution (non shared file systems)
- Compiler automatic discovery
More about MPI-START design please read here.
Discovery of suitable sites
Discovery of resources is the first step that needs to be accomplished before the execution of applications. This can be done by using the 'GlueHostApplicationSoftwareRunTimeEnvironment' attribute at BDII service (Berkeley Database Information Index), which should include all the relevant MPI support information that allow users to locate the sites with the adequate software environment. The following sections describe the tags that site may publish.
For each MPI implemenation supported by MPI-Start, sites must publish a variable with the name of the MPI flavour that has been installed and tested. The supported flavours are: MPICH for MPICH, MPICH2 for MPICH2 , LAM for LA-MPI and OPENMPI for Open MPI. Most commonly supported flavours are OpenMPI and MPICH2
MPI flavours published Tag at BDII example:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START GlueHostApplicationSoftwareRunTimeEnvironment: MPICH GlueHostApplicationSoftwareRunTimeEnvironment: MPICH2 GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI GlueHostApplicationSoftwareRunTimeEnvironment: OPENMP
More specific version and compiler information can be also defined by the sites using variables with the form:
<MPI flavour>-<MPI version> or <MPI flavour>-<MPI version>-<Compiler>
These are not mandatory, although they should be published to allow users with special requirements to locate specific versions of MPI software. Users should assume gcc compiler suite is used if no other value is specified.
MPI flavours with versions/compilers published Tag at BDII example:
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.4.2 GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7 GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.3.7-ICC
Sites may publish the network interconnect available for the execution of MPI applications with a variable with the form:
MPI-<interconnect>
Currently the valid interconnects are: Ethernet, Infiniband, SCI, and Myrinet.
Network Interconnections published Tag at BDII example:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI-INFINIBAND
Sites supporting a shared filesystem for the execution of MPI applications publish the MPI_SHARED_HOME variable. If your application needs such feature, you should check the availability of that variable. Otherwise you can use MPI-START hooks framework to have an automatic detection and distribution of input/output files around the nodes in case they are not having shared home.
Shared home published Tag at BDII example:
GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME
Querying information system
There several ways to query the information system:
- middleware command line tools - 'lcg-info' command (high level tools)
- linux command line tools - 'ldapsearch' command (low level tools)
- GUI for windows/linux: http://directory.apache.org/studio/ (low level tools)
- Other
The preferred way for end users to browse the information system is to use 'lcg-info' command from UI included in EMI 1 Kebnekaise release because it intends to provide high-level human readable information.
First of all we need to find out what MPI options are available for us at EGI:
Querying the information system to get the list of MPI Tags for a BIOMED VO in EGI from the sites which are supporting 'MPI-START'
Command:
$ lcg-info -vo biomed -query 'Tag=MPI-START' -attrs Tag -list-ce|egrep -e MPI|sort|uniq
Output:
MPICH MPICH1-1.2.0 MPICH-1.2.6 MPICH-1.2.7 MPICH-1.2.7p1 MPICH2 MPICH2-1.0.4 MPICH2-1.1.1 MPICH2-1.1.1p1 MPICH2-1.2.1 MPICH2-1.4.1 MPICH2-1.6 MPI-Ethernet MPIEXEC MPI_HOME_NOTSHARED MPI-Infiniband MPI-INFINIBAND MPI-Myrinet MPIRUN MPI_SHARED_HOME MPI-START MPI-START-0.0.59 OPENMPI OPENMPI-1.1 OPENMPI-1.2 OPENMPI-1.2.8 OPENMPI-1.3 OPENMPI-1.3.2 OPENMPI-1.3.3 OPENMPI-1.4 OPENMPI-1.4.1 OPENMPI-1.4.3 OPENMPI-1.4-4 OPENMPI-1.4.4 OPENMPI-1.4-4-GCC OPENMPI-GCC OPENMPI-ICC
Note: from the output above you can see that the sites having only 'MPI-START' Tag are filtered (lcg-info command parameters: -query 'Tag=MPI-START'). If you want to add additional criteria like to get the MPI Tags only for the sites which do support only 'MPI_SHARED_HOME' or 'MPI-INFINIBAND' or both, please modify the command's query parameters e.g. -query 'Tag=MPI-START,Tag=MPI-INFINIBAND'. And you will get available MPI Tags only from the sites which do support 'MPI-START' and 'MPI-INFINIBAND'.
Job execution
Requirements
Submitting MPI Job via gLite WMS, you should include in the Requirements expression the MPI related Tags obtained from the information system. The following example shows the requirements expression for a job that needs OPENMPI and INFINIBAND connection and uses MPI-START for execution:
Requirements = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment);
For an additional information about sending MPI Job to a Grid and use of MPI-START please refer to a MPI-START manual
Nodes/Cores manipulation
SMPGranularity=4; HostNumber=8; wholenodes=true;
Definitions:
- SMPGranularity - The SMPGranularity attribute is an integer greater than 0 specifying the number of cores any host involved in the allocation has to dedicate to the considered job. This attribute can’t be specified along with the Hostnumber attribute when WholeNodes is false.
- WholeNodes - The WholeNodes attribute is a boolean that indicates whether whole nodes should be used exclusively or not.
- HostNumber - HostNumber is an integer indicating the number of nodes the user wishes to obtain for his job. This attribute can’t be specified along with the SMPGranularity attribute when WholeNodes is false.
MPI Job debugging
If you want more verbose output from MPI-START for a debugging purposes you should add the attribute to the JDL file defined below:
Environment = {"I2G_MPI_START_VERBOSE=1", "I2G_MPI_START_DEBUG=1"};
Complete MPI Job description
JobType = "Normal"; CPUNumber = 6; Executable = "MPI.sh"; Arguments = "OPENMPI hello_bin hello arguments"; InputSandbox = {"MPI.sh", "hello_bin"}; OutputSandbox = {"std.out", "std.err"}; StdOutput = "std.out"; StdError = "std.err"; SMPGranularity=4; HostNumber=8; wholenodes=true; Requirements = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment) && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment); Environment = {"I2G_MPI_START_VERBOSE=1", "I2G_MPI_START_DEBUG=1"};
JDL Attributes Specification
https://edms.cern.ch/file/592336/1/CREAM-JDL.pdf
Known issues
- MPI-START wrapper version discovery: support for tagging the version of MPI-START in the information system. Currently only MPI-START tag is published, but version may be obtained only by sending grid job and performing discovery manually. User which is following the official manual may hit to the not supported command line parameters by the currently installed MPI-START wrapper version. Look for reference RT ticket click here
Future plans
EMI-1 release includes MPI-START v1.0.4. Please red more about the release here.
Included new MPI features like:
JDL attributes - WholeNodes, HostNumber and SMPGranularity
Please have a look at related feature requests:
https://savannah.cern.ch/bugs/?77096 https://savannah.cern.ch/bugs/?76971 https://savannah.cern.ch/bugs/?58878