Introduction

This document is intended to help EGI user community to execute MPI applications on the European Grid Infrastructure. The document has been prepared by the EGI-InSPIRE project. Please provide feedback to Enol Fernández del Castillo (enolfc_AT_ifca_DOT_unican_DOT_es). If you have any requirement about the further-development of the MPI services of EGI, please submit a requirement ticket into the EGI Requirements Tracker system.

Many of the sites that are involved in the European Grid Infrastructure support various MPI implementations, such as: Open MPI, MPICH 1/2. Site administrators may deploy any version of any MPI implementation to serve the needs of the site users.

Execution of MPI applications requires sites that properly support the submission and execution of parallel applications and the availability of a MPI implementation. Site administrators should check the MPI-Start Installation and Configuration page, that covers the relevant information about the configuration of the sites. Sites use special tags (e.g 'MPI-START|OPENMPI|MPICH2') that are published via the information system to allow users to discover which are the sites that can be used for their executions. Sites may also install different implementations (or flavors) of MPI. It is important therefore that users can use the information system to locate sites with the software they require.

MPI-Start is the recommended way of starting MPI jobs in the infrastructure for gLite-based resources. This guide covers on using MPI-Start for the execution of jobs and also introduces the execution of jobs using other middleware stacks such as UNICORE or ARC.

Current status of MPI supporting sites

The execution of parallel applications does not only require the middleware support for such jobs, it also needs a correct configuration of the sites where the jobs are actually run. In order to assure the correct execution of these applications, monitoring probes that check the proper support for such jobs are available. The Service Availability Monitoring infrastructure of EGI monitors the status and correct configuration of MPI sites and suspend erroneous sites if necessary. The probe tests the execution of MPI jobs using MPI-Start for each of the flavors published by the site. Standalone MPI implementations are not tested if the site does not support MPI-Start. You can always check latest monitoring data for the sites at central MyEGI webpage.

New probes that will enhance the level of testing of the MPI support are currently under development and described at the VT_MPI_within_EGI:Nagios wiki page.

The current probes are executed at all the sites that publish the MPI-START tag in their information system and consist in the following steps:

Assure that MPI-Start is actually available.
Check of the information published by the site. This first step inspects the announced MPI flavors supports and selects the probes that will be run in the next steps.
For each of the supported MPI flavors, submit a job to the site requesting 2 processes that is compiled from source using the MPI-Start hooks framework. The probe checks that the number of processes used by the application was really the requested number (2).

Although the probes request a low number of slots (2), they allow the early detection of basic problems. These probes are flagged as critical, thus any failure may cause the site to be suspended from the infrastructure.

Discovery of suitable sites

Discovery of resources is the first step that needs to be accomplished before the execution of applications. This can be done by using the 'GlueHostApplicationSoftwareRunTimeEnvironment' attribute at BDII service (Berkeley Database Information Index), which should include all the relevant MPI support information that allow users to locate the sites with the adequate software environment. The following sections describe the tags that site may publish.

For each MPI implemenation supported by MPI-Start, sites must publish a variable with the name of the MPI flavour that has been installed and tested. The supported flavours are: MPICH for MPICH, MPICH2 for MPICH2 , LAM for LA-MPI and OPENMPI for Open MPI. Most commonly supported flavours are Open MPI and MPICH2

MPI flavours published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH2 
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMP

More specific version and compiler information can be also defined by the sites using variables with the form:

These are not mandatory, although they should be published to allow users with special requirements to locate specific versions of MPI software. Users should assume gcc compiler suite is used if no other value is specified.

MPI flavours with versions/compilers published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.4.2 
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7 
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.3.7-ICC

Sites may publish the network interconnect available for the execution of MPI applications with a variable with the form:

MPI-<interconnect>

Currently the valid interconnects are: Ethernet, Infiniband, SCI, and Myrinet.

Network Interconnections published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-INFINIBAND

Sites supporting a shared filesystem for the execution of MPI applications publish the MPI_SHARED_HOME variable. If your application needs such feature, you should check the availability of that variable. MPI-Start performs the automatic detection of file system and distribution of input/output files in case of not having a shared home available. Check Hooks Framework for more information.

Shared home published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME

Querying information system

There are several ways to query the information system:

middleware command line tools - 'lcg-info' command (high level tools)
linux command line tools - 'ldapsearch' command (low level tools)
GUI for windows/linux: http://directory.apache.org/studio/ (low level tools)
Other

The preferred way for end users to browse the information system is to use 'lcg-info' command from UI included in EMI 1 Kebnekaise release because it intends to provide high-level human readable information.

First of all we need to find out what MPI options are available for us at EGI

Querying the information system to get the list of MPI Tags for a BIOMED VO in EGI from the sites which are supporting 'MPI-START'

Command:

$ lcg-info -vo biomed -query 'Tag=MPI-START' -attrs Tag -list-ce|egrep -e MPI|sort|uniq

Output:

MPICH
MPICH1-1.2.0
MPICH-1.2.6
MPICH-1.2.7
MPICH-1.2.7p1
MPICH2
MPICH2-1.0.4
MPICH2-1.1.1
MPICH2-1.1.1p1
MPICH2-1.2.1
MPICH2-1.4.1
MPICH2-1.6
MPI-Ethernet
MPIEXEC
MPI_HOME_NOTSHARED
MPI-Infiniband
MPI-INFINIBAND
MPI-Myrinet
MPIRUN
MPI_SHARED_HOME
MPI-START
MPI-START-0.0.59
OPENMPI
OPENMPI-1.1
OPENMPI-1.2
OPENMPI-1.2.8
OPENMPI-1.3
OPENMPI-1.3.2
OPENMPI-1.3.3
OPENMPI-1.4
OPENMPI-1.4.1
OPENMPI-1.4.3
OPENMPI-1.4-4
OPENMPI-1.4.4
OPENMPI-1.4-4-GCC
OPENMPI-GCC
OPENMPI-ICC

Note: from the output above you can see that the sites having only 'MPI-START' Tag are filtered (lcg-info command parameters: -query 'Tag=MPI-START'). If you want to add additional criteria like to get the MPI Tags only for the sites which do support only 'MPI_SHARED_HOME' or 'MPI-INFINIBAND' or both, please modify the command's query parameters e.g. -query 'Tag=MPI-START,Tag=MPI-INFINIBAND'. And you will get available MPI Tags only from the sites which do support 'MPI-START' and 'MPI-INFINIBAND'.

(information about the MPI wrapper allowing MPI jobs to be executed on two or more nodes having different locations on a different grid sites )

Executing applications with MPI-START

MPI-Start is the recommended way of starting MPI jobs in the infrastructure. The User Guide contains a complete description on how to run MPI jobs in general and in the Grid. Documentation is focused on gLite resources, although MPI-Start can be used with ARC and UNICORE if installed and configured by the site administrator.

Examples can be found also at the tutorials materials prepared by the EGI-InSPIRE SA3 Support for parallel computing (MPI) task:

Sites supporting MPI-Start must publish the proper tags in the information system e.g. BDII attribute 'GlueHostApplicationSoftwareRunTimeEnvironment' should have the tag MPI-START

MPI-START features

Supported MPI Implementations:

Open MPI
MPICH2
MPICH
LAM-MPI
PACX-MPI

Other features:

Support for defining the layout of the application on the allocated resources, for Open MP and hybrid MPI/Open MP applications
File automatic distribution (non shared file systems)
Compiler automatic discovery
Hooks Framework

More about MPI-START design please read here.

Job execution

Description of jobs

Whether using direct submission to a CREAM CE or a WMS, the JDL must include the description of the required resource for the job using the several attributes. For jobs that do not require any special allocation the CPUNumber attibute can be used:

CPUNumber: specifies the total number of processes that the job must allocate. These may or not reside on the same physical host.

Alternatively, if the job needs better control of the required resources the following attributes can be used:

SMPGranularity - The SMPGranularity attribute is an integer greater than 0 specifying the number of cores any host involved in the allocation has to dedicate to the considered job. This attribute can’t be specified along with the Hostnumber attribute when WholeNodes is false.
WholeNodes - The WholeNodes attribute is a boolean that indicates whether whole nodes should be used exclusively or not.
HostNumber - HostNumber is an integer indicating the number of nodes the user wishes to obtain for his job. This attribute can’t be specified along with the SMPGranularity attribute when WholeNodes is false.

Examples

Simple jobs that just require a 15 processes independently of the location:

CPUNumber = 16;

JDL attributes for a job that requires 8 hosts with at least 4 cores in each of the slots and those hosts are used exclusively:

SMPGranularity=4;
HostNumber=8;
wholenodes=true;

When submitting the jobs via the WMS, you should include in the Requirements expression the MPI related Tags obtained from the information system. The following example shows the requirements expression for a job that needs OPENMPI and INFINIBAND connection and uses MPI-START for execution:

Requirements  = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment);

The MPI-Start documentation and the [#Executing_MPI_applications_with_MPI-START tutorials linked above] include more information about sending jobs to the Infrastructure. A complete description of the JDL can be found in the CREAM JDL specification.

Complete Job Description

Below is an example of a MPI job that uses MPI-Start to execute an Open MPI application on 4 hosts with 2 cores each and with exclusive execution on those hosts:

JobType       = "Normal";
Executable    = "/usr/bin/mpi-start";
Arguments     = "-t openmpi myapp.bin myapp arguments";
InputSandbox  = {"myapp.bin"};
OutputSandbox = {"std.out", "std.err"};
StdOutput     = "std.out";
StdError      = "std.err";
SMPGranularity=2;
HostNumber=4;
wholenodes=true;
Requirements  = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment);

If the site you are submitting jobs to uses a MPI-Start version older than 1.0 you will need a wrapper script that manages the job startup as shown in MPI-Start manual.

Hooks

MPI-Start allows the extension of its capabilities with the hooks framework. These are scripts that are executed before and after the parallel application an may be used to:

Building your application's execution environment: pre-processing, post-processing.
Handling of your application's compilation.
Collecting the output files produced on more than one node.
Other

Troubleshooting

In case of problems during the execution, you should turn on the debugging options of MPI-START with the "-v" and "-vv" options or the environment variables I2G_MPI_START_VERBOSE and I2G_MPI_START_DEBUG. You can define them in the JDL as shown in the example Environment attribute below:

Environment   = {"I2G_MPI_START_VERBOSE=1", "I2G_MPI_START_DEBUG=1"};

The MPI-Start pages also includes a section on Troubleshooting common errors. GGUS tickets can be also opened to the MPI User Support unit.

Beyond MPI jobs

MPI-Start can be also used as a starter for launching other parallel jobs in the infrastructure. The Parallel Computing Support User Guide page includes some examples of what can be achieved for generic applications or Charm++. Hybrid MPI/OpenMP applications are easily executed with MPI-Start thanks to the support for the specification of process placement in the allocated resources.

Future plans

The development of MPI-Start is driven by the user requirements and tracked in the MPI-START development tracking system and EMI development tracking system. Please check them for future features and bug fixes that may affect your jobs.

Execution of parallel jobs in ARC resources

ARC uses the concept of Runtime Environments (RE) for the execution of MPI jobs. A Runtime environment is a set of software, complete with corresponding setup script which defines necessary UNIX environment variables, allowing execution of specific applications. Alternatively, the runtime environment can consist of a single (possibly empty) setup script, just to serve as a flag to indicate presence of a particular software, data or resource.

The setup scripts are executed by the ARC gatekeeper (a.k.a. the Grid Manager) prior to and after job execution, so that the job will run in a proper environment, with paths and variables set accordingly. Site admin configures the RE according to the site configuration and makes it available for user to execute the jobs. The Runtime Environment Registry (RER) keeps a list of available REs.

The user selects the appropriate RE with the

runtimeenvironment

attribute of xRSL. For example a 4 CPU application that uses the ENV/MPI/OPENMPI-1.3/GCC64 can use the following job description:

&(jobName="openmpi-gcc64")
(count="4")
(wallTime="10 minutes")
(memory="1024")
(executable="runopenmpi.sh")
(executables="hello-ompi-gcc64.exe" "runopenmpi.sh")
(inputfiles=("hello-ompi-gcc64.exe" ""))
(stdout="std.out")
(stderr="std.err")
(gmlog="gmlog")
(runtimeenvironment="ENV/MPI/OPENMPI-1.3/GCC64")

The runopenmpi.sh script starts the application as follows:

#!/bin/sh
echo "MPIRUN is '$MPIRUN'"
echo "NSLOTS is '$NSLOTS'"
$MPIRUN -np $NSLOTS ./hello-ompi-gcc64.exe

Execution of parallel jobs in UNICORE resources

UNICORE uses Execution Environments (EE), which are an advanced feature that allows site administrators to configure the way an executable is executed in a more detailed and user-friendly fashion. A common scenario is the configuration of an environment for parallel execution of a program, such as MPI.

The Execution Environments are created by the administrator who knows how to set up them and provide the users with the adequate options for starting the application. Users just need to select the EE and choose any parameters and options for their application. In the following job description, a MPI job that uses the OpenMPI Execution Environment is submitted requesting 2 nodes with 2 cpus at each of those nodes. There is an additional optional Argument in the Execution Environment with the number of processes that the application should use.

{
  Executable: "./hello.mpi",
  Imports: [
   {From: "/myfiles/hello.mpi", To: "hello.mpi" }, 
  ],
  Resources:{ CPUsPerNode: 2, Nodes: 2, },
  Execution environment: {
    Name: OpenMPI,
    Arguments: { Processes: 12, },
  },
}

Users should check the documentation at the sites to know which are the available EEs and their options, since they are configured locally at each site. See a list of UNICORE MPI examples.

Accounting

There is limited support for the accounting of parallel jobs in the infrastructure. At the CESGA accounting portal, please choose "CPU Efficiency" and all parallel jobs accounting will be flagged by blue color (eff >= 100% (parallel jobs)). Better support for the accouting of these jobs is expected in the next months.

MPI User Guide

Contents

Introduction

Current status of MPI supporting sites

Discovery of suitable sites

Querying information system

Executing applications with MPI-START

MPI-START features

Job execution

Description of jobs

Complete Job Description

Hooks

Troubleshooting

Beyond MPI jobs

Future plans

Execution of parallel jobs in ARC resources

Execution of parallel jobs in UNICORE resources

Accounting

Navigation menu

MPI User Guide

Introduction

Current status of MPI supporting sites

Discovery of suitable sites

Querying information system

Executing applications with MPI-START

MPI-START features

Job execution

Description of jobs

Complete Job Description

Hooks

Troubleshooting

Beyond MPI jobs

Future plans

Execution of parallel jobs in ARC resources

Execution of parallel jobs in UNICORE resources

Accounting

Navigation menu

Search