Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

MPI User Guide

From EGIWiki
Jump to navigation Jump to search


Introduction

This document is intended to help EGI user community to execute MPI applications on the European Grid Infrastructure. The document has been prepared by the EGI-InSPIRE project. Please provide feedback to Enol Fernández del Castillo (enolfc_AT_ifca_DOT_unican_DOT_es).

Many of the sites that are involved in the European Grid Infrastructure support various MPI implementations, such as: OpenMPI, MPICH 1/2. Site administrators may deploy any version of any MPI implementation to serve the needs of the site users.

Execution of MPI applications requires sites that properly support the submission and execution of parallel applications and the availability of a MPI implementation. Site administrators should check the gLite MPI v.1.0.0 release notes of EMI 1 Kebnekaise release or MPI-START v1.0.4 manual with the relevant information about the configuration of sites. Since not all of them have this support enabled, special tags e.g 'MPI-START|OpenMPI|MPICH2' are published via the information system to allow users to discover which are the sites that can be used for their executions. Sites may also install different implementations (or flavors) of MPI. It is important therefore that users can use the information system to locate sites with the software they require.

(information about the MPI wrapper allowing MPI jobs to be executed on two or more nodes having different locations on a different grid sites )

Current status of MPI supporting sites

The Service Availability Monitoring infrastructure of EGI monitors the status and correct configuration of MPI sites and suspend erronuous sites if necessary. This monitoring system tests MPI-START wrapper and it's supported MPI flavours by Nagios probes, standalone MPI implementations are not tested if the site doesn't support MPI-START wrapper. You can always check latest monitoring data for the sites at central MyEGI webpage.

The execution of parallel application does not only require the middleware support for such jobs, it also needs a correct configuration of the sites where the jobs are actually run. In order to assure the correct execution of these applications, monitoring probes that check the proper support for such jobs are available.

The monitoring probes are executed at all the sites that publish the MPI-START tag in their information system and consist in the following steps:

  1. Assure that MPI-Start is actually available.
  2. Check of the information published by the site. This first step inspects the announced MPI flavors supports and selects the probes that will be run in the next steps.
  3. For each of the supported MPI flavors, submit a job to the site requesting 2 processes that is compiled from source using the MPI-Start hooks. The probe checks that the number of processes used by the application was really the requested number.

Although the probes request a low number of slots (2), they allow the early detection of basic problems. These probes are flagged as critical, thus any failure may cause the site to be suspended from the infrastructure.

Executing MPI applications with MPI-START

MPI-Start is the recommended way of starting MPI jobs in the infrastructure. The MPI-START v1.0.4 User Guide contains a complete description on how to run MPI jobs in general and in the Grid. Documentation is focused on gLite resources, although MPI-Start can be used with ARC and UNICORE if installed and configured by the site administrator.

Examples can be found also at the tutorials materials prepared by the EGI-InSPIRE SA3 Support for parallel computing (MPI) task:

Sites supporting MPI-Start must publish the proper tags in the information system e.g. BDII attribute 'GlueHostApplicationSoftwareRunTimeEnvironment' should have the tag MPI-START

MPI-START features

  • Supported MPI Implementations:
    • OpenMPI
    • MPICH2
    • MPICH
    • LAM-MPI
    • PACX-MPI
  • OpenMP support (basic)
  • File automatic distribution (non shared file systems)
  • Compiler automatic discovery

More about MPI-START design please read here.

Discovery of suitable sites

Discovery of resources is the first step that needs to be accomplished before the execution of applications. This can be done by using the 'GlueHostApplicationSoftwareRunTimeEnvironment' attribute at BDII service (Berkeley Database Information Index), which should include all the relevant MPI support information that allow users to locate the sites with the adequate software environment. The following sections describe the tags that site may publish.

For each MPI implemenation supported by MPI-Start, sites must publish a variable with the name of the MPI flavour that has been installed and tested. The supported flavours are: MPICH for MPICH, MPICH2 for MPICH2 , LAM for LA-MPI and OPENMPI for OpenMPI. Most commonly supported flavours are OpenMPI and MPICH2


MPI flavours published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-START
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH2 
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMP


More specific version and compiler information can be also defined by the sites using variables with the form:

<MPI flavour>-<MPI version> or <MPI flavour>-<MPI version>-<Compiler>

These are not mandatory, although they should be published to allow users with special requirements to locate specific versions of MPI software. Users should assume gcc compiler suite is used if no other value is specified.


MPI flavours with versions/compilers published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.4.2 
GlueHostApplicationSoftwareRunTimeEnvironment: MPICH-1.2.7 
GlueHostApplicationSoftwareRunTimeEnvironment: OPENMPI-1.3.7-ICC


Sites may publish the network interconnect available for the execution of MPI applications with a variable with the form:

MPI-<interconnect>

Currently the valid interconnects are: Ethernet, Infiniband, SCI, and Myrinet.


Network Interconnections published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI-INFINIBAND


Sites supporting a shared filesystem for the execution of MPI applications publish the MPI_SHARED_HOME variable. If your application needs such feature, you should check the availability of that variable. Otherwise you can use MPI-START hooks framework to have an automatic detection and distribution of input/output files around the nodes in case they are not having shared home.


Shared home published Tag at BDII example:

GlueHostApplicationSoftwareRunTimeEnvironment: MPI_SHARED_HOME


Querying information system

There are several ways to query the information system:

  • middleware command line tools - 'lcg-info' command (high level tools)
  • linux command line tools - 'ldapsearch' command (low level tools)
  • GUI for windows/linux: http://directory.apache.org/studio/ (low level tools)
  • Other

The preferred way for end users to browse the information system is to use 'lcg-info' command from UI included in EMI 1 Kebnekaise release because it intends to provide high-level human readable information.


First of all we need to find out what MPI options are available for us at EGI

Querying the information system to get the list of MPI Tags for a BIOMED VO in EGI from the sites which are supporting 'MPI-START'

Command:

$ lcg-info -vo biomed -query 'Tag=MPI-START' -attrs Tag -list-ce|egrep -e MPI|sort|uniq

Output:

MPICH
MPICH1-1.2.0
MPICH-1.2.6
MPICH-1.2.7
MPICH-1.2.7p1
MPICH2
MPICH2-1.0.4
MPICH2-1.1.1
MPICH2-1.1.1p1
MPICH2-1.2.1
MPICH2-1.4.1
MPICH2-1.6
MPI-Ethernet
MPIEXEC
MPI_HOME_NOTSHARED
MPI-Infiniband
MPI-INFINIBAND
MPI-Myrinet
MPIRUN
MPI_SHARED_HOME
MPI-START
MPI-START-0.0.59
OPENMPI
OPENMPI-1.1
OPENMPI-1.2
OPENMPI-1.2.8
OPENMPI-1.3
OPENMPI-1.3.2
OPENMPI-1.3.3
OPENMPI-1.4
OPENMPI-1.4.1
OPENMPI-1.4.3
OPENMPI-1.4-4
OPENMPI-1.4.4
OPENMPI-1.4-4-GCC
OPENMPI-GCC
OPENMPI-ICC

Note: from the output above you can see that the sites having only 'MPI-START' Tag are filtered (lcg-info command parameters: -query 'Tag=MPI-START'). If you want to add additional criteria like to get the MPI Tags only for the sites which do support only 'MPI_SHARED_HOME' or 'MPI-INFINIBAND' or both, please modify the command's query parameters e.g. -query 'Tag=MPI-START,Tag=MPI-INFINIBAND'. And you will get available MPI Tags only from the sites which do support 'MPI-START' and 'MPI-INFINIBAND'.

Job execution

General Requirements

Submitting MPI Job via gLite WMS, you should include in the Requirements expression the MPI related Tags obtained from the information system. The following example shows the requirements expression for a job that needs OPENMPI and INFINIBAND connection and uses MPI-START for execution:

Requirements  = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment);

For an additional information about sending MPI Job to a Grid and use of MPI-START please refer to a MPI-START manual

Nodes and Cores Requirements

SMPGranularity=4;
HostNumber=8;
wholenodes=true;

Definitions:

  • SMPGranularity - The SMPGranularity attribute is an integer greater than 0 specifying the number of cores any host involved in the allocation has to dedicate to the considered job. This attribute can’t be specified along with the Hostnumber attribute when WholeNodes is false.
  • WholeNodes - The WholeNodes attribute is a boolean that indicates whether whole nodes should be used exclusively or not.
  • HostNumber - HostNumber is an integer indicating the number of nodes the user wishes to obtain for his job. This attribute can’t be specified along with the SMPGranularity attribute when WholeNodes is false.


MPI Job debugging

If you want more verbose output from MPI-START for a debugging purposes you should add the attribute to the JDL file defined below:

Environment   = {"I2G_MPI_START_VERBOSE=1", "I2G_MPI_START_DEBUG=1"};

Complete MPI Job description

JobType       = "Normal";
CPUNumber     = 6;
Executable    = "myapp.sh";
Arguments     = "OPENMPI myapp.bin myapp arguments";
InputSandbox  = {"myapp.sh", "myapp.bin"};
OutputSandbox = {"std.out", "std.err"};
StdOutput     = "std.out";
StdError      = "std.err";
SMPGranularity=4;
HostNumber=8;
wholenodes=true;
Requirements  = member("MPI-START", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("OPENMPI", other.GlueHostApplicationSoftwareRunTimeEnvironment)
             && member("MPI-INFINIBAND", other.GlueHostApplicationSoftwareRunTimeEnvironment);
Environment   = {"I2G_MPI_START_VERBOSE=1", "I2G_MPI_START_DEBUG=1"};


For more advanced use of MPI-START wrapper please read about a Hooks Framework. You may benefit from features like:

  • Already available code to reuse for building your application's execution environment: pre-processing, post-processing.
  • Handling of your application's compilation.
  • Collecting the output files produced on more than one node.
  • Other


JDL Attributes Specification

https://edms.cern.ch/file/592336/1/CREAM-JDL.pdf

Known issues

  • MPI-START wrapper version discovery: support for tagging the version of MPI-START in the information system. Currently only MPI-START tag is published, but version may be obtained only by sending grid job and performing discovery manually. User which is following the official manual may hit to the not supported command line parameters by the currently installed MPI-START wrapper version. Look for reference RT ticket click here

Future plans

User Community Board MPI requirements

MPI-START development tracking system