VT MPI within EGI

From EGIWiki
Jump to: navigation, search
Engagement overview Community requirements Community events Training EGI Webinars Documentations

General Project Information


Despite a dedicated SA3 activity to support MPI there still seem to be significant issues in uptake and satisfaction amongst the user communities. This VT

  • Works with user communities and projects that use MPI resources (e.g. ITER, MAPPER, A&A, etc) to demonstrate that MPI can work successfully in EGI.
  • Sets up a VO on EGI with sites committed to support MPI jobs.
  • Improve the communication between MPI users and developers of MPI support within EGI SA3.


The output of this project is report that describes the work carried out by the project, the achievements of its activities and captures the issues and actions that have been identified by the project but will be dealt with by EGI members outside of the Virtual Team project.

The project and the report covers six main areas of work to improve MPI within EGI:

  1. Documentation: Improved documentation has been prepared in the EGI wiki for site administrators and for application developers. These provide guidance asto to how to configure and to use MPI resources correctly.
  2. Nagios probes: New monitoring probes for the EGI Service Availability Monitor (SAM) has been defined. These will be implemented and put into production by the Heavy User Community and Operations teams.
  3. Information system: The typical problems with the registration of MPI resources have been collected and reported to Operations. The Nagios probes have been designed to be able to detect these problems.
  4. Accounting: Issues with collecting accounting information about parallel applications have been collected and reported to responsible technology developers and providers with request for addressing.
  5. Batch system integration: Issues with interfacing MPI applications and some of the local batch job schedulers of EGI have been collected and addressed.
  6. MPI VO: A new VO which includes only correctly configured MPI sites have been setup on the production infrastructure. The VO can be used to port MPI applications to EGI. During the demo MPI members will show how many MPI resources are available in EGI and how to use them. Real MPI applications will be sent to show the capabilities of the VO.

The list of open actions lists those MPI-related issues that have to be followed up by the community outside of this VT project. These actions have been already submitted to the responsible parties in EGI in the form of feedback, recommendations and software bugs. The EGI-InSPIRE SA3 MPI team will supervise overall progress with the actions and will record this in the table as well as in the EGI-InSPIRE project quarterly reports.

Report: MPI within EGI - https://documents.egi.eu/document/1260

Open Actions after MPI VT

Open Actions after MPI VT lifetime


Task 1: MPI documentation

  • Assigned to: Enol / Paschalis Korosoglou

This documentation will be reviewded and we will decide what needs updating or extending.

  • Gergely comments:
For users:
Should be merged in a single wiki page.
For site admins:


  • [Done] Action 1.1 (Enol): Check an update MPI wiki to include Zdenek comments the next week.
  • [Done] Action 1.2 (Alvaro/all): Put current MPI issues and technical information and mitigation plan into MPI VT wiki.
  • [Open] Action 1.3 (Enol): Include a MPI users section.

Task 2: Nagios probes

  • Assigned to: Gonçalo Borges / John Walsh / Paschalis Korosoglou

Current nagios probes should be reviewed to test EGI MPI infrastructure.

John Walsh comments:

a) A non-critcal test that tests MPI scalability above two nodes.
Ideally, I would like to see this test set to  ceiling(average number of cores) x 2 +1.
This should increase the likelihood that the job runs on multiple nodes.
This test should only be run may once or twice a week and allow at least a day for scheduling
(so as to be non-intrusive on valuable site resources).

b) Improve the baseline MPI tests.
We should test basic MPI API functionality (scatter, gather, etc), rather than the simpler "hello world".
I will try to see whether I assemble a basic test-suite.

c) Following up on https://ggus.eu/ws/ticket_info.php?ticket=76755,  I have suggested that we may not be using GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxObtainableCPUTime  properly, and that site queues may not be correctly set up for MPI jobs.
(i.e not setting torque queue resources_max.cput  and resources_max.pcput values). Perhaps we can develop a (BDII?) sanity "warning" check for this?


  • ['Done] Action 2.1 (John W./Enol/Paschalis/Alvaro/Gonçalo): Create a new wiki section to include new MPI nagios probes specifications to be developed by SA3. Follow nagios wiki procedure to include the new probes in production.
  • [In progress] Action 2.2 (Alvaro/Enol): Create a new GOCDB requirement, include MPI service in GOCDB. Check if it's needed different mpi services (for each flavour) or not.
  • [Done] Action 2.3 (Alvaro): Submit a doodle to schedule Nagios MPI probes meeting.
  • [Done] Action 2.4 (All) Deadline 12/03/12: Review and comment new nagios specifications https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios

Task 3: Information system

  • Assigned to: Gonçalo Borges

Problems detecting MPI resources.

  • Checking for MPI availability -- mostly decided by checking installed applications.
  • Not all sites reporting MPI capability correctly

Ivan comments:

For BDII, the metrics portal checks the GlueHostApplicationSoftwareRunTimeEnvironment property for the *MPI* regular expression.

John Walsh comments:

GGUS ticket: https://ggus.eu/ws/ticket_info.php?ticket=76755
the problem seems to be related to the torque settings for pcput an cput on each of the queue.

cput = Maximum amount of CPU time used by all processes in the job.
pcput = Maximum amount of CPU time used by any single process in the job.
walltime = Maximum amount of real time during which the job can be in the running state.

So, for example, on one of the "medium"  queue on
deimos.htc.biggrid.nl, the config is:
set queue medium resources_max.cput = 24:00:00
set queue medium resources_max.pcput = 24:00:00
set queue medium resources_max.walltime = 36:00:00

This would not be sufficient to allow an 6 core job to run for a full
24 hours, and the job is likely to be removed after it has run for 4 hours.
We need to check that these queue settings are sensible for MPI jobs.

This is an interesting ticket that summarizes the handling of torque pcput (or lack off) in the infosys.

Goncalo Borges comments:

The idea of the task was to assess the status of the information published while new SAM MPI probes are not around. 
I've developed a simple (non optimized) perl script to check the status of the most important variables published 
for MPI. The algorithm is the following:
   1) Get certified sites from GOCDB
   2) Get GlueClusterUniqueID for the different sites
   3) Check which GlueClusterUniqueIDs support MPI.
       3.1) Inspect the RunTimeEnviroment
   4) Check which CEs are under a given GlueClusterUniqueID supporting MPI
       4.1) Inspect relevant GlueCE information

The script produces two files:
   - info.txt: with the relevant information for MPI per GlueClusterUniqueID / site 
               and per GlueCEInfoHostName / GlueClusterUniqueID
   - warn.txt: with the issues found per GlueClusterUniqueID / site and per GlueCEInfoHostName / GlueClusterUniqueID.

A warning  entry is added to warn.txt following the directives we have agreed to the NAGIOS probes:
   - MPI-START tag is not published for a given GlueClusterUniqueID
   - One MPI flavour tag (OPENMPI or MPICH(2), following any of the proposed formats) is not present
               <MPI flavour>
               <MPI flavour>-<MPI version>
               <MPI flavour>-<MPI version>-<Compiler>
   - GlueCEPolicyMaxSlotsPerJob is 0 or 1 or the default 9999999 
   - GlueCEPolicyMaxWallClockTime is 0 or 1 or the default 9999999
   - GlueCEPolicyMaxCPUTime < GlueCEPolicyMaxWallClockTime
   - GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4

The script and the produced outputs were sent to the VT-MPI mailing list.

I would say the following steps here are:
   1./ Update MPI Wiki page on what should be published under GlueCEPolicyMaxSlotsPerJob.
   2./ Update MPI wiki page on the recommendation for GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime > GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= 4
       -GlueCEPolicyMaxWallClockTime not equal to 0 , 1 or 9999999
   3./ It seems the right wiki where this information should be available is: 
   4./ Deliver the list of problems to SA1 together with the pointers to documentation. SA1 should then brought 
       the issues in the right forum.


  • [DONE] Action 3.1 (John Walsh/Gonçalo Borges): Until we don't have nagios probes for that, Gonçalo will contact with John to open GGUS tickets to MPI sites that are not publishing batch system info correctly.
  • [DONE] Action 3.2 (John Walsh/Enol Fernandez): Check current GLUE2 schema if it includes MPI static values.
    • MaxSlotsPerJobs Can be used for MPI jobs:The maximum number of slots which could be allocated to a single job. This value is not filled by the current LRMS Information Providers.
  • [DONE] Action 3.3 (Roberto Rosende): Raise a request to EMI. Include MaxSlotsPerJobs as a new value to be published by batch system IPs.

Task 4: Accounting system

  • Assigned to: John Gordon, Iván Díaz

Implement MPI accounting system. (JRA1.4) Ivan Comments:

No special accounting support. Only way to recognize MPI jobs is to check jobs with >100 % efficiency
-Still development to be done.
-Apel needs to give data for each batch system.

Enol Comments:

> 100% efficiency may not be true for MPI jobs. What must be checked is the number of slots. 
That would include also other parallel jobs, but I don't think that's a major issue.
Apel should already give the number of slots used by the job, this data is easily available for all batch systems.

John's Comments:

> How many corss/cpus are used by a job is not under the control of the user. The OS may move a job/procss 
between cpus/cores for its own reasons. It may also spawn system threads which run in parallel with the user process. 
By these means a superficially serial job could record in its accounting that it used multiple cores/cpus. 
The requirement below about the accounting record containing the serial/parallel nature of the job begs 
the question 'How does the accounting parser find this information?' Is this recorded in the batch logs 
so that the parser could find it? 

Requirement: #3328

"Accounting system should keep track of the type of the job: MPI or serial.
This should be recorded in the Usage Record in order to be easily queried in
the accounting repository."


  • 4/1 Create MPI accounting system (APEL and Accounting Portal).

Task 5: Batch system status

  • Assigned to: Roberto Rosende/Enol Fernandez

All batch systems must support MPI jobs. Check the current batch system status and issues. Roberto Rosende comments:

Starting work on MPI support for SGE, to be ready for EMI2
The main problem with the batch system is that it is not receiving reliable info from information system (not truly a batch system matter).

Alvaro Simon comments:

Two bugs were found during the first UMD verification of
WN/Torque + EMI-MPI.1.0. Is a torque/maui problem that affects all MPI jobs. Maui versions
prior to 3.3.4 do not allocate correctly all the nodes for the  job execution. GGUS tickets:
- https://ggus.eu/ws/ticket_info.php?ticket=57828
- https://ggus.eu/ws/ticket_info.php?ticket=67870


  • [Done] Action 5.1 (Alvaro): Ask about batch system support issue in EMI. Raise this issue to EGI SA1/2.

Task 6: Gather information from MPI sites

  • Assigned to: Zdenek Sustr

After establishing the VO, and contacting sites for resources, more requests for information can be added. Zdenek comments:

- MPI VO -- bring together sites and users interested in MPI
- This VO is NOT intended for everyday use by all users wishing to use MPI
- This VO IS intended for users who wish to cooperate with the VT to make MPI support in EGI better
- The main reason for its establishment is to collect experience that will be later adopted by regular VOs

Ivan comments:

User Community under SA3 would also be a good idea.


  • [Done] Action 6.1 (Zdenek): Distribute and include the new MPI VO endpoint between MPI VT members, ask to MPI sites to support the new VO. Include new VO users to test MPI sites.
    • Participating resource providers:
      • NGI_CZ (20 cores) Stauts: configured, untested
      • NGI_NL Stauts: contacted
      • NGI_IT Stauts: contacted
  • Action 6.2 (Zdenek): Inform OMB about MPI VT status and work progress.
  • [Done] Action 7.1 (Zdenek/Alvaro): Set an estimated end date for MPI VT.


  • NGIs - confirmed:
    • BG: Aneta Karaivanova
    • CZ: Zdenek Sustr (leader)
    • ES/IBERGRID: Alvaro Simon (leader), Enol Fernandez, Iván Díaz, Alvaro Lopez, Pablo Orviz, Isabel Campos, Roberto Rosende Dopazo
    • GR: Dimitris Dellis, Marios Chatziangelou, Paschalis Korosoglou
    • HR: Emir Imamagic, Luko Gjenero
    • IE: John Walsh
    • IT: Daniele Cesini, Alessandro Costantini, Vania Boccia, Marco Bencivenni
    • PT: Gonçalo Borges
    • SK: Viera Sipkova, Viet Tran, Jan Astalos
    • UK: John Gordon
  • EGI.eu: Gergely Sipos, Karolis Eigelis, Tiziana Ferrari, Peter Solagna


VO MPI-Kickstart

The MPI-Kicktart Virtual Organization brings together sites and users inetrested in improving MPI reliability across EGI.

Useful Links

Note! VO membership needs to be renewed annually. The nearest renewal will be required on the turn of 2013 and 2014

Environment Settings

VO_MPI_VOMSES="'mpi voms1.egee.cesnet.cz 15030 /DC=cz/DC=cesnet-ca/O=CESNET/CN=voms1.egee.cesnet.cz mpi 24'"
VO_MPI_WMS_HOSTS="wms1.egee.cesnet.cz wms2.egee.cesnet.cz" 


Task 1(DONE): MPI documentation

Task 2(DONE): Nagios probes

Task 3(DONE): Information system

Task 4(DONE): Accounting system

Task 5(DONE): Batch system status

Task 6(DONE): Gather information from MPI sites

  • Created the new MPI kickstart VO
    • CESNET and CESGA are providing resources to test the new VO.
  • Gathered information from NGIs. MPI survey and sites status: