VT MPI within EGI

Engagement overview

Community requirements

Community events

Training

EGI Webinars

Documentations

General Project Information

Leader: Alvaro Simon (CESGA, Spain) and Zdenek Sustr (CESNET, Czech Republic)
Mailing List: vt-mpi at mailman.egi.eu
Status: FINISHED
Start Date: 10/Nov/2011
End Date: 27/Jul/2012
Meetings: MPI Virtual team meetings:

Motivation

Despite a dedicated SA3 activity to support MPI there still seem to be significant issues in uptake and satisfaction amongst the user communities. This VT

Works with user communities and projects that use MPI resources (e.g. ITER, MAPPER, A&A, etc) to demonstrate that MPI can work successfully in EGI.
Sets up a VO on EGI with sites committed to support MPI jobs.
Improve the communication between MPI users and developers of MPI support within EGI SA3.

Output

The output of this project is report that describes the work carried out by the project, the achievements of its activities and captures the issues and actions that have been identified by the project but will be dealt with by EGI members outside of the Virtual Team project.

The project and the report covers six main areas of work to improve MPI within EGI:

Documentation: Improved documentation has been prepared in the EGI wiki for site administrators and for application developers. These provide guidance asto to how to configure and to use MPI resources correctly.
Nagios probes: New monitoring probes for the EGI Service Availability Monitor (SAM) has been defined. These will be implemented and put into production by the Heavy User Community and Operations teams.
Information system: The typical problems with the registration of MPI resources have been collected and reported to Operations. The Nagios probes have been designed to be able to detect these problems.
Accounting: Issues with collecting accounting information about parallel applications have been collected and reported to responsible technology developers and providers with request for addressing.
Batch system integration: Issues with interfacing MPI applications and some of the local batch job schedulers of EGI have been collected and addressed.
MPI VO: A new VO which includes only correctly configured MPI sites have been setup on the production infrastructure. The VO can be used to port MPI applications to EGI. During the demo MPI members will show how many MPI resources are available in EGI and how to use them. Real MPI applications will be sent to show the capabilities of the VO.

The list of open actions lists those MPI-related issues that have to be followed up by the community outside of this VT project. These actions have been already submitted to the responsible parties in EGI in the form of feedback, recommendations and software bugs. The EGI-InSPIRE SA3 MPI team will supervise overall progress with the actions and will record this in the table as well as in the EGI-InSPIRE project quarterly reports.

Report: MPI within EGI - https://documents.egi.eu/document/1260

Open Actions after MPI VT

Open Actions after MPI VT lifetime

Tasks

Task 1: MPI documentation

Assigned to: Enol / Paschalis Korosoglou

This documentation will be reviewded and we will decide what needs updating or extending.

Gergely comments:

For users:
https://wiki.egi.eu/wiki/MPI_User_Guide
https://wiki.egi.eu/wiki/MPI_User_manual
https://wiki.egi.eu/wiki/Parallel_Computing_Support_User_Guide
Should be merged in a single wiki page.

For site admins:
https://wiki.egi.eu/wiki/MAN03

Actions

[Done] Action 1.1 (Enol): Check an update MPI wiki to include Zdenek comments the next week.
[Done] Action 1.2 (Alvaro/all): Put current MPI issues and technical information and mitigation plan into MPI VT wiki.
[Open] Action 1.3 (Enol): Include a MPI users section.

Task 2: Nagios probes

Assigned to: Gonçalo Borges / John Walsh / Paschalis Korosoglou

Current nagios probes should be reviewed to test EGI MPI infrastructure.

New nagios probes requirements: https://wiki.egi.eu/wiki/Nagios-requirements.html
MPI nagios new specifications: https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios

John Walsh comments:

a) A non-critcal test that tests MPI scalability above two nodes.
Ideally, I would like to see this test set to  ceiling(average number of cores) x 2 +1.
This should increase the likelihood that the job runs on multiple nodes.
This test should only be run may once or twice a week and allow at least a day for scheduling
(so as to be non-intrusive on valuable site resources).

b) Improve the baseline MPI tests.
We should test basic MPI API functionality (scatter, gather, etc), rather than the simpler "hello world".
I will try to see whether I assemble a basic test-suite.

c) Following up on https://ggus.eu/ws/ticket_info.php?ticket=76755,  I have suggested that we may not be using GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxObtainableCPUTime  properly, and that site queues may not be correctly set up for MPI jobs.
(i.e not setting torque queue resources_max.cput  and resources_max.pcput values). Perhaps we can develop a (BDII?) sanity "warning" check for this?

Actions

['Done] Action 2.1 (John W./Enol/Paschalis/Alvaro/Gonçalo): Create a new wiki section to include new MPI nagios probes specifications to be developed by SA3. Follow nagios wiki procedure to include the new probes in production.
- New Nagios MPI specifications wiki: https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios
[In progress] Action 2.2 (Alvaro/Enol): Create a new GOCDB requirement, include MPI service in GOCDB. Check if it's needed different mpi services (for each flavour) or not.
- New requirement created RT: https://rt.egi.eu/rt/Ticket/Display.html?id=3396
- New GOCDB MPI tag naming TBD.
[Done] Action 2.3 (Alvaro): Submit a doodle to schedule Nagios MPI probes meeting.
[Done] Action 2.4 (All) Deadline 12/03/12: Review and comment new nagios specifications https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios

Task 3: Information system

Assigned to: Gonçalo Borges

Problems detecting MPI resources.

Checking for MPI availability -- mostly decided by checking installed applications.
Not all sites reporting MPI capability correctly

Ivan comments:

For BDII, the metrics portal checks the GlueHostApplicationSoftwareRunTimeEnvironment property for the *MPI* regular expression.

John Walsh comments:

GGUS ticket: https://ggus.eu/ws/ticket_info.php?ticket=76755
the problem seems to be related to the torque settings for pcput an cput on each of the queue.

cput = Maximum amount of CPU time used by all processes in the job.
pcput = Maximum amount of CPU time used by any single process in the job.
walltime = Maximum amount of real time during which the job can be in the running state.

So, for example, on one of the "medium"  queue on
deimos.htc.biggrid.nl, the config is:
set queue medium resources_max.cput = 24:00:00
set queue medium resources_max.pcput = 24:00:00
set queue medium resources_max.walltime = 36:00:00

This would not be sufficient to allow an 6 core job to run for a full
24 hours, and the job is likely to be removed after it has run for 4 hours.
We need to check that these queue settings are sensible for MPI jobs.

This is an interesting ticket that summarizes the handling of torque pcput (or lack off) in the infosys.
https://savannah.cern.ch/bugs/?49653

Goncalo Borges comments:

The idea of the task was to assess the status of the information published while new SAM MPI probes are not around. 
I've developed a simple (non optimized) perl script to check the status of the most important variables published 
for MPI. The algorithm is the following:
   1) Get certified sites from GOCDB
   2) Get GlueClusterUniqueID for the different sites
   3) Check which GlueClusterUniqueIDs support MPI.
       3.1) Inspect the RunTimeEnviroment
   4) Check which CEs are under a given GlueClusterUniqueID supporting MPI
       4.1) Inspect relevant GlueCE information

The script produces two files:
   - info.txt: with the relevant information for MPI per GlueClusterUniqueID / site 
               and per GlueCEInfoHostName / GlueClusterUniqueID
   - warn.txt: with the issues found per GlueClusterUniqueID / site and per GlueCEInfoHostName / GlueClusterUniqueID.

A warning  entry is added to warn.txt following the directives we have agreed to the NAGIOS probes:
   - MPI-START tag is not published for a given GlueClusterUniqueID
   - One MPI flavour tag (OPENMPI or MPICH(2), following any of the proposed formats) is not present
               <MPI flavour>
               <MPI flavour>-<MPI version>
               <MPI flavour>-<MPI version>-<Compiler>
   - GlueCEPolicyMaxSlotsPerJob is 0 or 1 or the default 9999999 
   - GlueCEPolicyMaxWallClockTime is 0 or 1 or the default 9999999
   - GlueCEPolicyMaxCPUTime < GlueCEPolicyMaxWallClockTime
   - GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4

The script and the produced outputs were sent to the VT-MPI mailing list.

I would say the following steps here are:
   1./ Update MPI Wiki page on what should be published under GlueCEPolicyMaxSlotsPerJob.
   2./ Update MPI wiki page on the recommendation for GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime > GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= 4
       -GlueCEPolicyMaxWallClockTime not equal to 0 , 1 or 9999999
   3./ It seems the right wiki where this information should be available is: 
       https://wiki.egi.eu/wiki/MAN03_MPI-Start_Installation_and_Configuration
   4./ Deliver the list of problems to SA1 together with the pointers to documentation. SA1 should then brought 
       the issues in the right forum.

Actions

[DONE] Action 3.1 (John Walsh/Gonçalo Borges): Until we don't have nagios probes for that, Gonçalo will contact with John to open GGUS tickets to MPI sites that are not publishing batch system info correctly.
[DONE] Action 3.2 (John Walsh/Enol Fernandez): Check current GLUE2 schema if it includes MPI static values.
- MaxSlotsPerJobs Can be used for MPI jobs:The maximum number of slots which could be allocated to a single job. This value is not filled by the current LRMS Information Providers.
[DONE] Action 3.3 (Roberto Rosende): Raise a request to EMI. Include MaxSlotsPerJobs as a new value to be published by batch system IPs.

Task 4: Accounting system

Assigned to: John Gordon, Iván Díaz

Implement MPI accounting system. (JRA1.4) Ivan Comments:

No special accounting support. Only way to recognize MPI jobs is to check jobs with >100 % efficiency
-Still development to be done.
-Apel needs to give data for each batch system.

Enol Comments:

> 100% efficiency may not be true for MPI jobs. What must be checked is the number of slots. 
That would include also other parallel jobs, but I don't think that's a major issue.
Apel should already give the number of slots used by the job, this data is easily available for all batch systems.

John's Comments:

> How many corss/cpus are used by a job is not under the control of the user. The OS may move a job/procss 
between cpus/cores for its own reasons. It may also spawn system threads which run in parallel with the user process. 
By these means a superficially serial job could record in its accounting that it used multiple cores/cpus.

The requirement below about the accounting record containing the serial/parallel nature of the job begs 
the question 'How does the accounting parser find this information?' Is this recorded in the batch logs 
so that the parser could find it?

Requirement: #3328

"Accounting system should keep track of the type of the job: MPI or serial.
This should be recorded in the Usage Record in order to be easily queried in
the accounting repository."

Actions

4/1 Create MPI accounting system (APEL and Accounting Portal).

Task 5: Batch system status

Assigned to: Roberto Rosende/Enol Fernandez

All batch systems must support MPI jobs. Check the current batch system status and issues. Roberto Rosende comments:

Starting work on MPI support for SGE, to be ready for EMI2
The main problem with the batch system is that it is not receiving reliable info from information system (not truly a batch system matter).

Alvaro Simon comments:

Two bugs were found during the first UMD verification of
WN/Torque + EMI-MPI.1.0. Is a torque/maui problem that affects all MPI jobs. Maui versions
prior to 3.3.4 do not allocate correctly all the nodes for the  job execution. GGUS tickets:
- https://ggus.eu/ws/ticket_info.php?ticket=57828
- https://ggus.eu/ws/ticket_info.php?ticket=67870

Actions

[Done] Action 5.1 (Alvaro): Ask about batch system support issue in EMI. Raise this issue to EGI SA1/2.

Task 6: Gather information from MPI sites

Assigned to: Zdenek Sustr

After establishing the VO, and contacting sites for resources, more requests for information can be added. Zdenek comments:

- MPI VO -- bring together sites and users interested in MPI
- This VO is NOT intended for everyday use by all users wishing to use MPI
- This VO IS intended for users who wish to cooperate with the VT to make MPI support in EGI better
- The main reason for its establishment is to collect experience that will be later adopted by regular VOs

Ivan comments:

User Community under SA3 would also be a good idea.

Actions

[Done] Action 6.1 (Zdenek): Distribute and include the new MPI VO endpoint between MPI VT members, ask to MPI sites to support the new VO. Include new VO users to test MPI sites.
- Participating resource providers:
  - NGI_CZ (20 cores) Stauts: configured, untested
  - NGI_NL Stauts: contacted
  - NGI_IT Stauts: contacted
Action 6.2 (Zdenek): Inform OMB about MPI VT status and work progress.
[Done] Action 7.1 (Zdenek/Alvaro): Set an estimated end date for MPI VT.

Members

NGIs - confirmed:
- BG: Aneta Karaivanova
- CZ: Zdenek Sustr (leader)
- ES/IBERGRID: Alvaro Simon (leader), Enol Fernandez, Iván Díaz, Alvaro Lopez, Pablo Orviz, Isabel Campos, Roberto Rosende Dopazo
- GR: Dimitris Dellis, Marios Chatziangelou, Paschalis Korosoglou
- HR: Emir Imamagic, Luko Gjenero
- IE: John Walsh
- IT: Daniele Cesini, Alessandro Costantini, Vania Boccia, Marco Bencivenni
- PT: Gonçalo Borges
- SK: Viera Sipkova, Viet Tran, Jan Astalos
- UK: John Gordon
EGI.eu: Gergely Sipos, Karolis Eigelis, Tiziana Ferrari, Peter Solagna

Resources

VO MPI-Kickstart

The MPI-Kicktart Virtual Organization brings together sites and users inetrested in improving MPI reliability across EGI.

Useful Links

Home Page https://www.metacentrum.cz/en/VO/MPI/index.html
Registration https://perun.metacentrum.cz/perun-registrar-cert/?vo=mpi
- _Note_: New registration URL since Feb 2013
Mailing List mpi-kickstart at metacentrum.cz

Note! VO membership needs to be renewed annually. The nearest renewal will be required on the turn of 2013 and 2014

Environment Settings

VO_MPI_VOMS_SERVERS="'vomss://voms1.egee.cesnet.cz:8443/voms/mpi?/mpi'"
VO_MPI_QUEUES=""
VO_MPI_SW_DIR="$VO_SW_DIR/mpi"
VO_MPI_DEFAULT_SE=""
VO_MPI_STORAGE_DIR=""
VO_MPI_VOMSES="'mpi voms1.egee.cesnet.cz 15030 /DC=cz/DC=cesnet-ca/O=CESNET/CN=voms1.egee.cesnet.cz mpi 24'"
VO_MPI_VOMS_POOL_PATH=""
VO_MPI_VOMS_CA_DN="'/DC=cz/DC=cesnet-ca/O=CESNET CA/CN=CESNET CA 3'"
VO_MPI_WMS_HOSTS="wms1.egee.cesnet.cz wms2.egee.cesnet.cz"

Progress

Task 1(DONE): MPI documentation

Changed and updated MPI documetation based on sites and MPI VT feedback: MPI User Guide Admin Guide

Task 2(DONE): Nagios probes

Created new MPI nagios probes specifications available at: https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios
New MPI nagios probes are under development. Verification nagios box
New RT ticket to add MPI service for cream in GOCDB. [1]
New RT ticket to request OTAG new SAM MPI probes integration. [2]

Task 3(DONE): Information system

Checked current GLUE2 schema to find MPI static values.
- MaxSlotsPerJobs Can be used for MPI jobs:The maximum number of slots which could be allocated to a single job. This value is not filled by the current LRMS Information Providers.
- Raise a request to EMI. Include MaxSlotsPerJobs as a new value to be published by batch system IPs.
  - Raised GGUS ticket to EMI to improve IP MPI information: https://ggus.eu/ws/ticket_info.php?ticket=82902. linked with Savannah tickets for each batch system:
  - LSF: https://savannah.cern.ch/bugs/index.php?95182
  - SGE: https://savannah.cern.ch/bugs/index.php?95183
  - Torque: https://savannah.cern.ch/bugs/index.php?95184

Task 4(DONE): Accounting system

EGI accounting system needs improvements to implement parallel job usage records.
- This requirement was raised to EMI/JRA1: https://rt.egi.eu/guest/Ticket/Display.html?id=3328

Task 5(DONE): Batch system status

MAUI issue (https://ggus.eu/ws/ticket_info.php?ticket=67870) will fixed in the next EMI2 release.
- LRMS is a 3th party product not updated directly by EMI members.

Task 6(DONE): Gather information from MPI sites

Created the new MPI kickstart VO
- CESNET and CESGA are providing resources to test the new VO.
Gathered information from NGIs. MPI survey and sites status:
- NGIs: Italy and Slovakia
- NGI Bulgaria

VT MPI within EGI

Contents

General Project Information

Motivation

Output

Open Actions after MPI VT

Tasks

Task 1: MPI documentation

Actions

Task 2: Nagios probes

Actions

Task 3: Information system

Actions

Task 4: Accounting system

Actions

Task 5: Batch system status

Actions

Task 6: Gather information from MPI sites

Actions

Members

Resources

VO MPI-Kickstart

Useful Links

Environment Settings

Progress

Task 1(DONE): MPI documentation

Task 2(DONE): Nagios probes

Task 3(DONE): Information system

Task 4(DONE): Accounting system

Task 5(DONE): Batch system status

Task 6(DONE): Gather information from MPI sites

Navigation menu

VT MPI within EGI

General Project Information

Motivation

Output

Open Actions after MPI VT

Tasks

Task 1: MPI documentation

Actions

Task 2: Nagios probes

Actions

Task 3: Information system

Actions

Task 4: Accounting system

Actions

Task 5: Batch system status

Actions

Task 6: Gather information from MPI sites

Actions

Members

Resources

VO MPI-Kickstart

Useful Links

Environment Settings

Progress

Task 1(DONE): MPI documentation

Task 2(DONE): Nagios probes

Task 3(DONE): Information system

Task 4(DONE): Accounting system

Task 5(DONE): Batch system status

Task 6(DONE): Gather information from MPI sites

Navigation menu

Search