VT MPI within EGI

From EGIWiki
Jump to: navigation, search
Engagement overview Community requirements Community events Training EGI Webinars Documentations


Contents

General Project Information

Motivation

Despite a dedicated SA3 activity to support MPI there still seem to be significant issues in uptake and satisfaction amongst the user communities. This VT

Output

The output of this project is report that describes the work carried out by the project, the achievements of its activities and captures the issues and actions that have been identified by the project but will be dealt with by EGI members outside of the Virtual Team project.

The project and the report covers six main areas of work to improve MPI within EGI:

  1. Documentation: Improved documentation has been prepared in the EGI wiki for site administrators and for application developers. These provide guidance asto to how to configure and to use MPI resources correctly.
  2. Nagios probes: New monitoring probes for the EGI Service Availability Monitor (SAM) has been defined. These will be implemented and put into production by the Heavy User Community and Operations teams.
  3. Information system: The typical problems with the registration of MPI resources have been collected and reported to Operations. The Nagios probes have been designed to be able to detect these problems.
  4. Accounting: Issues with collecting accounting information about parallel applications have been collected and reported to responsible technology developers and providers with request for addressing.
  5. Batch system integration: Issues with interfacing MPI applications and some of the local batch job schedulers of EGI have been collected and addressed.
  6. MPI VO: A new VO which includes only correctly configured MPI sites have been setup on the production infrastructure. The VO can be used to port MPI applications to EGI. During the demo MPI members will show how many MPI resources are available in EGI and how to use them. Real MPI applications will be sent to show the capabilities of the VO.

The list of open actions lists those MPI-related issues that have to be followed up by the community outside of this VT project. These actions have been already submitted to the responsible parties in EGI in the form of feedback, recommendations and software bugs. The EGI-InSPIRE SA3 MPI team will supervise overall progress with the actions and will record this in the table as well as in the EGI-InSPIRE project quarterly reports.

Report: MPI within EGI - https://documents.egi.eu/document/1260


Open Actions after MPI VT

Open Actions after MPI VT lifetime

Tasks

Task 1: MPI documentation

This documentation will be reviewded and we will decide what needs updating or extending.

For users:
https://wiki.egi.eu/wiki/MPI_User_Guide
https://wiki.egi.eu/wiki/MPI_User_manual
https://wiki.egi.eu/wiki/Parallel_Computing_Support_User_Guide
Should be merged in a single wiki page.
For site admins:
https://wiki.egi.eu/wiki/MAN03

Actions

Task 2: Nagios probes

Current nagios probes should be reviewed to test EGI MPI infrastructure.

John Walsh comments:

a) A non-critcal test that tests MPI scalability above two nodes.
Ideally, I would like to see this test set to  ceiling(average number of cores) x 2 +1.
This should increase the likelihood that the job runs on multiple nodes.
This test should only be run may once or twice a week and allow at least a day for scheduling
(so as to be non-intrusive on valuable site resources).

b) Improve the baseline MPI tests.
We should test basic MPI API functionality (scatter, gather, etc), rather than the simpler "hello world".
I will try to see whether I assemble a basic test-suite.

c) Following up on https://ggus.eu/ws/ticket_info.php?ticket=76755,  I have suggested that we may not be using GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxObtainableCPUTime  properly, and that site queues may not be correctly set up for MPI jobs.
(i.e not setting torque queue resources_max.cput  and resources_max.pcput values). Perhaps we can develop a (BDII?) sanity "warning" check for this?

Actions

Task 3: Information system

Problems detecting MPI resources.

Ivan comments:

For BDII, the metrics portal checks the GlueHostApplicationSoftwareRunTimeEnvironment property for the *MPI* regular expression.

John Walsh comments:

GGUS ticket: https://ggus.eu/ws/ticket_info.php?ticket=76755
the problem seems to be related to the torque settings for pcput an cput on each of the queue.

cput = Maximum amount of CPU time used by all processes in the job.
pcput = Maximum amount of CPU time used by any single process in the job.
walltime = Maximum amount of real time during which the job can be in the running state.

So, for example, on one of the "medium"  queue on
deimos.htc.biggrid.nl, the config is:
set queue medium resources_max.cput = 24:00:00
set queue medium resources_max.pcput = 24:00:00
set queue medium resources_max.walltime = 36:00:00

This would not be sufficient to allow an 6 core job to run for a full
24 hours, and the job is likely to be removed after it has run for 4 hours.
We need to check that these queue settings are sensible for MPI jobs.

This is an interesting ticket that summarizes the handling of torque pcput (or lack off) in the infosys.
https://savannah.cern.ch/bugs/?49653

Goncalo Borges comments:

The idea of the task was to assess the status of the information published while new SAM MPI probes are not around. 
I've developed a simple (non optimized) perl script to check the status of the most important variables published 
for MPI. The algorithm is the following:
   1) Get certified sites from GOCDB
   2) Get GlueClusterUniqueID for the different sites
   3) Check which GlueClusterUniqueIDs support MPI.
       3.1) Inspect the RunTimeEnviroment
   4) Check which CEs are under a given GlueClusterUniqueID supporting MPI
       4.1) Inspect relevant GlueCE information

The script produces two files:
   - info.txt: with the relevant information for MPI per GlueClusterUniqueID / site 
               and per GlueCEInfoHostName / GlueClusterUniqueID
   - warn.txt: with the issues found per GlueClusterUniqueID / site and per GlueCEInfoHostName / GlueClusterUniqueID.

A warning  entry is added to warn.txt following the directives we have agreed to the NAGIOS probes:
   - MPI-START tag is not published for a given GlueClusterUniqueID
   - One MPI flavour tag (OPENMPI or MPICH(2), following any of the proposed formats) is not present
               <MPI flavour>
               <MPI flavour>-<MPI version>
               <MPI flavour>-<MPI version>-<Compiler>
   - GlueCEPolicyMaxSlotsPerJob is 0 or 1 or the default 9999999 
   - GlueCEPolicyMaxWallClockTime is 0 or 1 or the default 9999999
   - GlueCEPolicyMaxCPUTime < GlueCEPolicyMaxWallClockTime
   - GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4

The script and the produced outputs were sent to the VT-MPI mailing list.

I would say the following steps here are:
   1./ Update MPI Wiki page on what should be published under GlueCEPolicyMaxSlotsPerJob.
   2./ Update MPI wiki page on the recommendation for GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime > GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= 4
       -GlueCEPolicyMaxWallClockTime not equal to 0 , 1 or 9999999
   3./ It seems the right wiki where this information should be available is: 
       https://wiki.egi.eu/wiki/MAN03_MPI-Start_Installation_and_Configuration
   4./ Deliver the list of problems to SA1 together with the pointers to documentation. SA1 should then brought 
       the issues in the right forum.

Actions

Task 4: Accounting system

Implement MPI accounting system. (JRA1.4) Ivan Comments:

No special accounting support. Only way to recognize MPI jobs is to check jobs with >100 % efficiency
-Still development to be done.
-Apel needs to give data for each batch system.

Enol Comments:

> 100% efficiency may not be true for MPI jobs. What must be checked is the number of slots. 
That would include also other parallel jobs, but I don't think that's a major issue.
Apel should already give the number of slots used by the job, this data is easily available for all batch systems.

John's Comments:

> How many corss/cpus are used by a job is not under the control of the user. The OS may move a job/procss 
between cpus/cores for its own reasons. It may also spawn system threads which run in parallel with the user process. 
By these means a superficially serial job could record in its accounting that it used multiple cores/cpus. 
The requirement below about the accounting record containing the serial/parallel nature of the job begs 
the question 'How does the accounting parser find this information?' Is this recorded in the batch logs 
so that the parser could find it? 

Requirement: #3328

"Accounting system should keep track of the type of the job: MPI or serial.
This should be recorded in the Usage Record in order to be easily queried in
the accounting repository."

Actions

Task 5: Batch system status

All batch systems must support MPI jobs. Check the current batch system status and issues. Roberto Rosende comments:

Starting work on MPI support for SGE, to be ready for EMI2
The main problem with the batch system is that it is not receiving reliable info from information system (not truly a batch system matter).

Alvaro Simon comments:

Two bugs were found during the first UMD verification of
WN/Torque + EMI-MPI.1.0. Is a torque/maui problem that affects all MPI jobs. Maui versions
prior to 3.3.4 do not allocate correctly all the nodes for the  job execution. GGUS tickets:
- https://ggus.eu/ws/ticket_info.php?ticket=57828
- https://ggus.eu/ws/ticket_info.php?ticket=67870

Actions


Task 6: Gather information from MPI sites

After establishing the VO, and contacting sites for resources, more requests for information can be added. Zdenek comments:

- MPI VO -- bring together sites and users interested in MPI
- This VO is NOT intended for everyday use by all users wishing to use MPI
- This VO IS intended for users who wish to cooperate with the VT to make MPI support in EGI better
- The main reason for its establishment is to collect experience that will be later adopted by regular VOs

Ivan comments:

User Community under SA3 would also be a good idea.

Actions

Members

Resources

VO MPI-Kickstart

The MPI-Kicktart Virtual Organization brings together sites and users inetrested in improving MPI reliability across EGI.

Useful Links

Note! VO membership needs to be renewed annually. The nearest renewal will be required on the turn of 2013 and 2014

Environment Settings

VO_MPI_VOMS_SERVERS="'vomss://voms1.egee.cesnet.cz:8443/voms/mpi?/mpi'"
VO_MPI_QUEUES=""
VO_MPI_SW_DIR="$VO_SW_DIR/mpi"
VO_MPI_DEFAULT_SE=""
VO_MPI_STORAGE_DIR=""
VO_MPI_VOMSES="'mpi voms1.egee.cesnet.cz 15030 /DC=cz/DC=cesnet-ca/O=CESNET/CN=voms1.egee.cesnet.cz mpi 24'"
VO_MPI_VOMS_POOL_PATH=""
VO_MPI_VOMS_CA_DN="'/DC=cz/DC=cesnet-ca/O=CESNET CA/CN=CESNET CA 3'"
VO_MPI_WMS_HOSTS="wms1.egee.cesnet.cz wms2.egee.cesnet.cz" 

Progress

Task 1(DONE): MPI documentation

Task 2(DONE): Nagios probes

Task 3(DONE): Information system

Task 4(DONE): Accounting system

Task 5(DONE): Batch system status

Task 6(DONE): Gather information from MPI sites

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export