Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

VT MPI within EGI

From EGIWiki
Revision as of 15:02, 26 July 2012 by Ekarolis (talk | contribs)
Jump to navigation Jump to search

General Project Information

Motivation

Despite a dedicated SA3 activity to support MPI there still seem to be significant issues in uptake and satisfaction amongst the user communities. This VT

  • Works with user communities and projects that use MPI resources (e.g. ITER, MAPPER, A&A, etc) to demonstrate that MPI can work successfully in EGI.
  • Sets up a VO on EGI with sites committed to support MPI jobs.
  • Improve the communication between MPI users and developers of MPI support within EGI SA3.

Output

The VT is expected to produce the following outputs:

  • Materials (tutorials, white papers, etc) about successful use cases of MPI on EGI that can be used by new communities to use MPI on EGI.
  • An MPI VO that provides:
    • dedicated CPUs for MPI jobs - mpi-kickstart.egi.eu VO information
    • MPI specific test probes can run on all sites using the VO Monitoring services of Ibergrid (EGI-InSPIRE VO Services group) -
    • accounting for MPI jobs
    • user support
  • Improved communication channels with MPI users
  • The above set of resources and feedback to resource centers, user communities and technology providers on how to improve MPI within EGI.

Tasks

Task 1: MPI documentation

  • Assigned to: Enol / Paschalis Korosoglou

This documentation will be reviewded and we will decide what needs updating or extending.

  • Gergely comments:
For users:
https://wiki.egi.eu/wiki/MPI_User_Guide
https://wiki.egi.eu/wiki/MPI_User_manual
https://wiki.egi.eu/wiki/Parallel_Computing_Support_User_Guide
Should be merged in a single wiki page.
For site admins:
https://wiki.egi.eu/wiki/MAN03

Actions

  • [Done] Action 1.1 (Enol): Check an update MPI wiki to include Zdenek comments the next week.
  • [In progress] Action 1.2 (Alvaro/all): Put current MPI issues and technical information and mitigation plan into MPI VT wiki.
  • [Open] Action 1.3 (Enol): Include a MPI users section.

Task 2: Nagios probes

  • Assigned to: Gonçalo Borges / John Walsh / Paschalis Korosoglou

Current nagios probes should be reviewed to test EGI MPI infrastructure.

John Walsh comments:

a) A non-critcal test that tests MPI scalability above two nodes.
Ideally, I would like to see this test set to  ceiling(average number of cores) x 2 +1.
This should increase the likelihood that the job runs on multiple nodes.
This test should only be run may once or twice a week and allow at least a day for scheduling
(so as to be non-intrusive on valuable site resources).

b) Improve the baseline MPI tests.
We should test basic MPI API functionality (scatter, gather, etc), rather than the simpler "hello world".
I will try to see whether I assemble a basic test-suite.

c) Following up on https://ggus.eu/ws/ticket_info.php?ticket=76755,  I have suggested that we may not be using GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxObtainableCPUTime  properly, and that site queues may not be correctly set up for MPI jobs.
(i.e not setting torque queue resources_max.cput  and resources_max.pcput values). Perhaps we can develop a (BDII?) sanity "warning" check for this?

Actions

  • ['DONE] Action 2.1 (John W./Enol/Paschalis/Alvaro/Gonçalo): Create a new wiki section to include new MPI nagios probes specifications to be developed by SA3. Follow nagios wiki procedure to include the new probes in production.
  • [In progress] Action 2.2 (Alvaro/Enol): Create a new GOCDB requirement, include MPI service in GOCDB. Check if it's needed different mpi services (for each flavour) or not.
  • [DONE] Action 2.3 (Alvaro): Submit a doodle to schedule Nagios MPI probes meeting.
  • [Open] Action 2.4 (All) Deadline 12/03/12: Review and comment new nagios specifications https://wiki.egi.eu/wiki/VT_MPI_within_EGI:Nagios

Task 3: Information system

  • Assigned to: Gonçalo Borges

Problems detecting MPI resources.

  • Checking for MPI availability -- mostly decided by checking installed applications.
  • Not all sites reporting MPI capability correctly

Ivan comments:

For BDII, the metrics portal checks the GlueHostApplicationSoftwareRunTimeEnvironment property for the *MPI* regular expression.

John Walsh comments:

GGUS ticket: https://ggus.eu/ws/ticket_info.php?ticket=76755
the problem seems to be related to the torque settings for pcput an cput on each of the queue.

cput = Maximum amount of CPU time used by all processes in the job.
pcput = Maximum amount of CPU time used by any single process in the job.
walltime = Maximum amount of real time during which the job can be in the running state.

So, for example, on one of the "medium"  queue on
deimos.htc.biggrid.nl, the config is:
set queue medium resources_max.cput = 24:00:00
set queue medium resources_max.pcput = 24:00:00
set queue medium resources_max.walltime = 36:00:00

This would not be sufficient to allow an 6 core job to run for a full
24 hours, and the job is likely to be removed after it has run for 4 hours.
We need to check that these queue settings are sensible for MPI jobs.

This is an interesting ticket that summarizes the handling of torque pcput (or lack off) in the infosys.
https://savannah.cern.ch/bugs/?49653

Goncalo Borges comments:

The idea of the task was to assess the status of the information published while new SAM MPI probes are not around. 
I've developed a simple (non optimized) perl script to check the status of the most important variables published 
for MPI. The algorithm is the following:
   1) Get certified sites from GOCDB
   2) Get GlueClusterUniqueID for the different sites
   3) Check which GlueClusterUniqueIDs support MPI.
       3.1) Inspect the RunTimeEnviroment
   4) Check which CEs are under a given GlueClusterUniqueID supporting MPI
       4.1) Inspect relevant GlueCE information

The script produces two files:
   - info.txt: with the relevant information for MPI per GlueClusterUniqueID / site 
               and per GlueCEInfoHostName / GlueClusterUniqueID
   - warn.txt: with the issues found per GlueClusterUniqueID / site and per GlueCEInfoHostName / GlueClusterUniqueID.

A warning  entry is added to warn.txt following the directives we have agreed to the NAGIOS probes:
   - MPI-START tag is not published for a given GlueClusterUniqueID
   - One MPI flavour tag (OPENMPI or MPICH(2), following any of the proposed formats) is not present
               <MPI flavour>
               <MPI flavour>-<MPI version>
               <MPI flavour>-<MPI version>-<Compiler>
   - GlueCEPolicyMaxSlotsPerJob is 0 or 1 or the default 9999999 
   - GlueCEPolicyMaxWallClockTime is 0 or 1 or the default 9999999
   - GlueCEPolicyMaxCPUTime < GlueCEPolicyMaxWallClockTime
   - GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4

The script and the produced outputs were sent to the VT-MPI mailing list.

I would say the following steps here are:
   1./ Update MPI Wiki page on what should be published under GlueCEPolicyMaxSlotsPerJob.
   2./ Update MPI wiki page on the recommendation for GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime > GlueCEPolicyMaxWallClockTime
       -GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= 4
       -GlueCEPolicyMaxWallClockTime not equal to 0 , 1 or 9999999
   3./ It seems the right wiki where this information should be available is: 
       https://wiki.egi.eu/wiki/MAN03_MPI-Start_Installation_and_Configuration
   4./ Deliver the list of problems to SA1 together with the pointers to documentation. SA1 should then brought 
       the issues in the right forum.

Actions

  • [DONE] Action 3.1 (John Walsh/Gonçalo Borges): Until we don't have nagios probes for that, Gonçalo will contact with John to open GGUS tickets to MPI sites that are not publishing batch system info correctly.
  • [DONE] Action 3.2 (John Walsh/Enol Fernandez): Check current GLUE2 schema if it includes MPI static values.
    • MaxSlotsPerJobs Can be used for MPI jobs:The maximum number of slots which could be allocated to a single job. This value is not filled by the current LRMS Information Providers.
  • [DONE] Action 3.3 (Roberto Rosende): Raise a request to EMI. Include MaxSlotsPerJobs as a new value to be published by batch system IPs.

Task 4: Accounting system

  • Assigned to: John Gordon, Iván Díaz

Implement MPI accounting system. (JRA1.4) Ivan Comments:

No special accounting support. Only way to recognize MPI jobs is to check jobs with >100 % efficiency
-Still development to be done.
-Apel needs to give data for each batch system.

Enol Comments:

> 100% efficiency may not be true for MPI jobs. What must be checked is the number of slots. 
That would include also other parallel jobs, but I don't think that's a major issue.
Apel should already give the number of slots used by the job, this data is easily available for all batch systems.

John's Comments:

> How many corss/cpus are used by a job is not under the control of the user. The OS may move a job/procss 
between cpus/cores for its own reasons. It may also spawn system threads which run in parallel with the user process. 
By these means a superficially serial job could record in its accounting that it used multiple cores/cpus. 
The requirement below about the accounting record containing the serial/parallel nature of the job begs 
the question 'How does the accounting parser find this information?' Is this recorded in the batch logs 
so that the parser could find it? 

Requirement: #3328

"Accounting system should keep track of the type of the job: MPI or serial.
This should be recorded in the Usage Record in order to be easily queried in
the accounting repository."

Actions

  • 4/1 Create MPI accounting system (APEL and Accounting Portal).

Task 5: Batch system status

  • Assigned to: Roberto Rosende/Enol Fernandez

All batch systems must support MPI jobs. Check the current batch system status and issues. Roberto Rosende comments:

Starting work on MPI support for SGE, to be ready for EMI2
The main problem with the batch system is that it is not receiving reliable info from information system (not truly a batch system matter).

Alvaro Simon comments:

Two bugs were found during the first UMD verification of
WN/Torque + EMI-MPI.1.0. Is a torque/maui problem that affects all MPI jobs. Maui versions
prior to 3.3.4 do not allocate correctly all the nodes for the  job execution. GGUS tickets:
- https://ggus.eu/ws/ticket_info.php?ticket=57828
- https://ggus.eu/ws/ticket_info.php?ticket=67870

Actions

  • [DONE] Action 5.1 (Alvaro): Ask about batch system support issue in EMI. Raise this issue to EGI SA1/2.


Task 6: Gather information from MPI sites

  • Assigned to: Zdenek Sustr

After establishing the VO, and contacting sites for resources, more requests for information can be added. Zdenek comments:

- MPI VO -- bring together sites and users interested in MPI
- This VO is NOT intended for everyday use by all users wishing to use MPI
- This VO IS intended for users who wish to cooperate with the VT to make MPI support in EGI better
- The main reason for its establishment is to collect experience that will be later adopted by regular VOs

Ivan comments:

User Community under SA3 would also be a good idea.

Actions

  • [In progress] Action 6.1 (Zdenek): Distribute and include the new MPI VO endpoint between MPI VT members, ask to MPI sites to support the new VO. Include new VO users to test MPI sites.
    • Participating resource providers:
      • NGI_CZ (20 cores) Stauts: configured, untested
      • NGI_NL Stauts: contacted
      • NGI_IT Stauts: contacted
  • Action 6.2 (Zdenek): Inform OMB about MPI VT status and work progress.
  • [DONE] Action 7.1 (Zdenek/Alvaro): Set an estimated end date for MPI VT.

Members

  • NGIs - confirmed:
    • BG: Aneta Karaivanova
    • CZ: Zdenek Sustr (leader)
    • ES/IBERGRID: Alvaro Simon (leader), Enol Fernandez, Iván Díaz, Alvaro Lopez, Pablo Orviz, Isabel Campos, Roberto Rosende Dopazo
    • GR: Dimitris Dellis, Marios Chatziangelou, Paschalis Korosoglou
    • HR: Emir Imamagic, Luko Gjenero
    • IE: John Walsh
    • IT: Daniele Cesini, Alessandro Costantini, Vania Boccia, Marco Bencivenni
    • PT: Gonçalo Borges
    • SK: Viera Sipkova, Viet Tran, Jan Astalos
    • UK: John Gordon
  • EGI.eu: Gergely Sipos, Karolis Eigelis, Tiziana Ferrari, Peter Solagna

Resources

VO MPI-Kickstart

The MPI-Kicktart Virtual Organization brings together sites and users inetrested in improving MPI reliability across EGI.

Useful Links

Environment Settings

VO_MPI_VOMS_SERVERS="'vomss://voms1.egee.cesnet.cz:8443/voms/mpi?/mpi'"
VO_MPI_QUEUES=""
VO_MPI_SW_DIR="$VO_SW_DIR/mpi"
VO_MPI_DEFAULT_SE=""
VO_MPI_STORAGE_DIR=""
VO_MPI_VOMSES="'mpi voms1.egee.cesnet.cz 15030 /DC=cz/DC=cesnet-ca/O=CESNET/CN=voms1.egee.cesnet.cz mpi 24'"
VO_MPI_VOMS_POOL_PATH=""
VO_MPI_VOMS_CA_DN="'/DC=cz/DC=cesnet-ca/O=CESNET CA/CN=CESNET CA 3'"
VO_MPI_WMS_HOSTS="wms1.egee.cesnet.cz wms2.egee.cesnet.cz" 

Progress

Task 1(DONE): MPI documentation

Task 2(DONE): Nagios probes

Task 3(DONE): Information system

Task 4(DONE): Accounting system

Task 5(DONE): Batch system status

Task 6(DONE): Gather information from MPI sites

  • Created the new MPI kickstart VO
    • CESNET and CESGA are providing resources to test the new VO.
  • Gathered information from NGIs. MPI survey and sites status: