VT MPI within EGI
- 1 General Project Information
- 2 Motivation
- 3 Output
- 4 Tasks
- 5 Members
- 6 Resources
- 7 Progress
General Project Information
- Leader: Alvaro Simon (CESGA, Spain) and Zdenek Sustr (CESNET, Czech Republic)
- Mailing List: vt-mpi at mailman.egi.eu
- Status: Active
- Start Date: 10/Nov/2011
- End Date:
- Meetings: MPI Virtual team meetings:
Despite a dedicated SA3 activity to support MPI there still seem to be significant issues in uptake and satisfaction amongst the user communities. This VT
- Works with user communities and projects that use MPI resources (e.g. ITER, MAPPER, A&A, etc) to demonstrate that MPI can work successfully in EGI.
- Sets up a VO on EGI with sites committed to support MPI jobs.
- Improve the communication between MPI users and developers of MPI support within EGI SA3.
The VT is expected to produce the following outputs:
- Materials (tutorials, white papers, etc) about successful use cases of MPI on EGI that can be used by new communities to use MPI on EGI.
- An MPI VO that provides:
- dedicated CPUs for MPI jobs
- MPI specific test probes can run on all sites using the VO Monitoring services of Ibergrid (EGI-InSPIRE VO Services group)
- accounting for MPI jobs
- user support
- Improved communication channels with MPI users
- The above set of resources and feedback to resource centers, user communities and technology providers on how to improve MPI within EGI.
Task 1: MPI documentation
- Assigned to: Enol / Paschalis Korosoglou
This documentation will be reviewded and we will decide what needs updating or extending.
- Gergely comments:
For users: https://wiki.egi.eu/wiki/MPI_User_Guide https://wiki.egi.eu/wiki/MPI_User_manual Should be merged in a single wiki page.
For site admins: https://wiki.egi.eu/wiki/MAN03
- 1/1 Improve and review MPI documentation:Documentation was updated, It will reviewed by MPI members.
Task 2: Nagios probes
- Assigned to: Emir Imamagic / John Walsh / Paschalis Korosoglou
Current nagios probes should be reviewed to test EGI MPI infrastructure.
- New nagios probes requirements: https://wiki.egi.eu/wiki/Nagios-requirements.html
John Walsh comments:
a) A non-critcal test that tests MPI scalability above two nodes. Ideally, I would like to see this test set to ceiling(average number of cores) x 2 +1. This should increase the likelihood that the job runs on multiple nodes. This test should only be run may once or twice a week and allow at least a day for scheduling (so as to be non-intrusive on valuable site resources).
b) Improve the baseline MPI tests. We should test basic MPI API functionality (scatter, gather, etc), rather than the simpler "hello world". I will try to see whether I assemble a basic test-suite.
c) Following up on https://ggus.eu/ws/ticket_info.php?ticket=76755, I have suggested that we may not be using GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxObtainableCPUTime properly, and that site queues may not be correctly set up for MPI jobs. (i.e not setting torque queue resources_max.cput and resources_max.pcput values). Perhaps we can develop a (BDII?) sanity "warning" check for this?
- 2/1 Create and test new MPI nagios probes:
- Check MPI information system sanity.
- Use more than 1 physical cpu.
Task 3: Information system
- Assigned to: Gonçalo Borges
Problems detecting MPI resources.
- Checking for MPI availability -- mostly decided by checking installed applications.
- Not all sites reporting MPI capability correctly
For BDII, the metrics portal checks the GlueHostApplicationSoftwareRunTimeEnvironment property for the *MPI* regular expression.
John Walsh comments:
GGUS ticket: https://ggus.eu/ws/ticket_info.php?ticket=76755 the problem seems to be related to the torque settings for pcput an cput on each of the queue.
cput = Maximum amount of CPU time used by all processes in the job. pcput = Maximum amount of CPU time used by any single process in the job. walltime = Maximum amount of real time during which the job can be in the running state.
So, for example, on one of the "medium" queue on deimos.htc.biggrid.nl, the config is: set queue medium resources_max.cput = 24:00:00 set queue medium resources_max.pcput = 24:00:00 set queue medium resources_max.walltime = 36:00:00
This would not be sufficient to allow an 6 core job to run for a full 24 hours, and the job is likely to be removed after it has run for 4 hours. We need to check that these queue settings are sensible for MPI jobs.
This is an interesting ticket that summarizes the handling of torque pcput (or lack off) in the infosys. https://savannah.cern.ch/bugs/?49653
- 3/1 Check information system sanity for MPI sites:
- Open GGUS tickets to fix wrong batch system values.
Task 4: Accounting system
- Assigned to: John Gordon, Iván Díaz
Implement MPI accounting system. (JRA1.4) Ivan Comments:
No special accounting support. Only way to recognize MPI jobs is to check jobs with >100 % efficiency -Still development to be done. -Apel needs to give data for each batch system.
> 100% efficiency may not be true for MPI jobs. What must be checked is the number of slots. That would include also other parallel jobs, but I don't think that's a major issue. Apel should already give the number of slots used by the job, this data is easily available for all batch systems.
"Accounting system should keep track of the type of the job: MPI or serial. This should be recorded in the Usage Record in order to be easily queried in the accounting repository."
- 4/1 Create MPI accounting system (APEL and Accounting Portal).
Task 5: Batch system status
- Assigned to: Roberto Rosende/Enol Fernandez
All batch systems must support MPI jobs. Check the current batch system status and issues. Roberto Rosende comments:
Starting work on MPI support for SGE, to be ready for EMI2 The main problem with the batch system is that it is not receiving reliable info from information system (not truly a batch system matter).
Alvaro Simon comments:
Two bugs were found during the first UMD verification of WN/Torque + EMI-MPI.1.0. Is a torque/maui problem that affects all MPI jobs. Maui versions prior to 3.3.4 do not allocate correctly all the nodes for the job execution. GGUS tickets: - https://ggus.eu/ws/ticket_info.php?ticket=57828 - https://ggus.eu/ws/ticket_info.php?ticket=67870
- 5/1 MPI status report for each batch system (TORQUE,GE...)
Task 6: Gather information from MPI sites
- Assigned to: Zdenek Sustr
After establishing the VO, and contacting sites for resources, more requests for information can be added. Zdenek comments:
- MPI VO -- bring together sites and users interested in MPI - This VO is NOT intended for everyday use by all users wishing to use MPI - This VO IS intended for users who wish to cooperate with the VT to make MPI support in EGI better - The main reason for its establishment is to collect experience that will be later adopted by regular VOs
User Community under SA3 would also be a good idea.
- 6/1 Create and manage mpi VO: Provide MPI resources from different sites.
- NGIs - confirmed:
- CZ: Zdenek Sustr (leader)
- ES/IBERGRID: Alvaro Simon (leader), Enol Fernandez, Iván Díaz, Alvaro Lopez, Pablo Orviz, Isabel Campos, Roberto Rosende Dopazo
- GR: Dimitris Dellis, Marios Chatziangelou, Paschalis Korosoglou
- HR: Emir Imamagic, Luko Gjenero
- IE: John Walsh
- IT: Daniele Cesini, Alessandro Costantini, Vania Boccia, Marco Bencivenni
- PT: Gonçalo Borges
- SK: Viera Sipkova, Viet Tran, Jan Astalos
- UK: John Gordon
- EGI.eu: Gergely Sipos, Karolis Eigelis, Tiziana Ferrari, Peter Solagna
- Task 1
- Task 2
- Task N