VT MPI within EGI:Nagios

MPI VT Nagios Specifications

The present SAM MPI testing infrastructure is completely dependent of the information which is published by each individual site. If a site is publishing the MPI-START tag, the resource is tested, otherwise it is not. This information system dependency does not allow to test sites which are offering MPI functionality but are not broadcasting it, or sites which are broadcasting the MPI/Parallel support in an incorrect way. The introduction of a new service type in GOCDB (MPI or Parallel) would break this dependency and would allow the definition of a MPI test profile to probe:

GOCDB service
- org.mpi

NAGIOS: The information published by the (MPI or Parallel) service.
- eu.egi.mpi.EnvSanityCheck
NAGIOS: The (MPI or Parallel) functionality offered by the site.
- eu.egi.mpi.SimpleJob
- eu.egi.mpi.ComplexJob

eu.egi.mpi.EnvSanityCheck

Name: eu.egi.mpi.EnvSanityCheck
Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type.
Purpose: Test the information published by the (MPI or Parallel) service.
Description: The probe should test if the service:
- Publishes MPI-START tag under GlueHostApplicationSoftwareRunTimeEnvironment
- Publishes MPI flavour tag under GlueHostApplicationSoftwareRunTimeEnvironment according to one of the following formats:
  - <MPI flavour>
  - <MPI flavour>-<MPI version>
  - <MPI flavour>-<MPI version>-<Compiler>
- Has the GlueCEPolicyMaxSlotsPerJob variable set to a reasonable value (not 0 nor 1 nor 999999999) for the queue where the MPI job is supposed to run.
- Publishes reasonable GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime values (not 0 nor 999999999), and that GlueCEPolicyMaxCPUTime allows to execute a parallel application requesting 4 slots spending GlueCEPolicyMaxWallClockTime of WallClockTime.
Dependencies: None.
Frequency: Each hour.
Time out: 120s (Do you this is too much for an ldap query?)
Expected behaviour:

Use Case	Probe Result
MPI-START tag is not present under GlueHostApplicationSoftwareRunTimeEnvironment	CRITICAL
One MPI flavour tag (following any of the proposed formats) is not present under GlueHostApplicationSoftwareRunTimeEnvironment	CRITICAL
The probe reaches a timeout and the probe execution is canceled	CRITICAL
GlueCEPolicyMaxSlotsPerJob is equal to 0 or 1 or to 999999999	WARNING
If (GlueCEPolicyMaxCPUTime is equal to 0 or to 999999999) or (GlueCEPolicyMaxWallClockTime is equal to 0 or to 999999999) or (GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4)	WARNING
If MPI-START tag AND MPI flavour are present under GlueHostApplicationSoftwareRunTimeEnvironment AND GlueCEPolicyMaxSlotsPerJob variable is not 0 or 1 or 999999999 AND GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >=4)	OK

Comments

Enol:

Timeout of the BDII: I would stay in the safe side and keep the 120s, is not that much and for sure there are other probes to check the BDII status that may detect if they are too slow. Related with
that, the use case "The probe reaches a timeout and the probe execution is canceled" may be a Warning instead of Critical, because it's unrelated with the objective of the probe. Is there any kind of
policy for this kind of things?

Gonçalo:

Not that I'm aware but Emir should have a last look when we agree on this.

Enol:

I would change "One MPI flavour tag (following any of the proposed formats)  is not present  under GlueHostApplicationSoftwareRunTimeEnvironment" to "No MPI flavour tag
(following any of the proposed formats) is present under..." it seems easier to understand to me.

Enol:

GlueCEPolicyMaxCPUTime could be infinity, at least at IFCA we had only limits on the wallclocktime. It may be necessary to adjust the use case to allow the 99999999 value.

Gonçalo:

Not sure if that is really a good policy  but I guess that is OK if GlueCEPolicyMaxWallClockTime has a limit! For now, I'll remove this restriction.

Enol:

Ideally GlueCEPolicyMaxCPUTime  /  GlueCEPolicyMaxWallClockTime >= GlueCEPolicyMaxSlotsPerJob, but maybe that's too much. If we had accounting, we could check which is the medium size of the parallel
jobs and use that value. For the time being a 4 times relation sounds small for me: 4 core machines are not that big anymore, and I guess MPI users would like to go beyond one machine.

Gonçalo:

I though of this but there is no easy way out. Administrators can set GlueCEPolicyMaxCPUTime=GlueCEPolicyMaxWallClockTime and that is a valid option also. Imagine that they have
GlueCEPolicyMaxCPUTime=1000
GlueCEPolicyMaxWallClockTime= 1000
This will allow you to run 100 instances of an MPI job  each one of them spending 10 of WallClockTime. That is also a valid aproach and Adminstrators may not like to see a WARNING because that is exactly the settings they want. 
The 4  value  seemed reasonable (not to high) and coherent with eu.egi.mpi.ComplexJob where I request 4 slots.

Alvaro

About frequencies and time outs maybe these values can be tested during the development process, at this moment we don't know 'a priori' which is the best value and depends on each probe.

Paschalis

I agree with Enol on the comment regarding the GlueCEPolicyMaxCPUTime restriction. We (@HellasGrid) neither use this restriction.

Regarding the "The probe reaches a timeout and the probe execution is canceled" perhaps it makes more sense to mark it as UNKNOWN (not CRITICAL). I think this is a valid nagios option as well.

Emir

Timeout of 120 is fine.

"The probe reaches a timeout and the probe execution is canceled" I think it is better to report Unknown. 
This probe is checking the validity of information in BDII, not the BDII itself. If the BDII is not working the probe is unable to check the validity of information.

eu.egi.mpi.SimpleJob

Name: eu.egi.mpi.SimpleJob
Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting two slots in different machines (JobType="Normal"; CpuNumber = 2; NodeNumber=2)
Purpose: Test the MPI functionality with a minimum set of resources.
Description: The probe should check if:
- MPI-START is able to find the type of scheduler.
- MPI- START is able to determine if the environment for the MPI flavour under test is correctly set.
- The application correctly compiles.
- MPI-START is able to distribute the application binaries.
- The application executes with the number of requested slots and finishes correctly.
Dependencies: Executed after eu.egi.mpi.envsanitycheck, and if it exits with WARNING or OK status.
Frequency: ? (What is the execution frequency of the current probe? Should be the same!)
Timeout: ? (What is the execution frequency timeout of the current probe? Should be the same!)
Expected behaviour:

Use Case	Probe Result
MPI-START is not able to determine which kind of scheduler is used at the site.	WARNING
MPI-START is not able to determine if the environment for the MPI flavour under test is correctly set.	WARNING
The compilation of the parallel application fails.	CRITICAL
MPI-START fails to distribute the application binaries.	CRITICAL
The MPI application execution failed.	CRITICAL
MPI-START fails to collect the application results in the master node.	CRITICAL
The application executed successfully with less slots than the requested ones.	CRITICAL
The probe reaches a timeout and the probe execution is canceled.	WARNING
The probe reaches a timeout (in two successive attempts) and the probe execution is canceled.	CRITICAL
The application executed successfully with the requested slots AND MPI-START was able to collect the application results in the master node.	OK

Comments

Paschalis

Regarding the "MPI-START is not able to determine which kind of scheduler is used at the site." and "MPI-START is not able to determine if the environment for the MPI flavour under test is correctly set." being marked as WARNINGs. 
I am not an expert on the MPI-START intrinsics but would MPI-START work given these circumstances are not met? If yes then OK otherwise I would prefer having these marked as CRITICAL.

Regarding having "MPI-START fails to distribute the application binaries" marked as critical I suppose for completeness it should read "MPI-START fails to distribute the application binaries if shared filesystem is not found". 
The same comment applies on "MPI-START fails to collect the application results in the master node" as well.

On "The application executed successfully with less slots than the requested ones" I would suggest changing it to "The application executed successfully with less slots than the requested ones or on less slots than the requested ones"  (i.e. there are two possibilities to consider, 1) the application executes for some reason with 1-3 mpi processes, or 2) the application executes with 4 mpi processes on one node (or CPU) only.

Emir

I would use different naming scheme, e.g. eu.egi.MPI-SimpleJob

"The probe reaches a timeout and the probe execution is canceled." - having first WARNING and then CRITICAL is supported by org.sam.CE probe, but it has to be configured properly.

Goncalo

Some answers to Paschalis. Regarding his first question, MPI-START is not the unique solution to run MPI applications. 
We must also have in mind that a site may be using a scheduler that MPI-START has no plugin to it. 
The fact that MPI-START is not able to guess the batch system does not mean that the job will fail.

eu.egi.mpi.ComplexJob

Name: eu.egi.mpi.ComplexJob
Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting 4 slots with 2 instances running in different dedicated machines (JobType="Normal"; CpuNumber = 4; NodeNumber=2; SMPGranularity=2; WholeNodes=True)
Purpose: Test the MPI functionality and check the recommendation from the EGEE MPI WG are being implemented.
Description: The probe should check if:
- MPI-START is able to find the type of scheduler.
- MPI- START is able to determine if the environment for the MPI flavour under test is correctly set.
- The application correctly compiles.
- MPI-START is able to distribute the application binaries.
- The application executes with the number and characteristics of requested slots and finishes correctly.
- MPI-START is able to collect the application results in the master node.
Dependencies: Executed after eu.egi.mpi.envsanitycheck, and if it exits with WARNING or OK status.
Frequency: once per day ?
Timeout: One day?
Expected behaviour:

Use Case	Probe Result
MPI-START is not able to determine which kind of scheduler is used at the site.	WARNING
MPI-START is not able to determine if the environment for the MPI flavour under test is correctly set.	WARNING
The compilation of the parallel application fails.	CRITICAL
MPI-START fails to distribute the application binaries.	CRITICAL
The MPI application execution failed.	CRITICAL
MPI-START fails to collect the application results in the master node.	CRITICAL
The application executed successfully with less slots than the requested ones.	CRITICAL
The probe reaches a timeout and the probe execution is canceled.	WARNING (This should not go to CRITICAL because we do not know how long the job can be in queue because it is requesting some 4 slots)
The application executed successfully with the requested slots AND MPI-START was able to collect the application results in the master node.	OK

Comments

Enol:

Frequency: no need for this to be daily, maybe every two or three days is enough. The probe is requesting empty nodes, so emptying them means decreasing the throughput of the site. Although here we could argue
that a good MPI service means that MPI jobs enter more or less promptly into execution.

Enol:

Timeout: the easy value, just until the next probe is to be submitted (1 day if daily, 2 days if every two days).

Enol:

One thing that is missing and was discussed on the EVO is what the MPI application will be, not really important for the definition of the probes and the status, but we have to agree on a minimum set of
functionality to be tested.

Gonçalo:

That is another discussion. In I2G times we used pi calculation (remember?!). It is simple enough and we can enhance it with MPI directives. Maybe we can reuse it?! 
The drawback is that it does not test IO but just communication and cpu  between instances. But do we want to go in that direction?

Paschalis

Regarding the application I suppose it would be good to have something that implements a few basic MPI features (i.e. one point to point communication and one collective and perhaps one MPI I/O operation). 
The Pi calculation if I am not  wrong implements one reduction (collective) operation at the end, right?

Emir

Doesn't "WholeNodes=True" mean dedicating complete node no matter how many slots are available on them? What happens if the site is providing large SMPs with 32 slots per node?

It might be a good idea to ask admins to dedicate nodes for this test, or at least dedicate virtual slots. 
We've seen a lot of problems with jobs waiting in production queue with users jobs. What will happen is that this probe will just keep on flapping from WARNING to OK.

VT MPI within EGI:Nagios

Contents

MPI VT Nagios Specifications

eu.egi.mpi.EnvSanityCheck

Comments

eu.egi.mpi.SimpleJob

Comments

eu.egi.mpi.ComplexJob

Comments

Navigation menu

VT MPI within EGI:Nagios

MPI VT Nagios Specifications

eu.egi.mpi.EnvSanityCheck

Comments

eu.egi.mpi.SimpleJob

Comments

eu.egi.mpi.ComplexJob

Comments

Navigation menu

Search