Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "VT MPI within EGI:Nagios"

From EGIWiki
Jump to navigation Jump to search
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
== MPI VT Nagios Specifications  ==
== MPI VT Nagios Specifications  ==


The present SAM MPI testing infrastructure is completely dependent of the information which is published by each individual site. If a site is publishing the MPI-START tag, the resource is tested, otherwise it is not. This information system dependency does not allow to test sites which are offering MPI functionality but are not broadcasting it, or sites which are broadcasting the MPI/Parallel support in an incorrect way. The introduction of a new service type in GOCDB (MPI or Parallel) would break this dependency and would allow the definition of a MPI test profile to probe:  
The previous SAM MPI testing probes depends on the information published by each individual site. If a site is publishing the MPI-START tag, the resource is tested, otherwise it is not. This information system dependency does not allow to test sites which are offering MPI functionality but are not broadcasting it, or sites which are broadcasting the MPI/Parallel support in an incorrect way. The introduction of a new service type in GOCDB (MPI or Parallel) would break this dependency and would allow the definition of a MPI test profile to probe:  


*'''The information published by the (MPI or Parallel) service.'''  
*'''GOCDB service'''  
**org.sam.mpi.EnvSanityCheck
**eu.egi.MPI
*'''The (MPI or Parallel) functionality offered by the site.'''
**org.sam.mpi.SimpleJob
**org.sam.mpi.ComplexJob


<br>
*'''NAGIOS: The information published by the (MPI or Parallel) service.'''
**eu.egi.mpi.EnvSanityCheck
*'''NAGIOS: The (MPI or Parallel) functionality offered by the CREAM-CE.'''
**eu.egi.mpi.SimpleJob, that is composed of:
*** eu.egi.mpi.simplejob.WN (MPI job functionality test)
*** eu.egi.mpi.simplejob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
**eu.egi.mpi.ComplexJob, similarly to the SimpleJob is composed of:
*** eu.egi.mpi.complexjob.WN (MPI job functionality test)
*** eu.egi.mpi.complexjob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)


=== org.sam.mpi.EnvSanityCheck  ===
Code for these probes is available at https://github.com/IFCA/eu.egi.mpi. A detailed description of them is available in the following sections


*'''Name:''' org.sam.mpi.EnvSanityCheck  
=== eu.egi.mpi.EnvSanityCheck  ===
 
*'''Name:''' eu.egi.mpi.EnvSanityCheck  
*'''Target:''' CREAM-CE with eu.egi.MPI Service Type in GOCDB
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type.  
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type.  
*'''Purpose:''' Test the information published by the (MPI or Parallel) service.  
*'''Purpose:''' Test the information published by the (MPI or Parallel) service.  
Line 26: Line 34:
*'''Dependencies:''' None.  
*'''Dependencies:''' None.  
*'''Frequency:''' Each hour.  
*'''Frequency:''' Each hour.  
*'''Time out:''' ''120s (Do you this is too much for an ldap query?)''  
*'''Time out:''' ''120s''  
*'''Expected behaviour:'''
*'''Expected behaviour:'''


{| cellspacing="0" cellpadding="2" border="1" class="wikitable sortable" style="border: 1px solid black;"
{| cellspacing="0" cellpadding="2" border="1" style="border: 1px solid black;" class="wikitable sortable"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
! Use Case  
! Use Case  
Line 37: Line 45:
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
| One MPI flavour tag (following any of the proposed formats) is not present under ''GlueHostApplicationSoftwareRunTimeEnvironment''  
| No MPI flavour tag (following any of the proposed formats) is present under ''GlueHostApplicationSoftwareRunTimeEnvironment''  
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
Line 46: Line 54:
| '''WARNING'''
| '''WARNING'''
|-
|-
| If (''GlueCEPolicyMaxCPUTime'' is equal to 0 or to 999999999) or (''GlueCEPolicyMaxWallClockTime'' is equal to 0 or to 999999999) or (''GlueCEPolicyMaxCPUTime'' / ''GlueCEPolicyMaxWallClockTime'' &lt; 4)  
| If (''GlueCEPolicyMaxCPUTime'' is equal to 0) or (''GlueCEPolicyMaxWallClockTime'' is equal to to 0)
| '''WARNING'''
|-
| If (''GlueCEPolicyMaxCPUTime'' is equal to 999999999) and (''GlueCEPolicyMaxWallClockTime'' is equal to to 999999999) (i.e. no limits on job duration)
| '''WARNING'''
|-
| If  (''GlueCEPolicyMaxCPUTime'' / ''GlueCEPolicyMaxWallClockTime'' &lt; 4)  
| '''WARNING'''
| '''WARNING'''
|-
|-
| If '''MPI-START''' tag '''AND''' MPI flavour are present under ''GlueHostApplicationSoftwareRunTimeEnvironment'' '''AND''' ''GlueCEPolicyMaxSlotsPerJob'' variable is not 0 or 1 or 999999999 '''AND''' ''GlueCEPolicyMaxCPUTime'' / ''GlueCEPolicyMaxWallClockTime'' &gt;=4)  
| If '''MPI-START''' tag '''AND''' MPI flavour are present under ''GlueHostApplicationSoftwareRunTimeEnvironment'' '''AND''' ''GlueCEPolicyMaxSlotsPerJob'' variable is not 0 or 1 or 999999999 '''AND''' ''GlueCEPolicyMaxCPUTime'' / ''GlueCEPolicyMaxWallClockTime'' &gt;=4)  
| '''OK'''
| '''OK'''
|-
| If there is a timeout or problems fetching information from LDAP
| '''UNKNOWN'''
|}
|}


<br>
==== Possible issues  ====
* Ideally GlueCEPolicyMaxCPUTime  /  GlueCEPolicyMaxWallClockTime &gt;= GlueCEPolicyMaxSlotsPerJob. The 4 value used in the comparison seemed reasonable (not to high) and coherent with eu.egi.mpi.ComplexJob where 4 slots are request.


=== org.sam.mpi.SimpleJob ===
=== eu.egi.mpi.SimpleJob (eu.egi.mpi.simplejob.WN & eu.egi.mpi.simplejob.CREAMCE-JobSubmit) ===


*'''Name:''' org.sam.mpi.SimpleJob  
*'''Name:''' eu.egi.mpi.SimpleJob
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting two slots in different machines (JobType="Normal"; CpuNumber = 2; NodeNumber=2)  
*'''Probes:''' eu.egi.mpi.simplejob.WN (MPI job functionality test), eu.egi.mpi.simplejob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
*'''Target:''' CREAM-CE with eu.egi.MPI Service Type in GOCDB
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting two slots(JobType="Normal"; CpuNumber = 2)
*'''Purpose:''' Test the MPI functionality with a minimum set of resources.  
*'''Purpose:''' Test the MPI functionality with a minimum set of resources.  
*'''Description:''' The probe should check if:  
*'''Description:''' The probe should check if:  
Line 66: Line 86:
**MPI-START is able to distribute the application binaries.  
**MPI-START is able to distribute the application binaries.  
**The application executes with the number of requested slots and finishes correctly.  
**The application executes with the number of requested slots and finishes correctly.  
*'''Dependencies:''' Executed after org.sam.mpi.envsanitycheck, and if it exits with '''WARNING''' or '''OK''' status.  
*'''Dependencies:''' Executed after eu.egi.mpi.envsanitycheck, and if it exits with '''WARNING''' or '''OK''' status.  
*'''Frequency''': ''? (What is the execution frequency of the current probe? Should be the same!)''  
*'''Frequency''': ''? (What is the execution frequency of the current probe? Should be the same!)''  
*'''Timeout:''' ''? (What is the execution frequency timeout of the current probe? Should be the same!)''  
*'''Timeout:''' ''? (What is the execution frequency timeout of the current probe? Should be the same!)''  
*'''Expected behaviour:'''
*'''Expected behaviour:'''


{| cellspacing="0" cellpadding="2" border="1" class="wikitable sortable" style="border: 1px solid black;"
{| cellspacing="0" cellpadding="2" border="1" style="border: 1px solid black;" class="wikitable sortable"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
! Use Case  
! Use Case  
Line 85: Line 105:
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
| MPI-START fails to distribute the application binaries.
| MPI-START fails to distribute the application binaries (if shared filesystem is not found)
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
Line 94: Line 114:
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
| The application executed successfully with less slots than the requested ones.
| The application executed successfully with less slots than the requested ones or on less slots that the requested ones
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
Line 107: Line 127:
|}
|}


<br>
=== eu.egi.mpi.ComplexJob  (eu.egi.mpi.complexjob.WN & eu.egi.mpi.complexjob.CREAMCE-JobSubmit) ===


=== org.sam.mpi.ComplexJob ===
*'''Name:''' eu.egi.mpi.ComplexJob  
 
*'''Probes:''' eu.egi.mpi.complexjob.WN (MPI job functionality test), eu.egi.mpi.complexjob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
*'''Name:''' org.sam.mpi.ComplexJob
*'''Target:''' CREAM-CE with eu.egi.MPI Service Type in GOCDB
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting 4 slots with 2 instances running in different dedicated machines (JobType="Normal"; CpuNumber = 4; NodeNumber=2; SMPGranularity=2; WholeNodes=True)  
*'''Requirements:''' The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting 4 slots with 2 instances running in different dedicated machines (JobType="Normal"; HostNumber=2; SMPGranularity=2; WholeNodes=True)
*'''Purpose:''' Test the MPI functionality and check the recommendation from the EGEE MPI WG are being implemented.  
*'''Purpose:''' Test the MPI functionality and check the recommendation from the EGEE MPI WG are being implemented.  
*'''Description:''' The probe should check if:  
*'''Description:''' The probe should check if:  
Line 121: Line 141:
**The application executes with the number and characteristics of requested slots and finishes correctly.  
**The application executes with the number and characteristics of requested slots and finishes correctly.  
**MPI-START is able to collect the application results in the master node.  
**MPI-START is able to collect the application results in the master node.  
*'''Dependencies:''' Executed after org.sam.mpi.envsanitycheck, and if it exits with WARNING or OK status.  
*'''Dependencies:''' Executed after eu.egi.mpi.envsanitycheck, and if it exits with WARNING or OK status.  
*'''Frequency:''' ''once per day&nbsp;?''  
*'''Frequency:''' ''once per day''  
*'''Timeout:''' ''One day?''  
*'''Timeout:''' ''One day''  
*'''Expected behaviour:'''
*'''Expected behaviour:'''


{| cellspacing="0" cellpadding="2" border="1" class="wikitable sortable" style="border: 1px solid black;"
{| cellspacing="0" cellpadding="2" border="1" style="border: 1px solid black;" class="wikitable sortable"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
|- align="left" style="background: none repeat scroll 0% 0% Lightgray;"
! Use Case  
! Use Case  
Line 149: Line 169:
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
| The application executed successfully with less slots than the requested ones.
| The application executed successfully with less slots than the requested ones (for LRMS supporting node level allocation)
| '''CRITICAL'''
| '''CRITICAL'''
|-
|-
Line 158: Line 178:
| '''OK'''
| '''OK'''
|}
|}
==== Possible issues  ====
* "WholeNodes=True" means dedicating complete nodes no matter how many slots are available on them. What happens if the site is providing large SMPs with 32 slots per node? It might be a good idea to ask admins to dedicate nodes for this test, or at least dedicate virtual slots. We've seen a lot of problems with jobs waiting in production queue with users jobs. What will happen is that this probe will just keep on flapping from WARNING to OK
* Application used in the probe is a simple PI calculation, more complex MPI functionality testing may be needed.
[[Category:Virtual_Teams]]

Latest revision as of 09:52, 12 June 2013

MPI VT Nagios Specifications

The previous SAM MPI testing probes depends on the information published by each individual site. If a site is publishing the MPI-START tag, the resource is tested, otherwise it is not. This information system dependency does not allow to test sites which are offering MPI functionality but are not broadcasting it, or sites which are broadcasting the MPI/Parallel support in an incorrect way. The introduction of a new service type in GOCDB (MPI or Parallel) would break this dependency and would allow the definition of a MPI test profile to probe:

  • GOCDB service
    • eu.egi.MPI
  • NAGIOS: The information published by the (MPI or Parallel) service.
    • eu.egi.mpi.EnvSanityCheck
  • NAGIOS: The (MPI or Parallel) functionality offered by the CREAM-CE.
    • eu.egi.mpi.SimpleJob, that is composed of:
      • eu.egi.mpi.simplejob.WN (MPI job functionality test)
      • eu.egi.mpi.simplejob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
    • eu.egi.mpi.ComplexJob, similarly to the SimpleJob is composed of:
      • eu.egi.mpi.complexjob.WN (MPI job functionality test)
      • eu.egi.mpi.complexjob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)

Code for these probes is available at https://github.com/IFCA/eu.egi.mpi. A detailed description of them is available in the following sections

eu.egi.mpi.EnvSanityCheck

  • Name: eu.egi.mpi.EnvSanityCheck
  • Target: CREAM-CE with eu.egi.MPI Service Type in GOCDB
  • Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type.
  • Purpose: Test the information published by the (MPI or Parallel) service.
  • Description: The probe should test if the service:
    • Publishes MPI-START tag under GlueHostApplicationSoftwareRunTimeEnvironment
    • Publishes MPI flavour tag under GlueHostApplicationSoftwareRunTimeEnvironment according to one of the following formats:
      • <MPI flavour>
      • <MPI flavour>-<MPI version>
      • <MPI flavour>-<MPI version>-<Compiler>
    • Has the GlueCEPolicyMaxSlotsPerJob variable set to a reasonable value (not 0 nor 1 nor 999999999) for the queue where the MPI job is supposed to run.
    • Publishes reasonable GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime values (not 0 nor 999999999), and that GlueCEPolicyMaxCPUTime allows to execute a parallel application requesting 4 slots spending GlueCEPolicyMaxWallClockTime of WallClockTime.
  • Dependencies: None.
  • Frequency: Each hour.
  • Time out: 120s
  • Expected behaviour:
Use Case Probe Result
MPI-START tag is not present under GlueHostApplicationSoftwareRunTimeEnvironment CRITICAL
No MPI flavour tag (following any of the proposed formats) is present under GlueHostApplicationSoftwareRunTimeEnvironment CRITICAL
The probe reaches a timeout and the probe execution is canceled CRITICAL
GlueCEPolicyMaxSlotsPerJob is equal to 0 or 1 or to 999999999 WARNING
If (GlueCEPolicyMaxCPUTime is equal to 0) or (GlueCEPolicyMaxWallClockTime is equal to to 0) WARNING
If (GlueCEPolicyMaxCPUTime is equal to 999999999) and (GlueCEPolicyMaxWallClockTime is equal to to 999999999) (i.e. no limits on job duration) WARNING
If (GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime < 4) WARNING
If MPI-START tag AND MPI flavour are present under GlueHostApplicationSoftwareRunTimeEnvironment AND GlueCEPolicyMaxSlotsPerJob variable is not 0 or 1 or 999999999 AND GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >=4) OK
If there is a timeout or problems fetching information from LDAP UNKNOWN

Possible issues

  • Ideally GlueCEPolicyMaxCPUTime / GlueCEPolicyMaxWallClockTime >= GlueCEPolicyMaxSlotsPerJob. The 4 value used in the comparison seemed reasonable (not to high) and coherent with eu.egi.mpi.ComplexJob where 4 slots are request.

eu.egi.mpi.SimpleJob (eu.egi.mpi.simplejob.WN & eu.egi.mpi.simplejob.CREAMCE-JobSubmit)

  • Name: eu.egi.mpi.SimpleJob
  • Probes: eu.egi.mpi.simplejob.WN (MPI job functionality test), eu.egi.mpi.simplejob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
  • Target: CREAM-CE with eu.egi.MPI Service Type in GOCDB
  • Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting two slots(JobType="Normal"; CpuNumber = 2)
  • Purpose: Test the MPI functionality with a minimum set of resources.
  • Description: The probe should check if:
    • MPI-START is able to find the type of scheduler.
    • MPI- START is able to determine if the environment for the MPI flavour under test is correctly set.
    • The application correctly compiles.
    • MPI-START is able to distribute the application binaries.
    • The application executes with the number of requested slots and finishes correctly.
  • Dependencies: Executed after eu.egi.mpi.envsanitycheck, and if it exits with WARNING or OK status.
  • Frequency: ? (What is the execution frequency of the current probe? Should be the same!)
  • Timeout: ? (What is the execution frequency timeout of the current probe? Should be the same!)
  • Expected behaviour:
Use Case Probe Result
MPI-START is not able to determine which kind of scheduler is used at the site. WARNING
MPI-START is not able to determine if the environment for the MPI flavour under test is correctly set. WARNING
The compilation of the parallel application fails. CRITICAL
MPI-START fails to distribute the application binaries (if shared filesystem is not found) CRITICAL
The MPI application execution failed. CRITICAL
MPI-START fails to collect the application results in the master node. CRITICAL
The application executed successfully with less slots than the requested ones or on less slots that the requested ones CRITICAL
The probe reaches a timeout and the probe execution is canceled. WARNING
The probe reaches a timeout (in two successive attempts) and the probe execution is canceled. CRITICAL
The application executed successfully with the requested slots AND MPI-START was able to collect the application results in the master node. OK

eu.egi.mpi.ComplexJob (eu.egi.mpi.complexjob.WN & eu.egi.mpi.complexjob.CREAMCE-JobSubmit)

  • Name: eu.egi.mpi.ComplexJob
  • Probes: eu.egi.mpi.complexjob.WN (MPI job functionality test), eu.egi.mpi.complexjob.CREAMCE-JobSubmit (probe terminal status, and test for allocation support)
  • Target: CREAM-CE with eu.egi.MPI Service Type in GOCDB
  • Requirements: The service should be registered in GOCDB as a MPI (or Parallel) Service Type; Job submission requesting 4 slots with 2 instances running in different dedicated machines (JobType="Normal"; HostNumber=2; SMPGranularity=2; WholeNodes=True)
  • Purpose: Test the MPI functionality and check the recommendation from the EGEE MPI WG are being implemented.
  • Description: The probe should check if:
    • MPI-START is able to find the type of scheduler.
    • MPI- START is able to determine if the environment for the MPI flavour under test is correctly set.
    • The application correctly compiles.
    • MPI-START is able to distribute the application binaries.
    • The application executes with the number and characteristics of requested slots and finishes correctly.
    • MPI-START is able to collect the application results in the master node.
  • Dependencies: Executed after eu.egi.mpi.envsanitycheck, and if it exits with WARNING or OK status.
  • Frequency: once per day
  • Timeout: One day
  • Expected behaviour:
Use Case Probe Result
MPI-START is not able to determine which kind of scheduler is used at the site. WARNING
MPI-START is not able to determine if the environment for the MPI flavour under test is correctly set. WARNING
The compilation of the parallel application fails. CRITICAL
MPI-START fails to distribute the application binaries. CRITICAL
The MPI application execution failed. CRITICAL
MPI-START fails to collect the application results in the master node. CRITICAL
The application executed successfully with less slots than the requested ones (for LRMS supporting node level allocation) CRITICAL
The probe reaches a timeout and the probe execution is canceled. WARNING (This should not go to CRITICAL because we do not know how long the job can be in queue because it is requesting some 4 slots)
The application executed successfully with the requested slots AND MPI-START was able to collect the application results in the master node. OK


Possible issues

  • "WholeNodes=True" means dedicating complete nodes no matter how many slots are available on them. What happens if the site is providing large SMPs with 32 slots per node? It might be a good idea to ask admins to dedicate nodes for this test, or at least dedicate virtual slots. We've seen a lot of problems with jobs waiting in production queue with users jobs. What will happen is that this probe will just keep on flapping from WARNING to OK
  • Application used in the probe is a simple PI calculation, more complex MPI functionality testing may be needed.