Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Resource Centres OLA and Resource infrastructure Provider OLA reports

From EGIWiki
Jump to navigation Jump to search

Introduction

EGI availability and reliability statistics are produced every month for all certified production sites. The current version of the Site-NGI Operational Level Agreement defines the following requirements:

  • minimum tolerated availability: 70%,
  • minimum tolerated reliabilty: 75%.

For each monthly report, underperforming sites are requested through a GGUS to motivate the poor performance provided.

Suspension procedure: sites which have an availability of less than 50% for three consecutive months will be suspended, i.e. removed from the production infrastructure.

Performance

Other links

  • SUMMARY TABLE of availability and reliability metrics
  • List of sites for which availability followup procedures were not applicable

Tools and documentation

Relevant procedures

IMPORTANT! EGI sites not providing minimum monthly performance (&0% availability, 75% reliability) MUST provide justification through a GGUS ticket.

Report generator

Availability/Reliability computation

Operational Level Agreement

Description of the process

  • Generation of statistics

Availability and reliability statistics are automatically generated the first week of the month by GridView in pdf format and placed under [8]. An Excel version is available at [9]

  • Preliminary processing

Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.

  • Publication

An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.

  • Handling of sites below targets

For a site that misses availability/reliability targets but is not eligible for suspension:

  1. a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given
  2. the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [10])
  3. if the explanation is found satisfactory the ticket is closed
  4. conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed, which at this time is still being revised and is tracked into EGI RT [11]
  5. the child ticket can then be closed
  6. the parent ticket will be closed when all child tickets have been closed.
  • Handling of sites that are eligible for suspension

For a site that is eligible for suspension:

  1. a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [12])
  2. after the 10 days period passes, the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects
  3. in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
  4. the child ticket closes either when the site is suspended or when suspension is canceled
  5. the parent ticket will be closed when all child tickets have been closed
  • Wiki follow up page

Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [13]

Known issues and recommendations to NGIs

  1. Gridview is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, GridView takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [14] and [15]
  2. arcCE tests have been considered critical since mid July 2010, but sites and RODs are not getting notified in any operational tool about their results. This is being investigated, in the meantime sites/NGIs dealing with availability/reliability tickets caused by arcCE issues, are advised to solve such tickets mentioning that this was due to the arcCE in the solution. Some background [16] and [17]
  3. creamCE tests are critical but not taken into account for availability/reliability calculations. This was discussed on various occasions, more recently into the OMB 26 October: [18]