Difference between revisions of "Resource Centres OLA and Resource infrastructure Provider OLA reports"
|Line 52:||Line 52:|
== 2011 ==
== 2011 ==
[https://documents.egi.eu/document/332 Jan]|[https://documents.egi.eu/document/402 Feb]|[https://documents.egi.eu/document/465 Mar]|[https://documents.egi.eu/document/508 Apr]|[https://documents.egi.eu/document/593 May]|[https://documents.egi.eu/document/648 Jun]|[https://documents.egi.eu/document/716 Jul]|[https://documents.egi.eu/document/783 Aug]|[https://documents.egi.eu/document/820 Sep]|[https://documents.egi.eu/document/879 Oct]
[https://documents.egi.eu/document/332 Jan]|[https://documents.egi.eu/document/402 Feb]|[https://documents.egi.eu/document/465 Mar]|[https://documents.egi.eu/document/508 Apr]|[https://documents.egi.eu/document/593 May]|[https://documents.egi.eu/document/648 Jun]|[https://documents.egi.eu/document/716 Jul]|[https://documents.egi.eu/document/783 Aug]|[https://documents.egi.eu/document/820 Sep]|[https://documents.egi.eu/document/879 Oct]|[https://documents.egi.eu/document/905 Nov] re-computations are in progress...
== 2010 ==
== 2010 ==
Revision as of 12:33, 12 December 2011
|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
EGI Performance is measured using two parameters: Availability and Reliability (definition).
Availability/Reliability data is provided by the MyEGI portal. Note: GridView Availability/Reliability views are now obsolete.
Availability/Reliability are measured at a Resource Centre (RC) level and at a Resource infrastructure Provider (RP) level (for NGIs and EIROs).
SAM metric results are used for the calculation of Availability/Reliability:
- Metrics for RC Availability/Reliability computation are those that are part of the WLCG_CREAM_LCGCE_CRITICAL profile
- Metrics for RP Availability/Reliability computation (top-BDII) are those that are part of the ROC profile
Service Level Targets
For a Resource Centre(RC)
Is is mandatory that EGI certified Resource Centres provide a minimum monthly Availability/Reliability as specified below (see the RC Operational Level Agreement for details). Availability/Reliability statistics (OPS VO) are issued on a monthly basis.
|Condition for suspension||Resource Centres which have an Availability of less than 70% for three consecutive months will be suspended, i.e. removed from the production infrastructure. Note. This new suspension policy was introduced in April 2011, to increase the original 50% threshold to 70%.|
|Condition for justification||Resource Centres not providing minimum monthly performance (70% availability, 75% reliability) MUST provide justification through a GGUS ticket.|
For a Resource infrastructure Provider (NGI/EIRO)
As of January 2012, it is mandatory that top-BDII services operated by NGIs provide a minimum availability of 99% (see the RP Operational Level Agreement for details). Availability/Reliability NGI reports are distributed monthly.
Note: Service Level Targets specified below will come into force as of Januwary 2011.
|minimum top-BDII Availability||99%|
|minimum top-BDII Reliabilty||99%|
|Liability||Resource infrastructure Providers not providing the minimum requested monthly performance for one month (99% Availability, 99% Reliability) MUST provide a service improvement plan.|
- See the list of NGIs' Top-BDIIs used for the Availability/Reliability computation.
EGI-wide Availability and Reliability
It is available here (xls file, data from May 01 2010)
- List of underperforming/suspended Resource Centres
- List of Rsource Centres to which the Availability followup procedure was not applicable
Process for quality verification
- Generation of statistics
Availability and reliability statistics are automatically generated the first week of the month by the Availability Computation Engine (Gridview until May 2011) using the profile in pdf format and placed under . An Excel version is available at 
- Preliminary processing
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.
- Handling of sites below targets
For a site that misses availability/reliability targets but is not eligible for suspension:
- a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given
- the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section ). Reminders and escalation is performed in accordance to COD escalation procedures .
- if the explanation is found satisfactory the ticket is closed
- conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed , with the site being suspended if neither site or NGI reply to the ticket
- the child ticket can then be closed
- the parent ticket will be closed when all child tickets have been closed.
- Handling of sites that are eligible for suspension
For a site that is eligible for suspension:
- a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section )
- after the 10 days period passes during which normal COD escalation procedures apply , the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects.
- in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
- the child ticket closes either when the site is suspended or when suspension is canceled
- the parent ticket will be closed when all child tickets have been closed
- Wiki follow up page
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page 
- Recomputation precedure
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at 
Known issues and recommendations to NGIs
- ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at  and . As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.
- The calculations performed by ACE always take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
- Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
Operational Level Agreements
Resource Centre Operational Level Agreement
Resource infrastructure Provider Operational Level Agreement
- Definition of Availability and Reliability and related computation algorithm (paper)
- NEW! List of Nagios tests used for availability computation
- Availability Computation Engine (ACE) home page
- COD procedure for oversight of availability and reliability performance
- Impact of change of suspension policy for under-performing sites: report