Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Resource Centres OLA and Resource infrastructure Provider OLA reports"

From EGIWiki
Jump to navigation Jump to search
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Op menubar}} {{TOC_right}}  
[[Category:Procedures]]
[[Category:Service Level Management]]
{{TOC_right}}


EGI Performance is measured using two parameters: Availability and Reliability ([https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 definition]).  
EGI Performance is measured using two parameters: Availability and Reliability ([https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 definition]).  


Availability/Reliability data is provided by the [https://grid-monitoring.cern.ch/myegi/sa/ MyEGI] portal. Note: GridView Availability/Reliability views are now obsolete.
Availability/Reliability data is provided by the [https://grid-monitoring.cern.ch/myegi/sa/ MyEGI] portal. Note: GridView Availability/Reliability views are now obsolete.  


Availability/Reliability are measured at a Resource Centre (RC) level and at a Resource infrastructure Provider (RP) level (for NGIs and EIROs).
Availability/Reliability are measured at a Resource Centre (RC) level and at a Resource infrastructure Provider (RP) level (for NGIs and EIROs).  


[[SAM_Tests|SAM metric]] results are used for the calculation of Availability/Reliability.
[[SAM Tests|SAM metric]] results are used for the calculation of Availability/Reliability.  


Go to the '''[[Performance|main]]''' page for information on service level targets and related statistics.
Go to the '''[[Performance|main]]''' page for information on service level targets and related statistics.  


= Performance reports=
= Performance reports =


== Resource Centres ==
== Resource Centres ==
[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE)


{| border="1" cellspacing="0" cellpadding="5"  
[https://edms.cern.ch/document/963325 January 2008 - April 2010] (EGEE)
|-<!--style="background-color: lightgray;"-->
 
! Availability/Reliability
{| cellspacing="0" cellpadding="5" border="1"
! Jan
|-
! Feb
! Availability/Reliability  
! Mar
! Jan  
! Apr
! Feb  
! May
! Mar  
! Jun
! Apr  
! Jul
! May  
! Aug
! Jun  
! Sep
! Jul  
! Oct
! Aug  
! Nov
! Sep  
! Oct  
! Nov  
! Dec
! Dec
|-
|-
! 2010
! 2010  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
|[https://documents.egi.eu/document/42 05/10]
| [https://documents.egi.eu/document/42 05/10]  
|[https://documents.egi.eu/document/96 06/10]
| [https://documents.egi.eu/document/96 06/10]  
|[https://documents.egi.eu/document/130 07/10]
| [https://documents.egi.eu/document/130 07/10]  
|[https://documents.egi.eu/document/157 08/10]
| [https://documents.egi.eu/document/157 08/10]  
|[https://documents.egi.eu/document/219 09/10]
| [https://documents.egi.eu/document/219 09/10]  
|[https://documents.egi.eu/document/238 10/10]
| [https://documents.egi.eu/document/238 10/10]  
|[https://documents.egi.eu/document/266 11/10]  
| [https://documents.egi.eu/document/266 11/10]  
|[https://documents.egi.eu/document/299 12/10]
| [https://documents.egi.eu/document/299 12/10]
|-
|-
! 2011
! 2011  
|[https://documents.egi.eu/document/332 01/11]
| [https://documents.egi.eu/document/332 01/11]  
|[https://documents.egi.eu/document/402 02/11]
| [https://documents.egi.eu/document/402 02/11]  
|[https://documents.egi.eu/document/465 03/11]
| [https://documents.egi.eu/document/465 03/11]  
|[https://documents.egi.eu/document/508 04/11]
| [https://documents.egi.eu/document/508 04/11]  
|[https://documents.egi.eu/document/593 05/11]
| [https://documents.egi.eu/document/593 05/11]  
|[https://documents.egi.eu/document/648 06/11]
| [https://documents.egi.eu/document/648 06/11]  
|[https://documents.egi.eu/document/716 07/11]
| [https://documents.egi.eu/document/716 07/11]  
|[https://documents.egi.eu/document/783 08/11]
| [https://documents.egi.eu/document/783 08/11]  
|[https://documents.egi.eu/document/820 09/11]
| [https://documents.egi.eu/document/820 09/11]  
|[https://documents.egi.eu/document/879 10/11]
| [https://documents.egi.eu/document/879 10/11]  
|[https://documents.egi.eu/document/905 11/11]
| [https://documents.egi.eu/document/905 11/11]  
|[https://documents.egi.eu/document/959 12/11]
| [https://documents.egi.eu/document/959 12/11]
|-
|-
! 2012
! 2012  
|[https://documents.egi.eu/document/1000 01/12]
| [https://documents.egi.eu/document/1000 01/12]  
|[https://documents.egi.eu/document/1033 02/12]
| [https://documents.egi.eu/document/1033 02/12]  
|[https://documents.egi.eu/document/1091 03/12]
| [https://documents.egi.eu/document/1091 03/12]  
|[https://documents.egi.eu/document/1117 04/12]
| [https://documents.egi.eu/document/1117 04/12]  
|[https://documents.egi.eu/document/1174 05/12]
| [https://documents.egi.eu/document/1174 05/12]  
|[https://documents.egi.eu/document/1251 06/12]
| [https://documents.egi.eu/document/1251 06/12]  
|[https://documents.egi.eu/document/1307 07/12]
| [https://documents.egi.eu/document/1307 07/12]  
|[https://documents.egi.eu/document/1332 08/12]
| [https://documents.egi.eu/document/1332 08/12]  
|[ 09/12]
| [ 09/12]  
|[ 10/12]
| [ 10/12]  
|[ 11/12]  
| [ 11/12]  
|[ 12/12]
| [ 12/12]
|}
|}


== Resource Infrastructures ==
== Resource Infrastructures ==


{| cellspacing="0" cellpadding="5" border="1"
{| cellspacing="0" cellpadding="5" border="1"
|-
|-
| '''Service Level: '''
| '''Service Level: '''  
'''top-BDII Availability/Reliability'''  
'''top-BDII Availability/Reliability'''  
| '''Jan'''  
| '''Jan'''  
| '''Feb'''  
| '''Feb'''  
Line 97: Line 96:
| '''Dec'''
| '''Dec'''
|-
|-
| '''2011'''
| '''2011'''  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| [https://documents.egi.eu/public/RetrieveFile?docid=820&version=5&filename=EGI-core_services_availabilities-per_NGI%20NGIs%20core%20services.pdf 09/11]
| [https://documents.egi.eu/public/RetrieveFile?docid=820&version=5&filename=EGI-core_services_availabilities-per_NGI%20NGIs%20core%20services.pdf 09/11]  
| [https://documents.egi.eu/public/RetrieveFile?docid=879&version=4&filename=EGI-core_services_availabilities-per_NGI-Oct2011-1.pdf 10/11]
| [https://documents.egi.eu/public/RetrieveFile?docid=879&version=4&filename=EGI-core_services_availabilities-per_NGI-Oct2011-1.pdf 10/11]  
| [https://documents.egi.eu/public/RetrieveFile?docid=905&version=3&filename=EGI-core_services_availabilities-per_NGI-Nov2011.pdf 11/11]
| [https://documents.egi.eu/public/RetrieveFile?docid=905&version=3&filename=EGI-core_services_availabilities-per_NGI-Nov2011.pdf 11/11]  
| [https://documents.egi.eu/public/RetrieveFile?docid=959&version=1&filename=EGI-core_services_availabilities-per_NGI-Dec2011.pdf 12/11]
| [https://documents.egi.eu/public/RetrieveFile?docid=959&version=1&filename=EGI-core_services_availabilities-per_NGI-Dec2011.pdf 12/11]
|-
|-
| '''2012'''  
| '''2012'''  
| [https://documents.egi.eu/secure/RetrieveFile?docid=1000&version=1&filename=EGI-core_services_availabilities-per_NGI-Jan2012.pdf 01/12]
| [https://documents.egi.eu/secure/RetrieveFile?docid=1000&version=1&filename=EGI-core_services_availabilities-per_NGI-Jan2012.pdf 01/12]  
| [https://documents.egi.eu/secure/RetrieveFile?docid=1033&version=1&filename=EGI-core_services_availabilities-per_NGI-Feb2012%20Top-BDIIs.pdf 02/12]
| [https://documents.egi.eu/secure/RetrieveFile?docid=1033&version=1&filename=EGI-core_services_availabilities-per_NGI-Feb2012%20Top-BDIIs.pdf 02/12]  
| [https://documents.egi.eu/secure/RetrieveFile?docid=1091&version=2&filename=EGI-core_services_availabilities-per_NGI-Mar2012.pdf 03/12]
| [https://documents.egi.eu/secure/RetrieveFile?docid=1091&version=2&filename=EGI-core_services_availabilities-per_NGI-Mar2012.pdf 03/12]  
| [https://documents.egi.eu/public/RetrieveFile?docid=1117&version=2&filename=EGI-core_services_availabilities-per_NGI-Apr2012%20NGIs%20core%20services.pdf 04/12]
| [https://documents.egi.eu/public/RetrieveFile?docid=1117&version=2&filename=EGI-core_services_availabilities-per_NGI-Apr2012%20NGIs%20core%20services.pdf 04/12]  
| [https://documents.egi.eu/secure/RetrieveFile?docid=1174&version=3&filename=EGI-core_services_availabilities-per_NGI-May2012-1.pdf 05/12]  
| [https://documents.egi.eu/secure/RetrieveFile?docid=1174&version=3&filename=EGI-core_services_availabilities-per_NGI-May2012-1.pdf 05/12]  
| [https://documents.egi.eu/public/RetrieveFile?docid=1251&version=3&filename=EGI-core_services_availabilities-per_NGI-June2012-1.pdf 06/12]  
| [https://documents.egi.eu/public/RetrieveFile?docid=1251&version=3&filename=EGI-core_services_availabilities-per_NGI-June2012-1.pdf 06/12]  
Line 126: Line 125:
|}
|}


<br>


{| cellspacing="0" cellpadding="5" border="1"
{| cellspacing="0" cellpadding="5" border="1"
|-
|-
| '''Service Level: '''  
| '''Service Level: '''  
'''ROD Performance Index'''  
'''ROD Performance Index''' ticket/[https://documents.egi.eu/document/1089 Report]  
ticket/[https://documents.egi.eu/document/1089 Report]
 
| '''Jan'''  
| '''Jan'''  
| '''Feb'''  
| '''Feb'''  
Line 145: Line 145:
| '''Dec'''
| '''Dec'''
|-
|-
| '''2011'''
| '''2011'''  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| -
| -  
| <!--Oct-->[https://ggus.eu/ws/ticket_info.php?ticket=76116 76116]  
| <!--Oct-->[https://ggus.eu/ws/ticket_info.php?ticket=76116 76116]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-10.pdf 10/11]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-10.pdf 10/11]  
| <!--Nov-->[https://ggus.eu/ws/ticket_info.php?ticket=77235 77235]  
| <!--Nov-->[https://ggus.eu/ws/ticket_info.php?ticket=77235 77235]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-11.pdf 11/11]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-11.pdf 11/11]  
| <!--Dec--> [https://ggus.eu/ws/ticket_info.php?ticket=78078 78078]  
| <!--Dec--> [https://ggus.eu/ws/ticket_info.php?ticket=78078 78078]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-12.pdf 12/11]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2011-12.pdf 12/11]  
|-
|-
| '''2012'''  
| '''2012'''  
| [https://ggus.eu/ws/ticket_info.php?ticket=78078 78078]  
| [https://ggus.eu/ws/ticket_info.php?ticket=78078 78078]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-01.pdf 01/12]
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-01.pdf 01/12]  
 
| [https://ggus.eu/ws/ticket_info.php?ticket=79006 79006]  
| [https://ggus.eu/ws/ticket_info.php?ticket=79006 79006]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-02.pdf 02/12]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-02.pdf 02/12]  
| [https://ggus.eu/ws/ticket_info.php?ticket=80841 80841]/
 
| [https://ggus.eu/ws/ticket_info.php?ticket=80841 80841]/  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-03.pdf 03/12]  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-03.pdf 03/12]  
| [https://ggus.eu/ws/ticket_info.php?ticket=81998 81998]/  
| [https://ggus.eu/ws/ticket_info.php?ticket=81998 81998]/  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-04.pdf 04/12]
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-04.pdf 04/12]  
| [https://ggus.eu/ws/ticket_info.php?ticket=82926 82926]/
 
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-05.pdf 05/12]
| [https://ggus.eu/ws/ticket_info.php?ticket=82926 82926]/  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-05.pdf 05/12]  
 
| [https://ggus.eu/ws/ticket_info.php?ticket=84168 84168]/  
| [https://ggus.eu/ws/ticket_info.php?ticket=84168 84168]/  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-06.pdf 06/12]
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-06.pdf 06/12]  
| [https://ggus.eu/ws/ticket_info.php?ticket=85127 85127]/
 
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-07.pdf 07/12]
| [https://ggus.eu/ws/ticket_info.php?ticket=85127 85127]/  
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-07.pdf 07/12]  
 
| [GGUS]/
[https://documents.egi.eu/secure/RetrieveFile?docid=1089&version=1&filename=OlaMetrics_2012-08.pdf 08/12]
 
| [GGUS]/
[09/12]
 
| [GGUS]/
[10/12]
 
| [GGUS]/  
| [GGUS]/  
[08/12] 
| [GGUS]/ 
[09/12] 
| [GGUS]/ 
[10/12] 
| [GGUS]/ 
[11/12]  
[11/12]  
|[GGUS]/  
 
| [GGUS]/  
[12/12]  
[12/12]  
|}
|}


== EGI overall Availability and Reliability ==
== EGI overall Availability and Reliability ==
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)


== Underperforming/Suspended RCs ==
It is available [https://documents.egi.eu/public/ShowDocument?docid=415 here] (xls file, data from May 01 2010)  
* List of [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics underperforming/suspended Resource Centres ]
* List of [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable Rsource Centres] to which the Availability followup procedure was not applicable
<!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]-->


=Process for quality verification=
== Underperforming/Suspended RCs  ==


* '''Generation of statistics'''
*List of [https://wiki.egi.eu/wiki/Availability_and_reliability_reports_metrics underperforming/suspended Resource Centres ]
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile  in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].
*List of [https://wiki.egi.eu/wiki/List_of_sites_for_which_the_availability_followup_procedures_were_not_applicable Rsource Centres] to which the Availability followup procedure was not applicable <!--* [https://twiki.cern.ch/twiki/bin/view/EGEE/SuspendedSites List of suspended sites (2009)]-->


* '''Preliminary processing'''
= Process for quality verification =
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.


* '''Publication'''
*'''Generation of statistics'''
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.
 
Availability and reliability statistics are automatically generated the first week of the month by the [https://wiki.egi.eu/wiki/External_tools#Availability_Computation_Engine Availability Computation Engine] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].
 
*'''Preliminary processing'''
 
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.
 
*'''Publication'''
 
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.  
 
*'''Handling of sites below targets'''


* '''Handling of sites below targets'''
For a site that misses availability/reliability targets but is not eligible for suspension:  
For a site that misses availability/reliability targets but is not eligible for suspension:  


# a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given  
#a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given  
# the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].
#the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs]). Reminders and escalation is performed in accordance to COD escalation procedures [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure].  
# if the explanation is found satisfactory the ticket is closed  
#if the explanation is found satisfactory the ticket is closed  
# conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket  
#conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], with the site being suspended if neither site or NGI reply to the ticket  
# the child ticket can then be closed  
#the child ticket can then be closed  
# the parent ticket will be closed when all child tickets have been closed.
#the parent ticket will be closed when all child tickets have been closed.
 
*'''Handling of sites that are eligible for suspension'''


* '''Handling of sites that are eligible for suspension'''
For a site that is eligible for suspension:  
For a site that is eligible for suspension:  
# a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])
# after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects.
# in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
# the child ticket closes either when the site is suspended or when suspension is canceled
# the parent ticket will be closed when all child tickets have been closed


* '''Wiki follow up page'''
#a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics#Known_issues_and_recommendations_to_NGIs])
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]
#after the 10 days period passes during which normal COD escalation procedures apply [https://wiki.egi.eu/wiki/Operations:COD_Escalation_Procedure], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects.
#in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
#the child ticket closes either when the site is suspended or when suspension is canceled
#the parent ticket will be closed when all child tickets have been closed


* '''Recomputation precedure'''
*'''Wiki follow up page'''
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://wiki.egi.eu/wiki/PROC10]


=Known issues and recommendations to NGIs=
Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD_metrics]  
# ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.'''
# The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.
# Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.


=[[Documentation#OLAs|Operational Level Agreements]]=
*'''Recomputation precedure'''


=Links=
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [https://wiki.egi.eu/wiki/PROC10]  
* Definition of Availability and Reliability and related computation algorithm  ([https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 paper])


* NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation
= Known issues and recommendations to NGIs =
* [https://tomtools.cern.ch/confluence/display/SAM/ACE Availability Computation Engine (ACE) home page]


*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]
#ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.'''
* Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]
#The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.
#Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
 
= [[Documentation#OLAs|Operational Level Agreements]] =
 
= Links =
 
*Definition of Availability and Reliability and related computation algorithm ([https://tomtools.cern.ch/confluence/download/attachments/2261694/Ace_Service_Availability_Computation.pdf?version=1&modificationDate=1314361543000 paper])
 
*NEW! [https://wiki.egi.eu/wiki/Availability_and_reliability_tests List of Nagios tests] used for availability computation
*[https://tomtools.cern.ch/confluence/display/SAM/ACE Availability Computation Engine (ACE) home page]
 
*[https://wiki.egi.eu/wiki/Availability_and_reliability_internal_procedure_for_COD COD procedure for oversight of availability and reliability performance]  
*Impact of change of suspension policy for under-performing sites: [https://wiki.egi.eu/wiki/Availability_and_reliability_threshold_change_impact report]


<!-- DEPRECATED LINKS
<!-- DEPRECATED LINKS
Line 254: Line 280:
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and  SAM results for VOs)
*[http://gvdev.cern.ch/GVPC/Excel/ '''(DEPRECATED)''' GridView availability/reliability report generator] (providing access to the database including Nagios results for OPS and  SAM results for VOs)
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]
*[http://gridview015.cern.ch/GVPC/Excel/ACE/ ACE report generator]
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]-->
*[https://gvdev.cern.ch/ACEVAL/ace_index.php ACE visualization portal]--> {{Template:Creative_commons}}  
{{Template:Creative_commons}}
 
[[Category:Procedures]] [[Category:Service_Level_Management]]

Revision as of 15:30, 11 September 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



EGI Performance is measured using two parameters: Availability and Reliability (definition).

Availability/Reliability data is provided by the MyEGI portal. Note: GridView Availability/Reliability views are now obsolete.

Availability/Reliability are measured at a Resource Centre (RC) level and at a Resource infrastructure Provider (RP) level (for NGIs and EIROs).

SAM metric results are used for the calculation of Availability/Reliability.

Go to the main page for information on service level targets and related statistics.

Performance reports

Resource Centres

January 2008 - April 2010 (EGEE)

Availability/Reliability Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010 - - - - 05/10 06/10 07/10 08/10 09/10 10/10 11/10 12/10
2011 01/11 02/11 03/11 04/11 05/11 06/11 07/11 08/11 09/11 10/11 11/11 12/11
2012 01/12 02/12 03/12 04/12 05/12 06/12 07/12 08/12 [ 09/12] [ 10/12] [ 11/12] [ 12/12]

Resource Infrastructures

Service Level:

top-BDII Availability/Reliability

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 - - - - - - - - 09/11 10/11 11/11 12/11
2012 01/12 02/12 03/12 04/12 05/12 06/12 07/12 08/12 [09/12] [10/12] [11/12] [12/12]


Service Level:

ROD Performance Index ticket/Report

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 - - - - - - - - - 76116

10/11

77235

11/11

78078

12/11

2012 78078

01/12

79006

02/12

80841/

03/12

81998/

04/12

82926/

05/12

84168/

06/12

85127/

07/12

[GGUS]/

08/12

[GGUS]/

[09/12]

[GGUS]/

[10/12]

[GGUS]/

[11/12]

[GGUS]/

[12/12]

EGI overall Availability and Reliability

It is available here (xls file, data from May 01 2010)

Underperforming/Suspended RCs

Process for quality verification

  • Generation of statistics

Availability and reliability statistics are automatically generated the first week of the month by the Availability Computation Engine (Gridview until May 2011) using the profile in pdf format and placed under [1].

  • Preliminary processing

Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.

  • Publication

An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to chase sites that need to provide comments in case thresholds are not met, and identifies sites eligible for suspension. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.

  • Handling of sites below targets

For a site that misses availability/reliability targets but is not eligible for suspension:

  1. a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given
  2. the explanation must be produced within 10 working days since the ticket is received by the site (please see known issues section [2]). Reminders and escalation is performed in accordance to COD escalation procedures [3].
  3. if the explanation is found satisfactory the ticket is closed
  4. conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [4], with the site being suspended if neither site or NGI reply to the ticket
  5. the child ticket can then be closed
  6. the parent ticket will be closed when all child tickets have been closed.
  • Handling of sites that are eligible for suspension

For a site that is eligible for suspension:

  1. a child ticket is opened by the COD team assigned to appropriate NGI, notifying that the site will be suspended within 10 working days (please see known issues section [5])
  2. after the 10 days period passes during which normal COD escalation procedures apply [6], the site is suspended by COD unless the NGI has intervened or the EGI Chief Operations officer objects.
  3. in the case of NGI intervention, non suspension will occur if both the COD and COO agree on the reasoning provided by the NGI
  4. the child ticket closes either when the site is suspended or when suspension is canceled
  5. the parent ticket will be closed when all child tickets have been closed
  • Wiki follow up page

Sites that fail to provide explanations justifying the failure to meet OLA targets, or the explanation is found inadequate, as well as sites that are suspended, will be recorded in a wiki page [7]

  • Recomputation precedure

Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [8]

Known issues and recommendations to NGIs

  1. ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [9] and [10]. As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.
  2. The calculations performed by ACE always take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
  3. Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.

Operational Level Agreements

Links

  • Definition of Availability and Reliability and related computation algorithm (paper)
Template:Creative commons