Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC04 Quality verification of monthly availability and reliability statistics"

From EGIWiki
Jump to navigation Jump to search
(24 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} {{Ops_procedures
{{Template:Op menubar}} {{Template:Doc_menubar}}  
|Doc_title = Quality verification of monthly availability and reliability statistcs
[[Category:Deprecated]]
|Doc_link = https://wiki.egi.eu/wiki/PROC04
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
|Version = 2.0
| style="padding-right: 15px; padding-left: 15px;" |
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC04+Quality+verification+of+monthly+availability+and+reliability+statistics 
|}
 
{{TOC_right}}
{{Ops_procedures
|Doc_title = Quality verification of monthly availability and reliability statistics
|Doc_link = [[PROC04|https://wiki.egi.eu/wiki/PROC04]]
|Version = 16th August 2018
|Policy_acronym = OMB
|Policy_acronym = OMB
|Policy_name = Operations Management Board
|Policy_name = Operations Management Board
|Contact_group = operations at mailman.egi.eu  
|Contact_group = operations@egi.eu
|Doc_status = Approved
|Doc_status = Approved
|Approval_date = 30 October 2012
|Approval_date = 30 October 2012
|Procedure_statement = Instructions RODs and Operations Centres on how to handle justification for poor monthly performance  
|Procedure_statement = Instructions RODs and Operations Centres on how to handle justification for poor monthly performance  
|Owner = Alessandro Paolini
}}  
}}  


Line 15: Line 24:
The document describes the process of how to handle justification for poor monthly performance.  
The document describes the process of how to handle justification for poor monthly performance.  


Links to all monthly statistics are provided on a regular basis at [https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics Availability and reliability monthly statistics page].
Links to all monthly statistics are provided on a regular basis at [[Availability_and_reliability_monthly_statistics |Availability and reliability monthly statistics page]].


= Definitions  =
= Definitions  =
Line 25: Line 34:
= Process of handling RC Availability and Reliability  =
= Process of handling RC Availability and Reliability  =


Availability alarms are raised on [https://operations-portal.egi.eu/dashboard Operations Dashboard] and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.  
Availability alarms are raised on the [https://operations-portal.egi.eu/rodDashboard/ngi/any/tab/list/filter/operators/page/list ROD Dashboard] and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.  


== Entities involved in the procedure  ==
== Entities involved in the procedure  ==
Line 31: Line 40:
'''Regional Operator on Duty (ROD)''': team provided by NGIs and responsible for handling RC&nbsp;availability and reliability alarms through Dashboard in Operations Portal. <br>  
'''Regional Operator on Duty (ROD)''': team provided by NGIs and responsible for handling RC&nbsp;availability and reliability alarms through Dashboard in Operations Portal. <br>  


'''Central Operator on Duty (COD)''': team provided by EGI and responsible for handling of underperforming sites which were below the target for 3 consecutive months.  
'''Operations''': team provided by EGI.eu and responsible for handling of underperforming sites which were below the target for 3 consecutive months.  


'''NGI&nbsp;manager:'''&nbsp;person who suspend the underperforming site or provide site justification.
'''NGI&nbsp;manager:'''&nbsp;person who suspend the underperforming site or provide site justification.
Line 37: Line 46:
== Steps  ==
== Steps  ==


When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.  
When an alarm is raised, it means that the Availability metric has dropped below the threshold of 80% for the last 30 days period.  


'''Handling alarms:'''<br>  
'''Handling alarms:'''<br>  
Line 48: Line 57:
|- valign="top"
|- valign="top"
| 1  
| 1  
| ROD
| ROD  
| Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
| Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 80% for the last 30 days period.  
'' The expiration date should be set to not later then same date next month minus 1 day.''
''The expiration date should be set to not later then same date next month minus 1 day.''  
 
|- valign="top"
|- valign="top"
| 2  
| 2  
| ROD
| ROD  
| Escalation of the ticket will vary between NGIs. NGIs have freedom to decide if they want to apply any escalation procedure or treat avaliability tickets just as an notification for site administrators.
| Escalation of the ticket will vary between NGIs. NGIs have freedom to decide if they want to apply any escalation procedure or treat availability tickets just as an notification for site administrators.
|- valign="top"
|- valign="top"
| 3  
| 3  
| ROD  
| ROD  
|  
|  
*Ticket can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%).  
*Ticket can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 85%).  
*If the problem continues for over 30 days the ticket should be closed otherwise ticket will appear on COD dashboard and affect ROD performance index.  
*If the problem continues for over 30 days the ticket should be closed otherwise ticket will appear on Operations dashboard and affect ROD performance index.
 
|}
|}


Line 72: Line 83:
|- valign="top"
|- valign="top"
| 1  
| 1  
| COD
| Operations
| Creates a GGUS ticket for each underperforming site.
| Creates a GGUS ticket for each underperforming site. [https://wiki.egi.eu/wiki/WI03_RC_and_RP_OLA_violation_report_followup Ticket template].
|- valign="top"
|- valign="top"
| 2  
| 2  
| NGI manager  
| NGI operations manager  
| Within 10 working days NGI manager can suspend the site or ask to not suspend the site by providing adequate explanation<br>
| Within 10 working days NGI operations manager can suspend the site or ask to not suspend the site by providing adequate explanation<br>
|- valign="top"
|- valign="top"
| 3  
| 3
| COD
| Operations
| Send a direct email to NGI and site contact email (in GOC&nbsp;DB) with deadline 2 days for comments
|- valign="top"
| 4
| Operations
|  
|  
*In the case of '''no''' NGI intervention, the site is suspended in GOC DB.  
*In the case of '''no''' NGI intervention, the site is suspended in GOC DB.  
*In the case of NGI intervention:  
*In the case of NGI intervention:  
**non suspension will occur if the COD team agree on the reasoning provided by the NGI (the Chief Operations Officer may be involved)  
**non suspension will occur if the Operations team agree on the reasoning provided by the NGI (the Chief Operations Officer may be involved)  
**if availability shows no improvement COD can suspend the site
**if availability shows no improvement Operations can suspend the site
 
|}
 
'''Recomputation precedure'''
 
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
 
= Process of handling Core services Availability and Reliability  =
 
'''Generation of statistics'''
 
Availability and reliability statistics are automatically generated the first week of the month by the [[External tools#Availability_Computation_Engine|Availability Computation Engine]] using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].
 
'''Preliminary processing'''
 
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics are provided on a regular basis at [[Availability and reliability monthly statistics|Availability and reliability monthly statistics page]].<br>
 
'''Publication'''
 
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.
 
== Entities involved in the procedure  ==
 
'''Central Operator on Duty (COD)''': team provided by EGI and responsible for handling Core services availability and reliability reports.
 
'''Core service administrator:'''&nbsp;person responsible for runnig Core service.
 
== Steps  ==
 
'''Handling of Core Services below targets'''<br>
 
{| class="wikitable"
|-
! Step#
! Responsible
! Action
|- valign="top"
| 1
| COD
|
Creates a GGUS ticket and assigned to the respective NGI, asking for explanation to be given.
 
The explanation must be produced within 10 working days since the ticket is received.
 
|- valign="top"
| 2
|
Core service <br>administrator


| Provides explanation and improvement plan to the GGUS ticket.<br>
|- valign="top"
| 3
| COD
|
*If the explanation is found satisfactory the ticket is closed.
*If the explanation is not given in due time, or the explanation is found inadequate, COD team will report it to the Chief Operations Officer.&nbsp;
|- valign="top"
| 4<br>
| COD<br>
| COD close the parent ticket when all child tickets have been closed and provide summary as a solution.
|}
|}


<br>
'''Recomputation procedure'''  
 
'''Recomputation precedure'''  


In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
Line 157: Line 110:
= Known issues and recommendations to NGIs  =
= Known issues and recommendations to NGIs  =


#'''Newly certified sites''' will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ACE takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. ''More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. ''  
#'''Newly certified sites''' will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. [http://argoeu.github.io/guides/argo-compute-engine/ ARGO Computation Engine] takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. ''More information at [https://ggus.eu/index.php?mode=ticket_info&ticket_id=60594] and [https://ggus.eu/index.php?mode=ticket_info&ticket_id=60925]. ''  
#'''Recalculation''' - The calculations performed by ACE always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.  
#'''Recalculation''' - The calculations performed by ARGO always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.  
#'''Weighted availability''' is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
#'''Weighted availability''' is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI weighted availability will be affected.


= Revision history  =
= Revision history  =
Line 170: Line 123:
! Comments
! Comments
|-
|-
| <br>  
|  
| <br>  
| M. Krakowian
| <br>
| 19 August 2014
| <br>
| Change contact group -> Operations support
|-
|
| Alessandro Paolini
| 2016-06-08
| Change contact group -> Operations
|-
|  
| Alessandro Paolini
| 2018-08-16
| A/R of NGI Core services no more handled, deleted from procedure; updated some links
|}
|}



Revision as of 09:39, 16 August 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators
Alert.png This page is Deprecated; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC04+Quality+verification+of+monthly+availability+and+reliability+statistics


Title Quality verification of monthly availability and reliability statistics
Document link https://wiki.egi.eu/wiki/PROC04
Last modified 16th August 2018
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations@egi.eu
Document Status Approved
Approved Date 30 October 2012
Procedure Statement Instructions RODs and Operations Centres on how to handle justification for poor monthly performance
Owner Alessandro Paolini


Overview

The document describes the process of how to handle justification for poor monthly performance.

Links to all monthly statistics are provided on a regular basis at Availability and reliability monthly statistics page.

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Process of handling RC Availability and Reliability

Availability alarms are raised on the ROD Dashboard and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.

Entities involved in the procedure

Regional Operator on Duty (ROD): team provided by NGIs and responsible for handling RC availability and reliability alarms through Dashboard in Operations Portal.

Operations: team provided by EGI.eu and responsible for handling of underperforming sites which were below the target for 3 consecutive months.

NGI manager: person who suspend the underperforming site or provide site justification.

Steps

When an alarm is raised, it means that the Availability metric has dropped below the threshold of 80% for the last 30 days period.

Handling alarms:

Step# Responsible Action
1 ROD Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 80% for the last 30 days period.

The expiration date should be set to not later then same date next month minus 1 day.

2 ROD Escalation of the ticket will vary between NGIs. NGIs have freedom to decide if they want to apply any escalation procedure or treat availability tickets just as an notification for site administrators.
3 ROD
  • Ticket can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 85%).
  • If the problem continues for over 30 days the ticket should be closed otherwise ticket will appear on Operations dashboard and affect ROD performance index.

Handling of underperforming sites (below the target for 3 consecutive months):

Step# Responsible Action
1 Operations Creates a GGUS ticket for each underperforming site. Ticket template.
2 NGI operations manager Within 10 working days NGI operations manager can suspend the site or ask to not suspend the site by providing adequate explanation
3 Operations Send a direct email to NGI and site contact email (in GOC DB) with deadline 2 days for comments
4 Operations
  • In the case of no NGI intervention, the site is suspended in GOC DB.
  • In the case of NGI intervention:
    • non suspension will occur if the Operations team agree on the reasoning provided by the NGI (the Chief Operations Officer may be involved)
    • if availability shows no improvement Operations can suspend the site

Recomputation procedure

In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10

Known issues and recommendations to NGIs

  1. Newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ARGO Computation Engine takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [1] and [2].
  2. Recalculation - The calculations performed by ARGO always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
  3. Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI weighted availability will be affected.

Revision history

Version Authors Date Comments
M. Krakowian 19 August 2014 Change contact group -> Operations support
Alessandro Paolini 2016-06-08 Change contact group -> Operations
Alessandro Paolini 2018-08-16 A/R of NGI Core services no more handled, deleted from procedure; updated some links