Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC04 Quality verification of monthly availability and reliability statistics"

From EGIWiki
Jump to navigation Jump to search
Line 1: Line 1:
{{Template:Op menubar}}  
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} {{Ops_procedures
{{Template:Doc_menubar}}  
{{TOC_right}}
[[Category:Operations Procedures]]
{{Ops_procedures
|Doc_title = Quality verification of monthly availability and reliability statistcs
|Doc_title = Quality verification of monthly availability and reliability statistcs
|Doc_link = https://wiki.egi.eu/wiki/PROC04
|Doc_link = https://wiki.egi.eu/wiki/PROC04
Line 9: Line 5:
|Policy_acronym = OMB
|Policy_acronym = OMB
|Policy_name = Operations Management Board
|Policy_name = Operations Management Board
|Contact_group =  operations @ egi.eu
|Contact_group =  operations at mailman.egi.eu  
|Doc_status = Approved
|Doc_status = Approved
|Approval_date = 30 October 2012
|Approval_date = 30 October 2012
|Procedure_statement = Instructions RODs and Operations Centres on how to handle justification for poor monthly performance  
|Procedure_statement = Instructions RODs and Operations Centres on how to handle justification for poor monthly performance  
}}
}}  


=Requirements =
= Overview  =


= Process of handling RC Availability and Reliability =
The document describes the process of how to handle justification for poor monthly performance.


==Entities involved in the procedure ==
= Definitions  =


Please refer to the [[Glossary|EGI Glossary]] for the definitions of the terms used in this procedure.<br>


==Steps==
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.<br>
Availability alarms are handled by ROD teams through Dashboard in Operations Portal. These alarms are thought to be a warning for NGI informing about poor performance of site within the last 30 days.


'''Understanding the alarm:'''
= Process of handling RC Availability and Reliability  =


When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
Availability alarms are raised on [https://operations-portal.egi.eu/dashboard Operations Dashboard] and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.  


'''Handling alarms:'''
== Entities involved in the procedure  ==


ROD should treat the alarm as a warning that availability for the period of last 30 days has dropped below 70%.
'''Regional Operator on Duty (ROD)''': team provided by NGIs and responsible for handling RC&nbsp;availability and reliability alarms through Dashboard in Operations Portal. <br>
The alarm is handled identically to other alarms: usually a ticket must be submitted to the site. It can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%). If the problem continues for over 30 days the ticket should be closed. If the alarm is raised again, ROD has to open a new ticket. This should motivate the site to work on the problem.


It is up to ROD whether they ask for site's explanation.
'''Central Operator on Duty (COD)''': team provided by EGI and responsible for handling of underperforming sites which were below the target for 3 consecutive months.  


= Process of handling Core services Availability and Reliability  =
'''NGI&nbsp;manager:'''&nbsp;person who suspend the underperforming site or provide site justification.<br>


==Entities involved in the procedure ==
== Steps  ==


When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
'''Handling alarms:'''<br>
#ROD should treat the alarm as a warning that availability for the period of last 30 days has dropped below 70%.
#A ticket must be submitted to the site:
#*It can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%).
#*If the problem continues for over 30 days the ticket should be closed otherwise ticket will appear on COD dashboard.
#*If the alarm is raised again, ROD has to open a new ticket. This should motivate the site to work on the problem.
'''Handling of underperforming sites (below the target for 3 consecutive months):'''
{| class="wikitable"
|-
! Step#
! Responsible
! Action
|- valign="top"
| 1
| COD
| Creates a GGUS ticket for each underperforming site.
|- valign="top"
| 2
| NGI manager
| Can suspend the site or ask to not suspend the site by providing adequate explanation<br>
|- valign="top"
| 3
| COD
|
*In the case of '''no''' NGI intervention, the site is suspended in GOC DB.
*In the case of NGI intervention:
**non suspension will occur if the COD team agree on the reasoning provided by the NGI (COO may be involved)
**if availability shows no improvement COD can suspend the site
|}
'''Recomputation precedure'''
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
= Process of handling Core services Availability and Reliability  =


==Steps==
*'''Generation of statistics'''
*'''Generation of statistics'''


Availability and reliability statistics are automatically generated the first week of the month by the [[External_tools#Availability_Computation_Engine| Availability Computation Engine]] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].  
Availability and reliability statistics are automatically generated the first week of the month by the [[External tools#Availability_Computation_Engine|Availability Computation Engine]] using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].  


*'''Preliminary processing'''
*'''Preliminary processing'''


Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.  
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics are provided on a regular basis at [[Availability_and_reliability_monthly_statistics|Availability and reliability monthly statistics page]].<br>


*'''Publication'''
*'''Publication'''
Line 54: Line 89:
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.  
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.  


*'''Handling of sites below targets'''
== Entities involved in the procedure  ==
 
'''Central Operator on Duty (COD)''': team provided by EGI and responsible for handling Core services availability and reliability reports.
 
== Steps  ==
 


For a core services that misses availability/reliability targets:
'''Handling of sites below targets'''


#a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given. [[Grid_operations_oversight/WI04 | Core services report work instruction for COD]]
For a core services that misses availability/reliability targets:
 
#a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given. [[Grid operations oversight/WI04|Core services report work instruction for COD]]  
#the explanation must be produced within 10 working days since the ticket is received. Reminders and escalation is performed in accordance to COD escalation procedures [[PROC01]].  
#the explanation must be produced within 10 working days since the ticket is received. Reminders and escalation is performed in accordance to COD escalation procedures [[PROC01]].  
#if the explanation is found satisfactory the ticket is closed  
#if the explanation is found satisfactory the ticket is closed  
#* conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [[PROC01]]
#*conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [[PROC01]]  
#the child ticket can then be closed  
#the child ticket can then be closed  
#the parent ticket will be closed when all child tickets have been closed.
#the parent ticket will be closed when all child tickets have been closed.


*'''Recomputation precedure'''
'''Recomputation precedure'''


Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]][[PROC10|PROC10]]  


= Known issues and recommendations to NGIs  =
= Known issues and recommendations to NGIs  =


#'''Newly certified sites''' will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ACE takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. ''More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. ''
#'''Newly certified sites''' will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ACE takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. ''More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. ''  
#'''Recalculation''' - The calculations performed by ACE always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.  
#'''Recalculation''' - The calculations performed by ACE always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.  
#'''Weighted availability''' is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
#'''Weighted availability''' is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
Line 79: Line 121:
{| class="wikitable"
{| class="wikitable"
|-
|-
! Version !! Authors !! Date !! Comments
! Version  
! Authors  
! Date  
! Comments
|-
|-
|  
| <br>
|  
| <br>
|  
| <br>
|  
| <br>
|}
|}
<br>
[[Category:Operations_Procedures]]

Revision as of 17:39, 21 November 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Title Quality verification of monthly availability and reliability statistcs
Document link https://wiki.egi.eu/wiki/PROC04
Last modified 2.0
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations at mailman.egi.eu
Document Status Approved
Approved Date 30 October 2012
Procedure Statement Instructions RODs and Operations Centres on how to handle justification for poor monthly performance
Owner Owner of procedure


Overview

The document describes the process of how to handle justification for poor monthly performance.

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Process of handling RC Availability and Reliability

Availability alarms are raised on Operations Dashboard and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.

Entities involved in the procedure

Regional Operator on Duty (ROD): team provided by NGIs and responsible for handling RC availability and reliability alarms through Dashboard in Operations Portal.

Central Operator on Duty (COD): team provided by EGI and responsible for handling of underperforming sites which were below the target for 3 consecutive months.

NGI manager: person who suspend the underperforming site or provide site justification.

Steps

When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.

Handling alarms:

  1. ROD should treat the alarm as a warning that availability for the period of last 30 days has dropped below 70%.
  2. A ticket must be submitted to the site:
    • It can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%).
    • If the problem continues for over 30 days the ticket should be closed otherwise ticket will appear on COD dashboard.
    • If the alarm is raised again, ROD has to open a new ticket. This should motivate the site to work on the problem.

Handling of underperforming sites (below the target for 3 consecutive months):

Step# Responsible Action
1 COD Creates a GGUS ticket for each underperforming site.
2 NGI manager Can suspend the site or ask to not suspend the site by providing adequate explanation
3 COD
  • In the case of no NGI intervention, the site is suspended in GOC DB.
  • In the case of NGI intervention:
    • non suspension will occur if the COD team agree on the reasoning provided by the NGI (COO may be involved)
    • if availability shows no improvement COD can suspend the site

Recomputation precedure

In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10

Process of handling Core services Availability and Reliability

  • Generation of statistics

Availability and reliability statistics are automatically generated the first week of the month by the Availability Computation Engine using the profile in pdf format and placed under [1].

  • Preliminary processing

Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics are provided on a regular basis at Availability and reliability monthly statistics page.

  • Publication

An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.

Entities involved in the procedure

Central Operator on Duty (COD): team provided by EGI and responsible for handling Core services availability and reliability reports.

Steps

Handling of sites below targets

For a core services that misses availability/reliability targets:

  1. a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given. Core services report work instruction for COD
  2. the explanation must be produced within 10 working days since the ticket is received. Reminders and escalation is performed in accordance to COD escalation procedures PROC01.
  3. if the explanation is found satisfactory the ticket is closed
    • conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed PROC01
  4. the child ticket can then be closed
  5. the parent ticket will be closed when all child tickets have been closed.

Recomputation precedure

In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10PROC10

Known issues and recommendations to NGIs

  1. Newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ACE takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [2] and [3].
  2. Recalculation - The calculations performed by ACE always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
  3. Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.

Revision history

Version Authors Date Comments