Difference between revisions of "PROC04 Quality verification of monthly availability and reliability statistics"
(→Steps) |
|||
Line 49: | Line 49: | ||
| 1 | | 1 | ||
| ROD | | ROD | ||
| Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 70% for the last 30 days period. | | Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 70% for the last 30 days period. The expiration date should be set to not later then same date next month minus 1 day. | ||
|- valign="top" | |- valign="top" | ||
| 2 | | 2 |
Revision as of 12:02, 22 November 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | Quality verification of monthly availability and reliability statistcs |
Document link | https://wiki.egi.eu/wiki/PROC04 |
Last modified | 2.0 |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Group | operations at mailman.egi.eu |
Document Status | Approved |
Approved Date | 30 October 2012 |
Procedure Statement | Instructions RODs and Operations Centres on how to handle justification for poor monthly performance |
Owner | Owner of procedure |
Overview
The document describes the process of how to handle justification for poor monthly performance.
Links to all monthly statistics are provided on a regular basis at Availability and reliability monthly statistics page.
Definitions
Please refer to the EGI Glossary for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Process of handling RC Availability and Reliability
Availability alarms are raised on Operations Dashboard and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.
Entities involved in the procedure
Regional Operator on Duty (ROD): team provided by NGIs and responsible for handling RC availability and reliability alarms through Dashboard in Operations Portal.
Central Operator on Duty (COD): team provided by EGI and responsible for handling of underperforming sites which were below the target for 3 consecutive months.
NGI manager: person who suspend the underperforming site or provide site justification.
Steps
When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
Handling alarms:
Step# | Responsible | Action |
---|---|---|
1 | ROD | Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 70% for the last 30 days period. The expiration date should be set to not later then same date next month minus 1 day. |
2 | ROD | Escalation of the ticket will vary between NGIs. NGIs have freedom to decide if they want to apply any escalation procedure or treat avaliability tickets just as an notification for site administrators. |
3 | ROD |
|
Handling of underperforming sites (below the target for 3 consecutive months):
Step# | Responsible | Action |
---|---|---|
1 | COD | Creates a GGUS ticket for each underperforming site. |
2 | NGI manager | Within 10 working days NGI manager can suspend the site or ask to not suspend the site by providing adequate explanation |
3 | COD |
|
Recomputation precedure
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10
Process of handling Core services Availability and Reliability
Generation of statistics
Availability and reliability statistics are automatically generated the first week of the month by the Availability Computation Engine using the profile in pdf format and placed under [1].
Preliminary processing
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics are provided on a regular basis at Availability and reliability monthly statistics page.
Publication
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.
Entities involved in the procedure
Central Operator on Duty (COD): team provided by EGI and responsible for handling Core services availability and reliability reports.
Core service administrator: person responsible for runnig Core service.
Steps
Handling of Core Services below targets
Step# | Responsible | Action |
---|---|---|
1 | COD |
Creates a GGUS ticket and assigned to the respective NGI, asking for explanation to be given. The explanation must be produced within 10 working days since the ticket is received. |
2 |
Core service |
Provides explanation and improvement plan to the GGUS ticket. |
3 | COD |
|
4 |
COD |
COD close the parent ticket when all child tickets have been closed and provide summary as a solution. |
Recomputation precedure
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10
Known issues and recommendations to NGIs
- Newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ACE takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [2] and [3].
- Recalculation - The calculations performed by ACE always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
- Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.
Revision history
Version | Authors | Date | Comments |
---|---|---|---|