Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC04 Quality verification of monthly availability and reliability statistics"

From EGIWiki
Jump to navigation Jump to search
(Created page with "= Process of handling RC Availability and Reliability = Availability alarms are handled by ROD teams through Dashboard in Operations Portal. These alarms are thought to be a wa...")
 
(Remove deprecated content)
Tag: Replaced
 
(43 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Process of handling RC Availability and Reliability =
{{Template:Op menubar}} {{Template:Doc_menubar}}
 
[[Category:Deprecated]]
Availability alarms are handled by ROD teams through Dashboard in Operations Portal. These alarms are thought to be a warning for NGI informing about poor performance of site within the last 30 days.
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
 
| style="padding-right: 15px; padding-left: 15px;" |
'''Understanding the alarm:'''
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC04+Quality+verification+of+monthly+availability+and+reliability+statistics 
 
|}
When an alarm is raised, it means that the Availability metric has dropped below the threshold of 70% for the last 30 days period.
 
'''Handling alarms:'''
 
ROD should treat the alarm as a warning that availability for the period of last 30 days has dropped below 70%.
The alarm is handled identically to other alarms: usually a ticket must be submitted to the site. It can be closed as soon as the alarm goes into OK status (however it is recommended to make sure it is a couple percent above the threshold before closing it. e.g. 80%). If the problem continues for over 30 days the ticket should be closed. If the alarm is raised again, ROD has to open a new ticket. This should motivate the site to work on the problem.
 
It is up to ROD whether they ask for site's explanation.
 
= Process of handling Core services Availability and Reliability  =
 
*'''Generation of statistics'''
 
Availability and reliability statistics are automatically generated the first week of the month by the [[External_tools#Availability_Computation_Engine| Availability Computation Engine]] (Gridview until May 2011) using the profile in pdf format and placed under [http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/].
 
*'''Preliminary processing'''
 
Once the reports are generated, sanity checks are performed by EGI SA1 (Task TSA1.8). After this step is completed, statistics are uploaded into the EGI document server. Links to monthly statistics will be provided on a regular basis at this wiki page.
 
*'''Publication'''
 
An announcement of the new results is distributed by EGI SA1 (TSA1.8) to the NGI Operations Managers mailing list. COD (TSA1.7) is responsible of supervising statistics by chasing NGIs to provide comments in case thresholds are not met. This phase starts by filing a ticket to the COD Support Unit. The overall comments gathering process is handled through tickets.
 
*'''Handling of sites below targets'''
 
For a core services that misses availability/reliability targets:
 
#a child ticket is opened by the COD team and assigned to the respective NGI, asking for explanation to be given. [[Grid_operations_oversight/WI04 | Core services report work instruction for COD]]
#the explanation must be produced within 10 working days since the ticket is received. Reminders and escalation is performed in accordance to COD escalation procedures [[PROC01]].
#if the explanation is found satisfactory the ticket is closed
#* conversely if the explanation is not given in due time, or the explanation is found inadequate, COD escalation procedure will be followed [[PROC01]]
#the child ticket can then be closed
#the parent ticket will be closed when all child tickets have been closed.
 
*'''Recomputation precedure'''
 
Should there be doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at [[PROC10]]
 
= Known issues and recommendations to NGIs  =
 
#ACE as Gridview in the past, is always calculating reliability and reliability of a site as soon as it shows up in GOCDB and in the BDIIs, regardless of its certification status. While processing the data in order to generate the availability/reliability report, ACE takes into account the Certification status of the site at that moment in order to decide if the site is certified and as a result it will show up in the report, or if it uncertified and it has to be excluded. '''Thus newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that.''' Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [https://gus.fzk.de/ws/ticket_info.php?ticket=60594] and [https://gus.fzk.de/ws/ticket_info.php?ticket=60925]. '''As of December 2010, Gridview had included a snapshot feature so availability takes into account the topology at the last day of the month. While it does not solve the problem completely, it reduces its impact. However ACE reports (used since May 2011) do not include the snapshot feature yet.'''
#The calculations performed by ACE always '''take into account the information system status and gocdb information at the time the calculation is performed, and not that of a certain checkpoint in the past'''. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, '''complete recalculations are avoided whenever possible''', and errors are fixed on per site basis for those that have lower number than they should.
#Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI wighted availability will be affected.

Latest revision as of 10:42, 15 April 2022