Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC10 Recomputation of SAM results or availability reliability statistics"

From EGIWiki
Jump to navigation Jump to search
m (Protected "PROC10" ([edit=sysop] (indefinite) [move=sysop] (indefinite)))
(Remove deprecated content)
Tag: Replaced
 
(30 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}}  
{{Template:Doc_menubar}}
[[Category:Deprecated]]
[[Category:Operations Procedures]]
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
{{TOC_right}}
| style="padding-right: 15px; padding-left: 15px;" |
 
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC10+Recomputation+of+SAM+results+or+availability+reliability+statistics
= Procedure for the recomputation of SAM results and/or availability/reliability statistics=
|}
 
*'''Title''': Recomputation of SAM results and/or availability/reliability statistics
*'''Document link''': https://wiki.egi.eu/wiki/PROC10
*'''Last modified''': 03 May 2012
*'''Version''': 1.2
*'''Policy Group Acronym''': OMB
*'''Policy Group Name''': Operations Management Board
*'''Contact Person''': George Fergadis/AUTH
*'''Document Status''': APPROVED
*'''Approved Date''': 26 March 2012
*'''Procedure Statement''': This procedure documents the steps for requesting a correction in the SAM test results and in the related availability/reliability statistics.
 
= Overview  =
This procedure documents the steps for requesting a correction in the OPS VO
[[SAM_Instances|SAM test results]] and in the related [[Availability_and_reliability_monthly_statistics|availability/reliability statistics]] if applicable.
<!--A recomputation of these statistics for the affected month is not needed if test results are notified and corrected before the statistics of that month are computed and distributed. Problems with the SAM results should be notified as soon as possible once detected, in order to allow sufficient time for fixing of these and thus to avoid that monthly availability/reliability statistics for the affected month have to be re-computed.-->
 
DISCLAIMER: This procedure is only applicable to EGI OPS test results. Procedures for the computation of VO-specific availability report are VO-specific and are out of this scope.
 
= Who can submit a request? =
Re-computations can be requested by:
* site administrators
* regional operations staff.
 
= Re-computation policy =
'''Starting from the 01 May 2012 monitoring results can be recomputed only in the case of problems with the monitoring infrastructure itself. No re-computations will be performed in case of issues with the deployed middleware (e.g. in case of documented bugs affecting the availability of a production service end-point), which will be consequently reflected in lower availability/reliability.'''
 
Some examples of possible issues justifying a re-computation request:
* invalid proxy certificate used for submitting the monitoring probes in a Nagios instance;
* problems with the Storage Element used for replica management tests resulting in errors on CE's metrics.
 
'''The deadline for requesting re-computations is 10 calendar days after the publication and announcement of the monthly Availability/Reliability reports for a given month X (typically the announcement will be distributed on the 1st day of month X+1).
 
According to the re-computation requests received, A/R reports will be regenerated only once for each month, after the 10th of month X+1.'''
 
= How to request a re-computation of OPS monitoring results =
 
== The request is originated by a site ==
STEP 1 (RC). As soon as the problem is detected, notify your NGI operations centre by opening a [http://helpdesk.egi.eu/ GGUS ticket]. Please address the ticket to your Operations Centre support unit, who is responsible of validating the request.
In the GGUS ticket you must mention:
# the starting and ending time of the problem (including day and hour in UTC)
# the Site, NGI/federation of NGIs affected by the problem
# the VO affected by the problem (must be the OPS VO)
# a description of the problem
 
STEP 2 (OC). The NGI operations centre validates the request.
 
STEP 3 (OC). If the request is deemed valid, a GGUS ticket is sent to [[GGUS:SLM-FAQ|Service Level Management]](SLM) Support Unit. The SLM support team will take care of discussing all requests received with the SAM team.
 
STEP 4 (SLM SU). The SLM SU is responsible of
# validating the reported problems
# discuss the reported problems with the SAM Support Unit if needed
# notify the SAM SU about the requests received through a new parent ticket is submitted to SAM with the children tickets of the validated requests
 
STEP 5 (SAM SU).
# The SAM Support Unit is responsible of checking the requests and of regenerating the results. For the accepted requests all Nagios metric results for any site and service are set to ''unknown'' status from the beginning of the hour reported in the starting time to one hour after the ending time. This is to cover late results that could have arrived later. the availability and reliability of other sites won't be affected, as unknown periods are not considered in the computation.
# New monthly availability statistics will be recomputed for that particular period, Site, NGI/federation of NGIs.
# A new report will be made available 10 days after the first publication of the report.
# After publication of the new report, all child GGUS tickets will be closed.
 
STEP 6 (SLM SU).
# The parent ticket is closed.
 
== The request is originated by an NGI/EIRO operations centre ==
Follow the procedure defined in the section above, starting from STEP 3.
 
<!-- # '''STEP 3''': if the request for recomputation of the test results is accepted, the SAM Support Unit will be reponsible of fixing the results and of triggering a recomputation of the monthly availability statistics if necessary. The following these steps are followed:
## All Nagios metric results for any site and service are set to ''unknown'' status from the beginning of the hour reported in the starting time to one hour after the ending time. This is to cover late results that could have arrived later.
## Availability/reliability are then recomputed for that particular period, Site, NGI/federation of NGIs if necessary. As a consequence, the availability and reliability of other sites won't be affected, as unknown periods are not considered in the computation.
# '''STEP 4''': in case new availability/reliability statistics are computed, when these are ready for distribution, the SAM/Nagios SU reassignes the ticket to the SLM Support Unit, in order to notify that a new set of reports can be re-distributed to EGI.-->
 
= External links =
* [https://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy WLCG Availability re-computation policy]
 
= Tips =
* Date formats
You can use the Unix <tt>date</tt> command to convert the start and end time from your time zone to <tt>UTC</tt> using the [http://en.wikipedia.org/wiki/ISO_8601 ISO 8601] format.
 
''the start time must be rounded to the lower hour and the end time rounded to the higher hour''
 
Example:
# date --date="12 Feb 2012 17:00 CET" --utc --iso-8601=hours
will give:
2012-02-12T16:00+0000
 
= Revision history  =
03/05/2012: updated policy and procedure to reflect the OMB decision of the March 2012 meeting
17/01/2012: the text of the procedure is fixed to clarify that both RC administrators and regional operations staff can request a re-computation.
 
16/01/2012: the text of the procedure is fixed to clarify that the recomputation of test results can be requested before the end of the affected month, in which case if sufficient time is allowed for fixing of the test results, no re-computation of availability/reliability statistics will be needed.

Latest revision as of 09:43, 15 April 2022