PROC10 Recomputation of SAM results or availability reliability statistics

From EGIWiki
Revision as of 15:06, 9 March 2012 by Fergadis (talk | contribs)
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Procedure for the recomputation of SAM results and/or availability/reliability statistics

  • Title: Recomputation of SAM results and/or availability/reliability statistics
  • Document link: https://wiki.egi.eu/wiki/PROC10
  • Last modified: 16 Jan 2012
  • Version: 1.1
  • Policy Group Acronym: OMB
  • Policy Group Name: Operations Management Board
  • Contact Person: George Fergadis/AUTH
  • Document Status: APPROVED
  • Approved Date: 17 October 2011
  • Procedure Statement: This procedure documents the steps for requesting a correction in the SAM test results and in the related availability/reliability when applicable statistics.

Overview

This procedure documents the steps for requesting a correction in the SAM test results and in the related availability/reliability statistics if applicable. A recomputation of these statistics for the affected month is not needed if test results are notified and corrected before the statistics of that month are computed and distributed. Problems with the SAM results should be notified as soon as possible once detected, in order to allow sufficient time for fixing of these and thus to avoid that monthly availability/reliability statistics for the affected month have to be re-computed.

DISCLAIMER: This procedure is only applicable to EGI OPS test results. Procedures for the computation of VO-specific availability report are VO-specific and are out of scope.

Who can submit a request?

Re-computations can be requested by site administrators and by regional operations staff.

Prerequisites

Fixes in test results are accepted only when failures in test results were due to problems cased to the monitoring infrastructure itself. Some examples:

  • invalid proxy certificate used for submitting the monitoring probes in a Nagios instance;
  • problems with the Storage Element used for replica management tests resulting in errors on CE's metrics.

Steps

  1. STEP 1: as soon as the problem is detected, notify by opening a GGUS ticket. If the submitter is a Resource Centre administrator: please address the ticket to your Operations Centre support unit. If the submitter is a member of a regional operations staff: please address the ticket to the Service Level Management support unit. In the GGUS ticket you must mention:
    1. the starting and ending time of the problem (including day and hour in UTC)
    2. the Site, NGI/federation of NGIs affected by the problem
    3. the VO affected by the problem (must be the OPS VO)
    4. a description of the problem
  2. STEP 2: (only applicable if the submitter of the request is a Resource Centre administrator) the Operations Centre anlayzes the request. If the request is validated, the ticket is re-assigned to the Service Level Management(SLM) Support Unit, who will be responsible of (1) collecting all reported problems and (2) discuss the reported problems with the SAM Support Unit by re-assigning the ticket to the SAM/Nagios SU.
  3. STEP 3: if the request for recomputation of the test results is accepted, the SAM Support Unit will be reponsible of fixing the results and of triggering a recomputation of the monthly availability statistics if necessary. The following these steps are followed:
    1. All Nagios metric results for any site and service are set to unknown status from the beginning of the hour reported in the starting time to one hour after the ending time. This is to cover late results that could have arrived later.
    2. Availability/reliability are then recomputed for that particular period, Site, NGI/federation of NGIs if necessary. As a consequence, the availability and reliability of other sites won't be affected, as unknown periods are not considered in the computation.
  4. STEP 4: in case new availability/reliability statistics are computed, when these are ready for distribution, the SAM/Nagios SU reassignes the ticket to the SLM Support Unit, in order to notify that a new set of reports can be re-distributed to EGI.

External links

Tips

  • Date formats

You can use the Unix date command to convert the start and end time from your time zone to UTC using the ISO 8601 format.

Example:

# date --date="12 Feb 2012 17:35 CET" --utc --iso-8601=minutes

will give:

2012-02-12T16:35+0000

Revision history

17/01/2012: the text of the procedure is fixed to clarify that both RC administrators and regional operations staff can request a re-computation.

16/01/2012: the text of the procedure is fixed to clarify that the recomputation of test results can be requested before the end of the affected month, in which case if sufficient time is allowed for fixing of the test results, no re-computation of availability/reliability statistics will be needed. Template:Creative commons