Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

PROC10 Recomputation of SAM results or availability reliability statistics

From EGIWiki
Revision as of 15:33, 6 October 2011 by Dzila (talk | contribs) (→‎Steps)
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators

Procedure for the recomputation of SAM results and availability/reliability

  • Title: Recomputation of monitoring results and availability
  • Document link:
  • Last modified:
  • Version: 1.0
  • Policy Group Acronym: OMB
  • Policy Group Name: Operations Management Board
  • Contact Person: Dimitris Zilaskos
  • Document Status: DRAFT
  • Approved Date:
  • Procedure Statement:The purpose of this document is to ...

Overview

This procedure documents the steps for requesting a correction in the SAM test results and in the related availability statistics.

DISCLAIMER: This procedure is only applicable to EGI OPS test results. Procedures for the computation of VO-specific availability report are VO-specific and are out of scope.

Prerequisites

Fixes in test results are accepted only when failures in test results were due to problems cased to the monitoring infrastructure itself. Some examples:

  • invalid proxy certificate used for submitting the monitoring probes in a Nagios instance;
  • problems with the Storage Element used for replica management tests resulting in errors on CE's metrics.

Steps

  1. STEP 1: notify your Operations Centre by opening a GGUS ticket to be assigned to your Operations Centre Support Unit. In the GGUS ticket you must mention:
    1. the starting and ending time of the problem (including day and hour in UTC)
    2. the Site, ROC or NGI affected by the problem
    3. the VO affected by the problem
    4. a description of the problem
  2. STEP 2: the Operations Centre anlayzes the request. If the request is validated, the ticket is re-assigned to the Service Level Management(SLM) Support Unit, who will be responsible of (1) collecting all reported problems and (2) discuss the reported problems with the SAM Support Unit by re-assigning the ticket to the SAM/Nagios SU.
  3. STEP 3: if the request for recomputation of the test results is accepted, the SAM Support Unit will be reponsible of triggering a recomputation of the monthly availability statistics. Re-computation is performed by following these steps:
    1. All Nagios metric results for any site and service are set to unknown status from the beginning of the hour reported in the starting time to one hour after the ending time. This is to cover late results that could have arrived later.
    2. The period is then recomputed for that particular Site, ROC or NGI. As a consequence, the availability and reliability of other sites won't be affected, as unknown periods are not considered in the computation.
  4. STEP 4: when the new availability statistics are ready for distribution, the SAM/Nagios SU reassignes the ticket to the SLM Support Unit, in order to notify that a new set of reports can be re-distributed to EGI.

External links

Revision history