Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC08 Management of the EGI OPS Availability and Reliability Profile"

From EGIWiki
Jump to navigation Jump to search
(94 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Template: Op menubar}}
{{Template:Op menubar}}
{{Template:Doc_menubar}}
{{Template:Doc_menubar}}
[[Category:Operations Manuals]]
[[Category:Operations Procedures]]
[[Category:Deprecated]]
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
| style="padding-right: 15px; padding-left: 15px;" |
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC08+Management+of+the+EGI+OPS+Availability+and+Reliability+Profile 
|}
{{TOC_right}}
{{TOC_right}}


{| border="1"
{{Ops_procedures
|-
|Doc_title = Management of the EGI OPS Availability and Reliability Profile
| '''Title'''
|Doc_link = [[PROC08|https://wiki.egi.eu/wiki/PROC08]]
| ''Modification of the set of AVAILABILITY tests''
|Version = 2020-03-02
|-
|Policy_acronym = OMB
| '''Document link'''
|Policy_name = Operations Management Board
| ''https://wiki.egi.eu/wiki/PROC08_Modification_of_the_set_of_AVAILABILITY_tests''
|Contact_group = operations@egi.eu
|-
|Doc_status = Approved
| '''Last modified'''
|Approval_date = 2020-02-20
| [[User:Tferrari|Tferrari]] 14:08, 8 March 2011 (UTC)
|Procedure_statement = This document specifies the procedure for modifying the EGI OPS Availability and Reliability profile.
|-
|Owner = Alessandro Paolini
| '''Policy Group Acronym'''
}}
| ''OMB''
 
|-
<br>
| '''Policy Group Name'''
 
| ''Operations Management Board''
= Overview  =
|-
 
| '''Contact Person'''
A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. A change in the OPS Availability and Reliability profile affects the computation of the monthly Availability and Reliability statistics of all EGI Resource Infrastructures and Resource Centres.  
| ''E. Imamagic''
|-
| '''Document Status'''
| ''DRAFT''
|-
| '''Approved Date'''
| ''specify''
|-
| '''Procedure Statement'''
| ''This document specifies the procedure for modifying the set of AVAILABILITY tests, i.e. of those tests whose results affect the computation of the monthly Availability and Reliability statistics.''
|-
|}


----
= Definitions  =


= Overview =
*The key words '''Profile''', '''Metric''', '''Probe''' and '''Test''' are defined in the [https://wiki.egi.eu/wiki/ARGO#ARGO_tests ARGO] page.
*List of Availability and Reliability tests: [https://poem.egi.eu/poem/admin/poem/profile/2/change/ ARGO_MON_CRITICAL].


The purpose of this document is to clearly describe the procedure for modifying the set of AVAILABILITY tests, i.e. of those tests whose results affect the computation of the monthly Availability and Reliability (A/R) statistics.
Please refer to the [[Glossary|EGI Glossary]] for the definitions of the terms used in this procedure.<br>


= Scope =
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.


This procedure applies to set of AVAILABILITY tests which are run under OPS VO and its range is global, applies for all Operations Centres in EGI project. These tests are used in the official EGI ACE profile used for generating monthly A/R reports.
= Scope  =


This procedure does not apply to A/R statistics calculated for other VOs (e.g. user communities, national operations VOs).
This procedure is applicable to the EGI OPS Availability and Reliability profile. Any change applied is global, as it has effects on all EGI Resource Centres. The ARGO compute engine (CE) uses profiles to generate monthly Availability and Reliability reports.  


= Prerequirements =
This procedure is NOT applicable to VO-specific Availability and Reliability profiles used by non-OPS VOs (e.g. user communities, national operations VOs, etc.).


This procedure requires usage of ACE system for generating monthly A/R statistics. The procedure does not cover GridView system which is currently used. The critical feature which ACE supports and GridView lacks is definition of multiple profiles for A/R statistics. Detailed information about ACE system can be found on the following link: https://tomtools.cern.ch/confluence/display/SAM/ACE.
= Entities involved in the procedure =


In case when modification request includes addition of new tests, each test MUST first go through following procedures:
*'''Applicant'''. The Applicant submits a request for changing the EGI OPS profile. Anybody is allowed to submit the request. The request is submitted to the respective Operations Centre, who after acceptance, forward it to the Operations Management Board [[OMB|OMB]] for discussion.
* [[PROC07_Adding_new_probes_to_SAM|Procedure for adding new probes to SAM release]]
*'''Operations Centre'''. The entity associated to EGI that is responsible of delivering local operational services to a Resource Infrastructure Provider. In order to contribute resources to EGI a Resource Infrastructure Provider must be associated to an Operations Centre.
* [[Operations:Procedure_for_setting_Nagios_test_an_operations_test|Procedure for setting Nagios test an operations test]]
*'''Resource Infrastructure Operations Manager'''. Represents the respective Resource Infrastructure within the OMB.
Two procedures above assure that the new tests are included in SAM release, deployed on all NGI SAM instances and accepted by operators.
*'''EGI&nbsp;Operations team'''. Team coordinating EGI Infrastructure.
*'''ARGO Team'''. The ARGO Team is responsible of scheduling, integrating and releasing probes.


= Request =
= Pre-requirements  =


* Everyone is allowed to submit the request for modifying the set of AVAILABILITY tests.
*This procedure requires usage of the [http://argo.egi.eu ARGO] system for generating monthly availability and reliability statistics. Detailed information about ARGO system can be found on the following link: [http://argoeu.github.io/overview/ http://argoeu.github.io/overview/].  
* The procedure requires generation of two A/R reports for comparison (find details below). Therefore only one request will be processed at a time. Order of processing requests will be defined by the SA1 activity leader.
*If the request of change includes the addition of new tests, each test MUST first be integrated with the Operations Dashboard, i.e. being an OPERATIONS test included in the [https://poem.egi.eu/ui/metricprofiles/ARGO_MON_OPERATORS ARGO_MON_OPERATORS] profile is a necessary condition to be included in the [https://poem.egi.eu/ui/metricprofiles/ARGO_MON_CRITICAL ARGO_MON_CRITICAL] profile for A/R computation (see procedure [[PROC06|PROC06]]). This should ensure that the new probe has been tested on the production sites, excluding the occurrence of anomalies not strictly depending on the sites.
*EGI software teams are responsible for the development and testing of new probes.


= Procedure =
= Steps  =


{| border="1" cellspacing="0" cellpadding="5" align="center"
{| class="wikitable"
! Step
|-
! Action on
! Step  
! Action on  
! Action
! Action
|-
|-
| 1
| 1  
| Requester
| Applicant
| Opens a RT ticket in queue '''noc-managers'''.
| Sends a change request to the attention of the respective own Operations Centre. The request is submitted through a [https://gus.fzk.de/pages/ticket.php GGUS ticket].  
<pre>
Use the "Affected ROC/NGI" to address the ticket to the appropriate Operations Centre. Template:
Subject: Request for adding/removing XXX(,YYY,...) test(s) from the set of AVAILABILITY tests
<pre>Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS A/R Profile


We would like to request adding/removing XXX(,YYY,...) test(s) from the set of AVAILABILITY tests
We would like to request adding/removing XXX(,YYY,...) test(s) to/from from the EGI OPS Profile


Prerequisite data:
Prerequisite data:
* name of nagios probe:
* name of ARGO test(s):
* name of service on which the test runs:
* name of service on which the test runs:
* link to documentation page:
* link to documentation page:
* motivation (which part of the infrastructure will be improved with the new probe
* motivation (which part of the infrastructure will be improved with the new test
  or description of users' problems which will be avoided in future - provide list
  or description of users' problems which will be avoided in future - provide list
  of GGUS tickets is possible)
  of GGUS tickets is possible)
</pre>
</pre>
|-
|-
| 2
| 2  
| SA1 activity leader
| Operations Centre
| Schedules presentation of the new probe at the next possible OMB meeting.
| The Operations Centre process the request specified in the GGUS ticket for acceptance/rejection.
Motivations for rejection need to specify in the GGUS ticket.
 
In case of acceptance, a GGUS ticket is opened to EGI Operations Support Unit to forward the request for discussion in the OMB. Template:
<pre>Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile
 
We would like to request adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile.
Please see details in GGUS ticket _link to Applicant's GGUS ticket_.
</pre>
|-
|-
| 3
| 3  
| Requester
| EGI Opertions team&nbsp; and Resource Infrastructure Operations Manager
| Explains the reason for modifying set of AVAILABILITY tests
| EGI Opertions team schedules a presentation of the change requested at the next possible OMB meeting. The relevant Resource Infrastructure Operations Manager presents the request during the meeting. The Applicant is invited to attend the meeting. Only one request will be processed at a time as the impact of a change needs to be assessed. Requests are processed depending on their priority, as agreed by the [[OMB|OMB]].
|-
|-
| 4 (*)
| 4
| SA1 activity leader
| EGI Opertions team
| Opens a ticket in JIRA requestting creation of new ACE profile with modified set of AVAILABILITY tests.
| Opens a GGUS ticket to "Monitoring (ARGO)" requesting the addition (or removal) to the ARGO_MON_CRITICAL profile.  
|-
|-
| 5 (*)
| 5  
| ACE team
| ARGO team  
| ACE team creates the new ACE profile.
| Implements the change on the agreed date (generally, the first day of the month).
|-
|-
| 6
| 6  
| SA1.8 task staff
| EGI Opertions team
| For the following '''one month''' two A/R reports are generated. SA1.8 task staff compares the figures and presents them at the next OMB meeting.
| Before closing the ticket, verifies there are no anomalies with the new A/R reports and report back to the NGI Managers by email or to the next OMB meeting.
|-
|-
| 7
| 7  
| OMB
| EGI Opertions team
| If the A/R statistics generated with the new A/R profile are satisfactory OMB approves the modification
| Broadcasts the modification to all relevant parties (i.e. Operations Centres and Resource Centres) through the next Monthly Broadcast. Closes the GGUS ticket&nbsp;
|}
 
= Revision History  =
 
{| class="wikitable"
|-
! Version
! Authors
! Date
! Comments
|-
| <br>
| T. Ferrari
| 18/10/2011
| removed obsolete information: "This procedure does not apply to modifications which have already been agreed with the SAM team: including CREAM-CE probes, switching from the old SAM CA probe to the new one, switching from the old SAM ARC probes to Nagios ones."
|-
| <br>
| M. Krakowian
| 19 August 2014
| Change contact group -&gt; Operations support
|-
|-
| 8 (*)
| <br>
| SA1 activity leader
| P. Daoglou and P. Korosoglou
| Opens a ticket in JIRA requestting that the new A/R profile becomes the official for EGI. The previous profile is to be removed.
| 30 September 2015
| Updates in procedure. Replacement of ACE references with ARGO.
|-
|-
| 9
|  
| SA1 activity leader
| Alessandro Paolini
| Broadcasts the modification to all relevant parties (i.e. noc-managers, inspire-sa1). Closes the initial RT ticket.
| 2016-06-08
| Change contact group -&gt; Operations
|-
|
| Alessandro Paolini
| 2019-12-17
| updating some old links and names; updating some steps (it seems it is no more necessary creating a "testing" A/R report before approving the changes)
|}
|}


(*) - These steps depend on the procedure for creating new profiles which will be defined by the ACE team once the ACE is in production. Steps defined here have been provided by the ACE team. This procedure will be updated if any change occurs.
[[Category:Operations_Procedures]]
 
= Revision History =
<!-- this section will track changes introduced in the document AFTER it is officially approved by OMB -->

Revision as of 15:37, 4 August 2021

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators
Alert.png This page is Deprecated; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC08+Management+of+the+EGI+OPS+Availability+and+Reliability+Profile


Title Management of the EGI OPS Availability and Reliability Profile
Document link https://wiki.egi.eu/wiki/PROC08
Last modified 2020-03-02
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Group operations@egi.eu
Document Status Approved
Approved Date 2020-02-20
Procedure Statement This document specifies the procedure for modifying the EGI OPS Availability and Reliability profile.
Owner Alessandro Paolini



Overview

A change in the profile is needed every time a new Nagios test needs to be added/removed to/from the profile, in order to have its results included/removed in/from Availability and Reliability monthly statistics. A change in the OPS Availability and Reliability profile affects the computation of the monthly Availability and Reliability statistics of all EGI Resource Infrastructures and Resource Centres.

Definitions

  • The key words Profile, Metric, Probe and Test are defined in the ARGO page.
  • List of Availability and Reliability tests: ARGO_MON_CRITICAL.

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Scope

This procedure is applicable to the EGI OPS Availability and Reliability profile. Any change applied is global, as it has effects on all EGI Resource Centres. The ARGO compute engine (CE) uses profiles to generate monthly Availability and Reliability reports.

This procedure is NOT applicable to VO-specific Availability and Reliability profiles used by non-OPS VOs (e.g. user communities, national operations VOs, etc.).

Entities involved in the procedure

  • Applicant. The Applicant submits a request for changing the EGI OPS profile. Anybody is allowed to submit the request. The request is submitted to the respective Operations Centre, who after acceptance, forward it to the Operations Management Board OMB for discussion.
  • Operations Centre. The entity associated to EGI that is responsible of delivering local operational services to a Resource Infrastructure Provider. In order to contribute resources to EGI a Resource Infrastructure Provider must be associated to an Operations Centre.
  • Resource Infrastructure Operations Manager. Represents the respective Resource Infrastructure within the OMB.
  • EGI Operations team. Team coordinating EGI Infrastructure.
  • ARGO Team. The ARGO Team is responsible of scheduling, integrating and releasing probes.

Pre-requirements

  • This procedure requires usage of the ARGO system for generating monthly availability and reliability statistics. Detailed information about ARGO system can be found on the following link: http://argoeu.github.io/overview/.
  • If the request of change includes the addition of new tests, each test MUST first be integrated with the Operations Dashboard, i.e. being an OPERATIONS test included in the ARGO_MON_OPERATORS profile is a necessary condition to be included in the ARGO_MON_CRITICAL profile for A/R computation (see procedure PROC06). This should ensure that the new probe has been tested on the production sites, excluding the occurrence of anomalies not strictly depending on the sites.
  • EGI software teams are responsible for the development and testing of new probes.

Steps

Step Action on Action
1 Applicant Sends a change request to the attention of the respective own Operations Centre. The request is submitted through a GGUS ticket.

Use the "Affected ROC/NGI" to address the ticket to the appropriate Operations Centre. Template:

Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS A/R Profile

We would like to request adding/removing XXX(,YYY,...) test(s) to/from from the EGI OPS Profile

Prerequisite data:
* name of ARGO test(s):
* name of service on which the test runs:
* link to documentation page:
* motivation (which part of the infrastructure will be improved with the new test
 or description of users' problems which will be avoided in future - provide list
 of GGUS tickets is possible)
2 Operations Centre The Operations Centre process the request specified in the GGUS ticket for acceptance/rejection.

Motivations for rejection need to specify in the GGUS ticket.

In case of acceptance, a GGUS ticket is opened to EGI Operations Support Unit to forward the request for discussion in the OMB. Template:

Subject: Request for adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile

We would like to request adding/removing XXX(,YYY,...) test(s) to/from the EGI OPS Profile. 
Please see details in GGUS ticket _link to Applicant's GGUS ticket_.
3 EGI Opertions team  and Resource Infrastructure Operations Manager EGI Opertions team schedules a presentation of the change requested at the next possible OMB meeting. The relevant Resource Infrastructure Operations Manager presents the request during the meeting. The Applicant is invited to attend the meeting. Only one request will be processed at a time as the impact of a change needs to be assessed. Requests are processed depending on their priority, as agreed by the OMB.
4 EGI Opertions team Opens a GGUS ticket to "Monitoring (ARGO)" requesting the addition (or removal) to the ARGO_MON_CRITICAL profile.
5 ARGO team Implements the change on the agreed date (generally, the first day of the month).
6 EGI Opertions team Before closing the ticket, verifies there are no anomalies with the new A/R reports and report back to the NGI Managers by email or to the next OMB meeting.
7 EGI Opertions team Broadcasts the modification to all relevant parties (i.e. Operations Centres and Resource Centres) through the next Monthly Broadcast. Closes the GGUS ticket 

Revision History

Version Authors Date Comments

T. Ferrari 18/10/2011 removed obsolete information: "This procedure does not apply to modifications which have already been agreed with the SAM team: including CREAM-CE probes, switching from the old SAM CA probe to the new one, switching from the old SAM ARC probes to Nagios ones."

M. Krakowian 19 August 2014 Change contact group -> Operations support

P. Daoglou and P. Korosoglou 30 September 2015 Updates in procedure. Replacement of ACE references with ARGO.
Alessandro Paolini 2016-06-08 Change contact group -> Operations
Alessandro Paolini 2019-12-17 updating some old links and names; updating some steps (it seems it is no more necessary creating a "testing" A/R report before approving the changes)