Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC06 Setting Nagios test status to operations"

From EGIWiki
Jump to navigation Jump to search
(Remove deprecated content)
Tag: Replaced
 
(110 intermediate revisions by 10 users not shown)
Line 1: Line 1:
* '''Title''': Procedure for setting global Nagios tests critical
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}}
* '''Document link''': https://wiki.egi.eu/wiki/Operations:Setting_Nagios_tests_critical_procedure
[[Category:Deprecated]]
* '''Last modified''': 23.09.2010
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
* '''Version''': 0.2
| style="padding-right: 15px; padding-left: 15px;" |
* '''Policy Group Acronym''': GOO/COD
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC06+Setting+Nagios+test+status+to+operations
* '''Policy Group Name''': Grid Operations Oversight/Central Operator on Duty
* '''Contact Person''': Małgorzata Krakowian, Marcin Radecki
* '''Document Status''': REVIEW
* '''Approved Date''':
* '''Procedure Statement''':The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for setting Nagios tests critical. A Nagios test is set to critical to enable the operations dashboard to display an alarm in case the test fails.
 
= Procedure for setting global Nagios tests critical DRAFT =
 
The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for setting Nagios tests critical. A Nagios test is set to critical to enable the operations dashboard to display an alarm in case the test fails.
 
 
 
'''This procedure only applies for OPS VO and its range is global.'''
 
= Revision history =
{| border="1" cellspacing="0" cellpadding="5" align="center"
! Version
! Authors
! Date
! Comments
|-
| 0.2
| Małgorzata Krakowian
| 23.09.2010
| Add comments from discussion in Amsterdam EGI TF.
|-
| 0.1
| Małgorzata Krakowian
|
| First draft
|-
|}
|}
= Comments (to be removed in final version) =
MR: Helene raised a comment that nagios people at NGI should be allowed to play with the new test before it is made critical. Her point was to check if the test itself will not make any harm to the service instances in the region.
= Setting Nagios tests critical request =
== Request ==
* Everyone is allowed to submit the request.
* The request should be submitted to The Chief Operations Officer.
==  Prerequisities ==
The test should be implemented and approved by Nagios team.
'''Any probe change in fundamental way requires certification.'''
== Political validation ==
* COD has to agree, checking if the test is safe for the infrastructure.
* When COD validate the request, OMB has to agree that new Nagios test should become critical for OPS VO.
== How to start the process ==
* The Chief Operations Officer opens a GGUS ticket to COD to start the process.
* The Central Operator on Duty team - in charge of EGI oversight - is responsible of processing the request ticket.
== Setting Nagios tests critical steps==
The general idea is that tickets must be closed before being able to move on to the next step.
Steps:
{| border="1" cellspacing="0" cellpadding="5" align="center"
! Step
! Action on
! Action
|-v
| 1
| Nagios
| Add test to official Nagios package.
|-
| 2
| NGIs
| Nagios update.
|-
| 3
| NGIs
| Request to the ROD teams to ask the if they can verify if the test is acceptable, means 75% of affected nodes should be OK.
|-
| 4
| COD
| The information is broadcast by COD.
(This broadcast should be sent to VO managers and NOC/ROC managers)
See the template below for an indication of the message content.
<pre>
Subject: 
Dear All,
We would like to announce that test XXX will become critical XXX
Best regards,
</pre>
|-
| 5
| COD
| Add test to critical tests list.
|-
| 6
| COD
| Final check. Close parent ticket
|-
|}
=  Requirements to be implemented =
* COD should be authorized to manage OPS critical test list from operational portal side. https://forge.in2p3.fr/issues/show/934

Latest revision as of 10:43, 15 April 2022