PROC06 Setting Nagios test status to operations
- Title: Procedure for setting global Nagios tests critical
- Document link: https://wiki.egi.eu/wiki/Operations:Setting_Nagios_tests_critical_procedure
- Last modified: 23.09.2010
- Version: 0.2
- Policy Group Acronym: GOO/COD
- Policy Group Name: Grid Operations Oversight/Central Operator on Duty
- Contact Person: Małgorzata Krakowian, Marcin Radecki
- Document Status: REVIEW
- Approved Date:
- Procedure Statement:The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for setting Nagios tests critical. A Nagios test is set to critical to enable the operations dashboard to display an alarm in case the test fails.
Procedure for setting global Nagios tests critical DRAFT
The purpose of this document is to clearly describe the actions and the relative steps to be undertaken for setting Nagios tests critical. A Nagios test is set to critical to enable the operations dashboard to display an alarm in case the test fails.
This procedure only applies for OPS VO and its range is global, applies for all Operations Centres in EGI project.
Revision history
Version | Authors | Date | Comments |
---|---|---|---|
0.3 | Małgorzata Krakowian | 9.11.2010 | Add comments from Tiziana Ferrari |
0.2 | Małgorzata Krakowian | 23.09.2010 | Add comments from discussion in Amsterdam EGI TF. |
0.1 | Małgorzata Krakowian | First draft |
Comments (to be removed in final version)
MR: Helene raised a comment that nagios people at NGI should be allowed to play with the new test before it is made critical. Her point was to check if the test itself will not make any harm to the service instances in the region.
Setting Nagios tests critical request
Request
- Everyone is allowed to submit the request.
- The request should be submitted to The Chief Operations Officer.
Prerequisites
The Nagios test needs to satisfy quality criteria in agreement with the UMD roadmap. The test needs to be properly documented, and its correct functionality need to be proven (one month of successful running o=in the production infrastructure).
quality criteria needs to be provided by SA1
Validation
The general idea is that tickets must be closed before being able to move on to the next step.
Steps:
not sure where should be put the step to NAGIOS (Add test to official Nagios package.). Should it be a prerequisite?
TF comment to step 6: is this (percentage of passing sites) sufficient as a check? sites
may fail to pass, but the probe could be fine.</step>
Step | Action on | Action |
---|---|---|
1 | COO | Opens a GGUS ticket to COD to start the process. |
2 | COD | Checking the status of the Nagios probe to see if Nagios meets the specified quality criteria. |
3 | COD | COD contacts the OMB to request the approval of the new critical test. |
4 | Nagios | Add the test to official Nagios package. |
5 | NGIs | Nagios update. |
6 | NGIs | Request to the ROD teams to ask the if they can verify if the test is acceptable, means 75% of affected nodes should be OK. |
7 | COD | The announcement about new critical test is broadcast by COD.
(This broadcast should be sent to VO managers and NOC/ROC managers) See the template below for an indication of the message content. Subject: Dear All, We would like to announce that test XXX will become critical XXX Best regards, |
8 | COD | Add the test to critical tests list. https://wiki.egi.eu/wiki/Operations:Operations_tests |
9 | Operational Portal | Mark the test as a critical in Operational Portal.
This step will be removed when COD gets an access to manage operations tests list in Operational Portal. See Requirements to be implemented |
10 | COD | Final check. Close parent ticket |
Requirements to be implemented
- COD is responsible to manage OPS critical test list from operational portal side. https://rt.egi.eu/rt/Ticket/Display.html?id=482