PROC01 EGI Infrastructure Oversight escalation
- Title: Operations Centre decommission procedure
- Document link: https://wiki.egi.eu/wiki/Operations:COD_Escalation_new
- Last modified: 22.10.2010
- Version: 1.0
- Policy Group Acronym: GOO/COD
- Policy Group Name: Grid Operations Oversight/Central Operator on Duty
- Contact Person: Małgorzata Krakowian, Marcin Radecki
- Document Status: APPROVED
- Approved Date: 26.10.2010
- Procedure Statement: The purpose of this document is to define escalation procedure for operational problems
Workflow and escalation procedures
Escalation for operational problem at site
This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.
Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement in the quality of the production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance. The regular procedure the operators follow can be considered in four phases.
- submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by an operations team (COD or ROD);
- updating the task when a site state changes which can be detected either by a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from an operations team (COD or ROD);
- closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
- initiate last escalation step and/or communication with site administrators and regional operations team.
Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for.
When an alarm appears on the ROD dashboard (>24 hours old):
Step [#] | Max. Duration [work days] | Resp. Unit | Escalation procedure |
1 | 3 | ROD | Send mail to the site administrator with CC to NGI manager and GGUS (operational ticket is being created).
Content of the message:
|
2 | 3 | ROD | Send mail to the site administrator with CC to NGI manager and GGUS.
Content of the message:
(optionally: a phone call to site, just to make sure that communication channel is working); At the end of this period escalate to COD |
3 | 3 | COD | Send mail to NGI manager with CC to site administrator, ROD and GGUS.
Content of the message:
|
4 | 1 | COD | If no response is obtained from either the site, ROD or NGI manager.
Send mail to NGI manager with CC to site administrator, ROD and GGUS. Content of the message:
If no response after 1 working day COD performs site suspension. |
NB : Theoretically, the whole process could be covered in a 2 week period. This is in line with suspension procedure proposed for not replying to low availability/reliability figures. Most often a site solves the problem well before operators need to escalate issue to COD.
Escalation for operational problem with ROD
The procedure applies only in case when ROD is not handling issues on operational dashboard according to operational procedures.
Step [#] | Max. Duration [work days] | Resp. Unit | Escalation procedure |
1 | 3 | COD | Contact the ROD with CC to NGI manager. Content of the message:
|
2 | 3 | COD | Contact NGI manager with CC to ROD. Content of the message:
|
3 | (without delay) | COD | Contact COO with CC NGI manager. Content of the message:
|
The precondition to stop escalation is that all issues not handled according to procedure disappeared from COD dashboard. All communication is done via GGUS ticket.