Revision as of 09:58, 27 October 2010

Title: Operations Centre decommission procedure
Document link: https://wiki.egi.eu/wiki/Operations:COD_Escalation_new
Last modified: 22.10.2010
Version: 1.0
Policy Group Acronym: GOO/COD
Policy Group Name: Grid Operations Oversight/Central Operator on Duty
Contact Person: Małgorzata Krakowian, Marcin Radecki
Document Status: APPROVED
Approved Date: 26.10.2010
Procedure Statement: The purpose of this document is to define escalation procedure for operational problems

Workflow and escalation procedures

Escalation for operational problem at site

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement in the quality of the production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance. The regular procedure the operators follow can be considered in four phases.

submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by an operations team (COD or ROD);
updating the task when a site state changes which can be detected either by a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from an operations team (COD or ROD);
closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
initiate last escalation step and/or communication with site administrators and regional operations team.

Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for.

When an alarm appears on the ROD dashboard (>24 hours old):

Step [#]	Max. Duration [work days]	Resp. Unit	Escalation procedure	Content of the message
1	3	ROD	Send mail to the site administrator with CC to NGI/ROC manager and GGUS (operational ticket is being created).	ask for immediate action in case of no response for 3 working days ROD will escalate the issue
2	3	ROD	Send mail to the site administrator with CC to NGI/ROC manager and GGUS. (optionally: a phone call to site, just to make sure that communication channel is working); After 3 days period with no response from site administrator issue should be escalate to COD.	ask for immediate action in case of no response for 3 working days ROD will escalate the issue to COD
3	3	COD	Send mail to NGI/ROC manager with CC to site administrator, ROD and GGUS.	informing that NGI/ROC manager should make the site react on the ticket or suspend the site within 3 days if NGI will not react COD will suspend the site on the 4th day.
4	1	COD	If no response is obtained from either the site, ROD or NGI/ROC manager. Send mail to NGI/ROC manager with CC to site administrator, ROD and GGUS. If no response after 1 working day COD performs site suspension.	asking NGI/ROC manager to suspend the site if no response after 1 working day COD will perform site suspension.

NB : When availability/reliability monthly thresholds are not met, a site is requested to provide justification through a COD ticket. The ticket in this procedure equals to an alarm appearing in the ROD dashboard.

Escalation for operational problem with ROD

The procedure applies only in case when ROD is not handling issues on operational dashboard according to operational procedures.

Step [#]	Max. Duration [work days]	Resp. Unit	Escalation procedure	Content of the message
1	3	COD	Send mail to the ROD with CC to NGI/ROC and GGUS (operational ticket is being created). .	ask for explanation why an issue was not handled according to procedures ask for immediate action in case of no response for 3 working days COD will contact NGI/ROC manager
2	3	COD	Send mail to NGI/ROC manager with CC to ROD and GGUS.	Report that ROD is not responsive and not handling operational issues according to procedure In case of no response for 3 working days COD will contact COO
3	(without delay)	COD	Send mail to COO with CC NGI/ROC manager and GGUS.	Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure

The precondition to stop escalation is that all issues not handled according to procedure disappeared from COD dashboard. The communication is done via GGUS ticket.

@@ Line 74: / Line 74: @@
 |-
 |-
-| 1 || 3 || COD ||  Contact the ROD with CC to NGI/ROC and GGUS (operational ticket is being created). . ||
+| 1 || 3 || COD ||  Send mail to the ROD with CC to NGI/ROC and GGUS (operational ticket is being created). . ||
 * ask for explanation why an issue was not handled according to procedures
 * ask for immediate action
 * in case of no response for 3 working days COD will contact NGI/ROC manager
 |-
-| 2 || 3 || COD || Contact NGI/ROC manager with CC to ROD and GGUS. ||
+| 2 || 3 || COD || Send mail to NGI/ROC manager with CC to ROD and GGUS. ||
 * Report that ROD is not responsive and not handling operational issues according to procedure
 * In case of no response for 3 working days COD will contact COO
 |-
-| 3 || (without delay) || COD || Contact COO with CC NGI/ROC manager and GGUS. ||
+| 3 || (without delay) || COD || Send mail to COO with CC NGI/ROC manager and GGUS. ||
 * Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure
 |}

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

Revision as of 09:58, 27 October 2010

Workflow and escalation procedures

Escalation for operational problem at site

Escalation for operational problem with ROD

Navigation menu

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

Revision as of 09:58, 27 October 2010

Workflow and escalation procedures

Escalation for operational problem at site

Escalation for operational problem with ROD

Navigation menu

Search