Revision as of 16:54, 23 November 2012

Main

EGI.eu operations services

Support

Documentation

Tools

Activities

Performance

Technology

Catch-all Services

Resource Allocation

Security

Documentation menu:

Home •

Manuals •

Procedures •

Training •

Other •

Contact ►

For:

VO managers •

Administrators

Title	Grid Oversight escalation
Document link	https://wiki.egi.eu/wiki/PROC01
Last modified	3.0 - 20.11.2012
Policy Group Acronym	COD
Policy Group Name	Central Operator on Duty
Contact Group	manager-central-operator-on-duty@mailman.egi.eu
Document Status	Approved
Approved Date	20.11.2012
Procedure Statement	The purpose of this document is to define escalation procedure for operational problems
Owner	Owner of procedure

Overview

The purpose of this document is to define escalation procedure for operational problems

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Steps

Escalation for operational problem at site

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for.

When an alarm appears on the ROD dashboard, at most after 24 hours from the problem occurrence ROD should start the procedure below:

(Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure )

Step [#]	Dashboard step	Max. Duration [work days] (time before moving to next step)	Resp. Unit	Escalation procedure	Content of the message
1	1st step	3	ROD	Send mail to the site administrator with CC to NGI/ROC operations manager and GGUS (operational ticket is being created).	ask for immediate action in case of no response for 3 working days ROD will escalate the issue
2	2nd step	3	ROD	Send mail to the site administrator with CC to NGI/ROC operations manager and GGUS. (optionally: a phone call to site, just to make sure that e-mail communication channel is working); After 3 days period with no response from site administrator issue should be escalated to NGI/ROC operations manager.	ask for immediate action in case of no response for 3 working days ROD will escalate the issue to NGI/ROC operations manager
3	NGI step	5	NGI manager	NGI/ROC operations manager should at the political level make site responsive or suspend the site. (it can be done by phone, mail or on the meeting) If the problem needs to be escalated to EGI level then NGI/ROC operations manager ask ROD to send an mail to COD with CC to site administrator, ROD and GGUS.(see Content of the message) ROD team is still responsible to take care about the ticket on the Operations Portal.	informing COD that the problem cannot be solved at the NGI level and ask for help to resolve it
4	COD step	1	COD	If no action was taken by NGI/ROC operations manager for 5 working days COD send an mail to NGI/ROC operations manager with CC to site administrator, ROD and GGUS. If no response after 1 working day COD performs site suspension. If NGI cannot solve the problem at the NGI level, COD try to help to find the solution.	asking NGI/ROC operations manager to suspend the site if no response after 1 working day COD will perform site suspension.

The communication should be recorded in GGUS ticket.

Escalation for operational problem with unsupported MW at site

This Escalation process is a part of Service type decommission procedure

When an alarm appears on the Operations dashboard, ROD should start the procedure below:

Step [#]	Dashboard step	Max. Duration [work days] (time before moving to next step)	Resp. Unit	Escalation procedure	Content of the message
1	1st step	10	ROD	Create a ticket through Operations Portal with the template: ROD_MW_alarm_template Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. One ticket can be created for all MW alarms by using alarm masking feature. ROD should make sure that site is aware of all raised alarms.	ask to provide information about upgrade plan with 2 weeks deadline in case of no response or plan, ROD will escalate the issue to NGI operations manager
2	NGI step	5	NGI manager	Escalate ticket to NGI manager through Operations Dashboard. Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. (optionally: a phone call to site, just to make sure that e-mail communication channel is working); NGI manager should check why site is unresponsive or what is the reason site cannot migrate to supported software version. Site and NGI manager should decide on upgrade plan or site/endpoint decommission.	inform NGI operations managers about unresponsive site site might be suspended by COD, CSIRT or NGI operations manager after DEADLINE for the upgrade/decommissioning
3	COD step	(without delay)	ROD	In case of issues which cannot be solved on NGI level by a given deadline, site admins fail to provide feedback about their upgrade plans after a given deadline, if the NGI or site fail to put the affected service end points in downtime ROD should escalate ticket to COD, who should try to help to find the solution.

The communication should be recorded in GGUS ticket.

Escalation for operational problem with ROD

This section introduces a critical part of operations in terms of problem with ROD. The escalation procedure is a procedure that COD must follow whenever any problem related to ROD work is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

The procedure applies only in case when ROD is not handling issues on operational dashboard according to operational procedures.

(Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure )

Step [#]	Max. Duration [work days] (time before moving to next step)	Resp. Unit	Escalation procedure	Content of the message
1	3	COD	Send mail to the ROD with CC to NGI/ROC, COD and GGUS (operational ticket is being created).	ask for explanation why an issue was not handled according to procedures ask for immediate action in case of no response for 3 working days COD will contact NGI/ROC manager
2	3	COD	Send mail to NGI/ROC manager with CC to ROD, COD and GGUS.	Report that ROD is not responsive and not handling operational issues according to procedure In case of no response for 3 working days COD will contact COO
3	(without delay)	COD	Send mail to COO with CC NGI/ROC manager, COD and GGUS.	Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure

The precondition to stop escalation is that all issues not handled according to procedure disappeared from COD dashboard.

The communication should be recorded in GGUS ticket.

Revision history

Version	Authors	Date	Comments

@@ Line 161: / Line 161: @@
 NGI&nbsp;manager should check why site is unresponsive or what is the reason site cannot migrate to supported software version. Site and NGI&nbsp;manager should decide on upgrade plan or site/endpoint decommission.
-In case of
-*issues which cannot be solved on NGI level
-*by a given deadline, site admins fail to provide feedback about their upgrade plans
-*after a given deadline, if the NGI or site fail to put the affected service end points in downtime
-ROD should escalate ticket to COD
 |
@@ Line 180: / Line 172: @@
 | COD step
 | &nbsp; (without delay)
-| COD<br>
+| ROD<br>
 |
-If NGI cannot solve the problem at the NGI level, COD try to help to find the solution.
+In case of
+*issues which cannot be solved on NGI level
+*by a given deadline, site admins fail to provide feedback about their upgrade plans
+*after a given deadline, if the NGI or site fail to put the affected service end points in downtime
+ROD should escalate ticket to COD, who should try to help to find the solution.
 |
 |}

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

Revision as of 16:54, 23 November 2012

Contents

Overview

Definitions

Steps

Escalation for operational problem at site

Escalation for operational problem with unsupported MW at site

Escalation for operational problem with ROD

Revision history

Navigation menu

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

Revision as of 16:54, 23 November 2012

Overview

Definitions

Steps

Escalation for operational problem at site

Escalation for operational problem with unsupported MW at site

Escalation for operational problem with ROD

Revision history

Navigation menu

Search