Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"
(29 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} {{Ops_procedures | {{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}} {{Ops_procedures | ||
|Doc_title = | |Doc_title = EGI Infrastructure Oversight escalation | ||
|Doc_link = https://wiki.egi.eu/wiki/PROC01 | |Doc_link = [[PROC01 |https://wiki.egi.eu/wiki/PROC01]] | ||
|Version = | |Version = 3.1 - 8 June 2016 | ||
|Policy_acronym = | |Policy_acronym = OMB | ||
|Policy_name = | |Policy_name = Operations Management Board | ||
|Contact_group = | |Contact_group = operations@egi.eu | ||
|Doc_status = Approved | |Doc_status = Approved | ||
|Approval_date = | |Approval_date = 20.11.2012 | ||
|Procedure_statement = The purpose of this document is to define escalation procedure for operational problems | |Procedure_statement = The purpose of this document is to define escalation procedure for operational problems | ||
|Owner = Matthew Viljoen | |||
}} | }} | ||
= | = Overview = | ||
The purpose of this document is to define escalation procedure for operational problems | |||
= Definitions = | |||
Please refer to the [[Glossary|EGI Glossary]] for the definitions of the terms used in this procedure. | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. | |||
= Steps = | |||
== Escalation for operational problem at site == | == Escalation for operational problem at site == | ||
Line 30: | Line 41: | ||
<br> When an alarm appears on the ROD dashboard, at most after 24 hours from the problem occurrence ROD should start the procedure below: | <br> When an alarm appears on the ROD dashboard, at most after 24 hours from the problem occurrence ROD should start the procedure below: | ||
(Max Duration | (Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure ) | ||
{| border="1" class="wikitable" | {| border="1" class="wikitable" | ||
|- | |- | ||
| '''Step [#]''' | | '''Step [#]''' | ||
| '''Dashboard step''' | |||
| | | | ||
'''Max. Duration [work days]''' | '''Max. Duration [work days]''' | ||
Line 45: | Line 57: | ||
|- | |- | ||
| 1 | | 1 | ||
| 1st step | |||
| 3 | | 3 | ||
| ROD | | ROD | ||
Line 54: | Line 67: | ||
|- | |- | ||
| 2 | | 2 | ||
| 2nd step | |||
| 3 | | 3 | ||
| ROD | | ROD | ||
Line 67: | Line 81: | ||
|- | |- | ||
| 3 | | 3 | ||
| NGI step | |||
| 5 | | 5 | ||
| NGI manager | | NGI manager | ||
Line 72: | Line 87: | ||
NGI/ROC operations manager should at the political level make site responsive or suspend the site. (it can be done by phone, mail or on the meeting)<br> | NGI/ROC operations manager should at the political level make site responsive or suspend the site. (it can be done by phone, mail or on the meeting)<br> | ||
If the problem needs to be escalated to EGI level then NGI/ROC operations manager ask ROD to send an mail to | If the problem needs to be escalated to EGI level then NGI/ROC operations manager ask ROD to send an mail to Operations with CC to site administrator, ROD and GGUS.(see Content of the message) | ||
ROD team is still responsible to take care about the ticket on the Operations Portal. | ROD team is still responsible to take care about the ticket on the Operations Portal. | ||
| | | | ||
*informing | *informing Operations that the problem cannot be solved at the NGI level and ask for help to resolve it<br> | ||
|- | |- | ||
| 4 | | 4 | ||
| Operations step | |||
| 1 | | 1 | ||
| | | Operations | ||
| | | | ||
#If no action was taken by NGI/ROC operations manager for 5 working days | #If no action was taken by NGI/ROC operations manager for 5 working days Operations send an mail to NGI/ROC operations manager with CC to site administrator, ROD and GGUS. If no response after 1 working day Operations performs site suspension. | ||
#If NGI cannot solve the problem at the NGI level, | #If NGI cannot solve the problem at the NGI level, Operations try to help to find the solution. | ||
<br> | <br> | ||
Line 91: | Line 107: | ||
| | | | ||
*asking NGI/ROC operations manager to suspend the site | *asking NGI/ROC operations manager to suspend the site | ||
*if no response after 1 working day | *if no response after 1 working day Operations will perform site suspension. | ||
|} | |} | ||
Line 99: | Line 115: | ||
'''The communication should be recorded in GGUS ticket.''' | '''The communication should be recorded in GGUS ticket.''' | ||
== Escalation for operational problem with unsupported MW at site | == Escalation for operational problem with unsupported MW at site == | ||
'''This Escalation process is a part of ''' [[PROC16|''Decommissioning of unsupported software procedure'']] | |||
'''This Escalation process is a part of ''' [[PROC16|'' | |||
When an alarm appears on the Operations dashboard, ROD should start the procedure below: | When an alarm appears on the Operations dashboard, ROD should start the procedure below: | ||
Line 112: | Line 126: | ||
|- | |- | ||
| '''Step [#]''' | | '''Step [#]''' | ||
| ''' | | '''Dashboard step''' | ||
| | | | ||
'''Max. Duration [work days]''' | '''Max. Duration [work days]''' | ||
Line 124: | Line 138: | ||
| 1 | | 1 | ||
| 1st step | | 1st step | ||
| 10 | | 10<br> | ||
| ROD | | ROD | ||
| | | | ||
'''Create a ticket through Operations Portal | '''Create a ticket through Operations Portal with the template: '''[[ROD MW alarm template|ROD_MW_alarm_template]] | ||
Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. | Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. | ||
One | One ticket can be created for all MW alarms by using alarm masking feature. ROD should make sure that site is aware of all raised alarms. | ||
| | | | ||
*ask to<span class="solution"> provide information about </span>'''<span class="solution">upgrade plan</span>''' with | *ask to<span class="solution"> provide information about </span>'''<span class="solution">upgrade plan</span>''' with 10 working days deadline<br> | ||
*in case of no response or plan, ROD will escalate the issue to NGI manager<br> | *in case of no response or plan, ROD will escalate the issue to NGI operations manager<br> | ||
|- | |- | ||
| 2 | | 2 | ||
| NGI step | | NGI step | ||
| 5 | | 5<br> | ||
| NGI manager | | NGI manager | ||
| | | | ||
Line 150: | Line 164: | ||
NGI manager should check why site is unresponsive or what is the reason site cannot migrate to supported software version. Site and NGI manager should decide on upgrade plan or site/endpoint decommission. | NGI manager should check why site is unresponsive or what is the reason site cannot migrate to supported software version. Site and NGI manager should decide on upgrade plan or site/endpoint decommission. | ||
| | | | ||
*inform NGI managers about unresponsive site | *inform NGI operations managers about unresponsive site | ||
* | *site might be suspended by Operations, CSIRT or NGI operations manager after DEADLINE for the upgrade/decommissioning | ||
<br> | <br> | ||
Line 167: | Line 173: | ||
|- | |- | ||
| 3 | | 3 | ||
| | | Operations step | ||
| | | (without delay) | ||
| | | ROD<br> | ||
| | | | ||
In case of issues which cannot be solved on NGI level ROD should escalate ticket to Operations, who should try to help to find the solution. | |||
| | | <br> | ||
|} | |} | ||
<br> | <br> | ||
'''The communication should be recorded in GGUS ticket.''' | '''The communication should be recorded in GGUS ticket.''' | ||
== Escalation for operational problem with ROD == | == Escalation for operational problem with ROD == | ||
This section introduces a critical part of operations in terms of problem with ROD. The escalation procedure is a procedure that | This section introduces a critical part of operations in terms of problem with ROD. The escalation procedure is a procedure that Operations must follow whenever any problem related to ROD work is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached. | ||
The procedure applies only in case when '''ROD is not handling issues on operational dashboard according to operational procedures'''. | The procedure applies only in case when '''ROD is not handling issues on operational dashboard according to operational procedures'''. | ||
(Max Duration | (Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure ) | ||
{| border="1" class="wikitable" | {| border="1" class="wikitable" | ||
Line 202: | Line 208: | ||
| 1 | | 1 | ||
| 3 | | 3 | ||
| | | Operations | ||
| Send mail to the ROD with CC to NGI/ROC, | | Send mail to the ROD with CC to NGI/ROC, Operations and GGUS (operational ticket is being created). | ||
| | | | ||
*ask for explanation why an issue was not handled according to procedures | *ask for explanation why an issue was not handled according to procedures | ||
*ask for immediate action | *ask for immediate action | ||
*in case of no response for 3 working days | *in case of no response for 3 working days Operations will contact NGI/ROC manager | ||
|- | |- | ||
| 2 | | 2 | ||
| 3 | | 3 | ||
| | | Operations | ||
| Send mail to NGI/ROC manager with CC to ROD, | | Send mail to NGI/ROC manager with CC to ROD, Operations and GGUS. | ||
| | | | ||
*Report that ROD is not responsive and not handling operational issues according to procedure | *Report that ROD is not responsive and not handling operational issues according to procedure | ||
*In case of no response for 3 working days | *In case of no response for 3 working days Operations will contact COO | ||
|- | |- | ||
| 3 | | 3 | ||
| (without delay) | | (without delay) | ||
| | | Operations | ||
| Send mail to COO with CC NGI/ROC manager, | | Send mail to COO with CC NGI/ROC manager, Operations and GGUS. | ||
| | | | ||
*Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure | *Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure | ||
Line 228: | Line 234: | ||
|} | |} | ||
The precondition to stop escalation is that all issues not handled according to procedure disappeared from | The precondition to stop escalation is that all issues not handled according to procedure disappeared from Operations dashboard. | ||
'''The communication should be recorded in GGUS ticket.''' | '''The communication should be recorded in GGUS ticket.''' | ||
Line 241: | Line 247: | ||
! Comments | ! Comments | ||
|- | |- | ||
| | | 3.1 | ||
| | | Malgorzata | ||
| | | 18 August 2014 | ||
| | | Oversight of ROD work is now under EGI.eu Operations team. Change contact group -> Operations support | ||
|- | |||
| 3.2 | |||
| Alessandro Paolini | |||
| 2016-06-08 | |||
| Changed contact group -> Operations | |||
|} | |} | ||
[[Category:Operations_Procedures]] | [[Category:Operations_Procedures]] |
Revision as of 16:27, 7 January 2019
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | EGI Infrastructure Oversight escalation |
Document link | https://wiki.egi.eu/wiki/PROC01 |
Last modified | 3.1 - 8 June 2016 |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Group | operations@egi.eu |
Document Status | Approved |
Approved Date | 20.11.2012 |
Procedure Statement | The purpose of this document is to define escalation procedure for operational problems |
Owner | Matthew Viljoen |
Overview
The purpose of this document is to define escalation procedure for operational problems
Definitions
Please refer to the EGI Glossary for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Steps
Escalation for operational problem at site
This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached. Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for. |
When an alarm appears on the ROD dashboard, at most after 24 hours from the problem occurrence ROD should start the procedure below:
(Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure )
Step [#] | Dashboard step |
Max. Duration [work days] (time before moving to next step) |
Resp. Unit | Escalation procedure | Content of the message |
1 | 1st step | 3 | ROD | Send mail to the site administrator with CC to NGI/ROC operations manager and GGUS (operational ticket is being created). |
|
2 | 2nd step | 3 | ROD | Send mail to the site administrator with CC to NGI/ROC operations manager and GGUS.
(optionally: a phone call to site, just to make sure that e-mail communication channel is working); After 3 days period with no response from site administrator issue should be escalated to NGI/ROC operations manager. |
|
3 | NGI step | 5 | NGI manager |
NGI/ROC operations manager should at the political level make site responsive or suspend the site. (it can be done by phone, mail or on the meeting) If the problem needs to be escalated to EGI level then NGI/ROC operations manager ask ROD to send an mail to Operations with CC to site administrator, ROD and GGUS.(see Content of the message) ROD team is still responsible to take care about the ticket on the Operations Portal. |
|
4 | Operations step | 1 | Operations |
|
|
The communication should be recorded in GGUS ticket.
Escalation for operational problem with unsupported MW at site
This Escalation process is a part of Decommissioning of unsupported software procedure
When an alarm appears on the Operations dashboard, ROD should start the procedure below:
Step [#] | Dashboard step |
Max. Duration [work days] (time before moving to next step) |
Resp. Unit | Escalation procedure | Content of the message |
1 | 1st step | 10 |
ROD |
Create a ticket through Operations Portal with the template: ROD_MW_alarm_template Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. One ticket can be created for all MW alarms by using alarm masking feature. ROD should make sure that site is aware of all raised alarms. |
|
2 | NGI step | 5 |
NGI manager |
Escalate ticket to NGI manager through Operations Dashboard. Mail is send to the site administrator with CC to NGI/ROC operations manager and GGUS. (optionally: a phone call to site, just to make sure that e-mail communication channel is working); NGI manager should check why site is unresponsive or what is the reason site cannot migrate to supported software version. Site and NGI manager should decide on upgrade plan or site/endpoint decommission. |
|
3 | Operations step | (without delay) | ROD |
In case of issues which cannot be solved on NGI level ROD should escalate ticket to Operations, who should try to help to find the solution. |
The communication should be recorded in GGUS ticket.
Escalation for operational problem with ROD
This section introduces a critical part of operations in terms of problem with ROD. The escalation procedure is a procedure that Operations must follow whenever any problem related to ROD work is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.
The procedure applies only in case when ROD is not handling issues on operational dashboard according to operational procedures.
(Max Duration column shows time in working days which you have to wait before you move to next step in the escalation procedure )
Step [#] |
Max. Duration [work days] (time before moving to next step) |
Resp. Unit | Escalation procedure | Content of the message |
1 | 3 | Operations | Send mail to the ROD with CC to NGI/ROC, Operations and GGUS (operational ticket is being created). |
|
2 | 3 | Operations | Send mail to NGI/ROC manager with CC to ROD, Operations and GGUS. |
|
3 | (without delay) | Operations | Send mail to COO with CC NGI/ROC manager, Operations and GGUS. |
|
The precondition to stop escalation is that all issues not handled according to procedure disappeared from Operations dashboard.
The communication should be recorded in GGUS ticket.
Revision history
Version | Authors | Date | Comments |
---|---|---|---|
3.1 | Malgorzata | 18 August 2014 | Oversight of ROD work is now under EGI.eu Operations team. Change contact group -> Operations support |
3.2 | Alessandro Paolini | 2016-06-08 | Changed contact group -> Operations |