Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

From EGIWiki
Jump to navigation Jump to search
Line 142: Line 142:
Below are the detailed steps of the escalation procedure if '''no response is received for the notification of a problem or the problem has been unattended for'''.  
Below are the detailed steps of the escalation procedure if '''no response is received for the notification of a problem or the problem has been unattended for'''.  


| [[Image:Escalation procedure min.png|right]]<br>
| [[Image:Escalation procedure min.png|right|link=https://wiki.egi.eu/w/images/7/75/Escalation_procedure.png]]<br>
|}
|}



Revision as of 08:19, 12 August 2011

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



COD escalation procedure

  • Title: COD escalation procedure
  • Document link: https://wiki.egi.eu/wiki/PROC01
  • Last modified: 22.10.2010
  • Version: 1.0
  • Policy Group Acronym: GOO/COD
  • Policy Group Name: Grid Operations Oversight/Central Operator on Duty
  • Contact Person: Małgorzata Krakowian, Marcin Radecki
  • Document Status: APPROVED
  • Approved Date: 26.10.2010
  • Procedure Statement: The purpose of this document is to define escalation procedure for operational problems


Workflow and escalation procedures

Escalation for operational problem at site

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for.

When an alarm appears on the ROD dashboard (after 24 hours from the problem occurrence):

Step [#] Max. Duration [work days] Resp. Unit Escalation procedure Content of the message
1 3 ROD Send mail to the site administrator with CC to NGI/ROC manager and GGUS (operational ticket is being created).
  • ask for immediate action
  • in case of no response for 3 working days ROD will escalate the issue
2 3 ROD Send mail to the site administrator with CC to NGI/ROC manager and GGUS.

(optionally: a phone call to site, just to make sure that e-mail communication channel is working);

After 3 days period with no response from site administrator issue should be escalated to COD.

  • ask for immediate action
  • in case of no response for 3 working days ROD will escalate the issue to COD
3 3 COD Send mail to NGI/ROC manager with CC to site administrator, ROD and GGUS.
  • informing that NGI/ROC manager should make the site react on the ticket or suspend the site within 3 days
  • if NGI will not react COD will suspend the site on the 4th day.
4 1 COD If no response is obtained from either the site, ROD or NGI/ROC manager.

Send mail to NGI/ROC manager with CC to site administrator, ROD and GGUS.

If no response after 1 working day COD performs site suspension.

  • asking NGI/ROC manager to suspend the site
  • if no response after 1 working day COD will perform site suspension.

NB : When availability/reliability monthly thresholds are not met, a site is requested to provide justification through a COD ticket. The ticket in this procedure equals to an alarm appearing in the ROD dashboard.

The communication should be recorded in GGUS ticket.

Escalation for operational problem with ROD

This section introduces a critical part of operations in terms of problem with ROD. The escalation procedure is a procedure that COD must follow whenever any problem related to ROD work is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

The procedure applies only in case when ROD is not handling issues on operational dashboard according to operational procedures.

Step [#] Max. Duration [work days] Resp. Unit Escalation procedure Content of the message
1 3 COD Send mail to the ROD with CC to NGI/ROC, COD and GGUS (operational ticket is being created).
  • ask for explanation why an issue was not handled according to procedures
  • ask for immediate action
  • in case of no response for 3 working days COD will contact NGI/ROC manager
2 3 COD Send mail to NGI/ROC manager with CC to ROD, COD and GGUS.
  • Report that ROD is not responsive and not handling operational issues according to procedure
  • In case of no response for 3 working days COD will contact COO
3 (without delay) COD Send mail to COO with CC NGI/ROC manager, COD and GGUS.
  • Report that ROD and NGI/ROC manager is not responsive and not handling operational issues according to procedure

The precondition to stop escalation is that all issues not handled according to procedure disappeared from COD dashboard.

The communication should be recorded in GGUS ticket.

Drafts

The new procedure will be tentatively adopted starting from the 01 of October.

Escalation for operational problem at site

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that ROD must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for.

Escalation procedure min.png


When an alarm appears on the ROD dashboard (after 24 hours from the problem occurrence):

Step [#] Max. Duration [work days] Resp. Unit Escalation procedure Content of the message
1 3 ROD Send mail to the site administrator with CC to NGI/ROC manager and GGUS (operational ticket is being created).
  • ask for immediate action
  • in case of no response for 3 working days ROD will escalate the issue
2 3 ROD Send mail to the site administrator with CC to NGI/ROC manager and GGUS.

(optionally: a phone call to site, just to make sure that e-mail communication channel is working);

After 3 days period with no response from site administrator issue should be escalated to NGI manager.

  • ask for immediate action
  • in case of no response for 3 working days ROD will escalate the issue to NGI manager
3 5 NGI manage/ROD

NGI manager should at the political level make site responsive or suspend the site.

If the problem needs to be escalated to EGI level then NGI manager ask ROD to send an mail to COD with CC to site administrator, ROD and GGUS.

  • informing COD that the problem cannot be solved at the NGI level and ask for help to resolve it
4 1 COD
  1. If no action was taken by NGi manager for 5 working days COD send an mail to NGI/ROC manager with CC to site administrator, ROD and GGUS. If no response after 1 working day COD performs site suspension.
  2. If NGI cannot solve the problem at the NGI level, COD try to help to find the solution.


  • asking NGI/ROC manager to suspend the site
  • if no response after 1 working day COD will perform site suspension.


The communication should be recorded in GGUS ticket.