Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

From EGIWiki
Jump to navigation Jump to search
(Created page with '==== 6.3.2.1 Workflow and escalation procedure ==== This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The …')
 
Line 24: Line 24:
| 2 || 3 || 2nd mail to site admin and ROC; At the end of this period escalate to C-COD  
| 2 || 3 || 2nd mail to site admin and ROC; At the end of this period escalate to C-COD  
|-
|-
| 3 || 5 || Ticket escalated to C-COD, C-COD should in that week, act on the ticket by sending email to the ROC, ROD and site for immediate action and stating that representation at the next
| 3 || 5 || Ticket escalated to C-COD, C-COD should in that week, act on the ticket by sending email to the ROC, ROD and site for immediate action and stating that representation at the next weekly operations meeting is requested. The discussion may also include site suspension.
|}
 
{| border="1"
|-
|-
weekly operations meeting is requested. The discussion may also include site suspension.
| 4 ||| (IF no response is obtained from either the site or ROC) C-COD will discuss the ticket at the FIRST Weekly Operations Meeting and involve the the Operation and Coordination Center (OCC) in the ticket  
|-
| 4 ||| (IF no response is obtained from either the site or ROC) C-COD will discuss the ticket at the FIRST Weekly Operations Meeting and involve the the Operation and Coordination Center (OCC)
|}
 
{| border="1"
|-
in the ticket  
|-
|-
| 5  || 5 || Discuss at the SECOND weekly operations meeting and assign the ticket to OCC  
| 5  || 5 || Discuss at the SECOND weekly operations meeting and assign the ticket to OCC  

Revision as of 15:00, 12 August 2010

6.3.2.1 Workflow and escalation procedure

This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that Operators must follow whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of detection until the time when the ultimate solution is reached.

Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement in the quality of the production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance. The regular procedure the operators follow can be considered in four phases.

  • submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by an operations team (COD or ROD);
  • updating the task when a site state changes which can be detected either by a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from an operations team (COD or ROD);
  • closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
  • initiate last escalation step and/or communication with site administrators and regional operations team.

Below are the detailed steps of the escalation procedure if no response is received for the notification of a problem or the problem has been unattended for:

Step [#] Max. Duration [work days] Escalation procedure
1 3 When an alarm appears on the ROD dashboard (>24 hours old): 1st mail to site admin and ROC
2 3 2nd mail to site admin and ROC; At the end of this period escalate to C-COD
3 5 Ticket escalated to C-COD, C-COD should in that week, act on the ticket by sending email to the ROC, ROD and site for immediate action and stating that representation at the next weekly operations meeting is requested. The discussion may also include site suspension.
4 (IF no response is obtained from either the site or ROC) C-COD will discuss the ticket at the FIRST Weekly Operations Meeting and involve the the Operation and Coordination Center (OCC) in the ticket
5 5 Discuss at the SECOND weekly operations meeting and assign the ticket to OCC
6 Where applicable, C-COD will request OCC to approve site suspension
7 C-COD will ask ROC to suspend the site

NB : Theoretically, the whole process could be covered in a 2-3 week period. Most often a site is either suspended on the spot for security reasons, or the problem is solved (or the site and the ROC reacts) well before operators need to escalate the issue to C-COD, who then determines whether to bring it to the Weekly Operations Meetings.

NB: After the first 3 days, at the 2nd escalation step, if the site has not solved its problem, ROD should suggest to the site to declare downtime until they solve the problem and the ROC should be notified. If they do not accept the downtime then C-COD will proceed with the regular escalation procedure at the agreed deadlines.

%ENDSECTION{Workflowandescalationprocedure}%