Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC01 EGI Infrastructure Oversight escalation"

From EGIWiki
Jump to navigation Jump to search
(Remove deprecated content)
Tag: Replaced
 
(152 intermediate revisions by 12 users not shown)
Line 1: Line 1:
==== 6.3.2.1 Workflow and escalation procedure ====
{{Template:Op menubar}} {{Template:Doc_menubar}}
 
[[Category:Deprecated]]
This section introduces a critical part of operations in terms of sites' problems detection, identification and solving. The escalation procedure is a procedure that Operators must follow
{| style="border:1px solid black; background-color:lightgrey; color: black; padding:5px; font-size:140%; width: 90%; margin: auto;"
whenever any problem related to a site is detected. The main goal of the procedure is to track the problem follow-up process as a whole and keep the process consistent from the time of
| style="padding-right: 15px; padding-left: 15px;" |  
detection until the time when the ultimate solution is reached.
|[[File:Alert.png]] This page is '''Deprecated'''; the content has been moved to https://confluence.egi.eu/display/EGIPP/PROC01+EGI+Infrastructure+Oversight+escalation 
 
Moreover, the procedure is supposed to introduce a hierarchical structure and responsibility distribution in problem solving which should lead to significant improvement in the quality of the
production grid service. Consequently, minimizing the delay between the steps of the procedure is of utmost importance. The regular procedure the operators follow can be considered in four
phases.
 
* submitting problems into the problem tracking tool after they are detected using monitoring tools or by a task created by an operations team (COD or ROD);
* updating the task when a site state changes which can be detected either by a comparison of the monitoring information with the current state of the task in the problem tracking tool, or by input from an operations team (COD or ROD);
* closing tickets or escalating outdated tickets when deadlines are reached in the problem tracking tool;
* initiate last escalation step and/or communication with site administrators and regional operations team.
 
Below are the detailed steps of the '''escalation procedure''' if '''no response''' is received for the notification of a problem or the problem has been unattended for:
 
{| border="1"
|-
| '''Step [#]''' || '''Max. Duration [work days]''' || '''Resp. Unit''' ||'''Escalation procedure'''
|-
| 1 || 3 || ROD || When an alarm appears on the ROD dashboard (>24 hours old): 1st mail to site admin
|-
| 2 || 3 || ROD || 2nd mail to site admin and '''a phone call to site'''; At the end of this period escalate to COD
|-
| 3 || 4 || COD || Ticket escalated to COD, COD should in that week, act on the ticket by sending email to the NGI manager, ROD and site for immediate action otherwise site suspension will happen.
|-
| 4 || - || COD || (IF no response is obtained from either the site, ROD or NGI manager) COD performs site suspension.
|-
|}
|}
NB : Theoretically, the whole process could be covered in a 2 week period. This is in line with suspension procedure proposed for not replying to low availability/reliability figures. Most often a site solves the problem well before operators need to escalate issue to COD or the site is suspended on the spot for security reasons.
''MR: The below refers to inability '''to solve''' the issue, not reply.'' We do not have to chase sites for not solving the problem as if they cannot solve the problem it affects the availability - low availability will be tracked in other way.
NB: After the first 3 days, at the 2nd escalation step, if the site has not solved its problem, ROD should suggest to the site to declare downtime until they solve the problem and the NGI
should be notified. If they do not accept the downtime then COD will proceed with the regular escalation procedure at the agreed deadlines.

Latest revision as of 10:41, 15 April 2022