Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC24"

From EGIWiki
Jump to navigation Jump to search
 
(12 intermediate revisions by 2 users not shown)
Line 5: Line 5:


{{Ops_procedures
{{Ops_procedures
|Doc_title = Handling Major Incidents
|Doc_title =  
|Doc_link = https://wiki.egi.eu/wiki/PROC24
|Doc_link = https://wiki.egi.eu/wiki/PROC24
|Version = 0.1
|Version =  
|Policy_acronym = OMB
|Policy_acronym =  
|Policy_name = Operations Management Board
|Policy_name =  
|Contact_group = operations at mailman.egi.eu
|Contact_group =  
|Doc_status = DRAFT
|Doc_status = DRAFT
|Approval_date =  
|Approval_date =  
|Procedure_statement = The document describes the process of handling a major incident happening within the EGI infrastructure.
|Procedure_statement =
}}
}}


= Overview  =
= Overview  =
Major Incidents cause serious interruptions of activities and must be resolved with greater urgency. The aim of the procedure is the fast recovery of the service, where necessary by means of workarounds and creating a dynamically established team of specific experts coordinated by EGI Operations.


= Definitions  =
= Definitions  =
Please refer to the [[Glossary|EGI Glossary]] for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.


= Entities involved in the procedure  =
= Entities involved in the procedure  =
*'''Incident Manager (IM)''': a member of EGI Operations classifies the incident as major incident; if needed, he also establishes and coordinates a dynamic team of few experts to face the issue
*'''Customers/Users''': in the case of a significant number of customers/users unable to use the services, or if it is possible that a significant loss or costs can be ascribed to customers/users as a consequence of the incident, Customers/Users can be provided directly with temporary workarounds and mitigation procedures to be used until the incident is resolved
*'''Service providers''': teams responsible for development, release and deployment of the services involved in the incident


= Requirements  =
= Requirements  =
The incident has a big impact on the EGI infrastructure, involving for instance a service heavily used by several VOs. It is the case if
* a significant number of grid and/or cloud resources are not available to the EGI users
* central operation tools are not available
* a significant number of users is impacted by the unavailability of specific services


= Steps  =
= Steps  =
The following table describes
{| class="wikitable"
|-
! Step#
! <br>
! Responsible
! Action
! Prerequisites, if any
|- valign="top"
|
|
|
|
|
|}
= Revision History  =
{| class="wikitable"
|-
! Version !! Authors !! Date !! Comments
|-
|
|
|
|
|}

Latest revision as of 14:40, 29 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Title '
Document link https://wiki.egi.eu/wiki/PROC24
Last modified
Policy Group Acronym
Policy Group Name
Contact Group
Document Status DRAFT
Approved Date
Procedure Statement
Owner Owner of procedure


Overview

Definitions

Entities involved in the procedure

Requirements

Steps