Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "MAN04 Tool Intervention Management"

From EGIWiki
Jump to navigation Jump to search
(Deprecate page)
Tag: Replaced
 
(19 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Template: Op menubar}}
{{Template: Op menubar}} {{Template:Doc_menubar}}
{{Template:Doc_menubar}}
[[Category:Operations Manuals]]
{{TOC_right}}


{| border="1"
{{DeprecatedAndMovedTo|new_location=https://docs.egi.eu/providers/operations-manuals/man04_tool_intervention_management}}
|-
| '''Title'''
| ''Management of central operational tool unscheduled downtimes''
|-
| '''Document link'''
| ''https://wiki.egi.eu/wiki/MAN03_Tool_Intervention_Management''
|-
| '''Last review'''
| [[User:Tferrari|Tferrari]] 13:55, 8 March 2011 (UTC)
|-
| '''Policy Group Acronym'''
| ''OMB''
|-
| '''Policy Group Name'''
| ''Operations Management Board''
|-
| '''Contact Person'''
| ''E. Imamagic''
|-
| '''Document Status'''
| ''draft''
|-
| '''Approved Date'''
| specify
|-
| '''Procedure Statement'''
| ''This manual provides information on how to manage central operational tool unscheduled downtimes.''
|-
|}


----
[[Category:Operations_Manuals]]
 
= Management of central operational tool unscheduled downtimes =
 
The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.
 
= Scope =
 
This manual only applies to unscheduled downtimes of central operational tools. List of central operational tool is [[Tools|here]]
 
Scheduled downtimes are management according to existing procedures.
 
= Announcements =
 
All announcements should be sent with the Operations Portal Broadcast tool.
 
When using Operations Portal Broadcast tool the following groups should be included:
* LCG Rollout Mailing List
* Operators Mailing lists
* OSG Mailing list
* Tool Admins Mailing List
* WLCG Tier 1 contacts
* NGI managers
* VO managers
* VO users
* Site administrators
* Operation tools
 
'''Notice:''' Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are advise to use predefined templates.
 
= Procedure =
 
In the following sections relevant scenarios are covered.
 
== Case 1: short "undetected" downtime ==
 
'''Description:''' Service fails and recovers before administrator manages to react (e.g. short power or network outage).
 
'''Action:''' Administrator announces the failure by using the following template:
<pre>
Subject:
[SERVICE_NAME] unscheduled downtime
 
Message:
Dear all,
[SERVICE_NAME] experienced unscheduled downtime between [START] and [END].
 
[DETAILED_FAILURE_DESCRIPTION]
 
Apologies for any inconvenience caused.
Best Regards
[SERVICE_TEAM]
</pre>
 
== Case 2: long "detected" outage ==
 
Service fails and administrator detects the problem. The problem takes longer (TODO: longer than 1h?) time to recover. In the sections below individual situations are described.
 
=== 1. Outage ===
 
'''Description:''' Service failure is detected.
 
'''Action:''' Administrator announces the failure by using the following template:
<pre>
Subject:
[SERVICE_NAME] outage
 
Message:
Dear all,
[SERVICE_NAME] is experiencing unscheduled downtime.
[ADDITIONAL_INFO]
Apologies for any inconvenience caused.
Best Regards
[SERVICE_TEAM]
</pre>
 
=== 2. Extended downtime ===
 
'''Description:''' Service recovery is delayed.
 
'''Action:''' Administrator announces that recovery is taking longer by using the following template:
<pre>
Subject:
[SERVICE_NAME] extended outage
 
Message:
Dear all,
Outage of [SERVICE_NAME] is extended.
 
[ADDITIONAL_INFO]
Apologies for any inconvenience caused.
Best Regards
[SERVICE_TEAM]
</pre>
 
'''Note:''' In this template [ADDITIONAL_INFO] should contain estimated time of recovery.
 
=== 3. Recovery ===
 
'''Description:''' Service is recovered.
 
'''Actions:''' At the time of recovery administrator announces the recovery by using the following template:
<pre>
Subject:
[SERVICE_NAME] recovery
 
Message:
Dear all,
[SERVICE_NAME] is back online.
 
[ADDITIONAL_INFO]
Best Regards
[SERVICE_TEAM]
</pre>
 
=== 4. Post mortem analysis ===
 
'''Description:''' Service failure required further time to investigate the failure. This action is needed only if the post mortem analysis is needed.
 
'''Actions:''' Administrator announces the post mortem analysis of failure by using the following notification template:
<pre>
Subject:
[SERVICE_NAME] outage analysis
 
Message:
Dear all,
[SERVICE_NAME] experienced unscheduled downtime between [START] and [END].
 
[DETAILED_FAILURE_DESCRIPTION]
Best Regards
[SERVICE_TEAM]
</pre>
 
= Revision History =
<!-- to track changes introduced after the document is officially approved -->

Latest revision as of 10:55, 31 August 2021