Difference between revisions of "MAN04 Tool Intervention Management"
Line 10: | Line 10: | ||
|- | |- | ||
| '''Document link''' | | '''Document link''' | ||
| ''https://wiki.egi.eu/wiki/ | | ''https://wiki.egi.eu/wiki/MAN04'' | ||
|- | |- | ||
| '''Last review''' | | '''Last review''' | ||
| | | T. Ferrari, 03 August 2011 | ||
|- | |- | ||
| '''Policy Group Acronym''' | | '''Policy Group Acronym''' | ||
Line 25: | Line 25: | ||
|- | |- | ||
| '''Document Status''' | | '''Document Status''' | ||
| '' | | ''Approved'', v. 1.0 | ||
|- | |- | ||
| '''Approved Date''' | | '''Approved Date''' | ||
| | | 01 August 2011 | ||
|- | |- | ||
| '''Procedure Statement''' | | '''Procedure Statement''' | ||
Line 43: | Line 43: | ||
= Scope = | = Scope = | ||
This manual only applies to unscheduled downtimes of central operational tools. | This manual only applies to unscheduled downtimes of central operational tools. The list of central operational tool is available [[Tools|here]]. | ||
Scheduled downtimes are management according to existing procedures. | Note: Scheduled downtimes are management according to existing procedures ([[MAN02|MAN02]]). | ||
= Announcements = | = Announcements = | ||
Line 63: | Line 63: | ||
* Operation tools | * Operation tools | ||
'''Notice:''' Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are | '''Notice:''' Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are advised to use such predefined templates. | ||
= Procedure = | = Procedure = | ||
In the following sections relevant scenarios are covered. | In the following sections several relevant scenarios are covered. | ||
== Case 1: short "undetected" downtime == | == Case 1: short "undetected" downtime == | ||
Line 121: | Line 121: | ||
'''Description:''' Service recovery is delayed. Update should be sent at least every '''24h'''. | '''Description:''' Service recovery is delayed. Update should be sent at least every '''24h'''. | ||
'''Action:''' | '''Action:''' The administrator announces that recovery is taking longer by using the following template: | ||
<pre> | <pre> | ||
Subject: | Subject: | ||
Line 139: | Line 139: | ||
</pre> | </pre> | ||
'''Note:''' In this template [ADDITIONAL_INFO] should | '''Note:''' In this template [ADDITIONAL_INFO] should indicate the estimated time of recovery. | ||
=== 3. Recovery === | === 3. Recovery === | ||
Line 145: | Line 145: | ||
'''Description:''' Service is recovered. | '''Description:''' Service is recovered. | ||
'''Actions:''' At the time of recovery administrator announces the recovery by using the following template: | '''Actions:''' At the time of recovery the administrator announces the recovery by using the following template: | ||
<pre> | <pre> | ||
Subject: | Subject: | ||
Line 163: | Line 163: | ||
=== 4. Post mortem analysis === | === 4. Post mortem analysis === | ||
'''Description:''' Service failure required further time to investigate the | '''Description:''' Service failure required further time to investigate the source of the problem. This action is required only if the post mortem analysis is needed. | ||
'''Actions:''' | '''Actions:''' The administrator announces the post mortem analysis of failure by using the following notification template: | ||
<pre> | <pre> | ||
Subject: | Subject: | ||
Line 183: | Line 183: | ||
= Revision History = | = Revision History = | ||
<!-- to track changes introduced after the document is officially approved --> | <!-- to track changes introduced after the document is officially approved --> | ||
This is the first release of the manual. |
Revision as of 19:15, 3 August 2011
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Title | Tool Intervention Management |
Document link | https://wiki.egi.eu/wiki/MAN04 |
Last review | T. Ferrari, 03 August 2011 |
Policy Group Acronym | OMB |
Policy Group Name | Operations Management Board |
Contact Person | E. Imamagic |
Document Status | Approved, v. 1.0 |
Approved Date | 01 August 2011 |
Procedure Statement | This manual provides information on how to manage central operational tool unscheduled downtimes. |
Tool Intervention Management
The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.
Scope
This manual only applies to unscheduled downtimes of central operational tools. The list of central operational tool is available here.
Note: Scheduled downtimes are management according to existing procedures (MAN02).
Announcements
All announcements should be sent with the Operations Portal Broadcast tool.
When using Operations Portal Broadcast tool the following groups should be included:
- LCG Rollout Mailing List
- Operators Mailing lists
- OSG Mailing list
- Tool Admins Mailing List
- WLCG Tier 1 contacts
- NGI managers
- VO managers
- VO users
- Site administrators
- Operation tools
Notice: Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are advised to use such predefined templates.
Procedure
In the following sections several relevant scenarios are covered.
Case 1: short "undetected" downtime
Description: Service fails and recovers before administrator manages to react (e.g. short power or network outage).
Action: Administrator announces the failure by using the following template:
Subject: [SERVICE_NAME] unscheduled downtime Message: Dear all, [SERVICE_NAME] experienced unscheduled downtime between [START] and [END]. [DETAILED_FAILURE_DESCRIPTION] Apologies for any inconvenience caused. Best Regards [SERVICE_TEAM]
Case 2: long "detected" outage
Service fails and administrator detects the problem. The problem takes at least 1h time to recover. In the sections below individual situations are described.
1. Outage
Description: Service failure is detected.
Action: Administrator announces the failure by using the following template:
Subject: [SERVICE_NAME] outage Message: Dear all, [SERVICE_NAME] is experiencing unscheduled downtime. [ADDITIONAL_INFO] Apologies for any inconvenience caused. Best Regards [SERVICE_TEAM]
2. Extended downtime
Description: Service recovery is delayed. Update should be sent at least every 24h.
Action: The administrator announces that recovery is taking longer by using the following template:
Subject: [SERVICE_NAME] extended outage Message: Dear all, Outage of [SERVICE_NAME] is extended. [ADDITIONAL_INFO] Apologies for any inconvenience caused. Best Regards [SERVICE_TEAM]
Note: In this template [ADDITIONAL_INFO] should indicate the estimated time of recovery.
3. Recovery
Description: Service is recovered.
Actions: At the time of recovery the administrator announces the recovery by using the following template:
Subject: [SERVICE_NAME] recovery Message: Dear all, [SERVICE_NAME] is back online. [ADDITIONAL_INFO] Best Regards [SERVICE_TEAM]
4. Post mortem analysis
Description: Service failure required further time to investigate the source of the problem. This action is required only if the post mortem analysis is needed.
Actions: The administrator announces the post mortem analysis of failure by using the following notification template:
Subject: [SERVICE_NAME] outage analysis Message: Dear all, [SERVICE_NAME] experienced unscheduled downtime between [START] and [END]. [DETAILED_FAILURE_DESCRIPTION] Best Regards [SERVICE_TEAM]
Revision History
This is the first release of the manual.