Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "MAN04 Tool Intervention Management"

From EGIWiki
Jump to navigation Jump to search
Line 73: Line 73:
'''Description:''' Service fails and recovers before administrator manages to react (e.g. short power or network outage).
'''Description:''' Service fails and recovers before administrator manages to react (e.g. short power or network outage).


'''Actions:''' Administrator announces the failure by using the following template:
'''Action:''' Administrator announces the failure by using the following template:
<pre>
<pre>
Subject:
Subject:

Revision as of 11:32, 31 May 2011

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


Title Management of central operational tool unscheduled downtimes
Document link https://wiki.egi.eu/wiki/MAN03_Tool_Intervention_Management
Last review Tferrari 13:55, 8 March 2011 (UTC)
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Person E. Imamagic
Document Status draft
Approved Date specify
Procedure Statement This manual provides information on how to manage central operational tool unscheduled downtimes.

Management of central operational tool unscheduled downtimes

The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.

Scope

This manual only applies to unscheduled downtimes of central operational tools. List of central operational tool is here

Scheduled downtimes are management according to existing procedures.

Announcements

All announcements should be sent with the Operations Portal Broadcast tool.

When using Operations Portal Broadcast tool the following groups should be included:

  • LCG Rollout Mailing List
  • Operators Mailing lists
  • OSG Mailing list
  • Tool Admins Mailing List
  • WLCG Tier 1 contacts
  • NGI managers
  • VO managers
  • VO users
  • Site administrators
  • Operation tools

Notice: Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are advise to use predefined templates.

Procedure

In the following sections relevant scenarios are covered.

Case 1: short "undetected" downtime

Description: Service fails and recovers before administrator manages to react (e.g. short power or network outage).

Action: Administrator announces the failure by using the following template:

Subject:
 [SERVICE_NAME] unscheduled downtime

Message:
 Dear all,
 
 [SERVICE_NAME] experienced unscheduled downtime between [START] and [END].

 [DETAILED_FAILURE_DESCRIPTION]
 
 Best Regards
 [SERVICE_TEAM]

Case 2: long "detected" outage

1. Outage

Description: Service fails and administrator detects the problem. The problem takes longer (TODO: longer than 1h?) time to recover.

Actions:

  1. Administrator should announce the failure by using Operations Portal Broadcast tool (see notification template 1).

2. Extended downtime

Description: Service recovery is delayed.

Actions:

  1. Administrator should announce that recovery is taking longer by using Operations Portal Broadcast tool (see notification template 2).

3. Recovery without post mortem analysis

Description: Service is recovered and all details about failure are known.

Actions:

  1. At the time of recovery administrator should announce the recovery by using Operations Portal Broadcast tool (see notification template 3).
  2. Administrator should announce the detailed description of failure by using Operations Portal Broadcast tool (see notification template 4).

4. Recovery with post mortem analysis

Description: Service is recovered and further time is needed to investigate the failure.

Actions:

  1. At the time of recovery administrator should announce the recovery by using Operations Portal Broadcast tool (see notification template 3).
  2. Administrator should announce the post mortem analysus of failure by using Operations Portal Broadcast tool with the notification template 4).

Notification templates

When using Operations Portal Broadcast tool the following groups should be included:

  • LCG Rollout Mailing List
  • Operators Mailing lists
  • OSG Mailing list
  • Tool Admins Mailing List
  • WLCG Tier 1 contacts
  • NGI managers
  • VO managers
  • VO users
  • Site administrators
  • Operation tools

For individual notification templates together with targets are predefined in Operations Portal Broadcast tool.

1. Service failure notification

Subject:

[SERVICE_NAME] outage

Message:

Dear all,

[SERVICE_NAME] is experiencing unscheduled downtime.

[ADDITIONAL_INFO]

Apologies for any inconvenience caused.

Best Regards
[SERVICE_TEAM]


2. Extended service failure notification

Subject:

[SERVICE_NAME] extended outage

Message:

Dear all,

Outage of [SERVICE_NAME] is extended. 
[ADDITIONAL_INFO]

Apologies for any inconvenience caused.

Best Regards
[SERVICE_TEAM]

In this template [ADDITIONAL_INFO] should contain estimated time of recovery.

3. Service recovery notification without detailed information

Subject:

[SERVICE_NAME] recovery

Message:

Dear all,

[SERVICE_NAME] is back online. Additional information on failure will be provided soon.

Best Regards
[SERVICE_TEAM]

4. Post mortem analysis

Subject:

[SERVICE_NAME] unscheduled downtime

Message:

Dear all,

[SERVICE_NAME] experienced unscheduled downtime between [START] and [END].
[DETAILED_FAILURE_DESCRIPTION]

Best Regards
[SERVICE_TEAM]

Revision History