Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

GGUS:MAN04

From EGIWiki
Revision as of 08:55, 4 August 2011 by Wbuehler (talk | contribs) (based on https://wiki.egi.eu/w/index.php?title=MAN04&oldid=22790)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
GGUS-logo.jpg


GGUS wiki / GGUS Documentation


GGUS – Tool Intervention Management


Title GGUS Intervention Management
Document link Based on MAN04
Last Update 03 August 2011

The purpose of this document is to describe the intervention in case of unscheduled failure of central operational tool.

Scope

This manual only applies to unscheduled downtimes of GGUS, for others see MAN04.

Announcements

All announcements should be sent with the Operations Portal Broadcast tool.

When using Operations Portal Broadcast tool the following groups should be included:

  • LCG Rollout Mailing List
  • Operators Mailing lists
  • OSG Mailing list
  • Tool Admins Mailing List
  • WLCG Tier 1 contacts
  • NGI managers
  • VO managers
  • VO users
  • Site administrators
  • Operation tools

Notice: Individual notification templates together with targets are predefined in Operations Portal Broadcast tool. Administrators are advised to use such predefined templates.

Procedure

In the following sections several relevant scenarios are covered.

Case 1: short "undetected" downtime

Description: Service fails and recovers before administrator manages to react (e.g. short power or network outage).

Action: Administrator announces the failure by using the following template:

Subject:
 GGUS unscheduled downtime

Message:
 Dear all,
 
 GGUS experienced unscheduled downtime between [START] and [END].

 [DETAILED_FAILURE_DESCRIPTION]

 Apologies for any inconvenience caused.
 
 Best Regards, 
 the GGUS Team

Case 2: long "detected" outage

Service fails and administrator detects the problem. The problem takes at least 1h time to recover. In the sections below individual situations are described.

1. Outage

Description: Service failure is detected.

Action: Administrator announces the failure by using the following template:

Subject:
 GGUS outage

Message:
 Dear all,
 
 GGUS is experiencing unscheduled downtime.
 
 [ADDITIONAL_INFO]
 
 Apologies for any inconvenience caused.
 
 Best Regards, 
 the GGUS Team

2. Extended downtime

Description: Service recovery is delayed. Update should be sent at least every 24h.

Action: The administrator announces that recovery is taking longer by using the following template:

Subject:
 GGUS extended outage

Message:
 Dear all,
 
 Outage of GGUS is extended. 

 [ADDITIONAL_INFO should indicate the estimated time of recovery]
 
 Apologies for any inconvenience caused.
 
 Best Regards, 
 the GGUS Team

Note: In this template [ADDITIONAL_INFO] should indicate the estimated time of recovery.

3. Recovery

Description: Service is recovered.

Actions: At the time of recovery the administrator announces the recovery by using the following template:

Subject:
 GGUS recovery

Message:
 Dear all,
 
 GGUS is back online. 

 [ADDITIONAL_INFO]
 
 Best Regards, 
 the GGUS Team

4. Post mortem analysis

Description: Service failure required further time to investigate the source of the problem. This action is required only if the post mortem analysis is needed.

Actions: The administrator announces the post mortem analysis of failure by using the following notification template:

Subject:
 GGUS outage analysis

Message:
 Dear all,
 
 GGUS experienced unscheduled downtime between [START] and [END].

 [DETAILED_FAILURE_DESCRIPTION]
 
 Best Regards, 
 the GGUS Team