Difference between revisions of "ROD Duties"

From EGIWiki
Jump to: navigation, search
(Created by moving material from Operations/ROD/Draft)
 
 
(19 intermediate revisions by 4 users not shown)
Line 1: Line 1:
All duties listed in this section are mandatory for ROD team. In the case of no explicit 1<sup>st</sup> Line Support team in the NGI, duties of that team must be absorbed by the ROD.
+
{{Template:Op menubar}} {{Template:GO menubar}} {{TOC_right}}
  
A new ROD member needs to follow the procedures in the [[Operations/General/Joining_operations |procedure for joining ROD teams ]].
+
= ROD DUTIES  =
  
== Handling tickets ==
+
:A ROD team's duties can be split into three main areas: handling alarms and tickets, handling downtimes, and communicating urgent issues to the EGI&nbsp;Operations and CSIRT teams.
  
The main responsibility of ROD is to deal with tickets for sites in the region. This includes making sure that the tickets are opened and handled properly. The procedure for handling tickets is described in section [[Operations/ROD#Handling_tickets|Handling tickets]].
+
==== Handling alarms and tickets ====
  
== Putting a site in downtime for urgent matters ==
+
The main responsibility of ROD is to deal with alarms and tickets issued for sites in the region. This includes making sure that the tickets are created and handled properly.
  
In general, ROD can place a site or a service endpoint (i.e., a host) in downtime (in the GOCDB) if it is either requested by the site, or ROD sees an urgent need to put the site into downtime.
+
The ROD on duty is required to:
  
ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that COD can also suspend a site in the case of an emergency, e.g. security incidents or lack of response.  
+
*check alarm notifications in the Dashboard at least twice a day;
 +
*close alarms which are in the OK state;
 +
*handle non-OK alarms less than 24 hours old (notify the site administrators according to your NGI's procedures);
 +
*create tickets for alarms older then 24 hours that are not in an OK state;
 +
*escalate tickets to ''NGI ''Management/EGI Operations if necessary (in the Dashboard);
 +
*monitor and update any GGUS tickets up to the ''solved'' status (preferably via the Dashboard);
 +
*handle the final state of GGUS tickets not opened from the Dashboard by changing their status to ''verified''.
  
In both scenarios, it is important that communication channels between all parties involved are active.
+
==== Putting a site in downtime for urgent matters  ====
  
== Notify COD and EGI CSIRT about urgent matters ==
+
ROD can place a site or a service endpoint (there can be multiple services running on a single host) in downtime in the GOCDB if it is either requested by the site, or if ROD sees an urgent need to do it. ''Note: This is actually optional; an NGI may decide on a different policy if the site admins are not happy with ROD setting downtimes for them. However, it should be considered mandatory in case of urgent security incidents.''
  
ROD should create tickets to COD in the case of urgent matters. For security related issues, 1<sup>st</sup> Line Support and/or ROD should also notify the [[EGI_CSIRT:Main_Page|CSIRT]] duty contact.
+
ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that EGI&nbsp;Operations can also suspend a site in the case of an emergency, for example as a result of a security incident or lack of response.  
  
== Summary of ROD duties ==
+
In both scenarios, it is important that ROD communicates their actions to all involved parties.
  
{| border="1"
+
==== Notifying EGI Operations and EGI CSIRT about urgent matters ====
|-
 
| '''Duties of ROD'''
 
| '''Requirements'''
 
|-
 
| Receive incident notification from sites in the scope
 
| Mandatory (if not handled by 1<sup>st</sup> Line Support)
 
|-
 
| Handle incidents less than 24h old
 
| Mandatory (if not handled by 1<sup>st</sup> Line Support)
 
|-
 
| Create tickets for alarms older then 24h and that are not in an OK state
 
| Mandatory
 
|-
 
| Escalate tickets to COD if necessary: assignment to COD can be made directly through the dashboard.
 
| Mandatory
 
|-
 
| Propagate actions from COD down to sites
 
| Mandatory
 
|-
 
| Monitor and update any GGUS tickets up to the “solved” status (via the Dashboard)
 
| Mandatory
 
|-
 
| Close alarms for “solved problems”
 
| Mandatory
 
|-
 
| Handle the final state of GGUS tickets not opened from the operations portal by marking them as verified.
 
| Mandatory
 
|-
 
| Put the site in downtime for urgent matters  
 
| Optional
 
|-
 
| Create tickets to COD for urgent matters
 
| Mandatory
 
|}
 
  
(Definitions in the “Requirements” column: ''Mandatory'' – must be covered by either 1<sup>st</sup> Line Support or the ROD team, ''Optional'' – the federation decides how to implement this.)
+
ROD should create tickets to EGI Operations in the case of urgent matters. For security related issues, ROD should also notify the [[EGI CSIRT:Main Page|CSIRT]] duty contact.
 +
 
 +
ROD is also responsible for propagating actions from EGI Operations / Operations Support down to sites (this occurs rather infrequently, though).  
 +
 
 +
[[Category:Infrastructure_Oversight]]

Latest revision as of 14:58, 23 October 2014

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 



ROD DUTIES

A ROD team's duties can be split into three main areas: handling alarms and tickets, handling downtimes, and communicating urgent issues to the EGI Operations and CSIRT teams.

Handling alarms and tickets

The main responsibility of ROD is to deal with alarms and tickets issued for sites in the region. This includes making sure that the tickets are created and handled properly.

The ROD on duty is required to:

  • check alarm notifications in the Dashboard at least twice a day;
  • close alarms which are in the OK state;
  • handle non-OK alarms less than 24 hours old (notify the site administrators according to your NGI's procedures);
  • create tickets for alarms older then 24 hours that are not in an OK state;
  • escalate tickets to NGI Management/EGI Operations if necessary (in the Dashboard);
  • monitor and update any GGUS tickets up to the solved status (preferably via the Dashboard);
  • handle the final state of GGUS tickets not opened from the Dashboard by changing their status to verified.

Putting a site in downtime for urgent matters

ROD can place a site or a service endpoint (there can be multiple services running on a single host) in downtime in the GOCDB if it is either requested by the site, or if ROD sees an urgent need to do it. Note: This is actually optional; an NGI may decide on a different policy if the site admins are not happy with ROD setting downtimes for them. However, it should be considered mandatory in case of urgent security incidents.

ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that EGI Operations can also suspend a site in the case of an emergency, for example as a result of a security incident or lack of response.

In both scenarios, it is important that ROD communicates their actions to all involved parties.

Notifying EGI Operations and EGI CSIRT about urgent matters

ROD should create tickets to EGI Operations in the case of urgent matters. For security related issues, ROD should also notify the CSIRT duty contact.

ROD is also responsible for propagating actions from EGI Operations / Operations Support down to sites (this occurs rather infrequently, though).