Difference between revisions of "Operations and Operations Support"

From EGIWiki
Jump to: navigation, search
 
(45 intermediate revisions by 5 users not shown)
Line 5: Line 5:
 
= Introduction  =
 
= Introduction  =
  
'''COD team '''is a small team responsible for coordination of RODs, provided on a global layer. COD represents the whole ROD structure in terms of technical requirements for operations tools as well as on political level.
+
'''New version on https://wiki.egi.eu/wiki/EGI_Operations_Team'''
  
The purpose of this page is to collect all materials needed by COD team to perform the Grid operations oversight activities.  
+
This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.
  
= People and contact  =
+
'''NOTE''': on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations
  
COD team is formed from Dutch and Polish team and includes COD managers (people responsible for managerial issues) and COD shifters (people performing day-to-day COD work)
+
= Contact  =
  
;'''COD managers:''' 
+
EGI.eu Operations:  
:Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Tadeusz Szymocha, Magda Szopa
 
;'''COD shifters:''' 
 
:Tadeusz Szymocha, Magda Szopa, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
 
  
<br> [[Grid operations oversight/Photo|People behind the names]]
+
*GGUS Support Unit:Operations
 +
*operations @ egi.eu
  
<br> There are 2 mailing lists used for different cases:
+
= Actions =
 
 
*'''manager-central-operator-on-duty''' AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. '''COD managers''' are recipients of this list.
 
*'''central-operator-on-duty''' AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. '''COD shifters''' are recipients of this list.
 
 
 
= COD Duties  =
 
 
 
*COD managers
 
**'''representing RODs/COD in OTAG, OMB and Operations meetings''' - collecting requirements and improvements proposals from RODs concerning operations tools and procedures
 
**'''suspending Resource Centres''' in case of operational issues
 
**'''taking part in OLA task force'''
 
**'''writing new procedures''' - in case of need COD is taking part in procedures creation process
 
**'''preparing ROD newsletters''' - informing RODs about recent and upcoming developments related to Grid Oversight
 
**'''preparing ROD metrics reports''' - providing an overview of operations support process in grid infrastructure.
 
*COD shifters
 
**'''escalation of operational problems with RODs'''
 
**'''dealing with GGUS tickets assigned to COD'''
 
**'''process coordination''' of:
 
***creation and decommission of Operations Centre
 
***setting a Nagios test to an operations test
 
***getting explanations for low availability and reliability metrics
 
 
 
= COD shifters work instructions =
 
  
 
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
 
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.  
Line 49: Line 25:
 
|-
 
|-
 
! Action  
 
! Action  
! Description
+
! Responsible<br>
! Related procedures
+
! Procedure
 +
! Instructions and related pages<br>
 
|-
 
|-
| '''GGUS tickets assigned to COD'''  
+
| '''ROD certification'''  
 +
| OS<br>
 
|  
 
|  
COD shifter is obliged to check the current status of all '''GGUS tickets assigned to COD'''
+
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]
  
*see [http://tinyurl.com/2ws735h Link to all GGUS tickets assigned to COD]  
+
|
*If the ticket is waiting for COD action then he/she should perform the action
+
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]
  
<br> In case of a request for:  
+
|-
 +
| '''Creation of a new NGI'''
 +
| OS<br>  
 +
|
 +
*[https://wiki.egi.eu/wiki/PROC02 Operations Centre Creation]
  
*'''ROD certification'''
+
|
**see [[Grid operations oversight/WI01|New ROD team certification work instructions]]
+
*[[WI02 Operations centre creation|WI02 - New Operations Centre creation work instruction]]
*'''Creation of a new NGI'''
 
**see [[PROC02|Creation of a new Operations Centre process coordination]]
 
**see [[Grid operations oversight/WI02|work instruction]]  
 
**In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
 
*'''Operations Centre decommission'''
 
**see [[PROC03|Operations Centre decommission process coordination]]
 
**COD validates the request and removes ROD information from all-operators mailing list
 
*'''Setting a Nagios test to an operations test'''
 
**see [[PROC06|Procedure for setting a Nagios test to an operations test]]
 
**COD is responsible for coordinating the whole process.
 
  
If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers
+
|-
 +
| '''Monthly operations broadcast'''
 +
| OS
 +
|
 +
|
 +
*[[WI04_Monthly_broadcast| WI04 - Monthly Operations broadcast]]
  
 +
|-
 +
| '''Operations Centre decommission'''
 +
| O<br>
 
|  
 
|  
*[[PROC02|Creation of a new Operations Centre process coordination]]
+
*[https://wiki.egi.eu/wiki/PROC03 Operations Centre decommissioning]
*[[PROC03|Operations Centre decommission process coordination]]
 
*[[PROC06|Procedure for setting Nagios test an operations test]]
 
  
<br>  
+
| <br>
 +
|-
 +
| '''Setting a Nagios test to an operations test'''
 +
| O<br>
 +
|
 +
*[https://wiki.egi.eu/wiki/PROC06 Setting a Nagios test status to OPERATIONS]
  
 +
| <br>
 
|-
 
|-
 
| '''Operational portal dashboard issues'''  
 
| '''Operational portal dashboard issues'''  
 +
| O<br>
 
|  
 
|  
*[https://operations-portal.egi.eu/dashboard/ccodView COD dashboard link]
+
*[https://wiki.egi.eu/wiki/PROC01 EGI Infrastructure Oversight Escalation]
  
 
|  
 
|  
*[[PROC01|COD escalation procedure]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
  
 
|-
 
|-
| '''Handover'''  
+
| '''Availability/reliability followup procedure'''  
 +
| O<br>
 
|  
 
|  
*[https://operations-portal.egi.eu/dashboard/ccodView COD dashboard link]
+
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]<br>
*At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
 
**Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
 
**GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
 
**Other issues: problems with tools etc.
 
  
| <br>
 
|-
 
| '''Availability/reliability followup procedure'''
 
 
|  
 
|  
*[[Grid operations oversight/WI03|WI03 - Availability and reliability report work instruction]]  
+
*[https://wiki.egi.eu/wiki/PROC10 Recomputation of monitoring results and availability statistics]
 +
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]  
 
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
 
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
 
|
 
*[[Availability and reliability monthly statistics|Availability and reliability monthly statistics procedure]]
 
  
 
|-
 
|-
 
| '''Unknown followup procedure'''  
 
| '''Unknown followup procedure'''  
 +
| O<br>
 
|  
 
|  
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
+
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
*[[Grid operations oversight/Unknown issue|UNKNOWN issue ]]
 
  
 
|  
 
|  
*[[Grid operations oversight/WI05|WI05 - Escalation procedure in case of unresponsive NGI]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
 +
*[[Unknown issue|UNKNOWN issue]]
 +
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
  
 
|-
 
|-
 
| '''Top-level BDII followup procedure'''  
 
| '''Top-level BDII followup procedure'''  
 +
| O<br>
 
|  
 
|  
*[[Grid operations oversight/WI04|WI04 - Core services report work instruction ]]
+
*[https://wiki.egi.eu/wiki/PROC04 Quality verification of monthly availability and reliability statistics]
  
 
|  
 
|  
*[[Grid operations oversight/WI05|WI05 - Escalation procedure in case of unresponsive NGI]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
 +
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]
  
 
|-
 
|-
 
| '''ROD performance index followup procedure'''  
 
| '''ROD performance index followup procedure'''  
 +
| O<br>
 +
| <br>
 
|  
 
|  
*[[Grid operations oversight/WI07|WI07 - ROD Performance Index report work instruction]]  
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[Grid operations oversight/ROD performance index|ROD performance index]]
+
*[[WI03 RC and RP OLA violation report followup|WI03 RC and RP OLA violation report followup]]  
 
+
*[[ROD performance index|ROD performance index]]
|
 
*[[Grid operations oversight/WI05|WI05 - Escalation procedure in case of unresponsive NGI]]
 
  
 
|}
 
|}
Line 140: Line 121:
 
== ''Work Instructions''<br>  ==
 
== ''Work Instructions''<br>  ==
  
*[[WI01_ROD_certification_ticket_handling|WI01 - New ROD team certification work instructions]]  
+
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[Grid operations oversight/WI02|WI02 - New Opertions Centre creation work instruction]]  
+
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[Grid operations oversight/WI03|WI03 - Availability and reliability report work instruction]]  
+
*[[WI03 RC and RP OLA violation report followup|WI03 - RC and RP OLA violation report followup]]  
*[[Grid operations oversight/WI04|WI04 - Core services report work instruction ]]  
+
*[[WI04 Monthly broadcast|WI04 - Monthly Operations broadcast]]  
*[[Grid operations oversight/WI05|WI05 - Escalation procedure in case of unresponsive NGI]]  
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
*[[Grid operations oversight/WI06|WI06 - Tickets &gt; 30 days]]
+
*[[WI06_Core_services_process| Core services process]]
*[[Grid operations oversight/WI07|WI07 - ROD Performance Index report work instruction]]
 
*[[Grid operations oversight/WI08|WI08 - Unknown report work instruction]]
 
  
= Events =
+
== Pages listing NGIs<br> ==
  
*[https://www.egi.eu/indico/categoryDisplay.py?categId=11 EGI indico page] with COD meeting agendas.  
+
For EGI&nbsp;Operations:&nbsp;to be updated while OC&nbsp;creation or decommission
*All open actions can be found from [[Grid operations oversight/CODOD actions|COD actions]]
+
 
 +
*[https://wiki.egi.eu/wiki/GOCDB_grouping_action https://wiki.egi.eu/wiki/GOCDB_grouping_action ]<br>
 +
*[https://wiki.egi.eu/wiki/Operations_centres https://wiki.egi.eu/wiki/Operations_centres] <br>
 +
*https://wiki.egi.eu/wiki/Top-BDII_list_for_NGI <br>
 +
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1205<br>
 +
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1206
 +
*https://goc.egi.eu/portal/index.php?Page_Type=Service_Group&amp;id=1184
 +
*https://docs.google.com/a/egi.eu/spreadsheets/d/1Zsk3ykVllc5GzNG2Hhref7wzTvz_rSKcckV8nnWWZIs/edit#gid=163292516
 +
*folder "08 - sites-history Q"
 +
 
 +
<br>
 +
 
 +
<br>
 +
 
 +
<br>
  
 
= Resources  =
 
= Resources  =
  
*[https://documents.egi.eu/secure/ShowDocument?docid=298 Document server: ROD newsletter]
 
*[https://documents.egi.eu/secure/ShowDocument?docid=155 Document server: Operations Support Metrics]
 
 
*[[Operations Procedures|Operations Procedures]]  
 
*[[Operations Procedures|Operations Procedures]]  
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
+
*[http://www.youtube.com/user/EGIGridOversight Youtube channel]
*[https://operations-portal.in2p3.fr/dashboard/regionalPreferences Mailing lists for each ROD]
 
  
 
<!--
 
<!--
Line 166: Line 156:
  
 
*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
 
*[[Grid operations oversight/OperationsSupportMetrics summary|Operations Support Metrics - reports summary]]-->  
 
=== Oct 2011 to date  ===
 
 
*Please provide a link here
 
 
<br>
 
  
 
<br>  
 
<br>  
  
Definition of [[Grid operations oversight/OperationsSupportMetrics|Operations Support metrics]]
+
[[Category:Infrastructure_Oversight]]
 
 
=== May 2010-Sep 2011  ===
 
 
 
*Operations Support [https://documents.egi.eu/document/155 metrics]
 
 
 
=== Until April 2010  ===
 
 
 
*EGEE-III Operations Support [https://documents.egi.eu/document/829 metrics]
 
 
 
== Nagios tests  ==
 
 
 
*[[Operations SAM tests|Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
 
*[http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_CRITICAL Availability and reliability tests list]: list of Nagios probes whose results are used for Availability and Reliability computation
 
 
 
== OTAG topics  ==
 
 
 
=== Operational Portal: Dashboard  ===
 
 
 
*[http://bit.ly/dZ3RWN RT tickets]
 
*[[Grid operations oversight/COD interaction with Dashboard team|COD interactions with Dashboard team (draft)]]
 
*[[Grid operations oversight/COD OTAG topics|COD topics to be discussed on OTAG meeting]]
 
 
 
=== GOC DB  ===
 
 
 
*[[Grid operations oversight/COD GOCDB requirements|Collection of GOC DB requirements regarding COD work (draft)]]
 
 
 
== Pages in draft state  ==
 
 
 
*[[Grid operations oversight/COD Improvements to availability procedure|Improvements to Availability Calculation Procedure (draft)]]
 
*[[Grid operations oversight/A/R fixing procedure|A/R fixing procedure (draft)]][[Grid operations oversight/ROD FAQ|<br>]]
 
*[[Grid operations oversight/CandidateSuspendedSitesList|Candidate Suspended Sites List]]
 
 
 
[[Category:Grid_Oversight]]
 

Latest revision as of 17:17, 28 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




Introduction

New version on https://wiki.egi.eu/wiki/EGI_Operations_Team

This page collects internal materials needed by EGI.eu Operations and EGI Operations Support team to perform the EGI Infrastructure operations oversight activities.

NOTE: on April 30th 2016 EGI Operations Support activity stopped, all its task passed to Operations

Contact

EGI.eu Operations:

  • GGUS Support Unit:Operations
  • operations @ egi.eu

Actions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Responsible
Procedure Instructions and related pages
ROD certification OS
Creation of a new NGI OS
Monthly operations broadcast OS
Operations Centre decommission O

Setting a Nagios test to an operations test O

Operational portal dashboard issues O
Availability/reliability followup procedure O
Unknown followup procedure O
Top-level BDII followup procedure O
ROD performance index followup procedure O

Work Instructions

Pages listing NGIs

For EGI Operations: to be updated while OC creation or decommission




Resources