Difference between revisions of "Operations and Operations Support"

From EGIWiki
Jump to: navigation, search
Line 18: Line 18:
 
:Tadeusz Szymocha, Magda Szopa, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
 
:Tadeusz Szymocha, Magda Szopa, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
  
<br> [[Grid operations oversight/Photo|People behind the names]]  
+
<br> [[COD_Photo|People behind the names]]  
  
 
<br> There are 2 mailing lists used for different cases:  
 
<br> There are 2 mailing lists used for different cases:  
Line 62: Line 62:
  
 
*'''ROD certification'''  
 
*'''ROD certification'''  
**see [[Grid operations oversight/WI01|New ROD team certification work instructions]]  
+
**see [[WI01_ROD_certification_ticket_handling|New ROD team certification work instructions]]  
 
*'''Creation of a new NGI'''  
 
*'''Creation of a new NGI'''  
 
**see [[PROC02|Creation of a new Operations Centre process coordination]]  
 
**see [[PROC02|Creation of a new Operations Centre process coordination]]  
**see [[Grid operations oversight/WI02|work instruction]]  
+
**see [[WI02_Operations_centre_creation|work instruction]]  
 
**In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.  
 
**In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.  
 
*'''Operations Centre decommission'''  
 
*'''Operations Centre decommission'''  
Line 104: Line 104:
 
| '''Availability/reliability followup procedure'''  
 
| '''Availability/reliability followup procedure'''  
 
|  
 
|  
*[[WI03_Availability_and_Reliability_report_followup|WI03 - Availability and reliability report work instruction]]  
+
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]  
 
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
 
*[[Underperforming sites and suspensions|Underperforming sites and suspensions]]
  
 
|  
 
|  
*[[Availability and reliability monthly statistics|Availability and reliability monthly statistics procedure]]
+
*[[PROC04|Availability and reliability monthly statistics procedure]]
  
 
|-
 
|-
 
| '''Unknown followup procedure'''  
 
| '''Unknown followup procedure'''  
 
|  
 
|  
*[[WI08_Unknown_report_followup|WI08 - Unknown report work instruction]]  
+
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]  
*[[Grid operations oversight/Unknown issue|UNKNOWN issue ]]
+
*[[Unknown_issue|UNKNOWN issue ]]
  
 
|  
 
|  
*[[WI05_Unresponsive_NGI_escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
  
 
|-
 
|-
 
| '''Top-level BDII followup procedure'''  
 
| '''Top-level BDII followup procedure'''  
 
|  
 
|  
*[[WI04_Core_services_report_followup|WI04 - Core services report work instruction ]]
+
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]
  
 
|  
 
|  
*[[WI05_Unresponsive_NGI_escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
  
 
|-
 
|-
 
| '''ROD performance index followup procedure'''  
 
| '''ROD performance index followup procedure'''  
 
|  
 
|  
*[[WI07_ROD_performance_index_report_follwup|WI07 - ROD Performance Index report work instruction]]  
+
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]  
*[[Grid operations oversight/ROD performance index|ROD performance index]]
+
*[[ROD_performance_index|ROD performance index]]
  
 
|  
 
|  
*[[WI05_Unresponsive_NGI_escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]
  
 
|}
 
|}
Line 141: Line 141:
  
 
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
 
*[[WI01 ROD certification ticket handling|WI01 - New ROD team certification work instructions]]  
*[[WI02_Operations_centre_creation|WI02 - New Opertions Centre creation work instruction]]  
+
*[[WI02 Operations centre creation|WI02 - New Opertions Centre creation work instruction]]  
*[[WI03_Availability_and_Reliability_report_followup|WI03 - Availability and reliability report work instruction]]  
+
*[[WI03 Availability and Reliability report followup|WI03 - Availability and reliability report work instruction]]  
*[[WI04_Core_services_report_followup|WI04 - Core services report work instruction ]]  
+
*[[WI04 Core services report followup|WI04 - Core services report work instruction ]]  
*[[WI05_Unresponsive_NGI_escalation|WI05 - Escalation procedure in case of unresponsive NGI]]  
+
*[[WI05 Unresponsive NGI escalation|WI05 - Escalation procedure in case of unresponsive NGI]]  
*[[WI06_Tickets_older_than_30_days|WI06 - Tickets &gt; 30 days]]  
+
*[[WI06 Tickets older than 30 days|WI06 - Tickets &gt; 30 days]]  
*[[WI07_ROD_performance_index_report_follwup|WI07 - ROD Performance Index report work instruction]]  
+
*[[WI07 ROD performance index report follwup|WI07 - ROD Performance Index report work instruction]]  
*[[WI08_Unknown_report_followup|WI08 - Unknown report work instruction]]
+
*[[WI08 Unknown report followup|WI08 - Unknown report work instruction]]
  
 
= Events  =
 
= Events  =
  
 
*[https://www.egi.eu/indico/categoryDisplay.py?categId=11 EGI indico page] with COD meeting agendas.  
 
*[https://www.egi.eu/indico/categoryDisplay.py?categId=11 EGI indico page] with COD meeting agendas.  
*All open actions can be found from [[Grid operations oversight/CODOD actions|COD actions]]
+
*All open actions can be found from [[COD_actions|COD actions]]
  
 
= Resources  =
 
= Resources  =
Line 175: Line 175:
 
<br>  
 
<br>  
  
Definition of [[Grid operations oversight/OperationsSupportMetrics|Operations Support metrics]]  
+
Definition of [[Operations_support_metrics|Operations Support metrics]]  
  
 
=== May 2010-Sep 2011  ===
 
=== May 2010-Sep 2011  ===
Line 188: Line 188:
  
 
*[[Operations SAM tests|Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard  
 
*[[Operations SAM tests|Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard  
*[http://grid-monitoring.cern.ch/myegi/sam-pi/metrics_in_profiles?vo_name=ops&profile_name=ROC_CRITICAL Availability and reliability tests list]: list of Nagios probes whose results are used for Availability and Reliability computation
+
*[[Availability_SAM_tests|Availability and reliability tests list]]: list of Nagios probes whose results are used for Availability and Reliability computation
  
 
== OTAG topics  ==
 
== OTAG topics  ==
Line 195: Line 195:
  
 
*[http://bit.ly/dZ3RWN RT tickets]  
 
*[http://bit.ly/dZ3RWN RT tickets]  
*[[Grid operations oversight/COD interaction with Dashboard team|COD interactions with Dashboard team (draft)]]  
+
*[[COD_Interaction_with_Dashboard_team|COD interactions with Dashboard team (draft)]]  
*[[Grid operations oversight/COD OTAG topics|COD topics to be discussed on OTAG meeting]]
+
*[[COD_OTAG_topics|COD topics to be discussed on OTAG meeting]]
  
 
=== GOC DB  ===
 
=== GOC DB  ===
  
*[[Grid operations oversight/COD GOCDB requirements|Collection of GOC DB requirements regarding COD work (draft)]]
+
*[[COD_GOCDB_requirements|Collection of GOC DB requirements regarding COD work (draft)]]
  
 
== Pages in draft state  ==
 
== Pages in draft state  ==
  
*[[Grid operations oversight/COD Improvements to availability procedure|Improvements to Availability Calculation Procedure (draft)]]  
+
*[[Availability_procedure_improvements|Improvements to Availability Calculation Procedure (draft)]]  
*[[Grid operations oversight/A/R fixing procedure|A/R fixing procedure (draft)]][[Grid operations oversight/ROD FAQ|<br>]]  
+
*[[Availability_Reliability_database_fixing|A/R fixing procedure (draft)]][[Grid operations oversight/ROD FAQ|<br>]]  
*[[Grid operations oversight/CandidateSuspendedSitesList|Candidate Suspended Sites List]]
+
*[[Candidate_or_Suspended_sites|Candidate Suspended Sites List]]
  
 
[[Category:Grid_Oversight]]
 
[[Category:Grid_Oversight]]

Revision as of 12:13, 24 January 2013

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 




Introduction

COD team is a small team responsible for coordination of RODs, provided on a global layer. COD represents the whole ROD structure in terms of technical requirements for operations tools as well as on political level.

The purpose of this page is to collect all materials needed by COD team to perform the Grid operations oversight activities.

People and contact

COD team is formed from Dutch and Polish team and includes COD managers (people responsible for managerial issues) and COD shifters (people performing day-to-day COD work)

COD managers: 
Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Tadeusz Szymocha, Magda Szopa
COD shifters: 
Tadeusz Szymocha, Magda Szopa, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen


People behind the names


There are 2 mailing lists used for different cases:

  • manager-central-operator-on-duty AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. COD managers are recipients of this list.
  • central-operator-on-duty AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. COD shifters are recipients of this list.

COD Duties

  • COD managers
    • representing RODs/COD in OTAG, OMB and Operations meetings - collecting requirements and improvements proposals from RODs concerning operations tools and procedures
    • suspending Resource Centres in case of operational issues
    • taking part in OLA task force
    • writing new procedures - in case of need COD is taking part in procedures creation process
    • preparing ROD newsletters - informing RODs about recent and upcoming developments related to Grid Oversight
    • preparing ROD metrics reports - providing an overview of operations support process in grid infrastructure.
  • COD shifters
    • escalation of operational problems with RODs
    • dealing with GGUS tickets assigned to COD
    • process coordination of:
      • creation and decommission of Operations Centre
      • setting a Nagios test to an operations test
      • getting explanations for low availability and reliability metrics

COD shifters work instructions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Description Related procedures
GGUS tickets assigned to COD

COD shifter is obliged to check the current status of all GGUS tickets assigned to COD


In case of a request for:

If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers


Operational portal dashboard issues
Handover
  • COD dashboard link
  • At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
    • Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
    • GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
    • Other issues: problems with tools etc.

Availability/reliability followup procedure
Unknown followup procedure
Top-level BDII followup procedure
ROD performance index followup procedure

Work Instructions

Events

Resources


Oct 2011 to date

  • Please provide a link here



Definition of Operations Support metrics

May 2010-Sep 2011

Until April 2010

  • EGEE-III Operations Support metrics

Nagios tests

OTAG topics

Operational Portal: Dashboard

GOC DB

Pages in draft state