Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI Infrastructure operations oversight"

From EGIWiki
Jump to navigation Jump to search
Line 1: Line 1:
__NOTOC__
{{Template:Op menubar}}
{{Template:Op menubar}}
{{TOC_right}}
 


[[Category:COD]]
[[Category:COD]]
EGI Grid Operations oversight of the e-Infrastructure is a co-ordination task for ensuring that GRID monitoring across EGI runs smoothly.
'''Grid operations oversight''' activities includes the detection and coordination of the diagnosis of problems affecting the entire EGI e-Infrastructure during the entire lifecycle until
This team communicates among the 3 groups - Operations and e-Infrastructure Oversight (OE); Operational Documentation (OD);
resolution, the reporting of middleware issues to the developers, the execution of quality checks of the services provided by NGIs, and the handling of operational problems that can not be
and "Coordination of interoperations between NGIs and with other Grids".
solved at the NGI level. This task coordinates the oversight of the NGI e-Infrastructures (run under the responsibility of the NGIs), which – at the NGI level – includes the monitoring of the
services operated by sites, the management of tickets and their follow up for problem resolution, the suspension of a site when deemed necessary.


The Operations oversight team works with the Tool Developers (and particularly the [[OTAG]] group), NGIs and their Operations Teams (ROD).
The Grid operations oversight activities are performed by COD team on the EGI level and by ROD teams on the Operations Centres level.
There are regular phone meetings for the co-ordinators and others working in the tasks.  The OE co-ordinators also organise face to face meetings for the
ROD teams 3 to 4 times a year.


;'''COD managers:''' : Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Malgorzata Krakowian
;'''COD shifters:''' : Malgorzata Krakowian, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
;'''Contact:''' :
*There are 3 mailing lists used for different cases:
** '''manager-central-operator-on-duty''' AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. COD managers are recipients of this list.
** '''central-operator-on-duty''' AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. COD shifters are recipients of this list.
** '''all-central-operator-on-duty''' AT mailman.egi.eu - for contacting all ROD teams in NGIs. Every ROD team is a recipient of this list.


----
== Information for Regional Operators (ROD) ==
* NEW! [https://documents.egi.eu/secure/ShowDocument?docid=298 ROD newsletter]
* 2011
** [https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%201-2011.pdf Jan] |[https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%2002-2011.pdf Feb]
* 2010
** [https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%2012-2010.pdf Dec]


== ROD and COD Performance ==
{| width="100%"
* [[Operations:OperationsSupportMetrics | Operations Support Metrics]]
| width="50%" style="vertical-align:top" |
* [[Operations:OperationsSupportMetrics_summary | Operations Support Metrics - reports summary]]
= People and contact =


* 2011
People performing the activities and contact points to them can be found:
** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-January11.ods Jan] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-February11.ods Feb]
* for COD team on COD wiki page [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD#People_and_contact People_and_Contac]
* 2010
* for ROD teams on ROD wiki page [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD#People_and_Contact People_and_Contac]
** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-May10.ods May] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-June10.ods Jun] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-July10.ods Jul] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-August10.ods Aug] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-September10.ods Sep] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-October10.ods Oct] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-November10.ods Nov] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-December10.ods Dec]
* QR
** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR1.ods QR1] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR2.ods QR2] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR3.ods QR3]


== Nagios tests ==
* [[Operations:Operations_tests| Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
* [[Availability_and_reliability_tests| Availability and reliability tests list ]]: list of Nagios probes whose results are used for Availability and Reliability computation


== About Central Operator on Duty (COD) ==
= Duties =


* [[Grid_operations_oversight/CODOD|Phone conference Meetings, Agenda and Actions]]  
COD team is responsible for the global oversight over the whole EGI infrastructure. More details: [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD#COD_Duties COD_Duties]
ROD team is responsible for handling of operational problems within own NGI. More details: [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD#ROD_duties ROD_duties]


=== Duties ===  
| width="50%" style="vertical-align:top" |
* COD managers
= Resources =
** '''representing RODs/COD in OTAG, OMB and Operations meetings''' - collecting requirements and improvements proposals from RODs concerning operations tools and procedures
** '''suspending Resource Centres''' in case of operational issues
* [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD COD wiki page] - page contains detailed information about COD team
** '''taking part in OLA task force'''
* [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD ROD wiki page] - page contains detailed information about ROD teams
** '''writing new procedures''' - in case of need COD is taking part in procedures creation process
** '''preparing ROD newsletters''' - informing RODs about recent and upcoming developments related to Grid Oversight
** '''preparing ROD metrics reports''' - providing an overview of operations support process in grid infrastructure.
* COD shifters.
** '''escalation of operational problems with RODs'''
*** Alarms older than 3 days without an assigned ticket
*** Tickets which have expired 3 days ago
*** Tickets which have not been solved for 30 days
** '''escalation of operational problems with sites which cannot be solved on NGI level'''
*** Tickets transferred to C-COD (last escalation step)
*** Sites in downtime for more than 1 month
** '''dealing with GGUS tickets assigned to COD'''
** '''process coordination''' of:
*** creation and decommission of Operations Centre
*** setting a Nagios test to an operations test
*** getting explanations for low availability and reliability metrics


=== Internal area ===
==== COD shifters daily work instructions ====
In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.


{| border="1" cellspacing="0" cellpadding="5" align="center"
[[Category:COD]]
! Action
! Description
! Related procedures
|-v
| '''GGUS tickets assigned to COD'''
|
COD shifter is obliged to check the current status of all '''GGUS tickets assigned to COD'''
* see [http://tinyurl.com/2ws735h Link to all GGUS tickets assigned to COD]
* If the ticket is waiting for COD action then he/she should perform the action
 
 
In case of a request for:
* '''ROD certification'''
**  see [[Grid_operations_oversight/WI01 | New ROD team certification work instructions]]
* '''Creation of a new NGI'''
** see [[Operations_Centre_creation_process_coordination |  Creation of a new Operations Centre process coordination]]
** In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
* '''Operations Centre decommission'''
** see [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
** COD validates the request and removes ROD information from all-operators mailing list
* '''Setting a Nagios test to an operations test'''
** see [[Operations:Procedure_for_setting_Nagios_test_an_operations_test| Procedure for setting a Nagios test to an operations test]]
** COD is responsible for coordinating the whole process.
 
If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers via manager-central-operator-on-duty AT mailman.egi.eu
|
* [[Operations_Centre_creation_process_coordination |  Creation of a new Operations Centre process coordination]]
* [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
* [[Operations:Procedure_for_setting_Nagios_test_an_operations_test| Procedure for setting Nagios test an operations test]]
|-
| '''Availability/reliability reports'''
|
* Handling availability/reliability reports: [[Availability_and_reliability_internal_procedure_for_COD | Availability and reliability work instruction]]
** [[Availability_and_reliability_reports_metrics |  AR reports metrics]]
|
* [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
* [[Availability_and_reliability_monthly_statistics | Availability and reliability monthly statistics procedure]]
|-
| '''Operational portal dashboard issues'''
|
*[https://operations-portal.in2p3.fr/dashboard/ccodView COD dashboard link]
*[[Operations:Work_instruction_for_escalating_operational_problems_with_ROD | Escalation for operational problems with ROD - work instruction]]
|
* [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
|-
| '''Handover'''
|
[https://operations-portal.in2p3.fr/dashboard/ccodView COD dashboard link]
* At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
** Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
** GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
** Other issues: problems with tools etc.
|
|-
|}
 
 
''NOTE: all procedures should contain the following template: https://wiki.egi.eu/wiki/PDT:Procedure_Template''
 
=== Procedures ===
 
==== To be approved by OMB ====
 
=== OTAG topics ===
 
==== Operational Portal: Dashboard ====
* [http://bit.ly/dZ3RWN  RT tickets]
* [[Operations:COD_interaction_with_Dashboard_team| COD interactions with Dashboard team (draft)]]
* [[Operations:COD_OTAG_topics| COD topics to be discussed on OTAG meeting]]
 
==== GOC DB ====
* [[Operations:COD_GOCDB_requirements|Collection of GOC DB requirements regarding COD work (draft)]]
 
=== Pages in draft state ===
 
* [[Operations:COD_Improvements_to_availability_procedure|Improvements to Availability Calculation Procedure (draft)]]
 
* [[Operations:A/R_fixing_procedure| A/R fixing procedure (draft)]]
 
* [[Grid_operations_oversight_-_COD| COD wiki page]]
 
* [[Grid_operations_oversight_-_ROD| ROD wiki page]]
 
* [[Grid_operations_oversight_draft| Grid operations oversight - new page draft]]
 
* [[Operations:OperationsSupportMetrics:MetricsDocumentation|Metrics Documentation]]

Revision as of 14:49, 22 March 2011


Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Grid operations oversight activities includes the detection and coordination of the diagnosis of problems affecting the entire EGI e-Infrastructure during the entire lifecycle until resolution, the reporting of middleware issues to the developers, the execution of quality checks of the services provided by NGIs, and the handling of operational problems that can not be solved at the NGI level. This task coordinates the oversight of the NGI e-Infrastructures (run under the responsibility of the NGIs), which – at the NGI level – includes the monitoring of the services operated by sites, the management of tickets and their follow up for problem resolution, the suspension of a site when deemed necessary.

The Grid operations oversight activities are performed by COD team on the EGI level and by ROD teams on the Operations Centres level.


People and contact

People performing the activities and contact points to them can be found:


Duties

COD team is responsible for the global oversight over the whole EGI infrastructure. More details: COD_Duties

ROD team is responsible for handling of operational problems within own NGI. More details: ROD_duties

Resources

  • COD wiki page - page contains detailed information about COD team
  • ROD wiki page - page contains detailed information about ROD teams