|
|
Line 1: |
Line 1: |
| | __NOTOC__ |
| | |
| {{Template:Op menubar}} | | {{Template:Op menubar}} |
| {{TOC_right}}
| | |
|
| |
|
| [[Category:COD]] | | [[Category:COD]] |
| EGI Grid Operations oversight of the e-Infrastructure is a co-ordination task for ensuring that GRID monitoring across EGI runs smoothly.
| | '''Grid operations oversight''' activities includes the detection and coordination of the diagnosis of problems affecting the entire EGI e-Infrastructure during the entire lifecycle until |
| This team communicates among the 3 groups - Operations and e-Infrastructure Oversight (OE); Operational Documentation (OD); | | resolution, the reporting of middleware issues to the developers, the execution of quality checks of the services provided by NGIs, and the handling of operational problems that can not be |
| and "Coordination of interoperations between NGIs and with other Grids". | | solved at the NGI level. This task coordinates the oversight of the NGI e-Infrastructures (run under the responsibility of the NGIs), which – at the NGI level – includes the monitoring of the |
| | services operated by sites, the management of tickets and their follow up for problem resolution, the suspension of a site when deemed necessary. |
|
| |
|
| The Operations oversight team works with the Tool Developers (and particularly the [[OTAG]] group), NGIs and their Operations Teams (ROD). | | The Grid operations oversight activities are performed by COD team on the EGI level and by ROD teams on the Operations Centres level. |
| There are regular phone meetings for the co-ordinators and others working in the tasks. The OE co-ordinators also organise face to face meetings for the
| |
| ROD teams 3 to 4 times a year.
| |
|
| |
|
| ;'''COD managers:''' : Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Malgorzata Krakowian
| |
| ;'''COD shifters:''' : Malgorzata Krakowian, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
| |
| ;'''Contact:''' :
| |
| *There are 3 mailing lists used for different cases:
| |
| ** '''manager-central-operator-on-duty''' AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. COD managers are recipients of this list.
| |
| ** '''central-operator-on-duty''' AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. COD shifters are recipients of this list.
| |
| ** '''all-central-operator-on-duty''' AT mailman.egi.eu - for contacting all ROD teams in NGIs. Every ROD team is a recipient of this list.
| |
|
| |
|
| ----
| |
| == Information for Regional Operators (ROD) ==
| |
| * NEW! [https://documents.egi.eu/secure/ShowDocument?docid=298 ROD newsletter]
| |
| * 2011
| |
| ** [https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%201-2011.pdf Jan] |[https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%2002-2011.pdf Feb]
| |
| * 2010
| |
| ** [https://documents.egi.eu/secure/RetrieveFile?docid=298&version=1&filename=ROD%20newsletter%2012-2010.pdf Dec]
| |
|
| |
|
| == ROD and COD Performance == | | {| width="100%" |
| * [[Operations:OperationsSupportMetrics | Operations Support Metrics]]
| | | width="50%" style="vertical-align:top" | |
| * [[Operations:OperationsSupportMetrics_summary | Operations Support Metrics - reports summary]]
| | = People and contact = |
|
| |
|
| * 2011
| | People performing the activities and contact points to them can be found: |
| ** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-January11.ods Jan] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-February11.ods Feb]
| | * for COD team on COD wiki page [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD#People_and_contact People_and_Contac] |
| * 2010 | | * for ROD teams on ROD wiki page [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD#People_and_Contact People_and_Contac] |
| ** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-May10.ods May] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-June10.ods Jun] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-July10.ods Jul] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-August10.ods Aug] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-September10.ods Sep] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-October10.ods Oct] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-November10.ods Nov] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=EGI-Operations_Support_Metrics-December10.ods Dec]
| |
| * QR | |
| ** [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR1.ods QR1] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR2.ods QR2] | [https://documents.egi.eu/secure/RetrieveFile?docid=155&version=1&filename=COD-QR3.ods QR3]
| |
|
| |
|
| == Nagios tests ==
| |
| * [[Operations:Operations_tests| Operations tests list ]]: list of Nagios probes generating alarms for visualization in the Operations Dashboard
| |
| * [[Availability_and_reliability_tests| Availability and reliability tests list ]]: list of Nagios probes whose results are used for Availability and Reliability computation
| |
|
| |
|
| == About Central Operator on Duty (COD) == | | = Duties = |
|
| |
|
| * [[Grid_operations_oversight/CODOD|Phone conference Meetings, Agenda and Actions]]
| | COD team is responsible for the global oversight over the whole EGI infrastructure. More details: [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD#COD_Duties COD_Duties] |
| | |
| | ROD team is responsible for handling of operational problems within own NGI. More details: [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD#ROD_duties ROD_duties] |
|
| |
|
| === Duties === | | | width="50%" style="vertical-align:top" | |
| * COD managers
| | = Resources = |
| ** '''representing RODs/COD in OTAG, OMB and Operations meetings''' - collecting requirements and improvements proposals from RODs concerning operations tools and procedures | | |
| ** '''suspending Resource Centres''' in case of operational issues
| | * [https://wiki.egi.eu/wiki/Grid_operations_oversight/COD COD wiki page] - page contains detailed information about COD team |
| ** '''taking part in OLA task force'''
| | * [https://wiki.egi.eu/wiki/Grid_operations_oversight/ROD ROD wiki page] - page contains detailed information about ROD teams |
| ** '''writing new procedures''' - in case of need COD is taking part in procedures creation process
| |
| ** '''preparing ROD newsletters''' - informing RODs about recent and upcoming developments related to Grid Oversight | |
| ** '''preparing ROD metrics reports''' - providing an overview of operations support process in grid infrastructure.
| |
| * COD shifters.
| |
| ** '''escalation of operational problems with RODs'''
| |
| *** Alarms older than 3 days without an assigned ticket
| |
| *** Tickets which have expired 3 days ago
| |
| *** Tickets which have not been solved for 30 days
| |
| ** '''escalation of operational problems with sites which cannot be solved on NGI level'''
| |
| *** Tickets transferred to C-COD (last escalation step)
| |
| *** Sites in downtime for more than 1 month
| |
| ** '''dealing with GGUS tickets assigned to COD'''
| |
| ** '''process coordination''' of:
| |
| *** creation and decommission of Operations Centre
| |
| *** setting a Nagios test to an operations test
| |
| *** getting explanations for low availability and reliability metrics
| |
|
| |
|
| === Internal area ===
| |
| ==== COD shifters daily work instructions ====
| |
| In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.
| |
|
| |
|
| {| border="1" cellspacing="0" cellpadding="5" align="center"
| | [[Category:COD]] |
| ! Action
| |
| ! Description
| |
| ! Related procedures
| |
| |-v
| |
| | '''GGUS tickets assigned to COD'''
| |
| |
| |
| COD shifter is obliged to check the current status of all '''GGUS tickets assigned to COD'''
| |
| * see [http://tinyurl.com/2ws735h Link to all GGUS tickets assigned to COD]
| |
| * If the ticket is waiting for COD action then he/she should perform the action
| |
| | |
| | |
| In case of a request for:
| |
| * '''ROD certification'''
| |
| ** see [[Grid_operations_oversight/WI01 | New ROD team certification work instructions]]
| |
| * '''Creation of a new NGI'''
| |
| ** see [[Operations_Centre_creation_process_coordination | Creation of a new Operations Centre process coordination]]
| |
| ** In case where COD is also the Integration Process Coordinator, COD is responsible for the whole procedure.
| |
| * '''Operations Centre decommission'''
| |
| ** see [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
| |
| ** COD validates the request and removes ROD information from all-operators mailing list
| |
| * '''Setting a Nagios test to an operations test'''
| |
| ** see [[Operations:Procedure_for_setting_Nagios_test_an_operations_test| Procedure for setting a Nagios test to an operations test]]
| |
| ** COD is responsible for coordinating the whole process.
| |
| | |
| If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers via manager-central-operator-on-duty AT mailman.egi.eu
| |
| |
| |
| * [[Operations_Centre_creation_process_coordination | Creation of a new Operations Centre process coordination]]
| |
| * [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
| |
| * [[Operations:Procedure_for_setting_Nagios_test_an_operations_test| Procedure for setting Nagios test an operations test]]
| |
| |-
| |
| | '''Availability/reliability reports'''
| |
| |
| |
| * Handling availability/reliability reports: [[Availability_and_reliability_internal_procedure_for_COD | Availability and reliability work instruction]]
| |
| ** [[Availability_and_reliability_reports_metrics | AR reports metrics]]
| |
| |
| |
| * [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
| |
| * [[Availability_and_reliability_monthly_statistics | Availability and reliability monthly statistics procedure]]
| |
| |-
| |
| | '''Operational portal dashboard issues'''
| |
| |
| |
| *[https://operations-portal.in2p3.fr/dashboard/ccodView COD dashboard link]
| |
| *[[Operations:Work_instruction_for_escalating_operational_problems_with_ROD | Escalation for operational problems with ROD - work instruction]]
| |
| |
| |
| * [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
| |
| |-
| |
| | '''Handover'''
| |
| |
| |
| [https://operations-portal.in2p3.fr/dashboard/ccodView COD dashboard link]
| |
| * At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
| |
| ** Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
| |
| ** GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
| |
| ** Other issues: problems with tools etc.
| |
| |
| |
| |-
| |
| |}
| |
| | |
| | |
| ''NOTE: all procedures should contain the following template: https://wiki.egi.eu/wiki/PDT:Procedure_Template''
| |
| | |
| === Procedures ===
| |
| | |
| ==== To be approved by OMB ====
| |
| | |
| === OTAG topics ===
| |
| | |
| ==== Operational Portal: Dashboard ====
| |
| * [http://bit.ly/dZ3RWN RT tickets]
| |
| * [[Operations:COD_interaction_with_Dashboard_team| COD interactions with Dashboard team (draft)]]
| |
| * [[Operations:COD_OTAG_topics| COD topics to be discussed on OTAG meeting]]
| |
| | |
| ==== GOC DB ====
| |
| * [[Operations:COD_GOCDB_requirements|Collection of GOC DB requirements regarding COD work (draft)]]
| |
| | |
| === Pages in draft state ===
| |
| | |
| * [[Operations:COD_Improvements_to_availability_procedure|Improvements to Availability Calculation Procedure (draft)]]
| |
| | |
| * [[Operations:A/R_fixing_procedure| A/R fixing procedure (draft)]]
| |
| | |
| * [[Grid_operations_oversight_-_COD| COD wiki page]]
| |
| | |
| * [[Grid_operations_oversight_-_ROD| ROD wiki page]]
| |
| | |
| * [[Grid_operations_oversight_draft| Grid operations oversight - new page draft]]
| |
| | |
| * [[Operations:OperationsSupportMetrics:MetricsDocumentation|Metrics Documentation]]
| |