Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI Infrastructure operations oversight"

From EGIWiki
Jump to navigation Jump to search
(Deprecate page)
Tag: Replaced
 
(153 intermediate revisions by 13 users not shown)
Line 1: Line 1:
= EGI.eu Operations Oversight Pages =
{{DeprecatedAndMovedTo|new_location=https://ims.egi.eu/display/EGIPP/EGI+Infrastructure+operations+oversight}}


{{Template:Op menubar}} {{Template:GO menubar}}


EGI Grid Operations oversight of the e-Infrastructure is a co-ordination task for ensuring that GRID monitoring across EGI runs smoothly.
[[Category:Infrastructure_Oversight|*]]
This team communicates among the 3 groups - Operations and e-Infrastructure Oversight (OE); Operational Documentation (OD);
and "Coordination of interoperations between NGIs and with other Grids".
 
The Operations oversight team works with the Tool Developers (and particularly the [[OTAG]] group), NGIs and their Operations Teams (ROD).
There are regular phone meetings for the co-ordinators and others working in the tasks.  The OE co-ordinators also organise face to face meetings for the
ROD teams 3 to 4 times a year.
 
;'''Co-ordinators:''' : Ron Trompert (Chair), Marcin Radecki, Luuk Uljee
;'''Deputy:''' : Malgorzata Krakowian
;'''Contact:''' :
*There are 3 mailing lists used for different cases:
** '''manager-central-operator-on-duty''' AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. COD managers are recipients of this list.
** '''central-operator-on-duty''' AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. COD shifters are recipients of this list.
** '''all-central-operator-on-duty''' AT mailman.egi.eu - for contacting all ROD teams in NGIs. Each ROD team is a recipient of this list.
 
 
----
== COD offical web pages ==
* [[Operations:OperationsSupportMetrics | Operations Support Metrics]]
* [[Operations:Operations_tests| Critical tests list ]]
 
== Procedures used in COD activity ==
 
In this section were collected all procedures in force for COD
* [[Operations:CODOPSmanual |  ROD oversight duties]]
* [[Operations:NewNGIs_creation |  New NGI creation process coordination]]
* [[Availability_and_reliability_monthly_statistics]]
* [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
* [[Operations:Setting_Nagios_tests_critical_procedure| Procedure for setting Nagios tests critical for COD]]
* [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
 
== COD shifters daily work instructions ==
In this section were collected all work instructions containing detailed instructions that specify exactly what steps to follow to carry out an activity.
 
=== Dealing with the GGUS tickets assigned to COD ===
* [http://tinyurl.com/2ws735h Link to all GGUS tickets assigned to COD]
* COD shifter is oblige to check current status of all tickets assigned in GGUS to COD
* If the ticket is waiting for COD action then he/she should perform the action
* In case of request for:
** ROD certification see [[Procedure_to_handle_new_ROD_certification_GGUS_tickets | New ROD team certification work instruction]]
** New NGI creation see [[Operations:NewNGIs_creation |  New NGI creation process coordination]]
** Operations Centre decommission see [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
** Setting Nagios test critical see [[Operations:Setting_Nagios_tests_critical_procedure| Procedure for setting Nagios tests critical for COD]]
* If the shifter doesn't know what kind of action should be taken, he/she should contact with COD managers
 
=== New ROD team certification work instruction ===
* [[Procedure_to_handle_new_ROD_certification_GGUS_tickets | New ROD team certification work instruction]]
 
=== Availability/reliability reports ===
* Handling availability/reliability reports: [[Availability_and_reliability_internal_procedure_for_COD | Availability and reliability work instruction]]
** [[Availability_and_reliability_reports_metrics |  AR reports metrics]]
** [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
 
=== Issues on Operational portal dashboard ===
*[[Operations:Escalation_for_operational_problem_with_ROD | Escalation for operational problem with ROD - work instruction]]
 
=== Handover ===
* At the and of the shift handover should be submitted
** Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
** GGUS tickets assigned to COD: for each one should be provided last status and the action taken by the shifter
** Other issues: problems with the tools etc.
 
== Internal area ==
 
* [[Operations:Meetings|Phone confs Meetings and Agenda]]
 
* [[Operations:List_of_typical_problems_with_RODs| List of typical problems with RODs]]
 
 
''NOTE: all procedures should contain the following template: https://wiki.egi.eu/wiki/PDT:Procedure_Template''
 
=== Approved ===
 
* [[Operations:NewNGIs_creation |  New NGI creation process coordination]]
 
* [[Availability_and_reliability_monthly_statistics |  Handling availability/reliability reports]]
 
* [[Operations:COD_Escalation_Procedure|COD escalation procedure]]
 
* [[Operations:Operations_Centre_decommission|Operations Centre decommission process coordination]]
 
* [[Operations:Setting_Nagios_tests_critical_procedure| Procedure for setting Nagios tests critical for COD]]
 
=== To be approved by OMB ===
 
 
 
=== OTAG topics ===
 
==== Operational Portal: Dashboard =====
* [https://rt.egi.eu/rt/Search/Results.html?Format=%27%20%20%20%3Cb%3E%3Ca%20href%3D%22__WebPath__%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__id__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3A%23%27%2C%0A%27%3Cb%3E%3Ca%20href%3D%22__WebPath__%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__Subject__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3ASubject%27%2C%0A%27__Status__%27%2C%0A%27__QueueName__%27%2C%0A%27__OwnerName__%27%2C%0A%27__Priority__%27%2C%0A%27__NEWLINE__%27%2C%0A%27%27%2C%0A%27%3Csmall%3E__Requestors__%3C%2Fsmall%3E%27%2C%0A%27%3Csmall%3E__CreatedRelative__%3C%2Fsmall%3E%27%2C%0A%27%3Csmall%3E__ToldRelative__%3C%2Fsmall%3E%27%2C%0A%27%3Csmall%3E__LastUpdatedRelative__%3C%2Fsmall%3E%27%2C%0A%27%3Csmall%3E__TimeLeft__%3C%2Fsmall%3E%27&Order=ASC|ASC|ASC|ASC&OrderBy=id|||&Page=1&Query=Owner%20%3D%20%27mkrakowi%27%20AND%20Queue%20%3D%20%27otag%27&RowsPerPage=50&SavedChartSearchId=new&SavedSearchId=| RT tickets]
* [[Operations:COD_interaction_with_Dashboard_team| COD interaction with Dashboard team (draft)]]
* [[Operations:COD_OTAG_topics| COD topic to be discussed on OTAG meeting]]
* [[Operations:COD_Dashboard_requirements|Collection of dashboard requirements regarding COD work (draft)]]
 
==== GOC DB ====
* [[Operations:COD_GOCDB_requirements|Collection of GOC DB requirements regarding COD work (draft)]]
 
=== Pages in draft state ===
 
 
* [[Operations:COD_Improvements_to_availability_procedure|Improvements to Availability Calculation Procedure (draft)]]
 
* [[Operations:A/R_fixing_procedure| A/R fixing procedure (draft)]]
 
 
 
 
[[Category:COD]]

Latest revision as of 14:25, 25 October 2021