Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

EGI Infrastructure operations oversight

From EGIWiki
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Grid Operations oversight of the e-Infrastructure is a co-ordination task for ensuring that GRID monitoring across EGI runs smoothly. This team communicates among the 3 groups - Operations and e-Infrastructure Oversight (OE); Operational Documentation (OD); and "Coordination of interoperations between NGIs and with other Grids".

The Operations oversight team works with the Tool Developers (and particularly the OTAG group), NGIs and their Operations Teams (ROD). There are regular phone meetings for the co-ordinators and others working in the tasks. The OE co-ordinators also organise face to face meetings for the ROD teams 3 to 4 times a year.

COD managers:
Ron Trompert (Chair), Marcin Radecki, Luuk Uljee, Malgorzata Krakowian
COD shifters:
Malgorzata Krakowian, Ron Trompert, Luuk Uljee, Maarten van Ingen, Ernst Pijper, Alexander Verkooijen
Contact:
  • There are 3 mailing lists used for different cases:
    • manager-central-operator-on-duty AT mailman.egi.eu - for COD managerial issues like suggesting changes in procedures, tools. COD managers are recipients of this list.
    • central-operator-on-duty AT mailman.egi.eu - for reporting COD day-to-day issues like problems with tools or Nagios tests. COD shifters are recipients of this list.
    • all-central-operator-on-duty AT mailman.egi.eu - for contacting all ROD teams in NGIs. Every ROD team is a recipient of this list.

Information for Regional Operators (ROD)

ROD and COD Performance

Nagios tests

About Central Operator on Duty (COD)

Duties

  • COD managers
    • representing RODs/COD in OTAG, OMB and Operations meetings - collecting requirements and improvements proposals from RODs concerning operations tools and procedures
    • suspending Resource Centres in case of operational issues
    • taking part in OLA task force
    • writing new procedures - in case of need COD is taking part in procedures creation process
    • preparing ROD newsletters - informing RODs about recent and upcoming developments related to Grid Oversight
    • preparing ROD metrics reports - providing an overview of operations support process in grid infrastructure.
  • COD shifters.
    • escalation of operational problems with RODs
      • Alarms older than 3 days without an assigned ticket
      • Tickets which have expired 3 days ago
      • Tickets which have not been solved for 30 days
    • escalation of operational problems with sites which cannot be solved on NGI level
      • Tickets transferred to C-COD (last escalation step)
      • Sites in downtime for more than 1 month
    • dealing with GGUS tickets assigned to COD
    • process coordination of:
      • creation and decommission of Operations Centre
      • setting a Nagios test to an operations test
      • getting explanations for low availability and reliability metrics

Internal area

COD shifters daily work instructions

In this section are collected all work instructions containing detailed information specifying exactly what steps are to be followed to carry out an activity.

Action Description Related procedures
GGUS tickets assigned to COD

COD shifter is obliged to check the current status of all GGUS tickets assigned to COD


In case of a request for:

If the shifter doesn't know what kind of action should be taken, he/she should contact COD managers via manager-central-operator-on-duty AT mailman.egi.eu

Availability/reliability reports
Operational portal dashboard issues
Handover

COD dashboard link

  • At the end of the shift a handover should be submitted (send to COD) via Handover tool in the Operational Portal
    • Problems on the dashboard which will pass to next week: the ggus id of the ticket and when next escalation step should be taken
    • GGUS tickets assigned to COD: for each ticket its last status and the action taken by the shifter should be provided
    • Other issues: problems with tools etc.


NOTE: all procedures should contain the following template: https://wiki.egi.eu/wiki/PDT:Procedure_Template

Procedures

To be approved by OMB

OTAG topics

Operational Portal: Dashboard

GOC DB

Pages in draft state