Regional Operator on Duty welcome

From EGIWiki
Revision as of 13:50, 23 October 2014 by Krakow (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


EGI Infrastructure Operations Oversight menu: Home EGI.eu Operations Team Regional Operators (ROD) 



Welcome on board!

Dear ROD member,

this page was created to help you start with ROD duties. We will show you briefly what you should do to prepare yourself to act as a ROD.

First, in section "How to become a ROD member" we will describe steps which needs to be taken before you start your work. In section "ROD duties" you will find all the tasks that make up the ROD duties. Section "Important to read" states a short introduction of all documents which concern this activity and which you are supposed to read at the beginning. Under "Tools" section will be described all tools used by ROD teams. Finelly,  "Contact" section is going to inform you how to contact others.

How to become a ROD member

There are few actions which needs to be taken before you start your work:

  1. Get a valid grid certificate delivered by certifiation authorities - this step is important because most of the tools used during the shift require certificate. Find EUGRIDPMA members.
  2. Register to Dteam VO - Dteam membership will give you possibility to test sites and debug problems.
  3. Register into GGUS tool as support staff - GGUS is a ticketing system which is used for operational purpose within EGI. With support staff role you will be able to reply on and update recorded tickets.
  4. Register in GOC DB tool - GOC DB is a central database which contains all the information about EGI grid infrastructure (sites and people). To be ROD members you have to be recoreded in this database. It will allows you to perform step 5.
  5. Request the Regional Staff role in the GOC DB - Thanks to this role you will be recognized automatically in operations tools as ROD member. It gives you a several privilages in the database as well as in other tools.
  6. Contact your NGI manager - you need to contact your NGI manager to be approved in GOC DB as Regional Staff and to be added to ROD mailing list in your NGI (this mailing list is a contact point to the whole ROD team within the NGI).
  7. Get familiar with the ROD wiki page - ROD wiki page is a single place where you will find all information relevant to your work as a ROD.


To see how to perform all those actions please watch video How to become a ROD member (7 steps which should be done to become a ROD member also )

ROD duties

The Regional Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to a resolution. It monitors sites in their region, and react to problems identified by the monitors, either directly or indirectly, provide support to sites as needed, add to the knowledge base, and provide informational flow to oversight bodies in cases of non-reactive or non-responsive sites. ROD is a team responsible for solving problems on the infrastructure according to agreed procedures. They ensure that problems are properly recorded and progress according to specified time lines. They ensure that necessary information is available to all parties. The team is provided by each Operation Center and requires procedural knowledge on the process (rather than technical skills) for their work.

All duties listed are mandatory for ROD team:

  • Handling incidents - The main responsibility of ROD is to deal with incidents at sites in the region. This includes making sure that the tickets are opened and handled properly. The procedure for handling tickets is described in EGI Infrastructure Oversight escalation procedure
  • Propagate actions from EGI Operations / Operations Support down to sites - ROD is responsible for ensuring that decisions taken on the EGI Operations level are propagated to sites.
  • Putting a site in downtime or suspend for urgent matters - In general, ROD can place a site in downtime (in the GOC DB) if it is either requested by the site, or ROD sees an urgent need to put the site into downtime. ROD may also suspend a site, under exceptional circumstances, without going through all the steps of the escalation procedure. For example, if a security hazard occurs, ROD must suspend a site on the spot in the case of such an emergency. It is important to know that EGI Operations / Operations Support can also suspend a site in the case of an emergency e.g. security incidents or lack of response.
  • Notify EGI Operations about core or urgent matters - ROD should create tickets to EGI Operations in the case of core or urgent matters.


Important to read

Before you start your duties you should get familiar with following documents:

  • Grid Oversight Escalation Procedure - this document defines escalation procedure for operational problems. It describes steps and timelines which ROD team should follow.
  • Dashboard HowTOs and Training Guides -  a collection of HowTOs and guides for EGI Operations. It includes a Dashboard HOWTO, Training Guides which can be used as a presentation for training staff and quick sheets.
  • ROD FAQ - Frequently Asked Questions related to ROD work


It is also important to watch video tutorials prepared for ROD teams. They will walk you through several topics which are important for your work.

Tools

ROD uses several operations tools to perform theirs duties (Operations tools video):

  • Operations Portal - Dashboard tool on the Operations Portal is a main tool which is used by ROD teams. All actions concerning incidents (alarms and tickets) should be performed using this tool.
  • The Service Availability Monitoring tool (Nagios) - SAM is an official EGI monitoring system based on Nagios. It checks the availability of grid services and creates alarms visible on the Operations Portal dashboard when a service fails.
  • GGUS - is the EGI central helpdesk system designed for problem reporting and tracking. ROD should not handle operations tickets here. ROD teams should use GGUS to report problems with operations tools and middleware.
  • GOC DB - is a central database which contains all static information about the grid infrastructure (sites and people)


Links to Operations tools can be found here
.

Contact

Each ROD teams is supposed to provide own mailing list as a contact point to the team. The list of people responsible for ROD in a given NGI and contact points can be found in Operations Portal.

All ROD mailing list are also subscibed toall-central-operator-on-duty AT mailman.egi.eu mailing list so to contact other ROD teams you can use this list.

To contact EGI Operations Support team you can:

  • send a GGUS ticket and assign it to EGI Operation Support support unit
  • send an emai to operations-support AT mailman.egi.eu

You are welcome to send us questions in case of any doubts concerning ROD duties.