Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

EGI Operations Start Guide

From EGIWiki
Jump to navigation Jump to search
Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



Introduction

EGI Operations Start Guide was created to help you start with EGI Operations duties. It presents the responsibilities of the various parties involved in the running of the EGI infrastructure and guide how to join operations. As a newcomer, you need to understand the structure of the EGI project and roles of operators at different levels. Reading the whole document will give you a complete overall picture of daily operations within EGI.

Roles

The following describes the roles that are commonly found in the EGI Infrastructure and Operations. Other terms and definitions can be found in EGI Glossary.

Site level

Site Administrator

The person responsible for keeping the site operational. In the scope of Operations, site administrators primarily receive and react on notification of one or more incidents at their site. A site administrators will also need to react to security issues that are at a global level, but affect their site.

Site Operations Manager

The person responsible for the site at the political and legal level. S/he is responsible for signing the Operations Level Agreement (OLA) between the Site and the NGI that hosts the site operationally. The Site Operations Manager is also responsible for assigning and approving the other site roles in the GOCDB. Further, s/he should ensure that administrators are subscribed to relevant mailing lists. The Site Operations Manager manages the Resource Center.

Site Security Officer

The person responsible for keeping the site compliant with the Security policies. She/he is also the primary contact for the NGI Security officer and EGI CSIRT. The Site Security Officer deals with security incidents and shall respond to enquiries in a timely fashion as defined in the collection of security procedures and policies.

Regional level

Regional Operator on Duty (ROD)

A team responsible for solving problems/incidents in the infrastructure according to agreed procedures. ROD (teams) monitor the sites in their region, react to problems identified by the monitoring tools, and oversee problems through to their resolution. They ensure that problems are properly recorded and that the solutions progress according to specified time lines. They also provide support to sites and VOs as needed and provide informational flow to oversight bodies in cases of non-responsive sites. They ensure that all necessary information is available to all parties. The team is provided by each NGI and requires procedural knowledge on the process (rather than technical skills) for their work. New ROD team members are required to read the ROD Welcome page and be familiar with ROD wiki page.

NGI Security officer

The member of EGI-CSIRT IRTF (Incidendent Response Task Force) currently on shift. Further information can be found at the CSIRT:IRTF page. The role of the IRTF team is to handle day to day operational security issues and coordinate Computer-Security-Incident-Response across the EGI infrastructure. NGIs and Sites MUST respond in a timely manner to its requests and alerts.

NGI Manager

NGI Manager is the contact point for all operational matters and represents the NGI within the OMB.

Project level

Chief Operations Officer

Chief Operations Officer leads EGI Operations, and is responsible for coordinating the operations of the infrastructure across the project.

Central Operator on Duty (COD)

The Central Operator on Duty (COD) team is provided on a global layer and is responsible for the supporting and actively controlling the overall status of Grid services and sites. It coordinates Regional Operator on Duty (ROD) teams and represents the whole ROD structure in terms of technical requirements for operations tools as well as on political level.

VO

A Virtual Organisation (VO) is a group of users and, optionally, resources, often not bound to a single institution or national borders, who, by reason of their common membership and in sharing a common goal, are given authority to use a set of resources. Each VO member signs the VO AUP (during registration) which is the policy document describing the goals of the VO thereby defining the expected and acceptable use of the Grid by the users of the VO. User documentation can be found here.

VO manager

An individual responsible for the membership registry of the VO including its accuracy and integrity.

Joining operations

In order to join any of the organisational groups in your NGI, you will need to go through the following steps in order:

Obtain a Grid certificate.

If you do not already have a GRID certificate this page provides a map of all certification authorities according to country (or NGI). Select your country on the map to find out who is your local CA. Follow the procedure for your local CA to request a certificate. When you have received your certificate, install it into your web browser.

CERN provides a webpage for testing your certificate here. Please use this resource and contact your CA if your certificate does not work.

Join Dteam VO

It is recommended to join the dteam VO at the dteam Registration page. You should request group membership for /dteam and /dteam/YOUR_NGI. The dteam group manager will then be notified by the vomrs software.

Request GOCDB access

In the lower left corner there is a User status box, select Register a New Account there. In the screen that appears, enter your name, contact information and your DN.
Select Manage Roles under User status section and in the next page, select either your NGI or your Site, depending at which level your role is. At the next page, request the role according to role definition.

All new members then need to notify their NGI manager about their role request, as GOCDB currently does not send any notification about pending requests.

Register into GGUS

To register into GGUS please follow the Central GGUS registration link. GGUS can be accessed with only your certificate, which is adequate for normal users. However, ROD team members must register and apply for the support role. (The GGUS support staff will approve you quickly as they get the notification automatically.)

Some NGIs also have a local helpdesk or a regional GGUS. Ask your NGI manager if how to register to them.

Subscribe to mailing lists.

NGIs and Sites have local mailing lists for ROD team members and Site Administrators respectively. Please ensure that you subscribe to them. Depending on your role ask your NGI manager or Site manager to have you included on the necessary mailing lists if there is no automatic subscription process.

NGI manager should contact operations@egi.eu and state that wish to be subscribed to noc-managers mailing list noc-managers@mailman.egi.eu.

Duties

NGI management

  • Contact operations@egi.eu and state that you wish to be subscribed to noc-managers mailing list noc-managers@mailman.egi.eu.
  • Register for the regional manager role in the GOCDB. Note: There is some inconsistency with this method as the NGI Manager role is also created with the creation of the NGI. Registering for this role may be necessary if the staff changes.

Resource Management

The NGI manager is responsible for keeping the NGI entry in the GOCDB up to date. They are also responsible for managing the status of all sites under that NGI, and ensuring that that information is also kept current.

Availability/Reliability

The NGI manager is responsible for addressing problems with Site availability or reliability. The reports are issued on a monthly basis and the NGI manager has 10 days to respond to identified problems.

Communication

NGI managers are obliged to attend regular Operations-Management-Board (OMB) meetings. These occur monthly, and are either via phone conference (EVO) or face to face (about 3 times a year) co-located with EGI conferences.

The NGI manager(s) have a responsibility to communicate with COD as described below, and as needed with other administrative bodies within EGI.

  • Communication with COD
    • COD and COD management can be contacted according to the information at the [Grid_operations_oversight/COD COD pages].
    • Use the GGUS ticketing system for
      • dealing with OC creation/decommissioning processes.
      • availability/reliability reports
      • for issues regarding site suspension
  • Communication within the NGI (recommendations)
    • Mailing lists for ROD, Sites, other management teams
    • Weekly or biweekly status meetings (EVO/phone/chat)
    • Face to face meetings semi-regularly


ROD team management

  • Organize and manage ROD Teams within the NGI. Optionally, appoint a ROD representative person.
  • Instructions for the requirements of a ROD team are in the Operations Centre creation procedure.
    • Ensure that the ROD team members are members of a mailing list which is forwarded to COD.

Site Management

Resource Management

The NGI manager is responsible for managing the status of all sites under that NGI, and ensuring that that information is also kept current.

The following shows the allowed site status transistions:



SiteStatusFlow.png

Create

The NGI is responsible for Site/Resource Centre management. They are responsible for the creation, certification, suspension and closing/removing sites. See the Resource Centre Registration and Certification Procedure document for how to add a new site to your NGI.

Certify

After a Site is fully registered in the GOCDB and all steps in the certification part of the procedure are completed, the NGI Manager should change the site status from "uncertified" to "certified". Monitoring for the site and its nodes will now be switched on, and cannot be switched off.

Suspend

In the case of a site which requires suspension, the site certification status in GOCDB should be changed from "production" to "suspended". To include this site back to production, the certification procedure should be done from the beginning.

Close or Remove site

There is a site status in GOCDB called closed. This status is for sites which are no longer in use because they have either been closed or replaced by another site.


For removing a site completely from the GOCDB, which includes removing it's history, please contact GOCDB support.

Managing service interventions (Downtimes)

Downtimes must be handled as described in MAN02. Downtimes will affect the Availability and reliability monthly statistics.

In specific cases, ie. when sites can foresee a downtime which would drastically affect the monthly statistics, sites may want to go into voluntary suspension. The NGI manager should be contacted for that. Later, the site would have to go through re-certification (see PROC09) in order to enter normal operation again.

Responsiveness

In the scope of Regional Operations site administrators primarily receive and react on notification of one or more Incidents. They should also provide information in the site notepad, available on the dashboard. Site administrators can also view their site on the operations dashboard.

Communication Lines - contacting 1st Line Support, ROD

Site administrators may always send a "request for help" to 1st Line Support and/or ROD through appropriate mailing lists. These contact lists should be provided by your ROC. However, in general, communication with COD or ROD shall be via the GGUS system when responding to tickets.

Responding to alarms and tickets

Sites should respond to alarms and tickets in a suitable time frame. Site administrators should be aware of the alarms at their site, eg. through the dashboard. ROD will create a ticket for any alarm older than 24 hours. Details can be looked up in the DashboardHowto.

Modifying/Updating tickets

Site administrators should respond to tickets within the time limit specified in the Resource Centre OLA (currently 8 working hours). Sites can respond to tickets directly via email, or through the GGUS interface. In order to keep the ticket content succinct, email responses should not contain superfluous information.

Providing Information in the Dashboard Notepad

Site administrators are encouraged to actively use the site notepad in the dashboard, eg. to write any additional information about the status of the incidents and/or solutions to problems.

Monitoring

By default, the service endpoints of the sites are being monitored by dedicated monitoring instances (Nagios boxes). These monitoring instances are provided by the NGI. All services in production are being monitored and are thus required to support the ops VO. Site administrators can check the status of their service endpoint with eg.:

A complete list of tools can be found here.

Recent middleware version

Sites must only operate supported middleware versions. This implies upgrading from time to time. Emergency releases are treated in a special way. See Operations/Sites/Security and EGI CSIRT:Critical Vulnerability Handling.


A complete list of duties can be found in the Resource Centre OLA.


Documentation

Documentation relevant to EGI operations can also found at EGI Documentation wiki page

Tools

A list of tools relevant to EGI operations can also found at EGI Tools wiki page