Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI Operations Start Guide"

From EGIWiki
Jump to navigation Jump to search
(43 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Template:Doc_menubar}} {{Template:Man_menubar}} {{TOC_right}}  
{{Template:Op menubar}} {{TOC_right}}  


== Introduction  ==
== Introduction  ==


This document present the procedures and responsibilities of the various parties involved in the running of the EGI infrastructure. As a newcomer, you need to understand the structure of the EGI project and roles of operators at different levels, and read the parts of the manual which apply to you. You are encouraged to read also the other parts of the manual. It is not necessary — we strive to keep the individual parts as independent as possible — but reading the whole document will give you a complete overall picture of daily operations within EGI.  
EGI Operations Start Guide was created to help you<span lang="en" id="result_box" class="short_text"><span title="Click for alternate translations" class="hps"> start with EGI&nbsp;Operations duties. It</span></span> presents the responsibilities of the various parties involved in the running of the EGI infrastructure and guide how to join operations. As a newcomer, you need to understand the structure of the EGI project and roles of operators at different levels. Reading the whole document will give you a complete overall picture of daily operations within EGI.  


== Roles  ==
== Roles  ==


The following describes the roles that are commonly found in the EGI infrastructure and Operations.  
The following describes the roles that are commonly found in the EGI Infrastructure and Operations. Other terms and definitions can be found in [[Glossary|EGI Glossary]].  


==== User ====
=== '''Site level''' ===


:As a user, no complicated preparation is required — they just need a Grid certificate, and a registration into their (preferred) VO. Once this is done the user is ready to run their application(s) in compliance with grid security and VO policies [https://wiki.egi.eu/wiki/SPG:Documents]. However, if help is needed or want the user wishes to report a problem, the information [[TNA3.3 NGI User Support Teams|here]] is useful.
==== Site Administrator  ====


==== Site Manager  ====
The person responsible for keeping the site operational. In the scope of Operations, site administrators primarily receive and react on notification of one or more incidents at their site. They will also need to react to security issues that are at a global level, but affect their site. Site administrators should respond to [http://ggus.eu GGUS tickets] in a suitable time frame and be aware of the alarms at their site, eg. through the [https://operations-portal.egi.eu operations dashboard]. Sites must only operate supported middleware versions. This implies upgrading it from time to time. Emergency releases are treated in a special way. See [[EGI CSIRT:Critical Vulnerability Handling]].


:The Site Manager manages the Resource Center. <u>More information can be found at </u>[[Operations/Sites/Roles#Site_Manager|<u>Operations/Sites/Roles#Site_Manager</u>]]<u>.</u>
All Site management responsibilities are listed in [https://documents.egi.eu/document/31 RC OLA document].  


==== Site Administrator ====
==== Site Operations Manager ====


:In the scope of Operations, site administrators primarily receive and react on notification of one or more incidents at their site. A site administrator will also need to react to security issues that are at a global level, but also affect their site.<br>
The person responsible for the site at the political and legal level. S/he is responsible for signing the Operations Level Agreement ([https://documents.egi.eu/public/ShowDocument?docid=31 OLA]) between the Site and the NGI that hosts the site operationally. The Site Operations Manager is also responsible for assigning and approving the other site roles in the [https://goc.egi.eu/ GOCDB]. Further, s/he should ensure that administrators are subscribed to relevant mailing lists.  


==== Site Security Officer  ====
==== Site Security Officer  ====


:The Site Security Officer deals with security incidents and shall respond to enquiries in a timely fashion as defined in the collection of&nbsp; [https://wiki.egi.eu/wiki/EGI_CSIRT:Policies security procedures and policies].
The person responsible for keeping the site compliant with the [[EGI CSIRT:Policies|Security policies]]. She/he is also the primary contact for the NGI Security officer and EGI CSIRT. The Site Security Officer deals with security incidents and shall respond to enquiries in a timely fashion as defined in the collection of [[EGI CSIRT:Policies|security procedures and policies]].  
 
==== National Grid Initiatives  ====
 
:For European members, at the national level, sites (Resource Centres) are organised into ''National Grid Initiatives'' (NGIs). An NGI is run (or operated) by its '''''NGI Manager''''', who has the highest responsibility for the NGI and represents the NGI to the outside world. There are two levels of NGI Manager - one which deals with the political side of the NGI and the other which deals with the Operations part. This may be covered by the same person. Other important persons in an NGI are the '''''Security Officer''''', '''''Regional Operator on Duty''''' ''(ROD)'' (usually a team or people), and, if applicable, a team of ''1st-line supporters''. In general, these teams oversee the sites, monitor their status and help them solve their problems.
 
==== Regional Operator on Duty (ROD)  ====
 
:A team responsible for solving problems/incidents in the infrastructure according to agreed procedures. ROD (teams) monitor the sites in their region, react to problems identified by the monitoring tools, and oversee&nbsp; problems through to their&nbsp; resolution. They ensure that problems are properly recorded and that the solutions progress according to specified time lines.&nbsp; They also provide support to sites and VOs as needed and provide informational flow to oversight bodies in cases of non-responsive sites. They ensure that all necessary information is available to all parties. The team is provided by each NGI and requires procedural knowledge on the process (rather than technical skills) for their work. ROD team members are required to read the [[Operations/ROD|ROD manual pages]].
 
==== Central Operator on Duty (COD)  ====
 
:A small team responsible for the coordination of ROD teams, provided on a global layer by EGI. COD represents the whole ROD structure at the political level. COD&nbsp;is currently run by the Polish and Dutch NGIs.<br>
 
==== Security Officer on Duty  ====
 


=== '''Regional level'''  ===


:The member of EGI-CSIRT IRTF (Incidendent Response Task Force) currently on shift. Further information can be found at the [https://wiki.egi.eu/wiki/EGI_CSIRT:IRTF%7C CSIRT:IRTF] page. The role of the IRTF team is to handle day to day operational security issues and coordinate Computer-Security-Incident-Response across the EGI infrastructure. NGIs and Sites '''MUST''' respond in a timely manner to its requests and alerts.<br>
==== Regional Operator on Duty (ROD)<br> ====


A team responsible for solving problems/incidents in the infrastructure according to agreed procedures. ROD (teams) monitor the sites in their region, react to problems identified by the monitoring tools, and oversee problems through to their resolution. They ensure that problems are properly recorded and that the solutions progress according to specified time lines. They also provide support to sites and VOs as needed and provide informational flow to oversight bodies in cases of non-responsive sites. They ensure that all necessary information is available to all parties. The team is provided by each NGI and requires procedural knowledge on the process (rather than technical skills) for their work. New ROD team members are required to read the [[Grid operations oversight/ROD Welcome page|ROD Welcome page]] and be familiar with [[Grid operations oversight/ROD|ROD wiki page]].


==== NGI Security officer  ====


Operations roles are registered in the GOCDB. Depending on your NGI, you may use a regional version of the GOCDB or the [https://gocdb4.esc.rl.ac.uk/portal/ central instance] thereof. How to register for a role is also described in [[Operations/General/Joining operations|Joining Operations]].  
The member of EGI-CSIRT IRTF (Incidendent Response Task Force) currently on shift. Further information can be found at the [[EGI CSIRT:IRTF|CSIRT:IRTF]] page. The role of the IRTF team is to handle day to day operational security issues and coordinate Computer-Security-Incident-Response across the EGI infrastructure. NGIs and Sites '''MUST''' respond in a timely manner to its requests and alerts.  


The following roles are required at the NGI level:
==== NGI operations manager  ====


==== NGI Manager  ====
NGI operations manager is the contact point for all operational matters and represents the NGI within the [[OMB|Operations Management Board]].


This is the technical manager for all NGI activites.
S/he is mainly responsible for:


*Contact ''operations@egi.eu'' and state that you wish to be subscribed to noc-managers mailing list ''noc-managers@mailman.egi.eu''.
*keeping the NGI entry in the GOCDB up to date and for managing the status of all sites under that NGI, and ensuring that that information is also kept current
*Register for the ''regional manager'' role in the GOCDB. Note: There is some inconsistency with this method as the NGI Manager role is also created with the creation of the NGI. Registering for this role may be necessary if the staff changes.
*addressing problems with Site availability or reliability. The reports are issued on a monthly basis and the NGI operations manager has 10 days to respond to identified problems
*attending regular [[OMB|Operations-Management-Board (OMB) meetings]]


<br>
All NGI operations management responsibilities are listed in [https://documents.egi.eu/document/463 RP OLA document].


==== Deputy Manager ====
=== '''Project level''' ===


This is a backup role for the technical NGI Manager.
==== Chief Operations Officer  ====


*register for the ''deputy manager'' role in the GOCDB
Chief Operations Officer leads EGI Operations, and is responsible for coordinating the operations of the infrastructure across the project.


<br>
==== EGI CSIRT  ====


==== Security Officer  ====
[[Security|EGI CSIRT]] is an official security team coordinator and contact point at project level.


The security officer is responsible for security management at an NGI level
==== Operations Support  ====


*register the ''security officer'' role in the GOCDB
Operations Support team is provided on a global layer and is responsible for the supporting EGI Operations. Examples of its activities are service level management, service level reporting, service management in general and central technical.


<br>
==== VO  ====


==== Regional operational staff  ====
A Virtual Organisation (VO) is a group of users and, optionally, resources, often not bound to a single institution or national borders, who, by reason of their common membership and in sharing a common goal, are given authority to use a set of resources. Each VO member signs the VO AUP (during registration) which is the policy document describing the goals of the VO thereby defining the expected and acceptable use of the Grid by the users of the VO. User documentation can be found [[User Documentation|here]].


*Depending on the NGI and Operations Centre organisation, there may also be a ROD representative.
==== VO manager  ====
*Other roles in an NGI are the regional operational staff, also known as [[Operations/ROD|ROD]]


==== Other ROLES ====
An individual responsible for the membership registry of the VO including its accuracy and integrity.
 
===== 1st Line Support  =====
 
:&nbsp; 1st Line Support is a small team which redirects GGUS tickets to the appropriate support unit. Under the EGI Operations model this team acts as a First Line Support team.
 
===== VO  =====
 
:A Virtual Organisation (VO) is a grouping of users and, optionally, resources, often not bound to a single institution or national borders, who, by reason of their common membership and in sharing a common goal, are given authority to use a set of resources. Each VO member signs the VO AUP (during registration) which is the policy document describing the goals of the VO thereby defining the expected and acceptable use of the Grid by the users of the VO.
 
===== VO manager  =====
 
:An individual responsible for the membership registry of the VO including its accuracy and integrity.
 
 
 
<br>


== Joining operations  ==
== Joining operations  ==
A minimal list of requirements for joining operations teams.


In order to join any of the organisational groups in your NGI, you will need to go through the following steps in order:  
In order to join any of the organisational groups in your NGI, you will need to go through the following steps in order:  


==== Obtain a Grid certificate.  ====
=== Obtain a Grid certificate.  ===


:If you do not already have a GRID certificate [http://www.eugridpma.org/members/worldmap/ this page] provides a map of all certification authorities according to country (or NGI). Select your country on the map to find out who is your local CA. Follow the procedure for your local CA to request a certificate. When you have received your certificate, install it into your web browser.
If you do not already have a GRID certificate [http://www.eugridpma.org/members/worldmap/ this page] provides a map of all certification authorities according to country (or NGI). Select your country on the map to find out who is your local CA. Follow the procedure for your local CA to request a certificate. When you have received your certificate, install it into your web browser.  


:CERN provides a webpage for testing your certificate [https://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/CertTest/CertTest.cgi here]. Please use this resource and contact your CA if your certificate does not work.
If case of setting up new Resource Center please request for Host certificate.  


==== Request GOCDB access. ====
CERN provides a webpage for testing your certificate [https://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/CertTest/CertTest.cgi here]. Please use this resource and contact your CA if your certificate does not work.


:Read [[GOCDB/Input System User Documentation|Accessing GOCDB4 input system ]] first.
=== Join Dteam VO  ===
:Go to the [https://gocdb4.esc.rl.ac.uk/portal/index.php central GOCDB instance] or your regional GOCDB input portal.
::In the lower left corner there is a '''User status''' box, select '''Register a New Account''' there. In the screen that appears, enter your name, contact information and your DN.
:'''ROD team members:''' Select '''Manage Roles''' under '''User status''' section and&nbsp; in the next page, select your NGI.&nbsp; At the next page, request the '''Regional Operations Staff''' role.


:'''Site Administrators:''' Select '''Manage Roles''' under '''User status''' section and&nbsp; in the next page, select your site.&nbsp; At the next page, request the '''Site Administrator''' role.
It is recommended to join the [[Dteam vo|dteam VO]] at the [https://voms.hellasgrid.gr:8443/vo/dteam/vomrs dteam Registration] page. You should request group membership for <tt>/dteam</tt> and <tt>/dteam/YOUR_NGI</tt>. The dteam group manager will then be notified by the vomrs software.  


:'''Security Roles:''' Select '''Manage Roles''' under '''User status''' section and&nbsp; in the next page, select either your NGI or your Site, depending at which level your role is. At the next page, request the '''Security Officer''' role.
=== Request GOCDB access  ===
::Don't forget to select the '''''"Submit Query"'''''at each of the last two steps.


:All new members then '''need to notify their NGI manager''' about their role request, as GOCDB currently '''does not '''send any notification about pending requests.
*Read [[GOCDB/Input System User Documentation|Input System User Documentation ]] first.
*Go to the [http://goc.egi.eu/ GOCDB instance] and follow [[GOCDB/Input System User Documentation#Users_and_roles|the instruction]]


==== Register into a VO. ====
All new members '''need to notify their NGI operations manager''' about their role request, as GOCDB currently '''does not '''send any notification about pending requests.


:You may register to a VO relevant to your NGI. You'll need this to be able to submit jobs to sites in you region. Ask your NGI Manager for further information.
=== Register into GGUS  ===


==== Subscribe to mailing lists. ====
To register into GGUS please follow the [https://ggus.eu/?mode=register Central GGUS registration] link. GGUS can be accessed with only your certificate. Do not forget to apply for [https://ggus.eu/?mode=register the support role] as well. (The GGUS support staff will approve you quickly as they get the notification automatically.)


:NGIs and Sites have local mailing lists for ROD team members and Site Administrators respectively. Please ensure that you subscribe to them. Depending on your role ask your NGI manager or Site manager to have you included on the necessary mailing lists if there is no automatic subscription process.
Some NGIs also have a local helpdesk or a regional GGUS. Ask your NGI operations manager if how to register to them.


==== Register into GGUS====
=== Subscribe to mailing lists.  ===


:To register into GGUS please follow the [https://ggus.eu/admin/register.php Central GGUS registration] link.
NGIs and Sites have local mailing lists for ROD team members and Site Administrators respectively. Please ensure that you subscribe to them. Depending on your role ask your NGI operations manager or Site operations manager to have you included on the necessary mailing lists if there is no automatic subscription process.  
:GGUS can be accessed with only your certificate, which is adequate for normal users. However, ROD team members must register and apply for the support role. (The GGUS support staff will approve you quickly as they get the notification automatically.)
:Some NGIs also have a local helpdesk or a regional GGUS. Ask your NGI manager if how to register to them.
 
==== Accessing the Operations Portal (Dashboard).  ====
 
:Once you are assigned a GOCDB Role, you should be able to access the [https://operations-portal.egi.eu/ Operations Portal Dashboard]. Your view will depend on your role.
:Regional instances of the Operations Portal Dashboard run in a few NGIs. Please contact your NGI manager if you have problems accessing it.
 
==== Request security role in the GOCDB (if relevant).  ====
 
:If you were appointed a security officer for your NGI or site, you need to ask for the '''Security Officer''' role in the scope of your site or your NGI as described above. Your request will be moderated by NGI managers or the NGI Security Officer.
:There is a site [[EGI CSIRT:Main Page|CSIRT]] address in the GOCDB page for your site (''Csirtemail''). This address is usually a mailing list used for incident handling. You need to contact your site manager to be added to the site's CSIRT list. Every site's CSIRT contacts are automatically added to the project-wide CSIRT list.
 
==== Join dteam VO.  ====
 
:It is recommended that ROD members who wish to submit jobs should request to join the [[Dteam vo|dteam VO]] at the [https://voms.hellasgrid.gr:8443/vo/dteam/vomrs dteam Registration] page. You should request group membership for <tt>/dteam</tt> and <tt>/dteam/YOUR_NGI</tt>. The dteam group manager will then be notified by the vomrs software.
 
== Security  ==
 
A brief note on the role the CSIRT and EGI security teams play in operations.
 
 
 
:Operational security aimed at achieving a ''secure infrastructure'' within EGI and relies on site and NGI security contact information maintained in the GOCDB by each NGI.
:Operational security is coordinated by [https://wiki.egi.eu/wiki/EGI_CSIRT:Main_Page EGI CSIRT]. This team acts as a forum to combine efforts and resources from the NGIs in different areas, including Grid security monitoring, Security training and dissemination, and improvements in responses to security incidents.


NGI operations manager  should contact operations@egi.eu and state that wish to be subscribed to noc-managers mailing list noc-managers@mailman.egi.eu.


<br>


NGIs are required to provide at least one person with the role of security officer. The details regarding the NGI security officer role should be well defined in the GOCDB. He/She is also required to be the representative member for their NGI in [[EGI CSIRT:Main Page|EGI-CSIRT]] forums.
== Documentation  ==


The NGI security officer is responsible for coordinating grid security operations and dissemination in the scope of its NGI. Their duties include but are not limited to:
Documentation relevant to EGI operations can also found at [[Documentation|EGI Documentation wiki page]]  
 
*Participation in incident handling according to the EGI incident handling procedure.
*Dealing with security validations included in the new site certification procedure.
 
Details and up-to-date versions of these security procedures can be found on the [[https://wiki.egi.eu/wiki/EGI_CSIRT:Policies EGI-CSIRT wiki]].
 
==== Security Incident ====
 
:A Security Incident is the actual or suspected violation of an explicit or implied security policy.


== Tools  ==
== Tools  ==


A list of tools relevant to EGI operations. A full of EGI tools can also found in at https://wiki.egi.eu/wiki/Tools
A list of tools relevant to EGI operations can also found at [[Tools|EGI Tools wiki page]]  
 
This section gives a brief overview of the tools ROD (and Site Administrators) use while doing their work. Each tool has its own documentation, so it's important to become acquainted with each of these. There is also a special [[Tools]] page which provides links to most of the tools available for use within the EGI framework.
 
==== Operations Portal (Dashboard)  ====
 
:The [[Operations Portal]], known also as the ''Dashboard,'' is the central tool used by ROD on a day-to-day basis. Site Administrators can and should also access this tool. The Portal can be found at the following address:
 
::https://operations-portal.egi.eu/
 
:There is a HOWTO for the Dashboard found at:
 
::https://documents.egi.eu/public/ShowDocument?docid=301
:Other documentation for this tool is found on the [[https://forge.in2p3.fr/projects/opsportaluser/wiki operations portal pages]].
 
==== GOCDB  ====
 
:GOCDB is a database of all static grid-related information, such as the names of NGI managers, administrators, and security officers; the list of nodes/services connected to the EGI in respective NGIs and their sites. The ability to enter and view the downtime information is the functionality that ROD and Site Admins will (generally) use the most. There is further information about GOCDB on the [[GOCDB|EGI GOCDB wiki pages]].
 
:[https://goc.egi.eu/ GOCDB] uses two entry points: the read-only instance at the [https://goc.egi.eu/portal/ GOCDB4 Central Visualization Portal] is intended for viewing the information. It is visually distinguishable in ''blue text''. It is a central point of obtaining the information for the whole EGI infrastructure. The central ([https://gocdb4.esc.rl.ac.uk/portal/index.php GOCDB4 Input system], visually distinguishable in ''green text'') allows people to enter or update their information. Some NGIs have chosen to use a regional version of hte GOCDB4 Input system, and the information from their instance is accessible for general scrutiny via the [https://goc.egi.eu/portal/ GOCDB4 Central Visualization Portal].
 
==== Nagios  ====
 
:[http://www.nagios.org/ Nagios] is one of the best-known infrastructure monitoring software packages. The core application is easily expandable by plugins. There are standard plugins used in the EGI infrastructure.
 
:The [http://www.nagios.org/documentation original Nagios documentation] is quite long. Luckily, you'll be probably using just a small set of Nagios features. Most probably, you will find a display mode that suits you best, bookmark a link to this page and have a look at it a few times a day. The more advanced functionality you'll be using will depend on the role you are assigned in the project.
 
:The following role-related Nagios guides are available: [[Operations/ROD/Using Nagios as ROD|Using Nagios as ROD]]
 
==== MyEGI  ====
 
:MyEGI is a tool built on the top of Nagios; it lets you see
:http://wiki.egi.eu/Operations/Tools/MyEGI
 
==== GGUS  ====
 
 
 
 
:[[GGUS]] is a ticket system used in EGI. You are encouraged to use it for reporting software bugs and operations problems. GGUS tracks the development of an issue until it is resolved and the solution is verified. Remember, if you issue a ticket, you are the only person able to verify the solution when it is actually solved. GGUS allows you to create subtickets, which block the closure of the parent ticket until all of them are resolved. Because of this feature, GGUS is sometimes used to track even ordinary operations issues, should they require more complex workflow.
 
:In order to use GGUS, you must first register [[Here]], either as a user or as a supporter. You can log into GGUS using either username/password pair, or your certificate.
 
 
== Duties ==
 
 
 
== NGI management ==
 
==== Resource Management ====
 
The NGI manager is responsible for keeping the NGI entry in the GOCDB up to date. They are also responsible for managing the status of all sites under that NGI, and ensuring that that information is also kept current.
 
==== Availability/Reliability ====
 
The NGI manager is responsible for addressing problems with Site availability or reliability. The reports are issued on a monthly basis and the NGI manager has 10 days to respond to identified problems.
 
==== Communication ====
 
NGI managers are obliged to attend regular Operations-Management-Board (OMB) meetings. These occur monthly, and are either via phone conference (EVO) or face to face (about 3 times a year) co-located with EGI conferences.
 
The NGI manager(s) have a responsibility to communicate with COD as described below, and as needed with other administrative bodies within EGI.
 
*Communication with COD
**COD and COD management can be contacted according to the information at the [Grid_operations_oversight/COD COD pages].
**Use the GGUS ticketing system for
***dealing with OC creation/decommissioning processes.
***availability/reliability reports
***for issues regarding site suspension
 
*Communication within the NGI (recommendations)
**Mailing lists for ROD, Sites, other management teams
**Weekly or biweekly status meetings (EVO/phone/chat)
**Face to face meetings semi-regularly
 
<br>
 
==== ROD team management ====
 
*Organize and manage ROD Teams within the NGI. Optionally, appoint a ROD representative person.
*Instructions for the requirements of a ROD team are in the [[PROC02|Operations Centre creation]] procedure.
**Ensure that the ROD team members are members of a mailing list which is forwarded to COD.
 
== Site Management ==
 
=== Resource Management ===
 
The NGI manager is responsible for managing the status of all sites under that NGI, and ensuring that that information is also kept current.
 
The following shows the allowed site status transistions:
 
<br><br> [[Image:SiteStatusFlow.png|300px|SiteStatusFlow.png]]
 
==== Create ====
 
The NGI is responsible for Site/Resource Centre management. They are responsible for the creation, certification, suspension and closing/removing sites. See the [[PROC09|Resource Centre Registration and Certification Procedure]] document for how to add a new site to your NGI.
 
==== Certify ====
 
After a Site is fully registered in the GOCDB and all steps in the certification part of the procedure are completed, the NGI Manager should change the site status from "uncertified" to "certified". Monitoring for the site and its nodes will now be switched on, and cannot be switched off.
 
==== Suspend ====
 
In the case of a site which requires suspension, the site certification status in GOCDB should be changed from "production" to "suspended". To include this site back to production, the certification procedure should be done from the beginning.
 
==== Close or Remove site ====
 
There is a site status in GOCDB called closed. This status is for sites which are no longer in use because they have either been closed or replaced by another site.
 
<br> For removing a site completely from the GOCDB, which includes removing it's history, please contact GOCDB support.
 
 


{| width="100%" style="background: none repeat scroll 0% 0% rgb(249, 249, 249); margin: 1.2em 0px 6px; border: 1px solid rgb(221, 221, 221);"
|-
| style="width:61%; color:#000;" |
{| style="width:280px; border:none; background:none;"
|-
| style="width:280px; text-align:center; white-space:nowrap; color:#000;" | <div style="font-size:162%; border:none; margin:0; padding:.1em; color:#000;">[[Support |Need support?]]<br></div>
|}


| style="width:40%; font-size:95%;" | <br>
|}


[[Category:TODO_DOC]]
[[Category:Operations]]

Revision as of 08:00, 3 May 2017

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



Introduction

EGI Operations Start Guide was created to help you start with EGI Operations duties. It presents the responsibilities of the various parties involved in the running of the EGI infrastructure and guide how to join operations. As a newcomer, you need to understand the structure of the EGI project and roles of operators at different levels. Reading the whole document will give you a complete overall picture of daily operations within EGI.

Roles

The following describes the roles that are commonly found in the EGI Infrastructure and Operations. Other terms and definitions can be found in EGI Glossary.

Site level

Site Administrator

The person responsible for keeping the site operational. In the scope of Operations, site administrators primarily receive and react on notification of one or more incidents at their site. They will also need to react to security issues that are at a global level, but affect their site. Site administrators should respond to GGUS tickets in a suitable time frame and be aware of the alarms at their site, eg. through the operations dashboard. Sites must only operate supported middleware versions. This implies upgrading it from time to time. Emergency releases are treated in a special way. See EGI CSIRT:Critical Vulnerability Handling.

All Site management responsibilities are listed in RC OLA document.

Site Operations Manager

The person responsible for the site at the political and legal level. S/he is responsible for signing the Operations Level Agreement (OLA) between the Site and the NGI that hosts the site operationally. The Site Operations Manager is also responsible for assigning and approving the other site roles in the GOCDB. Further, s/he should ensure that administrators are subscribed to relevant mailing lists.

Site Security Officer

The person responsible for keeping the site compliant with the Security policies. She/he is also the primary contact for the NGI Security officer and EGI CSIRT. The Site Security Officer deals with security incidents and shall respond to enquiries in a timely fashion as defined in the collection of security procedures and policies.

Regional level

Regional Operator on Duty (ROD)

A team responsible for solving problems/incidents in the infrastructure according to agreed procedures. ROD (teams) monitor the sites in their region, react to problems identified by the monitoring tools, and oversee problems through to their resolution. They ensure that problems are properly recorded and that the solutions progress according to specified time lines. They also provide support to sites and VOs as needed and provide informational flow to oversight bodies in cases of non-responsive sites. They ensure that all necessary information is available to all parties. The team is provided by each NGI and requires procedural knowledge on the process (rather than technical skills) for their work. New ROD team members are required to read the ROD Welcome page and be familiar with ROD wiki page.

NGI Security officer

The member of EGI-CSIRT IRTF (Incidendent Response Task Force) currently on shift. Further information can be found at the CSIRT:IRTF page. The role of the IRTF team is to handle day to day operational security issues and coordinate Computer-Security-Incident-Response across the EGI infrastructure. NGIs and Sites MUST respond in a timely manner to its requests and alerts.

NGI operations manager

NGI operations manager is the contact point for all operational matters and represents the NGI within the Operations Management Board.

S/he is mainly responsible for:

  • keeping the NGI entry in the GOCDB up to date and for managing the status of all sites under that NGI, and ensuring that that information is also kept current
  • addressing problems with Site availability or reliability. The reports are issued on a monthly basis and the NGI operations manager has 10 days to respond to identified problems
  • attending regular Operations-Management-Board (OMB) meetings

All NGI operations management responsibilities are listed in RP OLA document.

Project level

Chief Operations Officer

Chief Operations Officer leads EGI Operations, and is responsible for coordinating the operations of the infrastructure across the project.

EGI CSIRT

EGI CSIRT is an official security team coordinator and contact point at project level.

Operations Support

Operations Support team is provided on a global layer and is responsible for the supporting EGI Operations. Examples of its activities are service level management, service level reporting, service management in general and central technical.

VO

A Virtual Organisation (VO) is a group of users and, optionally, resources, often not bound to a single institution or national borders, who, by reason of their common membership and in sharing a common goal, are given authority to use a set of resources. Each VO member signs the VO AUP (during registration) which is the policy document describing the goals of the VO thereby defining the expected and acceptable use of the Grid by the users of the VO. User documentation can be found here.

VO manager

An individual responsible for the membership registry of the VO including its accuracy and integrity.

Joining operations

In order to join any of the organisational groups in your NGI, you will need to go through the following steps in order:

Obtain a Grid certificate.

If you do not already have a GRID certificate this page provides a map of all certification authorities according to country (or NGI). Select your country on the map to find out who is your local CA. Follow the procedure for your local CA to request a certificate. When you have received your certificate, install it into your web browser.

If case of setting up new Resource Center please request for Host certificate.

CERN provides a webpage for testing your certificate here. Please use this resource and contact your CA if your certificate does not work.

Join Dteam VO

It is recommended to join the dteam VO at the dteam Registration page. You should request group membership for /dteam and /dteam/YOUR_NGI. The dteam group manager will then be notified by the vomrs software.

Request GOCDB access

All new members need to notify their NGI operations manager about their role request, as GOCDB currently does not send any notification about pending requests.

Register into GGUS

To register into GGUS please follow the Central GGUS registration link. GGUS can be accessed with only your certificate. Do not forget to apply for the support role as well. (The GGUS support staff will approve you quickly as they get the notification automatically.)

Some NGIs also have a local helpdesk or a regional GGUS. Ask your NGI operations manager if how to register to them.

Subscribe to mailing lists.

NGIs and Sites have local mailing lists for ROD team members and Site Administrators respectively. Please ensure that you subscribe to them. Depending on your role ask your NGI operations manager or Site operations manager to have you included on the necessary mailing lists if there is no automatic subscription process.

NGI operations manager should contact operations@egi.eu and state that wish to be subscribed to noc-managers mailing list noc-managers@mailman.egi.eu.


Documentation

Documentation relevant to EGI operations can also found at EGI Documentation wiki page

Tools

A list of tools relevant to EGI operations can also found at EGI Tools wiki page