Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "PROC09 Resource Centre Registration and Certification"

From EGIWiki
Jump to navigation Jump to search
Line 69: Line 69:
== Site Operations Manager ==
== Site Operations Manager ==
# A Resourece Infratructure Provider is responsible for all sites within their jurisdiction (for example, a NGI is the reference entity for each country). For this reason, the Site Operations Manager of a new site needs to contact the respective NGI if in Europe, or a Resource Infrastructure Provider active in a relevant geographical area if outside Europe, about the intention to join the EGI infrastructure. If needed, EGI Operations can assist the Site Operations Manager to get in contact with the relevant partners (see the Contact information section).
# A Resourece Infratructure Provider is responsible for all sites within their jurisdiction (for example, a NGI is the reference entity for each country). For this reason, the Site Operations Manager of a new site needs to contact the respective NGI if in Europe, or a Resource Infrastructure Provider active in a relevant geographical area if outside Europe, about the intention to join the EGI infrastructure. If needed, EGI Operations can assist the Site Operations Manager to get in contact with the relevant partners (see the Contact information section).
# In order to be certified, the Site Operations Manager is responsible of accepting the Resource Centre Operational Level Agreement, which defines the obligations of a Resource Centre and the committment to deliver a minimum quality of service to its future users. Endorsement of OLA implies - among other things - the acceptance of
# In order to be certified, the Site Operations Manager is responsible of reading, understanding and accepting the [https://documents.egi.eu/document/31 Resource Centre Operational Level Agreement], which defines the obligations of a Resource Centre and the committment to deliver a minimum quality of service to its future users. Endorsement of OLA implies - among other things - the acceptance of
* the Grid Security Policy
* the [https://documents.egi.eu/document/86 Grid Security Policy]
* the Grid Site Operations Policy
* the [https://documents.egi.eu/document/75 Grid Site Operations Policy]
* the Site Registration Security Policy
* the [https://documents.egi.eu/document/76 Site Registration Security Policy]
* all other policies for all EGI participants from [https://wiki.egi.eu/wiki/SPG:Documents Security Policy Group]  
* all other policies for all EGI participants from [https://wiki.egi.eu/wiki/SPG:Documents Security Policy Group]


== Resource Infrastructure Provider Operations Manager ==
== Resource Infrastructure Provider Operations Manager ==

Revision as of 01:28, 11 March 2011


Title Site Certification Procedure
Document link to be determined
Last modified
Policy Group Acronym
Policy Group Name Operational Documentation
Contact Person Vera Hansper
Document Status DRAFT
Approved Date
Procedure Statement A procedure for the steps involved to both register and certify new sites in the EGI infrastructure. The certification step can also be used to re-certify suspended sites.

Introduction

Certification is a pre-requisite for a Resource Centre (aka site) to become part of a Resource Infrastructure such as a National Grid Initiative (NGI) and EIRO (in Europe), or multi-country Resource Infrastructure.

This document describes the steps required

  1. to register and certify a new site,
  2. to re-certify a site which has been suspended.

Note: A separate document provides the process for decommissioning a site.

Through its parent Resource Infrastructure, a certified Resource Centre becomes member of the EGI Resource Infrastructure to make resources available to international user communities.

A certified site guarantees a minimum quality of service of service of these resources (currently expressed in terms of monthly availability and reliability), it must ensure troubles are handled in a timely fashion, it must understand and adhere to a common set of policies and procedures. This compares to an uncertified, or test Resource Centre, which does not provide a guarantee on the availability or usability of it's resources.

Definitions

The entities involved in this procedureare defined in the EGI Glossary.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Entities involved in the procedure

  • Resource Centre (or Site) Operations Manager, who is responsible of initiating the certification process by applying for membership to a Resource Infrastructure
  • Resource Infrastructure Operations Manager, who is responsible of approving the integration of a new Resource Centre into the respective Infrastructure
  • Operations Centre (ROD), who is technically responsible of carrying out the Resource Centre certification part of the procedure, once the membership is approved

The Resource Infrastructure Operations Manager can determine with the Site Operations Manager the level of involvement of these other actors.

Contact information

  • EGI Operations: operations (at) mailman.egi.eu
  • EGI Resource Infrastructure Providers are listed on the EGI [web site] // provide reference
  • Operations Centre contact information is available on GOCBD // provide link to instructions page

Actions and responsibilities

Site Operations Manager

  1. A Resourece Infratructure Provider is responsible for all sites within their jurisdiction (for example, a NGI is the reference entity for each country). For this reason, the Site Operations Manager of a new site needs to contact the respective NGI if in Europe, or a Resource Infrastructure Provider active in a relevant geographical area if outside Europe, about the intention to join the EGI infrastructure. If needed, EGI Operations can assist the Site Operations Manager to get in contact with the relevant partners (see the Contact information section).
  2. In order to be certified, the Site Operations Manager is responsible of reading, understanding and accepting the Resource Centre Operational Level Agreement, which defines the obligations of a Resource Centre and the committment to deliver a minimum quality of service to its future users. Endorsement of OLA implies - among other things - the acceptance of

Resource Infrastructure Provider Operations Manager

  1. Resource Infrastructure Provider Operations Managers MUST reply to site certification applications and provide feedback in a timely manner to accept or reject the requests.
  2. He/she MUST
  3. In case a request is accepted, Resource Infrastructure Provider Operations Managers MUST contact the relevant Operations Centre ROD team to start the site registration as candidate, and the certification procedure. Registration is only needed in case of new sites.

ROD actions and responsibilities

The various steps then required by both the NGI manager and the Site Operations Manager are explained in the tables below. The first part for a new site is the registration process and it is important to note that a Site Security Officer is required for each site and that mailing lists for both the Site's CSIRT and to contact site administrators is required. Details of the required information are also found in a link in step one of the first table (SiteCertMan/Required_information) and which are entered into EGI's infrastructure data base, the GOCDB.

Further, before a site can be certified, it is important that the Site Operations Manager reads and accepts the Grid Site Operations Policy, the Site Registration Security Policy and the NGI/SITE OLA. The links for these are found in the required documentation section.

There are also a number of steps which require the integration of the site with monitoring tools, and during the certification process, the site should become registered into the NGI's NAGIOS instance. Once the site has passed all tests consecutively for 2 to 3 days, it can be marked as certified. The actual certification process, in the second table, is applicable to both new and suspended sites.

The general status flow that a site is allowed to follow is neatly given by the following:

SiteStatusFlow.png

One final point: It is highly recommended that email contacts for the site's administrators and security officer(s) are mailing lists, and not individuals.

Site registration procedure

These steps describe what a site that is willing to join the EGI infrastructure needs to do and is applicable for a site not already registered in the GOCDB.

Actions falling on the NGI are the responsibility of the NGI manager. Actions falling on the Site are the responsibility of the Site Manager/Representative.

Note that a site MUST be part of an NGI/Group of NGIs, and if there is no suitable NGI for your country, it may be that the NGI must first be created. In this case, please see [this NewNGIs_creation link] for how to create a new NGI.

# Responsible Action
1 Site
  1. Contact your NGI Manager. National contacts are available in http://www.egi.eu/production-infrastructure/Resource-providers/.
  2. Provide your NGI Manager the required information according to the template available in https://wiki.egi.eu/wiki/SiteCertMan/Required_information
2 NGI

The following actions can be done in parallel:

  1. Forward all necessary and required documentation to install and configure the site services to the Site Representative.
  2. Communicate with the Site Representative to clarify any doubts or questions. Include the NGI ROD or help-desk teams in the step if necessary.
3 NGI
  1. Add the site to the GOCDB and flag it as "candidate".
  2. Notify the Site Representative that they should register themself in the GOCDB and request the Site Administrator role. Approve it when done.
4 Site
  1. Complete any missing information for the site's entry in the GOCDB, including services that are to be integrated into the infrastructure.
  2. Request in the GOCDB (or ask the relevant site security staff to request) the mandatory Site Security Officer role. A security expert is the most appropriate actor for this role.
  3. Accept or deny all the requested roles under the site scope. (Caveat: If the Site manager can not approve roles, they should request the NGI manager to do so. This is a current (20.12.2010) flaw in GOCDB.)
  4. Notify the NGI Manager that the site information update is concluded.
5 Site or NGI
  1. Check whether the site appears in the "Notified Site" field in https://gus.fzk.de/ws/ticket_search.php
  2. Note that this step should happen automatically when the site is correctly entered into the GOCDB. If this is still not visible 2 days after the GOCDB entries have been created, the NGI manager should be informed and should then contact GGUS administrators.
  3. A new site admin should register in GGUS (https://gus.fzk.de/admin/get_account.php?accounttype=support) but not specify any role, unless directed to by the NGI manager.
6 NGI
  1. Check that the site's information is correct. (Site roles and any other additional information.)
  2. Check that contacts receive email (if they are mailing lists, check that outside EGI members are allowed to post there).
  3. Check that the required services for a site are properly registered (CE, siteBDII, SE, APEL).
  4. Check domain names and DNS.
7 NGI
  1. Any other NGI requirements (join a certain VO, join any "ops" mailing list, etc.)
8 NGI
  1. If all previous actions have been completed with success, notify the Site Manager that the Registration is completed.

After the successful completion of all these steps, the site is considered as to be in the "Candidate" state and is ready for the certification process.

Site certification procedure

The Site Certification procedure is applicable for both new sites which have reached the "Candidate" state and for suspended sites, The following is a detailed description of the steps required for the transition from the "Uncertified" to the "Certified" state of the site.

# Responsible Action
1 Site
  1. Notify the NGI Manager that the site is ready for certification.
2 NGI
  1. If the site is in the "Candidate" or "Suspended" state, then flag the site as "Uncertified". If it was in the "Suspended" state then check that the reason for suspension has been cleared.
3 NGI Check that the GIIS (gLite: BDII) is working, and publishing coherent values, namely:
  1. the correct NGI is being published in GlueSiteOtherInfo.
  2. all services registered in the GOCDB are published and ALSO that services published in the GOCDB are valid.
  3. ops and EGI dteam VOs are configured and supported.
  4. regional VOs are configured and supported.

There are detailed examples for how to do this in SiteCertMan/GIIS_BDII_check.

4 NGI

Check that the registered services are fully functional by performing manual tests. e.g. from the UI or a dedicated SAM/Nagios testbed infrastructure provided by the NGI. There is an example of how to create a testbed nagios at this page. Contact the site admins if there are problems, and ensure that they fix them. Include the ROD and help-desk teams if necessary. Iterate this step with the site admins until tests pass. The prime tests to check are:

  1. network connectivity.
  2. CE job submission.
  3. SE data transfer

Details for submitting manual tests can be found at SiteCertMan/Grid_manual_tests.

5 NGI
  1. If all preliminary tests are passed for 3 consecutive days, declare an initial maintenance downtime and switch the site status to Certified. This ensures that site will appear in Nagios and GSTAT.
6 NGI After two days check that the site appears in all operational tools. If there are problems with a specific tool, open GGUS tickets to the relevant Support Units. The major tools that are relevant are:
  1. Regional Nagios (Nagios)
  2. Operations Dashboard (Dashboard-Siteview)
  3. GridView
  4. GSTAT
  5. SAM/SFT
7 NGI Ensure that, before the end of the maintenance downtime
  1. all Nagios tests (see above) are passed AND
  2. Accounting data is properly published.
  3. GSTAT is not in an error state. CAVEAT: There may be some problems with this tool and ARC sites.
8 NGI
  1. Notify the Site Manager that the site is certified
9 NGI
  1. Add site contact to any regional mailing list and provide access to any regional tool
10 NGI (Optional?) The NGI can broadcast that a new site is now part of the EGI infrastructure.


After the successful completion of these steps, the site is considered as "Certified".

Revision history

Version Authors Date Comments
0.7 Vera Hansper 2011-02-02 Updated introduction to include roles, etc. and added required documentation link for policies