Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

PROC09 Resource Centre Registration and Certification

From EGIWiki
Jump to navigation Jump to search


Title Resource Centre (Site) Registration and Certification Procedure
Document link to be determined
Last modified
Policy Group Acronym OMB
Policy Group Name Operations Management Board
Contact Person Vera Hansper
Document Status DRAFT
Approved Date
Procedure Statement A procedure for the steps involved to both register and certify new Resource Centres (sites) in the EGI infrastructure. The certification step can also be used to re-certify suspended Resource Centres (sites).

Introduction

Certification is a prerequisite for a Resource Centre (aka site) to become part of a Resource Infrastructure such as a National Grid Initiative (NGI) and EIRO (in Europe), or multi-country Resource Infrastructure.

This document describes the steps required

  1. to register and certify a new site,
  2. to re-certify a site which has been suspended.

Note: A separate document provides the process for decommissioning a site.

Through its parent Resource Infrastructure, a certified Resource Centre becomes member of the EGI Resource Infrastructure to make resources available to international user communities.

A certified Resource Centre (site) guarantees a minimum quality of service of these resources (currently expressed in terms of monthly availability and reliability): the site must ensure troubles are handled in a timely fashion and the site must understand and adhere to a common set of policies and procedures. This compares to an uncertified, or test Resource Centre, which does not provide a guarantee on the availability or usability of it's resources.

Definitions

The entities involved in this procedure are defined in the EGI Glossary.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Entities involved in the procedure

  • Resource Centre (or Site) Operations Manager, who is responsible of initiating the certification process by applying for membership to a Resource Infrastructure
  • Resource Infrastructure Operations Manager, who is responsible of approving the integration of a new Resource Centre into the respective Infrastructure
  • Operations Centre, the entity who is technically responsible of carrying out the Resource Centre certification part of the procedure, once the membership is approved

The Resource Infrastructure Operations Manager can determine with the Site Operations Manager the level of involvement of other actors.

Contact information

  • EGI Operations: operations (at) mailman.egi.eu
  • EGI Resource Infrastructure Providers are listed on the EGI site
  • Operations Centres with the respective contact information is available from GOCDB
  • EGI CSIRT: egi-csirt-team (at) mailman.egi.eu

Actions and responsibilities

Site Operations Manager

  1. A Resource Infrastructure Provider is responsible for all sites within the respective jurisdiction (for example, an NGI is responsible for all site in its country). For this reason, the Site Operations Manager of a new site is REQUIRED
    • to contact the respective NGI if the Site is located in Europe,
    • to contact the respective Resource Infrastructure Provider active in a relevant geographical area if the Site is outside Europe,

about the intention to join the EGI infrastructure. If needed, EGI Operations can assist the Site Operations Manager to get in contact with the relevant partners (see the Contact information section).

  1. The Site Operations Manager is REQUIRED to provide the necessary Site information needed to complete the registration process, is responsible for its accuracy and maintenance.
  2. In order to be certified, the Site Operations Manager is responsible for reading, understanding and accepting the Resource Centre Operational Level Agreement, which defines the obligations of a Resource Centre and the commitment to deliver a minimum quality of service to its future users. Endorsement of OLA implies - among other things - the acceptance of:

Resource Infrastructure Operations Manager

  1. A Resource Infrastructure Provider is REQUIRED to be responsible for all sites within the respective jurisdiction. For example, a NGI is responsible for all sites in its respective country.
  2. The Resource Infrastructure Operations Managers MUST attend site certification applications and MUST provide feedback to the requesting partners in a timely manner to accept or reject the requests received.
  3. If the Site needs to be certified, he/she MUST provide information to the Site Operations Manager about the Resource Centre OLA, and is responsible of keeping records of Site Operations Manager agreement, as deemed suitable by the Resource Infrastructure Provider (for example, through signed e-mail agreement, the collection of signatories on a paper copy of the OLA, or other means).
  4. In case a request is accepted, the Resource Infrastructure Operations Manager MUST contact the relevant Operations Centre to start the site registration as candidate, and the certification procedure. Registration is only needed in case of new sites.

Operations Centre

  1. The Operations Centre is responsible of registering (if applicable) and of certifying the site.
  2. The Operations Centre is responsible of registering an accepted site in the EGI configuration repository GOCDB.
  3. The Operations Centre MUST collect the mandatory information specified by the site registration procedure, and MUST accurately input the supplied data into GOCDB.
  4. The Operations Centre MUST integrate site information in all operations tools as needed, such as the local NAGIOS server for monitoring of uncertified sites, the local helpdesk (if available) for the registration of the Resource Centre support staff, etc.
  5. In case of an existing site that is starting certification after suspension or security reasons, the Operations Centre MUST contact the EGI CSIRT to verify that all requested repair operations have been successfully applied to fix the issue.
  6. The Operations Centre is responsible of verifying that all tests during the 3 calendar day certification process are successfully passed. The Operations Centre SHALL proceed with changing the site status in GOCDB to certified only if this condition is met.

Workflow

The various steps then required by both the Resource Infrastructure Operations Manager and the Site Operations Manager are explained in the tables below. The first part for a new site is the registration process. The actual certification process, in the second table, is applicable to both new and suspended Sites.

The general status flow that a Site is allowed to follow is illustrated by the following diagram. Information on site status and on how to manipulate it available from GOCDB Documentation.

SiteStatusFlow.png

Site registration

Requirements

  1. A Site MUST be part of a Resource Infrastructure and MUST be operated by an Operations Centre. If there is no suitable provider for your country, it may be that the an Operations Centre MUST first be created. A procedure exists for this, and it is documented in the Operations Centre creation procedure.
  2. To satisfy Grid security requirements a Site registration procedure must capture and maintain at least the following information. The comprehensive list of required information is available (here).
    • The full name of the Site.
    • An abbreviated name of the Site, which must be unique within the Grid, and preferably globally unique.
    • The name, email address and telephone number of the Site Operations Manager and Site Security Contact in accordance with the requirements of the Site Operations Policy.
    • The email address of a managed list for contact with Site Administrators at the site.
    • The email address of a managed list for contact with the site security incident response team.
  3. If a Site wishes to leave the Grid or the Grid decides to remove the Site, the registration information MUST be kept by GOCDB for at least the same period defined for logging in the Traceability and Logging Policy. Personal registration information of the Site Operations Manager and Security Contact of the Site leaving the Grid MUST NOT be retained for longer than one year.
  4. It is RECOMMENDED that email contacts for the Site Administrators and Security Officer(s) are mailing lists, and not individuals.

<Comment: additional constraints - if any - on information that is registered need to be specified here>

Steps

The following steps are only applicable if the Site is not already registered in GOCDB. They describe the steps for a Site Operations Manager that is requesting the respective Site to join the EGI infrastructure.

  • Actions tagged Site are the responsibility of the Site Operations Manager.
  • Actions tagged RIP are the responsibility of the Resource Infrastructure Operations Manager.
  • Actions tagged OC are the responsibility of the Operations Centre
# Responsible Action
0 Site
  1. Contact your Resource Infrastructure Operations Manager (contact information is available at http://www.egi.eu/production-infrastructure/Resource-providers/).
  2. Provide your Resource Infrastructure Operations Manager the required information according to the template available in https://wiki.egi.eu/wiki/SiteCertMan/Required_information
1 RIP
  1. Parse the site registration request, decide to accept or reject it, and communicate this result back to applicant.
  2. If the Site is accepted, notify the relevant Operations Centre, handle the Site information received, and put the Operations Centre in contact with the Site Operations Manager.
2 OC
  1. The following actions can be done in parallel:
    • Forward all necessary and required documentation to install and configure the site services to the Site Operations Manager.
    • Communicate with the Operations Manager to clarify any doubts or questions. Include the Operations Centre ROD or help-desk teams in the step if necessary.
3 OC
  1. Add the site to the GOCDB and flag it as "candidate". Note that all users with a GOCDB role at regional level can add a site in scope (this includes Operations Manager, deputy and regional staff). Currently, GOCDB applies the same permissions to all of the "regional level roles".
  2. Notify the Site Operations Manager that they should register them self in the GOCDB and request the Site Administrator role. Approve it when done.
4 Site
  1. Complete any missing information for the Site's entry in the GOCDB, including services that are to be integrated into the infrastructure.
  2. Request in the GOCDB (or ask the relevant site security staff to request) the mandatory Site Security Officer role. A security expert is the most appropriate actor for this role. See the GOCDB Input System User Documentation for more information on roles.
  3. Accept or deny all the requested roles under the site scope. Caveat: If the Site Operations Manager can not approve roles, they should request the Operations Centre to do so. This is a current flaw in GOCDB.
  4. Notify the Operations Centre that the site information update is concluded.
5 Site or OC
  1. Check whether the site appears in the "Notified Site" field in https://gus.fzk.de/ws/ticket_search.php
  2. Note that this step should happen automatically when the site is correctly entered into the GOCDB. If this is still not visible 2 days after the GOCDB entries have been created, the Operations Centre should be informed and should then contact GGUS administrators.
  3. A new Site Administrator should register in GGUS (https://gus.fzk.de/admin/get_account.php?accounttype=support) but not specify any role, unless directed to by the Operations Centre.
6 OC
  1. Check that the site's information is correct (site roles and any other additional information.)
  2. Check that contacts receive email (if they are mailing lists, check that outside EGI members are allowed to post there).
  3. Check that the required services for a site are properly registered (CE, siteBDII, SE, APEL). Note that for Sites adopting APEL, by registering a new glite-APEL node in GOCDB as gLite-APEL service including the correct DN, the APEL broker Access Control List gets automatically updated and sites can start publishing usage records in about two hours (for more information see the gLite-APEL documentation).<Tiziana Comment: I think that with the new OLA only the siteBDII is required. I suggest we remove the list in brackets. I would be more specific about this in the certification part of the procedure>
  4. Check domain names and DNS.
7 OC
  1. Any other Operations Centre-specific requirements (e.g. join a certain VO and/or mailing list, etc.)
8 OC
  1. If all previous actions have been completed with success, notify the Site Operations Manager that the Registration is completed, and contact the Resource Infrastructure Operations Manager to notify that a new candidate Site exists and is ready to be certified.

After the successful completion of all these steps, the site is considered as to be in the "Candidate" state and is ready for the certification process.

Site certification

Requirements

  1. The Site Certification procedure is only applicable for both Sites in "Candidate" or "Suspended" status state and for suspended sites.
  2. In order to enter certification the Site Operations Managers SHALL accept the Resource Centre OLA.
  3. A Site can successfully pass certification only if the conditions required by the Resource Centre OLA are met.

Steps

The following is a detailed description of the steps required for the transition from the "Uncertified" to the "Certified" state of the site.

  • Actions tagged Site are the responsibility of the Site Operations Manager.
  • Actions tagged RIP are the responsibility of the Resource Infrastructure Operations Manager.
  • Actions tagged OC are the responsibility of the Operations Centre
# Responsible Action
0 RIP
  1. The Resource Infrastructure Operations Manager contacts the Site Operations Manager to request the subscription of the Resource Centre OLA.
1 Site
  1. The Site Operations Manager notifies the Resource Infrastructure Operations Manager that the Resource Centre OLA is accepted (if the Site is has not already endorsed it before for example in case of a suspended Site), and ready to start certification.
2 RIP
  1. The Resource Infrastructure Operations Manager contacts the Operations Centre asking to start the certification process.
3 OS
  1. If the site is in the "Candidate" or "Suspended" state, then flag the site as "Uncertified". If it was in the "Suspended" state then check that the reason for suspension has been cleared. If the suspension cause is a security issue, then the EGI CSIRT needs to be contacted to verify that all requested repair operationswere successully applied by the Site Administrators to fix the issue that caused suspension.
4 OS
  1. Check that the GIIS (gLite: BDII) is working, and publishing coherent values, namely:
    • the correct NGI is being published in GlueSiteOtherInfo (see manual MAN01 How to Publish Site Information).
    • all services are registered in GOCDB according to the requirements of the Resource Centre OLA, these are published and ALSO that services published in the GOCDB are valid.
    • the OPS VO (monitoring) and the DTEAM VO (troubleshooting) are configured and supported by the Site.
    • regional VOs are configured and supported as needed by the Operations Centre.
    • the Site is integrated in any regional tool as needed (for example, the regional accounting infrastructure if present).

There are detailed examples for how to do this in SiteCertMan/GIIS_BDII_check.

5 OS
  1. Check that the registered services are fully functional by performing manual tests. e.g. from the UI or the Operations Centre monitoring infrastructure for uncertified sites (instructions). Contact the site admins if there are problems, and ensure that they fix them. Include the ROD and help-desk teams if necessary. Iterate this step with the site admins until tests pass successfully. The prime tests to check are:
    • network connectivity.
    • CE job submission.
    • SE data transfer

Details for submitting manual tests can be found at SiteCertMan/Grid_manual_tests.

6 OS
  1. If all preliminary tests are passed for 3 consecutive calendar days, declare an initial maintenance downtime and switch the site status to Certified. This ensures that site will appear in NAGIOS and GSTAT.
7 OS #After two days check that the site appears in all operational tools. If there are problems with a specific tool, open GGUS tickets to the relevant Support Units. The major tools that are relevant are:
    • Regional NAGIOS (NAGIOS)
    • Operations Dashboard (Dashboard-Siteview)
    • GridView
    • GSTAT
    • SAM/Site Functional Tests <Tiziana comment: what does SAM mean in this context? MyEGI?>
8 OS #Ensure that, before the end of the maintenance downtime
    • all Nagios tests (see above) are passed AND
    • accounting data is properly published.
    • GSTAT is not in an error state. CAVEAT: There may be some problems with this tool and ARC sites.
9 OS
  1. Notify the Site Operations Manager that the site is certified
10 OS
  1. Add Site contact information to any regional mailing list and provide access to regional tools as required
11 OS
  1. The NGI can broadcast that a new Site is now part of the EGI infrastructure. This step is OPTIONAL.

After the successful completion of these steps, the site is considered as "Certified".

Revision history

Version Authors Date Comments
0.8 Tiziana Ferrari 2011-03-11 Updated introduction, adopted MUST SHALL etc. terminology, proposed some changes to terminology, added a section with a list of responsibilities, added a few comments into the text to request clarifications.
0.7 Vera Hansper 2011-02-02 Updated introduction to include roles, etc. and added required documentation link for policies