The wiki is in the process of being deprecated and migrated to other supports.

EGI-InSPIRE:PY1 periodic report (SA1)

From EGIWiki
Jump to navigation Jump to search
EGI Inspire Main page


Introduction

Contacts with all the NGIs and integrated Resource Infrastructure Providers were successfully established. The Operations Management Board (OMB) – the policy group that leads the technical development of the operational activities – was subsequently constituted, and has been regularly meeting every month since June 2010. The Operations Architecture was approved in January 2011. It defines the players of the EGI operations community (Resource Centres, Resource Infrastructure Providers and EGI.eu), the relationships and governance model, the Resource Infrastructure and the Service Infrastructure.

  • Resource Infrastructure. At the end of the first year of EGI-InSPIRE the resource infrastructure comprised 102 PB of disk space, 89 PB of tape space and 239,840 CPU cores (+24.9% increase since April 2010). This amount increases to 338,895 cores by including integrated infrastructures (e.g. Canada, China, South American and Caribbean countries) and peer grids (OSG). Integrated Resource Infrastructure Providers are non-EGI-InSPIRE partners who contribute resources to EGI users and consume EGI operational services.

Resources were distributed amongst 338 Resource Centres (345 Resource Centres including those from integrated infrastructures), of which 96 supported MPI. The Resource Centres span across 57 countries: EGI-InSPIRE partners contributing resources are present in 51 countries, while in the remaining 6 countries resources are contributed by integrated Resource Infrastructure Providers.

  • Services Infrastructure. The operations services are provided by EGI.eu centrally in collaboration with the EGI-InSPIRE partners (Global Services) and locally by Resource Infrastructure Providers (Local Services) through the respective Operations Centres. The Global Services were successfully and gradually handed over from EGEE to EGI. Also the Local Services offered by the Resource Providers were run satisfactory and seamlessly. At a local level, even if NGIs are very heterogeneous in terms of size and maturity, Resource Centres were operated reliability and the overall availability of EGI services was not affected by the start of the operations of the new NGIs.

At the end of the EGEE-III project the Resource Infrastructure was operated by 14 Regional Operational Centres (ROCs): Asia Pacific, Canada, Central Europe, CERN, France, Germany/Switzerland, IGALC, Italy, Latin America, Northern Europe, Russia, South Eastern Europe, South Western Europe, and United Kingdom/Ireland. This scenario has evolved considerably during the first project year of EGI-InSPIRE. The largest ROCs (Central Europe and South East Europe) stopped their operations during PQ2 and PQ3 respectively. The EGEE ROCs have consequently developed into a much larger group of smaller Operations Centres, which typically serve a single country . This transition was successfully completed in January 2011 without affecting the infrastructure availability and reliability. EGI comprises 32 EGI Operations Centres operating 40 European National Grid Initiatives and CERN (European Intergovernmental Research Organization). National Grid Infrastructures from the South East and Baltic regions that were not part of EGEE in April 2010, were subsequently integrated during PQ1 into South East Europe ROC and NGI_NDGF.

During PQ3 for the first time a process for requirements gathering was defined and approved by the OMB, consisting in a phase of requirements gathering and prioritization involving Resource Centres and coordinated by the respective Resource Infrastructure Provider, followed by discussion and prioritization at the OMB. In the final stage input is eventually presented to the Technology Providers at the TCB level to drive innovation. This process was adopted in January and February 2011 to provide input to the EMI project for the EMI 2.0 release. This is an important milestone as for the first time the Operations Community was collectively involved in a structured requirements gathering process. The SA1 activity roadmap for the second year of the project was discussed and approved during the January OMB meeting. During the first year several task forces contributed to the progress of operational activities in various areas: network support, GLOBUS and UNICORE integration, NGI deployment use cases of local operational tools and the enhancement of EGI Operational Level Agreements.

TSA1.2 Security

All resources used by the teams such as mailing lists, wikis, security monitoring servers etc. were migrated from EGEE to EGI resources under the egi.eu domain. This transition didn’t impact the availability of the security monitoring services (Pakiti and Nagios security monitoring) covering the whole EGI infrastructure. The operational security teams CSIRT and the Software Vulnerability Group (SVG), were successfully established during PQ1. SVG counts 15 members and has established contacts with the software developers of the main deployed middleware stacks. SVG, jointly with EMI representatives, produced a security assessment plan which identifies which software components within EMI are going to be assessed and the related timing. The plan also states which software packages that have been assessed so far. SVG will improve the handling of software vulnerabilities in the EGI RT to improve automation, including automatic reminders. Both teams have produced respectively operational security procedures forming milestone MS405. In addition, the EGI-CSIRT procedure for critical vulnerability handling was defined and approved by the OMB . It describes the procedure for dealing with Critical Security Issues where action needs to be taken by a single site or multiple sites. The first phase of Security Service Challenge 4 (SSC4) was completed by EGI CSIRT. In total 13 sites (including all WLCG Tier1 sites) were tested and the evaluation of site performance was completed. During the second year of the project SSC workflows will be streamlined in order to extend the activity to a larger set of Resource Centres. A ticketing system for incident response (RTIR) was setup and personnel was trained for its usage.

Incidents and advisories

  • 9 security incidents were handled during the first year (3/PQ1, 2/PQ2, 1/PQ3, 3/PQ4).
  • Security advisories
    • PQ1: one advisory on a vulnerability found in Intel compiler suite
    • PQ2: 6 security advisories on security vulnerabilities, of which one was critical, two moderate and three high. To mitigate the risk of critical vulnerability (CVE-2010-3081), EGI CSIRT imposed a 7-day mandatory patching timescale across EGI sites; all EGI sites applied the patch before the deadline to avoid suspension.
    • PQ3: EGI CSIRT issued three security advisories on Linux vulnerabilities, of which one was “critical” two were “high risk”. EGI CSIRT also assisted all EGI sites to mitigate the critical vulnerability (CVE-2010-4170) within the 7 days deadline; no site was suspended.
    • PQ4: EGI CSIRT issued two security alerts, of which one is “high risk”, another is “critical”.

TSA1.3 Service Deployment

The transition from EGEE to EGI was prepared two months before the start of the project for the handover of coordination activities from CERN to EGI, together with the definition of new procedures for the timely staged rollout of new middleware releases . The EGEE Pre-Production Infrastructure was entirely decommissioned and integrated into the production one where feasible. The procedures and the tools needed for automation of staged rollout were gradually refined and developed during the course of the first year . Staged rollout is applied also to the rollout of Service Availability Monitoring software releases, and it will be extended to other distributed operational tools. Staged Rollout has progressively expanding in terms of Early Adopters teams and coverage of middleware products. Now ARC, gLite and UNICORE are integrated into the process. A staged rollout manager was appointed in each area. The “New Software Release Workflow” was tested with the dry run of several EMI components (Release Candidate 3). Information on Early Adopters is available on the EGI wiki . During the end of the year effort was progressively migrated from the staged rollout of gLite 3.1 and gLite 3.2 to EMI. Fortnightly meetings are organized to update the operations community on the status of staged rollout activities, the schedule of future releases, interoperability issues and middleware deployment problems. During PQ1 contacts points have been promoted and established with other DCIs (Desktop Grids, StratusLab, PRACE) and activities have been carried out in the framework of the Infrastructure Policy Group (IPG) and the Production Grid Infrastructure Working Group (PGI-WG) within OGF (OGF 28 – Munich, and OGF 29 – Chicago). During the first year the integration of ARC into the EGI monitoring infrastructure was completed. This required the re-writing of probes to migrate from the old SAM framework to Nagios , the integration of those into the SAM release, the decommissioning of the old SAM infrastructure operated at NDGF for ARC resources, and the integration of the Nagios probes into the Operations Dashboard. Middleware deployment plans were gathered and according to these two task forces were constituted to address the issue of integration of GLOBUS and UNICORE resources.

TSA1.6 and TSA1.7 Help desk & Support Teams

Helpdesk

During PQ1 the support infrastructure has been adapting to the EGI model following the work already started in the last year of the EGEE-III project. It has involved new NGIs setting up their national support tools and processes, and transferring the operations for the former EGEE ROCs to these NGIs. A workflow for this was defined and used to migrate operations. New NGI support units were implemented for NGIs that have currently gone through this process. xGUS – the GGUS NGI view – is now deployed in production by several NGIs. A new workflow for middleware-related issues was discussed with the Deployed Middleware Support Unit (DMSU) and the Technology Providers. The workflow entered production in January 2011. During PQ4 one of the main areas of work for GGUS was the definition and implementation of technology-related workflows. Work on the technology support workflow continued and resulted in a few refinements and modifications of the technology helpdesk. The middleware support chain is a staged process organized into: 1st line support (TPM), 2nd line support through the Deployed Middleware Support Unit (DMSU) and, finally, 3rd line support involving the Technology Providers. The implementation of the support chain was made available with the January 2011 release of GGUS. In addition, the software provisioning workflow was implemented. Through this workflow the Technology Providers can announce releases by submitting a ticket which is then routed to the EGI-SA2 activity through an interface to the EGI-RT system. Feedback concerning the release is also handled through such ticket, which is assigned back to the Technology Provider with an "accept" or "reject".

Support teams

  • 1st line support. In the last months of EGEE the new Ticket Processing Management (TPM) model with two teams was implemented and provided 1st line support since then. TPM handles 250 tickets per month on average.
  • Regional Operator on Duty (ROD). The first EGI-InSPIRE Regional Operator on Duty (ROD) team workshop was held in June 2010. The transition from EGEE to EGI-InSPIRE required many changes. In the EGI era, ROD teams monitor the status of Resource Centres in their country or region, whereas the Central Operator on Duty (COD) is responsible for the global oversight over the whole EGI infrastructure. This is to provide a high-quality grid infrastructure to the user communities. A ROD Newsletter is now periodically released since December 2010 to consolidate the Grid oversight teams (central and local ones). The purpose of this newsletter is to inform about recent and upcoming developments related to Grid Oversight and to show the support performance indicators during the month.
  • Central Operator on Duty (COD). Since the beginning of the project COD is responsible of overlooking infrastructure quality and support activities across the various Resource Providers. COD is now also responsible of handling Resource Centre suspension in case of low performance and of following up issues with underperforming Resource Centres on a monthly basis. A new procedure was defined for this. Several cases of Resource Provider unresponsiveness or lack of compliance to established procedures were escalated. All of these have been handled and the overall quality of ROD support has been improving, especially in some of the newly established NGIs. Training sessions for ROD teams were organized in co-location with the EGI-InSPIRE project conferences. COD contributed effort to the definition of various new procedures
  • Network Support. A network support questionnaire was distributed at the beginning of the project to gather information about network support contact points for each NGI, and to assess network support problems and the existing relationships with the local Network Research and Education Networks. Following to this, a workshop was organized in January 2011 to gather feedback on network support models and network monitoring and troubleshooting tools that can be used by Resource Centre administrators. The overall strategy for network support was finally defined. The activity is organized into: support to network performance problems where the GARR team provides contact with the NREN PERT service and support to the deployment of tools for troubleshooting on demand and network monitoring (PerfSONAR-Lite-TSS, the Grid Jobs based Network monitoring and DownCollector). A network support unit is now available from the EGI helpdesk.

Grid Management

TSA1.4 Deployment of operational tools

A mailing list was created in July 2010 involving all NGI operational tool administrators working as support, communication and coordination channel. The existing central instances of operational tools were migrated to the egi.eu domain to phase out the EGEE domain gridops.org, whose decommissioning is currently scheduled in June 2011. Various new software releases affecting the central tools were timely deployed in production (SAM, GOCDB and the Operations Portal), and the prototype of a new EGI-wide monitoring portal.

SAM

In June 2010 the old SAM submission framework was eventually decommissioned as last step of a long migration process from a centralized system to a fully distributed one started during the last quarter of EGEE. This transition was a major success considering the level of distribution of this system, which is proportional to the number of independent Operations Centres. At the end of the first year the following SAM/Nagios instances were in production :

    • 24 NGI instances covering 35 EGI partners;
    • 3 ROC instances covering 4 EGI partners;
    • 1 project instance covering 1 EGI partners;
    • 3 external ROC instances covering the following regions: Canada, IGALC and LA.

The development of Nagios probes for operational tool monitoring is on-going. The central tool monitoring server now also monitors GOCDB. An approach for the monitoring of uncertified Resource Centres requiring the deployment of a dedicated set of services was discussed, and will be finally put in production with the contribution of TSA1.8 effort during the second year of the project. The latest prototype of the central MyEGI instance was deployed in May 2011 . Unfortunately, due to bugs spotted in software release its availability was not broadcasted to wide audience.

Brokers

The ActiveMQ broker production network – deployed in failover mode and used for the distribution of monitoring information – consists of three brokers deployed at CERN, in Croatia and Greece. Additional broker is deployed for the APEL needs.

  • GOCDB. GOCDB was migrated from GOCDB3 to GOCDB4 in 2010 and to a new hardware platform in February 2011. After the migration the GOCDB service has been working without outages. GOCDB failover instance deployment started.
  • Operations Portal. The regionalized operations portal software was released in June 2010 for the first time, and is now deployed at NGI_BY, NGI_CZ, NGI_GRNET and NGI_IBERGRID. The central instance was regularly updated (7 upgrades during the first year).
  • Network monitoring. The web portal with network tools for troubleshooting and monitoring1 is now hosted by GARR (Italy), together with the network availability monitoring tool (DownCollector) – developed in the framework of the EGEE-III SA2 activity.

The EGI implementation and policies related to the DTEAM and OPS VOs – necessary for monitoring and troubleshooting – were reviewed: the DTEAM and OPS VOs are “global”, and their support is mandatory in all production Resource Centres to ensure site-level troubleshooting (DTEAM) and to have a running Nagios-based monitoring infrastructure (OPS). The deployment of regional monitoring VOs is limited to the monitoring of non-EGI sites. The DTEAM VOMS service – formerly operated at CERN – was migrated to one of EGI’s core services. In parallel to this, the VO membership management was handed over by CERN to SRCE and EGI.eu. Nagios test terminology was disambiguated and a set of related procedures (monitoring of non-production sites, downtime management of central tools, changing of the AVAILABILITY and OPERATIONS probes, changing of an existing probe and/or the integration of a new tests) were approved. A discussion on needs and mechanisms for operational tools failover configuration started. In addition, a procedure for downtime management of central tools was drafted. The following wiki pages relevant for operational tools were created:

  • Operational tools information – the page contains a brief description about each tool, main links to the tools interfaces and to documentation.
  • Operational tools deployment plans – the page contains NGI plans regarding the deployment of regionalised versions of operations tools.

TSA1.5 Accounting

The release to production of the APEL ActiveMQ client in early June 2010 meant the APEL central repository was ready to accept records through the new messaging communication bus. After this release the infrastructure progressively migrated from R-GMA to the new APEL client based on ActiveMQ. The R-GMA central infrastructure was decommissioned at the end of February 2011 and at the end of the first year 90% of the production/certified sites (excluding those that we know are not using their own accounting solution) were migrated to ActiveMQ APEL publishing. Of the 10% left, only 4% are sites which were previously using RGMA, the other 6% are sites that have never published. As to the accounting portal, the main improvements have been the porting to GOCDBPI-V4 to adapt to the upcoming phasing out of GOCDB3, and views and reports for WLCG Tier2 sites were enhanced. The initial version of the regionalized accounting portal is already available for deployment. Currently several NGIs (Germany, Portugal and Spain) have expressed their interest in deploying a regional instance of the regional portal. The regionalisation plans as well as the general accounting portal development plans for the first year were defined and documented in "MS703 Operational Tools regionalisation work plan". A dedicated GGUS SU has been created to provide specific support about the accounting portal. The APEL documentation from the old GOC WIKI was migrated to the EGI wiki (APEL). New ActiveMQ STOMP consumer is ready for external testing, and NGS and Hungary are contributing to testing activities. CERN will start using this interface for publishing of local jobs soon.

TSA1.8 Core services and availability

Availability and Reliability

The EGEE SLA document was updated to produce an EGI OLA document covering all the agreed and adopted practices. In parallel, a new process was defined and finally approved for the management of monthly availability and reliability statistics. A new procedure involving the Central Operator on Duty (COD) was created for getting explanations from sites for their figures if they fall below the Operational Level Agreement (OLA) requirements. The new procedure – which was prototyped in May 2010 – had been necessary to organize the handover of availability and reliability reporting from CERN to EGI.eu. AUTH is the partner responsible of validating and distributing the performance reports produced monthly. During the second half of the year a new suspension policy for underperforming Resource Centres requiring an increase of the current availability threshold from 50% to 70%, was assessed. The impact on the production infrastructure was considered to be minimal, for this reason in April 2011 the OMB finally approved the adoption of the new policy. A task force was organized to define the medium-term EGI OLA roadmap. The first result of this task force was the proposal of various changes to the existing OLA, which resulted in a new Resource Centre OLA which was approved in May 2011. Implications of OLA extensions on tool development plans were discussed during a set of dedicated meetings. During the second year the task force will focus next on the Resource Provider and EGI.eu OLAs.

Core services

A WMS and a MyProxy service – for the EGI Nagios Security Monitoring tool – were installed and entered production. In parallel a new VOMS/VOMRS server was setup in order to host the DTEAM VO. Data from the VOMS server at CERN were migrated to the new VOMS server hosted at AUTH. In PQ1 EUGridPMA accredited the SEE-GRID CA in order to provide EGI “Catch All” CA Services to VOs. The current network of Registration Authorities covers Albania, Azerbaijan, Bosnia – Herzegovina and Georgia. The migration of the DTEAM VOMS service from CERN was finalized. In addition, a procedure has been defined on providing Catch ALL VOMS services for newly created VOs.

Documentation

EGEE existing documentation has been progressively migrated to the EGI wiki, and made accessible at Documentation. This is a particularly challenging task as EGEE documentation is distributed across various document servers (e.g. EDMS and GOC WIKI). Documentation that was migrated, was also updated to the new operational environment of EGI. The migration of documentation is still in progress. The EGI documentation page includes pointers to approved manuals, best practices, procedures, FAQs and Training Guides. A series of meetings was organized to decide the set of categories and templates to be used to facilitate navigation and document categorization. During the first year several new document were produced: 3 manuals, 2 FAQs, and 8 new procedures were drafted and approved:

  • COD Escalation Procedure
  • Operations Centre Creation
  • Operations Centre decommissioning
  • Quality verification of monthly availability and reliability statistics
  • Validation of an Operations Centre Nagios
  • Setting a Nagios test status to OPERATIONS
  • Adding new probes to SAM
  • Management of the EGI OPS Availability and Reliability Profile