EGI-InSPIRE:PY1 periodic report (SA1)

From EGIWiki
(Redirected from PY1 periodic report (SA1))
Jump to: navigation, search
EGI Inspire Main page


Introduction

Contacts with all the NGIs and integrated Resource Infrastructure Providers were successfully established. The Operations Management Board (OMB) – the policy group that leads the technical development of the operational activities – was subsequently constituted, and has been regularly meeting every month since June 2010. The Operations Architecture was approved in January 2011. It defines the players of the EGI operations community (Resource Centres, Resource Infrastructure Providers and EGI.eu), the relationships and governance model, the Resource Infrastructure and the Service Infrastructure.

Resources were distributed amongst 338 Resource Centres (345 Resource Centres including those from integrated infrastructures), of which 96 supported MPI. The Resource Centres span across 57 countries: EGI-InSPIRE partners contributing resources are present in 51 countries, while in the remaining 6 countries resources are contributed by integrated Resource Infrastructure Providers.

At the end of the EGEE-III project the Resource Infrastructure was operated by 14 Regional Operational Centres (ROCs): Asia Pacific, Canada, Central Europe, CERN, France, Germany/Switzerland, IGALC, Italy, Latin America, Northern Europe, Russia, South Eastern Europe, South Western Europe, and United Kingdom/Ireland. This scenario has evolved considerably during the first project year of EGI-InSPIRE. The largest ROCs (Central Europe and South East Europe) stopped their operations during PQ2 and PQ3 respectively. The EGEE ROCs have consequently developed into a much larger group of smaller Operations Centres, which typically serve a single country . This transition was successfully completed in January 2011 without affecting the infrastructure availability and reliability. EGI comprises 32 EGI Operations Centres operating 40 European National Grid Initiatives and CERN (European Intergovernmental Research Organization). National Grid Infrastructures from the South East and Baltic regions that were not part of EGEE in April 2010, were subsequently integrated during PQ1 into South East Europe ROC and NGI_NDGF.

During PQ3 for the first time a process for requirements gathering was defined and approved by the OMB, consisting in a phase of requirements gathering and prioritization involving Resource Centres and coordinated by the respective Resource Infrastructure Provider, followed by discussion and prioritization at the OMB. In the final stage input is eventually presented to the Technology Providers at the TCB level to drive innovation. This process was adopted in January and February 2011 to provide input to the EMI project for the EMI 2.0 release. This is an important milestone as for the first time the Operations Community was collectively involved in a structured requirements gathering process. The SA1 activity roadmap for the second year of the project was discussed and approved during the January OMB meeting. During the first year several task forces contributed to the progress of operational activities in various areas: network support, GLOBUS and UNICORE integration, NGI deployment use cases of local operational tools and the enhancement of EGI Operational Level Agreements.

TSA1.2 Security

All resources used by the teams such as mailing lists, wikis, security monitoring servers etc. were migrated from EGEE to EGI resources under the egi.eu domain. This transition didn’t impact the availability of the security monitoring services (Pakiti and Nagios security monitoring) covering the whole EGI infrastructure. The operational security teams CSIRT and the Software Vulnerability Group (SVG), were successfully established during PQ1. SVG counts 15 members and has established contacts with the software developers of the main deployed middleware stacks. SVG, jointly with EMI representatives, produced a security assessment plan which identifies which software components within EMI are going to be assessed and the related timing. The plan also states which software packages that have been assessed so far. SVG will improve the handling of software vulnerabilities in the EGI RT to improve automation, including automatic reminders. Both teams have produced respectively operational security procedures forming milestone MS405. In addition, the EGI-CSIRT procedure for critical vulnerability handling was defined and approved by the OMB . It describes the procedure for dealing with Critical Security Issues where action needs to be taken by a single site or multiple sites. The first phase of Security Service Challenge 4 (SSC4) was completed by EGI CSIRT. In total 13 sites (including all WLCG Tier1 sites) were tested and the evaluation of site performance was completed. During the second year of the project SSC workflows will be streamlined in order to extend the activity to a larger set of Resource Centres. A ticketing system for incident response (RTIR) was setup and personnel was trained for its usage.

Incidents and advisories

TSA1.3 Service Deployment

The transition from EGEE to EGI was prepared two months before the start of the project for the handover of coordination activities from CERN to EGI, together with the definition of new procedures for the timely staged rollout of new middleware releases . The EGEE Pre-Production Infrastructure was entirely decommissioned and integrated into the production one where feasible. The procedures and the tools needed for automation of staged rollout were gradually refined and developed during the course of the first year . Staged rollout is applied also to the rollout of Service Availability Monitoring software releases, and it will be extended to other distributed operational tools. Staged Rollout has progressively expanding in terms of Early Adopters teams and coverage of middleware products. Now ARC, gLite and UNICORE are integrated into the process. A staged rollout manager was appointed in each area. The “New Software Release Workflow” was tested with the dry run of several EMI components (Release Candidate 3). Information on Early Adopters is available on the EGI wiki . During the end of the year effort was progressively migrated from the staged rollout of gLite 3.1 and gLite 3.2 to EMI. Fortnightly meetings are organized to update the operations community on the status of staged rollout activities, the schedule of future releases, interoperability issues and middleware deployment problems. During PQ1 contacts points have been promoted and established with other DCIs (Desktop Grids, StratusLab, PRACE) and activities have been carried out in the framework of the Infrastructure Policy Group (IPG) and the Production Grid Infrastructure Working Group (PGI-WG) within OGF (OGF 28 – Munich, and OGF 29 – Chicago). During the first year the integration of ARC into the EGI monitoring infrastructure was completed. This required the re-writing of probes to migrate from the old SAM framework to Nagios , the integration of those into the SAM release, the decommissioning of the old SAM infrastructure operated at NDGF for ARC resources, and the integration of the Nagios probes into the Operations Dashboard. Middleware deployment plans were gathered and according to these two task forces were constituted to address the issue of integration of GLOBUS and UNICORE resources.

TSA1.6 and TSA1.7 Help desk & Support Teams

Helpdesk

During PQ1 the support infrastructure has been adapting to the EGI model following the work already started in the last year of the EGEE-III project. It has involved new NGIs setting up their national support tools and processes, and transferring the operations for the former EGEE ROCs to these NGIs. A workflow for this was defined and used to migrate operations. New NGI support units were implemented for NGIs that have currently gone through this process. xGUS – the GGUS NGI view – is now deployed in production by several NGIs. A new workflow for middleware-related issues was discussed with the Deployed Middleware Support Unit (DMSU) and the Technology Providers. The workflow entered production in January 2011. During PQ4 one of the main areas of work for GGUS was the definition and implementation of technology-related workflows. Work on the technology support workflow continued and resulted in a few refinements and modifications of the technology helpdesk. The middleware support chain is a staged process organized into: 1st line support (TPM), 2nd line support through the Deployed Middleware Support Unit (DMSU) and, finally, 3rd line support involving the Technology Providers. The implementation of the support chain was made available with the January 2011 release of GGUS. In addition, the software provisioning workflow was implemented. Through this workflow the Technology Providers can announce releases by submitting a ticket which is then routed to the EGI-SA2 activity through an interface to the EGI-RT system. Feedback concerning the release is also handled through such ticket, which is assigned back to the Technology Provider with an "accept" or "reject".

Support teams

Grid Management

TSA1.4 Deployment of operational tools

A mailing list was created in July 2010 involving all NGI operational tool administrators working as support, communication and coordination channel. The existing central instances of operational tools were migrated to the egi.eu domain to phase out the EGEE domain gridops.org, whose decommissioning is currently scheduled in June 2011. Various new software releases affecting the central tools were timely deployed in production (SAM, GOCDB and the Operations Portal), and the prototype of a new EGI-wide monitoring portal.

SAM

In June 2010 the old SAM submission framework was eventually decommissioned as last step of a long migration process from a centralized system to a fully distributed one started during the last quarter of EGEE. This transition was a major success considering the level of distribution of this system, which is proportional to the number of independent Operations Centres. At the end of the first year the following SAM/Nagios instances were in production :

The development of Nagios probes for operational tool monitoring is on-going. The central tool monitoring server now also monitors GOCDB. An approach for the monitoring of uncertified Resource Centres requiring the deployment of a dedicated set of services was discussed, and will be finally put in production with the contribution of TSA1.8 effort during the second year of the project. The latest prototype of the central MyEGI instance was deployed in May 2011 . Unfortunately, due to bugs spotted in software release its availability was not broadcasted to wide audience.

Brokers

The ActiveMQ broker production network – deployed in failover mode and used for the distribution of monitoring information – consists of three brokers deployed at CERN, in Croatia and Greece. Additional broker is deployed for the APEL needs.

The EGI implementation and policies related to the DTEAM and OPS VOs – necessary for monitoring and troubleshooting – were reviewed: the DTEAM and OPS VOs are “global”, and their support is mandatory in all production Resource Centres to ensure site-level troubleshooting (DTEAM) and to have a running Nagios-based monitoring infrastructure (OPS). The deployment of regional monitoring VOs is limited to the monitoring of non-EGI sites. The DTEAM VOMS service – formerly operated at CERN – was migrated to one of EGI’s core services. In parallel to this, the VO membership management was handed over by CERN to SRCE and EGI.eu. Nagios test terminology was disambiguated and a set of related procedures (monitoring of non-production sites, downtime management of central tools, changing of the AVAILABILITY and OPERATIONS probes, changing of an existing probe and/or the integration of a new tests) were approved. A discussion on needs and mechanisms for operational tools failover configuration started. In addition, a procedure for downtime management of central tools was drafted. The following wiki pages relevant for operational tools were created:

TSA1.5 Accounting

The release to production of the APEL ActiveMQ client in early June 2010 meant the APEL central repository was ready to accept records through the new messaging communication bus. After this release the infrastructure progressively migrated from R-GMA to the new APEL client based on ActiveMQ. The R-GMA central infrastructure was decommissioned at the end of February 2011 and at the end of the first year 90% of the production/certified sites (excluding those that we know are not using their own accounting solution) were migrated to ActiveMQ APEL publishing. Of the 10% left, only 4% are sites which were previously using RGMA, the other 6% are sites that have never published. As to the accounting portal, the main improvements have been the porting to GOCDBPI-V4 to adapt to the upcoming phasing out of GOCDB3, and views and reports for WLCG Tier2 sites were enhanced. The initial version of the regionalized accounting portal is already available for deployment. Currently several NGIs (Germany, Portugal and Spain) have expressed their interest in deploying a regional instance of the regional portal. The regionalisation plans as well as the general accounting portal development plans for the first year were defined and documented in "MS703 Operational Tools regionalisation work plan". A dedicated GGUS SU has been created to provide specific support about the accounting portal. The APEL documentation from the old GOC WIKI was migrated to the EGI wiki (APEL). New ActiveMQ STOMP consumer is ready for external testing, and NGS and Hungary are contributing to testing activities. CERN will start using this interface for publishing of local jobs soon.

TSA1.8 Core services and availability

Availability and Reliability

The EGEE SLA document was updated to produce an EGI OLA document covering all the agreed and adopted practices. In parallel, a new process was defined and finally approved for the management of monthly availability and reliability statistics. A new procedure involving the Central Operator on Duty (COD) was created for getting explanations from sites for their figures if they fall below the Operational Level Agreement (OLA) requirements. The new procedure – which was prototyped in May 2010 – had been necessary to organize the handover of availability and reliability reporting from CERN to EGI.eu. AUTH is the partner responsible of validating and distributing the performance reports produced monthly. During the second half of the year a new suspension policy for underperforming Resource Centres requiring an increase of the current availability threshold from 50% to 70%, was assessed. The impact on the production infrastructure was considered to be minimal, for this reason in April 2011 the OMB finally approved the adoption of the new policy. A task force was organized to define the medium-term EGI OLA roadmap. The first result of this task force was the proposal of various changes to the existing OLA, which resulted in a new Resource Centre OLA which was approved in May 2011. Implications of OLA extensions on tool development plans were discussed during a set of dedicated meetings. During the second year the task force will focus next on the Resource Provider and EGI.eu OLAs.

Core services

A WMS and a MyProxy service – for the EGI Nagios Security Monitoring tool – were installed and entered production. In parallel a new VOMS/VOMRS server was setup in order to host the DTEAM VO. Data from the VOMS server at CERN were migrated to the new VOMS server hosted at AUTH. In PQ1 EUGridPMA accredited the SEE-GRID CA in order to provide EGI “Catch All” CA Services to VOs. The current network of Registration Authorities covers Albania, Azerbaijan, Bosnia – Herzegovina and Georgia. The migration of the DTEAM VOMS service from CERN was finalized. In addition, a procedure has been defined on providing Catch ALL VOMS services for newly created VOs.

Documentation

EGEE existing documentation has been progressively migrated to the EGI wiki, and made accessible at Documentation. This is a particularly challenging task as EGEE documentation is distributed across various document servers (e.g. EDMS and GOC WIKI). Documentation that was migrated, was also updated to the new operational environment of EGI. The migration of documentation is still in progress. The EGI documentation page includes pointers to approved manuals, best practices, procedures, FAQs and Training Guides. A series of meetings was organized to decide the set of categories and templates to be used to facilitate navigation and document categorization. During the first year several new document were produced: 3 manuals, 2 FAQs, and 8 new procedures were drafted and approved:

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export