EGI-InSPIRE:PY2 periodic report (SA1)

From EGIWiki
(Redirected from PY2 periodic report (SA1))
Jump to navigation Jump to search
EGI Inspire Main page


Executive Summary

SA1 was responsible of the continued operation and expansion of the production infrastructure. The transition started in PY1, which evolved the EGEE federated Operations Centre into independent NGIs, was completed. The total number of Resource Centres (RCs) in March 2011 amounts to 352 instances (+3.22% yearly increase). The installed capacity and Resource Centres grew considerably to comprise 270,800 logical cores (+30.7% yearly increase), 2.96 Million HEP-SPEC 06 (+49.5%), 139 PB of disk space (+31.4%) and 134.3 PB of tape (+50%).

EGI currently comprehends 27 national operations centres and 9 federated operations centres encompassing multiple NGIs. Availability and Reliability reached 94.50% and 95.42% (yearly average), which amounts to a +1% increase in PY2. Overall resource utilization has been satisfactorily progressing confirming the trends of PY1. The yearly increase of the total number of jobs executed in the infrastructure in the period May 2011-April 2012 amounts to +46.42% of the yearly job workload done from May 2010 to April 2011. The PY2 overall quantity of EGI computing resources used amounts to 10.5 Billion HEP-SPEC 06 Hours.

Operational security was run effectively during PY2 and ensured day-by-day security monitoring, and timely response in case of incidents. Security in EGI was reviewed following the PY1 reviewers’ suggestions, and documented in Deliverable D4.4. The EGI Security Threat Risk assessment team was formed. 75 threats in 20 categories were identified and an initial risk assessment and preliminary report was produced describing the assessment process, progress and initial findings. Specialized tools for incident response tracking and for streamlining of operational security tasks, were prototyped and rolled to production.

The Staged Rollout workflow introduced during PY1, is being progressively refined. The Staged Rollout infrastructure has been gradually expanding reflecting the deployment needs of VRCs and NGIs, and resources were reallocated to ensure testing of a broader range of products. The staged rollout infrastructure currently comprehends 60 Early Adopter teams.

The operations integration of GLOBUS, UNICORE, QosCosGrid and Desktop Grids were completed, with the exception of accounting, which requires further integration development. Extensions are being implemented in collaboration with the external technology providers.

GGUS was updated to decommission various legacy support units, and to add new ones for VO support, operations support and 3rd level support. A new report generator was designed and prototyped. GGUS FAQs were migrated to the EGI wiki, usability of the system was enhanced and GGUS was interfaced to a new helpdesk system (Service NOW). The GGUS failover configuration was hardened with auto-switching between different front-ends.

VO SAM, VO Admin Dashboard, and LFCBrowseSE are now mature systems supporting VO operations and being deployed by interested NGIs and/or VOs to assist them in VO daily operations and management. The first prototype of the VO Operations Portal – released by JRA1 and fully integrated into the Operations Portal – was deployed and feedback was provided to finally roll it to production.

Central Grid Oversight (COD) of EGI was responsible of the certification of new NGIs being created either as a result of legacy EGEE federated operations centres stopping operations, or because of new Resource Providers joining the infrastructure. COD was involved in training and dissemination activities, in follow-up of underperformance both at a Resource Centre and at a Resource Provider level, and in monitoring the instability of the distributed SAM infrastructure.

The EGI.eu central tools were significantly advanced. The first Metrics Portal was rolled to production in PQ6. The message broker network was repeatedly upgraded to improve the reliability of message delivery, stability, manageability and scalability. The transition from R-GMA to messaging of the accounting infrastructure was completed and a new central consumer based on ActiveMQ STOMP was deployed in pre-production. The Canopus release of the accounting portal (v4.0) brought among the other things, many bug fixes, extended FQAN-based views and new graphics. GOCDB functionality was also significantly extended with the support of virtual sites, new roles and permissions, scoping of Resource Centres and sites, and a hardened DNS-based failover configuration. The Service Availability Monitoring (SAM) underwent five different upgrades and is currently the largest and more distributed operational infrastructure comprising 32 distributed instances. The operations portal rolled to production new major components: the VO Dashboard and the Security Dashboard. In addition the VO management features were greatly enhanced.

The EGI Operations Level Agreement framework was considerably extended in PY2 with the first Resource Centre Operational Level Agreement, defining the target levels of the services provided by sites for resource access, and the Resource infrastructure Provider Operational Level Agreement, defining the target levels of the community services provided by the NGIs, which came into force in January 2012.

A new set of catch-all services for monitoring for the monitoring of uncertified Resource Centres was rolled to production. EGEE legacy documentation pages were phased out, updated and migrated to the EGI wiki, three new operational procedures were approved and training and support pages were improved.

SA1.1 Activity Management

During PY2 SA1 was responsible of the continued operation and expansion of the production infrastructure. The transition started in PY1, which evolved the EGEE federated Operations Centre into independent NGIs, was completed. The total number of Resource Centres (RCs) in March 2011 amounts to 352 instances (+3.22% yearly increase). The installed capacity and Resource Centres grew considerably to comprise 270,800 logical cores (+30.7% yearly increase), 2.96 Million HEP-SPEC 06 (+49.5%), 139 PB of disk space (+31.4%) and 134.3 PB of tape (+50%).

EGI currently comprehends 27 national operations centres and 9 federated operations centres encompassing multiple NGIs. Availability and Reliability reached 94.50% and 95.42% (yearly average), which amounts to a +1% increase in PY2. Overall resource utilization has been satisfactorily progressing confirming the trends of PY1. The yearly increase of the total number of jobs executed in the infrastructure in the period May 2011-April 2012 amounts to +46.42% of the yearly job workload done from May 2010 to April 2011. The PY2 overall quantity of EGI computing resources used amounts to 10.5 Billion HEP-SPEC 06 Hours.

The Operations Management Board (OMB) functioned well: 11 meetings were organized of which 2 were co-located with the main project events. Operations liaisons with Open Science Grid were strengthened. Fortnightly operations meetings were run regularly. SA1 run 5 task forces and coordinated the TCB task force on accounting.

The OMB contributed to the EGI sustainability workshop (January 2012) and to middleware sustainability discussions by providing information on status, costs and sustainability plans of NGI operations services, and by collecting priorities of currently deployed software. The operations service portfolio was defined and documented in the EGI Operations Architecture: Grid Service Management Best Practices (Deliverable D4.3 ). This is important for future sustainability planning and the estimation of the operations cost.

SA1.2 Operations Security

Operational security was run effectively during PY2 and ensured day-by-day security monitoring, and timely response in case of incidents. Security in EGI was reviewed following the PY1 reviewers’ suggestions, and documented in Deliverable D4.4 : the scope and aims of EGI security were reviewed, including the assets that EGI security seeks to protect. The work of the various security groups in or associated with EGI was presented. Practices and standards for IT security and their usage and possible future usage were analysed and plans for a security risk assessment were defined.

The EGI Security Threat Risk assessment team was formed. 75 threats in 20 categories were identified and an initial risk assessment and preliminary report was produced describing the assessment process, progress and initial findings after members of the group have given their initial opinion on the Likelihood and Impact for the various threats. The risks were computed by asking members of the group to rate the likelihood and impact for each threat between 1 and 5, and multiplying these two factors together. The average of these risks is taken as the basis for this preliminary report, with a minimal amount of discussion and without any refinement after this initial assessment. From this preliminary report, threats with a risk of 8 or more from this initial assessment were reported. There are 13 of these. Issues which have a high impact (4 or more) from this initial assessment were also reported.

Specialized tools for incident response tracking and for streamlining of operational security tasks, were prototyped and in some cases already rolled to production. These were introduced to support the security drills framework and were extensively used during the Security Service Challenge 5 (SSC5) – a cross-NGI security challenge [add text].Various training events were organized, which were well attended [add text]. Existing procedures were updated and new procedures were drafted and approved. Statistics. During PY2 10 potential software vulnerabilities were reported to SVG, and SVG issued seven advisories . EGI CSIRT reported on seven incidents, and issued five security alerts, of which two were critical and three were high risk.

Procedures

The EGI Security Incident Handling Procedure and the EGI Software Vulnerability Issue Handling Procedure were both updated, and a new one was approved: the EGI CSIRT Critical Vulnerability Operational Procedure . Milestone MS412 - Operational Security Procedures, is the entry point to the EGI operational security procedure framework and gives an overview of operational security processes.

Vulnerability Assessment of Grid Middleware

Security Vulnerability Assessment is the pro-active examination of software to find vulnerabilities that may exist. A Security Assessment Plan was defined at the beginning of PY2 to identify which software components within EMI would be assessed and when the assessments would take place. The document also presents the status of software packages previously assessed.

Support tools

EGI CSIRT monitors the EGI infrastructure and notifies sites exposing security vulnerabilities in order to detect weaknesses before security incidents occur. Results detected by security monitoring are made available to the sites and NGIs. The EGI Operations Portal has been extended with a Security Dashboard , which aggregates information from the CSIRT monitoring tools and presents them in a uniform manner to authorized users. The first prototype was made available during PQ7 and was rolled to production in PQ8.

The security probes were extended to improve accuracy of tests and to limit the number of false positives. The Nagios instance that is responsible of security monitoring was moved under the egi.eu domain (secmon.egi.eu). Additional improvements have been applied to the Pakiti service, which detects known vulnerable packages that are installed at the EGI sites, mainly to improve its scalability and robustness in handling data from the whole EGI infrastructure. A ticketing system to support incident response – RT for Incident Response (RTIR) was put into production.

Security Service Challenge (SSC)

Purpose of EGI SSCs is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, to ensure that procedures are enforced and that appropriate communications channels are available and used. A SSC comprises activities in three areas: communication, containment and forensics. Performance of Resource Centres in these three areas is recorded. The role of EGI and NGIs in the SSC is to provide coordination and to be a single point of contact to entities like VOs and Certification Authorities. Coordination includes making sure that that obtained forensic results are processed and that resulting information is fed back to the affected sites.

The SSC5 run was multi-site incident simulation exercise. A multi-site grid incident affects multiple NGIs and involves the entire EGI ecosystem (Resource Centres, NGIs, Certification Authorities, User Communities, security teams etc.). SSC-5 was therefore created as a realistic scenario involving 40 resource centres across more than 20 countries, and a VO-specific job submission framework was adopted (PANDA). A VO job-submission framework adds extra complexity to containment of the simulated incident on a Resource Centre level. SSC5 was successfully completed in June 2011. A SSC framework was developed to scale up to a larger number of sites. This framework allows for: job-submission (using different methods), storage operations, the definition of a set of tasks: communication, User/Process management with target times, the automated generation of reports and scoring schema, recording of history and monitoring of progress.

The SSC-5 showed that EGI-CSIRT in collaboration with the VO-CSIRT and the CA is able to handle a multi-site incident. For efficiency purposes it is crucial for EGI-CSIRT to have a close collaboration with the VO-CSIRT and the Certification Authorities, which was very good during this exercise. Security training. Various training events for Resource Centre administrators were organized aiming at spreading knowledge about strategies for incident prevention (security monitoring tools and system hardening and patching), handling and containment (incident response procedures and mechanisms to control access to grid resources), computer forensics and post mortem analysis of the compromised systems. Two security training sessions were organized at the Technical Forum 2011, and during the EGI CSIRT face to face meeting in April. The training was very well received and new training sessions are being prepared for Technical Forum 2012.

Coordination

EGI CSIRT activities are coordinated through weekly operation meetings and monthly team meetings (of which two are face to face). As of May 2011 also the Security Vulnerability Group (SVG) have been holding monthly meeting to streamline coordination.

SA1.3 Service Deployment

Staged rollout

During PY2 staged rollout activities were reviewed to ensure the early adoption of software releases from the EGI main external technology providers (EMI and IGE). Resources (mainly concentrated on staged rollout of gLite 3.2) were reallocated for testing of EMI and IGE software releases.

The Staged Rollout workflow introduced during PY1, was refined during the first year of EGI-InSPIRE according to the experience gathered with the testing of EMI 1, this has been done in parallel with the construction of the Staged Rollout infrastructure, which is being gradually expanding reflecting the deployment needs of VRCs and NGIs. The process was documented in Milestone MS409 , and a new review cycle started in PQ8 to incorporate additional improves that were experimented. A new provisioning procedure is now being discussed proposing the usage of a new type of UMD repository (“testing”) containing all packages from the moment they are released by the technology providers and allowing for more than one version of a given product to be tested at the same time. The new procedure reduces the time lag between the release by a technology provider and the availability in the UMD repositories as new versions will be immediately available in the UMD testing repository. The testing repository will be disabled by default, and all sites will be able to point to UMD repositories for all services (only the base and updates repositories will be enabled by default). The largest number of products was tested in PQ5 in preparation to the release of UMD 1.0 (81 tests in total resulting in 2 products rejected). This number was gradually reduced in the following quarters, as subsequent UMD updates only included a subset of products being updated. 122 components were tested in total during PY2, of which eight were rejected. 192 staged rollout tests were undertaken, and the number of Early Adopter teams has been progressively increasing to test a growing set of products from EMI, IGE and EGI-InSPIRE JRA1 (operational tools), and it currently amount to 60 teams. IGE 1 product releases underwent software provisioning for UMD1.2, broadening the range of products released in UMD. Two EMI products, namely VOMS/Oracle and FTS, were not included into software provisioning because of the very limited deployment scope (VOMS/Oracle), and because specific to one user community, who is already in charge of testing and staged rollout (FTS).

ARC and Unicore are part of UMD through EMI; extensive work was carried out in order to have fully operational interoperability of sites deploying those products.

Interoperations

The full integration of ARC was accomplished in PY2, while the integration of GLOBUS and UNICORE were almost completed PY2. Two task forces dedicated to GLOBUS and UNICORE involving interested NGIs and technology providers were constituted in PQ5. The GLOBUS and UNICORE integration are waiting for accounting integration to be completed. The UNICORE integration required extensions to accept all RFC 3986 characters in GOCDB, and both UNICORE and GLOBUS nagios tests were integrated into the Service Availability Monitoring (SAM) Update 13.

GLOBUS accounting integration depends on the availability of GridSafe, which will be shipped for the first time with the IGE 3.0 release expected in September 2012. UNICORE accounting integration is in progress, and depends on the support of the Secure Stomp Messenger protocol, which is being integrated and tested. Two new integration activities started in PQ7 concerning QosCosGrid software and desktop grids. QosCosGrid middleware is needed for advance reservation of compute resources, a feature needed for tightly coupled usage of EGI and PRACE resources. The QosCosGrid (QCG) middleware is an integrated system offering advanced job and resource management capabilities to deliver to end-users supercomputer-like performance and structure. Integration activities are being carried out in collaboration with the MAPPER project . QCG integration with GOCDB and SAM are almost completely accomplished to date. Desktop grid integration is a shared effort of EGI and EDGI and aims at the seamless integration of desktop infrastructures into EGI operations. Desktop grid integration into GOCDB and SAM Update 17 are finalized, while accounting integration is still in progress.

Another major technical obstacle which concerns integration is the possibility to harvest information about deployed resources and their status from the Information Discovery System regardless of the middleware stack supported. A series of workshops were organized involving EGI, EMI and IGE to collect requirements and define a shared implementation roadmap.

SA1.4 Tools

Operations Portal

Seven upgrades were deployed in production during PY2. New major components of the portal were released: the VO Dashboard and the Security Dashboard, and VO management features were greatly enhanced. The Operations Portal infrastructure comprehends to date one central instance and four NGI instances (NGI_BY, NGI_CZ, NGI_GRNET and NGI_IBERGRID). The old Operations Portal (cic.egi.eu) was decommissioned in PQ8.

Service Availability Monitoring

Five different SAM releases were deployed in the distributed infrastructure, which comprises 26 NGI instances serving 35 EGI partners, 2 federated instances serving 4 the Russian federation and the Asia Pacific federation, one central instance for monitoring of the EGI.eu operations tools and 3 external ROC instances for Canada, IGALC and Latin America.

The installation of a secondary instance is now possible for the implementation of a high availability configuration, this is an important features as SAM is critical for the reliable collection of monitoring results and the computation of availability statistics. Starting with SAM-Update13 SAM uses a new test to check the EGI Trust Anchor versions on worker nodes. The new test is included in OPERATIONS tests and AVAILABILITY tests (and consequently has an impact in case of failure on the Availability/Reliability monthly reports). The main new feature of the new CA test is that metadata provided in the CA release is used so that there is no need for manual updates of the CA probe package for each new CA release.

GOCDB

A new release was rolled to production (v. 4.2) supporting scoping of Sites and Service Endpoints into EGI and Local categories (Sites and Service Endpoints marked as being part of the infrastructure are exposed to the central operational tools while local entities are not considered part of EGI. This release also rolled into production many bug fixes and a large scale refactoring of the database as part of the earlier v4.1 release.

The set-up of the failover configuration of the master instance was completed. This includes a 2 hourly export and refresh of the secure download for the database, and testing of the DNS switching mechanism. The secondary instance is hosted by the Fraunhofer Institute. Version 4.3 was released on April 18th. Major change in this version was introduction of the support of service groups (previously known as virtual sites) and new roles. GOCDB documentation on wiki was greatly improved.

Messaging

During PY2 the broker network underwent major software upgrades to improve reliability, scalability and operability. These will continue in PY3.The new version of the messaging broker ActiveMQ 5.5 was tested in October 2011 and subsequently rolled to production. For testing purposes an additional broker network was set up. The testing network consists of 4 brokers (2 at AUTH and one at CERN and SRCE). Purpose of the upgrades was the rolling into production of new features and the improvement of the messaging infrastructure in various ways. Reliability and availability of the messaging system was enhanced through the usage of virtual destinations. Scalability was improved to reduce the number of connections to the broker network that are left pending, and the implementation of a test network was completed to try new software releases. The difference between “camel routes” and “virtual destinations” is in how data is consumed. With camel routes a message is recorded until it is consumed and then deleted, while with topics a message is published to a consumer without keeping record. A time to live of 3 days is adopted by default. This improves the reliability of message delivery. ActiveMQ 5.5 also supports dynamic failover.

Metrics portal

The first production release of the Metrics portal was open to the public and used for the recording SA1 metrics as of QR6. Its use supported the preparation of the quarterly reports since then. New metrics, HTML and Excel reports for NGIs have been developed in response to user needs.

SA1.5 Accounting

The migration from R-GMA to messaging across the entire distributed infrastructure was successfully completed. A new ActiveMQ STOMP consumer was deployed in pre-production for external testing; it is relevant to all the accounting systems who publish summary records directly into the APEL database (NGI_IT, NGI_NDGF, OSG and a few Resource Centres).

The APEL team engaged with developers from EMI (for ARC and UNICORE), IGE (GLOBUS) and EDGI (for Desktop Grids, which are under integration) in a series of phone calls and meetings to discuss how they will publish accounting data to the central repository. A TCB task force on accounting started in February to coordinate deployment and requirements gathering across different projects and infrastructures. A new accounting portal release “Canopus” (v4.0) was deployed in production. The new version uses a new codebase and versioning system, with GGUS referenced commits. URL management was improved and the pchart 2.0 graph engine was introduced. In addition, FQAN-based accounting views were implemented for “VO manager” and “site” views. Many bugs were fixed and minor enhancements were released.

SA1.6 Helpdesk

A big step towards permanently keeping up to date the documentation about the Support Units (SUs) connected to GGUS was taken. All GGUS-FAQs about SU were updated and migrated from PDF documents stored by the GGUS server to the central EGI wiki and are now fully searchable. Various legacy SUs were decommissioned, and new SUs for VO support, operations support and 3rd level support, were created.

The Technology Helpdesk is well established for specialized software support. Via the established interface to the EGI-RT system the first EMI-1 release and subsequent updates were announced to EGI via the Technology Helpdesk.

In order to properly support monitoring of SLAs established with external technology providers, and to monitor status and progress of NGI support services, requirements were gathered from within the project and from external projects (EMI and WLCG) for the implementation of a report generator, which is expected in PY3. A prototype was presented at the Community Forum 2012.

Usability of GGUS was enhanced and new features were introduced: a downtime check in GOC DB notifying in case of downtimes of a Resource Centre specified on ticket submit form, extended search capabilities by user DN and type of problem, usage of "Problem Type" values when submitting TEAM or ALARM tickets, issuing of warnings in case of ticket updates, the automated switching after a deadline of tickets in status "solved" or "unsolved" to “verified”, the addition of new fields in the ticket the middleware product affected.

GGUS was interfaced to Service NOW, CERN helpdesk system. As many EMI 3rd level middleware support units are located at CERN, processes were defined and implemented for ticket synchronization between the two systems.

GGUS was updated roughly on a monthly basis. Several Remedy updates were deployed and auto-switching between different web front-ends was implemented for an improved availability of the system.

SA1.7 Support

VO support

EGI VO Services aim at supporting VOs in the whole process of start-up, management and operation, pointing out to tools, services, documentation and guidelines to maximize the usage of the resources, easing service deployment, and bridging the VO community with the infrastructure need. The operations community is in charge of operating VO-specific services (both operational and functional depending on the user needs), and of supporting operations and users through the EGI helpdesk. The VO services are mature enough to be supported by NGI operational teams and the expertise on operating those services is also widely available in the operations community. The infrastructure of VO functional services comprises more than 700 service instances. The overall number of international and national VOs registered in the Operations Portal amounts to 226 (+3.20% yearly increase), including 20883 registered users (+14.30% increase). High-Energy Physics, Astronomy Astrophysics and Astro-particle Physics, and Life Sciences are the mostly active disciplines, with respectively 93.6%, 2.25% and 1.30% of the overall EGI used normalized CPU time in PY2.

VO SAM, VO Admin Dashboard, and LFCBrowseSE are now mature systems supporting VO operations and being deployed by interested NGIs and/or VOs to assist them in VO daily operations and management. The first prototype of the VO Operations Portal – released by JRA1 and fully integrated into the Operations Portal – was deployed and feedback was provided to finally roll it to production. Handling of operations issues raised by VOs and VRCs was streamlined, and these are now regularly discussed during OMB meetings. A new procedure for VO decommissioning was drafted and it will be finalized by the OMB during PY3.

Grid oversight

Central Grid Oversight (COD) of EGI was responsible of the certification of new NGIs being created either as a result of legacy EGEE federated operations centres stopping operations, or because of new Resource Providers joining the infrastructure. COD has been issuing newsletters on a monthly basis for better dissemination of technical information to NGI operations support teams. A change to the COD escalation procedure was discussed and approved at the OMB for the streamlining of oversight activities (procedure PROC01 ). A set of tutorial videos was prepared and their publication on the EGI training market place is being discussed.

As to the follow-up of underperforming sites, COD investigated the issue of the high rate of UNKNOWN monitoring results for some NGIs this is quite high, which could undermine the meaningfulness of the availability/reliability reports. Frequent UNKNOWN results are frequently an indication of an unstable/misconfigured Resource Centres or of unreliable local monitoring infrastructures. A technical analysis of the problem was conducted and NGIs have been monthly contacted in case of too high percentage, and the issues were successfully resolved by the majority of NGIs.

COD also participated to the definition of the specifications of a Nagios test that will automate the notification of performance issues to sites, so that administrators are proactively warned and can take counter measures during the course of the month. The Nagios probe measures availability and computes it daily across the last 30 calendar days. It returns WARNING if 70% ≤ availability ≤ 75%, and CRITICAL if availability <70%. A prototype version (provided by SRCE) will be available in March for testing.. This work will be completed in PY2. A new performance indicator (called “ROD performance index”) was prototyped and now is rolled to production to measure NGI support performance on a monthly basis. The ROD performance index is an indicator of the number of tickets and alarms that the NGI had trouble to handle in due course (it is the sum of the number of daily tickets “expired” and of daily pending alarms older than 72 hours). NGIs with ROD performance index larger than 10 are supported to improve their on duty activities. This task is contributed to reduce the index, and a gradual decrease was observed during the months it was prototyped. As of October 2011 NGIs not meeting the minimum performance threshold are requested to provide explanations in GGUS tickets.

As of January 2012 COD is also responsible of follow-up of NGI core services that are underperforming, and of providing support and documentation.

Network support

Network support activities are responsible for disseminating and supporting tools for network monitoring and trouble shooting, and for IPv6 testing of deployed software. HINTS is the tool for the execution of on-demand tests and measurements to facilitate troubleshooting network problems. Testing of HINTS was carried out in France and in Italy and probes were installed in various sites. A development server was set up in Paris and a production one is located in Rome. Network support sessions were organized and co-located with the EGI Technical and User Forum. A task force on IPv6 compliance testing of the deployed technologies (ARC. gLite, GLOBUS, UNICORE) was kicked off. The implementation of the IPv6 testbed is in progress. Activities are run collaboration with the IPv6 HEPiX working group.

SA1.8 Core services, Availability, Documentation

Core services

The DTEAM VO assists RC administrators and operations teams in troubleshooting. Its support is a mandatory requirement for all RCs that deploy VO-enabled middleware. It is served by two geographically distributed VOMS servers in Thessaloniki and Athens. During PY2 seven new DTEAM NGI groups were created and 3 ROC Groups were decommissioned (ROC_Italy, SEE and DECH). The DTEAM VO is successfully used by a large number of NGIs as demonstrated by the accounting figures below, and AUTH – the partner responsible of DTEAM VO management – was responsive and supportive. The EGI Catch-All Certification Autority is an important service for new user communities and to support user authentication in the early stages of creation of a new grid infrastructure. It is currently serving 5 countries which do not have a national accredited Certification Authority (Albania, Azerbaijan, Bosnia and Herzegovina, Georgia and Senegal). A TOP-BDII, a WMS and an LB service were installed as catch all services for NGIs that do not operate their own services for the site certification process (especially for small NGIs). In addition a portal was built, that syncs with GOCDB and gives the ability to the NGI Managers to add and remove on demand uncertified sites from the catch-all TOP-BDII.

Availability

The current OPS availability profile used for the computation of Resource Centre availability and reliability statistics, was discussed and approved for extensions. CREAM, ARC-CE and lcg-CE monitoring results, are now included in the computations. In collaboration with WLCG the replacement of GridView with the Availability Calculation Engine (ACE) was approved. From May 2011 a new suspension policy for under-preforming sites was introduced that increases the limit of the availability for Resource Centre suspension from 50% to 70%.

The EGI Operations Level Agreement framework was considerably extended in PY2. The Resource Centre Operational Level Agreement (OLA) v. 1.0 was finalized and approved in May, and was subsequently updated in PQ8.The OLA task force was extended to start working on the Resource Provider OLA: the first Resource infrastructure Provider Operational Level Agreement was drafted and approved in PQ6. This is a major accomplishment as for the first time a document defining NGI responsibilities and services was approved. The OLA will be incrementally expanded as monitoring and Availability/Reliability reporting evolve, to include additional NGI services. The OLA structure was also reviewed in order to conform to the ITIL best practices.

Following the approval of the OLA, the first NGI Core service Availability/Reliability report was distributed starting in September 2011, and as of January 2012 NGIs are requested to provide a service improvement plan in case of underperformance. The reporting framework was extended in order to extract results from the MyEGI portal and produce summarized Availability/Reliability reports. These for the moment only include top-BDII.

The EGI sites availability recalculation procedure was finalized (PROC10) was approved and subsequently updated in PQ8. To support this procedure, and in general to support Resource Centres and NGIs in case of problems with the distributed performance reports, the Service Level Management Support Unit was created in GGUS.

Documentation

The EGEE portal for documentation (GOCWIKI) was phased out in September 2011, after a phase of update and migration to the EGI wiki of relevant pages. The transfer of material to the EGI wiki is almost completed. The best practices manual is now complete and fully operational and a new best practice on the management of the top-BDII in failover configuration was approved. Three new procedures were approved: Recomputation of monitoring results and availability statistics (PROC10), Resource Centre Decommissioning Procedure (PROC11), and Production Service Decommissioning Procedure (PROC12). The remaining ones were periodically updated as needed. In addition, documentation, training and support wiki pages were significantly updated by EGI.eu: