Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:PY2 periodic report (SA1)"

From EGIWiki
Jump to navigation Jump to search
Line 87: Line 87:


= SA1.6 Helpdesk =
= SA1.6 Helpdesk =
A big step towards permanently keeping up to date the documentation about the Support Units (SUs) connected to GGUS was taken. All GGUS-FAQs about SU were updated and migrated from PDF documents stored by the GGUS server to the central EGI wiki  and are now fully searchable. Various legacy SUs were decommissioned, and new SUs for VO support, operations support and 3rd level support, were created.
The Technology Helpdesk is well established for specialized software support. Via the established interface to the EGI-RT system the first EMI-1 release and subsequent updates were announced to EGI via the Technology Helpdesk.
In order to properly support monitoring of SLAs established with external technology providers, and to monitor status and progress of NGI support services, requirements were gathered from within the project and from external projects (EMI and WLCG) for the implementation of a report generator, which is expected in PY3. A prototype was presented at the Community Forum 2012.
Usability of GGUS was enhanced and new features were introduced: a downtime check in GOC DB notifying in case of downtimes of a Resource Centre specified on ticket submit form, extended search capabilities by user DN and type of problem, usage of "Problem Type" values when submitting TEAM or ALARM tickets, issuing of warnings in case of ticket updates, the automated switching after a deadline of tickets in status "solved" or "unsolved" to “verified”, the addition of new fields in the ticket the middleware product affected.
GGUS was interfaced to Service NOW, CERN helpdesk system. As many EMI 3rd level middleware support units are located at CERN, processes were defined and implemented for ticket synchronization between the two systems.
GGUS was updated roughly on a monthly basis. Several Remedy updates were deployed and auto-switching between different web front-ends was implemented for an improved availability of the system.


= SA1.7 Support =
= SA1.7 Support =


= SA1.8 Core services, Availability, Documentation =
= SA1.8 Core services, Availability, Documentation =

Revision as of 16:30, 4 June 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



Executive Summary

SA1 was responsible of the continued operation and expansion of the production infrastructure. The transition started in PY1, which evolved the EGEE federated Operations Centre into independent NGIs, was completed. The total number of Resource Centres (RCs) in March 2011 amounts to 352 instances (+3.22% yearly increase). The installed capacity and Resource Centres grew considerably to comprise 270,800 logical cores (+30.7% yearly increase), 2.96 Million HEP-SPEC 06 (+49.5%), 139 PB of disk space (+31.4%) and 134.3 PB of tape (+50%).

EGI currently comprehends 27 national operations centres and 9 federated operations centres encompassing multiple NGIs. Availability and Reliability reached 94.50% and 95.42% (yearly average), which amounts to a +1% increase in PY2. Overall resource utilization has been satisfactorily progressing confirming the trends of PY1. The yearly increase of the total number of jobs executed in the infrastructure in the period May 2011-April 2012 amounts to +46.42% of the yearly job workload done from May 2010 to April 2011. The PY2 overall quantity of EGI computing resources used amounts to 10.5 Billion HEP-SPEC 06 Hours.

Operational security was run effectively during PY2 and ensured day-by-day security monitoring, and timely response in case of incidents. Security in EGI was reviewed following the PY1 reviewers’ suggestions, and documented in Deliverable D4.4. The EGI Security Threat Risk assessment team was formed. 75 threats in 20 categories were identified and an initial risk assessment and preliminary report was produced describing the assessment process, progress and initial findings. Specialized tools for incident response tracking and for streamlining of operational security tasks, were prototyped and rolled to production.

The Staged Rollout workflow introduced during PY1, is being progressively refined. The Staged Rollout infrastructure has been gradually expanding reflecting the deployment needs of VRCs and NGIs, and resources were reallocated to ensure testing of a broader range of products. The staged rollout infrastructure currently comprehends 60 Early Adopter teams.

The operations integration of GLOBUS, UNICORE, QosCosGrid and Desktop Grids were completed, with the exception of accounting, which requires further integration development. Extensions are being implemented in collaboration with the external technology providers.

GGUS was updated to decommission various legacy support units, and to add new ones for VO support, operations support and 3rd level support. A new report generator was designed and prototyped. GGUS FAQs were migrated to the EGI wiki, usability of the system was enhanced and GGUS was interfaced to a new helpdesk system (Service NOW). The GGUS failover configuration was hardened with auto-switching between different front-ends.

VO SAM, VO Admin Dashboard, and LFCBrowseSE are now mature systems supporting VO operations and being deployed by interested NGIs and/or VOs to assist them in VO daily operations and management. The first prototype of the VO Operations Portal – released by JRA1 and fully integrated into the Operations Portal – was deployed and feedback was provided to finally roll it to production.

Central Grid Oversight (COD) of EGI was responsible of the certification of new NGIs being created either as a result of legacy EGEE federated operations centres stopping operations, or because of new Resource Providers joining the infrastructure. COD was involved in training and dissemination activities, in follow-up of underperformance both at a Resource Centre and at a Resource Provider level, and in monitoring the instability of the distributed SAM infrastructure.

The EGI.eu central tools were significantly advanced. The first Metrics Portal was rolled to production in PQ6. The message broker network was repeatedly upgraded to improve the reliability of message delivery, stability, manageability and scalability. The transition from R-GMA to messaging of the accounting infrastructure was completed and a new central consumer based on ActiveMQ STOMP was deployed in pre-production. The Canopus release of the accounting portal (v4.0) brought among the other things, many bug fixes, extended FQAN-based views and new graphics. GOCDB functionality was also significantly extended with the support of virtual sites, new roles and permissions, scoping of Resource Centres and sites, and a hardened DNS-based failover configuration. The Service Availability Monitoring (SAM) underwent five different upgrades and is currently the largest and more distributed operational infrastructure comprising 32 distributed instances. The operations portal rolled to production new major components: the VO Dashboard and the Security Dashboard. In addition the VO management features were greatly enhanced.

The EGI Operations Level Agreement framework was considerably extended in PY2 with the first Resource Centre Operational Level Agreement, defining the target levels of the services provided by sites for resource access, and the Resource infrastructure Provider Operational Level Agreement, defining the target levels of the community services provided by the NGIs, which came into force in January 2012.

A new set of catch-all services for monitoring for the monitoring of uncertified Resource Centres was rolled to production. EGEE legacy documentation pages were phased out, updated and migrated to the EGI wiki, three new operational procedures were approved and training and support pages were improved.

SA1.1 Activity Management

During PY2 SA1 was responsible of the continued operation and expansion of the production infrastructure. The transition started in PY1, which evolved the EGEE federated Operations Centre into independent NGIs, was completed. The total number of Resource Centres (RCs) in March 2011 amounts to 352 instances (+3.22% yearly increase). The installed capacity and Resource Centres grew considerably to comprise 270,800 logical cores (+30.7% yearly increase), 2.96 Million HEP-SPEC 06 (+49.5%), 139 PB of disk space (+31.4%) and 134.3 PB of tape (+50%).

EGI currently comprehends 27 national operations centres and 9 federated operations centres encompassing multiple NGIs. Availability and Reliability reached 94.50% and 95.42% (yearly average), which amounts to a +1% increase in PY2. Overall resource utilization has been satisfactorily progressing confirming the trends of PY1. The yearly increase of the total number of jobs executed in the infrastructure in the period May 2011-April 2012 amounts to +46.42% of the yearly job workload done from May 2010 to April 2011. The PY2 overall quantity of EGI computing resources used amounts to 10.5 Billion HEP-SPEC 06 Hours.

The Operations Management Board (OMB) functioned well: 11 meetings were organized of which 2 were co-located with the main project events. Operations liaisons with Open Science Grid were strengthened. Fortnightly operations meetings were run regularly. SA1 run 5 task forces and coordinated the TCB task force on accounting.

The OMB contributed to the EGI sustainability workshop (January 2012) and to middleware sustainability discussions by providing information on status, costs and sustainability plans of NGI operations services, and by collecting priorities of currently deployed software. The operations service portfolio was defined and documented in the EGI Operations Architecture: Grid Service Management Best Practices (Deliverable D4.3 ). This is important for future sustainability planning and the estimation of the operations cost.

SA1.2 Operations Security

Operational security was run effectively during PY2 and ensured day-by-day security monitoring, and timely response in case of incidents. Security in EGI was reviewed following the PY1 reviewers’ suggestions, and documented in Deliverable D4.4 : the scope and aims of EGI security were reviewed, including the assets that EGI security seeks to protect. The work of the various security groups in or associated with EGI was presented. Practices and standards for IT security and their usage and possible future usage were analysed and plans for a security risk assessment were defined.

The EGI Security Threat Risk assessment team was formed. 75 threats in 20 categories were identified and an initial risk assessment and preliminary report was produced describing the assessment process, progress and initial findings after members of the group have given their initial opinion on the Likelihood and Impact for the various threats. The risks were computed by asking members of the group to rate the likelihood and impact for each threat between 1 and 5, and multiplying these two factors together. The average of these risks is taken as the basis for this preliminary report, with a minimal amount of discussion and without any refinement after this initial assessment. From this preliminary report, threats with a risk of 8 or more from this initial assessment were reported. There are 13 of these. Issues which have a high impact (4 or more) from this initial assessment were also reported.

Specialized tools for incident response tracking and for streamlining of operational security tasks, were prototyped and in some cases already rolled to production. These were introduced to support the security drills framework and were extensively used during the Security Service Challenge 5 (SSC5) – a cross-NGI security challenge [add text].Various training events were organized, which were well attended [add text]. Existing procedures were updated and new procedures were drafted and approved. Statistics. During PY2 10 potential software vulnerabilities were reported to SVG, and SVG issued seven advisories . EGI CSIRT reported on seven incidents, and issued five security alerts, of which two were critical and three were high risk.

Procedures

The EGI Security Incident Handling Procedure and the EGI Software Vulnerability Issue Handling Procedure were both updated, and a new one was approved: the EGI CSIRT Critical Vulnerability Operational Procedure . Milestone MS412 - Operational Security Procedures, is the entry point to the EGI operational security procedure framework and gives an overview of operational security processes.

Vulnerability Assessment of Grid Middleware

Security Vulnerability Assessment is the pro-active examination of software to find vulnerabilities that may exist. A Security Assessment Plan was defined at the beginning of PY2 to identify which software components within EMI would be assessed and when the assessments would take place. The document also presents the status of software packages previously assessed.

Support tools

EGI CSIRT monitors the EGI infrastructure and notifies sites exposing security vulnerabilities in order to detect weaknesses before security incidents occur. Results detected by security monitoring are made available to the sites and NGIs. The EGI Operations Portal has been extended with a Security Dashboard , which aggregates information from the CSIRT monitoring tools and presents them in a uniform manner to authorized users. The first prototype was made available during PQ7 and was rolled to production in PQ8.

The security probes were extended to improve accuracy of tests and to limit the number of false positives. The Nagios instance that is responsible of security monitoring was moved under the egi.eu domain (secmon.egi.eu). Additional improvements have been applied to the Pakiti service, which detects known vulnerable packages that are installed at the EGI sites, mainly to improve its scalability and robustness in handling data from the whole EGI infrastructure. A ticketing system to support incident response – RT for Incident Response (RTIR) was put into production.

Security Service Challenge (SSC)

Purpose of EGI SSCs is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, to ensure that procedures are enforced and that appropriate communications channels are available and used. A SSC comprises activities in three areas: communication, containment and forensics. Performance of Resource Centres in these three areas is recorded. The role of EGI and NGIs in the SSC is to provide coordination and to be a single point of contact to entities like VOs and Certification Authorities. Coordination includes making sure that that obtained forensic results are processed and that resulting information is fed back to the affected sites.

The SSC5 run was multi-site incident simulation exercise. A multi-site grid incident affects multiple NGIs and involves the entire EGI ecosystem (Resource Centres, NGIs, Certification Authorities, User Communities, security teams etc.). SSC-5 was therefore created as a realistic scenario involving 40 resource centres across more than 20 countries, and a VO-specific job submission framework was adopted (PANDA). A VO job-submission framework adds extra complexity to containment of the simulated incident on a Resource Centre level. SSC5 was successfully completed in June 2011. A SSC framework was developed to scale up to a larger number of sites. This framework allows for: job-submission (using different methods), storage operations, the definition of a set of tasks: communication, User/Process management with target times, the automated generation of reports and scoring schema, recording of history and monitoring of progress.

The SSC-5 showed that EGI-CSIRT in collaboration with the VO-CSIRT and the CA is able to handle a multi-site incident. For efficiency purposes it is crucial for EGI-CSIRT to have a close collaboration with the VO-CSIRT and the Certification Authorities, which was very good during this exercise. Security training. Various training events for Resource Centre administrators were organized aiming at spreading knowledge about strategies for incident prevention (security monitoring tools and system hardening and patching), handling and containment (incident response procedures and mechanisms to control access to grid resources), computer forensics and post mortem analysis of the compromised systems. Two security training sessions were organized at the Technical Forum 2011, and during the EGI CSIRT face to face meeting in April. The training was very well received and new training sessions are being prepared for Technical Forum 2012.

Coordination

EGI CSIRT activities are coordinated through weekly operation meetings and monthly team meetings (of which two are face to face). As of May 2011 also the Security Vulnerability Group (SVG) have been holding monthly meeting to streamline coordination.

SA1.3 Service Deployment

Staged rollout

During PY2 staged rollout activities were reviewed to ensure the early adoption of software releases from the EGI main external technology providers (EMI and IGE). Resources (mainly concentrated on staged rollout of gLite 3.2) were reallocated for testing of EMI and IGE software releases.

The Staged Rollout workflow introduced during PY1, was refined during the first year of EGI-InSPIRE according to the experience gathered with the testing of EMI 1, this has been done in parallel with the construction of the Staged Rollout infrastructure, which is being gradually expanding reflecting the deployment needs of VRCs and NGIs. The process was documented in Milestone MS409 , and a new review cycle started in PQ8 to incorporate additional improves that were experimented. A new provisioning procedure is now being discussed proposing the usage of a new type of UMD repository (“testing”) containing all packages from the moment they are released by the technology providers and allowing for more than one version of a given product to be tested at the same time. The new procedure reduces the time lag between the release by a technology provider and the availability in the UMD repositories as new versions will be immediately available in the UMD testing repository. The testing repository will be disabled by default, and all sites will be able to point to UMD repositories for all services (only the base and updates repositories will be enabled by default). The largest number of products was tested in PQ5 in preparation to the release of UMD 1.0 (81 tests in total resulting in 2 products rejected). This number was gradually reduced in the following quarters, as subsequent UMD updates only included a subset of products being updated. 122 components were tested in total during PY2, of which eight were rejected. 192 staged rollout tests were undertaken, and the number of Early Adopter teams has been progressively increasing to test a growing set of products from EMI, IGE and EGI-InSPIRE JRA1 (operational tools), and it currently amount to 60 teams. IGE 1 product releases underwent software provisioning for UMD1.2, broadening the range of products released in UMD. Two EMI products, namely VOMS/Oracle and FTS, were not included into software provisioning because of the very limited deployment scope (VOMS/Oracle), and because specific to one user community, who is already in charge of testing and staged rollout (FTS).

ARC and Unicore are part of UMD through EMI; extensive work was carried out in order to have fully operational interoperability of sites deploying those products.

Interoperations

The full integration of ARC was accomplished in PY2, while the integration of GLOBUS and UNICORE were almost completed PY2. Two task forces dedicated to GLOBUS and UNICORE involving interested NGIs and technology providers were constituted in PQ5. The GLOBUS and UNICORE integration are waiting for accounting integration to be completed. The UNICORE integration required extensions to accept all RFC 3986 characters in GOCDB, and both UNICORE and GLOBUS nagios tests were integrated into the Service Availability Monitoring (SAM) Update 13.

GLOBUS accounting integration depends on the availability of GridSafe, which will be shipped for the first time with the IGE 3.0 release expected in September 2012. UNICORE accounting integration is in progress, and depends on the support of the Secure Stomp Messenger protocol, which is being integrated and tested. Two new integration activities started in PQ7 concerning QosCosGrid software and desktop grids. QosCosGrid middleware is needed for advance reservation of compute resources, a feature needed for tightly coupled usage of EGI and PRACE resources. The QosCosGrid (QCG) middleware is an integrated system offering advanced job and resource management capabilities to deliver to end-users supercomputer-like performance and structure. Integration activities are being carried out in collaboration with the MAPPER project . QCG integration with GOCDB and SAM are almost completely accomplished to date. Desktop grid integration is a shared effort of EGI and EDGI and aims at the seamless integration of desktop infrastructures into EGI operations. Desktop grid integration into GOCDB and SAM Update 17 are finalized, while accounting integration is still in progress.

Another major technical obstacle which concerns integration is the possibility to harvest information about deployed resources and their status from the Information Discovery System regardless of the middleware stack supported. A series of workshops were organized involving EGI, EMI and IGE to collect requirements and define a shared implementation roadmap.

SA1.4 Tools

SA1.5 Accounting

SA1.6 Helpdesk

A big step towards permanently keeping up to date the documentation about the Support Units (SUs) connected to GGUS was taken. All GGUS-FAQs about SU were updated and migrated from PDF documents stored by the GGUS server to the central EGI wiki and are now fully searchable. Various legacy SUs were decommissioned, and new SUs for VO support, operations support and 3rd level support, were created.

The Technology Helpdesk is well established for specialized software support. Via the established interface to the EGI-RT system the first EMI-1 release and subsequent updates were announced to EGI via the Technology Helpdesk.

In order to properly support monitoring of SLAs established with external technology providers, and to monitor status and progress of NGI support services, requirements were gathered from within the project and from external projects (EMI and WLCG) for the implementation of a report generator, which is expected in PY3. A prototype was presented at the Community Forum 2012.

Usability of GGUS was enhanced and new features were introduced: a downtime check in GOC DB notifying in case of downtimes of a Resource Centre specified on ticket submit form, extended search capabilities by user DN and type of problem, usage of "Problem Type" values when submitting TEAM or ALARM tickets, issuing of warnings in case of ticket updates, the automated switching after a deadline of tickets in status "solved" or "unsolved" to “verified”, the addition of new fields in the ticket the middleware product affected.

GGUS was interfaced to Service NOW, CERN helpdesk system. As many EMI 3rd level middleware support units are located at CERN, processes were defined and implemented for ticket synchronization between the two systems.

GGUS was updated roughly on a monthly basis. Several Remedy updates were deployed and auto-switching between different web front-ends was implemented for an improved availability of the system.

SA1.7 Support

SA1.8 Core services, Availability, Documentation