Difference between revisions of "EGI-InSPIRE:SA1 EGI Global tasks evolution"

From EGIWiki
Jump to: navigation, search
(Foreseen evolution)
(Foreseen evolution)
Line 43: Line 43:
  
 
===Foreseen evolution ===
 
===Foreseen evolution ===
 +
After april 2014, there will be more emphasis on supporting NGIs, assistance of user communities with respect to resource allocation. This will be in addition to what is already being done today.
  
 
====Impact on funding ====
 
====Impact on funding ====
 +
This is uncertain.
  
 
== Coordination of network support, monitoring, troubleshooting ==
 
== Coordination of network support, monitoring, troubleshooting ==

Revision as of 14:52, 21 January 2013

This document provided by the partners responsible of EGI operations global tasks provide information about current status and the envisaged evolution of these tasks after April 2014.

Contents

Human Services

Operation Management Board Coordination

Partner: EGI.eu

Current status

The Operations Management Board (OMB) drives future developments in the operations area by making sure that the infrastructure delivers high availability, is secure, meets the demand of existing user communities and that infrastructure operations evolve to support the integration of new resource infrastructures. It does this by providing management and developing policies and procedures for the operational services that are integrated into the production infrastructure. The OMB is responsible of technical roadmapping and of the definition and execution of processes for periodic gathering of requirements.

Foreseen evolution

Impact on funding

Software Support

Current status

EGI.eu provides first and second level user and operations support and this function includes the following tasks:

  • function coordination (partner: CESNET)
  • ticket triage and assignment for dispatching of tickets to the appropriate SUs within GGUS (partners: INFN, CESNET)
  • 1st and 2nd level software support, encompassing both grid middleware and operational tools (operational tickets are dispatched to NGI operations SUs, so are not internally addressed by the software support team). This includes the production of howtos and reporting to operations meetings about critical incidents (partners: CESNET, INFN, JUELICH, LIU and STFC
  • Ticket oversight and follow-up (partner: KIT): this function includes administrative and reporting functions of the helpdesk infrastructure (e.g. collecting ticket statistics, internal and external reporting of statistics for SLAs monitoring and other reporting duties), and follow-up (notifying supporters when the reaction to high-priority tickets is not fast enough, requesting information from ticket submitters when they do not react, ensuring assigners/resolvers will react sufficiently fast when the submitter provides additional information).

More information about this task

Foreseen evolution

Impact on funding

Coordination of Grid Oversight

Partners: SARA, CYFRONET

Current status

Grid Oversight is an activity aimed at controlling the infrastructure and solving arising operational issues. Theses issues can be of different complexity and importance, and may be caused by various reasons on regional or central level. For the scalability reasons the Grid Oversight has hierarchical structure: teams on regional (ROD) and central (COD) level contribute to it, solving problems within their scope. The COD part of the function is a global task. Speaking in ITSM terms the processes in which COD is naturally interested in are these of Service Operations area, especially Incident Management and Problem Management. The oversight of Incident Management is organized in an escalation process and COD is the body to which incidents that can not be handled on regional level are escalated.

Foreseen evolution

After april 2014, there will be more emphasis on supporting NGIs, assistance of user communities with respect to resource allocation. This will be in addition to what is already being done today.

Impact on funding

This is uncertain.

Coordination of network support, monitoring, troubleshooting

Partner: GARR

Current status

Provides network support for the resolution of end-to-end network performance issues. EGI is a highly distributed networked infrastructure of grid services using network connectivity for remote job submission, data transfer and data access, hence tools are needed for network troubleshooting and performance monitoring

Foreseen evolution

Impact on funding

Coordination of Operational interoperation between NGIs and DCIs

Partner: EGI.eu

Current status

EGI coordinates the integration of heterogeneous middleware stacks and Distributed Computing Infrastructures with the EGI operational infrastructures such as: accounting, monitoring, managemenet and support.

Foreseen evolution

Impact on funding

Coordination of documentation

Partner: EGI.eu

Current status

Coordination of maintenance and development operational documentation, procedures, best practices.

Foreseen evolution

Impact on funding

Security Operations Coordination

Partners: STFC, NIKHEF

Current status

Security vulnerabilities and risks presented by e-Infrastructures provide a rationale for coordination amongst the EGI participants at various levels. Central coordination groups ensure policies, operational security, and maintenance to guarantee secure access to users. In addition, security and incident response is provided through the EGI Computer Security and Incident Response Team by coordinating activity at the sites across the infrastructure. This coordination ensures that common policies are followed by providing services such as security monitoring, training and dissemination with the goal of improving the response to incidents (e.g. security drills).

Foreseen evolution

Impact on funding

Service Level Management: availability/reliability reports

Partner: AUTH

Current Status

This task includes the validation of distribution of monthly availability statistics for Resource Centres, NGIs, EGI.eu, and the coordination of the evolution of the EGI OLA framework and the related reporting tools.

Foreseen evolution

Impact on funding

Infrastructure Services

Software Rollout

Partner: LIP

Current status

Updates of deployed software need to be gradually adopted in production after internal verification. This process is implemented in EGI through staged rollout, i.e. through the early deployment of a new component by a selected list of candidate Resource Centres. The successful verification of a new component is a precondition for declaring the software ready for deployment. Given the scale of the EGI infrastructure, this process requires careful coordination to ensure that every new capability is verified by a representative pool of candidate sites, to supervise the responsiveness of the candidate sites and ensure that the staged rollout progresses well without introducing unnecessary delays, and to review the reports produced. It also ensures the planning of resources according to the foreseen release schedules from the Technology Providers. EGI.eu coordination is necessary to ensure a successful interoperation of the various stakeholders: Resource Centres, Technology Providers, the EGI.eu Technical Manager and the EGI repository managers.

This activities includes:

  • Definition and adoption of a workflow to automate software deployment
  • Coordination of the staged rollout activities carried out by the NGIs
  • Liaison with the UMD team (EGI-InSPIRE SA2)and the Products Teams

Foreseen evolution

Impact on funding

Monitoring

Central SAM monitoring services

Partner: CERN

Current status

A distributed monitoring framework is necessary to continuously test the level of functionality delivered by each service node instance in the production Resource Centres, to generate alarms and tickets in case of critical failures and to compute monthly availability and reliability statistics, and to monitor and troubleshoot network problems. The Monitoring Infrastructure is a distributed service based on Nagios and messaging. The central services – operated by EGI.eu – include systems such as the MyEGI portal for the visualisation of information, and a set of databases for the persistent storage of information about test results, availability statistics, monitoring profiles and aggregated topology information. The central services need to interact with the local monitoring infrastructures operated by the NGIs. The central monitoring services are critical and need to deliver high availability.

Foreseen evolution

Impact on funding

Broker network

Partner: GRNET (coord), SRCE, CERN

Current status

Foreseen evolution

Impact on funding

Accounting

APEL central DB

Partner: SRFC

Current status

The EGI Accounting Infrastructure is distributed. At a central level it includes the repositories for the persistent storage of usage records. The Accounting Infrastructure is essential in a service-oriented business model to record usage information. Accounting data needs to be validated and regularly published centrally. The central databases are populated through individual usage records published by the Resource Centres, or through the publication of summarised usage records.

Foreseen evolution

Impact on funding

Central accounting portal

Partner: CESGA

Current status

The central accounting portal is made available by EGI for the visualisation of accounting information.

Foreseen evolution

Impact on funding

Security Monitoring

  • Security Nagios server. Partner: GRNET
  • CSIRT Pakiti. Partner CESNET

Current status

The objective of a Security Infrastructure is to protect itself from intrusions such as exploitable software vulnerabilities, misuse by authorised users, resource "theft", etc., while allowing the information, resources and services to remain accessible and productive to its intended users. A specifically designed set of tools and services help reduce these vulnerabilities such as monitoring individual resource centers (based on Nagios and Pakiti), a central security dashboard to allow sites, NGIs and EGI Computer Security Incident Response Teams to access security alerts in a controlled manner, and a ticketing system to support coordination efforts.

Foreseen evolution

Impact on funding

Configuration Repository (GOCDB)

Partner: STFC

Current status

EGI relies on a central database (GOCDB) to record static information about different entities such as the Operations Centres, the Resource Centres, and the service instances. It also provides contact, role and status information. GOCDB is a source of information for many other operational tools, such as the broadcast tool, the Aggregated Topology Provider, etc.

Foreseen evolution

1yr Technical Evolution: GOCDB needs to evolve along the following themes to address current and emerging stakeholder requirements:

  • GOCDB v5 (~April/May). Replaces Oracle PROM database with ORM DB objects. Is needed to support different RDBMSs, improves performance and will simplify development. Requires changes to PI to be accepted by all PTs. See: https://wiki.egi.eu/wiki/Doctrine
  • Update current mutually-exclusive ‘EGI’ and ‘Local’ scope tags to be non-exclusive. Allows sites/services to be tagged multiple times with project-specific tags (e.g. ‘UK_NES’) and wider ‘EGI’ scope tags. Objects are created once. Maintains the integrity of topology information across different target infrastructures. PI ‘scope’ parameter value to support comma-separated list. Service scope values chosen from Site scope values.
  • Render GOCDB data in Glue2 XML and provide new PI method(s) to post downtimes using XML. Needed to address interoperability and data consistency across different info-systems/infrastructures. Has been requested by different stakeholders.

Impact on funding

  • Costs expected to stay constant up to April 2014 (aiming to address these developments and continue ops support at/around the current level).
  • Current level of match funding from GridPP is expected until April 2014.
  • Reducing costs before April 2014 is unrealistic.
  • Cost changes post April 2014 are hard to predict; depends on subsequent changes to requirements. Current level of match funding from GridPP is expected until 2015.

Operations Portal

Partner: IN2P3

Current status

EGI.eu provides a central portal for the operations community that offers a bundle of different capabilities, such as the broadcast tool, VO management facilities, and a dashboard for grid operators that is used to display information about failing monitoring probes and to open tickets to the Resource Centres affected. The dashboard also supports the central grid oversight activities. It is fully interfaced with the EGI Helpdesk and the monitoring system through the message passing. It is a critical component as it is used by all EGI Operations Centres to provide support to the respective Resource Centres.

Foreseen evolution

Impact on funding

Helpdesk

Partner: KIT

Current status

EGI provides support to users and operators through a distributed helpdesk with central coordination (GGUS). The central helpdesk provides a single interface for support. The central system is interfaced to a variety of other ticketing systems at the NGI level in order to allow a bi-directional exchange of tickets (for example, those opened locally can be passed to the central instance or other areas, while user and operational problem tickets can be open centrally and subsequently routed to the NGI local support infrastructures).

Foreseen evolution

Impact on funding

Core and Catch-all Services

Parner: GRNET JRU

Current status

Auxiliary core services are needed for the good running of Infrastructure Services. Examples of such services are VOMS service and VO membership management for infrastructural VOs (DTEAM, OPS), the provisioning of middleware services needed by the monitoring infrastructure (e.g. top-BDII and WMS), the catch-all CA and other catch-all core services to support small user communities (central catalogues, workflow schedulers, authentication services).

Foreseen evolution

This should include central SAM instances for ad-hoc monitoring objectives (like the middleware monitoring SAM).

Impact on funding

Resources