Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:SA1 EGI Global tasks evolution"

From EGIWiki
Jump to navigation Jump to search
 
(67 intermediate revisions by 12 users not shown)
Line 1: Line 1:
[[Category:EGI-inSPIRE SA1]]
{{Template:EGI-Inspire menubar}}
 
This document provided by the partners responsible of EGI operations global tasks provide information about current status and the envisaged evolution of these tasks '''after April 2014'''.
This document provided by the partners responsible of EGI operations global tasks provide information about current status and the envisaged evolution of these tasks '''after April 2014'''.


Line 10: Line 11:


===Foreseen evolution ===
===Foreseen evolution ===
Need of operations coordination through the NGI participation to the Operations Management Board continues.


====Impact on funding ====
====Impact on funding ====
Constant funding.


==Software Support ==
==Software Support ==
Line 26: Line 29:
===Foreseen evolution ===
===Foreseen evolution ===


* ''Ticket triage and assignment'' is an essential function of the EGI user support, it must be preserved as is. The current work is well-stabilized with sufficient number of people to run the rotary service, no major changes are foreseen.
* ''1st and 2nd level software support.'' At least EMI, the major software provider for EGI, is not going to be continued as a single formal project, the software is planned to be supported by the community on more or less best-effort basis. This may yield the risk of insufficient reaction on software issues critical for EGI, which must be complemented with more required effort at the EGI side. In particular, the scenario of 2nd level software support in EGI producing patched software when TP fails to deliver a fix, which was foreseen in the EGI-InSPIRE project proposal but which did never happen, becomes more realistic. Because EMI funding stops in May 2013, the following 12 months will show the real impact.
* ''Ticket oversight and followup.'' Gathering and evaluating TP performance metrics and monitoring SLAs becomes less important because of the lack of partners who may be able to sign SLA with EGI.  On the contrary, actual followup of the tickets (ensuring they are not forgotten by supporters etc.) remains important or even increases priority -- with the lack of formal mechanisms (SLAs) this is the only way to push on supporters while meeting the same expectations of the users.


====Impact on funding ====
====Impact on funding ====
Given the expected loosening formal relationships with TPs, the role
of "in house" software support in EGI becomes even more critical.
The expected technical effort increase can be compensated by less effort
required to track the formal relationships, however, overall effort decrease
is not realistic to keep the task functional.


=== Coordination of Grid Oversight ===
=== Coordination of Grid Oversight ===
Line 54: Line 66:


===Foreseen evolution ===
===Foreseen evolution ===
Handover of provisioning to NRENs and DANTE being investigated


====Impact on funding ====
====Impact on funding ====
Reduce


== Coordination of Operational interoperation between NGIs and DCIs ==
== Coordination of Operational interoperation between NGIs and DCIs ==
Line 78: Line 92:
== Security Operations Coordination ==
== Security Operations Coordination ==
Partners: STFC, NIKHEF
Partners: STFC, NIKHEF
(now including security policy coordination as this is closely related to operations)
===Current status===
===Current status===
<!-- add and modify text as needed. This paragraph should include: EGI CSIRT, security training, security drills, IRTF, SVG and if you want SPG -->
<!-- add and modify text as needed. This paragraph should include: EGI CSIRT, security training, security drills, IRTF, SVG and if you want SPG -->
Security vulnerabilities and risks presented by e-Infrastructures provide a rationale for coordination amongst the EGI participants at various levels. Central coordination groups ensure policies, operational security, and maintenance to guarantee secure access to users. In addition, security and incident response is provided through the EGI Computer Security and Incident Response Team by coordinating activity at the sites across the infrastructure. This coordination ensures that common policies are followed by providing services such as security monitoring, training and dissemination with the goal of improving the response to incidents (e.g. security drills).
The inherent value of the e-Infrastructure provides a strong rationale for security coordination amongst the EGI participants at various levels. Central coordination of the security activities ensures that policies, operational security, and maintenance are compatible amongst all partners, improving availability and lowering access barriers for use of the infrastructure. Today, the Security Policy Group (SPG) coordinates a consistent set of security policies, developed in collaboration with all interested NGIs, and provides technical implementations of these policies for simplified use by the NGIs where relevant. In addition, security and incident response is provided through the EGI Computer Security and Incident Response Team (CSIRT) by coordinating activity in the NGIs and at the sites across the infrastructure. This coordination ensures that incidents are promptly and efficiently handled, that common policies are followed by providing services such as security monitoring, and by training and dissemination with the goal of improving the response to incidents. The overall incident response capabilities of the sites, also with respect to new technologies introduced by the user communities (VOs),  such as the VO-Job-submission frameworks, are frequently assessed through the EGI-wide security drills.


===Foreseen evolution ===
===Foreseen evolution ===
====Impact on funding ====
Security is an ongoing process. Policies, procedures, operations, technology and trust have to constantly evolve to address new threats and risks. In the security threat risk assessment carried out in 2012 one of the threats highlighted as a high risk issue was “The move to more use of Cloud technologies may lead to security problems”. There is no doubt that there will be many issues to be solved in the provision of secure operations as we deploy new technologies, which we are sure will require at least the current level of effort to manage and co-ordinate. We have decided to request the effort we will need for global coordination of security for EGI, based on what we are currently doing and the foreseen future needs. We are convinced that we will continue to need at least this level of effort and that this will need to be funded somehow.
 
Experience of providing such security coordination over the last two years has shown that this includes multiple aspects that can be more clearly distinguished when evolving the task for the future, as presented in the following sub-sections.
 
====Security Policy Coordination and the support of its implementation====
Security policy development covers diverse aspects, including operational
policies (agreements on vulnerability management, intrusion detection and
prevention, regulation of access, and enforcement), incident response policies
(governing the exchange of information and expected actions), participant
responsibilities (including acceptable use policies, identifying users and
managing user communities), traceability, legal aspects, and the protection
of personal data. In an environment without central control, such as EGI,
common identity management such as provided by the IGTF, is needed to ensure
unique and persistent assignments of rights and privileges. Since research is
global, such policies must be coordinated with peer infrastructures in Europe
and elsewhere, such as PRACE-RI, Open Science Grid, XSEDE, and like efforts in
the Asia Pacific. Coordination mechanisms such as the FIM4R group, TERENA
REFEDS, SCI, Open Grid Forum and the IGTF are employed.
For some elements of these policies (such as the common identity management)
having a central reference implementation for immediate re-use by the NGIs
saves on total effort needed in the long run. The use today of the centrally
produced "EGI trust anchor distribution" is expected to continue.
 
====Incident Response Task Force (IRTF) coordination and advanced incident response====
Experience has shown that the complexity of multi-domain incidents at the
scale of EGI necessitates dedicated experts in incident response and forensics
to deal with global incidents and to provide support to EGI participants to
address localised incidents before they spread across EGI. Experience with the
rotational scheme used today in EGI has shown that it is
very hard to retain unique expertise in a widely distributed community with
high personnel turn-over. In practice, incident response is provided by a
dedicated core team, with specialist forensics support concentrated in just a
few individuals. It is essential that this expertise is available as and when
needed, but it cannot provide global coverage for any EGI site. 
We propose to establish a small core team which holds the coordination
role and provides advanced support in incident response and forensics.
The primary responsibility for basic incident response and forensics
will still lie with each NGI, while the EGI Global IRTF will coordinate
incident response and information exchange. However, for complex
multi-site incidents and in cases where advanced forensics is needed,
the EGI Global IRTF will step in and take an active part, to protect the
continued integrity of the EGI infrastructure as a whole. Investment in
a relatively small amount of global coordination effort, removes the
need for each NGI to have to maintain its own specialist IT security
capability and has the potential to realise cost savings within each NGI.
 
====Software Vulnerability (SVG) coordination====
The Software Vulnerability Group (SVG) aims at eliminating existing vulnerabilities from the deployed infrastructure, primarily from the grid middleware, and avoiding the introduction of new ones, thus preventing security incidents.  This activity will need to continue both to handle new vulnerabilities found in the Grid middleware currently deployed, and to handle vulnerabilities in software used by future technology to facilitate the sharing of distributed resources such as federated clouds. The SVG handles vulnerabilities reported in software used specifically in the EGI infrastructure. This depends on investigation and risk assessment by a collaborative team drawn from technology providers and other security groups, known as the Risk Assessment Team or 'RAT'. Considering the recent number of vulnerabilities detected and the co-ordination effort needed with other entities (Technology providers, EGI software distribution managers and coordinators, and central operations co-ordination) this task needs explicit recognition and assignment of dedicated effort. In particular the SVG also has a role in determining the threat posed by software deployed in the infrastructure independent of specific vulnerability events.  SVG also has a role in the co-ordination and prioritization of 'Vulnerability Assessment' work, which is the examination of software to find whether any vulnerabilities exist. The SVG has also been asked to assess or advise on the assessment of other pieces of software prior to recommending their deployment on the EGI infrastructure, but has insufficient manpower to carry this out.
 
====Security Coordination through Security Service Challenges and Training====
Participating in a global infrastructure is still not a very common task for
some resource centres. Unless specific efforts are made to ensure communication on
incidents is effective between all EGI participants, the 'weakest link'
principle applies and the integrity of the entire infrastructure can
inadvertently be put at risk by a single user or resource. The use of
'security drills', exercising the incident response communications channels,
has proven particularly effective in ensuring open and effective exchange of
information. Additionally, these security drills can be re-used at a site or
national level, where they serve as trainings in computer security forensics
and identification of intrusion and threats.
To be effective, the security drills must be realistic, current with respect
to the software and intrusion vectors used to exercise the site, and be based
on the actual communication infrastructure of EGI. The drills need development
(mainly contributed) and periodic use in realistic tests (the coordination
function included here). Re-using the security drills for training and
national (re)use needs limited 'train the trainer' effort which is best
provided for centrally.
 
====Security Monitoring Coordination====
EGI is an interconnected federation where a single vulnerable place may
have a huge impact on the whole infrastructure. In order to recognize
the risks and to address potential vulnerabilities in a timely manner,
the EGI Security Monitoring provides an oversight of the infrastructure
from the security standpoint. Also, sites connected to EGI differ
significantly in the level of security and detecting weaknesses exposed
by the sites allows the EGI security operations to contact the sites
before the issue leads to an incident. Information produced by security
monitoring is also important during assessment of new risks and
vulnerabilities since it enables to identify the scope and impact of a
potential security incident. The whole activity needs to be closely linked to other
security-related tasks, namely the Incident Response Task Force and SVG
and provide reliable and quick support to them (for instance to
introduce new checks or process collected data). The task needs to
cooperate with other activities responsible for general EGI monitoring
and will need to coordinate their developments among these activities. Additional
connections need to be maintained to the operations dashboard and
common activities doing support to sites to make sure detected security issues
are handled properly.
 
Development/maintenance of security monitoring is described in the dedicated section on [[#Security Monitoring|security monitoring]].
 
===Impact on funding ===
It has already been acknowledged that some areas of security coordination are underfunded today in EGI-InSPIRE. Lack of global effort in the Incident Response Team is a growing problem and the amount of global effort to coordinate SVG (currently 1 PM/year) is way too small. We have therefore decided to give honest estimates of the amount of effort required to perform adequate global coordination of security in EGI.
 
Experience over the last 2 years in EGI-InSPIRE has established that the amount of global coordination effort required to perform these critical duties is as follows.
 
Person-Months per year, total effort: [partners to be assigned]
 
*6+2 PM Security policy coordination and the support of its implementation
*12 PM IRTF coordination and advanced incident response
*6 PM SVG coordination
*6 PM Security coordination through service challenges and training
*4 PM Security monitoring coordination
 
Total effort required 36 PM/year.


==  Service Level Management: availability/reliability reports==
==  Service Level Management: availability/reliability reports==
Partner: AUTH  
Partner: AUTH  
=== Current Status ===
=== Current Status ===
This task includes the validation of distribution of monthly availability statistics for Resource Centres, NGIs, EGI.eu, and the coordination of the evolution of the EGI OLA framework and the related reporting tools.
This task includes the validation of distribution of monthly availability statistics for Resource Centres, NGIs, EGI.eu, and the coordination of the evolution of the EGI OLA framework and the related reporting tools.


===Foreseen evolution ===
===Foreseen evolution ===
Currently focus is being placed upon finding a mature enough solution to automate the delivery of monthly A/R statistics which is an ongoing activity. Thus it is expected that this delivery will become automated at some point in the near future.
Documentation activities that may still be in progress are to be concluded until April 2014.
====Impact on funding ====
====Impact on funding ====
Reducing cost after April 2014 focusing only on having, maintaining and operating an automated service for the delivery of A/R statistics is reasonable.


=Infrastructure Services=
=Infrastructure Services=
Line 105: Line 234:
* Coordination of the staged rollout activities carried out by the NGIs
* Coordination of the staged rollout activities carried out by the NGIs
* Liaison with the UMD team (EGI-InSPIRE SA2)and the Products Teams
* Liaison with the UMD team (EGI-InSPIRE SA2)and the Products Teams
'''The Staged Rollout has a number of dependencies on other EGI tools:'''
* EGI Single Sign On (EGI-SSO)
* EGI RT
* EGI Wiki
* EGI Repositories
* EGI Mail Managers
* GGUS ticketing system
=== Current Isuues ===
Most of the effort for the Staged Rollout comes from the coordination between the several stakeholders involved in the software rollout process: the technology Providers, EGI.eu Technical Manager and the EGI tools managers. A key aspect in this process are the Early Adopters teams. They are the human resources that do the actual work and the tests of new releases in production environments and without their commitment the testing and validation of new releases would be very hard to accomplish.
In the last years, the Staged Rollout coordination team managed to gather a significant set of Early adopters teams (63 in total), nonetheless even after a large effort, there are still several components coming from different TP that still do not have EA teams committed to their testing. As a consequence there are still around 10 components that  were not made available into UMD release regardless all the effort in there development.


===Foreseen evolution ===
===Foreseen evolution ===
It is foreseen that the staged rollout internal process will not suffer significant changes in the near future, nevertheless it needs to incorporate the fact that Technology Providers will change as well as the infrastructure.
With the end of two of the main development projects in middle 2013, IGE and EMI, there is an uncertainty over what will happen to the product teams of many of these components. It is expected, that we will pass from a well controlled and coordinated number of teams to a much higher number scattered across multiple communities. Also the infrastructure is on the edge of a change towards the adoption of cloud computing which if successful will bring a burst of new products with a direct consequence on the increase of SR in term of coordination and number of EA. Furthermore, grid technology has reached a mature state which means that a much lower rate of new functionalities and number of releases is expected, while on the other hand some new developments may occur in order to adapt or adopt new software/services models.
Taking into account this scenario it'll be harder to track the products release dates, but this will not have a direct impact in the SR process. What will change will be the way product teams will communicate the announcement of a new release. The announcements are expected in the respective web sites, RSS feeds or mailing list subscriptions.
Based on this assumption we foresee the following changes in SR process and interaction between stakeholders:
* Increase number of TP will increase the burden in SR coordination but this will be balanced by  a decrease in the release rate of new products and functionalities. This is a key issue and means that the communication channels between SR and product team will play an even more important role that could be mitigated by improving communication between SR and PT. This strengthening of the relations can be achieved by an increase in dissemination or new channels of communication.
* The number of EA may decrease due to the ending of fund for this activity, it is expected that the verification time and staged rollout of new releases will increase.
* Tools used in the SR need to be adapted in order to be able to accommodate the dispersion of product team each one with different release tools and schedules.
====Impact on funding ====
====Impact on funding ====
After 2014 it is expected that there will be an increase in the he number of technology providers but with a lower release frequency. Required a strong commitment of product team in order to products to be accept for SR.
* Costs expected to stay constant up to April 2014.After April 2014 the availability of National and regional funding is uncertain.
* To ensure the actual level of coordination, the funding should at least be maintained at the current level.


== Monitoring ==
== Monitoring ==
Line 113: Line 276:
Partner: CERN  
Partner: CERN  
====Current status====
====Current status====
A distributed monitoring framework is necessary to continuously test the level of functionality delivered by each service node instance in the production Resource Centres, to generate alarms and tickets in case of critical failures and to compute monthly availability and reliability statistics, and to monitor and troubleshoot network problems. The Monitoring Infrastructure is a distributed service based on Nagios and messaging. The central services – operated by EGI.eu – include systems such as the MyEGI portal for the visualisation of information, and a set of databases for the persistent storage of information about test results, availability statistics, monitoring profiles and aggregated topology information. The central services need to interact with the local monitoring infrastructures operated by the NGIs. The central monitoring services are critical and need to deliver high availability.
A distributed monitoring framework is necessary to continuously test the level of functionality delivered by each service node instance in the production Resource Centres, to generate alarms and tickets in case of critical failures, to compute monthly availability and reliability statistics, and to monitor and troubleshoot network problems. The Monitoring Infrastructure is a distributed service based on Nagios and messaging. The central services – operated by EGI.eu – include systems such as the MyEGI portal for the visualisation of information, and a set of databases for the persistent storage of information about test results, availability statistics, monitoring profiles and aggregated topology information. The central services need to interact with the local monitoring infrastructures operated by the NGIs. The central monitoring services are critical and need to deliver high availability.


====Foreseen evolution ====
====Foreseen evolution ====
Maintenance and operations of the central SAM services is foreseen after EGI-InSPIRE to continue the operational monitoring infrastructure, services' notifications, availability and reliability computation, monthly reports generation and user support.
Development will be needed if new functionality is required from EGI.eu/OMB.
=====Impact on funding =====
=====Impact on funding =====
 
Experience over the last 2 years in EGI-InSPIRE has shown that some areas of SAM (development, coordination and maintenance in particular) are underfunded in EGI-InSPIRE. The amount of reasonable effort for these activities based on current scenario is the following in Person-Months per year:
* 12 PM Project coordination
* 30 PM Maintenance
* 18 PM Operations and support
* 30 PM Development (this is an estimation of effort that could be 0 or 50 depending on EGI requirements)


===  Broker network===   
===  Broker network===   
Partner: GRNET (coord), SRCE, CERN  
Partner: GRNET (coord), SRCE, CERN  
====Current status====
====Current status====
Currently the broker network is consisted of two separate networks of brokers one for testing and one for production purposes. The production message broker network is consisted of 4 message brokers in Greece (GRNET/AUTH), Croatia (SRCE) and Switzerland (CERN) and it is used by EGI operational tools infrastructures (i.e. SAM, APEL). Operation of the service requires applying regular updates and performing maintenance operations in close coordination among the partners running the 4 instances. Development effort is limited to developing monitoring tools designated for the broker network, as well as producing best practices for clients of the infrastructure.
====Foreseen evolution ====
====Foreseen evolution ====
The foreseen evolution regarding the production EGI message broker network includes:
* Security implementation and x509 authentication for all clients of PROD MSG network
* Enhancement of current usage monitoring tools
* Investigation of technical issues affecting the broker network and service performance optimization
* Support provisioning for clients of the service
=====Impact on funding =====
=====Impact on funding =====


Costs are expected to stay constant up to April 2014 (in order to address the maintenance, support and development effort at the current level). Reducing cost after April 2014 focusing only on coordination, service maintenance and support is realistic.


== Accounting ==
== Accounting ==


=== APEL central DB ===
=== APEL central DB ===
Partner: SRFC
Partner: STFC


====Current status====
====Current status====
The EGI Accounting Infrastructure is distributed. At a central level it includes the repositories for the persistent storage of usage records. The Accounting Infrastructure is essential in a service-oriented business model to record usage information. Accounting data needs to be validated and regularly published centrally. The central databases are populated through individual usage records published by the Resource Centres, or through the publication of summarised usage records.  
The EGI Accounting Infrastructure is distributed. At a central level it includes the repositories for the persistent storage of usage records. The Accounting Infrastructure is essential in a service-oriented business model to record usage information. Accounting data needs to be validated and regularly published centrally. The central databases are populated through individual usage records published by the Resource Centres, or through the publication of summarised usage records.
 
===== Dependencies =====
The Accounting Repository has a number of dependencies on other EGI tools:
* Accounting Portal
* GOCDB
* EGI Message Brokers
* GGUS
* Operations Portal and Operations Dashboard


====Foreseen evolution ====
====Foreseen evolution ====
2013/2014 Technical Evolution:
* Accounting’s evolution over the final year of EGI-InSPIRE will be to consolidate the new Accounting repository to receive records from the EMI APEL 3 client, from the regional APEL server implementations and from other CPU accounting sources (MAPPER, EDGI, UNICORE etc.). Additionally, the archived data in summary form will be migrated to the new format.
* Other types of accounting (Cloud, Storage, Application etc.) is in early testing stage for the accounting repository, although the SSM publishing method has proved robust and usable in these new contexts.  Next steps for the coming year are to test with multiple clients ensuring their representation of data is consistent from one client to another and producing summaries for the portal to publish.
By the end of EGI-Inspire there will be a pervasive accounting infrastructure collecting data for EGI and other International VOs from a number of e-infrastructures including EGI, OSG, NorduGrid, PRACE and a variety of middleware stacks. The central repository will receive data directly from NGI repositories (some running APEL code) from external infrastructures, and directly from sites running various clients, mainly APEL. The infrastructure should handle at least accounting of cpu, storage, and clouds but will be extensible to other types of usage.
This puts EGI in a good position to exchange accounting data with other infrastructures and to perform a core accounting role within the ERA.
* Integration of accounting records from different types of accounting is the next step which would continue after EGI-InSPIRE if funding is available, as this is not yet in development.  The implementation of storage accounting integrated with cloud accounting, for example, can only realistically be worked on once we have a body of data from the storage and cloud clients which may be related.  Similarly for cloud/cpu and cloud/application integration.  The feasibility of this integration requires some research which I envisage could not start until earliest Q3 2013 and would expect it to extend beyond EGI-InSPIRE.
=====Impact on funding =====
=====Impact on funding =====
* Costs expected to stay constant up to April 2014 (aiming to address the developments set out in the first two bullets above and continue maintenance and support at the current level).
* Current level of match funding from GridPP is expected until April 2014.
* Reducing costs before April 2014 is unrealistic.
* Post-April 2014 - if there are additional development requirements then funding for these will be required, this would be for further developments in other types of accounting (Cloud, Storage, Application etc.). It is expected that requirements for these would evolve as their usage increases and this will require work beyond maintenance and operations.  If the evolution of accounting is to include the integration of data from different types of accounting then this would also require further funding.


=== Central accounting portal ===
=== Central accounting portal ===
Line 141: Line 344:


====Current status====
====Current status====
The central accounting portal is made available by EGI for the visualisation of accounting information.  
The EGI accounting infrastructure is a complex system that involves various sensors in different
regions, all publishing data to a central repository. The data is processed, summarized and displayed
in the Accounting Portal, which acts as a common interface to the different accounting record
providers and presents a homogeneous view of the data gathered and a user-friendly access to
understanding resource utilisation.
 
The current production version (v4.2 Fomalhaut) of the Accounting Portal is available at https://accounting.egi.eu. The regional Accounting Portal is ready, pending support from APEL regionalization.
 
=====Dependencies=====
The Accounting Portal has functional dependencies on the following tools:
 
* GOCDB: List of sites and NGIs in production, list of available services in production.
 
* CIC Portal: VOMS endpoints and VO list.
 
* Accounting Repository: Accounting records and summarized accounting data.


====Foreseen evolution ====
====Foreseen evolution ====
In the period 2013/2014, the following directions are planned:
* Contributed CPUs by site - Measuring the contribution in CPUs vs the number of hours, enhancing the reporting existing now with SPECInts.
* Preliminar support for parallel (MPI) jobs - There were advances in this after the MPI vt lifespan, but further software development is needed.
* New Accounting: Integration of all advances on Network, Storage and Application accounting on JRA1.4, apart from possible new developments that may arise.
* EGI User usage accounting - Improvements on the userDN accounting.
* Preliminar cloud support - Display and integration of the experimental sites offering cloud computing capabilities under the fedcloud task.
* Regional portal codebase improvements - Implantation and improvement based on regional experiences if appropriate.
* XML endpoints generalization and improvement - Implementation of a friendlier XML interface and documentation for users wishing programmatic access
* Cloud support improvements - Further integration of cloud accounting characteristics.
=====Impact on funding =====
=====Impact on funding =====
The funding should remain constant until February 2014.
After February 2014 the status, dedication and funding of staff maintaining the portal is not clear. The availability of National and regional funding is also uncertain, perhaps the situation will be clearer next year.
To ensure continued operation, the funding would have to be maintained at the current level


== Security Monitoring ==
== Security Monitoring ==
Line 154: Line 385:


===Foreseen evolution ===
===Foreseen evolution ===
Currently EGI CSIRT runs two services implementing security monitoring, the
security Nagios box and Pakiti service to monitor patch management, which
provide a security-related overview of the infrastructure. Within the scope of
EGI-InSPIRE we plan increase the coverage of EGI (so-call site-wide
monitoring), start producing metrics summarizing level of security as seen by
EGI CSIRT, improve the handling of the issues detected (sending notifications,
close integration with ticketing systems), and finish the next generation of
Pakiti server monitoring security patches.
====Impact on funding ====
====Impact on funding ====
After EGI-InSPIRE we will need to keep at least operations and maintenance of the key services (Pakiti and security Nagios). Estimated costs are 1PM for operations (OPS) and 2PM for maintenance (MAINT) for each service, totaling to 6PM / year for both the services.


== Configuration Repository (GOCDB) ==
== Configuration Repository (GOCDB) ==
Line 179: Line 420:
EGI.eu provides a central portal for the operations community that offers a bundle of different capabilities, such as the broadcast tool, VO management facilities, and a dashboard for grid operators that is used to display information about failing monitoring probes and to open tickets to the Resource Centres affected. The dashboard also supports the central grid oversight activities. It is fully interfaced with the EGI Helpdesk and the monitoring system through the message passing. It is a critical component as it is used by all EGI Operations Centres to provide support to the respective Resource Centres.  
EGI.eu provides a central portal for the operations community that offers a bundle of different capabilities, such as the broadcast tool, VO management facilities, and a dashboard for grid operators that is used to display information about failing monitoring probes and to open tickets to the Resource Centres affected. The dashboard also supports the central grid oversight activities. It is fully interfaced with the EGI Helpdesk and the monitoring system through the message passing. It is a critical component as it is used by all EGI Operations Centres to provide support to the respective Resource Centres.  


===Foreseen evolution ===
=== Foreseen evolution ===
====Impact on funding ====
 
<span style="line-height: 1.5em;">The Operations Portal is based on
</span>lot of emerging technologies used in the Web development ( frameworks , Css templating system , javascript libraries ...) .
 
The emergence of such technologies provide new opportunities to the users in terms of ergonomics , visualisation layer , services offer &nbsp;and bring also new usage and new possibliites of features.
 
The Operations Portal team is following such evolutions and aiming&nbsp; at providing these opportunitites to the community if they are useful.<br> The portal has been designed in a modular way and as an aggregation plateform. The flexibility of the architecture allows for a&nbsp; huge&nbsp;number of data sources, and to provide standard access to information.
 
The core of the data gathering system is the web service facility called Lavoisier.
 
<span style="line-height: 1.5em;">Lavoisier’s
</span>flexibility allows us to be ready to integrate almost any kind of new information if needed&nbsp;<span style="line-height: 1.5em;">and meaningful. For the new resource types coming into
</span>the EGI production infrastructure, such as
 
HPC systems, virtualized resources, desktop resources we are able&nbsp;to integrate its via plug-ins inside Lavoisier.<br>
 
<br>
 
==== Impact on funding ====
 
<span style="line-height: 1.5em;">With the current level of
</span>fundings (PY4) we can ensure the daily maintenance and we can provide only some improvements on the existing tool if developments are limited.
 
<span style="line-height: 1.5em;">A decrease of funding after April 2014 will&nbsp; imply that </span><span style="line-height: 19.1875px;">additional development requirements
</span>will not been taken into account and further developments to extend the current&nbsp;<span style="line-height: 19.1875px;">&nbsp;tool with new technologies and new source of
</span>information will not be possible&nbsp; (HPC or cloud resources) . All new developments&nbsp; would also require new funding.<br>
 
== Metrics Portal ==
Partner: CESGA
===Current status===
The Metrics Portal displays a set of metrics that will be used to monitor the performance of the
infrastructure and the project, and to track their changes over time. The portal automatically collects
all the required data and calculates these metrics before displaying them in the portal. The portal
aggregates information from different sources such as GOCDB, GGUS, etc.
 
The Metrics Portal has been used for the last year to gather metrics from the project tasks. Depending of changes on the structure and scope of the projects and its tasks and activities, the portal will be updated while keeping the old metrics in their validity periods.
 
The Portal also monitorizes the evolution of the infrastructure month by month and it is the only way to have access to historic data on infrastructure evolution data (vs. realtime data).
 
====Dependencies====
The Metrics Portal has many dependencies. These include:
 
* Accounting Portal: To display accounting metrics, most active VOs, Number of submitted jobs, etc.
 
* BDII: Number of CPUs and Cores in production, online and nearline storage, mpi sites.
 
* GGUS: Number of tickets created/closed. Tickets response times, Number of tickets created by priority, etc.
 
* GOCDB: Sites in production, number of countries and NGIs in EGI.
 
* ACE: Availability and reliability metrics.
 
=== Foreseen evolution  ===
 
* Manual metrics expansion and refinement - In the line of having finer granularity and semantics to enable better user reporting.
* Views enhancement and optimization - Crossbrowser integration, possibly AJAX functionality, mobile integration.
* Regional portal codebase improvements - Refactoring, code documentation and quality improvement.
* GGUS metrics improvement and new A/R metrics
* Access Control improvements - Finer grained access mechanism.
* New customized reports with Excel support
 
=== Impact on funding  ===
The funding should remain constant until February 2014. After February 2014 the status, dedication and funding of staff maintaining the portal is not clear. The availability of National and regional funding is also uncertain, perhaps the situation will be clearer next year. To ensure continued operation, the funding would have to be maintained at the current level


== Helpdesk ==
== Helpdesk ==
Line 187: Line 490:
EGI provides support to users and operators through a distributed helpdesk with central coordination (GGUS). The central helpdesk provides a single interface for support. The central system is interfaced to a variety of other ticketing systems at the NGI level in order to allow a bi-directional exchange of tickets (for example, those opened locally can be passed to the central instance or other areas, while user and operational problem tickets can be open centrally and subsequently routed to the NGI local support infrastructures).
EGI provides support to users and operators through a distributed helpdesk with central coordination (GGUS). The central helpdesk provides a single interface for support. The central system is interfaced to a variety of other ticketing systems at the NGI level in order to allow a bi-directional exchange of tickets (for example, those opened locally can be passed to the central instance or other areas, while user and operational problem tickets can be open centrally and subsequently routed to the NGI local support infrastructures).


===Foreseen evolution ===
=== Foreseen evolution ===
====Impact on funding ====
* GGUS Report Generator - Some fine tuning is still to be done. Another report [[https://rt.egi.eu/guest/Ticket/Display.html?id=4752 ETA accuracy]] needs implementation
* VOMS synchronization - Restructuring VOMS synchronization for making it fail-safe.
* GGUS Interfaces with other ticketing systems - Interfaces for PRACE/MAPPER, DANTE, NGI_IBERGRID
* Alarm process for central operations tools - Implement alarm process for EGI central operations tools
* Implementation of specific work flows for CSIRT/Security - The CSIRT team is currently evaluating whether they want to use GGUS. If CSIRT will use GGUS the permissions and access rights schema needs to be adapted to their needs.
* Web portal - Introduce alternative authentication/authorization processes and provide specific view for CMS VO
* Integration of operations portal in GGUS
* Ticket monitoring in GGUS - Provide input on how processing tickets waiting for submitter's input and input on how processing tickets waiting for activity of technology providers after the end of EMI.
 
<br>
 
=== Impact on funding ===
With the current level of funding, the planned improvements can implemented until April 2014.
Reduced funding after April 2014 will make it hard to provide further improvements.
The current level of match funding from KIT is expected to continue until end of 2014.
For 2015 and later, the KIT funding situation is yet unclear.
 
<br>


== Core and Catch-all Services ==
== Core and Catch-all Services ==
Line 194: Line 514:


===Current status===
===Current status===
Auxiliary core services are needed for the good running of Infrastructure Services. Examples of such services are VOMS service and VO membership management for infrastructural VOs (DTEAM, OPS), the provisioning of middleware services needed by the monitoring infrastructure (e.g. top-BDII and WMS), the catch-all CA and other catch-all core services to support small user communities (central catalogues, workflow schedulers, authentication services).
Auxiliary core services are needed for the good operation of Infrastructure Services. Examples of such services are VOMS service and VO membership management for infrastructural VOs (DTEAM, OPS), the provisioning of middleware services needed by the monitoring infrastructure (e.g. top-BDII and WMS), the catch-all CA and other catch-all core services to support small user communities (central catalogues, workflow schedulers, authentication services), middleware monitoring SAM instance as well as site certification services.


===Foreseen evolution ===
===Foreseen evolution ===
This should include central SAM instances for ad-hoc monitoring objectives (like the middleware monitoring SAM).
This should include in addition to the above the provisioning of central SAM instances for ad-hoc monitoring objectives (like the middleware monitoring SAM instance). Thus in summary the following are part of the foreseen evolution:
 
* Catch-all Grid Services for small user communities (this includes VOMS service for infrastructural VOs, catch-all CA services, SAM instances, central grid services)
* Tools (Grid Services) for Resource Centre certification (operation of site certification infrastructure including central core services such as WMS and Top-BDII)


====Impact on funding ====
====Impact on funding ====
Costs are expected to stay constant up to April 2014. Reducing cost after April 2014 focusing only on services operation and support is realistic.


= Resources =
= Resources =

Latest revision as of 15:17, 6 January 2015

EGI Inspire Main page


This document provided by the partners responsible of EGI operations global tasks provide information about current status and the envisaged evolution of these tasks after April 2014.

Human Services

Operation Management Board Coordination

Partner: EGI.eu

Current status

The Operations Management Board (OMB) drives future developments in the operations area by making sure that the infrastructure delivers high availability, is secure, meets the demand of existing user communities and that infrastructure operations evolve to support the integration of new resource infrastructures. It does this by providing management and developing policies and procedures for the operational services that are integrated into the production infrastructure. The OMB is responsible of technical roadmapping and of the definition and execution of processes for periodic gathering of requirements.

Foreseen evolution

Need of operations coordination through the NGI participation to the Operations Management Board continues.

Impact on funding

Constant funding.

Software Support

Current status

EGI.eu provides first and second level user and operations support and this function includes the following tasks:

  • function coordination (partner: CESNET)
  • ticket triage and assignment for dispatching of tickets to the appropriate SUs within GGUS (partners: INFN, CESNET)
  • 1st and 2nd level software support, encompassing both grid middleware and operational tools (operational tickets are dispatched to NGI operations SUs, so are not internally addressed by the software support team). This includes the production of howtos and reporting to operations meetings about critical incidents (partners: CESNET, INFN, JUELICH, LIU and STFC
  • Ticket oversight and follow-up (partner: KIT): this function includes administrative and reporting functions of the helpdesk infrastructure (e.g. collecting ticket statistics, internal and external reporting of statistics for SLAs monitoring and other reporting duties), and follow-up (notifying supporters when the reaction to high-priority tickets is not fast enough, requesting information from ticket submitters when they do not react, ensuring assigners/resolvers will react sufficiently fast when the submitter provides additional information).

More information about this task

Foreseen evolution

  • Ticket triage and assignment is an essential function of the EGI user support, it must be preserved as is. The current work is well-stabilized with sufficient number of people to run the rotary service, no major changes are foreseen.
  • 1st and 2nd level software support. At least EMI, the major software provider for EGI, is not going to be continued as a single formal project, the software is planned to be supported by the community on more or less best-effort basis. This may yield the risk of insufficient reaction on software issues critical for EGI, which must be complemented with more required effort at the EGI side. In particular, the scenario of 2nd level software support in EGI producing patched software when TP fails to deliver a fix, which was foreseen in the EGI-InSPIRE project proposal but which did never happen, becomes more realistic. Because EMI funding stops in May 2013, the following 12 months will show the real impact.
  • Ticket oversight and followup. Gathering and evaluating TP performance metrics and monitoring SLAs becomes less important because of the lack of partners who may be able to sign SLA with EGI. On the contrary, actual followup of the tickets (ensuring they are not forgotten by supporters etc.) remains important or even increases priority -- with the lack of formal mechanisms (SLAs) this is the only way to push on supporters while meeting the same expectations of the users.

Impact on funding

Given the expected loosening formal relationships with TPs, the role of "in house" software support in EGI becomes even more critical. The expected technical effort increase can be compensated by less effort required to track the formal relationships, however, overall effort decrease is not realistic to keep the task functional.

Coordination of Grid Oversight

Partners: SARA, CYFRONET

Current status

Grid Oversight is an activity aimed at controlling the infrastructure and solving arising operational issues. Theses issues can be of different complexity and importance, and may be caused by various reasons on regional or central level. For the scalability reasons the Grid Oversight has hierarchical structure: teams on regional (ROD) and central (COD) level contribute to it, solving problems within their scope. The COD part of the function is a global task. Speaking in ITSM terms the processes in which COD is naturally interested in are these of Service Operations area, especially Incident Management and Problem Management. The oversight of Incident Management is organized in an escalation process and COD is the body to which incidents that can not be handled on regional level are escalated.

Foreseen evolution

After april 2014, there will be more emphasis on supporting NGIs, assistance of user communities with respect to resource allocation. This will be in addition to what is already being done today.

Impact on funding

This is uncertain.

Coordination of network support, monitoring, troubleshooting

Partner: GARR

Current status

Provides network support for the resolution of end-to-end network performance issues. EGI is a highly distributed networked infrastructure of grid services using network connectivity for remote job submission, data transfer and data access, hence tools are needed for network troubleshooting and performance monitoring

Foreseen evolution

Handover of provisioning to NRENs and DANTE being investigated

Impact on funding

Reduce

Coordination of Operational interoperation between NGIs and DCIs

Partner: EGI.eu

Current status

EGI coordinates the integration of heterogeneous middleware stacks and Distributed Computing Infrastructures with the EGI operational infrastructures such as: accounting, monitoring, managemenet and support.

Foreseen evolution

Impact on funding

Coordination of documentation

Partner: EGI.eu

Current status

Coordination of maintenance and development operational documentation, procedures, best practices.

Foreseen evolution

Impact on funding

Security Operations Coordination

Partners: STFC, NIKHEF

(now including security policy coordination as this is closely related to operations)

Current status

The inherent value of the e-Infrastructure provides a strong rationale for security coordination amongst the EGI participants at various levels. Central coordination of the security activities ensures that policies, operational security, and maintenance are compatible amongst all partners, improving availability and lowering access barriers for use of the infrastructure. Today, the Security Policy Group (SPG) coordinates a consistent set of security policies, developed in collaboration with all interested NGIs, and provides technical implementations of these policies for simplified use by the NGIs where relevant. In addition, security and incident response is provided through the EGI Computer Security and Incident Response Team (CSIRT) by coordinating activity in the NGIs and at the sites across the infrastructure. This coordination ensures that incidents are promptly and efficiently handled, that common policies are followed by providing services such as security monitoring, and by training and dissemination with the goal of improving the response to incidents. The overall incident response capabilities of the sites, also with respect to new technologies introduced by the user communities (VOs),  such as the VO-Job-submission frameworks, are frequently assessed through the EGI-wide security drills.

Foreseen evolution

Security is an ongoing process. Policies, procedures, operations, technology and trust have to constantly evolve to address new threats and risks. In the security threat risk assessment carried out in 2012 one of the threats highlighted as a high risk issue was “The move to more use of Cloud technologies may lead to security problems”. There is no doubt that there will be many issues to be solved in the provision of secure operations as we deploy new technologies, which we are sure will require at least the current level of effort to manage and co-ordinate. We have decided to request the effort we will need for global coordination of security for EGI, based on what we are currently doing and the foreseen future needs. We are convinced that we will continue to need at least this level of effort and that this will need to be funded somehow.

Experience of providing such security coordination over the last two years has shown that this includes multiple aspects that can be more clearly distinguished when evolving the task for the future, as presented in the following sub-sections.

Security Policy Coordination and the support of its implementation

Security policy development covers diverse aspects, including operational policies (agreements on vulnerability management, intrusion detection and prevention, regulation of access, and enforcement), incident response policies (governing the exchange of information and expected actions), participant responsibilities (including acceptable use policies, identifying users and managing user communities), traceability, legal aspects, and the protection of personal data. In an environment without central control, such as EGI, common identity management such as provided by the IGTF, is needed to ensure unique and persistent assignments of rights and privileges. Since research is global, such policies must be coordinated with peer infrastructures in Europe and elsewhere, such as PRACE-RI, Open Science Grid, XSEDE, and like efforts in the Asia Pacific. Coordination mechanisms such as the FIM4R group, TERENA REFEDS, SCI, Open Grid Forum and the IGTF are employed. For some elements of these policies (such as the common identity management) having a central reference implementation for immediate re-use by the NGIs saves on total effort needed in the long run. The use today of the centrally produced "EGI trust anchor distribution" is expected to continue.

Incident Response Task Force (IRTF) coordination and advanced incident response

Experience has shown that the complexity of multi-domain incidents at the scale of EGI necessitates dedicated experts in incident response and forensics to deal with global incidents and to provide support to EGI participants to address localised incidents before they spread across EGI. Experience with the rotational scheme used today in EGI has shown that it is very hard to retain unique expertise in a widely distributed community with high personnel turn-over. In practice, incident response is provided by a dedicated core team, with specialist forensics support concentrated in just a few individuals. It is essential that this expertise is available as and when needed, but it cannot provide global coverage for any EGI site. We propose to establish a small core team which holds the coordination role and provides advanced support in incident response and forensics. The primary responsibility for basic incident response and forensics will still lie with each NGI, while the EGI Global IRTF will coordinate incident response and information exchange. However, for complex multi-site incidents and in cases where advanced forensics is needed, the EGI Global IRTF will step in and take an active part, to protect the continued integrity of the EGI infrastructure as a whole. Investment in a relatively small amount of global coordination effort, removes the need for each NGI to have to maintain its own specialist IT security capability and has the potential to realise cost savings within each NGI.

Software Vulnerability (SVG) coordination

The Software Vulnerability Group (SVG) aims at eliminating existing vulnerabilities from the deployed infrastructure, primarily from the grid middleware, and avoiding the introduction of new ones, thus preventing security incidents. This activity will need to continue both to handle new vulnerabilities found in the Grid middleware currently deployed, and to handle vulnerabilities in software used by future technology to facilitate the sharing of distributed resources such as federated clouds. The SVG handles vulnerabilities reported in software used specifically in the EGI infrastructure. This depends on investigation and risk assessment by a collaborative team drawn from technology providers and other security groups, known as the Risk Assessment Team or 'RAT'. Considering the recent number of vulnerabilities detected and the co-ordination effort needed with other entities (Technology providers, EGI software distribution managers and coordinators, and central operations co-ordination) this task needs explicit recognition and assignment of dedicated effort. In particular the SVG also has a role in determining the threat posed by software deployed in the infrastructure independent of specific vulnerability events. SVG also has a role in the co-ordination and prioritization of 'Vulnerability Assessment' work, which is the examination of software to find whether any vulnerabilities exist. The SVG has also been asked to assess or advise on the assessment of other pieces of software prior to recommending their deployment on the EGI infrastructure, but has insufficient manpower to carry this out.

Security Coordination through Security Service Challenges and Training

Participating in a global infrastructure is still not a very common task for some resource centres. Unless specific efforts are made to ensure communication on incidents is effective between all EGI participants, the 'weakest link' principle applies and the integrity of the entire infrastructure can inadvertently be put at risk by a single user or resource. The use of 'security drills', exercising the incident response communications channels, has proven particularly effective in ensuring open and effective exchange of information. Additionally, these security drills can be re-used at a site or national level, where they serve as trainings in computer security forensics and identification of intrusion and threats. To be effective, the security drills must be realistic, current with respect to the software and intrusion vectors used to exercise the site, and be based on the actual communication infrastructure of EGI. The drills need development (mainly contributed) and periodic use in realistic tests (the coordination function included here). Re-using the security drills for training and national (re)use needs limited 'train the trainer' effort which is best provided for centrally.

Security Monitoring Coordination

EGI is an interconnected federation where a single vulnerable place may have a huge impact on the whole infrastructure. In order to recognize the risks and to address potential vulnerabilities in a timely manner, the EGI Security Monitoring provides an oversight of the infrastructure from the security standpoint. Also, sites connected to EGI differ significantly in the level of security and detecting weaknesses exposed by the sites allows the EGI security operations to contact the sites before the issue leads to an incident. Information produced by security monitoring is also important during assessment of new risks and vulnerabilities since it enables to identify the scope and impact of a potential security incident. The whole activity needs to be closely linked to other security-related tasks, namely the Incident Response Task Force and SVG and provide reliable and quick support to them (for instance to introduce new checks or process collected data). The task needs to cooperate with other activities responsible for general EGI monitoring and will need to coordinate their developments among these activities. Additional connections need to be maintained to the operations dashboard and common activities doing support to sites to make sure detected security issues are handled properly.

Development/maintenance of security monitoring is described in the dedicated section on security monitoring.

Impact on funding

It has already been acknowledged that some areas of security coordination are underfunded today in EGI-InSPIRE. Lack of global effort in the Incident Response Team is a growing problem and the amount of global effort to coordinate SVG (currently 1 PM/year) is way too small. We have therefore decided to give honest estimates of the amount of effort required to perform adequate global coordination of security in EGI.

Experience over the last 2 years in EGI-InSPIRE has established that the amount of global coordination effort required to perform these critical duties is as follows.

Person-Months per year, total effort: [partners to be assigned]

*6+2 PM Security policy coordination and the support of its implementation
*12 PM IRTF coordination and advanced incident response
*6 PM SVG coordination
*6 PM Security coordination through service challenges and training
*4 PM Security monitoring coordination 

Total effort required 36 PM/year.

Service Level Management: availability/reliability reports

Partner: AUTH

Current Status

This task includes the validation of distribution of monthly availability statistics for Resource Centres, NGIs, EGI.eu, and the coordination of the evolution of the EGI OLA framework and the related reporting tools.

Foreseen evolution

Currently focus is being placed upon finding a mature enough solution to automate the delivery of monthly A/R statistics which is an ongoing activity. Thus it is expected that this delivery will become automated at some point in the near future.

Documentation activities that may still be in progress are to be concluded until April 2014.

Impact on funding

Reducing cost after April 2014 focusing only on having, maintaining and operating an automated service for the delivery of A/R statistics is reasonable.

Infrastructure Services

Software Rollout

Partner: LIP

Current status

Updates of deployed software need to be gradually adopted in production after internal verification. This process is implemented in EGI through staged rollout, i.e. through the early deployment of a new component by a selected list of candidate Resource Centres. The successful verification of a new component is a precondition for declaring the software ready for deployment. Given the scale of the EGI infrastructure, this process requires careful coordination to ensure that every new capability is verified by a representative pool of candidate sites, to supervise the responsiveness of the candidate sites and ensure that the staged rollout progresses well without introducing unnecessary delays, and to review the reports produced. It also ensures the planning of resources according to the foreseen release schedules from the Technology Providers. EGI.eu coordination is necessary to ensure a successful interoperation of the various stakeholders: Resource Centres, Technology Providers, the EGI.eu Technical Manager and the EGI repository managers.

This activities includes:

  • Definition and adoption of a workflow to automate software deployment
  • Coordination of the staged rollout activities carried out by the NGIs
  • Liaison with the UMD team (EGI-InSPIRE SA2)and the Products Teams


The Staged Rollout has a number of dependencies on other EGI tools:

  • EGI Single Sign On (EGI-SSO)
  • EGI RT
  • EGI Wiki
  • EGI Repositories
  • EGI Mail Managers
  • GGUS ticketing system

Current Isuues

Most of the effort for the Staged Rollout comes from the coordination between the several stakeholders involved in the software rollout process: the technology Providers, EGI.eu Technical Manager and the EGI tools managers. A key aspect in this process are the Early Adopters teams. They are the human resources that do the actual work and the tests of new releases in production environments and without their commitment the testing and validation of new releases would be very hard to accomplish.

In the last years, the Staged Rollout coordination team managed to gather a significant set of Early adopters teams (63 in total), nonetheless even after a large effort, there are still several components coming from different TP that still do not have EA teams committed to their testing. As a consequence there are still around 10 components that were not made available into UMD release regardless all the effort in there development.

Foreseen evolution

It is foreseen that the staged rollout internal process will not suffer significant changes in the near future, nevertheless it needs to incorporate the fact that Technology Providers will change as well as the infrastructure.

With the end of two of the main development projects in middle 2013, IGE and EMI, there is an uncertainty over what will happen to the product teams of many of these components. It is expected, that we will pass from a well controlled and coordinated number of teams to a much higher number scattered across multiple communities. Also the infrastructure is on the edge of a change towards the adoption of cloud computing which if successful will bring a burst of new products with a direct consequence on the increase of SR in term of coordination and number of EA. Furthermore, grid technology has reached a mature state which means that a much lower rate of new functionalities and number of releases is expected, while on the other hand some new developments may occur in order to adapt or adopt new software/services models.

Taking into account this scenario it'll be harder to track the products release dates, but this will not have a direct impact in the SR process. What will change will be the way product teams will communicate the announcement of a new release. The announcements are expected in the respective web sites, RSS feeds or mailing list subscriptions.


Based on this assumption we foresee the following changes in SR process and interaction between stakeholders:

  • Increase number of TP will increase the burden in SR coordination but this will be balanced by a decrease in the release rate of new products and functionalities. This is a key issue and means that the communication channels between SR and product team will play an even more important role that could be mitigated by improving communication between SR and PT. This strengthening of the relations can be achieved by an increase in dissemination or new channels of communication.
  • The number of EA may decrease due to the ending of fund for this activity, it is expected that the verification time and staged rollout of new releases will increase.
  • Tools used in the SR need to be adapted in order to be able to accommodate the dispersion of product team each one with different release tools and schedules.

Impact on funding

After 2014 it is expected that there will be an increase in the he number of technology providers but with a lower release frequency. Required a strong commitment of product team in order to products to be accept for SR.

  • Costs expected to stay constant up to April 2014.After April 2014 the availability of National and regional funding is uncertain.
  • To ensure the actual level of coordination, the funding should at least be maintained at the current level.

Monitoring

Central SAM monitoring services

Partner: CERN

Current status

A distributed monitoring framework is necessary to continuously test the level of functionality delivered by each service node instance in the production Resource Centres, to generate alarms and tickets in case of critical failures, to compute monthly availability and reliability statistics, and to monitor and troubleshoot network problems. The Monitoring Infrastructure is a distributed service based on Nagios and messaging. The central services – operated by EGI.eu – include systems such as the MyEGI portal for the visualisation of information, and a set of databases for the persistent storage of information about test results, availability statistics, monitoring profiles and aggregated topology information. The central services need to interact with the local monitoring infrastructures operated by the NGIs. The central monitoring services are critical and need to deliver high availability.

Foreseen evolution

Maintenance and operations of the central SAM services is foreseen after EGI-InSPIRE to continue the operational monitoring infrastructure, services' notifications, availability and reliability computation, monthly reports generation and user support. Development will be needed if new functionality is required from EGI.eu/OMB.

Impact on funding

Experience over the last 2 years in EGI-InSPIRE has shown that some areas of SAM (development, coordination and maintenance in particular) are underfunded in EGI-InSPIRE. The amount of reasonable effort for these activities based on current scenario is the following in Person-Months per year:

  • 12 PM Project coordination
  • 30 PM Maintenance
  • 18 PM Operations and support
  • 30 PM Development (this is an estimation of effort that could be 0 or 50 depending on EGI requirements)

Broker network

Partner: GRNET (coord), SRCE, CERN

Current status

Currently the broker network is consisted of two separate networks of brokers one for testing and one for production purposes. The production message broker network is consisted of 4 message brokers in Greece (GRNET/AUTH), Croatia (SRCE) and Switzerland (CERN) and it is used by EGI operational tools infrastructures (i.e. SAM, APEL). Operation of the service requires applying regular updates and performing maintenance operations in close coordination among the partners running the 4 instances. Development effort is limited to developing monitoring tools designated for the broker network, as well as producing best practices for clients of the infrastructure.

Foreseen evolution

The foreseen evolution regarding the production EGI message broker network includes:

  • Security implementation and x509 authentication for all clients of PROD MSG network
  • Enhancement of current usage monitoring tools
  • Investigation of technical issues affecting the broker network and service performance optimization
  • Support provisioning for clients of the service
Impact on funding

Costs are expected to stay constant up to April 2014 (in order to address the maintenance, support and development effort at the current level). Reducing cost after April 2014 focusing only on coordination, service maintenance and support is realistic.

Accounting

APEL central DB

Partner: STFC

Current status

The EGI Accounting Infrastructure is distributed. At a central level it includes the repositories for the persistent storage of usage records. The Accounting Infrastructure is essential in a service-oriented business model to record usage information. Accounting data needs to be validated and regularly published centrally. The central databases are populated through individual usage records published by the Resource Centres, or through the publication of summarised usage records.

Dependencies

The Accounting Repository has a number of dependencies on other EGI tools:

  • Accounting Portal
  • GOCDB
  • EGI Message Brokers
  • GGUS
  • Operations Portal and Operations Dashboard

Foreseen evolution

2013/2014 Technical Evolution:

  • Accounting’s evolution over the final year of EGI-InSPIRE will be to consolidate the new Accounting repository to receive records from the EMI APEL 3 client, from the regional APEL server implementations and from other CPU accounting sources (MAPPER, EDGI, UNICORE etc.). Additionally, the archived data in summary form will be migrated to the new format.
  • Other types of accounting (Cloud, Storage, Application etc.) is in early testing stage for the accounting repository, although the SSM publishing method has proved robust and usable in these new contexts. Next steps for the coming year are to test with multiple clients ensuring their representation of data is consistent from one client to another and producing summaries for the portal to publish.

By the end of EGI-Inspire there will be a pervasive accounting infrastructure collecting data for EGI and other International VOs from a number of e-infrastructures including EGI, OSG, NorduGrid, PRACE and a variety of middleware stacks. The central repository will receive data directly from NGI repositories (some running APEL code) from external infrastructures, and directly from sites running various clients, mainly APEL. The infrastructure should handle at least accounting of cpu, storage, and clouds but will be extensible to other types of usage.

This puts EGI in a good position to exchange accounting data with other infrastructures and to perform a core accounting role within the ERA.

  • Integration of accounting records from different types of accounting is the next step which would continue after EGI-InSPIRE if funding is available, as this is not yet in development. The implementation of storage accounting integrated with cloud accounting, for example, can only realistically be worked on once we have a body of data from the storage and cloud clients which may be related. Similarly for cloud/cpu and cloud/application integration. The feasibility of this integration requires some research which I envisage could not start until earliest Q3 2013 and would expect it to extend beyond EGI-InSPIRE.
Impact on funding
  • Costs expected to stay constant up to April 2014 (aiming to address the developments set out in the first two bullets above and continue maintenance and support at the current level).
  • Current level of match funding from GridPP is expected until April 2014.
  • Reducing costs before April 2014 is unrealistic.
  • Post-April 2014 - if there are additional development requirements then funding for these will be required, this would be for further developments in other types of accounting (Cloud, Storage, Application etc.). It is expected that requirements for these would evolve as their usage increases and this will require work beyond maintenance and operations. If the evolution of accounting is to include the integration of data from different types of accounting then this would also require further funding.

Central accounting portal

Partner: CESGA

Current status

The EGI accounting infrastructure is a complex system that involves various sensors in different regions, all publishing data to a central repository. The data is processed, summarized and displayed in the Accounting Portal, which acts as a common interface to the different accounting record providers and presents a homogeneous view of the data gathered and a user-friendly access to understanding resource utilisation.

The current production version (v4.2 Fomalhaut) of the Accounting Portal is available at https://accounting.egi.eu. The regional Accounting Portal is ready, pending support from APEL regionalization.

Dependencies

The Accounting Portal has functional dependencies on the following tools:

  • GOCDB: List of sites and NGIs in production, list of available services in production.
  • CIC Portal: VOMS endpoints and VO list.
  • Accounting Repository: Accounting records and summarized accounting data.

Foreseen evolution

In the period 2013/2014, the following directions are planned:

  • Contributed CPUs by site - Measuring the contribution in CPUs vs the number of hours, enhancing the reporting existing now with SPECInts.
  • Preliminar support for parallel (MPI) jobs - There were advances in this after the MPI vt lifespan, but further software development is needed.
  • New Accounting: Integration of all advances on Network, Storage and Application accounting on JRA1.4, apart from possible new developments that may arise.
  • EGI User usage accounting - Improvements on the userDN accounting.
  • Preliminar cloud support - Display and integration of the experimental sites offering cloud computing capabilities under the fedcloud task.
  • Regional portal codebase improvements - Implantation and improvement based on regional experiences if appropriate.
  • XML endpoints generalization and improvement - Implementation of a friendlier XML interface and documentation for users wishing programmatic access
  • Cloud support improvements - Further integration of cloud accounting characteristics.
Impact on funding

The funding should remain constant until February 2014. After February 2014 the status, dedication and funding of staff maintaining the portal is not clear. The availability of National and regional funding is also uncertain, perhaps the situation will be clearer next year. To ensure continued operation, the funding would have to be maintained at the current level

Security Monitoring

  • Security Nagios server. Partner: GRNET
  • CSIRT Pakiti. Partner CESNET

Current status

The objective of a Security Infrastructure is to protect itself from intrusions such as exploitable software vulnerabilities, misuse by authorised users, resource "theft", etc., while allowing the information, resources and services to remain accessible and productive to its intended users. A specifically designed set of tools and services help reduce these vulnerabilities such as monitoring individual resource centers (based on Nagios and Pakiti), a central security dashboard to allow sites, NGIs and EGI Computer Security Incident Response Teams to access security alerts in a controlled manner, and a ticketing system to support coordination efforts.

Foreseen evolution

Currently EGI CSIRT runs two services implementing security monitoring, the security Nagios box and Pakiti service to monitor patch management, which provide a security-related overview of the infrastructure. Within the scope of EGI-InSPIRE we plan increase the coverage of EGI (so-call site-wide monitoring), start producing metrics summarizing level of security as seen by EGI CSIRT, improve the handling of the issues detected (sending notifications, close integration with ticketing systems), and finish the next generation of Pakiti server monitoring security patches.

Impact on funding

After EGI-InSPIRE we will need to keep at least operations and maintenance of the key services (Pakiti and security Nagios). Estimated costs are 1PM for operations (OPS) and 2PM for maintenance (MAINT) for each service, totaling to 6PM / year for both the services.

Configuration Repository (GOCDB)

Partner: STFC

Current status

EGI relies on a central database (GOCDB) to record static information about different entities such as the Operations Centres, the Resource Centres, and the service instances. It also provides contact, role and status information. GOCDB is a source of information for many other operational tools, such as the broadcast tool, the Aggregated Topology Provider, etc.

Foreseen evolution

1yr Technical Evolution: GOCDB needs to evolve along the following themes to address current and emerging stakeholder requirements:

  • GOCDB v5 (~April/May). Replaces Oracle PROM database with ORM DB objects. Is needed to support different RDBMSs, improves performance and will simplify development. Requires changes to PI to be accepted by all PTs. See: https://wiki.egi.eu/wiki/Doctrine
  • Update current mutually-exclusive ‘EGI’ and ‘Local’ scope tags to be non-exclusive. Allows sites/services to be tagged multiple times with project-specific tags (e.g. ‘UK_NES’) and wider ‘EGI’ scope tags. Objects are created once. Maintains the integrity of topology information across different target infrastructures. PI ‘scope’ parameter value to support comma-separated list. Service scope values chosen from Site scope values.
  • Render GOCDB data in Glue2 XML and provide new PI method(s) to post downtimes using XML. Needed to address interoperability and data consistency across different info-systems/infrastructures. Has been requested by different stakeholders.

Impact on funding

  • Costs expected to stay constant up to April 2014 (aiming to address these developments and continue ops support at/around the current level).
  • Current level of match funding from GridPP is expected until April 2014.
  • Reducing costs before April 2014 is unrealistic.
  • Cost changes post April 2014 are hard to predict; depends on subsequent changes to requirements. Current level of match funding from GridPP is expected until 2015.

Operations Portal

Partner: IN2P3

Current status

EGI.eu provides a central portal for the operations community that offers a bundle of different capabilities, such as the broadcast tool, VO management facilities, and a dashboard for grid operators that is used to display information about failing monitoring probes and to open tickets to the Resource Centres affected. The dashboard also supports the central grid oversight activities. It is fully interfaced with the EGI Helpdesk and the monitoring system through the message passing. It is a critical component as it is used by all EGI Operations Centres to provide support to the respective Resource Centres.

Foreseen evolution

The Operations Portal is based on lot of emerging technologies used in the Web development ( frameworks , Css templating system , javascript libraries ...) .

The emergence of such technologies provide new opportunities to the users in terms of ergonomics , visualisation layer , services offer  and bring also new usage and new possibliites of features.

The Operations Portal team is following such evolutions and aiming  at providing these opportunitites to the community if they are useful.
The portal has been designed in a modular way and as an aggregation plateform. The flexibility of the architecture allows for a  huge number of data sources, and to provide standard access to information.

The core of the data gathering system is the web service facility called Lavoisier.

Lavoisier’s flexibility allows us to be ready to integrate almost any kind of new information if needed and meaningful. For the new resource types coming into the EGI production infrastructure, such as

HPC systems, virtualized resources, desktop resources we are able to integrate its via plug-ins inside Lavoisier.


Impact on funding

With the current level of fundings (PY4) we can ensure the daily maintenance and we can provide only some improvements on the existing tool if developments are limited.

A decrease of funding after April 2014 will  imply that additional development requirements will not been taken into account and further developments to extend the current  tool with new technologies and new source of information will not be possible  (HPC or cloud resources) . All new developments  would also require new funding.

Metrics Portal

Partner: CESGA

Current status

The Metrics Portal displays a set of metrics that will be used to monitor the performance of the infrastructure and the project, and to track their changes over time. The portal automatically collects all the required data and calculates these metrics before displaying them in the portal. The portal aggregates information from different sources such as GOCDB, GGUS, etc.

The Metrics Portal has been used for the last year to gather metrics from the project tasks. Depending of changes on the structure and scope of the projects and its tasks and activities, the portal will be updated while keeping the old metrics in their validity periods.

The Portal also monitorizes the evolution of the infrastructure month by month and it is the only way to have access to historic data on infrastructure evolution data (vs. realtime data).

Dependencies

The Metrics Portal has many dependencies. These include:

  • Accounting Portal: To display accounting metrics, most active VOs, Number of submitted jobs, etc.
  • BDII: Number of CPUs and Cores in production, online and nearline storage, mpi sites.
  • GGUS: Number of tickets created/closed. Tickets response times, Number of tickets created by priority, etc.
  • GOCDB: Sites in production, number of countries and NGIs in EGI.
  • ACE: Availability and reliability metrics.

Foreseen evolution

  • Manual metrics expansion and refinement - In the line of having finer granularity and semantics to enable better user reporting.
  • Views enhancement and optimization - Crossbrowser integration, possibly AJAX functionality, mobile integration.
  • Regional portal codebase improvements - Refactoring, code documentation and quality improvement.
  • GGUS metrics improvement and new A/R metrics
  • Access Control improvements - Finer grained access mechanism.
  • New customized reports with Excel support

Impact on funding

The funding should remain constant until February 2014. After February 2014 the status, dedication and funding of staff maintaining the portal is not clear. The availability of National and regional funding is also uncertain, perhaps the situation will be clearer next year. To ensure continued operation, the funding would have to be maintained at the current level

Helpdesk

Partner: KIT

Current status

EGI provides support to users and operators through a distributed helpdesk with central coordination (GGUS). The central helpdesk provides a single interface for support. The central system is interfaced to a variety of other ticketing systems at the NGI level in order to allow a bi-directional exchange of tickets (for example, those opened locally can be passed to the central instance or other areas, while user and operational problem tickets can be open centrally and subsequently routed to the NGI local support infrastructures).

Foreseen evolution

  • GGUS Report Generator - Some fine tuning is still to be done. Another report [ETA accuracy] needs implementation
  • VOMS synchronization - Restructuring VOMS synchronization for making it fail-safe.
  • GGUS Interfaces with other ticketing systems - Interfaces for PRACE/MAPPER, DANTE, NGI_IBERGRID
  • Alarm process for central operations tools - Implement alarm process for EGI central operations tools
  • Implementation of specific work flows for CSIRT/Security - The CSIRT team is currently evaluating whether they want to use GGUS. If CSIRT will use GGUS the permissions and access rights schema needs to be adapted to their needs.
  • Web portal - Introduce alternative authentication/authorization processes and provide specific view for CMS VO
  • Integration of operations portal in GGUS
  • Ticket monitoring in GGUS - Provide input on how processing tickets waiting for submitter's input and input on how processing tickets waiting for activity of technology providers after the end of EMI.


Impact on funding

With the current level of funding, the planned improvements can implemented until April 2014. Reduced funding after April 2014 will make it hard to provide further improvements. The current level of match funding from KIT is expected to continue until end of 2014. For 2015 and later, the KIT funding situation is yet unclear.


Core and Catch-all Services

Parner: GRNET JRU

Current status

Auxiliary core services are needed for the good operation of Infrastructure Services. Examples of such services are VOMS service and VO membership management for infrastructural VOs (DTEAM, OPS), the provisioning of middleware services needed by the monitoring infrastructure (e.g. top-BDII and WMS), the catch-all CA and other catch-all core services to support small user communities (central catalogues, workflow schedulers, authentication services), middleware monitoring SAM instance as well as site certification services.

Foreseen evolution

This should include in addition to the above the provisioning of central SAM instances for ad-hoc monitoring objectives (like the middleware monitoring SAM instance). Thus in summary the following are part of the foreseen evolution:

  • Catch-all Grid Services for small user communities (this includes VOMS service for infrastructural VOs, catch-all CA services, SAM instances, central grid services)
  • Tools (Grid Services) for Resource Centre certification (operation of site certification infrastructure including central core services such as WMS and Top-BDII)

Impact on funding

Costs are expected to stay constant up to April 2014. Reducing cost after April 2014 focusing only on services operation and support is realistic.

Resources