Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:Plan 2012 SA1.4"

From EGIWiki
Jump to navigation Jump to search
 
(28 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{EGI-Inspire_menubar}} {{Template:Inspire_reports_menubar}} {{TOC_right}}
= Assessment of progress, 2011 =
= Assessment of progress, 2011 =


Line 13: Line 15:
* Management of the EGI OPS Availability and Reliability Profile ([[PROC08]])  
* Management of the EGI OPS Availability and Reliability Profile ([[PROC08]])  
One manual relevant for operational tools was accepted at the Operations Management Board on '''July 26th 2011''':
One manual relevant for operational tools was accepted at the Operations Management Board on '''July 26th 2011''':
* Tool Intervention Management ([[MAN04_Tool_Intervention_Management]])
* Tool Intervention Management ([[MAN04]])


=== Central MyEGI deployement ===
=== Central MyEGI deployement ===
Line 48: Line 50:
Additional details can be found in the following slides: https://www.egi.eu/indico/conferenceDisplay.py?confId=549. This reorganization will enable automatic bootstrap of SAM instance for operational tools and integration with MyEGI web interface and ACE system for A/R calculation.
Additional details can be found in the following slides: https://www.egi.eu/indico/conferenceDisplay.py?confId=549. This reorganization will enable automatic bootstrap of SAM instance for operational tools and integration with MyEGI web interface and ACE system for A/R calculation.


=== TEST message broker network deployment ===
=== CIC Portal decommission ===
Deployment of a TEST message broker network with the same setup as the PROD one. Usage for this broker network for:
* Staging new operations tools to the messaging infrastructure
* Staging new versions and features of the message broker software
 
=== CIC Portal decommissioning ===


Decommission of the old CIC portal (cic.egi.eu) was postponed and is on the roadmap for 2012. Postpone was caused by development of CIC features (VO ID cards, broadcast) in the Operations Portal.
Decommission of the old CIC portal (cic.egi.eu) was postponed and is on the roadmap for 2012. Postpone was caused by development of CIC features (VO ID cards, broadcast) in the Operations Portal.
Line 66: Line 63:


== Monitoring of operations tools ==
== Monitoring of operations tools ==
Deployment of the full blown operations tools monitoring SAM instance is planned for January / February 2012. The new instance will contain the same components as SAM NGI instance (MyEGI, databases, message bus integration) and it will publish results to the central SAM database.


== Security implementation in messaging system ==
== Security implementation in messaging system ==
Migration to authenticated only connections to the PROD message broker network. Usage of the following sources for dynamic updated credentials (x509 authentication):
* GOCDB
* ATP
* EGI SSO


Ability to specify username/password credentials on demand. Implementation of a synchronization procedure between brokers for the credential management.
Migration to authenticated only connections to the PROD message broker network. The infrastructure is ready to use authenticated connections but the following steps have to be followed:
* Identify and register the clients
* Setup credentials for each client
* Modify client configurations to use authenticated access
 
Timelines:
 
'''2012/Q1'''
* Implementation of credential sharing procedure between brokers
* First round of communication with clients to setup credentials / request of clients to migrate to authenticated connections
'''2012/Q2'''
* Second round of communication with clients to setup credentials / request of clients to migrate to authenticated connections
'''2012/Q3'''
* Third round of communication with clients to setup credentials / request of clients to migrate to authenticated connections
* Request to EGI OMB to approve authenticated only connections to PROD message broker network
'''2012/Q4'''
* Enforce authentication to all clients (assuming OMB approval)
* Enforce authorization rules (assuming OMB approval)
 
== CIC Portal decommission ==
 
The decommission of the old CIC Portal is planned for May 2012.
 
== High Availability implementation for Operational Tools ==
 
Central GOCDB failover in place at Fraunhofer, approx. end Dec 2011/Jan 2012.  A DNS switch for the 'goc.egi.eu' domain between the production server and the failover server is in place (but not yet tested). Once installed, the failover will be readonly in order to prevent data-synchronization problems.
 
This milestone includes the implementation of a notification system to warn the administrators of the tools in case of failure (for example, through GGUS alarm tickets).
 
== Test message broker network deployment ==
 
Deployment of a TEST message broker network with the same setup as the PROD one. Usage for this broker network for:
* Staging new operations tools to the messaging infrastructure
* Staging new versions and features of the message broker software
This activity will be done in Q1.
 
== Messaging system reliability and availability improvements ==
 
ActiveMQ 5.5 contains new feature that allows dynamic failover of the servers. This allows message broker administrator to dynamically add or remove brokers on the network and request clients to move to another instance. The feature will be tested on test broker network in Q1 and if everything goes well it will be implemented on production network in Q2.
 
ActiveMQ 5.5 provides Virtual Destinations feature which minimizes the loss of messages due to disconnections for Topic destinations. Queues keep the messages until the consumer receives them. Virtual Destinations will be implemented in production in Q1.
 
== Messaging system scalability improvements ==
 
Currently brokers are keeping connections alive until the clients close them. This can lead to a large number of idle connections (either by producers that don't send anything or by consumers that listen on queues/topics where there is no message). Given that each connection to a broker is consuming resources of the service, idle connections will be evicted. The automatic eviction mechanism will be implemented in Q1.


== Staged message broker network software upgrades ==
== Staged message broker network software upgrades ==
After the successful upgrade of the ActiveMQ software from the version 5.3 to the version 5.5, new updates will be staged as they are released on the TEST broker network and then deployed to the PROD network minimizing the possible outage duration.
After the successful upgrade of the ActiveMQ software from the version 5.3 to the version 5.5, new updates will be staged as they are released on the TEST broker network and then deployed to the PROD network minimizing the possible outage duration.


Next update has been scheduled to be deployed in the first months of 2012 (Jan/Feb) without disrupting the service operation.
Next update has been scheduled to be deployed in the first months of 2012 (Jan/Feb) without disrupting the service operation.


== CIC Portal decommissioning ==
== Deployment of new dashboards ==
 
Security Dashboard will be released to production in January 2012. VO Dashboard will be released to production between March and April 2012.
 
Deployment of the refactored Operations Dashboard is planned for April / May 2012.
 
== Improvement of synchronization between SAM and Operations Portal ==
 
Operations Portal currently uses special topic in messaging system for receiving alarms from SAM system. Topic in messaging system does not ensure that message is delivered to the subscriber. In order to make the synchronization mechanism more reliable it was proposed to switch from topic to Virtual Destination, which ensures that message is delivered to the subscriber. Development work has already started by messaging system and Operations Portal teams. Estimated timeline for this activity is Q2.


== High Availability implementation for Operational Tools ==
== Site A/R monitoring ==


Central GOCDB failover in place at Fraunhofer, approx. end Dec 2011/Jan 2012.  A DNS switch for the 'goc.egi.eu' domain between the production server and the failover server is in place (but not yet tested). Once installed, the failover will be readonly in order to prevent data-synchronization problems.
Based on the requirements from COD activity, we started working on a solution for raising alarms in case when site's A/R drops significantly low. This mechanism will enable sites to recover prior to breaking OLA requirements. Besides for probe work is required on JRA1 site in order to define which component will be responsible for running the probe (SAM or Operations Portal). Once all the details are agreed upon, probe will be released into production. Estimated timeline is Q2.

Latest revision as of 19:31, 24 December 2014

EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports



Assessment of progress, 2011

Completed activities

Migration from gridops.org domain of EGI central tools to egi.eu domain

Migration was successfully finalized on July 4th 2011 when the the decommission of gridops.org domain was performed. Decommission did not cause problems to grid or operational tools or any external system. All EGI central tools are now using egi.eu domain and list can be found on the following page: Tools.

Definition of procedures relevant for operational tools

Two procedures relevant for operational tools were approved at the Operations Management Board on March 15th 2011:

  • Adding new probes to SAM (PROC07)
  • Management of the EGI OPS Availability and Reliability Profile (PROC08)

One manual relevant for operational tools was accepted at the Operations Management Board on July 26th 2011:

  • Tool Intervention Management (MAN04)

Central MyEGI deployement

Central MyEGI instance (http://grid-monitoring.cern.ch/myegi/) was deployed in February 2011 after SAM Update-09 release.

Regionalization of OPS VO

It was agreed that CERN will continue running the VOMRS service and that the management of VO will be transferred to EGI. At the Operations Management Board on January 25th 2011 it was agreed that VO managers will be Emir Imamagic and Peter Solagna. Furthermore since each NGI can have only 2 DNs registered in OPS VO it was decided all VO-related operations will be performed by the two VO managers.

Mechanisms for automating maintenance of ActiveMQ brokers

CERN has provided the tool for automatic generation of configuration files for all ActiveMQ brokers (mbcg). AUTH further modified generated packages in order to provide more generic solution. In addition AUTH provided certificate protected wiki pages where all the sensitive data and installation instructions are stored.

Additional activities

During the 2011 SAM infrastructure was fully distributed. At the end of the year the following SAM/Nagios instances were in production:

  • 26 NGI instances covering 37 EGI partners
  • 2 ROC instances covering 2 EGI partners
  • 1 project instances covering 1 EGI partners
  • 3 external ROC instances covering the following regions: Canada, IGALC and LA.

Detailed list of SAM/Nagios instances can be found on the following page: SAM Instances.

Metrics portal reached stable version and it was used in QR6 generation.

Ongoing activities

Monitoring of operations tools

Development of the new SAM instance for operational tools monitoring started in PQ5. The first step was reorganization of operational tools in the GOCDB:

Additional details can be found in the following slides: https://www.egi.eu/indico/conferenceDisplay.py?confId=549. This reorganization will enable automatic bootstrap of SAM instance for operational tools and integration with MyEGI web interface and ACE system for A/R calculation.

CIC Portal decommission

Decommission of the old CIC portal (cic.egi.eu) was postponed and is on the roadmap for 2012. Postpone was caused by development of CIC features (VO ID cards, broadcast) in the Operations Portal.

High Availability implementation for Operational Tools

GOCDB failover implementation was postponed due to the GOCDB 4.1 development. The task is on the roadmap for Dec 2011 and 2012 (now depending on Fraunhofer Institute).

SAM release Update-13 will provides functionality of deploying secondary instance. Secondary SAM instance will be deployed depending on NGI size and resources.

Plans for 2012

Monitoring of operations tools

Deployment of the full blown operations tools monitoring SAM instance is planned for January / February 2012. The new instance will contain the same components as SAM NGI instance (MyEGI, databases, message bus integration) and it will publish results to the central SAM database.

Security implementation in messaging system

Migration to authenticated only connections to the PROD message broker network. The infrastructure is ready to use authenticated connections but the following steps have to be followed:

  • Identify and register the clients
  • Setup credentials for each client
  • Modify client configurations to use authenticated access

Timelines:

2012/Q1

  • Implementation of credential sharing procedure between brokers
  • First round of communication with clients to setup credentials / request of clients to migrate to authenticated connections

2012/Q2

  • Second round of communication with clients to setup credentials / request of clients to migrate to authenticated connections

2012/Q3

  • Third round of communication with clients to setup credentials / request of clients to migrate to authenticated connections
  • Request to EGI OMB to approve authenticated only connections to PROD message broker network

2012/Q4

  • Enforce authentication to all clients (assuming OMB approval)
  • Enforce authorization rules (assuming OMB approval)

CIC Portal decommission

The decommission of the old CIC Portal is planned for May 2012.

High Availability implementation for Operational Tools

Central GOCDB failover in place at Fraunhofer, approx. end Dec 2011/Jan 2012. A DNS switch for the 'goc.egi.eu' domain between the production server and the failover server is in place (but not yet tested). Once installed, the failover will be readonly in order to prevent data-synchronization problems.

This milestone includes the implementation of a notification system to warn the administrators of the tools in case of failure (for example, through GGUS alarm tickets).

Test message broker network deployment

Deployment of a TEST message broker network with the same setup as the PROD one. Usage for this broker network for:

  • Staging new operations tools to the messaging infrastructure
  • Staging new versions and features of the message broker software

This activity will be done in Q1.

Messaging system reliability and availability improvements

ActiveMQ 5.5 contains new feature that allows dynamic failover of the servers. This allows message broker administrator to dynamically add or remove brokers on the network and request clients to move to another instance. The feature will be tested on test broker network in Q1 and if everything goes well it will be implemented on production network in Q2.

ActiveMQ 5.5 provides Virtual Destinations feature which minimizes the loss of messages due to disconnections for Topic destinations. Queues keep the messages until the consumer receives them. Virtual Destinations will be implemented in production in Q1.

Messaging system scalability improvements

Currently brokers are keeping connections alive until the clients close them. This can lead to a large number of idle connections (either by producers that don't send anything or by consumers that listen on queues/topics where there is no message). Given that each connection to a broker is consuming resources of the service, idle connections will be evicted. The automatic eviction mechanism will be implemented in Q1.

Staged message broker network software upgrades

After the successful upgrade of the ActiveMQ software from the version 5.3 to the version 5.5, new updates will be staged as they are released on the TEST broker network and then deployed to the PROD network minimizing the possible outage duration.

Next update has been scheduled to be deployed in the first months of 2012 (Jan/Feb) without disrupting the service operation.

Deployment of new dashboards

Security Dashboard will be released to production in January 2012. VO Dashboard will be released to production between March and April 2012.

Deployment of the refactored Operations Dashboard is planned for April / May 2012.

Improvement of synchronization between SAM and Operations Portal

Operations Portal currently uses special topic in messaging system for receiving alarms from SAM system. Topic in messaging system does not ensure that message is delivered to the subscriber. In order to make the synchronization mechanism more reliable it was proposed to switch from topic to Virtual Destination, which ensures that message is delivered to the subscriber. Development work has already started by messaging system and Operations Portal teams. Estimated timeline for this activity is Q2.

Site A/R monitoring

Based on the requirements from COD activity, we started working on a solution for raising alarms in case when site's A/R drops significantly low. This mechanism will enable sites to recover prior to breaking OLA requirements. Besides for probe work is required on JRA1 site in order to define which component will be responsible for running the probe (SAM or Operations Portal). Once all the details are agreed upon, probe will be released into production. Estimated timeline is Q2.