EGI-InSPIRE:Plan 2012 SA1.4

From EGIWiki
(Redirected from Plan 2012 SA1.4)
Jump to: navigation, search
EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports


Contents


Assessment of progress, 2011

Completed activities

Migration from gridops.org domain of EGI central tools to egi.eu domain

Migration was successfully finalized on July 4th 2011 when the the decommission of gridops.org domain was performed. Decommission did not cause problems to grid or operational tools or any external system. All EGI central tools are now using egi.eu domain and list can be found on the following page: Tools.

Definition of procedures relevant for operational tools

Two procedures relevant for operational tools were approved at the Operations Management Board on March 15th 2011:

One manual relevant for operational tools was accepted at the Operations Management Board on July 26th 2011:

Central MyEGI deployement

Central MyEGI instance (http://grid-monitoring.cern.ch/myegi/) was deployed in February 2011 after SAM Update-09 release.

Regionalization of OPS VO

It was agreed that CERN will continue running the VOMRS service and that the management of VO will be transferred to EGI. At the Operations Management Board on January 25th 2011 it was agreed that VO managers will be Emir Imamagic and Peter Solagna. Furthermore since each NGI can have only 2 DNs registered in OPS VO it was decided all VO-related operations will be performed by the two VO managers.

Mechanisms for automating maintenance of ActiveMQ brokers

CERN has provided the tool for automatic generation of configuration files for all ActiveMQ brokers (mbcg). AUTH further modified generated packages in order to provide more generic solution. In addition AUTH provided certificate protected wiki pages where all the sensitive data and installation instructions are stored.

Additional activities

During the 2011 SAM infrastructure was fully distributed. At the end of the year the following SAM/Nagios instances were in production:

Detailed list of SAM/Nagios instances can be found on the following page: SAM Instances.

Metrics portal reached stable version and it was used in QR6 generation.

Ongoing activities

Monitoring of operations tools

Development of the new SAM instance for operational tools monitoring started in PQ5. The first step was reorganization of operational tools in the GOCDB:

Additional details can be found in the following slides: https://www.egi.eu/indico/conferenceDisplay.py?confId=549. This reorganization will enable automatic bootstrap of SAM instance for operational tools and integration with MyEGI web interface and ACE system for A/R calculation.

CIC Portal decommission

Decommission of the old CIC portal (cic.egi.eu) was postponed and is on the roadmap for 2012. Postpone was caused by development of CIC features (VO ID cards, broadcast) in the Operations Portal.

High Availability implementation for Operational Tools

GOCDB failover implementation was postponed due to the GOCDB 4.1 development. The task is on the roadmap for Dec 2011 and 2012 (now depending on Fraunhofer Institute).

SAM release Update-13 will provides functionality of deploying secondary instance. Secondary SAM instance will be deployed depending on NGI size and resources.

Plans for 2012

Monitoring of operations tools

Deployment of the full blown operations tools monitoring SAM instance is planned for January / February 2012. The new instance will contain the same components as SAM NGI instance (MyEGI, databases, message bus integration) and it will publish results to the central SAM database.

Security implementation in messaging system

Migration to authenticated only connections to the PROD message broker network. The infrastructure is ready to use authenticated connections but the following steps have to be followed:

Timelines:

2012/Q1

2012/Q2

2012/Q3

2012/Q4

CIC Portal decommission

The decommission of the old CIC Portal is planned for May 2012.

High Availability implementation for Operational Tools

Central GOCDB failover in place at Fraunhofer, approx. end Dec 2011/Jan 2012. A DNS switch for the 'goc.egi.eu' domain between the production server and the failover server is in place (but not yet tested). Once installed, the failover will be readonly in order to prevent data-synchronization problems.

This milestone includes the implementation of a notification system to warn the administrators of the tools in case of failure (for example, through GGUS alarm tickets).

Test message broker network deployment

Deployment of a TEST message broker network with the same setup as the PROD one. Usage for this broker network for:

This activity will be done in Q1.

Messaging system reliability and availability improvements

ActiveMQ 5.5 contains new feature that allows dynamic failover of the servers. This allows message broker administrator to dynamically add or remove brokers on the network and request clients to move to another instance. The feature will be tested on test broker network in Q1 and if everything goes well it will be implemented on production network in Q2.

ActiveMQ 5.5 provides Virtual Destinations feature which minimizes the loss of messages due to disconnections for Topic destinations. Queues keep the messages until the consumer receives them. Virtual Destinations will be implemented in production in Q1.

Messaging system scalability improvements

Currently brokers are keeping connections alive until the clients close them. This can lead to a large number of idle connections (either by producers that don't send anything or by consumers that listen on queues/topics where there is no message). Given that each connection to a broker is consuming resources of the service, idle connections will be evicted. The automatic eviction mechanism will be implemented in Q1.

Staged message broker network software upgrades

After the successful upgrade of the ActiveMQ software from the version 5.3 to the version 5.5, new updates will be staged as they are released on the TEST broker network and then deployed to the PROD network minimizing the possible outage duration.

Next update has been scheduled to be deployed in the first months of 2012 (Jan/Feb) without disrupting the service operation.

Deployment of new dashboards

Security Dashboard will be released to production in January 2012. VO Dashboard will be released to production between March and April 2012.

Deployment of the refactored Operations Dashboard is planned for April / May 2012.

Improvement of synchronization between SAM and Operations Portal

Operations Portal currently uses special topic in messaging system for receiving alarms from SAM system. Topic in messaging system does not ensure that message is delivered to the subscriber. In order to make the synchronization mechanism more reliable it was proposed to switch from topic to Virtual Destination, which ensures that message is delivered to the subscriber. Development work has already started by messaging system and Operations Portal teams. Estimated timeline for this activity is Q2.

Site A/R monitoring

Based on the requirements from COD activity, we started working on a solution for raising alarms in case when site's A/R drops significantly low. This mechanism will enable sites to recover prior to breaking OLA requirements. Besides for probe work is required on JRA1 site in order to define which component will be responsible for running the probe (SAM or Operations Portal). Once all the details are agreed upon, probe will be released into production. Estimated timeline is Q2.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export