From EGIWiki
Revision as of 13:22, 20 June 2013 by Eimamagi (talk | contribs) (Integration with EGI operational tools)
Jump to: navigation, search
Main Roadmap and Innovation Technology For Users For Resource Providers Media

Workbenches: Open issues
Scenario 1
VM Management
Scenario 2
Data Management
Scenario 3
Information Systems
Scenario 4
Scenario 5
Scenario 6
Scenario 7
Federated AAI
Scenario 8
VM Image Management
Scenario 9
Scenario 10
Scenario 11

Scenario 5: Reliability/Availability of Resource Providers

Leader: Emir Imamagic, SRCE

Scenario collaborators

Role Institution Name
Scenario leader SRCE Emir Imamagic
Collaborator INFN Daniele Cesini
Collaborator CESGA Ivan Diaz
Collaborator CESGA Alvaro Simon

What Monitoring means in this context

Monitoring in this context is the monitoring of the availability and reliability of the cloud resources provided by the resource providers. What will be tested is the possibility for an hypothetical user to instantiate at least one predefined virtual machine within a given period of time. It consists of an "external" monitoring, no data will be collected from inside the VMs or underlying virtualization systems. Monitoring the capabilities of the cloud resource providers in terms of how many resources are available is beyond the scope of this Scenario, at least in its initial phase. Possible evolution of the FedCloud monitoring will be evaluated when the basic monitoring will be in place.

The outcome of Scenario5 will be a system that is able to run at least one probe on each Resource Provider paticipating to the FedCloud.

Given the experience accumulated with the NAGIOS system within the EMI and EGI projects the monitoring framework will be based on NAGIOS. This has also the advantage to ease the integration of the FedCloud monitoring framework in the SAM monitoring sytem used by the EGI project to monitor the production infrastructure.

Integration with EGI operational tools


The table below shows current status of integration of RPs main service types in GOCDB. Explanation of status:

  • OK: service properly defined in GOCDB, passing SAM test
  • WARN: service properly defined in GOCDB, failing SAM test, please check output on SAM instance: https://cloudmon.egi.eu/nagios
  • MISSING_INFO: endpoint is defined in GOCDB, but description needs to be improved. Please check the special comments for defining service endpoints in GOCDB (serviceUrl, other attributes)
  • NO_ENDPOINT: endpoint is not defined in GOCDB
RP eu.egi.cloud.accounting eu.egi.cloud.information.bdii eu.egi.cloud.vm-management.occi eu.egi.cloud.storage-management.cdmi (optional)
BSC-Cloud (CDMI-only) OK
FZJ OK OK WARN - pending OCCI probe OpenStack modifications
  • set ServiceUrl according to instructions below
GWDG (GoeGrid) OK OK WARN - pending OCCI probe OpenStack modifications OK
IN2P3-CC NO_ENDPOINT NO_ENDPOINT WARN - pending OCCI probe OpenStack modifications
INFN (INFN-IGI-CNAF-FedCloud) MISSING_INFO: APEL publisher should be modified to publish as INFN-IGI-CNAF-FedCloud or other name without space characters. OK WARN - pending OCCI probe WNoDeS modifications
KTH (KTH-CLOUD) MISSING_INFO: APEL publisher should be modified to publish as KTH-CLOUD or other name without space characters. OK OK
LAL (GRIF) OK OK WARN - VM INSTANTIATION CRITICAL - HTTP request failed: 403 Forbidden


The following service types were added to GOCDB:

  • eu.egi.cloud.accounting
  • eu.egi.cloud.information.bdii
  • eu.egi.cloud.storage-management.cdmi
  • eu.egi.cloud.vm-management.occi
  • eu.egi.cloud.vm-metadata.marketplace

All RPs must enter cloud service endpoints to GOCDB in order to enable integration with other operational tools.

First step is defining site to which the endpoints will belong. There are two possible options:

1. Register resources on an existing EGI site

  • pre-reqs:
    • RP is associated with the existing site and the team handling existing grid services is the same/very close with the cloud team
    • site's Certification Status is "Certified"

2. Register resources on a new site

In both cases service endpoints should have the following flags set:

  • based on the readiness of your resources set service Production flag to 'Y' or 'N' (in both cases site's availability/reliability will not be affected and no alarms will be raised in Operations Portal)
  • Beta: 'N'
  • Monitored: 'Y'

Special rules apply for the following service types:

  • eu.egi.cloud.vm-management.occi: Endpoint URL field must contain the following info:

Both <image_name> and <network_name> cannot contain spaces. Example for OpenStack is:


and ON:


Note: parameter platform=openstack should only be set for RPs using OpenStack. OpenNebula RPs must define parameter network. More information about the probe used for testing OCCI instances can be found here: https://github.com/pkasprzak/FedCloud-probes/.

Further information about GOCDB can be find on the following page: GOCDB/Input_System_User_Documentation.


Central SAM instance is deployed for monitoring cloud resources. Once the set of probes is fully defined probes will be included to official SAM release. Once included to official release central instance will be switched off.

SAM instance is on the following address: https://cloudmon.egi.eu/nagios.

List of tests can be found here: https://cloudmon.egi.eu/poem/admin/poem/profile/1/.


Nagios probes

Who has the responsibility to develop probes? Following the EGI model probes are developed by the Technology Providers and are integrated into the monitoring framework by the EGI-JRA1 staff that can also provide support during the initial phase of probes development in order to give guidelines and templates.

Information on how to develop NAGIOS probes can be retrieved in the SAM Development Guide

List of available probes within EGI is reported in the SAM Administrrator Guide

The EGI SAM System

The SAM system is basically a framework consisting of:
- Nagios monitoring system (https://www.nagios.org),
- custom databases for topology, probes description and storing results of tests
- web interface MyWLCG/MyEGI (https://tomtools.cern.ch/confluence/display/SAM/MyWLCG)
Probes used to perform check of services are provided by service developers. In case of EMI services probes are provided by EMI product teams. In case of Globus Toolkit, probes are provided by IGE project, etc. SAM team only maintains probes which test internal SAM functions (e.g. communication with messaging system, database synchronization, etc).

More information on SAM can be found here.


The SAM system EGi wiki pages

File:Flessr nagios probes.pdf (Thanks to David Wallom)