Federated Cloud Monitoring

From EGIWiki
Jump to: navigation, search
Overview For users For resource providers Infrastructure status Site-specific configuration Architecture



Scenarios: Federated AAI Accounting VM Image Management Brokering IntraCloud Networking
Monitoring VM Management Data Management Information Discovery Security



Scope

Monitoring in this context is the monitoring of the availability and reliability of the cloud resources provided by the resource providers. What will be tested is the possibility for an hypothetical user to instantiate at least one predefined virtual machine within a given period of time. It consists of an "external" monitoring, no data will be collected from inside the VMs or underlying virtualization systems. Monitoring the capabilities of the cloud resource providers in terms of how many resources are available is beyond the scope of this Scenario, at least in its initial phase. Possible evolution of the FedCloud monitoring will be evaluated when the basic monitoring will be in place.

The outcome of this scenario will be a system that is able to run at least one probe on each Resource Provider participating to the FedCloud.

Given the experience accumulated with the NAGIOS system within the EMI and EGI projects the monitoring framework will be based on NAGIOS. This has also the advantage to ease the integration of the FedCloud monitoring framework in the SAM monitoring system used by the EGI project to monitor the production infrastructure.

Members

Role Institution Name
Scenario leader SRCE Emir Imamagic
Collaborator INFN Daniele Cesini
Collaborator CESGA Ivan Diaz

Roadmap

Probe Description Responsible Deadline Status
OCCI - extensions Decided on F2F to deploy contextualization tests, which are already implemented. Tasks:
  • check if this can be applied to all OCCI endpoints
  • probe configuration
  • define test name.
Boris Parak, Emir Imamagic February 2015. DONE
CDMI Decided on F2F to implement a lightweight probe that will be similar to existing SRM probe. Emir Imamagic February 2015. DONE
APEL Decided on F2F to rewrite APEL tests to be consistent with grid sites (APEL-Pub only). Tasks:
  • implement publishing mechanism on APEL side
  • define topic name & configure cloudmon consumer to listen to it
  • define test name.
Stuart Pullinger, Emir Imamagic ?? 2015. IN PROGRESS
Site-BDII Decided on F2F to remove service type for Cloud BDII eu.egi.cloud.information.bdii and request OCCI/CDMI specific tests to be added to GLUE2-Validator (http://gridinfo.web.cern.ch/glue/glue-validator-guide). Tasks:
  • define list of tests & open a ticket to GLUE2-Validator developers
  • request removal of service type eu.egi.cloud.information.bdii.

See list of tests proposed by Salvatore Pinto below.

Peter Solagna, Emir Imamagic ?? 2015. IN PROGRESS
vmcatcher BDII test Decided on F2F to implement a basic BDII vmcatcher test.

See list of tests proposed by Salvatore Pinto below.

Emir Imamagic April 2015. IN PROGRESS
vmcatcher OCCI test Decided on F2F to develop a wrapper around existing OCCI probe that will enable testing of the vmcatcher monitoring image that is being updated every 6h. Boris Parak, Emir Imamagic April 2015. IN PROGRESS
COMPSs Tasks:
  • test * integrate COMPSs tests on cloudmon.egi.eu.
Emir Imamagic April 2015. IN PROGRESS
VMDIRAC Tasks:
  • test * integrate VMDIRAC tests on cloudmon.egi.eu.
Emir Imamagic April 2015. IN PROGRESS
Alternative AuthN Mechanisms Peter requested enabling alternative AuthN in probes. Need to clarify in which cases is this required and optimal mechanism of defining credentials. Peter Solagna ?? TBD
Native cloud middleware tests Need to provide estimated effort needed for development of probes for native cloud middleware tests. Tasks for concrete cases will be open upon request. Emir Imamagic April 2015. IN PROGRESS

Examples of BDII tests proposed by Salvatore Pinto:

1. At least one OS_template is published
ldapsearch -x -H ldap://bdii.marie.hellasgrid.gr:2170 -b GLUE2DomainID=PRISMA-INFN-BARI,GLUE2GroupID=grid,o=glue "(objectClass=GLUE2ApplicationEnvironment)"  returns numEntries > 1
2. At least one Resource_template is published
ldapsearch -x -H ldap://bdii.marie.hellasgrid.gr:2170 -b GLUE2DomainID=PRISMA-INFN-BARI,GLUE2GroupID=grid,o=glue "(objectClass=GLUE2ExecutionEnvironment)"  returns numEntries > 1
3. At least one (OCCI or CDMI) endpoint is published
ldapsearch -x -H ldap://bdii.marie.hellasgrid.gr:2170 -b GLUE2DomainID=PRISMA-INFN-BARI,GLUE2GroupID=grid,o=glue "(&(objectClass=GLUE2Endpoint)(|(|(GLUE2EndpointInterfaceName=CDMI))(|(GLUE2EndpointInterfaceName=OCCI))))" returns numEntries > 1


Documentation

Integration with EGI operational tools

GOCDB

The following service types can be added to GOCDB:

  • eu.egi.cloud.accounting (required)
  • eu.egi.cloud.information.bdii (required)
  • eu.egi.cloud.storage-management.cdmi 
  • eu.egi.cloud.vm-management.occi (required)
  • eu.egi.cloud.vm-metadata.marketplace

All RPs must enter cloud service endpoints to GOCDB in order to enable integration with other operational tools.

First step is defining site to which the endpoints will belong. There are two possible options:

1. Register resources on an existing EGI site

  • pre-reqs:
    • RP is associated with the existing site and the team handling existing grid services is the same/very close with the cloud team
    • site's Certification Status is "Certified"

2. Register resources on a new site

In both cases service endpoints should have the following flags set:

  • Production: 'N' (If the site is ready to certify the service please contact the fedcloud list first)
  • Beta: 'N'
  • Monitored: 'Y'

Special rules apply for the following service types:

eu.egi.cloud.storage-management.cdmi

Endpoint URL field must contain the following info:

http[s]://hostname:port

eu.egi.cloud.vm-management.occi

Endpoint URL field must contain the following info:

https://hostname:port/?image=<image_name>&resource=<resource_name>

Both <image_name> and <resource_name> cannot contain spaces. These attributes map to os_tpl and resource_tpl respectively.

org.openstack.nova

Endpoint URL field must contain Keystone URL (https://hostname:port/url) with the following additional info:

https://hostname:port/url?image=<image_uuid>&resource=<flavor_name>

Both <image_name> and <flavour_name> cannot contain spaces.

Further information about GOCDB can be find on the following page: GOCDB/Input System User Documentation.

SAM

Central SAM instance is deployed for monitoring cloud resources. Once the set of probes is fully defined probes will be included to official SAM release. Once included to official release central instance will be switched off.

SAM instance is on the following address: https://cloudmon.egi.eu/nagios.

Detailed description of tests can be found here.

Technology

Nagios probes

Who has the responsibility to develop probes? Following the EGI model probes are developed by the Technology Providers and are integrated into the monitoring framework by the EGI-JRA1 staff that can also provide support during the initial phase of probes development in order to give guidelines and templates.

Information on how to develop NAGIOS probes can be retrieved in the SAM Development Guide

List of available probes within EGI is reported in the SAM Administrrator Guide

The EGI SAM System

The SAM system is basically a framework consisting of:
- Nagios monitoring system (https://www.nagios.org),
- custom databases for topology, probes description and storing results of tests
- web interface MyWLCG/MyEGI (https://tomtools.cern.ch/confluence/display/SAM/MyWLCG)
Probes used to perform check of services are provided by service developers. In case of EMI services probes are provided by EMI product teams. In case of Globus Toolkit, probes are provided by IGE project, etc. SAM team only maintains probes which test internal SAM functions (e.g. communication with messaging system, database synchronization, etc).

More information on SAM can be found here.

References

The SAM system EGi wiki pages

File:Flessr nagios probes.pdf (Thanks to David Wallom)