Difference between revisions of "Fedcloud-tf:WorkGroups:Scenario5"

From EGIWiki
Jump to: navigation, search
(Status)
(Redirected page to Federated Cloud Monitoring)
 
(50 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Fedcloud-tf:Menu}} {{Fedcloud-tf:WorkGroups:Menu}} {{TOC_right}}
+
#REDIRECT[[Federated_Cloud_Monitoring]]
 
 
== Scenario 5: Reliability/Availability of Resource Providers  ==
 
 
 
<font color="red">Leader: Emir Imamagic, SRCE</font>
 
 
 
== Scenario collaborators  ==
 
 
 
{| border="1"
 
|-
 
! Role
 
! Institution
 
! Name
 
|-
 
| Scenario leader
 
| SRCE
 
| Emir Imamagic
 
|-
 
| Collaborator
 
| INFN
 
| Daniele Cesini
 
|-
 
| Collaborator
 
| CESGA
 
| Ivan Diaz
 
|-
 
| Collaborator
 
| CESGA
 
| Alvaro Simon
 
|}
 
 
 
<br>
 
 
 
== What Monitoring means in this context  ==
 
 
 
Monitoring in this context is the monitoring of the availability and reliability of the cloud resources provided by the resource providers. What will be tested is the possibility for an hypothetical user to instantiate at least one predefined virtual machine within a given period of time. It consists of an "external" monitoring, no data will be collected from inside the VMs or underlying virtualization systems. Monitoring the capabilities of the cloud resource providers in terms of how many resources are available is beyond the scope of this Scenario, at least in its initial phase. Possible evolution of the FedCloud monitoring will be evaluated when the basic monitoring will be in place.
 
 
 
The outcome of Scenario5 will be a system that is able to run at least one probe on each Resource Provider paticipating to the FedCloud.
 
 
 
Given the experience accumulated with the [http://www.nagios.org NAGIOS] system within the EMI and EGI projects the monitoring framework will be based on NAGIOS. This has also the advantage to ease the integration of the FedCloud monitoring framework in the [https://wiki.egi.eu/wiki/SAM SAM] monitoring sytem used by the EGI project to monitor the production infrastructure.
 
 
 
== Integration with EGI operational tools  ==
 
 
 
=== Status ===
 
 
 
The table below shows current status of integration of RPs main service types in GOCDB. Explanation of status:
 
* OK: service properly defined in GOCDB, passing SAM test
 
* WARN: service properly defined in GOCDB, failing SAM test, please check output on SAM instance: https://cloudmon.egi.eu/nagios
 
* MISSING_INFO: endpoint is defined in GOCDB, but description needs to be improved. Please check the special comments for defining service endpoints in GOCDB (serviceUrl, other attributes)
 
* NO_ENDPOINT: endpoint is not defined in GOCDB
 
 
 
{| class="wikitable"
 
|-
 
! RP
 
! eu.egi.cloud.accounting
 
! eu.egi.cloud.information.bdii
 
! eu.egi.cloud.vm-management.occi
 
! eu.egi.cloud.storage-management.cdmi (optional)
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=120657&grid_id=0 BSC-Cloud] (CDMI-only)
 
| colspan=3|
 
| style="background: green; color: white"| OK
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=173&grid_id=0 CESGA]
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
|
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=119894&grid_id=0 CESNET (CESNET-MetaCloud)]
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=17610&grid_id=0 FZJ]
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| bgcolor="orange"| WARN (failing due SAM perl library issue)
 
|
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=119660&grid_id=0 GRNET (HG-09-Okeanos-Cloud)]
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="orange"| MISSING_INFO:
 
* set ServiceUrl according to instructions [[Fedcloud-tf:WorkGroups:Scenario5#GOCDB|below]]
 
|
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=124&grid_id=0 GWDG (GoeGrid)]
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=371&grid_id=0 IN2P3-CC]
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="orange"| WARN (HTTP request failed: 500 Connect failed: connect: Connection refused; Connection refused)
 
|
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=120990&grid_id=0 INFN (INFN-IGI-CNAF-FedCloud)]
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
| style="background: green; color: white"| OK
 
|
 
|-
 
| [https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=60&grid_id=0 LAL (GRIF)]
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="red"| NO_ENDPOINT
 
| bgcolor="orange"| MISSING_INFO:
 
* set Monitored attribute to Y
 
* set ServiceUrl according to instructions [[Fedcloud-tf:WorkGroups:Scenario5#GOCDB|below]]
 
|
 
|}
 
 
 
=== GOCDB ===
 
 
 
The following service types were added to GOCDB:
 
* eu.egi.cloud.accounting
 
* eu.egi.cloud.information.bdii
 
* eu.egi.cloud.storage-management.cdmi
 
* eu.egi.cloud.vm-management.occi
 
* eu.egi.cloud.vm-metadata.marketplace
 
 
 
All RPs must enter cloud service endpoints to GOCDB in order to enable integration with other operational tools.
 
 
 
First step is defining site to which the endpoints will belong. There are two possible options:
 
 
 
'''1.''' Register resources on an existing EGI site
 
* pre-reqs:
 
** RP is associated with the existing site and the team handling existing grid services is the same/very close with the cloud team
 
** site's Certification Status is "Certified"
 
'''2.''' Register resources on a new site
 
* new site should have the following settings:
 
** Infrastructure: 'Test'
 
** Certification Status: 'Candidate'
 
* check the example of NGI_GRNET site: https://goc.egi.eu/portal/index.php?Page_Type=View_Object&object_id=119660&grid_id=0
 
 
 
In both cases service endpoints should have the following flags set:
 
* based on the readiness of your resources set service Production flag to 'Y' or 'N' (in both cases site's availability/reliability will not be affected and no alarms will be raised in Operations Portal)
 
* Beta: 'N'
 
* Monitored: 'Y'
 
 
 
Special rules apply for the following service types:
 
* eu.egi.cloud.accounting: serviceUrl field must contain name of the site as defined on http://goc-accounting.grid-support.ac.uk/cloudtest/cloudsites.html (e.g. CESNET)
 
* eu.egi.cloud.vm-management.occi: serviceUrl field must contain the following info:
 
https://hostname:port/?image=<image_name>[&platform=openstack][&network=<network_name>]
 
Both <image_name> and <network_name> cannot contain spaces. Example for OpenStack is:
 
https://egi-cloud.zam.kfa-juelich.de:8788/?image=EGI-Demo&platform=openstack
 
and ON:
 
https://carach5.ics.muni.cz:10443/?image=EGI-Demo&network=EGI-Demo-Net
 
* eu.egi.cloud.storage-management.cdmi: serviceUrl field must contain the following info:
 
hostname:port
 
 
 
Further information about GOCDB can be find on the following page: [[GOCDB/Input_System_User_Documentation]].
 
 
 
=== SAM ===
 
 
 
Central SAM instance is deployed for monitoring cloud resources. Once the set of probes is fully defined probes will be included to official SAM release. Once included to official release central instance will be switched off.
 
 
 
SAM instance is on the following address: https://cloudmon.egi.eu/nagios.
 
 
 
List of tests can be found here: https://cloudmon.egi.eu/poem/admin/poem/profile/1/.
 
 
 
== Technology ==
 
 
 
 
 
=== Nagios probes  ===
 
 
 
'''Who has the responsibility to develop probes?''' Following the EGI model probes are developed by the Technology Providers and are integrated into the monitoring framework by the EGI-JRA1 staff that can also provide support during the initial phase of probes development in order to give guidelines and templates.
 
 
 
'''Information''' on how to develop NAGIOS probes can be retrieved in the [https://tomtools.cern.ch/confluence/display/SAMDOC/Probes+Development SAM Development Guide]
 
 
 
'''List of available probes '''within EGI is reported in the [https://tomtools.cern.ch/confluence/display/SAMDOC/Released+Probes SAM Administrrator Guide]
 
 
 
=== The EGI SAM&nbsp;System  ===
 
 
 
The SAM system is basically a framework consisting of: <br>- Nagios monitoring system (https://[http://www.nagios.org/ www.nagios.org]), <br>- custom databases for topology, probes description and storing results of tests <br>- web interface MyWLCG/MyEGI ([https://tomtools.cern.ch/confluence/display/SAM/MyWLCG https://tomtools.cern.ch/confluence/display/SAM/MyWLCG]) <br>Probes used to perform check of services are provided by service developers. In case of EMI services probes are provided by EMI product teams. In case of Globus Toolkit, probes are provided by IGE project, etc. SAM team only maintains probes which test internal SAM functions (e.g. communication with messaging system, database synchronization, etc).
 
 
 
More information [[SAM|here]]
 
 
 
 
 
== The old DOC ==
 
 
 
=== The proposed approach  ===
 
 
 
The proposed approach is to have a central monitoring instance that run probes on al the FedCloud resources and collects their output.
 
 
 
The central instance could be a full blown SAM&nbsp;instance as those available through EGI ([https://wiki.egi.eu/wiki/SAM_Instances https://wiki.egi.eu/wiki/SAM_Instances]) or a simple NAGIOS box. This has to be decided.
 
 
 
<br>
 
 
 
The following steps need to be completed in order to have the approach established.
 
 
 
1. '''Identify the Central instance:''' One resource or technology provider needs to provide a machine (virtual or real) where the testing instance would be deployed. Based on our experience providing 1GB RAM, 1 CPU/core and at least 10GB of disk is sufficient. This instance will be used to monitor all the clouds provided by resource providers. EGI-JRA1 will help with installation of this SAM instance.
 
 
 
<br>
 
 
 
2. '''Creation of basic probes:''' for each technology/Resource provider a basic probe should be created to test:
 
 
 
*login functionality to the remote VM web interface (assuming that such web interface is available)&nbsp;
 
*pre-defined VM instantiation
 
 
 
The idea is to start defining a skeleton probe with one of the technology providers that will be used to create all other probes.
 
 
 
The characteristics of the pre-defined VM are to be discussed with TPs
 
 
 
this step can be split in sub steps
 
 
 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.a Identify a technology provider that voulunteer to create the skeleton probe with the help of EG-JRA1
 
 
 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.b Use the skeleton probe to create probes for all the other TPs
 
 
 
<br>
 
 
 
3.'''Integration of probes: '''Once all the probes are created they can be integrated into the NAGIOS or SAM&nbsp;instance that can start collecting status data about the FedCloud
 
 
 
<br>
 
 
 
We estimante that these 3 initial steps can be accomplished before the '''end of February 2012'''.
 
 
 
<br>
 
 
 
This approach was implemented by using pure Nagios. This Nagios instance is still available at the address: [https://test30.egi.cesga.es/nagios]. As soon as the SAM integration is finalized Nagios instances will be switched off.
 
 
 
 
 
=== Operative Steps<br>  ===
 
 
 
==== Step 1: Setup the Scenario5 group<br>  ====
 
 
 
Identify the group leader and collaborators: Done<br>
 
 
 
Send a kickoff mail to the fedcloud mailing list: Done
 
 
 
==== Step 2: Agree on the proposed approach<br>  ====
 
 
 
To be done in the coming meetings<br>
 
 
 
==== Step 3: Identify a Resource Provider that will host the central nagios instance<br>  ====
 
 
 
Currently we have two volunteers: CESGA (Ivan and Alvaro) and GWDG (Kasprzak, Piotr). to be contacted for confirmation<br>
 
 
 
==== Step 4: Definition of probe tests  ====
 
 
 
*Ping of management interface
 
*Login into management interface
 
*Instantiation of VM
 
 
 
==== Step 5: Create a skeleton probe with a volunteer TP<br>  ====
 
 
 
GWDG (Kasprzak, Piotr) showed interest in developing the probe for their system <br>
 
 
 
==== Step 6: Advertise the skeleton probe and use it as templateto develop all other probes<br>  ====
 
 
 
{| cellspacing="1" cellpadding="1" border="1" style="width: 541px; height: 150px;"
 
|+ Available probes and info
 
|-
 
! scope="col" | VM Management
 
! scope="col" | Probe Avail.
 
! scope="col" | Probe link
 
! scope="col" | Notes&nbsp;&nbsp;&nbsp;&nbsp;
 
|-
 
| OpenNebula
 
| No
 
| -
 
| -
 
|-
 
| CloudSigma APIs
 
| No
 
| -
 
| -
 
|-
 
| OpenStack
 
| No
 
| -
 
| -
 
|-
 
| StratusLab
 
| No
 
| -
 
| -
 
|-
 
| Okeanos
 
| No
 
| -
 
| -
 
|-
 
| WNoDES
 
| No
 
| -
 
| -
 
|}
 
 
 
==== Step 7: Integrate the probes into the central NAGIOS system ====
 
 
 
== Further Resources  ==
 
 
 
[[SAM|The SAM system EGi wiki pages]]
 
 
 
[[Image:Flessr nagios probes.pdf|FleSSR Nagios Cloud Probes Document]] (Thanks to David Wallom)
 

Latest revision as of 12:09, 8 June 2015