Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "VO Service Availability Monitoring"

From EGIWiki
Jump to navigation Jump to search
 
(109 intermediate revisions by 3 users not shown)
Line 1: Line 1:
__TOC__
{{Template:Op menubar}}
= Introduction =
{{Template:Doc_menubar}}
{{TOC_right}}
{{Template:Deprecated}}
<br />
 
= Introduction to VO SAM =


[[File:VOServicesWikiFig3.png|right]]
[[File:VOServicesWikiFig3.png|right]]


The current operations model forces that each NGI must deploy and operate their own [https://wiki.egi.eu/wiki/SAM '''Service Availability Monitoring (SAM)'''] system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:
== SAM as an NGI instance ==
The current operations model forces that each NGI must deploy and operate their own [[SAM |'''Service Availability Monitoring (SAM)''']] system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:
* a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
* a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
* the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
* the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
* the message bus to publish results and a programmatic interface
* the message bus to publish results and a programmatic interface
* the visualization portal (MyEGI)
* the visualization portal (MyEGI)
The full list of SAM instances across the EGI infrastructure can be consulted [https://wiki.egi.eu/wiki/SAM_Instances here])
The full list of SAM instances across the EGI infrastructure can be consulted [[SAM_Instances| here]])


<br />
<br />


Each SAM instance triggers the execution of probes in grid sites under their scope. The present [https://tomtools.cern.ch/confluence/display/SAMDOC/Grid+probes list of probes] includes:
Each SAM instance triggers the execution of probes in grid sites under their scope. The present [https://tomtools.cern.ch/confluence/display/SAM/Probes list of probes] includes:
* Job submission testing via [https://tomtools.cern.ch/confluence/display/SAMDOC/CE CE probe/metrics]: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
* Job submission testing via [https://tomtools.cern.ch/confluence/display/SAM/CE CE probe/metrics]: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
* Data managements testing via [https://tomtools.cern.ch/confluence/display/SAMDOC/SRM SRM probe/metrics]: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
* Data managements testing via [https://tomtools.cern.ch/confluence/display/SAM/SRM SRM probe/metrics]: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
* WN testing via [https://tomtools.cern.ch/confluence/display/SAMDOC/WN WN probe/metrics]: replica management tests (WN<->SE communication), ...
* WN testing via [https://tomtools.cern.ch/confluence/display/SAM/WN WN probe/metrics]: replica management tests (WN<->SE communication), ...
* WMS testing via [https://tomtools.cern.ch/confluence/display/SAMDOC/WMS WMS probe/metrics] using submissions to predefined CEs.
* WMS testing via [https://tomtools.cern.ch/confluence/display/SAM/WMS WMS probe/metrics] using submissions to predefined CEs.
* LFC testing via [https://tomtools.cern.ch/confluence/display/SAMDOC/LFC LFC probe/metrics]: read and update catalogue entries, ...  
* LFC testing via [https://tomtools.cern.ch/confluence/display/SAM/LFC LFC probe/metrics]: read and update catalogue entries, ...  
<br />
<br />


= VO monitoring =
<br />


[[File:VOServicesWikiFig4.png|right]]
[[File:VOServicesWikiFig4.png|right]]
== SAM as a VO instance ==


The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.  
The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.  
Line 43: Line 51:
<br />
<br />


= Instalation guide =
= General Information =
== General Information ==
== What is the VO SAM? ==
=== What is the VO SAM? ===
The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.
The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.


=== Who can run the VO SAM? ===
<br/>
 
== Who can run the VO SAM? ==
The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.
The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.


=== Which services interoperate with VO SAM?===
<br/>
 
== Which services interoperate with VO SAM?==
The VO SAM interoperates with the following middleware components which have to be declared at configuration time:
The VO SAM interoperates with the following middleware components which have to be declared at configuration time:
# WMS: For job submission tests
# WMS: For job submission tests
Line 67: Line 78:
<br />
<br />


= Instalation guide =
== System requirements ==
== System requirements ==
The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:
The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:
Line 83: Line 95:
* open ports needed by the services
* open ports needed by the services
* cron jobs scheduled to execution in the SAM box
* cron jobs scheduled to execution in the SAM box
please consult the <big>[https://tomtools.cern.ch/confluence/display/SAMDOC/Service+reference+card+-+egee-NAGIOS '''SAM service reference card''']</big>
please consult the <big>[https://tomtools.cern.ch/confluence/display/SAMDOC/Nagios+Reference+Card'''SAM service reference card''']</big>




Line 91: Line 103:
=== General guidelines===
=== General guidelines===


Follow the instalation / configuration guidelines depicted in the [https://tomtools.cern.ch/confluence/display/SAMDOC/Installing+SAM+with+YAIM <big>'''Installing SAM with YAIM documentation'''</big>]
* For a fresh installation, please check the instalation / configuration guidelines depicted in the [https://tomtools.cern.ch/confluence/display/SAMDOC/Installing+SAM <big>'''Installing SAM '''</big>] web page.
<!-- Now this file was update and shows only for update 10 which is in SR -->
* For upgrades, also check the [https://tomtools.cern.ch/confluence/display/SAMDOC/Release+Notes SAM Release Notes] web page.
* Before actually onfiguring the service with YAIM, you should apply the VO specific guidelines presented in the next sections.
* A brief explanation of all the YAIM variables used to configure SAM is available in [https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim#YAIM_s_site_info_def_File SAM YAIM variable explanations].


SAM production release is presently on Update 7. To understand which YAIM configuration variables one has to set in the YAIM site-info.def file, please check:
<br />


# [https://tomtools.cern.ch/confluence/display/SAMDOC/Update-05 SAM release Update 05 Release notes]
=== Installation notes ===
# [https://tomtools.cern.ch/confluence/display/SAMDOC/Update-06 SAM release Update 06 Release notes]
* During installation, you need to use yum-priorities in order to download the proper package version from the correct repository. Please check carefully the repositories priorities according to SAM version you are using.
# [https://tomtools.cern.ch/confluence/display/SAMDOC/Update-07 SAM release Update 07 Release notes]


# [https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim#YAIM_s_site_info_def_File SAM YAIM variable explanations]
* Package installation sequence:
<!--  
$ yum install lcg-CA
# [https://tomtools.cern.ch/confluence/display/SAMDOC/Update-09 SAM release Update 09 Release notes]
$ yum install httpd
# [https://tomtools.cern.ch/confluence/display/SAMDOC/Update-08 SAM release Update 08 Release notes] -->
$ yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'
$ yum install egee-NAGIOS


Before configuring the service with YAIM, you should also apply the VO specific guidelines presented in the next sections.
<br />


=== Installation guidelines===
=== Upgrade notes ===
* Note that after upgrade and before YAIM execution, atp_synchro.conf.rpmnew and glite-info-service-nagios.conf.rpmnew configuration files should replace the existing ones.
$ mv /etc/atp/atp_synchro.conf.rpmnew /etc/atp/atp_synchro.conf
$ mv /opt/glite/etc/glite-info-service-nagios.conf.rpmnew /opt/glite/etc/glite-info-service-nagios.conf


* In the glite-info-service-nagios.conf.rpmnew, one must ensure the the new file does include the following line
get_data = echo -e "Role=${NAGIOS_ROLE}\nMsgNagiosDestination="$(. /etc/sysconfig/msg-to-queue && echo $MSG_TO_QUEUE_DESTINATION)"\nVersion="$(cat /etc/sam-release)
<br />


=== VO specific guidelines ===
=== VO specific guidelines ===
To monitor all the resources under the scope of a VO, you have to properly set the following variables in your YAIM configuration file:
==== VO Topology generation: ATP vs LDAP ====
* '''General VO definitions'''
There are two alternative ways to configure a VO SAM.
# Use [http://grid-monitoring.cern.ch/atp Aggregated Topology Provider] as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
# Use LDAP as a topology provider for Nagios Configuration Generator(NCG)
<br />
 
* ATP is now the preferred option for the topology generatation, and it can be switch on defining
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=true
 
After running yaim, your /etc/ncg/ncg.conf should contain the following entries under the SiteInfo block:
<NCG::SiteInfo>
  <ATP>
    ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  </ATP>
  (...)
</NCG::SiteInfo>
 
* To use LDAP, you should switch off all topology providers except for LDAP, and be sure that the SiteInfo block in
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=true
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=false
# if LDAP topology is used, control adding hosts (see SAM-1470) (optional variable)
NCG_LDAP_ADD_HOSTS=1
and ensure that similar entries are defined in /etc/ncg/ncg.conf
<NCG::SiteInfo>
  <LDAP>
    LDAP_ADDRESS=<Your TopBDII you want to use>
    ADD_HOSTS=1
    VO_FILTER=<Your VO>
  </LDAP>
  (...)
</NCG::SiteInfo>
Afterwards restart ncg service.
 
==== Yaim configuration file example (for phys.vo.ibergrid.eu VO) ====
###
### Generic definitions for some of the core services
SITE_NAME=NCG-INGRID-PT
SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
PX_HOST=px01.ncg.ingrid.pt
BDII_HOST=topbdii01.ncg.ingrid.pt
###
### VO Definitions
  # List of VOs to support  
  # List of VOs to support  
  VOS="vo1"
  VOS="phys.vo.ibergrid.eu"
  # VOMS server definition for vo1
  # VOMS server definition for vo1
  VO_vo1_VOMS_SERVERS="'vomss://voms.my.domain:8443/voms/vo1?/vo1/'"
  VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
  # VOMSES server definition for vo1
  # VOMSES server definition for vo1
  VO_vo1_VOMSES="'vo1 voms.my.domain 15001 /C=Country/O=Ca/O=Institution/OU=Department/CN=voms.my.domain vo1'"
  VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
# DN of the CA which issued the VOMS Certificate
VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
# WMS used to submit jobs to vo1
VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"
   
   
  # DN of the CA which issued the VOMS Certificate
  ###
  VO_vo1_VOMS_CA_DN="/C=Country/O=CA/CN=Certification Authority"
### LFC and SE definitions for the data management tests
  JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es
   
   
  # WMS used to submit jobs to vo1
  ###
  VO_vo1_WMS_HOSTS="wms01.ncg.ingrid.pt"
  ### Nagios
 
NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalo Borges"
* '''Specific SAM VO definitions '''
NCG_NAGIOS_ADMIN=goncalo@lip.pt
NAGIOS_HOST=nagios02.ncg.ingrid.pt
  # Nagios is acting on a VO  
  # Nagios is acting on a VO  
  NAGIOS_ROLE=vo
  NAGIOS_ROLE=vo
   
  NCG_PROBES_TYPE=local
  # List of VOs the tests should run as. You must have a member of each VO willing to store a proxy for your retrieval.  
  # List of VOs the tests should run as.  
  NCG_VO="vo1"
  NCG_VO="phys.vo.ibergrid.eu"
  # Do not show hosts without services associated  
  # Do not show hosts without services associated  
  NCG_INCLUDE_EMPTY_HOSTS=0
  NCG_INCLUDE_EMPTY_HOSTS=0
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS="xxxxxx"
NAGIOS_NCG_ENABLE_CRON=true
COUNTRY_NAME=Portugal
   
   
###
### NCG configurations
  # list all the NGI/ROCs
  # list all the NGI/ROCs
  NCG_GOCDB_ROC_NAME="NGI_IBERGRID NGI_NL ..."
  NCG_GOCDB_ROC_NAME=NGI_IBERGRID
 
NCG_TOPOLOGY_USE_SAM=false
* '''Monitor more than one VO'''
NCG_TOPOLOGY_USE_GOCDB=false
Include a white space separated VO list for
NCG_TOPOLOGY_USE_ENOC=false
  VO="vo1 vo2 vo3"
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_ATP=true
NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
NCG_REMOTE_USE_SAM=false
NCG_REMOTE_USE_NAGIOS=false
NCG_REMOTE_USE_ENOC=false
NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"
###
### DB data
MYSQL_ADMIN="xxxxxx"
DB_PASS="xxxxxx"
DB_TYPE=mysql
DB_USER=mrs
DB_NAME=mrs
DB_HOST="localhost"
###
### MyEGI
MYEGI_ADMIN_NAME="Goncalo Borges"
MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
MYEGI_DEFAULT_PROFILE="ROC"
MYEGI_DEBUG="true"
MYEGI_DATABASE_PASSWORD="xxxxxx"
  MYEGI_REGION="NGI_IBERGRID"
   
   
NCG_VO="vo1 vo2 vo3"
and insert the information depicted in the previous '''General VO definition''' section for all VOs.
<br />
<br />


== VO specific probes ==
= VO specific probes =
Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:
Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:
* [https://svnweb.cern.ch/trac/sam/browser/trunk/demo-probe <big>'''Tutorial to build nagios probes'''</big>]
* [https://svnweb.cern.ch/trac/sam/browser/trunk/demo-probe <big>'''Tutorial to build nagios probes'''</big>]


Afterwards, the VO would have to define their own profile in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm.  
Afterwards, the VO would have to define their own profile in '''/usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm''' in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm. To understand the atributes, flags and static file rules which are used in previous perl module, please consult:
* [https://tomtools.cern.ch/confluence/display/SAM/NCG <big>'''SAM NCG wiki page'''</big>]


Finally, in the Yaim configuration you would need to point to the new profile:
Finally, in the Yaim configuration you would need to point to the new profile:
Line 164: Line 270:
  NCG_PROFILE_FQAN_customprofileA=vo2
  NCG_PROFILE_FQAN_customprofileA=vo2
  NCG_PROFILE_FQAN_customprofileB=vo2
  NCG_PROFILE_FQAN_customprofileB=vo2
<br />
== Frequently Asked Questions & Troubleshooting ==
===Can I start 2 different proxies to submit jobs to the different VOs?===
Yes. You can have different proxy for each VO. Just use different user certificate when creating MyProxy credential. For example:
# For vo1
$ export X509_USER_CERT=~/.globus/usercert-vo1.pem
$ export X509_USER_KEY=~/.globus/userkey-vo1.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo1 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"
# For vo2
$ export X509_USER_CERT=~/.globus/usercert-vo2.pem
$ export X509_USER_KEY=~/.globus/userkey-vo2.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo2 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"
Same principle applies to any other VO supported by that instance. Of course you can use the same user cert if it is member of multiple VOs. Easier solution would of course be to use robot certificates. Robot certs will be supported in the next release (https://tomtools.cern.ch/jira/browse/SAM-952).
=== Can a catch-all VO SAM provide a dedicated VO view? ===
Nagios web interface was never about obvious presentation. However, there is the service group view where NCG generates service group aggregating all VO dependent checks for each VO. For example:
*[https://nagios01.ncg.ingrid.pt/nagios/cgi-bin/status.cgi?servicegroup=phys.vo.ibergrid.eu&style=detail <big>'''phys.vo.ibergrid.eu view in nagios01.ncg.ingrid.pt'''</big>]
=== Can I configure VO SAM to use a unique LFC and central SE for all VOs? ===
Yes. Include the following definitions in your YAIM configuration variables. Implicitly there is the assumption that the unique LFC and central SE do support all monitored VOs.
# LFC and SE definitions
JOBSUBMIT_WN_LFC=lfc-allvos.my.domain
JOBSUBMIT_WN_SE_REP=se-allvos.my.domain
=== Can I configure VO dependent LFCs and central SEs in a VO SAM? ===
There is a way to do this, though slightly more complicated. Make sure that you don't have line like anywhere in localdb:
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-se-rep!se.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!se.my.domain
anywhere in /etc/ncg/*localdb*, and put the following instead:
VO_ATTRIBUTE!vo1!WN_SE_REP!se-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_SE_REP!se-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_SE_REP!--wn-se-rep
VO_ATTRIBUTE!vo1!WN_LFC!lfc-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_LFC!lfc-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_LFC!--wn-lfc
=== How can I provide access to a VO member? ===
At configuration time, and dependending of the NCG_ROLE selected, different users may not have the same permissions to access to the VO SAM services. To enable permission of a given user,  one can add the user DN to /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf
=== Additional guidance ===
For further questions/problems not addressed in this document, please consult:
*[https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Administrators+FAQ '''<big>SAM Administrator FAQs</big>''']
*[https://tomtools.cern.ch/confluence/display/SAMDOC/Installing+SAM+with+YAIM#InstallingSAMwithYAIM-Problems <big>'''Problems installing SAM with YAIM</big>''']
*[https://tomtools.cern.ch/confluence/display/SAMDOC/Troubleshooting+Guide <big>'''SAM Troubleshooting guide'''</big>]
<br />
== A working example ==
* [https://nagios01.ncg.ingrid.pt/nagios <big>'''phys.vo.ibergrid.eu NAGIOS instance'''</big>]
<br />
== Future improvements ==
* In the next SAM release, ncg will automatically go over the whole infrastructure and look for nodes which support defined VOs.
* Define a VO profile with VO dependent only metrics. Please check [https://tomtools.cern.ch/jira/browse/SAM-1178 SAM-1178]
<br />
== Known issues ==
* MyEGI is not working properly using the NCG_ROLE=vo.
* If you use the ATP for topology generation as recomended in UPDATE 07, you will get some problems. Set up your YAIM configuration files to use SAM as topology generator.
NCG_TOPOLOGY_USE_SAM=true
However, in this case, you will have to ask for permission to access the information produced by SAM PI. See for example [https://gus.fzk.de/ws/ticket_info.php?ticket=66314 GGUS ticket 66314].


<br />
<br />
Line 245: Line 282:
= Additional references =
= Additional references =
* [http://www.nagios.org/ <big>'''Nagios'''</big>]
* [http://www.nagios.org/ <big>'''Nagios'''</big>]
<br />
=[[FAQ_VO_Service_Availability_Monitoring|FAQ VO Service Availability Monitoring]]=
[[Category:Operations Documentation]]
[[Category:SAM]]

Latest revision as of 07:47, 13 July 2016

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators



Alert.png This article is Deprecated and should no longer be used, but is still available for reasons of reference.



Introduction to VO SAM

VOServicesWikiFig3.png

SAM as an NGI instance

The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:

  • a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
  • the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
  • the message bus to publish results and a programmatic interface
  • the visualization portal (MyEGI)

The full list of SAM instances across the EGI infrastructure can be consulted here)


Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes includes:

  • Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
  • Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
  • WN testing via WN probe/metrics: replica management tests (WN<->SE communication), ...
  • WMS testing via WMS probe/metrics using submissions to predefined CEs.
  • LFC testing via LFC probe/metrics: read and update catalogue entries, ...



VOServicesWikiFig4.png

SAM as a VO instance

The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.


One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system.


In order to acomplish this multi-VO monitoring role, the SAM service instance has to be properly adapted:

  • the topology generation has to change so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure.
  • the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files, ...) have to be properly configured to suport those VOs


The following section will depict how to install and properly configure a Service Availability Monitoring System for VOs.


General Information

What is the VO SAM?

The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.


Who can run the VO SAM?

The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.


Which services interoperate with VO SAM?

The VO SAM interoperates with the following middleware components which have to be declared at configuration time:

  1. WMS: For job submission tests
  2. central SE: for data replica tests
  3. MyProxy service: For renewal of VO credentials

The WMS and the central SE must support the VOs in cause. The best case scenario is that the VO uses dedicated instances of the previous services for their SAM system since nagios tests will induce high load peaks. The alternative is to use services at disposal of the VO but shared by all VO users (and probably, by other VOs also). The information regarding those services endpoints are available through informations system queries from any user interface:

# lcg-infosites --vo <VO NAME> wms
# lcg-infosites --vo <VO NAME> se

The MyProxy service is used to store and renew the credential of the user sending the nagios jobs. Starting from Update-09 (still under Stage Rollout) SAM supports usage of robot certificates, instead of MyProxy credentials. This is an optional feature which can be used only if your CA and VO support robot certificates. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy server.

Case the VO is unable to provide their own instances of those services, the VRCs may trigger the formal links established between VRCs and NGIs, contemplated by the VRC approval process, and agreed at the VRC aproval time. This provides direct opportunities to find service providers for VRCs, and for any VO which is associated to a VRC.


Instalation guide

System requirements

The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:

  1. Scientific Linux 5 64 bit OS
  2. 4 GB RAM memory
  3. quad-core processor is recommended to better handle parallel submitions

The system requirements may increase according to the VOs infrastructure to be monitored.


SAM service reference card

To get an overview of the service, in terms of

  • individual services running on the SAM box
  • configuration files for each service
  • log files for each service
  • open ports needed by the services
  • cron jobs scheduled to execution in the SAM box

please consult the SAM service reference card



Install and configure SAM with YAIM

General guidelines

  • For a fresh installation, please check the instalation / configuration guidelines depicted in the Installing SAM web page.
  • For upgrades, also check the SAM Release Notes web page.
  • Before actually onfiguring the service with YAIM, you should apply the VO specific guidelines presented in the next sections.
  • A brief explanation of all the YAIM variables used to configure SAM is available in SAM YAIM variable explanations.


Installation notes

  • During installation, you need to use yum-priorities in order to download the proper package version from the correct repository. Please check carefully the repositories priorities according to SAM version you are using.
  • Package installation sequence:
$ yum install lcg-CA
$ yum install httpd
$ yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'
$ yum install egee-NAGIOS


Upgrade notes

  • Note that after upgrade and before YAIM execution, atp_synchro.conf.rpmnew and glite-info-service-nagios.conf.rpmnew configuration files should replace the existing ones.
$ mv /etc/atp/atp_synchro.conf.rpmnew /etc/atp/atp_synchro.conf
$ mv /opt/glite/etc/glite-info-service-nagios.conf.rpmnew /opt/glite/etc/glite-info-service-nagios.conf
  • In the glite-info-service-nagios.conf.rpmnew, one must ensure the the new file does include the following line
get_data = echo -e "Role=${NAGIOS_ROLE}\nMsgNagiosDestination="$(. /etc/sysconfig/msg-to-queue && echo $MSG_TO_QUEUE_DESTINATION)"\nVersion="$(cat /etc/sam-release)


VO specific guidelines

VO Topology generation: ATP vs LDAP

There are two alternative ways to configure a VO SAM.

  1. Use Aggregated Topology Provider as an input for Nagios Configuration Generator(NCG) to get the sites and services. ATP is part of the ROC/NGI nagios package which aggregate information from GOCDB, Top BDII and VO feed; and it is single authoritative information source with topology information.
  2. Use LDAP as a topology provider for Nagios Configuration Generator(NCG)


  • ATP is now the preferred option for the topology generatation, and it can be switch on defining
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=true

After running yaim, your /etc/ncg/ncg.conf should contain the following entries under the SiteInfo block:

<NCG::SiteInfo>
  <ATP>
    ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
  </ATP>
  (...)
</NCG::SiteInfo>
  • To use LDAP, you should switch off all topology providers except for LDAP, and be sure that the SiteInfo block in
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=true
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_ATP=false
# if LDAP topology is used, control adding hosts (see SAM-1470) (optional variable)
NCG_LDAP_ADD_HOSTS=1

and ensure that similar entries are defined in /etc/ncg/ncg.conf

<NCG::SiteInfo>
 <LDAP>
   LDAP_ADDRESS=<Your TopBDII you want to use>
   ADD_HOSTS=1
   VO_FILTER=<Your VO>
 </LDAP>
  (...)
</NCG::SiteInfo>

Afterwards restart ncg service.

Yaim configuration file example (for phys.vo.ibergrid.eu VO)

###
### Generic definitions for some of the core services
SITE_NAME=NCG-INGRID-PT
SITE_BDII_HOST=sbdii01.ncg.ingrid.pt
PX_HOST=px01.ncg.ingrid.pt
BDII_HOST=topbdii01.ncg.ingrid.pt

###
### VO Definitions
# List of VOs to support 
VOS="phys.vo.ibergrid.eu"
# VOMS server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu/'"
# VOMSES server definition for vo1
VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu'"
# DN of the CA which issued the VOMS Certificate
VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="/C=PT/O=LIPCA/CN=LIP Certification Authority"
# WMS used to submit jobs to vo1
VO_PHYS_VO_IBERGRID_EU_WMS_HOSTS="wms01.ncg.ingrid.pt"

###
### LFC and SE definitions for the data management tests
JOBSUBMIT_WN_LFC=lfc01.ncg.ingrid.pt
JOBSUBMIT_WN_SE_REP=se01-tic.ciemat.es

###
### Nagios
NAGIOS_ADMIN_DNS="/C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=Goncalo Borges"
NCG_NAGIOS_ADMIN=goncalo@lip.pt
NAGIOS_HOST=nagios02.ncg.ingrid.pt
# Nagios is acting on a VO 
NAGIOS_ROLE=vo
NCG_PROBES_TYPE=local
# List of VOs the tests should run as. 
NCG_VO="phys.vo.ibergrid.eu"
# Do not show hosts without services associated 
NCG_INCLUDE_EMPTY_HOSTS=0
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS="xxxxxx" 
NAGIOS_NCG_ENABLE_CRON=true

COUNTRY_NAME=Portugal

###
### NCG configurations 
# list all the NGI/ROCs
NCG_GOCDB_ROC_NAME=NGI_IBERGRID
NCG_TOPOLOGY_USE_SAM=false
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_TOPOLOGY_USE_ATP=true
NCG_TOPOLOGY_ATP_ROOT_URL="http://grid-monitoring.cern.ch/atp"
NCG_REMOTE_USE_SAM=false
NCG_REMOTE_USE_NAGIOS=false
NCG_REMOTE_USE_ENOC=false
NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"

###
### DB data
MYSQL_ADMIN="xxxxxx"
DB_PASS="xxxxxx"
DB_TYPE=mysql
DB_USER=mrs
DB_NAME=mrs
DB_HOST="localhost"

###
### MyEGI
MYEGI_ADMIN_NAME="Goncalo Borges"
MYEGI_ADMIN_EMAIL="goncalo@lip.pt"
MYEGI_DEFAULT_PROFILE="ROC"
MYEGI_DEBUG="true"
MYEGI_DATABASE_PASSWORD="xxxxxx"
MYEGI_REGION="NGI_IBERGRID"


VO specific probes

Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:

Afterwards, the VO would have to define their own profile in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm. To understand the atributes, flags and static file rules which are used in previous perl module, please consult:

Finally, in the Yaim configuration you would need to point to the new profile:

NCG_HASH_CONFIG_PROFILES=customprofile

In case of multiple-VO instance where each VO has its own profile config would be:

NCG_VO="vo1 vo2"
NCG_HASH_CONFIG_PROFILES=customprofileA,customprofileB
NCG_PROFILE_FQAN_customprofileA=vo2
NCG_PROFILE_FQAN_customprofileB=vo2


Contacts

  • GGUS: VO Services Support Unit
  • Mailing list: vo-services@mailman.egi.eu
  • SAM mailing list: tool-admins@mailman.egi.eu


Additional references


FAQ VO Service Availability Monitoring