Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "VO Service Availability Monitoring"

From EGIWiki
Jump to navigation Jump to search
Line 107: Line 107:
Before configuring the service with YAIM, you should also apply the VO specific guidelines presented in the next sections.
Before configuring the service with YAIM, you should also apply the VO specific guidelines presented in the next sections.


<! -- === Installation guidelines=== -->
=== Installation guidelines===





Revision as of 15:51, 29 April 2011

Introduction

VOServicesWikiFig3.png

The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:

  • a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
  • the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
  • the message bus to publish results and a programmatic interface
  • the visualization portal (MyEGI)

The full list of SAM instances across the EGI infrastructure can be consulted here)


Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes includes:

  • Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
  • Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
  • WN testing via WN probe/metrics: replica management tests (WN<->SE communication), ...
  • WMS testing via WMS probe/metrics using submissions to predefined CEs.
  • LFC testing via LFC probe/metrics: read and update catalogue entries, ...


VO monitoring

VOServicesWikiFig4.png

The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.


One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system.


In order to acomplish this multi-VO monitoring role, the SAM service instance has to be properly adapted:

  • the topology generation has to change so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure.
  • the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files, ...) have to be properly configured to suport those VOs


The following section will depict how to install and properly configure a Service Availability Monitoring System for VOs.


Instalation guide

General Information

What is the VO SAM?

The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.

Who can run the VO SAM?

The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation to a third party entity is not excluded.

Which services interoperate with VO SAM?

The VO SAM interoperates with the following middleware components which have to be declared at configuration time:

  1. WMS: For job submission tests
  2. central SE: for data replica tests
  3. MyProxy service: For renewal of VO credentials

The WMS and the central SE must support the VOs in cause. The best case scenario is that the VO uses dedicated instances of the previous services for their SAM system since nagios tests will induce high load peaks. The alternative is to use services at disposal of the VO but shared by all VO users (and probably, by other VOs also). The information regarding those services endpoints are available through informations system queries from any user interface:

# lcg-infosites --vo <VO NAME> wms
# lcg-infosites --vo <VO NAME> se

The MyProxy service is used to store and renew the credential of the user sending the nagios jobs. Starting from Update-09 (still under Stage Rollout) SAM supports usage of robot certificates, instead of MyProxy credentials. This is an optional feature which can be used only if your CA and VO support robot certificates. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy server.

Case the VO is unable to provide their own instances of those services, the VRCs may trigger the formal links established between VRCs and NGIs, contemplated by the VRC approval process, and agreed at the VRC aproval time. This provides direct opportunities to find service providers for VRCs, and for any VO which is associated to a VRC.


System requirements

The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:

  1. Scientific Linux 5 64 bit OS
  2. 4 GB RAM memory
  3. quad-core processor is recommended to better handle parallel submitions

The system requirements may increase according to the VOs infrastructure to be monitored.


SAM service reference card

To get an overview of the service, in terms of

  • individual services running on the SAM box
  • configuration files for each service
  • log files for each service
  • open ports needed by the services
  • cron jobs scheduled to execution in the SAM box

please consult the SAM service reference card



Install and configure SAM with YAIM

General guidelines

Follow the instalation / configuration guidelines depicted in the Installing SAM with YAIM documentation

SAM production release is presently on Update 7. To understand which YAIM configuration variables one has to set in the YAIM site-info.def file, please check:

  1. SAM release Update 05 Release notes
  2. SAM release Update 06 Release notes
  3. SAM release Update 07 Release notes
  1. SAM YAIM variable explanations

Before configuring the service with YAIM, you should also apply the VO specific guidelines presented in the next sections.

Installation guidelines

VO specific guidelines

To monitor all the resources under the scope of a VO, you have to properly set the following variables in your YAIM configuration file:

  • General VO definitions
# List of VOs to support 
VOS="vo1"

# VOMS server definition for vo1
VO_vo1_VOMS_SERVERS="'vomss://voms.my.domain:8443/voms/vo1?/vo1/'"

# VOMSES server definition for vo1
VO_vo1_VOMSES="'vo1 voms.my.domain 15001 /C=Country/O=Ca/O=Institution/OU=Department/CN=voms.my.domain vo1'"

# DN of the CA which issued the VOMS Certificate
VO_vo1_VOMS_CA_DN="/C=Country/O=CA/CN=Certification Authority"

# WMS used to submit jobs to vo1
VO_vo1_WMS_HOSTS="wms01.ncg.ingrid.pt"
  • Specific SAM VO definitions
# Nagios is acting on a VO 
NAGIOS_ROLE=vo

# List of VOs the tests should run as. You must have a member of each VO willing to store a proxy for your retrieval. 
NCG_VO="vo1"

# Do not show hosts without services associated 
NCG_INCLUDE_EMPTY_HOSTS=0

# list all the NGI/ROCs
NCG_GOCDB_ROC_NAME="NGI_IBERGRID NGI_NL ..."
  • Monitor more than one VO

Include a white space separated VO list for

VO="vo1 vo2 vo3"

NCG_VO="vo1 vo2 vo3"

and insert the information depicted in the previous General VO definition section for all VOs.


VO specific probes

Right now it is pretty tricky to include VO specific probes. The first step is to learn how a nagios probe must be developed. A tutorial is available here:

Afterwards, the VO would have to define their own profile in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash_local.pm in the same way as other profiles exist in /usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/Hash.pm.

Finally, in the Yaim configuration you would need to point to the new profile:

NCG_HASH_CONFIG_PROFILES=customprofile

In case of multiple-VO instance where each VO has its own profile config would be:

NCG_VO="vo1 vo2"
NCG_HASH_CONFIG_PROFILES=customprofileA,customprofileB
NCG_PROFILE_FQAN_customprofileA=vo2
NCG_PROFILE_FQAN_customprofileB=vo2


Frequently Asked Questions & Troubleshooting

Can I start 2 different proxies to submit jobs to the different VOs?

Yes. You can have different proxy for each VO. Just use different user certificate when creating MyProxy credential. For example:

# For vo1
$ export X509_USER_CERT=~/.globus/usercert-vo1.pem
$ export X509_USER_KEY=~/.globus/userkey-vo1.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo1 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"

# For vo2
$ export X509_USER_CERT=~/.globus/usercert-vo2.pem
$ export X509_USER_KEY=~/.globus/userkey-vo2.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo2 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"

Same principle applies to any other VO supported by that instance. Of course you can use the same user cert if it is member of multiple VOs. Easier solution would of course be to use robot certificates. Robot certs will be supported in the next release (https://tomtools.cern.ch/jira/browse/SAM-952).

Can a catch-all VO SAM provide a dedicated VO view?

Nagios web interface was never about obvious presentation. However, there is the service group view where NCG generates service group aggregating all VO dependent checks for each VO. For example:

Can I configure VO SAM to use a unique LFC and central SE for all VOs?

Yes. Include the following definitions in your YAIM configuration variables. Implicitly there is the assumption that the unique LFC and central SE do support all monitored VOs.

# LFC and SE definitions
JOBSUBMIT_WN_LFC=lfc-allvos.my.domain
JOBSUBMIT_WN_SE_REP=se-allvos.my.domain

Can I configure VO dependent LFCs and central SEs in a VO SAM?

There is a way to do this, though slightly more complicated. Make sure that you don't have line like anywhere in localdb:

/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-se-rep!se.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!se.my.domain

anywhere in /etc/ncg/*localdb*, and put the following instead:

VO_ATTRIBUTE!vo1!WN_SE_REP!se-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_SE_REP!se-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_SE_REP!--wn-se-rep
VO_ATTRIBUTE!vo1!WN_LFC!lfc-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_LFC!lfc-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_LFC!--wn-lfc

How can I provide access to a VO member?

At configuration time, and dependending of the NCG_ROLE selected, different users may not have the same permissions to access to the VO SAM services. To enable permission of a given user, one can add the user DN to /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf

Additional guidance

For further questions/problems not addressed in this document, please consult:


A working example


Future improvements

  • In the next SAM release, ncg will automatically go over the whole infrastructure and look for nodes which support defined VOs.
  • Define a VO profile with VO dependent only metrics. Please check SAM-1178


Known issues

  • MyEGI is not working properly using the NCG_ROLE=vo.
  • If you use the ATP for topology generation as recomended in UPDATE 07, you will get some problems. Set up your YAIM configuration files to use SAM as topology generator.
NCG_TOPOLOGY_USE_SAM=true

However, in this case, you will have to ask for permission to access the information produced by SAM PI. See for example GGUS ticket 66314.


Contacts

  • GGUS: VO Services Support Unit
  • Mailing list: vo-services@mailman.egi.eu
  • SAM mailing list: tool-admins@mailman.egi.eu


Additional references