Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

VO Service Availability Monitoring

From EGIWiki
Revision as of 19:09, 28 February 2011 by Goncalo (talk | contribs) (Created page with '__TOC__ = SAM Introduction = right The current operations model forces that each NGI must deploy and operate their own [https://wiki.egi.eu/wik…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

SAM Introduction

VOServicesWikiFig3.png

The current operations model forces that each NGI must deploy and operate their own Service Availability Monitoring (SAM) system to monitor the fraction of EGI production infrastructure under their scope. SAM currently includes the following components:

  • a test execution framework (probes) based on the open source monitoring framework Nagios, and the Nagios Configuration Generator (NCG)
  • the Aggregated Topology Provider (ATP), the Metrics Description Database (MDDB), and the Metrics Results Database (MRDB)
  • the message bus to publish results and a programmatic interface
  • the visualization portal (MyEGI)

The full list of SAM instances across the EGI infrastructure can be consulted here)


Each SAM instance triggers the execution of probes in grid sites under their scope. The present list of probes includes:

  • Job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival, ...
  • Data managements testing via SRM probe/metrics: get full SRM endpoint(s) and storage areas from BDII, copy a local file to the SRM into default space area(s), ...
  • WN testing via WN probe/metrics: replica management tests (WN<->SE communication), ...
  • WMS testing via WMS probe/metrics using submissions to predefined CEs.
  • LFC testing via LFC probe/metrics: read and update catalogue entries, ...


SAM for VO monitoring

VOServicesWikiFig4.png

The main idea is to profit from the SAM service in use for EGI operations and adapted it to monitor the resources from a VO, or from multiple VOs, using the same SAM instance.


One of the most obvious advantages of this service is that a VO can then develop and integrate their own probes. While EGI operation teams test and monitor the status of resources through generic tests, they can be considered insufficient for certain communities. These approach allows those communities to define custom test suites and insert them in their SAM system.


In order to acomplish this multi-VO monitoring role, the SAM service instance has to be properly adapted:

  • the topology generation has to change so that resources to be tested are properly configured. The difference with respect to the service used in operations is that VO resources may not be restricted to a single region, and may be spread along the whole EGI infrastructure.
  • the services which interoperate with the SAM services (the WMS which is used to submit jobs, default SRM used to replicate files, ...) have to be properly configured to suport those VOs


The following section will depict how to install and properly configure a Service Availability Monitoring System for VOs.


Instalation guide

General Information

What is the VO SAM

The VO SAM service is an adaptation of the operation SAM service used by NGIs. It is useful to monitor VO/VRC infrastructures within a given NGI or groups of NGIs.

Who can run the VO SAM

The VO SAM was delivered so that VRCs (or VOs associated to a VRC) could assume the operation of the service. This should be the most optimal scenario since it provides full independency for the VRC / VO to configure the service according to VO needs (as for example, integration of new VO specific probes). Nevertheless, the delegation of the service operation a third party entity is not excluded.

Which services interoperate with VO SAM

The VO SAM interoperates with the following middleware components which have to be declared at configuration time:

  1. WMS: For job submission tests
  2. central SE: for data replica tests
  3. MyProxy service: For renewal of VO credentials

The WMS and the central SE must support the VOs in cause. The best case scenario is that the VO uses dedicated instances of the previous services for their SAM system since nagios tests will induce high load peaks. The alternative is to use services at disposal of the VO but shared by all VO users (and probably, by other VOs also). The information regarding those services endpoints are available through informations system queries from any user interface:

# lcg-infosites --vo <VO NAME> wms
# lcg-infosites --vo <VO NAME> se

The MyProxy service is used to store and renew the credential of the user sending the nagios jobs. Starting from Update-09 (still under Stage Rollout) SAM supports usage of robot certificates, instead of MyProxy credentials. This is an optional feature which can be used only if your CA and VO support robot certificates. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy server.

Case the VO is unable to provide their own instances of those services, the VRCs may trigger the formal links established between VRCs and NGIs, contemplated by the VRC approval process, and agreed at the VRC aproval time. This provides direct opportunities to find service providers for VRCs, and for any VO which is associated to a VRC.


System requirements

The VO SAM will be a service with high load peaks specially at job submission times. The system requirements have to be chosen according to the number of sites to be monitored under each VO, and the number of VO to be included in the same VO SAM box. Therefore, as a minimum requirement, we suggest:

  1. Scientific Linux 5 64 bit OS
  2. 4 GB RAM memory
  3. quad-core processor is recommended to better handle parallel submitions

The system requirements may increase according to the VOs infrastructure to be monitored.


SAM service reference card

To get an overview of the service, in terms of

  • individual services running on the SAM box
  • configuration files for each service
  • log files for each service
  • open ports needed by the services
  • cron jobs scheduled to execution in the SAM box

please consult the SAM service reference card



Install and configure SAM with YAIM

General guidelines

Follow the instalation / configuration guidelines depicted in

SAM production release is presently on Update 7. To understand which YAIM configuration variables one has to set in the YAIM site-info.def file, please check:

  1. SAM release Update 05 Release notes
  2. SAM release Update 06 Release notes
  3. SAM release Update 07 Release notes
  4. SAM YAIM variable explanations

Before configuring the service with YAIM, you should also apply the VO specific guidelines presented in the next sections.

VO specific guidelines

To monitor all the resources under the scope of a VO, you have to properly set the following variables in your YAIM configuration file:

  • General VO definitions
# List of VOs to support 
VOS="vo1"

# VOMS server definition for vo1
VO_vo1_VOMS_SERVERS="'vomss://voms.my.domain:8443/voms/vo1?/vo1/'"

# VOMSES server definition for vo1
VO_vo1_VOMSES="'vo1 voms.my.domain 15001 /C=Country/O=Ca/O=Institution/OU=Department/CN=voms.my.domain vo1'"

# DN of the CA which issued the VOMS Certificate
VO_vo1_VOMS_CA_DN="/C=Country/O=CA/CN=Certification Authority"

# WMS used to submit jobs to vo1
VO_vo1_WMS_HOSTS="wms01.ncg.ingrid.pt"
  • Specific SAM VO definitions
# Nagios is acting on a VO 
NAGIOS_ROLE=vo

# List of VOs the tests should run as. You must have a member of each VO willing to store a proxy for your retrieval. 
NCG_VO="vo1"

# Do not show hosts without services associated 
NCG_INCLUDE_EMPTY_HOSTS=0

# list all the NGI/ROCs
NCG_GOCDB_ROC_NAME="NGI_IBERGRID NGI_NL ..."
  • Monitor more than one VO

Include a white space separated VO list for

VO="vo1 vo2 vo3"

NCG_VO="vo1 vo2 vo3"

and insert the information depicted in the previous General VO definition section for all VOs.


VO SAM Frequently Asked Questions

Can I start 2 different proxies to submit jobs to the different VOs?

Yes. You can have different proxy for each VO. Just use different user certificate when creating MyProxy credential. For example:

# For vo1
$ export X509_USER_CERT=~/.globus/usercert-vo1.pem
$ export X509_USER_KEY=~/.globus/userkey-vo1.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo1 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"

# For vo2
$ export X509_USER_CERT=~/.globus/usercert-vo2.pem
$ export X509_USER_KEY=~/.globus/userkey-vo2.pem
$ myproxy-init -l nagios -s $PX_HOST -k NagiosRetrieve-NAGIOS_HOSTNAME-vo2 -c 1000 -x -Z "NAGIOS_HOSTNAME_DN"

Same principle applies to any other VO supported by that instance. Of course you can use the same user cert if it is member of multiple VOs. Easier solution would of course be to use robot certificates. Robot certs will be supported in the next release (https://tomtools.cern.ch/jira/browse/SAM-952).


Can a catch-all VO SAM provide a dedicated VO view?

Nagios web interface was never about obvious presentation. However, there is the service group view where NCG generates service group aggregating all VO dependent checks for each VO. For example:


Can I configure VO SAM to use a unique LFC and central SE for all VOs

Yes. Include the following definitions in your YAIM configuration variables. Implicitly there is the assumption that the unique LFC and central SE do support all monitored VOs.

# LFC and SE definitions
JOBSUBMIT_WN_LFC=lfc-allvos.my.domain
JOBSUBMIT_WN_SE_REP=se-allvos.my.domain


Can I configure VO dependent LFCs and central SEs in a VO SAM

There is a way to do this, though slightly more complicated. Make sure that you don't have line like anywhere in localdb:

/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-se-rep!se.my.domain
/etc/ncg/ncg-localdb.d/jobsubmit:MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!se.my.domain

anywhere in /etc/ncg/*localdb*, and put the following instead:

VO_ATTRIBUTE!vo1!WN_SE_REP!se-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_SE_REP!se-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_SE_REP!--wn-se-rep
VO_ATTRIBUTE!vo1!WN_LFC!lfc-vo1.my.domain
VO_ATTRIBUTE!vo2!WN_LFC!lfc-vo2.my.domain
MODIFY_METRIC_ATTRIBUTE!org.sam.CE-JobState!WN_LFC!--wn-lfc


Further questions

For further questions not addressed in this document, please consult:

  1. SAM Administrator FAQs
  2. ask for help in the SAM mailing list (tool-admins@mailman.egi.eu)


VO specific probes

On 14.1.2011. 19:26, Gonçalo Borges wrote: > This could be quite a long work since metrics could depend on the VO, > right?! And if a VO wants to include specific probes? How can they do that?

Right now is pretty tricky. VO would need to define its own profile in Hash_local.pm in the same way as other profiles in Hash.pm. The file Hash_local.pm should be stored here:

/usr/lib/perl5/vendor_perl/5.8.5/NCG/LocalMetrics/

In the Yaim configuration you would need to point to the new profile: NCG_HASH_CONFIG_PROFILES=customprofile In case of multiple-VO instance where each VO has its own profile config would be: NCG_VO="VO1 VO2" NCG_HASH_CONFIG_PROFILES=customprofileA,customprofileB NCG_PROFILE_FQAN_customprofileA=VO1 NCG_PROFILE_FQAN_customprofileB=VO2

>> - completely clearing out hosts which don't support any of the VOs >> - gathering info over the whole EGI infrastructure, without need to >> define list of NGI/ROCs >> > > When do you think this would be available?

Was initially planning to do this for the next release. Unfortunately it might be pushed to the February one as I need to clear up some details and was busy with all the SA1 planning.

Cheers

Future improvements

  • In the next SAM release, ncg will automatically go over the whole infrastructure and look for nodes which support defined VOs. In these sense, it will not be needed to defined all regions under VO scope in the YAIM variable NCG_GOCDB_ROC_NAME.

Next steps should be in the next release: - defining list of metrics for VO instances, Christine provided some info, I'm sorry didn't find time to answer) - completely clearing out hosts which don't support any of the VOs - gathering info over the whole EGI infrastructure, without need to define list of NGI/ROCs

{CE} =
['hr.srce.GRAM-Auth','org.sam.CE-JobState','org.sam.CE-JobSubmit','org.sam.WN-Bi','org.sam.WN-Csh','org.sam.WN-SoftVer','org.sam.WN-Rep','org.sam.WN-RepISenv','org.sam.WN-RepFree','org.sam.WN-RepCr','org.sam.WN-RepGet','org.sam.WN-RepRep','org.sam.WN-RepDel'];

{'CREAM-CE'} =
['org.sam.CREAMCE-JobState','org.sam.CREAMCE-JobSubmit','org.sam.WN-Bi','org.sam.WN-Csh','org.sam.WN-SoftVer','org.sam.WN-Rep','org.sam.WN-RepISenv','org.sam.WN-RepFree','org.sam.WN-RepCr','org.sam.WN-RepGet','org.sam.WN-RepRep','org.sam.WN-RepDel'];

{'MPI'} =
['org.sam.mpi.CE-JobState','org.sam.mpi.CE-JobSubmit','org.sam.WN-MPI'];

{SRM} =
['org.sam.SRM-All','org.sam.SRM-GetSURLs','org.sam.SRM-LsDir','org.sam.SRM-Put','org.sam.SRM-Ls','org.sam.SRM-GetTURLs','org.sam.SRM-Get','org.sam.SRM-Del'];

{'Central-LFC'} =
['ch.cern.LFC-Write','ch.cern.LFC-Read','ch.cern.LFC-Readdir','ch.cern.LFC-ReadDli','ch.cern.LFC-Ping']; 

{'Local-LFC'} =
['ch.cern.LFC-Read','ch.cern.LFC-Readdir','ch.cern.LFC-Ping'];

{WMS} = ['org.sam.WMS-JobState','org.sam.WMS-JobSubmit'];

{VOMS} = ['hr.srce.VOMS-ServiceStatus'];

{FTS} = ['ch.cern.FTS-ChannelList','ch.cern.FTS-InfoSites'];

{'VO-box'} = ['org.nagios.gsissh-Check'];


Known issues

  • MyEGI is not working properly using the NCG_ROLE=vo.


Contacts

  • GGUS: VO Services Support Unit
  • Mailing list: vo-services@mailman.egi.eu
  • SAM mailing list: tool-admins@mailman.egi.eu


Additional references