Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

GOCDB/PI/Technical Documentation

From EGIWiki
< GOCDB‎ | PI
Revision as of 10:13, 10 November 2010 by Aesch (talk | contribs)
Jump to navigation Jump to search


GOCDB Programmatic Interface

Important information about using the GOCDBPI

Releases, changes and announcements

All client tool using GOCDB should register someone to the GOCDB discussion mailing list: gocdb-discuss_at_mailtalk.ac.uk. Releases, new features, changes in methods and general announcements related to this interface are made on this list.

If your application is using the GOCDB Programmatic interface please request to join this mailing list by sending us a mail to gocdb-admins@mailtalk.ac.uk.


Interface description

Interface type

GOCDB Programmatic Interface is a REST (Representational State Transfer) based interface over https. The use of https guarantees URLs are properly secured when transiting. Some of the methods are nonetheless public and don't require client side authentication (see #Data protection and access).

Data protection and access

viel text Quick Search

   * Browse
         o Pages
         o Blog
         o Labels
         o Attachments
         o Bookmarks
         o Mail
         o Advanced
         o Confluence Gadgets
   * Log In
  1. Dashboard
  2. SAM and Nagios
  3. Home
  4. Probes
   * Tools
         o Attachments (0)
         o Page History
         o Restrictions
         o Info
         o Link to this Page…
         o View Wiki Markup

Probes

   * Page restrictions apply
   * Added by Konstantin Skaburskas, last edited by Konstantin Skaburskas on Nov 09, 2010  (view change)

Comment:

Under construction. Please refer to:

https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics

https://twiki.cern.ch/twiki/bin/view/LCG/SAMToNagios

   * python-GridMon RPM
   * grid-monitoring-probes-org.sam RPM
         o RPM structure and dependencies
         o Content of the RPM
   * CE
         o Metrics
               + org.sam.CE-JobState
               + org.sam.CE-JobSubmit
               + org.sam.CE-JobMonit
         o Job submission
         o Troubleshooting
   * CREAM-CE
         o Metrics
               + org.sam.CREAMCE-JobState
         o Job submission via WMS
         o Direct job submission
   * WN
         o Metrics
         o Execution on WN
         o Troubleshooting
               + Increasing debugging on WN
               + Getting logs from WN
               + Missing attributes in message body
   * SRM
         o Metrics
         o Execution
               + Locking of working directory
         o Troubleshooting
   * General Troubleshooting
         o "Return code of 139 is out of bounds" and metrics in CRITICAL

python-GridMon RPM

Temporary here.

python-GridMon - a library for development and run of Nagios grid probes & metrics.

   * /etc/gridmon/ - configuration directory:
         o org.sam.errdb - collection of common gLite m/w error messages and their mapping to Nagios statuses

Describes probes and metrics which are part of grid-monitoring-probes-org.sam RPM (as of rel. 0.1.16) grid-monitoring-probes-org.sam RPM

grid-monitoring-probes-org.sam RPM is available through EGEE SA1 repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (and via egee-NAGIOS meta RPM). RPM structure and dependencies

   * Structure
     /etc/gridmon/
     /usr/lib/python2.4/site-packages/gridmetrics/
     /usr/libexec/grid-monitoring/probes/org.sam/
     /usr/libexec/grid-monitoring/probes/org.sam/wnjob
     /usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/{bin/,etc/,lib/,plugins/,probes/,tmp/,var/}
     /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/{etc/wn.d/org.sam/,probes/org.sam/}
   * Dependencies
     python >= 2.4
     python-GridMon >= 1.1.3
     python-ldap
     python-suds >= 0.3.5
     grid-monitoring-probes-hr.srce >= 0.20.1

python-GridMon is a Python helper library for grid monitoring tools. It contains helper routines for Nagios and Grid Security, and was used at the development of the probes and metrics. Content of the RPM

   * SAM Nagios probes (in /usr/libexec/grid-monitoring/probes/org.sam/):
         o CE-probe - CE probe containing a number of CE tests (metrics) for jobs submission via WMS
         o CREAMCE-probe - as above, but for CREAM CEs
         o CREAMCEDJS-probe - direct job submission to CREAM CEs (asynchronous)
         o SRM-probe - SRM probe containing a number of metrics for SRM service
         o T-probe - template probe, which serves as an example for writing your own probes based on the Python framework currently provided by the package (see Writing a probe under "Python based probes using org.sam's 'gridmonsam' module" section on the same page)
         o WN-probe - WN probe containing a number of metrics to be run on WNs
         o WMS-probe - metrics to test if jobs submission through WMS works (asynchronous)
   * wrapper checks (in /usr/libexec/grid-monitoring/probes/org.sam/):
         o samtest-run - to run "native" SAM tests (see link)
         o nagtest-run - to run "semi"-Nagios checks (see link)
   * /usr/libexec/grid-monitoring/probes/org.sam/wnjob - directory containing
         o nagios.d/ - directory with Nagios used as checks' scheduler on WNs
         o nagrun.sh - wrapper script to be launched on WNs (sets up required environment, launches and monitors Nagios, periodically sends WN metrics results to Message Bus)
         o org.sam/ -
               + probes/ - directory with SAM WN probes/tests ("new and old" ones), samtest-run and nagtest-run wrappers
               + etc/ - WN Nagios configuration for the above checks
   * gridmetics Python package (in /usr/lib/python2.4/site-packages/):
         o used by the above SAM probes.
   * /etc/gridmon/ - configuration directory:
         o org.sam.conf - main configuration file

Source code can be browsed here: https://svnweb.cern.ch/trac/sam/browser/trunk/probes, http://svnweb.cern.ch/guest/sam/trunk/probes CE

CE-probe Metrics Metric Name Metric Description Metric Locality org.sam.CE-JobState Active. Submits grid job to CE. remote org.sam.CE-JobMonit Active. Monitors grid jobs submitted to CEs. remote org.sam.CE-JobSubmit Passive. Holds terminal status of job submission to CE. local org.sam.CE-JobState

Active + Passive check. By default active check is assumed to be run hourly.

   * submits grid job to CE
         o stores active job attributes into /<metric work dir>/activejob.map. The file acts as a lock and prevents submission of next jobs. activejob.map file is removed in two cases: 1). by org.sam.CE-JobMonit when job enters terminal state; 2). by org.sam.CE-JobState when global timeout for running job is exceeded (--timeout-job-discard defaults to 6h). To force next submission one can simply remove the file.
   * accepts passive check results (from org.sam.CE-JobMonit) for submitted grid job
         o holds a status of the grid job. When grid job enters terminal state (as seen by org.sam.CE-JobMonit) its status is passively updated by org.sam.CE-JobMonit according to job's state.

org.sam.CE-JobSubmit

Passive check.

   * holds terminal status of job submission to CE (mapping from gLite job terminal states ['Done','Aborted','Canceled'] to Nagios status [OK,WARNING,CRITICAL,UNKNOWN])
   * passively updated by org.sam.CE-JobMonit.

org.sam.CE-JobMonit

Active check. By default runs each 5 min.

   * monitors status of all submitted jobs (as defined in activejob.map files) and updates states of org.sam.CE-JobState and org.sam.CE-JobMonit metrics. Acts as a babysitter for all grid jobs submitted by org.sam.CE-JobState. org.sam.CE-JobState and org.sam.CE-JobMonit are updated (as passive checks) either via Naigos command file or NSCA.

Job submission Troubleshooting CREAM-CE

CREAMCE-probe Metrics Metric Name Metric Description Metric Locality org.sam.CREAMCE-JobState Submits grid job to CE remote org.sam.CREAMCE-JobMonit Monitors grid jobs submitted to CEs remote org.sam.CREAMCE-JobSubmit Passive check. local org.sam.CREAMCE-JobState

See Job submission via WMS

See Direct job submission WN

WN-probe samtest-run wrapper and standard SAM tests Metrics Metrics Description org.sam.WN-Rep Wrapper check to launch the replica management checks and publish passive check results to Nagios. org.sam.WN-RepISenv Check if LCG_GFAL_INFOSYS variable is set. org.sam.WN-RepFree Check if Close (or VO default) SE has any free space left according to the information system. org.sam.WN-RepCr Copy and register a file to the Close (or default) SE into default space area. Retrieve list of replicas. org.sam.WN-RepGet Copy the file back from Close SE to the WN. Compare the files. org.sam.WN-RepRep Replicate the file from close SE to a chosen 'central' SE. org.sam.WN-RepDel Delete given file(s) from SRM. org.sam.WN-PyVer Check version of Python installed on WN. Execution on WN

Executed by Nagios Troubleshooting Increasing debugging on WN

TODO. Getting logs from WN

Logs and debugging messages from Nagios on WN can be found in /<>/ TODO. Missing attributes in message body

No results from WNs and you are sure that the problem is not with brokers.

Symptom. WN checks are in PENDING. Messages reach Nagios box. You see they are consumed by msg-to-handler from destinations

$ grep destination /etc/msg-to-handler.d/*.conf

but either all or part of them are not getting into respective local directory queues

$ grep CACHE_DIR /etc/msg-to-handler.d/*.conf

and, as a consequence, Nagios passive results don't reach Nagios command file.

You may see similar messages in /var/log/messages:

Oct 31 07:00:17 samnag013 msg-to-handler[23485]: [WARNING] msg-to-handler: could not handle message ID:gridmsg002.cern.ch-41281-1288255386367-4:1287764:-1:1:1: handler warning: Got error creating Nagios passive result: Nagios Parser ERROR: Missing attribute hostname. .

This message is from one of the handlers defined for msg-to-handler in /etc/msg-to-handler.d/. NB! In this case [as of Friday, November 05 2010] message handler doesn't print its name as it is identified in respective /etc/msg-to-handler.d/*.conf. You'll have to "map" it yourself.

msg-to-handler subscribes with auto acknowledge and in case of such failure it doesn't re-sent the messages back to a "dead-queue".

The best would be to consume couple of messages from a destination you are suspecting is has bad messages. SRM Metrics Description org.sam.SRM-All Wrapper metric to launch the other metrics and publish passive checks results to Nagios. org.sam.SRM-GetSURLs Get full SRM endpoint(s) and storage areas from BDII. org.sam.SRM-LsDir List content of VO's top level space area(s) in SRM. org.sam.SRM-Put Copy a local file to the SRM into default space area(s). org.sam.SRM-Ls List (previously copied) file(s) on the SRM. org.sam.SRM-GetTURLs Get Transport URLs for the file copied to storage. org.sam.SRM-Get Copy given remote file(s) from SRM to a local file. org.sam.SRM-Del Delete given file(s) from SRM. Metrics Execution Locking of working directory

Time limited metric working directory locking was implemented to eliminate race conditions when metrics are executed too close in time (clashes in scheduling or manual invocations). The file name is lock. Time limit is 5 min. The lock is created by org.sam.SRM-GetSURLs and removed by org.sam.SRM-Del. If by any reason the file was not removed, the directory is considered to be locked until 5 min has passed since the file creation. Troubleshooting General Troubleshooting "Return code of 139 is out of bounds" and metrics in CRITICAL

This is most probably a check segfaulting and Nagios marks testing host/service as CRITICAL. Check /var/log/messages

$ egrep "kernel.*segfault" /var/log/messages ... Nov 5 08:33:19 samnag016 kernel: python[28524]: segfault at 0000000000000071 rip 00002b9960e16390 rsp 00007fffee35cd20 error 4 ...

NB! a Nagios check for parsing logs was planned to detect such errors in logs and notify Nagios instance admins. Not there yet as of [Friday, November 05 2010].

   * Example and the problem resolution. The above case was due to a segfault in lcg_util API http://savannah.cern.ch/bugs/?74459

To re-solve the problem on the side of the probe, org.sam.SRM-GetTURLs was modified to use CLI instead of Python API. This allows to "shift" segfault to CLI ("wrap" it into a forked sub-shell) and catch it. So, this doesn't crash CPython, thus, Nagios doesn't erroneously mark the tested service as CRITICAL (the default behavior of Nagios on a check segfault). Labels parameters Labels Enter labels to add to this page: Please wait Looking for a label? Just start typing.

Powered by a free Atlassian Confluence Open Source Project License granted to European Middleware Initiative (EMI) - CERN Product Teams. Evaluate Confluence today.

   * Atlassian Confluence 3.2, the Enterprise Wiki: Intranet software for documentation and knowledge management
   *   |  Report a bug
   *  |  Atlassian News

Link to this Page

   * Link to this Page

Link: Tiny Link: Wiki Markup: Close