Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @

NGI DE CH Operations Center:Monitoring

From EGIWiki
Jump to navigation Jump to search

NGI-DE NGI-CH Monitoring


Participants and Shifts

Dimitri Nilsen (KIT)
Foued Jrad (KIT)
Pavel Weber(KIT)
Alessandro Usai (SWITCH)
Simon Leinen (SWITCH)


even weeks: KIT uneven weeks: SWITCH

reports Frydays to

Plan for ARC Testing set up in Nagios 15.9.11

  1. Customize the file /etc/grid-monitoring/org.ndgf.conf with the NGI services.
  2. NorduGrid Logging Improvement:
    1. Edit the xrls templates files (in /usr/share/grid-monitoring/org.ndgf/<SERVICE>/xrsl) for all the services and add (gmlog = "gmlog") to them e.g.
more /usr/share/grid-monitoring/org.ndgf/lfc/xrsl

(executable = "")
(jobname = "lfc")
(stdout = "testjob.out")
(gmlog = "gmlog")
(stderr = "testjob.err")
(inputfiles = ("" "/usr/share/grid-monitoring/org.ndgf/lfc/")
              ("file" "%LFC_TESTFILE%"))
(outputfiles = ("testjob.out" "")("testjob.err" "")
               ("outfile" "%LFC_STORAGE_W%/%HOST%-lfc-%TIME%"))
(walltime = "15 min")
(memory = "256")</nowiki>

This will ensure that in case of error the gmlog (useful for debugging) is sent back as part of the outputsandbox. Files in /usr/share/grid-monitoring/org.ndgf : gridftp/xrsl, jobsubmit/xrsl, lfc/xrsl, rls/xrsl, srm/xrsl

  1. Data management requirements for NorduGrid: the testfile used for the LFC/SRM/GridFTP tests must be created/managed manually.

Notice (19.10.2011): It is important that the file/LFC entry be created with the same credentials used by the Nagios monitoring node! A robot certificate will be used in the near future: checks to be carried out with it and the dCache and LFC nodes used by NGI_DE. For the time being, for the ARC tests in the test system (this only affects NGI_CH!), (DPM) and (LFC) are used instead.

Current Status 19.10.2011

On, File /etc/grid-monitoring/org.ndgf.conf has been customized to enable the ARC tests. For the time being (DPM) and (LFC) are used.

Dimitri will provide access to the official LFC and SRM services by the production probes.

Migration to ARC 1:

1) Because of further changes had to be made by hand in files




to ensure LFC tests run successfully (this is backward compatible with ARC 0.8). Currently working fine on

2) (ARC) passive tests always pending and flagged as "Service is not scheduled to be checked"

This is a limitation of the probe, which forms the name of these services only based on the VO name and not the VO FQAN. The only way to fix this right now is to NOT use FQAN: is VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany" in the site-info.def the problem? (to be checked)

To be done:

1) access to the production system to be granted to Alessandro (before Monday October the 24th). (update Nov 11, 2011: DONE)

2) change of the site-info.def file in both the test and production systems, to grant admin rights to Alessandro (Alessandro) (update Nov 11, 2011: DONE)

3) update 14 on test and production systems: to be discussed. (update Nov 11, 2011: DONE)

4) ARC enabling on the production system (Alessandro). (update Nov 11, 2011: POSTPONED)

5) VO_OPS_NCG_DEFAULT_VO_FQAN check (Alessandro) (update Nov 11, 2011: POSTPONED)

6) dCache access: once the robot certificate is used, this should not be a problem any more. (update Nov 11, 2011: SHOULD BE FIXED)

UPDATE 14, Nov. 8, 2011

copy robot-cert to /etc/nagios/globus/
chown -R nagios.nagios /etc/nagios/globus/

Change in yum, sl-security.repo ->


update packeges and apply patch
yum update
reboot  (for the kernel change)
mysql -u root -D mrs -p < SAM_2009_patch.sql
note: mddb-parser missing
[root@ngi-de-nagios ~]# diff site-info.def site-info.def.11082011.SAM13
< VOS="ops"
> VOS="ops dteam"
> # ======== DTEAM ========
> 'dteam 15004 \
> /DC=ch/DC=cern/OU=computers/ dteam 24' \
> 'dteam 15004 \
> /DC=ch/DC=cern/OU=computers/ dteam 24' \
> 'dteam 15004 \
> /C=GR/O=HellasGrid/ dteam 24' \
> 'dteam 15004 \
> /C=GR/O=HellasGrid/ dteam 24'"
> '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \
> '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \
> '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \
> '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"
> VO_DTEAM_WMS_HOSTS="ngi-de-monitoring-wms.$MY_DOMAIN ngi-de-monitoring-2-wms.$MY_DOMAIN"
> #
< NAGIOS_ADMIN_DNS="/C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad,/DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=users/C=CH/O=SWITCH/CN=Alessandro Usai,/C=DE/O=GermanGrid/OU=KIT/CN=Pavel Weber,/O=GermanGrid/OU=FZK/CN=Dimitri Nilsen"
> NAGIOS_ADMIN_DNS="/C=DE/O=GermanGrid/OU=KIT/CN=Angela Poschlad,/C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad,/C=DE/O=GridGermany/OU=Leibniz-Rechenzentrum/CN=Ilya Saverchenko"
< # ARC related variable.
< VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
< VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
> #VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
> #VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
< #====================== Unicore ===============================
< #Password used for protecting user credential keystore
< #Password used for protecting truststore (if not defined
< #UNICORE_KEYSTORE_PASS will be used)
< #Alias of user credential (default: mon-agent)

run yaim
/opt/glite/yaim/bin/yaim -s /root/site-info.def -c -n glite-UI -n glite-NAGIOS 

(note: warnings such as "Warning: Unable to load '/usr/share/zoneinfo//posix/Mideast/Riyadh87' as time zone. Skipping it." are normal)

reboot (to fix the lcg-util environment problem for the gLite UI configuration of the node)

old (stale) jobs were removed manually (using find /var/lib/gridprobes/ops.NGI.Germany/org.sam/ -name -mmin +180 | xargs rm -rf)
rpm differences

in production and test system because of dependencies problems in test system


UPDATE 15, Jan. 31, 2012

Changes in site-info.def:


Changes in /etc/ncg/ncg.conf in section:


the variable for UNICORE probes is set to true globally, BUT:

ENABLE_UNICORE_PROBES=0 i.e. disabled for NGI_CH (while still true for NGI-DE).

ARC probes are still disabled till release 17, as they are in release but working not properly.

On the ngi-de-nagios also the kernel is updated to kernel 2.6.18-274.17.1.el5.

The host was rebooted and yaim started. Finally npcd and msg-to-handler were not running, started:

/etc/init.d/npcd start
/etc/init.d/msg-to-handler start


  • Problem with EGI message broker caused the JobSubmit error for all creams in NGI started on 19.11 about 18:20. Corrected with patch:
[root@rnagios]# cat /etc/ncg/ncg.localdb
# Local Rules file to modify NCG configuration



  • All creams are green again. The question is if we should keep the changes in localdb or can remove them to have possibility to use other brokers?


  • The problem with UNIGE-DPNC fixed by additing and changing the site bdii name the /opt/glite/var/tmp/gip/ngi/ngi-urls on ngi-de-monitoring-bdii host. To be discussed how to upgrade ngi bdii and automate the procedure of update the bdii list only for ngi-de and ngi-ch.


  • The uni dortmund bdii was changed. The list on ngi-de-monitoring /opt/glite/var/tmp/gip/ngi/ngi-urls is updated:
#UNI-DORTMUND ldap://,o=grid
UNI-DORTMUND ldap://,o=grid

and bdii restarted. Ticket


Update ngi-de-nagios to release 19 29.01.2013

  • yum update --exclude sam-gridmon
  • Problems by update looks like related to the patchs applied before: The PI interface is missing again

  • After yaim the patchs listed in ticket applied again. Let us wait a few hours to see if it helps

  • Related ticket before for rocmon: