Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Monitoring"

From EGIWiki
Jump to navigation Jump to search
Line 246: Line 246:
*All creams are green  again. The question is if we should keep the changes in localdb or can remove them to have possibility to use other brokers?
*All creams are green  again. The question is if we should keep the changes in localdb or can remove them to have possibility to use other brokers?


*Installed http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/x86_64/grid-monitoring-probes-org.sam-0.5.7-1.el5.noarch.rpm on the production system. This rpm (which will be distributed with update 19) was released to fix the SL5/SL6 64 bit WN tests problem, which affected DESY, see ticket. The same patch has been installed and tested successfully before on ROCMON.
*Installed http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/x86_64/grid-monitoring-probes-org.sam-0.5.7-1.el5.noarch.rpm on the production system. This rpm (which will be distributed with update 19) was released to fix the SL5/SL6 64 bit WN tests problem, which affected DESY, see ticket https://helpdesk.ngi-de.eu/?mode=ticket_info&ticket_id=2730.
The same patch has been installed and tested successfully before on ROCMON.

Revision as of 10:49, 21 November 2012


NGI-DE NGI-CH Monitoring

Mailinglist

ngi-de-monitoring@lists.kit.edu

Participants and Shifts

Dimitri Nilsen (KIT)
Foued Jrad (KIT)
Pavel Weber(KIT)
Alessandro Usai (SWITCH)
Simon Leinen (SWITCH)

Shifts

even weeks: KIT uneven weeks: SWITCH

reports Frydays to ngi-de-monitoring@lists.kit.edu

Plan for ARC Testing set up in Nagios 15.9.11

  1. Customize the file /etc/grid-monitoring/org.ndgf.conf with the NGI services.
  2. NorduGrid Logging Improvement:
    1. Edit the xrls templates files (in /usr/share/grid-monitoring/org.ndgf/<SERVICE>/xrsl) for all the services and add (gmlog = "gmlog") to them e.g.
more /usr/share/grid-monitoring/org.ndgf/lfc/xrsl

(executable = "testjob.sh")
(jobname = "lfc")
(stdout = "testjob.out")
(gmlog = "gmlog")
(stderr = "testjob.err")
(inputfiles = ("testjob.sh" "/usr/share/grid-monitoring/org.ndgf/lfc/testjob.sh")
              ("file" "%LFC_TESTFILE%"))
(outputfiles = ("testjob.out" "")("testjob.err" "")
               ("outfile" "%LFC_STORAGE_W%/%HOST%-lfc-%TIME%"))
(walltime = "15 min")
(memory = "256")</nowiki>

This will ensure that in case of error the gmlog (useful for debugging) is sent back as part of the outputsandbox. Files in /usr/share/grid-monitoring/org.ndgf : gridftp/xrsl, jobsubmit/xrsl, lfc/xrsl, rls/xrsl, srm/xrsl

  1. Data management requirements for NorduGrid: the testfile used for the LFC/SRM/GridFTP tests must be created/managed manually.

Notice (19.10.2011): It is important that the file/LFC entry be created with the same credentials used by the Nagios monitoring node! A robot certificate will be used in the near future: checks to be carried out with it and the dCache and LFC nodes used by NGI_DE. For the time being, for the ARC tests in the test system (this only affects NGI_CH!), feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used instead.

Current Status 19.10.2011

On rocmon-fzk.gridka.de, File /etc/grid-monitoring/org.ndgf.conf has been customized to enable the ARC tests. For the time being feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used.

Dimitri will provide access to the official LFC and SRM services by the production probes.

Migration to ARC 1:

1) Because of https://ggus.eu/tech/ticket_show.php?ticket=72260 further changes had to be made by hand in files

/etc/grid-monitoring/org.ndgf.conf

/usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-lfc

/usr/share/grid-monitoring/org.ndgf/lfc/xrsl

to ensure LFC tests run successfully (this is backward compatible with ARC 0.8). Currently working fine on rocmon-fzk.gridka.de.

2) (ARC) passive tests always pending and flagged as "Service is not scheduled to be checked"

This is a limitation of the probe, which forms the name of these services only based on the VO name and not the VO FQAN. The only way to fix this right now is to NOT use FQAN: is VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany" in the site-info.def the problem? (to be checked)


To be done:

1) access to the production system ngi-de-nagios.gridka.de to be granted to Alessandro (before Monday October the 24th). (update Nov 11, 2011: DONE)

2) change of the site-info.def file in both the test and production systems, to grant admin rights to Alessandro (Alessandro) (update Nov 11, 2011: DONE)

3) update 14 on test and production systems: to be discussed. (update Nov 11, 2011: DONE)

4) ARC enabling on the production system (Alessandro). (update Nov 11, 2011: POSTPONED)

5) VO_OPS_NCG_DEFAULT_VO_FQAN check (Alessandro) (update Nov 11, 2011: POSTPONED)

6) dCache access: once the robot certificate is used, this should not be a problem any more. (update Nov 11, 2011: SHOULD BE FIXED)

UPDATE 14, Nov. 8, 2011

https://tomtools.cern.ch/confluence/display/SAMDOC/Update-14

copy robot-cert to /etc/nagios/globus/
chown -R nagios.nagios /etc/nagios/globus/


Change in yum, sl-security.repo ->

priority=1

update packeges and apply patch
yum update
reboot  (for the kernel change)
mysql -u root -D mrs -p < SAM_2009_patch.sql
note: mddb-parser missing
site-info.def
[root@ngi-de-nagios ~]# diff site-info.def site-info.def.11082011.SAM13
46c46
< VOS="ops"
---
> VOS="ops dteam"
58a59,76
> # ======== DTEAM ========
> VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/'
> VO_DTEAM_VOMSES="\
> 'dteam lcg-voms.cern.ch 15004 \
> /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' \
> 'dteam voms.cern.ch 15004 \
> /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24' \
> 'dteam voms.hellasgrid.gr 15004 \
> /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' \
> 'dteam voms2.hellasgrid.gr 15004 \
> /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'"
> VO_DTEAM_VOMS_CA_DN="\
> '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \
> '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' \
> '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' \
> '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"
> VO_DTEAM_WMS_HOSTS="ngi-de-monitoring-wms.$MY_DOMAIN ngi-de-monitoring-2-wms.$MY_DOMAIN"
> #
68c86
< NAGIOS_ADMIN_DNS="/C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad,/DC=com/DC=quovadisglobal/DC=grid/DC=switch/DC=users/C=CH/O=SWITCH/CN=Alessandro Usai,/C=DE/O=GermanGrid/OU=KIT/CN=Pavel Weber,/O=GermanGrid/OU=FZK/CN=Dimitri Nilsen"
---
> NAGIOS_ADMIN_DNS="/C=DE/O=GermanGrid/OU=KIT/CN=Angela Poschlad,/C=DE/O=GermanGrid/OU=FZK/CN=Foued Jrad,/C=DE/O=GridGermany/OU=Leibniz-Rechenzentrum/CN=Ilya Saverchenko"
128d145
<
130,133c147,148
< NCG_HASH_CONFIG_PROFILES=ngi
< #NCG_HASH_CONFIG_PROFILES=ngi, arc
<
< NCG_PROFILE_FQAN_ngi=ops
---
> #NCG_HASH_CONFIG_PROFILES=ngi,arc
> #NCG_PROFILE_FQAN_ngi=ops
135d149
<
143,145c157
<
< # ARC related variable.
< #VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany"
---
> VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany"
170c182
< NCG_USE_ROBOT_CERT=true
---
> #NCG_USE_ROBOT_CERT=true
173,174c185,186
< VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
< VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
---
> #VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
> #VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
180,192d191
<
< #====================== Unicore ===============================
<
< #ENABLE_UNICORE_PROBES=true
< #Password used for protecting user credential keystore
< #UNICORE_KEYSTORE_PASS=applePear
<
< #Password used for protecting truststore (if not defined
< #UNICORE_KEYSTORE_PASS will be used)
< #UNICORE_TRUSTSTORE_PASS=melonPeach
<
< #Alias of user credential (default: mon-agent)
< #UNICORE_KEYSTORE_ALIAS=mon-agent


run yaim
/opt/glite/yaim/bin/yaim -s /root/site-info.def -c -n glite-UI -n glite-NAGIOS 

(note: warnings such as "Warning: Unable to load '/usr/share/zoneinfo//posix/Mideast/Riyadh87' as time zone. Skipping it." are normal)

reboot (to fix the lcg-util environment problem for the gLite UI configuration of the node)


old (stale) jobs were removed manually (using find /var/lib/gridprobes/ops.NGI.Germany/org.sam/ -name activejob.map -mmin +180 | xargs rm -rf)
rpm differences

in production and test system because of dependencies problems in test system

kernel-module-openafs-*


UPDATE 15, Jan. 31, 2012

Changes in site-info.def:

  • ENABLE_UNICORE_PROBES=true

Changes in /etc/ncg/ncg.conf in section:

ENABLE_UNICORE_PROBES=true
<NCG::SiteSet>
  <GOCDB>
    ROC=NGI_DE
    GOCDB_ROOT_URL=https://goc.egi.eu/gocdbpi/
    ENABLE_UNICORE_PROBES=$ENABLE_UNICORE_PROBES
  </GOCDB>
  <GOCDB>
    ROC=NGI_CH
    GOCDB_ROOT_URL=https://goc.egi.eu/gocdbpi/
    ENABLE_UNICORE_PROBES=0
  </GOCDB>
  <File>
      DB_FILE=/etc/ncg/ncg.localdb
      DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  </File>
</NCG::SiteSet>

the variable for UNICORE probes is set to true globally, BUT:

ENABLE_UNICORE_PROBES=0 i.e. disabled for NGI_CH (while still true for NGI-DE).

ARC probes are still disabled till release 17, as they are in release but working not properly.

On the ngi-de-nagios also the kernel is updated to kernel 2.6.18-274.17.1.el5.

The host was rebooted and yaim started. Finally npcd and msg-to-handler were not running, started:

/etc/init.d/npcd start
/etc/init.d/msg-to-handler start

20.11.12

  • Problem with EGI message broker caused the JobSubmit error for all creams in NGI started on 19.11 about 18:20. Corrected with patch:
[root@rnagios]# cat /etc/ncg/ncg.localdb
#
# Local Rules file to modify NCG configuration
#
MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--mb-uri!stomp://egi-1.msg.cern.ch:6163/

and ncg.reload.sh

21.11.12

  • All creams are green again. The question is if we should keep the changes in localdb or can remove them to have possibility to use other brokers?

The same patch has been installed and tested successfully before on ROCMON.