Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Monitoring"

From EGIWiki
Jump to navigation Jump to search
Line 45: Line 45:
On <font face="Courier New,Courier">rocmon-fzk.gridka.de</font>, File  <font face="Courier New,Courier">/etc/grid-monitoring/org.ndgf.conf</font> has been customized to enable the ARC tests. For the time being feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used.  
On <font face="Courier New,Courier">rocmon-fzk.gridka.de</font>, File  <font face="Courier New,Courier">/etc/grid-monitoring/org.ndgf.conf</font> has been customized to enable the ARC tests. For the time being feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used.  


Dmitry will provide access to the official LFC and SRM services by the production probes.  
Dimitri will provide access to the official LFC and SRM services by the production probes.  


'''Migration to ARC 1:'''  
'''Migration to ARC 1:'''  
Because of https://ggus.eu/tech/ticket_show.php?ticket=72260 further changes had to be made by hand in files
1) Because of https://ggus.eu/tech/ticket_show.php?ticket=72260 further changes had to be made by hand in files


/etc/grid-monitoring/org.ndgf.conf
/etc/grid-monitoring/org.ndgf.conf
Line 56: Line 56:
/usr/share/grid-monitoring/org.ndgf/lfc/xrsl  
/usr/share/grid-monitoring/org.ndgf/lfc/xrsl  


to ensure LFC tests run successfully (this is backward compatible with ARC 0.8).
to ensure LFC tests run successfully (this is backward compatible with ARC 0.8). Currently working fine on rocmon-fzk.gridka.de.
 
2) (ARC) passive tests always pending and flagged as "Service is not scheduled to be checked"
 
This is a limitation of the probe, which forms the name of these services only based on the VO name and not the VO FQAN.
The only way to fix this right now is to NOT use FQAN: is VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany" in the site-info.def the problem? (to be checked)
 


'''To be done:'''
'''To be done:'''


1) access to the production system ngi-de-nagios.gridka.de to be granted to Alessandro.
1) access to the production system ngi-de-nagios.gridka.de to be granted to Alessandro (before Monday October the 24th).
 
2) change of the site-info.def file in both the test and production systems, to grant admin rights to Alessandro (Alessandro)


2) change of the site-info.def file in both the test and production systems, to grant admin rights to Alessandro (to be done by Alessandro)
3) update 14 on test and production systems: to be discussed


3) update 14? to be discussed
4) ARC enabling on the production system (Alessandro)


4) ARC enabling on the production system (to be done by Alessandro)
5) VO_OPS_NCG_DEFAULT_VO_FQAN check (Alessandro)


5) dCache access: once the robot certificate is used, this should not be a problem any more
6) dCache access: once the robot certificate is used, this should not be a problem any more.

Revision as of 15:53, 19 October 2011


NGI-DE NGI-CH Monitoring

Mailinglist

ngi-de-monitoring@lists.kit.edu

Participants

Dimitri Nilsen (KIT) Foued Jrad (KIT) Alessandro Usai (SWITCH) Andres Aeschlimann (SWITCH)

Plan for ARC Testing set up in Nagios 15.9.11

  1. Customize the file /etc/grid-monitoring/org.ndgf.conf with the NGI services.
  2. NorduGrid Logging Improvement:
    1. Edit the xrls templates files (in /usr/share/grid-monitoring/org.ndgf/<SERVICE>/xrsl) for all the services and add (gmlog = "gmlog") to them e.g.
more /usr/share/grid-monitoring/org.ndgf/lfc/xrsl

(executable = "testjob.sh")
(jobname = "lfc")
(stdout = "testjob.out")
(gmlog = "gmlog")
(stderr = "testjob.err")
(inputfiles = ("testjob.sh" "/usr/share/grid-monitoring/org.ndgf/lfc/testjob.sh")
              ("file" "%LFC_TESTFILE%"))
(outputfiles = ("testjob.out" "")("testjob.err" "")
               ("outfile" "%LFC_STORAGE_W%/%HOST%-lfc-%TIME%"))
(walltime = "15 min")
(memory = "256")</nowiki>

This will ensure that in case of error the gmlog (useful for debugging) is sent back as part of the outputsandbox. Files in /usr/share/grid-monitoring/org.ndgf : gridftp/xrsl, jobsubmit/xrsl, lfc/xrsl, rls/xrsl, srm/xrsl

  1. Data management requirements for NorduGrid: the testfile used for the LFC/SRM/GridFTP tests must be created/managed manually.

Notice (19.10.2011): It is important that the file/LFC entry be created with the same credentials used by the Nagios monitoring node! A robot certificate will be used in the near future: checks to be carried out with it and the dCache and LFC nodes used by NGI_DE. For the time being, for the ARC tests in the test system (this only affects NGI_CH!), feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used instead.

Current Status 19.10.2011

On rocmon-fzk.gridka.de, File /etc/grid-monitoring/org.ndgf.conf has been customized to enable the ARC tests. For the time being feronia.switch.ch (DPM) and lodur.switch.ch (LFC) are used.

Dimitri will provide access to the official LFC and SRM services by the production probes.

Migration to ARC 1: 1) Because of https://ggus.eu/tech/ticket_show.php?ticket=72260 further changes had to be made by hand in files

/etc/grid-monitoring/org.ndgf.conf

/usr/libexec/grid-monitoring/probes/org.ndgf/ARCCE-lfc

/usr/share/grid-monitoring/org.ndgf/lfc/xrsl

to ensure LFC tests run successfully (this is backward compatible with ARC 0.8). Currently working fine on rocmon-fzk.gridka.de.

2) (ARC) passive tests always pending and flagged as "Service is not scheduled to be checked"

This is a limitation of the probe, which forms the name of these services only based on the VO name and not the VO FQAN. The only way to fix this right now is to NOT use FQAN: is VO_OPS_NCG_DEFAULT_VO_FQAN="/ops/NGI/Germany" in the site-info.def the problem? (to be checked)


To be done:

1) access to the production system ngi-de-nagios.gridka.de to be granted to Alessandro (before Monday October the 24th).

2) change of the site-info.def file in both the test and production systems, to grant admin rights to Alessandro (Alessandro)

3) update 14 on test and production systems: to be discussed

4) ARC enabling on the production system (Alessandro)

5) VO_OPS_NCG_DEFAULT_VO_FQAN check (Alessandro)

6) dCache access: once the robot certificate is used, this should not be a problem any more.