Difference between revisions of "DMSU topics gridops meeting"
Line 67: | Line 67: | ||
36 3 * * * root /sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1 | 36 3 * * * root /sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1 | ||
# tail /var/log/cleanup-grid-accounts.log | |||
/bin/sh: /sbin/cleanup-grid-accounts.sh: No such file or directory | /bin/sh: /sbin/cleanup-grid-accounts.sh: No such file or directory | ||
# ls -l /sbin/cleanup-grid-accounts.sh | |||
ls: /sbin/cleanup-grid-accounts.sh: No such file or directory | |||
# ls -l /usr/sbin/cleanup-grid-accounts.sh | |||
-rwxr-xr-x 1 root root 6747 May 16 2012 /usr/sbin/cleanup-grid-accounts.sh | |||
Until the fix is released in production a workaround could be applied by changing the cleanup-grid-accounts cron with the correct path, like: | |||
# cat /etc/cron.d/cleanup-grid-accounts | |||
PATH=/sbin:/bin:/usr/sbin:/usr/bin | |||
16 3 * * * root /usr/sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1 | |||
Another workaround is also possible: | |||
/opt/glite/yaim/bin/yaim -r -s -n WN -n TORQUE_client -n GLEXEC_wn -f config_users | |||
This does not execute config_lcgenv, therefore $INSTALL_ROOT is set correctly (to /usr). | |||
== Grid Operations Meeting <strike>19th</strike> November 2012 == | == Grid Operations Meeting <strike>19th</strike> November 2012 == |
Revision as of 16:28, 25 January 2013
Back to EGI_DMSU_Documentation
Grid Operations Meeting 28 January 2013
LFC-Oracle problem
Details GGUS #90701
The error is occurring with EMI2 emi-lfc_oracle-1.8.5-1.el5 and Oracle 11:
#lfc-ls lfc-1-kit:/grid send2nsd: NS002 - send error : client_establish_context: The server had a problem while authenticating our connection lfc-1-kit:/grid: Could not secure the connection
Experts suspect it is due to the use of Oracle 11 client when the LFC code has been compiled against the Oracle 10 API. The LFC developers expect to provide rpms built against Oracle 11 shortly.
list-match problem with EMI2 WMS
Details GGUS #90240
Some CEs have enabled only a group or a role in their queues, not the entire VO:
GlueCEAccessControlBaseRule: VOMS:/gridit/ansys GlueCEAccessControlBaseRule: VOMS:/gridit/ansys/Role=SoftwareManager
so, when your primary attribute is:
attribute : /gridit/ansys/Role=NULL/Capability=NULL
if you use an EMI-2 WMS, you cannot match those resources (instead you can if use EMI-1 WMS)
It seems that the problem is in the value of WmsRequirements contained the file /etc/glite-wms/glite_wms.conf: the filter set in that variable is different from the one used in the EMI-1 WMS. The developers are investigating on it
proxy renewal problems on EMI1 WMS
Details GGUS #89801
Under some circumstances, ICE cannot renew the user credentials due to glite-wms-ice-proxy-renew hanging processes. It is believed that the guilty is this Savannah bug. The bug is already solved in EMI2.
Problems with aliased DNS names of myproxy
Details GGUS #89105
DNS aliases of myproxy server (i.e. used to implement round-robin load balance and/or high availability) may cause problems to proxy renewal when all DNS aliases, including the canonical name, are not included in the host certificate of the myproxy server SubjectAltNames extensions.
The failure may not appear always (it depends on multiple conditions like versions of globus etc.), however, sites are encouraged to use certificates which cover all the DNS aliases thoroughly.
EMI-2 WN: yaim bug for cleanup-grid-accounts
Detail GGUS #90486
For a bug, the cleanup-grid-accounts procedure doesn't properly work, so the occupied space on WNs may increase.
the yaim function config_lcgenv unsets the path $INSTALL_ROOT, so it isn't valid the path usesd by the cron cleanup-grid-accounts:
# cat /etc/cron.d/cleanup-grid-accounts PATH=/sbin:/bin:/usr/sbin:/usr/bin 36 3 * * * root /sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1
# tail /var/log/cleanup-grid-accounts.log /bin/sh: /sbin/cleanup-grid-accounts.sh: No such file or directory
# ls -l /sbin/cleanup-grid-accounts.sh ls: /sbin/cleanup-grid-accounts.sh: No such file or directory
# ls -l /usr/sbin/cleanup-grid-accounts.sh -rwxr-xr-x 1 root root 6747 May 16 2012 /usr/sbin/cleanup-grid-accounts.sh
Until the fix is released in production a workaround could be applied by changing the cleanup-grid-accounts cron with the correct path, like:
# cat /etc/cron.d/cleanup-grid-accounts PATH=/sbin:/bin:/usr/sbin:/usr/bin 16 3 * * * root /usr/sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1
Another workaround is also possible:
/opt/glite/yaim/bin/yaim -r -s -n WN -n TORQUE_client -n GLEXEC_wn -f config_users
This does not execute config_lcgenv, therefore $INSTALL_ROOT is set correctly (to /usr).
Grid Operations Meeting 19th November 2012
FTS jobs abort with "No site found for host xxx.yyy" error
Details GGUS #87929
From time to time, some FTS transfers fail with the message above. The problem was reported at CNAF, IN2P3, and GRIDKA, noticed by Atlas, CMS, and LHCb VOs. The problem is appearing and disappearing in rather short and unpredictable intervals.
Exact reasons are not yet understood, we keep investigating. Reports from sites affected by similar problem will be appreciated.
Update Nov 20 The user reports that both problem disappeared, probably fixed together.
LCMAPS-plugins-c-pep in glexec fails at RH6 based WNs
Details GGUS #88520
Due to replacement of OpenSSL with NSS in the RH6 based distributions, LCMAPS-plugins-c-pep invoked from glexec fails on talking to Argus PEP via curl.
This is a known issue, as mentioned in EMI glexec release notes however, the workaround is not described in a usable way there.
Once we make sure we understand it properly and that the fix works, it will be documented properly at UMD pages and passed to the developers to
- fix the documentation
- try to deploy the workaround automatically when NSS-poisoned system is detected
UPDATE Nov 19th: the fix is now well explained in the known issues section and it will be included in a future yaim update
WMS does not work with ARC CE 2.0
Details GGUS #88630, further info Condor ticket #3062
The format of jobid changed in in the ARC CE release 12. This is not recognised by Condor prior to version 7.8.3. However, current EMI-1 WMS uses Condor 7.8.0. This breaks submission from WMS to ARC CE.
The problem hence affects CMS SAM tests as well as their production jobs.
Hence updates to ARC CE 12 should be done carefully before the Condor update is available from EMI.
UPDATE Nov 26th: on a test WMS it was installed Condor 7.8.6, and the submission to ARC seemed to work fine; since this WMS isn't available any more, further deeper tests should be performed, perhaps using the EMI-TESTBED infrastructure
UPDATE Jan 24th: Jobsubmission works now to ARC 2.0 using a WMS with Condor 7.8.6. the EMI-3 WMS will follow the EMI release schedule
Grid Operations Meeting 5th November 2012
problem in retrieving the job output after EMI2 update 4 (WMS)
for details see GGUS 87802
If an user isn't using it the myproxy service, she can retrieve the output without any problem. Otherwise, the problem occurrs (see the details in that ticket).
- the user proxy is usually stored in the Sandboxdir, but if the user is using the myproxy service, that file is a symlink to the real file stored in the proxy renewal directory (/var/glite/spool/glite-renewd). When the jobs ends, that proxy is purged so that the user hasn't any more the permissions to retrieve the output
- For the moment a simple workaround is to submit a new job, and before its ending, retrieve the output of any previous job.
In order to not use the myproxy server, an user should specify in the jdl file the following empty parameter
MyProxyServer = "";
otherwise the UI default setting is considered.
UPDATE Nov 13th: there is a workaround, reported also in the known issues section of "EMI2 update 4" page. Besides this update is currently in the UMD staged rollout phase.