Grid Operations Meeting 2 June 2014
ARGUS/WMS Certificate Chain Mixups
- Affecting several sites, where WMS is unable to make SSL connection to ARGUS.
- With all probability this is a combination of using
curl
from the SL6 distribution, which in built with NSS SSL rather than OpenSSL and, as such, does not really suport proxy certificates, and a bug in Java, hopefully fixed since Java 7 Update 60. - Related issues:
- This issue is already being investigated at 3rd level but PTs cannot decide who is responsible ant DMSU is overseeing.
CREAM CLI/GridSite SegFaults at Long-Lived Proxies
glite-ce-job-submit
crashes if the user's proxy certificate has a lifetime exceeding 240 hours (10 days)- Cause tracked down to GridSite, forwarded to the GridSite PT to fix
- Related issue:
Grid Operations Meetigng 5th May 2014
Issue with eu.egi.sec.DPM-GLUE2-EMI-1 probe
- The nagios probe eu.egi.sec.DPM-GLUE2-EMI-1 should be modified because it tries to detect some information that the new version of DPM doesn't publish any more
- references: GGUS #104943, GGUS #105143
Grid Operations Meeting 28th October 2013
Unable to retrieve the output sandbox in a custom dir from a CREAM CE
see details in GGUS #98368
The command glite-ce-job-output with --dir option fails when the specified path contains 'special' chars, such as '=' or '.' .
$ glite-ce-job-output --dir /var/lib/gridprobes/lhcb.Role=production/emi.cream/CREAMCEDJS/ce208.cern.ch/jobOutput https://ce206.cern.ch:8443/CREAM023339075 2013-10-24 10:10:04,637 FATAL - Failed creation of directory [/var/lib/gridprobes/lhcb.Role=production/emi.cream/CREAMCEDJS/ce208.cern.ch/jobOutput/ce206.cern.ch_8443_CREAM023339075]: boost::filesystem::path: invalid name "lhcb.Role=production" in path: "/var/lib/gridprobes/lhcb.Role=production/emi.cream/CREAMCEDJS/ce208.cern.ch/jobOutput/ce206.cern.ch_8443_CREAM023339075"
This bug is present on EMI-2 UI and sl5 EMI-3 UI; the developers are investigating #CREAM-128
Grid Operations Meeting 23th September 2013
Jobs aborted with the error "CREAM'S database has been scratched and all its jobs have been lost"
see details in GGUS #95559
Since Sep 13th (at least with WMS servers at CNAF) almost all the production jobs are failing, mainly due to two bugs: for the first one, (almost) all the jobs in the ICE DB are marked with DB_ID=0; for the second bug, a particular CE (prod-ce-01.pd.infn.it) was triggering the signal of deleting the jobs with DB_ID=0. All the WMS servers which contacted that CE are affected by this issue
It was found out that CREAM CE(s) are sending (since a certain date) an empty DB_ID information as result of an interoperability problem (missing SOAP_HEADER) between gSOAP and Axis2 (ICE uses gSOAP, CREAM uses Axis2 as SOAP frameworks).
The fix (CREAM-125) has been already committed: with the new version of glite-ce-cream-client-api-c CREAM re-starts to send to ICE a not empty DB_ID in the JobRegister query
Other tickets opened for this issue: GGUS #97360 GGUS #97402 GGUS #97420 GGUS #97453
Grid Operations Meeting 22th July 2013
gridsite causes WMS delegation problems
see details in GGUS #95559
The latest release of gridsite (1.7.26 in EMI-2 and 2.1.0-1 in EMI-3) doesn't allow the '-' character to be used in the proxies delegation when submitting through a WMS server.
This causes many intermittent errors in the delegation of proxies on both EMI-2 and EMI-3 WMS servers:
Warning - Unable to delegate the credential to the endpoint: https://wms014.cnaf.infn.it:7443/glite_wms_wmproxy_server Unknown Soap fault Method: Soap Error
and in the wmproxy.log there is the following error message:
09 Jul, 15:35:33 -I- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": ------------------------------- Fault description -------------------------------- 09 Jul, 15:35:33 -I- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": Method: putProxy 09 Jul, 15:35:33 -I- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": Code: 60 09 Jul, 15:35:33 -I- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": Description: Proxy exception: Unable to store client Proxy 09 Jul, 15:35:33 -D- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": Stack: 09 Jul, 15:35:33 -D- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": ProxyOperationException: Proxy exception: Unable to store client Proxy at putProxy()[../../../src/security/delegation.cpp:176] at putProxy()[../../../src/security/delegation.cpp:153] at putProxy()[../../../src/server/operations.cpp:605] 09 Jul, 15:35:33 -I- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": ---------------------------------------------------------------------------------- 09 Jul, 15:35:33 -D- PID: 982 - "wmpgsoapoperations::delegationns__putProxy": putProxy operation completed
The developers are preapring the fix. In the meantime, as a temporary workaround, downgrade to the previous version of gridsite:
- On a WMS: yum downgrade gridsite gridsite-libs
- on an UI: yum downgrade gridsite-commands gridsite-libs
StoRM 1.11.1 performance problems
see details in 95700
In the latest STORM EMI-3/SL6 version it woas found several load issues: in particular a lot of thread was generated, reaching the limit of 1024 in /etc/security/limits.conf . Even increasing the limit, the number of threads increase as well, and the service performance is very low. The issue is under investigation.
It seems there two problems. The main problem is related in the interaction with mysql (slow queries make the latency for transfer operations steadily increase. Because of the latency, the StoRM gets a lot of abort requests. And then the second problem: every abort command creates a new thread that keeps alive until garbage collection. The fix for both problems is under testing.
Grid Operations Meeting 24th June 2013
EMI-3/SL6 voms-proxy-info ignores X509_USER_PROXY
see details in GGUS #94878
it ignores the X509_USER_PROXY environmental variable the "voms-proxy-info" command provided in EMI-3 by voms-client3 rpm, and the following error is returned:
# export X509_USER_PROXY=/home/caifti/x509up_u3812 # echo $X509_USER_PROXY /home/caifti/x509up_u3812 # voms-proxy-info -all Proxy not found: /tmp/x509up_u500 (No such file or directory)
It works only specifying directly the proxy file:
# voms-proxy-info -file x509up_u3812 subject : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei/CN=proxy issuer : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei identity : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei type : full legacy globus proxy strength : 1024 path : /home/caifti/x509up_u3812 timeleft : 02:20:30 key usage : Digital Signature, Key Encipherment, Data Encipherment
The fix will be released by the end of June; in the meantime as workaround it is possible either downgrading to the previous version voms-client-2 present in the EMI-3 repo:
# rpm -e --nodeps voms-clients3 # yum install voms-clients
or using the option "--file".
Grid Operations Meeting 3rd April 2013
problems with EMI-3 STORM
see details in GGUS #92819
For the moment it isn't recommanded the STORM upgrade to EMI-3 (or the installation from scratch) because when you launch consecutive lcg-gt calls, they fail with a Requested file is busy error. For example:
$ lcg-cp -b file:/home/enol/std.out -U srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 $ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp gsiftp://test27.egi.cesga.es:2811//storage/dteam/t1 ee152752-020f-4598-b19e-a4bc56dcb5b8 $ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1: [SE][StatusOfGetRequest][SRM_FILE_BUSY] Requested file is busy (in an incompatible state with PTG) lcg_gt: Invalid argument"
For each protocol published by the SE, the nagios SRM probes do a lcg-gt which will return a CRITICAL state.
WMS on sl6 doesn't work with ARGUS
see details in GGUS #92773
WMS on sl6 cannot use ARGUS as authorization system: because SL6 uses the NSS library instead of OpenSSL, the proxy are not correctly handled. It is a known problem in NSS, and will not be corrected soon.
critical issues on EMI-3 VOMS: wait for the release of EMI-3 first update
We have found some issues that affect the service functionality and in particular make the installation and operation of the service problematic for deployments with large number of VOs:
VOMSES startup script fails in restarting VOMSES web app
VOMS Admin service incorrectly parses truststore refresh period from configuration
These issues have already been acknowledged by the developers and are being currently fixed. A new version of voms-admin-service fixing the above issues is scheduled for release on April 18th. For this reason, and given the stability of the EMI 2 VOMS services, we recommend NOT to upgrade now and wait for the version that will be released in the first EMI-3 update.
Grid Operations Meeting 20th March 2013
done Jobs are Purged Prematurely from L&B Proxy
Details GGUS 90930 and GGUS 92288
A bug in the L&B proxy causes zero default timeout on purging jobs in state done. As a result, the regular nightly purge removes jobs which happen to be is state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect.
All current releases are affected. The cause, however, is known and currently fixed in the codebase.
It is possible to work around the issue by setting GLITE_LB_EXPORT_PURGE_ARGS explicitly in the YAIM configuration file:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --done 60d --other 60d"
as opposed to the current:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --other 60d"
some issues preventing the CREAM update to EMI-2 (sl6)
Details GGUS 92492
Issues found by two resource centres (INFN-LNL-2 and INFN-T1) in NGI_IT:
1) After configuring with yaim, many tomcat6 errors are logged in catalina.out:
java.lang.IllegalArgumentException: Document base /usr/share/tomcat6/webapps/ce-cream-es does not exist
SEVERE: A web application appears to have started a thread named [Timer-4] but has failed to stop it. This is very likely to create a memory leak.
SEVERE: A web application created a ThreadLocal with key of type [null] (value [org.apache.axiom.util.UIDGenerator$1@4a88e4c0]) and a value of type [long[]] (value [[J@24edb15c]) but failed to remove it when the web application was stopped. To prevent a memory leak, the ThreadLocal has been forcibly removed.
After a while the ce starts swapping and runs out of health.
WORKAROUND:
rm -f /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml /etc/init.d/tomcat6 stop && /etc/init.d/glite-ce-blah-parser stop && sleep 3 && /etc/init.d/glite-ce-blah-parser start && /etc/init.d/tomcat6 start
SOLUTION: Have this fixed in the next update
2)
[root@ce01-lcg ~]# cat /etc/glite-ce-cream/log4j.properties | egrep 'MaxFileSize|MaxBackupIndex' log4j.appender.fileout.MaxFileSize=1000KB log4j.appender.fileout.MaxBackupIndex=20
These are too little in a production environment. An entire job lifecycle doesnt fit in 20MB of logs. furthermore, any run of yaim restores the too little values.
WORKAROUND: modify /etc/glite-ce-cream/log4j.properties :
log4j.appender.fileout.MaxFileSize=10M
and
chattr +i /etc/glite-ce-cream/log4j.properties
SOLUTION: Have this fixed in the next update
3) After configuring with yaim, services are up, but the ce remains unresponsive:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl 2013-03-14 14:41:23,596 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl 2013-03-14 14:43:10,813 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
Tomcat is actually in a ill state:
[root@ce01-lcg ~]# service tomcat6 status tomcat6 (pid 20389) is running... [ OK ] [root@ce01-lcg ~]# service tomcat6 stop Stopping tomcat6: [FAILED]
WORKAROUND:
service glite-ce-blah-parser stop service tomcat6 stop && service glite-ce-blah-parser stop && sleep 3 && service glite-ce-blah-parser start && service tomcat6 start
Then it works:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl https://ce01-lcg.cr.cnaf.infn.it:8443/CREAM691020988
SOLUTION: Have this fixed in the next update
Issues Upgrading Products Depending on GridSite from EMI-2 to EMI-3
Details GGUS 92620
Products with complex (often transitive) dependencies on multiple GridSite packages (such as a simultaneous dependency on gridsite
and gridsite-libs
) can experience problems upgrading from EMI-2 to EMI-3.
Products known to be affected:
- DPM
Products known not to be affected:
- stand-alone GridSite
- L&B
It is possible to work around the issue by uninstalling the problematic packages and installing afresh.
The GridSite PT is preparing a fix introducing a gridsite-compat1.7
package to overcome the issue.
EMI-3: vomses start-up script doesn't properly work
Details GGUS ID 92666
We noticed that the vomses start-up script doesn't properly work because the pid of the process is wrongly handled.
[root@cert-14 ~]# service vomses status Checking vomses status: (not running) [FAILED]
actually the service is running with the pid 3206
# ps auxfwww | grep java root 23227 0.0 0.0 103236 836 pts/0 S+ 10:06 0:00 \_ grep java voms 3206 0.2 7.7 1708284 305284 ? Sl Mar19 2:14 java -Xmx256m -cp //var/lib/voms-admin/lib/activati....
but here there is a different number:
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 3203
so also this doesn't work
[root@cert-14 ~]# service vomses stop Stopping vomses: (not running) [FAILED]
when you need to restart the vomses service, you have to kill the process and delete the pid file, and then start it.
[root@cert-14 ~]# kill -9 3206 [root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 3203 [root@cert-14 ~]# rm /var/lock/subsys/vomses.pid rm: remove regular file `/var/lock/subsys/vomses.pid'? y [root@cert-14 ~]# service vomses start Starting vomses: [ OK ] [root@cert-14 ~]# service vomses status Checking vomses status: (not running) [FAILED] [root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 25425 [root@cert-14 ~]# ps auxfwww | grep java root 25586 0.0 0.0 103236 836 pts/0 S+ 10:10 0:00 \_ grep java voms 25428 20.7 4.2 1543472 167168 ? Sl 10:10 0:06 java -Xmx256m -cp //var/lib/voms-admin/lib/activation-1.1.ja
the developers are already aware of it
Grid Operations Meeting 28 January 2013
lcg-gt problems with dcache
Details GGUS #90807
The current version of lcg-util and gfal (1.13.9-0) return the following error, apparently only when using dCache SEs:
$ lcg-gt -D srmv2 -T srmv2 srm://srm.triumf.ca/dteam/generated/2013-01-25/filed56b1d3e-76f8-4f5a-9b32-94e6d038ab4b gsiftp gsiftp://dpool13.triumf.ca:2811/generated/2013-01-25/filed56b1d3e-76f8-4f5a-9b32-94e6d038ab4b [ERROR] No request token returned with SRMv2
Instead, using an lder version of lcg-utils, like the one deployed in gLite, the command lcg-gt works fine. Indeed nagios doesn't detect this problem because it is still gLite-based (lcg_util-1.11.16-2 and GFAL-client-1.11.16-2)
the develpers are investigating on this issue
LFC-Oracle problem
Details GGUS #90701
The error is occurring with EMI2 emi-lfc_oracle-1.8.5-1.el5 and Oracle 11:
#lfc-ls lfc-1-kit:/grid send2nsd: NS002 - send error : client_establish_context: The server had a problem while authenticating our connection lfc-1-kit:/grid: Could not secure the connection
Experts suspect it is due to the use of Oracle 11 client when the LFC code has been compiled against the Oracle 10 API. The LFC developers expect to provide rpms built against Oracle 11 shortly.
UPDATE Feb 5th: solved by applying the work around with:
add in /etc/sysconfig/lfcdaemon: export LD_PRELOAD=/usr/lib64/libssl.so:/usr/lib64/libglobus_gssapi_gsi.so.4
the Oracle 11 build wasn't tested because the service is in production
list-match problem with EMI2 WMS[edit]
Details GGUS #90240
Some CEs have enabled only a group or a role in their queues, not the entire VO:
GlueCEAccessControlBaseRule: VOMS:/gridit/ansys GlueCEAccessControlBaseRule: VOMS:/gridit/ansys/Role=SoftwareManager
so, when your primary attribute is:
attribute : /gridit/ansys/Role=NULL/Capability=NULL
if you use an EMI-2 WMS, you cannot match those resources (instead you can if use EMI-1 WMS)
It seems that the problem is in the value of WmsRequirements contained the file /etc/glite-wms/glite_wms.conf: the filter set in that variable is different from the one used in the EMI-1 WMS. The developers are investigating on it
UPDATE Jan 31st: the fix will be released in EMI-3. However, the developers provided us a rpm, glite-wms-classad_plugin-3.4.99-0.sl5.x86_64.rpm, which we have installed on our EMI-2 WMS servers, and the issue has been fixed
proxy renewal problems on EMI1 WMS
Details GGUS #89801
Under some circumstances, ICE cannot renew the user credentials due to glite-wms-ice-proxy-renew hanging processes. It is believed that the guilty is this Savannah bug. The bug is already solved in EMI2.
Problems with aliased DNS names of myproxy
Details GGUS #89105
DNS aliases of myproxy server (i.e. used to implement round-robin load balance and/or high availability) may cause problems to proxy renewal when all DNS aliases, including the canonical name, are not included in the host certificate of the myproxy server SubjectAltNames extensions.
The failure may not appear always (it depends on multiple conditions like versions of globus etc.), however, sites are encouraged to use certificates which cover all the DNS aliases thoroughly.
EMI-2 WN: yaim bug for cleanup-grid-accounts
Detail GGUS #90486
For a bug, the cleanup-grid-accounts procedure doesn't properly work, so the occupied space on WNs may increase.
the yaim function config_lcgenv unsets the path $INSTALL_ROOT, so it isn't valid the path usesd by the cron cleanup-grid-accounts:
# cat /etc/cron.d/cleanup-grid-accounts PATH=/sbin:/bin:/usr/sbin:/usr/bin 36 3 * * * root /sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1
# tail /var/log/cleanup-grid-accounts.log /bin/sh: /sbin/cleanup-grid-accounts.sh: No such file or directory
# ls -l /sbin/cleanup-grid-accounts.sh ls: /sbin/cleanup-grid-accounts.sh: No such file or directory
# ls -l /usr/sbin/cleanup-grid-accounts.sh -rwxr-xr-x 1 root root 6747 May 16 2012 /usr/sbin/cleanup-grid-accounts.sh
The function belongs to glite-yaim-core-5.1.0-1. Until the fix is released in production a workaround could be applied by changing the cleanup-grid-accounts cron with the correct path, like:
# cat /etc/cron.d/cleanup-grid-accounts PATH=/sbin:/bin:/usr/sbin:/usr/bin 16 3 * * * root /usr/sbin/cleanup-grid-accounts.sh -v >> /var/log/cleanup-grid-accounts.log 2>&1
Another workaround is also possible:
/opt/glite/yaim/bin/yaim -r -s -n WN -n TORQUE_client -n GLEXEC_wn -f config_users
This does not execute config_lcgenv, therefore $INSTALL_ROOT is set correctly (to /usr).
Grid Operations Meeting 19th November 2012
FTS jobs abort with "No site found for host xxx.yyy" error
Details GGUS #87929
From time to time, some FTS transfers fail with the message above. The problem was reported at CNAF, IN2P3, and GRIDKA, noticed by Atlas, CMS, and LHCb VOs. The problem is appearing and disappearing in rather short and unpredictable intervals.
Exact reasons are not yet understood, we keep investigating. Reports from sites affected by similar problem will be appreciated.
Update Nov 20 The user reports that both problem disappeared, probably fixed together.
LCMAPS-plugins-c-pep in glexec fails at RH6 based WNs
Details GGUS #88520
Due to replacement of OpenSSL with NSS in the RH6 based distributions, LCMAPS-plugins-c-pep invoked from glexec fails on talking to Argus PEP via curl.
This is a known issue, as mentioned in EMI glexec release notes however, the workaround is not described in a usable way there.
Once we make sure we understand it properly and that the fix works, it will be documented properly at UMD pages and passed to the developers to
- fix the documentation
- try to deploy the workaround automatically when NSS-poisoned system is detected
UPDATE Nov 19th: the fix is now well explained in the known issues section and it will be included in a future yaim update
WMS does not work with ARC CE 2.0
Details GGUS #88630, further info Condor ticket #3062
The format of jobid changed in in the ARC CE release 12. This is not recognised by Condor prior to version 7.8.3. However, current EMI-1 WMS uses Condor 7.8.0. This breaks submission from WMS to ARC CE.
The problem hence affects CMS SAM tests as well as their production jobs.
Hence updates to ARC CE 12 should be done carefully before the Condor update is available from EMI.
UPDATE Nov 26th: on a test WMS it was installed Condor 7.8.6, and the submission to ARC seemed to work fine; since this WMS isn't available any more, further deeper tests should be performed, perhaps using the EMI-TESTBED infrastructure
UPDATE Jan 24th: Jobsubmission works now to ARC 2.0 using a WMS with Condor 7.8.6. the EMI-3 WMS will follow the EMI release schedule
Grid Operations Meeting 5th November 2012
problem in retrieving the job output after EMI2 update 4 (WMS)
for details see GGUS 87802
If an user isn't using it the myproxy service, she can retrieve the output without any problem. Otherwise, the problem occurrs (see the details in that ticket).
- the user proxy is usually stored in the Sandboxdir, but if the user is using the myproxy service, that file is a symlink to the real file stored in the proxy renewal directory (/var/glite/spool/glite-renewd). When the jobs ends, that proxy is purged so that the user hasn't any more the permissions to retrieve the output
- For the moment a simple workaround is to submit a new job, and before its ending, retrieve the output of any previous job.
In order to not use the myproxy server, an user should specify in the jdl file the following empty parameter
MyProxyServer = "";
otherwise the UI default setting is considered.
UPDATE Nov 13th: there is a workaround, reported also in the known issues section of "EMI2 update 4" page. Besides this update is currently in the UMD staged rollout phase.