Difference between revisions of "Middleware issues and solutions"
Line 1: | Line 1: | ||
{{Template:Op menubar}} | {{Template:Op menubar}} | ||
{{Template:Doc_menubar}} | {{Template:Doc_menubar}} | ||
[[Category: | [[Category:Operations Manuals]] | ||
Purpose of this page is to document recurring middleware issues with broad impact and the respective solutions and/or workarounds. | Purpose of this page is to document recurring middleware issues with broad impact and the respective solutions and/or workarounds. | ||
Latest revision as of 13:10, 23 November 2012
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Documentation menu: | Home • | Manuals • | Procedures • | Training • | Other • | Contact ► | For: | VO managers • | Administrators |
Purpose of this page is to document recurring middleware issues with broad impact and the respective solutions and/or workarounds.
This page is maintained by the Distributed Middleware Support Unit of EGI.
BDII
BDII: make size of tmpfs for /var/run/bdii/db configurable
Under certain conditions, an EMI top-bdii can hang when updating the information: It turned out that the current default size for /var/run/bdii/db, 1500 Mb, isn't enough to hold the current BDII database and LDIF files/other stuff for update. A workaround could be increas the tmpfs size to 2.4G, for example. See details in GGUS ticket [79393]
CREAM
GlueCEStateWaitingJobs: 444444 and WallTime workaround
If on the queues there is published:
GlueCEStateWaitingJobs: 444444
and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing:
Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
probably the queues have no "resources_default.walltime" parameter configured.
So define it for each queue by launching, for example:
# qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00"
CREAM refuses Terena-signed VOMS proxies
TERENA eScience SSL CA sets pathlen attribute among its basic constraints (see http://www.openssl.org/docs/apps/x509v3_config.html).
This triggers a bug in VOMS java API (the API incorrectly applies the policy on checking the chain of the attribute certificate), which is used by CREAM, resulting in errors on job submission:
FATAL - User <user's DN> not authorized for operation {http://www.gridsite.org/namespaces/delegation-2}getProxyReq
A workaround is not using TERENA eScience SSL CA signed certificates as host certificates of VOMS servers.
The problem is fixed in voms-api-java-2.0.5, released in EMI. A backport to gLite 3.2 is unofficially available with unreleased patch [#4997] (no gLite 3.2 update is expected).
GGUS ticket [#76129]
/tmp fills up with x509up*glexec files
Nov 27, 2011
CREAM CE uses glexec to gain the right local user identity for the job. As a sideeffect, a matching x509 proxy file is created in /tmp. The result is filling /tmp up with these files.
Starting from EMI-1 CREAM release, which uses glexec-0.8, the problem is avoided by the setting
create_target_proxy = no
in the glexec.conf file.
In gLite CREAM releases it is worked around by putting
<parameter name="glexec_probe_cmd" value="/opt/glite/bin/glexecprobe" /> <parameter name="methods" value="JobRegister, putProxy" />
into the section
<authzchain name="chain-1"> <plugin name="localuserpip"
of /opt/glite/etc/glite-ce-cream/cream-config.xml. This is done automatically by YAIM but not by Quattor.
GGUS ticket [[1]]
VOMS
VOMS server fails with high number of VOs
VOMS server of gLite 3.2 is more memory greedy, it starts failing when configured to serve more than 10 (approx.) VOs.
Change -XX:MaxPermSize parameter of CATALINA_OPTS to the value of at least 512m in /etc/tomcat5/tomcat5.conf
CATALINA_OPTS="-Xmx1508M -server -Dsun.net.client.defaultReadTimeout=240000 -XX:MaxPermSize=512m"
and add
* soft nofile 2048 * hard nofile 2048
into /etc/security/limits.conf.
GGUS ticket #72136
NOTE (2012-02-29): in the EMI VOMS, the value of -XX:MaxPermSize is automatically set by yaim basing on the VO numbers enabled on the server
WMS and LB
Job submission breaks when VOMS server certificate is renewed
When VOMS server certificate is renewed, retaining the same DN, and both old and new certificates of this VOMS server are left in /etc/grid-security/vomsdir on the WMS machine, a Gridsite bug is exposed and jobs get refused with ambiguous error
Unable to delegate the credential to the endpoint: https://wms.your.domain.eu:7443/glite_wms_wmproxy_server
Other symptoms are messages "Remote GRST CRED: Not Available" in wmproxy.log.
The workaround is manually removing the old VOMS server certificate from /etc/grid-security/vomsdir (it's not used either).
GGUS ticket [#77256]
LB breaks when MySQL 5.1 is installed
Both LB and WMS nodes are affected by this problem.
Due to imperfect packaging of LB, all RPM dependencies are satisfied by installing MySQL 5.1 first and then LB and/or WMS. This happens quite easily because the EGI SAM repository provides MySQL 5.1.
LB dies silently shortly after startup because of missing shared library (no glite-lb-bkserverd processes run), WMS does not accept jobs, logging
(edg_wll_RegisterJobMaster(): unable to register job Resource temporarily unavailable;; Logging library ERROR: Resource temporarily unavailable;; edg_wll_DoLogEventServer(): edg_wll_log_proxy_connect error Connection refused;; edg_wll_log_proxy_connect())
into /var/log/wms/wmproxy.log.
The workaround is installing MySQL-shared-compat package.
GGUS ticket [#79624].
changing WMS hostname
If you have to change the WMS hostname without reinstalling the service, the old hostname would continue to appear into some paths, so after changing the network configuration, do also the following steps:
- uninstall condor (emi-wms will get uninstalled because of the dependancies)
- remove the dir /opt/condor*
- reinstall emi-wms
GGUS ticket [#80622]
Failure loading GSI credentials
Signs:
FATAL CONTROL - Failed to get GSI credentials. Exiting. SECURITY - Failed to load GSI credential: edg_wll_gss_acquire_cred_gsi(): GSS Major Status: General failure (GSS Minor Status Error Chain: globus_gsi_gssapi: Unable to read credential for import OpenSSL Error: tasn_dec.c:749: in library: asn1 encoding routines, function ASN1_TEMPLATE_NOEXP_D2I: nested asn1 error Field=n, Type=RSA OpenSSL Error: tasn_dec.c:830: in library: asn1 encoding routines, function ASN1_D2I_EX_PRIMITIVE: nested asn1 error OpenSSL Error: tasn_dec.c:1306: in library: asn1 encoding routines, function ASN1_CHECK_TLEN: wrong tag ), exiting.
This is actually an openssl/globus problem, see Lack of support for PKCS#8
ICE dies soon after the starting
(copied from WMS wiki)
Version of ICE 3.3.5-3 has a bug that is triggered in some particular circumstances. Before explaining it, some details on the bug: there's a piece of code that matches the myproxy server address for correcteness; this check is performed by mean of a (quite) complicated regex that matches the address to be compliant with the FQDN format. The complexity of this regex triggers an high usage of boost's internal memory buffer that can run out of memory matching some particular addresses.
Now, the bug: the addresses that we found triggering this problem are like "ed8ac012f7da92dd487bc8d3edc4a49b" (or even shorter; not any alphanumeric combination is problematic though). Some VOs use addresse like that (LHCb for example)... We also noted that just adding the domain name to those addresses would by pass the boost's memory exhaustion.
To check if ICE keeps crashing for this problem you can follow these steps:
- login as root in the WMS node
- stop the daemon the automatically restart WMS services (if any)
- stop ICE
- su - glite
- execute (as glite) /usr/bin/glite-wms-ice --conf glite_wms.conf
After a little while you should receive on the console an error like this:
/usr/bin/glite-wms-ice --conf glite_wms.conf Logfile is [/var/log/wms/ice.log] terminate called after throwing an instance of 'std::runtime_error' what(): Memory exhausted Aborted
If this is the case you should proceed to modify the myproxy addresses stored in the ICE's database:
- come back to root user and change dir to ICE's persist directory:
cd /var/ice/persist_dir
- put all the myproxy-url without the "dot" in the file "file.txt" by executing:
sqlite3 ice.db "select myproxyurl from delegation where myproxyurl not like '%.%';" > file.txt
- grep -v ^$ file.txt > file1.txt
- generate the instructions to update the ICE DB, and put them in a script
cat file1.txt | gawk '{print "sqlite3 /var/ice/persist_dir/ice.db \"update delegation set myproxyurl=\x27"$0".desy.de\x27 where myproxyurl=\x27"$0"\x27;\""}' > script
(Substitute the example desy.de domain name with yours).
- execute the script:
chmod +x script ./script
GGUS ticket 84948
Storage Element
BDII does not start at SE node
BDII daemon does not start correctly at SE node, yielding the service not to be published to GOC, etc. Symptoms are error messages:
# service ldap restart Stopping slapd: [ OK ] Checking configuration files for slapd: bdb_db_open: Warning - No DB_CONFIG file found in directory /var/lib/ldap: (2) Expect poor performance for suffix dc=my-domain,dc=com. config file testing succeeded [ OK ] Starting slapd: [ OK ]
The problem is caused by settting the BDII_USER variable in site-info.def, which causes incorrect permission settings on some files slapd uses. This variable should not be set at SE nodes, it's intended for BDII node only.
GGUS ticket [#73086]
Globus
Lack of support for PKCS#8
Globus does not suporrt PKCS#8 format of private keys. Hovever, this is the default for OpenSSL 1.x. Therefore, key-certificate pairs generated by OpenSSL 1.x in the default way are not directly usable with services based on Globus (many former gLite ones), yielding errors like
globus_gsi_gssapi: Unable to read credential for import globus_gsi_gssapi: Error with GSI credential globus_credential: Error reading proxy credential: Unhandled PEM sequence: PRIVATE KEY
In cases where this issue is affecting a WMS/LB node, the related messages in syslog look like this
FATAL CONTROL - Failed to get GSI credentials. Exiting. SECURITY - Failed to load GSI credential: edg_wll_gss_acquire_cred_gsi(): GSS Major Status: General failure (GSS Minor Status Error Chain: globus_gsi_gssapi: Unable to read credential for import OpenSSL Error: tasn_dec.c:749: in library: asn1 encoding routines, function ASN1_TEMPLATE_NOEXP_D2I: nested asn1 error Field=n, Type=RSA OpenSSL Error: tasn_dec.c:830: in library: asn1 encoding routines, function ASN1_D2I_EX_PRIMITIVE: nested asn1 error OpenSSL Error: tasn_dec.c:1306: in library: asn1 encoding routines, function ASN1_CHECK_TLEN: wrong tag ), exiting.
The problem can be worked around by converting the PKCS#8 key to RSA
openssl pkcs8 -in key.pk8 -out key-temp.pem openssl rsa -in key-temp.pem -out key.pem rm key-temp.pem