Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-20-03-2013"

From EGIWiki
Jump to navigation Jump to search
Line 22: Line 22:


==== 2.2 Updates from DMSU ====
==== 2.2 Updates from DMSU ====
===== ''done'' Jobs are Purged Prematurely from L&B Proxy =====
Details [https://ggus.eu/tech/ticket_show.php?ticket=90930 GGUS 90930] and [https://ggus.eu/tech/ticket_show.php?ticket=92288 GGUS 92288]
A bug in the L&B proxy causes zero default timeout on purging jobs in state ''done''. As a result, the regular nightly purge removes jobs which happen to be is state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect.
All current releases are affected. The cause, however, is known and currently fixed in the codebase.
It is possible to '''work around''' the issue by setting GLITE_LB_EXPORT_PURGE_ARGS explicitly in the YAIM configuration file:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --done 60d --other 60d"
as opposed to the current:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --other 60d"
===== some issues preventing the CREAM update to EMI-2 (sl6) =====
Details [https://gus.fzk.de/ws/ticket_info.php?ticket=92492 GGUS 92492]
Issues found by two resource centres (INFN-LNL-2 and INFN-T1) in NGI_IT:
1)
After configuring with yaim, many tomcat6 errors are logged in catalina.out:
java.lang.IllegalArgumentException: Document base /usr/share/tomcat6/webapps/ce-cream-es does not exist
SEVERE: A web application appears to have started a thread named [Timer-4] but has failed to stop it. This is very likely to create a memory leak.
SEVERE: A web application created a ''ThreadLocal with key of type [null] (value [org.apache.axiom.util.UIDGenerator$1@4a88e4c0])'' and a value of type ''[long[]] (value [[J@24edb15c])'' but failed to remove it when the web application was stopped. To prevent a memory leak, the ThreadLocal has been forcibly removed.
After a while the ce starts swapping and runs out of health.
WORKAROUND:
rm -f /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml
/etc/init.d/tomcat6 stop && /etc/init.d/glite-ce-blah-parser stop && sleep 3 && /etc/init.d/glite-ce-blah-parser start && /etc/init.d/tomcat6 start
SOLUTION: Have this fixed in the next update
2)
[root@ce01-lcg ~]# cat /etc/glite-ce-cream/log4j.properties | egrep 'MaxFileSize|MaxBackupIndex'
log4j.appender.fileout.MaxFileSize=1000KB
log4j.appender.fileout.MaxBackupIndex=20
These are too little in a production environment. An entire job lifecycle doesnt fit in 20MB of logs. furthermore, any run of yaim restores the too little values.
WORKAROUND:
modify /etc/glite-ce-cream/log4j.properties :
log4j.appender.fileout.MaxFileSize=10M
and
chattr +i /etc/glite-ce-cream/log4j.properties
SOLUTION: Have this fixed in the next update
3)
After configuring with yaim, services are up, but the ce remains unresponsive:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
2013-03-14 14:41:23,596 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
2013-03-14 14:43:10,813 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
Tomcat is actually in a ill state:
[root@ce01-lcg ~]# service tomcat6 status
tomcat6 (pid 20389) is running... [ OK ]
[root@ce01-lcg ~]# service tomcat6 stop
Stopping tomcat6: [FAILED]
WORKAROUND:
service glite-ce-blah-parser stop
service tomcat6 stop && service glite-ce-blah-parser stop && sleep 3 && service glite-ce-blah-parser start && service tomcat6 start
Then it works:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
https://ce01-lcg.cr.cnaf.infn.it:8443/CREAM691020988
SOLUTION: Have this fixed in the next update
===== Issues Upgrading Products Depending on GridSite from EMI-2 to EMI-3 =====
Details [https://ggus.eu/tech/ticket_show.php?ticket=92620 GGUS 92620]
Products with complex (often transitive) dependencies on multiple GridSite packages (such as a simultaneous dependency on <code>gridsite</code> '''and''' <code>gridsite-libs</code>) can experience problems upgrading from EMI-2 to EMI-3.
Products known to be affected:
* DPM
Products known not to be affected:
* stand-alone GridSite
* L&B
It is possible to '''work around''' the issue by uninstalling the problematic packages and installing afresh.
The GridSite PT is [https://savannah.cern.ch/bugs/?100916 preparing a fix] introducing a <code>gridsite-compat1.7</code> package to overcome the issue.


=== 3. AOB  ===
=== 3. AOB  ===

Revision as of 10:24, 20 March 2013

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



Detailed agenda: Grid Operations Meeting 20 March 2013

Audio conference link Conference system is Adobe Connect, no password required.
Audio conference details Indico page


1. Middleware releases and staged rollout

1.1. Update on the status of EMI updates

1.2. Staged Rollout

2. Operational Issues

2.2 Updates from DMSU

done Jobs are Purged Prematurely from L&B Proxy

Details GGUS 90930 and GGUS 92288

A bug in the L&B proxy causes zero default timeout on purging jobs in state done. As a result, the regular nightly purge removes jobs which happen to be is state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect.

All current releases are affected. The cause, however, is known and currently fixed in the codebase.

It is possible to work around the issue by setting GLITE_LB_EXPORT_PURGE_ARGS explicitly in the YAIM configuration file:

GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --done 60d --other 60d"

as opposed to the current:

GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --other 60d"
some issues preventing the CREAM update to EMI-2 (sl6)

Details GGUS 92492

Issues found by two resource centres (INFN-LNL-2 and INFN-T1) in NGI_IT:

1) After configuring with yaim, many tomcat6 errors are logged in catalina.out:

java.lang.IllegalArgumentException: Document base /usr/share/tomcat6/webapps/ce-cream-es does not exist

SEVERE: A web application appears to have started a thread named [Timer-4] but has failed to stop it. This is very likely to create a memory leak.

SEVERE: A web application created a ThreadLocal with key of type [null] (value [org.apache.axiom.util.UIDGenerator$1@4a88e4c0]) and a value of type [long[]] (value [[J@24edb15c]) but failed to remove it when the web application was stopped. To prevent a memory leak, the ThreadLocal has been forcibly removed.

After a while the ce starts swapping and runs out of health.

WORKAROUND:

rm -f /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml

/etc/init.d/tomcat6 stop && /etc/init.d/glite-ce-blah-parser stop && sleep 3 && /etc/init.d/glite-ce-blah-parser start && /etc/init.d/tomcat6 start

SOLUTION: Have this fixed in the next update

2)

[root@ce01-lcg ~]# cat /etc/glite-ce-cream/log4j.properties | egrep 'MaxFileSize|MaxBackupIndex'
log4j.appender.fileout.MaxFileSize=1000KB
log4j.appender.fileout.MaxBackupIndex=20

These are too little in a production environment. An entire job lifecycle doesnt fit in 20MB of logs. furthermore, any run of yaim restores the too little values.

WORKAROUND: modify /etc/glite-ce-cream/log4j.properties :

log4j.appender.fileout.MaxFileSize=10M

and

chattr +i /etc/glite-ce-cream/log4j.properties

SOLUTION: Have this fixed in the next update

3) After configuring with yaim, services are up, but the ce remains unresponsive:

[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
2013-03-14 14:41:23,596 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
2013-03-14 14:43:10,813 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]

Tomcat is actually in a ill state:

[root@ce01-lcg ~]# service tomcat6 status
tomcat6 (pid 20389) is running... [ OK ]

[root@ce01-lcg ~]# service tomcat6 stop
Stopping tomcat6: [FAILED]

WORKAROUND:

service glite-ce-blah-parser stop

service tomcat6 stop && service glite-ce-blah-parser stop && sleep 3 && service glite-ce-blah-parser start && service tomcat6 start

Then it works:

[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
https://ce01-lcg.cr.cnaf.infn.it:8443/CREAM691020988

SOLUTION: Have this fixed in the next update

Issues Upgrading Products Depending on GridSite from EMI-2 to EMI-3

Details GGUS 92620

Products with complex (often transitive) dependencies on multiple GridSite packages (such as a simultaneous dependency on gridsite and gridsite-libs) can experience problems upgrading from EMI-2 to EMI-3.

Products known to be affected:

  • DPM

Products known not to be affected:

  • stand-alone GridSite
  • L&B

It is possible to work around the issue by uninstalling the problematic packages and installing afresh.

The GridSite PT is preparing a fix introducing a gridsite-compat1.7 package to overcome the issue.

3. AOB

3.2 Next meeting

4. Minutes