Difference between revisions of "Agenda-20-03-2013"
(9 intermediate revisions by 3 users not shown) | |||
Line 16: | Line 16: | ||
==== 1.1. Update on the status of EMI updates ==== | ==== 1.1. Update on the status of EMI updates ==== | ||
Cristina Aiftimiei reports about the EMI releasing activities. | |||
* [https://twiki.cern.ch/twiki/bin/view/EMI/EmiEgiGOM#EMI_1_EMI_2_EMI_3_Updates_status Report on twiki] | |||
* [http://indico.cern.ch/materialDisplay.py?contribId=3&materialId=slides&confId=197801 EMI 3 highlights] | |||
* [https://twiki.cern.ch/twiki/pub/EMI/EMT/emi_3_Changes_Upgrades.pdf Reference for Major & Backward (In)Compatibilities Changes & Upgrade Paths] | |||
==== 1.2. Staged Rollout ==== | ==== 1.2. Staged Rollout ==== | ||
* UMD-2 released: | |||
** EMI-2 [http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-9-21-02-2013-v-2-6-1-1 update 9] | |||
*** Still under SR: Cream-torque, | |||
*** Ready for production: Cream 1.14.3 ; Wms 3.4.1. | |||
** IGE 3.1 and - 3.0: | |||
*** Gsisshterm - 1.3.4 | |||
*** Gram5 - 5.2.3 | |||
* Staged rollout test still continue for EMI WN tarball [https://rt.egi.eu/rt/Ticket/Display.html?id=4927 RT] | |||
** Please contact me if you are willing to participate | |||
* UMD-3 release: | |||
** Contains all the EMI-3 products (50 products) released on the 11th MaArch [http://www.eu-emi.eu/emi-3-montebianco Release] | |||
** Still undergoing campaign for Early Adopters. | |||
** Already some products under SR | |||
*** Bdii-site - 1.2.0 | |||
* Preparing a wiki with the issues found during SR activities regarding the upgrades from EMI-2 to EMI-3 | |||
=== 2. Operational Issues === | === 2. Operational Issues === | ||
Line 27: | Line 50: | ||
Details [https://ggus.eu/tech/ticket_show.php?ticket=90930 GGUS 90930] and [https://ggus.eu/tech/ticket_show.php?ticket=92288 GGUS 92288] | Details [https://ggus.eu/tech/ticket_show.php?ticket=90930 GGUS 90930] and [https://ggus.eu/tech/ticket_show.php?ticket=92288 GGUS 92288] | ||
A bug in the L&B proxy causes zero default timeout on purging jobs in state ''done''. As a result, the regular nightly purge removes jobs which happen to be | A bug in the L&B proxy causes zero default timeout on purging jobs in state ''done''. As a result, the regular nightly purge removes jobs which happen to be in state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect. | ||
All current releases are affected. The cause, however, is known and currently fixed in the codebase. | All current releases are affected. The cause, however, is known and currently fixed in the codebase. | ||
Line 123: | Line 146: | ||
The GridSite PT is [https://savannah.cern.ch/bugs/?100916 preparing a fix] introducing a <code>gridsite-compat1.7</code> package to overcome the issue. | The GridSite PT is [https://savannah.cern.ch/bugs/?100916 preparing a fix] introducing a <code>gridsite-compat1.7</code> package to overcome the issue. | ||
===== EMI-3: vomses start-up script doesn't properly work ===== | |||
Details [https://ggus.eu/tech/ticket_show.php?ticket=92666 GGUS ID 92666] | |||
We noticed that the vomses start-up script doesn't properly work because the pid of the process is wrongly handled. | |||
[root@cert-14 ~]# service vomses status | |||
Checking vomses status: (not running) [FAILED] | |||
actually the service is running with the pid 3206 | |||
# ps auxfwww | grep java | |||
root 23227 0.0 0.0 103236 836 pts/0 S+ 10:06 0:00 \_ grep java | |||
voms 3206 0.2 7.7 1708284 305284 ? Sl Mar19 2:14 java -Xmx256m -cp //var/lib/voms-admin/lib/activati.... | |||
but here there is a different number: | |||
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid | |||
3203 | |||
so also this doesn't work | |||
[root@cert-14 ~]# service vomses stop | |||
Stopping vomses: (not running) [FAILED] | |||
when you need to restart the vomses service, you have to kill the process and delete the pid file, and then start it. | |||
[root@cert-14 ~]# kill -9 3206 | |||
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid | |||
3203 | |||
[root@cert-14 ~]# rm /var/lock/subsys/vomses.pid | |||
rm: remove regular file `/var/lock/subsys/vomses.pid'? y | |||
[root@cert-14 ~]# service vomses start | |||
Starting vomses: [ OK ] | |||
[root@cert-14 ~]# service vomses status | |||
Checking vomses status: (not running) [FAILED] | |||
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid | |||
25425 | |||
[root@cert-14 ~]# ps auxfwww | grep java | |||
root 25586 0.0 0.0 103236 836 pts/0 S+ 10:10 0:00 \_ grep java | |||
voms 25428 20.7 4.2 1543472 167168 ? Sl 10:10 0:06 java -Xmx256m -cp //var/lib/voms-admin/lib/activation-1.1.ja | |||
the developers are already aware of it | |||
=== 3. AOB === | === 3. AOB === | ||
Line 130: | Line 201: | ||
=== 4. Minutes === | === 4. Minutes === | ||
[[Category:Grid_Operations_Meetings]] | [[Category:Grid_Operations_Meetings]] | ||
[https://indico.egi.eu/indico/materialDisplay.py?materialId=minutes&confId=1364 Minutes available online] |
Latest revision as of 17:49, 26 March 2013
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Detailed agenda: Grid Operations Meeting 20 March 2013
Audio conference link | Conference system is Adobe Connect, no password required. |
Audio conference details | Indico page |
1. Middleware releases and staged rollout
1.1. Update on the status of EMI updates
Cristina Aiftimiei reports about the EMI releasing activities.
- Report on twiki
- EMI 3 highlights
- Reference for Major & Backward (In)Compatibilities Changes & Upgrade Paths
1.2. Staged Rollout
- UMD-2 released:
- EMI-2 update 9
- Still under SR: Cream-torque,
- Ready for production: Cream 1.14.3 ; Wms 3.4.1.
- IGE 3.1 and - 3.0:
- Gsisshterm - 1.3.4
- Gram5 - 5.2.3
- EMI-2 update 9
- Staged rollout test still continue for EMI WN tarball RT
- Please contact me if you are willing to participate
- UMD-3 release:
- Contains all the EMI-3 products (50 products) released on the 11th MaArch Release
- Still undergoing campaign for Early Adopters.
- Already some products under SR
- Bdii-site - 1.2.0
- Preparing a wiki with the issues found during SR activities regarding the upgrades from EMI-2 to EMI-3
2. Operational Issues
2.2 Updates from DMSU
done Jobs are Purged Prematurely from L&B Proxy
Details GGUS 90930 and GGUS 92288
A bug in the L&B proxy causes zero default timeout on purging jobs in state done. As a result, the regular nightly purge removes jobs which happen to be in state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect.
All current releases are affected. The cause, however, is known and currently fixed in the codebase.
It is possible to work around the issue by setting GLITE_LB_EXPORT_PURGE_ARGS explicitly in the YAIM configuration file:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --done 60d --other 60d"
as opposed to the current:
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --other 60d"
some issues preventing the CREAM update to EMI-2 (sl6)
Details GGUS 92492
Issues found by two resource centres (INFN-LNL-2 and INFN-T1) in NGI_IT:
1) After configuring with yaim, many tomcat6 errors are logged in catalina.out:
java.lang.IllegalArgumentException: Document base /usr/share/tomcat6/webapps/ce-cream-es does not exist
SEVERE: A web application appears to have started a thread named [Timer-4] but has failed to stop it. This is very likely to create a memory leak.
SEVERE: A web application created a ThreadLocal with key of type [null] (value [org.apache.axiom.util.UIDGenerator$1@4a88e4c0]) and a value of type [long[]] (value [[J@24edb15c]) but failed to remove it when the web application was stopped. To prevent a memory leak, the ThreadLocal has been forcibly removed.
After a while the ce starts swapping and runs out of health.
WORKAROUND:
rm -f /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml /etc/init.d/tomcat6 stop && /etc/init.d/glite-ce-blah-parser stop && sleep 3 && /etc/init.d/glite-ce-blah-parser start && /etc/init.d/tomcat6 start
SOLUTION: Have this fixed in the next update
2)
[root@ce01-lcg ~]# cat /etc/glite-ce-cream/log4j.properties | egrep 'MaxFileSize|MaxBackupIndex' log4j.appender.fileout.MaxFileSize=1000KB log4j.appender.fileout.MaxBackupIndex=20
These are too little in a production environment. An entire job lifecycle doesnt fit in 20MB of logs. furthermore, any run of yaim restores the too little values.
WORKAROUND: modify /etc/glite-ce-cream/log4j.properties :
log4j.appender.fileout.MaxFileSize=10M
and
chattr +i /etc/glite-ce-cream/log4j.properties
SOLUTION: Have this fixed in the next update
3) After configuring with yaim, services are up, but the ce remains unresponsive:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl 2013-03-14 14:41:23,596 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl 2013-03-14 14:43:10,813 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
Tomcat is actually in a ill state:
[root@ce01-lcg ~]# service tomcat6 status tomcat6 (pid 20389) is running... [ OK ] [root@ce01-lcg ~]# service tomcat6 stop Stopping tomcat6: [FAILED]
WORKAROUND:
service glite-ce-blah-parser stop service tomcat6 stop && service glite-ce-blah-parser stop && sleep 3 && service glite-ce-blah-parser start && service tomcat6 start
Then it works:
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl https://ce01-lcg.cr.cnaf.infn.it:8443/CREAM691020988
SOLUTION: Have this fixed in the next update
Issues Upgrading Products Depending on GridSite from EMI-2 to EMI-3
Details GGUS 92620
Products with complex (often transitive) dependencies on multiple GridSite packages (such as a simultaneous dependency on gridsite
and gridsite-libs
) can experience problems upgrading from EMI-2 to EMI-3.
Products known to be affected:
- DPM
Products known not to be affected:
- stand-alone GridSite
- L&B
It is possible to work around the issue by uninstalling the problematic packages and installing afresh.
The GridSite PT is preparing a fix introducing a gridsite-compat1.7
package to overcome the issue.
EMI-3: vomses start-up script doesn't properly work
Details GGUS ID 92666
We noticed that the vomses start-up script doesn't properly work because the pid of the process is wrongly handled.
[root@cert-14 ~]# service vomses status Checking vomses status: (not running) [FAILED]
actually the service is running with the pid 3206
# ps auxfwww | grep java root 23227 0.0 0.0 103236 836 pts/0 S+ 10:06 0:00 \_ grep java voms 3206 0.2 7.7 1708284 305284 ? Sl Mar19 2:14 java -Xmx256m -cp //var/lib/voms-admin/lib/activati....
but here there is a different number:
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 3203
so also this doesn't work
[root@cert-14 ~]# service vomses stop Stopping vomses: (not running) [FAILED]
when you need to restart the vomses service, you have to kill the process and delete the pid file, and then start it.
[root@cert-14 ~]# kill -9 3206 [root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 3203 [root@cert-14 ~]# rm /var/lock/subsys/vomses.pid rm: remove regular file `/var/lock/subsys/vomses.pid'? y [root@cert-14 ~]# service vomses start Starting vomses: [ OK ] [root@cert-14 ~]# service vomses status Checking vomses status: (not running) [FAILED] [root@cert-14 ~]# cat /var/lock/subsys/vomses.pid 25425 [root@cert-14 ~]# ps auxfwww | grep java root 25586 0.0 0.0 103236 836 pts/0 S+ 10:10 0:00 \_ grep java voms 25428 20.7 4.2 1543472 167168 ? Sl 10:10 0:06 java -Xmx256m -cp //var/lib/voms-admin/lib/activation-1.1.ja
the developers are already aware of it