Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-03-04-2013"

From EGIWiki
Jump to navigation Jump to search
(Created page with " {| |- | [http://connect.ct.infn.it/egi-inspire-sa1/ Audio conference link] | ''Conference system is Adobe Connect, no password required.'' |- | [https://indico.egi.eu/indico/ma...")
 
 
(8 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|
{|
|-
|-
Line 5: Line 4:
| ''Conference system is Adobe Connect, no password required.''
| ''Conference system is Adobe Connect, no password required.''
|-
|-
| [https://indico.egi.eu/indico/materialDisplay.py?materialId=0&confId=1364 Audio conference details]  
| [https://indico.egi.eu/indico/materialDisplay.py?materialId=0&confId=1382 Audio conference details]  
| [ Indico page]
| [https://indico.egi.eu/indico/conferenceDisplay.py?confId=1382 Indico page]
|}
|}
   
   
 
{{TOC right}}
=== 1. Middleware releases and staged rollout ===
=== 1. Middleware releases and staged rollout ===


==== 1.1. Update on the status of EMI updates  ====
==== 1.1. Update on the status of EMI updates  ====
Cristina Aiftimiei reports about the EMI releasing activities.
 
* [https://twiki.cern.ch/twiki/bin/view/EMI/EmiEgiGOM#EMI_1_EMI_2_EMI_3_Updates_status Report on twiki]
* [http://indico.cern.ch/materialDisplay.py?contribId=3&materialId=slides&confId=197801 EMI 3 highlights]
* [https://twiki.cern.ch/twiki/pub/EMI/EMT/emi_3_Changes_Upgrades.pdf Reference for Major & Backward (In)Compatibilities Changes & Upgrade Paths]


==== 1.2. Staged Rollout  ====
==== 1.2. Staged Rollout  ====
* UMD-1 releases:
** EMI-1  (28.03.2013) [http://www.eu-emi.eu/emi-1-kebnekaise-updates/-/asset_publisher/Ir6q/content/update-24-28-03-2013-v-1-14-2-2 update 24]
*** Contains security updates for CREAM


* UMD-2 released:
* UMD-2 releases:
** EMI-2 [http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-9-21-02-2013-v-2-6-1-1 update 9]
** EMI-2 (21.02.2013) [http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-9-21-02-2013-v-2-6-1-1 update 9]
*** Still under SR: Cream-torque,  
*** Still under SR: Cream-torque,  
*** Ready for production: Cream  1.14.3 ; Wms    3.4.1.
*** Ready for production: Cream  1.14.3 ; Wms    3.4.1.
** IGE 3.1 and - 3.0:
** IGE 3.1 and - 3.0: Gsisshterm  - 1.3.4; Gram5 - 5.2.3  
*** Gsisshterm  - 1.3.4
** EMI-2 (02.04.2013) [http://www.eu-emi.eu/emi-2-matterhorn/updates/-/asset_publisher/9AgN/content/update-10-02-04-2013-v-2-6-2-1 update 10]
*** Gram5 - 5.2.3  
*** Contains security updates for CREAM


* Staged rollout test still continue for EMI WN tarball [https://rt.egi.eu/rt/Ticket/Display.html?id=4927 RT]  
* Staged rollout finished for EMI WN tarball [https://rt.egi.eu/rt/Ticket/Display.html?id=4927 RT]  
** Please contact me if you are willing to participate
** emi-wn-2.6.0-1_v1 [http://www.sysadmin.hep.ac.uk/wiki/EMI2Tarball EMI2 tarball]


* UMD-3 release:
* UMD-3 releases:
** Contains all the EMI-3 products (50 products) released  on the 11th MaArch [http://www.eu-emi.eu/emi-3-montebianco Release]
** Contains all the EMI-3 products (50 products) released  on the 11th MaArch [http://www.eu-emi.eu/emi-3-montebianco Release]
** Still undergoing campaign for Early Adopters.
*** Priorities according to [https://wiki.egi.eu/wiki/Agenda-04-03-2013 table]
** Already some products under SR
*** Bdii-site - 1.2.0


* Preparing a wiki with the issues found during SR activities regarding the upgrades from EMI-2 to EMI-3
* Wiki with the issues found during SR activities regarding the upgrades from EMI-2 to EMI-3


=== 2. Operational Issues  ===
=== 2. Operational Issues  ===
Line 43: Line 40:
==== 2.2 Updates from DMSU ====
==== 2.2 Updates from DMSU ====


===== ''done'' Jobs are Purged Prematurely from L&B Proxy =====
===== problems with EMI-3 STORM =====
 
Details [https://ggus.eu/tech/ticket_show.php?ticket=90930 GGUS 90930] and [https://ggus.eu/tech/ticket_show.php?ticket=92288 GGUS 92288]
 
A bug in the L&B proxy causes zero default timeout on purging jobs in state ''done''. As a result, the regular nightly purge removes jobs which happen to be in state 'done' at the moment from the proxy. Bookkeeping information is not lost because jobs are kept in the server. But some actions depending on talking to the proxy fail. In collocated L&B Proxy/Server scenarios, the bug takes no effect.
 
All current releases are affected. The cause, however, is known and currently fixed in the codebase.
 
It is possible to '''work around''' the issue by setting GLITE_LB_EXPORT_PURGE_ARGS explicitly in the YAIM configuration file:
 
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --done 60d --other 60d"
 
as opposed to the current:
 
GLITE_LB_EXPORT_PURGE_ARGS="--cleared 2d --aborted 15d --cancelled 15d --other 60d"
 
===== some issues preventing the CREAM update to EMI-2 (sl6) =====
 
Details [https://gus.fzk.de/ws/ticket_info.php?ticket=92492 GGUS 92492]
 
Issues found by two resource centres (INFN-LNL-2 and INFN-T1) in NGI_IT:


1)
see details in [https://ggus.eu/ws/ticket_info.php?ticket=92819 GGUS #92819]
After configuring with yaim, many tomcat6 errors are logged in catalina.out:


java.lang.IllegalArgumentException: Document base /usr/share/tomcat6/webapps/ce-cream-es does not exist
For the moment '''we recommended NOT to upgrade STORM to EMI-3''' (or the clean installation) because when you launch consecutive lcg-gt calls, they fail with a ''Requested file is busy'' error.
For example:


SEVERE: A web application appears to have started a thread named [Timer-4] but has failed to stop it. This is very likely to create a memory leak.
$ lcg-cp -b file:/home/enol/std.out -U srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1
$ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp gsiftp://test27.egi.cesga.es:2811//storage/dteam/t1
ee152752-020f-4598-b19e-a4bc56dcb5b8
$ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp
srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1: [SE][StatusOfGetRequest][SRM_FILE_BUSY] Requested file is busy (in an incompatible state with PTG)
lcg_gt: Invalid argument"


SEVERE: A web application created a ''ThreadLocal with key of type [null] (value [org.apache.axiom.util.UIDGenerator$1@4a88e4c0])'' and a value of type ''[long[]] (value [[J@24edb15c])'' but failed to remove it when the web application was stopped. To prevent a memory leak, the ThreadLocal has been forcibly removed.
For each protocol published by the SE, the nagios SRM probes do a ''lcg-gt'' which will return a CRITICAL state.


After a while the ce starts swapping and runs out of health.
===== WMS on sl6 doesn't work with ARGUS =====


WORKAROUND:
see details in [https://ggus.eu/tech/ticket_show.php?ticket=92773 GGUS #92773]
rm -f /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml
   
   
/etc/init.d/tomcat6 stop && /etc/init.d/glite-ce-blah-parser stop && sleep 3 && /etc/init.d/glite-ce-blah-parser start && /etc/init.d/tomcat6 start
WMS on sl6 cannot use ARGUS as authorization system: because SL6 uses the NSS library instead of OpenSSL, the proxy are not correctly handled. It is a known problem in NSS, and will not be corrected soon.
 
SOLUTION: Have this fixed in the next update
 
2)
[root@ce01-lcg ~]# cat /etc/glite-ce-cream/log4j.properties | egrep 'MaxFileSize|MaxBackupIndex'
log4j.appender.fileout.MaxFileSize=1000KB
log4j.appender.fileout.MaxBackupIndex=20
 
These are too little in a production environment. An entire job lifecycle doesnt fit in 20MB of logs. furthermore, any run of yaim restores the too little values.
 
WORKAROUND:
modify /etc/glite-ce-cream/log4j.properties :
 
log4j.appender.fileout.MaxFileSize=10M
and
chattr +i /etc/glite-ce-cream/log4j.properties
 
SOLUTION: Have this fixed in the next update
 
3)
After configuring with yaim, services are up, but the ce remains unresponsive:
 
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
2013-03-14 14:41:23,596 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]


[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
===== critical issues on EMI-3 VOMS: wait for the release of EMI-3 first update =====
2013-03-14 14:43:10,813 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]


Tomcat is actually in a ill state:
We have found some issues that affect the service functionality  and in particular make the installation and operation of the
service problematic for deployments with large number of VOs:


[root@ce01-lcg ~]# service tomcat6 status
[https://issues.infn.it:8443/browse/VOMS-229 VOMSES startup script fails in restarting VOMSES web app]
tomcat6 (pid 20389) is running... [ OK ]
[root@ce01-lcg ~]# service tomcat6 stop
Stopping tomcat6: [FAILED]


WORKAROUND:
[https://issues.infn.it:8443/browse/VOMS-230 VOMS Admin service incorrectly parses truststore refresh period from configuration]
service glite-ce-blah-parser stop
service tomcat6 stop && service glite-ce-blah-parser stop && sleep 3 && service glite-ce-blah-parser start && service tomcat6 start


Then it works:
[https://issues.infn.it:8443/browse/VOMS-231 Standalone VOMS Admin service deployment is not suitable for deployments with large number of VOs (> 10)]
[sdalpra@ui01-ad32 ~]$ glite-ce-job-submit -a -r ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam my.jdl
https://ce01-lcg.cr.cnaf.infn.it:8443/CREAM691020988
 
SOLUTION: Have this fixed in the next update
 
===== Issues Upgrading Products Depending on GridSite from EMI-2 to EMI-3 =====
 
Details [https://ggus.eu/tech/ticket_show.php?ticket=92620 GGUS 92620]
 
Products with complex (often transitive) dependencies on multiple GridSite packages (such as a simultaneous dependency on <code>gridsite</code> '''and''' <code>gridsite-libs</code>) can experience problems upgrading from EMI-2 to EMI-3.
 
Products known to be affected:
* DPM
 
Products known not to be affected:
* stand-alone GridSite
* L&B
 
It is possible to '''work around''' the issue by uninstalling the problematic packages and installing afresh.
 
The GridSite PT is [https://savannah.cern.ch/bugs/?100916 preparing a fix] introducing a <code>gridsite-compat1.7</code> package to overcome the issue.
 
===== EMI-3: vomses start-up script doesn't properly work =====
 
Details [https://ggus.eu/tech/ticket_show.php?ticket=92666 GGUS ID 92666]
 
We noticed that the vomses start-up script doesn't properly work because the pid of the process is wrongly handled.
 
[root@cert-14 ~]# service vomses status
Checking vomses status: (not running) [FAILED]
 
actually the service is running with the pid 3206
 
# ps auxfwww | grep java
root 23227 0.0 0.0 103236 836 pts/0 S+ 10:06 0:00 \_ grep java
voms 3206 0.2 7.7 1708284 305284 ? Sl Mar19 2:14 java -Xmx256m -cp //var/lib/voms-admin/lib/activati....
 
but here there is a different number:
 
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid
3203
 
so also this doesn't work
[root@cert-14 ~]# service vomses stop
Stopping vomses: (not running) [FAILED]
 
when you need to restart the vomses service, you have to kill the process and delete the pid file, and then start it.
 
[root@cert-14 ~]# kill -9 3206
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid
3203
[root@cert-14 ~]# rm /var/lock/subsys/vomses.pid
rm: remove regular file `/var/lock/subsys/vomses.pid'? y
[root@cert-14 ~]# service vomses start
Starting vomses: [ OK ]
[root@cert-14 ~]# service vomses status
Checking vomses status: (not running) [FAILED]
[root@cert-14 ~]# cat /var/lock/subsys/vomses.pid
25425
[root@cert-14 ~]# ps auxfwww | grep java
root 25586 0.0 0.0 103236 836 pts/0 S+ 10:10 0:00 \_ grep java
voms 25428 20.7 4.2 1543472 167168 ? Sl 10:10 0:06 java -Xmx256m -cp //var/lib/voms-admin/lib/activation-1.1.ja


the developers are already aware of it
These issues have already been acknowledged by the developers and are being currently fixed. '''A new version of voms-admin-service'''
'''fixing the above issues is scheduled for release on April 18th'''. For this reason, and given the stability of the EMI 2 VOMS services,
'''we recommend NOT to upgrade now''' and wait for the version that will be released in the first EMI-3 update.


=== 3. AOB  ===
=== 3. AOB  ===

Latest revision as of 13:49, 3 April 2013

Audio conference link Conference system is Adobe Connect, no password required.
Audio conference details Indico page


1. Middleware releases and staged rollout

1.1. Update on the status of EMI updates

1.2. Staged Rollout

  • UMD-1 releases:
    • EMI-1 (28.03.2013) update 24
      • Contains security updates for CREAM
  • UMD-2 releases:
    • EMI-2 (21.02.2013) update 9
      • Still under SR: Cream-torque,
      • Ready for production: Cream 1.14.3 ; Wms 3.4.1.
    • IGE 3.1 and - 3.0: Gsisshterm - 1.3.4; Gram5 - 5.2.3
    • EMI-2 (02.04.2013) update 10
      • Contains security updates for CREAM
  • Staged rollout finished for EMI WN tarball RT
  • UMD-3 releases:
    • Contains all the EMI-3 products (50 products) released on the 11th MaArch Release
      • Priorities according to table
  • Wiki with the issues found during SR activities regarding the upgrades from EMI-2 to EMI-3

2. Operational Issues

2.2 Updates from DMSU

problems with EMI-3 STORM

see details in GGUS #92819

For the moment we recommended NOT to upgrade STORM to EMI-3 (or the clean installation) because when you launch consecutive lcg-gt calls, they fail with a Requested file is busy error. For example:

$ lcg-cp -b file:/home/enol/std.out -U srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1
$ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp gsiftp://test27.egi.cesga.es:2811//storage/dteam/t1
ee152752-020f-4598-b19e-a4bc56dcb5b8
$ lcg-gt -b -T srmv2 srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1 gsiftp
srm://test27.egi.cesga.es:8444/srm/managerv2?SFN=/dteam/t1: [SE][StatusOfGetRequest][SRM_FILE_BUSY] Requested file is busy (in an incompatible state with PTG)
lcg_gt: Invalid argument"

For each protocol published by the SE, the nagios SRM probes do a lcg-gt which will return a CRITICAL state.

WMS on sl6 doesn't work with ARGUS

see details in GGUS #92773

WMS on sl6 cannot use ARGUS as authorization system: because SL6 uses the NSS library instead of OpenSSL, the proxy are not correctly handled. It is a known problem in NSS, and will not be corrected soon.

critical issues on EMI-3 VOMS: wait for the release of EMI-3 first update

We have found some issues that affect the service functionality and in particular make the installation and operation of the service problematic for deployments with large number of VOs:

VOMSES startup script fails in restarting VOMSES web app

VOMS Admin service incorrectly parses truststore refresh period from configuration

Standalone VOMS Admin service deployment is not suitable for deployments with large number of VOs (> 10)

These issues have already been acknowledged by the developers and are being currently fixed. A new version of voms-admin-service fixing the above issues is scheduled for release on April 18th. For this reason, and given the stability of the EMI 2 VOMS services, we recommend NOT to upgrade now and wait for the version that will be released in the first EMI-3 update.

3. AOB

3.2 Next meeting

4. Minutes

Minutes available online