Difference between revisions of "Agenda-03-12-2012"

From EGIWiki
Jump to: navigation, search
(1.1. Update on the status of EMI updates)
Line 1: Line 1:
{{Template:Op menubar}}
+
{{Template:Op menubar}}  
[[Category:Grid Operations Meetings]]
 
  
=Detailed agenda: Grid Operations Meeting 03 December 2012=
+
= Detailed agenda: Grid Operations Meeting 03 December 2012 =
  
 
{|
 
{|
Line 13: Line 12:
 
|}
 
|}
  
 +
<br>
  
=== 1. Middleware releases and staged rollout===
+
=== 1. Middleware releases and staged rollout ===
====1.1. Update on the status of EMI updates ====
+
 
Cristina Aiftimiei (EMI) reports on the EMI updates. [https://twiki.cern.ch/twiki/bin/view/EMI/EmiEgiGOM#Status_03_12_2012 Twiki page with more information]
+
==== 1.1. Update on the status of EMI updates ====
 +
 
 +
Cristina Aiftimiei (EMI) reports on the EMI updates. [https://twiki.cern.ch/twiki/bin/view/EMI/EmiEgiGOM#Status_03_12_2012 Twiki page with more information]  
  
 
==== 1.2. Staged Rollout  ====
 
==== 1.2. Staged Rollout  ====
  
===2. Operational Issues ===
+
=== 2. Operational Issues ===
==== 2.1 Unsupported middleware update ====
+
 
===== Middleware services planned to be upgraded by end of November =====
+
==== 2.1 Unsupported middleware update ====
There are currently (last check Dec 1st) 28 sites, who declared a plan to upgrade their services by the end of November, still with unsupported middleware, without a downtime on those services.<br>
+
 
By today EGI Operations will open a new batch of NGI GGUS tickets, asking:
+
===== Middleware services planned to be upgraded by end of November =====
* To open a downtime for the unsupported services '''by Friday COB'''
+
 
* Sites with late plans (beyond November) should be already in downtime, if any of these sites have not done so they must open the downtime '''as soon as possible''', possibly today COB
+
There are currently (last check Dec 1st) 28 sites, who declared a plan to upgrade their services by the end of November, still with unsupported middleware, without a downtime on those services.<br> By today EGI Operations will open a new batch of NGI GGUS tickets, asking:  
* Sites with ''CLASSIC SE'' service types registered in GOCDB will be asked to remove those services.
+
 
 +
*To open a downtime for the unsupported services '''by Friday COB'''  
 +
*Sites with late plans (beyond November) should be already in downtime, if any of these sites have not done so they must open the downtime '''as soon as possible''', possibly today COB  
 +
*Sites with ''CLASSIC SE'' service types registered in GOCDB will be asked to remove those services.
 +
 
 +
===== Unsupported VOMS  =====
 +
 
 +
VOMS is a critical services for the VOs, VOMS tickets status will be assessed one by one. Never the less sites deploying unsupported VOMS '''must provide an upgrade plans, or the technical reasons to delay the upgrade'''.
 +
 
 +
===== DPM LFC and WN  =====
 +
 
 +
The middleware services that are unsupported since the end of November will raise ''critical alarms'' on the ROD dashboard by the end of this week. The probes are ready, currently the testing is being finalized, and Operations portal team is working for their integration in the operational dashboard.
 +
 
 +
ROD teams have to follow the following [https://wiki.egi.eu/wiki/PROC01#Escalation_for_operational_problem_with_unsupported_MW_at_site.C2.A0 escalation procedure], to follow up with the unsupported middleware alarms. The overall procedure for the unsupported middleware decommissioning is [[PROC16]].
 +
 
 +
==== 2.2 Updates from DMSU  ====
  
===== Unsupported VOMS =====
+
===== FTS jobs abort with "No site found for host xxx.yyy" error  =====
  
VOMS is a critical services for the VOs, VOMS tickets status will be assessed one by one. Never the less sites deploying unsupported VOMS '''must provide an upgrade plans, or the technical reasons to delay the upgrade'''.
+
Details [https://ggus.eu/tech/ticket_show.php?ticket=87929 GGUS #87929]
  
===== DPM LFC and WN =====
+
From time to time, some FTS transfers fail with the message above. The problem was reported at CNAF, IN2P3, and GRIDKA, noticed by Atlas, CMS, and LHCb VOs. The problem is appearing and disappearing in rather short and unpredictable intervals.
  
The middleware services that are unsupported since the end of November will raise ''critical alarms'' on the ROD dashboard by the end of this week. The probes are ready, currently the testing is being finalized, and Operations portal team is working for their integration in the operational dashboard.
+
Exact reasons are not yet understood, we keep investigating. Reports from sites affected by similar problem will be appreciated.  
  
ROD teams have to follow the following [https://wiki.egi.eu/wiki/PROC01#Escalation_for_operational_problem_with_unsupported_MW_at_site.C2.A0 escalation procedure], to follow up with the unsupported middleware alarms.
+
''Update Nov 20: The user reports that both problem disappeared, probably fixed together.''
The overall procedure for the unsupported middleware decommissioning is [[PROC16]].
 
  
==== 2.2 Updates from DMSU ====
+
===== LCMAPS-plugins-c-pep in glexec fails at RH6 based WNs  =====
  
===== FTS jobs abort with "No site found for host xxx.yyy" error =====
+
Details [https://ggus.eu/tech/ticket_show.php?ticket=88520 GGUS #88520]
  
Details [https://ggus.eu/tech/ticket_show.php?ticket=87929 GGUS #87929]
+
Due to replacement of OpenSSL with NSS in the RH6 based distributions, LCMAPS-plugins-c-pep invoked from glexec fails on talking to Argus PEP via curl.  
  
From time to time, some FTS transfers fail with the message above.
+
This is a known issue, as mentioned in [http://www.eu-emi.eu/products/-/asset_publisher/1gkD/content/glexec-wn EMI glexec release notes] however, the workaround is not described in a usable way there.  
The problem was reported at CNAF, IN2P3, and GRIDKA, noticed by Atlas, CMS,
 
and LHCb VOs. The problem is appearing and disappearing in rather short
 
and unpredictable intervals.
 
  
Exact reasons are not yet understood, we keep investigating.
+
Once we make sure we understand it properly and that the fix works, it will be documented properly at UMD pages and passed to the developers to
Reports from sites affected by similar problem will be appreciated.
 
  
''Update Nov 20: The user reports that both problem disappeared, probably fixed together.''
+
#fix the documentation
 +
#try to deploy the workaround automatically when NSS-poisoned system is detected
  
===== LCMAPS-plugins-c-pep in glexec fails at RH6 based WNs =====
+
'''UPDATE Nov 19th''': the fix is now well explained in the [http://www.eu-emi.eu/products/-/asset_publisher/1gkD/content/glexec-wn#Known_issues known issues section] and it will be included in a future yaim update
  
Details [https://ggus.eu/tech/ticket_show.php?ticket=88520 GGUS #88520]
+
===== WMS does not work with ARC CE 2.0.1  =====
  
Due to replacement of OpenSSL with NSS in the RH6 based distributions,
+
Details [https://ggus.eu/tech/ticket_show.php?ticket=88630 GGUS #88630], further info [https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3062 Condor ticket #3062]
LCMAPS-plugins-c-pep invoked from glexec fails on talking to Argus PEP
 
via curl.
 
  
This is a known issue, as mentioned in
+
The format of jobid changed in in the ARC CE release 12. This is not recognised by Condor prior to version 7.8.3. However, current EMI-1 WMS uses Condor 7.8.0. This breaks submission from WMS to ARC CE.  
[http://www.eu-emi.eu/products/-/asset_publisher/1gkD/content/glexec-wn EMI glexec release notes]
 
however, the workaround is not described in a usable way there.
 
  
Once we make sure we understand it properly and that the fix works,
+
The problem hence affects CMS SAM tests as well as their production jobs.  
it will be documented properly at UMD pages and passed to the developers
 
to
 
# fix the documentation
 
# try to deploy the workaround automatically when NSS-poisoned system is detected
 
'''UPDATE Nov 19th''': the fix is now well explained in the [http://www.eu-emi.eu/products/-/asset_publisher/1gkD/content/glexec-wn#Known_issues known issues section] and it will be included in a future yaim update
 
  
===== WMS does not work with ARC CE 2.0.1 =====
+
Hence updates to ARC CE 12 should be done carefully before the Condor update is available from EMI.  
  
Details [https://ggus.eu/tech/ticket_show.php?ticket=88630 GGUS #88630],
+
'''UPDATE Nov 26th''': on a test WMS it was installed Condor 7.8.6, and the submission to ARC seemed to work fine; since this WMS isn't available any more, further deeper tests should be performed, perhaps using the EMI-TESTBED infrastructure
further info [https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3062 Condor ticket #3062]
 
  
The format of jobid changed in in the ARC CE release 12.
+
=== 3. AOB  ===
This is not recognised by Condor prior to version 7.8.3.
 
However, current EMI-1 WMS uses Condor 7.8.0.
 
This breaks submission from WMS to ARC CE.
 
  
The problem hence affects CMS SAM tests as well as their production jobs.
+
==== 3.1 Next meeting  ====
  
Hence updates to ARC CE 12 should be done carefully before the Condor update
+
2 weeks time would be Dec 17, the day before OMB.  
is available from EMI.
 
  
'''UPDATE Nov 26th''': on a test WMS it was installed Condor 7.8.6, and the submission to ARC seemed to work fine; since this WMS isn't available any more, further deeper tests should be performed, perhaps using the EMI-TESTBED infrastructure
+
*We would need to skip to January 7th
 +
*Intermediate proposal: '''Friday Dec 14th'''
  
===3. AOB ===
+
=== 4. Minutes  ===
==== 3.1 Next meeting ====
 
2 weeks time would be Dec 17, the day before OMB.
 
* We would need to skip to January 7th
 
* Intermediate proposal: '''Friday Dec 14th'''
 
  
=== 4. Minutes ===
+
[[Category:Grid_Operations_Meetings]]

Revision as of 13:52, 3 December 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security



Detailed agenda: Grid Operations Meeting 03 December 2012

EVO direct link Pwd: gridops
EVO details Indico page


1. Middleware releases and staged rollout

1.1. Update on the status of EMI updates

Cristina Aiftimiei (EMI) reports on the EMI updates. Twiki page with more information

1.2. Staged Rollout

2. Operational Issues

2.1 Unsupported middleware update

Middleware services planned to be upgraded by end of November

There are currently (last check Dec 1st) 28 sites, who declared a plan to upgrade their services by the end of November, still with unsupported middleware, without a downtime on those services.
By today EGI Operations will open a new batch of NGI GGUS tickets, asking:

  • To open a downtime for the unsupported services by Friday COB
  • Sites with late plans (beyond November) should be already in downtime, if any of these sites have not done so they must open the downtime as soon as possible, possibly today COB
  • Sites with CLASSIC SE service types registered in GOCDB will be asked to remove those services.
Unsupported VOMS

VOMS is a critical services for the VOs, VOMS tickets status will be assessed one by one. Never the less sites deploying unsupported VOMS must provide an upgrade plans, or the technical reasons to delay the upgrade.

DPM LFC and WN

The middleware services that are unsupported since the end of November will raise critical alarms on the ROD dashboard by the end of this week. The probes are ready, currently the testing is being finalized, and Operations portal team is working for their integration in the operational dashboard.

ROD teams have to follow the following escalation procedure, to follow up with the unsupported middleware alarms. The overall procedure for the unsupported middleware decommissioning is PROC16.

2.2 Updates from DMSU

FTS jobs abort with "No site found for host xxx.yyy" error

Details GGUS #87929

From time to time, some FTS transfers fail with the message above. The problem was reported at CNAF, IN2P3, and GRIDKA, noticed by Atlas, CMS, and LHCb VOs. The problem is appearing and disappearing in rather short and unpredictable intervals.

Exact reasons are not yet understood, we keep investigating. Reports from sites affected by similar problem will be appreciated.

Update Nov 20: The user reports that both problem disappeared, probably fixed together.

LCMAPS-plugins-c-pep in glexec fails at RH6 based WNs

Details GGUS #88520

Due to replacement of OpenSSL with NSS in the RH6 based distributions, LCMAPS-plugins-c-pep invoked from glexec fails on talking to Argus PEP via curl.

This is a known issue, as mentioned in EMI glexec release notes however, the workaround is not described in a usable way there.

Once we make sure we understand it properly and that the fix works, it will be documented properly at UMD pages and passed to the developers to

  1. fix the documentation
  2. try to deploy the workaround automatically when NSS-poisoned system is detected

UPDATE Nov 19th: the fix is now well explained in the known issues section and it will be included in a future yaim update

WMS does not work with ARC CE 2.0.1

Details GGUS #88630, further info Condor ticket #3062

The format of jobid changed in in the ARC CE release 12. This is not recognised by Condor prior to version 7.8.3. However, current EMI-1 WMS uses Condor 7.8.0. This breaks submission from WMS to ARC CE.

The problem hence affects CMS SAM tests as well as their production jobs.

Hence updates to ARC CE 12 should be done carefully before the Condor update is available from EMI.

UPDATE Nov 26th: on a test WMS it was installed Condor 7.8.6, and the submission to ARC seemed to work fine; since this WMS isn't available any more, further deeper tests should be performed, perhaps using the EMI-TESTBED infrastructure

3. AOB

3.1 Next meeting

2 weeks time would be Dec 17, the day before OMB.

  • We would need to skip to January 7th
  • Intermediate proposal: Friday Dec 14th

4. Minutes