Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-16-04-2012"

From EGIWiki
Jump to navigation Jump to search
(Created page with "{{Template:Op menubar}} = Detailed agenda: Grid Operations Meeting 23 March 2012 14h00 Amsterdam time = '''Note:''' This meeting is on Friday, due of clashing with the Communi...")
 
 
(22 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{Template:Op menubar}}  
{{Template:Op menubar}}  
[[Category:Grid Operations Meetings]]
= Detailed agenda: Grid Operations Meeting 16 April 2012 14h00 Amsterdam time  =


= Detailed agenda: Grid Operations Meeting 23 March 2012 14h00 Amsterdam time  =
'''Note:''' This meeting is on Friday, due of clashing with the Community Forum 2012.
{|
{|
|-
|-
| [http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MIM9Ms2a22DID89D9lDa9I EVO direct link]  
| [https://www.egi.eu/indico/materialDisplay.py?materialId=1&confId=1014 EVO direct link]  
| Pwd: '''gridops'''<br>
| Pwd: '''gridops'''<br>
|-
|-
| [https://www.egi.eu/indico/getFile.py/access?resId=0&materialId=1&confId=958 EVO details]  
| [https://www.egi.eu/indico/materialDisplay.py?materialId=0&confId=1014 EVO details]  
| [https://www.egi.eu/indico/conferenceDisplay.py?confId=958 Indico page]
| [https://www.egi.eu/indico/conferenceDisplay.py?confId=1014 Indico page]
|}
|}


<br>  
<br>  


== 1. Actions from previous meeting ==
== 1. Middleware releases and staged rollout  ==
* Create a wiki page to collect EMI WN and UI issues:[[EMI WN and UI known issues]]
 
* Check the sites deploying gLite3.1 WNs: [[Worker Nodes gLite3.1]]
=== 1.1 EMI release status ===
 
=== 1.2 Staged Rollout ===
 
*Currently no products in staged rollout.
*LFC_oracle 1.8.2 in the final stage of Verification by one Tier1.
*IGE SAGA adaptors: http://www.saga-project.org/ , is currently under verification, but we need to know if there are sites interested in it to do staged rollout. Already had some expression of interest from 1 site.


== 2. Middleware releases and staged rollout  ==
Upcoming EMI update 15 (see Cristina's presentation)


=== 2.1 EMI-1 release status ===
EMI2 is scheduled for the 7th of May.


=== 2.2 Staged Rollout  ===
== 2. Operational Issues  ==
===2.1 BDII instability ===


==== UMD 1.6.0 ====
*On April 12th several Site-BDIIs in Ibergrid and NGI_IT were malfunctioning:
** [https://ggus.eu/tech/ticket_show.php?ticket=81235 Ibergrid ticket]
** [https://ggus.eu/ws/ticket_info.php?ticket=81225 example of NGI_IT alarm ticket]
*Symptomps:
** Site-BDIIs failing SAM probes, failing attempts to restart them
** High CPU usage of LDAP services, even after the restart
** gLite3.2 and EMI services affected
** Similar problems may have affected the Top-BDIIs: the logs report that during the same period some of the Top-BDIIs in the HA cluster of Ibergrid were removed because not responsive.
* During the same hours a GEANT network problem was reported: in particular caused by a router in Geneva (12th April 2012 around 16 CET)


Real-time provisioning status for UMD 1.6.0 is here: {{UMDReleasePlan|version=1.6.0}}
Possible connection between the problems (G.Borges):
* GEANT problem reported as intermittent
* Connections with the client -broken because of the network- remained pending (default timeout 60 seconds)
* Clients reconnected to the BDIIs multiplying the number of connections to the server, causing the effect of a sort of DoS attack


Some comments follow:
For all NGIs:
*IGE2 (Globus 5.2): most components are expected to be included in the UMD1.6 release. Presently just waiting for the finalization of the staged rollout reports.
* Assess the status of Site-BDIIs and Top-BDIIs and report similar problems
*IGE2 - gridftp server issue, see below.
* If possible, upgrade BDII instances to BDII Site 1.1.0
*VOMS 2.0.7 and Gridsite 1.7.19 that where impacting the WMS, are now being fast verified, EA teams already did testing, so the SW provisioning will be fast so that they can be included in the umd release.
** This version adds dependencies to OpenLdap 2.4 (increased stability), and reduces memory and disk usage
*BDII core 1.3.0, was also tested by EA teams and should override the bdii core 1.2.0 . The latest version solves, among other things, some memory leak issues.
** '''Note:''' This may not solve this specific issue, but it will increase the general BDII performance
*IGE SAGA adaptors - mails have been sent (reminder sent yesterday) to get sites and/or users interested in this component, to come forward and do early adoption.


===== GTK 5.2 and globus-gridftp patch  =====
=== 2.2 VOMS Admin fails notifying about exipiring membership ===
==== 2.2.1 Description of the problem ====
A [https://savannah.cern.ch/bugs/?93255 bug] was found affecting the VOMS Admin process of sending warning messages when user membership is about to expire. The VOMS Admin versions affected by the bug are:
* gLite 3.2: versions 2.5.3-1 and 2.5.5-1
* EMI 1: VOMS Admin 2.6.1
These VOMS Admin versions enforce the user membership expiration (default every 12 months). The bug prevents the sending of a warning email before the users' membership expiration, VO managers are notified only once the user is suspended, after the membership expiration.


Cristina Aiftimiei wrote in an Email on 16 March 2012:  
==== 2.2.2 Possible workarounds ====
# Extension of VO membership up to '''30 September 2012'''
#* For users whose membership is expiring before 30 Sep 2012
#* This will allow VOMS administrator to upgrade their services once the bug fix is available
#* [https://wiki.italiangrid.it/twiki/bin/view/VOMS/KnownIssues#How_to_extend_membership_for_a_l instructions]
# Manual notification of the list of users with expiring membership
#* Every month VOMS server administrator should produce, for every VO affected by the bug, a report of the users whose membership are expiring during the month and communicate those lists to the VO managers.


if you include GTK5.2 -&gt; you can remove uberftp, but will break DPM &amp; STORM production.
==== 2.2.3 AuP grace period ====
if you include GTK 5.2 + last glibus-gridftp-server -&gt; you can remove uberftp, DPM production should work (Mattias had the confirmation)
For the VOs using the same VOMS releases reported above, users are requested to re-sign the VO AuP, as an additional step to renew the membership. Currently the default grace period is 24 hours, VOMS administrators are strongly suggested to extend this grace period to 7 days, a warning email is sent to the users at the beginning of the grace period, this change in configuration will give more time to users to perform their step (''the users who do not resign the AuP are suspended at the end of the grace period'').
but for StoRM we are still waiting for such a confirmation (today).
* [https://wiki.italiangrid.it/twiki/bin/view/VOMS/KnownIssues#VOMS_Admin_Sign_AUP_default_grac Instructions to perform this configuration]


<br> The currently provisionied IGE Globus GridFTP contains this mentioned patch. DPM and StoRM in Production have been tested against this, and do work fine.
=== 2.3 New GOCDB roles in production ===


An older version of Uberftp is currently in our production repository to protect EMI-UI from the version in EPEL that is compiled against the backwards-incompatible version of Globus GridFTP. As soon as the patched version of Globus GridFTP will be published, this old version of Uberftp will be removed at the same time.
New roles were rolled in production on April 10th.
* [[GOCDB/Release4/Development/NewRoles| New Roles detailed description]]
GOCDB users were automatically assigned to the new roles:
* At Site level
** Site Administrator -> Site Operations Manager. '' In order to maintain the previous users' permissions''
** Security Officer -> Site Security Officer
*At Regional level
**Regional Operations Staff -> "Regional Staff (ROD)"
** "Deputy Regional Manager" -> "NGI Operations Deputy Manager"
** "Regional Manager" -> "NGI Operations Manager"
** "Security Officer" -> "NGI Security Officer"
** No users have been automatically assigned to the new role "Regional First Line Support"


== 3. Operational Issues  ==
'''Important:''' For the operations tools nothing have changed after the switch to the new roles, if you notice any different behavior in the the tools, please report it in a GGUS ticket.
=== TMPDIR policy proposal ===
 
Follow up of the discussion in the previous meeting: [[Agenda-12-03-2012]]
==== 2.3.1 Problem reported and solved ====
* The location for the jobs to write large files (where the storage space required in the VO ID card is available) is stored in the ''$TMPDIR'' environment variable.
* ROD teams were not able to access the ROD dashboard
** Jobs' wrappers should be aware of this, and should not alter this variable (it is supposed to be set up by the Site Administrator)
** BUG fixed and rolled in production on Friday April 13th by the Operations Portal team
* If the variable is not set in the environment, the jobs should assume that the needed disk space is available in the job's ''WORKDIR''.
* New roles were not showing in the User details view of the GOCDB GUI
* For now, consider this ''proposal'' limited to the gLite middleware. It needs to be evaluated for other middleware.
** BUG fixed and rolled in production on Friday April 13th by the GOCDB team
** In principle, it would be reasonable to have an uniform behavior.


== 4. AOB  ==
== 4. AOB  ==


=== 4.1 Next meetings  ===
=== 4.1 Next meetings  ===
[[Category:GridOpsMeeting]]

Latest revision as of 17:18, 29 November 2012

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security

Detailed agenda: Grid Operations Meeting 16 April 2012 14h00 Amsterdam time

EVO direct link Pwd: gridops
EVO details Indico page


1. Middleware releases and staged rollout

1.1 EMI release status

1.2 Staged Rollout

  • Currently no products in staged rollout.
  • LFC_oracle 1.8.2 in the final stage of Verification by one Tier1.
  • IGE SAGA adaptors: http://www.saga-project.org/ , is currently under verification, but we need to know if there are sites interested in it to do staged rollout. Already had some expression of interest from 1 site.

Upcoming EMI update 15 (see Cristina's presentation)

EMI2 is scheduled for the 7th of May.

2. Operational Issues

2.1 BDII instability

  • On April 12th several Site-BDIIs in Ibergrid and NGI_IT were malfunctioning:
  • Symptomps:
    • Site-BDIIs failing SAM probes, failing attempts to restart them
    • High CPU usage of LDAP services, even after the restart
    • gLite3.2 and EMI services affected
    • Similar problems may have affected the Top-BDIIs: the logs report that during the same period some of the Top-BDIIs in the HA cluster of Ibergrid were removed because not responsive.
  • During the same hours a GEANT network problem was reported: in particular caused by a router in Geneva (12th April 2012 around 16 CET)

Possible connection between the problems (G.Borges):

  • GEANT problem reported as intermittent
  • Connections with the client -broken because of the network- remained pending (default timeout 60 seconds)
  • Clients reconnected to the BDIIs multiplying the number of connections to the server, causing the effect of a sort of DoS attack

For all NGIs:

  • Assess the status of Site-BDIIs and Top-BDIIs and report similar problems
  • If possible, upgrade BDII instances to BDII Site 1.1.0
    • This version adds dependencies to OpenLdap 2.4 (increased stability), and reduces memory and disk usage
    • Note: This may not solve this specific issue, but it will increase the general BDII performance

2.2 VOMS Admin fails notifying about exipiring membership

2.2.1 Description of the problem

A bug was found affecting the VOMS Admin process of sending warning messages when user membership is about to expire. The VOMS Admin versions affected by the bug are:

  • gLite 3.2: versions 2.5.3-1 and 2.5.5-1
  • EMI 1: VOMS Admin 2.6.1

These VOMS Admin versions enforce the user membership expiration (default every 12 months). The bug prevents the sending of a warning email before the users' membership expiration, VO managers are notified only once the user is suspended, after the membership expiration.

2.2.2 Possible workarounds

  1. Extension of VO membership up to 30 September 2012
    • For users whose membership is expiring before 30 Sep 2012
    • This will allow VOMS administrator to upgrade their services once the bug fix is available
    • instructions
  2. Manual notification of the list of users with expiring membership
    • Every month VOMS server administrator should produce, for every VO affected by the bug, a report of the users whose membership are expiring during the month and communicate those lists to the VO managers.

2.2.3 AuP grace period

For the VOs using the same VOMS releases reported above, users are requested to re-sign the VO AuP, as an additional step to renew the membership. Currently the default grace period is 24 hours, VOMS administrators are strongly suggested to extend this grace period to 7 days, a warning email is sent to the users at the beginning of the grace period, this change in configuration will give more time to users to perform their step (the users who do not resign the AuP are suspended at the end of the grace period).

2.3 New GOCDB roles in production

New roles were rolled in production on April 10th.

GOCDB users were automatically assigned to the new roles:

  • At Site level
    • Site Administrator -> Site Operations Manager. In order to maintain the previous users' permissions
    • Security Officer -> Site Security Officer
  • At Regional level
    • Regional Operations Staff -> "Regional Staff (ROD)"
    • "Deputy Regional Manager" -> "NGI Operations Deputy Manager"
    • "Regional Manager" -> "NGI Operations Manager"
    • "Security Officer" -> "NGI Security Officer"
    • No users have been automatically assigned to the new role "Regional First Line Support"

Important: For the operations tools nothing have changed after the switch to the new roles, if you notice any different behavior in the the tools, please report it in a GGUS ticket.

2.3.1 Problem reported and solved

  • ROD teams were not able to access the ROD dashboard
    • BUG fixed and rolled in production on Friday April 13th by the Operations Portal team
  • New roles were not showing in the User details view of the GOCDB GUI
    • BUG fixed and rolled in production on Friday April 13th by the GOCDB team

4. AOB

4.1 Next meetings