|Main||EGI.eu operations services||Support||Documentation||Tools||Activities||Performance||Technology||Catch-all Services||Resource Allocation||Security|
Detailed agenda: Grid Operations Meeting 22 October 2012
|EVO direct link|| Pwd: gridops|
|EVO details||Indico page|
1. Middleware releases and staged rollout
1.1. Update on the status of EMI updates
Cristina (EMI) sent apologies for this meeting. The products listed in the previous meeting are all confirmed, and currently have successfully passed the EMI certification.
1.2. Staged Rollout
UMD 1.9.0 release date is 29 October (next week), and freeze date is today (allow until tomorrow for arrival of late staged rollout reports):
- Ready for the release:
- BDII Core 1.4.0
- CREAM 1.13.5 (note that due to mismatch in version this corresponds to EMI release 1.13.4)
- WMS 3.3.8
- GFAL/lcg_utils 1.13.0
- IGE Gridway 5.10.2
- Waiting for verification and/or staged rollout reports that should arrive between today and tomorrow:
- ARC 1.1.1 (all components)
- IGE SAGA 1.6.1
UMD 2.3.0 release date is mid November. To include several leftovers from the initial EMI2 release and updates. We aim to include also some components from this week update, as well as from IGE 3.0 release. The current status is the following:
- Products in staged rollout that already have reports completed:
- AMGA 2.3.0
- CREAM 1.14.1
- EMIR 1.2.0
- Products under verification:
- FTS 2.2.8
- dCache 2.2.4
- Products to be verified:
- All components from IGE 3.0
The UNICOREX/6 is actually only the unicore nagios probe, to be tested with collaboration from the SAM/Nagios teams.
Expecting from this week EMI update, Clients (UI and WN) containing the new GFAL/lcg_utils 1.13.9, WMS 3.4 and LB 3.2.9
Other components from this update, we will try to also include in the next UMD, since most of the process can go in parallel.
2. Operational Issues
2.1 Monitoring of unsupported middleware
COD is currently opening GGUS ticket vs sites deploying unsupported gLite middleware. Tickets have been opened vs sites with critical alarms in the custom security dashboard.
2.1.1 The timeline of the process is the following
- Between October 8th and October 10th COD opened the first batch of tickets
- Some sites solved the problem generating the critical alarms in few days (upgrading or decommissioning the service), therefore the ticket was closed by COD
- On October 15th a new probe has been put in production to monitor unsupported CREAM services (some CREAM instances did not publish correctly the version). COD opened tickets to sites with new alarms but without a ticket in a open status.
- This unfortunately meant a new ticket submitted for some sites with the previous one closed.
- On October 19th a new probe to check WMS instances was released in production
- COD team will update the tickets already opened to warn site managers that there is a new problem detected in the site
- On October 19th the false positives caused by CONDOR installations have been removed
- Today the GGUS ticket template is being updated, the next tickets opened will contain more information about the workflow expected for these tickets
- In the coming days a probe for dCache is expected
2.1.2 Additional information about the tickets
- Software version information for the site is extracted from BDII, hence the accuracy of the monitoring infrastructure depends on it. False positives may be detected in case of erroneous publishing of gLite 3.1/3.2 information by service end-points. If this applies to you, you are kindly requested to report the problem in this ticket. Probes will be fixed accordingly.
- Please provide information about you upgrade plans following the template provided in the ticket
- Site administrators who provide information about their upgrade plans MUST NOT t close a ticket until the alarm disappears in the site Security Dashboard. Plese keep it in status "in progress" until all unsupported products are either decommissioned or upgraded.
2.1.3 Unresponsive sites
In this separate wiki page NGIs can find the Unresponsive sites on Oct 22: sites who have not answered to the ticket opened by COD.
Note: unresponsive sites are eligible for suspension after November 1st.
2.1.4 Decommissioning of lcg-ce
Reminder: Decommissioned lcg-ce instances have to be removed from GOCDB and site BDII!
2.2 Dependency problem with gridsite-apache and globus
The following UMD2 products:
have dependencies on gridsite-apache, while the latest update of gridsite obsoletes gridsite-apache. Both the updates repository of EMI and UMD contain the latest version of gridsite without any gridsite-apache package and breaking the lcgdm-dav-server coming with DPM and LFC.
- Disable UMD update repository before installing DPM or LFC.
- If a site manager wants to install the latest libraries from UMD update repository, another workaround could be:
- Exclude the gridsite package, insert the following into /etc/yum.conf: exclude=gridsite
- yum update should now work
The following UMD-2 packages:
Have a dependency with the globus-gass-copy-progs package which is not in the UMD repositories but it comes from EPEL. The globus-gass-copy-progs package has dependencies with other globus libraries which are part of IGE components currently released in UMD.
A recent EPEL updgrade released a new globus-gass-copy-progs package, with updated dependencies to newer libraries (released in EPEL as well), the Globus libraries in UMD are unfortunately too old. UMD repositories protect from EPEL and yum cannot download the newer libraries, and it fails.
globus-gass-copy-progs requieres globus-gass-copy(x86-64) = 8.6-1.el6 and UMD repository contains globus-gass-copy-8.4-1.el6.x86_64.
- SA2 is evaluating to include the globus-gass-copy(x86-64) = 8.6-1.el6 into UMD, after proper testing.
- The next UMD update will have to include the new products released by IGE also in order to inject the correct dependencies.
2.3 monitoring issues with WN SL6
SAM Update 17.1 fails to monitor some sites that are deploying UMD2 WN on SL6. The problems (segfault) are described in these tickets:
Emir produced a new build for the WN probe, and the patch will be made available soon for the NGI Nagios administrators. This patch fixes the problem for SL6 WN, but it makes impossible to monitor Lcg-CE and in general 32bit CEs.
2.4 new service types in gocdb in production
The following service type will be put in production in GOCDB, today:
- CUSTOM.pl.plgrid.Bazaar - SLA negotiation system between users and resource providers from NGI_PL grid
- CUSTOM.pl.plgrid.BazaarSAT - Bazaar Site Admin Toolkit from NGI_PL grid
- CUSTOM.pl.plgrid.BAT.agent - Service for collecting accounting data from NGI_PL grid
- CUSTOM.pl.plgrid.QStorMan.UserInterface - A service to provide a user of the grid system with a certain level of quality, from NGI_PL grid
- CUSTOM.pl.plgrid.KeyFS - Key File System service, installed on UI machines to provide a user with the grid credentials, from NGI_PL grid
- pl.cyfronet.InSilicoLab - InSilicoLab portal instance.
These service types were removed from GOCDB because ATP was not able to handle them (Nagios configuration returned errors). SAM 17.1 update patches the problem, and now ATP properly handles the new service types. Please upgrade the SAM instance if your NGI is still deploying an older version.
3.1 UserDN publication
There are still sites not publishing the UserDN in the usage records: Missing UserDN 22 Oct 2012, (83 sites). Please, follow up with these sites to fix their APEL configuration.
3.2 Next meeting
Proposal: November 5th 2012 14:00 Amsterdam time
Minutes are available here.