Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:11112011"

Latest revision as of 15:54, 17 November 2011

Introduction

Minutes of last meeting

No comments

Announcements

Meetings/conferences

no news

Availability/reliability statistics

https://wiki.egi.eu/wiki/Availability_and_reliability_monthly_statistics

Okt 2011 not yet published

Last: Sep 2011
Av:97 % Re:98 %

UNI-KARLSRUHE N/A N/A N/A 68 % 68 %
Severe Lustre FS problems on the cluster over 3 weeks avoided a stable running of the storage element. 
https://helpdesk.ngi-de.eu/index.php?mode=ticket_info&ticket_id=1622

Monitoring

Update 14.
robot certificates in use now.

Staged rollout/updates

UPDATE 32 for gLite 3.2 is now ready for production use.
The priority of the updates is: Normal
Comment from DESY: there is a major change in Torque. If you relly want to update you have to drain your batch system. You have to  
update your batch system, the WNS and the CEs at the same time. In consequence we are surprised about the normal priority of this 
update. Recommendation@all: Be carefull with this update!

Round the sites

NGI-DE

BMRZ-FRANKFURT (Uni Frankfurt)
DESY-HH (Andreas Gellrich, Dmitri Ozerov)

- normal Operations. We do not go for the gLite 3.2 update 32, see comment above.
- In the mean time we setup some EMI services, like BDII, logging and bookkeeping, 6 WMSs. The WMS structure has changed. With the 
WMS-EMI service we had a problem with the EXT 3 file system, that only allows 32 000 subdirectories which might be used by user 
proxies. Recommendation: Use the Ext 4 file system for the WMS-EMI-service. The EMI-WMS works much more performant like the old 
gLite WMS 32-bit-SL 4. 
The EMI-BDIIs work very well. Recommendation: For the BDII use the LDAP 2.4 version like it is described in the BDII installation 
guide. 
We also tried the UMD releases as well but the UMD update circles are much to long and another problem we see is the inconsistance 
between UMD and EMI. May be this will change in a few years.

DESY-ZN
FZJuelich
Goegrid
GSI
ITWM (Martin Braun)

- WN staged rollout: We are currently testing the last gLite 3.2 DPM version and the update of the lcg utils. 
- Also problem with Torque. These problem started earlier because we are also running a EMI CREAM CE and there was no "Torque 
warning". 
- Another problem without installing the update on the WNs was that the server will run "out of file handles" and the server 
stops working. Ticket is filed.
- Problem with sBDII. We tried to move to EMI and open LDAP

KIT (GridKa, FZK-LCG2)

15-11-2011 08:00 -> 15-11-2011 18:00 	
Upgrade of dCache to 1.9.12-x	
Affected node: atlassrm-fzk.gridka.de

15-11-2011 08:00 -> 16-11-2011 13:00     
ATLAS LFC migration to CERN Note: putting DE cloud offline in DDM/Panda and queues draining on Nov 
14 already. 
Affected: All ATLAS users (Local-LFC, atlas-lfc-fzk.gridka.de). Possibly COMPASS.

KIT (Uni Karlsruhe)
LRZ
MPI-K
MPPMU (Cesare Delle Fratte)

 - DONE completed downtime and gLite-cream update UPDATE 32 for gLite 3.2
 - nothing important to report

RWTH Aachen
SCAI
Uni Bonn
Uni Dortmund
Uni Dresden (Ralph Müller-Pfefferkorn)

- We updated last nodes of our cluster to SL. Now running 512 cores.
- No plan to update to EMI releases. We will wait.

Uni Freiburg (Anton Gamel)

- DT next week. Maintenance of the local air condition. Take this advantage to upgrade our local Lustre system

Uni Mainz-Maigrid
Uni Siegen
Uni Wuppertal

SwiNG

CSCS (Miguel Gila via Email)

- This week we performed severe maintenance in the cluster: we migrated the shared FS from Lustre to GPFS. So far, the performance 
of GPFS is way better than Lustre thanks to the usage of SSDs for the metadata. We also updated the firmware of the controllers and 
systems running GPFS.
- Update of Argus packages to latest versions.
- Update of CREAM-CEs to latest versions.
- Upgraded 10 Supermicro WNs to the new AMD 16 core CPU. So, in total, we have 10x32core machines plus all the old Sun WNs. 
Unfortunately we were unable to boot them with the SL5.5 kernel and so we're upgrading them to SL5.7 with the latest kernel (this 
means recompiling infiniband and gpfs drivers). It's taking more time than expected to deploy these servers, but once done, this 
would mean 80 cores more with approximately the same power consumption.
- Fixed some issues with the site bdii publication system (outdated publishing).

Unfortunately we are still seeing issues with CREAM services (mostly scp failed transfers and problem with all pool accounts used)

PSI
Switch

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

Oktober. complains about ROD metrics
No. Tickets expired: 19
No. Alarms older than 72h: 0

in this case, we should be more carefully by tickets expiration dates and do updates in time.

Any problematic tickets?

MaiGrid has very slow response time, see comment below

Report from CSCS (Miguel Gila via Email)

- During this week we were the ROD shifters. It was difficult to contact MaiGrid regarding their open tickets and this caused us to 
miss a expiration date for the ticket (75221). COD contacted us for an explanation (76046), but when COD was asked for help, they 
were unable to help us.
It all mixed with MaiGRID removing the problematic machine from production, which leaded us to close the ticket.
- Besides that, all other sites are okay with their tickets. UNI-BONN has a ticket on a machine that is in downtime (76078), so I  
extended the expiration date to a later date.

As a side note, we would like to express our discomfort with MaiGRID for not answering to the tickets when they were supposed to. We 
(I) made a few errors in handling the issue that will not happen again in the future, but MaiGRID should have been more reactive.
In addition to that, we want to point out that we believe the EGI Operations Portal is not very clear when dealing with the tickets: 
not all updates are shown in the interface and you need to go to the GGUS to see everything. Also it would be desirable for the 
portal to warn shifters when opening a ticket if the site is in downtime.

Handover of the ROD shift

45	07.11 	13.11 	Team6, CSCS/NGI_CH 	
46	14.11 	20.11 	Team1. DESY	
47	21.11 	27.11 	Team2, FhG (SCAI) 	
48	28.11 	04.11 	Team3, KIT

ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE_CH_Operations_Center:Operations_Teams#Shifts_rotation_table

LRZ management will be contacted by Dimitri (KIT) because LRZ actually is not participating in the ROD shifts.

AOB

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:11112011"

Latest revision as of 15:54, 17 November 2011

Contents

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:11112011"

Latest revision as of 15:54, 17 November 2011

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Search