Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"

Revision as of 13:51, 7 July 2011

Introduction

Minutes of last meeting

no comments

Announcements

Meetings/conferences

At the moment there is running the Grid educational event organized by Yves Kemp (DESY-HH) at the University of Berlin. The test  
user certificates are delivered and no problems are known. All is fine.

Availability/reliability statistics

NGI-DE in average 97%/98%
problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved

Sites certification procedure (Foued)

Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production

Monitoring

Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res 
has to ask his people.

Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
will test the connection to the ports and the authorization, if the test DN is supported at the site (the non D-grid sites have to 
enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ) also have 
problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will check  
this after the meeting

Staged rollout/updates

ntr from KIT
Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are
still waiting for solutions

Other

Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production

Round the sites

NGI-DE
- BMRZ-FRANKFURT (Uni Frankfurt)
- DESY-HH
  - Good load 4000jobs and job efficiency 10 VOs in parallel
  - Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
  - Updated two of three dCache instances successfully to 1.9.12 (second golden release)
  - Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is officially available.
- DESY-ZN
- FZJuelich
- Goegrid
- GSI
- ITWM
  - Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2 was needed
  - SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus package) that have caused the problem
  - Switched the ROD shift next week with CSCS
- KIT (GridKa, FZK-LCG2)
  - Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to be optimized (additional servers etc.).
  - Problems with one tape library (stucking tapes, errors in reading labels)
  - CREAM CE Instability still there
- KIT (Uni Karlsruhe)
- LRZ
  - Systems up and running
  - No planned DT within the next weeks
  - Plan to evaluate CernVMFS
- MPI-K
- MPPMU
  - CREAM CEs are up and running
  - Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
  - Good hammercloud tests
  - Problems with the BDII are still under investigation
- RWTH Aachen
- SCAI
  - ntr
- Uni Bonn
- Uni Dortmund
- Uni Dresden
  - Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
- Uni Freiburg

from Anton Gamel Hello Tobias,

unfortunately I cannot attend the meeting today. For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling system.

Cheers!

Anton

- Uni Mainz-Maigrid
- Uni Siegen
- Uni Wuppertal
SwiNG
- CSCS

Gila Arrondo Miguel Angel

Hi all,

Unfortunately I cannot make it to the meeting today. This is the report for CSCS:

- Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed (74 out of ~1400). Since then, we disabled the failed storage server and all works fine.

- Also last week we had two tickets:

a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published data It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.

b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into cream02.lcg.cscs.ch:8443/cream-pbs-other queue It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.

- We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time (a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team is working on it, but they're having a hard time repeating the issues we see here.

- Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.

- Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.

That's all.

Best regards, Miguel

- PSI
- Switch
  - We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

24	13.06 	19.06 	Team3, KIT 	
25	20.06 	26.06 	Team4, JUELICH 	
26	27.06 	03.07 	Team5, BADW-LRZ	
27	04.07 	10.07 	Team2, FhG (ITWM) 	switched wk 27/29
28	11.07 	17.07 	Team1. DESY	
29	18.07 	24.07 	Team6, CSCS/NGI_CH 	switched wk 27/29

Any problematic tickets?
- Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
- At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a

ticket to notify the site

Handover of the ROD shift
ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE

AOB

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.

@@ Line 132: / Line 132: @@
 	18.07 	24.07 	Team6, CSCS/NGI_CH 	switched wk 27/29
 * Any problematic tickets?
+** Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
+** At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a
+ ticket to notify the site
 * Handover of the ROD shift
 * ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"

Revision as of 13:51, 7 July 2011

Contents

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"

Revision as of 13:51, 7 July 2011

Introduction

Announcements

Round the sites

Status ROD

AOB

Navigation menu

Search