Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

NGI DE CH Operations Center:Operations Meeting:01072011

From EGIWiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Operations Meeting Main

Introduction

  • Minutes of last meeting
no comments

Announcements

  • Meetings/conferences
At the moment at the University of Berlin there is running the Grid educational event organized by Yves Kemp (DESY-HH). The test  
user certificates are delivered and no problems are known. All is running fine.
  • Availability/reliability statistics
NGI-DE in average 97%/98%
problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
  • Sites certification procedure
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
  • Monitoring
Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res 
has to ask his people if they want.
Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
will test the connection to the ports and the authorization, especially if the test DN is supported at the site (the non D-grid   
sites have to enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ)  
also have problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will 
check this after the meeting
  • Staged rollout/updates
ntr from KIT
Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are
still waiting for solutions
  • Other
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production

Round the sites

NGI-DE
  • BMRZ-FRANKFURT (Uni Frankfurt)
  • DESY-HH
Andreas, Dmitri:
Good load: 4000jobs
job efficiency: 10 VOs in parallel
Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
Updated two of three dCache instances successfully to 1.9.12 (second golden release)
Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release  
and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) 
does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is 
officially available.
  • DESY-ZN
  • FZJuelich
  • Goegrid
  • GSI
  • ITWM
Martin:
Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2  
was needed
SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus 
package) that have caused the problem
Switched the ROD shift next week with CSCS
  • KIT (GridKa, FZK-LCG2)
Dimitri:
Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to 
be optimized (additional servers etc.).
Problems with one tape library (stucking tapes, errors in reading labels)
CREAM CE Instability still there
  • KIT (Uni Karlsruhe)
  • LRZ
Florian:
Systems up and running
No planned DT within the next weeks
Plan to evaluate CernVMFS
  • MPI-K
  • MPPMU
Cesare:
CREAM CEs are up and running
Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
Good hammercloud tests
Problems with the BDII are still under investigation
  • RWTH Aachen
  • SCAI
Oliver:
ntr
  • Uni Bonn
  • Uni Dortmund
  • Uni Dresden
Ralph:
Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed.  
All is working again since yesterday.
  • Uni Freiburg
Anton via email: 
Hello Tobias,
unfortunately I cannot attend the meeting today.
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling  
system.
Cheers!
Anton
  • Uni Mainz-Maigrid
  • Uni Siegen
  • Uni Wuppertal
SwiNG
  • CSCS
Miguel via email:
Hi all,
Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous 
disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed  (74 out of ~1400). 
Since then, we disabled the failed storage server and all works fine.
Also last week we had two tickets:
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published 
data
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into 
cream02.lcg.cscs.ch:8443/cream-pbs-other queue
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.
We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time  
(a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team 
is working on it, but they're having a hard time repeating the issues we see here.
Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure 
yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. 
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.
Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.
That's all.
Best regards,
Miguel
  • PSI
  • Switch
Res:
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

24	13.06 	19.06 	Team3, KIT 	
25	20.06 	26.06 	Team4, JUELICH 	
26	27.06 	03.07 	Team5, BADW-LRZ	
27	04.07 	10.07 	Team2, FhG (ITWM) 	switched wk 27/29
28	11.07 	17.07 	Team1. DESY	
29	18.07 	24.07 	Team6, CSCS/NGI_CH 	switched wk 27/29 
  • Any problematic tickets?
Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
Site contacted via phone. Now solved.
At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a  
ticket to notify the site
  • Handover of the ROD shift
Res:
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

AOB

@ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer 
resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page 
https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their 
information in the wiki. Connection details and the link for the webpage will follow with the next reminder


If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.