Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:02092011"

From EGIWiki
Jump to navigation Jump to search
 
(21 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[NGI_DE_CH_Operations_Center:Operations_Meeting|Operations Meeting Main]]
== How to connect ==
Connect Via Phone:
Following DFN ISDN Gateways are available: Germany - Berlin:+49-30-2541080 Germany - Stuttgart:+49-711-6330190
If requested please enter following conference number: 97922688
==Introduction==
==Introduction==


Line 7: Line 16:
* Meetings/conferences
* Meetings/conferences


5-9 September 2011
5-9 September 2011
GridKa School  
GridKa School  
Karlsruhe, Germany
Karlsruhe, Germany
http://gridka-school.scc.kit.edu/2011)  
http://gridka-school.scc.kit.edu/2011)  


8-9 September 2011
8-9 September 2011
Paradiso workshop
Paradiso workshop
Brussels, Belgium  
Brussels, Belgium  
http://www.paradiso-fp7.eu )
http://www.paradiso-fp7.eu )


19 -23 September 2011
19 -23 September 2011
EGI Technical Forum
EGI Technical Forum
Lyon, France (http://tf2011.egi.eu/)
Lyon, France (http://tf2011.egi.eu/)


* Availability/reliability statistics
* Availability/reliability statistics
Line 27: Line 36:
  For the problem of UNI-Siegen-HEP we will wait untill the statistics of August and we will see if the situation has improved.
  For the problem of UNI-Siegen-HEP we will wait untill the statistics of August and we will see if the situation has improved.
* Monitoring
* Monitoring
Our Nagios box was updated. Last week there were some problems with notifications and the tests were not up to date. This problem 
was caused by the central instance at CERN. EGI monitoring team is involved.
* Staged rollout/updates
* Staged rollout/updates
ntr


==Round the sites==
==Round the sites==
Line 34: Line 46:
* BMRZ-FRANKFURT (Uni Frankfurt)
* BMRZ-FRANKFURT (Uni Frankfurt)
* DESY-HH
* DESY-HH
new wn's (exchange the old hardware), now we have 37kHS and ~4800 jobs slots at DESY-HH.
100% occupancy (mostly cms, atlas, but also ilc, and hera).
* DESY-ZN
* DESY-ZN
* FZJuelich
* FZJuelich (Rebecca Breu)
ntr
* Goegrid
* Goegrid
* GSI
* GSI
* ITWM
* ITWM (Martin Braun)
* KIT (GridKa, FZK-LCG2)
ntr
  nothing special
Announcement: short downtime of 1 hour on next Monday, because internet connectivity will be maintained
* KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Foued Jrad, Tobias Koenig)
  Business as usual.
We tried to install the UMD release 1.1 for gLite login and bookkeeping package. Test was not succesful-> information was put on 
rollout board, still have no answer. LB package is actually not very usefull
GridKA school next week.
* KIT (Uni Karlsruhe)
* KIT (Uni Karlsruhe)
* LRZ
* LRZ
* MPI-K
* MPI-K
* MPPMU
* MPPMU (Cesare Delle Fratte)
* RWTH Aachen
update in July
big problem was the  information system since last weekend
big problem with CREAM CE, got stuck because of lack of memory, now green again
-> This two main problems influenced reliability/availability statistics for the August but but we are concerned about the two
different numbers on the operation portal and the grid view portal
Recomendation: We will wait untill the official numbers for August are published
* RWTH Aachen ()
During the last two weeks: problems with dCache probably caused by a raid controller, 4000 files with wrong checksums, still under
investigation
* SCAI
* SCAI
* Uni Bonn
* Uni Bonn
* Uni Dortmund
* Uni Dortmund
* Uni Dresden
* Uni Dresden (Ralph Mueller Pfefferkorn)
* Uni Freiburg
Production runs quite fine
 
problems with the gstat/BDII published data caused by a wrong torque queue configuration, problem is fixed this week
* Uni Freiburg (Anton Gamel)
On Wednesday evening we had a downtime because of a aircondition failure, some pools were down for some hours
 
Tickets: decommission of CE1, Ce was drained and put out of production in the GOCDB, but there are still some panda jobs submitted
beside the WMS system. A reason for that could be the hard coded CEs in user jobs. Is there any advice how to proceed?
 
Cream is memory "eating" caused by blah daemon. Actually our CREAM has 8GB and 1GB swap. It is using all the swap. Recommendation.
Better to add memory or ask the the rollout list/board. Actually there is no recommendation at the rollout list/board. AT KIT we
have 16GB.
 
Two recommendations from John Alan Keenedy (MPPMU): We at MPPMU had the same problem. I read at the rollout list the problem is
the SG helper. Regard to the Panda jobs: Ask the ATLAS people. Torsten Harrenberg is in vacancy but ROD is back from vacation.
* Uni Mainz-Maigrid
* Uni Mainz-Maigrid
* Uni Siegen
* Uni Siegen
Line 56: Line 101:
; SwiNG
; SwiNG
* CSCS
* CSCS
Business as usual despite some issues with Lustre filesystem (scratch) that we hope to fix in the upcoming weeks by replacing it 
by GPFS with SSDs for metadata.
* PSI
* PSI
* Switch
* Switch
Line 65: Line 112:
* Any problematic tickets?
* Any problematic tickets?
* Handover of the ROD shift
* Handover of the ROD shift
(from DESY-HH, week before) - nothing serious to report, but to mention that
In NGI_DE there is discrepancy in the number of published job slots (GoeGrid - two times then
in reality) and in the apel statistics (MPPMU - wrong SI2K). Proposal - can we have the table in NGI_DE
wiki with the actual number of job slots, total HS for all sites in NGI_DE, so that we can compare
this with the information from bddi,gstat,gridview and contact sites in case of disagreement?
This week:
This week:
35 29.08 04.09 Team2, FhG (SCAI)
35 29.08 04.09 Team2, FhG (SCAI)
Line 79: Line 132:


* ROD Workshop @ EGI TechForum.
* ROD Workshop @ EGI TechForum.
ROD team’s session on Thursday September 22nd in afternoon at 4. Have a
EGI technical forum: ROD team’s session on Thursday September 22nd in afternoon at 4. Have a
look at the agenda at: https://www.egi.eu/indico/contributionDisplay.py?contribId=35&confId=452.
look at the agenda at: https://www.egi.eu/indico/contributionDisplay.py?contribId=35&confId=452.


==AOB==
==AOB==
For write access to this wiki page please contact wilhelm.buehleraddkit.edu
One participant: Sound quality was very good during this telephone conference and the wiki page is very useful


If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.

Latest revision as of 09:41, 7 September 2011

Operations Meeting Main

How to connect

Connect Via Phone:

Following DFN ISDN Gateways are available: Germany - Berlin:+49-30-2541080 Germany - Stuttgart:+49-711-6330190

If requested please enter following conference number: 97922688

Introduction

  • Minutes of last meeting

Announcements

  • Meetings/conferences
5-9 September 2011
GridKa School 
Karlsruhe, Germany
http://gridka-school.scc.kit.edu/2011) 
8-9 September 2011
Paradiso workshop
Brussels, Belgium 
http://www.paradiso-fp7.eu )
19 -23 September 2011
EGI Technical Forum
Lyon, France (http://tf2011.egi.eu/)
  • Availability/reliability statistics
There was the question from John Alan Kennedy from MPPMU why MPPMU's scheduled downtime  [08-07-2011 16:00 to 15-07-2011 16:00 
UTC] was not taken into account. Explanation from Dimitris Zilaskos: Downtime was only marked as 'at warning'.
For the problem of UNI-Siegen-HEP we will wait untill the statistics of August and we will see if the situation has improved.
  • Monitoring
Our Nagios box was updated. Last week there were some problems with notifications and the tests were not up to date. This problem  
was caused by the central instance at CERN. EGI monitoring team is involved.
  • Staged rollout/updates
ntr

Round the sites

NGI-DE
  • BMRZ-FRANKFURT (Uni Frankfurt)
  • DESY-HH
new wn's (exchange the old hardware), now we have 37kHS and ~4800 jobs slots at DESY-HH.
100% occupancy (mostly cms, atlas, but also ilc, and hera).
  • DESY-ZN
  • FZJuelich (Rebecca Breu)
ntr
  • Goegrid
  • GSI
  • ITWM (Martin Braun)
ntr
Announcement: short downtime of 1 hour on next Monday, because internet connectivity will be maintained
  • KIT (GridKa, FZK-LCG2, Dimitri Nilsen, Foued Jrad, Tobias Koenig)
Business as usual.

We tried to install the UMD release 1.1 for gLite login and bookkeeping package. Test was not succesful-> information was put on  
rollout board, still have no answer. LB package is actually not very usefull 

GridKA school next week. 
  • KIT (Uni Karlsruhe)
  • LRZ
  • MPI-K
  • MPPMU (Cesare Delle Fratte)
update in July

big problem was the  information system since last weekend
big problem with CREAM CE, got stuck because of lack of memory, now green again
-> This two main problems influenced reliability/availability statistics for the August but but we are concerned about the two 
different numbers on the operation portal and the grid view portal
Recomendation: We will wait untill the official numbers for August are published
  • RWTH Aachen ()
During the last two weeks: problems with dCache probably caused by a raid controller, 4000 files with wrong checksums, still under 
investigation
  • SCAI
  • Uni Bonn
  • Uni Dortmund
  • Uni Dresden (Ralph Mueller Pfefferkorn)
Production runs quite fine
problems with the gstat/BDII published data caused by a wrong torque queue configuration, problem is fixed this week
  • Uni Freiburg (Anton Gamel)
On Wednesday evening we had a downtime because of a aircondition failure, some pools were down for some hours
Tickets: decommission of CE1, Ce was drained and put out of production in the GOCDB, but there are still some panda jobs submitted 
beside the WMS system. A reason for that could be the hard coded CEs in user jobs. Is there any advice how to proceed?
Cream is memory "eating" caused by blah daemon. Actually our CREAM has 8GB and 1GB swap. It is using all the swap. Recommendation. 
Better to add memory or ask the the rollout list/board. Actually there is no recommendation at the rollout list/board. AT KIT we 
have 16GB.
Two recommendations from John Alan Keenedy (MPPMU): We at MPPMU had the same problem. I read at the rollout list the problem is 
the SG helper. Regard to the Panda jobs: Ask the ATLAS people. Torsten Harrenberg is in vacancy but ROD is back from vacation.
  • Uni Mainz-Maigrid
  • Uni Siegen
  • Uni Wuppertal
SwiNG
  • CSCS
Business as usual despite some issues with Lustre filesystem (scratch) that we hope to fix in the upcoming weeks by replacing it  
by GPFS with SSDs for metadata.
  • PSI
  • Switch

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

  • Any problematic tickets?
  • Handover of the ROD shift
(from DESY-HH, week before) - nothing serious to report, but to mention that 
In NGI_DE there is discrepancy in the number of published job slots (GoeGrid - two times then 
in reality) and in the apel statistics (MPPMU - wrong SI2K). Proposal - can we have the table in NGI_DE
wiki with the actual number of job slots, total HS for all sites in NGI_DE, so that we can compare 
this with the information from bddi,gstat,gridview and contact sites in case of disagreement?

This week: 35 29.08 04.09 Team2, FhG (SCAI)

Next weeks:

36 05.09 11.09 Team3, KIT

37 12.09 18.09 Team4, JUELICH

38 19.09 25.09 Team5, BADW-LRZ

  • ROD Workshop @ EGI TechForum.
EGI technical forum: ROD team’s session on Thursday September 22nd in afternoon at 4. Have a
look at the agenda at: https://www.egi.eu/indico/contributionDisplay.py?contribId=35&confId=452.

AOB

For write access to this wiki page please contact wilhelm.buehleraddkit.edu
One participant: Sound quality was very good during this telephone conference and the wiki page is very useful

If you have additional topics to be discussed during the meeting, please submit them in advance via our email list email list.