Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"

From EGIWiki
Jump to navigation Jump to search
 
(16 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[NGI_DE_CH_Operations_Center:Operations_Meeting|Operations Meeting Main]]
==Introduction==
==Introduction==


Line 7: Line 9:


* Meetings/conferences
* Meetings/conferences
  At the moment there is running the Grid educational event organized by Yves Kemp (DESY-HH) at the University of Berlin. The test   
  At the moment at the University of Berlin there is running the Grid educational event organized by Yves Kemp (DESY-HH). The test   
  user certificates are delivered and no problems are known. All is fine.
  user certificates are delivered and no problems are known. All is running fine.
* Availability/reliability statistics
* Availability/reliability statistics
  NGI-DE in average 97%/98%
  NGI-DE in average 97%/98%
  problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
  problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
  Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
  Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
* Sites certification procedure (Foued)
* Sites certification procedure
  Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
  Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
* Monitoring
* Monitoring
  Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
  Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
  Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res  
  Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res  
  has to ask his people.
  has to ask his people if they want.


  Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
  Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
  will test the connection to the ports and the authorization, if the test DN is supported at the site (the non D-grid sites have to  
  will test the connection to the ports and the authorization, especially if the test DN is supported at the site (the non D-grid  
enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ) also have  
sites have to enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ)
problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will check 
also have problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will  
  this after the meeting
  check this after the meeting


* Staged rollout/updates
* Staged rollout/updates
Line 36: Line 38:
==Round the sites==
==Round the sites==


* NGI-DE
; NGI-DE
** BMRZ-FRANKFURT (Uni Frankfurt)
* BMRZ-FRANKFURT (Uni Frankfurt)
** DESY-HH
* DESY-HH
Andreas, Dmitri:
  Good load: 4000jobs
  Good load: 4000jobs
  job efficiency: 10 VOs in parallel
  job efficiency: 10 VOs in parallel
Line 47: Line 50:
  does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is  
  does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is  
  officially available.
  officially available.
** DESY-ZN
 
** FZJuelich
* DESY-ZN
** Goegrid
* FZJuelich
** GSI
* Goegrid
** ITWM
* GSI
* ITWM
Martin:
  Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2   
  Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2   
  was needed
  was needed
Line 57: Line 62:
  package) that have caused the problem
  package) that have caused the problem
  Switched the ROD shift next week with CSCS
  Switched the ROD shift next week with CSCS
** KIT (GridKa, FZK-LCG2)
* KIT (GridKa, FZK-LCG2)
Dimitri:
  Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to  
  Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to  
  be optimized (additional servers etc.).
  be optimized (additional servers etc.).
  Problems with one tape library (stucking tapes, errors in reading labels)
  Problems with one tape library (stucking tapes, errors in reading labels)
  CREAM CE Instability still there
  CREAM CE Instability still there
** KIT (Uni Karlsruhe)
* KIT (Uni Karlsruhe)
** LRZ
* LRZ
Florian:
  Systems up and running
  Systems up and running
  No planned DT within the next weeks
  No planned DT within the next weeks
  Plan to evaluate CernVMFS
  Plan to evaluate CernVMFS
** MPI-K
* MPI-K
** MPPMU
* MPPMU
Cesare:
  CREAM CEs are up and running
  CREAM CEs are up and running
  Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
  Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
  Good hammercloud tests
  Good hammercloud tests
  Problems with the BDII are still under investigation
  Problems with the BDII are still under investigation
** RWTH Aachen
* RWTH Aachen
** SCAI
* SCAI
Oliver:
  ntr
  ntr
** Uni Bonn
* Uni Bonn
** Uni Dortmund
* Uni Dortmund
** Uni Dresden
* Uni Dresden
  Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
Ralph:
** Uni Freiburg
  Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed.
  from Anton Gamel
All is working again since yesterday.
* Uni Freiburg
  Anton via email:
  Hello Tobias,
  Hello Tobias,
  unfortunately I cannot attend the meeting today.
  unfortunately I cannot attend the meeting today.
Line 87: Line 98:
  system.
  system.
  Cheers!
  Cheers!
Anton
Anton
** Uni Mainz-Maigrid
* Uni Mainz-Maigrid
** Uni Siegen
* Uni Siegen
** Uni Wuppertal
* Uni Wuppertal
* SwiNG
; SwiNG
** CSCS
* CSCS
  Gila Arrondo Miguel Angel
  Miguel via email:
  Hi all,
  Hi all,
  Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
  Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
Line 116: Line 127:
  Best regards,
  Best regards,
  Miguel
  Miguel
** PSI
* PSI
** Switch
* Switch
Res:
  We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between  
  We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between  
  ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
  ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
Line 131: Line 143:
  29 18.07 24.07 Team6, CSCS/NGI_CH switched wk 27/29  
  29 18.07 24.07 Team6, CSCS/NGI_CH switched wk 27/29  
* Any problematic tickets?
* Any problematic tickets?
** Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
** At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a ticket to notify the site
Site contacted via phone. Now solved.
 
At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a
ticket to notify the site


* Handover of the ROD shift
* Handover of the ROD shift
Res:
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
* ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE
* ROD shift schedule https://wiki.egi.eu/wiki/NGI_DE:ROD-DE


==AOB==
==AOB==
* @ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
@ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer  
* Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their information in the wiki. Connection details and the link for the webpage will follow with the next reminder
resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
 
Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page  
https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their  
information in the wiki. Connection details and the link for the webpage will follow with the next reminder




If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.
If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.

Latest revision as of 16:18, 5 September 2011

Operations Meeting Main

Introduction

  • Minutes of last meeting
no comments

Announcements

  • Meetings/conferences
At the moment at the University of Berlin there is running the Grid educational event organized by Yves Kemp (DESY-HH). The test  
user certificates are delivered and no problems are known. All is running fine.
  • Availability/reliability statistics
NGI-DE in average 97%/98%
problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
  • Sites certification procedure
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
  • Monitoring
Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res 
has to ask his people if they want.
Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
will test the connection to the ports and the authorization, especially if the test DN is supported at the site (the non D-grid   
sites have to enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ)  
also have problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will 
check this after the meeting
  • Staged rollout/updates
ntr from KIT
Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are
still waiting for solutions
  • Other
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production

Round the sites

NGI-DE
  • BMRZ-FRANKFURT (Uni Frankfurt)
  • DESY-HH
Andreas, Dmitri:
Good load: 4000jobs
job efficiency: 10 VOs in parallel
Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
Updated two of three dCache instances successfully to 1.9.12 (second golden release)
Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release  
and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) 
does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is 
officially available.
  • DESY-ZN
  • FZJuelich
  • Goegrid
  • GSI
  • ITWM
Martin:
Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2  
was needed
SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus 
package) that have caused the problem
Switched the ROD shift next week with CSCS
  • KIT (GridKa, FZK-LCG2)
Dimitri:
Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to 
be optimized (additional servers etc.).
Problems with one tape library (stucking tapes, errors in reading labels)
CREAM CE Instability still there
  • KIT (Uni Karlsruhe)
  • LRZ
Florian:
Systems up and running
No planned DT within the next weeks
Plan to evaluate CernVMFS
  • MPI-K
  • MPPMU
Cesare:
CREAM CEs are up and running
Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
Good hammercloud tests
Problems with the BDII are still under investigation
  • RWTH Aachen
  • SCAI
Oliver:
ntr
  • Uni Bonn
  • Uni Dortmund
  • Uni Dresden
Ralph:
Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed.  
All is working again since yesterday.
  • Uni Freiburg
Anton via email: 
Hello Tobias,
unfortunately I cannot attend the meeting today.
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling  
system.
Cheers!
Anton
  • Uni Mainz-Maigrid
  • Uni Siegen
  • Uni Wuppertal
SwiNG
  • CSCS
Miguel via email:
Hi all,
Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous 
disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed  (74 out of ~1400). 
Since then, we disabled the failed storage server and all works fine.
Also last week we had two tickets:
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published 
data
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into 
cream02.lcg.cscs.ch:8443/cream-pbs-other queue
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.
We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time  
(a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team 
is working on it, but they're having a hard time repeating the issues we see here.
Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure 
yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. 
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.
Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.
That's all.
Best regards,
Miguel
  • PSI
  • Switch
Res:
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

24	13.06 	19.06 	Team3, KIT 	
25	20.06 	26.06 	Team4, JUELICH 	
26	27.06 	03.07 	Team5, BADW-LRZ	
27	04.07 	10.07 	Team2, FhG (ITWM) 	switched wk 27/29
28	11.07 	17.07 	Team1. DESY	
29	18.07 	24.07 	Team6, CSCS/NGI_CH 	switched wk 27/29 
  • Any problematic tickets?
Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
Site contacted via phone. Now solved.
At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a  
ticket to notify the site
  • Handover of the ROD shift
Res:
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

AOB

@ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer 
resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page 
https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their 
information in the wiki. Connection details and the link for the webpage will follow with the next reminder


If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.