Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "NGI DE CH Operations Center:Operations Meeting:01072011"

From EGIWiki
Jump to navigation Jump to search
Line 39: Line 39:
** BMRZ-FRANKFURT (Uni Frankfurt)
** BMRZ-FRANKFURT (Uni Frankfurt)
** DESY-HH
** DESY-HH
*** Good load: 4000jobs
Good load: 4000jobs
*** job efficiency: 10 VOs in parallel
job efficiency: 10 VOs in parallel
*** Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
*** Updated two of three dCache instances successfully to 1.9.12 (second golden release)
Updated two of three dCache instances successfully to 1.9.12 (second golden release)
*** Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is officially available.
Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is officially available.
** DESY-ZN
** DESY-ZN
** FZJuelich
** FZJuelich
Line 49: Line 49:
** GSI
** GSI
** ITWM
** ITWM
*** Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2 was needed
Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2
*** SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus package) that have caused the problem
was needed
*** Switched the ROD shift next week with CSCS
SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus  
package) that have caused the problem
Switched the ROD shift next week with CSCS
** KIT (GridKa, FZK-LCG2)
** KIT (GridKa, FZK-LCG2)
*** Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to be optimized (additional servers etc.).
Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to  
*** Problems with one tape library (stucking tapes, errors in reading labels)
be optimized (additional servers etc.).
*** CREAM CE Instability still there
Problems with one tape library (stucking tapes, errors in reading labels)
CREAM CE Instability still there
** KIT (Uni Karlsruhe)
** KIT (Uni Karlsruhe)
** LRZ
** LRZ
*** Systems up and running
Systems up and running
*** No planned DT within the next weeks
No planned DT within the next weeks
*** Plan to evaluate CernVMFS
Plan to evaluate CernVMFS
** MPI-K
** MPI-K
** MPPMU
** MPPMU
*** CREAM CEs are up and running
CREAM CEs are up and running
*** Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
*** Good hammercloud tests
Good hammercloud tests
*** Problems with the BDII are still under investigation
Problems with the BDII are still under investigation
** RWTH Aachen
** RWTH Aachen
** SCAI
** SCAI
*** ntr
ntr
** Uni Bonn
** Uni Bonn
** Uni Dortmund
** Uni Dortmund
** Uni Dresden
** Uni Dresden
*** Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
** Uni Freiburg
** Uni Freiburg
from Anton Gamel  
from Anton Gamel  
Hello Tobias,
Hello Tobias,
 
unfortunately I cannot attend the meeting today.
unfortunately I cannot attend the meeting today.
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling system.
system.
 
Cheers!
Cheers!
 
Anton
Anton
** Uni Mainz-Maigrid
** Uni Mainz-Maigrid
** Uni Siegen
** Uni Siegen
Line 91: Line 90:
* SwiNG
* SwiNG
** CSCS
** CSCS
Gila Arrondo Miguel Angel  
Gila Arrondo Miguel Angel  
 
Hi all,
Hi all,
Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
 
Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous  
Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed  (74 out of ~1400).  
 
Since then, we disabled the failed storage server and all works fine.
- Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed  (74 out of ~1400). Since then, we disabled the failed storage server and all works fine.
Also last week we had two tickets:
 
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published  
- Also last week we had two tickets:
data
 
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published data
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into  
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.
cream02.lcg.cscs.ch:8443/cream-pbs-other queue
 
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into cream02.lcg.cscs.ch:8443/cream-pbs-other queue
We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.
(a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team  
 
is working on it, but they're having a hard time repeating the issues we see here.
- We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time (a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team is working on it, but they're having a hard time repeating the issues we see here.
Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure  
 
yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules.  
- Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules.  
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.
Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.
 
That's all.
- Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.
Best regards,
 
Miguel
That's all.
 
Best regards,
Miguel
 
** PSI
** PSI
** Switch
** Switch
*** We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between  
 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.
Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.



Revision as of 14:25, 7 July 2011

Introduction

  • Minutes of last meeting
no comments

Announcements

  • Meetings/conferences
At the moment there is running the Grid educational event organized by Yves Kemp (DESY-HH) at the University of Berlin. The test  
user certificates are delivered and no problems are known. All is fine.
  • Availability/reliability statistics
NGI-DE in average 97%/98%
problem site (BMRZ-FRANKFURT) ticket 1304, 73%/73% availability/reliability
Solution: Caused by gLite 3.2 migration, misconfigurations and some disk space issues. Problems have been solved
  • Sites certification procedure (Foued)
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production
  • Monitoring
Updated this week the regional NGI-DE monitoring and the ARC monitoring. Now the regional monitoring supports Globus 5. Each
Globus site should register their Globus 5 resources in the GOCDB. We continue the monitoring together with the Suisse sites. Res 
has to ask his people.
Question from Florian (LRZ): What Globus services will be monitored? GSI ssh, gridftp and gram 5 are monitored. The Gsissh test
will test the connection to the ports and the authorization, if the test DN is supported at the site (the non D-grid sites have to 
enter the DN manually, for D-Grid sites this is done automatically if they support the DG ops or DG test. We (LRZ) also have 
problems to register our Globus 5 resources in the GOCDB (invalid service type)->Answer: Please open a ticket. Foued will check  
this after the meeting
  • Staged rollout/updates
ntr from KIT
Martin/ITWM: EMI WN, EMI DPM staged rollout: We have some problems; the corresponding tickets are not solved until now and we are
still waiting for solutions
  • Other
Site UNI-SIEGEN-HEP and site UNI-BONN are full recertified and back to the production

Round the sites

  • NGI-DE
    • BMRZ-FRANKFURT (Uni Frankfurt)
    • DESY-HH
Good load: 4000jobs
job efficiency: 10 VOs in parallel
Network interruption on last Wednesday, firewall settings, DFN connection were reconfigured. Another interruption is needed.
Updated two of three dCache instances successfully to 1.9.12 (second golden release)
Concerned about the migration from gLite to EMI release. People from CERN (LCG rollout list) recommend not to use the EMI release and we do not know when the release will be available. Actually we only do recommended WN high risk security updates. Dimitri (KIT) does not see that it is necessary to install the EMI release at the moment and we have to wait until the first EMI release is officially available.
    • DESY-ZN
    • FZJuelich
    • Goegrid
    • GSI
    • ITWM
Take part at the staged rollout for WNs / EMI test: SAM test failed->DT was needed, to solve the incident a rollback to gLite 3.2  
was needed
SW areas run full->upgraded some WNs to SL5.6 but the CREAM CEs wants an older version, excluded some packets (e.g. the CArgus 
package) that have caused the problem
Switched the ROD shift next week with CSCS
    • KIT (GridKa, FZK-LCG2)
Last weeks LHCb had a very high load on the dCache Urprod instance, in consequence the system crashed. The dCache instance had to 
be optimized (additional servers etc.).
Problems with one tape library (stucking tapes, errors in reading labels)
CREAM CE Instability still there
    • KIT (Uni Karlsruhe)
    • LRZ
Systems up and running
No planned DT within the next weeks
Plan to evaluate CernVMFS
    • MPI-K
    • MPPMU
CREAM CEs are up and running
Next week we will have a DT to upgrade the BDII, SRM to the latest gLite 3.2 and new WNs WNs. Decommission old LCG CE and old farm.
Good hammercloud tests
Problems with the BDII are still under investigation
    • RWTH Aachen
    • SCAI
ntr
    • Uni Bonn
    • Uni Dortmund
    • Uni Dresden
Longer DT to update operation system of several components (WNs and Cream CEs). Because of less manpower vendor support was needed. All is working again since yesterday.
    • Uni Freiburg
from Anton Gamel 
Hello Tobias,
unfortunately I cannot attend the meeting today.
For the minutes: Site UNI-FREIBURG is running smoothly. We had some small downtimes (at risk) for service of the water cooling  
system.
Cheers!

Anton

    • Uni Mainz-Maigrid
    • Uni Siegen
    • Uni Wuppertal
  • SwiNG
    • CSCS
Gila Arrondo Miguel Angel 
Hi all,
Unfortunately I cannot make it to the meeting today. This is the report for CSCS:
Last week there was a problem with our scratch Lustre filesystem. One of the RAIDs lost 3 disks while rebuilding after a previous 
disk failure and we lost scratch data. Those jobs that were reading/writing to that part of the FS failed  (74 out of ~1400). 
Since then, we disabled the failed storage server and all works fine.
Also last week we had two tickets:
a) https://ggus.eu/ws/ticket_info.php?ticket=71814 APEL Consumer did not respond to a request for confirmation of pre-published 
data
It has been solved as it was a problem present in both gLite 3.2 and EMI1 releases of APEL.
b) https://ggus.eu/ws/ticket_info.php?ticket=71436 hone jobs are aborted immediately after submission into 
cream02.lcg.cscs.ch:8443/cream-pbs-other queue
It has also been solved and the changes will be applied when we upgrade our EA EMI 1 CREAM CE.
We are suffering from random segfaults of Torque pbs_mom service in WNs. It means that, eventually and for small periods of time  
(a few hours max.), there might be less slots available for jobs. Running jobs are not affected. Adaptive Computing support team 
is working on it, but they're having a hard time repeating the issues we see here.
Next Wednesday is CSCS maintenance day and we've been informed that there might be a network downtime of about 30min. It is unsure 
yet whether it would affect us, but if network team confirms it, we'll establish the downtime in gocdb according to rules. 
Besides that, there is no planned action for CSCS-LCG2 production systems on maintenance day.
Starting on Monday July 4 we will be on duty with ROD shift. Our previous problems accessing the Operations Portal have been fixed.
That's all.
Best regards,
Miguel
    • PSI
    • Switch
We have to check the ROD responsibility for next week. It seems there is a little bit confusion and the responsibility between 
ITWM and CSCS is unclear. Note, added after the meeting: For the ROD shift in CW 27 ITWM is on duty.

Note: please update your entry at https://wiki.egi.eu/wiki/NGI_DE:Sites if needed.

Status ROD

24	13.06 	19.06 	Team3, KIT 	
25	20.06 	26.06 	Team4, JUELICH 	
26	27.06 	03.07 	Team5, BADW-LRZ	
27	04.07 	10.07 	Team2, FhG (ITWM) 	switched wk 27/29
28	11.07 	17.07 	Team1. DESY	
29	18.07 	24.07 	Team6, CSCS/NGI_CH 	switched wk 27/29 
  • Any problematic tickets?
    • Problems with University of Karlsruhe (T2 of KIT) Site was down for one week and did not respond.
    • At one site job submission failed for a short time. It seems that there is instability of the test. Recommendation: Please open a ticket to notify the site

AOB

  • @ Henry Jonker / Frankfurt: At GridKa we enabled at one of our CEs the Life sciences project e-NMR (enmr.eu) to use our computer resources. From our site it is working. Since yesterday there have been some jobs running. Note: Ticket was reopened
  • Important: For the next meeting we will use the DFN telephone conferencing system and the new documentation page https://wiki.egi.eu/wiki/NGI_DE_CH_Operatons_Center . We want to track this meeting so the sites have the possibility to put their information in the wiki. Connection details and the link for the webpage will follow with the next reminder


If you have additional topics to be discussed during the meeting, please submit them in advance via our email list.