EGI-InSPIRE:Switzerland-QR9

From EGIWiki
Revision as of 18:13, 27 July 2012 by Leinen (talk | contribs)
Jump to: navigation, search
Quarterly Report Number NGI Name Partner Name Author


1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date Location Title Participants Outcome (Short report & Indico URL)

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date Location Title Participants Outcome (Short report & Indico URL)


1.3. PUBLICATIONS

Publication title Journal / Proceedings title Journal references
Volume number
Issue

Pages from - to
Authors
1.
2.
3.
Et al?

2. ACTIVITY REPORT

2.1. Progress Summary

CSCS

The entire site moved from its old location in Manno to a newly built centre in Lugano-Cornaredo.  This was used as an opportunity to uprade the T2 cluster, see under "achievements" below.

2.2. Main Achievements

CSCS

  1. Full network redesign, from a mixed environment with 2 networks (IB and ethernet) to fully integrated IB and 10G ethernet network with Voltaire bridges.
  2. Replacement of the remaining old Sun Thumpers storage servers by newer IBM based storage (DS 3500).
  3. Replacement of the remaining old Sun Worker Nodes (96 blade machines) with new Sandy Bridge systems with 32 job slots per host.
  4. Reorganization and simplification of the Scratch filesystem where the jobs at CSCS run. Based on GPFS.
  5. Increased the bandwidth available to VM-based grid services from 1G ethernet to 10GB ethernet.
  6. Replaced 2x VM-based CREAM-CE machines by 3 physical hosts. These systems are now EMI UMD1 latest update.
  7. Replaced 2x VM-based ARGUS servers by 3 physical hosts and integration with the CREAM-CE on the same nodes. These systems are now EMI UMD1 latest update.
  8. Installed CernVMFS for all the VOs supported by CSCS. Currently only used by ATLAS and LHCb.
  9. General upgrade of service nodes from SL 5.4/5 to SL5.7 and latest kernel available at the moment (308).

PSI

  1. Added 11 worker notes (2 * 8 core Xeon "Sandy Bridge" ES-2670 2.6 GHz, 48GB DDR3) and put them into production.

UNIGE

  1. Stable operation
  2. Upgraded network hardware
  3. Preparation of next upgrade: replacement of the oldest CPU nodes.

2.3. Issues and mitigation

CSCS

  1. Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available.  This seems to be related to a known bug by Sun that is not going to be fixed.  Replacement plan is being drafted.
  2. Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed.
  3. CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course.

UNIGE

  1. Some disk space management: identifying and removing data we no longer need
  2. Solaris is no longer supported by the DPM software team.  We have such machines in our SE.  The mitigation is not to update the DPM and hope that it will last another year or two.
  3. The DPM software on the Solaris disk servers does not support RFC proxies.  For this reason our SE is no longer a data source for NORDUgrid jobs on other sites.

2.3. Issues and mitigation

CSCS

  1. Unfortunately, we are seeing soft lockup CPU errors in our old Sun Thors that cause the system (dCache pools) to block the software RAIDs available.  This seems to be related to a known bug by Sun that is not going to be fixed.  Replacement plan is being drafted.
  2. Initially we saw ARP problems with the bridge ethernet/IB. Now this seems to have been fixed.
  3. CSCS status on the WLCG Dashboard is not always green. This seems to be caused by user job related issues. We think it may have to do with CVMFS, but it's difficult to be sure about it at this point. Debugging is in course.



UNIGE

  1. Some disk space management: identifying and removing data we no longer need
  2. Solaris is no longer supported by the DPM software team.  We have such machines in our SE.  The mitigation is not to update the DPM and hope that it will last another year or two.
  3. The DPM software on the Solaris disk servers does not support RFC proxies.  For this reason our SE is no longer a data source for NORDUgrid jobs on other sites.
Issue Description Mitigation Description