Difference between revisions of "EGI-InSPIRE:Switzerland-QR15"

Latest revision as of 10:28, 8 January 2015

EGI Inspire Main page

Inspire reports menu:

Quarterly Report Number	NGI Name	Partner Name	Author
QR 15	NG-CH	Switzerland	Sigve Haug

1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date	Location	Title	Participants	Outcome (Short report & Indico URL)

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date	Location	Title	Participants	Outcome (Short report & Indico URL)
2013-12-13	Edinburgh	DPM workshop	25	Detailed updates on status of the DPM middleware, future developments, detailed technical information for an optimal deployment and operation of existing and new components.
2013-12-04 - 06	Amsterdam	EGI Workshop	100	H2020 discussions

1.3. PUBLICATIONS

Publication title	Journal / Proceedings title	Journal references Volume number Issue Pages from - to	Authors 1. 2. 3. Et al?’’

2. ACTIVITY REPORT

CSCS’’’

2.1. Progress Summary

- Testing removal of /expirment_software mount on WN

2.2. Main Achievements

- Completed migration SLURM on all CREAMs, ARCs, and WN
- Completed migration to dcache 2.6 
- Upgraded to postgres 9.3
- Moved NFS mounts to new NAS managed by CSCS storage team
- Allowed for file deletion over /pnfs mount

2.3 Issues and Mitigation

- Working on publishing accounting to new APEL server. Currently working on issues relating to cream/slurm accoutning as well as jura
-  Infiniband switch died, replaced switch no major issues apart from the failed jobs

"PSI"

2.1 Progress Summary

- configuration management: (puppet) repositories migrated from subversion to GIT
- upgrade to Nagios 4.0.2
- UI upgraded to emi-ui-3.0.2-1.el5
- Upgraded CMS Frontier to 2.7.STABLE9-16.1
- Upgraded PostgreSQL to 9.3

2. Main Achievements

- Migrated from dCache 2.2 to 2.6
- SE management improvements:
   * Access permission improvements for SE. Users assigned to 10 groups, dcache directory
   tree allows only write access to appropriate user and group areas. Risk of Erroneous production
   of files in wrong locations or deletion of other users/groups files thereby limited
   * Dcache access permission rules are created by scripts that
   are integrated with user management.
    * To speed up per user and group space accounting, specific
    persistent ("materialized") views were created in the
    underlying Postgresql DB (matrialized views became
    available with postgresql 9.3). Code implementing the views available
    under https://bitbucket.org/fabio79ch/v_pnfs
  * Xrootd service available

3. Issues and Mitigation

- Did not upgrade SL5 WNs to UMD3 because of the SL5 UMD3 tarball was not
  yet available and we want to keep to our shared file system deployment. Maintainer
  told us that the tarball will become available within January.
- dCache 2.6.19 still doesn't update the atime field of a file. We would like to use
 the atime information for locating files that were not accessed in a long time.

UZH Nothing reported

UNIBE-ID

UNIGE-DPNC

Progress Summary

- Major update of the DPM SE, ongoing migration to SLC6, virtualization of
  central services, maintenance and stable operation.

Main Achievements

- Upgrade of the DPM SE
- 4 new disk servers (IBM x3630 M4, 43 TB for data) running SLC6 and added to the DPM
- 6 old Solaris disk servers drained from data and retired (2 reused for NFS)
- DPM software upgraded to 1.8.7
- WebDav and xrootd interfaces added
- Data access via xrootd tested and documented for users
- Reorganization of data in the SE for the new ATLAS DDM 'Rucio'
   renaming process run by Cedric Serfon for DDM ops using WebDav
   two failed attempts in Dec 2013 with 'Too many connections' errors; help from DPM experts not on target
   success in Jan 2014, with local jobs not running
- Preparing a funding request
- Yearly review of accounts
- Change of nearly all IP numbers, making room for growth
- A new web server running SLC6
- Upgrade of Ganglia monitoring to the version 3.1.7 (the one in SLC6) compiled from sources on the SLC5. 
   Preparing virtual machines for more central services (ARC CE and the batch server)

2.3. Issues and mitigation

- Federated Data Access using Xrootd (FAX) no working yet. We lack support.
- Ganglia 3.1.7 does not compile on Solaris 10. No solution yet.
- One hardware problem - overheating hardware raid.

UNIBE-LHEP Progress summary

- New cluster with ~800 cores fully commissioned, fronted by an ARC CE with ARC 3.0.3-1.el6.x86_64 (EMI-3)
- Both production clusters commissioned for ATLAS analysis payloads and Multi-core ATHENA workloads
- Started work to migrate to ARC native account reporting to APEL (Jura)
- DPM SE re-configured for WEB_DAV data access for ATLAS, required for for the new ATLAS DDM 'Rucio' (upgraded all head and pool nodes to the latest DPM versions on EMI-3: 1.8.7-3.el6.x86_64)
- Integrated VO t2k.org on both clusters and the SE 
- Installed national instances of VOMS and GIIS to replace current issues that will be retired by SWITCH. Commissioning in progress.
- Gathering quotes for cluster expansion
  Quite stable operation of the production cluster (ce.lhep). Preparation for migration from CentOS5  to SLC6 and expansion with nodes obtained from CERN/ATLAS
  A new cluster with ~1500 cores (ce01.lhep) has been commissioned and operation stabilised. Some outstanding issues (details below)

* Main achievements
- Operation stability improved considerably

* Issues and mitigations
- Network issues on UI's and one DPM pool node. Not easily understood. Eventually, after workarounds were in place, understood to be caused by 
  stray ganglia processes, yet the exact mechanism causing the upset is unknown.
- One of two clusters not publishing to SGAS national instance: intervention on the SGAS server was required (permission problem)
- Both clusters (and also others in CH) not being accounted for in APEL since July 2013. Not detected until January 2014 (no GGUS ticket issued). 
  Currently working with SWITCH in order to bring the APEL account repository up to date

@@ Line 1: / Line 1: @@
-{{Template:Op menubar}} {{Template:Inspire_reports_menubar}} {{TOC_right}}
+{{Template:EGI-Inspire menubar}}
+{{Template:Inspire_reports_menubar}}
+{{TOC_right}}
 <br>
@@ Line 33: / Line 35: @@
 ! Outcome (Short report &amp; Indico URL)
 |-
-|30.9-01.10
-| CSCS, Lugano (CH)
-|GridKa Cloud T1-T2 yearly face to face
-| 25
-| ATLAS German cloud sites' technical solutions discussed. Direct contact between CSCS admins and ATLAS operation experts, https://indico.cern.ch/conferenceDisplay.py?confId=261676
-| -
 |}
@@ Line 51: / Line 47: @@
 ! Outcome (Short report &amp; Indico URL)
 |-
-| 2013-09-16 - 20
+| 2013-12-13
-| Madrid
+| Edinburgh
-| EGI Technical Forum Madrid
+| DPM workshop
-| 4
+| 25
-|
+| Detailed updates on status of the DPM middleware, future developments, detailed technical information for an optimal deployment and operation of existing and new components.
 |-
-| 2013-10-23
+| 2013-12-04 - 06
-| Lausanne
+| Amsterdam
-| HPC-CH Forum
+| EGI Workshop
-| Michael Rolli, Nico Faerber
+| 100
-|
+| H2020 discussions
 |-
 |}
@@ Line 80: / Line 76: @@
 ! align="left" | Authors ''<br>1.<br>2.<br>3.<br>Et al?’’
 |-
-|Grid Site Testing for ATLAS with HammerCloud
-|CHEP2013 Proceedings
-|
-|Johannes Elmsheuser, Ludwig-Maximilians-Universitaet Muenchen
-Federica Legger, Ludwig-Maximilians-Universitaet Muenchen
-Ramon Medrano Llamas, CERN
-Gianfranco Sciacca, Universitaet Bern
-Daniel Colin van der Ster, CERN
-|-
 |}
@@ Line 95: / Line 83: @@
 '''CSCS’’’
+* 2.1. Progress Summary
+ - Testing removal of /expirment_software mount on WN
+* 2.2. Main Achievements
+ - Completed migration SLURM on all CREAMs, ARCs, and WN
+ - Completed migration to dcache 2.6
+ - Upgraded to postgres 9.3
+ - Moved NFS mounts to new NAS managed by CSCS storage team
+ - Allowed for file deletion over /pnfs mount
+* 2.3 Issues and Mitigation
+ - Working on publishing accounting to new APEL server. Currently working on issues relating to cream/slurm accoutning as well as jura
+ -  Infiniband switch died, replaced switch no major issues apart from the failed jobs
-.1. Progress Summary
+"PSI"
-  Smooth operation in terms of network and storage.
+* 2.1 Progress Summary
+  - configuration management: (puppet) repositories migrated from subversion to GIT
-.2. Main Achievements
+  - upgrade to Nagios 4.0.2
-  (by end of this month) Complete reinstallation of the whole compute cluster. Migrated to EMI-3 on SL6 and SLURM, this includes 4 CREAM-CEs, 2 ARC-CEs, 78 WNs
+  - UI upgraded to emi-ui-3.0.2-1.el5
- and 1 APEL server.
+  - Upgraded CMS Frontier to 2.7.STABLE9-16.1
-  Planning dCache upgrade to 2.6 by mid-november and becoming fully SHA-2 compliant.
+ - Upgraded PostgreSQL to 9.3
-'''PSI'''
-. Progress Summary
-   * Expanded SE by 360 TB raw storage by adding a NetApp E5460,
-         same hardware as our SGI SI5500. Upgraded and homogenized all Firmware.
-   * Decision to continue our Solaris X4500 and X4540 for next one or two years based
-     on availability of replacement parts (decommissioned machines from our Tier-2 at CSCS).
-   * Reinstalled a Sollaris network boot and install service
-         (Jumpstart) to ease reinstallation of our Solaris10 machines.
-   * Complete phasing out of the disks that had troubled us in the X4540 machines over
-     the last two years (frequent failures) by using replacements from the decommissioned
-     machines from our Tier-2. Reinstallation of all X4540 servers.
-. Main Achievements
-   * SE expansion (360 TB raw)
-   * Ensuring a continued use of our existing aging HW by
-         securing replacement parts and providing an adequately
-         resilient Solaris infrastructure for fast reinstallations.
-. Issues and Mitigation
+* 2. Main Achievements
+ - Migrated from dCache 2.2 to 2.6
+ - SE management improvements:
+    * Access permission improvements for SE. Users assigned to 10 groups, dcache directory
+    tree allows only write access to appropriate user and group areas. Risk of Erroneous production
+    of files in wrong locations or deletion of other users/groups files thereby limited
+    * Dcache access permission rules are created by scripts that
+    are integrated with user management.
+     * To speed up per user and group space accounting, specific
+     persistent ("materialized") views were created in the
+     underlying Postgresql DB (matrialized views became
+     available with postgresql 9.3). Code implementing the views available
+     under https://bitbucket.org/fabio79ch/v_pnfs
+   * Xrootd service available
+* 3. Issues and Mitigation
+ - Did not upgrade SL5 WNs to UMD3 because of the SL5 UMD3 tarball was not
+   yet available and we want to keep to our shared file system deployment. Maintainer
+   told us that the tarball will become available within January.
+ - dCache 2.6.19 still doesn't update the atime field of a file. We would like to use
+  the atime information for locating files that were not accessed in a long time.
 '''UZH Nothing reported'''
@@ Line 138: / Line 131: @@
 '''UNIBE-ID'''
-. Progress Summary
- Stable operations with minor issues reported below
-. Main Achievements
- - UBELIX relocation: At end of August the whole cluster was relocated to a new server room at von Roll complex with only minor problems. After 8.5 days of downtime the cluster was fully operational again. Though there were no hardware defects right after the relocation, over the next few weeks three harddisks and one mainboard broke.
-. Issues and Mitigation
- - ARC CE usage recrd registration to the smscg database stopped working due to duplicated job ids occuring in the smscg db. After deduplication with a small batch script the  usage records delivery works again and the accumulated use records where delivered too.
- - At the new cluster location brand new switch from Brocade were installed. Since the relocation we are occasionally facing very short network link downs on two of those switches. Cases at Brocade are opened and the problem is investigated.
@@ Line 152: / Line 136: @@
 * Progress Summary
-- Major update of the DPM SE, ongoing migration to SLC6, virtualization of
+ - Major update of the DPM SE, ongoing migration to SLC6, virtualization of
-- central services, maintenance and stable operation.
+   central services, maintenance and stable operation.
 * Main Achievements
@@ Line 163: / Line 147: @@
   - Data access via xrootd tested and documented for users
   - Reorganization of data in the SE for the new ATLAS DDM 'Rucio'
-   renaming process run by Cedric Serfon for DDM ops using WebDav
+    renaming process run by Cedric Serfon for DDM ops using WebDav
-   two failed attempts in Dec 2013 with 'Too many connections' errors; help from DPM experts not on target
+    two failed attempts in Dec 2013 with 'Too many connections' errors; help from DPM experts not on target
-   success in Jan 2014, with local jobs not running
+    success in Jan 2014, with local jobs not running
   - Preparing a funding request
   - Yearly review of accounts
   - Change of nearly all IP numbers, making room for growth
   - A new web server running SLC6
-  - Upgrade of Ganglia monitoring to the version 3.1.7 (the one in SLC6) compiled from sources on the SLC5. Preparing virtual machines for more central services (ARC CE and the batch server)
+  - Upgrade of Ganglia monitoring to the version 3.1.7 (the one in SLC6) compiled from sources on the SLC5.
+    Preparing virtual machines for more central services (ARC CE and the batch server)
 * 2.3. Issues and mitigation
@@ Line 180: / Line 165: @@
 '''UNIBE-LHEP'''
+Progress summary
-  Quite stable operation of the production cluster (ce.lhep). Preparation for migration from CentOS5  to SLC6 and expansion with nodes obtained from CERN/ATLAS
+ - New cluster with ~800 cores fully commissioned, fronted by an ARC CE with ARC 3.0.3-1.el6.x86_64 (EMI-3)
-  A new cluster with ~1500 cores (ce01.lhep) has been commissioned and operation stabilised. Some outstanding issues (details below)
+ - Both production clusters commissioned for ATLAS analysis payloads and Multi-core ATHENA workloads
+ - Started work to migrate to ARC native account reporting to APEL (Jura)
+ - DPM SE re-configured for WEB_DAV data access for ATLAS, required for for the new ATLAS DDM 'Rucio' (upgraded all head and pool nodes to the latest DPM versions on EMI-3: 1.8.7-3.el6.x86_64)
+ - Integrated VO t2k.org on both clusters and the SE
+ - Installed national instances of VOMS and GIIS to replace current issues that will be retired by SWITCH. Commissioning in progress.
+ - Gathering quotes for cluster expansion
+   Quite stable operation of the production cluster (ce.lhep). Preparation for migration from CentOS5  to SLC6 and expansion with nodes obtained from CERN/ATLAS
+   A new cluster with ~1500 cores (ce01.lhep) has been commissioned and operation stabilised. Some outstanding issues (details below)
   * Main achievements
-  - ce01.lhep cluster in full production for ATLAS. Commissioning of the Infiniband local area network
+ - Operation stability improved considerably
-  - ce.lhep cluster has been operated with reasonable stability until early October. Shutdown for expansion and migration from SLC5 to SLC6. Added nodes from CERN/ATLAS, complete rationalised re-cabling (power and network).
-  - ce.lhep decommissioned and re-installed as ce02.lhep (SLC6.4, ROCKS 6.1 Front-end), added to GOCDB. ATLAS SLC6 WN image prepared. ROCKS images for Lustre nodes (MDS, OSS) prepared, with Lustre 2.1.6. Ready for mass-install.
-  - Enabled t2k.org VO on our Storage Element
   * Issues and mitigations
-  Recurring problems on the ce01.lhep cluster:
+ - Network issues on UI's and one DPM pool node. Not easily understood. Eventually, after workarounds were in place, understood to be caused by
-  - Frequent NIC lock-ups on lustre OSS nodes causing Lustre to hang and consequent cluster downtimes. Solution: switch LAN from TCP to Infiniband
+   stray ganglia processes, yet the exact mechanism causing the upset is unknown.
-  - CVMFS cache/partition full issue causes WN's to become black holes. Mitigation: manual cache clean-up executed from time to time. Foreseen solution: will re-install with CVMFS 2.1.5 which is said to resolve the bug
+ - One of two clusters not publishing to SGAS national instance: intervention on the SGAS server was required (permission problem)
-  - NFS v4 new defaults caused the inability of the ATLAS software manager jobs to validate the CVMFS software deployment. Mitigation: run on ATLAS request the validation manually from time to time. Solution: identified and corrected the setting causing the issue
+ - Both clusters (and also others in CH) not being accounted for in APEL since July 2013. Not detected until January 2014 (no GGUS ticket issued).
-  - One off failure of one PDU: the LAN switch for the ce01.lhep cluster affected (no redundant PSU): cluster hanging until power recovered
+   Currently working with SWITCH in order to bring the APEL account repository up to date
-  Problems affecting ce.lhep production cluster:
-  - Cooling instabilities have caused at least once a full cluster shutdown. A second instance saw only part of the WNs to shutdown spontaneously. Recovery implies re-installation
-  - Some obscure issues with ROCKS and a Lustre build against the latest kernel available have delayed mass-install of the cluster. Issue: a critical ARC bug causes the services to stop processing jobs upon a Data Staging failure. Mitigation: none. Rely on Nagios email notification from the PPS EGI Nagios service to catch failures and react by restarting the services
-[[Category:NGI_QR_Reports]]

Difference between revisions of "EGI-InSPIRE:Switzerland-QR15"

Latest revision as of 10:28, 8 January 2015

Contents

1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

1.3. PUBLICATIONS

2. ACTIVITY REPORT

Navigation menu

Difference between revisions of "EGI-InSPIRE:Switzerland-QR15"

Latest revision as of 10:28, 8 January 2015

1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

1.3. PUBLICATIONS

2. ACTIVITY REPORT

Navigation menu

Search