EGI-InSPIRE:Switzerland-QR15

From EGIWiki
(Redirected from OMB:Switzerland-QR15)
Jump to: navigation, search
EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports


Contents




Quarterly Report Number NGI Name Partner Name Author
QR 15 NG-CH Switzerland Sigve Haug


1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date Location Title Participants Outcome (Short report & Indico URL)

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date Location Title Participants Outcome (Short report & Indico URL)
2013-12-13 Edinburgh DPM workshop 25 Detailed updates on status of the DPM middleware, future developments, detailed technical information for an optimal deployment and operation of existing and new components.
2013-12-04 - 06 Amsterdam EGI Workshop 100 H2020 discussions


1.3. PUBLICATIONS

Publication title Journal / Proceedings title Journal references
Volume number
Issue

Pages from - to
Authors
1.
2.
3.
Et al?’’

2. ACTIVITY REPORT

CSCS’’’

- Testing removal of /expirment_software mount on WN
- Completed migration SLURM on all CREAMs, ARCs, and WN
- Completed migration to dcache 2.6 
- Upgraded to postgres 9.3
- Moved NFS mounts to new NAS managed by CSCS storage team
- Allowed for file deletion over /pnfs mount
- Working on publishing accounting to new APEL server. Currently working on issues relating to cream/slurm accoutning as well as jura
-  Infiniband switch died, replaced switch no major issues apart from the failed jobs

"PSI"

- configuration management: (puppet) repositories migrated from subversion to GIT
- upgrade to Nagios 4.0.2
- UI upgraded to emi-ui-3.0.2-1.el5
- Upgraded CMS Frontier to 2.7.STABLE9-16.1
- Upgraded PostgreSQL to 9.3
- Migrated from dCache 2.2 to 2.6
- SE management improvements:
   * Access permission improvements for SE. Users assigned to 10 groups, dcache directory
   tree allows only write access to appropriate user and group areas. Risk of Erroneous production
   of files in wrong locations or deletion of other users/groups files thereby limited
   * Dcache access permission rules are created by scripts that
   are integrated with user management.
    * To speed up per user and group space accounting, specific
    persistent ("materialized") views were created in the
    underlying Postgresql DB (matrialized views became
    available with postgresql 9.3). Code implementing the views available
    under https://bitbucket.org/fabio79ch/v_pnfs
  * Xrootd service available
- Did not upgrade SL5 WNs to UMD3 because of the SL5 UMD3 tarball was not
  yet available and we want to keep to our shared file system deployment. Maintainer
  told us that the tarball will become available within January.
- dCache 2.6.19 still doesn't update the atime field of a file. We would like to use
 the atime information for locating files that were not accessed in a long time.

UZH Nothing reported

UNIBE-ID


UNIGE-DPNC

- Major update of the DPM SE, ongoing migration to SLC6, virtualization of
  central services, maintenance and stable operation.
- Upgrade of the DPM SE
- 4 new disk servers (IBM x3630 M4, 43 TB for data) running SLC6 and added to the DPM
- 6 old Solaris disk servers drained from data and retired (2 reused for NFS)
- DPM software upgraded to 1.8.7
- WebDav and xrootd interfaces added
- Data access via xrootd tested and documented for users
- Reorganization of data in the SE for the new ATLAS DDM 'Rucio'
   renaming process run by Cedric Serfon for DDM ops using WebDav
   two failed attempts in Dec 2013 with 'Too many connections' errors; help from DPM experts not on target
   success in Jan 2014, with local jobs not running
- Preparing a funding request
- Yearly review of accounts
- Change of nearly all IP numbers, making room for growth
- A new web server running SLC6
- Upgrade of Ganglia monitoring to the version 3.1.7 (the one in SLC6) compiled from sources on the SLC5. 
   Preparing virtual machines for more central services (ARC CE and the batch server)
- Federated Data Access using Xrootd (FAX) no working yet. We lack support.
- Ganglia 3.1.7 does not compile on Solaris 10. No solution yet.
- One hardware problem - overheating hardware raid.


UNIBE-LHEP Progress summary

- New cluster with ~800 cores fully commissioned, fronted by an ARC CE with ARC 3.0.3-1.el6.x86_64 (EMI-3)
- Both production clusters commissioned for ATLAS analysis payloads and Multi-core ATHENA workloads
- Started work to migrate to ARC native account reporting to APEL (Jura)
- DPM SE re-configured for WEB_DAV data access for ATLAS, required for for the new ATLAS DDM 'Rucio' (upgraded all head and pool nodes to the latest DPM versions on EMI-3: 1.8.7-3.el6.x86_64)
- Integrated VO t2k.org on both clusters and the SE 
- Installed national instances of VOMS and GIIS to replace current issues that will be retired by SWITCH. Commissioning in progress.
- Gathering quotes for cluster expansion
  Quite stable operation of the production cluster (ce.lhep). Preparation for migration from CentOS5  to SLC6 and expansion with nodes obtained from CERN/ATLAS
  A new cluster with ~1500 cores (ce01.lhep) has been commissioned and operation stabilised. Some outstanding issues (details below)
* Main achievements
- Operation stability improved considerably

* Issues and mitigations
- Network issues on UI's and one DPM pool node. Not easily understood. Eventually, after workarounds were in place, understood to be caused by 
  stray ganglia processes, yet the exact mechanism causing the upset is unknown.
- One of two clusters not publishing to SGAS national instance: intervention on the SGAS server was required (permission problem)
- Both clusters (and also others in CH) not being accounted for in APEL since July 2013. Not detected until January 2014 (no GGUS ticket issued). 
  Currently working with SWITCH in order to bring the APEL account repository up to date
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox
Print/export