Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:Switzerland-QR10"

From EGIWiki
Jump to navigation Jump to search
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[Category:NGI QR Reports]]
 
{{Template:EGI-Inspire menubar}}
 
{{Template:Inspire_reports_menubar}}
{{TOC_right}}
__NOTOC__
__NOTOC__
{| border="1" cellspacing="0" cellpadding="2"
{| border="1" cellspacing="0" cellpadding="2"
Line 48: Line 52:
!scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL)  
!scope="col"|Date||Location||Title||Participants||Outcome (Short report & Indico URL)  
|-
|-
|October 2012||Lugano (CH)||Swiss High Performance Computing Forum||UNIGE (Szymon Gadomski) and UNIBE-LHEP (Gianfranco Sciacca)||Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva'
|October 2012||Lugano (CH)||Swiss High Performance Computing Forum||UNIGE (Szymon Gadomski), UNIBE-LHEP (Gianfranco Sciacca and Sigve Haug), ETH/CSCS (Christoph Grab and Pablo Fernandez)||Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva'
     S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP)
     S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP)
|-
|-
|-
|August 2012||Karlsruhe (DE)||GridKa school||G. Sciacca (UNIBE), Sergio Maffioletti (UZH)||Various topics of interest for Grid site administrators, community networking
|}
|}


Line 58: Line 65:
|-
|-
-->
-->


===1.3. PUBLICATIONS===
===1.3. PUBLICATIONS===
Line 73: Line 79:


===2.1. Progress Summary===
===2.1. Progress Summary===
Progress on accounting and monitoring.
Transition to UMD releases almost complete.
UNIBE-ID certified (accounting problem in progress).
Review/reshuffle of the tasks among the partners.
===2.2. Main Achievements===
===2.2. Main Achievements===
Solution to the accounting problem of UNIBE finally in place. CSCS are still waiting for a definite solution for the accounting of CREAM CE + ARC CE mixed environment.
ARC monitoring to be enabled on the prod Nagios system: the ARC probes are NOT production quality, and therefore some sites reserve their right to accept the monitoring of their ARC services (flagged as non prod in gocdb).


Nagios ARC probes review: on going, arc gridftp probe decommissioned by OMB after NGI_CH request.
Upgrade to UMD1 and UMD2 complete: Geneva will soon upgrade the site bdii and DPM from gLite 3.2 to UMD (see GGUS ticket)
-> apart from UNIGE No gLite 3.1 and gLite 3.2 services in NGI_CH.
Security effort to be reviewd by NGI_CH. Some manager tasks to be reassigned from SWITCH to SWING.
UNIBE-ID certified, it will start being monitored by the Nagios prod system as soon as the probes are enabled (it has been certified through the use of the Nagios test system). note: ARC 2 bdii problem escalated with a GGUS ticket.
In general it has been remarked that ARC support is lacking, and that it could be improved.


===2.3. Issues and mitigation===
===2.3. Issues and mitigation===
==== CSCS  ====
#Some storage problems at CSCS, due to hardware failures, they have been tackled with emergency disks.
#CPU expansion at CSCS (the cluster has now 21800 HS06 computing power).
#All compute nodes upgraded to UMD1,
#All VOBOX middleware services (gsissh and bdii) were removed from the VO-specific machines.
==== PSI  ====
#For over a year 1TB disks failures in our old SUN X4540 systems, no data loss thanks to the resilient ZFS RAID6. Need to continue and operate these systems for some more time.
#Introduction of a limit on 3 GB RAM usage for jobs in the SGE configuration -> neededed to switch from srmcp (Java) based tools to lcg-tools, due to srmcp causing too high short lived peak memory consumption and therefore SGE to kill the job.
==== UNIGE  ====
#Issue: Space token UNIGE-DPNC_LOCALGROUPDISK getting over 90% full a few times.
#Mitigation: Finding which replicas are no longer or less needed.
#Methods: statistic of age and last access times(available from the DPM), lists of datasets to keep (maintained by analysis project leaders), evaluation of total data size by project based on those lists, negotiation in the group meetings.
==== UNIBE-LHEP  ====
Main achievements
#Stable operation, no down time.
#Completed electrical and cooling server room upgrades in order to accommodate the hardware moved from CSCS
#Rolling move (without downtime) of all production worker nodes and lustre servers to new water-cooled racks
#Build of new ARC CE as front-end of new cluster
#Advanced in commissioning of new cluster, WN image customised for ATLAS and Infiniband interconnect
#Pledged CPU and disk resources to ATLAS as Tier-2 centre for April 2013
#Site account records now appear in the central EGI accounting portal (backdated to 2010)
Issues and mitigations
#Issue: bugs in current version of ARC middleware used to build the new CE
#Mitigation: adopt code hacks developed elsewhere in order to circumvent the bugs
#Issue: EGI Nagios monitoring not yet in production or production value
#Mitigation: increase levels of local monitoring at the cost of manpower and rely on operational monitoring by ATLAS
==== UNIBE-ID  ====
Site certified (ARC 2 CE, no gLite/storage). Accounting to be fixed soon.


{| border="1" cellspacing="0" cellpadding="2"
{| border="1" cellspacing="0" cellpadding="2"

Latest revision as of 09:22, 8 January 2015

EGI Inspire Main page


Inspire reports menu: Home SA1 weekly Reports SA1 Task QR Reports NGI QR Reports NGI QR User support Reports



Quarterly Report Number NGI Name Partner Name Author


1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date Location Title Participants Outcome (Short report & Indico URL)
September 2012 Prague EGI Technical Forum UNIBE-LHEP (Gianfranco Sciacca), UZH (Sergio Maffioletti,Tyanko Alexiev), SWITCH (Alessandro Usai, Simon Leinen, Valery Tschopp) Important know-how build,accounting problem discussed, OMB and OTAG

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date Location Title Participants Outcome (Short report & Indico URL)
October 2012 Lugano (CH) Swiss High Performance Computing Forum UNIGE (Szymon Gadomski), UNIBE-LHEP (Gianfranco Sciacca and Sigve Haug), ETH/CSCS (Christoph Grab and Pablo Fernandez) Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva'
   S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP)
August 2012 Karlsruhe (DE) GridKa school G. Sciacca (UNIBE), Sergio Maffioletti (UZH) Various topics of interest for Grid site administrators, community networking


1.3. PUBLICATIONS

Publication title Journal / Proceedings title Journal references
Volume number
Issue

Pages from - to
Authors
1.
2.
3.
Et al?

2. ACTIVITY REPORT

2.1. Progress Summary

Progress on accounting and monitoring.

Transition to UMD releases almost complete.

UNIBE-ID certified (accounting problem in progress).

Review/reshuffle of the tasks among the partners.

2.2. Main Achievements

Solution to the accounting problem of UNIBE finally in place. CSCS are still waiting for a definite solution for the accounting of CREAM CE + ARC CE mixed environment.

ARC monitoring to be enabled on the prod Nagios system: the ARC probes are NOT production quality, and therefore some sites reserve their right to accept the monitoring of their ARC services (flagged as non prod in gocdb).

Nagios ARC probes review: on going, arc gridftp probe decommissioned by OMB after NGI_CH request. Upgrade to UMD1 and UMD2 complete: Geneva will soon upgrade the site bdii and DPM from gLite 3.2 to UMD (see GGUS ticket) -> apart from UNIGE No gLite 3.1 and gLite 3.2 services in NGI_CH.

Security effort to be reviewd by NGI_CH. Some manager tasks to be reassigned from SWITCH to SWING. UNIBE-ID certified, it will start being monitored by the Nagios prod system as soon as the probes are enabled (it has been certified through the use of the Nagios test system). note: ARC 2 bdii problem escalated with a GGUS ticket. In general it has been remarked that ARC support is lacking, and that it could be improved.

2.3. Issues and mitigation

CSCS

  1. Some storage problems at CSCS, due to hardware failures, they have been tackled with emergency disks.
  2. CPU expansion at CSCS (the cluster has now 21800 HS06 computing power).
  3. All compute nodes upgraded to UMD1,
  4. All VOBOX middleware services (gsissh and bdii) were removed from the VO-specific machines.


PSI

  1. For over a year 1TB disks failures in our old SUN X4540 systems, no data loss thanks to the resilient ZFS RAID6. Need to continue and operate these systems for some more time.
  2. Introduction of a limit on 3 GB RAM usage for jobs in the SGE configuration -> neededed to switch from srmcp (Java) based tools to lcg-tools, due to srmcp causing too high short lived peak memory consumption and therefore SGE to kill the job.


UNIGE

  1. Issue: Space token UNIGE-DPNC_LOCALGROUPDISK getting over 90% full a few times.
  2. Mitigation: Finding which replicas are no longer or less needed.
  3. Methods: statistic of age and last access times(available from the DPM), lists of datasets to keep (maintained by analysis project leaders), evaluation of total data size by project based on those lists, negotiation in the group meetings.


UNIBE-LHEP

Main achievements

  1. Stable operation, no down time.
  2. Completed electrical and cooling server room upgrades in order to accommodate the hardware moved from CSCS
  3. Rolling move (without downtime) of all production worker nodes and lustre servers to new water-cooled racks
  4. Build of new ARC CE as front-end of new cluster
  5. Advanced in commissioning of new cluster, WN image customised for ATLAS and Infiniband interconnect
  6. Pledged CPU and disk resources to ATLAS as Tier-2 centre for April 2013
  7. Site account records now appear in the central EGI accounting portal (backdated to 2010)

Issues and mitigations

  1. Issue: bugs in current version of ARC middleware used to build the new CE
  2. Mitigation: adopt code hacks developed elsewhere in order to circumvent the bugs
  3. Issue: EGI Nagios monitoring not yet in production or production value
  4. Mitigation: increase levels of local monitoring at the cost of manpower and rely on operational monitoring by ATLAS

UNIBE-ID

Site certified (ARC 2 CE, no gLite/storage). Accounting to be fixed soon.

Issue Description Mitigation Description