Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "EGI-InSPIRE:Switzerland-QR10"

From EGIWiki
Jump to navigation Jump to search
Line 73: Line 73:


===2.1. Progress Summary===
===2.1. Progress Summary===
Progress on accounting and monitoring.
Transition to UMD releases almost complete.
UNIBE-ID certified.
Review/reshuffle of the tasks among the partners.
===2.2. Main Achievements===
===2.2. Main Achievements===
Solution to the accounting problem of UNIBE finally in place. CSCS are still waiting for a definite solution for the accounting of CREAM CE + ARC CE mixed environment.


ARC monitoring to be enabled on the prod Nagios system: the probes are NOT production quality, and therefore some sites reserve their right to accept the monitoring of their ARC services (flagged as non prod in gocdb).
Nagios ARC probes review: on going, arc gridftp probe decommissioned by OMB after NGI_CH request.
Upgrade to UMD1 and UMD2 complete: Geneva will soon upgrade the site bdii and DPM from gLite 3.2 to UMD (see GGUS ticket)
-> aparte from UNIGE No gLite 3.1 and gLite 3.2 services in NGI_CH.
Security effort to be reviewd by NGI_CH. Some manager tasks to be reassigned from SWITCH to SWING.
UNIBE-ID certified, it will start being monitored by the Nagios prod system as soon as the probes are enabled (it has been certified through the use of the Nagios test system). note: ARC 2 bdii problem escalated with a GGUS ticket.
In general it has been remarked that ARC support is lacking, and that it could be improved.


===2.3. Issues and mitigation===
===2.3. Issues and mitigation===
CSCS: Some storage problems at CSCS, due to hardware failures, they have been tackled with emergency disks.
CPU expansion at CSCS (the cluster has now 21800 HS06 computing power). All compute nodes upgraded to UMD1,
all VOBOX middleware services (gsissh and bdii) were removed from the VO-specific machines.
PSI: For over a year 1TB disks failures in our old SUN X4540 systems, no data loss thanks to the resilient ZFS RAID6.
Need to continue and operate these systems for some more time.
Introduction of a limit on 3 GB RAM usage for jobs in the SGE configuration -> neededed to switch from srmcp (Java) based tools to lcg-tools, due to srmcp causing too high short lived peak memory consumption and therefore SGE to kill the job.
UNIGE:
Issue: Space token UNIGE-DPNC_LOCALGROUPDISK getting over 90% full a few times.
Mitigation: Finding which replicas are no longer or less needed. Methods: statistic of age and last access times(available from the DPM), lists of datasets to keep (maintained by analysis project leaders), evaluation of total data size by project based on those lists, negotiation in the group meetings.
UNIBE:
Second cluster with ARC2 installed, it will eventually replace the current (somehow old) cluster. ARC support is problematic. *Accounting finally working*.


{| border="1" cellspacing="0" cellpadding="2"
{| border="1" cellspacing="0" cellpadding="2"

Revision as of 16:26, 1 November 2012


Quarterly Report Number NGI Name Partner Name Author


1. MEETINGS AND DISSEMINATION

1.1. CONFERENCES/WORKSHOPS ORGANISED

Date Location Title Participants Outcome (Short report & Indico URL)
September 2012 Prague EGI Technical Forum UNIBE-LHEP (Gianfranco Sciacca), UZH (Sergio Maffioletti,Tyanko Alexiev), SWITCH (Alessandro Usai, Simon Leinen, Valery Tschopp) Important know-how build,accounting problem discussed, OMB and OTAG

1.2. OTHER CONFERENCES/WORKSHOPS ATTENDED

Date Location Title Participants Outcome (Short report & Indico URL)
October 2012 Lugano (CH) Swiss High Performance Computing Forum UNIGE (Szymon Gadomski) and UNIBE-LHEP (Gianfranco Sciacca) Talk: 'Disk Pool Manager Storage Systems at the Universities of Bern and Geneva'
   S.Gadomski (UNIGE-DPNC) and G.Sciacca (UNIBE-LHEP)


1.3. PUBLICATIONS

Publication title Journal / Proceedings title Journal references
Volume number
Issue

Pages from - to
Authors
1.
2.
3.
Et al?

2. ACTIVITY REPORT

2.1. Progress Summary

Progress on accounting and monitoring.

Transition to UMD releases almost complete.

UNIBE-ID certified.

Review/reshuffle of the tasks among the partners.

2.2. Main Achievements

Solution to the accounting problem of UNIBE finally in place. CSCS are still waiting for a definite solution for the accounting of CREAM CE + ARC CE mixed environment.

ARC monitoring to be enabled on the prod Nagios system: the probes are NOT production quality, and therefore some sites reserve their right to accept the monitoring of their ARC services (flagged as non prod in gocdb).

Nagios ARC probes review: on going, arc gridftp probe decommissioned by OMB after NGI_CH request. Upgrade to UMD1 and UMD2 complete: Geneva will soon upgrade the site bdii and DPM from gLite 3.2 to UMD (see GGUS ticket) -> aparte from UNIGE No gLite 3.1 and gLite 3.2 services in NGI_CH.

Security effort to be reviewd by NGI_CH. Some manager tasks to be reassigned from SWITCH to SWING. UNIBE-ID certified, it will start being monitored by the Nagios prod system as soon as the probes are enabled (it has been certified through the use of the Nagios test system). note: ARC 2 bdii problem escalated with a GGUS ticket. In general it has been remarked that ARC support is lacking, and that it could be improved.

2.3. Issues and mitigation

CSCS: Some storage problems at CSCS, due to hardware failures, they have been tackled with emergency disks. CPU expansion at CSCS (the cluster has now 21800 HS06 computing power). All compute nodes upgraded to UMD1, all VOBOX middleware services (gsissh and bdii) were removed from the VO-specific machines.


PSI: For over a year 1TB disks failures in our old SUN X4540 systems, no data loss thanks to the resilient ZFS RAID6. Need to continue and operate these systems for some more time. Introduction of a limit on 3 GB RAM usage for jobs in the SGE configuration -> neededed to switch from srmcp (Java) based tools to lcg-tools, due to srmcp causing too high short lived peak memory consumption and therefore SGE to kill the job.

UNIGE: Issue: Space token UNIGE-DPNC_LOCALGROUPDISK getting over 90% full a few times. Mitigation: Finding which replicas are no longer or less needed. Methods: statistic of age and last access times(available from the DPM), lists of datasets to keep (maintained by analysis project leaders), evaluation of total data size by project based on those lists, negotiation in the group meetings.

UNIBE: Second cluster with ARC2 installed, it will eventually replace the current (somehow old) cluster. ARC support is problematic. *Accounting finally working*.

Issue Description Mitigation Description