Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-09-10-2017"

From EGIWiki
Jump to navigation Jump to search
 
(39 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{TOC right}}  
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}}
[[Category:Grid Operations Meetings]]


= General information  =
= General information  =
Line 7: Line 8:
== UMD/CMD ==
== UMD/CMD ==


CMD-ONE 1.0.0 (Could Middleware Distribution for OpenNebula 5, for CentOS7) released: http://repository.egi.eu/2017/09/21/release-cmd-one-1-0-0/
* Next UMD 4.6.0 regular release is planned for mid November 2017 https://wiki.egi.eu/wiki/UMD_Release_Schedule
    oneacct 0.4.6-1 - tool for exporting accounting data from OpenNebula into APEL
    rOCCI-server 1.1.9 - server implementation of the Open Cloud Computing Interface[*]
    APEL-SSM 2.1.7 - Secure STOMP Messenger
    cloudkeeper 1.5.0 - tool to synchronize cloud appliances between AppDB and cloud platforms
    cloudkeeper-one 1.2.4 - OpenNebula backend for cloudkeeper
    gridsite 2.3.3 -  set of extensions to the Apache web server and a toolkit for Grid credentials, GACL access control lists and HTTP(S) protocol operations
    cloud-bdii-infoprovider 0.8.3 - generates a representation of cloud resources, that can be published inside a BDII or any other component like the INDIGO Configuration Management Database (CMDB)
    rOCCI-client 4.3.9 -  A client implementation of the Open Cloud Computing Interface


CMD-OS 1.1.3 (revision update for Could Middleware Distribution for OpenStack Mitaka on CentOS7 and Ubuntu Xenial) released: http://repository.egi.eu/2017/09/21/release-cmd-os-1-1-3/
* UMD 4.5.1 planned to release ARGUS 1.7.2, CVMFS server, FTS 3.6.8/C7 (and possibly ARC)
    cASO 1.1.1, providing important fixes ()
    cloud-info-provider 0.8.3
    ooi 1.1.2


* UMD 4.6 scheduled for mid November (UI, CREAM, ARGUS)
* CMD-ONE 1.0.0 (Could Middleware Distribution for OpenNebula 5, for CentOS7) released: http://repository.egi.eu/2017/09/21/release-cmd-one-1-0-0/
** oneacct 0.4.6-1 - tool for exporting accounting data from OpenNebula into APEL
** rOCCI-server 1.1.9 - server implementation of the Open Cloud Computing Interface[*]
** APEL-SSM 2.1.7 - Secure STOMP Messenger
** cloudkeeper 1.5.0 - tool to synchronize cloud appliances between AppDB and cloud platforms
** cloudkeeper-one 1.2.4 - OpenNebula backend for cloudkeeper
** gridsite 2.3.3 - set of extensions to the Apache web server and a toolkit for Grid credentials, GACL access control lists and HTTP(S) protocol operations
** cloud-bdii-infoprovider 0.8.3 - generates a representation of cloud resources, that can be published inside a BDII or any other component like the INDIGO Configuration Management Database (CMDB)
** rOCCI-client 4.3.9 -A client implementation of the Open Cloud Computing Interface
 
* CMD-OS 1.1.3 (revision update for Could Middleware Distribution for OpenStack Mitaka on CentOS7 and Ubuntu Xenial) released: http://repository.egi.eu/2017/09/21/release-cmd-os-1-1-3/
** cASO 1.1.1, providing important fixes ()
** cloud-info-provider 0.8.3
** ooi 1.1.2


== Preview repository  ==
== Preview repository  ==


Released on 2017-07-07:
Released on 2017-09-22:
* '''[[Preview 1.13.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.13.0/ AppDB info] (sl6): ARC 15.03 u15, dCache 2.16.40, frontier-squid 3.5.24-3.1, LCGdm-dav 0.18.2, QCG Broker 4.2.0
* '''[[Preview 1.14.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.14.0/ AppDB info] (sl6): ARC 15.03 update 16, CVMFS 2.4.1, DMLite 0.8.8, gfal2 2.14.2, gfal2-python 1.9.3, gfal2-utils 1.5.1, STORM 1.11.12, VOMS Admin 3.7.0, XRootD 4.7.0
* '''[[Preview 2.13.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.13.0/ AppDB info] (CentOS 7): ARC 15.03 u15, ARGUS 1.7.1, CREAM 1.16.5, dCache 3.1.9 & SRM client 3.0.11, frontier-squid 3.5.24-3.1, LCGdm-dav 0.18.2, QCG Broker 4.2.0
* '''[[Preview 2.14.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.14.0/ AppDB info] (CentOS 7): ARC 15.03 update 16, ARGUS 1.7.2, CVMFS 2.4.1, DMLite 0.8.8, gfal2 2.14.2, gfal2-python 1.9.3, gfal2-utils 1.5.1, UI 4.0.3, XRootD 4.7.0
 
'''IMPORTANT''': the latest version of Xrootd ( 4.7.0) has "broken" the xrdcp command towards dcache:
 
https://github.com/xrootd/xrootd/issues/593
 
The issue seems to be server side ( a missing field in the response which is now enforced by the client) and a fix will be provided in the upcoming Xrootd 4.7.1.


= Operations  =
= Operations  =
Line 36: Line 45:
<br>
<br>


== Testing FedCloud sites  ==
== FedCloud ==
* cASO upgrade campaign ongoing: https://wiki.egi.eu/wiki/Federated_Cloud_siteconf#cASO_upgrade


== Feedback from Helpdesk  ==
== Feedback from Helpdesk  ==
NTR


== Monthly Availability/Reliability ==
== Monthly Availability/Reliability ==
Line 44: Line 55:
*Underperformed sites in the past A/R reports with issues not yet fixed:
*Underperformed sites in the past A/R reports with issues not yet fixed:
** '''AsiaPacific'''
** '''AsiaPacific'''
*** TW-NCUHEP: still undeperforming for frequent failures https://ggus.eu/index.php?mode=ticket_info&ticket_id=128083
***TW-NCUHEP: undeperforming in the past months for a network issue with one of the nagios server, now it seems solved: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128083
***KR-UOS-SSCC: there were srm problems, now also CREAM failures, proposed the suspension https://ggus.eu/index.php?mode=ticket_info&ticket_id=127024
***T2-TH-SUT: CAs upgraded, A/R figures are improving https://ggus.eu/index.php?mode=ticket_info&ticket_id=130558
**ROC_Canada: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128097
**'''NGI_AEGIS''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=130560
***CA-MCGILL-CLUMEQ-T2: still some failures
***AEGIS04-KG: JobPurge probe is failing
**NGI_BG (BG01-IPP) https://ggus.eu/index.php?mode=ticket_info&ticket_id=129370 : suggested to mark the SE as not production
**'''NGI_BG''': BG05-SUGrid Se put out of production https://ggus.eu/index.php?mode=ticket_info&ticket_id=130561
**NGI_IT https://ggus.eu/index.php?mode=ticket_info&ticket_id=129381
**'''NGI_DE''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=130563
***HEPHY-UIBK: recovered
***mainzgrid
***INFN-ROMA1-CMS: still underperforming, but the bug in the nagios probes for the CREAM (ticket GGUS 128151) is then disappeared,  
**'''NGI_GRNET''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=130565 QoS
**'''NGI_NL''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=130567 QoS
*Underperformed sites after 3 consecutive months (July, August, September), underperformed NGIs, QoS violations:
**'''NGI_GRNET''': https://ggus.eu/index.php?mode=ticket_info&ticket_id=130908
***HG-05-FORTH
*Suspended sites (in August): UA-IFBG, KR-UOS-SSCC
 
== New weights for the NGIs average A/R values, based on Computation Power ==
 
We would like to implement a new way for computing the weights for the NGIs average A/R values, introducing the concept of CE's "'''computation power'''":
computation power = hep-spec * LogicalCPUs
 
This is a quantity that can be addable up over the CEs of a site (and over the sites). Until now it has been simply added up the CEs hep-spec values for getting a site global value, but this is not correct, because the hep-spec refers to a particular CE (to the cluster behind that particular CE) and it is not addable up.
That is why, first of all, we asked VAPOR to implement the "computation power" as well as the site/NGI "average hep-spec". Have a look for example at the "figures" section: http://operations-portal.egi.eu/vapor/resources/GL2ResSummary
 
In the ARGO development instance the new weights have been used for computing the September average A/R values: http://web-egi-devel.argo.grnet.gr/lavoisier/ngi_reports?accept=html
 
We made a comparison between the values and the official ones: http://argo.egi.eu/lavoisier/ngi_reports?accept=html
 
As expected, there were some improvements and some worsening, perhaps more accentuated in the case of NGIs with few sites; with the new way the sites providing more than one CE (either with the same or different hep-spec) weight less than before (in the good and in the evil), because we compute an average hep-spec, not a simple sum over the benchmark values.
Moreover several sites are still missing the necessary information for computing the weights in both the methods: check on VAPOR the values published by your sites in order to properly publishing in the GLUE2 schema the number of logical CPUs and the Hep-Spec06 benchmark.


*Underperformed sites after 3 consecutive months, underperformed NGIs, QoS violations:
*'''Example of ldap query for checking if a site is publishing the HepSpec-06 benchmark''':
**ROC_CERN https://ggus.eu/index.php?mode=ticket_info&ticket_id=129957 QoS violation
**NGI_AEGIS https://ggus.eu/index.php?mode=ticket_info&ticket_id=129959
**NGI_CH https://ggus.eu/index.php?mode=ticket_info&ticket_id=129960
***T3_CH_PSI
**NGI_DE https://ggus.eu/index.php?mode=ticket_info&ticket_id=129961
***FZK-LCG2
**NGI_GRNET https://ggus.eu/index.php?mode=ticket_info&ticket_id=129962
**NGI_UA https://ggus.eu/index.php?mode=ticket_info&ticket_id=129963
***UA_IFBG
**NGI_UK https://ggus.eu/index.php?mode=ticket_info&ticket_id=129964 QoS violation (SOLVED)


'''suspended sites''': IFJ-PAN-BG, ZA-MERAKA, ZA-UJ
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue" '(&(objectClass=GLUE2Benchmark)(GLUE2BenchmarkType=hep-spec06))'
dn: GLUE2BenchmarkID=ce07.pic.es_hep-spec06,GLUE2ResourceID=ce07.pic.es,GLUE2ServiceID=ce07.pic.es_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue
GLUE2BenchmarkExecutionEnvironmentForeignKey: ce07.pic.es
GLUE2BenchmarkID: ce07.pic.es_hep-spec06
GLUE2BenchmarkType: hep-spec06
objectClass: GLUE2Entity
objectClass: GLUE2Benchmark
GLUE2BenchmarkValue: 12.1205
GLUE2EntityOtherInfo: InfoProviderName=glite-ce-glue2-benchmark-static
GLUE2EntityOtherInfo: InfoProviderVersion=1.1
GLUE2EntityOtherInfo: InfoProviderHost=ce07.pic.es
GLUE2BenchmarkComputingManagerForeignKey: ce07.pic.es_ComputingElement_Manager
GLUE2EntityName: Benchmark hep-spec06
GLUE2EntityCreationTime: 2017-06-20T16:50:48Z
dn: GLUE2BenchmarkID=ce01.pic.es_hep-spec06,GLUE2ResourceID=ce01.pic.es,GLUE2ServiceID=ce01.pic.es_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue
GLUE2BenchmarkExecutionEnvironmentForeignKey: ce01.pic.es
GLUE2BenchmarkID: ce01.pic.es_hep-spec06
GLUE2BenchmarkType: hep-spec06
objectClass: GLUE2Entity
objectClass: GLUE2Benchmark
GLUE2BenchmarkValue: 13.4856
GLUE2EntityOtherInfo: InfoProviderName=glite-ce-glue2-benchmark-static
GLUE2EntityOtherInfo: InfoProviderVersion=1.1
GLUE2EntityOtherInfo: InfoProviderHost=ce01.pic.es
GLUE2BenchmarkComputingManagerForeignKey: ce01.pic.es_ComputingElement_Manager
GLUE2EntityName: Benchmark hep-spec06
GLUE2EntityCreationTime: 2017-09-05T07:34:26Z
 
*'''Example of ldap query for getting the number of LogicalCPUs published by an ARC-CE (due to a bug in te info-provider, CREAM-CE publish the total number under the ExecutionEnvironment class)''':
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=UA_ILTPE_ARC,GLUE2GroupID=grid,o=glue" 'objectClass=GLUE2ComputingManager' GLUE2ComputingManagerTotalLogicalCPUs
dn: GLUE2ManagerID=urn:ogf:ComputingManager:ds4.ilt.kharkov.ua:pbs,GLUE2ServiceID=urn:ogf:ComputingService:ds4.ilt.kharkov.ua:arex,GLUE2GroupID=services,GLUE2DomainID=UA_ILTPE_ARC,GLUE2GroupID=grid,o=glue
GLUE2ComputingManagerTotalLogicalCPUs: 168
 
*'''Example of ldap query for getting the number of LogicalCPUs published by a CREAM-CE''':
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=UKI-SOUTHGRID-SUSX,GLUE2GroupID=grid,o=glue" 'objectClass=GLUE2ExecutionEnvironment' GLUE2ExecutionEnvironmentLogicalCPUs
GLUE2ExecutionEnvironmentPhysicalCPUs GLUE2ExecutionEnvironmentTotalInstances
dn: GLUE2ResourceID=grid-cream-02.hpc.susx.ac.uk,GLUE2ServiceID=grid-cream-02.hpc.susx.ac.uk_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=UKI-SOUTHGRID-SUSX,GLUE2GroupID=grid,o=glue
GLUE2ExecutionEnvironmentTotalInstances: 71
GLUE2ExecutionEnvironmentLogicalCPUs: 568
GLUE2ExecutionEnvironmentPhysicalCPUs: 71
 
* Manual for [[HEP SPEC06|Hepspec06 benchmark]].
 
'''In November the new way will be moved in production, so if during October many sites fix the information, the new NGIs A/R average values will improve'''.


== Decommissioning EMI WMS  ==
== Decommissioning EMI WMS  ==
Line 94: Line 162:


*'''WMS will be removed from production starting from 1st January 2018'''.  
*'''WMS will be removed from production starting from 1st January 2018'''.  
**VOs have '''4 months''' to find alternatives or migrate to DIRAC  
**VOs have '''3 months''' to find alternatives or migrate to DIRAC  
*Considering that this is not an update, the decommission can be performed in few weeks.
*Considering that this is not an update, the decommission can be performed in few weeks.


Line 107: Line 175:


*support for the '''dCache 2.10''' ended at December 2016, tickets opened by EGI Operations to track decommissioning  
*support for the '''dCache 2.10''' ended at December 2016, tickets opened by EGI Operations to track decommissioning  
*dCache 2.13 decommissioning procedure started, in June the probes will get CRITICAL, '''support from dCache ends in July''', upgrades to be performed by August
** still left: CA-ALBERTA-WESTGRID-T2, CA-TRIUMF-T2K
*please upgrade to 2.16, whose support ends on May 2018, or to 3.0
**take care that the dCache team does not support the upgrade from 2.10 directly to 2.16; only 2.10-&gt;2.13 and 2.13-&gt;2.16 transitions are supported.
*'''decommissioning campaign started by EGI Operations''' http://go.egi.eu/decommdcache213
*'''decommissioning campaign started by EGI Operations''' http://go.egi.eu/decommdcache213
** still left: CA-VICTORIA-WESTGRID-T2, UNI-FREIBURG, CA-SCINET-T2, INFN-ROMA1-CMS


== webdav probes in production  ==
== webdav probes in production  ==


The webdav probes have been deployed in production. Some sites were already contacted for enabling the monitoring of their webdav endpoints:
The webdav probes have been deployed in production. Some sites have already enablied the monitoring of their webdav endpoints: CYFRONET-LCG2, egee.irb.hr, GRIF, IGI-BOLOGNA, NCG-INGRID-PT, SARA-MATRIX, UKI-NORTHGRID-LIV-HEP, UNI-BONN.


{| class="wikitable sortable"
*webdav endpoints registered in GOC-DB: https://goc.egi.eu/gocdbpi/public/?method=get_service&&service_type=webdav
|-
*link to nagios results: https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_webdav&style=detail
! Site
! Host
! GGUSID
! note
|-
| CYFRONET-LCG2
| se01.grid.cyfronet.pl
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=128325
| SOLVED
|-
| GRIF
| node12.datagrid.cea.fr
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=128329
|
|-
| IGI-BOLOGNA
| darkstorm.cnaf.infn.it
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=127930
| SOLVED
|-
| INFN-T1
| removed
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=128326
| SOLVED
|-
| NCG-INGRID-PT
| gftp01.ncg.ingrid.pt
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=128327
| SOLVED
|-
| UKI-NORTHGRID-LIV-HEP
| hepgrid11.ph.liv.ac.uk
| https://ggus.eu/index.php?mode=ticket_info&ticket_id=128328
| SOLVED
|-
| egee.irb.hr
| lorienmaster.irb.hr
|
|
|}
 
link to nagios results: https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_webdav&style=detail


Several sites are publishing in the BDII the webdav endpoints:  
Several sites are publishing in the BDII the webdav endpoints:  
Line 168: Line 193:
*'''NGI_GRNET''': GR-01-AUTH, HG-03-AUTH
*'''NGI_GRNET''': GR-01-AUTH, HG-03-AUTH
*'''NGI_HR''': egee.irb.hr, egee.srce.hr
*'''NGI_HR''': egee.irb.hr, egee.srce.hr
*'''NGI_IBERGRID''': CETA-GRID, NCG-INGRID-PT
*'''NGI_IBERGRID''': CETA-GRID, IFIC-LCG2, NCG-INGRID-PT
*'''NGI_FRANCE''': GRIF-IPNO, GRIF-LAL, GRIF-LPNHE
*'''NGI_FRANCE''': GRIF-IPNO, GRIF-LAL, GRIF-LPNHE
*'''NGI_IL''': IL-TAU-HEP, TECHNION-HEP, WEIZMANN-LCG2
*'''NGI_IL''': IL-TAU-HEP, TECHNION-HEP, WEIZMANN-LCG2
Line 181: Line 206:


'''ACTIONS for NGIs and sites''':
'''ACTIONS for NGIs and sites''':
The Operations Centres are asked to verify with their sites if the webdav protocol is really (intentional) enabled on their storage elements (if not, the information should be removed from the BDII), and report to EGI Operations
EGI Operations is going to open GGUS tickets asking the sites to enable the monitoring of their webdav endpoints (after verifying that the protocol is really provided).
*The webdav service endpoint should be registered in GOC-DB for being properly monitored: the nagios probes are executed using the VO ops, so please ensure that the protocol is enabled for ops VO as well
*The webdav service endpoint should be registered in GOC-DB for being properly monitored: '''the nagios probes are executed using the VO ops''', so please ensure that the protocol is enabled for ops VO as well
*the webdav probes are harmless: they are not in any critical profile, they don't raise any alarm in the operations dashboard, and the A/R figures are not affected. We need time and more sites for gathering statistics on their results before making them critical.
*the webdav probes are harmless: they are not in any critical profile, they don't raise any alarm in the operations dashboard, and the A/R figures are not affected. We need time and more sites for gathering statistics on their results before making them critical.


For registering on GOC-DB the webdav service endpoint, follow the [[HOWTO21]] in order to filling in the proper information. '''In particular''':
For registering on GOC-DB the webdav service endpoint, follow the [[HOWTO21]] in order to filling in the proper information. '''In particular''':
*register a new service endpoint, separated from the SRM one;
* on GOC-DB fill in the webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/
* on GOC-DB fill in the webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/
** it corresponds to the value of GLUE2 attribute GLUE2EndpointURL (containing the used port and without the VO folder)
** it corresponds to the value of GLUE2 attribute GLUE2EndpointURL (containing the used port and without the VO folder);
* verify that the webdav url (for example:  https://darkstorm.cnaf.infn.it:8443/webdav ) is properly accessible
* verify that the webdav url (for example:  https://darkstorm.cnaf.infn.it:8443/webdav ) is properly accessible.
 
== Storage accounting deployment  ==
 
During the [https://indico.egi.eu/indico/event/3241/ September meeting], OMB has approved the full-scale deployment of storage accounting. The APEL team [[Storage accounting testing|has tested it with a group of early adopters sites]], and the results prove that storage accounting is now production-ready.


== Testing of the storage accounting ==
Storage accounting is currently supported '''only for the DPM and dCache storage elements''' therefore only the resource centres deploying these kind of storage elements are requested to publish storage accounting data.


As discussed during the [https://indico.egi.eu/indico/event/3233/ January OMB], the APEL team would need one site per NGI for testing the storage accounting. The eligible sites are the ones providing either dCache or DPM storage elements.
In order to properly install and configure the storage accounting scripts, please follow the instructions reported in the wiki: https://wiki.egi.eu/wiki/APEL/Storage


More information can be found in the following wiki: https://wiki.egi.eu/wiki/APEL/Storage
After setting up a daily cron job and running the accounting software, look for your data in the Accounting Portal: http://accounting-devel.egi.eu/storage.php. If it does not appear within 24 hours, or there are other errors, please open a GGUS ticket to APEL who will help debug the process.


[[Storage accounting testing|List of sites]] available for test.  
Please enable on your resources the storage accounting '''by Oct 30th''': after this day, EGI Operations will open a GGUS ticket to all RCs that haven't started the deployment yet.


'''2017-07-27 UPDATE''' (more details in the [https://indico.egi.eu/indico/event/3239/ July OMB presentation]):
List of sites already publishing is '''[[Storage accounting deployment|here]]'''.


*23 sites have verified their numbers and 3 in progress
'''PLEASE NOTE''': as in a broadcast circulated on Oct 4th
*for the deployment in production we need to:
*There is currently an issue affecting the display of storage accounting data in the Accounting Portal.
**Get sites to add new GOCDB service type
*Records are being received and loaded at the Accounting Repository and will appear in the Portal once the issue is resolved, so for the moment please only raise tickets if you have errors appearing in your logs.
**Change broker queue name and get sites to swap
**Update documentation
**Add storage system scripts to UMD
**Migrate storage view to new development Portal
*by September we should be ready for a wide roll-out of storage accounting
**detailed instructions for the sites will be circulated


= AOB  =
= AOB  =
Line 215: Line 238:
== Next meeting  ==
== Next meeting  ==


*'''Oct 9th, 2017''' https://indico.egi.eu/indico/event/3353/
*'''Nov 13th, 2017''' https://indico.egi.eu/indico/event/3354/

Latest revision as of 14:24, 25 October 2017

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


General information

Middleware

UMD/CMD

  • UMD 4.5.1 planned to release ARGUS 1.7.2, CVMFS server, FTS 3.6.8/C7 (and possibly ARC)
  • CMD-ONE 1.0.0 (Could Middleware Distribution for OpenNebula 5, for CentOS7) released: http://repository.egi.eu/2017/09/21/release-cmd-one-1-0-0/
    • oneacct 0.4.6-1 - tool for exporting accounting data from OpenNebula into APEL
    • rOCCI-server 1.1.9 - server implementation of the Open Cloud Computing Interface[*]
    • APEL-SSM 2.1.7 - Secure STOMP Messenger
    • cloudkeeper 1.5.0 - tool to synchronize cloud appliances between AppDB and cloud platforms
    • cloudkeeper-one 1.2.4 - OpenNebula backend for cloudkeeper
    • gridsite 2.3.3 - set of extensions to the Apache web server and a toolkit for Grid credentials, GACL access control lists and HTTP(S) protocol operations
    • cloud-bdii-infoprovider 0.8.3 - generates a representation of cloud resources, that can be published inside a BDII or any other component like the INDIGO Configuration Management Database (CMDB)
    • rOCCI-client 4.3.9 -A client implementation of the Open Cloud Computing Interface

Preview repository

Released on 2017-09-22:

  • Preview 1.14.0 AppDB info (sl6): ARC 15.03 update 16, CVMFS 2.4.1, DMLite 0.8.8, gfal2 2.14.2, gfal2-python 1.9.3, gfal2-utils 1.5.1, STORM 1.11.12, VOMS Admin 3.7.0, XRootD 4.7.0
  • Preview 2.14.0 AppDB info (CentOS 7): ARC 15.03 update 16, ARGUS 1.7.2, CVMFS 2.4.1, DMLite 0.8.8, gfal2 2.14.2, gfal2-python 1.9.3, gfal2-utils 1.5.1, UI 4.0.3, XRootD 4.7.0

IMPORTANT: the latest version of Xrootd ( 4.7.0) has "broken" the xrdcp command towards dcache:

https://github.com/xrootd/xrootd/issues/593

The issue seems to be server side ( a missing field in the response which is now enforced by the client) and a fix will be provided in the upcoming Xrootd 4.7.1.

Operations

ARGO/SAM


FedCloud

Feedback from Helpdesk

NTR

Monthly Availability/Reliability

New weights for the NGIs average A/R values, based on Computation Power

We would like to implement a new way for computing the weights for the NGIs average A/R values, introducing the concept of CE's "computation power":

computation power = hep-spec * LogicalCPUs

This is a quantity that can be addable up over the CEs of a site (and over the sites). Until now it has been simply added up the CEs hep-spec values for getting a site global value, but this is not correct, because the hep-spec refers to a particular CE (to the cluster behind that particular CE) and it is not addable up. That is why, first of all, we asked VAPOR to implement the "computation power" as well as the site/NGI "average hep-spec". Have a look for example at the "figures" section: http://operations-portal.egi.eu/vapor/resources/GL2ResSummary

In the ARGO development instance the new weights have been used for computing the September average A/R values: http://web-egi-devel.argo.grnet.gr/lavoisier/ngi_reports?accept=html

We made a comparison between the values and the official ones: http://argo.egi.eu/lavoisier/ngi_reports?accept=html

As expected, there were some improvements and some worsening, perhaps more accentuated in the case of NGIs with few sites; with the new way the sites providing more than one CE (either with the same or different hep-spec) weight less than before (in the good and in the evil), because we compute an average hep-spec, not a simple sum over the benchmark values. Moreover several sites are still missing the necessary information for computing the weights in both the methods: check on VAPOR the values published by your sites in order to properly publishing in the GLUE2 schema the number of logical CPUs and the Hep-Spec06 benchmark.

  • Example of ldap query for checking if a site is publishing the HepSpec-06 benchmark:
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue" '(&(objectClass=GLUE2Benchmark)(GLUE2BenchmarkType=hep-spec06))'

dn: GLUE2BenchmarkID=ce07.pic.es_hep-spec06,GLUE2ResourceID=ce07.pic.es,GLUE2ServiceID=ce07.pic.es_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue
GLUE2BenchmarkExecutionEnvironmentForeignKey: ce07.pic.es
GLUE2BenchmarkID: ce07.pic.es_hep-spec06
GLUE2BenchmarkType: hep-spec06
objectClass: GLUE2Entity
objectClass: GLUE2Benchmark
GLUE2BenchmarkValue: 12.1205
GLUE2EntityOtherInfo: InfoProviderName=glite-ce-glue2-benchmark-static
GLUE2EntityOtherInfo: InfoProviderVersion=1.1
GLUE2EntityOtherInfo: InfoProviderHost=ce07.pic.es
GLUE2BenchmarkComputingManagerForeignKey: ce07.pic.es_ComputingElement_Manager
GLUE2EntityName: Benchmark hep-spec06
GLUE2EntityCreationTime: 2017-06-20T16:50:48Z

dn: GLUE2BenchmarkID=ce01.pic.es_hep-spec06,GLUE2ResourceID=ce01.pic.es,GLUE2ServiceID=ce01.pic.es_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=pic,GLUE2GroupID=grid,o=glue
GLUE2BenchmarkExecutionEnvironmentForeignKey: ce01.pic.es
GLUE2BenchmarkID: ce01.pic.es_hep-spec06
GLUE2BenchmarkType: hep-spec06
objectClass: GLUE2Entity
objectClass: GLUE2Benchmark
GLUE2BenchmarkValue: 13.4856
GLUE2EntityOtherInfo: InfoProviderName=glite-ce-glue2-benchmark-static
GLUE2EntityOtherInfo: InfoProviderVersion=1.1
GLUE2EntityOtherInfo: InfoProviderHost=ce01.pic.es
GLUE2BenchmarkComputingManagerForeignKey: ce01.pic.es_ComputingElement_Manager
GLUE2EntityName: Benchmark hep-spec06
GLUE2EntityCreationTime: 2017-09-05T07:34:26Z
  • Example of ldap query for getting the number of LogicalCPUs published by an ARC-CE (due to a bug in te info-provider, CREAM-CE publish the total number under the ExecutionEnvironment class):
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=UA_ILTPE_ARC,GLUE2GroupID=grid,o=glue" 'objectClass=GLUE2ComputingManager' GLUE2ComputingManagerTotalLogicalCPUs

dn: GLUE2ManagerID=urn:ogf:ComputingManager:ds4.ilt.kharkov.ua:pbs,GLUE2ServiceID=urn:ogf:ComputingService:ds4.ilt.kharkov.ua:arex,GLUE2GroupID=services,GLUE2DomainID=UA_ILTPE_ARC,GLUE2GroupID=grid,o=glue
GLUE2ComputingManagerTotalLogicalCPUs: 168
  • Example of ldap query for getting the number of LogicalCPUs published by a CREAM-CE:
$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2DomainID=UKI-SOUTHGRID-SUSX,GLUE2GroupID=grid,o=glue" 'objectClass=GLUE2ExecutionEnvironment' GLUE2ExecutionEnvironmentLogicalCPUs 
GLUE2ExecutionEnvironmentPhysicalCPUs GLUE2ExecutionEnvironmentTotalInstances

dn: GLUE2ResourceID=grid-cream-02.hpc.susx.ac.uk,GLUE2ServiceID=grid-cream-02.hpc.susx.ac.uk_ComputingElement,GLUE2GroupID=resource,GLUE2DomainID=UKI-SOUTHGRID-SUSX,GLUE2GroupID=grid,o=glue
GLUE2ExecutionEnvironmentTotalInstances: 71
GLUE2ExecutionEnvironmentLogicalCPUs: 568
GLUE2ExecutionEnvironmentPhysicalCPUs: 71

In November the new way will be moved in production, so if during October many sites fix the information, the new NGIs A/R average values will improve.

Decommissioning EMI WMS

As discussed at the February and April/May OMBs, we are making plans for decommissioning the WMS and moving to DIRAC.

NGIs provided WMS usage statistics, and in general the usage is relatively low, mainly for local testing

Moderate usage by few VOs:

  • NGI_CZ: eli-beams.eu
  • NGI_GRNET: see
  • NGI_IT: calet.org, compchem, theophys, virgo
  • NGI_PL: gaussian, vo.plgrid.pl, vo.nedm.cyfronet
  • NGI_UK: mice, t2k.org

EGI contacted these VOs to agree a smooth migration of their activities to DIRAC, only some of them replied till now:

  • compchem is already testing DIRAC
  • calet.org: discussing with the users the migration to DIRAC. Interested in a webinar on DIRAC.
  • mice: enabled on the GridPP DIRAC server

We need the VO feedback for better defining technical details and timeline:

  • NGIs with VOs using WMS (not necessarily limited to the VOs above), please contact them to ensure that these VOs have a back-up plan.

WMS servers can be decommissioned as soon as the supported VOs do not need them any more. The proposal is:

  • WMS will be removed from production starting from 1st January 2018.
    • VOs have 3 months to find alternatives or migrate to DIRAC
  • Considering that this is not an update, the decommission can be performed in few weeks.

2017-08-21 UPDATE: eli-beams.eu is interested in testing DIRAC; the process for enabling the VO on te DIRAC4EGI server has started.

IPv6 readiness plans

    • Resource Centres: assess the IPv6 readiness of the site infrastructure (real machines, cloud managers)
      • NGIs/ROCs please start discussing with sites and provide suggestions for the overall plan

Decommissioning of dCache 2.10 and 2.13

  • support for the dCache 2.10 ended at December 2016, tickets opened by EGI Operations to track decommissioning
    • still left: CA-ALBERTA-WESTGRID-T2, CA-TRIUMF-T2K
  • decommissioning campaign started by EGI Operations http://go.egi.eu/decommdcache213
    • still left: CA-VICTORIA-WESTGRID-T2, UNI-FREIBURG, CA-SCINET-T2, INFN-ROMA1-CMS

webdav probes in production

The webdav probes have been deployed in production. Some sites have already enablied the monitoring of their webdav endpoints: CYFRONET-LCG2, egee.irb.hr, GRIF, IGI-BOLOGNA, NCG-INGRID-PT, SARA-MATRIX, UKI-NORTHGRID-LIV-HEP, UNI-BONN.

Several sites are publishing in the BDII the webdav endpoints:

  • AsiaPacific: JP-KEK-CRC-02
  • NGI_AEGIS: AEGIS01-IPB-SCL
  • NGI_CH: UNIGE-DPNC, UNIBE-LHEP
  • NGI_DE: UNI-SIEGEN-HEP
  • NGI_GRNET: GR-01-AUTH, HG-03-AUTH
  • NGI_HR: egee.irb.hr, egee.srce.hr
  • NGI_IBERGRID: CETA-GRID, IFIC-LCG2, NCG-INGRID-PT
  • NGI_FRANCE: GRIF-IPNO, GRIF-LAL, GRIF-LPNHE
  • NGI_IL: IL-TAU-HEP, TECHNION-HEP, WEIZMANN-LCG2
  • NGI_IT: IGI-BOLOGNA, INFN-GENOVA, INFN-MILANO-ATLASC, INFN-ROMA3, INFN-T1
  • NGI_PL: CYFRONET-LCG2, WUT
  • NGI_UK: UKI-NORTHGRID-LIV-HEP, UKI-NORTHGRID-MAN-HEP
  • ROC_CANADA: CA-MCGILL-CLUMEQ-T2

Checked with:

$ ldapsearch -x -LLL -H ldap://egee-bdii.cnaf.infn.it:2170 -b "GLUE2GroupID=grid,o=glue" '(&(objectClass=GLUE2Endpoint)(GLUE2EndpointInterfaceName=webdav))' GLUE2EndpointImplementationName GLUE2EndpointURL

ACTIONS for NGIs and sites: EGI Operations is going to open GGUS tickets asking the sites to enable the monitoring of their webdav endpoints (after verifying that the protocol is really provided).

  • The webdav service endpoint should be registered in GOC-DB for being properly monitored: the nagios probes are executed using the VO ops, so please ensure that the protocol is enabled for ops VO as well
  • the webdav probes are harmless: they are not in any critical profile, they don't raise any alarm in the operations dashboard, and the A/R figures are not affected. We need time and more sites for gathering statistics on their results before making them critical.

For registering on GOC-DB the webdav service endpoint, follow the HOWTO21 in order to filling in the proper information. In particular:

Storage accounting deployment

During the September meeting, OMB has approved the full-scale deployment of storage accounting. The APEL team has tested it with a group of early adopters sites, and the results prove that storage accounting is now production-ready.

Storage accounting is currently supported only for the DPM and dCache storage elements therefore only the resource centres deploying these kind of storage elements are requested to publish storage accounting data.

In order to properly install and configure the storage accounting scripts, please follow the instructions reported in the wiki: https://wiki.egi.eu/wiki/APEL/Storage

After setting up a daily cron job and running the accounting software, look for your data in the Accounting Portal: http://accounting-devel.egi.eu/storage.php. If it does not appear within 24 hours, or there are other errors, please open a GGUS ticket to APEL who will help debug the process.

Please enable on your resources the storage accounting by Oct 30th: after this day, EGI Operations will open a GGUS ticket to all RCs that haven't started the deployment yet.

List of sites already publishing is here.

PLEASE NOTE: as in a broadcast circulated on Oct 4th

  • There is currently an issue affecting the display of storage accounting data in the Accounting Portal.
  • Records are being received and loaded at the Accounting Repository and will appear in the Portal once the issue is resolved, so for the moment please only raise tickets if you have errors appearing in your logs.

AOB

Next meeting