Difference between revisions of "Agenda-12-06-2017"

From EGIWiki
Jump to: navigation, search
(Testing the new webdav probes)
 
(23 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{TOC right}}  
+
{{Template:Op menubar}} {{Template:Doc_menubar}} {{TOC_right}}
 +
[[Category:Grid Operations Meetings]]
  
 
= General information  =
 
= General information  =
Line 8: Line 9:
  
 
== CMD ==
 
== CMD ==
 +
 +
CMD-OS 1.1.0 RC ready http://repository.egi.eu/sw/production/cmd-os/candidate/1/
  
 
== UMD ==
 
== UMD ==
  
 
== Preview repository  ==
 
== Preview repository  ==
 +
 +
Released on 2017-06-02:
 +
* '''[[Preview 1.12.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/1.0/1.12.0/ AppDB info] (sl6): ARC 15.03 u14, davix 0.6.6, DMLite 0.8.6, dpm-dsi 1.9.13, FTS 3.6.8, XRootD 4.6.1
 +
* '''[[Preview 2.12.0]]''' [https://appdb.egi.eu/store/software/preview.repository/releases/2.0/2.12.0/ AppDB info] (CentOS 7): ARC 15.03 u14, davix 0.6.6, DMLite 0.8.6, dpm-dsi 1.9.13, FTS 3.6.8, XRootD 4.6.1, WN 4.0.5
  
 
= Operations  =
 
= Operations  =
 +
 +
== ARGO/SAM ==
 +
 +
* ARC-CE probes are updated in order to mitigate the issue with missing jobs (https://ggus.eu/index.php?mode=ticket_info&ticket_id=126724)
 +
* FTS default port changed to 8446 and it is extracted from GOCDB service URL (https://ggus.eu/index.php?mode=ticket_info&ticket_id=128154)
 +
* New probes:
 +
** AAI CheckIn: HTTP checks of all URLs in GOCDB
 +
** NGI Argus: https://sccsec-egi-git.scc.kit.edu/EGI-CSIRT/nagios-plugins-egi.argus-ngi
 +
** WebDAV: https://gitlab.cern.ch/lcgdm/nagios-plugins-webdav
 +
** Internal ARGO probes: API queries, Nagios & ARC-CE monitor test freshness, Consumer & connectors, AMS
 +
* ARGO MON switched from UMD-3 to UMD-4
  
 
== Testing FedCloud sites  ==
 
== Testing FedCloud sites  ==
Line 21: Line 39:
 
== yearly review of the information registered into GOC-DB  ==
 
== yearly review of the information registered into GOC-DB  ==
  
== Failures with the updated CREAM probes  ==
+
'''2017-04-07'''
  
After the release of the updated CREAM probes on May 4th, several sites are failing the JobCancel and/or JobPurge ones ([https://ggus.eu/index.php?mode=ticket_info&ticket_id=128151 GGUS 128151]):  
+
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:  
  
*the error message is: "'''Received timeout while fetching results'''".
+
#'''NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:'''  
 +
#*E-Mail
 +
#*ROD E-Mail
 +
#*Security E-Mail
  
The main reason is that in those CEs there isn't a job slot reserved for the ops tests.  
+
:NGI Managers should also review the status of the "not certified" RCs, in according to the [https://wiki.egi.eu/wiki/PROC09#Resource_Center_status_Workflow RC Status Workflow];
  
As explained in the [https://wiki.italiangrid.it/twiki/bin/view/CREAM/DjsCreamProbeNew CREAM probes wiki]:  
+
#'''RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:'''
 +
#*E-Mail
 +
#*telephone numbers
 +
#*CSIRT E-Mail
  
*JobCancel: cancel an active job
+
:RC administrators should also review the information related to the registered service endpoints.
**This metric submits a job directly to the selected CREAM CE, waits until the job gain the IDLE, RUNNING or REALLY-RUNNING state and then tries to cancel it. Finally it checks if the job has been correctly cancelled.
 
*JobPurge: purge a terminted job
 
**This metric is analogous of cream_jobCancel.py. It submits a short job (e.g. hostname.jdl), waits its termination (e.g DONE-OK) and then it tries to purge it. Finally, in order to verify the purging operation was successfully executed, the probe checks the job status by executing the glite-ce-job-status command which just in this scenario, must fail because the job doesn't exist anymore.
 
  
They both have a timeout of 15 minutes, so if the test job is not executed by that time, the probes return a failure. '''Please assign the ops jobs an higher priority and reserve them 1 job slot, they only require few seconds for being executed'''.
+
'''The process should be completed by Apr 28th.'''  
  
These failures didn't occur before May 4th because in the first version of the probes the returned status was "''UNKNOWN''" instead of the most proper one "''CRITICAL''".  
+
To track the process, a [https://wiki.egi.eu/wiki/Verify_Configuration_Records series of tickets] have been opened.  
  
List of [https://argo-mon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_CREAM-CE&style=detail&servicestatustypes=16&hoststatustypes=15&serviceprops=0&hostprops=0 failing CREAM-CEs] from nagios (not all of them are affected by this problem):  
+
'''2017-06-12 UPDATE''':
 
+
*no feedback yet by: AfricaArabia, NGI_DE, NGI_FI, NGI_IL, NGI_NL, NGI_UA;
* 45 CREAM-CEs affected (13% of the total ones)
+
*still reviewing: NGI_IBERGRID, NGI_IT, ROC_LA.
* the sites can ask a recomputation of the May statistics
 
  
 
== Monthly Availability/Reliability  ==
 
== Monthly Availability/Reliability  ==
  
== Proposal to modify the declaration of scheduled interventions ==
+
*Underperformed sites in the past A/R reports with issues not yet fixed:
 +
**AfricaArabia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=127502 ZA-UCT-ICTS no improvement, no feedback, will be suspended after the meeting
 +
** '''AsiaPacific'''
 +
*** TW-NCUHEP: site-bdii unstable for network issues with ARGO https://ggus.eu/index.php?mode=ticket_info&ticket_id=128083
 +
***KR-UOS-SSCC: there were srm problems, now also CREAM failures https://ggus.eu/index.php?mode=ticket_info&ticket_id=127024
 +
** '''NGI_DE''' [https://ggus.eu/index.php?mode=ticket_info&ticket_id=125430 GGUS 125430]
 +
***LRZ https://ggus.eu/index.php?mode=ticket_info&ticket_id=128087 site-bdii unreachable, GRAM5 failures; improving
 +
***UNI-SIEGEN-HEP: the fix for CREAM probes solved the issues, waiting the end of the month for closing
 +
***wuppertalprod: https://ggus.eu/index.php?mode=ticket_info&ticket_id=127026 the patch to the ARC-CE probes has been applied, the situation is improving
 +
**NGI_FI: https://ggus.eu/index.php?mode=ticket_info&ticket_id=127505 ARC-CE nagios probes bug
 +
**NGI_UA: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=125839 GGUS 125839]
 +
***UA-NSCMBR: bug in the ARC-CE probes
 +
**ROC_Canada: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128097
 +
***CA-MCGILL-CLUMEQ-T2: new problems regarding ssl on CREAM, they were solved, situation is improving
 +
*Underperformed sites after 3 consecutive months and underperformed NGIs:
 +
**AsiaPacific (MY-USM-GCL): https://ggus.eu/index.php?mode=ticket_info&ticket_id=128880
 +
**NGI_CHINA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128881 QoS violation (SOLVED)
 +
**NGI_FI (CSC) https://ggus.eu/index.php?mode=ticket_info&ticket_id=128883 (SOLVED)
 +
**NGI_FRANCE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128884 QoS violation
 +
**NGI_IBERGRID (UNICAN) https://ggus.eu/index.php?mode=ticket_info&ticket_id=128885 the site has just been decommissioned
 +
**NGI_IL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128886 QoS violation
 +
**NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128887 QoS violation (SOLVED)
 +
**NGI_PL (IFJ-PAN-BG) https://ggus.eu/index.php?mode=ticket_info&ticket_id=128889
 +
**NGI_RO (RO-11-NIPNE, RO-14-ITIM) https://ggus.eu/index.php?mode=ticket_info&ticket_id=128890
 +
**NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128891 QoS violation (SOLVED)
 +
**ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=128892 QoS violation (SOLVED)
  
 
== Decommissioning EMI WMS  ==
 
== Decommissioning EMI WMS  ==
Line 84: Line 129:
 
***'''NGIs/ROCs please start discussing with sites and provide suggestions for the overall plan'''
 
***'''NGIs/ROCs please start discussing with sites and provide suggestions for the overall plan'''
  
== Decommissioning of dCache 2.10 and 2.13 (to modify) ==
+
== Decommissioning of dCache 2.10 and 2.13 ==
  
 
* support for the '''dCache 2.10''' ended at December 2016, tickets opened by EGI Operations to track decommissioning
 
* support for the '''dCache 2.10''' ended at December 2016, tickets opened by EGI Operations to track decommissioning
Line 142: Line 187:
 
** follow the [[HOWTO21]] for filling in the information on GOC-DB
 
** follow the [[HOWTO21]] for filling in the information on GOC-DB
 
* verify that the webdav url (for example:  https://darkstorm.cnaf.infn.it:8443/webdav ) is properly accessible
 
* verify that the webdav url (for example:  https://darkstorm.cnaf.infn.it:8443/webdav ) is properly accessible
 +
 +
== Testing of the storage accounting  ==
 +
 +
As discussed during the [https://indico.egi.eu/indico/event/3233/ January OMB], the APEL team would need one site per NGI for testing the storage accounting. The eligible sites are the ones providing either dCache or DPM storage elements.
 +
 +
More information can be found in the following wiki: https://wiki.egi.eu/wiki/APEL/Storage
 +
 +
[[Storage accounting testing|List of sites]] available for test.
 +
 +
'''2017-06-12 UPDATE''':
 +
 +
*26 sites are sending storage accounting data (only from dCache and DPM SEs). The data has to be verified before deploying the script in production.
 +
*After the discussion at the March [https://indico.egi.eu/indico/event/3235/ OMB], we are evaluating the creation of a new service type on GOC-DB that will be used for:
 +
** authorising the site/SE to publish the accounting data
 +
** making the site/SE appear in the portal
 +
** monitoring that the accounting data are regularly published
 +
 +
Currently the accounting service types are:
 +
 +
#glite-APEL: for [https://wiki.egi.eu/wiki/APEL/UsingAuth authorizing] the sending of the messages
 +
#APEL: to [https://wiki.egi.eu/wiki/APEL/Tests monitor] the accounting data publication
 +
 +
The proposed name is "APEL-SE"
  
 
= AOB  =
 
= AOB  =

Latest revision as of 14:26, 25 October 2017

Main EGI.eu operations services Support Documentation Tools Activities Performance Technology Catch-all Services Resource Allocation Security


Documentation menu: Home Manuals Procedures Training Other Contact For: VO managers Administrators


General information

Middleware

CMD

CMD-OS 1.1.0 RC ready http://repository.egi.eu/sw/production/cmd-os/candidate/1/

UMD

Preview repository

Released on 2017-06-02:

  • Preview 1.12.0 AppDB info (sl6): ARC 15.03 u14, davix 0.6.6, DMLite 0.8.6, dpm-dsi 1.9.13, FTS 3.6.8, XRootD 4.6.1
  • Preview 2.12.0 AppDB info (CentOS 7): ARC 15.03 u14, davix 0.6.6, DMLite 0.8.6, dpm-dsi 1.9.13, FTS 3.6.8, XRootD 4.6.1, WN 4.0.5

Operations

ARGO/SAM

Testing FedCloud sites

Feedback from Helpdesk

yearly review of the information registered into GOC-DB

2017-04-07

On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:

  1. NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
    • E-Mail
    • ROD E-Mail
    • Security E-Mail
NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
  1. RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
    • E-Mail
    • telephone numbers
    • CSIRT E-Mail
RC administrators should also review the information related to the registered service endpoints.

The process should be completed by Apr 28th.

To track the process, a series of tickets have been opened.

2017-06-12 UPDATE:

  • no feedback yet by: AfricaArabia, NGI_DE, NGI_FI, NGI_IL, NGI_NL, NGI_UA;
  • still reviewing: NGI_IBERGRID, NGI_IT, ROC_LA.

Monthly Availability/Reliability

Decommissioning EMI WMS

As discussed at the February and April/May OMBs, we are making plans for decommissioning the WMS and moving to DIRAC.

NGIs provided WMS usage statistics, and in general the usage is relatively low, mainly for local testing

Moderate usage by few VOs:

  • NGI_CZ: eli-beams.eu
  • NGI_GRNET: see
  • NGI_IT: calet.org, compchem, theophys, virgo
  • NGI_PL: gaussian, vo.plgrid.pl, vo.nedm.cyfronet
  • NGI_UK: mice, t2k.org

EGI contacted these VOs to agree a smooth migration of their activities to DIRAC, only some of them replied till now:

  • compchem is already testing DIRAC
  • calet.org: discussing with the users the migration to DIRAC. Interested in a webinar on DIRAC.
  • mice: enabled on the GridPP DIRAC server

We need the VO feedback for better defining technical details and timeline:

  • NGIs with VOs using WMS (not necessarily limited to the VOs above), please contact them to ensure that these VOs have a back-up plan.

WMS servers can be decommissioned as soon as the supported VOs do not need them any more. The proposal is:

  • WMS will be removed from production starting from 1st January 2018.
    • VOs have 8 months to find alternatives or migrate to DIRAC
  • Considering that this is not an update, the decommission can be performed in few weeks.

IPv6 readiness plans

    • Resource Centres: assess the IPv6 readiness of the site infrastructure (real machines, cloud managers)
      • NGIs/ROCs please start discussing with sites and provide suggestions for the overall plan

Decommissioning of dCache 2.10 and 2.13

  • support for the dCache 2.10 ended at December 2016, tickets opened by EGI Operations to track decommissioning
  • dCache 2.13 decommissioning procedure started, in June the probes will get CRITICAL, support from dCache ends in July, upgrades to be performed by August
  • please upgrade to 2.16, whose support ends on May 2018, or to 3.0
    • take care that the dCache team does not support the upgrade from 2.10 directly to 2.16; only 2.10->2.13 and 2.13->2.16 transitions are supported.
  • decommissioning campaign will be started by EGI Operations to monitor the upgrade of the dCache 2.13 instances and follow up with the NGIs/sites at the beginning of August

Testing the new webdav probes

Site Host GGUSID note
CYFRONET-LCG2 se01.grid.cyfronet.pl https://ggus.eu/index.php?mode=ticket_info&ticket_id=128325 SOLVED
GRIF node12.datagrid.cea.fr https://ggus.eu/index.php?mode=ticket_info&ticket_id=128329
IGI-BOLOGNA darkstorm.cnaf.infn.it https://ggus.eu/index.php?mode=ticket_info&ticket_id=127930 SOLVED
INFN-T1 removed https://ggus.eu/index.php?mode=ticket_info&ticket_id=128326 SOLVED
NCG-INGRID-PT gftp01.ncg.ingrid.pt https://ggus.eu/index.php?mode=ticket_info&ticket_id=128327 SOLVED
UKI-NORTHGRID-LIV-HEP hepgrid11.ph.liv.ac.uk https://ggus.eu/index.php?mode=ticket_info&ticket_id=128328 SOLVED
egee.irb.hr lorienmaster.irb.hr

Missing steps:

Testing of the storage accounting

As discussed during the January OMB, the APEL team would need one site per NGI for testing the storage accounting. The eligible sites are the ones providing either dCache or DPM storage elements.

More information can be found in the following wiki: https://wiki.egi.eu/wiki/APEL/Storage

List of sites available for test.

2017-06-12 UPDATE:

  • 26 sites are sending storage accounting data (only from dCache and DPM SEs). The data has to be verified before deploying the script in production.
  • After the discussion at the March OMB, we are evaluating the creation of a new service type on GOC-DB that will be used for:
    • authorising the site/SE to publish the accounting data
    • making the site/SE appear in the portal
    • monitoring that the accounting data are regularly published

Currently the accounting service types are:

  1. glite-APEL: for authorizing the sending of the messages
  2. APEL: to monitor the accounting data publication

The proposed name is "APEL-SE"

AOB

Next meeting