Alert.png The wiki is deprecated and due to be decommissioned by the end of September 2022.
The content is being migrated to other supports, new updates will be ignored and lost.
If needed you can get in touch with EGI SDIS team using operations @ egi.eu.

Difference between revisions of "Agenda-14-03-2016"

From EGIWiki
Jump to navigation Jump to search
(Created page with "{{TOC right}} = General information = * the Operations meeting will be on the '''2nd Monday of the month''' * the EGI Operations Meeting schedule for '''first half of 2016''' ...")
 
 
(36 intermediate revisions by 3 users not shown)
Line 9: Line 9:


* A Critical bug which causes file loss.has been discovered on the DPM dmlite-shell new drain command released in DPM 1.8.10. One site in production has been affected https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Dmlite/Shell#Newfunctionality:Drain
* A Critical bug which causes file loss.has been discovered on the DPM dmlite-shell new drain command released in DPM 1.8.10. One site in production has been affected https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Dmlite/Shell#Newfunctionality:Drain
* Preparation of the UMD-4 SL6/CentOS7 ongoing
** broadcast sent on March 10th
* Preparation of the UMD for Cloud
** if you have run the new drain commands at your site, contact the DPM Development team through GGUS (data consistency check is needed)
** '''DO NOT use the new drain commands''' (documented at https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Dev/Dmlite/Shell#Newfunctionality:Drain) and until the fixed components are released please continue to use the old dpm-drain command
** UMD 3.14.1 release notes updated


== Staged rollout updates  ==
== Staged rollout updates  ==
* frontier-squid 2.7.24.2 (centos7)
* voms-admin 3.4.1 (sl6)
* storm 1.8.10 (sl6)


== Next releases  ==
== Next releases  ==
= Preview repository =
On March 9th it was released the first update of Preview:
* STORM 1.11.10
* VOMS Admin server 3.4.2
* VOMS Server 2.0.13
* VOMS API Java 3.0.6, 3.1.0, 3.2.0
see details in https://wiki.egi.eu/wiki/Preview_1.1.0
Generic information about Preview repository: https://wiki.egi.eu/wiki/Preview_Repository
'''Note:''' ''EGI provides the preview repository without any additional quality assurance process, but the products are released as they are provided by the product team. EGI recommends the use of the UMD repositories, which contain software verified through the quality assurance process of UMD.''


= Operational issues  =
= Operational issues  =
== Globus GSI clients moving to STRICT_RFC2818 by default ==
* the release of the update that will change the default name compatibility mode from "HYBRID" to "STRICT_RFC2818" is '''planned for April 1, 2016'''.
* EGI Broadcast sent in August already warning about the change, already advising "'''site managers to make sure that all the hostnames and aliases used to connect to a service are included in its host certificate Subject Alternative Name field''', at the latest by the end of the year"
* sites that could be affected by this future change are the ones running services whose clients may use globus-gssapi-gsi for authentication (CE, FTS, SRM, GridFTP, MyProxy, WMS) and using DNS aliases which are not included within the SAN (Subject Alternative Name) field of the certificate (including the host name itself)


== Aligning Fedcloud sites to the A/R procedures ==
== Aligning Fedcloud sites to the A/R procedures ==
Line 31: Line 57:
** EGI Operations will check the status of the production cloud services in order to understand which issues (if any) the site has and provide help to NGIs and sites;
** EGI Operations will check the status of the production cloud services in order to understand which issues (if any) the site has and provide help to NGIs and sites;
** Start of the integration of cloud probes in EGI CRITICAL profile(current set+openstack): To be agreed with the ARGO team, [https://wiki.egi.eu/wiki/PROC08 PROC08] will be followed
** Start of the integration of cloud probes in EGI CRITICAL profile(current set+openstack): To be agreed with the ARGO team, [https://wiki.egi.eu/wiki/PROC08 PROC08] will be followed
*** https://ggus.eu/index.php?mode=ticket_info&ticket_id=119628
* June 2016:  
* June 2016:  
** Starting notification of sites eligible for suspension
** '''Starting notification of sites eligible for suspension'''


== FedCloud status ==
== FedCloud status ==


=== Issues at cloud sites ===
=== Old issues ===


Grouped by NGI, please follow up with sites.  
Grouped by NGI, please follow up with sites.  
Line 42: Line 69:
* NGI_UK
* NGI_UK
** 100IT (OpenStack)  
** 100IT (OpenStack)  
*** vmcatcher issues https://ggus.eu/index.php?mode=ticket_info&ticket_id=116358#update#19
*** vmcatcher issues https://ggus.eu/index.php?mode=ticket_info&ticket_id=116358#update#19 '''IN PROGRESS'''
*** BDII and GOCDB have different Endpoint URLs https://ggus.eu/index.php?mode=ticket_info&ticket_id=119002#update#5
*** BDII and GOCDB have different Endpoint URLs https://ggus.eu/index.php?mode=ticket_info&ticket_id=119002#update#5 '''FIXED'''


* NGI_PL
* NGI_PL
** CYFRONET-CLOUD (OpenStack)
** CYFRONET-CLOUD (OpenStack)
*** VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=116363#update#29
*** VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=116363#update#29 '''IN PROGRESS'''


* NGI_DE
* NGI_DE
** GoeGrid (OpenNebula)
** GoeGrid (OpenNebula)
*** OCCI, VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=119003 https://ggus.eu/index.php?mode=ticket_info&ticket_id=116365
*** OCCI, VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=119003 https://ggus.eu/index.php?mode=ticket_info&ticket_id=116365 '''IN PROGRESS'''


* NGI_GRNET
* NGI_GRNET
** HG-09-Okeanos-Cloud (Synnefo)
** HG-09-Okeanos-Cloud (Synnefo)
*** VMCatcher, issue with large metadata, on hold (it requires some development) https://ggus.eu/index.php?mode=ticket_info&ticket_id=116368
*** VMCatcher, issue with large metadata, on hold (it requires some development) https://ggus.eu/index.php?mode=ticket_info&ticket_id=116368 '''ON HOLD'''
 
* NGI_IBERGRID
** IFCA-LCG2 (OpenStack)
*** OCCI, endpoing published on sBDII is missing "/occi1.1/" https://ggus.eu/index.php?mode=ticket_info&ticket_id=119004


* NGI_TR
* NGI_TR
** TR-FC1-ULAKBIM (OpenStack)
** TR-FC1-ULAKBIM (OpenStack)
*** Missing GLUE2DomainID and image description looks wrong https://ggus.eu/index.php?mode=ticket_info&ticket_id=119005#update#15
*** Missing GLUE2DomainID and image description looks wrong https://ggus.eu/index.php?mode=ticket_info&ticket_id=119005#update#15 '''IN PROGRESS'''


=== Getting help on issues ===  
=== New issues ===


* VMcatcher issues
* New tickets opened to track issues in publishing appliances on AppDB for fedcloud.egi.eu: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120010
** [https://appdb.egi.eu/browse/sites/cloud This page] has a little number down right the site showing the number of images available at the site. If it's missing, it's very likely that the site has issues with vmcatcher.  
* Issue with OCCI and fedcloud.egi.eu VO at MK-04-FINKICLOUD (NGI_MARGI): https://ggus.eu/index.php?mode=ticket_info&ticket_id=120027
** '''ACTION''': Please check this documentation: https://wiki.egi.eu/wiki/MAN10#EGI_Image_Management_2 and https://github.com/hepix-virtualisation/vmcatcher. '''If you cannot figure out, please contact EGI Operations through the ticket, we will forward to vmcatcher devs.'''


=== Updating Federated_Cloud_Operation wiki ===
=== Actions ===
* Review your site's information on [https://wiki.egi.eu/wiki/Federated_Cloud_Operation Federated_Cloud_Operation] wiki, please sites reply asap!
 
** GoeGrid https://ggus.eu/?mode=ticket_info&ticket_id=118882
* EGI Operations have been asked by user support to contact '''sites with unresolved technical problems in the support of the fedcloud.egi.eu''' VO since a long time
** MK-04-FINKICLOUD https://ggus.eu/?mode=ticket_info&ticket_id=118890
** if issues cannot be fixed quickly, '''sites will be asked to remove the support to fedcloud.egi.eu'''
** CYFRONET-CLOUD https://ggus.eu/?mode=ticket_info&ticket_id=118878
** they will re-enable the VO support as soon as they are able to fix the issues
** sites will be contacted directly by EGI Operations
 
=== Getting help ===  
 
* the whole FedCloud wiki has been reviewed, removing redundancies, updating links and instructions
* from the operations point of view: https://wiki.egi.eu/wiki/Federated_Cloud_resource_providers_support
** see in particular the [https://wiki.egi.eu/wiki/MAN10 manual for the installation of a cloud site]
** TBD: review the support units associated with FedCloud (in progress)


== Decommissioning Debian ==  
== Decommissioning Debian ==  


* Debian support for squeeze (6.0) has been reached (Feb2016) https://www.debian.org/News/2016/20160212  
* Debian support for squeeze (6.0) has been reached (Feb2016) https://www.debian.org/News/2016/20160212
* only one service published on BDII and production in GOCDB, but in GOCDB it is indicated as SL5.8, site is UA-MHI (NGI_UA)
 
<pre>
 
dn: GlueSubClusterUniqueID=arc.hpc-mhi.org,GlueClusterUniqueID=arc.hpc-mhi.org
,Mds-Vo-name=UA-MHI,Mds-Vo-name=local,o=grid
GlueHostOperatingSystemName: Debian
GlueHostOperatingSystemRelease: 0
GlueHostOperatingSystemVersion: 0
 
</pre>


== Decommissioning SL5 ==
== Decommissioning SL5 ==
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki]
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki]
* Tests available https://midmon.egi.eu/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28
* No checks for dCache, DPM, ARC, UNICORE --> ''' Action on NGIs/ROCs to follow up directly with sites'''
** eu.egi.sec.Argus-SL5
** eu.egi.sec.CREAM-SL5
** eu.egi.sec.LB-SL5
** eu.egi.sec.LFC-SL5
** eu.egi.sec.MyProxy-SL5
** eu.egi.sec.QCG.Computing-SL5
** eu.egi.sec.QCG.Notification-SL5
** eu.egi.sec.Site-BDII-SL5
** eu.egi.sec.Top-BDII-SL5
** eu.egi.sec.VOMS-SL5
** eu.egi.sec.WMS-SL5
** eu.egi.sec.StoRM-SL5
* No checks for dCache, DPM, ARC --> '''NGIs/ROCs to follow up directly with sites'''
* Documentation https://wiki.egi.eu/wiki/MW_SAM_tests#SL5_tests


== Decommissioning dCache 2.6 ==
= AOB  =


* DONE.
== Monthly Availability/Reliability ==


= AOB  =
A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html


== Monthly Availability/Reliability ==
List of the underperforming RCs for (at least) 3 consecutive months:


* Last three months report availabile on [http://argo.egi.eu/lavoisier/ngi_reports?month=2016-01 ARGO]
* AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094:
* Problems follow-up:
** EG-ZC-T3: unresponsive since months, must be suspended
** AfricaArabia: [https://ggus.eu/?mode=ticket_info&ticket_id=117094 ticket]
** ZA-UJ
*** Overall A/R: 12.67/12.67
* AsiaPacific:
*** RCs eligible to suspension: EG-ZC-T3, ZA-CHPC, ZA-UJ  
** MY-UM-SIFIR
** CERN: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=118843 ticket]
* NGI_DE:
*** Overall A/R: 33.22/33.22
** LRZ-LMU
*** there were problems on the regional SAM instances, solved in January
* NGI_GRNET:
** NGI_ARMGRID: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=119415 ticket]
** GR-04-FORTH-ICS
*** Overall A/R: 77.43/77.43
* NGI_IT https://ggus.eu/index.php?mode=ticket_info&ticket_id=118846:
** NGI_DE: [https://ggus.eu/?mode=ticket_info&ticket_id=117099 ticket]
** INFN-NAPOLI-PAMELA: in decommissioning
*** the underperforming RCs (SCAI, UNI-DORTMUND) are recovering from the issues
* NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January
** NGI_GRNET: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=119414 ticket]
* ROC_LA:
*** RC eligible for suspension: GR-04-FORTH-ICS
** UFAL: new site but the monitoring data are missing
** NGI_IT: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=118846 ticket]
*** the underperforming RC INFN-NAPOLI-PAMELA seems to be recovering, waiting for a confirmation
** NGI_MARGI: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 ticket]
*** no monitoring data available since January
*** RC eligible for suspension: MK-03-FINKI
** NGI_MD:
*** Overall A/R: 61.89/61.89
*** the underperforming RC MD-02-IMI is recovering
** ROC_LA: [https://ggus.eu/index.php?mode=ticket_info&ticket_id=119416 ticket]
*** no monitoring data available for CBPF
*** RC eligible for suspension: UFAL


== Next meeting ==
== Next meeting ==


* '''14 Mar 2016''' https://indico.egi.eu/indico/conferenceDisplay.py?confId=2736
* '''11 Apr 2016''' https://indico.egi.eu/indico/event/2738/

Latest revision as of 16:42, 14 March 2016


General information

News from URT

Staged rollout updates

  • frontier-squid 2.7.24.2 (centos7)
  • voms-admin 3.4.1 (sl6)
  • storm 1.8.10 (sl6)

Next releases

Preview repository

On March 9th it was released the first update of Preview:

  • STORM 1.11.10
  • VOMS Admin server 3.4.2
  • VOMS Server 2.0.13
  • VOMS API Java 3.0.6, 3.1.0, 3.2.0

see details in https://wiki.egi.eu/wiki/Preview_1.1.0

Generic information about Preview repository: https://wiki.egi.eu/wiki/Preview_Repository

Note: EGI provides the preview repository without any additional quality assurance process, but the products are released as they are provided by the product team. EGI recommends the use of the UMD repositories, which contain software verified through the quality assurance process of UMD.

Operational issues

Globus GSI clients moving to STRICT_RFC2818 by default

  • the release of the update that will change the default name compatibility mode from "HYBRID" to "STRICT_RFC2818" is planned for April 1, 2016.
  • EGI Broadcast sent in August already warning about the change, already advising "site managers to make sure that all the hostnames and aliases used to connect to a service are included in its host certificate Subject Alternative Name field, at the latest by the end of the year"
  • sites that could be affected by this future change are the ones running services whose clients may use globus-gssapi-gsi for authentication (CE, FTS, SRM, GridFTP, MyProxy, WMS) and using DNS aliases which are not included within the SAN (Subject Alternative Name) field of the certificate (including the host name itself)

Aligning Fedcloud sites to the A/R procedures

  • EGI Operations proposal to align Fedcloud sites to the A/R related procedures used for the grid sites
    • based on the availability reliability of monitored services in cloudmon, EGI Operations will start follow up with underperforming sites as we are doing for every grid sites
    • sites will NOT be suspended for a/r performance at least until end of May
  • in parallel EGI Operations will start PROC08 to include cloud probes in the EGI_CRITICAL and EGI profiles used for A/R computations (IN PROGRESS)

The proposed timeline is:

  • February 2016:
    • EGI Operations will check the status of the production cloud services in order to understand which issues (if any) the site has and provide help to NGIs and sites;
    • Start of the integration of cloud probes in EGI CRITICAL profile(current set+openstack): To be agreed with the ARGO team, PROC08 will be followed
  • June 2016:
    • Starting notification of sites eligible for suspension

FedCloud status

Old issues

Grouped by NGI, please follow up with sites.

New issues

Actions

  • EGI Operations have been asked by user support to contact sites with unresolved technical problems in the support of the fedcloud.egi.eu VO since a long time
    • if issues cannot be fixed quickly, sites will be asked to remove the support to fedcloud.egi.eu
    • they will re-enable the VO support as soon as they are able to fix the issues
    • sites will be contacted directly by EGI Operations

Getting help

Decommissioning Debian

  • Debian support for squeeze (6.0) has been reached (Feb2016) https://www.debian.org/News/2016/20160212
  • only one service published on BDII and production in GOCDB, but in GOCDB it is indicated as SL5.8, site is UA-MHI (NGI_UA)

dn: GlueSubClusterUniqueID=arc.hpc-mhi.org,GlueClusterUniqueID=arc.hpc-mhi.org
 ,Mds-Vo-name=UA-MHI,Mds-Vo-name=local,o=grid
GlueHostOperatingSystemName: Debian
GlueHostOperatingSystemRelease: 0
GlueHostOperatingSystemVersion: 0

Decommissioning SL5

  • Tracked on SL5_retirement wiki
  • No checks for dCache, DPM, ARC, UNICORE --> Action on NGIs/ROCs to follow up directly with sites

AOB

Monthly Availability/Reliability

A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html

List of the underperforming RCs for (at least) 3 consecutive months:

Next meeting