Difference between revisions of "Agenda-11-04-2016"
Line 83: | Line 83: | ||
* Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki] | * Tracked on [https://wiki.egi.eu/wiki/SL5_retirement SL5_retirement wiki] | ||
* No checks for dCache, DPM, ARC, UNICORE --> ''' Action on NGIs/ROCs to follow up directly with sites''' | * No checks for dCache, DPM, ARC, UNICORE --> ''' Action on NGIs/ROCs to follow up directly with sites''' | ||
== NGIs argus server not properly configured == | |||
Some time ago (more than a year I think), EGI ran a campaign to have | |||
NGIs run a "NGI Argus" service. This campaign resulted in new services | |||
being added to goc-db for each NGI. | |||
Unfortunately, as explained in the OMB in February, our monitoring is | |||
currently unable to check the deployment of these services: | |||
- For 6 services, our monitoring cannot contact the NGI Argus | |||
- For 18 services, our monitoring is not authorized to get the right | |||
information from the NGI Argus | |||
- For 1 service, our monitoring indicates that the NGI Argus is not | |||
properly configured and does not pull the rules from argus.cern.ch | |||
In the end, only 5 services are properly configured and monitored! | |||
The changes are rather easy: | |||
* If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port | |||
* If we are not authorized, the site needs to add the right ACE to their argus authorization: via ''' pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ' ''' | |||
* If the argus server is not properly configured (no rule pulled), the site has to follow http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus | |||
The current status of the infrastructure can be found: | |||
* In the secmon nagios (not sure you have access to this): | |||
https://secmon.egi.eu/nagios/cgi-bin/status.cgi?servicegroup=SERVICE_ngi.ARGUS&style=detail&sorttype=1&sortoption=3 | |||
* On the security dashboard: | |||
https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4 | |||
On the security dashboard, each NGI should have a "argus-ban" result: | |||
* "Ok" means ok | |||
* "Unknown" means that we can't contact them | |||
* "High" means that we are not authorized | |||
* "Critical" means that argus is not pull rules from argus.cern.ch | |||
= AOB = | = AOB = |
Revision as of 10:55, 15 April 2016
General information
- the Operations meeting will be on the 2nd Monday of the month
- the EGI Operations Meeting schedule for first half of 2016 is available on Indico: https://indico.egi.eu/indico/categoryDisplay.py?categId=32 and on the new summary page: https://wiki.egi.eu/wiki/Operations_Meeting
News from URT
Staged rollout updates
Next releases
Preview repository
Operational issues
Aligning Fedcloud sites to the A/R procedures
- EGI Operations proposal to align Fedcloud sites to the A/R related procedures used for the grid sites
- based on the availability reliability of monitored services in cloudmon, EGI Operations will start follow up with underperforming sites as we are doing for every grid sites
- sites will NOT be suspended for a/r performance at least until end of May
- in parallel EGI Operations will start PROC08 to include cloud probes in the EGI_CRITICAL and EGI profiles used for A/R computations (IN PROGRESS)
The proposed timeline is:
- February 2016:
- EGI Operations will check the status of the production cloud services in order to understand which issues (if any) the site has and provide help to NGIs and sites;
- Start of the integration of cloud probes in EGI CRITICAL profile(current set+openstack): To be agreed with the ARGO team, PROC08 will be followed
- June 2016:
- Starting notification of sites eligible for suspension
FedCloud status
Old issues
Grouped by NGI, please follow up with sites.
- NGI_UK
- 100IT (OpenStack)
- vmcatcher issues https://ggus.eu/index.php?mode=ticket_info&ticket_id=116358#update#19 IN PROGRESS
- BDII and GOCDB have different Endpoint URLs https://ggus.eu/index.php?mode=ticket_info&ticket_id=119002#update#5 FIXED
- 100IT (OpenStack)
- NGI_PL
- CYFRONET-CLOUD (OpenStack)
- VMCatcher https://ggus.eu/index.php?mode=ticket_info&ticket_id=116363#update#29 IN PROGRESS
- CYFRONET-CLOUD (OpenStack)
- NGI_DE
- GoeGrid (OpenNebula)
- NGI_GRNET
- HG-09-Okeanos-Cloud (Synnefo)
- VMCatcher, issue with large metadata, on hold (it requires some development) https://ggus.eu/index.php?mode=ticket_info&ticket_id=116368 ON HOLD
- HG-09-Okeanos-Cloud (Synnefo)
- NGI_TR
- TR-FC1-ULAKBIM (OpenStack)
- Missing GLUE2DomainID and image description looks wrong https://ggus.eu/index.php?mode=ticket_info&ticket_id=119005#update#15 IN PROGRESS
- TR-FC1-ULAKBIM (OpenStack)
- New tickets opened to track issues in publishing appliances on AppDB for fedcloud.egi.eu: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120010
- Issue with OCCI and fedcloud.egi.eu VO at MK-04-FINKICLOUD (NGI_MARGI): https://ggus.eu/index.php?mode=ticket_info&ticket_id=120027
New issues
Actions
- EGI Operations have been asked by user support to contact sites with unresolved technical problems in the support of the fedcloud.egi.eu VO since a long time
- if issues cannot be fixed quickly, sites will be asked to remove the support to fedcloud.egi.eu
- they will re-enable the VO support as soon as they are able to fix the issues
- sites will be contacted directly by EGI Operations
Getting help
- the whole FedCloud wiki has been reviewed, removing redundancies, updating links and instructions
- from the operations point of view: https://wiki.egi.eu/wiki/Federated_Cloud_resource_providers_support
- see in particular the manual for the installation of a cloud site
- TBD: review the support units associated with FedCloud (in progress)
Decommissioning SL5
- Tracked on SL5_retirement wiki
- No checks for dCache, DPM, ARC, UNICORE --> Action on NGIs/ROCs to follow up directly with sites
NGIs argus server not properly configured
Some time ago (more than a year I think), EGI ran a campaign to have NGIs run a "NGI Argus" service. This campaign resulted in new services being added to goc-db for each NGI.
Unfortunately, as explained in the OMB in February, our monitoring is currently unable to check the deployment of these services: - For 6 services, our monitoring cannot contact the NGI Argus - For 18 services, our monitoring is not authorized to get the right information from the NGI Argus - For 1 service, our monitoring indicates that the NGI Argus is not properly configured and does not pull the rules from argus.cern.ch
In the end, only 5 services are properly configured and monitored!
The changes are rather easy:
- If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port
- If we are not authorized, the site needs to add the right ACE to their argus authorization: via pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ'
- If the argus server is not properly configured (no rule pulled), the site has to follow http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus
The current status of the infrastructure can be found:
- In the secmon nagios (not sure you have access to this):
- On the security dashboard:
https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4
On the security dashboard, each NGI should have a "argus-ban" result:
- "Ok" means ok
- "Unknown" means that we can't contact them
- "High" means that we are not authorized
- "Critical" means that argus is not pull rules from argus.cern.ch
AOB
Monthly Availability/Reliability
A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html
List of the underperforming RCs for (at least) 3 consecutive months:
- AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094:
- DZ-01-ARN
- EG-ZC-T3: unresponsive since too months, must be suspended
- ZA-UJ
- AsiaPacific: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=120180
- MY-UM-SIFIR: network and power failure
- NGI_DE: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=120181
- LRZ-LMU no feedback
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120577
- FBF-Brescia-IT working for improving the behaviour
- NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January
- NGI_MD: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120578
- the only site MD-02-IMI was suspended in March for security reasons, asked for news
- ROC_LA
- UFAL: suspended by the NGI manager
Next meeting
- 9 May 2016 https://indico.egi.eu/indico/event/2739/