Difference between revisions of "SAM Nagios probes refactoring TF"
Jump to navigation
Jump to search
(22 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{ | {{EGI_Activity_groups_menubar}} | ||
{{ | |||
{{ | {{TOC_right}} | ||
[[Category: | |||
{{Template:Deprecated}} | |||
[[Category:Special Interest groups]] | |||
= Mandate = | = Mandate = | ||
Line 9: | Line 12: | ||
** request improvement/correction of supported ones | ** request improvement/correction of supported ones | ||
** ease their maintenance and support by: | ** ease their maintenance and support by: | ||
*** reducing the | *** reducing the number of probes | ||
*** reducing the dependencies between services (a reported service error should not be caused by the problems of another service in the same site) | *** reducing the dependencies between services (a reported service error should not be caused by the problems of another service in the same site) | ||
* improve documentation: | * improve documentation: | ||
Line 18: | Line 21: | ||
Reference:[https://indico.egi.eu/indico/getFile.py/access?contribId=3&resId=0&materialId=slides&confId=2190 OMB, June 26, 2014 - Re-factoring SAM probes] | Reference:[https://indico.egi.eu/indico/getFile.py/access?contribId=3&resId=0&materialId=slides&confId=2190 OMB, June 26, 2014 - Re-factoring SAM probes] | ||
= Tools = | |||
* [https://indico.egi.eu/indico/categoryDisplay.py?categId=89 TF Indico Agendas] | |||
* [https://mailman.egi.eu/mailman/private/sam-probe-wg/ Mailing-List Archive (members only)] | |||
= Tasks = | = Tasks = | ||
Line 29: | Line 36: | ||
=== Developers Documentation === | === Developers Documentation === | ||
* collect requierments and suggestions from Developers | * collect requierments and suggestions from Probes Developers | ||
== Probes == | == Probes == | ||
=== Identified Issues === | === Identified Issues === | ||
* [http://bit.ly/sam_open List of all GGUS tickets assigned to ARGO/SAM SU] | |||
* '''NGI SAM Nagioses have documentation URL hardcoded in metric configuration''' - [https://ggus.eu/?mode=ticket_info&ticket_id=108242 GGUS #108242] | * '''NGI SAM Nagioses have documentation URL hardcoded in metric configuration''' - [https://ggus.eu/?mode=ticket_info&ticket_id=108242 GGUS #108242] | ||
** changing the URLs requires SAM update | ** changing the URLs requires SAM update | ||
** '''Solution:''' | ** '''Solution:''' | ||
*** Plan the Change | *** Plan the Change | ||
* '''org.gstat.SanityCheck - not maintained anymore''' - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108243 GGUS #108243] | * '''org.gstat.SanityCheck - not maintained anymore''' - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108243 GGUS #108243] | ||
** checks small subset of BDII GLUE 1. | ** checks small subset of BDII GLUE 1.3 data | ||
** (add reference to profiles were it is enabled) | ** (add reference to profiles were it is enabled) | ||
** '''Solution:''' | ** '''Solution:''' | ||
*** replace with org.bdii.GLUE2-Validate - through validation of GLUE 2 data | *** replace with org.bdii.GLUE2-Validate - through validation of GLUE 2 data | ||
*<div id="SAMUnsuppSoft">''' SAM requires unsupported software'''</div> | |||
*''' SAM requires unsupported software''' | |||
** UMD 2 middleware: | ** UMD 2 middleware: | ||
*** '''Solution:''' migration to UMD-3 planned in September - | *** '''Solution:''' migration to UMD-3 planned in September - [https://github.com/ARGOeu/sam-probes/issues/6 - SAM/ARGO task #6] | ||
** CentOS/SL5 | ** CentOS/SL5 | ||
*** '''Solution:''' - migration to CentOS/SL6 planned within EGI InSPIRE JRA2 activity | *** '''Solution:''' - migration to CentOS/SL6 planned within EGI InSPIRE JRA2 activity | ||
* '''org.sam.SRM-GetTURLs fails if webdav is published''' | * '''org.sam.SRM-GetTURLs fails if webdav is published''' | ||
** the probe takes from the BDII the list of protocols published in the GlueSEAccessProtocol object and tries to access them using SRM. This fails for webdav. | ** the probe takes from the BDII the list of protocols published in the GlueSEAccessProtocol object and tries to access them using SRM. This fails for webdav. | ||
** '''Solution:''' | ** '''Solution:''' | ||
*** CERN DPM PT is developping a webdav SAM probe | *** CERN DPM PT is developping a webdav SAM probe - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108571 GGUS #108571] | ||
*** improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL | *** improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL | ||
* '''MPI Nagios probe issues for GE batch system''' | |||
* '''MPI Nagios probe issues for GE batch system''' - [https://ggus.eu/?mode=ticket_info&ticket_id=108443 GGUS #108443] | |||
** [https://ggus.eu/index.php?mode=ticket_info&ticket_id=101406 GGUS #101406] - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4 | ** [https://ggus.eu/index.php?mode=ticket_info&ticket_id=101406 GGUS #101406] - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4 | ||
**'''Solution''' | **'''Solution''' | ||
*** rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core | *** rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core and include it in the next SAM update | ||
* '''remove/replace FTS(2) probes''' | * '''remove/replace FTS(2) probes''' - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108458 GGUS #108458] | ||
** FTS2 has been decommissioned since the 1st of August; | ** FTS2 has been decommissioned since the 1st of August; | ||
** the Nagios FTS probes are not suited for FTS3 and so they can be removed. | ** the Nagios FTS probes are not suited for FTS3 and so they can be removed. | ||
** '''Solution:''' | ** '''Solution:''' | ||
*** follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their | *** follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their integration in in SAM | ||
* '''SAM CE Nagios framework is unsupported''' | |||
** used for WN* tests | |||
** '''Solution:''' - replace them | |||
* '''LFC decommissioning''' - with implications in tests depending on it | |||
** org.sam.WN-Rep* (CREAM-CE) | |||
** org.sam.WN-Rep* (CREAM-CE) | |||
** '''Possible Solutions (under evaluation:''' | |||
*** deploy dedicated LFC and reconfigure all NGI SAM Nagioses | |||
* '''SRM tests (add reference) cause false alarms on new DPM versions (tests for SRM API all the interfaces published in the Top-BDII)''' (add link to GGUS tkt) | |||
** '''Solution:''' - ?? | |||
== TF Activity == | == TF Activity == | ||
Line 80: | Line 95: | ||
{| class="wikitable" | {| class="wikitable" | ||
! style="text-align:left;"| | ! style="text-align:left;"| Description | ||
! Status | ! Status | ||
|- | |- | ||
|Replace all references to "grid-monitoring.egi.eu" with "mon.egi.eu" | |Replace all references to "grid-monitoring.egi.eu" with "mon.egi.eu" | ||
|DONE | |DONE | ||
|- | |- | ||
| | |Provide lists of all ROC SAM tests, in [[ROC_SAM_Tests]] | ||
|Update | |DONE | ||
|- | |||
|Update description of ROC SAM tests, in [[ROC_SAM_Tests]] | |||
|In Progress | |In Progress | ||
|- | |- | ||
| | |Update all broken links in [[SAM]] | ||
|In Progress | |In Progress | ||
|- | |- | ||
| | |Include info from [[SAM_Tests]] in [[SAM]], and obsolete it | ||
|In Progress | |In Progress | ||
|- | |- | ||
|etc | |etc | ||
|ToDo | |ToDo | ||
Line 112: | Line 120: | ||
* [[Media:SAM-ARGO-OPS-Roadmap.pdf]] | * [[Media:SAM-ARGO-OPS-Roadmap.pdf]] | ||
* [https://github.com/ARGOeu/sam-probes/issues?q=is%3Aissue+ ARGOeu/sam-probes - github issues] | |||
= People = | = People = | ||
Line 118: | Line 127: | ||
* Emir Imamagic | * Emir Imamagic | ||
* Peter Solagna | * Peter Solagna | ||
* David Crooks | |||
* Tiziana Ferrari | |||
* | * Paloma Fuente | ||
* Kashif Mohammad | |||
* Stuart Pullinger | |||
* Marcin Radecki | |||
* Ievgen Sliusar | |||
* Ulf Tigerstedt | |||
* Petter Urkedal |
Latest revision as of 07:33, 13 July 2016
EGI Activity groups | Special Interest groups | Policy groups | Virtual teams | Distributed Competence Centres |
This article is Deprecated and should no longer be used, but is still available for reasons of reference. |
Mandate
- assess the support status of various Nagios probes available
- recommend removal or replacement of unsupported probes from the SAM Nagios framework
- request improvement/correction of supported ones
- ease their maintenance and support by:
- reducing the number of probes
- reducing the dependencies between services (a reported service error should not be caused by the problems of another service in the same site)
- improve documentation:
- availability of references to the individual nagios probes/tests descriptions in a central place
- update known documentations web pages with proper references (avoid broken links)
- improve developers guides
- require change to SAM to remove harcoded documentation URLs in metrics configuration
Reference:OMB, June 26, 2014 - Re-factoring SAM probes
Tools
Tasks
Documentation
Generic improvements
- eliminate Broken Links
- collect all documentation links in a central place
- create a unique page collecting descriptions of all available tests/probes
Developers Documentation
- collect requierments and suggestions from Probes Developers
Probes
Identified Issues
- NGI SAM Nagioses have documentation URL hardcoded in metric configuration - GGUS #108242
- changing the URLs requires SAM update
- Solution:
- Plan the Change
- org.gstat.SanityCheck - not maintained anymore - GGUS #108243
- checks small subset of BDII GLUE 1.3 data
- (add reference to profiles were it is enabled)
- Solution:
- replace with org.bdii.GLUE2-Validate - through validation of GLUE 2 data
- SAM requires unsupported software
- UMD 2 middleware:
- Solution: migration to UMD-3 planned in September - - SAM/ARGO task #6
- CentOS/SL5
- Solution: - migration to CentOS/SL6 planned within EGI InSPIRE JRA2 activity
- UMD 2 middleware:
- org.sam.SRM-GetTURLs fails if webdav is published
- the probe takes from the BDII the list of protocols published in the GlueSEAccessProtocol object and tries to access them using SRM. This fails for webdav.
- Solution:
- CERN DPM PT is developping a webdav SAM probe - GGUS #108571
- improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL
- MPI Nagios probe issues for GE batch system - GGUS #108443
- GGUS #101406 - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4
- Solution
- rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core and include it in the next SAM update
- remove/replace FTS(2) probes - GGUS #108458
- FTS2 has been decommissioned since the 1st of August;
- the Nagios FTS probes are not suited for FTS3 and so they can be removed.
- Solution:
- follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their integration in in SAM
- SAM CE Nagios framework is unsupported
- used for WN* tests
- Solution: - replace them
- LFC decommissioning - with implications in tests depending on it
- org.sam.WN-Rep* (CREAM-CE)
- org.sam.WN-Rep* (CREAM-CE)
- Possible Solutions (under evaluation:
- deploy dedicated LFC and reconfigure all NGI SAM Nagioses
- SRM tests (add reference) cause false alarms on new DPM versions (tests for SRM API all the interfaces published in the Top-BDII) (add link to GGUS tkt)
- Solution: - ??
TF Activity
Description | Status |
---|---|
Replace all references to "grid-monitoring.egi.eu" with "mon.egi.eu" | DONE |
Provide lists of all ROC SAM tests, in ROC_SAM_Tests | DONE |
Update description of ROC SAM tests, in ROC_SAM_Tests | In Progress |
Update all broken links in SAM | In Progress |
Include info from SAM_Tests in SAM, and obsolete it | In Progress |
etc | ToDo |
SAM/ARGO Roadmap
People
- Cristina Aiftimiei
- Emir Imamagic
- Peter Solagna
- David Crooks
- Tiziana Ferrari
- Paloma Fuente
- Kashif Mohammad
- Stuart Pullinger
- Marcin Radecki
- Ievgen Sliusar
- Ulf Tigerstedt
- Petter Urkedal