Difference between revisions of "SAM Nagios probes refactoring TF"
Jump to navigation
Jump to search
Line 69: | Line 69: | ||
*** CERN DPM PT is developping a webdav SAM probe - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108571 GGUS #108571] | *** CERN DPM PT is developping a webdav SAM probe - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108571 GGUS #108571] | ||
*** improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL | *** improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL | ||
* '''MPI Nagios probe issues for GE batch system''' | * '''MPI Nagios probe issues for GE batch system''' | ||
** [https://ggus.eu/index.php?mode=ticket_info&ticket_id=101406 GGUS #101406] - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4 | ** [https://ggus.eu/index.php?mode=ticket_info&ticket_id=101406 GGUS #101406] - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4 | ||
**'''Solution''' | **'''Solution''' | ||
*** rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core | *** rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core and include it in the next SAM update - [https://ggus.eu/?mode=ticket_info&ticket_id=108443 GGUS #108443] | ||
* '''remove/replace FTS(2) probes''' - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108458 GGUS #108458] | * '''remove/replace FTS(2) probes''' - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108458 GGUS #108458] | ||
** FTS2 has been decommissioned since the 1st of August; | ** FTS2 has been decommissioned since the 1st of August; | ||
** the Nagios FTS probes are not suited for FTS3 and so they can be removed. | ** the Nagios FTS probes are not suited for FTS3 and so they can be removed. | ||
** '''Solution:''' | ** '''Solution:''' | ||
*** follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their | *** follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their integration in in SAM - [https://ggus.eu/index.php?mode=ticket_info&ticket_id=108458 GGUS #108458] | ||
== TF Activity == | == TF Activity == |
Revision as of 18:05, 16 September 2014
Main | EGI.eu operations services | Support | Documentation | Tools | Activities | Performance | Technology | Catch-all Services | Resource Allocation | Security |
Tools menu: | • Main page | • Instructions for developers | • AAI Proxy | • Accounting Portal | • Accounting Repository | • AppDB | • ARGO | • GGUS | • GOCDB |
• Message brokers | • Licenses | • OTAGs | • Operations Portal | • Perun | • EGI Collaboration tools | • LToS | • EGI Workload Manager |
Mandate
- assess the support status of various Nagios probes available
- recommend removal or replacement of unsupported probes from the SAM Nagios framework
- request improvement/correction of supported ones
- ease their maintenance and support by:
- reducing the number of probes
- reducing the dependencies between services (a reported service error should not be caused by the problems of another service in the same site)
- improve documentation:
- availability of references to the individual nagios probes/tests descriptions in a central place
- update known documentations web pages with proper references (avoid broken links)
- improve developers guides
- require change to SAM to remove harcoded documentation URLs in metrics configuration
Reference:OMB, June 26, 2014 - Re-factoring SAM probes
Tools
Tasks
Documentation
Generic improvements
- eliminate Broken Links
- collect all documentation links in a central place
- create a unique page collecting descriptions of all available tests/probes
Developers Documentation
- collect requierments and suggestions from Probes Developers
Probes
Identified Issues
- NGI SAM Nagioses have documentation URL hardcoded in metric configuration - GGUS #108242
- changing the URLs requires SAM update
- Solution:
- Plan the Change
- SAM CE Nagios framework is unsupported
- used for WN* tests
- Solution: - replace them
- SRM tests (add reference) cause false alarms on new DPM versions (tests for SRM API all the interfaces published in the Top-BDII) (add link to GGUS tkt)
- Solution: - ??
- org.gstat.SanityCheck - not maintained anymore - GGUS #108243
- checks small subset of BDII GLUE 1.3 data
- (add reference to profiles were it is enabled)
- Solution:
- replace with org.bdii.GLUE2-Validate - through validation of GLUE 2 data
- LFC decommissioning - with implications in tests depending on it
- org.sam.WN-Rep* (CREAM-CE)
- org.sam.WN-Rep* (CREAM-CE)
- Solutions:
- remove all LFC-dependent tests, or find replacement
- deploy dedicated LFC and reconfigure all NGI SAM Nagioses
- SAM requires unsupported software
- UMD 2 middleware:
- Solution: migration to UMD-3 planned in September - - SAM/ARGO task #6
- CentOS/SL5
- Solution: - migration to CentOS/SL6 planned within EGI InSPIRE JRA2 activity
- UMD 2 middleware:
- org.sam.SRM-GetTURLs fails if webdav is published
- the probe takes from the BDII the list of protocols published in the GlueSEAccessProtocol object and tries to access them using SRM. This fails for webdav.
- Solution:
- CERN DPM PT is developping a webdav SAM probe - GGUS #108571
- improve org.sam.SRM-GetTURLs, under dCache PT maintenance, to not ask/test SRM for a webdav TURL
- MPI Nagios probe issues for GE batch system
- GGUS #101406 - WARNING: Publishes GlueCEPolicyMaxCPUTime (30) / GlueCEPolicyMaxWallClockTime (30) < 4
- Solution
- rewriting the test to take into acount that GE provider now also provides correct CPU limit in the BDII per core and include it in the next SAM update - GGUS #108443
- remove/replace FTS(2) probes - GGUS #108458
- FTS2 has been decommissioned since the 1st of August;
- the Nagios FTS probes are not suited for FTS3 and so they can be removed.
- Solution:
- follow-up with TP (OliverK & MaiteBL) the development of FTS3 probes and their integration in in SAM - GGUS #108458
TF Activity
Task No. | Description | Status |
---|---|---|
1 | Replace all references to "grid-monitoring.egi.eu" with "mon.egi.eu" | DONE |
2 | Update all broken links in SAM | In Progress |
4 | provide lists of all ROC SAM tests, in ROC_SAM_Tests | DONE |
4.1 | update description of ROC SAM tests, in ROC_SAM_Tests | In Progress |
5 | include info from SAM_Tests in SAM, and obsolete it | In Progress |
x | etc | ToDo |
SAM/ARGO Roadmap
People
- Cristina Aiftimiei
- Emir Imamagic
- Peter Solagna