Unknown issue
Jump to navigation
Jump to search
Present situation
this page will contain information about UNKNOWN status issue
Present situation
Availability and Reliability calculations formulas:
Availability = Uptime / (Total time - Time_status_was_UNKNOWN) Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)
How to read in context of UNKNOWN status:
- Period in which site is in status UNKNOWN is not taken into calculation.
- During this period EGI doesn’t know what is happening with the infrastructure.
Problems
- No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?
- Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions?
Solution proposal
Strict policy for the developers how to use UNKNOWN status
Advantage: we will be sure that all problems will be properly addressed as ERROR not UNKNOWN
Disadvantages: someone has to write the policy and check whether it is respected
Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h
Advantage: we will be notified if the UNKNOWN status takes too long
Disadvantages: it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs
Threshold for UNKNOWN status
Advantage: it is easy and fast to implement and automate
Disadvantages: there is a possibility that overlook an important problem