Difference between revisions of "Unknown issue"
Jump to navigation
Jump to search
Present situation
Line 6: | Line 6: | ||
<pre>Availability = Uptime / (Total time - Time_status_was_UNKNOWN) | <pre>Availability = Uptime / (Total time - Time_status_was_UNKNOWN) | ||
Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)</pre> | Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)</pre> | ||
<br>'''How to read in context of UNKNOWN status:'''<br> | <br>'''How to read in context of UNKNOWN status:'''<br> | ||
#Period in which site is in status UNKNOWN is not taken into calculation. | #Period in which site is in status UNKNOWN is not taken into calculation. | ||
#During this period EGI doesn’t know what is happening with the infrastructure. | #During this period EGI doesn’t know what is happening with the infrastructure. | ||
<br> | |||
= Problems = | |||
#No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?<br> | |||
#Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions? | |||
= Solution proposal = | |||
= | == Strict policy for the developers how to use UNKNOWN status == | ||
'''Advantage''': we will be sure that all problems will be properly addressed as ERROR not UNKNOWN<br>'''Disadvantages''': someone has to write the policy and check whether it is respected | |||
== Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h == | |||
'''Advantage''': we will be notified if the UNKNOWN status takes too long<br>'''Disadvantages''': it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs | |||
== Threshold for UNKNOWN status == | |||
'''Advantage''': it is easy and fast to implement and automate<br>'''Disadvantages''': there is a possibility that overlook an important problem |
Revision as of 12:04, 12 October 2011
this page will contain information about UNKNOWN status issue
Present situation
Availability and Reliability calculations formulas:
Availability = Uptime / (Total time - Time_status_was_UNKNOWN) Reliability = Uptime / (Total time - Scheduled Downtime - Time_status_was_UNKNOWN)
How to read in context of UNKNOWN status:
- Period in which site is in status UNKNOWN is not taken into calculation.
- During this period EGI doesn’t know what is happening with the infrastructure.
Problems
- No policy for test developers when test should return UNKNOWN status. What does UNKNOWN status mean?
- Some of NGIs reach ~0% for all their sites and some reach even ~40%, sometimes disproporsions are even within one NGI. What/where is the reason for so high values and disproporsions?
Solution proposal
Strict policy for the developers how to use UNKNOWN status
Advantage: we will be sure that all problems will be properly addressed as ERROR not UNKNOWN
Disadvantages: someone has to write the policy and check whether it is respected
Alarms for UNKNOWN status should be created when UNKNOWN status is longer than 4h
Advantage: we will be notified if the UNKNOWN status takes too long
Disadvantages: it means an extra work for ROD which will be look not only after ERRORs but also UNKNOWNs
Threshold for UNKNOWN status
Advantage: it is easy and fast to implement and automate
Disadvantages: there is a possibility that overlook an important problem