The Four Golden Signals
- Latency
- It’s important to distinguish between the latency of successful requests and the latency of failed requests.
- Should we cout client timeouts as successful requests? They are potentially requests that took too long but could have been successful, on the other hand it could also be a client-side early close.
- A slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
- It’s important to distinguish between the latency of successful requests and the latency of failed requests.
- Traffic
- Normalized request rate, for example HTTP requests per second.
- Errors
- Rate of requests that fail, either explicitly (e.g. HTTP 500s), implicitly (e.g. wrong content), or by policy (e.g. missing latency target).
- Saturation
- Emphasize the resources that are most constrained (e.g. memory, I/O, etc.)
- Systems degrade before they achieve 100% utilization, so having a utilization target is essential.
Metric Aggregation
Most metrics are better thought of as distributions rather than averages. A high-order percentile, such as the 99th or 99.9th, shows you a plausible worst-case value, while the 60th percentile emphasizes the typical case.
For example, Histogram summaries of CPU usage per second in a minute provides good resolution (granularity) with less collection and retention cost.
Black-Box Versus White-Box
We should aim for heavy use of white-box monitoring with modest but critical uses of black-box monitoring. White-box monitoring helps identify the root cause.
Black-box monitoring is symptom-oriented and represents active (not predicted) problems. For not-yet-occurring but imminent problems, black-box monitoring is fairly useless.