Based on NGINX 1.27.4
Passive health checks
peer->accessed
represents the last failure time. It is initialized as 0, and set to the current time ngx_time()
along with peer->checked
when any request to the peer failed.
peer->checked
represents the start of the current error window. It is set to current time when the peer is selected as best
, and subsequently updated to the current time if peer is selected again after fail_timeout
has passed (i.e. time since last checked
is over fail_timeout
), or when any request to the peer failed.
peer->fails
represents the number of failures observed during the current error window. It is incremented on each failure, and reset to 0 if check passed for a request AND peer->accessed < peer->checked
, i.e. peer->checked
was refreshed but peer->accessed
was not. Given that peer->accessed
is a moving goalpost that bumps on each failure event, it’s possible that fails
accumulate across multiple fail_timeout
time spans if failures continue to happen.
If max_fails
is not 0, failure condition is met (peer->fails >= peer->max_fails
) and we are still within the error window (now - peer->checked <= peer->fail_timeout
), the peer is skipped in the peer selection process.
Combination of the conditions above raises an interesting question. If there is a continuous stream of failures that happens at an interval shorter than fail_timeout
, will fails
accumulate to trigger the peer failure condition, or will it be hard reset after each fail_timeout
?
An interesting behavior of NGINX is that it decrements effective_weight
by peer->weight / peer->max_fails
(integer division!), and increments it by 1 up to weight
whenever the peer is considered for use, so there is a penalty in effective weight for failures if peer->weight / peer->max_fails
is not 0. This penalty reduces the number of requests the peer receives, and subsequently makes a continuous stream of failures less likely to hold.
And even if a peer is temporarily disabled, no new requests would select that peer, so it’s safe to assume that after fail_timeout
the counter will be reset as the error window passes and new requests pass the check. If the first requests completed did not pass the check, peer->checked
will be bumped along with peer->accessed
and the peer will be disabled again immediately.
Nonetheless, imperfection of the algorithm makes it nonconforming to the NGINX documentation, which states that fail_timeout
is “the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable.”
NOTE
Takeaway from this: Always set
weight
to an integer greater than or equal tomax_fails
, otherwise the weight penalty is ineffective.
History
From Maxim Dounin, https://www.mail-archive.com/nginx-devel@nginx.org/msg00514.html
This is expected behaviour. Documentation is a bit simplified here, and fail_timeout is used like session time limit - the peer→fails counter is reset once there are no failures within fail_timeout.
While this might be non-ideal for some use cases, it’s certainly not a bug.
A proposed fix was rejected because
Such algorithm forget everything about previous failures once per fail_timeout, and won’t detect bursts of failures split across two fail_timeout intervals.
and in a later email,
Well, in normal world if an upstream constantly fails ~1% of requests - it’s not healthy and should not be used. I understand that your use case is a bit special though.
Yes, I know this case, sorry, forgot to mention. However, I think it will extend detection period to 2-3 fail_timeouts in real life (in theory up to max_fails fail_timeouts, yes, but it’s almost improbable). If we want correct implementation we need per-second array (with fail_timeout elements), that’s an overkill in my opinion.
Sure, per-second array isn’t a solution.
By the way, leaky bucket approach (like limit_req but with fails per second) might work well here, what do you think?
Yes, leaky/token bucket should work. That’s actually what I think about if I think about changing the above algorithm to something strictly bound to fail_timeout period.
So Maxim is against the per-second array solution.
A few years later, https://www.mail-archive.com/nginx-devel@nginx.org/msg09804.html
Documentation somewhat oversimplifies things. The fail_timeout setting is essentially a session timeout, and things work as follows:
As long as there are failures, the fails counter is incremented. If fail_timeout passes since last failure, the fails counter is reset to 0 on the next successful request.
If the fails counter reaches max_fails, no more requests are routed to the peer for fail_timeout time. After fail_timeout passes, one request is allowed. If the request is successful, the fails counter is reset to 0, and further requests to the peer are allowed without any limits.
References:
- https://www.mail-archive.com/search?l=nginx-devel@nginx.org&q=subject:“%5C[BUG%5C?%5C]fail_timeout%5C/max_fails%5C:+code+doesn’t+do+what+doc+says”&o=newest&f=1
- also on nginx.org as https://mailman.nginx.org/pipermail/nginx-devel/2013-May/003753.html
- https://www.mail-archive.com/search?l=nginx-devel@nginx.org&q=subject:“incorrect+upstream+max_fails+behaviour”&o=newest&f=1
- also on nginx.org as https://mailman.nginx.org/pipermail/nginx-devel/2020-March/013070.html
Demo
Set up NGINX with 1 worker process and two local servers listening on port 3000 and 3001 that always returns 503, and an upstream that points to both servers with max_fails=120 fail_timeout=3s
. Then, set up a third server listening on port 3002 that proxy_pass
to this upstream and have proxy_next_upstream error timeout http_503
set. Finally, we request the server with roughly 0.5s interval and observe if the error log contains “no live upstreams”.
The result from NGINX 1.27.4 confirms our assumption. About 2 minutes later, which is enough time for both upstream servers to reach 60 fails because each request is retried against both servers, NGINX logs “no live upstreams” for all subsequent requests.
We also observed a weird behavior that occasionally, the error log would stop for a while after 6 requests, and then returns to the same pattern of continuous errors or another 6 lines of logs. If the pause is about 2 minutes, the peer->fails
counter might have been reset for both peers, so it’s likely that somehow state & NGX_PEER_FAILED
was false
when the peer used is freed.
Backup servers
NGINX only considers the backup servers stored in peers->next
if either ngx_http_upstream_get_peer()
did not get a peer (returns NULL
) from the primary servers, or the single primary server is down or has reached max_conns.