Based on NGINX 1.27.4

Passive health checks

peer->accessed represents the last failure time. It is initialized as 0, and set to the current time ngx_time() along with peer->checked when any request to the peer failed.

peer->checked represents the start of the current error window. It is set to current time when the peer is selected as best, and subsequently updated to the current time if peer is selected again after fail_timeout has passed (i.e. time since last checked is over fail_timeout), or when any request to the peer failed.

peer->fails represents the number of failures observed during the current error window. It is incremented on each failure, and reset to 0 if check passed for a request AND peer->accessed < peer->checked, i.e. peer->checked was refreshed but peer->accessed was not. Given that peer->accessed is a moving goalpost that bumps on each failure event, it’s possible that fails accumulate across multiple fail_timeout time spans if failures continue to happen.

If max_fails is not 0, failure condition is met (peer->fails >= peer->max_fails) and we are still within the error window (now - peer->checked <= peer->fail_timeout), the peer is skipped in the peer selection process.

Combination of the conditions above raises an interesting question. If there is a continuous stream of failures that happens at an interval shorter than fail_timeout, will fails accumulate to trigger the peer failure condition, or will it be hard reset after each fail_timeout?

An interesting behavior of NGINX is that it decrements effective_weight by peer->weight / peer->max_fails (integer division!), and increments it by 1 up to weight whenever the peer is considered for use, so there is a penalty in effective weight for failures if peer->weight / peer->max_fails is not 0. This penalty reduces the number of requests the peer receives, and subsequently makes a continuous stream of failures less likely to hold.

And even if a peer is temporarily disabled, no new requests would select that peer, so it’s safe to assume that after fail_timeout the counter will be reset as the error window passes and new requests pass the check. If the first requests completed did not pass the check, peer->checked will be bumped along with peer->accessed and the peer will be disabled again immediately.

Nonetheless, imperfection of the algorithm makes it nonconforming to the NGINX documentation, which states that fail_timeout is “the time during which the specified number of unsuccessful attempts to communicate with the server should happen to consider the server unavailable.”

NOTE

Takeaway from this: Always set weight to an integer greater than or equal to max_fails, otherwise the weight penalty is ineffective.

History

From Maxim Dounin, https://www.mail-archive.com/nginx-devel@nginx.org/msg00514.html

This is expected behaviour. Documentation is a bit simplified here, and fail_timeout is used like session time limit - the peer→fails counter is reset once there are no failures within fail_timeout.

While this might be non-ideal for some use cases, it’s certainly not a bug.

A proposed fix was rejected because

Such algorithm forget everything about previous failures once per fail_timeout, and won’t detect bursts of failures split across two fail_timeout intervals.

and in a later email,

Well, in normal world if an upstream constantly fails ~1% of requests - it’s not healthy and should not be used. I understand that your use case is a bit special though.

Yes, I know this case, sorry, forgot to mention. However, I think it will extend detection period to 2-3 fail_timeouts in real life (in theory up to max_fails fail_timeouts, yes, but it’s almost improbable). If we want correct implementation we need per-second array (with fail_timeout elements), that’s an overkill in my opinion.

Sure, per-second array isn’t a solution.

By the way, leaky bucket approach (like limit_req but with fails per second) might work well here, what do you think?

Yes, leaky/token bucket should work. That’s actually what I think about if I think about changing the above algorithm to something strictly bound to fail_timeout period.

So Maxim is against the per-second array solution.

A few years later, https://www.mail-archive.com/nginx-devel@nginx.org/msg09804.html

Documentation somewhat oversimplifies things. The fail_timeout setting is essentially a session timeout, and things work as follows:

  1. As long as there are failures, the fails counter is incremented. If fail_timeout passes since last failure, the fails counter is reset to 0 on the next successful request.

  2. If the fails counter reaches max_fails, no more requests are routed to the peer for fail_timeout time. After fail_timeout passes, one request is allowed. If the request is successful, the fails counter is reset to 0, and further requests to the peer are allowed without any limits.

References:

Demo

Set up NGINX with 1 worker process and two local servers listening on port 3000 and 3001 that always returns 503, and an upstream that points to both servers with max_fails=120 fail_timeout=3s. Then, set up a third server listening on port 3002 that proxy_pass to this upstream and have proxy_next_upstream error timeout http_503 set. Finally, we request the server with roughly 0.5s interval and observe if the error log contains “no live upstreams”.

The result from NGINX 1.27.4 confirms our assumption. About 2 minutes later, which is enough time for both upstream servers to reach 60 fails because each request is retried against both servers, NGINX logs “no live upstreams” for all subsequent requests.

We also observed a weird behavior that occasionally, the error log would stop for a while after 6 requests, and then returns to the same pattern of continuous errors or another 6 lines of logs. If the pause is about 2 minutes, the peer->fails counter might have been reset for both peers, so it’s likely that somehow state & NGX_PEER_FAILED was false when the peer used is freed.

Backup servers

NGINX only considers the backup servers stored in peers->next if either ngx_http_upstream_get_peer() did not get a peer (returns NULL) from the primary servers, or the single primary server is down or has reached max_conns.