failure detection

failure detection is the problem of determining whether a remote component is functioning correctly. it sounds simple. it is not. the fundamental difficulty is that in an asynchronous distributed system, there is no way to distinguish a crashed node from a very slow node from a network partition.

every failure detection mechanism is an approximation that trades false positives (declaring a healthy node failed) against false negatives (failing to detect an actual failure), and fast detection against certainty.

timeouts

a timeout is the most primitive form of failure detection. if a request does not return within some period, declare it failed.

the problem is choosing the value. too short: healthy requests time out, false positive rate goes up, you start marking healthy services as failed. too long: you hold threads waiting on actually-failed services, detection is slow, cascades develop before you respond.

the right timeout is not derivable from first principles. it depends on the p99 latency of the service under normal conditions, with some headroom. a service that normally responds in 50ms and occasionally hits 200ms might need a 500ms or 1-second timeout. a service that sometimes does expensive work and takes up to 2 seconds needs a longer timeout, but you are accepting slower detection in exchange.

timeouts do not tell you why the request failed. the service might be crashed, overloaded, partitioned, or just temporarily slow. you know only that it did not respond within your deadline.

heartbeats

a heartbeat is a periodic "I am alive" message sent from a component to its monitor, or from a monitor to a component that must respond. if the heartbeat stops for a configured interval, the component is declared failed.

heartbeats detect process failure and connectivity failure. if the process crashes, it stops sending heartbeats. if the network partitions, heartbeats stop arriving even if the process is healthy.

the tradeoff is interval vs certainty. a 1-second heartbeat with a 3-missed-heartbeats threshold detects failure in ~3 seconds. a 10-second heartbeat with the same threshold detects failure in ~30 seconds. faster detection means smaller intervals, which means more network traffic and more CPU overhead for the health-checking infrastructure.

heartbeats also have false positive problems. a temporarily overloaded process might miss a heartbeat while it is garbage collecting or handling a spike. a flapping network link might cause intermittent heartbeat loss. declaring the component failed in these cases triggers unnecessary failovers and recovery operations.

adaptive phi accrual failure detection (used in systems like Akka and Cassandra) addresses this by outputting a suspicion score rather than a binary alive/dead signal. the score increases continuously when heartbeats are late, based on historical heartbeat arrival timing. callers decide on their own threshold for treating the score as "failed." this is more nuanced than binary detection but adds complexity.

health checks

a health check is an endpoint (usually /health or /healthz) that the component exposes to report its own status. load balancers, orchestration systems, and monitoring tools poll this endpoint to decide whether to route traffic to the instance.

health checks are only as good as what they check. a health check that always returns 200 OK is worse than no health check at all, it actively misleads the load balancer. a health check that exercises the actual dependencies (can we reach the database? can we make a basic query?) is more informative.

the risk of deep health checks is that they can cause correlated failures. if every instance of a service runs a health check that connects to the database, and the database gets a spike in connections from all those health checks simultaneously, you have turned your health checking into a denial-of-service attack on your own database. shallow health checks (is the process running, can it accept connections) are safer to run frequently. deeper checks should be rate-limited.

liveness vs readiness is a useful distinction from kubernetes:

liveness: is this process running and not stuck? if the liveness check fails, restart the container.
readiness: is this process ready to receive traffic? if the readiness check fails, remove it from the load balancer pool but do not restart it. useful during startup, during connection pool warm-up, or when a dependency is temporarily unavailable.

the limits of failure detection

all failure detection mechanisms have an inherent latency. there is always a window between when a component fails and when its failure is detected and acted upon. during that window, traffic continues to be sent to the failed component.

this window is not eliminable. you can reduce it by making detection faster, but faster detection increases false positive rates. every failure detection system is choosing a point on that tradeoff curve.

a second limit: failure detection can only detect failure at the granularity of what it checks. a health check that returns 200 OK does not tell you the service is degraded, overloaded, or returning wrong data for some subset of requests. it only tells you that the health check endpoint responded successfully. partial failures, as discussed in the previous page, often go undetected by standard health checks.

this is why failure detection is not a complete solution to the failure problem, it is one tool in a larger set. circuit breakers track success rates over a window of actual requests rather than relying on health check endpoints. latency tracking catches slow services that pass health checks. distributed tracing catches failures that only occur on specific code paths.

failure detection tells you when something is broken. the subtly wrong cases need different tools.