cascading failures
how one slow service takes down everything upstream.
how one slow service takes down everything upstream.
a cascading failure is when the failure of one component causes failures in the components that depend on it, which cause failures in their dependents, until large parts of a system are down. the original failure is often small. the cascade is what makes it catastrophic.
the mechanism is not mysterious. it follows directly from how systems behave under load when a downstream dependency slows down.
assume service A depends on service B. under normal operation, A sends requests to B, B responds quickly, A's threads are held briefly and released.
now B starts responding slowly. not crashing, just slow. A's threads begin holding longer as they wait for B's responses. the thread pool that handles requests to A starts filling up. new incoming requests to A find no available threads. they queue. the queue fills up. A stops responding.
from the perspective of A's callers, A has failed. A's callers (service C and service D) are now in the same position A was in. their threads fill with slow requests to A. they start queuing. they become unresponsive.
the original problem was B being slow. the outcome is A, C, and D all down, plus B. one degradation has propagated through the call graph.
a crashed service fails fast. requests to it get immediate connection refused errors. callers detect the failure quickly, increment error counters, and stop sending traffic (if they have circuit breakers) or at least fail fast enough that their own thread pools do not fill up.
a slow service holds connections open. requests do not error. they wait. callers' threads sit blocking on I/O. the thread pool drains slowly. the caller starts queuing requests. by the time the caller detects something is wrong, its own capacity is already compromised.
this is why latency is a more dangerous failure mode than a crash. it is insidious. it spreads before it is detected.
the concrete mechanism in most cascading failures is resource exhaustion. the resources that get exhausted:
threads: a thread waiting on a slow downstream call is not processing new requests. if your service has 100 threads and B holds each one for 10 seconds instead of 10ms, capacity drops by a factor of 1000.
connection pools: database connections, http client connections, anything that limits how many outstanding requests you can have. a slow service holds connections open, pool exhaustion prevents new requests.
memory: request queues consume memory. a service that queues incoming requests while its thread pool is exhausted will eventually run out of heap if the queue is unbounded.
file descriptors: open sockets consume file descriptors. a leak of open connections to a slow service accumulates until the OS limit is hit.
when callers detect latency or errors, the natural response is to retry. this is usually the right behavior in isolation. but under a cascade, retries increase load on an already stressed system.
service B is struggling with 1000 requests per second. B gets slow. all of A's callers retry their failed requests. now B is handling 2000 or 3000 requests per second (plus the original 1000 that are still in flight and timing out). B gets slower. more retries. the load multiplies.
exponential backoff and jitter exist specifically to prevent retry storms. but they require implementation discipline across every client. a single service that retries aggressively can trigger a cascade even when everyone else backs off correctly.
the patterns that prevent cascading failures are covered in the resilience chapter, but the principles follow from the mechanism:
timeouts: bound how long a request can wait. do not let threads hold indefinitely. the tradeoff is choosing the right timeout: too short and you get spurious failures, too long and you get slow cascades instead of fast ones.
circuit breakers: when a downstream service is failing, stop sending it traffic. fail fast at the circuit breaker rather than queueing up more work behind an already-struggling service. give the downstream service time to recover.
bulkheads: isolate resource pools for different dependencies. if calls to B exhaust a dedicated thread pool, the threads handling other traffic are unaffected. the failure is contained to the service-B portion of your capacity.
back-pressure: instead of queueing indefinitely, signal to upstream services that you are overloaded. let the pressure propagate up the call chain so each layer can make an informed decision about what to do, rather than queuing until memory is exhausted.
cascading failures are predictable. the conditions that create them (slow dependencies, unbounded queues, shared resource pools, missing timeouts) are diagnosable before they cause incidents. the investment in preventing them is less than the investment in recovering from them.