status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
partial failure
what makes distributed failure different from a crashed process.
status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
what makes distributed failure different from a crashed process.
in a single process, failure is usually binary. the process is running or it is not. an exception is thrown or it is not. the function returned or it did not. recovery is clear: restart, retry, catch and handle.
distributed systems do not work this way. a partial failure is when some components continue functioning while others fail, and the system as a whole continues operating in some degraded, inconsistent, or unpredictable state. this is the defining characteristic of distributed failure, and it makes everything harder.
a database replica falls behind but is still accepting reads. queries return stale data. nothing errors. everything appears normal to the caller.
a service is accepting connections but its connection pool to downstream dependencies is exhausted. some requests complete immediately; others block until timeout. latency goes through the roof for some percentage of traffic.
one availability zone loses connectivity to another. half your nodes can reach the cache; half cannot. the half that cannot start hammering the origin. the half that can look fine.
a disk is failing but not yet failed. writes succeed. reads from some sectors succeed; reads from others hang for seconds before returning errors.
in all these cases, the system is partially functional. it is not down. there is no alarm going off. the partial failure is invisible to observers who are only looking at the success path.
the challenge is observability. a crashed process emits a clear signal: connection refused, process not found, immediate error. a partially failed process may still accept connections, may still return responses, may look healthy to a health check endpoint that does not exercise the degraded path.
the other challenge is ambiguity from the caller's perspective. if you send a request and do not get a response, you do not know if:
this is the fundamental problem of distributed systems: you cannot distinguish between "the request failed" and "the response is delayed." in a single process, every function call returns (eventually). in a distributed system, a request can disappear into the network and never come back.
it helps to think about failure as a spectrum rather than binary:
healthy: processing requests correctly, at normal latency.
degraded: processing requests, but slowly, or only some requests successfully. the system is not down but it is not right either.
overloaded: processing a fraction of normal load, dropping or queuing the rest. accepting connections but building up backlogs.
unresponsive: not returning responses, but not sending errors either. connections hang open. timeouts pile up on the callers' side.
crashed: not accepting connections at all. the clearest failure mode, and often the easiest to recover from.
most production incidents involve the middle three states, not the last one. a crashed service triggers alerts and on-call rotations. a degraded service erodes reliability silently until a threshold is breached or a user complains.
building systems that handle partial failure means:
partial failure is not a bug in distributed systems. it is a structural property of running software across multiple machines connected by a network. designing around it is what distributed systems engineering is largely about.