partial failure

in a single process, failure is usually binary. the process is running or it is not. an exception is thrown or it is not. the function returned or it did not. recovery is clear: restart, retry, catch and handle.

distributed systems do not work this way. a partial failure is when some components continue functioning while others fail, and the system as a whole continues operating in some degraded, inconsistent, or unpredictable state. every hard distributed systems problem traces back to this.

what partial failure looks like

a database replica falls behind but is still accepting reads. queries return stale data. nothing errors. everything appears normal to the caller.

a service is accepting connections but its connection pool to downstream dependencies is exhausted. some requests complete immediately; others block until timeout. latency goes through the roof for some percentage of traffic.

one availability zone loses connectivity to another. half your nodes can reach the cache; half cannot. the half that cannot start hammering the origin. the half that can look fine.

a disk is failing but not yet failed. writes succeed. reads from some sectors succeed; reads from others hang for seconds before returning errors.

in all these cases, the system is partially functional. it is not down. there is no alarm going off. the partial failure is invisible to observers who are only looking at the success path.

why this is hard

the challenge is observability. a crashed process emits a clear signal: connection refused, process not found, immediate error. a partially failed process may still accept connections, may still return responses, may look healthy to a health check endpoint that does not exercise the degraded path.

the other challenge is ambiguity from the caller's perspective. if you send a request and do not get a response, you do not know if:

the request was never received
the request was received but not processed
the request was processed but the response was lost
the request is still being processed

you cannot distinguish between "the request failed" and "the response is delayed." in a single process, every function call returns (eventually). in a distributed system, a request can disappear into the network and never come back.

the failure spectrum

it helps to think about failure as a spectrum rather than binary:

healthy: processing requests correctly, at normal latency.

degraded: processing requests, but slowly, or only some requests successfully. the system is not down but it is not right either.

overloaded: processing a fraction of normal load, dropping or queuing the rest. accepting connections but building up backlogs.

unresponsive: not returning responses, but not sending errors either. connections hang open. timeouts pile up on the callers' side.

crashed: not accepting connections at all. the clearest failure mode, and often the easiest to recover from.

most production incidents involve the middle three states, not the last one. a crashed service triggers alerts and on-call rotations. a degraded service erodes reliability silently until a threshold is breached or a user complains.

implications for design

building systems that handle partial failure means:

not assuming that no error means success. the operation might still be in progress, lost in the network, or queued behind a backlog.
designing operations to be idempotent where possible. if you cannot know whether a request was processed, you need to be able to safely send it again.
building detection into the callers, not just the failing service. a service cannot always detect its own degradation. callers can detect it via timeouts, error rates, and latency tracking.
designing for graceful degradation. when a dependency is partially failed, can your service still function, partially, rather than failing completely?

partial failure is not a bug in distributed systems. it is a structural property of running software across multiple machines connected by a network.