failure

a single-process program fails in a comprehensible way: it crashes, throws an exception, or returns wrong output. you can tell. you can react. the failure is visible and usually total.

distributed systems fail differently. components fail partially. the network lies. a service is reachable but slow. a dependency is accepting requests but not completing them. from the outside, you cannot always tell whether you are talking to a healthy system or a dying one.

this chapter is the foundation for the rest of the series. circuit breakers only make sense once you know what cascading failures look like. retry logic requires understanding the difference between a slow service and a crashed one. reliability patterns start with knowing what they are defending against.

what this chapter covers

partial failure is the defining characteristic of distributed failure. not crashes, partial function. some requests succeed while others hang. one replica is healthy; another is not. the system as a whole is neither fully up nor fully down.

network partitions is what actually happens when nodes cannot reach each other. not the theorem, the mechanics. what the system sees, what it cannot determine, and what it has to decide under uncertainty.

cascading failures is how one slow service takes down everything upstream. the mechanism is specific and predictable. knowing it lets you break the chain before it starts.

latency vs outage covers a practical distinction that matters for how you respond. a service that is down and a service that is very slow look different, are detected differently, and need different responses.

failure detection is timeouts, heartbeats, and health checks, and why none of them are free. they are all approximations of the same unsolvable problem: determining whether a remote system is alive.