a network partition is when a group of nodes in a distributed system cannot communicate with another group. not slowness, not packet loss, complete inability to exchange messages. the network has split the cluster into isolated islands.
partitions are not exotic failure modes. they happen because of misconfigured switches, overloaded routers, BGP misconfigurations, cloud provider maintenance, physical cable failures, firewall rule changes. in large enough deployments, they are a regular occurrence.
imagine a three-node cluster: A, B, and C. A network failure occurs between A and the {B, C} pair. A can no longer reach B or C. B and C can still reach each other.
from A's perspective: it sends messages to B and C and gets no response. it does not know if:
- B and C are crashed
- B and C are alive but the network between them is broken
- the partition is affecting only outgoing traffic from A, and B/C are still receiving writes from clients
from B and C's perspective: A has gone silent. A might be crashed. A might still be receiving and processing writes from clients that happen to route to A. they do not know.
this is the key property of a partition: the isolated nodes cannot distinguish partition from crash. the observable symptoms are identical. the difference is that in a crash, the node eventually comes back and there is no state to reconcile. in a partition, the node may still be processing requests, and when the partition heals, the system has two diverged histories to merge.
a partitioned node has to decide what to do:
keep serving requests (availability over consistency): the partitioned node continues accepting reads and writes. it cannot coordinate with the other partition, so it may serve stale data, accept writes that conflict with writes on the other side, or both. when the partition heals, the system has to reconcile diverged state.
stop serving requests (consistency over availability): the partitioned node refuses requests it cannot confirm with a quorum of nodes. reads return errors. writes are rejected. the system is unavailable for that partition until it can communicate again, but there is no diverged state to reconcile.
serve reads, reject writes: a hybrid. reads might return stale data (acknowledged), but writes that could create conflicts are refused. some systems use this as a middle ground.
which choice is correct depends on what the system is for. a key-value cache for user sessions can serve stale reads. the user's session still works, just possibly with stale preferences. a financial ledger cannot accept writes on a partitioned node. the risk of writing a debit that the other partition never sees is too high.
when the network recovers, the partitioned nodes can communicate again. what happens next depends on what the system did during the partition.
if both sides rejected writes, reconciliation is straightforward: the nodes synchronize and continue.
if both sides accepted writes, reconciliation requires conflict resolution. techniques include:
- last-write-wins: whichever write has the later timestamp wins. fast, simple, lossy. clocks in distributed systems are not synchronized precisely, so "later timestamp" is unreliable.
- vector clocks: track causality explicitly. each write carries a version vector. conflicting writes are those where neither happens-before the other. surfacing them to the application for resolution.
- CRDTs (conflict-free replicated data types): data structures designed so all concurrent operations can be merged without conflicts. elegant for specific use cases (counters, sets, maps with specific semantics), not universally applicable.
most databases with eventual consistency have already chosen a conflict resolution strategy. it is worth knowing which one, because the failure mode (lost writes, unexpected overwrites) is invisible until it bites you.
a partition is the extreme case of packet loss. in practice, networks often exhibit partial packet loss: some messages get through, some do not, with non-deterministic patterns.
this is harder to reason about than a clean partition. a node might receive 80% of messages from its neighbors. is it partitioned? is it not? timeouts and quorum calculations get complicated. most distributed system design reasons about clean partitions as a simplifying assumption, knowing that partial loss is worse and less tractable.
the question is not whether your system will experience a partition. it is whether your system has a defined behavior when it does.
systems that do not handle partitions explicitly still have behavior during a partition. they just did not design it. usually that behavior is: all nodes try to operate independently, state diverges, and the system is in an undefined state when the partition heals.
defining partition behavior explicitly, even if that definition is "go into read-only mode," is better than letting the system figure it out on its own.