reliability

a reliable system does what it is supposed to do, at the level of performance expected, even when things go wrong. things go wrong constantly in production.

the sources of failure

hardware fails. disks have a mean time between failures measurable in years. in a cluster with hundreds of disks, that is several failures per week at steady state. cpu errors, memory bit flips, network hardware failing silently. these are not rare events. they are background noise.

software has bugs. tested, reviewed, shipped software has defects that only surface under specific conditions. race conditions show up under load. off-by-one errors appear at edge cases. integration assumptions turn out to be wrong.

humans make mistakes. misconfigured deployments, botched migrations, wrong firewall rules, accidentally deleted records. operator error is one of the most common causes of production incidents. a system that cannot survive human error in any form is not a reliable system.

faults vs failures

a fault is a component deviating from its specification. a failure is when the system as a whole stops providing the required service. the goal of reliability engineering is to build systems where faults do not become failures.

you achieve this with redundancy. if one disk can fail without data loss, you have multiple copies. if one server can fail without downtime, you have replicas. redundancy does not eliminate faults. it contains them.

fault tolerance means anticipating faults and designing in mechanisms to handle them. this is different from fault avoidance, which tries to prevent faults from happening at all. avoidance has limits; tolerance is how you handle what avoidance misses.

measuring reliability

mtbf (mean time between failures) measures how long a component typically operates before failing. longer is better, but it is a distribution, not a promise. a disk with a 3-year mtbf does not survive for exactly 3 years; it means half the drives fail before 3 years, half fail after.

mttr (mean time to repair) measures how long it takes to restore service after a failure. this is often more important than mtbf. a system that fails frequently but recovers in seconds is more reliable in practice than one that fails rarely but takes an hour to restore.

higher mtbf and lower mttr both improve reliability, but reducing mttr is often more tractable and has more immediate impact.

the reliability of composed systems

a system is only as reliable as its weakest dependency. if your service calls three other services and each has 99.9% availability, your composite availability is 99.9% × 99.9% × 99.9% ≈ 99.7%, assuming all three must succeed for your service to succeed.

add a database, a cache, and a message queue, and that number keeps falling. this is why reliability engineering cannot stop at your service boundary. the whole call graph matters.

this is also why graceful degradation matters. if your service can still function (partially, with reduced features) when a dependency is down, you have changed a failure into a degraded state. degraded is better than down.

reliability is not the same as availability

a batch job that runs once per hour, completes correctly every time, and takes 30 minutes to run is 50% available (offline half the time) but could be considered highly reliable (never produces wrong output).

a cache that is always reachable but occasionally returns stale data is highly available but has reliability issues if correctness is important.

the distinction matters when you are deciding what to optimize. sometimes you need the system to be correct. sometimes you need it to be reachable. often you need both. they require different techniques.