latency vs outage
they look different, are detected differently, and need different responses.
they look different, are detected differently, and need different responses.
when a downstream service starts misbehaving, you need to know what kind of misbehavior you are dealing with. a service that is down and a service that is very slow produce different symptoms, are detected by different signals, and require different responses. conflating them leads to the wrong fix.
an outage means the service is not accepting connections or is immediately returning errors. connection refused. 500 in 2ms. tcp reset. from the caller's perspective, the request fails fast. the failure is clear and immediate.
latency degradation means the service is accepting connections and processing requests, but slowly. the request does not fail. it hangs. the caller's thread waits. the response eventually arrives (maybe) but takes 2 seconds instead of 20ms. or it times out.
the surface symptoms differ:
if a service is down, the right response is to stop sending it traffic, fail fast, and fall back to cached data or a degraded mode. circuit breakers open. load balancers pull the unhealthy instance. the system acknowledges the failure and works around it.
if a service is slow, the picture is murkier. it might recover on its own. it might be processing a large batch job and will return to normal speed shortly. aggressively cutting traffic to a slow-but-recovering service can prevent it from recovering, especially if the slowness is due to a temporary spike and the service needs time to drain its queue.
the response also depends on cause. latency degradation can come from:
an outage can come from:
understanding the cause shapes the response. a gc pause that lasts 200ms does not need a circuit breaker to open. a hung database query that blocks everything does.
for outages: error rates are the primary signal. connection refused, tcp resets, explicit 500 errors. these rise immediately when the service is down. health check endpoints return 500 or time out.
for latency degradation: latency percentiles are the primary signal: p50, p95, p99. the median might still look acceptable while the p99 is catastrophic. histogram data is more informative than average latency, which gets dragged around by outliers in misleading ways.
a useful heuristic: if your error rate is high but your latency looks normal, something is failing fast (outage-like). if your latency is high but your error rate is normal, something is slow (latency degradation). if both are high, you probably have latency degradation that has progressed to timeouts.
timeouts convert latency into errors. set a 5-second timeout and requests that would have hung for 60 seconds now fail after 5 seconds. this is good for the caller's thread pool and for cascade prevention.
the side effect is that latency degradation starts looking like an outage once timeouts start firing. the service is slow, not down, but the caller sees error rates rising as timeouts occur. this can trigger circuit breakers and other outage-response mechanisms that were not designed for a slow service.
this is not always wrong. a service that responds in 30 seconds is functionally unavailable for most use cases, and treating it as down is reasonable. but a service that responds in 6 seconds with a 5-second timeout set by an overly cautious caller is not actually misbehaving. the distinction matters for diagnosing the problem and deciding whether to investigate the service or the client.
when an incident starts, the first diagnostic question is: are we seeing fast failures or slow failures?
fast failures → look for crashes, restarts, connectivity issues, error responses from the service itself.
slow failures → look for resource contention, slow dependencies, load spikes, gc pressure, database lock issues.
the answer determines where to look, what to fix, and how urgently to cut traffic to the affected service. getting this wrong wastes time in an incident and can make the underlying problem worse.