status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
flow control protects the receiver. congestion control protects the network.
the sender has no direct visibility into the network's capacity. it cannot ask routers how full their queues are. it can only send data and observe what happens. did the acks come back on time, or did packets vanish?
every congestion control algorithm is a feedback loop: probe for more bandwidth, detect congestion, back off, repeat. the differences are in how they probe and what signals they react to.
the congestion window (cwnd) limits how many bytes the sender can have in flight. the actual send rate is bounded by min(cwnd, rwnd).
cwnd lives in the sender's kernel, not on the wire. the receiver never sees it.
ss --info '( dport = :443 )' | grep cwnd
# cwnd:10 ssthresh:20
cwnd is measured in segments (typically mss-sized, 1460 bytes on ethernet). cwnd:10 means roughly 14600 bytes in flight.
a new connection has no idea how much bandwidth is available. slow start probes: for every ack received, cwnd increases by one segment. cwnd doubles every rtt. exponential growth.
RTT 1: cwnd=1 → send 1 segment → 1 ack → cwnd=2
RTT 2: cwnd=2 → send 2 segments → 2 acks → cwnd=4
RTT 3: cwnd=4 → send 4 segments → 4 acks → cwnd=8
RTT 4: cwnd=8 → send 8 segments → 8 acks → cwnd=16
"slow start" is a misnomer. the name refers to starting from 1 rather than blasting at full speed, but the growth is exponential.
slow start continues until:
linux sets initcwnd to 10 segments by default (ip route show to check). google's research showed bumping initcwnd from 3 to 10 significantly reduced page load times because most web responses fit in 10 segments.
once cwnd reaches ssthresh, growth goes linear: cwnd increases by roughly one segment per rtt. the sender carefully pushes upward, testing whether the network can handle more. when it pushes too far, packets drop and the algorithm reacts.
two signals:
packet loss. either a retransmission timeout fires or three duplicate acks trigger fast retransmit. loss-based algorithms treat this as the primary signal.
ecn (explicit congestion notification). routers mark packets when their queues fill up instead of dropping them. the receiver echoes the mark back in acks. the sender reacts as if a loss occurred but avoids the actual retransmission. enable with net.ipv4.tcp_ecn=1 but every router on the path needs support.
reno is the textbook algorithm. on loss: cut cwnd in half, set ssthresh to the new cwnd, enter congestion avoidance. simple, conservative, and terrible on high-bandwidth high-latency links because it takes forever to recover.
cubic (RFC 9438) is linux's default since 2.6.19. instead of linear growth in congestion avoidance, cubic uses a cubic function that grows slowly near the last known good cwnd (where loss happened) and more aggressively when far from it. much faster at utilizing fat pipes.
sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = cubic
sysctl net.ipv4.tcp_available_congestion_control
cubic's weakness: it interprets any packet loss as congestion, even random loss on a wireless link. 1% loss on wifi and cubic backs off constantly even if the link has plenty of bandwidth.
loss-based algorithms wait until the network is already congested (queues full, packets dropped) before reacting. delay-based algorithms detect congestion earlier by watching rtt. a rising rtt means queues are filling.
vegas monitors rtt and reduces cwnd when it increases beyond a threshold. avoids filling queues entirely, which is great for latency but means it loses bandwidth fights with cubic. cubic fills queues, vegas backs off, cubic takes the capacity.
bbr (bottleneck bandwidth and round-trip propagation time) is google's answer. it models the connection's bottleneck bandwidth and minimum rtt, then paces packets to match. does not react to loss directly. does not need to fill queues to find capacity.
# enable bbr (kernel 4.9+)
sysctl -w net.ipv4.tcp_congestion_control=bbr
# bbr needs the fq packet scheduler
tc qdisc replace dev eth0 root fq
bbr excels on long-fat networks and networks with random non-congestion loss. keeps queues shallow, which is great for latency. bbrv1 can be unfair to cubic flows; bbrv2 (newer kernels) addresses this.
bbr is not magic. on a shared link with mixed cubic and bbr flows, the interactions are complex and not always fair. test before deploying.
when cubic detects a loss:
cwnd * 0.7 (reno uses 0.5)classic sawtooth: cwnd ramps up linearly, loss drops it, ramp again. average throughput is a function of loss frequency and recovery speed.
on a short request (typical web page), the connection might never leave slow start. on a large download, you spend most of the time in congestion avoidance watching the sawtooth.
raw congestion control says "you can have cwnd bytes in flight." without pacing, the sender dumps cwnd bytes into the network as fast as the nic can push them. creates bursts that fill router queues and cause loss even when average throughput is fine.
pacing spreads packets evenly across the rtt. bbr requires pacing (via the fq qdisc). cubic benefits from it.
tc qdisc show dev eth0
ss --info '( dport = :443 )'
# rtt:25.3/1.2 cwnd:42 ssthresh:38 bytes_acked:1284032 retrans:0/2
cwnd:42 ssthresh:38 means congestion avoidance (cwnd > ssthresh). retrans:0/2 means 0 currently unacked retransmits, 2 total since the connection opened.
cwnd stuck at a low value with low ssthresh means recent loss and slow recovery. cwnd equals initcwnd on a fresh connection means slow start has not had time to ramp.
nstat -az TcpRetransSegs
system-wide retransmit count. if this is climbing, something on your network is hurting.
short connections suffer most. a connection transferring 50 KB might finish entirely during slow start with initcwnd=10 (14 KB on the wire in the first rtt). the congestion algorithm never gets a chance to ramp. http/2 multiplexing and connection reuse help because one warmed-up connection with a large cwnd serves many requests. quic with 0-rtt goes further: it can reuse the previous connection's congestion state.
buffer sizes limit throughput. if SO_SNDBUF is smaller than the bandwidth-delay product, the sender cannot fill the pipe. if SO_RCVBUF is smaller, the receive window clamps throughput before cwnd gets a chance.
measure before tuning. switching to bbr on a server handling only lan traffic will not help. bumping initcwnd to 40 when your clients are on the same rack is pointless overhead. know the rtt and bandwidth of your typical path, calculate the bdp, and tune accordingly.
flow control protects the receiver. congestion control protects the network. udp is what happens when you opt out of all of this.