status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
the sender can push data as fast as the network and cpu allow. the receiver might be a raspberry pi running a python script. if the sender does not care, the receiver's buffers overflow and packets get dropped.
every tcp ack carries a window size field: the number of bytes the receiver is willing to accept beyond the acknowledged sequence number. the sender must not have more unacknowledged data in flight than this window allows.
sender receiver
|--- seq=1000 len=1000 ------------>| rwnd=4000
|--- seq=2000 len=1000 ------------>|
|--- seq=3000 len=1000 ------------>|
|--- seq=4000 len=1000 ------------>| sender has 4000 bytes in flight
| | (cannot send more until ack)
|<-- ack=2000 rwnd=3000 ------------| "got 1000, room for 3000 more"
|--- seq=5000 len=1000 ------------>| window slides forward
the window slides forward as the receiver acks data and the application reads from the socket buffer. "sliding window."
the effective send rate is bounded by min(cwnd, rwnd): the smaller of the congestion window (congestion control covers cwnd) and the receive window. most of the time cwnd is the bottleneck. when rwnd shrinks enough to limit throughput, the receiver is too slow and you need to fix the application.
the window size field is 16 bits, maxing out at 65535 bytes. on a 1 Gbps transatlantic pipe with 100ms rtt, you need about 12.5 MB in flight to saturate the link. 64 KB is laughable.
window scaling (RFC 7323) multiplies the window size by a power of two. both sides negotiate a scale factor during the handshake:
options [mss 1460,sackOK,TS val ... ecr 0,nop,wscale 7]
wscale 7 means the actual window is the advertised value shifted left by 7 (multiplied by 128). a 16-bit window of 512 with scale factor 7 is really 65536 bytes. scale factor 14 (the maximum) gives a window up to 1 GB.
linux negotiates window scaling automatically. if you see poor throughput on a high-bandwidth link, check whether the handshake includes wscale. some ancient firewalls strip tcp options and silently cap your window at 64 KB.
ss --info '( dport = :443 )' | grep -i wscale
when the receive buffer fills completely, the receiver advertises a window of zero:
sender receiver
|--- seq=5000 len=1000 ------------>|
|<-- ack=6000 rwnd=0 ---------------| "stop sending"
| (sender pauses) |
the sender stops transmitting and starts a persist timer. it periodically sends window probes (1-byte segments) to check whether the window opened:
|--- window probe (1 byte) -------->|
|<-- ack=6000 rwnd=0 ---------------| still full
| (wait, probe again) |
|--- window probe (1 byte) -------->|
|<-- ack=6000 rwnd=8192 ------------| application read, room opened
|--- seq=6000 len=1000 ------------>| resume
probe interval doubles each time, starting at rto and capping at 60 seconds.
zero windows are not errors. they are the protocol working correctly. but a connection sitting at zero window for minutes means something is wrong with the receiving application. it is not reading from the socket fast enough. check for blocking I/O, gc pauses, or a saturated thread pool.
# find connections with zero receive window
ss -tan | awk '$4 == 0'
the kernel manages the receive buffer. data arrives, goes into the buffer. the application calls read() or recv(), data comes out. the advertised window reflects how much space remains.
slow reader means:
your application's read loop directly controls network throughput. a slow consumer throttles the entire connection.
the common mistake: doing heavy processing inside the read loop instead of reading into a queue and processing asynchronously. the socket buffer fills while you compute, the window closes, and the sender stalls.
linux auto-tunes socket buffers. the defaults handle most cases:
# receive buffer: min default max (bytes)
sysctl net.ipv4.tcp_rmem
# 4096 131072 6291456
# send buffer: min default max
sysctl net.ipv4.tcp_wmem
# 4096 16384 4194304
auto-tuning adjusts the buffer between minimum and maximum based on memory pressure and connection needs. the default is the starting point for new connections.
override per-socket with SO_RCVBUF:
int size = 2 * 1024 * 1024;
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));
setting SO_RCVBUF disables auto-tuning for that socket. the kernel doubles the value you request (half for internal bookkeeping) and the buffer stays fixed. only do this if auto-tuning is provably wrong for your workload.
for high bandwidth-delay-product links (datacenter to datacenter across an ocean), raise the maximums so auto-tuning has room:
sysctl -w net.ipv4.tcp_rmem="4096 131072 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
ss -tan '( dport = :443 )'
Recv-Q in ESTABLISHED state shows bytes in the receive buffer waiting for the application to read. consistently growing means the application is not keeping up.
Send-Q shows bytes in the send buffer waiting for acknowledgment. growing while Recv-Q on the remote is zero means congestion on the network, not a slow receiver.
flow control keeps the receiver from drowning. congestion control keeps the network from drowning. they constrain the sender independently. the slower one wins.