9 Transport and Congestion Control

9.1 The Anchor: IP Datagrams and Decentralized Authority

Transport is the endpoint component responsible for end-to-end delivery across a network the sender lacks control over. The engineering problem: how to deliver data reliably and efficiently when the network offers only best-effort, unordered datagrams and no entity has global authority over traffic.

The binding constraint is the IP interface, inherited from Cerf and Kahn’s 1974 design (Cerf and Kahn 1974). IP provides best-effort delivery: datagrams are routed independently, guaranteeing only best-effort forwarding. Routers are stateless — they forward packets based on destination address alone, maintain only forwarding state, and remain silent about congestion. A packet is delivered, lost, or delayed; the sender observes only the presence or absence of acknowledgments.

This constraint is architectural, not accidental. Clark’s design philosophy embedded administrative decentralization as a founding principle: the Internet operates across independent domains with no entity possessing global visibility (Clark 1988). Congestion control must operate without a central scheduler, a shared clock, or routers capable of universal rate coordination.

“The goal of the architecture is to knit together a collection of networks… using gateways which implement a standard internetwork protocol.” — Cerf & Kahn, 1974 (Cerf and Kahn 1974)

The IP interface locks every other design choice in cascade. Transport must answer four decision problems using only endpoint-local observations:

When to send? Absent a shared clock or prescribed schedule, the sender must infer when the network can accept data.
What to send — new data or retransmission? The sender must distinguish successful delivery from loss, using only ACK signals.
How much to send? The sender must estimate available capacity using only indirect signals, since link rates and queue depths are hidden.
How to infer network state from endpoint signals? ACK arrivals and their timing are the only measurement channel. Every belief about the network — capacity, delay, congestion — must be reconstructed from this single, noisy signal.

Saltzer, Reed, and Clark formalized this placement retrospectively: the end-to-end argument places intelligence at endpoints because the network interior is blind to application requirements (Saltzer et al. 1984). This decomposition choice defines TCP’s entire design space — and constrains every variant that followed.

The dependency chain for transport follows directly from the binding constraint:

Interface (IP datagrams) is inherited and locked.
Coordination (distributed endpoints) is forced: no entity has authority to allocate bandwidth centrally.
State (inferred from local signals) is forced: without coordination, the sender must build its own belief model from ACK signals.
Time (estimated via RTT) is forced: without a shared clock, timing must be inferred from the only available signal — round-trip acknowledgment delay.

9.2 Act 1: “It’s 1974. TCP Is a Reliability Protocol.”

Cerf and Kahn designed TCP as a protocol for packet network intercommunication — a system to deliver bytes correctly across heterogeneous networks connected by stateless gateways (Cerf and Kahn 1974). The protocol unified host-to-host communication, process-to-process multiplexing, and reliability into a single monolithic design. By 1981, RFC 793 formalized the connection state machine: three-way handshake, sequence numbers, cumulative acknowledgments, and flow control via a receiver-advertised window (Postel 1981). The receiver window told the sender how much buffer space the receiver had available — a capacity limit at the far end. This was the only rate-control mechanism: if the receiver could accept 32 KB, the sender could send 32 KB. The network between them was assumed to have sufficient capacity.

What the pioneers saw: A small ARPANET with fewer than 200 hosts, administered cooperatively by researchers who knew each other. Links operated at 50 kbps. Traffic consisted of terminal sessions and file transfers. The engineering challenge was reliability — ensuring bytes arrived correctly despite unreliable packet delivery.

What remained invisible from the pioneers’ vantage point: The network would grow from 200 hosts to millions. Traffic would increase by orders of magnitude. Multiple independent administrative domains would replace cooperative research governance. The implicit assumption — that the network has enough capacity for all senders — would fail catastrophically within a decade.

9.2.1 Invariant Analysis: TCP (1974–1981)

Invariant	TCP Answer (1974–1981)	Gap?
State	Sequence numbers, connection FSM, receiver window	Path-blind: zero belief about network capacity or delay
Time	Fixed retransmission timers (“adaptive” but vague)	RTT estimation absent; timers fire too early or too late
Coordination	Distributed endpoints; receiver flow control only	Congestion coordination absent; each sender independent
Interface	Byte stream above, IP datagrams below	Clean — and permanent. This becomes the binding constraint

The State gap is the critical one: TCP tracks connection state (what bytes have been sent, acknowledged, received) but maintains zero path state. The sender is blind to network load, bottleneck capacity, and competing traffic. The congestion window is absent from the design. The sender transmits as fast as the receiver window permits — which is often far faster than the network absorbs.

9.2.2 Environment, Measurement, Belief

Layer	What TCP Has	What’s Missing
Environment	True queue occupancy, competing traffic, link rates	Everything — fully opaque to the sender
Measurement	ACK arrivals (binary: arrived or not)	Timing and congestion signals both absent
Belief	“Receiver can accept this much data” (receiver window)	Network capacity belief entirely absent

The E-M gap is total. The measurement channel (ACK arrivals) carries information about the receiver but nothing about the path. The sender’s belief covers only “the receiver has buffer space” — a statement about the far end, silent on the middle. The gap between Environment and Measurement is structural absence, beyond noise or estimation error. The sensor for network capacity simply does not exist. This is the most extreme form of the E-M gap.

9.2.3 “The Gaps Didn’t Matter… Yet”

On the 1974 ARPANET, they were invisible. With fewer than 200 hosts and 50 kbps links, aggregate demand rarely exceeded capacity. Congestion occurred locally and resolved quickly because traffic was light. The receiver window was the practical bottleneck, not the network.

But by 1986, the Internet had grown to over 5,000 hosts. Links had scaled to 56 kbps and 1.5 Mbps, but aggregate demand scaled faster. Multiple independent organizations connected their networks. The assumption “enough capacity” broke.

9.3 Act 2: “It’s 1986. The Network Is Collapsing.”

9.3.1 Which Invariant Broke?

In October 1986, throughput on the NSFnet backbone between LBL and UC Berkeley — a path that should have sustained 32 kbps — dropped to 40 bits per second. A factor of 800 reduction. The network was physically intact; every link was operational. Every router was forwarding. The collapse was behavioral: senders were transmitting faster than the network could deliver, packets were being dropped, senders retransmitted (because their timers fired), creating more traffic, causing more drops. The retransmissions themselves became the dominant source of traffic — a positive feedback loop where the system’s response to failure amplified the failure.

This was a stable, self-reinforcing failure. It was a stable equilibrium: the network settled into a state where nearly all traffic consisted of retransmissions, and useful throughput approached zero. The system had converged — to the wrong operating point.

“The network is now subject to a form of congestion collapse… If the timers are set too short, more retransmissions than necessary will be performed, and the network will be further congested.” — Nagle, 1984 (Nagle 1984)

Nagle diagnosed the problem in RFC 896 two years before the collapse became acute: senders lacked any mechanism to sense or respond to congestion. His fix operated at the Interface level — coalesce small packets to reduce packet count — but it treated a symptom, leaving the cause intact. Senders still lacked a feedback loop.

Three researchers, working independently over three years, diagnosed three distinct invariant failures:

Year	Pioneer	Invariant Diagnosed	Gap Identified
1984	Nagle	State	Sender lacks network load model
1986	Zhang	Time	RTT estimate is too crude; timers fire incorrectly
1987	Karn & Partridge	State	Ambiguous RTT samples from retransmissions pollute estimator

Zhang showed that TCP’s retransmission timers systematically failed because the RTT estimator was too primitive (Zhang 1986). RFC 793 tracked only a smoothed mean RTT with a fixed multiplier — variance tracking absent. When network conditions changed, the timer was either too aggressive (causing spurious retransmissions that worsened congestion) or too conservative (waiting too long to detect loss).

Karn and Partridge identified a subtler problem: when a packet is retransmitted, the sender is unable to determine whether the returning ACK acknowledges the original or the retransmission (Karn and Partridge 1987). Using this ambiguous sample corrupts the RTT estimate. Their fix: discard RTT samples from retransmitted segments and back off the timer exponentially on each retransmission.

In parallel, Jain and Chiu were developing the theoretical foundation: how should distributed agents adjust rates to converge to fairness? Their work would prove that additive increase / multiplicative decrease (AIMD) is the unique linear control policy that converges to both efficiency and fairness — a result Jacobson would implement independently of the formal proof.

Three researchers. Three years. No coordination. But they converged on the same decomposition — each diagnosing a different invariant’s failure. This convergence is evidence that the four-invariant structure is the structure the pioneers discovered through debugging, confirmed rather than imposed retrospectively through debugging.

9.3.2 Jacobson’s Synthesis: Four Fixes in One System (1988)

Jacobson synthesized all four invariant fixes into a single coherent system — the most consequential transport paper ever written (Jacobson 1988).

“In October ‘86, the Internet had its first ’congestion collapse’… the throughput from LBL to UC Berkeley… dropped from 32 Kbps to 40 bps.” — Jacobson, 1988 (Jacobson 1988)

Jacobson framed TCP congestion control as a feedback control problem, drawing explicitly on control theory. His core principle: conservation of packets — a new packet should enter the network only when an old one leaves. Three violations of this principle cause collapse: (1) the connection starts too fast, (2) bad timers inject packets prematurely, (3) resource limits change mid-connection.

Applied closed-loop reasoning by creating TCP’s first real feedback loop:

Slow start: cwnd (congestion window) begins at 1 MSS (Maximum Segment Size)¹ and doubles each RTT (exponential growth to equilibrium). The sender sources data at most twice the network’s current delivery rate.
Congestion avoidance (AIMD): After reaching ssthresh, increase cwnd by 1 MSS per RTT. On loss, halve cwnd. Additive increase probes; multiplicative decrease retreats aggressively.
RTT estimator: SRTT and RTTVAR with exponentially weighted moving averages:

RTTVAR = (1 - beta) * RTTVAR + beta * |SRTT - RTT_sample|
SRTT   = (1 - alpha) * SRTT  + alpha * RTT_sample
RTO    = SRTT + 4 * RTTVAR

RFC 6298 (Paxson et al. 2011) uses alpha = 1/8 and beta = 1/4.² The smoothed RTT weights history 7/8 and the new sample 1/8 — gradual adaptation without oscillation. The multiplier 4 on RTTVAR covers the 95th percentile of RTT variation, chosen specifically to prevent spurious timeouts during slow start when RTT can double in one round.

Fast retransmit: Three duplicate ACKs trigger retransmission without waiting for timeout — a faster path to loss recovery. The sender infers loss (rather than delay) because the receiver has received three subsequent packets and is signaling a gap.

Concrete example: Starting cwnd = 1 MSS (1,500 bytes), RTT = 100 ms. Slow start: cwnd = 1, 2, 4, 8, 16, 32 (5 RTTs = 500 ms). If loss occurs at cwnd = 32, ssthresh = 16, cwnd resets to 1. Congestion avoidance from cwnd = 16 to 32 takes 16 RTTs = 1.6 seconds. Total recovery: 2.1 seconds. This was fast enough for 56 kbps links where BDP (Bandwidth-Delay Product) = 700 bytes. It would prove inadequate for 10 Gbps links where BDP = 125 MB.³

Applied disaggregation by separating per-flow state (each sender tracks its own cwnd, SRTT, RTTVAR) from network state (routers remain stateless).

Decision placement remained fully distributed: every sender makes rate decisions independently, using only local ACK signals. No central authority allocates bandwidth. Convergence emerges from the interaction of independent AIMD loops.

Chiu and Jain proved mathematically that AIMD is the only linear control law that converges to fairness from any starting point (Chiu and Jain 1989). The proof is elegant: additive increase moves all flows toward the efficiency line (total rate = capacity); multiplicative decrease moves all flows toward the fairness line (equal rates). The combination oscillates around the intersection — the fair, efficient operating point.

The beauty of Jacobson’s synthesis: four independent diagnoses (Nagle on State, Zhang on Time, Karn on measurement integrity, Jain on Coordination) were unified into a single deployable system. The interface remained unchanged — no router modifications, no new header fields, no coordination between senders. This is why RFC 1122 (Braden 1989) could mandate the algorithms for all Internet hosts within one year: the fix operated entirely within the existing protocol boundary.

9.3.3 Invariant Analysis: TCP + Jacobson (1988)

Invariant	TCP Answer After Jacobson (1988)	Gap?
State	cwnd (capacity belief) + SRTT/RTTVAR (delay belief)	Loss is binary and delayed; no congestion gradient
Time	EWMA estimator: RTO = SRTT + 4*RTTVAR	Adapts slowly to sudden RTT changes (8 samples to converge)
Coordination	AIMD: emergent fairness without communication	RTT bias: low-RTT flows get more bandwidth
Interface	Unchanged — deployable via host update only	No router participation; routers remain passive

The State gap narrowed dramatically while remaining open. Loss is now the congestion signal, but loss is binary (happened or didn’t) and delayed (detected after one RTT at best, after a timeout at worst). The sender is blind to congestion severity — a single dropped packet and a completely saturated link produce the same signal.

9.3.4 Environment, Measurement, Belief After the Fix

Layer	What TCP Has	What’s Missing
Environment	True bottleneck rate, queue depth, competing flows	Still opaque to sender
Measurement	ACK timing (RTT samples) + loss events (timeout or 3 dup-ACKs)	Direct capacity signal absent; loss is noisy
Belief	cwnd (capacity proxy), SRTT (delay estimate)	cwnd is a proxy, an estimate rather than a measurement of true capacity

The E-M gap is accidentally noisy: the ACK signal is honest but imperfect. Jacobson’s estimator extracts what information exists, but the measurement channel is fundamentally limited to binary loss events and timing. Better estimators can improve accuracy; they are bounded by the information the signal carries.

9.3.5 “The Gaps Didn’t Matter… Yet”

Jacobson’s algorithms were mandated for all Internet hosts within one year (RFC 1122 (Braden 1989), 1989). They worked extraordinarily well for the 1988 Internet: moderate link speeds, small buffers, homogeneous RTTs. The loss signal was tightly coupled to congestion because buffers were small — when congestion occurred, packets were dropped quickly, and the sender learned within a few RTTs.

Jacobson knew the endpoint-only solution was incomplete:

“There are two network problems, congestion avoidance and congestion recovery… This paper describes the approach to the first problem… The second problem requires router changes.” — Jacobson, 1988 (Jacobson 1988)

The “gateway fix” was RED (Floyd and Jacobson, 1993 (Floyd and Jacobson 1993)) — five years later. Jacobson shipped what was deployable first. This is the meta-constraint of incremental deployability: host changes can be mandated via RFC; router changes require voluntary upgrades across administratively independent networks.

9.3.6 The AIMD Closed Loop in Steady State

The sawtooth pattern that emerges from AIMD is the system reaching equilibrium. The sender increases cwnd linearly (1 MSS per RTT) until loss occurs, then halves cwnd instantly. The cycle repeats.

The loop delay is one RTT. This is consequential: a flow with 100 ms RTT takes at least 100 ms to react to congestion. During those 100 ms, if the sender was transmitting near bottleneck rate, congestion has deepened. By contrast, a datacenter flow with 1 ms RTT reacts 100x faster. This RTT coupling is ineliminable without changing the time model.

The loop is stabilizing. After loss, cwnd halves, reducing sending rate by 50%. The sender then probes slowly upward. Over many cycles, the system converges to a rate where loss probability is low but not zero. The equilibrium loss probability p satisfies a balance: the rate of additive increase (1 MSS/RTT) equals the rate of multiplicative decrease (cwnd/2 per loss event). For N flows sharing capacity C, the equilibrium cwnd per flow satisfies N * cwnd = C, and loss occurs when any flow overshoots.

But the loop has failure modes visible at scale:

RTT-biased fairness: A 1 ms flow increases cwnd 100x per second; a 100 ms flow increases once per second. At equilibrium, low-RTT flows consume disproportionate bandwidth. This follows directly from the mathematics of AIMD — it is structural.
TCP incast: When many senders share a bottleneck and all transmit simultaneously (partition-aggregate workloads), all flows experience synchronized loss. All halve cwnd. When the buffer drains, all increase in lockstep. Utilization collapses to 10%.
Slow convergence: On a link with equilibrium cwnd = 83,000 MSS (10 Gbps, 100 ms RTT), recovering from a halving takes 41,500 RTTs = 69 minutes at 1 MSS per RTT.

9.4 Act 3: “It’s 2008. The Pipe Got Wider.”

9.4.1 Which Invariant Broke?

Link speeds scaled from 10 Mbps to 10 Gbps. Propagation delays remained governed by the speed of light. The bandwidth-delay product (BDP) — the amount of data that fits “in the pipe” — grew by three orders of magnitude.

Figure 9.1: The bandwidth-delay product (BDP) is the fundamental bottleneck-sizing metric: BDP = bottleneck bandwidth × round-trip propagation delay. It represents the amount of data (in bits or packets) that can be in flight on a path while the sender waits for the first ACK—equivalently, the minimum congestion window needed to saturate the link. This metric becomes critical on high-BDP paths: a 100 Gbps link with 100 milliseconds round-trip time has a BDP of 1.25 gigabytes, requiring a congestion window of roughly 800,000 packets (at 1500 bytes per packet). For comparison, LAN paths (10 Gbps, 0.1 ms RTT) have a BDP of only 125 packets, while satellite links (10 Mbps, 600 ms RTT) require 500,000 packets. TCP’s slow-start phase becomes a fundamental bottleneck on high-BDP paths. Slow-start doubles the congestion window every round-trip time: starting from 1 packet, reaching 1000 packets takes log₂(1000) ≈ 10 RTTs, or 1 second on a 100 ms satellite link. On a 50 millisecond coast-to-coast WAN path with 10 Gbps bandwidth, slow-start to saturation takes 8 RTTs = 400 milliseconds, during which the sender transmits at an average rate far below link capacity. This convergence delay is economically wasteful on premium, high-capacity links where saturation should occur in milliseconds, not seconds. Modern congestion control variants (CUBIC, BBR) address this by accelerating convergence for high-BDP paths, but the fundamental constraint remains: the window must scale with the BDP to saturate the link.

Scenario	Bandwidth	RTT	BDP	Time to Fill (AIMD)
1988 ARPANET	56 kbps	100 ms	700 bytes	< 1 RTT
2000 WAN	100 Mbps	50 ms	625 KB	~400 RTTs (20 s)
2008 fiber	10 Gbps	100 ms	125 MB	~83,000 RTTs (~2.3 hours)
Datacenter	100 Gbps	0.5 ms	6.25 MB	~4,200 RTTs (2.1 s)

Reno’s linear increase (1 MSS per RTT) was too slow for high-BDP pipes. After a loss event on a 10 Gbps WAN path, cwnd drops to half. Recovering to full utilization at 1 MSS per RTT takes 83,000 RTTs — over two hours. The network is 50% utilized during recovery.

9.4.2 Ha, Rhee, and Xu’s Redesign: CUBIC (2008)

CUBIC replaced Reno’s linear congestion avoidance with a cubic function of elapsed time since the last loss event (Ha et al. 2008):

cwnd(t) = C * (t - K)^3 + W_max

where t is wall-clock time since last loss, W_max is the pre-loss window, K is the time to reach W_max under cubic growth, and C is a scaling constant.⁴

Applied closed-loop reasoning with a structurally different growth curve:

Near W_max: Growth is concave (slowing down) — cautious near the previous congestion point.
Beyond W_max: Growth is convex (speeding up) — aggressive probing for new capacity.
Decrease: 0.7x (less aggressive than Reno’s 0.5x), preserving more throughput after loss.

The critical innovation: growth rate is a function of wall-clock time, not RTT. A 10 ms flow and a 100 ms flow increase cwnd at the same rate in real time. This partially fixes the RTT-bias that plagued Reno: under AIMD, a 1 ms flow increases cwnd 100x per second while a 100 ms flow increases once per second. CUBIC decouples growth from ACK frequency.

Concrete example: W_max = 83,000 MSS (10 Gbps link, 100 ms RTT, 1,500-byte packets). After loss, cwnd drops to 58,100 (0.7x). Under Reno, recovering 24,900 MSS at 1 per RTT takes 24,900 RTTs = 41.5 minutes. Under CUBIC, the cubic function is tuned so recovery takes ~400 RTTs = 40 seconds — a 60x improvement.

CUBIC also introduced a TCP-friendly mode: when the cubic function would grow slower than standard AIMD, CUBIC switches to AIMD-compatible growth. This ensures backward compatibility — CUBIC always matches or exceeds Reno’s bandwidth in the same conditions.

Figure 9.2: TCP’s congestion control has evolved through multiple generations, each restructuring the belief system about network state in response to environmental pressures. TCP Reno (1990) introduced slow-start with exponential growth in the early phase, then switched to linear additive-increase congestion avoidance. Loss detection (via duplicate ACKs or timeout) triggers a reset to slow-start threshold at half the current window. This linear increase model worked well for moderate-bandwidth, moderate-latency links, but becomes a bottleneck on high-bandwidth, high-latency paths where the bandwidth-delay product is enormous. CUBIC (2008) addresses this bottleneck by replacing linear increase with a cubic function, allowing faster recovery toward higher equilibrium windows on high-BDP paths. BBR represents a different paradigm: instead of inferring network state from loss, BBR explicitly models the network as a pipe with bounded bandwidth and round-trip delay, measuring these parameters directly and maintaining a window size equal to bandwidth × delay. DCTCP uses ECN marks (explicit congestion notification) proportionally: when congestion increases, ECN marks rise in frequency, allowing DCTCP to reduce the window proportionally rather than dramatically. Each variant embodies a different belief system: Reno assumes loss is the primary congestion signal and reacts sharply; CUBIC assumes high-BDP networks need gentler recovery; BBR assumes direct bandwidth measurement is possible; DCTCP assumes dense feedback via ECN marks is available. The choice of variant reflects assumptions about the network environment, feedback frequency, and the tradeoff between responsiveness and stability.

9.4.3 Before/After: Reno to CUBIC

What Changed	Reno (1990)	CUBIC (2008)
Growth curve	Linear: +1 MSS/RTT	Cubic: C*(t-K)^3 + W_max
Growth rate depends on	RTT (ACK frequency)	Wall-clock time (RTT-independent)
Decrease factor	0.5x	0.7x
Recovery time (10 Gbps, 100 ms)	~83,000 RTTs	~400 RTTs
RTT fairness	Severe bias: 100x RTT difference = 100x bandwidth difference	Partially fixed: growth rate independent of RTT
Congestion signal	Loss	Loss (unchanged)

The progression from Reno to CUBIC illustrates a recurring pattern: the controller (how the sender responds) improved dramatically, but the sensor (what the sender observes) remained the same. CUBIC is a better feedback controller operating on the same noisy, binary measurement signal. When the signal itself fails — as it does with bufferbloat — the controller’s sophistication is irrelevant.

9.4.4 Environment, Measurement, Belief: CUBIC

Layer	What CUBIC Has	What’s Missing
Environment	True bottleneck rate, competing traffic, queue depth	Still opaque to sender
Measurement	Loss events + ACK timing	Direct capacity measurement absent; loss is binary
Belief	cwnd via cubic function; W_max as memory of previous equilibrium	Still interprets loss as sole congestion signal

The E-M gap quality is unchanged from Jacobson: accidentally noisy. CUBIC improves the controller (cubic growth curve) but uses the same sensor (loss events). When the sensor fails — as it does with deep buffers — the controller’s sophistication is irrelevant.

9.4.5 “The Gaps Didn’t Matter… Yet”

CUBIC became the Linux default in 2006 (kernel 2.6.19) and remains the dominant WAN congestion control algorithm (Ha et al. 2008). It solved the high-BDP scaling problem elegantly. But CUBIC retained the fundamental assumption inherited from Jacobson: loss is the signal for congestion.

This assumption held as long as buffers were small enough that loss and congestion were tightly coupled. When deep buffers appeared — in home routers, cable modems, and carrier-grade equipment — the loss signal decoupled from congestion. Buffers absorbed thousands of packets without dropping. The sender saw ACKs arriving on time, believed the network was healthy, and kept increasing cwnd. RTT inflated from 5 ms to 500 ms before any packet was lost. This is bufferbloat: the belief (cwnd) diverged from reality (the queue was full) because the measurement signal (loss) was delayed by the buffer.

The bufferbloat problem revealed a deeper insight about the E-M-B decomposition: CUBIC improved the controller (how the sender responds to loss) but used the same sensor (binary loss events). When the sensor fails — when the signal decouples from the underlying reality — no amount of controller sophistication can compensate. The fix required either a better sensor (DCTCP’s ECN (Explicit Congestion Notification) marking) or a different measurement strategy entirely (BBR’s model-based probing).

9.5 Act 4: “It’s 2017. Loss Is the Wrong Signal.”

9.5.1 Which Invariant Broke?

The State invariant’s measurement channel failed. Loss-based CC assumes loss indicates congestion. In networks with deep buffers, loss occurs only after the buffer is completely full — potentially hiding congestion for thousands of RTTs. The sender’s belief (cwnd = safe capacity) diverges from reality (the path is saturated, latency has exploded).

A concrete example: a home broadband link with 100 Mbps downstream and a 10 MB buffer at the cable modem. A saturated flow hides congestion for 10 MB / (100 Mbps / 8) = 0.8 seconds. During those 800 ms, the sender increases cwnd, the buffer fills, and latency climbs from 5 ms to 800 ms. A video call that needs < 100 ms RTT becomes unusable. A web page that would load in 200 ms now takes 2 seconds. Every interactive application on the same access link suffers, because the buffer is shared.

The diagnosis through the E-M-B framework is precise: the Environment changed (deep buffers deployed for throughput), the Measurement signal (loss) decoupled from congestion (buffers absorb packets instead of dropping), and the Belief (cwnd = safe capacity) diverged from reality (the path is saturated). The sender’s estimator is functioning correctly — it just has the wrong sensor.

9.5.2 Cardwell’s Redesign: BBR (2017)

BBR (Bottleneck Bandwidth and Round-trip propagation time) replaced the cwnd belief system entirely (Cardwell et al. 2017). Instead of a proxy for capacity (cwnd), BBR maintains an explicit network model:

BtlBw: Bottleneck bandwidth — the maximum delivery rate observed over the last 10 RTTs.
RTprop: Round-trip propagation delay — the minimum RTT observed over the last 10 seconds.
Target: Send at rate = BtlBw, keeping in-flight bytes at BtlBw * RTprop (the BDP).

Applied closed-loop reasoning with a fundamentally different measurement interpretation: loss serves as probing information, freed from its role as a penalty signal. BBR increases rate periodically, observes the result, and updates its model.

BBR operates in four phases:

STARTUP: Exponential increase until BtlBw estimate stops growing for 3 RTTs. Goal: find bottleneck bandwidth quickly.
DRAIN: Reduce in-flight packets to the BDP. Goal: drain queues built during startup.
PROBE_BW: Every 10 RTTs, increase rate by 25% to probe for more bandwidth. If delivery rate drops, reduce back.
PROBE_RTT: Every 10 seconds, reduce in-flight to 4 MSS for 200 ms. Goal: refresh RTprop estimate with minimal queueing.

Figure 9.3: The sawtooth pattern visualizes the closed-loop dynamics of TCP’s AIMD (Additive Increase, Multiplicative Decrease) congestion control, where the sender probes available capacity by increasing the congestion window (cwnd) at a linear rate until loss is detected, then halves cwnd in response. Observe how the additive increase phase (roughly 1 packet per RTT) gradually fills the pipe and queues, while loss events (marked by red dots) trigger multiplicative decreases that sharply reduce cwnd by 50%. The green dashed line at cwnd=32 shows the slow-start threshold (ssthresh), which separates exponential probing from linear probing. This equilibrium—where the rate of additive gains roughly balances multiplicative losses over many cycles—emerges not as a bug, but as the system reaching a stable point where loss probability is low but not zero, achieving high utilization without overwhelming the bottleneck. The loop delay is RTT, which governs how quickly the sender can react to congestion. A datacenter flow with 1 ms RTT reacts 100x faster than a wide-area flow with 100 ms RTT, creating a fundamental coupling between the time model and the convergence properties of the adaptation loop. Flows with longer RTTs reach equilibrium slowly and are biased against in competing for capacity, illustrating how the time model directly shapes fairness.

Concrete comparison: On a 10 Gbps link with 100 ms RTT and a 10 MB buffer:

CUBIC: Fills the buffer before detecting loss. Queue grows to 10 MB, adding 8 ms of queuing delay. RTT inflates from 100 ms to 108 ms. After loss, cwnd drops to 0.7x and slowly recovers. Average queue depth: 5 MB. Average queuing delay: 4 ms.
BBR: Measures BtlBw = 10 Gbps and RTprop = 100 ms. Sends at BDP = 125 MB in flight. Queue target: zero. Average queuing delay: < 1 ms.

The difference is structural: CUBIC fills the pipe and the buffer before backing off. BBR fills the pipe and drains the buffer by design. For latency-sensitive applications (video conferencing, interactive web), this difference is transformative.

9.5.3 Invariant Analysis: BBR (2017)

Invariant	BBR Answer (2017)	Gap?
State	(BtlBw, RTprop) — explicit model, not proxy	Model assumes constant bottleneck; wrong in multi-tenant networks
Time	Phase-based probing on scheduled intervals	RTprop stale for up to 10 seconds after routing changes
Coordination	Model-based; probes past loss	Coexistence failure: BBR starves CUBIC flows (takes 95% capacity)
Interface	Unchanged (ACK-based)	Interprets ACKs differently but uses same signal

The Coordination gap is BBR’s most consequential failure. When BBR and CUBIC share a bottleneck, they interpret loss differently. BBR probes past loss — CUBIC interprets that loss and halves cwnd. BBR fills the capacity that CUBIC vacates. Over time, BBR can consume 95% of bottleneck bandwidth while CUBIC gets 5%. BBRv2 and BBRv3 add loss-aware fairness constraints, but the fundamental tension persists: a protocol that probes past loss will dominate a protocol that fears loss.

The model assumptions create additional gaps. BBR assumes bottleneck bandwidth is constant — wrong in multi-tenant environments where neighbors’ traffic changes. RTprop estimates are systematically inflated when every packet traverses a path with persistent queueing. A datacenter with persistent background traffic produces RTprop samples that systematically overestimate true propagation delay.

9.5.4 Environment, Measurement, Belief

Layer	What BBR Has	What’s Missing
Environment	True bottleneck rate, queue depth	Still opaque; measured indirectly
Measurement	Delivery rate (from ACK spacing) + min RTT	Delivery rate noisy; RTprop estimate goes stale between probes
Belief	(BtlBw, RTprop) = explicit model	BtlBw assumes stationarity; multi-tenant traffic violates this

The E-M gap quality changed. Jacobson’s gap was accidentally noisy (honest but imperfect ACK signal). BBR’s gap is structurally filtered: the delivery rate measurement reflects not just the bottleneck but also competing traffic, receiver-side buffering, and link-layer retransmissions. BBR smooths this noise with windowed max/min filters, but some information is permanently lost.

9.5.5 “The Gaps Didn’t Matter… Yet”

Despite the fairness concerns, BBR achieved lower latency and higher throughput than CUBIC on Google’s global network. For a single operator controlling both endpoints (Google servers to user browsers via QUIC), the coexistence problem is muted — Google’s traffic competes primarily with itself. BBR is deployed at scale in Google infrastructure: YouTube, Cloud, and all QUIC connections.

The gap that matters is deployment scope. BBR works well when one operator controls the sender (and can choose the CC algorithm). It works poorly on shared infrastructure where heterogeneous CC algorithms compete. As BBR adoption grows beyond Google, the fairness problem becomes increasingly visible. BBRv3 (2023) addresses this with explicit loss-rate monitoring and proportional backoff — but the tension between model-based probing and loss-based conservation remains fundamental.

9.6 Act 5: “It’s 2010. The Datacenter Changes Everything.”

9.6.1 Which Invariant Broke?

Datacenters introduced a qualitatively different environment: sub-millisecond RTTs, single administrative authority, commodity switches with shallow buffers, and workloads mixing latency-sensitive “mice” flows (short queries) with throughput-sensitive “elephant” flows (bulk transfers).

The critical failure was incast collapse: in partition-aggregate workloads (MapReduce, search), dozens or hundreds of workers respond simultaneously to a single aggregator. Their synchronized traffic overwhelms the switch buffer. All flows experience loss simultaneously. All halve cwnd. When the buffer drains, all increase together — synchronized oscillation with utilization collapsing to 10%.

Standard TCP (Reno or CUBIC) in the datacenter produces two failures: (1) elephant flows fill buffers, inflating latency for mice flows, and (2) incast causes synchronized loss and throughput collapse. Both failures trace to the same State gap: loss is too coarse a signal for sub-millisecond environments.

Concrete incast example: A MapReduce job with 100 workers, each holding 1 MB of results. The aggregator sends a request; all 100 workers respond simultaneously over 10 Gbps links to a top-of-rack switch with a 1 MB buffer. Total arriving data: 100 MB. Buffer capacity: 1 MB. Within microseconds, 99 MB of data is dropped. All 100 flows detect loss and halve cwnd. When retransmissions begin, they are again synchronized. Throughput oscillates between near-zero and momentary bursts. Effective utilization: 10-20% of link capacity.

9.6.2 Alizadeh’s Redesign: DCTCP (2010)

DCTCP (Data Center TCP) extracted multi-bit feedback from the single-bit ECN field (Alizadeh et al. 2010; Ramakrishnan et al. 2001):

Switches mark the Congestion Experienced (CE) bit when queue depth exceeds threshold K.
The sender tracks alpha = EWMA of the fraction of marked packets in each window.
Decrease: cwnd = cwnd * (1 - alpha/2).
If 10% of packets are marked: 5% decrease. If 90% are marked: 45% decrease.

Applied closed-loop reasoning with a richer measurement signal: instead of binary loss (all-or-nothing), DCTCP receives a continuous congestion gradient. The marking rate is proportional to queue depth, so the sender learns not just “congested or not” but “how congested.”

Applied disaggregation by separating the measurement signal (ECN from switches) from the control response (proportional decrease at endpoints). Switches provide signals; endpoints interpret them. This preserves endpoint autonomy while adding router participation — the “gateway fix” that Jacobson called for in 1988.

Decision placement shifted subtly: switches now actively participate in congestion signaling (marking packets), but endpoints retain all rate-control authority. This is the first change to the measurement interface since Jacobson 1988.

9.6.3 Invariant Analysis: DCTCP (2010)

Invariant	DCTCP Answer (2010)	Gap?
State	cwnd + alpha (fraction marked). Continuous signal	Requires ECN-capable switches; single-admin only
Time	Same RTT estimator. ECN marks arrive within 1 RTT	Marking threshold K must be tuned per deployment
Coordination	Proportional decrease (gentler than AIMD halving)	Cannot coexist with standard TCP (too gentle, gets starved)
Interface	ECN from switches — new measurement signal	Requires coordinated deployment across all switches

The deployment constraint is DCTCP’s binding limitation: it requires every switch on the path to support ECN marking, and sharing links with standard TCP leads to starvation. This restricts DCTCP to single-admin environments — datacenters where one operator controls all switches and can mandate the protocol.

Concrete example: A datacenter switch with 10 Gbps links and 1,500-byte packets. Marking threshold K = 20 packets. Queue sojourn time at K: 20 * 1,500 * 8 / 10 Gbps = 2.4 microseconds. In steady state, queue depth oscillates around K. Average queuing delay: ~2 microseconds. Compare with CUBIC on the same switch: the queue grows until packets are dropped, potentially filling a 10 MB buffer, causing 8 ms of queuing delay — 3,000x worse.

9.6.4 Environment, Measurement, Belief: DCTCP

Layer	What DCTCP Has	What’s Missing
Environment	True queue depth at each switch	Sender sees only aggregate marking rate
Measurement	Fraction of ECN-marked packets per window	Continuous signal, but marking threshold K is static
Belief	alpha = EWMA of marked fraction	Proportional to congestion severity; much richer than binary loss

The E-M gap quality fundamentally changed. DCTCP’s measurement signal is physically richer than loss: ECN marks arrive before packets are dropped, and their frequency encodes congestion severity. This is not a better estimator for the same signal (as CUBIC was for Reno) — it is a different sensor entirely. The diagnostic: DCTCP moved from accidentally noisy measurement (loss) to designed measurement (ECN marking). The cost: the new sensor requires coordinated deployment.

9.6.5 Before/After: Standard TCP to DCTCP

What Changed	Standard TCP (CUBIC)	DCTCP (2010)
Congestion signal	Loss (binary)	ECN marks (proportional)
Decrease rule	Halve cwnd (0.5x or 0.7x)	cwnd * (1 - alpha/2), proportional
Queue target	Buffer fills before reaction	Queue stays at threshold K
Incast handling	Synchronized collapse	Proportional marking prevents overflow
Deployment scope	Global Internet	Single-admin datacenter only

9.6.6 “The Gaps Didn’t Matter… Yet”

DCTCP’s single-admin constraint was tolerable because the target environment — hyperscale datacenters — is precisely where single-admin deployment is feasible. Google, Microsoft, and Amazon all deployed DCTCP or DCTCP-inspired protocols within their datacenters. The constraint excluded the wide-area Internet, but that was acceptable because the datacenter workloads (partition-aggregate, storage replication) that benefit most from DCTCP exist only within single-admin domains.

The gap that would matter next was different: even inside the datacenter, DCTCP’s measurement signal (ECN) provides only queue-depth information. It is silent on link utilization, the number of competing flows, and the arrival rate of new traffic. Later datacenter protocols — HPCC (Li et al. 2019) and Swift (Kumar et al. 2020) — added richer in-network telemetry: INT (In-band Network Telemetry) headers carrying precise queue depth, link utilization, and timestamp information per hop. This is the E-M-B progression: from binary loss (TCP) to proportional marking (DCTCP) to per-hop telemetry (HPCC/Swift).

9.7 Act 6: “It’s 2017. TCP Cannot Evolve.”

9.7.1 Which Invariant Broke?

The Interface invariant — TCP’s permanent binding constraint — became a liability. TCP is implemented in operating system kernels: deploying a new congestion control algorithm requires an OS update, which takes months to years. Worse, middleboxes (NATs, firewalls, intrusion detection systems) inspect TCP headers and block packets with unfamiliar options. Honda et al. measured that 35% of Internet paths drop packets carrying unknown TCP options (Honda et al. 2011) — meaning TCP extensions are effectively undeployable across much of the Internet.

This is protocol ossification: the loss of protocol flexibility caused by deployed infrastructure that enforces expectations about protocol behavior. TCP’s header format, option space, and behavior became frozen by the accumulated assumptions of millions of middleboxes.

The consequences were concrete. TCP Fast Open (RFC 7413 (Cheng et al. 2014), 2014) enabled 0-RTT connection setup by caching a cryptographic cookie — a simple, backward-compatible extension. Deployment stalled because middleboxes dropped SYN packets containing unfamiliar options. Multipath TCP (MPTCP, RFC 6824 (Ford et al. 2013), 2013) added subflow management to existing connections. Again, middleboxes interfered. The pattern: any TCP extension that modified the header format or option space was blocked by infrastructure that the protocol designers lacked control over.

9.7.2 Langley’s Redesign: QUIC (2017)

QUIC escaped TCP’s ossification by building a new transport protocol over UDP (Langley et al. 2017):

UDP as substrate: Middleboxes already pass UDP traffic. By wrapping transport logic in UDP, QUIC avoids TCP’s frozen header format.
User-space implementation: QUIC runs in application libraries, not OS kernels. Updates deploy with application releases, not OS patches.
Encrypted headers: QUIC encrypts all transport metadata (connection state, stream IDs, sequence numbers). Middleboxes see only opaque bytes, preventing future ossification.
Stream multiplexing: Multiple independent streams within one connection. Loss on one stream leaves others unblocked — fixing TCP’s head-of-line blocking.
0-RTT connection setup: Combined cryptographic and transport handshake. Repeat connections can send data immediately, eliminating the latency cost of TCP’s three-way handshake plus TLS setup.
Connection migration: Connections identified by Connection IDs, not 4-tuples. A mobile device switching from WiFi to cellular keeps its QUIC connection alive.

Applied disaggregation by separating transport logic from the OS kernel and separating per-stream reliability from per-connection ordering. TCP bundles reliability, ordering, and congestion control into one byte stream; QUIC disaggregates them. This disaggregation enables independent loss recovery per stream: if stream A loses a packet, stream B continues delivering data. In TCP with HTTP/2, a single lost packet blocks all multiplexed streams — even those with no data in the lost packet.

Decision placement for congestion control remained endpoint-local: QUIC runs the same CC algorithms (BBR, CUBIC) internally. The innovation lies in Interface, not CC — how transport is packaged and deployed. This is a critical distinction: QUIC changes where the code runs (user-space vs kernel) and what it encrypts (everything), but it preserves the fundamental CC architecture (endpoint-driven, ACK-based, distributed).

Applied closed-loop reasoning for connection setup: QUIC’s 0-RTT handshake combines cryptographic key exchange and transport parameter negotiation into a single round trip. For repeat connections, the client caches session parameters and can send data immediately — zero round trips of setup latency. This reduces page-load time by 1–2 RTTs compared to TCP + TLS 1.3, which is significant on high-RTT paths (50–100 ms WAN) and transformative on satellite links (300+ ms).

9.7.3 Invariant Analysis: QUIC (2017)

Invariant	QUIC Answer (2017)	Gap?
State	Per-stream state; Connection ID enables migration	CC algorithms inherited from TCP (same State gaps)
Time	0-RTT/1-RTT handshake; same CC timing internally	Same RTT estimation challenges
Coordination	Same distributed CC as TCP	Same fairness problems (BBR vs CUBIC)
Interface	UDP wrapper; encrypted headers; stream multiplexing	Middleboxes that block UDP also block QUIC

QUIC’s contribution is primarily to the Interface invariant. It demonstrates that when the binding constraint itself ossifies, the solution is to change the deployment layer — move the interface to a substrate that middleboxes cannot inspect.

9.7.4 Environment, Measurement, Belief: QUIC

Layer	What QUIC Has	What’s Missing
Environment	Same network as TCP	Same invisibility
Measurement	Same ACK-based signals (per-stream)	Same noise, but per-stream loss isolation
Belief	Same CC models (BBR/CUBIC internally)	Same CC gaps; innovation is in deployment, not CC

The E-M gap is identical to TCP’s for congestion control. QUIC’s innovation is orthogonal: it changes how transport is packaged and deployed, while congestion sensing and control remain unchanged. The lesson: when the Interface invariant ossifies, restructuring the other invariants (State, Time, Coordination) is insufficient — the interface itself must change.

Concrete deployment impact: By 2017, Google reported that 35% of its egress traffic used QUIC. HTTP/3 (RFC 9114 (Bishop 2022)) standardized on QUIC as its transport. IETF QUIC (RFC 9000 (Iyengar and Thomson 2021), 2021) diverged from Google’s original design but preserved the core architectural choices: UDP substrate, mandatory encryption, stream multiplexing, connection migration.

9.8 The Grand Arc: From Cerf/Kahn to QUIC

9.8.1 The Evolving Anchor

The binding constraint — the IP datagram interface — remained constant across all six acts. Every innovation operated within the same Interface and Coordination constraints. What changed was the State invariant (what the sender believes about the network) and the Time invariant (how timing decisions are made), restructured generation after generation as the environment evolved.

This constancy distinguishes transport from routing (where the binding constraint shifts across eras). TCP’s meta constraint — IP datagrams exist, and no entity has central authority — never changed. The dependency chain is stable: Interface (locked) forces Coordination (distributed), which forces State (inferred from local signals), which forces Time (estimated from ACK stream). Every TCP variant navigates within this chain. The variation is in the quality of the estimator and the richness of the signal.

Era	Environment	State (Belief)	Time	What Broke
1974	Small ARPANET, < 200 hosts	Connection state only	Fixed timers	“Enough capacity” assumption
1988	5,000+ hosts, congestion collapse	cwnd + SRTT/RTTVAR	EWMA estimator	Loss signal delayed by deep buffers
2008	10 Gbps fiber, high-BDP	Cubic function of wall-clock time	RTT-independent growth	Loss still sole signal; bufferbloat
2017 (BBR)	Deep buffers, bufferbloat	(BtlBw, RTprop) explicit model	Phase-based probing	Fairness with loss-based flows
2010 (DCTCP)	Datacenter: 1 ms RTT, ECN	alpha (ECN fraction)	ECN faster than loss	Single-admin only; cannot coexist
2017 (QUIC)	Ossified TCP; middlebox interference	Per-stream state; Connection ID	0-RTT handshake	UDP blocking in some networks

9.8.2 Three Design Principles Applied Across the Arc

Disaggregation was applied repeatedly, each time separating a new concern:

Cerf/Kahn applied disaggregation by separating endpoint state (TCP) from network forwarding (IP). Routers are stateless; endpoints maintain all connection semantics.
Jacobson applied disaggregation by separating per-flow state (cwnd, SRTT) from network state. Each sender tracks its own capacity belief independently.
QUIC applied disaggregation by separating per-stream reliability from per-connection ordering, and separating transport logic from the OS kernel.

Closed-loop reasoning was the dominant design strategy, refined across every generation:

Jacobson applied closed-loop reasoning with the AIMD feedback loop: measure loss, adjust cwnd, send, repeat. Loop period = RTT.
BBR applied closed-loop reasoning with a model-based loop: measure delivery rate and min RTT, update (BtlBw, RTprop) model, set rate to match model.
DCTCP applied closed-loop reasoning with a richer measurement signal: ECN marks provide a continuous congestion gradient instead of binary loss.

Decision placement remained consistently distributed across all generations. Every TCP variant keeps rate-control authority at endpoints. DCTCP added switch-side measurement (ECN marking) but preserved endpoint-side decision-making. This consistency follows from the binding constraint: no entity has global authority over Internet traffic, so rate decisions must be local.

The one context where decision placement shifted was the datacenter. Under single administrative authority, DCTCP coordinates switches (marking) and endpoints (proportional response). HPCC goes further: switches provide per-hop telemetry, and endpoints compute rate adjustments from the full path signal. This is possible only because the datacenter operator controls both switches and hosts — the Coordination invariant’s answer changes when the meta-constraint changes from “no central authority” to “single administrator.”

The transport lineage demonstrates a general pattern: when the binding constraint is constant (IP interface, distributed authority), innovation concentrates in the State and Time invariants. Designers restructure what the system believes and how it measures, because the interface and coordination model are frozen. The two exceptions — DCTCP (new measurement interface via ECN) and QUIC (new transport interface via UDP) — both required environments where the deployment barrier was lower: single-admin datacenters and user-space libraries, respectively.

9.8.3 The Dependency Chain

flowchart TD
    A[IP datagrams: unreliable, unordered, no feedback]:::constraint --> B[Coordination: distributed endpoints, no central authority]:::constraint
    B --> C[State: must infer capacity from local signals]:::constraint
    C --> D[Time: must estimate RTT from ACK stream]:::constraint

    E[1986: Congestion collapse]:::failure --> F[Jacobson 1988: cwnd + AIMD + SRTT]:::fix
    F --> G[Loss = congestion signal assumption]:::constraint
    G --> H[High-BDP: linear increase too slow]:::failure
    H --> I[CUBIC 2008: cubic growth, RTT-independent]:::fix
    G --> J[Bufferbloat: loss delayed by deep buffers]:::failure
    J --> K[BBR 2017: model-based, BtlBw + RTprop]:::fix
    G --> L[Datacenter: incast, shallow buffers, low RTT]:::failure
    L --> M[DCTCP 2010: proportional ECN marking]:::fix

    N[TCP ossification: middleboxes block extensions]:::failure --> O[QUIC 2017: transport over UDP, encrypted]:::fix

    classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:white
    classDef failure fill:#d94a4a,stroke:#8a2c2c,color:white
    classDef fix fill:#4ad94a,stroke:#2c8a2c,color:white

9.8.4 Pioneer Diagnosis Table

Year	Pioneer	Invariant	Diagnosis	Contribution
1974	Cerf & Kahn	Interface	Heterogeneous networks need common protocol	TCP/IP: byte stream over datagrams
1984	Nagle	State	Senders flood network without congestion awareness	RFC 896: named congestion collapse
1986	Zhang	Time	RTT estimation too crude for reliable timers	Identified timer failure modes
1987	Karn & Partridge	State	Retransmit ambiguity corrupts RTT samples	Karn’s algorithm: discard ambiguous samples
1988	Jacobson	State + Time	Conservation of packets violated three ways	Slow start, AIMD, SRTT/RTTVAR, fast retransmit
1989	Chiu & Jain	Coordination	Distributed rate control needs convergence proof	AIMD convergence to fairness
2008	Ha, Rhee, Xu	State	Linear increase too slow for high-BDP	CUBIC: cubic growth, RTT-independent
2010	Alizadeh et al.	State	Loss too coarse for datacenter	DCTCP: proportional ECN decrease
2017	Cardwell et al.	State	Loss is the wrong signal (bufferbloat)	BBR: model-based CC (BtlBw, RTprop)
2017	Langley et al.	Interface	TCP ossified by middleboxes and kernels	QUIC: transport over UDP, encrypted

9.8.5 The General Principle

The tighter the coupling between measurement and environment, the better the closed-loop performance. ECN is tighter than loss-based signaling. Loss is tighter than timeout. (BtlBw, RTprop) is tighter than cwnd. Per-hop INT telemetry is tighter than endpoint-only ACK signals. The cost of tighter coupling is always higher per-packet overhead and stricter deployment requirements. The design trade-off is eternal: tighter coupling enables higher utilization and lower latency, but at the cost of complexity and deployment friction.

9.8.6 Innovation Timeline

flowchart TD
    subgraph sg1["Origins"]
        A1["1974 — Cerf/Kahn: TCP/IP"]
        A2["1981 — RFC 793"]
        A1 --> A2
    end
    subgraph sg2["Diagnosis"]
        B1["1984 — Nagle: RFC 896"]
        B2["1986 — Zhang: timer critique"]
        B3["1987 — Karn/Partridge: fix"]
        B1 --> B2 --> B3
    end
    subgraph sg3["The Jacobson Era"]
        C1["1988 — Jacobson: slow start + AIMD"]
        C2["1989 — RFC 1122 mandate"]
        C3["1990 — Reno fast recovery"]
        C1 --> C2 --> C3
    end
    subgraph sg4["Refinement"]
        D1["1992 — RFC 1323: window scaling"]
        D2["1993 — Floyd: RED"]
        D3["1996 — SACK (RFC 2018)"]
        D1 --> D2 --> D3
    end
    subgraph sg5["Scaling"]
        E1["2008 — Ha/Rhee/Xu: CUBIC"]
        E2["2010 — Alizadeh: DCTCP"]
        E1 --> E2
    end
    subgraph sg6["Model-Based"]
        F1["2017 — Cardwell: BBR"]
        F2["2017 — Langley: QUIC"]
        F3["2021 — RFC 9000: IETF QUIC"]
        F1 --> F2 --> F3
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5 --> sg6

Transport and Congestion Control: 1974–2021

9.9 Generative Exercises

Exercise 1: TCP for Mars

Mars rovers communicate with Earth at 20-minute one-way latency. RTT = 2,400 seconds.

Predict the closed-loop failure: Slow start from cwnd = 1 takes log_2(100) * 2,400 = ~16,800 seconds (~4.7 hours) to reach cwnd = 100. AIMD adds 1 MSS per 2,400-second RTT. Loss recovery takes one RTO (~3,600 seconds = 1 hour). What happens to throughput for a file transfer?
Diagnose through invariants: Which invariant is under greatest pressure? (Time: the feedback loop is 2,400 seconds. The sender is blind for 40 minutes between sending and learning the result.) Can you keep the IP interface but change the Time model? What if you replace inferred timing with prescribed timing based on orbital mechanics?
Design a fix: Propose a preloading mechanism — before losing connectivity, precompute a cwnd schedule from ground-truth Earth-Mars link models. Upload the schedule to the rover. The rover increases cwnd on wall-clock time, not RTT. What invariant answers change? What new failure modes appear if the precomputed model is wrong? What if a solar event disrupts the link mid-schedule?
Alternative architecture: Propose a delay-tolerant transport protocol that does not use ACKs for reliability. Instead, use forward error correction (FEC) to encode redundancy at the sender. How does removing the ACK-based feedback loop change all four invariant answers? What is the cost in bandwidth efficiency?

Exercise 2: Heterogeneous CC Coexistence

A campus network carries traffic from three populations: students using CUBIC (browser default), a streaming service using BBR, and a research cluster using DCTCP with ECN. All share a single 10 Gbps backbone link with a 10 MB buffer.

Predict the interaction: BBR probes past loss. CUBIC halves cwnd on loss. DCTCP responds to ECN marks. Who dominates? Who starves?
Diagnose through the Coordination invariant: AIMD converges to fairness only when all flows use the same algorithm. When flows use different CC algorithms, what happens to the convergence proof? Can you define “fairness” when flows have different loss responses?
Design a solution: Propose a mechanism that achieves some notion of fairness across heterogeneous CC algorithms. Where would you place this mechanism — at endpoints, at switches, or at both? What are the deployment constraints?
L4S prediction: L4S (Low Latency, Low Loss, Scalable Throughput, RFC 9332 (De Schepper and Briscoe 2023)) proposes separating traffic into two queues: a “classic” queue for legacy TCP and an “L4S” queue for scalable CC algorithms (like DCTCP-style). Analyze this through the Coordination invariant: does separating queues solve the fairness problem, or does it create a new one? What happens when 90% of traffic is L4S and 10% is classic?

Exercise 3: Transport for a Satellite Constellation

LEO satellite constellations (Starlink, Kuiper) introduce an environment with properties of both WAN and datacenter: variable RTTs, single operator, frequent path changes.

A LEO satellite constellation (Starlink-like) has variable RTTs: 20 ms when the satellite is overhead, 80 ms at low elevation angles, with handoffs every 15 seconds as satellites move. The path changes completely every 90 seconds.

Predict the measurement failure: SRTT with alpha = 1/8 tracks RTT changes slowly. How many samples to converge after a 20 ms to 80 ms jump? What happens to RTO during the transition?
BBR vs CUBIC: BBR’s RTprop estimate uses minimum RTT over 10 seconds. What happens when the path’s true propagation delay increases by 4x every 15 seconds? How does CUBIC’s loss-based approach handle the same scenario?
Design the right CC: This environment has properties of both WAN (variable RTT, multi-hop) and datacenter (single operator controls all satellites). Which existing CC algorithm is closest to correct? What modification would you make, and which invariant does it restructure?
Connection migration: A user drives along a highway while on a video call. The ground terminal hands off between satellite beams every 15 seconds. TCP connections using 4-tuple identification break on every handoff. QUIC’s Connection ID survives handoffs. What other transport-layer changes are needed for seamless satellite handoff? Consider: does the path’s BDP change during handoff? Should the CC state (cwnd or BtlBw estimate) be preserved across handoffs or reset?

This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.

Alizadeh, Mohammad, Albert Greenberg, David A. Maltz, et al. 2010. “Data Center TCP (DCTCP).” Proceedings of the ACM SIGCOMM Conference, 63–74.

Bishop, Mike. 2022. HTTP/3. RFC No. 9114. IETF.

Braden, Robert. 1989. Requirements for Internet Hosts — Communication Layers. RFC No. 1122. IETF.

Cardwell, Neal, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh, and Van Jacobson. 2017. “BBR: Congestion-Based Congestion Control.” ACM Queue 14 (5): 20–53.

Cerf, Vinton G., and Robert E. Kahn. 1974. “A Protocol for Packet Network Intercommunication.” IEEE Transactions on Communications COM-22 (5): 637–48.

Cheng, Yuchung, Jerry Chu, Sivasankar Radhakrishnan, and Arvind Jain. 2014. TCP Fast Open. RFC No. 7413. IETF.

Chiu, Dah-Ming, and Raj Jain. 1989. “Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks.” Computer Networks and ISDN Systems 17 (1): 1–14.

Clark, David D. 1988. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18 (4): 106–14.

De Schepper, Koen, and Bob Briscoe. 2023. Low Latency, Low Loss, and Scalable Throughput (L4S) Internet Service: Architecture. RFC No. 9332. IETF.

Floyd, Sally, and Van Jacobson. 1993. “Random Early Detection Gateways for Congestion Avoidance.” IEEE/ACM Transactions on Networking 1 (4): 397–413.

Ford, Alan, Costin Raiciu, Mark Handley, and Olivier Bonaventure. 2013. TCP Extensions for Multipath Operation with Multiple Addresses. RFC No. 6824. IETF.

Ha, Sangtae, Injong Rhee, and Lisong Xu. 2008. “CUBIC: A New TCP-Friendly High-Speed TCP Variant.” ACM SIGOPS Operating Systems Review 42: 64–74.

Honda, Michio, Yoshifumi Nishida, Costin Raiciu, Adam Greenhalgh, Mark Handley, and Hideyuki Tokuda. 2011. “Is It Still Possible to Extend TCP?” Proceedings of ACM IMC, 181–94.

Iyengar, Jana, and Martin Thomson. 2021. QUIC: A UDP-Based Multiplexed and Secure Transport. RFC No. 9000. IETF.

Jacobson, Van. 1988. “Congestion Avoidance and Control.” ACM SIGCOMM Computer Communication Review 18 (4): 314–29.

Karn, Phil, and Craig Partridge. 1987. “Improving Round-Trip Time Estimates in Reliable Transport Protocols.” Proceedings of the ACM SIGCOMM Conference.

Kumar, Gautam, Nandita Dukkipati, Keon Jang, et al. 2020. “Swift: Delay Is Simple and Effective for Congestion Control in the Datacenter.” Proceedings of the ACM SIGCOMM Conference, 514–28.

Langley, Adam, Alistair Riddoch, Alyssa Wilk, et al. 2017. “The QUIC Transport Protocol: Design and Internet-Scale Deployment.” Proceedings of the ACM SIGCOMM Conference, 183–96.

Li, Yuliang, Rui Miao, Hongqiang Harry Liu, et al. 2019. “HPCC: High Precision Congestion Control.” Proceedings of the ACM SIGCOMM Conference, 44–58.

Nagle, John. 1984. Congestion Control in IP/TCP Internetworks. RFC No. 896. IETF.

Paxson, Vern, Mark Allman, Jerry Chu, and Matt Sargent. 2011. Computing TCP’s Retransmission Timer. RFC No. 6298. IETF.

Postel, Jon. 1981. Transmission Control Protocol. RFC No. 793. IETF.

Ramakrishnan, K. K., Sally Floyd, and David Black. 2001. The Addition of Explicit Congestion Notification (ECN) to IP. RFC No. 3168. IETF.

Saltzer, Jerome H., David P. Reed, and David D. Clark. 1984. “End-to-End Arguments in System Design.” ACM Transactions on Computer Systems 2 (4): 277–88.

Zhang, Lixia. 1986. “Why TCP Timers Don’t Work Well.” Proceedings of the ACM SIGCOMM Conference.

Maximum Segment Size (MSS) is the largest payload TCP sends in a single segment, typically 1460 bytes (1500-byte Ethernet MTU minus 20-byte IP header minus 20-byte TCP header).↩︎
The SRTT and RTTVAR formulas are Exponentially Weighted Moving Averages (EWMAs). An EWMA gives recent samples more weight than older ones, with the weight decaying exponentially. Alpha = 1/8 means each new RTT sample contributes 12.5% to the smoothed estimate — the estimator adapts to changes within roughly 8 samples. Beta = 1/4 weights the deviation estimator more aggressively so RTTVAR tracks variance changes faster than the mean. These values were chosen empirically by Jacobson (Jacobson 1988) and codified in RFC 6298 as powers of two for efficient bit-shift implementation.↩︎
Bandwidth-Delay Product (BDP) = link bandwidth × round-trip time. For a 1 Gbps link with 100 ms RTT, BDP = 10^9 × 0.1 = 10^8 bits = 12.5 MB. This is the amount of data “in flight” needed to fully utilize the link. TCP’s congestion window must reach BDP for full utilization.↩︎
CUBIC’s window growth is W(t) = C(t - K)^3 + W_max, where C is a scaling constant, K = (W_max × β/C)^{1/3} is the time to reach the previous maximum, and β = 0.7 is the multiplicative decrease factor. The cubic function grows slowly near W_max (probing carefully) and aggressively far from it.↩︎