9  Transport and Congestion Control


9.1 The Anchor: IP Datagrams and Decentralized Authority

Transport is the endpoint component responsible for end-to-end delivery across a network the sender lacks control over. The engineering problem: how to deliver data reliably and efficiently when the network offers only best-effort, unordered datagrams and no entity has global authority over traffic.

The binding constraint is the IP interface, inherited from Cerf and Kahn’s 1974 design (Cerf and Kahn 1974). IP provides best-effort delivery: datagrams are routed independently, guaranteeing only best-effort forwarding. Routers are stateless — they forward packets based on destination address alone, maintain only forwarding state, and remain silent about congestion. A packet is delivered, lost, or delayed; the sender observes only the presence or absence of acknowledgments.

This constraint is architectural, not accidental. Clark’s design philosophy embedded administrative decentralization as a founding principle: the Internet operates across independent domains with no entity possessing global visibility (Clark 1988). Congestion control must operate without a central scheduler, a shared clock, or routers capable of universal rate coordination.

“The goal of the architecture is to knit together a collection of networks… using gateways which implement a standard internetwork protocol.” — Cerf & Kahn, 1974 (Cerf and Kahn 1974)

The IP interface locks every other design choice in cascade. Transport must answer four decision problems using only endpoint-local observations:

  1. When to send? Absent a shared clock or prescribed schedule, the sender must infer when the network can accept data.
  2. What to send — new data or retransmission? The sender must distinguish successful delivery from loss, using only ACK signals.
  3. How much to send? The sender must estimate available capacity using only indirect signals, since link rates and queue depths are hidden.
  4. How to infer network state from endpoint signals? ACK arrivals and their timing are the only measurement channel. Every belief about the network — capacity, delay, congestion — must be reconstructed from this single, noisy signal.

Saltzer, Reed, and Clark formalized this placement retrospectively: the end-to-end argument places intelligence at endpoints because the network interior is blind to application requirements (Saltzer et al. 1984). This decomposition choice defines TCP’s entire design space — and constrains every variant that followed.

The dependency chain for transport follows directly from the binding constraint:

  • Interface (IP datagrams) is inherited and locked.
  • Coordination (distributed endpoints) is forced: no entity has authority to allocate bandwidth centrally.
  • State (inferred from local signals) is forced: without coordination, the sender must build its own belief model from ACK signals.
  • Time (estimated via RTT) is forced: without a shared clock, timing must be inferred from the only available signal — round-trip acknowledgment delay.

9.2 Act 1: “It’s 1974. TCP Is a Reliability Protocol.”

Cerf and Kahn designed TCP as a protocol for packet network intercommunication — a system to deliver bytes correctly across heterogeneous networks connected by stateless gateways (Cerf and Kahn 1974). The protocol unified host-to-host communication, process-to-process multiplexing, and reliability into a single monolithic design. By 1981, RFC 793 formalized the connection state machine: three-way handshake, sequence numbers, cumulative acknowledgments, and flow control via a receiver-advertised window (Postel 1981). The receiver window told the sender how much buffer space the receiver had available — a capacity limit at the far end. This was the only rate-control mechanism: if the receiver could accept 32 KB, the sender could send 32 KB. The network between them was assumed to have sufficient capacity.

What the pioneers saw: A small ARPANET with fewer than 200 hosts, administered cooperatively by researchers who knew each other. Links operated at 50 kbps. Traffic consisted of terminal sessions and file transfers. The engineering challenge was reliability — ensuring bytes arrived correctly despite unreliable packet delivery.

What remained invisible from the pioneers’ vantage point: The network would grow from 200 hosts to millions. Traffic would increase by orders of magnitude. Multiple independent administrative domains would replace cooperative research governance. The implicit assumption — that the network has enough capacity for all senders — would fail catastrophically within a decade.

9.2.1 Invariant Analysis: TCP (1974–1981)

Invariant TCP Answer (1974–1981) Gap?
State Sequence numbers, connection FSM, receiver window Path-blind: zero belief about network capacity or delay
Time Fixed retransmission timers (“adaptive” but vague) RTT estimation absent; timers fire too early or too late
Coordination Distributed endpoints; receiver flow control only Congestion coordination absent; each sender independent
Interface Byte stream above, IP datagrams below Clean — and permanent. This becomes the binding constraint

The State gap is the critical one: TCP tracks connection state (what bytes have been sent, acknowledged, received) but maintains zero path state. The sender is blind to network load, bottleneck capacity, and competing traffic. The congestion window is absent from the design. The sender transmits as fast as the receiver window permits — which is often far faster than the network absorbs.

9.2.2 Environment, Measurement, Belief

Layer What TCP Has What’s Missing
Environment True queue occupancy, competing traffic, link rates Everything — fully opaque to the sender
Measurement ACK arrivals (binary: arrived or not) Timing and congestion signals both absent
Belief “Receiver can accept this much data” (receiver window) Network capacity belief entirely absent

The E-M gap is total. The measurement channel (ACK arrivals) carries information about the receiver but nothing about the path. The sender’s belief covers only “the receiver has buffer space” — a statement about the far end, silent on the middle. The gap between Environment and Measurement is structural absence, beyond noise or estimation error. The sensor for network capacity simply does not exist. This is the most extreme form of the E-M gap.

9.2.3 “The Gaps Didn’t Matter… Yet”

On the 1974 ARPANET, they were invisible. With fewer than 200 hosts and 50 kbps links, aggregate demand rarely exceeded capacity. Congestion occurred locally and resolved quickly because traffic was light. The receiver window was the practical bottleneck, not the network.

But by 1986, the Internet had grown to over 5,000 hosts. Links had scaled to 56 kbps and 1.5 Mbps, but aggregate demand scaled faster. Multiple independent organizations connected their networks. The assumption “enough capacity” broke.


9.3 Act 2: “It’s 1986. The Network Is Collapsing.”

9.3.1 Which Invariant Broke?

In October 1986, throughput on the NSFnet backbone between LBL and UC Berkeley — a path that should have sustained 32 kbps — dropped to 40 bits per second. A factor of 800 reduction. The network was physically intact; every link was operational. Every router was forwarding. The collapse was behavioral: senders were transmitting faster than the network could deliver, packets were being dropped, senders retransmitted (because their timers fired), creating more traffic, causing more drops. The retransmissions themselves became the dominant source of traffic — a positive feedback loop where the system’s response to failure amplified the failure.

This was a stable, self-reinforcing failure. It was a stable equilibrium: the network settled into a state where nearly all traffic consisted of retransmissions, and useful throughput approached zero. The system had converged — to the wrong operating point.

“The network is now subject to a form of congestion collapse… If the timers are set too short, more retransmissions than necessary will be performed, and the network will be further congested.” — Nagle, 1984 (Nagle 1984)

Nagle diagnosed the problem in RFC 896 two years before the collapse became acute: senders lacked any mechanism to sense or respond to congestion. His fix operated at the Interface level — coalesce small packets to reduce packet count — but it treated a symptom, leaving the cause intact. Senders still lacked a feedback loop.

Three researchers, working independently over three years, diagnosed three distinct invariant failures:

Year Pioneer Invariant Diagnosed Gap Identified
1984 Nagle State Sender lacks network load model
1986 Zhang Time RTT estimate is too crude; timers fire incorrectly
1987 Karn & Partridge State Ambiguous RTT samples from retransmissions pollute estimator

Zhang showed that TCP’s retransmission timers systematically failed because the RTT estimator was too primitive (Zhang 1986). RFC 793 tracked only a smoothed mean RTT with a fixed multiplier — variance tracking absent. When network conditions changed, the timer was either too aggressive (causing spurious retransmissions that worsened congestion) or too conservative (waiting too long to detect loss).

Karn and Partridge identified a subtler problem: when a packet is retransmitted, the sender is unable to determine whether the returning ACK acknowledges the original or the retransmission (Karn and Partridge 1987). Using this ambiguous sample corrupts the RTT estimate. Their fix: discard RTT samples from retransmitted segments and back off the timer exponentially on each retransmission.

In parallel, Jain and Chiu were developing the theoretical foundation: how should distributed agents adjust rates to converge to fairness? Their work would prove that additive increase / multiplicative decrease (AIMD) is the unique linear control policy that converges to both efficiency and fairness — a result Jacobson would implement independently of the formal proof.

Three researchers. Three years. No coordination. But they converged on the same decomposition — each diagnosing a different invariant’s failure. This convergence is evidence that the four-invariant structure is the structure the pioneers discovered through debugging, confirmed rather than imposed retrospectively through debugging.

9.3.2 Jacobson’s Synthesis: Four Fixes in One System (1988)

Jacobson synthesized all four invariant fixes into a single coherent system — the most consequential transport paper ever written (Jacobson 1988).

“In October ‘86, the Internet had its first ’congestion collapse’… the throughput from LBL to UC Berkeley… dropped from 32 Kbps to 40 bps.” — Jacobson, 1988 (Jacobson 1988)

Jacobson framed TCP congestion control as a feedback control problem, drawing explicitly on control theory. His core principle: conservation of packets — a new packet should enter the network only when an old one leaves. Three violations of this principle cause collapse: (1) the connection starts too fast, (2) bad timers inject packets prematurely, (3) resource limits change mid-connection.

Applied closed-loop reasoning by creating TCP’s first real feedback loop:

  • Slow start: cwnd (congestion window) begins at 1 MSS (Maximum Segment Size)1 and doubles each RTT (exponential growth to equilibrium). The sender sources data at most twice the network’s current delivery rate.
  • Congestion avoidance (AIMD): After reaching ssthresh, increase cwnd by 1 MSS per RTT. On loss, halve cwnd. Additive increase probes; multiplicative decrease retreats aggressively.
  • RTT estimator: SRTT and RTTVAR with exponentially weighted moving averages:
RTTVAR = (1 - beta) * RTTVAR + beta * |SRTT - RTT_sample|
SRTT   = (1 - alpha) * SRTT  + alpha * RTT_sample
RTO    = SRTT + 4 * RTTVAR

RFC 6298 (Paxson et al. 2011) uses alpha = 1/8 and beta = 1/4.2 The smoothed RTT weights history 7/8 and the new sample 1/8 — gradual adaptation without oscillation. The multiplier 4 on RTTVAR covers the 95th percentile of RTT variation, chosen specifically to prevent spurious timeouts during slow start when RTT can double in one round.

  • Fast retransmit: Three duplicate ACKs trigger retransmission without waiting for timeout — a faster path to loss recovery. The sender infers loss (rather than delay) because the receiver has received three subsequent packets and is signaling a gap.

Concrete example: Starting cwnd = 1 MSS (1,500 bytes), RTT = 100 ms. Slow start: cwnd = 1, 2, 4, 8, 16, 32 (5 RTTs = 500 ms). If loss occurs at cwnd = 32, ssthresh = 16, cwnd resets to 1. Congestion avoidance from cwnd = 16 to 32 takes 16 RTTs = 1.6 seconds. Total recovery: 2.1 seconds. This was fast enough for 56 kbps links where BDP (Bandwidth-Delay Product) = 700 bytes. It would prove inadequate for 10 Gbps links where BDP = 125 MB.3

Applied disaggregation by separating per-flow state (each sender tracks its own cwnd, SRTT, RTTVAR) from network state (routers remain stateless).

Decision placement remained fully distributed: every sender makes rate decisions independently, using only local ACK signals. No central authority allocates bandwidth. Convergence emerges from the interaction of independent AIMD loops.

Chiu and Jain proved mathematically that AIMD is the only linear control law that converges to fairness from any starting point (Chiu and Jain 1989). The proof is elegant: additive increase moves all flows toward the efficiency line (total rate = capacity); multiplicative decrease moves all flows toward the fairness line (equal rates). The combination oscillates around the intersection — the fair, efficient operating point.

The beauty of Jacobson’s synthesis: four independent diagnoses (Nagle on State, Zhang on Time, Karn on measurement integrity, Jain on Coordination) were unified into a single deployable system. The interface remained unchanged — no router modifications, no new header fields, no coordination between senders. This is why RFC 1122 (Braden 1989) could mandate the algorithms for all Internet hosts within one year: the fix operated entirely within the existing protocol boundary.

9.3.3 Invariant Analysis: TCP + Jacobson (1988)

Invariant TCP Answer After Jacobson (1988) Gap?
State cwnd (capacity belief) + SRTT/RTTVAR (delay belief) Loss is binary and delayed; no congestion gradient
Time EWMA estimator: RTO = SRTT + 4*RTTVAR Adapts slowly to sudden RTT changes (8 samples to converge)
Coordination AIMD: emergent fairness without communication RTT bias: low-RTT flows get more bandwidth
Interface Unchanged — deployable via host update only No router participation; routers remain passive

The State gap narrowed dramatically while remaining open. Loss is now the congestion signal, but loss is binary (happened or didn’t) and delayed (detected after one RTT at best, after a timeout at worst). The sender is blind to congestion severity — a single dropped packet and a completely saturated link produce the same signal.

9.3.4 Environment, Measurement, Belief After the Fix

Layer What TCP Has What’s Missing
Environment True bottleneck rate, queue depth, competing flows Still opaque to sender
Measurement ACK timing (RTT samples) + loss events (timeout or 3 dup-ACKs) Direct capacity signal absent; loss is noisy
Belief cwnd (capacity proxy), SRTT (delay estimate) cwnd is a proxy, an estimate rather than a measurement of true capacity

The E-M gap is accidentally noisy: the ACK signal is honest but imperfect. Jacobson’s estimator extracts what information exists, but the measurement channel is fundamentally limited to binary loss events and timing. Better estimators can improve accuracy; they are bounded by the information the signal carries.

9.3.5 “The Gaps Didn’t Matter… Yet”

Jacobson’s algorithms were mandated for all Internet hosts within one year (RFC 1122 (Braden 1989), 1989). They worked extraordinarily well for the 1988 Internet: moderate link speeds, small buffers, homogeneous RTTs. The loss signal was tightly coupled to congestion because buffers were small — when congestion occurred, packets were dropped quickly, and the sender learned within a few RTTs.

Jacobson knew the endpoint-only solution was incomplete:

“There are two network problems, congestion avoidance and congestion recovery… This paper describes the approach to the first problem… The second problem requires router changes.” — Jacobson, 1988 (Jacobson 1988)

The “gateway fix” was RED (Floyd and Jacobson, 1993 (Floyd and Jacobson 1993)) — five years later. Jacobson shipped what was deployable first. This is the meta-constraint of incremental deployability: host changes can be mandated via RFC; router changes require voluntary upgrades across administratively independent networks.

9.3.6 The AIMD Closed Loop in Steady State

The sawtooth pattern that emerges from AIMD is the system reaching equilibrium. The sender increases cwnd linearly (1 MSS per RTT) until loss occurs, then halves cwnd instantly. The cycle repeats.

The loop delay is one RTT. This is consequential: a flow with 100 ms RTT takes at least 100 ms to react to congestion. During those 100 ms, if the sender was transmitting near bottleneck rate, congestion has deepened. By contrast, a datacenter flow with 1 ms RTT reacts 100x faster. This RTT coupling is ineliminable without changing the time model.

The loop is stabilizing. After loss, cwnd halves, reducing sending rate by 50%. The sender then probes slowly upward. Over many cycles, the system converges to a rate where loss probability is low but not zero. The equilibrium loss probability p satisfies a balance: the rate of additive increase (1 MSS/RTT) equals the rate of multiplicative decrease (cwnd/2 per loss event). For N flows sharing capacity C, the equilibrium cwnd per flow satisfies N * cwnd = C, and loss occurs when any flow overshoots.

But the loop has failure modes visible at scale:

  • RTT-biased fairness: A 1 ms flow increases cwnd 100x per second; a 100 ms flow increases once per second. At equilibrium, low-RTT flows consume disproportionate bandwidth. This follows directly from the mathematics of AIMD — it is structural.
  • TCP incast: When many senders share a bottleneck and all transmit simultaneously (partition-aggregate workloads), all flows experience synchronized loss. All halve cwnd. When the buffer drains, all increase in lockstep. Utilization collapses to 10%.
  • Slow convergence: On a link with equilibrium cwnd = 83,000 MSS (10 Gbps, 100 ms RTT), recovering from a halving takes 41,500 RTTs = 69 minutes at 1 MSS per RTT.

9.4 Act 3: “It’s 2008. The Pipe Got Wider.”

9.4.1 Which Invariant Broke?

Link speeds scaled from 10 Mbps to 10 Gbps. Propagation delays remained governed by the speed of light. The bandwidth-delay product (BDP) — the amount of data that fits “in the pipe” — grew by three orders of magnitude.

Figure 9.1: The bandwidth-delay product (BDP) is the fundamental bottleneck-sizing metric: BDP = bottleneck bandwidth × round-trip propagation delay. It represents the amount of data (in bits or packets) that can be in flight on a path while the sender waits for the first ACK—equivalently, the minimum congestion window needed to saturate the link. This metric becomes critical on high-BDP paths: a 100 Gbps link with 100 milliseconds round-trip time has a BDP of 1.25 gigabytes, requiring a congestion window of roughly 800,000 packets (at 1500 bytes per packet). For comparison, LAN paths (10 Gbps, 0.1 ms RTT) have a BDP of only 125 packets, while satellite links (10 Mbps, 600 ms RTT) require 500,000 packets. TCP’s slow-start phase becomes a fundamental bottleneck on high-BDP paths. Slow-start doubles the congestion window every round-trip time: starting from 1 packet, reaching 1000 packets takes log₂(1000) ≈ 10 RTTs, or 1 second on a 100 ms satellite link. On a 50 millisecond coast-to-coast WAN path with 10 Gbps bandwidth, slow-start to saturation takes 8 RTTs = 400 milliseconds, during which the sender transmits at an average rate far below link capacity. This convergence delay is economically wasteful on premium, high-capacity links where saturation should occur in milliseconds, not seconds. Modern congestion control variants (CUBIC, BBR) address this by accelerating convergence for high-BDP paths, but the fundamental constraint remains: the window must scale with the BDP to saturate the link.
Scenario Bandwidth RTT BDP Time to Fill (AIMD)
1988 ARPANET 56 kbps 100 ms 700 bytes < 1 RTT
2000 WAN 100 Mbps 50 ms 625 KB ~400 RTTs (20 s)
2008 fiber 10 Gbps 100 ms 125 MB ~83,000 RTTs (~2.3 hours)
Datacenter 100 Gbps 0.5 ms 6.25 MB ~4,200 RTTs (2.1 s)

Reno’s linear increase (1 MSS per RTT) was too slow for high-BDP pipes. After a loss event on a 10 Gbps WAN path, cwnd drops to half. Recovering to full utilization at 1 MSS per RTT takes 83,000 RTTs — over two hours. The network is 50% utilized during recovery.

9.4.2 Ha, Rhee, and Xu’s Redesign: CUBIC (2008)

CUBIC replaced Reno’s linear congestion avoidance with a cubic function of elapsed time since the last loss event (Ha et al. 2008):

cwnd(t) = C * (t - K)^3 + W_max

where t is wall-clock time since last loss, W_max is the pre-loss window, K is the time to reach W_max under cubic growth, and C is a scaling constant.4

Applied closed-loop reasoning with a structurally different growth curve:

  • Near W_max: Growth is concave (slowing down) — cautious near the previous congestion point.
  • Beyond W_max: Growth is convex (speeding up) — aggressive probing for new capacity.
  • Decrease: 0.7x (less aggressive than Reno’s 0.5x), preserving more throughput after loss.

The critical innovation: growth rate is a function of wall-clock time, not RTT. A 10 ms flow and a 100 ms flow increase cwnd at the same rate in real time. This partially fixes the RTT-bias that plagued Reno: under AIMD, a 1 ms flow increases cwnd 100x per second while a 100 ms flow increases once per second. CUBIC decouples growth from ACK frequency.

Concrete example: W_max = 83,000 MSS (10 Gbps link, 100 ms RTT, 1,500-byte packets). After loss, cwnd drops to 58,100 (0.7x). Under Reno, recovering 24,900 MSS at 1 per RTT takes 24,900 RTTs = 41.5 minutes. Under CUBIC, the cubic function is tuned so recovery takes ~400 RTTs = 40 seconds — a 60x improvement.

CUBIC also introduced a TCP-friendly mode: when the cubic function would grow slower than standard AIMD, CUBIC switches to AIMD-compatible growth. This ensures backward compatibility — CUBIC always matches or exceeds Reno’s bandwidth in the same conditions.

Figure 9.2: TCP’s congestion control has evolved through multiple generations, each restructuring the belief system about network state in response to environmental pressures. TCP Reno (1990) introduced slow-start with exponential growth in the early phase, then switched to linear additive-increase congestion avoidance. Loss detection (via duplicate ACKs or timeout) triggers a reset to slow-start threshold at half the current window. This linear increase model worked well for moderate-bandwidth, moderate-latency links, but becomes a bottleneck on high-bandwidth, high-latency paths where the bandwidth-delay product is enormous. CUBIC (2008) addresses this bottleneck by replacing linear increase with a cubic function, allowing faster recovery toward higher equilibrium windows on high-BDP paths. BBR represents a different paradigm: instead of inferring network state from loss, BBR explicitly models the network as a pipe with bounded bandwidth and round-trip delay, measuring these parameters directly and maintaining a window size equal to bandwidth × delay. DCTCP uses ECN marks (explicit congestion notification) proportionally: when congestion increases, ECN marks rise in frequency, allowing DCTCP to reduce the window proportionally rather than dramatically. Each variant embodies a different belief system: Reno assumes loss is the primary congestion signal and reacts sharply; CUBIC assumes high-BDP networks need gentler recovery; BBR assumes direct bandwidth measurement is possible; DCTCP assumes dense feedback via ECN marks is available. The choice of variant reflects assumptions about the network environment, feedback frequency, and the tradeoff between responsiveness and stability.

9.4.3 Before/After: Reno to CUBIC

What Changed Reno (1990) CUBIC (2008)
Growth curve Linear: +1 MSS/RTT Cubic: C*(t-K)^3 + W_max
Growth rate depends on RTT (ACK frequency) Wall-clock time (RTT-independent)
Decrease factor 0.5x 0.7x
Recovery time (10 Gbps, 100 ms) ~83,000 RTTs ~400 RTTs
RTT fairness Severe bias: 100x RTT difference = 100x bandwidth difference Partially fixed: growth rate independent of RTT
Congestion signal Loss Loss (unchanged)

The progression from Reno to CUBIC illustrates a recurring pattern: the controller (how the sender responds) improved dramatically, but the sensor (what the sender observes) remained the same. CUBIC is a better feedback controller operating on the same noisy, binary measurement signal. When the signal itself fails — as it does with bufferbloat — the controller’s sophistication is irrelevant.

9.4.4 Environment, Measurement, Belief: CUBIC

Layer What CUBIC Has What’s Missing
Environment True bottleneck rate, competing traffic, queue depth Still opaque to sender
Measurement Loss events + ACK timing Direct capacity measurement absent; loss is binary
Belief cwnd via cubic function; W_max as memory of previous equilibrium Still interprets loss as sole congestion signal

The E-M gap quality is unchanged from Jacobson: accidentally noisy. CUBIC improves the controller (cubic growth curve) but uses the same sensor (loss events). When the sensor fails — as it does with deep buffers — the controller’s sophistication is irrelevant.

9.4.5 “The Gaps Didn’t Matter… Yet”

CUBIC became the Linux default in 2006 (kernel 2.6.19) and remains the dominant WAN congestion control algorithm (Ha et al. 2008). It solved the high-BDP scaling problem elegantly. But CUBIC retained the fundamental assumption inherited from Jacobson: loss is the signal for congestion.

This assumption held as long as buffers were small enough that loss and congestion were tightly coupled. When deep buffers appeared — in home routers, cable modems, and carrier-grade equipment — the loss signal decoupled from congestion. Buffers absorbed thousands of packets without dropping. The sender saw ACKs arriving on time, believed the network was healthy, and kept increasing cwnd. RTT inflated from 5 ms to 500 ms before any packet was lost. This is bufferbloat: the belief (cwnd) diverged from reality (the queue was full) because the measurement signal (loss) was delayed by the buffer.

The bufferbloat problem revealed a deeper insight about the E-M-B decomposition: CUBIC improved the controller (how the sender responds to loss) but used the same sensor (binary loss events). When the sensor fails — when the signal decouples from the underlying reality — no amount of controller sophistication can compensate. The fix required either a better sensor (DCTCP’s ECN (Explicit Congestion Notification) marking) or a different measurement strategy entirely (BBR’s model-based probing).


9.5 Act 4: “It’s 2017. Loss Is the Wrong Signal.”

9.5.1 Which Invariant Broke?

The State invariant’s measurement channel failed. Loss-based CC assumes loss indicates congestion. In networks with deep buffers, loss occurs only after the buffer is completely full — potentially hiding congestion for thousands of RTTs. The sender’s belief (cwnd = safe capacity) diverges from reality (the path is saturated, latency has exploded).

A concrete example: a home broadband link with 100 Mbps downstream and a 10 MB buffer at the cable modem. A saturated flow hides congestion for 10 MB / (100 Mbps / 8) = 0.8 seconds. During those 800 ms, the sender increases cwnd, the buffer fills, and latency climbs from 5 ms to 800 ms. A video call that needs < 100 ms RTT becomes unusable. A web page that would load in 200 ms now takes 2 seconds. Every interactive application on the same access link suffers, because the buffer is shared.

The diagnosis through the E-M-B framework is precise: the Environment changed (deep buffers deployed for throughput), the Measurement signal (loss) decoupled from congestion (buffers absorb packets instead of dropping), and the Belief (cwnd = safe capacity) diverged from reality (the path is saturated). The sender’s estimator is functioning correctly — it just has the wrong sensor.

9.5.2 Cardwell’s Redesign: BBR (2017)

BBR (Bottleneck Bandwidth and Round-trip propagation time) replaced the cwnd belief system entirely (Cardwell et al. 2017). Instead of a proxy for capacity (cwnd), BBR maintains an explicit network model:

  • BtlBw: Bottleneck bandwidth — the maximum delivery rate observed over the last 10 RTTs.
  • RTprop: Round-trip propagation delay — the minimum RTT observed over the last 10 seconds.
  • Target: Send at rate = BtlBw, keeping in-flight bytes at BtlBw * RTprop (the BDP).

Applied closed-loop reasoning with a fundamentally different measurement interpretation: loss serves as probing information, freed from its role as a penalty signal. BBR increases rate periodically, observes the result, and updates its model.

BBR operates in four phases:

  1. STARTUP: Exponential increase until BtlBw estimate stops growing for 3 RTTs. Goal: find bottleneck bandwidth quickly.
  2. DRAIN: Reduce in-flight packets to the BDP. Goal: drain queues built during startup.
  3. PROBE_BW: Every 10 RTTs, increase rate by 25% to probe for more bandwidth. If delivery rate drops, reduce back.
  4. PROBE_RTT: Every 10 seconds, reduce in-flight to 4 MSS for 200 ms. Goal: refresh RTprop estimate with minimal queueing.
Figure 9.3: The sawtooth pattern visualizes the closed-loop dynamics of TCP’s AIMD (Additive Increase, Multiplicative Decrease) congestion control, where the sender probes available capacity by increasing the congestion window (cwnd) at a linear rate until loss is detected, then halves cwnd in response. Observe how the additive increase phase (roughly 1 packet per RTT) gradually fills the pipe and queues, while loss events (marked by red dots) trigger multiplicative decreases that sharply reduce cwnd by 50%. The green dashed line at cwnd=32 shows the slow-start threshold (ssthresh), which separates exponential probing from linear probing. This equilibrium—where the rate of additive gains roughly balances multiplicative losses over many cycles—emerges not as a bug, but as the system reaching a stable point where loss probability is low but not zero, achieving high utilization without overwhelming the bottleneck. The loop delay is RTT, which governs how quickly the sender can react to congestion. A datacenter flow with 1 ms RTT reacts 100x faster than a wide-area flow with 100 ms RTT, creating a fundamental coupling between the time model and the convergence properties of the adaptation loop. Flows with longer RTTs reach equilibrium slowly and are biased against in competing for capacity, illustrating how the time model directly shapes fairness.

Concrete comparison: On a 10 Gbps link with 100 ms RTT and a 10 MB buffer:

  • CUBIC: Fills the buffer before detecting loss. Queue grows to 10 MB, adding 8 ms of queuing delay. RTT inflates from 100 ms to 108 ms. After loss, cwnd drops to 0.7x and slowly recovers. Average queue depth: 5 MB. Average queuing delay: 4 ms.
  • BBR: Measures BtlBw = 10 Gbps and RTprop = 100 ms. Sends at BDP = 125 MB in flight. Queue target: zero. Average queuing delay: < 1 ms.

The difference is structural: CUBIC fills the pipe and the buffer before backing off. BBR fills the pipe and drains the buffer by design. For latency-sensitive applications (video conferencing, interactive web), this difference is transformative.

9.5.3 Invariant Analysis: BBR (2017)

Invariant BBR Answer (2017) Gap?
State (BtlBw, RTprop) — explicit model, not proxy Model assumes constant bottleneck; wrong in multi-tenant networks
Time Phase-based probing on scheduled intervals RTprop stale for up to 10 seconds after routing changes
Coordination Model-based; probes past loss Coexistence failure: BBR starves CUBIC flows (takes 95% capacity)
Interface Unchanged (ACK-based) Interprets ACKs differently but uses same signal

The Coordination gap is BBR’s most consequential failure. When BBR and CUBIC share a bottleneck, they interpret loss differently. BBR probes past loss — CUBIC interprets that loss and halves cwnd. BBR fills the capacity that CUBIC vacates. Over time, BBR can consume 95% of bottleneck bandwidth while CUBIC gets 5%. BBRv2 and BBRv3 add loss-aware fairness constraints, but the fundamental tension persists: a protocol that probes past loss will dominate a protocol that fears loss.

The model assumptions create additional gaps. BBR assumes bottleneck bandwidth is constant — wrong in multi-tenant environments where neighbors’ traffic changes. RTprop estimates are systematically inflated when every packet traverses a path with persistent queueing. A datacenter with persistent background traffic produces RTprop samples that systematically overestimate true propagation delay.

9.5.4 Environment, Measurement, Belief

Layer What BBR Has What’s Missing
Environment True bottleneck rate, queue depth Still opaque; measured indirectly
Measurement Delivery rate (from ACK spacing) + min RTT Delivery rate noisy; RTprop estimate goes stale between probes
Belief (BtlBw, RTprop) = explicit model BtlBw assumes stationarity; multi-tenant traffic violates this

The E-M gap quality changed. Jacobson’s gap was accidentally noisy (honest but imperfect ACK signal). BBR’s gap is structurally filtered: the delivery rate measurement reflects not just the bottleneck but also competing traffic, receiver-side buffering, and link-layer retransmissions. BBR smooths this noise with windowed max/min filters, but some information is permanently lost.

9.5.5 “The Gaps Didn’t Matter… Yet”

Despite the fairness concerns, BBR achieved lower latency and higher throughput than CUBIC on Google’s global network. For a single operator controlling both endpoints (Google servers to user browsers via QUIC), the coexistence problem is muted — Google’s traffic competes primarily with itself. BBR is deployed at scale in Google infrastructure: YouTube, Cloud, and all QUIC connections.

The gap that matters is deployment scope. BBR works well when one operator controls the sender (and can choose the CC algorithm). It works poorly on shared infrastructure where heterogeneous CC algorithms compete. As BBR adoption grows beyond Google, the fairness problem becomes increasingly visible. BBRv3 (2023) addresses this with explicit loss-rate monitoring and proportional backoff — but the tension between model-based probing and loss-based conservation remains fundamental.


9.6 Act 5: “It’s 2010. The Datacenter Changes Everything.”

9.6.1 Which Invariant Broke?

Datacenters introduced a qualitatively different environment: sub-millisecond RTTs, single administrative authority, commodity switches with shallow buffers, and workloads mixing latency-sensitive “mice” flows (short queries) with throughput-sensitive “elephant” flows (bulk transfers).

The critical failure was incast collapse: in partition-aggregate workloads (MapReduce, search), dozens or hundreds of workers respond simultaneously to a single aggregator. Their synchronized traffic overwhelms the switch buffer. All flows experience loss simultaneously. All halve cwnd. When the buffer drains, all increase together — synchronized oscillation with utilization collapsing to 10%.

Standard TCP (Reno or CUBIC) in the datacenter produces two failures: (1) elephant flows fill buffers, inflating latency for mice flows, and (2) incast causes synchronized loss and throughput collapse. Both failures trace to the same State gap: loss is too coarse a signal for sub-millisecond environments.

Concrete incast example: A MapReduce job with 100 workers, each holding 1 MB of results. The aggregator sends a request; all 100 workers respond simultaneously over 10 Gbps links to a top-of-rack switch with a 1 MB buffer. Total arriving data: 100 MB. Buffer capacity: 1 MB. Within microseconds, 99 MB of data is dropped. All 100 flows detect loss and halve cwnd. When retransmissions begin, they are again synchronized. Throughput oscillates between near-zero and momentary bursts. Effective utilization: 10-20% of link capacity.

9.6.2 Alizadeh’s Redesign: DCTCP (2010)

DCTCP (Data Center TCP) extracted multi-bit feedback from the single-bit ECN field (Alizadeh et al. 2010; Ramakrishnan et al. 2001):

  • Switches mark the Congestion Experienced (CE) bit when queue depth exceeds threshold K.
  • The sender tracks alpha = EWMA of the fraction of marked packets in each window.
  • Decrease: cwnd = cwnd * (1 - alpha/2).
  • If 10% of packets are marked: 5% decrease. If 90% are marked: 45% decrease.

Applied closed-loop reasoning with a richer measurement signal: instead of binary loss (all-or-nothing), DCTCP receives a continuous congestion gradient. The marking rate is proportional to queue depth, so the sender learns not just “congested or not” but “how congested.”

Applied disaggregation by separating the measurement signal (ECN from switches) from the control response (proportional decrease at endpoints). Switches provide signals; endpoints interpret them. This preserves endpoint autonomy while adding router participation — the “gateway fix” that Jacobson called for in 1988.

Decision placement shifted subtly: switches now actively participate in congestion signaling (marking packets), but endpoints retain all rate-control authority. This is the first change to the measurement interface since Jacobson 1988.

9.6.3 Invariant Analysis: DCTCP (2010)

Invariant DCTCP Answer (2010) Gap?
State cwnd + alpha (fraction marked). Continuous signal Requires ECN-capable switches; single-admin only
Time Same RTT estimator. ECN marks arrive within 1 RTT Marking threshold K must be tuned per deployment
Coordination Proportional decrease (gentler than AIMD halving) Cannot coexist with standard TCP (too gentle, gets starved)
Interface ECN from switches — new measurement signal Requires coordinated deployment across all switches

The deployment constraint is DCTCP’s binding limitation: it requires every switch on the path to support ECN marking, and sharing links with standard TCP leads to starvation. This restricts DCTCP to single-admin environments — datacenters where one operator controls all switches and can mandate the protocol.

Concrete example: A datacenter switch with 10 Gbps links and 1,500-byte packets. Marking threshold K = 20 packets. Queue sojourn time at K: 20 * 1,500 * 8 / 10 Gbps = 2.4 microseconds. In steady state, queue depth oscillates around K. Average queuing delay: ~2 microseconds. Compare with CUBIC on the same switch: the queue grows until packets are dropped, potentially filling a 10 MB buffer, causing 8 ms of queuing delay — 3,000x worse.

9.6.4 Environment, Measurement, Belief: DCTCP

Layer What DCTCP Has What’s Missing
Environment True queue depth at each switch Sender sees only aggregate marking rate
Measurement Fraction of ECN-marked packets per window Continuous signal, but marking threshold K is static
Belief alpha = EWMA of marked fraction Proportional to congestion severity; much richer than binary loss

The E-M gap quality fundamentally changed. DCTCP’s measurement signal is physically richer than loss: ECN marks arrive before packets are dropped, and their frequency encodes congestion severity. This is not a better estimator for the same signal (as CUBIC was for Reno) — it is a different sensor entirely. The diagnostic: DCTCP moved from accidentally noisy measurement (loss) to designed measurement (ECN marking). The cost: the new sensor requires coordinated deployment.

9.6.5 Before/After: Standard TCP to DCTCP

What Changed Standard TCP (CUBIC) DCTCP (2010)
Congestion signal Loss (binary) ECN marks (proportional)
Decrease rule Halve cwnd (0.5x or 0.7x) cwnd * (1 - alpha/2), proportional
Queue target Buffer fills before reaction Queue stays at threshold K
Incast handling Synchronized collapse Proportional marking prevents overflow
Deployment scope Global Internet Single-admin datacenter only

9.6.6 “The Gaps Didn’t Matter… Yet”

DCTCP’s single-admin constraint was tolerable because the target environment — hyperscale datacenters — is precisely where single-admin deployment is feasible. Google, Microsoft, and Amazon all deployed DCTCP or DCTCP-inspired protocols within their datacenters. The constraint excluded the wide-area Internet, but that was acceptable because the datacenter workloads (partition-aggregate, storage replication) that benefit most from DCTCP exist only within single-admin domains.

The gap that would matter next was different: even inside the datacenter, DCTCP’s measurement signal (ECN) provides only queue-depth information. It is silent on link utilization, the number of competing flows, and the arrival rate of new traffic. Later datacenter protocols — HPCC (Li et al. 2019) and Swift (Kumar et al. 2020) — added richer in-network telemetry: INT (In-band Network Telemetry) headers carrying precise queue depth, link utilization, and timestamp information per hop. This is the E-M-B progression: from binary loss (TCP) to proportional marking (DCTCP) to per-hop telemetry (HPCC/Swift).


9.7 Act 6: “It’s 2017. TCP Cannot Evolve.”

9.7.1 Which Invariant Broke?

The Interface invariant — TCP’s permanent binding constraint — became a liability. TCP is implemented in operating system kernels: deploying a new congestion control algorithm requires an OS update, which takes months to years. Worse, middleboxes (NATs, firewalls, intrusion detection systems) inspect TCP headers and block packets with unfamiliar options. Honda et al. measured that 35% of Internet paths drop packets carrying unknown TCP options (Honda et al. 2011) — meaning TCP extensions are effectively undeployable across much of the Internet.

This is protocol ossification: the loss of protocol flexibility caused by deployed infrastructure that enforces expectations about protocol behavior. TCP’s header format, option space, and behavior became frozen by the accumulated assumptions of millions of middleboxes.

The consequences were concrete. TCP Fast Open (RFC 7413 (Cheng et al. 2014), 2014) enabled 0-RTT connection setup by caching a cryptographic cookie — a simple, backward-compatible extension. Deployment stalled because middleboxes dropped SYN packets containing unfamiliar options. Multipath TCP (MPTCP, RFC 6824 (Ford et al. 2013), 2013) added subflow management to existing connections. Again, middleboxes interfered. The pattern: any TCP extension that modified the header format or option space was blocked by infrastructure that the protocol designers lacked control over.

9.7.2 Langley’s Redesign: QUIC (2017)

QUIC escaped TCP’s ossification by building a new transport protocol over UDP (Langley et al. 2017):

  • UDP as substrate: Middleboxes already pass UDP traffic. By wrapping transport logic in UDP, QUIC avoids TCP’s frozen header format.
  • User-space implementation: QUIC runs in application libraries, not OS kernels. Updates deploy with application releases, not OS patches.
  • Encrypted headers: QUIC encrypts all transport metadata (connection state, stream IDs, sequence numbers). Middleboxes see only opaque bytes, preventing future ossification.
  • Stream multiplexing: Multiple independent streams within one connection. Loss on one stream leaves others unblocked — fixing TCP’s head-of-line blocking.
  • 0-RTT connection setup: Combined cryptographic and transport handshake. Repeat connections can send data immediately, eliminating the latency cost of TCP’s three-way handshake plus TLS setup.
  • Connection migration: Connections identified by Connection IDs, not 4-tuples. A mobile device switching from WiFi to cellular keeps its QUIC connection alive.

Applied disaggregation by separating transport logic from the OS kernel and separating per-stream reliability from per-connection ordering. TCP bundles reliability, ordering, and congestion control into one byte stream; QUIC disaggregates them. This disaggregation enables independent loss recovery per stream: if stream A loses a packet, stream B continues delivering data. In TCP with HTTP/2, a single lost packet blocks all multiplexed streams — even those with no data in the lost packet.

Decision placement for congestion control remained endpoint-local: QUIC runs the same CC algorithms (BBR, CUBIC) internally. The innovation lies in Interface, not CC — how transport is packaged and deployed. This is a critical distinction: QUIC changes where the code runs (user-space vs kernel) and what it encrypts (everything), but it preserves the fundamental CC architecture (endpoint-driven, ACK-based, distributed).

Applied closed-loop reasoning for connection setup: QUIC’s 0-RTT handshake combines cryptographic key exchange and transport parameter negotiation into a single round trip. For repeat connections, the client caches session parameters and can send data immediately — zero round trips of setup latency. This reduces page-load time by 1–2 RTTs compared to TCP + TLS 1.3, which is significant on high-RTT paths (50–100 ms WAN) and transformative on satellite links (300+ ms).

9.7.3 Invariant Analysis: QUIC (2017)

Invariant QUIC Answer (2017) Gap?
State Per-stream state; Connection ID enables migration CC algorithms inherited from TCP (same State gaps)
Time 0-RTT/1-RTT handshake; same CC timing internally Same RTT estimation challenges
Coordination Same distributed CC as TCP Same fairness problems (BBR vs CUBIC)
Interface UDP wrapper; encrypted headers; stream multiplexing Middleboxes that block UDP also block QUIC

QUIC’s contribution is primarily to the Interface invariant. It demonstrates that when the binding constraint itself ossifies, the solution is to change the deployment layer — move the interface to a substrate that middleboxes cannot inspect.

9.7.4 Environment, Measurement, Belief: QUIC

Layer What QUIC Has What’s Missing
Environment Same network as TCP Same invisibility
Measurement Same ACK-based signals (per-stream) Same noise, but per-stream loss isolation
Belief Same CC models (BBR/CUBIC internally) Same CC gaps; innovation is in deployment, not CC

The E-M gap is identical to TCP’s for congestion control. QUIC’s innovation is orthogonal: it changes how transport is packaged and deployed, while congestion sensing and control remain unchanged. The lesson: when the Interface invariant ossifies, restructuring the other invariants (State, Time, Coordination) is insufficient — the interface itself must change.

Concrete deployment impact: By 2017, Google reported that 35% of its egress traffic used QUIC. HTTP/3 (RFC 9114 (Bishop 2022)) standardized on QUIC as its transport. IETF QUIC (RFC 9000 (Iyengar and Thomson 2021), 2021) diverged from Google’s original design but preserved the core architectural choices: UDP substrate, mandatory encryption, stream multiplexing, connection migration.


9.8 The Grand Arc: From Cerf/Kahn to QUIC

9.8.1 The Evolving Anchor

The binding constraint — the IP datagram interface — remained constant across all six acts. Every innovation operated within the same Interface and Coordination constraints. What changed was the State invariant (what the sender believes about the network) and the Time invariant (how timing decisions are made), restructured generation after generation as the environment evolved.

This constancy distinguishes transport from routing (where the binding constraint shifts across eras). TCP’s meta constraint — IP datagrams exist, and no entity has central authority — never changed. The dependency chain is stable: Interface (locked) forces Coordination (distributed), which forces State (inferred from local signals), which forces Time (estimated from ACK stream). Every TCP variant navigates within this chain. The variation is in the quality of the estimator and the richness of the signal.

Era Environment State (Belief) Time What Broke
1974 Small ARPANET, < 200 hosts Connection state only Fixed timers “Enough capacity” assumption
1988 5,000+ hosts, congestion collapse cwnd + SRTT/RTTVAR EWMA estimator Loss signal delayed by deep buffers
2008 10 Gbps fiber, high-BDP Cubic function of wall-clock time RTT-independent growth Loss still sole signal; bufferbloat
2017 (BBR) Deep buffers, bufferbloat (BtlBw, RTprop) explicit model Phase-based probing Fairness with loss-based flows
2010 (DCTCP) Datacenter: 1 ms RTT, ECN alpha (ECN fraction) ECN faster than loss Single-admin only; cannot coexist
2017 (QUIC) Ossified TCP; middlebox interference Per-stream state; Connection ID 0-RTT handshake UDP blocking in some networks

9.8.2 Three Design Principles Applied Across the Arc

Disaggregation was applied repeatedly, each time separating a new concern:

  • Cerf/Kahn applied disaggregation by separating endpoint state (TCP) from network forwarding (IP). Routers are stateless; endpoints maintain all connection semantics.
  • Jacobson applied disaggregation by separating per-flow state (cwnd, SRTT) from network state. Each sender tracks its own capacity belief independently.
  • QUIC applied disaggregation by separating per-stream reliability from per-connection ordering, and separating transport logic from the OS kernel.

Closed-loop reasoning was the dominant design strategy, refined across every generation:

  • Jacobson applied closed-loop reasoning with the AIMD feedback loop: measure loss, adjust cwnd, send, repeat. Loop period = RTT.
  • BBR applied closed-loop reasoning with a model-based loop: measure delivery rate and min RTT, update (BtlBw, RTprop) model, set rate to match model.
  • DCTCP applied closed-loop reasoning with a richer measurement signal: ECN marks provide a continuous congestion gradient instead of binary loss.

Decision placement remained consistently distributed across all generations. Every TCP variant keeps rate-control authority at endpoints. DCTCP added switch-side measurement (ECN marking) but preserved endpoint-side decision-making. This consistency follows from the binding constraint: no entity has global authority over Internet traffic, so rate decisions must be local.

The one context where decision placement shifted was the datacenter. Under single administrative authority, DCTCP coordinates switches (marking) and endpoints (proportional response). HPCC goes further: switches provide per-hop telemetry, and endpoints compute rate adjustments from the full path signal. This is possible only because the datacenter operator controls both switches and hosts — the Coordination invariant’s answer changes when the meta-constraint changes from “no central authority” to “single administrator.”

The transport lineage demonstrates a general pattern: when the binding constraint is constant (IP interface, distributed authority), innovation concentrates in the State and Time invariants. Designers restructure what the system believes and how it measures, because the interface and coordination model are frozen. The two exceptions — DCTCP (new measurement interface via ECN) and QUIC (new transport interface via UDP) — both required environments where the deployment barrier was lower: single-admin datacenters and user-space libraries, respectively.

9.8.3 The Dependency Chain

flowchart TD
    A[IP datagrams: unreliable, unordered, no feedback]:::constraint --> B[Coordination: distributed endpoints, no central authority]:::constraint
    B --> C[State: must infer capacity from local signals]:::constraint
    C --> D[Time: must estimate RTT from ACK stream]:::constraint

    E[1986: Congestion collapse]:::failure --> F[Jacobson 1988: cwnd + AIMD + SRTT]:::fix
    F --> G[Loss = congestion signal assumption]:::constraint
    G --> H[High-BDP: linear increase too slow]:::failure
    H --> I[CUBIC 2008: cubic growth, RTT-independent]:::fix
    G --> J[Bufferbloat: loss delayed by deep buffers]:::failure
    J --> K[BBR 2017: model-based, BtlBw + RTprop]:::fix
    G --> L[Datacenter: incast, shallow buffers, low RTT]:::failure
    L --> M[DCTCP 2010: proportional ECN marking]:::fix

    N[TCP ossification: middleboxes block extensions]:::failure --> O[QUIC 2017: transport over UDP, encrypted]:::fix

    classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:white
    classDef failure fill:#d94a4a,stroke:#8a2c2c,color:white
    classDef fix fill:#4ad94a,stroke:#2c8a2c,color:white

9.8.4 Pioneer Diagnosis Table

Year Pioneer Invariant Diagnosis Contribution
1974 Cerf & Kahn Interface Heterogeneous networks need common protocol TCP/IP: byte stream over datagrams
1984 Nagle State Senders flood network without congestion awareness RFC 896: named congestion collapse
1986 Zhang Time RTT estimation too crude for reliable timers Identified timer failure modes
1987 Karn & Partridge State Retransmit ambiguity corrupts RTT samples Karn’s algorithm: discard ambiguous samples
1988 Jacobson State + Time Conservation of packets violated three ways Slow start, AIMD, SRTT/RTTVAR, fast retransmit
1989 Chiu & Jain Coordination Distributed rate control needs convergence proof AIMD convergence to fairness
2008 Ha, Rhee, Xu State Linear increase too slow for high-BDP CUBIC: cubic growth, RTT-independent
2010 Alizadeh et al. State Loss too coarse for datacenter DCTCP: proportional ECN decrease
2017 Cardwell et al. State Loss is the wrong signal (bufferbloat) BBR: model-based CC (BtlBw, RTprop)
2017 Langley et al. Interface TCP ossified by middleboxes and kernels QUIC: transport over UDP, encrypted

9.8.5 The General Principle

The tighter the coupling between measurement and environment, the better the closed-loop performance. ECN is tighter than loss-based signaling. Loss is tighter than timeout. (BtlBw, RTprop) is tighter than cwnd. Per-hop INT telemetry is tighter than endpoint-only ACK signals. The cost of tighter coupling is always higher per-packet overhead and stricter deployment requirements. The design trade-off is eternal: tighter coupling enables higher utilization and lower latency, but at the cost of complexity and deployment friction.

9.8.6 Innovation Timeline

flowchart TD
    subgraph sg1["Origins"]
        A1["1974 — Cerf/Kahn: TCP/IP"]
        A2["1981 — RFC 793"]
        A1 --> A2
    end
    subgraph sg2["Diagnosis"]
        B1["1984 — Nagle: RFC 896"]
        B2["1986 — Zhang: timer critique"]
        B3["1987 — Karn/Partridge: fix"]
        B1 --> B2 --> B3
    end
    subgraph sg3["The Jacobson Era"]
        C1["1988 — Jacobson: slow start + AIMD"]
        C2["1989 — RFC 1122 mandate"]
        C3["1990 — Reno fast recovery"]
        C1 --> C2 --> C3
    end
    subgraph sg4["Refinement"]
        D1["1992 — RFC 1323: window scaling"]
        D2["1993 — Floyd: RED"]
        D3["1996 — SACK (RFC 2018)"]
        D1 --> D2 --> D3
    end
    subgraph sg5["Scaling"]
        E1["2008 — Ha/Rhee/Xu: CUBIC"]
        E2["2010 — Alizadeh: DCTCP"]
        E1 --> E2
    end
    subgraph sg6["Model-Based"]
        F1["2017 — Cardwell: BBR"]
        F2["2017 — Langley: QUIC"]
        F3["2021 — RFC 9000: IETF QUIC"]
        F1 --> F2 --> F3
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5 --> sg6

Transport and Congestion Control: 1974–2021


9.9 Generative Exercises

TipExercise 1: TCP for Mars

Mars rovers communicate with Earth at 20-minute one-way latency. RTT = 2,400 seconds.

  1. Predict the closed-loop failure: Slow start from cwnd = 1 takes log_2(100) * 2,400 = ~16,800 seconds (~4.7 hours) to reach cwnd = 100. AIMD adds 1 MSS per 2,400-second RTT. Loss recovery takes one RTO (~3,600 seconds = 1 hour). What happens to throughput for a file transfer?

  2. Diagnose through invariants: Which invariant is under greatest pressure? (Time: the feedback loop is 2,400 seconds. The sender is blind for 40 minutes between sending and learning the result.) Can you keep the IP interface but change the Time model? What if you replace inferred timing with prescribed timing based on orbital mechanics?

  3. Design a fix: Propose a preloading mechanism — before losing connectivity, precompute a cwnd schedule from ground-truth Earth-Mars link models. Upload the schedule to the rover. The rover increases cwnd on wall-clock time, not RTT. What invariant answers change? What new failure modes appear if the precomputed model is wrong? What if a solar event disrupts the link mid-schedule?

  4. Alternative architecture: Propose a delay-tolerant transport protocol that does not use ACKs for reliability. Instead, use forward error correction (FEC) to encode redundancy at the sender. How does removing the ACK-based feedback loop change all four invariant answers? What is the cost in bandwidth efficiency?

TipExercise 2: Heterogeneous CC Coexistence

A campus network carries traffic from three populations: students using CUBIC (browser default), a streaming service using BBR, and a research cluster using DCTCP with ECN. All share a single 10 Gbps backbone link with a 10 MB buffer.

  1. Predict the interaction: BBR probes past loss. CUBIC halves cwnd on loss. DCTCP responds to ECN marks. Who dominates? Who starves?

  2. Diagnose through the Coordination invariant: AIMD converges to fairness only when all flows use the same algorithm. When flows use different CC algorithms, what happens to the convergence proof? Can you define “fairness” when flows have different loss responses?

  3. Design a solution: Propose a mechanism that achieves some notion of fairness across heterogeneous CC algorithms. Where would you place this mechanism — at endpoints, at switches, or at both? What are the deployment constraints?

  4. L4S prediction: L4S (Low Latency, Low Loss, Scalable Throughput, RFC 9332 (De Schepper and Briscoe 2023)) proposes separating traffic into two queues: a “classic” queue for legacy TCP and an “L4S” queue for scalable CC algorithms (like DCTCP-style). Analyze this through the Coordination invariant: does separating queues solve the fairness problem, or does it create a new one? What happens when 90% of traffic is L4S and 10% is classic?

TipExercise 3: Transport for a Satellite Constellation

LEO satellite constellations (Starlink, Kuiper) introduce an environment with properties of both WAN and datacenter: variable RTTs, single operator, frequent path changes.

A LEO satellite constellation (Starlink-like) has variable RTTs: 20 ms when the satellite is overhead, 80 ms at low elevation angles, with handoffs every 15 seconds as satellites move. The path changes completely every 90 seconds.

  1. Predict the measurement failure: SRTT with alpha = 1/8 tracks RTT changes slowly. How many samples to converge after a 20 ms to 80 ms jump? What happens to RTO during the transition?

  2. BBR vs CUBIC: BBR’s RTprop estimate uses minimum RTT over 10 seconds. What happens when the path’s true propagation delay increases by 4x every 15 seconds? How does CUBIC’s loss-based approach handle the same scenario?

  3. Design the right CC: This environment has properties of both WAN (variable RTT, multi-hop) and datacenter (single operator controls all satellites). Which existing CC algorithm is closest to correct? What modification would you make, and which invariant does it restructure?

  4. Connection migration: A user drives along a highway while on a video call. The ground terminal hands off between satellite beams every 15 seconds. TCP connections using 4-tuple identification break on every handoff. QUIC’s Connection ID survives handoffs. What other transport-layer changes are needed for seamless satellite handoff? Consider: does the path’s BDP change during handoff? Should the CC state (cwnd or BtlBw estimate) be preserved across handoffs or reset?


This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.


  1. Maximum Segment Size (MSS) is the largest payload TCP sends in a single segment, typically 1460 bytes (1500-byte Ethernet MTU minus 20-byte IP header minus 20-byte TCP header).↩︎

  2. The SRTT and RTTVAR formulas are Exponentially Weighted Moving Averages (EWMAs). An EWMA gives recent samples more weight than older ones, with the weight decaying exponentially. Alpha = 1/8 means each new RTT sample contributes 12.5% to the smoothed estimate — the estimator adapts to changes within roughly 8 samples. Beta = 1/4 weights the deviation estimator more aggressively so RTTVAR tracks variance changes faster than the mean. These values were chosen empirically by Jacobson (Jacobson 1988) and codified in RFC 6298 as powers of two for efficient bit-shift implementation.↩︎

  3. Bandwidth-Delay Product (BDP) = link bandwidth × round-trip time. For a 1 Gbps link with 100 ms RTT, BDP = 10^9 × 0.1 = 10^8 bits = 12.5 MB. This is the amount of data “in flight” needed to fully utilize the link. TCP’s congestion window must reach BDP for full utilization.↩︎

  4. CUBIC’s window growth is W(t) = C(t - K)^3 + W_max, where C is a scaling constant, K = (W_max × β/C)^{1/3} is the time to reach the previous maximum, and β = 0.7 is the multiplicative decrease factor. The cubic function grows slowly near W_max (probing carefully) and aggressively far from it.↩︎