5  Transport and Congestion Control


5.1 The Anchor: Unreliability and Decentralization

TCP operates at the convergence of two hard constraints inherited from below: the IP interface and the Internet’s administrative decentralization. IP offers best-effort delivery—datagrams routed without guarantee of arrival, order, or timing. More fundamentally, IP provides no feedback to the sender about network state. A packet is delivered, lost, or delayed; the sender observes only arrival of acknowledgments or absence thereof. No router signals congestion explicitly. No path tells the sender what capacity remains.

This is not incidental. It is architectural policy. Clark’s [1988] design philosophy embedded administrative decentralization as a founding principle: the Internet operates across multiple independent domains with no entity possessing global visibility. Congestion control cannot depend on a central scheduler, a shared clock, or routers capable of universal rate coordination. It must emerge from endpoint-local observations and endpoint-local decisions.

Transport is the endpoint component responsible for end-to-end delivery. Its engineering question: how do I deliver data reliably and efficiently across a path I don’t control? The end-to-end argument places intelligence at endpoints because the network interior cannot know application requirements — a decomposition choice that defines TCP’s entire design space.

The IP interface is thus TCP’s anchor. Everything else cascades from this constraint: TCP must infer the state of the network using only local signals (ACK arrivals, loss events, RTT samples). TCP must coordinate with other flows without seeing them. TCP must make sending-rate decisions autonomously, producing convergence not through a global algorithm but through the interaction of independent closed loops.


5.2 TCP’s Invariant Answers

5.2.1 State: Belief about Capacity from Noisy Measurement

Environment state is the true bottleneck queue occupancy, competing traffic, link rate, and propagation delay—none of which the sender can observe directly. The sender’s measurement signals are ACK arrivals and loss events. ACKs arrive when the receiver acknowledges in-order delivery; loss is inferred either from timeout (no ACK for an expected packet) or from duplicate ACKs (receiver signaling out-of-order arrival).

TCP’s internal belief about available capacity is encoded in the congestion window (cwnd), measured in maximum segment size (MSS) units. The cwnd represents how many bytes the sender is willing to have in flight—not queued at the receiver, but in transit through the network, occupying buffer and link capacity somewhere on the path.

The measurement-to-belief gap is vast. The ACK arrival rate depends on bottleneck link rate, propagation delay, and receiver ack frequency (delayed ACK may group acknowledgments). The loss signal is binary and delayed—when loss occurs, the sender does not learn until either a timeout fires (RTT + variance estimate, typically 200 ms or more in WAN conditions) or it receives three duplicate ACKs (RTT delay at minimum). The congestion window is a proxy for available capacity, but the proxy is noisy, delayed, and subject to systematic errors.

This gap is where bufferbloat lives. Large buffers (gigabyte-scale packet buffers in modern switches) delay the loss signal. When congestion occurs, the bottleneck link becomes full, but the buffer absorbs packets instead of dropping them. The sender observes ACKs arriving on time (because buffer drains into the link at line rate) and believes the network is healthy. The loop takes many RTTs to detect congestion—so many that the buffer is completely full by the time cwnd backs off. The belief is stale; latency spikes. A datacenter with RTT = 1 ms and a 10 Gbps link has a bandwidth-delay product (BDP) of 10 Gbps × 0.001 s = 10 megabits = 1.25 megabytes. A single switch buffer of 10 GB can hide congestion for 8,000 RTTs. During those RTTs, the sender has no signal that the network state has changed.

The three-layer decomposition of state illuminates the diagnosis: environment (true queue, true link rate), measurement (ACK timing, loss event), belief (cwnd). Failures live in the gaps between layers.

5.2.2 Time: Inferred from Local Observation via Jacobson’s Algorithm

TCP has no access to a shared, synchronized clock. No NTP signal arrives at the sender to announce “the network is congested.” Instead, TCP estimates round-trip time by measuring the delay between sending a packet and receiving its acknowledgment.

Jacobson’s [1988] RTT estimation algorithm underpins all TCP timing. The algorithm maintains two state variables: smoothed RTT (SRTT) and RTT variance (RTTVAR). When a new RTT sample arrives from the network, both are updated:

RTTVAR = (1 - β) * RTTVAR + β * |SRTT - RTT_sample|
SRTT = (1 - α) * SRTT + α * RTT_sample
RTO = SRTT + 4 * RTTVAR

The RFC 6298 standard uses α = 1/8 and β = 1/4. These constants are tuned empirically: α = 1/8 means the smoothed RTT is weighted heavily toward history (7/8 old, 1/8 new), allowing gradual adaptation without oscillation. RTTVAR is similarly conservative, ensuring the RTO timeout is rarely triggered by normal variance. The multiplier 4 on RTTVAR is chosen to cover roughly the 95th percentile of RTT samples under normal conditions.

The consequences of these choices are profound. If RTT is 50 ms and samples are stable, RTTVAR converges to roughly 2.5 ms, and RTO becomes 50 + 4 × 2.5 = 60 ms. But if RTT suddenly spikes to 500 ms (a routing change, a congestion event on the path), the adaptation is glacial. The first sample updates SRTT by 1/8 × (500 - 50) = 56.25 ms, so SRTT becomes 106.25 ms. After 8 samples at the new RTT, SRTT has risen to only 400 ms—the timeout is still too conservative. During those RTTs, the sender is effectively blind to the path’s true latency.

A datacenter network with RTT = 1 ms exhibits this problem acutely. A path failure that causes RTT to spike to 10 ms takes roughly 8 RTT intervals = 8 ms to detect in SRTT (ignoring RTTVAR). But the retransmission timeout RTO, initially 1 + 4 × 0.02 ≈ 1.08 ms, does not rise immediately—it lags behind. This delay in timeout adaptation can cause spurious retransmissions or, conversely, missed loss detection.

Loss detection—whether through timeout or duplicate ACK—is TCP’s only explicit time signal for congestion. When loss occurs, the sender assumes the bottleneck has been exceeded and congestion backpressure is warranted. This inference is not always correct. Wireless links lose packets to corruption, not congestion. Receiver buffers drop packets, not because the link is full, but because the receiver has run out of application-level read buffers. In both cases, the sender’s time signal (loss) is decoupled from the actual congestion event.

5.2.3 Coordination: Distributed AIMD without Central Authority

Coordination is fully distributed and asynchronous. The sender adjusts cwnd based only on local signals. The receiver plays no role in rate control; it merely acknowledges delivery. No central arbiter assigns rates. Competing flows do not know of each other and do not exchange messages about congestion.

The coordination rule is additive-increase, multiplicative-decrease (AIMD). In the absence of loss, the sender increases cwnd by one MSS for every RTT of ACKs received (additive increase). Upon loss, the sender decreases cwnd by half (multiplicative decrease).

The mathematics of AIMD converges provably. Suppose N flows share a bottleneck of capacity C. All flows are identical, with current cwnd at equilibrium. When a packet is lost (from some flow), that flow detects loss and halves its window. Other flows do not reduce their windows; they continue increasing. Over time, the flow that halved its window increases back to equilibrium, while other flows overshoot again. This oscillation around an equilibrium point emerges naturally without any global state.

The convergence proof is elementary. A flow increases by 1 MSS per RTT. It decreases by cwnd/2 on loss. For equilibrium, the rate of increase must equal the rate of decrease (in some averaging sense). If a flow’s share is S bytes/sec and it loses 1 packet per T RTTs, then the decrease is cwnd/2 per T RTTs. At equilibrium, roughly T/2 RTTs are spent increasing to reach the loss point again. For identical RTTs and fair load, all flows converge to cwnd such that N * cwnd = C. The equilibrium is stable: if any flow increases beyond its fair share, loss is more likely, forcing it back down.

But convergence is not instant, and it is not perfectly fair. Fast networks (high capacity, low loss rate) take many RTTs to converge. Flows with different RTTs have different convergence speeds: a 10 ms RTT flow increases faster (more increase events per second) than a 100 ms RTT flow, because it receives ACKs 10x more frequently. This RTT bias is a long-standing fairness problem in TCP. Low-RTT flows get more capacity because they pack more additive increases into the same wall-clock time. A 1 ms flow increases cwnd 100x per second; a 100 ms flow increases 1x per second. At equilibrium, the low-RTT flow has 100x higher cwnd, not because of network dynamics, but because of how AIMD sampling works.

5.2.4 Interface: Reliable Byte Stream over Unreliable Datagrams

TCP presents a clean abstraction to applications: a reliable, in-order byte stream. Applications write bytes to a socket without awareness of packets, loss, or network state. TCP handles all reordering, retransmission, and flow control internally.

Below, TCP uses the unreliable IP datagram interface. TCP segments (encapsulated in IP packets) are routed individually, with no guarantee of delivery or order. IP is stateless—routers do not maintain per-connection state. Each packet is forwarded based on its destination IP address alone. This architectural decision (stateless routers, endpoint state) forced all connection semantics (sequence numbers, retransmission, ordering) to move to TCP, the endpoints.

The ACK interface is lightweight but information-rich. The receiver acknowledges the highest sequence number received in order. TCP uses selective acknowledgment (SACK) to allow receivers to report gaps in the received sequence (RFC 2018)—a refinement that allows the sender to retransmit only missing segments, not entire windows. But the core ACK signal remains the same: a number representing the highest contiguous byte delivered, sent back to the sender with its own sequence number.


5.3 The Closed Loop: AIMD Dynamics and Failure Modes

TCP’s closed loop is the simplest and most elegant in the networked systems canon. Pseudocode:

while(sender_has_data):
    if(ACKs_received_without_loss):
        cwnd += MSS  // Increase additive per RTT of ACKs
    else:
        cwnd /= 2    // Decrease multiplicative
    send_up_to_cwnd()
    measure_RTT()

AIMD Dynamics: In the absence of loss, the sender increases cwnd by one maximum segment size (MSS) for every round-trip time (RTT) of acknowledgments received—the “additive increase” phase probes for available capacity. Upon packet loss, detected either via timeout or duplicate ACKs, the sender immediately decreases cwnd by half—the “multiplicative decrease” phase backs off aggressively to relieve congestion.

As Figure 5.1 shows, the sawtooth pattern that emerges is not a bug—it is the system reaching equilibrium.

TCP AIMD Congestion Control Dynamics

The sawtooth pattern visualizes the closed-loop dynamics of TCP’s AIMD (Additive Increase, Multiplicative Decrease) congestion control, where the sender probes available capacity by increasing the congestion window (cwnd) at a linear rate until loss is detected, then halves cwnd in response. Observe how the additive increase phase (roughly 1 packet per RTT) gradually fills the pipe and queues, while loss events (marked by red dots) trigger multiplicative decreases that sharply reduce cwnd by 50%. The green dashed line at cwnd=32 shows the slow-start threshold (ssthresh), which separates exponential probing from linear probing. This equilibrium—where the rate of additive gains roughly balances multiplicative losses over many cycles—emerges not as a bug, but as the system reaching a stable point where loss probability is low but not zero, achieving high utilization without overwhelming the bottleneck.

Figure 5.1: The loop delay is RTT, which governs how quickly the sender can react to congestion. A datacenter flow with 1 ms RTT reacts 100x faster than a wide-area flow with 100 ms RTT, creating a fundamental coupling between the time model and the convergence properties of the adaptation loop. Flows with longer RTTs reach equilibrium slowly and are biased against in competing for capacity, illustrating how the time model directly shapes fairness.

The sawtooth pattern that emerges is not a bug—it is the system reaching equilibrium. The rate at which additive increase gains capacity (1 MSS per RTT) roughly balances the rate at which multiplicative decrease loses it (cwnd/2 per loss event). The pipe model shows data in flight through the network at each point: when cwnd is large, more data fills the path and queues (increasing latency); when cwnd shrinks, the network drains and latency drops. Over many cycles, the system converges to a sending rate where loss probability is low but not zero, achieving utilization without overwhelming the bottleneck.

The loop delay is RTT. This is consequential. A flow with 100 ms RTT takes at least 100 ms to react to congestion. During those 100 ms, if the sender was transmitting at near-bottleneck rate, congestion has deepened. By contrast, a datacenter flow with 1 ms RTT reacts 100x faster. This RTT coupling—present in every adaptation loop—is ineliminable without changing the time model.

The loop is stabilizing. After loss, cwnd halves instantly, reducing sending rate by 50%. The sender then slowly increases again, probing the network’s capacity. If no loss occurs, it increases. If loss occurs again, it backs off again. Over many cycles, the system converges to a rate where loss probability is low but not zero. The equilibrium loss probability p satisfies a balance equation: the rate at which additive increase gains capacity (1 packet per RTT) equals the rate at which multiplicative decrease loses it (cwnd/2 per loss event).

But the loop has failure modes, visible at scale:

TCP Incast (data center) occurs when many senders (e.g., 100 servers) send to a single receiver over a single bottleneck link. All senders transmit at line rate initially. The receiver buffer fills instantly. Synchronous loss affects all flows simultaneously. All flows halve cwnd. When the receiver drains the buffer, all flows increase again in lockstep. The result is periodic synchronous loss, not smooth utilization. Utilization may drop to 10% despite ample capacity. Cause: synchronized response to a shared bottleneck. Fix: smaller timeout values to detect loss faster (TCP_MINRTO = 1 ms instead of the standard 200 ms), or active queue management (CoDel) to provide earlier feedback.

Bufferbloat occurs when queues at the bottleneck are large—not a failure of TCP per se, but a coupling failure between TCP (which interprets large RTTs as “be conservative”) and queue management (which should be dropping packets early, not buffering endlessly). The sender’s belief (the network is congested) diverges from the environment (the link is not full, but the buffer is full). Latency becomes the measurement signal’s victim: the queue grows without bound, TCP sees ACKs arriving on time (they are—the queue just drains slowly), and RTT climbs. Jitter explodes. A 1 Gbps link with a 1 GB buffer in a datacenter can hide congestion for 8 seconds if the bottleneck is saturated—eight thousand milliseconds of RTT inflation.

RTT-Biased Fairness: AIMD with different RTTs does not converge to equal-bandwidth fairness. A flow with RTT = 1 ms increases cwnd 100x per second; a flow with RTT = 100 ms increases 1x per second. At equilibrium, the low-RTT flow has 100x larger cwnd. This is not a bug in AIMD—it follows directly from the mathematics—but it violates the fairness assumption that all flows should share capacity equally.

Slow Convergence to Capacity: On a link with loss rate p, a flow’s cwnd at equilibrium is roughly sqrt(2 / p). For p = 0.01 (1% loss), cwnd ≈ 14 packets. To go from cwnd = 1 to cwnd = 14 (slow-start phase), takes log_2(14) ≈ 4 RTTs. For p = 0.001 (0.1% loss), equilibrium cwnd ≈ 45, taking 6 RTTs to reach. For p = 0.0001 (0.01% loss, typical on fiber), equilibrium cwnd ≈ 140, taking 8 RTTs. For high-capacity, low-loss networks (modern fiber, datacenter), this is too slow. A 100 Gbps WAN link with 100 ms RTT and 0.01% loss has BDP = 100 Gbps × 0.1 s = 10 Gbps × 100 bits = 1.25 GB. Slow-start to reach equilibrium takes 8 RTTs = 800 ms. Modern variants address this.


5.4 Loss-Based Evolution: From Reno to CUBIC

The steady state of transport research for 25 years: how to improve TCP’s state representation (the cwnd belief model) without changing the interface (ACK-based feedback)? TCP’s congestion control has evolved through four generations, each restructuring the state invariant in response to environmental pressure. The anchor constraint—IP’s best-effort, decentralized interface—remains constant. But as bandwidth scaled (10 Mbps → 100+ Gbps), RTT diversity increased (1 ms datacenters vs. 200 ms WAN), and queue management improved (ECN availability), the optimal belief system (how to interpret network state) shifted. Figure 5.2 shows how Reno’s loss-based approach worked for moderate speeds, CUBIC’s cubic growth accelerates recovery for high-BDP paths, BBR decouples bandwidth measurement from loss interpretation, and DCTCP uses ECN marks for fine-grained feedback.

TCP Variants: Reno, CUBIC, BBR, and DCTCP Congestion Control

TCP’s congestion control has evolved through multiple generations, each restructuring the belief system about network state in response to environmental pressures. TCP Reno (1990) introduced slow-start with exponential growth in the early phase, then switched to linear additive-increase congestion avoidance. Loss detection (via duplicate ACKs or timeout) triggers a reset to slow-start threshold at half the current window. This linear increase model worked well for moderate-bandwidth, moderate-latency links, but becomes a bottleneck on high-bandwidth, high-latency paths where the bandwidth-delay product is enormous.

Figure 5.2: CUBIC (2008) addresses this bottleneck by replacing linear increase with a cubic function, allowing faster recovery toward higher equilibrium windows on high-BDP paths. BBR represents a different paradigm: instead of inferring network state from loss, BBR explicitly models the network as a pipe with bounded bandwidth and round-trip delay, measuring these parameters directly and maintaining a window size equal to bandwidth × delay. DCTCP uses ECN marks (explicit congestion notification) proportionally: when congestion increases, ECN marks rise in frequency, allowing DCTCP to reduce the window proportionally rather than dramatically. Each variant embodies a different belief system: Reno assumes loss is the primary congestion signal and reacts sharply; CUBIC assumes high-BDP networks need gentler recovery; BBR assumes direct bandwidth measurement is possible; DCTCP assumes dense feedback via ECN marks is available. The choice of variant reflects assumptions about the network environment, feedback frequency, and the tradeoff between responsiveness and stability.

Tahoe and Reno [1990]: TCP Tahoe introduces slow-start (exponential increase early) and additive-increase congestion avoidance (1 MSS per RTT late). Loss triggers reset to cwnd = 1 (slow-start threshold reset). Reno refines this with fast recovery: upon 3 duplicate ACKs (loss without timeout), reduce cwnd by half but enter congestion avoidance immediately, not slow-start. This allows the sender to continue sending (at half rate) while waiting for lost segments to retransmit—better than idling.

Slow-start mechanics: cwnd starts at one MSS and doubles every RTT until ssthresh is reached. Doubling means cwnd = 2 × cwnd each RTT: cwnd = 1, 2, 4, 8, 16, 32… On loss, ssthresh is set to cwnd/2, and cwnd resets to 1. Then congestion avoidance begins: cwnd increases by 1 MSS per RTT (linear instead of exponential).

Example: Starting cwnd = 1 MSS, RTT = 100 ms, target = 50 MSS. Slow-start: 1 → 2 → 4 → 8 → 16 → 32 (5 RTTs = 500 ms). Congestion avoidance: 32 → 33 → 34 → … → 50 (18 RTTs = 1800 ms). Total time to reach 50 MSS: 2.3 seconds. For a 10 Gbps link with MSS = 1500 bytes, this means reaching 75 Mbps takes 2.3 seconds of warm-up.

CUBIC [Ha et al., 2008]: Designed for high-bandwidth, high-latency (high-BDP) networks. As fiber deployment increased, Reno’s slow-start phase became a bottleneck. CUBIC replaces the linear increase of congestion avoidance with a cubic function:

cwnd(t) = C * (t - K)^3 + W_max

where t is time in seconds since last loss, W_max is the cwnd at the time of loss (before halving), K is the time to reach W_max under cubic growth, and C is a scaling constant. After loss, cwnd decreases to 70% of pre-loss value (less aggressive than Reno’s 50%). The cubic function allows cwnd to probe aggressively when RTTs are large, accelerating the approach to fair-share capacity.

Example: W_max = 200 MSS (50 Gbps on a 10 Gbps link with 2000-byte packets). After loss, cwnd = 140. The cubic function is tuned so that it reaches 200 MSS at roughly the same rate as linear AIMD would (for backward compatibility). But because the cubic curve is concave down (flattens as it approaches 200), CUBIC initially grows faster than AIMD, catching up to capacity sooner.

The trade-off: CUBIC is more aggressive than Reno, which can cause problems when competing with Reno flows on low-capacity links (fairness oscillations). But on high-BDP links, CUBIC converges much faster. Modern datacenter links (100 Gbps, 1 ms RTT, BDP = 12.5 GB) benefit significantly from CUBIC’s aggressive probing.

Both Tahoe and Reno and CUBIC share a fundamental assumption: loss is the signal for congestion. When cwnd grows and loss occurs, reduce cwnd. When cwnd shrinks and ACKs arrive, increase cwnd. The belief system is implicit in the control law: cwnd as a proxy for safe capacity.

The refinement is gradual, each generation addressing a specific failure of its predecessor:

  • Reno fixes Tahoe’s slow slow-start recovery.
  • CUBIC fixes Reno’s slow growth in high-BDP networks.
  • Westwood [Brakmo et al., 1994] estimates available bandwidth from ACK rate. The idea: instead of halving cwnd on loss, estimate what bandwidth was available just before loss, then set cwnd to match that bandwidth. More graceful than a hard halving.
  • Vegas measures RTT and backs off before loss. If RTT rises, it assumes congestion and reduces cwnd proactively (delay-based, not loss-based). But Vegas is rarely deployed because it loses to aggressive loss-based flows.

All share an implicit contract: senders using different algorithms must coexist. A sender cannot be so aggressive that it starves others. This backward-compatibility constraint limits how far any variant can deviate from AIMD. BBR broke this constraint (or changed the covenant), with consequences.

Each variant represents a different answer to the closed-loop problem. Reno assumes loss is rare and proportional to cwnd^2. CUBIC retains this assumption but accelerates recovery with a cubic curve, making it backward-compatible while scaling better. BBR abandons the loss assumption entirely—it measures bandwidth and propagation directly, then probes explicitly to find the boundary. DCTCP uses frequent ECN marks (not loss) to make fine-grained rate adjustments. The progression shows that when constraints shift, the invariant under greatest pressure (State) restructures first, cascading through the dependency graph: new state models require different closed-loop dynamics, creating new failure modes, driving new meta-constraints. Wide-area networks still use Reno/CUBIC; datacenters deploy DCTCP/HPCC; user-space protocols (QUIC) use BBR.


5.5 Model-Based Transport: BBR as State Redesign

BBR (Bottleneck Bandwidth and RTT) [Cardwell et al., 2017] does not replace cwnd directly. Instead, it replaces the belief system entirely. BBR maintains an explicit model of the network:

  • BtlBw: bottleneck bandwidth (maximum throughput the link can deliver).
  • RTprop: round-trip propagation delay (minimum RTT observed, excluding queueing).
  • Target: send at rate = BtlBw, keeping in-flight bytes ≈ BtlBw * RTprop (the bandwidth-delay product).

BBR decouples measurement of bandwidth from interpretation of loss. It estimates BtlBw directly: the maximum rate of packet delivery observed over the last 10 RTTs. It estimates RTprop as the minimum RTT observed over the last 10 seconds. Neither estimate requires loss.

The key insight: loss is not congestion, it is probing. BBR increases sending rate periodically (the “PROBE_BW” phase), even without ACK-driven additive increase. If the probing causes loss, BBR treats it as information about the network model, not as a signal to back off. It updates estimates and continues probing. Only when the model suggests the network is saturated does BBR reduce rate.

BBR’s state machine: BBR operates in four phases:

  1. STARTUP: Exponential increase (similar to TCP slow-start) until BtlBw estimate stops growing for three RTTs. Goal: find the bottleneck bandwidth quickly.

  2. DRAIN: Reduce in-flight packets from their peak back to BtlBw × RTprop. Goal: drain queues built during startup.

  3. PROBE_BW: Periodic rate probing. Every 10 RTTs, increase sending rate by 25% to probe for more bandwidth. If delivery rate drops, assume the network is saturated and reduce back to BtlBw. If delivery rate increases, update BtlBw estimate.

  4. PROBE_RTT: Every 10 seconds, reduce in-flight packets to 4 × MSS for 200 ms. Goal: force a round-trip with minimal queueing, allowing RTprop estimate to refresh. This is critical: without RTprop probing, the estimate becomes stale as network conditions change.

This restructures the State invariant. Before (Reno, CUBIC): belief is cwnd, interpreted as “safe window.” After (BBR): belief is (BtlBw, RTprop), interpreted as “network model.”

But this redesign has consequences:

Coexistence Problem: When BBR and CUBIC share a bottleneck, they interpret loss differently. BBR probes → causes loss → CUBIC halves cwnd → recovers slowly → BBR dominates. The unfairness is profound. BBR can starve CUBIC flows, taking 95% of capacity while CUBIC takes 5%. Subsequent revisions (BBRv2) add fairness constraints: BBR monitors the loss rate of other flows and backs off if loss is too high. But the fundamental tension remains: a protocol that probes past loss will overwhelm a protocol that fears loss.

Model Assumptions: BBR assumes the bottleneck bandwidth is constant and the RTprop estimate reflects true propagation (not queueing). In real networks, both assumptions are wrong. Multi-tenant data centers have variable bottleneck rates (neighbors’ traffic changes). Queueing can hide in the RTprop estimate if no packet ever travels the path without queuing. BBRv2 handles some of these cases, but perfect robustness is impossible.

Measurement Noise: ACK-based delivery rate is noisier than explicit link-layer rate feedback would be. Bursts in competing traffic, link-layer retransmissions (WiFi), and receiver-side buffering all add noise to the delivery rate signal. BBR filters using medians and exponential smoothing, but some noise is unavoidable. A sudden 1000x increase in cross-traffic on a shared link will take multiple RTTs for BBR’s BtlBw estimate to converge.

Despite these challenges, BBR is deployed at scale in Google infrastructure and user-space implementations (QUIC). It achieves lower latency than Reno or CUBIC on high-BDP paths, because it does not wait for loss to react.


5.6 RTT Estimation and Its Consequences

Jacobson’s algorithm (Section 4.2) is elegant but sensitive to environment changes. The exponential weighting with α = 1/8 is a compromise: fast enough to track genuine changes, but slow enough to ignore transient spikes.

Concrete example: datacenter RTT collapse. A datacenter path has typical RTT = 0.5 ms (fiber cross-building latency). A routing change forces traffic through a remote switch, causing RTT to spike to 5 ms. With α = 1/8:

  • Sample 1 (5 ms): SRTT = 7/8 × 0.5 + 1/8 × 5 = 0.4375 + 0.625 = 1.0625 ms
  • Sample 2 (5 ms): SRTT = 7/8 × 1.0625 + 1/8 × 5 = 0.9297 + 0.625 = 1.5547 ms
  • Sample 3 (5 ms): SRTT = 7/8 × 1.5547 + 1/8 × 5 = 1.3604 + 0.625 = 1.9854 ms
  • Sample 8 (5 ms): SRTT ≈ 3.7 ms

It takes 8 RTT samples to get halfway to the true value. At 0.5 ms per sample, this is 4 ms of wall-clock time—during which the timeout is set conservatively, potentially missing detection of the actual path condition.

Noisy RTT environments (WiFi, cellular) make this worse. If RTT samples vary between 10 ms and 50 ms, RTTVAR grows large, and RTO becomes very conservative (up to 100 ms or more). This delays loss detection.

RTprop estimation in BBR: BBR tracks the minimum RTT over the last 10 seconds. This is robust to queuing in the forward direction: the minimum RTT sample includes at least one path with minimal queueing. But the estimate is stale—if the path’s true propagation delay changes (routing change), BBR doesn’t know for 10 seconds. Most datacenter routes are static, so this is acceptable. But in wide-area networks with dynamic routing, the RTprop estimate can be systematically wrong.


5.7 Explicit Congestion Notification (ECN) and DCTCP

TCP’s loss-based feedback is coarse: loss is all-or-nothing. Either a packet arrives or it doesn’t. But routers know about congestion before they drop packets—they can observe queue buildup or buffer occupancy rising.

Explicit Congestion Notification (RFC 3168) [Ramakrishnan et al., 2001] allows routers to mark packets instead of dropping them. The sender sees the mark in the ECN field of the IP header, interprets it as a congestion signal, and reduces cwnd. The advantage: no packet loss, so no retransmission overhead.

DCTCP (Datacenter TCP) [Alizadeh et al., 2010] leverages ECN more aggressively than standard TCP. Instead of halving cwnd on loss, DCTCP reduces cwnd proportionally to the fraction of packets marked:

cwnd_new = cwnd * (1 - α/2)

where α is the fraction of packets marked in a window. If 10% of packets are marked, cwnd is reduced by 5%. If 90% are marked, cwnd is reduced by 45%.

The insight: ECN marks provide a continuous signal about congestion level, not a binary signal (loss = yes/no). DCTCP uses this richer signal to maintain higher utilization while keeping queueing low.

Concrete example: datacenter with DCTCP. Target queue depth: 20 packets. Link capacity: 10 Gbps. Packet size: 1500 bytes. Per-packet transmission time: 1500 bytes / 10 Gbps = 0.12 μs. Queue sojourn time at 20 packets: 20 × 0.12 = 2.4 μs.

A switch marks packets when queue depth exceeds 5 packets (0.6 μs sojourn time). In steady state, the queue oscillates around this target. DCTCP senders receive ECN marks and reduce cwnd proportionally. The result: average queue depth stays low (high throughput is achieved without persistent queueing), and RTT remains stable at its propagation value (1 ms cross-datacenter, 0.5 ms intra-rack).

By contrast, CUBIC TCP waiting for loss would let the queue grow until packets are dropped (potentially gigabytes of buffer), causing RTT to spike to 100+ ms and creating jitter.


5.8 The Environment-Measurement-Belief Gap in TCP

TCP’s fundamental diagnostic question: where do failures live? The answer is always in the gap between environment state (true network condition) and internal belief (TCP’s model).

Bufferbloat is a paradigm case. The environment is a high-capacity, low-loss link with large queues. The measurement signal is ACK arrivals, which arrive on time because the queue is just draining into the link. The belief is “the network is healthy, let me increase cwnd.” In truth, latency is high and variance is extreme. The measurement signal—ACK timing—reflects only the link’s behavior, not the queue’s fullness. TCP needs an earlier signal to close the loop.

Concrete example: home broadband bufferbloat. A residential cable modem has 1 Gbps upload and 100 Mbps downstream. The downstream link capacity is 100 Mbps, but the CMTS (cable modem termination system) and downstream port buffer up to 10 MB of data (a common configuration). With 10 MB buffer and 100 Mbps capacity, a single saturated flow can hide congestion for 10 MB / (100 Mbps / 8) = 0.8 seconds = 800 ms. During this time, the sender sees ACKs arriving (the buffer is draining), so it believes the network is good and increases cwnd. By the time the buffer actually fills and loss occurs, cwnd is grossly oversized, and the burst that caused the overflow now sits in the buffer, creating 800 ms of queueing latency. An interactive application (video call, web browsing) that assumes RTT < 100 ms now experiences 800 ms latency—a catastrophic failure of responsiveness.

Solutions: Active Queue Management (CoDel, PIE, fq_codel) provides an earlier signal. The router marks or drops packets based on sojourn time (time spent in the queue), not queue length. The marking/dropping happens when the queue is 10 ms deep, not when it is full. TCP sees loss or ECN marks earlier, before latency has exploded. The belief now tracks the environment more tightly.

Wireless Loss decouples loss from congestion. A WiFi link loses packets to corruption (interference, fading) at rates of 5-10% even when the link is not congested. TCP interprets loss as congestion, halves cwnd, and throughput plummets. The belief (network is congested) is wrong; the environment (link is lossy but not saturated) is different. Solutions: link-layer retransmission (WiFi does this—it retries lost frames automatically) hides corruption loss from TCP, so TCP only sees congestion loss. Or: ECN-based signaling, where routers mark packets (not drop them) when congestion occurs, allowing loss-free networks to operate without triggering TCP’s loss response.

Datacenter RTT Collapse: In a data center, RTT can be 1 ms. All of a sudden, a routing change (or a neighbor’s burst) causes RTT to spike to 10 ms. SRTT adapts slowly (exponential smoothing takes many RTTs to track the change). The timeout becomes conservative relative to the new RTT. More importantly, the time model (RTT-proportional reaction time) becomes a limit. A 100x RTT change means a 100x slower reaction time. BBR’s RTprop estimate also lags. Explicit congestion notification (ECN, DCTCP) allows a datacenter switch to signal congestion without loss, permitting tighter coupling between measurement and belief.

Competing Traffic Asynchrony: In wide-area paths, the bottleneck is rarely a single shared link. More often, it is a cascade of competing flows with different start times. A flow that joins a congested path initially sees high loss. After backing off, it sees low loss and increases. But in the meantime, other flows have come and gone. The equilibrium is a constantly-shifting target, not a fixed point. The measurement signal (loss) is noisy, and the belief (equilibrium capacity) is unstable. BBR’s continuous probing is better suited to this environment than AIMD, because it does not wait for convergence—it adapts continuously.

The general principle: The tighter the coupling between measurement and environment, the better the closed-loop performance. ECN is tighter than loss-based signaling. Loss is tighter than timeout. RTprop + BtlBw is tighter than cwnd. Explicit dataplane feedback (DCTCP signals) is tighter than application-layer RTT. The cost of tighter coupling is always higher per-packet overhead (bandwidth consumed by signaling) and deployability (new network equipment required). The design trade-off is eternal: tighter coupling enables higher utilization and lower latency, but at the cost of complexity and deployment friction.


5.9 Bandwidth-Delay Product (BDP) and Capacity Planning

The bandwidth-delay product is the fundamental bottleneck-sizing metric:

BDP = Bottleneck_bandwidth × Round_trip_propagation_delay

BDP is the amount of data (in bits) that can be “in flight” on a path while the sender waits for the first ACK. This is the minimum cwnd needed to saturate the link. The bandwidth-delay product is the fundamental bottleneck-sizing metric: it represents the amount of data (in bits) that can be in flight on a path while the sender waits for the first ACK. This is the minimum congestion window needed to saturate the link. For high-BDP paths (long-haul fiber, satellite), this number becomes enormous—hundreds of thousands of packets—making TCP slow-start a bottleneck. Figure 5.3 visualizes the relationship across different scenarios.

Bandwidth-Delay Product: Pipe Model and In-Flight Data

The bandwidth-delay product (BDP) is the fundamental bottleneck-sizing metric: BDP = bottleneck bandwidth × round-trip propagation delay. It represents the amount of data (in bits or packets) that can be in flight on a path while the sender waits for the first ACK—equivalently, the minimum congestion window needed to saturate the link. This metric becomes critical on high-BDP paths: a 100 Gbps link with 100 milliseconds round-trip time has a BDP of 1.25 gigabytes, requiring a congestion window of roughly 800,000 packets (at 1500 bytes per packet). For comparison, LAN paths (10 Gbps, 0.1 ms RTT) have a BDP of only 125 packets, while satellite links (10 Mbps, 600 ms RTT) require 500,000 packets.

Figure 5.3: TCP’s slow-start phase becomes a fundamental bottleneck on high-BDP paths. Slow-start doubles the congestion window every round-trip time: starting from 1 packet, reaching 1000 packets takes log₂(1000) ≈ 10 RTTs, or 1 second on a 100 ms satellite link. On a 50 millisecond coast-to-coast WAN path with 10 Gbps bandwidth, slow-start to saturation takes 8 RTTs = 400 milliseconds, during which the sender transmits at an average rate far below link capacity. This convergence delay is economically wasteful on premium, high-capacity links where saturation should occur in milliseconds, not seconds. Modern congestion control variants (CUBIC, BBR) address this by accelerating convergence for high-BDP paths, but the fundamental constraint remains: the window must scale with the BDP to saturate the link.

Concrete examples:

Scenario Bandwidth RTT BDP Min cwnd Notes
LAN 10 Gbps 0.1 ms 1 Mbps = 125 KB 85 pkts (1500B) Modern datacenter intra-rack
Metro 10 Gbps 1 ms 10 Mbps = 1.25 MB 850 pkts Cross-city datacenter
WAN coast-to-coast 10 Gbps 50 ms 500 Mbps = 62.5 MB 42,000 pkts Slow-start takes many RTTs
Satellite 10 Mbps 600 ms 6000 Mbps = 750 MB 500,000 pkts Impractical for TCP slow-start
Mobile 4G 50 Mbps 100 ms 5 Mbps = 625 KB 400 pkts RTT includes cellular delay

For high-BDP paths, slow-start becomes a bottleneck. Reaching cwnd = 42,000 takes log_2(42000) ≈ 16 RTTs = 800 ms just for slow-start. For satellite (RTT = 600 ms), slow-start to 500,000 packets takes 500,000 RTTs = 300,000 seconds = 3.5 days. This is why satellite links do not use standard TCP.

BDP is not just an academic curiosity—it directly impacts convergence speed and network utilization. High-BDP paths are the reason CUBIC exists: its cubic growth curve accelerates convergence past standard TCP’s slow-start for high-BDP networks. BBR avoids the problem entirely by measuring bandwidth directly and sending at the measured rate without waiting for slow-start convergence. For most modern fiber networks and datacenters, BDP is large enough that TCP’s behavior shifts: slow-start dominates transfer time, not congestion avoidance. Capacity planning must account for BDP: a 100 Gbps coast-to-coast path with 50 ms RTT has BDP = 500 Mbps = 62.5 MB. A switch buffer of 10 GB is 160× the BDP, enabling queuing to hide congestion signals for thousands of RTTs.


5.10 Dependency Graph: The Closed-Loop Restructuring

TCP’s redesign history is a cascade of closed-loop pressure restructuring the state invariant:

Anchor: IP interface (unreliable datagrams, decentralized)
    |
    v
Forced choices:
    - State: inferred from local signals only
    - Time: inferred RTT (no shared clock)
    - Coordination: distributed AIMD (no central arbiter)
    - Interface: byte stream above, datagrams below
    |
    v
Early implementation (Tahoe/Reno):
    - State belief: cwnd (crude, loss-based)
    - Closed-loop: measure loss → back off or increase
    - Dynamics: slow convergence on high-BDP, RTT-biased fairness
    |
    v
Environmental pressure: Bandwidth scaling (10 Mbps → 100+ Gbps)
    - Slow-start phase takes too long
    - CUBIC refinement: cubic growth curve (faster probe of capacity)
    |
    v
Environmental pressure: Datacenter operations
    - RTT collapse (1 ms), large buffers (gigabytes)
    - Bufferbloat: large RTT variance
    - DCTCP refinement: ECN-based signaling (earlier feedback)
    |
    v
Environmental pressure: Model-based operation
    - Probing past loss (not fearing loss) works better in controlled environments
    - BBR redesign: replace cwnd with (BtlBw, RTprop) model
    |
    v
Result: Different algorithms for different environments
    - Wide-area: CUBIC (good fairness, slower probing)
    - Datacenter: DCTCP (fast, low latency, ECN-dependent)
    - User-space (QUIC): BBR (better latency, less fair to non-BBR flows)
    - Future: Machine learning congestion control (learning the model from environment)

The pattern: when environmental constraints shift (higher bandwidth, different RTT, different loss characteristics), the invariant under greatest pressure (State, the belief model) restructures first. The restructuring cascades through the dependency graph: new state representations require different loop dynamics, which create new failure modes, which force new meta-constraints.


5.11 TCP as the Canonical Distributed System

TCP is the richest application of all four invariants and all three design principles in the book:

Invariants: TCP answers all four, with a distributed, inferred, and decoupled approach: - State: Belief (cwnd, or BBR’s model) built from measurement (ACKs) of environment (network capacity). - Time: Inferred (RTT estimation) rather than prescribed (no shared clock). - Coordination: Fully distributed—no central scheduler, no explicit messages, only implicit convergence. - Interface: Clean abstraction (byte stream) hiding implementation complexity (datagrams, retransmission).

Principles: TCP demonstrates all three: - Disaggregation: Separation of per-flow state (each sender tracks its own cwnd) from network state (routers are stateless). Separation of transport policy (how to adjust rate) from forwarding mechanism (IP routing). Modern refinements disaggregate delivery rate measurement from loss interpretation (BBR), allowing different policies to coexist. - Closed-loop reasoning: The entire algorithm is a feedback loop—measure → believe → decide → send → repeat. The loop stability is the designing constraint. Every TCP variant inherits this structure; disagreement is only about what to measure and how to believe the measurement. - Decision placement: Endpoints make all decisions locally, using only local information. No central authority needed. This is why TCP scaled to the global Internet. Even DCTCP and ECN-based schemes preserve endpoint autonomy—switches provide signals, but endpoints interpret them.

Evolution: TCP’s history is a laboratory for understanding how systems redesign under shifting constraints. Each variant (Reno, CUBIC, BBR, DCTCP) keeps the anchor (IP interface, endpoint-driven) and coordination (distributed or variants) but restructures State (belief model) and Time (measurement) to match new environments. The evolution will continue: as links scale to 400 Gbps, as RTT becomes more variable, as machine learning enters the stack, new state representations will emerge. The framework predicts this: when environment changes, the invariant under greatest pressure restructures first.


5.12 Generative Exercises

Exercise 1: TCP for Mars

Mars rovers must communicate with Earth at 20 minute one-way latency. RTT ≈ 2400 seconds.

Given TCP’s architecture, what happens to the closed-loop dynamics? - Slow-start from cwnd = 1: takes log_2(final_cwnd) × 2400 seconds to reach equilibrium. For cwnd = 100, this is 7 × 2400 ≈ 4.7 hours. - AIMD additive increase: 1 packet per RTT. Each RTT is 2400 seconds. - Loss recovery: timeout is RTT + variance ≈ 3600 seconds. On loss, the sender waits 1 hour before retransmitting.

The closed-loop is too slow. Redesign:

  1. Can you keep the IP interface (datagrams, decentralized) but change the Time model? Propose a preloading mechanism: before losing connectivity, precompute a sequence of cwnd values based on predicted path capacity (from ground-truth Earth-Mars models). Upload this sequence to the rover. The rover then increases cwnd on a wall-clock schedule, not RTT schedule.

  2. How does this change the State invariant? The belief (cwnd) is no longer inferred from ACKs; it is prescribed. What are the failure modes if the precomputed schedule is wrong?

  3. Alternatively: propose a delay-tolerant transport protocol (not TCP). How would you rethink Coordination, given that every exchange of ACKs takes 2400 seconds? What if ACKs are not needed for reliability—what if the rover could rely on application-level redundancy (forward error correction)?

Exercise 2: TCP in a Datacenter (CUBIC vs. DCTCP vs. BBR)

A datacenter network has: - Propagation RTT: 1 ms (speed of light over fiber). - Bottleneck bandwidth: 10 Gbps. - Queuing delay when congested: 100 microseconds. - ECN support: all switches can mark packets when queues are above 5 μs depth.

Traditional CUBIC TCP: - Additive increase: 1 MSS per RTT = 1 MSS per 1 ms = 1000 MSS per second. - For MSS = 1500 bytes, this is 1500 KB/s increase rate. - Multiplicative decrease: halve cwnd on loss. - Startup: slow-start from cwnd = 1 takes ~10 RTTs to reach 1000 MSS ≈ 10 ms.

DCTCP (ECN-enabled): - Instead of waiting for loss, respond to ECN mark. - Decrease cwnd by much less (e.g., α × cwnd, where α = 0.5 is the fraction of packets marked). - Never stop sending while ECN marks are low (no full loss recovery).

Compare: 1. How does the closed-loop latency differ? CUBIC’s minimum RTT for loss detection is 1 ms (timeout > 1 ms always). DCTCP’s is 100 microseconds (one queue sojourn time). What is the practical impact on application response time?

  1. What is the fairness difference? CUBIC has RTT-biased fairness (low-RTT flows win). Does DCTCP fix this? Why or why not?

  2. Design a scenario where a datacenter operator tries to run CUBIC and DCTCP on the same switches with ECN disabled for non-DCTCP traffic. What goes wrong?

  3. Now add BBR to the mix (same datacenter, ECN enabled). BBR probes past loss. DCTCP backs off on ECN mark. What is the interaction? Who dominates?

Exercise 3: Redesigning TCP for High-Loss Wireless

A WiFi network has 5% packet loss due to interference (not congestion). Standard TCP halves cwnd on loss, leading to throughput collapse.

  1. Propose a measurement signal that can distinguish congestion loss from corruption loss. Hint: think about the pattern of losses (bursty vs. isolated), or the timing of losses (clustered in time or spread out).

  2. How would you modify the State invariant (belief model) to handle this? Should TCP reduce cwnd differently for corruption loss?

  3. DCTCP uses ECN marks as a signal. How could ECN help distinguish loss types?

  4. What invariant changes would a delay-based congestion control algorithm (like Vegas) require? Would Vegas work better on WiFi than CUBIC?


5.13 References

  • Allman, M., Paxson, V., and Blanton, E. (2009). “TCP Congestion Control.” RFC 5681.
  • Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Yalagandula, P., … (2010). “Data Center TCP (DCTCP).” Proc. ACM SIGCOMM.
  • Brakmo, L.S., O’Malley, S.W., and Peterson, L.L. (1994). “TCP Vegas: New Techniques for Congestion Detection and Avoidance.” Proc. ACM SIGCOMM.
  • Cardwell, N., Cheng, Y., Gunn, C.S., Yeganeh, S.H., and Jacobson, V. (2017). “BBR: Congestion-Based Congestion Control.” ACM Queue, 15(5).
  • Clark, D.D. (1988). “The Design Philosophy of the DARPA Internet Protocols.” Proc. ACM SIGCOMM.
  • Ha, S., Rhee, I., and Xu, L. (2008). “CUBIC: A New TCP-Friendly High-Speed TCP Variant.” ACM SIGOPS Operating Systems Review, 42(5).
  • Jacobson, V. (1988). “Congestion Avoidance and Control.” Proc. ACM SIGCOMM.
  • Nichols, K. and Jacobson, V. (2012). “Controlling Queue Delay.” ACM Queue, 10(5).
  • Ramakrishnan, K., Floyd, S., and Black, D. (2001). “The Addition of Explicit Congestion Notification (ECN) to IP and Related Protocols.” RFC 3168.

This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.