10  System Composition — Transport Meets Queue Management


10.1 The Anchor: Two Closed Loops, One Shared Signal

Transport (Chapter 4) decides how fast to send. Queue management (Chapter 5) decides which packets to drop. Each is a closed loop. Each converges when left alone. But they live at the same bottleneck, reading and writing the same scalar signal. Transport consumes what the queue produces: a loss event, a delay reading, an ECN mark. Each loop is blind to the other’s control law. Each is constrained to the shared signal as given.

This is the binding constraint of system composition: two independently designed closed loops share a single bottleneck interface, and the semantics of that interface are negotiated across administrative boundaries that no single party controls.

“The congestion control algorithm at the endpoint and the queue discipline at the router together determine network performance. Neither is sufficient alone. Both are necessary. The interface between them — the signal that crosses from router to endpoint — is the contract that makes the system work, or fail.” — De Schepper & Briscoe, 2023 (De Schepper and Briscoe 2023a)

10.1.1 The Decision Problems

The composed system must continuously answer four questions:

  1. How do transport and AQM coordinate through shared signals? What does a loss mean? What does a mark mean? What does increased delay mean?
  2. When does their interaction oscillate? Under what conditions do the two loops couple destructively — synchronized backoff, persistent queuing, throughput collapse?
  3. How can explicit signaling break the coupling deadlock? When a loss-only signal is too sparse to carry enough information, what richer interface can both sides agree on?
  4. How do ISP policies (shaping, boosting) invisibly break transport assumptions? What happens when an unannounced third control loop sits between transport and the queue it believes it is talking to?

The binding constraint is the Interface invariant: the semantics of the shared signal are fixed by standards, legacy deployment, and middlebox behavior. Every act in this chapter is an attempt to enrich that signal without breaking the composition that already works.


10.2 Act 1: The Coupling Problem — Loss as the Only Signal (1988)

It’s 1988. Van Jacobson has just rescued the ARPANET from congestion collapse with AIMD. Every TCP flow infers congestion from the absence of an ACK. When a packet is dropped, the sender halves its window. When ACKs return, the sender grows its window linearly. The loop closes through the router’s tail-drop behavior: the queue fills, the queue overflows, the dropped packet becomes the signal.

“We use ‘packet loss’ as a congestion signal. The network tells us to slow down by dropping packets. There is no other signal available.” — Jacobson, 1988 (Jacobson 1988)

What the pioneers saw: A simple interface. Routers are dumb; they drop when full. Endpoints are smart; they infer. The signal is binary — either an ACK arrives or it doesn’t. The simplicity is the point: no cross-layer cooperation needed, no new header bits, no middlebox upgrades. TCP works over any IP network that can drop packets.

What remained invisible from the pioneers’ vantage point: Two pathologies that would emerge only at scale. First, global synchronization — when a tail-drop queue overflows, it drops from every flow simultaneously, and every flow halves its window at the same RTT (Floyd and Jacobson 1993). Utilization collapses to ~50%, then recovers together, then collapses again. The loops are not independent; they are phase-locked by the shared drop event. Second, bufferbloat (Gettys and Nichols 2012): if the buffer is large enough, the loss signal arrives too late. Transport sends at its believed rate for many RTTs while the queue silently grows to hundreds of milliseconds. By the time loss arrives, latency has already inflated.

10.2.1 The Solution: AIMD + Tail Drop

Transport applies closed-loop reasoning: sensor = missing ACK, estimator = duplicate-ACK counter, controller = halve cwnd (congestion window), actuator = reduced send rate. Queue management applies decision placement at the distributed extreme: every router decides independently, no coordination. The interface between them is deliberately minimal — one bit per packet, delivered by packet death.

10.2.2 Invariant Analysis: Loss-Based Coupling (1988)

Invariant Answer (1988) Gap?
State Loss events = congestion Sparse at high speed
Time Per-RTT feedback Lags buffer growth
Coordination Distributed, uncoordinated Global synchronization
Interface 1 bit (drop or deliver) Destructive signal

The State gap is the critical one. At 100 Mbps with 50 ms RTT, one drop per RTT produces a 0.01 % loss rate — transport sees 1 signal per ~10⁵ packets. At 100 Gbps, the signal is effectively silent until the queue overflows catastrophically. Transport’s belief about capacity drifts arbitrarily far from the environment’s actual state before a loss correction arrives.

10.2.3 Environment → Measurement → Belief

Layer What Transport Has What’s Missing
Environment Actual queue depth, link rate Not observable to endpoint
Measurement Missing ACKs, RTT samples No queue-depth signal
Belief cwnd = estimate of capacity Lags reality by ~buffer/link

The E→M gap is physically limited, not accidentally noisy: the IP interface exposes only two events (deliver, drop), and the endpoint sees only those. Fixing this requires changing the interface — adding a new event type.

10.2.4 “The Gaps Didn’t Matter… Yet”

In 1988, buffers were small (tens of packets), links were slow (1.5–45 Mbps), and synchronization produced tolerable 50 % utilization. The gap between belief and reality was bounded by small buffers. The next decade would change both.


10.3 Act 2: ECN — Adding the Third Event (2001)

It’s 2001. Backbone links are hitting gigabit speeds. RED (Floyd and Jacobson 1993) has been deployed in some routers to probabilistically drop early and break synchronization, but RED still uses loss as the signal. Sally Floyd, K. K. Ramakrishnan, and David Black propose a richer interface: give the router a way to say “slow down” without killing a packet.

“ECN allows the router to signal congestion by marking rather than dropping packets, preserving the data stream while still conveying the congestion signal.” — RFC 3168 (Ramakrishnan et al. 2001)

What the pioneers saw: The loss signal is destructive — every congestion signal costs a retransmission. At high speed this is wasteful. Two unused bits in the IP header (the TOS/DS byte’s low-order bits) can carry an explicit signal: ECT(0)/ECT(1) (ECN-Capable Transport) says “I understand ECN”; CE (Congestion Experienced) is the router’s mark; the receiver echoes it in the ACK, and the sender responds exactly as to a loss.

What remained invisible: Two deployment failures. First, middlebox bleaching — firewalls, NATs, and old routers would see unfamiliar bits and clear them. The signal disappeared mid-path, without warning. Second, asymmetric fairness — an ECN sender in a shared queue backs off at shallow queue depth (early marks), while a loss-based sender in the same queue continues until the queue overflows. The ECN flow gets less bandwidth because it’s more polite, which creates a disincentive to deploy ECN.

10.3.1 The Solution: ECT/CE in the IP Header

The IP interface applies disaggregation: the ToS byte’s low-order bits now separate “drop on congestion” from “mark on congestion.” The AQM interface remains RED-shaped (mark probability rises with queue depth), but the action is a bit-flip, not a packet kill. Transport’s response is unchanged: one CE mark = treat as one loss. This preserves backward compatibility — any sender that doesn’t know ECN still gets dropped.

10.3.2 Invariant Analysis: ECN (2001)

Invariant Answer (2001) Gap?
State Loss OR mark = congestion Marks still treated as binary
Time Per-RTT, more frequent Same response magnitude
Coordination Still distributed Asymmetric fairness in shared queue
Interface 2 bits (ECT, CE) Middleboxes bleach

The Interface gap is the killer. Honda et al. (2011) measured that ~20 % of Internet paths had middleboxes that cleared ECN bits (Honda et al. 2011). Deployment stalled: the chicken-and-egg problem was fatal. A sender who turned on ECN gained no benefit unless both the router and receiver supported it and no middlebox rewrote the bits.

10.3.3 Environment → Measurement → Belief

Layer What Transport Has What’s Missing
Environment Queue depth at bottleneck Still not direct
Measurement CE marks + missing ACKs Mark count per RTT not exposed
Belief cwnd, halved on any mark No proportional response

The E→M gap shrank from 1 bit to 2 bits, but the signal semantics remained binary (“halve on anything”). To extract real information, the sender needs to know how many marks arrived in the last RTT, not just whether one did.

10.3.4 “The Gaps Didn’t Matter… Yet”

In 2001, typical RTTs were 50+ ms, speeds 100 Mbps, and marking rates low. Binary response to marks was fine. The gap became catastrophic when datacenters arrived: 10 Gbps, 100 µs RTT, marks every few packets.


10.4 Act 3: L4S — Dual-Queue Coupled AQM (2013–2023)

It’s 2013. DCTCP (2010) has demonstrated that in a controlled datacenter, a scalable congestion controller responding proportionally to mark fractions can achieve sub-millisecond queuing at 10 Gbps (Alizadeh et al. 2010). But DCTCP is unfair to classic TCP in a shared queue — its shallow-queue operation starves Reno. Bob Briscoe and Koen De Schepper pose the challenge: can we deploy DCTCP-style transport on the public Internet without breaking classic TCP?

“The root cause is not ECN itself, but that classic TCP and scalable TCP cannot share one queue. Separate them, couple them, and both can coexist.” — De Schepper & Briscoe, RFC 9330 (De Schepper and Briscoe 2023a)

What the pioneers saw: The problem is not the ECN bit — it’s that one queue serves two incompatible control laws. Classic TCP expects infrequent, deep-queue marks (mirroring loss). Scalable TCP (DCTCP, Prague) expects frequent, shallow-queue marks. Sharing a single FIFO starves one or the other. The solution is two queues with a coupling law that enforces fair bandwidth sharing.

What remained invisible: Deployment would still hinge on the same middlebox ecosystem that killed RFC 3168. L4S repurposes ECT(1) as a classifier (“this flow is scalable-CC”), which requires middleboxes to leave ECT(1) alone — a hope, not a guarantee.

10.4.1 The Solution: DualQ + ECT(1) + AccECN

L4S applies disaggregation at the queue (Briscoe, De Schepper, Bagnulo, et al. 2023; De Schepper and Briscoe 2023b): one queue for scalable flows (L queue, shallow target, frequent marks), one for classic flows (C queue, deep target, drops or rare marks). A coupling law links their marking probabilities: p_C = (p_L / k)². This mathematical coupling makes classic TCP’s sqrt-loss response match scalable TCP’s linear mark response at the same bandwidth share.

Transport applies closed-loop reasoning with a new sensor: Accurate ECN (AccECN, RFC 9341 (Briscoe, De Schepper, Bondarenko, et al. 2023))1 lets the receiver feed back the count of CE marks in the last RTT, not just presence. The sender updates cwnd ← cwnd × (1 - α/2), where α is an EWMA of the mark fraction — exactly DCTCP’s rule, now deployable end-to-end.

Classifier: a flow sets ECT(1) to declare “I will respond proportionally.” The DualQ router reads ECT(1) to steer the packet into the L queue.

Figure 10.1: L4S (Low Latency Low Loss Scalable) addresses the composition failure between transport and queue management by redesigning the feedback interface. The core insight is that loss-based feedback is fundamentally broken at high speeds: drops are too sparse to provide meaningful control signals. L4S solves this by replacing loss with ECN marks (explicit congestion notification bits in the IP header) and introducing a dual-queue architecture that maintains backward compatibility with legacy TCP. The dual-queue architecture separates flows into two classes: the L queue (for ECN-capable, L4S-compliant flows) and the C queue (for legacy TCP). The L queue maintains a shallow target depth—sub-millisecond latency objectives—and marks packets with ECN when approaching this shallow threshold. ECN-capable transport protocols (like DCTCP) receive these marks at high frequency and respond proportionally, reducing the window slightly rather than sharply. The C queue, serving legacy TCP, only drops packets when it reaches deep queue depths, providing backward compatibility: legacy TCP sees packet loss (the signal it expects) only under severe congestion. A scheduler maintains fairness and isolation between the two queues, preventing L queue traffic from starving legacy flows. This design achieves three properties simultaneously: ECN-capable flows enjoy low latency and proportional control via dense marking signals; legacy TCP flows continue to function with occasional loss-based feedback; and the system maintains stability by separating the control timescales for modern and legacy protocols.

10.4.2 Invariant Analysis: L4S (2023)

Invariant Answer (2023) Gap?
State Mark fraction α per RTT Requires AccECN deployment
Time Sub-ms queue target in L queue Classic queue still slow
Coordination DualQ scheduler + coupling law Cross-AS deployment fragile
Interface ECT(1) classifier + multi-bit feedback Middlebox bleaching risk

The Coordination gap is the remaining one. L4S works when every bottleneck on the path is DualQ-capable. On a path that traverses three ASes and one of them runs classic FIFO, scalable flows lose their latency benefit on that hop.

10.4.3 Comparison: Before and After

What Changed Before L4S (2001 ECN) After L4S (2023)
Queue structure Single FIFO Dual queue, coupled
Mark semantics Binary per RTT Fractional (counted)
Response to marks Halve cwnd Proportional reduction
Classic-TCP fairness Broken (ECN starves) Preserved by coupling law
Target queuing delay Tens of ms Sub-millisecond (L queue)

10.4.4 Environment → Measurement → Belief

Layer What Transport Has What’s Missing
Environment Per-queue depth at DualQ Depth on non-DualQ hops
Measurement Mark fraction α, RTT, loss Per-hop diagnostics
Belief Capacity + queue proximity estimate Assumes path is DualQ throughout

The E→M gap is now structurally filtered rather than physically limited: the signal is honest and frequent, but it is filtered through the deployment state of the path. Measurement quality tracks administrative consolidation.

10.4.5 “The Gaps Didn’t Matter… Yet”

L4S works in datacenter-like conditions and in controlled access deployments (Nokia DualQ at cable modems, Apple scalable CC in iOS 16). The remaining gap — partial deployment — matters only for flows crossing non-L4S ASes. For intra-AS traffic, the binding constraint has shifted from “interface is too narrow” to “interface is not yet universal.”


10.5 Act 4: PowerBoost — Invisible Policy Breaks Transport (2008–2016)

It’s 2011. Srikanth Sundaresan and collaborators, instrumenting home gateways for the FCC broadband measurement study, observe anomalous transport behavior. Comcast cable customers see throughput of 25 Mbps for the first ~8 seconds of an upload, then an abrupt cliff to 12 Mbps — with no loss event, no marking, nothing transport can see (Sundaresan et al. 2011).

“PowerBoost allows ISPs to advertise peak rates they cannot sustain. The mechanism is a token bucket invisible to transport: packets flow unmolested during the burst, then the bucket drains and the shaper buffers indefinitely — or the policer drops without feedback.” — Sundaresan et al., 2011 (Sundaresan et al. 2011)

What the ISPs saw: A marketing tool. Advertise 25 Mbps peak, provision 12 Mbps sustained, statistical multiplexing absorbs the gap. For web browsing and small file transfers, users see the peak. For sustained uploads, they see the sustained rate. The token bucket2 is transparent — it’s just a shaper.

What remained invisible to the ISPs: TCP’s closed loop assumes the bottleneck is a FIFO queue with loss-based or mark-based signaling. A token bucket is a third control loop invisible to TCP. Bauer et al. (2011) and Flach et al. (2016) quantified the damage: policer-induced loss was 6× higher than shaped-traffic loss, and TCP retransmission overhead climbed sharply (Flach et al. 2016).

Figure 10.2: PowerBoost implements a token bucket rate limiter that presents two tiers to applications: an initial burst rate (20 Mbps) for a deterministic window (typically several seconds), followed by a sustained rate (6 Mbps) indefinitely. The left panel shows the bucket metaphor: tokens accumulate up to capacity C at rate r tokens per second. When the application transmits, it consumes tokens at a rate proportional to bytes sent. While tokens exist, transmission proceeds at the burst rate; when the bucket empties, transmission drops to the sustained rate (governed by replenishment rate). The timeline clearly shows the rate cliff: at approximately 2 seconds, the transmission rate drops sharply from 20 Mbps to 6 Mbps, a 70% decrease. ISPs deploy PowerBoost to advertise high burst speeds (“get 25 Mbps burst!”) while capping sustained capacity (“but only 12 Mbps long-term”), presenting a favorable marketing message despite underlying sustained capacity constraints. The critical problem is invisibility at the transport layer. TCP observes the bandwidth timeline but interprets the rate cliff incorrectly: the sender’s TCP congestion window is sized for the burst rate (assuming packets flow freely at 20 Mbps), but the modem dequeues at only 6 Mbps. Packets accumulate in the modem’s internal buffer, creating growing queues and inflating RTT from 50 ms to 150+ ms—a 3× increase. TCP interprets queue growth as network congestion (the traditional signal for loss or ECN marks), reduces its congestion window via AIMD (additive-increase, multiplicative-decrease), and backs off. But the queue continues growing because the modem’s sustained rate limit—not network congestion—is the bottleneck. TCP backs off further, progressively reducing its window to match the sustained rate, but only after dozens of RTT cycles. The result is that a simple file upload triggers extensive buffering, latency inflation, and TCP misbehavior—all because the rate limit is invisible. PowerBoost violates the Coordination invariant: two control loops (TCP’s congestion control and PowerBoost’s rate limiting) operate on the same queue with no communication between them, leading to misinterpretation of signals and suboptimal adaptation.

10.5.1 The Mechanism

A token bucket has depth PBS and refill rate MSTR. When the bucket is full, the sender transmits at its offered rate R. When R > MSTR, the bucket depletes in time D = PBS / (R − MSTR). After D seconds, one of two things happens:

  • Shaper mode: excess packets queue inside the modem. Transport sees inflated RTT but no loss. The belief model mistakes the shaper for a congested bottleneck. Bufferbloat.
  • Policer mode: excess packets are dropped without backpressure. Transport sees loss and halves cwnd. But the drops are deterministic (every packet above MSTR), and TCP’s stochastic backoff misses the true available rate. Utilization collapses.

10.5.2 Invariant Analysis: PowerBoost (as a hidden shim)

Invariant PowerBoost Answer Gap from Transport’s View
State Token count in ISP bucket Not exposed to endpoint
Time Deterministic rate cliff at t=D Invisible in transport’s RTT estimate
Coordination ISP-only, unilateral No signal to transport
Interface None (transparent to IP) Transport’s belief model broken

The Coordination invariant is the core failure. The ISP has inserted a third control loop between transport and the AQM, with no communication to either. Transport’s belief model says “the only thing between me and the destination is a FIFO queue.” The actual environment contains a token bucket whose state is opaque to transport.

10.5.3 Environment → Measurement → Belief

Layer What Transport Has What’s Missing
Environment Token bucket state + queue state Token count never exposed
Measurement RTT, loss, throughput samples Shaper and congestion look identical
Belief “Network is congested at 12 Mbps” Wrong — network is rate-limited

The E→M gap here is structurally filtered by policy: the signal is shaped by a party whose incentives diverge from both transport and AQM. This is the “tussle” (Clark et al. 2005) playing out in the measurement layer.

10.5.4 Why the Fix Is Hard

Three fixes exist, none widely deployed. Expose the rate limit via socket API: requires ISP cooperation. Deploy CoDel at the token bucket: keeps queue small, lets transport see delay signal honestly. Mark via ECN at bucket depletion: requires ECN everywhere. The political economy of ISP-as-adversary-to-TCP prevents all three.

10.5.5 “The Gaps Didn’t Matter… Yet”

Until sustained uploads became common (video conferencing, cloud backup, live streaming), PowerBoost was invisible to most users. By 2020, with everyone working from home, the cliff was hitting tens of millions of users daily.


10.6 Act 5: QoS vs QoE — The Application-Layer Mismatch (2011+)

It’s 2011. Conviva and Akamai are instrumenting millions of video streaming sessions. Florin Dobrian and collaborators publish the first large-scale study linking network-layer metrics to user behavior (Dobrian et al. 2011). They find that a 1 % increase in buffer-stall ratio reduces average viewing time by ~3 minutes. No transport-layer metric predicts this directly.

“Users do not experience loss rate. They experience stalls, startup delay, and visual quality. These are application-layer constructs built from, but not reducible to, network-layer signals.” — Dobrian et al., 2011 (Dobrian et al. 2011)

What the pioneers saw: Network operators measure what is easy — packet loss, latency, jitter, throughput. These are QoS (Quality of Service) metrics. User engagement depends on QoE (Quality of Experience): stall frequency, startup time, bitrate stability, VMAF (Video Multimethod Assessment Fusion)-scored visual quality. The mapping from QoS to QoE is non-linear, application-specific, and often non-monotonic.

What remained invisible: The mapping is also user-population-specific. A 720p→480p bitrate switch annoys a 4K-TV viewer but is invisible to a phone viewer. A 200 ms startup delay is fine for Netflix binge-watching but fatal for live sports. No single QoS target optimizes QoE across populations.

Figure 10.3: The left panel shows the objective network-layer metrics that operators can measure directly: throughput (100–200 Mbps range), latency (5–10 ms), jitter (1–5 ms variance), and packet loss (0.1–0.5%). These QoS (Quality of Service) metrics are easy to quantify—any packet analyzer can count them. But they do not directly determine user satisfaction (QoE—Quality of Experience). The right panel shows the perceptual outcomes users actually care about: video streaming quality (MOS scores), voice clarity, page load time, and gaming frame rate. The colored arrows reveal the non-linear, application-specific mappings. Throughput improvement from 100 to 200 Mbps dramatically improves video quality (strong correlation) but has minimal effect on gaming response (weak correlation). Latency reduction from 10 to 5 milliseconds significantly improves gaming (strong correlation) but barely affects video quality. Jitter reduction correlates strongly with voice clarity but weakly with video. Loss rate improves page load time but is nearly invisible to video (FEC handles losses). The same network metric means opposite things to different applications. This fundamental measurement gap exists because network operators measure what is easy (packet-level counters) while applications care about what is user-visible. One application’s irrelevant metric (loss, handled by TCP retransmission for file transfers) is another’s critical signal (loss, intolerable in voice calls). At the application layer, measurement must be application-specific: video streaming measures buffer stalls (the death of engagement, worse than any bitrate reduction), DASH players estimate available bandwidth by sampling segment download times, VoIP measures codec MOS (Mean Opinion Score) which correlates with perceived voice quality. Network layer cannot expose these signals directly through standard APIs. The socket API provides bytes received per unit time (bandwidth), not “will the video stall in 2 seconds?” or “is this loss due to wireless fading or network congestion?” Until the measurement interface is relaxed—either by exposing QoS signals to applications (ECN marks, loss information, latency samples) or by moving measurement into the application layer—the coupling between network state and user satisfaction remains implicit and application-specific.

10.6.1 Concrete Mappings

Application Dominant QoE Metric QoS Signal Used
VoIP MOS (loss-sensitive) Loss + one-way delay
DASH video Stall ratio + VMAF + switch count Throughput estimate
Video conferencing End-to-end latency RTT, jitter
Web Time-to-first-byte RTT, loss
Gaming Jitter + tail latency RTT variance

The same 50 ms RTT is excellent for DASH (irrelevant, buffering absorbs it), tolerable for VoIP, and unacceptable for competitive gaming.

10.6.2 Invariant Analysis: QoE at the Application Layer

Invariant App-Layer Answer Gap?
State Buffer level, bitrate history Indirect QoS access (throughput only)
Time Per-chunk or per-frame decisions Lags network state by chunks
Coordination App-only (ABR algorithm) No network cooperation
Interface Socket API (throughput + RTT) No native QoE exposure

The Interface gap is structural: the socket API exposes bytes-per-second, not queue depth or loss semantics. The application must reinvent measurement at the application layer — DASH players estimate bandwidth from segment download time; VoIP codecs infer loss patterns from jitter buffer underruns. This is redundant with what transport already measures, but unavoidable on the current API.

10.6.3 Environment → Measurement → Belief

Layer What the Application Has What’s Missing
Environment Network capacity, path quality, user display Only partially observable
Measurement Chunk download times, stall events, playback buffer No direct QoS signals
Belief “Safe bitrate is X Mbps for the next N seconds” Prediction-heavy, noisy

The E→M gap is structurally filtered by the API: the application can only measure what the socket layer exposes. Bridging requires either new APIs (SCReAM, L4S-aware endpoints) or external measurement infrastructure (Conviva, M-Lab).

10.6.4 “The Gaps Didn’t Matter… Yet”

When video was 480p and populations were tolerant, simple throughput-based adaptation was enough. 4K streaming, live sports, and cloud gaming have shrunk the tolerance, and the gap between QoS signals and QoE targets now dominates user experience.


10.7 Act 6: Datacenter Co-Design — DCTCP, HPCC, Swift (2010–2020)

It’s 2010. Mohammad Alizadeh and collaborators at Microsoft Bing control every switch, every NIC, every kernel in their datacenter. There is no middlebox. There is no cross-AS deployment. There is no ISP policy shim. The Coordination invariant is relaxed: one administrator owns both sides of every interface. What can the composition look like when co-design is permitted?

“Because we control both the switch and the endpoint, we can mandate ECN, we can require a specific congestion control algorithm, and we can measure outcomes in microseconds.” — Alizadeh et al., 2010 (Alizadeh et al. 2010)

10.7.1 DCTCP (2010): Scalable CC via Mark Fraction

DCTCP applies closed-loop reasoning with a richer sensor: the receiver echoes the count of CE marks in the last RTT, and the sender computes α = (1 − g) × α + g × F where F is the current RTT’s marking fraction. Window update: cwnd ← cwnd × (1 − α/2). When marking is 1 %, the cut is 0.5 %, not 50 %. The loop runs at full speed with tiny perturbations — the scalable-CC template (Alizadeh et al. 2010).

10.7.2 TIMELY (2015): RTT as the Signal

Radhika Mittal and collaborators at Google argue that RTT, not ECN, is the most actionable signal: every endpoint already measures it, no switch changes needed (Mittal et al. 2015). TIMELY tracks RTT gradient (dRTT/dt) and reduces cwnd when the gradient turns positive. Pure endpoint control. No switch cooperation required — inverting DCTCP’s strategy.

10.7.3 HPCC (2019): In-Band Network Telemetry

Yuliang Li and collaborators at Alibaba push further: every switch stamps its queue depth and link utilization into the packet header (Li et al. 2019). The sender receives per-hop state at every ACK. The CC algorithm becomes simple: compute path utilization U from the max over hops, adjust W ← W_base/U + W_AI. Convergence in one RTT. The signal has grown from 1 bit (ECN) to ~42 bytes per switch (INT metadata).

10.7.4 Swift (2020): Delay Decomposition

Gautam Kumar and collaborators at Google take delay-based CC into production at fleet scale (Kumar et al. 2020). Swift decomposes RTT into fabric delay (switch queues) and endpoint delay (NIC queues at hosts), each with its own target. When NIC delay rises, the fabric is fine — the host is overloaded, and backpressure is applied at the endpoint, not the switch.

10.7.5 Invariant Analysis: Datacenter Co-Design

Invariant Datacenter Answer Gap?
State Mark fraction (DCTCP) → INT (HPCC) → delay decomposition (Swift) Grows with signal richness
Time Per-ACK or per-RTT, µs scale Limited by clock accuracy (Swift)
Coordination Single admin, co-designed Not portable to public Internet
Interface Mandated end-to-end (ECN, INT, timestamps) Cannot deploy across AS

10.7.6 Environment → Measurement → Belief

Layer What Transport Has What’s Missing
Environment Switch queues, NIC queues, link rates Fully observable within DC
Measurement ECN marks / INT / RTT decomposition Nothing material
Belief Precise path-utilization estimate Matches environment closely

The E→M gap is nearly closed within the datacenter. That is the return on administrative consolidation: relaxing the Coordination invariant (one admin, co-designed stack) enables the State invariant to be saturated (signal richness limited only by engineering choices, not middlebox policy).


10.8 The Grand Arc: From Loss to Co-Design

10.8.1 The Evolving Interface

Era Signal Interface Width Who Owns Semantics
1988 Loss 1 bit (dropped / delivered) Router (drop policy) + Endpoint (interpretation)
2001 Loss + ECN 2 bits (ECT/CE) Standards-body agreement
2010 (DC) ECN mark fraction EWMA over RTT Single admin
2015 (DC) RTT gradient Continuous Endpoint-only
2019 (DC) Per-hop INT ~42 B / switch Single admin
2020 (DC) Delay decomposition Continuous + host state Single admin
2023 L4S (DualQ + AccECN) Counted marks + classifier Opt-in across admins

10.8.2 Three Design Principles Applied Across the Arc

Disaggregation: Every redesign separates previously merged concerns. L4S disaggregates one queue into two. Swift disaggregates RTT into fabric and endpoint components. HPCC disaggregates “congestion” into per-hop utilization fields. Each separation creates a new interface; each interface is an opportunity for tighter control and a risk of ossification.

Closed-loop reasoning: Every step enriches the sensor side of the loop. AIMD’s sensor is “did the ACK arrive?” DCTCP’s sensor is “what fraction was marked?” HPCC’s sensor is “what is each hop’s utilization?” As the sensor grows richer, the control law becomes more precise and the loop gain becomes proportional rather than binary. The diagnostic — does it converge, how fast, where does it oscillate — is answered better at every stage.

Decision placement: The public Internet is locked into distributed decision-making — the ASes have no incentive to coordinate. The datacenter moves all the way to co-designed decisions: one admin decides both the CC algorithm and the AQM policy. L4S is the attempt to keep distributed decisions while still enriching the shared signal — a compromise that works only where both sides opt in.

10.8.3 The Dependency Chain

flowchart TD
    A[Interface: IP best-effort]:::constraint --> B[Coordination: distributed, no admin]:::constraint
    B --> C[State: loss-only signal]:::failure
    C --> D[Bufferbloat + synchronization]:::failure
    D --> E[Fix: ECN marks]:::fix
    E --> F[New Gap: middlebox bleaching + fairness]:::failure
    F --> G[Fix: L4S DualQ + AccECN]:::fix
    G --> H[New Gap: partial deployment]:::failure
    B --> I[ISP inserts token bucket]:::failure
    I --> J[Transport belief broken]:::failure
    J --> K[Fix: expose bucket OR CoDel at bucket]:::fix
    A --> L[Socket API hides QoS]:::constraint
    L --> M[App re-measures at layer 7]:::failure
    M --> N[DASH, VMAF, Conviva]:::fix
    B --> O[Relax coordination: single DC admin]:::fix
    O --> P[DCTCP -> HPCC -> Swift]:::fix
    classDef constraint fill:#cfe2ff,stroke:#084298,color:#000
    classDef failure fill:#f8d7da,stroke:#842029,color:#000
    classDef fix fill:#d1e7dd,stroke:#0f5132,color:#000

10.8.4 Pioneer Diagnosis Table

Year Pioneer Invariant Diagnosis Contribution
1988 Jacobson Interface Loss is the only signal AIMD, packet conservation
1993 Floyd, Jackson Coordination Tail-drop synchronizes flows RED (randomize)
2001 Ramakrishnan, Floyd, Black Interface Loss signal is destructive RFC 3168 ECN (Ramakrishnan et al. 2001)
2010 Alizadeh et al. State Binary mark too coarse at DC speeds DCTCP proportional response
2011 Sundaresan et al. Coordination Token buckets invisible to TCP PowerBoost measurement
2011 Dobrian et al. Interface QoS fails to predict QoE Stall-engagement model
2012 Gettys, Nichols Time Large buffers delay signal Bufferbloat named (Gettys and Nichols 2012) + CoDel (Nichols and Jacobson 2012)
2015 Mittal et al. State ECN needs switch support; RTT doesn’t TIMELY (delay-based DC CC)
2016 Flach et al. Coordination Policers cause 6× loss Internet-wide policer study
2019 Li et al. State 1-bit ECN loses information HPCC (per-hop INT)
2020 Kumar et al. State NIC queues can be bottleneck Swift (fabric/endpoint split)
2023 De Schepper, Briscoe Interface Shared queue breaks fairness L4S DualQ + AccECN

10.8.5 Innovation Timeline

flowchart TD
    subgraph sg1["Loss-Only Era"]
        A1["1988 — Jacobson: AIMD"]
        A2["1993 — RED"]
        A1 --> A2
    end
    subgraph sg2["Explicit Signals"]
        B1["2001 — RFC 3168: ECN"]
        B2["2010 — DCTCP (DC)"]
        B3["2012 — CoDel + Bufferbloat named"]
        B1 --> B2 --> B3
    end
    subgraph sg3["Datacenter Co-Design"]
        C1["2015 — TIMELY"]
        C2["2019 — HPCC (INT)"]
        C3["2020 — Swift"]
        C1 --> C2 --> C3
    end
    subgraph sg4["Public Internet Revival"]
        D1["2023 — L4S RFCs 9330/9331/9332"]
        D2["2023 — AccECN RFC 9341"]
        D1 --> D2
    end
    sg1 --> sg2 --> sg3 --> sg4

Transport × AQM Coupling Evolution

10.8.6 The Bidirectional Coupling Picture

Figure 10.4: Two independently designed systems meet at a link and couple bidirectionally: transport sends packets at a rate determined by its congestion window (a belief about available capacity), while queue management absorbs packets into a buffer, decides which packets get serviced, and provides signals back to transport via loss or delay. The feedback loop operates as follows: when the sender’s rate R exceeds link capacity C, the queue fills. When the queue is full, packets drop. Transport observes loss via missing ACKs, infers congestion, and reduces its sending rate via AIMD (additive-increase, multiplicative-decrease). The loop closes: higher loss → lower R → queue drains → loss decreases. This creates a control loop that should stabilize the system at an equilibrium sending rate. However, this feedback loop has a critical temporal assumption: the loss signal must reach transport quickly enough for control to remain stable. At high speeds (100 Gbps), one dropped packet per RTT represents a minuscule loss rate (0.01% or less). TCP’s congestion window is typically many thousands of packets, so dropping one packet causes only a 0.01% rate reduction—insufficient feedback. The queue fills faster than transport can reduce the sending rate, creating a pathological condition called bufferbloat: the measurement signal (loss) is so sparse that the queue grows to enormous size before transport reacts. This causes buffering to increase, latency to spike dramatically, and the feedback mechanism designed to stabilize the loop to destabilize it instead. The resolution requires redesigning one or both sides of the interface: either over-provisioning to ensure R < C always, using active queue management to keep queues small, or replacing loss-based feedback with denser signals like ECN marks.

Transport and AQM read and write one shared signal at the bottleneck. The signal is produced by one loop and consumed by the other. Across the six acts, the signal has grown from 1 bit to per-hop structured telemetry. The composition has moved from “independent loops that happen to share a queue” to “co-designed loops with a shared state representation.”


10.9 Generative Exercises

TipExercise 9.1: A New Signal Space

Suppose a future router can stamp a single 16-bit field into every packet, chosen by the operator. You may not change transport or AQM algorithms — only the signal semantics. What 16 bits maximize the information available to transport? Consider: queue depth (how many bits?), time-since-last-drain (how many bits?), link utilization (how many bits?), per-flow fairness hint (how many bits?). Justify your bit budget.

TipExercise 9.2: L4S on a Non-DualQ Hop

A flow crosses four hops: the first and last are DualQ (L4S-capable); the middle two are classic FIFO. Predict what the sender observes. Does ECT(1) survive? Are marks frequent or sparse? Does the scalable CC algorithm still work, or does it collapse to DCTCP-in-a-shared-queue fairness? Design a detection heuristic that lets the sender fall back to classic TCP behavior when it suspects a non-DualQ hop.

TipExercise 9.3: Reverse-Engineering an ISP Shim

You have a cable modem connection. You run iperf and observe throughput of 50 Mbps for 12 seconds, then a drop to 20 Mbps, then (8 seconds later) a further drop to 10 Mbps. RTT climbs from 15 ms to 180 ms during the second drop. Construct the simplest token-bucket model that explains this trace. What are PBS, MSTR, and R for each stage? If you could add one ECN-marking point at the ISP edge, where would you place it to let TCP adapt smoothly?


10.10 References


  1. Accurate ECN (AccECN, RFC 9341) extends the original ECN mechanism by feeding back the exact number of CE-marked packets (rather than a single binary ECN-Echo bit). This gives the sender a proportional congestion signal — enabling fine-grained rate adjustments instead of the blunt halving that standard ECN triggers.↩︎

  2. A token bucket accumulates tokens at a constant rate (the committed information rate). Each packet consumes tokens proportional to its size. If tokens are available, the packet passes; if not, it is queued or dropped. The bucket depth (burst size) controls how much traffic can exceed the committed rate in a burst.↩︎