CS176C — Advanced Topics in Internet Computing
2026-05-21
L13 asked: who goes first? Four scheduling disciplines — FIFO, priority, round-robin, WFQ.
Each was a different answer to the Coordination invariant: given packets in the queue, which one transmits next?
Today’s question is more dangerous:
What happens when there is no room?
The simplest answer — tail drop — discards the newcomer. Problem solved?
Not even close.
How Much Latency Is Hiding in Your Router?
Consider a home WiFi access point — the one on your desk right now.
Question: How much delay does this buffer introduce when full?
\[\frac{100 \times 1{,}500 \times 8}{10{,}000{,}000} = \frac{1{,}200{,}000}{10{,}000{,}000} = 0.12 \text{ s} = \mathbf{120 \text{ ms}}\]
120 milliseconds of pure queueing delay. Not propagation. Not processing. Just packets waiting in line.
Speed test on an idle network (no other traffic):
Same speed test while your roommate downloads a file:
| Metric | Idle | Under load |
|---|---|---|
| Bandwidth | 10 Mbps | 10 Mbps |
| Latency | 15 ms | 135+ ms |
Bufferbloat lives in the gap between idle and working latency. The speed test measures the wrong thing.
Remember L12: the 150 ms wall. Human conversation collapses above 150 ms one-way delay.
| Component | Typical delay |
|---|---|
| Encoding | ~20 ms |
| Propagation (US coast-to-coast) | ~30 ms |
| Jitter buffer | 50–100 ms |
| Decoding | ~5–10 ms |
| Total without queueing | ~105–160 ms |
Now add 120 ms of bufferbloat: total becomes 225–280 ms.
The VoIP call is not degraded — it is destroyed. You talk over each other, both pause, both start again. Walkie-talkie mode.
And the user has no idea why. Their speed test said “10 Mbps.”
The E-M-B Decomposition
TCP is a feedback control system. It sends packets, observes what happens, adjusts.
Question: What signal does TCP use to detect congestion?
Loss. When a packet is lost (timeout or duplicate ACKs), TCP interprets it as: “the network is full — slow down.” TCP halves cwnd.
When no loss occurs: “the network has headroom — speed up.” TCP increases cwnd.
This loop works beautifully when buffers are small. Send → buffer absorbs a few → buffer fills → tail drop → TCP sees loss → TCP backs off → buffer drains → TCP ramps up. Fast, responsive sawtooth.
Now make the buffer large — 100 packets, 120 ms worth.
TCP sends packets. The buffer absorbs them. And absorbs more. And more. No packets are lost.
The buffer holds 100 packets before anything drops. TCP sees zero loss. TCP’s interpretation: “no congestion — send faster.”
cwnd grows. Queue fills from 10 ms to 50 ms to 100 ms to 120 ms.
But TCP does not measure delay. TCP measures loss. And there is no loss.
| Reality | TCP’s belief | |
|---|---|---|
| Queue | 100 packets, 120 ms delay | — |
| Congestion | Link saturated | “No congestion” |
| Action | Should back off | Increases cwnd |
The queue is lying to TCP. The buffer creates the appearance of a healthy network while the reality is a congested one.
Same E-M-B diagnostic from L2 (TCP over satellite), L3 (DV routing), and the midterm:
Environment state: The true queue occupancy. 100 packets. 120 ms of delay.
Measurement signal: Loss. But the large buffer degrades this signal. Packets queue for 120 ms without dropping. Loss arrives only after overflow — late, sudden, in a burst.
Internal belief: cwnd. The sender’s estimate of how fast it is safe to send. With no loss, cwnd grows without bound.
The gap between environment and belief — that gap IS bufferbloat. It lives entirely in the measurement signal failure. Loss is a proxy for congestion. With large buffers, it is a degraded proxy.
Rational Decisions, Perverse Outcomes
Burst absorption. TCP’s initial window = 10 packets as a burst. Small buffers → immediate loss → connection never ramps up.
Speed mismatch. 1 Gbps Ethernet in → 10 Mbps cable uplink out. That’s 100:1. Without a buffer, every packet arriving during busy periods is dropped.
Cheap memory. DRAM prices fell exponentially. 128 MB of buffer costs pennies. Why not?
Misaligned incentives. Manufacturers optimized for throughput: “99.9% link utilization!” Throughput is easy to measure, easy to market. Latency is harder to measure, invisible on spec sheets, and not part of the conversation.
Jim Gettys (2011) diagnosed this and named it bufferbloat — dark buffers hiding latency throughout the Internet.
Redesigning the Measurement Signal
Tail drop sends one signal: “the buffer overflowed.”
Question: What if the queue could signal congestion before it fills?
AQM: Drop packets early — before the buffer is full — so TCP receives a congestion signal while there is still time to react.
The queue is no longer passive storage. It actively monitors its own congestion and proactively generates signals for transport.
The evolution of AQM is a series of refinements to one question:
What should the queue measure to detect congestion?
Three generations answered differently. Each generation solved the previous one’s failure.
Floyd and Jacobson: compute an EWMA of queue length. Drop probabilistically between two thresholds.
if EWMA < min_threshold:
drop_prob = 0 # accept all
else if EWMA > max_threshold:
drop_prob = 1 # drop all
else:
drop_prob = max_p × (EWMA - min) / (max - min)
Why average? Instantaneous queue length is noisy — a burst spikes then drains. EWMA smooths transients.
Why probabilistic? Deterministic drops synchronize TCP flows — all back off together, all ramp up together. Random drops desynchronize them (avoids global synchronization).
Question: A queue of 30 packets — is that congestion?
It depends! Two scenarios on a 10 Mbps link:
| Scenario 1 | Scenario 2 | |
|---|---|---|
| Cause | Video burst: 100 packets | Two TCP flows at 5 Mbps each |
| Queue | Spikes to 20, drains in 12 ms | Persistent at 40 packets |
| EWMA | ~30 | ~30 |
| Correct action | Do nothing (burst clears) | Drop (reduce sending rate) |
RED cannot tell them apart. Queue length is the wrong measurement signal. It confuses good queues (burst absorption) with bad queues (persistent overload).
And RED has 7+ tuning parameters (min/max thresholds, max_p, EWMA weight, …). Optimal values differ for every deployment. Network operators found RED impossible to configure.
Nichols and Jacobson asked: instead of how many packets, why not how long each packet waits?
Sojourn time = time a packet spends in the queue.
. . .
| Queue state | Queue length | Sojourn time |
|---|---|---|
| Transient burst | High | Short (draining) |
| Persistent overload | High | Long (stuck) |
Queue length can’t distinguish these. Sojourn time can.
Track the minimum sojourn time over the last 100 ms:
On each dequeue:
sojourn = now - packet.enqueue_time
Track min_sojourn over last 100 ms
if min_sojourn > 5 ms: # persistent congestion
drop_count += 1
next_drop = now + interval / sqrt(drop_count)
DROP at next_drop
else: # queue is healthy
drop_count = 0
do not drop
Why the minimum? If any packet in the window got through quickly, the queue is still draining — congestion is transient. Only when the minimum exceeds 5 ms does CoDel act.
Why 5 ms? VoIP frame = 20 ms. A 5 ms target = 25% of a frame period. Jitter buffer handles it easily. Interactive web requests add 5 ms — imperceptible.
Parameters: ONE. The target delay (5 ms). Compare RED’s seven.
The problem: CoDel requires per-packet timestamping. At 10 Gbps that’s ~15 million stamps/second. DOCSIS cable modem silicon was already fabricated — no room for timestamps.
Pan et al. (Cisco): Estimate sojourn time without touching individual packets:
\[\text{delay estimate} = \frac{\text{queue length}}{\text{departure rate}}\]
A PI controller (proportional-integral) drives the drop probability toward a target delay:
Every 10 ms:
delay_est = queue_length / departure_rate
error = delay_est - target
drop_prob += alpha × error + beta × (error - prev_error)
No per-packet timestamps. Queue length and departure rate are statistics routers already track. The control loop runs once every 10 ms — not 15 million times per second.
PIE became the AQM standard in DOCSIS 3.1 — shipping in millions of cable modems worldwide.
| Generation | Year | Measures | Signal | Tuning | Deployed? |
|---|---|---|---|---|---|
| Tail drop | — | Nothing | Buffer overflow | None | Everywhere |
| RED | 1993 | Queue length (EWMA) | Space (indirect) | 7+ params | Rarely |
| CoDel | 2012 | Sojourn time (min) | Time (direct) | 1 param | Linux default |
| PIE | 2013 | Estimated delay (Q/R) | Time (approx.) | 2 params | DOCSIS 3.1 |
Each step is a redesign of the State invariant: what information should the queue maintain to make good control decisions?
The pattern: as the measurement signal improves, the feedback loop becomes more stable and responsive.
Per-Flow Isolation and Explicit Signals
RED, CoDel, and PIE all operate on a single shared queue. A greedy flow monopolizes the buffer. AQM solves when to drop but not who gets served.
| Scheme | Key idea | Improvement |
|---|---|---|
| FQ-CoDel | Hash flows to ~1024 queues, run CoDel on each. DRR scheduler from L13 serves queues fairly. New flows get priority. | Per-flow isolation: torrent packets in their own queue, VoIP in another. Linux default since 2013. |
| CAKE | Extends FQ-CoDel: per-host fairness (not just per-flow), DiffServ-aware priority, ingress shaping. | One host with 20 connections no longer starves others. Standard in OpenWrt. |
| L4S | Dual-queue coupled AQM. ECN marks instead of drops. Separates classic (Reno/Cubic) from scalable (DCTCP/Prague) transport. | No data destroyed. Richer signal. Current IETF frontier (RFC 9332). |
The arc: measurement signal quality improves (space → time → per-flow time) and interface richness grows (drop → mark → per-class mark).
Bufferbloat = the canonical E-M-B failure. Large buffers degrade the measurement signal (loss), causing TCP’s belief (cwnd) to diverge from reality.
120 ms of queueing delay from a 100-packet buffer at 10 Mbps exceeds VoIP’s 150 ms budget. Idle latency misses it; working latency reveals it.
AQM = drop early so transport gets warning while there is still time.
RED → queue length → wrong signal, 7 parameters, rarely deployed.
CoDel → sojourn time → good vs. bad queue, 1 parameter, Linux default.
PIE → estimated delay → no per-packet timestamps, millions of DOCSIS devices.
FQ-CoDel / CAKE / L4S → per-flow isolation + richer signals.
The design lesson: When you measure the right thing, you do not need knobs to compensate for measuring the wrong thing.
Today: what the queue should measure to detect congestion.
L15 goes further: what the network should measure to detect performance problems across an entire path — not just one queue.
We move from a single router’s buffer to end-to-end network measurement: how do you know what’s happening between two endpoints?
Traceroute, ping, RTT estimation, and the measurement infrastructure that makes the Internet debuggable.