When the Queue Lies

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-21

Where We Left Off

L13 asked: who goes first? Four scheduling disciplines — FIFO, priority, round-robin, WFQ.

Each was a different answer to the Coordination invariant: given packets in the queue, which one transmits next?

Today’s question is more dangerous:

What happens when there is no room?

The simplest answer — tail drop — discards the newcomer. Problem solved?

Not even close.

The Arithmetic of Delay

How Much Latency Is Hiding in Your Router?

A Concrete Router

Consider a home WiFi access point — the one on your desk right now.

  • Link capacity: 10 Mbps (a slow cable modem uplink)
  • Packet size: 1,500 bytes (standard Ethernet MTU)
  • Buffer depth: 100 packets

Question: How much delay does this buffer introduce when full?

\[\frac{100 \times 1{,}500 \times 8}{10{,}000{,}000} = \frac{1{,}200{,}000}{10{,}000{,}000} = 0.12 \text{ s} = \mathbf{120 \text{ ms}}\]

120 milliseconds of pure queueing delay. Not propagation. Not processing. Just packets waiting in line.

Idle Latency vs. Working Latency

Speed test on an idle network (no other traffic):

  • Latency: ~15 ms (propagation + processing)
  • Bandwidth: 10 Mbps
  • Verdict: “My connection is fine!”

Same speed test while your roommate downloads a file:

  • Buffer fills → latency jumps to 120+ ms
  • Bandwidth: still 10 Mbps
  • Verdict: “…so why is my Zoom call terrible?”
Metric Idle Under load
Bandwidth 10 Mbps 10 Mbps
Latency 15 ms 135+ ms

Bufferbloat lives in the gap between idle and working latency. The speed test measures the wrong thing.

Why 120 ms Destroys Interactive Applications

Remember L12: the 150 ms wall. Human conversation collapses above 150 ms one-way delay.

Component Typical delay
Encoding ~20 ms
Propagation (US coast-to-coast) ~30 ms
Jitter buffer 50–100 ms
Decoding ~5–10 ms
Total without queueing ~105–160 ms

Now add 120 ms of bufferbloat: total becomes 225–280 ms.

The VoIP call is not degraded — it is destroyed. You talk over each other, both pause, both start again. Walkie-talkie mode.

And the user has no idea why. Their speed test said “10 Mbps.”

Why TCP Cannot See the Queue

The E-M-B Decomposition

TCP’s Feedback Signal

TCP is a feedback control system. It sends packets, observes what happens, adjusts.

Question: What signal does TCP use to detect congestion?

Loss. When a packet is lost (timeout or duplicate ACKs), TCP interprets it as: “the network is full — slow down.” TCP halves cwnd.

When no loss occurs: “the network has headroom — speed up.” TCP increases cwnd.

This loop works beautifully when buffers are small. Send → buffer absorbs a few → buffer fills → tail drop → TCP sees loss → TCP backs off → buffer drains → TCP ramps up. Fast, responsive sawtooth.

Now make the buffer large — 100 packets, 120 ms worth.

The Lie

TCP sends packets. The buffer absorbs them. And absorbs more. And more. No packets are lost.

The buffer holds 100 packets before anything drops. TCP sees zero loss. TCP’s interpretation: “no congestion — send faster.”

cwnd grows. Queue fills from 10 ms to 50 ms to 100 ms to 120 ms.

But TCP does not measure delay. TCP measures loss. And there is no loss.

Reality TCP’s belief
Queue 100 packets, 120 ms delay
Congestion Link saturated “No congestion”
Action Should back off Increases cwnd

The queue is lying to TCP. The buffer creates the appearance of a healthy network while the reality is a congested one.

E-M-B: The Cleanest Instance in the Course

Same E-M-B diagnostic from L2 (TCP over satellite), L3 (DV routing), and the midterm:

Environment state: The true queue occupancy. 100 packets. 120 ms of delay.

Measurement signal: Loss. But the large buffer degrades this signal. Packets queue for 120 ms without dropping. Loss arrives only after overflow — late, sudden, in a burst.

Internal belief: cwnd. The sender’s estimate of how fast it is safe to send. With no loss, cwnd grows without bound.

The gap between environment and belief — that gap IS bufferbloat. It lives entirely in the measurement signal failure. Loss is a proxy for congestion. With large buffers, it is a degraded proxy.

Why the Buffers Got So Large

Rational Decisions, Perverse Outcomes

Question: If Large Buffers Cause Bufferbloat, Why Did Manufacturers Install Them?

Burst absorption. TCP’s initial window = 10 packets as a burst. Small buffers → immediate loss → connection never ramps up.

Speed mismatch. 1 Gbps Ethernet in → 10 Mbps cable uplink out. That’s 100:1. Without a buffer, every packet arriving during busy periods is dropped.

Cheap memory. DRAM prices fell exponentially. 128 MB of buffer costs pennies. Why not?

Misaligned incentives. Manufacturers optimized for throughput: “99.9% link utilization!” Throughput is easy to measure, easy to market. Latency is harder to measure, invisible on spec sheets, and not part of the conversation.

Jim Gettys (2011) diagnosed this and named it bufferbloat — dark buffers hiding latency throughout the Internet.

Active Queue Management

Redesigning the Measurement Signal

The AQM Insight

Tail drop sends one signal: “the buffer overflowed.”

Question: What if the queue could signal congestion before it fills?

AQM: Drop packets early — before the buffer is full — so TCP receives a congestion signal while there is still time to react.

The queue is no longer passive storage. It actively monitors its own congestion and proactively generates signals for transport.

The evolution of AQM is a series of refinements to one question:

What should the queue measure to detect congestion?

Three generations answered differently. Each generation solved the previous one’s failure.

RED: Measuring Queue Length (1993)

Floyd and Jacobson: compute an EWMA of queue length. Drop probabilistically between two thresholds.

if EWMA < min_threshold:
    drop_prob = 0       # accept all
else if EWMA > max_threshold:
    drop_prob = 1       # drop all
else:
    drop_prob = max_p × (EWMA - min) / (max - min)

Why average? Instantaneous queue length is noisy — a burst spikes then drains. EWMA smooths transients.

Why probabilistic? Deterministic drops synchronize TCP flows — all back off together, all ramp up together. Random drops desynchronize them (avoids global synchronization).

Why RED Failed

Question: A queue of 30 packets — is that congestion?

It depends! Two scenarios on a 10 Mbps link:

Scenario 1 Scenario 2
Cause Video burst: 100 packets Two TCP flows at 5 Mbps each
Queue Spikes to 20, drains in 12 ms Persistent at 40 packets
EWMA ~30 ~30
Correct action Do nothing (burst clears) Drop (reduce sending rate)

RED cannot tell them apart. Queue length is the wrong measurement signal. It confuses good queues (burst absorption) with bad queues (persistent overload).

And RED has 7+ tuning parameters (min/max thresholds, max_p, EWMA weight, …). Optimal values differ for every deployment. Network operators found RED impossible to configure.

CoDel: Measuring Sojourn Time (2012)

Nichols and Jacobson asked: instead of how many packets, why not how long each packet waits?

Sojourn time = time a packet spends in the queue.

. . .

Queue state Queue length Sojourn time
Transient burst High Short (draining)
Persistent overload High Long (stuck)

Queue length can’t distinguish these. Sojourn time can.

  • 80 packets, 2 ms min sojourn → good queue (packets flowing)
  • 20 packets, 50 ms min sojourn → bad queue (packets stuck)

CoDel: The Algorithm

Track the minimum sojourn time over the last 100 ms:

On each dequeue:
    sojourn = now - packet.enqueue_time

    Track min_sojourn over last 100 ms

    if min_sojourn > 5 ms:            # persistent congestion
        drop_count += 1
        next_drop = now + interval / sqrt(drop_count)
        DROP at next_drop
    else:                              # queue is healthy
        drop_count = 0
        do not drop

Why the minimum? If any packet in the window got through quickly, the queue is still draining — congestion is transient. Only when the minimum exceeds 5 ms does CoDel act.

Why 5 ms? VoIP frame = 20 ms. A 5 ms target = 25% of a frame period. Jitter buffer handles it easily. Interactive web requests add 5 ms — imperceptible.

Parameters: ONE. The target delay (5 ms). Compare RED’s seven.

PIE: Estimating Delay Without Timestamps (2013)

The problem: CoDel requires per-packet timestamping. At 10 Gbps that’s ~15 million stamps/second. DOCSIS cable modem silicon was already fabricated — no room for timestamps.

Pan et al. (Cisco): Estimate sojourn time without touching individual packets:

\[\text{delay estimate} = \frac{\text{queue length}}{\text{departure rate}}\]

A PI controller (proportional-integral) drives the drop probability toward a target delay:

Every 10 ms:
    delay_est = queue_length / departure_rate
    error = delay_est - target
    drop_prob += alpha × error + beta × (error - prev_error)

No per-packet timestamps. Queue length and departure rate are statistics routers already track. The control loop runs once every 10 ms — not 15 million times per second.

PIE became the AQM standard in DOCSIS 3.1 — shipping in millions of cable modems worldwide.

The Measurement Signal Evolution

Generation Year Measures Signal Tuning Deployed?
Tail drop Nothing Buffer overflow None Everywhere
RED 1993 Queue length (EWMA) Space (indirect) 7+ params Rarely
CoDel 2012 Sojourn time (min) Time (direct) 1 param Linux default
PIE 2013 Estimated delay (Q/R) Time (approx.) 2 params DOCSIS 3.1

Each step is a redesign of the State invariant: what information should the queue maintain to make good control decisions?

  • RED: EWMA of queue length → indirect, ambiguous, requires manual tuning
  • CoDel: minimum sojourn time → direct, unambiguous, self-tuning
  • PIE: estimated delay from rate measurements → nearly as good, vastly cheaper

The pattern: as the measurement signal improves, the feedback loop becomes more stable and responsive.

Beyond Single-Queue AQM

Per-Flow Isolation and Explicit Signals

FQ-CoDel, CAKE, and L4S

RED, CoDel, and PIE all operate on a single shared queue. A greedy flow monopolizes the buffer. AQM solves when to drop but not who gets served.

Scheme Key idea Improvement
FQ-CoDel Hash flows to ~1024 queues, run CoDel on each. DRR scheduler from L13 serves queues fairly. New flows get priority. Per-flow isolation: torrent packets in their own queue, VoIP in another. Linux default since 2013.
CAKE Extends FQ-CoDel: per-host fairness (not just per-flow), DiffServ-aware priority, ingress shaping. One host with 20 connections no longer starves others. Standard in OpenWrt.
L4S Dual-queue coupled AQM. ECN marks instead of drops. Separates classic (Reno/Cubic) from scalable (DCTCP/Prague) transport. No data destroyed. Richer signal. Current IETF frontier (RFC 9332).

The arc: measurement signal quality improves (space → time → per-flow time) and interface richness grows (drop → mark → per-class mark).

Summary: The Story of Bufferbloat and AQM

  1. Bufferbloat = the canonical E-M-B failure. Large buffers degrade the measurement signal (loss), causing TCP’s belief (cwnd) to diverge from reality.

  2. 120 ms of queueing delay from a 100-packet buffer at 10 Mbps exceeds VoIP’s 150 ms budget. Idle latency misses it; working latency reveals it.

  3. AQM = drop early so transport gets warning while there is still time.

  4. RED → queue length → wrong signal, 7 parameters, rarely deployed.

  5. CoDel → sojourn time → good vs. bad queue, 1 parameter, Linux default.

  6. PIE → estimated delay → no per-packet timestamps, millions of DOCSIS devices.

  7. FQ-CoDel / CAKE / L4S → per-flow isolation + richer signals.

The design lesson: When you measure the right thing, you do not need knobs to compensate for measuring the wrong thing.

Bridge to L15

Today: what the queue should measure to detect congestion.

L15 goes further: what the network should measure to detect performance problems across an entire path — not just one queue.

We move from a single router’s buffer to end-to-end network measurement: how do you know what’s happening between two endpoints?

Traceroute, ping, RTT estimation, and the measurement infrastructure that makes the Internet debuggable.