When Buffering Is Forbidden

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-14

Take Away the Buffer

L10–L11: large buffers (60–200 seconds), TCP retransmission, 10-second ABR loops.

Today we take that buffer away.

You’re on a Zoom call. Your colleague speaks. You respond. The round trip must feel instantaneous.

Question: Why can’t a Zoom call tolerate a 2-second buffer like Netflix?

Because conversation is interactive. A 2-second delay means your response arrives 4 seconds after they stopped talking. They’ve already started a new sentence. Turn-taking collapses.

Where Does the Deadline Come From?

Human Biology, Not Engineering

The 150-Millisecond Wall

Question: How fast do humans take turns in conversation? What’s the gap between one speaker finishing and the next starting?

~200 milliseconds. Your brain starts formulating a response before the other person finishes — you predict when their sentence ends from pitch and cadence.

ITU-T G.114 (decades of telephone research):

Below 150 ms one-way delay → conversation feels natural
150–400 ms → strained, speakers hesitate and overlap
Above 400 ms → walkie-talkie style, natural dialogue impossible

This is biology, not engineering. You cannot negotiate with human perception. The 150 ms bound is as fixed as the speed of light.

The Delay Budget: What’s Left for the Buffer?

Question: A voice packet must get from speaker’s mouth to listener’s ear in 150 ms. Where does the time go?

Component	Delay	Can you reduce it?
Encoding	~20 ms	No — codec needs 160 samples
Network (one-way)	30–80 ms	No — speed of light + queuing
Decoding	~5–10 ms	No — reconstruct audio
Subtotal (fixed)	~55–110 ms	These are physics + computation

Budget remaining for everything else: 150 − 110 = 40 ms (best case: 150 − 55 = 95 ms)

The jitter buffer gets whatever is left — not a designed size, but the scraps after physics and computation take their share. On an intercontinental call (80 ms network), there may be less than 40 ms for the jitter buffer.

What the Budget Eliminates

Question: What tools from L10–L11 can we still use under this budget?

TCP retransmission? Adds 60–160 ms (one RTT). Eliminated. By the time retransmit arrives, conversation has moved on.

Large buffers? Netflix’s 60 seconds = 60 seconds of conversational delay. Even 2 seconds = satellite phone. Eliminated.

Complex codecs? AV1 spends hours per frame. VoIP has 20 ms. Eliminated.

B-frames? Require future frames (lookahead). No time. Eliminated.

Almost nothing from L10–L11 survives. Same Time invariant, same question, completely different answer.

Jitter: The Real Enemy

Regular Sending, Irregular Arrivals

What Is Jitter?

Sender transmits one packet every 20 ms — perfectly regular.

The network delivers them irregularly:

Packet	Sent at	Arrives after	Gap from previous
1	0 ms	50 ms	—
2	20 ms	53 ms	23 ms (expected 20)
3	40 ms	48 ms	15 ms (early!)
4	60 ms	72 ms	44 ms (very late)
5	80 ms	49 ms	17 ms (back to normal)

Jitter = the variation in arrival times. The ear expects perfectly steady playback — 8,000 samples per second, no gaps, no bunching.

The Playout Tradeoff: Pick a Delay d

The receiver schedules packet i for playout at time S_i + d. Packet arrives on time if R_i ≤ S_i + d, i.e., d ≥ one-way delay.

From our example: one-way delays are 50, 53, 48, 72, 49 ms. To get all 5: d ≥ 72 ms.

Playout delay d	On time	Total delay (enc + net + d + dec)	Within 150 ms?
50 ms	3/5 (60%)	~130 ms	✓
60 ms	4/5 (80%)	~140 ms	✓
72 ms	5/5 (100%)	~152 ms	✗

No d gives both 100% delivery AND stays within the 150 ms budget. The jitter buffer gets whatever the delay budget leaves after fixed costs — and on this path, it’s not enough.

The Jitter Buffer: Fixed vs. Adaptive

Fixed delay: Pick 80 ms at call start. Never change it.

→ Works if jitter stays below 80 ms. Fails if network conditions change mid-call.

Adaptive delay: Observe actual jitter, adjust during silence gaps (humans are ~40–50% silent).

Track average delay + how much it varies
Playout delay = average + safety margin (4× the variation)
Stable network → delay shrinks → less latency
Volatile network → delay grows → fewer gaps

Same EWMA pattern as TCP RTT estimation (L3) and DASH throughput smoothing (L11). Observe → estimate → decide → act. Decision placed at the receiver.

The Protocol Problem

What Does the Jitter Buffer Need to Work?

Question: What Information Does the Receiver Need?

The jitter buffer needs three things to function. What are they?

Which packets are missing? → To detect loss and trigger concealment
When was each packet created? → To reconstruct the sender’s timing
What codec was used? → To decode the audio

Does UDP provide any of these?

No. UDP is a bare envelope: source port, destination port, length, checksum, payload. No sequence numbers. No timestamps. No codec identification.

This is why RTP exists.

RTP: Three Problems Solved in 12 Bytes

RTP (Schulzrinne, 1996): a thin header on top of UDP:

Field	Size	Solves
Sequence number	16 bits	“Which packets are missing?” — gaps = loss
Timestamp	32 bits	“When was this created?” — reconstruct sender’s timing
Payload type	7 bits	“What codec?” — decode correctly

Concrete packet: 160 B audio (20 ms of G.711) + 12 B RTP + 8 B UDP + 20 B IP = 200 bytes

50 packets/sec = 80 kbps (64 kbps audio + 16 kbps overhead)

Question: How Does the Sender Know Things Are Going Wrong?

RTP is sender → receiver. But what if the receiver is losing 5% of packets? What if jitter is spiking? The sender has no idea.

RTCP (RTP Control Protocol): receiver sends a report back every ~5 seconds:

Fraction of packets lost
Interarrival jitter (how variable are arrivals?)
Round-trip time estimate

The 5% bandwidth rule: RTCP must not exceed 5% of session bandwidth. At 64 kbps: ~3.2 kbps → one report every ~5 seconds.

Consequence: The feedback loop is slow. TCP reacts in milliseconds. RTCP reacts in seconds. A 500 ms burst of loss comes and goes before RTCP reports it.

When Packets Are Lost: Conceal or Add Redundancy

TCP retransmits. VoIP can’t. But 1–5% of packets ARE lost. At 50 pkt/sec, 2% loss = one gap per second.

PLC (Packet Loss Concealment) — reactive, at receiver:

Repeat last packet (crude but effective — frames are highly correlated)
Interpolate using pitch and spectral envelope (Opus does this well)
At 1% loss: nearly indistinguishable. At 5%: noticeable. Above 10%: unintelligible.

FEC (Forward Error Correction) — proactive, from sender:

Send a copy of packet n−1 alongside packet n
Cost: ~1.5× bandwidth + 20 ms extra delay

Both consume the already-tight delay budget.

The NAT Problem

Why RTP Stopped Working — and What Replaced It

Question: RTP Assumes Direct UDP. What Broke?

RTP (1996) assumed endpoints have reachable IP addresses and UDP flows freely.

By mid-2000s:

~80% of devices behind NATs (your IP: 192.168.1.42 — unreachable from outside)
Corporate firewalls block non-HTTP UDP traffic

Two users behind NATs can’t exchange UDP packets. Neither knows the other’s public address. Neither NAT has a mapping for the other’s traffic.

Question: How do you establish a real-time connection when neither side can receive incoming packets?

WebRTC: NAT Traversal Solved (2011–2021)

ICE (Interactive Connectivity Establishment): try every possible path, pick the first that works:

Direct P2P — if both have public IPs (rare today). Cheapest, fastest.
STUN-assisted — ask a public server “what’s my NAT’s public IP?” Then try P2P using that address.
TURN relay — if all else fails, both send to a relay server that forwards. Always works, highest latency/cost.

Plus:

DTLS/SRTP: End-to-end encryption. Even the relay can’t decrypt.
Browser APIs: getUserMedia + RTCPeerConnection → video call in ~50 lines of JavaScript.

Scaling: SFU (Selective Forwarding Unit) — each participant sends ONE stream, SFU forwards selectively. No transcoding. Preserves encryption. Simulcast: encode at multiple resolutions, SFU picks per receiver.

Zoom: Adaptation at Every Dimension

Question: Netflix adapts ONE variable (bitrate) every 10 seconds. What does Zoom adapt, and how fast?

Dimension	How Zoom adapts
Codec	VP9 (3 Mbps) → H.264 (500 kbps) → low-bitrate + FEC
Jitter buffer	50–200 ms, driven by observed loss and jitter
Frame rate	30 → 15 → 7 → 1 fps; audio always prioritized
Path	Global relay network; reroute if path degrades

Four dimensions, adapted per packet arrival — milliseconds, not seconds.

150 ms budget → you cannot wait 10 seconds to react.

Cloud gaming pushes further: <50 ms round-trip. Essentially zero buffer. The most demanding real-time application on the Internet.

The Grand Arc: Three Control Loops (L10–L12)

	Netflix	Zoom	Cloud gaming
Buffer	60–200 s	50–200 ms	~0 ms
Loop	~10 s	ms–seconds	Per frame (~16 ms)
Transport	HTTP/TCP	RTP/UDP	RTP/custom
Loss	Retransmit	FEC + conceal	FEC + quality drop
UX killer	Stall	Delay > 150 ms	Input lag > 50 ms

The pattern: Tighten the time constraint → buffer shrinks → loop speeds up → system loses degrees of freedom.

Time is not a label. It is a force that reshapes architecture. Three lectures, one invariant, three completely different systems.

Bridge to L13: What If the Network Could Help?

For three lectures, applications coped with a best-effort network.

What if the network itself could help?

Routers distinguish VoIP from Netflix?
Signal the sender before packets are lost?
Reduce jitter inside the network instead of at endpoints?

L13 goes inside the router — scheduling, queuing, active queue management.

The applications have done all they can. Time to ask what the network owes them.