CS176C — Advanced Topics in Internet Computing
2026-05-14
L10–L11: large buffers (60–200 seconds), TCP retransmission, 10-second ABR loops.
Today we take that buffer away.
You’re on a Zoom call. Your colleague speaks. You respond. The round trip must feel instantaneous.
Question: Why can’t a Zoom call tolerate a 2-second buffer like Netflix?
Because conversation is interactive. A 2-second delay means your response arrives 4 seconds after they stopped talking. They’ve already started a new sentence. Turn-taking collapses.
Human Biology, Not Engineering
Question: How fast do humans take turns in conversation? What’s the gap between one speaker finishing and the next starting?
~200 milliseconds. Your brain starts formulating a response before the other person finishes — you predict when their sentence ends from pitch and cadence.
ITU-T G.114 (decades of telephone research):
This is biology, not engineering. You cannot negotiate with human perception. The 150 ms bound is as fixed as the speed of light.
Question: A voice packet must get from speaker’s mouth to listener’s ear in 150 ms. Where does the time go?
| Component | Delay | Can you reduce it? |
|---|---|---|
| Encoding | ~20 ms | No — codec needs 160 samples |
| Network (one-way) | 30–80 ms | No — speed of light + queuing |
| Decoding | ~5–10 ms | No — reconstruct audio |
| Subtotal (fixed) | ~55–110 ms | These are physics + computation |
Budget remaining for everything else: 150 − 110 = 40 ms (best case: 150 − 55 = 95 ms)
The jitter buffer gets whatever is left — not a designed size, but the scraps after physics and computation take their share. On an intercontinental call (80 ms network), there may be less than 40 ms for the jitter buffer.
Question: What tools from L10–L11 can we still use under this budget?
Almost nothing from L10–L11 survives. Same Time invariant, same question, completely different answer.
Regular Sending, Irregular Arrivals
Sender transmits one packet every 20 ms — perfectly regular.
The network delivers them irregularly:
| Packet | Sent at | Arrives after | Gap from previous |
|---|---|---|---|
| 1 | 0 ms | 50 ms | — |
| 2 | 20 ms | 53 ms | 23 ms (expected 20) |
| 3 | 40 ms | 48 ms | 15 ms (early!) |
| 4 | 60 ms | 72 ms | 44 ms (very late) |
| 5 | 80 ms | 49 ms | 17 ms (back to normal) |
Jitter = the variation in arrival times. The ear expects perfectly steady playback — 8,000 samples per second, no gaps, no bunching.
The receiver schedules packet i for playout at time S_i + d. Packet arrives on time if R_i ≤ S_i + d, i.e., d ≥ one-way delay.
From our example: one-way delays are 50, 53, 48, 72, 49 ms. To get all 5: d ≥ 72 ms.
| Playout delay d | On time | Total delay (enc + net + d + dec) | Within 150 ms? |
|---|---|---|---|
| 50 ms | 3/5 (60%) | ~130 ms | ✓ |
| 60 ms | 4/5 (80%) | ~140 ms | ✓ |
| 72 ms | 5/5 (100%) | ~152 ms | ✗ |
No d gives both 100% delivery AND stays within the 150 ms budget. The jitter buffer gets whatever the delay budget leaves after fixed costs — and on this path, it’s not enough.
Fixed delay: Pick 80 ms at call start. Never change it.
→ Works if jitter stays below 80 ms. Fails if network conditions change mid-call.
Adaptive delay: Observe actual jitter, adjust during silence gaps (humans are ~40–50% silent).
Same EWMA pattern as TCP RTT estimation (L3) and DASH throughput smoothing (L11). Observe → estimate → decide → act. Decision placed at the receiver.
What Does the Jitter Buffer Need to Work?
The jitter buffer needs three things to function. What are they?
Does UDP provide any of these?
No. UDP is a bare envelope: source port, destination port, length, checksum, payload. No sequence numbers. No timestamps. No codec identification.
This is why RTP exists.
RTP (Schulzrinne, 1996): a thin header on top of UDP:
| Field | Size | Solves |
|---|---|---|
| Sequence number | 16 bits | “Which packets are missing?” — gaps = loss |
| Timestamp | 32 bits | “When was this created?” — reconstruct sender’s timing |
| Payload type | 7 bits | “What codec?” — decode correctly |
Concrete packet: 160 B audio (20 ms of G.711) + 12 B RTP + 8 B UDP + 20 B IP = 200 bytes
50 packets/sec = 80 kbps (64 kbps audio + 16 kbps overhead)
RTP is sender → receiver. But what if the receiver is losing 5% of packets? What if jitter is spiking? The sender has no idea.
RTCP (RTP Control Protocol): receiver sends a report back every ~5 seconds:
The 5% bandwidth rule: RTCP must not exceed 5% of session bandwidth. At 64 kbps: ~3.2 kbps → one report every ~5 seconds.
Consequence: The feedback loop is slow. TCP reacts in milliseconds. RTCP reacts in seconds. A 500 ms burst of loss comes and goes before RTCP reports it.
TCP retransmits. VoIP can’t. But 1–5% of packets ARE lost. At 50 pkt/sec, 2% loss = one gap per second.
PLC (Packet Loss Concealment) — reactive, at receiver:
FEC (Forward Error Correction) — proactive, from sender:
Both consume the already-tight delay budget.
Why RTP Stopped Working — and What Replaced It
RTP (1996) assumed endpoints have reachable IP addresses and UDP flows freely.
By mid-2000s:
Two users behind NATs can’t exchange UDP packets. Neither knows the other’s public address. Neither NAT has a mapping for the other’s traffic.
Question: How do you establish a real-time connection when neither side can receive incoming packets?
ICE (Interactive Connectivity Establishment): try every possible path, pick the first that works:
Plus:
getUserMedia + RTCPeerConnection → video call in ~50 lines of JavaScript.Scaling: SFU (Selective Forwarding Unit) — each participant sends ONE stream, SFU forwards selectively. No transcoding. Preserves encryption. Simulcast: encode at multiple resolutions, SFU picks per receiver.
Question: Netflix adapts ONE variable (bitrate) every 10 seconds. What does Zoom adapt, and how fast?
| Dimension | How Zoom adapts |
|---|---|
| Codec | VP9 (3 Mbps) → H.264 (500 kbps) → low-bitrate + FEC |
| Jitter buffer | 50–200 ms, driven by observed loss and jitter |
| Frame rate | 30 → 15 → 7 → 1 fps; audio always prioritized |
| Path | Global relay network; reroute if path degrades |
Four dimensions, adapted per packet arrival — milliseconds, not seconds.
150 ms budget → you cannot wait 10 seconds to react.
Cloud gaming pushes further: <50 ms round-trip. Essentially zero buffer. The most demanding real-time application on the Internet.
| Netflix | Zoom | Cloud gaming | |
|---|---|---|---|
| Buffer | 60–200 s | 50–200 ms | ~0 ms |
| Loop | ~10 s | ms–seconds | Per frame (~16 ms) |
| Transport | HTTP/TCP | RTP/UDP | RTP/custom |
| Loss | Retransmit | FEC + conceal | FEC + quality drop |
| UX killer | Stall | Delay > 150 ms | Input lag > 50 ms |
The pattern: Tighten the time constraint → buffer shrinks → loop speeds up → system loses degrees of freedom.
Time is not a label. It is a force that reshapes architecture. Three lectures, one invariant, three completely different systems.
For three lectures, applications coped with a best-effort network.
What if the network itself could help?
L13 goes inside the router — scheduling, queuing, active queue management.
The applications have done all they can. Time to ask what the network owes them.