CS176C — Advanced Topics in Internet Computing
2026-05-07
L1–L9 built the network stack from the bottom up: medium access, wireless architecture, transport, disaggregation.
The question was always: how does the network move bits?
Now we flip. Suppose the network works — packets go from A to B. A user opens Netflix, joins a Zoom call, watches a Twitch stream.
What goes wrong, and what did engineers build to fix it — generation by generation?
Why Is Multimedia Hard?
A video frame is a grid of pixels. Each pixel: 8 bits × 3 colors (RGB) = 24 bits.
1080p at 60 fps — the arithmetic:
| Resolution | Raw bitrate |
|---|---|
| 320×240 @ 15fps (1990s) | 27.6 Mbps |
| 1080p @ 60fps | ~3 Gbps |
| 4K @ 60fps | ~12 Gbps |
A 1995 modem: 28.8 kbps. The video: 27.6 Mbps. The gap: 1,000×.
Exploiting What’s Redundant and What’s Imperceptible
Spatial (within a frame): The whiteboard behind me is one color. Encode “200×300 block = white” instead of 60,000 individual pixels.
Temporal (across frames): Two consecutive lecture frames — only the speaker’s hand moved. 95% identical. Encode only the difference.
Perceptual (what humans won’t notice): A loud sound masks a quiet one nearby — like trying to hear a whisper next to a jackhammer. Codecs discard what you can’t perceive.
Result: 3 Gbps → 5 Mbps. A 600× compression ratio. Now the video fits on a network.
An I-frame is a self-contained photograph — all the info needed to display it. Large (10× a P-frame).
A P-frame stores only what changed since the last frame. Tiny — but depends on its predecessor.
Question: What happens if one P-frame’s reference is lost?
Errors cascade. Each P-frame builds on the corrupted reconstruction. Video garbles progressively until the next I-frame resets everything.
How often to insert I-frames?
| Application | I-frame frequency | Why |
|---|---|---|
| Netflix | Every 4–8 seconds | TCP recovers errors; maximize compression |
| Twitch | Every 1–2 seconds | Viewers join mid-stream; need fast sync |
| Zoom | Every frame | No time for error recovery |
Same technique, three parameter choices — driven by how much time the application can afford to wait.
The First Attempt — and Why It Broke
Problem: TCP retransmits lost packets. A voice packet arriving 200ms late is worse than silence.
But raw UDP has nothing: no loss detection, no timing, no codec info.
RTP (Schulzrinne, 1996) — a thin layer on UDP solving three problems:
RTCP — companion reverse channel. Receiver reports loss rate, jitter, and RTT every few seconds.
RTP assumed:
Both assumptions broke.
This was an open-loop system. The server sent at a fixed rate and never checked if the receiver was keeping up. No feedback, no adaptation.
Deployability as the Binding Constraint
YouTube (2005): stream video as an HTTP download — the same protocol that serves web pages.
Solves both of RTP’s problems at once:
The client-side buffer: download ahead of playback. Buffer fills during fast periods, drains during slow ones. As long as it doesn’t empty → smooth playback.
YouTube’s key decision: start playback in 1–2 seconds (not 10–30). Fast start, even at the risk of freezing later.
But: single bitrate. If the network drops below it, the buffer drains, and the video stalls — a mid-stream freeze. Users tolerate startup delay, but hate mid-stream freezes.
DASH and the Client-Driven Control Loop
Netflix encodes every movie at 6–7 quality levels:
| Bitrate | Quality |
|---|---|
| 145 kbps | Watchable on a phone — blocky |
| 771 kbps | Medium — acceptable |
| 2,358 kbps | 1080p — standard HD |
| 5,800 kbps | 1080p high — high quality |
Why multiple? The network varies. Campus WiFi: 5.8 Mbps works. On a train: 500 kbps is all you get.
The ladder gives the player options to adapt — graceful degradation instead of a freeze.
The video is chopped into segments (2–10 seconds each). Each available at every bitrate. A manifest file lists all options.
Three components, completely separated:
This is disaggregation — the same principle from Chapter 1. Encoding, delivery, and adaptation evolve independently.
The client runs a feedback loop — one decision per segment:
Question: What can go wrong with this loop?
These failure modes → L11’s topic (BBA, MPC, neural ABR).
| Act | System | What it solved | What broke |
|---|---|---|---|
| 0 | Raw media | — | 1,000× too large |
| 1 | Compression | Shrinks 3 Gbps → 5 Mbps | — |
| 2 | RTP (1996) | Real-time transport | NATs; no adaptation |
| 3 | HTTP download (2005) | Firewalls; TCP reliability | Single bitrate → stalls |
| 4 | DASH/HLS (2009–11) | Adaptive per-segment | Prediction fails → L11 |
Each generation solved the previous one’s failure. Same arc as medium access: ALOHA → CSMA → CSMA/CA → OFDMA.
One Question, Three Architectures
Netflix: no deadline. Buffer holds 60–200 seconds. TCP retransmit is fine.
Zoom: within 150 ms. Large buffers destroy conversation. TCP retransmit exceeds the budget. Must use RTP/UDP.
Twitch: 5–60 second offset. Two-tier: RTMP ingest → HTTP/DASH to viewers.
| Netflix | Zoom | Twitch | |
|---|---|---|---|
| Time | No deadline | 150 ms | 5–60 s offset |
| Buffer | 60–200 s | 50–200 ms | Server + viewer |
| Transport | HTTP/TCP | RTP/UDP | RTMP + HTTP |
| Who decides? | Client (ABR) | Both endpoints | Source → server |
| What kills UX? | Freeze | Latency | Chat lag |
Time is the anchor — it constrains State, Coordination, Interface. Same cascade as medium access (L5–L8) and cellular (L8).
L11: The streaming control loop — BBA (2014), MPC (2015), neural ABR (2017). Each solving the previous one’s failure.
L12: When buffering is forbidden — VoIP, the 150ms wall, jitter buffers, RTP/RTCP, WebRTC, Zoom.
Then: what if the network itself could help?
L13–L16 go inside the router — scheduling, queuing, active queue management.