The Perceptual Anchor: Why Multimedia Is Different

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-07

Where We Are

L1–L9 built the network stack from the bottom up: medium access, wireless architecture, transport, disaggregation.

The question was always: how does the network move bits?

Now we flip. Suppose the network works — packets go from A to B. A user opens Netflix, joins a Zoom call, watches a Twitch stream.

What goes wrong, and what did engineers build to fix it — generation by generation?

Act 0: The Bandwidth Problem

Why Is Multimedia Hard?

What Does Video Look Like as Bits?

A video frame is a grid of pixels. Each pixel: 8 bits × 3 colors (RGB) = 24 bits.

1080p at 60 fps — the arithmetic:

Pixels per frame: 1920 × 1080 = 2,073,600
Bits per frame: 2,073,600 × 24 ≈ 50 Mbits
Frames per second: 60
Total: ~3 Gbps

Resolution	Raw bitrate
320×240 @ 15fps (1990s)	27.6 Mbps
1080p @ 60fps	~3 Gbps
4K @ 60fps	~12 Gbps

A 1995 modem: 28.8 kbps. The video: 27.6 Mbps. The gap: 1,000×.

Act 1: Compression

Exploiting What’s Redundant and What’s Imperceptible

Three Kinds of Redundancy

Spatial (within a frame): The whiteboard behind me is one color. Encode “200×300 block = white” instead of 60,000 individual pixels.

Temporal (across frames): Two consecutive lecture frames — only the speaker’s hand moved. 95% identical. Encode only the difference.

Perceptual (what humans won’t notice): A loud sound masks a quiet one nearby — like trying to hear a whisper next to a jackhammer. Codecs discard what you can’t perceive.

Result: 3 Gbps → 5 Mbps. A 600× compression ratio. Now the video fits on a network.

I-Frames and P-Frames: The Dependency Tradeoff

An I-frame is a self-contained photograph — all the info needed to display it. Large (10× a P-frame).

A P-frame stores only what changed since the last frame. Tiny — but depends on its predecessor.

Question: What happens if one P-frame’s reference is lost?

Errors cascade. Each P-frame builds on the corrupted reconstruction. Video garbles progressively until the next I-frame resets everything.

How often to insert I-frames?

Application	I-frame frequency	Why
Netflix	Every 4–8 seconds	TCP recovers errors; maximize compression
Twitch	Every 1–2 seconds	Viewers join mid-stream; need fast sync
Zoom	Every frame	No time for error recovery

Same technique, three parameter choices — driven by how much time the application can afford to wait.

Act 2: Fixed-Rate Streaming (1996)

The First Attempt — and Why It Broke

RTP: Real-Time Transport Over UDP

Problem: TCP retransmits lost packets. A voice packet arriving 200ms late is worse than silence.

But raw UDP has nothing: no loss detection, no timing, no codec info.

RTP (Schulzrinne, 1996) — a thin layer on UDP solving three problems:

Sequence numbers — detect lost packets (gaps)
Timestamps — reconstruct correct playback timing
Payload type — identify the codec

RTCP — companion reverse channel. Receiver reports loss rate, jitter, and RTT every few seconds.

What Broke: NATs, Firewalls, No Adaptation

RTP assumed:

Endpoints have reachable IP addresses (UDP flows freely)
The network can sustain the encoded bitrate

Both assumptions broke.

NATs blocked incoming UDP. Corporate firewalls dropped non-HTTP traffic.
RealPlayer streamed at one fixed bitrate regardless of conditions — stuttering and freezing when the network dipped.

This was an open-loop system. The server sent at a fixed rate and never checked if the receiver was keeping up. No feedback, no adaptation.

Act 3: The HTTP Revolution (2005)

Deployability as the Binding Constraint

YouTube’s Insight: Use HTTP

YouTube (2005): stream video as an HTTP download — the same protocol that serves web pages.

Solves both of RTP’s problems at once:

HTTP passes every firewall. No special ports, no UDP blocking.
TCP retransmits lost packets. No corrupted frames.

The client-side buffer: download ahead of playback. Buffer fills during fast periods, drains during slow ones. As long as it doesn’t empty → smooth playback.

YouTube’s key decision: start playback in 1–2 seconds (not 10–30). Fast start, even at the risk of freezing later.

But: single bitrate. If the network drops below it, the buffer drains, and the video stalls — a mid-stream freeze. Users tolerate startup delay, but hate mid-stream freezes.

Act 4: Adaptive Streaming (2009–11)

DASH and the Client-Driven Control Loop

The Bitrate Ladder: Why One Encoding Isn’t Enough

Netflix encodes every movie at 6–7 quality levels:

Bitrate	Quality
145 kbps	Watchable on a phone — blocky
771 kbps	Medium — acceptable
2,358 kbps	1080p — standard HD
5,800 kbps	1080p high — high quality

Why multiple? The network varies. Campus WiFi: 5.8 Mbps works. On a train: 500 kbps is all you get.

The ladder gives the player options to adapt — graceful degradation instead of a freeze.

DASH: Radical Disaggregation

The video is chopped into segments (2–10 seconds each). Each available at every bitrate. A manifest file lists all options.

Three components, completely separated:

Server: Stateless. Holds segments + manifest. Doesn’t track clients.
Client: Stateful. Tracks buffer, estimates throughput, chooses bitrate per segment.
Network: Oblivious. Standard HTTP. Existing CDNs work without modification.

This is disaggregation — the same principle from Chapter 1. Encoding, delivery, and adaptation evolve independently.

The ABR Control Loop

The client runs a feedback loop — one decision per segment:

Observe: download segment, measure throughput
Estimate: smooth the measurement (like TCP’s RTT estimation)
Decide: pick a bitrate the network can sustain
Act: HTTP GET the next segment
Repeat every 2–10 seconds

Question: What can go wrong with this loop?

Too aggressive → buffer drains → stall
Too conservative → quality worse than needed
Oscillation → quality swings high ↔︎ low every few seconds
Multiple clients → synchronized quality drops

These failure modes → L11’s topic (BBA, MPC, neural ABR).

The Generational Arc

Act	System	What it solved	What broke
0	Raw media	—	1,000× too large
1	Compression	Shrinks 3 Gbps → 5 Mbps	—
2	RTP (1996)	Real-time transport	NATs; no adaptation
3	HTTP download (2005)	Firewalls; TCP reliability	Single bitrate → stalls
4	DASH/HLS (2009–11)	Adaptive per-segment	Prediction fails → L11

Each generation solved the previous one’s failure. Same arc as medium access: ALOHA → CSMA → CSMA/CA → OFDMA.

The Time Invariant

One Question, Three Architectures

When Must the Data Arrive?

Netflix: no deadline. Buffer holds 60–200 seconds. TCP retransmit is fine.

Zoom: within 150 ms. Large buffers destroy conversation. TCP retransmit exceeds the budget. Must use RTP/UDP.

Twitch: 5–60 second offset. Two-tier: RTMP ingest → HTTP/DASH to viewers.

	Netflix	Zoom	Twitch
Time	No deadline	150 ms	5–60 s offset
Buffer	60–200 s	50–200 ms	Server + viewer
Transport	HTTP/TCP	RTP/UDP	RTMP + HTTP
Who decides?	Client (ABR)	Both endpoints	Source → server
What kills UX?	Freeze	Latency	Chat lag

Time is the anchor — it constrains State, Coordination, Interface. Same cascade as medium access (L5–L8) and cellular (L8).

What Comes Next

L11: The streaming control loop — BBA (2014), MPC (2015), neural ABR (2017). Each solving the previous one’s failure.

L12: When buffering is forbidden — VoIP, the 150ms wall, jitter buffers, RTP/RTCP, WebRTC, Zoom.

Then: what if the network itself could help?

L13–L16 go inside the router — scheduling, queuing, active queue management.