The Perceptual Anchor: Why Multimedia Is Different

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-07

Where We Are

L1–L9 built the network stack from the bottom up: medium access, wireless architecture, transport, disaggregation.

The question was always: how does the network move bits?

Now we flip. Suppose the network works — packets go from A to B. A user opens Netflix, joins a Zoom call, watches a Twitch stream.

What goes wrong, and what did engineers build to fix it — generation by generation?

Act 0: The Bandwidth Problem

Why Is Multimedia Hard?

What Does Video Look Like as Bits?

A video frame is a grid of pixels. Each pixel: 8 bits × 3 colors (RGB) = 24 bits.

1080p at 60 fps — the arithmetic:

  • Pixels per frame: 1920 × 1080 = 2,073,600
  • Bits per frame: 2,073,600 × 24 ≈ 50 Mbits
  • Frames per second: 60
  • Total: ~3 Gbps
Resolution Raw bitrate
320×240 @ 15fps (1990s) 27.6 Mbps
1080p @ 60fps ~3 Gbps
4K @ 60fps ~12 Gbps

A 1995 modem: 28.8 kbps. The video: 27.6 Mbps. The gap: 1,000×.

Act 1: Compression

Exploiting What’s Redundant and What’s Imperceptible

Three Kinds of Redundancy

Spatial (within a frame): The whiteboard behind me is one color. Encode “200×300 block = white” instead of 60,000 individual pixels.

Temporal (across frames): Two consecutive lecture frames — only the speaker’s hand moved. 95% identical. Encode only the difference.

Perceptual (what humans won’t notice): A loud sound masks a quiet one nearby — like trying to hear a whisper next to a jackhammer. Codecs discard what you can’t perceive.

Result: 3 Gbps → 5 Mbps. A 600× compression ratio. Now the video fits on a network.

I-Frames and P-Frames: The Dependency Tradeoff

An I-frame is a self-contained photograph — all the info needed to display it. Large (10× a P-frame).

A P-frame stores only what changed since the last frame. Tiny — but depends on its predecessor.

Question: What happens if one P-frame’s reference is lost?

Errors cascade. Each P-frame builds on the corrupted reconstruction. Video garbles progressively until the next I-frame resets everything.

How often to insert I-frames?

Application I-frame frequency Why
Netflix Every 4–8 seconds TCP recovers errors; maximize compression
Twitch Every 1–2 seconds Viewers join mid-stream; need fast sync
Zoom Every frame No time for error recovery

Same technique, three parameter choices — driven by how much time the application can afford to wait.

Act 2: Fixed-Rate Streaming (1996)

The First Attempt — and Why It Broke

RTP: Real-Time Transport Over UDP

Problem: TCP retransmits lost packets. A voice packet arriving 200ms late is worse than silence.

But raw UDP has nothing: no loss detection, no timing, no codec info.

RTP (Schulzrinne, 1996) — a thin layer on UDP solving three problems:

  1. Sequence numbers — detect lost packets (gaps)
  2. Timestamps — reconstruct correct playback timing
  3. Payload type — identify the codec

RTCP — companion reverse channel. Receiver reports loss rate, jitter, and RTT every few seconds.

What Broke: NATs, Firewalls, No Adaptation

RTP assumed:

  1. Endpoints have reachable IP addresses (UDP flows freely)
  2. The network can sustain the encoded bitrate

Both assumptions broke.

  • NATs blocked incoming UDP. Corporate firewalls dropped non-HTTP traffic.
  • RealPlayer streamed at one fixed bitrate regardless of conditions — stuttering and freezing when the network dipped.

This was an open-loop system. The server sent at a fixed rate and never checked if the receiver was keeping up. No feedback, no adaptation.

Act 3: The HTTP Revolution (2005)

Deployability as the Binding Constraint

YouTube’s Insight: Use HTTP

YouTube (2005): stream video as an HTTP download — the same protocol that serves web pages.

Solves both of RTP’s problems at once:

  • HTTP passes every firewall. No special ports, no UDP blocking.
  • TCP retransmits lost packets. No corrupted frames.

The client-side buffer: download ahead of playback. Buffer fills during fast periods, drains during slow ones. As long as it doesn’t empty → smooth playback.

YouTube’s key decision: start playback in 1–2 seconds (not 10–30). Fast start, even at the risk of freezing later.

But: single bitrate. If the network drops below it, the buffer drains, and the video stalls — a mid-stream freeze. Users tolerate startup delay, but hate mid-stream freezes.

Act 4: Adaptive Streaming (2009–11)

DASH and the Client-Driven Control Loop

The Bitrate Ladder: Why One Encoding Isn’t Enough

Netflix encodes every movie at 6–7 quality levels:

Bitrate Quality
145 kbps Watchable on a phone — blocky
771 kbps Medium — acceptable
2,358 kbps 1080p — standard HD
5,800 kbps 1080p high — high quality

Why multiple? The network varies. Campus WiFi: 5.8 Mbps works. On a train: 500 kbps is all you get.

The ladder gives the player options to adapt — graceful degradation instead of a freeze.

DASH: Radical Disaggregation

The video is chopped into segments (2–10 seconds each). Each available at every bitrate. A manifest file lists all options.

Three components, completely separated:

  • Server: Stateless. Holds segments + manifest. Doesn’t track clients.
  • Client: Stateful. Tracks buffer, estimates throughput, chooses bitrate per segment.
  • Network: Oblivious. Standard HTTP. Existing CDNs work without modification.

This is disaggregation — the same principle from Chapter 1. Encoding, delivery, and adaptation evolve independently.

The ABR Control Loop

The client runs a feedback loop — one decision per segment:

  1. Observe: download segment, measure throughput
  2. Estimate: smooth the measurement (like TCP’s RTT estimation)
  3. Decide: pick a bitrate the network can sustain
  4. Act: HTTP GET the next segment
  5. Repeat every 2–10 seconds

Question: What can go wrong with this loop?

  • Too aggressive → buffer drains → stall
  • Too conservative → quality worse than needed
  • Oscillation → quality swings high ↔︎ low every few seconds
  • Multiple clients → synchronized quality drops

These failure modes → L11’s topic (BBA, MPC, neural ABR).

The Generational Arc

Act System What it solved What broke
0 Raw media 1,000× too large
1 Compression Shrinks 3 Gbps → 5 Mbps
2 RTP (1996) Real-time transport NATs; no adaptation
3 HTTP download (2005) Firewalls; TCP reliability Single bitrate → stalls
4 DASH/HLS (2009–11) Adaptive per-segment Prediction fails → L11

Each generation solved the previous one’s failure. Same arc as medium access: ALOHA → CSMA → CSMA/CA → OFDMA.

The Time Invariant

One Question, Three Architectures

When Must the Data Arrive?

Netflix: no deadline. Buffer holds 60–200 seconds. TCP retransmit is fine.

Zoom: within 150 ms. Large buffers destroy conversation. TCP retransmit exceeds the budget. Must use RTP/UDP.

Twitch: 5–60 second offset. Two-tier: RTMP ingest → HTTP/DASH to viewers.

Netflix Zoom Twitch
Time No deadline 150 ms 5–60 s offset
Buffer 60–200 s 50–200 ms Server + viewer
Transport HTTP/TCP RTP/UDP RTMP + HTTP
Who decides? Client (ABR) Both endpoints Source → server
What kills UX? Freeze Latency Chat lag

Time is the anchor — it constrains State, Coordination, Interface. Same cascade as medium access (L5–L8) and cellular (L8).

What Comes Next

L11: The streaming control loop — BBA (2014), MPC (2015), neural ABR (2017). Each solving the previous one’s failure.

L12: When buffering is forbidden — VoIP, the 150ms wall, jitter buffers, RTP/RTCP, WebRTC, Zoom.

Then: what if the network itself could help?

L13–L16 go inside the router — scheduling, queuing, active queue management.