The Perceptual Anchor: Why Multimedia Is Different

For nine lectures, we built the network stack from the bottom up — first principles and design methodology, then medium access from ALOHA to OFDMA, then wireless architecture from GSM to 5G. The question was always: how does the network move bits?

Now we flip the question. Suppose the network works — packets go from A to B. A user opens Netflix, or joins a Zoom call, or watches a Twitch stream. What goes wrong, and what did engineers build to fix it — generation by generation?

The next three lectures trace the evolution of multimedia networking. Like the medium access arc (L5–L8), each generation solved a problem the previous one couldn’t. Today’s lecture covers four acts of that evolution: from the raw bandwidth problem, through fixed-rate streaming and its failure, to adaptive streaming over HTTP. Along the way, we will see a single invariant — Time — emerge as the organizing principle that separates Netflix, Zoom, and Twitch into fundamentally different architectures.

Act 0: The bandwidth problem — why multimedia is hard

Before we can discuss how multimedia is delivered, we need to understand what multimedia data actually looks like as bits on a wire — and why it overwhelmed every network of its era.

What does audio look like digitally?

Sound is a continuous wave of air pressure. To transmit it over a digital network, we must convert it to a sequence of numbers. The standard method — Pulse Code Modulation, or PCM — works in two steps [1][7]:

Sample the analog signal at regular intervals. How often? Think of a sine wave that goes up and down. To faithfully capture its shape, you need at least two measurements per cycle — one for the peak and one for the trough. This is the Nyquist theorem: to capture frequencies up to F Hz, you must sample at least 2F times per second. Human speech contains frequencies from roughly 300 Hz to 3,400 Hz. Telephone systems sample at 8,000 times per second (8 kHz) — just above 2 × 3,400, enough to capture the full voice range.
Quantize each sample to a fixed number of levels. With 8 bits per sample, there are 256 possible values. The continuous amplitude is rounded to the nearest level.

The result: 8,000 samples/sec × 8 bits/sample = 64 kbps. This is the G.711 codec — the universal standard for telephone audio since the 1960s. No compression, just digitization. Every telephone call you’ve ever made used something derived from this [1].

64 kbps was expensive on early networks but manageable. Audio is not the bandwidth problem. Video is.

What does video look like digitally?

Video is a sequence of images (frames) displayed rapidly enough that the eye perceives smooth motion — typically 24 to 60 frames per second. Each frame is a grid of pixels. Each pixel stores color information. Human vision perceives three primary colors (red, green, blue), so each pixel uses 8 bits per color channel × 3 channels = 24 bits per pixel.

Consider a modest video from the mid-1990s — 320×240 pixels at 15 frames per second, the kind you might try to watch on a dial-up modem. Let’s walk through the arithmetic:

Pixels per frame: 320 × 240 = 76,800
Bits per frame: 76,800 × 24 = 1,843,200 bits (~1.8 Mbits)
Frames per second: 15
Total: 1,843,200 × 15 = 27,648,000 bits/sec ≈ 27.6 Mbps

A 1995 modem delivered 28.8 kbps. The raw video is roughly 1,000 times faster than the connection [7].

At modern resolutions, the problem scales up:

Resolution	Frame rate	Raw bitrate
320×240 (1990s web)	15 fps	27.6 Mbps
720p HD	30 fps	664 Mbps
1080p Full HD	60 fps	~3 Gbps
4K UHD	60 fps	~12 Gbps

No residential internet connection carries 3 Gbps for a single video stream. No storage system holds a 2-hour movie at 1080p raw (~2.7 terabytes). Raw video is absurdly bandwidth-prohibitive. You cannot send video over any practical network without shrinking it first.

This is the first problem of multimedia networking: the data is too large. Everything that follows — every codec, every streaming protocol, every adaptation algorithm — exists because of this gap between raw media and network capacity.

Act 1: Compression — exploiting what’s redundant and what’s imperceptible

Compression (encoding) bridges the gap between raw media and network capacity. The key insight: natural audio and video contain enormous amounts of redundancy that can be removed without destroying the content [7].

Three kinds of redundancy

Spatial redundancy (within a single frame). Look at any frame of a video lecture. The whiteboard is one color. The wall is one color. The speaker’s shirt is one color. Large regions are nearly identical pixels. Instead of encoding each pixel individually, the encoder describes regions: “this 200×300 block is all white.” A lecture frame might compress 50× just from spatial redundancy. A basketball game frame, with thousands of individual spectators, compresses far less — perhaps 5× [7].

A compressed frame that contains all the information needed to display it — requiring no reference to any other frame — is called an I-frame (intra-coded frame). Think of it as a self-contained photograph. It is the largest type of encoded frame.

Temporal redundancy (across consecutive frames). Consider two consecutive frames of that lecture video. The speaker’s hand moved slightly. Everything else — whiteboard, wall, desk — is identical. Roughly 95% of the frame hasn’t changed. Why re-encode all of it?

A P-frame (predicted frame) encodes only what changed relative to the previous frame. Instead of storing the full image, it stores a compact set of instructions: “this 16×16 block moved 3 pixels left and 2 pixels down” [7]. P-frames are roughly one-tenth the size of I-frames.

The tradeoff is dependency. A P-frame is meaningless without its reference frame. If the reference is lost — due to a network error or a dropped packet — the P-frame’s instructions (“move this block 3 pixels left”) point to the wrong starting position. The decoder reconstructs a corrupted image. Worse, the next P-frame builds on this corrupted reconstruction, amplifying the error. And the next P-frame amplifies it further. Errors cascade forward, producing increasingly garbled video, until the next I-frame arrives and resets everything to a clean state.

How often should the encoder insert I-frames to limit this error propagation? This is the GOP (Group of Pictures) structure decision — and the answer depends on the application:

Netflix (pre-recorded): I-frame every 4–8 seconds. Errors are recovered by TCP retransmission, so error propagation is rare. Infrequent I-frames maximize compression.
Twitch (live): I-frame every 1–2 seconds. Viewers join mid-stream and need a fresh I-frame to start decoding. More frequent resets, higher bitrate.
Zoom (real-time): I-frame every frame. There is no time to wait for error recovery. Every frame must be independently decodable.

The same physics (temporal redundancy), the same technique (I/P frames), but three different parameter choices — driven entirely by how much time the application can afford to wait. We will return to this observation.

Perceptual redundancy (what humans won’t notice). The human ear cannot hear frequencies above roughly 20 kHz. And it has a blind spot: if a loud sound is playing, you cannot hear a quieter sound at a nearby frequency — like trying to hear a whisper next to a jackhammer. This is called auditory masking [1]. The human eye is far less sensitive to color detail than to brightness — you can encode color at one-quarter the resolution of brightness and nobody notices.

Codecs discard the information humans won’t miss. MP3 exploits auditory masking to compress CD-quality audio (1.41 Mbps) to 128 kbps — an 11× reduction — with quality most listeners cannot distinguish from the original [1]. Modern codecs like Opus, used in Zoom and Discord, compress speech to as low as 6 kbps while remaining intelligible [7].

What compression achieves

Together, these three kinds of redundancy achieve compression ratios of 100× to 600×:

Content	Raw	Compressed	Ratio
Telephone voice (G.711)	64 kbps	64 kbps (uncompressed — telephone systems predated compression; they just digitize)	1×
CD-quality music (MP3)	1.41 Mbps	128 kbps	11×
Voice call (Opus)	64 kbps (G.711 baseline)	6–32 kbps	2–10× further compression
1080p video (H.264)	~3 Gbps	5–8 Mbps	400–600×

A 3 Gbps raw stream becomes a 5 Mbps encoded stream — watchable HD video over a residential internet connection.

Compression solved the first problem. Multimedia data now fits on a network. But when people tried to actually stream compressed video over the internet, a new problem emerged.

Act 2: Fixed-rate streaming — and why it broke

The RTP era (1996): the first real-time transport

In the early 1990s, researchers on the MBone (Multicast Backbone) were experimenting with streaming audio and video conferences over the internet. They quickly discovered that TCP — the reliable transport protocol — was wrong for real-time media. TCP retransmits lost packets, which adds delay. A retransmitted voice packet arriving 200ms late is worse than silence — the conversation has already moved on [4].

But raw UDP offered nothing: no loss detection, no timing information, no way to identify what codec the data was encoded with. Every application reinvented these primitives. In 1996, Henning Schulzrinne and colleagues standardized the solution: RTP (Real-time Transport Protocol) [4].

RTP is a thin layer on top of UDP that solves three specific problems:

Loss detection — a 16-bit sequence number increments with each packet. Gaps indicate lost packets.
Timing reconstruction — a 32-bit timestamp encodes the sampling time of each packet’s data. The receiver uses timestamps to play audio/video at the correct pace, regardless of when packets actually arrive.
Codec identification — a payload type field tells the receiver how to decode the data (voice? video? which codec?).

RTP also included a companion: RTCP (RTP Control Protocol), which provides a reverse feedback channel. Periodically (every few seconds), each receiver sends a report back to the sender describing reception quality: fraction of packets lost, observed jitter (variability in arrival times), and estimated round-trip time [4].

What assumption did RTP make — and what broke?

RTP assumed two things [4]:

Endpoints have reachable IP addresses. The sender and receiver can exchange UDP packets directly.
The network is reasonably cooperative. Bandwidth is sufficient if the codec is chosen appropriately.

Both assumptions broke within a few years.

NATs and firewalls killed UDP reachability. By the late 1990s, most home and corporate networks sat behind Network Address Translators (NATs) that blocked incoming UDP packets. Corporate firewalls explicitly blocked non-HTTP traffic. An RTP stream that worked on a university network failed for most real users [4].

The network was not cooperative. On dial-up and early broadband, bandwidth fluctuated wildly. Services like RealPlayer encoded their video at a single fixed bitrate and used RTP to deliver it, regardless of network conditions. When the network couldn’t sustain the rate, frames arrived late or not at all — the experience was constant stuttering and freezing [1].

This was an open-loop system — meaning the server transmitted without ever checking whether the receiver was keeping up. There was no feedback, no adaptation, no measurement of network conditions. The server just sent at a fixed rate and hoped for the best.

Act 3: The HTTP revolution — deployability as the binding constraint

Progressive download (mid-2000s): YouTube’s insight

Around 2005, YouTube took a radically different approach. Instead of RTP over UDP with specialized streaming servers, YouTube used plain HTTP over TCP — the same protocol that serves web pages [7].

Why HTTP? Because it solves both of RTP’s broken assumptions at once:

Firewalls. Every firewall passes HTTP traffic. A video delivered via HTTP looks like a large web page to the network. No special ports, no UDP blocking, no corporate firewall issues.
Reliability. TCP automatically retransmits lost packets. No more corrupted frames or skipped video.

The architecture was simple: the server stored the video as a file. The client downloaded it progressively via HTTP GET requests, starting playback as soon as enough data had arrived, while the download continued in the background.

YouTube’s critical design decision: minimize startup delay. Earlier streaming services (RealPlayer, Windows Media) accumulated 10–30 seconds of data before starting playback, building a large reserve. YouTube started playback within 1–2 seconds — fast and responsive, even at the risk of freezing later if the network slowed down [7].

This introduced the concept of the client-side buffer — the reserve of downloaded-but-not-yet-played video that decouples the network’s variability from the constant-rate playback the user sees. The buffer fills during fast network periods and drains during slow ones. As long as it doesn’t empty, playback is smooth.

But progressive download had a critical limitation: single bitrate. The video was encoded once, at one quality level. If the network could sustain that rate, everything worked. If the network dropped below it, the buffer drained, and the video eventually froze — a stall (also called a rebuffering event). And research consistently shows: users tolerate a brief startup delay (they expect a loading indicator), but a mid-stream freeze — even a 1-second stall at minute 20 of a movie — dramatically increases abandonment [2][7].

What assumption did progressive download make — and what broke?

Progressive download assumed the network could sustain the video’s single bitrate for the duration of playback. On a stable wired connection, this worked. On WiFi, cellular, or shared broadband (your roommate starts a download), it failed. The network’s capacity fluctuated, but the video’s bitrate was fixed. The buffer was the only shock absorber, and for sustained network drops, no buffer was large enough.

Act 4: Adaptive streaming — DASH and the client-driven control loop

The key idea: multiple bitrates, per-segment choice

The solution arrived in 2009–2011, from multiple directions simultaneously. Apple created HLS (HTTP Live Streaming) [8]. Microsoft created Smooth Streaming. Adobe created HDS. In 2011, MPEG standardized the approach as DASH (Dynamic Adaptive Streaming over HTTP) [3][9].

The architecture rests on two ideas:

1. Encode the video at multiple bitrates — a bitrate ladder.

Netflix, for example, encodes every movie at 6–7 quality levels [2][7]:

Bitrate	Quality
145 kbps	Watchable on a phone — blocky, blurry
356 kbps	Low — visible artifacts
771 kbps	Medium — acceptable on a laptop
1,418 kbps	720p — good
2,358 kbps	1080p — standard HD
5,800 kbps	1080p high — high-quality HD

Why this specific spacing? Quality improvement follows a pattern of diminishing returns — each additional kbps buys less improvement. The jump from 145 to 771 kbps transforms unwatchable to acceptable. The jump from 2,358 to 5,800 kbps is a modest refinement. The ladder is designed so that each step offers a perceptible quality improvement worth the bandwidth cost [2].

2. Divide the video into short segments and let the client choose per segment.

The video is chopped into segments — typically 2 to 10 seconds of playback each. Every segment is available at every bitrate level. A manifest file (called an MPD in DASH or an M3U8 in HLS) lists all segments, their durations, and the URLs for each bitrate version [3][8].

The client downloads one segment at a time. For each segment, it independently decides which bitrate to request based on current network conditions.

The architecture: radical disaggregation

DASH separates three concerns that earlier systems entangled [3][9]:

Server: Stateless content store. It holds encoded segments and the manifest file. It does not track client state, does not measure network conditions, does not make quality decisions. It simply responds to HTTP GET requests.
Client: Stateful controller. It tracks its own buffer level (seconds of video buffered), estimates network throughput from recent segment downloads, and makes per-segment bitrate decisions using an ABR (Adaptive Bitrate) algorithm.
Network: Oblivious. Standard HTTP traffic over TCP. No special routers, no QoS, no video-awareness. Existing CDN infrastructure (content delivery networks that cache popular content near users) works without modification.

This is the same disaggregation principle from Chapter 1: encoder, storage, delivery, and adaptation are separated. Each can evolve independently. Netflix can change its ABR algorithm without modifying any server or network equipment. CDN providers can optimize caching without understanding video. The manifest file is the narrow-waist interface — the rendezvous point between encoder and player [3].

The ABR control loop: a feedback system over HTTP

The client’s ABR algorithm is a closed-loop control system — structurally identical to the feedback loops we studied in L3 (TCP congestion control) and L6 (CSMA/CA’s ACK-based feedback) [5][7]:

Observe: Download segment N. Measure how long it took. Compute throughput: bytes received ÷ download time.
Estimate: Smooth the measurement by averaging recent segments, giving more weight to the most recent ones (a technique called exponential moving average — the same smoothing TCP uses for RTT estimation). This avoids overreacting to a single unusually fast or slow download.
Decide: Compare estimated throughput and current buffer level to the available bitrate options. If throughput is high and buffer is growing, request a higher bitrate. If throughput dropped or buffer is draining, request a lower bitrate.
Act: Send HTTP GET for the next segment at the selected bitrate.
Repeat: Every segment (2–10 seconds).

The loop period is one segment — deliberately slow. Frequent bitrate changes are visually jarring (the picture quality oscillates). Slow adaptation smooths quality changes but sacrifices responsiveness to sudden network drops [5].

What assumption does DASH make — and what breaks?

DASH assumes that past throughput predicts future throughput well enough to make good bitrate decisions. This works on stable networks. It breaks on volatile ones — WiFi handoffs, cellular bandwidth fluctuations, shared home broadband.

When the prediction is wrong:

Too aggressive: Client requests high bitrate → download takes longer than expected → buffer drains → if buffer empties, the video stalls. Worst case: the client panics, drops to the lowest bitrate, the user sees a jarring quality collapse.
Too conservative: Client requests low bitrate when the network could sustain higher → quality is worse than necessary → bandwidth is wasted.
Oscillation: Client alternates between high and low bitrates — quality swings visibly every few seconds. This happens when the ABR algorithm uses the same threshold for increasing and decreasing bitrate — it keeps crossing the boundary in both directions. The fix is hysteresis: use a higher threshold to increase quality than to decrease it, so the algorithm doesn’t flip-flop [5][7].

And when multiple clients share the same bottleneck link, each running its own ABR loop independently, their decisions interact: one client’s increase triggers congestion, causing all clients to reduce quality simultaneously — synchronized quality drops [5].

These failure modes — and the algorithmic solutions (buffer-based ABR, model-predictive control, neural ABR) — are the subject of L11.

The generational arc

Act	System	What it solved	What broke (motivating the next act)
0	Raw media	—	Data is 1,000× too large for the network
1	Compression (I/P frames, perceptual coding)	Shrinks 3 Gbps to 5 Mbps	— (now it fits; the delivery problem begins)
2	RTP fixed-rate streaming (1996)	Real-time transport over UDP	NATs blocked UDP; no adaptation to network variability
3	Progressive HTTP download (2005)	Bypasses firewalls; TCP reliability	Single bitrate; stalls when network drops below it
4	DASH/HLS adaptive streaming (2009–11)	Multiple bitrates; per-segment client-driven adaptation	Throughput prediction fails; oscillation; cross-client interference (→ L11)

Each generation solved a problem the previous one couldn’t. Each solution introduced a new failure mode that motivated the next. The arc mirrors medium access: ALOHA (no coordination) → CSMA (sense first) → CSMA/CA (ACK feedback) → OFDMA (centralized scheduling). Here: raw → compress → fixed-stream → buffer → adapt. More information at each step, better control, higher quality.

The Time invariant: why Netflix, Zoom, and Twitch are different systems

Everything so far has been about stored video — Netflix, YouTube, pre-recorded content where there is no deadline on when data must arrive. But multimedia includes systems with radically different time constraints. Consider three applications:

Netflix — pre-recorded movie. The user presses play whenever they want.
Zoom — live voice and video call. Speakers respond in real time.
Twitch — live gameplay broadcast. Viewers see it 5–10 seconds after it happens.

All three use compression. All three face network variability. But their architectures are fundamentally different — and the reason is a single question: when must the data arrive?

Netflix (stored streaming): no deadline. The movie was recorded months ago. The buffer can hold 60–200 seconds of video [7]. Codecs can spend hours optimizing compression. TCP retransmission is fine — the buffer absorbs the delay. The client runs the ABR control loop from Act 4, choosing bitrates per segment. The metric that matters: avoid mid-stream freezes.

Zoom (conversational): within 150 milliseconds. Human conversation requires end-to-end delay below roughly 150 ms [6]. Beyond that, speakers talk over each other; turn-taking rhythm breaks. This single constraint eliminates most of Netflix’s toolkit: large buffers add seconds of delay (impossible), TCP retransmission adds 50–100 ms per retry (exceeds the budget), and expensive codecs that take seconds to encode are unusable. The protocol must be lightweight — RTP over UDP, the very protocol from Act 2 that HTTP replaced for streaming. Lost packets are concealed (the codec interpolates the gap), not retransmitted. The metric that matters: latency, not throughput.

Twitch (live streaming): 5–60 second offset. The content is live, but viewers accept being a few seconds behind. This enables a two-tier architecture: the broadcaster sends one stream to the server via RTMP (low-latency ingest), and the server distributes to thousands of viewers via HTTP/DASH (the scalable infrastructure from Act 4) [7]. YouTube Live chooses 30–60 seconds of offset (cheaper CDN caching). Twitch chooses 5–10 seconds (viewers need near-real-time chat interaction with the streamer).

One question, three architectures

	Netflix (stored)	Zoom (conversational)	Twitch (live)
When must data arrive?	No deadline	Within 150 ms	5–60 s offset
Buffer	60–200 s	50–200 ms (jitter only)	Server 2–60 s + viewer 10–60 s
Encoding	Offline, hours per frame	Real-time, 20 ms per frame	Real-time encode, server transcode
Transport	HTTP/TCP	RTP/UDP	RTMP ingest + HTTP/DASH egress
Who decides quality?	Client (ABR)	Both endpoints	Source → server → viewers
Loss handling	Retransmit (TCP)	Conceal (interpolate)	Retransmit (TCP, viewer-side)
What kills the experience?	Mid-stream freeze	Latency > 150 ms	Chat lag vs. viewer experience

Time is the anchor. It constrains State (what must the system track?), Coordination (who decides?), and Interface (what protocol?) — the same cascade we saw in L5–L8, where the shared broadcast medium was the anchor for medium access and licensed spectrum was the anchor for cellular. Different anchor, same framework, same cascade.

What comes next

Today traced four acts of multimedia evolution — from the raw bandwidth problem through compression, fixed-rate streaming, and adaptive HTTP streaming — and identified Time as the invariant that separates stored, conversational, and live multimedia into different architectures.

Two problems remain open:

L11: The streaming control loop. DASH’s ABR algorithm must predict an unpredictable network. When the prediction fails, the result is oscillation or stalling. We will trace the next generation of the pioneer arc: buffer-based ABR (Huang 2014) [5], model-predictive control (Yin 2015) [5], and neural ABR (Mao 2017) — each solving a failure in the previous approach, each grounded in the closed-loop reasoning from L3.

L12: When buffering is forbidden. Zoom’s 150 ms deadline eliminates large buffers, TCP, and complex codecs. How does the receiver reconstruct smooth audio from irregularly arriving packets? How does the sender learn about conditions when there is no time for feedback? We will study jitter buffers, RTP/RTCP in depth, and modern real-time systems — WebRTC (2021) [10] and Zoom — that push the time constraint to its extreme.

And after three lectures of watching applications cope with the network’s limitations, we will ask: what if the network itself could help? That question takes us inside the router — scheduling, queuing, and active queue management — starting in L13.

References

[1] Kurose, J. F. and Ross, K. W. (2021). Computer Networking, 8th Edition. Pearson.

[2] Dobrian, F. et al. (2011). “Understanding the Impact of Video Quality on User Engagement.” Proc. ACM SIGCOMM.

[3] Sodagar, I. (2011). “The MPEG-DASH Standard for Multimedia Streaming over the Internet.” IEEE MultiMedia.

[4] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. (2003). “RTP: A Transport Protocol for Real-Time Applications.” RFC 3550.

[5] Huang, T.-Y. et al. (2014). “A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service.” Proc. ACM SIGCOMM; Yin, X. et al. (2015). “A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP.” Proc. ACM SIGCOMM.

[6] ITU-T Recommendation G.114 (2003). “One-way transmission time.” International Telecommunication Union.

[7] A. Gupta, A First-Principles Approach to Networked Systems, Ch. 7: Multimedia Applications, UC Santa Barbara, 2026.

[8] Pantos, R. and May, W. (2017). “HTTP Live Streaming.” RFC 8216.

[9] ISO/IEC 23009-1 (2019). “Dynamic Adaptive Streaming over HTTP (DASH) — Part 1: Media Presentation Description and Segment Formats.”

[10] Alvestrand, H. (2021). “Overview: Real-Time Protocols for Browser-Based Applications.” RFC 8825.