12 Multimedia Applications – A First-Principles Approach to Networked Systems

12.1 The Anchor: Human Perception as System Constraint

Multimedia applications sit at an unusual place in the stack. Every other application negotiates with time — file transfer tolerates seconds of variance, email tolerates minutes, web browsing tolerates hundreds of milliseconds of jitter so long as the page eventually arrives. Multimedia is bound by biology, with no room to negotiate. The human ear detects voice gaps shorter than 50 milliseconds. The eye perceives a video freeze lasting a quarter-second. At 150 milliseconds of one-way conversational latency, talkers start interrupting each other; beyond 400 milliseconds, conversation collapses (Schulzrinne et al. 2003). These are not engineering targets. They are biological facts, inherited from human perception, that cascade through every layer of the design.

The binding constraint is therefore human perceptual requirements meeting a best-effort network. The application inherits two fixed realities: (1) perceptual deadlines set by biology, and (2) variable bandwidth, jitter, and loss from the IP substrate below. Reliability trades against latency (a retransmitted voice packet arriving 200ms late is worse than silence). Latency trades against quality below the perceptual floor (a crystal-clear video that rebuffers every ten seconds is unwatchable). The application must adapt — continuously, in real time — to whatever the network delivers, while staying inside a deadline the user imposes.

“Real-time services have requirements that are different from traditional data traffic… end-to-end delay and delay jitter are the primary concerns, not throughput or reliable delivery.” — Schulzrinne et al., 1996 (Schulzrinne et al. 2003)

The four decision problems every multimedia system must continuously answer:

What bitrate to choose for the next chunk or frame — higher means better quality but longer downloads, risking stalls.
How to estimate network conditions from delivery signals — chunk download time, RTT, loss rate, jitter.
How to handle packet loss under a real-time deadline — retransmit, conceal, or encode redundantly.
Where to place the decision — client, server, or network.

Yin et al. (Yin et al. 2015) retrospectively framed the ABR version of this as an optimization: maximize quality minus rebuffering time minus switch magnitude, over a planning horizon. That framing came later; the pioneers each saw only the local piece they could name.

12.2 Act 1: “It’s 1996. Voice and Video Over the MBone.”

It’s 1996. The Mbone is carrying IETF meeting audio live across the internet. Researchers are stringing together UDP-based tools — vat, vic, sdr — each reinventing the primitives needed to turn a UDP flow into a timed media stream. Henning Schulzrinne, Stephen Casner, Ron Frederick, and Van Jacobson consolidate those primitives into a standard.

“RTP does not provide any mechanism to ensure timely delivery or provide other quality-of-service guarantees, but relies on lower-layer services to do so. It does not guarantee delivery or prevent out-of-order delivery.” — Schulzrinne et al., 1996 (Schulzrinne et al. 2003)

What the pioneers saw: Endpoints with global IP addresses, willing to cooperate, speaking UDP. Bandwidth was scarce but honest — you got what you got. The job was to package media so the receiver could detect loss, reconstruct timing, and identify the codec — without adding the latency that TCP’s retransmission would impose.

What remained invisible: NATs would proliferate within five years, breaking the assumption of reachable endpoints. Firewalls would block UDP at the enterprise edge, pushing real-time media onto TCP and HTTP. Scale would demand that content be cacheable by the network — a capability RTP’s per-session model lacked.

RTP embodies an “application-level framing” design philosophy: rather than providing a complete, rigid protocol, it offers a thin framework on UDP that lets each application define its own payload format, packetization, and error-recovery strategy. This deliberate thinness is the Interface invariant answer — the protocol specifies only what information applications must carry, leaving everything else to the application.

Schulzrinne applied disaggregation by separating data transport (RTP) from control feedback (RTCP), and separating loss detection (sequence numbers) from timing reconstruction (timestamps). He applied closed-loop reasoning with RTCP as the feedback channel — receivers periodically report fraction-lost, jitter estimate, and interarrival variance back to the sender, which can adapt bitrate or switch codecs in response. The interaction between the forward data channel and the reverse feedback channel is shown in Figure 12.1.

Figure 12.1: RTP (Real-time Transport Protocol) is a thin layer on top of UDP that solves three critical problems bare UDP cannot: loss detection via 16-bit sequence numbers (gaps indicate lost packets), timing reconstruction via 32-bit timestamps (allowing receivers to buffer and playout at correct intervals despite network jitter), and codec identification via payload type fields (so receivers know how to decode). The sender transmits at regular intervals—approximately every 20 milliseconds for audio—with a 12-byte RTP header prepended to the audio payload. Each packet is numbered sequentially and timestamped at the source sampling time, enabling the receiver to reconstruct the original playout timing: two packets 20 milliseconds apart (in sampling time) are scheduled 20 milliseconds apart during playout, regardless of actual network arrival times. This decouples timing reconstruction from transport jitter, the core innovation enabling adaptive jitter buffers. RTCP provides the reverse channel: instead of a request-response pattern, endpoints proactively send reception reports at 5-second intervals (much slower than the 20-millisecond RTP interval). These reports contain packet loss rates (fraction of expected packets not received), jitter estimates (variance in inter-arrival times), and round-trip time (RTT). The sender receives RTCP feedback and adapts: if loss is high, lower codec bitrate or enable forward error correction; if jitter is high, increase playout buffer; if RTT is high, reduce the sending rate. The key tension is timing: RTP packets flow every 20 milliseconds (fast timescale), but RTCP feedback arrives every 5 seconds (slow timescale). This 250× slowdown means rapid transient changes (congestion spikes, sudden loss increases) are not immediately visible. Adaptation happens on the seconds timescale, intentionally slow to avoid oscillations but slow enough to miss brief disruptions. This fundamental coupling constraint—that feedback is slower than data transmission—explains why VoIP degrades rapidly under sudden congestion.

12.2.1 Invariant Analysis: RTP/RTCP (1996)

Invariant	RTP’s Answer (1996)	Gap?
State	16-bit sequence number detects loss; SSRC identifies stream	No belief about the network — only about packets
Time	32-bit timestamp reconstructs playout clock; NTP wallclock in RTCP Sender Reports enables inter-media synchronization (lip-sync); RTCP reports every few seconds	Timestamp accuracy depends on sender clock; RTCP feedback too slow for reactive congestion control
Coordination	End-to-end; sender adapts from RTCP reports	Periodic (seconds) — misses transient spikes
Interface	Thin layer on UDP; payload type identifies codec	Middleboxes (NAT, firewall) do not understand RTP

The State gap is structural: a sequence number tells you a packet is missing but nothing about why — was the buffer full, was the radio in a fade, did a router reboot? The sender sees only “packet missing” — congestion loss and channel loss look identical, leaving the choice between “slow down” and “add FEC” unresolved. The Time gap is operational: RTCP reports arriving every five seconds tell the sender about conditions that are already five seconds stale. For a VoIP call that tolerates 150ms delay, a five-second feedback loop cannot track fast fades. One critical Time contribution, however, is inter-media synchronization: the NTP wallclock timestamp carried in each RTCP Sender Report allows a receiver to align audio and video streams from different RTP sessions onto a common timeline — the mechanism behind lip-sync.

12.2.2 Environment → Measurement → Belief

Layer	What RTP Has	What’s Missing
Environment	True loss rate, jitter, available bandwidth along path	Path conditions vary per-hop; RTP sees only endpoints
Measurement	RTCP receiver reports (fraction lost, jitter, last-SR delay)	Sampling period seconds; no per-packet RTT
Belief	Average recent loss and jitter on this session	Belief lags environment by one RTCP interval

The E→M gap is accidentally noisy: receiver reports are honest but too infrequent. Better estimators leave the gap intact because the measurement rate itself is the constraint — and RTCP’s bandwidth cap (5% of session) is deliberate, to prevent feedback from consuming the media budget.

12.2.3 “The Gaps Didn’t Matter… Yet.”

In 1996 the Mbone was a cooperative research network. Endpoints were workstations with public IPs. Bandwidth was low but predictable. Sessions were small enough that RTCP scaled. Slow feedback was acceptable because sender adaptations were coarse (switch codecs, add redundancy) and the consequence of slow adaptation was modest (a noisy few seconds, then recovery).

The gaps would matter when two environmental shifts collided: commercial deployment at internet scale (1999-2005) introduced NATs and firewalls that broke UDP reachability, and cellular networks (2007 onward) introduced bandwidth that varied by an order of magnitude over seconds. RTP’s UDP packets were blocked at NATs and firewalls, and when they did arrive, RTCP’s feedback loop was too slow for cellular variability.

12.3 Act 2: “It’s 2009. The iPhone Won’t Talk RTP.”

It’s 2009. Apple ships iOS 3.0 and the iPhone is the first consumer device carrying HD video over cellular. The network bandwidth between the tower and the phone varies from 500 kbps to 5 Mbps within a minute. Corporate firewalls block UDP outright. Content delivery networks understand one protocol: HTTP. Apple needs a streaming format that reaches every user, through every network, over the existing CDN fabric.

Roger Pantos’ answer was as simple as it was blunt: chop the video into short .ts segments, list them in a .m3u8 text file, serve both over HTTP. The client downloads the playlist, then fetches segments one at a time, choosing a bitrate based on how fast previous segments arrived. HTTP Live Streaming (HLS) became an RFC in 2017 (Pantos and May 2017), and DASH standardized the same model as an ISO standard (ISO/IEC 2019), but the architecture crystallized at launch.

“A client obtains the Playlist and then accesses each Media Segment in the Playlist in order… The client presents the Media Segments so as to give the end user a continuous experience of the Presentation.” — Pantos & May, 2017 (Pantos and May 2017)

What Pantos saw: HTTP everywhere. CDNs that cache files for free. Firewalls that pass port 443. A client (iPhone) with enough CPU to manage its own state. The encoder’s job was to produce the bitrate ladder once; the client’s job was to choose.

What remained invisible: A 10-second chunk cadence put a 30-second floor on live latency (segment, playlist refresh, client buffer). Real-world networks experience wild throughput fluctuations — drops from 17 Mbps to 500 kbps within seconds due to WiFi interference or mobile congestion — and throughput estimation from chunk download time would oscillate when bandwidth varied faster than chunk duration. Worse, throughput-based adjustments often fall into a “conservative trap”: after a bandwidth dip, the estimator becomes overly cautious and picks a low bitrate even when capacity has already recovered, wasting quality the user could have enjoyed. Millions of clients sharing a bottleneck would synchronize bitrate decisions, creating thrash.

The protocol flow begins with the client requesting a manifest file — an M3U8 playlist in HLS (Pantos and May 2017), a Media Presentation Description (MPD) in DASH (Sodagar 2011) — that lists every available bitrate, resolution, and codec alongside the URLs for each segment at each quality level. The manifest is the client’s map of the bitrate ladder: it tells the client what choices exist before a single byte of video is fetched.

Pantos applied decision placement at the client: the client owns buffer state, observes network conditions, and picks bitrate. The server is stateless — it just serves files. The network is oblivious — it just passes HTTP. He applied disaggregation between content production (encoder ladder) and delivery policy (client ABR). He applied closed-loop reasoning with chunk-download-time as the throughput sensor, smoothed by an exponential moving average. The resulting feedback loop is shown in Figure 12.2, for DASH, which standardized the same architecture two years later (Sodagar 2011).

Figure 12.2: The DASH adaptive bitrate (ABR) control loop operates at segment granularity, typically one decision every 10 seconds. The client observes two critical state variables: achieved throughput (bytes received divided by segment download time) and buffer depth (seconds of content remaining before playback exhausts the buffer). These observations feed a decision system that selects the next segment’s bitrate from a manifest provided by the server. The control mechanism implements a closed-loop feedback structure. If throughput is high and the buffer is growing, the client requests a higher bitrate segment, exploiting available capacity. Conversely, if throughput drops or the buffer is draining, the client immediately reduces bitrate to prevent rebuffering stalls—avoiding the perceptual catastrophe of playback interruption. This state-decomposed design achieves radical disaggregation: the server remains stateless (unaware of client condition), the client maintains state (buffer depth and throughput estimates), and the network is oblivious (carrying only standard HTTP requests). The fundamental tradeoff is inherent in this mechanism: higher bitrate requests exploit capacity but risk stalling if conditions degrade; lower bitrate requests are conservative but may underutilize recovering network conditions. Stability requires damping via hysteresis (different thresholds for increasing versus decreasing bitrate) and lookahead prediction (inferring future throughput from historical patterns rather than reacting only to current measurements).

12.3.1 Invariant Analysis: HLS (2009) / DASH (2011)

Invariant	HLS/DASH Answer (2009-11)	Gap?
State	Client tracks buffer level + chunk download time	One throughput sample per chunk — noisy, lagged
Time	Chunk cadence (6-10s); playlist refresh cadence	Chunk duration floors live latency (~30s)
Coordination	Client-driven; stateless server	No cross-client coordination — synchronized oscillation
Interface	HTTP GET of segments + text manifest	Manifest polling is verbose; no server push

The State gap is accidentally noisy: one chunk per 10 seconds is a single throughput sample against a bandwidth process that varies every 100ms on cellular links. The variance is large, and increasing the sampling rate requires shrinking chunks, which inflates HTTP overhead. The Coordination gap produced the famous “bitrate thrash” — multiple clients on the same bottleneck see identical high throughput, all request 4K chunks, the link saturates, all see identical low throughput, all drop to 480p together.

12.3.2 “The Gaps Didn’t Matter… Yet.”

For on-demand Netflix-style streaming, a 30-second startup was tolerable (users expect a loading screen) and rebuffering was rare once the buffer filled. Throughput variance mattered less than average, because a 60-second buffer absorbs seconds of bandwidth dips. Synchronized oscillation across clients was tolerable as long as absolute quality stayed above perceptual thresholds.

Two shifts broke this tolerance: live sports (2013 onward) demanded sub-10-second latency because viewers with notifications on their phones saw goals scored before their stream showed the kick, and mobile bandwidth variability (LTE handovers, congestion) outpaced the throughput estimator’s smoothing window, producing visible quality oscillations that users hated more than occasional rebuffering.

12.4 Act 3: “It’s 2014. Throughput Estimation Oscillates — Drop It.”

It’s 2014. Huang, Johari, McKeown and colleagues run a large-scale study of ABR on a real streaming service. Their finding is counterintuitive: throughput-based ABR algorithms oscillate badly because the throughput signal is itself the output of the client’s prior decisions. A client that chose a low bitrate gets an artificially high throughput estimate (downloads finish fast), which tempts it to jump high, which then fails. Throughput and bitrate are coupled — the sensor lies.

“Our experiments … show that a primary cause of rebuffering is the instability of the throughput estimator itself. We propose a buffer-based approach that avoids throughput estimation entirely.” — Huang et al., 2014 (Huang et al. 2014)

What Huang saw: The buffer level is a direct measurement of the gap between network delivery and player consumption. If the buffer is growing, the network is fast enough; if it’s shrinking, it isn’t. The buffer itself is the estimator.

BBA (Buffer-Based Algorithm) is almost embarrassingly simple: define a piecewise-linear function from buffer level to chosen bitrate, with a reservoir (buffer below which you pick the lowest bitrate) and a cushion (buffer above which you pick the highest). Ignore throughput entirely. Huang applied closed-loop reasoning with a different sensor — buffer level, which is structurally honest because the controller’s own decisions leave it unaffected.

What remained invisible: Buffer level is slow — a sensor that lags by the chunk duration. On fast-varying networks BBA underreacts. And it leaves a lot of quality on the table: when the network is stable and abundant, BBA’s conservatism wastes bandwidth that a smarter algorithm could spend on higher bitrates. There is also a startup paradox — a “broken invariant” at session start: an empty buffer contains zero information about network conditions, so BBA’s core principle (let the buffer decide) is vacuous. During startup, BBA reverts to throughput estimation to ramp up — the very signal it was designed to replace. The buffer-based invariant only holds once the buffer has accumulated enough history to be informative.

12.4.1 Invariant Analysis: BBA (2014)

Invariant	BBA’s Answer	Gap?
State	Buffer level only — direct and honest	Slow sensor; lags by chunk duration
Time	Per-chunk decision	Same as HLS/DASH
Coordination	Client-driven, same interface	—
Interface	Standard DASH/HLS	—

BBA’s key contribution is measurement-quality reasoning: it identifies that chunk-throughput is structurally filtered by the controller (the sender’s own decisions shape what the sensor sees), and switches to a sensor free of that filtering. This is the same reasoning pattern as BGP route advertisements (where stability comes from structural constraints rather than better estimation, per Ch09).

12.5 Act 4: “It’s 2015. Combine Buffer and Throughput — Optimally.”

It’s 2015. BBA is stable but leaves quality on the table. Throughput-based ABR is responsive but oscillates. Yin, Jindal, Sekar, and Sinopoli ask: what does the optimal policy look like? They frame ABR as a finite-horizon constrained optimization — choose a sequence of bitrates to maximize sum-of-quality minus a rebuffering penalty minus a switch-magnitude penalty, subject to buffer dynamics.

“We present a principled understanding of bitrate adaptation and analyze several practical adaptation algorithms through a common control-theoretic framework. We propose a novel model predictive control algorithm.” — Yin et al., 2015 (Yin et al. 2015)

What Yin saw: ABR is textbook model predictive control. Predict future throughput over a planning horizon (harmonic mean of the last 5 chunks is a robust predictor). Model buffer evolution as a function of chosen bitrate and predicted throughput. Solve the optimization (small enough to be tractable per-chunk). Execute the first decision. Re-solve next chunk. This is the classic MPC loop: predict → optimize → act → measure → re-predict.

Yin applied closed-loop reasoning with an explicit optimization horizon — the loop is no longer myopic (one chunk ahead) but plans over 5 chunks (~50 seconds). He combined both sensors: throughput prediction feeds the optimizer, buffer level constrains it. The principles of optimization and feedback control, decades old in systems engineering, were finally named for ABR.

What remained invisible: The prediction horizon grew the state invariant (now we predict 5 chunks of throughput) and added prediction error as a new failure mode. When networks are non-stationary (LTE handover, WiFi congestion), the harmonic-mean predictor is systematically wrong. Mao et al. with Pensieve (Mao et al. 2017) replaced the hand-tuned predictor with a learned policy trained via reinforcement learning two years later, implicitly acknowledging that static predictors fail on the tails.

12.5.1 Invariant Analysis: MPC (2015)

Invariant	MPC’s Answer	Gap?
State	Predicted throughput (harmonic mean, 5 samples) + buffer level	Prediction error compounds over horizon
Time	5-segment optimization horizon (~50s)	Horizon too long for live streaming
Coordination	Client-driven, same as DASH	—
Interface	Same MPD; modified client only	—

The prediction horizon is the crux. Too short and MPC collapses to myopic throughput-based ABR. Too long and prediction errors dominate the optimum. Yin’s 5-chunk horizon was a sweet spot for on-demand streaming — but it places a 50-second floor on responsiveness, which is incompatible with live. For live streaming, playback latency becomes a critical additional state variable: the optimizer must now minimize not just rebuffering and quality switches but also the gap between the live edge and the playback position. This adds a skip penalty to the QoE function — the cost of dropping frames to catch up when the player falls behind the live edge — turning the two-dimensional (bitrate, buffer) optimization into a three-dimensional one (bitrate, buffer, latency).

12.5.2 Environment → Measurement → Belief After MPC

Layer	What MPC Has	What’s Missing
Environment	True throughput process + buffer state	—
Measurement	Per-chunk throughput samples + current buffer	Same sampling rate as DASH
Belief	Explicit predictor + horizon-optimal bitrate plan	Prediction residual grows with horizon

The E→M gap is still accidentally noisy (same chunk-rate samples as DASH), but MPC partially closes it by combining two complementary sensors. The new failure mode is prediction bias: the predictor assumes stationarity that mobile networks violate.

12.6 Act 5: “It’s 2019. Live Latency Hits the Chunk Floor.”

It’s 2019. Twitch, YouTube Live, and sports streaming services push the latency bar below 10 seconds, then below 5, then below 2. HLS’s 10-second chunks floor the latency at ~30 seconds (segment + playlist refresh + 3-chunk client buffer). DASH faces the same floor. A football fan with the game on their phone hears the neighbor’s TV cheer before their stream shows the goal.

Apple’s answer is Low-Latency HLS (LL-HLS) (Pantos 2019): keep HTTP, keep chunks, but subdivide each chunk into partial segments of 200-500 ms, and use HTTP/2 push to deliver partials as they are produced. The client polls a blocking manifest that returns the instant a new partial becomes available. CMAF (ISO/IEC 2018) chunked transfer is the container-level analogue: a single fragment that is transmitted progressively as it is encoded, allowing the player to begin decoding before the fragment ends.

What the designers saw: Chunks need not be downloaded atomically. If the server produces the first 200ms of a chunk, it can start pushing it while encoding the rest. If the player can begin decoding 200ms into a chunk, its effective latency is 200ms of content plus RTT plus whatever safety buffer it maintains.

What remains a live tension: Partial segments fragment CDN caching (many small objects instead of a few large ones). ABR algorithms tuned for 10-second chunks oscillate wildly at 200-ms granularity because they re-decide too often. The throughput estimator sees noisier samples. The buffer is shallower, so any congestion burst immediately stalls playback.

LL-HLS applied disaggregation between the segment (the unit of CDN caching and ABR decision) and the partial (the unit of delivery). The segment remains the ABR decision point; partials keep the pipeline fed between decisions. This resolves one tension by introducing another: the player now balances a segment-level bitrate choice against partial-level delivery noise.

12.6.1 Invariant Analysis: LL-HLS / CMAF (2019)

Invariant	LL-HLS Answer	Gap?
State	Buffer + partial-segment progress + blocking manifest	Shallower buffer = less error absorption
Time	Partial cadence 200-500ms; segment cadence unchanged	Sub-2s latency achievable, but fragile
Coordination	Client-driven with HTTP/2 push from server	Cache hierarchy sees many small objects
Interface	CMAF chunked transfer; blocking playlist reload	Incompatible with some legacy CDN edge behaviors

12.7 Act 6: “It’s 2021. WebRTC Puts Real-Time in the Browser.”

It’s 2021. Zoom, Google Meet, Microsoft Teams have made real-time video conferencing a daily fact of life. The pandemic has compressed a decade of adoption into twelve months. The underlying technology — WebRTC, standardized in RFC 8825 (Alvestrand 2021) — descended from Skype’s hybrid P2P architecture (analyzed by Baset and Schulzrinne (Baset and Schulzrinne 2006)) and from a decade of work on getting RTP (Schulzrinne et al. 2003) through NATs.

“WebRTC is a set of protocols that enables real-time communications … directly between web browsers, without the need for specialized plug-ins.” — Alvestrand, 2021 (Alvestrand 2021)

What the WebRTC designers saw: Real-time media has to work from inside a browser, without plugins, through NATs, with end-to-end encryption. Signaling (who is calling whom, what codecs they support) is application-specific and can flow through any channel the app provides. Media has to go peer-to-peer when possible (lowest latency) and through a relay when not.

The stack is disaggregated into layers: - ICE (Interactive Connectivity Establishment): probes every candidate address pair (local, reflexive via STUN, relayed via TURN) and picks the one that works. - DTLS-SRTP (Datagram Transport Layer Security – Secure Real-time Transport Protocol): encrypts media end-to-end on the wire. - GCC (Google Congestion Control): runs a per-packet loss/delay-based congestion control loop, reacting every RTT — the fastest loop in the multimedia stack. - getStats API: exposes per-stream loss, RTT, jitter, bitrate to the application for introspection.

For many-party conferencing, WebRTC by itself is peer-to-peer and scales poorly (each participant sends N-1 streams). The conferencing architecture design space spans two server models:

Multipoint Control Unit (MCU): The server decodes every incoming stream, composites them into a single mixed layout, re-encodes, and sends one stream to each participant. This minimizes client-side bandwidth and CPU (each client sends one stream and receives one), but maximizes server cost — and critically, the MCU must decrypt media to transcode it, breaking end-to-end encryption.
Selective Forwarding Unit (SFU): The server terminates ICE/DTLS from each peer, receives their RTP streams, and forwards a subset to each recipient without decoding. Because the SFU never touches the media payload, it preserves end-to-end encryption. Server cost is lower (no transcoding), but each client receives multiple streams and must decode them locally.

Production conferencing overwhelmingly uses SFUs, accepting higher client complexity in exchange for preserved encryption and lower server cost. The SFU’s forwarding decision — which streams to send to which recipient, and at what quality — is the core coordination problem.

The SFU’s quality selection relies on one of two encoding strategies. With simulcast¹, each sender independently encodes and transmits 2-3 separate streams at different resolutions (e.g., 1080p, 360p, 180p). The SFU selects which stream to forward to each receiver based on their available bandwidth and display size — high resolution for the active speaker’s large tile, low resolution for thumbnails. With SVC (Scalable Video Coding) (Schwarz et al. 2007; Wiegand et al. 2003), the sender produces a single layered bitstream: a base layer that decodes independently, plus enhancement layers that add resolution or frame rate. The SFU can drop enhancement layers for bandwidth-constrained receivers without re-encoding. Simulcast is simpler to implement and codec-agnostic but wastes sender upload bandwidth on redundant encodes; SVC is more bandwidth-efficient but requires codec support (VP9 SVC, AV1 SVC (Chen et al. 2018)) and adds encoding complexity.

To further conserve bandwidth in large meetings, SFUs implement “Last N” forwarding: only the video streams of the active speaker and the N most recent speakers are forwarded; all other participants receive audio only. This reduces the forwarding fan-out from O(N²) toward O(N), making 100-participant meetings feasible without saturating either server or client bandwidth.

WebRTC applied decision placement adaptively — distributed (P2P) when possible, centralized (SFU) when scale demands. It applied closed-loop reasoning at two timescales: GCC per-packet (~RTT), and application-layer stream selection per-second. It applied disaggregation of signaling from media, of transport from security, of media distribution from media transport.

12.7.1 Invariant Analysis: WebRTC (2021)

Invariant	WebRTC’s Answer	Gap?
State	getStats: per-stream loss, RTT, jitter, bitrate	Browser opacity; app cannot see per-packet
Time	GCC adapts per-RTT (~10-100ms); frame cadence 30-60 fps	GCC tuning opaque to application
Coordination	ICE chooses P2P or TURN; SFU for scale (preserves E2E encryption, unlike MCU which must decrypt to transcode)	SFU centralizes routing but not trust; MCU centralizes both
Interface	PeerConnection API + SDP (Session Description Protocol) offer/answer	SDP legacy baggage; no native many-to-many

12.7.2 Environment → Measurement → Belief in WebRTC

Layer	What WebRTC Has	What’s Missing
Environment	True path loss, RTT, available bandwidth	—
Measurement	Per-packet RTT, loss, delay gradient (GCC sensors)	Still no view into router queues or competing flows
Belief	GCC bandwidth estimate, updated per RTT	Estimate can lag during congestion onset or recovery

The E→M gap is accidentally noisy but narrowed dramatically compared to RTCP: GCC samples per-packet, not per-report, and uses delay-gradient (the rate of change of one-way delay) as an early-warning signal for congestion. This is the same principle as BBR’s RTT-probing in transport (Ch08) — closing the measurement loop by measuring faster.

12.8 The Grand Arc: From Schulzrinne to WebRTC

12.8.1 The Evolving Anchor

Era	Year	Binding Constraint	Invariant Locked First
RTP/RTCP	1996	Perceptual deadline + best-effort IP	Time (timestamps)
HLS / DASH	2009-11	Deployability via HTTP/CDN/firewalls	Interface (HTTP)
BBA	2014	Throughput signal is gamed by controller	State (buffer as honest sensor)
MPC	2015	QoE as joint function of bitrate, rebuffer, switches	State (predictive model) + Time (horizon)
LL-HLS / CMAF	2019	Live latency floor <2s	Time (sub-chunk cadence)
WebRTC	2021	Browser sandbox + NAT + E2E encryption	Interface (browser API) + Coordination (ICE)

12.8.2 Three Design Principles Applied Across the Arc

Disaggregation appears at every generation. RTP separates data from control; HLS separates content production (encoder ladder) from delivery policy (client ABR); MPC separates prediction from optimization; WebRTC separates signaling, transport, security, and media distribution. Each separation creates an interface — HTTP chunks, MPD manifest, PeerConnection API — and each interface has the potential to ossify. HTTP as the multimedia interface ossified hard enough that every subsequent innovation (LL-HLS, CMAF, Media over QUIC) has had to either preserve HTTP semantics or build parallel infrastructure.

Closed-loop reasoning is the through-line. Every system has a sensor, an estimator, a controller, and an actuator. What changes is the sensor: RTCP receiver reports (slow, periodic), chunk download time (noisy, filtered), buffer level (honest, lagged), predicted throughput (extrapolated, biased), per-packet delay gradient (fast, noisy). The measurement-quality vocabulary from Ch01 tracks the arc: RTCP and chunk-throughput are accidentally noisy (fix with better estimators); chunk-throughput when used by an ABR controller is structurally filtered (fix with a different sensor like buffer); GCC’s delay-gradient is physically bounded (as close to the true path state as the endpoint can see).

Decision placement shifted once and then oscillated. RTP placed decisions at endpoints. HLS crystallized decisions at the client. WebRTC started at peers and reintroduced centralization via SFU when scale demanded it. The spectrum from distributed to centralized oscillates — each generation re-chooses based on the binding constraint of its moment.

12.8.3 The Dependency Chain

flowchart TD
  A[Perceptual deadline <150ms] -->|forces| B[Cannot retransmit]
  B -->|forces| C[Loss detection + concealment at endpoint]
  C -->|1996: RTP| D[Sequence + timestamp over UDP]
  D -->|NAT/firewall break UDP reachability| E[Must move to HTTP]
  E -->|2009: HLS| F[Chunks over HTTP + client-driven ABR]
  F -->|throughput signal is gamed by controller| G[Need honest sensor]
  G -->|2014: BBA| H[Buffer-based ABR]
  F -->|myopic decisions waste quality| I[Need horizon]
  I -->|2015: MPC| J[Model predictive control]
  F -->|chunk cadence floors live latency| K[Need sub-chunk delivery]
  K -->|2019: LL-HLS| L[Partial segments + HTTP/2 push]
  D -->|browsers need plugin-free RTC| M[Browser-native stack]
  M -->|2021: WebRTC| N[ICE + GCC + SFU]

  classDef constraint fill:#e3f2fd,stroke:#1565c0;
  classDef failure fill:#ffebee,stroke:#c62828;
  classDef fix fill:#e8f5e9,stroke:#2e7d32;
  class A,B,C,E,G,I,K,M constraint;
  class D,F,H,J,L,N fix;

12.8.4 Pioneer Diagnosis Table

Year	Pioneer	Invariant	Diagnosis	Contribution
1996	Schulzrinne	Time + State	Media needs loss detection + timing over UDP	RTP/RTCP
2009	Pantos	Interface	Only HTTP reaches every user	HLS
2011	Sodagar	Interface	Need vendor-interoperable format	DASH / MPD standard
2014	Huang	State	Throughput sensor is gamed by controller	BBA
2015	Yin	State + Time	Myopic ABR cannot optimize QoE	MPC for ABR
2017	Mao	State	Hand-tuned predictors fail on tails	Pensieve (neural ABR)
2019	Apple / MPEG	Time	Chunk cadence floors live latency	LL-HLS / CMAF
2021	Alvestrand et al.	Interface + Coordination	Browser RTC needs native stack + NAT traversal	WebRTC / ICE / SFU

12.8.5 Innovation Timeline

flowchart TD
    subgraph sg1["Encoding & Transport"]
        A1["1948 — Shannon: rate-distortion"]
        A2["1994 — MPEG-2"]
        A3["1996 — Schulzrinne: RTP/RTCP"]
        A4["2003 — H.264/AVC"]
        A5["2013 — H.265/HEVC, VP9"]
        A6["2018 — AV1"]
        A1 --> A2 --> A3 --> A4 --> A5 --> A6
    end
    subgraph sg2["ABR Streaming"]
        B1["2009 — Pantos/Apple: HLS"]
        B2["2011 — Sodagar/MPEG: DASH"]
        B3["2014 — Huang: BBA"]
        B4["2015 — Yin: MPC"]
        B5["2017 — Mao: Pensieve"]
        B1 --> B2 --> B3 --> B4 --> B5
    end
    subgraph sg3["Low-Latency & Real-Time"]
        C1["2006 — Baset: Skype analysis"]
        C2["2019 — LL-HLS / CMAF"]
        C3["2021 — WebRTC RFC 8825"]
        C1 --> C2 --> C3
    end
    sg1 --> sg2 --> sg3

Multimedia Application Innovations

12.9 Voice Over IP: The Tightest Budget

VoIP operates under the most aggressive version of the binding constraint. The end-to-end delay budget decomposes into:

Component	Typical Budget	Constraint
Encoding (Opus/G.711, 20ms frames)	~20 ms	Frame size × compute
One-way propagation (continental)	~50 ms	Speed of light
Queueing + transmission	0-100 ms	Network congestion
Jitter buffer²	50-100 ms	Arrival variance
Decoding + playout	~10 ms	Codec compute

The sum approaches 200ms on a good path and exceeds the 150ms conversational bound on congested paths. This tight budget has three consequences illustrated in Figure 12.3. First, retransmission is infeasible — an RTT round trip at 100ms alone eats the jitter budget. Second, loss concealment replaces retransmission — codecs interpolate missing frames. Third, adaptive playout adjusts buffer depth at talkspurt boundaries (silence intervals) to match actual jitter without adding fixed overhead.

Figure 12.3: The VoIP end-to-end pipeline decomposes the 150-millisecond delay budget across six critical stages. Audio is captured at 8 kHz sampling rate and grouped into 20-millisecond frames (160 samples), then encoded using bandwidth-efficient codecs (G.711 at 64 kbps or Opus at 20–64 kbps). Each packet is timestamped and transmitted over UDP/RTP, carrying a 12-byte RTP header that enables loss detection (via sequence numbers) and timing reconstruction (via timestamps). Unlike video, which can tolerate significant loss via FEC, VoIP cannot afford retransmission: requesting a lost packet adds RTT (50+ milliseconds), violating the delay budget. Instead, loss concealment—where the decoder reconstructs lost frames by interpolation—is the only feasible strategy, making packet loss a quality degradation rather than a reliability failure. At the receiver, the jitter buffer absorbs network variability without adding excessive latency. Network jitter (variance in packet inter-arrival times) can grow from zero to over 100 milliseconds depending on queuing conditions. The receiver observes actual packet arrival times and adaptively adjusts the playout point at talk-spurt boundaries (silence detection marks natural pauses), allowing the buffer to contract when network conditions improve while preventing audible gaps when jitter increases. If adaptive playout delay is set too low, packets arrive late and are discarded, creating audible gaps. If set too high, unnecessary latency accumulates. The algorithm uses exponential moving averages of inter-arrival times to estimate the required delay, avoiding both extremes. Finally, decoding reconstructs audio from the buffered frames and sends to the speaker. The entire pipeline—encoding (20 ms) + network propagation (50 ms typical) + jitter buffer and playout (50–100 ms) + decoding (10 ms)—sums to approximately 150 milliseconds, leaving no room for retransmission or large buffering.

The playout delay is estimated as p = d̂ + k·v̂, where d̂ is the EWMA of one-way delay and v̂ is the EWMA of deviation, with k ≈ 3-4. Larger k adds robustness at the cost of latency. Talkspurt-boundary adjustment exploits human perception: adding 20ms of delay during silence is imperceptible, while adding it mid-word is audible.

12.10 Generative Exercises

Exercise 1: The 5G Handover

A user watching a 4K DASH stream walks from 5G macro coverage into a WiFi-only zone. Bandwidth drops from 100 Mbps to 10 Mbps within 200ms. The MPC predictor uses a 5-chunk harmonic mean. Predict what happens to bitrate selection over the next 60 seconds. At what point does MPC recover? What would Pensieve (neural ABR) do differently?

Exercise 2: E2E Encrypted Conferencing

Design a conferencing system that preserves end-to-end encryption (no server sees cleartext media) and supports stream selection (send hi-res to active speaker, lo-res to thumbnails) and scales to 100 participants. Which invariant must you relax? Where does the decision about stream selection move, and at what cost?

Exercise 3: Interplanetary Video

A Mars-to-Earth video link has 8-20 minute one-way latency and ~1 Mbps bandwidth. Apply the four invariants: what does State look like? Can you run any closed loop? What sensor would you use for ABR? Is conversational video possible? If not, what is the best achievable interaction model?

Alvestrand, Harald. 2021. Overview: Real-Time Protocols for Browser-Based Applications. RFC No. 8825. IETF.

Baset, Salman A., and Henning Schulzrinne. 2006. “An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol.” Proc. IEEE INFOCOM.

Chen, Yue, Debargha Murherjee, Jingning Han, et al. 2018. “An Overview of Core Coding Tools in the AV1 Video Codec.” IEEE Intl. Conf. On Image Processing (ICIP), 41–45.

Huang, Te-Yuan, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. 2014. “A Buffer-Based Approach to Rate Adaptation: Evidence from a Large Video Streaming Service.” Proc. ACM SIGCOMM.

ISO/IEC. 2018. Information Technology — Multimedia Application Format (MPEG-A) — Part 19: Common Media Application Format (CMAF) for Segmented Media. Nos. 23000-19. ISO/IEC.

ISO/IEC. 2019. Information Technology — Dynamic Adaptive Streaming over HTTP (DASH) — Part 1: Media Presentation Description and Segment Formats. Nos. 23009-1. 4th ed. ISO/IEC.

Mao, Hongzi, Ravi Netravali, and Mohammad Alizadeh. 2017. “Neural Adaptive Video Streaming with Pensieve.” Proc. ACM SIGCOMM.

Pantos, Roger. 2019. Low-Latency HLS Preliminary Specification.

Pantos, Roger, and William May. 2017. HTTP Live Streaming. RFC No. 8216. IETF.

Schulzrinne, Henning, Stephen Casner, Ron Frederick, and Van Jacobson. 2003. RTP: A Transport Protocol for Real-Time Applications. RFC No. 3550. IETF.

Schwarz, Heiko, Detlev Marpe, and Thomas Wiegand. 2007. “Overview of the Scalable Video Coding Extension of the H.264/AVC Standard.” IEEE Trans. Circuits Syst. Video Technol. 17 (9): 1103–20.

Sodagar, Iraj. 2011. “The MPEG-DASH Standard for Multimedia Streaming over the Internet.” IEEE MultiMedia 18 (4): 62–67.

Wiegand, Thomas, Gary J. Sullivan, Gisle Bjontegaard, and Ajay Luthra. 2003. “Overview of the H.264/AVC Video Coding Standard.” IEEE Trans. Circuits Syst. Video Technol. 13 (7): 560–76.

Yin, Xiaoqi, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. 2015. “A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP.” Proc. ACM SIGCOMM.