When Buffering Is Forbidden: VoIP, Jitter, and Modern Real-Time

Last lecture we spent an hour inside the ABR control loop — the feedback system that lets Netflix and YouTube adapt video quality to a network they cannot control. The loop was elegant: observe throughput, smooth the estimate, predict the next segment, choose a bitrate, repeat every ten seconds. When it worked, viewers saw seamless HD. When it failed — oscillation, stalling, cross-client interference — researchers responded with buffer-based control, model-predictive optimization, and eventually learned policies. But through all of it, one luxury was always present: a large buffer. Netflix holds 60 to 200 seconds of video in reserve. That buffer absorbs network variability the way a shock absorber absorbs potholes — the viewer never feels the bumps.

Today we take that buffer away.

Suppose you are on a Zoom call. Your colleague says something. You respond. The total round trip — their voice to your ears, your voice back to theirs — must feel instantaneous, or the conversation breaks. You start talking over each other. You both stop. Awkward silence. You both start again. The rhythm of human conversation has a hard deadline: roughly 150 milliseconds of one-way delay, end to end [1][6]. Beyond that, turn-taking collapses.

150 milliseconds. Netflix’s buffer alone is a thousand times longer. Netflix adapts every 10 seconds (one segment). Zoom must adapt every packet arrival — milliseconds, not seconds. Why? Because with only 150 ms of total budget, you cannot wait 10 seconds to react to a network change. Everything we built in L10 and L11 — large buffers, TCP retransmission, complex offline codecs, ten-second adaptation loops — is unusable under this constraint. We need a completely different architecture for delivering media over a best-effort network (one that makes no guarantees about delay, loss, or bandwidth — the internet as we know it).

This lecture traces how engineers built systems that deliver smooth audio and video under that 150-millisecond wall, from the protocol layer (RTP/RTCP) through receiver-side jitter management to modern platforms (WebRTC, Zoom) and the extreme frontier of cloud gaming. Along the way we will see the same design principles — disaggregation, closed-loop reasoning, decision placement — reappear in a radically different context.

Act 1: The 150-millisecond wall

Where does the deadline come from?

The 150-millisecond bound is not an engineering target. It is a biological fact about human conversation. ITU-T Recommendation G.114, based on decades of telephone research, established that one-way delays below 150 ms produce “acceptable” conversational quality [6]. Between 150 and 400 ms, conversations become strained — speakers notice the gap, hesitate, talk over each other. Above 400 ms, natural dialogue is effectively impossible; you are reduced to walkie-talkie-style alternation.

Why is human conversation so sensitive to delay? Because turn-taking is a prediction game. When you hear someone’s sentence ending — their pitch drops, their cadence slows — your brain starts formulating a response before they finish. The gap between speakers in natural conversation averages about 200 milliseconds [7]. If the network adds 300 ms of one-way delay, your response arrives 600 ms after their sentence ended (300 ms for their voice to reach you, 300 ms for yours to reach them). That 600 ms feels like an eternity — the other person has already started talking again, because they interpreted your silence as hesitation.

This is fundamentally different from Netflix. A Netflix viewer does not interact with the content. A 2-second buffer delay is invisible. But in a phone call, delay is the enemy, not bandwidth. A voice call at 64 kbps barely registers on a modern network. The problem is not how many bits per second — it is how many milliseconds per bit.

The delay budget

Where do those 150 milliseconds go? Let us decompose the pipeline from the speaker’s mouth to the listener’s ear [7]:

Component	Typical delay	Notes
Encoding	~20 ms	Codec accumulates one frame of audio (160 samples at 8 kHz = 20 ms for G.711) before encoding
Packetization	~0 ms	Negligible — just wrapping encoded audio in RTP/UDP/IP headers
Network (one-way)	30–80 ms	Propagation + queuing. US coast-to-coast ~30 ms propagation; intercontinental ~80 ms
Jitter buffer	50–100 ms	Absorbs variability in packet arrival times — packets may be sent every 20 ms but arrive at irregular intervals due to network queuing. This component is the focus of Act 2.
Decoding	~5–10 ms	Codec reconstructs audio waveform
Total	~105–210 ms	Already brushing or exceeding the 150 ms bound

The arithmetic is merciless. Encoding eats 20 ms. Propagation across the US eats 30 ms. The jitter buffer — which we have not even discussed yet — eats 50 to 100 ms. Decoding eats 10 ms. That is 110 to 160 ms with nothing going wrong. There is almost no slack in the budget [7].

This tight budget eliminates options that stored streaming takes for granted:

TCP retransmission is out. A retransmitted packet adds at least one round-trip time — 60 to 160 ms — to the delay. By the time it arrives, the conversation has moved on. The retransmitted audio is not just late; it is harmful if played, because it belongs to a moment that has already passed [4].
Large buffers are out. Netflix’s 60-second buffer would add 60 seconds of conversational delay. Even a 2-second buffer would make the call feel like a satellite phone from the 1990s.
Complex codecs are out. Netflix can spend hours optimizing each frame with H.265 or AV1. VoIP must encode in real time — the codec has 20 ms before the next frame arrives. This limits algorithmic complexity and compression efficiency [7].
B-frames are out. B-frames reference future frames, requiring lookahead that adds latency. Real-time video uses I-frames and P-frames only.

Every component of the stored-streaming architecture — the very tools we spent two lectures studying — is either forbidden or severely constrained.

Pause and reflect

This is the Time invariant at its most extreme. In L10, we identified three application categories defined by when data must arrive: stored (no deadline), live (5–60 s offset), and conversational (< 150 ms). We have now spent two lectures building the stored-streaming architecture. Today’s constraint — 150 ms — forces a fundamentally different design at every layer. Same invariant, same analytical question, completely different answer.

Act 2: Jitter — the real enemy

What is jitter?

The sender transmits voice packets at perfectly regular intervals — one every 20 ms (for G.711 at 64 kbps: 160 samples per packet, 8,000 samples per second) [4]. But the network does not deliver them at regular intervals. Each packet takes a different path through router queues. Packet 1 might arrive in 50 ms. Packet 2 in 53 ms. Packet 3 in 48 ms. Packet 4 in 72 ms. Packet 5 in 49 ms.

The variation in those arrival times is called jitter [4][7]. The sender’s clock is steady; the receiver’s arrivals are not. And the human ear expects steady playback — audio samples delivered at exactly 8,000 per second, with no gaps and no bunching.

Here is the question: if packets arrive irregularly but must be played back regularly, what do you do?

You could play each packet the instant it arrives. But then packet 3 (which arrived early) would play before it should, and packet 4 (which arrived late) would create a gap. The audio would sound choppy and uneven.

You could wait until all packets have arrived, then play them back at the correct rate. But “waiting until all packets have arrived” means waiting until the call is over. That is a buffer. And we just said large buffers are forbidden.

The answer is a compromise: you wait just long enough to absorb the jitter, and no longer. Let us work through the example concretely to see why.

Worked example: computing jitter and the playout tradeoff

Let us put precise numbers on the five packets above. The sender transmits one packet every 20 ms. Each packet experiences a different one-way delay through the network:

Packet	Sent at (S)	One-way delay	Arrives at (R = S + delay)
1	0 ms	50 ms	50 ms
2	20 ms	53 ms	73 ms
3	40 ms	48 ms	88 ms
4	60 ms	72 ms	132 ms
5	80 ms	49 ms	129 ms

Computing jitter (RFC 3550 definition). Jitter measures how much the inter-arrival gap differs from the inter-send gap [4]. If the network added a constant delay to every packet, arrivals would be spaced exactly 20 ms apart (matching the send spacing). Any deviation from 20 ms is jitter:

D(i) = (R_i − R_(i−1)) − (S_i − S_(i−1))

Pair	Inter-arrival gap	Expected gap	D (jitter sample)	Absolute D
1→2	73 − 50 = 23 ms	20 ms	+3 ms (slightly late)	3 ms
2→3	88 − 73 = 15 ms	20 ms	−5 ms (arrived early)	5 ms
3→4	132 − 88 = 44 ms	20 ms	+24 ms (very late)	24 ms
4→5	129 − 132 = −3 ms	20 ms	−23 ms (arrived before previous!)	23 ms

Packet 4 is the outlier: it arrived 44 ms after packet 3, when the receiver expected only 20 ms. Packet 5 actually arrived before packet 4 (reordering) — the inter-arrival gap is negative.

Now: what playout delay should the receiver choose? The receiver picks a fixed offset d from the sender’s timeline. Packet i is scheduled for playout at time S_i + d. For a packet to arrive on time: R_i ≤ S_i + d, which means d ≥ one-way delay of that packet.

Packet	One-way delay	Need d ≥
1	50 ms	50 ms
2	53 ms	53 ms
3	48 ms	48 ms
4	72 ms	72 ms
5	49 ms	49 ms

To get ALL five packets on time: d ≥ max(50, 53, 48, 72, 49) = 72 ms.

The tradeoff in numbers. Let us try three different playout delays and see what happens:

d = 50 ms (aggressive — minimize delay):

Packet	Arrives at	Playout time (S + 50)	On time?
1	50 ms	50 ms	✓ (just barely)
2	73 ms	70 ms	✗ LATE by 3 ms → discarded
3	88 ms	90 ms	✓
4	132 ms	110 ms	✗ LATE by 22 ms → discarded
5	129 ms	130 ms	✓

Result: 3 out of 5 on time (60%). Two audible gaps. But only 50 ms added to conversation delay. Total budget: encoding (20) + network (~50 avg) + jitter buffer (50) + decoding (10) = ~130 ms — within the 150 ms wall.

d = 60 ms (moderate):

Packet	Arrives at	Playout time (S + 60)	On time?
1	50 ms	60 ms	✓ (10 ms early)
2	73 ms	80 ms	✓ (7 ms early)
3	88 ms	100 ms	✓ (12 ms early)
4	132 ms	120 ms	✗ LATE by 12 ms → discarded
5	129 ms	140 ms	✓ (11 ms early)

Result: 4 out of 5 on time (80%). One audible gap (packet 4). Total budget: ~140 ms — within the wall.

d = 72 ms (conservative — zero loss):

Packet	Arrives at	Playout time (S + 72)	On time?
1	50 ms	72 ms	✓
2	73 ms	92 ms	✓
3	88 ms	112 ms	✓
4	132 ms	132 ms	✓ (just barely)
5	129 ms	152 ms	✓

Result: 5 out of 5 on time (100%). No gaps. But 72 ms added to conversation. Total budget: encoding (20) + network (~50) + jitter buffer (72) + decoding (10) = ~152 ms — over the 150 ms wall.

The summary:

Playout delay d	Packets on time	Total delay	Within 150 ms?
50 ms	3/5 (60%)	~130 ms	✓
60 ms	4/5 (80%)	~140 ms	✓
72 ms	5/5 (100%)	~152 ms	✗

There is no playout delay that gives both 100% packet delivery AND stays within the 150 ms budget on this particular network path. This is the fundamental tension of real-time audio: you cannot simultaneously have zero gaps and zero excess delay. The jitter buffer size is not a design parameter you choose freely — it is whatever the delay budget leaves after the fixed costs (encoding, network, decoding) consume their share. And on a high-jitter path, there may not be enough left.

This is why adaptive playout exists — instead of picking one d for the entire call, you adjust d at every silence boundary based on recently observed jitter. When the network calms down, d shrinks (less delay). When jitter spikes, d grows (fewer gaps). The goal: stay as close to the 150 ms wall as possible without crossing it too often.

The jitter buffer

A jitter buffer (also called a playout buffer or de-jitter buffer) sits at the receiver. It holds arriving packets for a short time before playing them out, smoothing the irregular arrivals into a regular stream [4][7].

The idea is simple. When the first packet of a talkspurt arrives at time r_1 with timestamp t_1, the receiver schedules it for playout at time r_1 + d, where d is the playout delay. Every subsequent packet i in the same talkspurt is scheduled for playout at time r_1 + d + (t_i - t_1) — that is, at the correct offset from the first packet’s playout time, based on the sender’s timestamps. The receiver reconstructs the sender’s timing, shifted forward by d milliseconds.

The playout delay d is the shock absorber. If d is large enough, every packet arrives before its scheduled playout time. If d is too small, some packets arrive after their playout deadline — they are late and must be discarded, creating audible gaps [7].

This is a tradeoff with no free lunch:

Large d → fewer late packets → smoother audio → but more delay added to the conversation, pushing toward the 150 ms wall.
Small d → lower delay → more conversational → but more late packets → more gaps in the audio.

How should the receiver choose d?

Strategy 1: Fixed playout delay

The simplest approach: pick a fixed delay d at the start of the call and never change it. If you estimate that network jitter will stay below 80 ms, set d = 80 ms [7].

This works if your estimate is correct. But networks are not stationary. During a 30-minute call, jitter changes as routes shift, queues fill and drain, and other traffic competes for bandwidth. A fixed delay that is comfortable at minute 1 may be too small at minute 15 (causing gaps) or too large at minute 20 (wasting precious latency budget when the network has calmed down).

Strategy 2: Adaptive playout delay

The better approach: observe the actual jitter and adjust the playout delay over time [7]. But when can you adjust? You cannot change the playout schedule in the middle of a word — that would cause audible glitches (speedup or slowdown of speech). You can only adjust during silence — the gaps between talkspurts.

Human conversation is roughly 40–50% silence [7]. These talkspurt boundaries are natural adjustment points. At the start of each new talkspurt, the receiver computes a fresh playout delay based on recently observed network conditions.

Why not just average all recent packets equally? Because network conditions change — a simple average gives equal weight to a measurement from 10 seconds ago and one from right now. We want recent observations to count more. The solution is an exponential weighted moving average (EWMA) — the same smoothing technique TCP uses for RTT estimation, and the same one DASH’s ABR uses for throughput estimation (L11). EWMA keeps a running estimate that gradually forgets old data [7]:

Delay estimate for packet i (where r_i = receive time, t_i = send timestamp):

d_i = (1 - α) × d_(i-1) + α × (r_i - t_i)

where r_i is the receive time, t_i is the send timestamp, and α ≈ 0.01 (small, so the estimate changes slowly).

Jitter deviation for packet i:

**v_i = (1 - β) × v_(i-1) + β ×

r_i - t_i - d_i

where β ≈ 0.01.

Playout delay for the first packet of a new talkspurt:

p = d_i + K × v_i

where K = 3 or 4, providing a safety margin of several standard deviations [7].

The intuition: d_i tracks the average network delay, v_i tracks how much the delay varies, and K × v_i provides headroom for the worst case. When the network is stable (low v_i), the playout delay shrinks, reducing conversational latency. When the network is volatile (high v_i), the playout delay grows, protecting against late arrivals.

Pause and notice

This adaptive playout algorithm is a closed-loop control system — the same structure we keep encountering. TCP’s congestion control observes loss and adjusts the sending rate. DASH’s ABR observes throughput and adjusts the bitrate. VoIP’s jitter buffer observes arrival-time variability and adjusts the playout delay. The sensor is different, the actuator is different, the timescale is different, but the control structure is identical: observe → estimate → decide → act → repeat.

And notice where the decision is placed: entirely at the receiver. The sender does not know the receiver’s jitter buffer depth. The receiver makes this decision unilaterally, using only the timestamps the sender embeds in each packet. This is the same decision-placement pattern as DASH: the endpoint with the most information (the one experiencing the network conditions) makes the call.

Act 3: From sound wave to packet — the digitization chain

Before we discuss protocols, we need to understand exactly how a sound wave becomes a packet with a specific size, data rate, and timing. Every number in VoIP — 64 kbps, 160 bytes, 50 packets per second, 20 ms per frame — derives from a chain of first-principles decisions. None of these numbers are arbitrary.

Step 1: Sampling — how often to measure the wave

Sound is a continuous analog signal. To transmit it digitally, we must convert it to a sequence of discrete numbers. The fundamental question: how often must we sample the signal to faithfully capture it?

The answer comes from the Nyquist-Shannon sampling theorem (Shannon, 1949) [11]: a continuous signal containing no frequencies higher than F Hz can be perfectly reconstructed from samples taken at a rate of at least 2F samples per second. Below this rate, the reconstruction loses information (aliasing). At or above this rate, the original signal can be recovered exactly.

Human speech contains frequencies from roughly 300 Hz to 3,400 Hz. This range — the telephone voice band — was standardized by the ITU as the passband for telephony (ITU-T P.310, ITU-T G.712) [12][13]. Frequencies below 300 Hz (chest resonance, room rumble) and above 3,400 Hz (sibilants, harmonics) are filtered out because they contribute relatively little to intelligibility while consuming bandwidth.

Applying Nyquist: to capture frequencies up to 3,400 Hz, we need at least 2 × 3,400 = 6,800 samples per second. The ITU-T G.711 standard [14] specifies a sampling rate of 8,000 samples per second (8 kHz) — slightly above the Nyquist minimum, providing a guard band for the practical roll-off of anti-aliasing filters.

Step 2: Quantization — how precisely to record each sample

Each sample captures the amplitude of the wave at one instant. We must represent this amplitude as a fixed number of bits. G.711 uses 8 bits per sample, giving 2^8 = 256 possible amplitude levels [14]. The continuous amplitude is rounded to the nearest level — this introduces a small quantization error, but 256 levels are sufficient for intelligible speech.

(G.711 actually uses logarithmic companding — μ-law in North America/Japan, A-law elsewhere — which allocates more levels to quiet sounds and fewer to loud ones, matching human perception. But the key number is 8 bits per sample.)

Step 3: Data rate — the arithmetic

8,000 samples/sec × 8 bits/sample = 64,000 bits/sec = 64 kbps

This is the G.711 data rate. Every second of speech produces exactly 8,000 bytes of digitized audio [14].

Step 4: Packetization — how much audio per packet

We do not send one sample per packet — that would be 8,000 packets per second, each carrying only 1 byte of audio and 40 bytes of headers (IP + UDP + RTP). The overhead would be catastrophic: 40/41 = 98% headers, 2% audio.

Instead, we batch samples into frames. The standard frame size is 20 ms of audio (RFC 3551, Section 4.2) [15]:

In 20 ms, the number of samples: 8,000 samples/sec × 0.020 sec = 160 samples
At 1 byte per sample (8-bit G.711): 160 bytes of audio payload per packet
Packet rate: 1,000 ms / 20 ms = 50 packets per second

Why 20 ms? It is an engineering tradeoff between packetization delay (time to fill a frame) and header overhead [15]:

Frame size	Samples	Payload	Packets/sec	Overhead ratio	Packetization delay
10 ms	80	80 bytes	100	40/(40+80) = 33%	10 ms
20 ms	160	160 bytes	50	40/(40+160) = 20%	20 ms
30 ms	240	240 bytes	33	40/(40+240) = 14%	30 ms
40 ms	320	320 bytes	25	40/(40+320) = 11%	40 ms

At 10 ms frames, one-third of all transmitted data is headers. At 20 ms, overhead drops to 20% — a significant improvement. Going to 30 ms saves only 6 more percentage points but adds 10 ms of irrecoverable delay (and each lost packet destroys 50% more speech). The 20 ms default balances efficiency against the tight 150 ms delay budget: 20 ms of packetization consumes only 13% of the total budget, leaving room for network propagation and jitter absorption [6][15].

Step 5: Total packet size and wire rate

Each packet carries audio plus protocol headers:

Component	Size	Purpose
Audio payload (G.711)	160 bytes	20 ms of speech (160 samples × 1 byte)
RTP header	12 bytes	Sequence number, timestamp, payload type
UDP header	8 bytes	Source/dest port, length, checksum
IP header	20 bytes	Source/dest IP, TTL, protocol, etc.
Total	200 bytes	Per packet

Total bandwidth on the wire: 50 packets/sec × 200 bytes × 8 bits/byte = 80,000 bits/sec = 80 kbps. Of this, 64 kbps is audio and 16 kbps (20%) is protocol overhead [4][14][15].

The complete chain

Continuous sound wave (analog)
  → Filter to 300–3,400 Hz voice band [12][13]
    → Sample at 8,000 Hz (Nyquist for 3.4 kHz) [11][14]
      → Quantize each sample to 8 bits (256 levels, μ-law/A-law) [14]
        → 8,000 × 8 = 64,000 bits/sec = 64 kbps
          → Batch 160 samples (20 ms) into one frame [15]
            → 160 bytes audio + 40 bytes headers = 200 bytes/packet
              → 50 packets/sec × 200 bytes = 80 kbps on the wire

Every number in VoIP derives from this chain. The 8 kHz sampling rate follows from the 3.4 kHz voice band via Nyquist. The 64 kbps data rate follows from 8 kHz × 8 bits. The 160-byte payload follows from 20 ms × 8,000 samples/sec. The 50 packets/sec follows from 1,000 ms / 20 ms. The 80 kbps wire rate follows from 50 × 200 × 8. None of these are design choices that could easily be different — they are consequences of human hearing, sampling theory, and the delay budget.

Act 4: RTP/RTCP — the protocol layer

Why not just use UDP?

The jitter buffer needs three pieces of information to work:

Which packets are missing? (To detect loss and trigger concealment.)
When was each packet created? (To reconstruct the sender’s timing.)
What codec was used? (To decode the audio.)

Raw UDP provides none of these. UDP is a bare envelope: source port, destination port, length, checksum, payload. No sequence numbers, no timestamps, no codec identification [4].

In L10, Sanjay introduced RTP as the protocol that replaced TCP for real-time media. Today we go deeper into how these header fields actually work and why RTCP completes the feedback loop.

In the early 1990s, researchers on the MBone (Multicast Backbone) were streaming audio and video conferences over the Internet. Each application reinvented these primitives independently. In 1996, Henning Schulzrinne and colleagues standardized the solution: RTP (Real-time Transport Protocol) [4].

RTP: three problems solved

RTP adds a thin header — just 12 bytes — on top of each UDP packet [4]:

Sequence number (16 bits): Increments by one for each packet. Gaps in the sequence indicate lost packets. The receiver does not request retransmission (that would add delay) — it simply notes the gap and triggers concealment.
Timestamp (32 bits): Encodes the sampling instant of the first sample in the packet. For G.711 at 8 kHz, the timestamp increments by 160 every 20 ms (160 samples × 125 μs per sample = 20 ms). The receiver uses these timestamps to schedule playout — independent of when the packet actually arrived on the network.
Payload type (7 bits): Identifies the codec. The receiver knows whether to decode G.711, Opus, G.729, or something else without out-of-band negotiation.

The concrete G.711 packet we derived above — 200 bytes total, 50 packets/sec, 80 kbps on the wire — is the canonical example of an RTP audio stream [4][14][15].

RTCP: the feedback channel

RTP is one-directional: sender to receiver. But the sender needs to know how the receiver is doing. Is the network losing packets? Is jitter increasing? Is the call degrading?

RTCP (RTP Control Protocol) provides the reverse channel [4]. Periodically — every few seconds — each receiver sends a report back to the sender containing:

Fraction lost: What percentage of packets were lost since the last report? If loss exceeds ~5%, speech becomes unintelligible even with concealment [7].
Interarrival jitter: How variable are the packet arrival times? High jitter means the jitter buffer is under stress.
Round-trip time: Computed from timestamps in the sender’s and receiver’s reports. The sender learns the path delay.

The sender uses these reports to adapt: if loss is rising, it might add Forward Error Correction (FEC). If jitter is extreme, it might reduce the sending rate to ease congestion. If the path has degraded beyond repair, it might reroute through a relay server.

The RTCP bandwidth constraint

RTCP has a strict rule: control traffic must not exceed 5% of the session bandwidth [4]. For a 64 kbps audio call, that is ~3.2 kbps of RTCP. At typical report sizes (~100 bytes), that allows roughly one report every 5 seconds.

This means the feedback loop is slow — the sender learns about problems seconds after they occur. Compare this to TCP, which reacts to every lost ACK within one RTT (tens of milliseconds). RTCP’s adaptation timescale is orders of magnitude slower. This is a deliberate tradeoff: fast feedback would consume too much bandwidth on low-bitrate calls. But it means that rapid transient congestion — a 500 ms burst of packet loss — may come and go before RTCP even reports it [4].

Four-invariant reading

Through the lens of our framework [4]:

State: RTP maintains minimal state — sequence numbers for loss detection, timestamps for timing reconstruction, SSRC identifiers for stream demultiplexing. No belief about the network; just facts about packets.
Time: The timestamp is the core mechanism — it decouples media timing from network timing. The playout clock runs on sender timestamps, not receiver arrival times.
Coordination: End-to-end. No router participates. The sender encodes, the receiver reconstructs, and RTCP carries periodic feedback. The network is treated as a black box.
Interface: Thin layer on UDP. RTP does not provide reliability, does not reserve resources, does not guarantee QoS. It provides the minimal metadata for applications to cope with a best-effort network.

Key takeaway — RTP in one sentence: RTP is deliberately not TCP for audio. It solves exactly three problems (loss detection via sequence numbers, timing via timestamps, codec identification via payload type) and leaves everything else — adaptation, concealment, jitter management — to the application. The cost of this minimalism: RTCP feedback arrives every few seconds (5% bandwidth cap), so the sender adapts slowly. But for voice at 64 kbps, slow adaptation is acceptable — the constraint is latency, not throughput.

Act 5: Loss without retransmission

The problem

In stored streaming, lost packets are retransmitted by TCP. In real-time audio, retransmission is forbidden — it takes too long. But packets do get lost: typical Internet paths lose 1–5% of packets under moderate congestion [7]. At 50 packets per second, 2% loss means one packet lost per second — a gap every second in the audio. What do you do?

Packet loss concealment (PLC)

The simplest technique: when the receiver detects a missing packet (gap in sequence numbers), the codec interpolates from surrounding packets [7]. Options range from crude to sophisticated:

Repeat the last packet. Play the previous 20 ms of audio again. Crude but effective for speech — consecutive frames are highly correlated (temporal redundancy, the same property codecs exploit for compression). Listeners barely notice a single repeated frame. Multiple consecutive lost packets become audible as a “stuck” sound.
Interpolate. Use the pitch and spectral envelope of surrounding frames to synthesize a plausible replacement. Modern codecs like Opus do this well — a single lost frame is nearly inaudible [7].
Comfort noise. During silence periods, the codec generates artificial background noise matching the ambient characteristics. Without this, the line “goes dead” during silence — disconcerting for the listener, who may think the call dropped.

PLC works remarkably well at low loss rates. At 1% loss, most listeners cannot distinguish a concealed call from a lossless one. At 5%, degradation becomes noticeable. Above 10%, intelligibility collapses regardless of concealment quality [7].

Forward Error Correction (FEC)

PLC is reactive — it patches over loss after it happens. FEC is proactive — it sends redundant data so the receiver can reconstruct lost packets without retransmission [7].

The simplest FEC scheme for VoIP: with every packet n, include a lower-quality copy of packet n - 1. If packet n - 1 was lost, the receiver recovers it from the redundant copy embedded in packet n. The cost: increased bandwidth (sending ~1.5x the data) and one packet of additional delay (the redundant copy of packet n - 1 arrives with packet n, which is 20 ms later) [7].

More sophisticated schemes use XOR-based parity: every k packets, send one parity packet that is the XOR of the previous k data packets. Any single lost packet among the k can be reconstructed from the parity and the other k - 1 data packets. The tradeoff: larger k means less bandwidth overhead but requires that at most one packet in every k is lost (burst loss defeats this scheme).

The delay-bandwidth tradeoff is visible again: FEC adds both bandwidth overhead and delay, but it recovers from loss without retransmission. In the tight 150 ms budget, the 20 ms added by simple redundancy may be acceptable; more complex schemes may not be.

Act 6: Modern real-time systems

The NAT problem — and why RTP’s assumptions broke

RTP assumed two things in 1996 [4]: endpoints have reachable IP addresses, and UDP travels freely between them. Both assumptions shattered within a few years.

NATs (Network Address Translators) became ubiquitous by the early 2000s. Your laptop’s IP address — 192.168.1.42 — is not reachable from the public Internet. The NAT device maps your outgoing packets to a public address, but incoming packets from unknown sources are dropped. Two users behind NATs cannot send each other UDP packets directly — neither knows the other’s public address and port, and neither NAT has a mapping for the other’s traffic.

Firewalls compound the problem. Many corporate and institutional networks block all UDP traffic that is not DNS. An RTP stream is dead on arrival.

By the mid-2000s, an estimated 80% of Internet-connected devices sat behind at least one NAT. RTP’s architecture — designed for the open MBone research network — could not reach most users.

WebRTC: real-time in the browser (2011–2021)

WebRTC (Web Real-Time Communication) was designed to solve exactly this problem: bring Zoom-quality audio and video to any web browser, without plugins, without special software, through NATs and firewalls [5][10].

The architecture rests on three pillars:

1. ICE (Interactive Connectivity Establishment): A systematic protocol for discovering how two endpoints can communicate [5]. When you join a WebRTC call, your browser:

Gathers candidates: your local IP, your NAT’s public IP (discovered via STUN), and a relay server’s IP (via TURN).
STUN (Session Traversal Utilities for NAT): Your browser sends a packet to a public STUN server. The server replies with your NAT’s public IP and port — the address the outside world sees. Now you know what address to advertise.
TURN (Traversal Using Relays around NAT): If direct communication fails (symmetric NATs, firewalls blocking UDP), a TURN server acts as a relay — both endpoints send media to the TURN server, which forwards it. This always works but adds latency and server cost.

ICE tries candidates in order of preference: direct peer-to-peer first (lowest latency, zero server cost), then STUN-assisted (still peer-to-peer, slightly more setup), then TURN relay (always works, highest latency and cost) [5]. The first path that succeeds wins.

2. DTLS (Datagram TLS) and SRTP (Secure RTP) for security: All media is encrypted end-to-end. Even the TURN relay cannot decrypt the audio or video — it is forwarding opaque encrypted packets. This is a fundamental design commitment: real-time communication must be private by default [5].

3. Browser APIs for simplicity: The getUserMedia API accesses the camera and microphone. The RTCPeerConnection API manages the entire connection lifecycle — ICE negotiation, DTLS handshake, RTP/RTCP media flow. A developer can build a basic video call in about 50 lines of JavaScript [5]. The complexity of NAT traversal, codec negotiation, jitter buffering, and encryption is hidden behind two API calls.

Scaling WebRTC: the SFU architecture

Peer-to-peer works for two-person calls. What about a meeting with 20 participants? In a mesh topology, each participant must encode and upload their stream N - 1 times and download N - 1 streams. At N = 20, that is 19 uploads per participant — impossible on a typical home connection [5].

The solution is a Selective Forwarding Unit (SFU) [5]. Each participant sends one stream to the SFU. The SFU forwards each incoming stream to the other participants — no transcoding, no decoding, just packet forwarding. The “selective” part: if a participant has limited bandwidth, the SFU can skip forwarding some streams or choose lower-resolution versions (via simulcast — see below), rather than blindly flooding all streams to all recipients. Upload cost per participant: one stream. Download cost: N - 1 streams (but download bandwidth is typically abundant).

Key SFU optimizations:

Simulcast: The sender simultaneously encodes at multiple resolutions (e.g., 720p, 360p, 180p). The SFU chooses which resolution to forward to each receiver based on their bandwidth. A participant on fiber gets 720p; a participant on cellular gets 180p [5].
Last-N forwarding: Only forward video for the active speaker and the N most recent speakers. Other participants receive audio only, saving massive bandwidth [5].
RTCP-based bitrate adaptation: The SFU monitors RTCP feedback from receivers and instructs senders to lower their bitrate when congestion is detected — closing the feedback loop at millisecond-to-second timescales [5].

The SFU is not an MCU (Multipoint Control Unit). An MCU decodes all incoming streams, composites them into a single view, and re-encodes — computationally expensive and breaks end-to-end encryption. An SFU forwards packets without decoding. It preserves encryption and scales better [5].

Zoom: building on and extending WebRTC concepts

While WebRTC provides the open standard, production platforms like Zoom extend these ideas with additional optimizations — proprietary codecs, global relay networks, and millisecond-scale adaptation that goes beyond what a generic WebRTC implementation offers.

Zoom is not open-source, but its publicly described architecture reveals several design choices that push beyond basic WebRTC [7]:

Adaptive codec switching. Zoom does not just adapt the bitrate — it switches codecs entirely based on network conditions. Good network: VP9 at 3 Mbps (high quality, high complexity). Moderate congestion: H.264 at 500 kbps (lower quality, lower complexity). Severe congestion: low-bitrate codec with aggressive FEC. The switching happens in real time, driven by RTCP-like feedback [7].

Adaptive jitter buffer. Zoom adjusts the jitter buffer between 50 and 200 ms based on observed network conditions [7]. If loss is detected, the buffer grows (accepting slightly more latency to prevent gaps). If the network improves, the buffer shrinks (reducing latency). This is the same adaptive playout algorithm from Act 2, but integrated with codec and FEC decisions — a multi-dimensional control loop.

Video frame-rate adaptation. Under severe congestion, Zoom progressively reduces video frame rate: 30 fps → 15 fps → 7 fps → 1 fps (effectively a slideshow). Audio is always prioritized — the voice never drops out even when video is frozen [7].

Global relay network. Zoom operates a worldwide network of data centers. If the direct path between two participants is congested or unreliable, Zoom automatically reroutes media through the nearest relay — similar to WebRTC’s TURN, but proprietary, globally distributed, and continuously optimized. Users never configure this; the system detects path quality and switches transparently [7].

The adaptation timescale. This is the critical difference from DASH. Netflix adapts every 10 seconds (one segment). Zoom adapts every packet arrival — milliseconds to hundreds of milliseconds. The loop is: receive packet → update jitter estimate → update loss estimate → decide whether to adjust buffer, codec, FEC, or bitrate → apply. The faster loop matches the tighter deadline. When you have 150 ms of total budget, you cannot wait 10 seconds to react.

Act 7: Cloud gaming — the extreme frontier

Below 50 milliseconds

Cloud gaming pushes the time constraint even further than VoIP. In a cloud gaming system (Xbox Cloud Gaming, NVIDIA GeForce NOW), the game runs on a remote server. The server renders each video frame, encodes it, and streams it to the player. The player’s controller inputs are sent back to the server. The loop is:

Player presses button
Input travels to server (~20–40 ms)
Server processes input and renders next frame (~16 ms at 60 fps)
Frame is encoded (~5 ms with hardware encoder)
Encoded frame travels to player (~20–40 ms)
Frame is decoded and displayed (~5 ms)

Total: ~66–106 ms. Gamers perceive input lag above 50–100 ms as sluggish and above 150 ms as unplayable for fast-paced games [7].

Why cloud gaming is harder than VoIP

The differences are structural:

	VoIP	Cloud gaming
Latency budget	~150 ms one-way	~50 ms round-trip (input-to-display)
Data rate	64 kbps (audio)	5–15 Mbps (video)
Loss tolerance	1–5% acceptable	Near-zero (visual artifacts are immediately visible)
Adaptation	Codec switch, FEC	Resolution/frame-rate drop
Buffer	Jitter buffer (50–200 ms)	Essentially zero (every frame must display immediately)

Cloud gaming has VoIP’s latency constraint and streaming video’s bandwidth requirement. It cannot use a jitter buffer (any buffering adds visible input lag). It cannot use TCP (retransmission adds lag). It cannot use B-frames (lookahead adds lag). And it is sending video, not audio — orders of magnitude more data.

The result: cloud gaming is the most demanding real-time application on the Internet. It requires fast, stable networks with low jitter, hardware-accelerated encoding and decoding, and edge servers geographically close to users. This is why cloud gaming services restrict availability to regions with adequate infrastructure, and why the experience degrades dramatically on congested or high-latency connections.

The grand arc: three control loops

Step back and compare the three systems we have studied across L10–L12:

	Netflix (stored)	Zoom (conversational)	Cloud gaming (extreme real-time)
Buffer	60–200 seconds	50–200 ms (jitter buffer)	~0 ms
Loop period	~10 seconds (per segment)	~ms to seconds (per packet/report)	Per frame (~16 ms)
Transport	HTTP/TCP	RTP/UDP	RTP/UDP or custom
Loss handling	TCP retransmission	FEC + concealment	FEC + quality reduction
Adaptation variable	Bitrate (via ABR)	Codec + FEC + buffer + bitrate	Resolution + frame rate
What kills UX	Rebuffering stall	Delay > 150 ms	Input lag > 50 ms
Feedback sensor	Chunk download time	RTCP reports + packet arrival	Controller input latency

The pattern: as the time constraint tightens, the buffer shrinks, the loop speeds up, and the system loses degrees of freedom. Netflix has many choices and seconds to make them. Zoom has fewer choices and milliseconds. Cloud gaming has almost no choices and must act per frame.

This is the Time invariant in action — not as a static label (“this is conversational”), but as a force that propagates through every design decision. Tighten the deadline and watch the architecture reshape itself.

The pioneer arc — multimedia complete

Across L10–L12, we traced the full evolution:

Shannon          — source coding theorem (the theoretical floor)
Schulzrinne      — RTP/RTCP (real-time transport over UDP)
YouTube          — progressive HTTP download (firewalls solved)
Apple HLS        — adaptive HTTP streaming (chunks + manifest)
MPEG DASH        — open standard for ABR
Huang BBA        — buffer-based ABR (reject throughput prediction)
Yin MPC          — model-predictive ABR (optimize over horizon)
Mao Pensieve     — learned ABR (reinforcement learning)
WebRTC RFC 8825  — browser-native real-time (NAT, encryption, SFU)

Each generation solved the problem the previous one could not. Each solution introduced new failure modes. The arc follows the same pattern as medium access (L5–L8): from uncoordinated transmission to feedback-driven adaptation to centralized optimization. More information → better control → tighter deadlines met.

Bridge to L13: what if the network could help?

For three lectures, we watched applications cope with a best-effort network. The network delivers packets when it can, drops them when it must, and offers no guarantees about delay, jitter, or loss. Every technique we studied — buffers, ABR, jitter management, FEC, concealment — is the application compensating for the network’s indifference.

But what if the network itself could help? What if routers could distinguish a VoIP packet from a Netflix chunk and treat them differently? What if a router, detecting its queue filling up, could signal the sender before packets are lost, rather than after? What if queue management could reduce jitter at the source — inside the network — rather than forcing every endpoint to absorb it?

That question takes us inside the router. L13 begins the final arc of the course: scheduling, queuing, and active queue management. We will open the router’s packet queue and ask: how should packets be ordered, when should packets be dropped, and can the network’s internal behavior make every application we have studied — Netflix, Zoom, cloud gaming — work better?

The applications have done all they can. It is time to ask what the network owes them.

References

[1] Kurose, J. F. and Ross, K. W. (2021). Computer Networking, 8th Edition. Pearson.

[2] Dobrian, F. et al. (2011). “Understanding the Impact of Video Quality on User Engagement.” Proc. ACM SIGCOMM.

[3] Sodagar, I. (2011). “The MPEG-DASH Standard for Multimedia Streaming over the Internet.” IEEE MultiMedia.

[4] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. (2003). “RTP: A Transport Protocol for Real-Time Applications.” RFC 3550.

[5] Alvestrand, H. (2021). “Overview: Real-Time Protocols for Browser-Based Applications.” RFC 8825; Jennings, C. et al. (2021). “WebRTC Architecture.”

[6] ITU-T Recommendation G.114 (2003). “One-way transmission time.” International Telecommunication Union.

[7] A. Gupta, A First-Principles Approach to Networked Systems, Ch. 7: Multimedia Applications, UC Santa Barbara, 2026.

[8] Pantos, R. and May, W. (2017). “HTTP Live Streaming.” RFC 8216.

[9] ISO/IEC 23009-1 (2019). “Dynamic Adaptive Streaming over HTTP (DASH) — Part 1: Media Presentation Description and Segment Formats.”

[10] Rescorla, E. (2021). “WebRTC Security Architecture.” RFC 8827.

[11] Shannon, C. E. (1949). “Communication in the Presence of Noise.” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21. (The sampling theorem: a signal band-limited to F Hz can be reconstructed from samples at 2F Hz.)

[12] ITU-T Recommendation G.712 (2001). “Transmission performance characteristics of pulse code modulation channels.” (Defines the 300–3,400 Hz voice-frequency passband for PCM telephony.)

[13] ITU-T Recommendation P.310 (1996). “Transmission characteristics for telephone-band (300–3400 Hz) digital telephones.”

[14] ITU-T Recommendation G.711 (1988; originally 1972). “Pulse code modulation (PCM) of voice frequencies.” (Specifies 8 kHz sampling, 8-bit μ-law/A-law quantization, 64 kbps.)

[15] Schulzrinne, H. and Casner, S. (2003). “RTP Profile for Audio and Video Conferences with Minimal Control.” RFC 3551. (Section 4.2: default packetization interval of 20 ms; payload type 0 = PCMU, payload type 8 = PCMA.)