11 Application Protocols and Content Delivery

11.1 The Anchor: Latency, Distribution, and the Silent Network

Applications deliver content to users, and users judge applications by how fast the content arrives. But the content sits on an origin server, potentially continents away, reachable only through a network the application lacks ownership of, scheduling authority over, and reconfiguration access to. The engineering problem is to make distant content feel local — to meet a latency budget that physics, distance, and congestion all conspire to exceed.

The binding constraint is the intersection of three inherited realities. Users demand low latency and high throughput: a page that loads in two seconds retains readers, one that loads in five seconds loses them. Content is geographically distributed: origins centralize (a single S3 bucket, a single data center) while users globalize (they live everywhere). And the application runs over TCP/IP: it lacks the ability to install routers, rewrite the congestion-control algorithm in the kernel, or demand that a transit AS prioritize its packets. Every latency improvement must come from what the application layer can do with the signals and interfaces it already has.

“Almost every resource I want is located somewhere far away… The goal was to be able to pass a reference to anything, and have it retrieved reliably, no matter where it was.” — Tim Berners-Lee, on the origin of HTTP (Berners-Lee 1991)

The binding constraint locks four decision problems the application layer must continuously answer:

How to request content with low overhead? Every request costs at least one round-trip. Requests must be batched, reused, or shortened.
How to place content near users? If the origin is far, a copy closer to the user reduces the RTT (Round-Trip Time) budget dramatically. Where do the copies live, and which one answers?
How to multiplex independent requests over shared connections? A modern page has dozens of objects. Serial retrieval multiplies RTT; parallel retrieval multiplies connection setup cost. One connection must carry many requests without head-of-line blocking.
How to escape protocol ossification? Middleboxes on the path inspect and mangle packets based on assumptions about TCP and TLS. A new transport protocol must deploy without requiring middlebox permission.

These problems were invisible to Berners-Lee in 1991. They were discovered, one at a time, as the Web grew from a CERN research tool to the dominant application of the Internet. Each generation diagnosed a different invariant failure and patched it without breaking the layers below.

The dependency chain for applications runs in reverse of transport’s chain. Transport inherits Interface from IP and forces State, Time, Coordination downward. Applications start from Interface (HTTP semantics) and push outward: State (where content lives), Coordination (who serves which user), Time (RTT budget). The chain:

Interface (HTTP request-response semantics) is the application’s contract with the user and with all intermediaries. Changing it breaks the Web.
Coordination (client-server, but with optional caches and CDN (Content Delivery Network) intermediaries) is forced by the desire to keep servers stateless.
State (distributed caches, TTLs, consistency) is forced by coordination choices and by the latency budget.
Time (RTT budgets, connection reuse, 0-RTT resumption) is the metric the user experiences and the constraint that keeps redesigning the stack.

11.2 Act 1: “It’s 1991. A Physicist Wants to Share Documents.”

Tim Berners-Lee is a computer scientist at CERN. Physicists collaborate across institutions but cannot easily share documents: each lab runs different systems, different databases, different document formats. He proposes a system where any document can reference any other document through a uniform identifier, and any browser can retrieve any document using the same minimal protocol. The protocol must run over TCP (the only universally available transport) and must be simple enough that a graduate student can implement it in an afternoon.

“The HTTP protocol was designed to be extensible… In its simplest form, a client sends a single line, ‘GET /path’, and the server responds with the content of the document.” — Berners-Lee, 1991 (Berners-Lee 1991)

What the pioneers saw: A small academic community exchanging static hypertext documents over reliable intra-institutional LANs. Documents were small (a few KB of text). Pages referenced only a handful of other documents. Latency was dominated by TCP connection setup, which was acceptable because requests were infrequent and sequential (the user clicked a link, waited, read, clicked again).

What remained invisible from the pioneers’ vantage point: The Web would eclipse every other Internet application within a decade. Pages would grow from single documents to assemblages of dozens (later hundreds) of objects: images, stylesheets, scripts, fonts. Users would demand sub-second load times. The “one connection per object” model would become the Web’s first performance bottleneck.

HTTP/0.9 and HTTP/1.0 (Berners-Lee et al. 1996) applied disaggregation by separating document identification (URL) from document retrieval (GET request) from document format (MIME types in HTTP/1.0). They applied decision placement by keeping the server stateless: each request carried all context, each connection was independent. The design was minimal on purpose — extensibility came from headers, not from complex state machines.

11.2.1 Invariant Analysis: HTTP/0.9 and HTTP/1.0 (1991-1996)

Invariant	HTTP/1.0 Answer	Gap?
State	Stateless per request; each connection independent	Blind to client cache state; every reference re-fetched
Time	TCP handshake + slow-start per object	3-RTT minimum per object; dominant cost for small objects
Coordination	Client-server, anonymous client	Sessionless; caches are transparent and uncoordinated
Interface	Text headers + URL + method; extensible	Per-connection overhead is architectural, beyond mere inefficiency

The Time gap is the killer. Every object fetched requires a fresh TCP handshake (1 RTT), followed by slow-start ramp-up, followed by the actual request and response. A page with 20 objects needs 20 sequential connection setups over a single client. Over a 100ms RTT link, that is two full seconds of connection overhead before any content arrives. The sender’s belief about the network is reset on every object — slow-start restarts, RTT estimates restart, congestion window restarts.

11.2.2 Environment, Measurement, Belief

Layer	What HTTP/1.0 Has	What’s Missing
Environment	User has a browser; server has documents; TCP is in the middle	Caches that already hold the object; other users fetching the same content
Measurement	Request line, status code, content length	Silent on cache state, RTT, and server load
Belief	“This URL maps to this document, fetch it now”	Pages are assemblages, belief is per-object only

The E-M gap is structurally absent: HTTP/1.0’s measurement channel (the request-response pair) is silent on reuse opportunities. Caches existed (Web proxies appeared by 1994) but HTTP/1.0 gave them only crude hints (Expires headers, If-Modified-Since). Freshness was an estimate, with no validation path.

11.2.3 “The Gaps Didn’t Matter… Yet.”

In 1991, pages were single documents. In 1993, pages had a logo and three hyperlinks. A user clicked, waited half a second, read for a minute, clicked again. The 3-RTT connection overhead disappeared into human reading time.

By 1996, Netscape pages had 20 inline images. Users clicked, waited ten seconds, stared at a half-loaded page, and clicked Reload. The connection-per-object model had become the dominant performance cost, and every subsequent HTTP generation would attack it.

11.3 Act 2: “It’s 1997. Every Page Has Twenty Objects.”

11.3.1 Which Invariant Broke?

Invariant	What Broke	Concrete Consequence
Time	Connection-per-object is RTT-bound	20 objects × 3 RTT = 60 RTT before page completes
State	Each connection restarts slow-start and RTT estimation	TCP never reaches its fair share before the object ends
Coordination	Browsers open 6+ parallel connections to compensate	Server connection-table bloat; unfair to other users

The Time invariant broke most visibly. Users experienced it as “the Web is slow.” Operators experienced it as connection-table exhaustion on servers. Browsers had tried to compensate by opening multiple parallel TCP connections per server (Netscape defaulted to 4, later 6), but this was a brute-force fix: more connections meant more handshakes, more slow-starts, more server state, and more unfairness to other users sharing the bottleneck link. The 6 parallel connections also negatively impacted congestion control: each connection ran its own independent slow-start and maintained its own congestion window, with no shared view of the bottleneck. A single browser effectively claimed 6x its fair share of bandwidth during ramp-up, starving other flows at the same bottleneck.

11.3.2 Fielding’s Redesign: HTTP/1.1 (RFC 2616, 1999)

Roy Fielding (Fielding 2000) and the HTTP Working Group designed HTTP/1.1 around a single architectural shift: the persistent connection. A TCP connection is opened once and reused for many requests. The client sends a request, the server sends a response, and both keep the connection open for the next request. The TCP connection outlives the object.

“Persistent connections provide a mechanism by which a client and a server can signal the close of a TCP connection. This signaling takes place using the Connection header field.” — Fielding et al., RFC 2616 (Fielding et al. 1999)

HTTP/1.1 also introduced pipelining: the client may send multiple requests back-to-back without waiting for each response. And it mandated the Host header, enabling virtual hosting (many websites on one IP address) — a coordination fix that let the Web’s naming system scale.

HTTP/1.1 applied closed-loop reasoning through cache validation: the ETag header let a client ask “is my cached copy still valid?” and receive a cheap 304 Not Modified if so, cutting bandwidth without cutting correctness. The closed loop was server → validator → client → conditional request, tracking cache freshness through explicit measurement rather than Expires-based guessing.

11.3.3 HTTP/1.0 → HTTP/1.1 Comparison

What Changed	HTTP/1.0	HTTP/1.1
Connection lifetime	One request per TCP connection	Persistent: many requests per connection
Request ordering	Serial, wait for each response	Pipelined (in spec; rarely used)
Cache validation	Expires header (time-based guess)	ETag conditional requests (explicit)
Host disambiguation	One site per IP	Host header enables virtual hosting

11.3.4 Environment, Measurement, Belief After HTTP/1.1

Layer	What HTTP/1.1 Has	What’s Missing
Environment	Persistent connections amortize TCP setup	Bottleneck bandwidth and RTT still unknown to application
Measurement	ETag validation gives cache-hit signals	Reordering invisible; pipelined responses must arrive in order
Belief	“This connection is warm; reuse it”	Per-request order is still a straight line

The gap that HTTP/1.1 closed was the connection-setup cost. The gap it introduced was subtle: pipelining required responses to arrive in the order requests were sent. If the first response was slow, every subsequent response waited behind it. This is head-of-line blocking at the HTTP layer: one slow object stalls the entire connection’s pipeline. In practice, browsers left pipelining disabled by default, because servers and proxies handled it inconsistently. The spec existed; the benefit remained unrealized.

11.3.5 “The Gaps Didn’t Matter… Yet.”

For a decade, HTTP/1.1 was enough. Browsers opened 6 parallel connections per origin, each pipelining implicitly (request-then-response-then-request), and the Web grew from static pages to JavaScript applications. But by 2010, pages had hundreds of objects (trackers, ads, fonts, analytics), served by dozens of origins, and the “6 connections per origin” ceiling was a hard throughput cap no matter how fast the link.

11.4 Act 3: “It’s 2000. The Origin Is in Boston. The User Is in Tokyo.”

11.4.1 Which Invariant Broke?

The Time invariant broke at a different layer: the speed of light. An HTTP/1.1 optimization was powerless against a 150ms trans-Pacific RTT. A user in Tokyo fetching a Boston origin waited at least 150ms per round-trip, and every handshake, every TLS¹ negotiation, every TCP slow-start cycle consumed several RTTs. Worse, the trans-Pacific path was congested, lossy, and routed through multiple ASes, each with its own failure modes.

The fix lived outside the protocol — it had to live in where the content was. If content moved closer to the user — if a copy of the origin lived in Tokyo — the RTT dropped from 150ms to 10ms, and every HTTP optimization compounded with a 15× latency reduction. This was a decision placement problem: who decides which copy answers which user?

11.4.2 Dilley and the Akamai Team’s Redesign: The CDN (2002)

Akamai, founded in 1998 by MIT researchers, built a globally distributed network of edge servers — eventually tens of thousands of caches in thousands of locations worldwide (Nygren et al. 2010). Content providers gave Akamai their content; Akamai served it from whichever edge was closest to each user. The question shifted from “how fast can we fetch from Boston?” to “how do we route each user to the right edge?”

“The key challenge is how to distribute content to hundreds of thousands of servers distributed across thousands of networks in ways that maximize performance, reliability, and cost-effectiveness.” — Dilley et al., 2002 (Dilley et al. 2002)

Akamai’s architecture solved the decision-placement problem through DNS-based request routing organized as a two-tier name server hierarchy. Top-Level Name Servers (TLNS) handle the initial DNS delegation globally, directing each query to a region. Low-Level Name Servers (LLNS) within each region make the fine-grained decision: which specific edge server in which cluster should answer this user, based on the user’s resolver location, current edge load, and real-time network conditions. When a user’s browser resolved www.example.com, the authoritative DNS delegation flowed through this TLNS → LLNS chain, returning the IP address of the optimal edge server. DNS became the control plane for content placement.

Within each cluster, Akamai uses consistent hashing to map content URLs to specific cache machines, ensuring that requests for the same object always land on the same server — maximizing local cache hit rates without centralized coordination. For rare or unpopular content absent from the local cluster’s cache, Akamai employs a distributed hash table (DHT) to locate which edge in the broader footprint holds a warm copy, avoiding unnecessary origin fetches for cold content.

Akamai applied disaggregation by splitting the system into three planes: a data plane (edge caches serving content), a control plane (mapping engine deciding which edge answers which user), and a measurement plane (continuous probes of edge-to-user and edge-to-origin latency and loss). The control plane was centralized (for global consistency) while the data plane was distributed (for latency). This is the same disaggregation pattern SDN would apply to routing a decade later.

Akamai applied closed-loop reasoning through the measurement plane: every edge continuously reported health, load, and network conditions to the mapping engine; the mapping engine continuously updated DNS responses; users continuously resolved names. The loop period was seconds to minutes (DNS TTLs set the minimum belief lifetime). The loop’s goal was to keep Belief (which edge is best for this user right now) aligned with Environment (which edge actually has the content, is lightly loaded, and has a good path).

11.4.3 Invariant Analysis: Akamai CDN (2002-2010)

Invariant	Akamai Answer	Gap?
State	Distributed edge caches; TTL-bounded freshness	Cache coherence across edges is best-effort
Time	DNS TTLs control belief lifetime (~seconds)	TTL tradeoff: short = responsive but expensive; long = stale
Coordination	Centralized mapping; distributed delivery	Mapping engine is a global dependency
Interface	DNS redirection; HTTP transparent to client	Edge identity is opaque to clients

The State gap is the tradeoff at the heart of every CDN: TTLs determine both cache hit rate and belief staleness. Long TTLs mean edges serve more requests without contacting the origin (good for latency, bad for freshness). Short TTLs mean the system reacts quickly to origin updates and failures (good for correctness, bad for cost and latency). Every CDN tunes this tradeoff differently per content type: images get hours-long TTLs, HTML gets minutes, API responses get seconds or bypass caching entirely.

11.4.4 Environment, Measurement, Belief: Akamai Mapping

Layer	What Akamai Has	What’s Missing
Environment	Actual user location, actual network paths, actual edge loads	True user identity (only resolver IP is visible)
Measurement	Edge health probes; passive latency samples; active probes to users’ resolvers	Precise user geolocation (resolver ≠ user)
Belief	“This user is near edge E; route them there”	Users behind public resolvers (8.8.8.8) look identical to users worldwide

The E-M gap here is physically limited: the signal Akamai has (DNS resolver IP) is a proxy for user location, an indirect measurement. EDNS Client Subnet (a later extension) partially fixed this by letting recursive resolvers forward a prefix of the user’s IP, but privacy-preserving resolvers (8.8.8.8, 1.1.1.1) deliberately omit this. The CDN’s belief about the user is necessarily coarse.

11.4.5 “The Gaps Didn’t Matter… Yet.”

For static content (images, video, CSS) CDNs were a transformative win: end-to-end page loads dropped by factors of 5-10x for global users. Measured more precisely, caching overlays themselves provide speedups of 1.7x to 4.3x compared to direct-from-origin delivery (Nygren et al. 2010) — the larger 5-10x figures reflect full-page improvements where CDN caching compounds with TCP connection reuse and reduced origin load. But CDNs left half the page load cost untouched: the HTML itself, the TLS handshake, and the fundamental HOL blocking of HTTP/1.1. A CDN edge still spoke HTTP/1.1, still opened 6 connections, still slow-started each one. The protocol layer remained a bottleneck that geography alone failed to resolve.

11.5 Act 4: “It’s 2009. A Google Engineer Is Tired of Pipelining Not Working.”

11.5.1 Which Invariant Broke?

Invariant	What Broke	Concrete Consequence
Interface	HTTP/1.1 semantics tie one request to one response on the wire	6 parallel connections is a hard cap per origin
State	Each connection has independent TCP state	Slow-start repeats 6 times; no shared congestion view
Time	HTTP-layer HOL blocking: slow object stalls the pipeline	Head request delays every subsequent response

Mike Belshe at Google measured real page loads and found that connection count, not bandwidth, was the bottleneck (Belshe 2010). Doubling a user’s link speed yielded marginal page load improvement. Reducing RTT cut it in half. The protocol itself was the bottleneck.

11.5.2 Belshe and Peon’s Redesign: SPDY and HTTP/2 (2012-2015)

Belshe and Peon designed SPDY (prototyped 2009, IETF proposal 2012) to multiplex many independent requests over a single TCP connection. SPDY became the basis for HTTP/2 (RFC 7540, 2015) (Belshe et al. 2015). The core insight: HTTP’s request-response semantics can be preserved while changing the wire format entirely.

“SPDY’s goal is to reduce web page load time… Multiple concurrent HTTP requests can run across a single SPDY session.” — Belshe and Peon, 2012 (Belshe and Peon 2012)

HTTP/2 applied disaggregation by separating HTTP semantics from wire encoding: the same GET/POST/headers/status codes, but now framed as streams over a single connection. Each stream is independent; responses can arrive in any order, interleaved frame by frame. A slow stream leaves fast streams unblocked.

HTTP/2 added three mechanisms: - Stream multiplexing: many concurrent requests share one connection; each request is a stream with an ID, and frames interleave. - HPACK (Header Compression for HTTP/2) header compression: repeated headers (Cookie, User-Agent, Accept) are compressed via a shared dynamic table, reducing request overhead from KB to bytes. - Server push: the server sends resources the client will need, before the client asks (retired in practice — caching interaction was too complex to get right).

11.5.3 HTTP/1.1 → HTTP/2 Comparison

What Changed	HTTP/1.1	HTTP/2
Concurrency	6 parallel TCP connections	1 TCP connection, N multiplexed streams
Framing	Text, per-response	Binary, per-frame
Header overhead	Repeated on every request	HPACK compressed
HOL blocking	At HTTP layer (per connection)	At TCP layer (per connection)

11.5.4 The Gap HTTP/2 Created

HTTP/2 fixed HTTP-layer HOL blocking but created a new failure mode: TCP-layer HOL blocking. TCP’s “belief” in a monolithic, ordered byte stream was false state for multiplexed HTTP — TCP treated all bytes as a single ordered sequence, oblivious to stream boundaries. Because all streams share one TCP connection, a single dropped packet stalls every stream until the packet is retransmitted. TCP delivers bytes in order, not by stream. On a lossy network (mobile, Wi-Fi with interference), a single loss could stall 50 concurrent streams for an RTT. HTTP/1.1 with 6 connections isolated losses; HTTP/2 with 1 connection coupled them.

11.5.5 Environment, Measurement, Belief After HTTP/2

Layer	What HTTP/2 Has	What’s Missing
Environment	Single warm TCP connection; multiplexed streams	Loss patterns: which stream’s data was in the lost packet?
Measurement	HTTP/2 frames per stream; connection-level flow control	TCP conflates “stream 3 is stalled” with “all streams are stalled”
Belief	“All streams progress in parallel”	True when no loss; false on every retransmission

The gap is accidentally noisy: TCP’s in-order delivery was designed when there was one stream per connection. Multiplexing many streams over one TCP connection exposes the in-order-delivery assumption as a tax every stream pays for any stream’s loss. The fix required changing the transport layer itself.

11.6 Act 5: “It’s 2013. Transport Is Ossified. Google Ships a New One Anyway.”

11.6.1 Which Invariant Broke?

Invariant	What Broke	Concrete Consequence
Interface	TCP in-order delivery creates head-of-line blocking for HTTP/2 streams	One lost packet stalls all streams for 1+ RTT
Interface	TCP and TLS handshakes are serial (3+ RTTs for first request)	First-byte latency is 3× worse than the protocol requires
Interface	Middleboxes inspect TCP options and reject anything unfamiliar	New TCP extensions cannot deploy (Honda et al. 2011)

Honda et al. (Honda et al. 2011) measured middlebox behavior across hundreds of paths and found that TCP was ossified: middleboxes dropped or mangled packets carrying unfamiliar options. Any new TCP feature (like MPTCP) had to pretend to be old TCP to survive deployment. The kernel was locked, and the path was locked.

11.6.2 Langley’s Redesign: QUIC and HTTP/3 (2017-2022)

Jim Roskind and Adam Langley at Google designed QUIC to rebuild the transport layer from scratch — but over UDP, not by replacing TCP. Middleboxes forward UDP as opaque datagrams, so QUIC could evolve freely without middlebox permission. QUIC reached RFC status in 2021 (Iyengar and Thomson 2021), and HTTP/3 (Bishop 2022) maps HTTP semantics onto QUIC streams.

“QUIC’s design is… motivated by a desire to remove head-of-line blocking, reduce connection establishment latency, and enable continued transport evolution.” — Langley et al., 2017 (Langley et al. 2017)

QUIC applied disaggregation by moving transport into user space. The QUIC library lives inside the application (or alongside it), not inside the kernel. This means QUIC can be updated as fast as applications can be updated — monthly, not decade-by-decade. Google deploys QUIC changes to Chrome and their servers simultaneously; the protocol evolves continuously.

QUIC made three changes that TCP’s ossified deployment path blocked:

Streams as a transport primitive: QUIC multiplexes streams natively. A lost packet affects only the streams whose data it carried. TCP’s HOL blocking disappears.
Encryption is mandatory and integrated: QUIC handshake combines TLS 1.3 and transport setup into a single 1-RTT exchange (or 0-RTT for resumption). TCP + TLS requires 2-3 RTTs; QUIC requires 1. However, 0-RTT data is vulnerable to replay attacks: an attacker who captures the initial flight can replay it verbatim. Applications using 0-RTT must guarantee idempotency — a replayed request must not debit an account twice or create duplicate records. The Time optimization (saving 1 RTT) forces a State burden (the application must track whether a request has already been processed).
Connection IDs separate identity from address: a mobile device can switch from Wi-Fi to cellular without dropping the connection. TCP ties identity to the (IP, port) 4-tuple, which breaks on IP changes.

11.6.3 TCP + HTTP/2 → QUIC + HTTP/3 Comparison

What Changed	TCP + HTTP/2	QUIC + HTTP/3
HOL blocking	TCP layer (all streams)	Per-stream only
First-byte latency	2-3 RTT (TCP + TLS)	1 RTT (integrated) or 0 RTT (resumption)
Deployment path	Kernel TCP, middlebox-aware	User-space UDP, middlebox-opaque
Connection migration	Fails on IP change	Survives via connection ID
Evolution pace	Decade-scale (kernel + middleboxes)	Months (library upgrade)

11.6.4 Environment, Measurement, Belief: QUIC

Layer	What QUIC Has	What’s Missing
Environment	Per-stream independent delivery; encrypted transport metadata (including packet numbers and ACK frames, not just payload) opaque to path	Middleboxes are excluded from assisting (TCP-level optimizations inapplicable)
Measurement	Per-packet encrypted; loss signals per-stream	Path-level visibility is reduced for operators
Belief	“Each stream progresses independently; loss doesn’t cascade”	True; but user-space CPU cost rises for crypto on every packet. On stable high-bandwidth links, HTTP/3 can suffer throughput reductions of up to 45% compared to kernel-optimized TCP due to user-space packet processing overhead (Langley et al. 2017)

The gap QUIC creates is operational: operators lose path-level transport visibility (packet traces are encrypted, packet-by-packet state machines live in endpoints). This is a deliberate tradeoff: ossification cost visibility, so QUIC buys evolvability by paying with opacity.

11.6.5 “The Gaps Didn’t Matter… Yet.”

By 2026, HTTP/3 serves ~35% of top websites (W3Techs). QUIC ships in Chrome, Safari, Firefox, and Edge. The ossification escape succeeded. But the latency budget keeps tightening: the next constraint shifts from “how fast can we fetch content” to “how fast can we compute a response.” When RTTs drop below 10ms, the bottleneck shifts from the network to the server.

11.7 Act 6: “It’s 2018. The Bottleneck Is Not the Network. It Is the Origin.”

11.7.1 Which Invariant Broke?

With HTTP/3 and CDNs, static content fetch is as fast as physics allows. But dynamic content — personalized pages, API responses, real-time data — still requires round-tripping to an origin. If the origin is 80ms away, every dynamic request pays 80ms no matter what the transport does. A second force amplified this pressure: bandwidth gravity. IoT proliferation generates massive volumes of raw sensor data at the edge — camera feeds, telemetry streams, environmental monitors — making it economically and technically infeasible to ship all of it to a centralized cloud for processing. The fix was to move computation to the edge, not just content.

11.7.2 Edge Compute: Netflix Open Connect, Google Global Cache, Cloudflare Workers

Three lineages converged on edge computation in the 2010s. Netflix Open Connect (2012+) (Böttger et al. 2018) deployed custom cache appliances inside ISPs, serving video bytes from within the user’s access network — the ultimate latency minimization. Google Global Cache did the same for YouTube and Google services. Cloudflare Workers (2017) and AWS Lambda@Edge generalized the model: run arbitrary user code at hundreds of global PoPs, within milliseconds of any user.

Edge compute applied decision placement at the finest granularity: per-request, per-user compute runs at the location that minimizes total latency, including compute time. The closed loop is no longer just content placement (where does the data live?) but execution placement (where does the code run?). Akamai’s SureRoute exemplifies closed-loop path optimization at the edge: edge servers periodically “race” packets along multiple paths between edge and origin, measuring real-time latency and loss, then route subsequent requests along the fastest surviving path — a concrete application of closed-loop reasoning to overlay routing.

11.7.3 Invariant Analysis: Edge Compute (2018-present)

Invariant	Edge-Compute Answer	Gap?
State	Per-request compute; ephemeral; some edge KV stores	Consistency across PoPs is eventual only
Time	Compute latency budget in milliseconds	Cold-start latency dominates for serverless
Coordination	PoPs execute independently; origin is fallback	Global state updates have high tail latency
Interface	HTTP request in → HTTP response out; code inside	Limited runtime (WASM, V8 isolates); constrained OS access

The cold-start problem is the new bottleneck: spinning up a function on demand ranges from sub-5ms (pre-warmed V8 isolates, as in Akamai EdgeWorkers) to several seconds (cold container launches in general-purpose serverless platforms), often exceeding the network latency it was meant to eliminate. The fix is pre-warming, persistent isolates, and lighter-weight runtimes (WebAssembly). The Time gap shifted from RTT to spin-up time — a different layer of the stack, but still the user’s latency budget.

11.7.4 Environment, Measurement, Belief: Edge Compute

Layer	What Edge Compute Has	What’s Missing
Environment	Request from a specific user, code, some local state	Global state that was updated elsewhere milliseconds ago
Measurement	Request headers, cached responses, local KV reads	Cross-PoP consistency without round-trip to origin
Belief	“Serve this user from here with this code”	True for stateless compute; fragile for stateful

The E-M gap is structurally limited for stateful compute: the edge learns global state only by querying, and querying defeats the latency purpose. This is why edge compute works best for read-heavy, cache-friendly, or stateless workloads — exactly the patterns CDNs were already good at, now extended with code.

11.8 The Grand Arc: From Documents to Edge Execution

11.8.1 The Evolving Anchor

Era	Binding Constraint	What Locks Interface	Cascade
1991	Simplicity (one-person implementable)	GET-and-done	Stateless, per-object TCP
1999	RTT × object count	Persistent connections, pipelining (spec)	ETags, Host header, 6 parallel conns
2002	Speed of light + origin location	DNS-mediated redirection to edges	Distributed caches, TTL-bounded belief
2015	HTTP/1.1 connection ceiling	Multiplexed streams over 1 TCP	HPACK, binary framing, TCP HOL blocking
2022	TCP ossification + HOL blocking	UDP + integrated crypto	Per-stream delivery, 0-RTT, connection migration
2024+	Origin round-trip cost	Edge compute, serverless	Cold-start becomes the budget

11.8.2 Three Design Principles Applied Across the Arc

Disaggregation. Each act introduced a new separation. HTTP/1.0 separated identification from retrieval from format. HTTP/1.1 separated connection lifetime from object lifetime. Akamai separated the control plane (mapping) from the data plane (delivery) from the measurement plane (probes). HTTP/2 separated streams from connections. QUIC separated transport from the kernel. Edge compute separated execution placement from origin location. Each separation created an interface; each interface enabled parallel evolution; and several interfaces (TCP, HTTP/1.1 pipelining) eventually ossified, requiring the next act’s redesign to route around them.

Closed-loop reasoning. Cache validation (ETags), DNS-based edge selection, QUIC congestion control per stream, edge compute load balancing — each is a feedback loop whose period matches the constraint it tracks. DNS TTLs run at seconds because edge load shifts at seconds. Cache validation runs at minutes because content change rates are hours. QUIC’s loss loop runs at RTTs because packet loss feedback is the only signal.

Decision placement. The application layer’s central question is “where does each decision live?” HTTP/0.9: everything at endpoints. HTTP/1.1: endpoints plus transparent caches. Akamai: centralized mapping, distributed delivery. HTTP/2: still endpoints, but now one endpoint per connection. QUIC: endpoints again, but user-space. Edge compute: per-request placement at whichever PoP minimizes total latency. The arc oscillates — it oscillates between centralization (CDN mapping) and distribution (endpoint-only HTTP/2) as constraints shift. A direct line connects Act 3’s gap to Act 5’s fix: DNS-based redirection is locked for the TTL duration against mid-session network changes — once a DNS response is cached, the user is committed to that edge for the TTL duration, even if conditions shift. QUIC’s Connection ID addresses exactly this gap: because connection identity is decoupled from the IP address, a mobile user can migrate connections seamlessly without re-resolving DNS, closing the mid-session adaptability gap that DNS redirection structurally cannot.

11.8.3 The Dependency Chain

flowchart TD
    C0[Constraint: low latency + distributed content]:::constraint
    F1[Failure: 3-RTT per object]:::failure
    X1[Fix: persistent connections]:::fix
    F2[Failure: pipelining HOL blocking]:::failure
    X2[Fix: multiplexed streams HTTP/2]:::fix
    F3[Failure: TCP HOL blocking]:::failure
    X3[Fix: QUIC per-stream transport]:::fix
    F4[Failure: speed of light to origin]:::failure
    X4[Fix: CDN edge caching]:::fix
    F5[Failure: dynamic origin round-trip]:::failure
    X5[Fix: edge compute]:::fix
    F6[Failure: serverless cold-start]:::failure

    C0 --> F1 --> X1 --> F2 --> X2 --> F3 --> X3
    C0 --> F4 --> X4 --> F5 --> X5 --> F6

    classDef constraint fill:#dbeafe,stroke:#1e40af,color:#1e3a8a
    classDef failure fill:#fecaca,stroke:#991b1b,color:#7f1d1d
    classDef fix fill:#bbf7d0,stroke:#166534,color:#14532d

11.8.4 Pioneer Diagnosis Table

Year	Pioneer	Invariant	Diagnosis	Contribution
1991	Berners-Lee	Interface	Document retrieval needs a universal protocol	HTTP/0.9, URL, hypertext
1999	Fielding	Time	Connection setup dominates per-object cost	Persistent connections, ETags, Host
2002	Dilley et al.	State	Geographic distance is the latency floor	DNS-mediated CDN, edge caches
2012	Belshe	Interface	HTTP/1.1 connection limit caps throughput	SPDY / HTTP/2 stream multiplexing
2017	Langley	Interface	TCP ossification prevents transport evolution	QUIC over UDP, user-space transport
2018+	(Cloudflare, AWS)	Coordination	Dynamic content still round-trips to origin	Edge compute, WASM at edge

11.8.5 Innovation Timeline

flowchart TD
    subgraph sg1["Protocol Origins"]
        A1["1991 — Berners-Lee: HTTP/0.9"]
        A2["1996 — HTTP/1.0 (RFC 1945)"]
        A3["1999 — HTTP/1.1 (RFC 2616)"]
        A1 --> A2 --> A3
    end
    subgraph sg2["Content Distribution"]
        B1["1998 — Akamai founded"]
        B2["2002 — Dilley: CDN architecture"]
        B3["2010 — Nygren: Akamai overview"]
        B1 --> B2 --> B3
    end
    subgraph sg3["Protocol Multiplexing"]
        C1["2009 — Google: SPDY prototype"]
        C2["2012 — SPDY draft"]
        C3["2015 — HTTP/2 (RFC 7540)"]
        C1 --> C2 --> C3
    end
    subgraph sg4["Ossification Escape"]
        D1["2013 — Google: QUIC prototype"]
        D2["2017 — QUIC SIGCOMM paper"]
        D3["2021 — QUIC RFC 9000"]
        D4["2022 — HTTP/3 RFC 9114"]
        D1 --> D2 --> D3 --> D4
    end
    subgraph sg5["Edge Execution"]
        E1["2012 — Netflix Open Connect"]
        E2["2017 — Cloudflare Workers"]
        E3["2018 — Lambda@Edge"]
        E1 --> E2 --> E3
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5

Application Protocols and Content Delivery

11.9 Generative Exercises

Exercise 1: The Ossification Budget

Suppose a new transport-layer feature (e.g., per-packet ECN marking with multi-bit signals) would reduce page load time by 20% if universally deployed, but middleboxes drop 5% of packets containing the new marking. Design a deployment strategy. Which invariant are you optimizing? Which are you sacrificing? Hint: consider whether the feature lives in TCP, QUIC, or HTTP semantics.

Exercise 2: CDN Belief Staleness

A CDN serves a breaking-news article with a 60-second TTL. A correction is pushed to the origin at time T. Users in Tokyo, Sydney, and São Paulo request the article at times T+5, T+30, T+45. What does each see? Now shorten the TTL to 10 seconds — what is the cost in origin load and user latency? Construct the closed loop: what are the sensor, estimator, controller, actuator? Where does the E-M gap sit?

Exercise 3: The Edge Cold-Start Problem

A serverless edge function takes 80ms to cold-start and 5ms per warm invocation. Requests arrive as a Poisson process at rate λ per PoP. For what λ does pre-warming pay for itself, assuming idle instance cost dominates? How does your answer change if the function handles user-specific state that must be loaded on first invocation? Which invariant does pre-warming change, and which does it preserve?

TLS 1.2 requires 2 round trips before data: TCP handshake (1 RTT) + TLS handshake (1 RTT). TLS 1.3 reduces this to 1 RTT, and QUIC’s 0-RTT mode eliminates it entirely for returning connections — at the cost of replay vulnerability.↩︎

--- title: "Application Protocols and Content Delivery" --- --- ## The Anchor: Latency, Distribution, and the Silent Network Applications deliver content to users, and users judge applications by how fast the content arrives. But the content sits on an origin server, potentially continents away, reachable only through a network the application lacks ownership of, scheduling authority over, and reconfiguration access to. The engineering problem is to make distant content feel local --- to meet a latency budget that physics, distance, and congestion all conspire to exceed. The binding constraint is the intersection of three inherited realities. Users demand low latency and high throughput: a page that loads in two seconds retains readers, one that loads in five seconds loses them. Content is geographically distributed: origins centralize (a single S3 bucket, a single data center) while users globalize (they live everywhere). And the application runs over TCP/IP: it lacks the ability to install routers, rewrite the congestion-control algorithm in the kernel, or demand that a transit AS prioritize its packets. Every latency improvement must come from what the application layer can do with the signals and interfaces it already has. > *"Almost every resource I want is located somewhere far away... The goal was to be able to pass a reference to anything, and have it retrieved reliably, no matter where it was."* --- Tim Berners-Lee, on the origin of HTTP [@bernerslee1991] The binding constraint locks four decision problems the application layer must continuously answer: 1. **How to request content with low overhead?** Every request costs at least one round-trip. Requests must be batched, reused, or shortened. 2. **How to place content near users?** If the origin is far, a copy closer to the user reduces the RTT (Round-Trip Time) budget dramatically. Where do the copies live, and which one answers? 3. **How to multiplex independent requests over shared connections?** A modern page has dozens of objects. Serial retrieval multiplies RTT; parallel retrieval multiplies connection setup cost. One connection must carry many requests without head-of-line blocking. 4. **How to escape protocol ossification?** Middleboxes on the path inspect and mangle packets based on assumptions about TCP and TLS. A new transport protocol must deploy without requiring middlebox permission. These problems were invisible to Berners-Lee in 1991. They were discovered, one at a time, as the Web grew from a CERN research tool to the dominant application of the Internet. Each generation diagnosed a different invariant failure and patched it without breaking the layers below. The dependency chain for applications runs in reverse of transport's chain. Transport inherits Interface from IP and forces State, Time, Coordination downward. Applications start from **Interface (HTTP semantics)** and push outward: State (where content lives), Coordination (who serves which user), Time (RTT budget). The chain: - **Interface** (HTTP request-response semantics) is the application's contract with the user and with all intermediaries. Changing it breaks the Web. - **Coordination** (client-server, but with optional caches and CDN (Content Delivery Network) intermediaries) is forced by the desire to keep servers stateless. - **State** (distributed caches, TTLs, consistency) is forced by coordination choices and by the latency budget. - **Time** (RTT budgets, connection reuse, 0-RTT resumption) is the metric the user experiences and the constraint that keeps redesigning the stack. --- ## Act 1: "It's 1991. A Physicist Wants to Share Documents." Tim Berners-Lee is a computer scientist at CERN. Physicists collaborate across institutions but cannot easily share documents: each lab runs different systems, different databases, different document formats. He proposes a system where any document can reference any other document through a uniform identifier, and any browser can retrieve any document using the same minimal protocol. The protocol must run over TCP (the only universally available transport) and must be simple enough that a graduate student can implement it in an afternoon. > *"The HTTP protocol was designed to be extensible... In its simplest form, a client sends a single line, 'GET /path', and the server responds with the content of the document."* --- Berners-Lee, 1991 [@bernerslee1991] **What the pioneers saw:** A small academic community exchanging static hypertext documents over reliable intra-institutional LANs. Documents were small (a few KB of text). Pages referenced only a handful of other documents. Latency was dominated by TCP connection setup, which was acceptable because requests were infrequent and sequential (the user clicked a link, waited, read, clicked again). **What remained invisible from the pioneers' vantage point:** The Web would eclipse every other Internet application within a decade. Pages would grow from single documents to assemblages of dozens (later hundreds) of objects: images, stylesheets, scripts, fonts. Users would demand sub-second load times. The "one connection per object" model would become the Web's first performance bottleneck. HTTP/0.9 and HTTP/1.0 [@rfc1945] **applied disaggregation** by separating document identification (URL) from document retrieval (GET request) from document format (MIME types in HTTP/1.0). They **applied decision placement** by keeping the server stateless: each request carried all context, each connection was independent. The design was minimal on purpose --- extensibility came from headers, not from complex state machines. ### Invariant Analysis: HTTP/0.9 and HTTP/1.0 (1991-1996) | Invariant | HTTP/1.0 Answer | Gap? | |:------|:-------------------------------|:-------------------------------| | State | Stateless per request; each connection independent | Blind to client cache state; every reference re-fetched | | Time | TCP handshake + slow-start per object | 3-RTT minimum per object; dominant cost for small objects | | Coordination | Client-server, anonymous client | Sessionless; caches are transparent and uncoordinated | | Interface | Text headers + URL + method; extensible | Per-connection overhead is architectural, beyond mere inefficiency | The Time gap is the killer. Every object fetched requires a fresh TCP handshake (1 RTT), followed by slow-start ramp-up, followed by the actual request and response. A page with 20 objects needs 20 sequential connection setups over a single client. Over a 100ms RTT link, that is two full seconds of connection overhead before any content arrives. The sender's belief about the network is reset on every object --- slow-start restarts, RTT estimates restart, congestion window restarts. ### Environment, Measurement, Belief | Layer | What HTTP/1.0 Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | User has a browser; server has documents; TCP is in the middle | Caches that already hold the object; other users fetching the same content | | Measurement | Request line, status code, content length | Silent on cache state, RTT, and server load | | Belief | "This URL maps to this document, fetch it now" | Pages are assemblages, belief is per-object only | The E-M gap is **structurally absent**: HTTP/1.0's measurement channel (the request-response pair) is silent on reuse opportunities. Caches existed (Web proxies appeared by 1994) but HTTP/1.0 gave them only crude hints (Expires headers, If-Modified-Since). Freshness was an estimate, with no validation path. ### "The Gaps Didn't Matter... Yet." In 1991, pages were single documents. In 1993, pages had a logo and three hyperlinks. A user clicked, waited half a second, read for a minute, clicked again. The 3-RTT connection overhead disappeared into human reading time. By 1996, Netscape pages had 20 inline images. Users clicked, waited ten seconds, stared at a half-loaded page, and clicked Reload. The connection-per-object model had become the dominant performance cost, and every subsequent HTTP generation would attack it. --- ## Act 2: "It's 1997. Every Page Has Twenty Objects." ### Which Invariant Broke? | Invariant | What Broke | Concrete Consequence | |:------|:-------------------------------|:-------------------------------| | Time | Connection-per-object is RTT-bound | 20 objects × 3 RTT = 60 RTT before page completes | | State | Each connection restarts slow-start and RTT estimation | TCP never reaches its fair share before the object ends | | Coordination | Browsers open 6+ parallel connections to compensate | Server connection-table bloat; unfair to other users | The Time invariant broke most visibly. Users experienced it as "the Web is slow." Operators experienced it as connection-table exhaustion on servers. Browsers had tried to compensate by opening multiple parallel TCP connections per server (Netscape defaulted to 4, later 6), but this was a brute-force fix: more connections meant more handshakes, more slow-starts, more server state, and more unfairness to other users sharing the bottleneck link. The 6 parallel connections also **negatively impacted congestion control**: each connection ran its own independent slow-start and maintained its own congestion window, with no shared view of the bottleneck. A single browser effectively claimed 6x its fair share of bandwidth during ramp-up, starving other flows at the same bottleneck. ### Fielding's Redesign: HTTP/1.1 (RFC 2616, 1999) Roy Fielding [@fielding2000] and the HTTP Working Group designed HTTP/1.1 around a single architectural shift: the **persistent connection**. A TCP connection is opened once and reused for many requests. The client sends a request, the server sends a response, and both keep the connection open for the next request. The TCP connection outlives the object. > *"Persistent connections provide a mechanism by which a client and a server can signal the close of a TCP connection. This signaling takes place using the Connection header field."* --- Fielding et al., RFC 2616 [@rfc2616] HTTP/1.1 also introduced **pipelining**: the client may send multiple requests back-to-back without waiting for each response. And it mandated the **Host header**, enabling virtual hosting (many websites on one IP address) --- a coordination fix that let the Web's naming system scale. HTTP/1.1 **applied closed-loop reasoning** through cache validation: the `ETag` header let a client ask "is my cached copy still valid?" and receive a cheap `304 Not Modified` if so, cutting bandwidth without cutting correctness. The closed loop was server → validator → client → conditional request, tracking cache freshness through explicit measurement rather than Expires-based guessing. ### HTTP/1.0 → HTTP/1.1 Comparison | What Changed | HTTP/1.0 | HTTP/1.1 | |:------|:-------------------------------|:-------------------------------| | Connection lifetime | One request per TCP connection | Persistent: many requests per connection | | Request ordering | Serial, wait for each response | Pipelined (in spec; rarely used) | | Cache validation | Expires header (time-based guess) | ETag conditional requests (explicit) | | Host disambiguation | One site per IP | Host header enables virtual hosting | ### Environment, Measurement, Belief After HTTP/1.1 | Layer | What HTTP/1.1 Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | Persistent connections amortize TCP setup | Bottleneck bandwidth and RTT still unknown to application | | Measurement | ETag validation gives cache-hit signals | Reordering invisible; pipelined responses must arrive in order | | Belief | "This connection is warm; reuse it" | Per-request order is still a straight line | The gap that HTTP/1.1 closed was the connection-setup cost. The gap it **introduced** was subtle: pipelining required responses to arrive in the order requests were sent. If the first response was slow, every subsequent response waited behind it. This is **head-of-line blocking at the HTTP layer**: one slow object stalls the entire connection's pipeline. In practice, browsers left pipelining disabled by default, because servers and proxies handled it inconsistently. The spec existed; the benefit remained unrealized. ### "The Gaps Didn't Matter... Yet." For a decade, HTTP/1.1 was enough. Browsers opened 6 parallel connections per origin, each pipelining implicitly (request-then-response-then-request), and the Web grew from static pages to JavaScript applications. But by 2010, pages had hundreds of objects (trackers, ads, fonts, analytics), served by dozens of origins, and the "6 connections per origin" ceiling was a hard throughput cap no matter how fast the link. --- ## Act 3: "It's 2000. The Origin Is in Boston. The User Is in Tokyo." ### Which Invariant Broke? The Time invariant broke at a different layer: **the speed of light**. An HTTP/1.1 optimization was powerless against a 150ms trans-Pacific RTT. A user in Tokyo fetching a Boston origin waited at least 150ms per round-trip, and every handshake, every TLS^[TLS 1.2 requires 2 round trips before data: TCP handshake (1 RTT) + TLS handshake (1 RTT). TLS 1.3 reduces this to 1 RTT, and QUIC's 0-RTT mode eliminates it entirely for returning connections — at the cost of replay vulnerability.] negotiation, every TCP slow-start cycle consumed several RTTs. Worse, the trans-Pacific path was congested, lossy, and routed through multiple ASes, each with its own failure modes. The fix lived outside the protocol --- it had to live in **where the content was**. If content moved closer to the user --- if a copy of the origin lived in Tokyo --- the RTT dropped from 150ms to 10ms, and every HTTP optimization compounded with a 15× latency reduction. This was a **decision placement** problem: who decides which copy answers which user? ### Dilley and the Akamai Team's Redesign: The CDN (2002) Akamai, founded in 1998 by MIT researchers, built a globally distributed network of edge servers --- eventually tens of thousands of caches in thousands of locations worldwide [@nygren2010]. Content providers gave Akamai their content; Akamai served it from whichever edge was closest to each user. The question shifted from "how fast can we fetch from Boston?" to "how do we route each user to the right edge?" > *"The key challenge is how to distribute content to hundreds of thousands of servers distributed across thousands of networks in ways that maximize performance, reliability, and cost-effectiveness."* --- Dilley et al., 2002 [@dilley2002] Akamai's architecture solved the decision-placement problem through **DNS-based request routing** organized as a **two-tier name server hierarchy**. **Top-Level Name Servers (TLNS)** handle the initial DNS delegation globally, directing each query to a region. **Low-Level Name Servers (LLNS)** within each region make the fine-grained decision: which specific edge server in which cluster should answer this user, based on the user's resolver location, current edge load, and real-time network conditions. When a user's browser resolved `www.example.com`, the authoritative DNS delegation flowed through this TLNS → LLNS chain, returning the IP address of the optimal edge server. DNS became the control plane for content placement. Within each cluster, Akamai uses **consistent hashing** to map content URLs to specific cache machines, ensuring that requests for the same object always land on the same server --- maximizing local cache hit rates without centralized coordination. For rare or unpopular content absent from the local cluster's cache, Akamai employs a **distributed hash table (DHT)** to locate which edge in the broader footprint holds a warm copy, avoiding unnecessary origin fetches for cold content. Akamai **applied disaggregation** by splitting the system into three planes: a **data plane** (edge caches serving content), a **control plane** (mapping engine deciding which edge answers which user), and a **measurement plane** (continuous probes of edge-to-user and edge-to-origin latency and loss). The control plane was **centralized** (for global consistency) while the data plane was **distributed** (for latency). This is the same disaggregation pattern SDN would apply to routing a decade later. Akamai **applied closed-loop reasoning** through the measurement plane: every edge continuously reported health, load, and network conditions to the mapping engine; the mapping engine continuously updated DNS responses; users continuously resolved names. The loop period was seconds to minutes (DNS TTLs set the minimum belief lifetime). The loop's goal was to keep Belief (which edge is best for this user right now) aligned with Environment (which edge actually has the content, is lightly loaded, and has a good path). ### Invariant Analysis: Akamai CDN (2002-2010) | Invariant | Akamai Answer | Gap? | |:------|:-------------------------------|:-------------------------------| | State | Distributed edge caches; TTL-bounded freshness | Cache coherence across edges is best-effort | | Time | DNS TTLs control belief lifetime (~seconds) | TTL tradeoff: short = responsive but expensive; long = stale | | Coordination | Centralized mapping; distributed delivery | Mapping engine is a global dependency | | Interface | DNS redirection; HTTP transparent to client | Edge identity is opaque to clients | The State gap is the tradeoff at the heart of every CDN: **TTLs determine both cache hit rate and belief staleness**. Long TTLs mean edges serve more requests without contacting the origin (good for latency, bad for freshness). Short TTLs mean the system reacts quickly to origin updates and failures (good for correctness, bad for cost and latency). Every CDN tunes this tradeoff differently per content type: images get hours-long TTLs, HTML gets minutes, API responses get seconds or bypass caching entirely. ### Environment, Measurement, Belief: Akamai Mapping | Layer | What Akamai Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | Actual user location, actual network paths, actual edge loads | True user identity (only resolver IP is visible) | | Measurement | Edge health probes; passive latency samples; active probes to users' resolvers | Precise user geolocation (resolver ≠ user) | | Belief | "This user is near edge E; route them there" | Users behind public resolvers (8.8.8.8) look identical to users worldwide | The E-M gap here is **physically limited**: the signal Akamai has (DNS resolver IP) is a proxy for user location, an indirect measurement. EDNS Client Subnet (a later extension) partially fixed this by letting recursive resolvers forward a prefix of the user's IP, but privacy-preserving resolvers (8.8.8.8, 1.1.1.1) deliberately omit this. The CDN's belief about the user is necessarily coarse. ### "The Gaps Didn't Matter... Yet." For static content (images, video, CSS) CDNs were a transformative win: end-to-end page loads dropped by factors of 5-10x for global users. Measured more precisely, caching overlays themselves provide speedups of **1.7x to 4.3x** compared to direct-from-origin delivery [@nygren2010] --- the larger 5-10x figures reflect full-page improvements where CDN caching compounds with TCP connection reuse and reduced origin load. But CDNs left half the page load cost untouched: **the HTML itself, the TLS handshake, and the fundamental HOL blocking of HTTP/1.1.** A CDN edge still spoke HTTP/1.1, still opened 6 connections, still slow-started each one. The protocol layer remained a bottleneck that geography alone failed to resolve. --- ## Act 4: "It's 2009. A Google Engineer Is Tired of Pipelining Not Working." ### Which Invariant Broke? | Invariant | What Broke | Concrete Consequence | |:------|:-------------------------------|:-------------------------------| | Interface | HTTP/1.1 semantics tie one request to one response on the wire | 6 parallel connections is a hard cap per origin | | State | Each connection has independent TCP state | Slow-start repeats 6 times; no shared congestion view | | Time | HTTP-layer HOL blocking: slow object stalls the pipeline | Head request delays every subsequent response | Mike Belshe at Google measured real page loads and found that connection count, not bandwidth, was the bottleneck [@belshe2010]. Doubling a user's link speed yielded marginal page load improvement. Reducing RTT cut it in half. The protocol itself was the bottleneck. ### Belshe and Peon's Redesign: SPDY and HTTP/2 (2012-2015) Belshe and Peon designed SPDY (prototyped 2009, IETF proposal 2012) to multiplex many independent requests over a single TCP connection. SPDY became the basis for HTTP/2 (RFC 7540, 2015) [@rfc7540]. The core insight: HTTP's request-response semantics can be preserved while changing the wire format entirely. > *"SPDY's goal is to reduce web page load time... Multiple concurrent HTTP requests can run across a single SPDY session."* --- Belshe and Peon, 2012 [@belshe2012spdy] HTTP/2 **applied disaggregation** by separating HTTP semantics from wire encoding: the same GET/POST/headers/status codes, but now framed as streams over a single connection. Each stream is independent; responses can arrive in any order, interleaved frame by frame. A slow stream leaves fast streams unblocked. HTTP/2 added three mechanisms: - **Stream multiplexing:** many concurrent requests share one connection; each request is a stream with an ID, and frames interleave. - **HPACK (Header Compression for HTTP/2) header compression:** repeated headers (Cookie, User-Agent, Accept) are compressed via a shared dynamic table, reducing request overhead from KB to bytes. - **Server push:** the server sends resources the client will need, before the client asks (retired in practice --- caching interaction was too complex to get right). ### HTTP/1.1 → HTTP/2 Comparison | What Changed | HTTP/1.1 | HTTP/2 | |:------|:-------------------------------|:-------------------------------| | Concurrency | 6 parallel TCP connections | 1 TCP connection, N multiplexed streams | | Framing | Text, per-response | Binary, per-frame | | Header overhead | Repeated on every request | HPACK compressed | | HOL blocking | At HTTP layer (per connection) | At TCP layer (per connection) | ### The Gap HTTP/2 Created HTTP/2 fixed HTTP-layer HOL blocking but created a new failure mode: **TCP-layer HOL blocking**. TCP's "belief" in a monolithic, ordered byte stream was **false state** for multiplexed HTTP --- TCP treated all bytes as a single ordered sequence, oblivious to stream boundaries. Because all streams share one TCP connection, a single dropped packet stalls every stream until the packet is retransmitted. TCP delivers bytes in order, not by stream. On a lossy network (mobile, Wi-Fi with interference), a single loss could stall 50 concurrent streams for an RTT. HTTP/1.1 with 6 connections isolated losses; HTTP/2 with 1 connection coupled them. ### Environment, Measurement, Belief After HTTP/2 | Layer | What HTTP/2 Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | Single warm TCP connection; multiplexed streams | Loss patterns: which stream's data was in the lost packet? | | Measurement | HTTP/2 frames per stream; connection-level flow control | TCP conflates "stream 3 is stalled" with "all streams are stalled" | | Belief | "All streams progress in parallel" | True when no loss; false on every retransmission | The gap is **accidentally noisy**: TCP's in-order delivery was designed when there was one stream per connection. Multiplexing many streams over one TCP connection exposes the in-order-delivery assumption as a tax every stream pays for any stream's loss. The fix required changing the transport layer itself. --- ## Act 5: "It's 2013. Transport Is Ossified. Google Ships a New One Anyway." ### Which Invariant Broke? | Invariant | What Broke | Concrete Consequence | |:------|:-------------------------------|:-------------------------------| | Interface | TCP in-order delivery creates head-of-line blocking for HTTP/2 streams | One lost packet stalls all streams for 1+ RTT | | Interface | TCP and TLS handshakes are serial (3+ RTTs for first request) | First-byte latency is 3× worse than the protocol requires | | Interface | Middleboxes inspect TCP options and reject anything unfamiliar | New TCP extensions cannot deploy [@honda2011] | Honda et al. [@honda2011] measured middlebox behavior across hundreds of paths and found that TCP was **ossified**: middleboxes dropped or mangled packets carrying unfamiliar options. Any new TCP feature (like MPTCP) had to pretend to be old TCP to survive deployment. The kernel was locked, and the path was locked. ### Langley's Redesign: QUIC and HTTP/3 (2017-2022) Jim Roskind and Adam Langley at Google designed QUIC to rebuild the transport layer from scratch --- but over UDP, not by replacing TCP. Middleboxes forward UDP as opaque datagrams, so QUIC could evolve freely without middlebox permission. QUIC reached RFC status in 2021 [@rfc9000], and HTTP/3 [@rfc9114] maps HTTP semantics onto QUIC streams. > *"QUIC's design is... motivated by a desire to remove head-of-line blocking, reduce connection establishment latency, and enable continued transport evolution."* --- Langley et al., 2017 [@quic2017] QUIC **applied disaggregation** by moving transport into user space. The QUIC library lives inside the application (or alongside it), not inside the kernel. This means QUIC can be updated as fast as applications can be updated --- monthly, not decade-by-decade. Google deploys QUIC changes to Chrome and their servers simultaneously; the protocol evolves continuously. QUIC made three changes that TCP's ossified deployment path blocked: - **Streams as a transport primitive:** QUIC multiplexes streams natively. A lost packet affects only the streams whose data it carried. TCP's HOL blocking disappears. - **Encryption is mandatory and integrated:** QUIC handshake combines TLS 1.3 and transport setup into a single 1-RTT exchange (or 0-RTT for resumption). TCP + TLS requires 2-3 RTTs; QUIC requires 1. However, 0-RTT data is vulnerable to **replay attacks**: an attacker who captures the initial flight can replay it verbatim. Applications using 0-RTT must guarantee **idempotency** --- a replayed request must not debit an account twice or create duplicate records. The Time optimization (saving 1 RTT) forces a State burden (the application must track whether a request has already been processed). - **Connection IDs separate identity from address:** a mobile device can switch from Wi-Fi to cellular without dropping the connection. TCP ties identity to the (IP, port) 4-tuple, which breaks on IP changes. ### TCP + HTTP/2 → QUIC + HTTP/3 Comparison | What Changed | TCP + HTTP/2 | QUIC + HTTP/3 | |:------|:-------------------------------|:-------------------------------| | HOL blocking | TCP layer (all streams) | Per-stream only | | First-byte latency | 2-3 RTT (TCP + TLS) | 1 RTT (integrated) or 0 RTT (resumption) | | Deployment path | Kernel TCP, middlebox-aware | User-space UDP, middlebox-opaque | | Connection migration | Fails on IP change | Survives via connection ID | | Evolution pace | Decade-scale (kernel + middleboxes) | Months (library upgrade) | ### Environment, Measurement, Belief: QUIC | Layer | What QUIC Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | Per-stream independent delivery; encrypted transport metadata (including **packet numbers** and **ACK frames**, not just payload) opaque to path | Middleboxes are excluded from assisting (TCP-level optimizations inapplicable) | | Measurement | Per-packet encrypted; loss signals per-stream | Path-level visibility is reduced for operators | | Belief | "Each stream progresses independently; loss doesn't cascade" | True; but user-space CPU cost rises for crypto on every packet. On stable high-bandwidth links, HTTP/3 can suffer throughput reductions of up to **45%** compared to kernel-optimized TCP due to user-space packet processing overhead [@quic2017] | The gap QUIC creates is operational: operators lose path-level transport visibility (packet traces are encrypted, packet-by-packet state machines live in endpoints). This is a deliberate tradeoff: ossification cost visibility, so QUIC buys evolvability by paying with opacity. ### "The Gaps Didn't Matter... Yet." By 2026, HTTP/3 serves ~35% of top websites (W3Techs). QUIC ships in Chrome, Safari, Firefox, and Edge. The ossification escape succeeded. But the latency budget keeps tightening: the next constraint shifts from "how fast can we fetch content" to "how fast can we compute a response." When RTTs drop below 10ms, the bottleneck shifts from the network to the server. --- ## Act 6: "It's 2018. The Bottleneck Is Not the Network. It Is the Origin." ### Which Invariant Broke? With HTTP/3 and CDNs, static content fetch is as fast as physics allows. But dynamic content --- personalized pages, API responses, real-time data --- still requires round-tripping to an origin. If the origin is 80ms away, every dynamic request pays 80ms no matter what the transport does. A second force amplified this pressure: **bandwidth gravity**. IoT proliferation generates massive volumes of raw sensor data at the edge --- camera feeds, telemetry streams, environmental monitors --- making it economically and technically infeasible to ship all of it to a centralized cloud for processing. The fix was to move **computation** to the edge, not just content. ### Edge Compute: Netflix Open Connect, Google Global Cache, Cloudflare Workers Three lineages converged on edge computation in the 2010s. **Netflix Open Connect** (2012+) [@bottger2018] deployed custom cache appliances inside ISPs, serving video bytes from within the user's access network --- the ultimate latency minimization. **Google Global Cache** did the same for YouTube and Google services. **Cloudflare Workers** (2017) and **AWS Lambda@Edge** generalized the model: run arbitrary user code at hundreds of global PoPs, within milliseconds of any user. Edge compute **applied decision placement** at the finest granularity: per-request, per-user compute runs at the location that minimizes total latency, including compute time. The closed loop is no longer just content placement (where does the data live?) but execution placement (where does the code run?). Akamai's **SureRoute** exemplifies closed-loop path optimization at the edge: edge servers periodically "race" packets along multiple paths between edge and origin, measuring real-time latency and loss, then route subsequent requests along the fastest surviving path --- a concrete application of closed-loop reasoning to overlay routing. ### Invariant Analysis: Edge Compute (2018-present) | Invariant | Edge-Compute Answer | Gap? | |:------|:-------------------------------|:-------------------------------| | State | Per-request compute; ephemeral; some edge KV stores | Consistency across PoPs is eventual only | | Time | Compute latency budget in milliseconds | Cold-start latency dominates for serverless | | Coordination | PoPs execute independently; origin is fallback | Global state updates have high tail latency | | Interface | HTTP request in → HTTP response out; code inside | Limited runtime (WASM, V8 isolates); constrained OS access | The cold-start problem is the new bottleneck: spinning up a function on demand ranges from **sub-5ms** (pre-warmed V8 isolates, as in Akamai EdgeWorkers) to **several seconds** (cold container launches in general-purpose serverless platforms), often exceeding the network latency it was meant to eliminate. The fix is pre-warming, persistent isolates, and lighter-weight runtimes (WebAssembly). The Time gap shifted from RTT to spin-up time --- a different layer of the stack, but still the user's latency budget. ### Environment, Measurement, Belief: Edge Compute | Layer | What Edge Compute Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | Environment | Request from a specific user, code, some local state | Global state that was updated elsewhere milliseconds ago | | Measurement | Request headers, cached responses, local KV reads | Cross-PoP consistency without round-trip to origin | | Belief | "Serve this user from here with this code" | True for stateless compute; fragile for stateful | The E-M gap is **structurally limited** for stateful compute: the edge learns global state only by querying, and querying defeats the latency purpose. This is why edge compute works best for read-heavy, cache-friendly, or stateless workloads --- exactly the patterns CDNs were already good at, now extended with code. --- ## The Grand Arc: From Documents to Edge Execution ### The Evolving Anchor | Era | Binding Constraint | What Locks Interface | Cascade | |:----|:-------------------|:---------------------|:--------| | 1991 | Simplicity (one-person implementable) | GET-and-done | Stateless, per-object TCP | | 1999 | RTT × object count | Persistent connections, pipelining (spec) | ETags, Host header, 6 parallel conns | | 2002 | Speed of light + origin location | DNS-mediated redirection to edges | Distributed caches, TTL-bounded belief | | 2015 | HTTP/1.1 connection ceiling | Multiplexed streams over 1 TCP | HPACK, binary framing, TCP HOL blocking | | 2022 | TCP ossification + HOL blocking | UDP + integrated crypto | Per-stream delivery, 0-RTT, connection migration | | 2024+ | Origin round-trip cost | Edge compute, serverless | Cold-start becomes the budget | ### Three Design Principles Applied Across the Arc **Disaggregation.** Each act introduced a new separation. HTTP/1.0 separated identification from retrieval from format. HTTP/1.1 separated connection lifetime from object lifetime. Akamai separated the control plane (mapping) from the data plane (delivery) from the measurement plane (probes). HTTP/2 separated streams from connections. QUIC separated transport from the kernel. Edge compute separated execution placement from origin location. Each separation created an interface; each interface enabled parallel evolution; and several interfaces (TCP, HTTP/1.1 pipelining) eventually ossified, requiring the next act's redesign to route around them. **Closed-loop reasoning.** Cache validation (ETags), DNS-based edge selection, QUIC congestion control per stream, edge compute load balancing --- each is a feedback loop whose period matches the constraint it tracks. DNS TTLs run at seconds because edge load shifts at seconds. Cache validation runs at minutes because content change rates are hours. QUIC's loss loop runs at RTTs because packet loss feedback is the only signal. **Decision placement.** The application layer's central question is "where does each decision live?" HTTP/0.9: everything at endpoints. HTTP/1.1: endpoints plus transparent caches. Akamai: centralized mapping, distributed delivery. HTTP/2: still endpoints, but now one endpoint per connection. QUIC: endpoints again, but user-space. Edge compute: per-request placement at whichever PoP minimizes total latency. The arc oscillates --- it oscillates between centralization (CDN mapping) and distribution (endpoint-only HTTP/2) as constraints shift. A direct line connects Act 3's gap to Act 5's fix: DNS-based redirection is **locked for the TTL duration against mid-session network changes** --- once a DNS response is cached, the user is committed to that edge for the TTL duration, even if conditions shift. QUIC's **Connection ID** addresses exactly this gap: because connection identity is decoupled from the IP address, a mobile user can migrate connections seamlessly without re-resolving DNS, closing the mid-session adaptability gap that DNS redirection structurally cannot. ### The Dependency Chain ```{mermaid} flowchart TD C0[Constraint: low latency + distributed content]:::constraint F1[Failure: 3-RTT per object]:::failure X1[Fix: persistent connections]:::fix F2[Failure: pipelining HOL blocking]:::failure X2[Fix: multiplexed streams HTTP/2]:::fix F3[Failure: TCP HOL blocking]:::failure X3[Fix: QUIC per-stream transport]:::fix F4[Failure: speed of light to origin]:::failure X4[Fix: CDN edge caching]:::fix F5[Failure: dynamic origin round-trip]:::failure X5[Fix: edge compute]:::fix F6[Failure: serverless cold-start]:::failure C0 --> F1 --> X1 --> F2 --> X2 --> F3 --> X3 C0 --> F4 --> X4 --> F5 --> X5 --> F6 classDef constraint fill:#dbeafe,stroke:#1e40af,color:#1e3a8a classDef failure fill:#fecaca,stroke:#991b1b,color:#7f1d1d classDef fix fill:#bbf7d0,stroke:#166534,color:#14532d ``` ### Pioneer Diagnosis Table | Year | Pioneer | Invariant | Diagnosis | Contribution | |:-----|:--------|:----------|:----------|:-------------| | 1991 | Berners-Lee | Interface | Document retrieval needs a universal protocol | HTTP/0.9, URL, hypertext | | 1999 | Fielding | Time | Connection setup dominates per-object cost | Persistent connections, ETags, Host | | 2002 | Dilley et al. | State | Geographic distance is the latency floor | DNS-mediated CDN, edge caches | | 2012 | Belshe | Interface | HTTP/1.1 connection limit caps throughput | SPDY / HTTP/2 stream multiplexing | | 2017 | Langley | Interface | TCP ossification prevents transport evolution | QUIC over UDP, user-space transport | | 2018+ | (Cloudflare, AWS) | Coordination | Dynamic content still round-trips to origin | Edge compute, WASM at edge | ### Innovation Timeline ```{mermaid} %%| fig-cap: "Application Protocols and Content Delivery" flowchart TD subgraph sg1["Protocol Origins"] A1["1991 — Berners-Lee: HTTP/0.9"] A2["1996 — HTTP/1.0 (RFC 1945)"] A3["1999 — HTTP/1.1 (RFC 2616)"] A1 --> A2 --> A3 end subgraph sg2["Content Distribution"] B1["1998 — Akamai founded"] B2["2002 — Dilley: CDN architecture"] B3["2010 — Nygren: Akamai overview"] B1 --> B2 --> B3 end subgraph sg3["Protocol Multiplexing"] C1["2009 — Google: SPDY prototype"] C2["2012 — SPDY draft"] C3["2015 — HTTP/2 (RFC 7540)"] C1 --> C2 --> C3 end subgraph sg4["Ossification Escape"] D1["2013 — Google: QUIC prototype"] D2["2017 — QUIC SIGCOMM paper"] D3["2021 — QUIC RFC 9000"] D4["2022 — HTTP/3 RFC 9114"] D1 --> D2 --> D3 --> D4 end subgraph sg5["Edge Execution"] E1["2012 — Netflix Open Connect"] E2["2017 — Cloudflare Workers"] E3["2018 — Lambda@Edge"] E1 --> E2 --> E3 end sg1 --> sg2 --> sg3 --> sg4 --> sg5 ``` --- ## Generative Exercises ::: {.callout-tip} ## Exercise 1: The Ossification Budget Suppose a new transport-layer feature (e.g., per-packet ECN marking with multi-bit signals) would reduce page load time by 20% if universally deployed, but middleboxes drop 5% of packets containing the new marking. Design a deployment strategy. Which invariant are you optimizing? Which are you sacrificing? Hint: consider whether the feature lives in TCP, QUIC, or HTTP semantics. ::: ::: {.callout-tip} ## Exercise 2: CDN Belief Staleness A CDN serves a breaking-news article with a 60-second TTL. A correction is pushed to the origin at time T. Users in Tokyo, Sydney, and São Paulo request the article at times T+5, T+30, T+45. What does each see? Now shorten the TTL to 10 seconds --- what is the cost in origin load and user latency? Construct the closed loop: what are the sensor, estimator, controller, actuator? Where does the E-M gap sit? ::: ::: {.callout-tip} ## Exercise 3: The Edge Cold-Start Problem A serverless edge function takes 80ms to cold-start and 5ms per warm invocation. Requests arrive as a Poisson process at rate λ per PoP. For what λ does pre-warming pay for itself, assuming idle instance cost dominates? How does your answer change if the function handles user-specific state that must be loaded on first invocation? Which invariant does pre-warming change, and which does it preserve? :::