flowchart TD
A["Shared coaxial bus<br/>(physical constraint)"]:::constraint --> B["CSMA/CD works<br/>(100 stations)"]:::fix
B --> C["Bridges extend reach<br/>(campus-wide L2)"]:::fix
C --> D["Loops: broadcast storms<br/>(frames cycle forever)"]:::failure
D --> E["STP: prune to tree<br/>(Perlman 1985)"]:::fix
E --> F["Half the links blocked<br/>(wasted bandwidth)"]:::failure
E --> G["30-50s convergence<br/>(all forwarding halts)"]:::failure
F --> H["Switches: full-duplex<br/>(no CSMA/CD needed)"]:::fix
H --> I["Broadcast domain grows<br/>(thousands of stations)"]:::failure
I --> J["VLANs: 802.1Q<br/>(12-bit, 4094 limit)"]:::fix
G --> K["Data center scale<br/>(100-1000 switches)"]:::constraint
F --> K
K --> L["Fat-tree + ECMP<br/>(Al-Fares 2008)"]:::fix
L --> M["No L2 mobility<br/>(IP fabric, no VLANs)"]:::failure
J --> M
M --> N["VXLAN: L2 over L3<br/>(24-bit VNI)"]:::fix
N --> O["Flood-and-learn<br/>(broadcast storms in overlay)"]:::failure
O --> P["EVPN: BGP control plane<br/>(MAC/IP distribution)"]:::fix
classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:#fff
classDef failure fill:#e74c3c,stroke:#a93226,color:#fff
classDef fix fill:#27ae60,stroke:#1a7a42,color:#fff
3 Wired Link Layer
3.2 Act 1: “It’s 1976. A Building Full of Personal Computers Needs a Network.”
Xerox PARC in Palo Alto was unlike any computing environment of its era. By 1973, the Alto — one of the first personal workstations — was sitting on dozens of desks. These machines needed to share a laser printer, access file servers, and exchange electronic mail. The available networking options were inadequate: ARPANET was for inter-site WAN communication, and serial point-to-point links could not connect a building full of machines. Robert Metcalfe, a recent Harvard PhD who had studied Norman Abramson’s ALOHAnet (Abramson 1970), saw the opportunity to adapt packet radio concepts to a local wired medium.
The experimental Ethernet ran at 2.94 Mbit/s over 75-ohm coaxial cable with a reach of approximately 1 km. These numbers are coupled through physics: collision detection requires that a transmitter hear a collision before it finishes sending its shortest frame. The round-trip propagation time across 1 km of cable is approximately 10 µs; at 2.94 Mbit/s, the minimum frame takes roughly 50 µs to transmit — safely longer than the round-trip, so collisions are always detected in time.1 Frames used 8-bit addresses, sufficient for 256 stations in one building.
When DEC, Intel, and Xerox formed the DIX consortium to standardize Ethernet for commercial use (1980), they confronted a scope change: Ethernet would span multiple buildings, campuses, and eventually organizations. An 8-bit address space — adequate for one lab — was insufficient.2
“Ether is a passive broadcast medium with no central control, and with stations that have equal access. Coordination of access to the Ether for packet transmission is distributed among the contending transmitting stations using controlled statistical arbitration.” — Metcalfe and Boggs, 1976 (Metcalfe and Boggs 1976)
What the pioneers saw: A single building with 100–200 workstations. A shared coaxial cable snaking through the ceiling, physically accessible to every machine. Traffic was bursty — print jobs, file transfers, electronic mail — not continuous streams. The wire was idle most of the time. The workstations were cooperative (all owned by Xerox PARC researchers, no adversarial traffic). The cable was short enough that propagation delay was negligible relative to packet transmission time.
What remained invisible from the pioneers’ vantage point: Networks would scale from hundreds to millions of endpoints. Traffic would shift from bursty office applications to continuous video streams. Organizations would need to connect buildings, campuses, and eventually data centers containing hundreds of thousands of servers. A single shared cable was inherently limited to building scale.
3.2.2 Invariant Analysis: Ethernet CSMA/CD (1976)
| Invariant | Ethernet Answer (1976) | Gap? |
|---|---|---|
| State | Carrier sense + collision detection: rich local knowledge | Limited to one cable segment; no knowledge of stations beyond the wire |
| Time | BEB: randomized, doubles per collision | No adaptation to offered load; 16-collision cap means arbitrary frame loss |
| Coordination | Fully distributed: each station decides independently | Fair under light load; under heavy load, BEB penalizes unlucky stations |
| Interface | Ethernet frame: 6-byte DA, 6-byte SA, 2-byte Type, 46–1500 byte payload, 4-byte FCS | Frame format ossifies immediately — unchanged for 50 years |
The State answer is remarkably good for a local protocol. Carrier sense gives each station real-time knowledge of whether another station is currently transmitting. Collision detection gives instantaneous feedback when two transmissions overlap. The measurement is honest and accurate — on a wire, signal interference is unambiguous.
The Time answer is adequate but imprecise. BEB tracks only collision count as an indirect proxy for offered load. A station experiencing its 5th collision could be competing with 2 other stations (bad luck) or 200 (genuine overload). BEB conflates both cases. This is an accidentally noisy measurement — the signal (collision count) is honest but imprecise as an estimator of contention.
The Interface answer proves prophetic: the Ethernet frame format defined in 1976 — destination MAC, source MAC, type/length, payload, FCS — survives essentially unchanged into the 2020s. This is the most successful ossified interface in networking history.
3.2.3 Environment → Measurement → Belief
| Layer | What Ethernet Has | What’s Missing |
|---|---|---|
| Environment | All stations on the cable, their traffic patterns, queue depths | — |
| Measurement | Carrier sense (is wire busy?), collision detect (did signals interfere?) | Station count, offered load, queue depths of competitors |
| Belief | “Wire is idle, safe to transmit” or “collision occurred, back off” | No model of contention level; each collision is treated identically |
The E→M gap is accidentally noisy: the wire provides an honest signal (busy/idle, collision/clean), but the station has no way to estimate how many other stations are waiting. Better estimation algorithms (adaptive backoff based on observed collision rate) were proposed but never standardized — the simple BEB was good enough for the environments Ethernet actually served.
3.2.4 “The Gaps Didn’t Matter… Yet.”
At Xerox PARC with 100 workstations and bursty traffic, utilization rarely exceeded 10%. Collisions were infrequent. BEB resolved them quickly. The single cable was short enough that propagation delay was negligible, making collision detection nearly instantaneous. Metcalfe’s efficiency model — \(E = 1/(1 + 5a)\)5, where \(a\) is the ratio of propagation delay to frame transmission time — showed that short cables with long frames yield efficiency above 95%.
The design worked brilliantly. So brilliantly that DEC, Intel, and Xerox standardized it as the DIX Ethernet specification in 1980, and IEEE adopted it as 802.3 in 1983 (IEEE Standard for Ethernet 1985). The speed was fixed at 10 Mbps. The frame format was locked. Ethernet escaped the lab and entered the enterprise.
But enterprises had more than one building. And each building had more than one cable segment.
3.3 Act 2: “It’s 1985. Bridges Connect Cable Segments — and Create Loops.”
As Ethernet spread through campuses in the early 1980s, organizations needed to connect multiple cable segments. A building typically had one segment per floor. Connecting floors required a device that could forward frames between segments: a bridge. Bridges are transparent — end stations do not know they exist. A bridge learns which MAC addresses are reachable through each port by observing source addresses, and forwards frames only to the port where the destination lives. If the destination is unknown, the bridge floods the frame to all ports except the incoming one.
The problem emerged when network administrators installed redundant bridges for reliability. Two bridges connecting the same pair of segments create a forwarding loop: a broadcast frame enters bridge A, exits to segment 2, enters bridge B, exits back to segment 1, enters bridge A again — forever. Each copy generates more copies. Within seconds, the network saturates with duplicate frames. This is a broadcast storm, and it can bring down an entire campus network.
Radia Perlman at Digital Equipment Corporation saw that bridges needed to self-organize a loop-free topology without any human configuration (Perlman 1985).
“I was working at DEC and they said, ‘we’ve got this bridging thing and it kind of works but it has this problem with loops.’ So I figured out the spanning tree algorithm.” — Perlman
What the pioneers saw: Campus networks with a handful of bridges (typically 5–20), connecting a modest number of segments. Redundancy was desirable for reliability, but loops were catastrophic. The bridges needed to agree on a loop-free subset of links — a spanning tree — automatically, without configuration, and without end stations knowing anything about bridges.
What remained invisible from the pioneers’ vantage point: Data center networks would eventually contain thousands of switches with hundreds of redundant paths. Blocking half the links to create a spanning tree would waste enormous capacity. And the 30–50 second convergence time would be intolerable for applications requiring sub-second failover.
3.3.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Topology | Single cable segment | Multiple segments connected by bridges |
| Path diversity | One path (the cable) | Multiple paths through redundant bridges |
| Failure mode | Cable break isolates stations | Loop: broadcast storm, network-wide saturation |
The Coordination invariant broke. On a single cable, there was no coordination problem — all stations shared one medium, and CSMA/CD handled access. With bridges connecting multiple segments, the network gained path diversity but no mechanism to prevent loops. Each bridge forwarded independently, following its MAC table. Each bridge saw only its own ports. The result was uncontrolled positive feedback: broadcast frames cycling through loops, multiplying with each pass.
3.3.2 Perlman’s Redesign: The Spanning Tree Protocol (STP)
Perlman applied disaggregation by separating two concerns that bridges had previously conflated: forwarding (delivering frames to the correct port) and loop prevention (ensuring the forwarding topology is acyclic). STP handles loop prevention as a distinct protocol layer, running in the background while forwarding operates normally on the resulting tree.
STP elects a root bridge (the bridge with the lowest bridge ID) and computes a shortest-path tree from the root to every other bridge. Each bridge determines its root port (the port closest to the root) and each network segment determines its designated port (the port closest to the root for that segment). All other ports are placed in blocking state — they neither forward nor learn. The result is a spanning tree: a loop-free subset of the network topology that reaches every segment.
Bridges exchange Bridge Protocol Data Units (BPDUs) every 2 seconds6. Each BPDU carries the root bridge ID, the sender’s distance to the root, and the sender’s bridge ID. When a bridge receives a BPDU advertising a shorter path to the root, it updates its own state. When a bridge stops receiving BPDUs on a port (indicating a failure), it begins reconvergence.
Perlman applied decision placement as distributed but with elected authority — every bridge participates in the election, but the root bridge becomes the single authority that defines the tree. This is a middle point on the distributed-centralized spectrum: no bridge has global knowledge, but they collectively agree on a single tree through the BPDU exchange protocol.
3.3.3 Invariant Analysis: STP (1985)
| Invariant | STP Answer (1985) | Gap? |
|---|---|---|
| State | Each bridge knows: root ID, own distance to root, port roles | No global topology view; each bridge sees only its neighbors’ BPDUs |
| Time | Max-age (20s) + listening (15s) + learning (15s) = 50s convergence | All forwarding halts during reconvergence — packets silently dropped |
| Coordination | Distributed root election via BPDUs | Single root bridge is a logical bottleneck; all traffic follows root-centric paths |
| Interface | BPDU messages on reserved multicast (01:80:C2:00:00:00) | BPDUs are the control plane interface; invisible to end stations |
The Time gap is the most consequential. When a bridge fails or a link goes down, STP must reconverge: detect the failure (up to 20 seconds for BPDUs to age out), then transition ports through listening and learning states (15 seconds each)7. Total: 30–50 seconds during which the network drops all traffic on affected segments. The user-visible consequence is concrete: a laptop plugged into a switchport cannot obtain a DHCP lease until STP finishes converging — DHCP’s 4-second timeout expires repeatedly while ports sit in the listening state, and the user sees “no network” for nearly a minute. This was the price of conservative loop prevention — STP ensures no forwarding occurs until the new tree is guaranteed to be loop-free.
The Coordination gap is structural: all traffic in the spanning tree flows through the root bridge and follows root-centric paths. If stations A and B are on adjacent segments, their frames often traverse the entire tree to reach each other, even though a direct path exists (but is blocked by STP). Redundant links sit idle, wasting half the network’s capacity.
3.3.4 Environment → Measurement → Belief
| Layer | What STP Has | What’s Missing |
|---|---|---|
| Environment | Full physical topology: all bridges, all links, all failures | — |
| Measurement | BPDUs from neighbors: root ID, cost, bridge ID | Direct failure detection — must wait for BPDU timeout |
| Belief | Root bridge identity, shortest path to root, port roles | Global topology; alternative paths; traffic demand patterns |
The E→M gap is structurally filtered: STP deliberately limits each bridge’s view to its immediate neighbors’ BPDUs. This is not an accident — it is a design choice. Perlman’s algorithm works with only local information, which makes it simple and self-organizing. The cost is slow failure detection: a bridge discovers a link failure only when BPDUs stop arriving, which takes up to 20 seconds (the max-age timer).
RSTP (IEEE Standard for Local and Metropolitan Area Networks 2001) would later fix the Time gap by replacing passive timers with an active proposal/agreement handshake. When a bridge detects a topology change, it proposes a new port role to its neighbor; the neighbor immediately agrees or counter-proposes. Convergence drops from 30–50 seconds to sub-second. The lesson is a general Time invariant pattern: timer-based coordination is simple but slow; active negotiation trades complexity for speed. The same tradeoff reappears in BGP (Chapter 6) and OSPF convergence.
3.3.5 “The Gaps Didn’t Matter… Yet.”
Campus networks in the 1980s had modest requirements. A few hundred stations, a handful of bridges, bursty office traffic. Users could tolerate 30–50 seconds of outage during a rare topology change. The capacity wasted by blocking redundant links was invisible — networks were lightly loaded. STP’s guarantees — automatic, configuration-free, loop-free topology — were worth the cost.
But two forces were converging that would shatter these assumptions. First, Ethernet was about to move from shared bus to dedicated switch ports, multiplying the number of links in the network. Second, organizations were about to discover that blocking half of those links was an unacceptable waste.
3.5 Act 4: “It’s 2008. Data Centers Expose STP’s Fatal Flaw.”
The mid-2000s saw an explosion in data center construction. Companies like Google, Amazon, and Microsoft were building warehouse-scale computing facilities with tens of thousands of servers. The traditional data center network was a tree topology: servers connected to access switches, access switches connected to aggregation switches, aggregation switches connected to core switches. Each level had fewer, more expensive switches with higher bandwidth.
This tree had two fatal properties. First, oversubscription: the links between aggregation and core switches carried traffic from hundreds of servers, creating bottlenecks. The ratio of server bandwidth to core bandwidth — the oversubscription ratio — was typically 4:1 or worse, meaning servers could achieve only a fraction of their link speed when communicating across the data center. Second, STP blocked redundant links: even when administrators installed parallel paths between switches for reliability, STP pruned them to a single tree. Half the purchased bandwidth sat idle.
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat at UC San Diego diagnosed the architectural failure precisely:
“We argue that data center architects should move away from the traditional tree-based architecture and toward the folded Clos topology, where the entire network can be built from commodity switches.” — Al-Fares, Loukissas, and Vahdat, 2008 (Al-Fares et al. 2008)
What the pioneers saw: Data center traffic patterns that were fundamentally different from campus traffic. In a campus, most traffic flows north-south (client to server, through the core). In a data center, most traffic flows east-west (server to server, within the facility). MapReduce, distributed storage, and search index updates generated massive cross-rack traffic that the tree topology was structurally incapable of serving.
What remained invisible from the pioneers’ vantage point: Even fat-tree fabrics would struggle with tenant isolation at cloud scale. The 4,094-VLAN limit of 802.1Q would become a hard barrier. And applications would demand the ability to move virtual machines across the data center without changing their IP addresses — a form of L2 mobility that no L3 fabric natively supports.
3.5.1 Which Invariant Broke?
| What Changed | Before (Campus) | After (Data Center) |
|---|---|---|
| Scale | 10–50 switches | 100–1,000+ switches |
| Traffic pattern | North-south (client → server) | East-west (server → server) |
| Bandwidth demand | Oversubscription tolerable | Full bisection bandwidth required |
| STP impact | Blocks a few redundant links | Blocks half the network’s bandwidth |
The Coordination invariant broke again, but for a different reason than in Act 2. STP prevented loops, but at the cost of blocking all redundant paths. In a tree topology with \(k\) parallel links between levels, STP activates exactly 1 and blocks \(k-1\). For a data center with hundreds of parallel paths, this means STP wastes the majority of the network’s capacity.
3.5.2 Al-Fares’s Redesign: Fat-Tree Topology with ECMP
The insight was architectural: replace the tree with a folded Clos topology (also called a fat-tree). In a \(k\)-port fat-tree, every switch is the same commodity device with \(k\) ports. The topology has three tiers: \(k\) pods of \((k/2)^2\) servers each, connected through edge switches, aggregation switches, and core switches. Every pair of servers has \(k/2\) equal-cost paths for intra-pod traffic and \((k/2)^2\) equal-cost paths for inter-pod traffic.
The fat-tree eliminates STP entirely. Because the topology is symmetric and all paths between tiers have equal cost, Equal-Cost Multi-Path (ECMP) routing distributes traffic across all available paths. No links are blocked. Full bisection bandwidth is achieved: any server can communicate with any other server at full line rate.
Al-Fares applied disaggregation by separating forwarding from loop prevention in a different way than Perlman. STP prevented loops by blocking links. The fat-tree prevents loops by running L3 (IP routing) at every switch — routing protocols inherently prevent loops through shortest-path computation and TTL fields. There is no spanning tree, no root election, no blocked ports.
Al-Fares applied decision placement as distributed — each switch runs its own routing instance and makes independent forwarding decisions based on its routing table. ECMP is stateless: the switch hashes each flow’s 5-tuple to select among equal-cost next hops. No central controller is needed.
3.5.3 Invariant Analysis: Fat-Tree with ECMP (2008)
| Invariant | Fat-Tree Answer (2008) | Gap? |
|---|---|---|
| State | IP routing table: full topology via link-state IGP | L3 state is accurate; tracks topology only, omits L2 host location |
| Time | IGP convergence: sub-second with BFD | Vast improvement over STP’s 30-50s |
| Coordination | Distributed ECMP: hash-based per-flow load splitting | Static hashing — blind to congestion and flow size |
| Interface | Standard IP forwarding; no STP, no VLANs needed within the fabric | Still bounded by 802.1Q’s 4,094-VLAN limit at the edge |
The Time improvement is dramatic: from 30–50 seconds (STP) to sub-second (IGP with BFD failure detection). This alone transformed data center reliability. The Coordination improvement is equally significant: instead of blocking redundant links, ECMP activates all of them. The fat-tree delivers full bisection bandwidth using commodity switches.
The Interface gap persists: even though the fabric runs IP internally, tenant-facing edge ports still use 802.1Q VLANs for isolation. A cloud provider hosting 10,000 tenants, each needing isolated L2 domains, exhausts 4,094 VLANs immediately.
The fat-tree also provides predictable latency: every leaf switch is exactly the same number of hops from every other leaf switch, regardless of pod placement. This uniformity — absent from the tree topology, where cross-pod paths traverse more levels — is as valuable as the bandwidth gain for latency-sensitive data center applications.
Two alternatives to STP competed during this period. TRILL (Transparent Interconnection of Lots of Links) (Perlman et al. 2011) ran IS-IS (Intermediate System to Intermediate System) at L2 to provide shortest-path forwarding with multi-pathing — using hop counts to mitigate temporary loops during convergence, a fundamentally different approach from STP’s blocking strategy. SPB (Shortest Path Bridging) (IEEE Standard for Local and Metropolitan Area Networks 2012) used IS-IS to compute shortest-path trees for each bridge. Both were technically sound but lost to the simpler approach: run IP everywhere and use VXLAN for L2 overlay. TRILL and SPB attempted to fix L2 from within; the industry chose to abandon L2 in the fabric and emulate it at the edge.
3.5.4 “The Gaps Didn’t Matter… Yet.”
Fat-tree fabrics with ECMP solved the bandwidth and convergence problems within a single data center. But cloud computing was creating a new demand: virtual machine mobility. A VM running a web application needs to keep its IP address when it migrates from one physical server to another — otherwise, TCP connections break and DNS caches go stale. IP address preservation requires L2 adjacency: the source and destination must be in the same broadcast domain. In an IP-routed fat-tree, there is no L2 domain that spans the fabric.
3.6 Act 5: “It’s 2014. Applications Need L2 Semantics Across an L3 Fabric.”
Cloud data centers host thousands of tenants, each requiring isolated network domains. Virtual machines migrate across racks for load balancing and maintenance. Container orchestration systems like Kubernetes attach and detach virtual network interfaces in milliseconds. All of these operations assume L2 connectivity: the ability to send Ethernet frames between endpoints regardless of their physical location in the data center.
The fat-tree fabric runs IP. IP provides L3 reachability, yet L2 adjacency requires more. The 802.1Q VLAN mechanism provides L2 isolation but is limited to 4,094 segments and stops at L3 boundaries. The industry needed a way to deliver L2 semantics — Ethernet frame forwarding, broadcast domains, MAC-based addressing — over an L3 IP infrastructure.
3.6.1 The Solution: VXLAN and EVPN
VXLAN (Virtual eXtensible LAN) (Mahalingam et al. 2014) solves the data plane problem. It encapsulates a complete Ethernet frame inside a UDP packet, adding a 24-bit VXLAN Network Identifier (VNI) that identifies the virtual L2 segment. The encapsulated frame travels across the IP fabric as a normal UDP packet, routed by standard L3 forwarding. At the destination switch (the VXLAN Tunnel Endpoint, or VTEP), the outer headers are stripped and the inner Ethernet frame is delivered to the destination host.
VXLAN’s 24-bit VNI supports \(2^{24} \approx 16\) million virtual segments, obliterating the 4,094-VLAN limit. Each tenant gets its own VNI. VMs in the same VNI can communicate via L2 regardless of their physical location. VMs in different VNIs are completely isolated.
EVPN (Ethernet VPN) (Sajassi et al. 2015) solves the control plane problem. In the original VXLAN specification, VTEPs discovered remote MAC addresses through flood-and-learn: when a switch received a frame for an unknown destination, it flooded the encapsulated frame to all VTEPs. This replicated the inefficiency of L2 flooding — broadcast storms in an overlay network.
EVPN replaces flood-and-learn with BGP-based MAC/IP advertisement. When a host connects to a VTEP, the VTEP advertises the host’s MAC and IP addresses to all other VTEPs using BGP EVPN route types. Remote VTEPs learn MAC-to-VTEP mappings from BGP updates, eliminating the need to flood. ARP requests can be answered locally by the VTEP (ARP suppression), further reducing broadcast traffic. Multi-tenancy is handled through Route Distinguishers (RDs) and Route Targets (RTs): each tenant’s MAC/IP routes carry a unique RD so they remain distinct within BGP, and RTs control which VTEPs import which tenant’s routes. These are the coordination primitives that make EVPN work for cloud providers hosting thousands of isolated tenants on shared physical infrastructure.
VXLAN/EVPN applied disaggregation by separating the L2 forwarding plane (Ethernet frames) from the L2 control plane (MAC learning) and from the physical transport (IP routing). Three independent layers, each optimized separately: the underlay (IP fabric with ECMP) handles packet delivery; the overlay (VXLAN) handles L2 encapsulation; the control plane (BGP EVPN) handles endpoint discovery.
VXLAN/EVPN applied decision placement as centralized control with distributed forwarding. BGP route reflectors distribute MAC/IP bindings to all VTEPs. Each VTEP makes local forwarding decisions based on these bindings. This mirrors the routing architecture: centralized information distribution, distributed packet forwarding.
3.6.2 Invariant Analysis: VXLAN/EVPN (2014)
| Invariant | VXLAN/EVPN Answer (2014) | Gap? |
|---|---|---|
| State | BGP-distributed MAC/IP bindings: global, verified | Control plane overhead scales with number of hosts |
| Time | BGP update propagation: seconds | Slower than hardware-based L2 learning for local events |
| Coordination | BGP route reflectors: centralized distribution | Route reflector is a single logical point of failure |
| Interface | 24-bit VNI (16M segments); MAC-in-UDP encapsulation | Encapsulation overhead: 50-byte VXLAN header per frame |
The Interface answer resolves the 4,094-VLAN bottleneck that persisted through Acts 3 and 4. The State answer is the most significant improvement: instead of per-switch flood-and-learn (local, unverified, slow), every VTEP has a BGP-distributed global view of all MAC/IP bindings. The Coordination answer mirrors BGP’s architecture in the WAN: route reflectors aggregate and distribute information, individual routers make local decisions.
The remaining gaps are operational rather than architectural. VXLAN header overhead (50 bytes9 per frame) reduces the effective MTU — jumbo frames (9,000 bytes) absorb this cost, but standard 1,500-byte frames lose over 3% of payload capacity. BGP EVPN control plane scales linearly with the number of hosts, which can reach hundreds of thousands in a large data center.
3.6.3 Environment → Measurement → Belief After the Fix
| Layer | What VXLAN/EVPN Has | What’s Missing |
|---|---|---|
| Environment | All hosts, their MAC/IP addresses, their VTEP locations | — |
| Measurement | BGP advertisements: MAC, IP, VNI, VTEP IP | Real-time host liveness — must wait for BGP withdrawal on failure |
| Belief | Global MAC/IP → VTEP mapping, per-VNI isolation | Assumes BGP is timely; stale entries possible during convergence |
3.7 Act 6: Access Technologies — DOCSIS and PON
The wired link layer story extends beyond LANs and data centers to access networks — the “last mile” connecting subscribers to their service providers. Two dominant technologies share the wire (or fiber) among multiple subscribers using fundamentally different physical media.
3.7.1 DOCSIS: Broadband Over Cable
DOCSIS (Data Over Cable Service Interface Specification) (CableLabs 1997) delivers broadband over the hybrid fiber-coax (HFC) plants originally built for cable television. The downstream direction uses frequency-division multiplexing: the cable operator allocates specific frequency channels for data, separate from television channels. The upstream direction faces a coordination challenge: multiple cable modems share a single coaxial segment back to the CMTS (Cable Modem Termination System).
DOCSIS upstream uses a hybrid contention + scheduled model, not pure TDMA. Modems compete using a contention-based request phase (similar to ALOHA with exponential backoff) to request upstream bandwidth. The CMTS receives these requests and responds with explicit grants — scheduled time slots for actual data transmission. Contention exists for the request, but data delivery is collision-free. This two-phase design balances responsiveness (modems request immediately when data arrives) with efficiency (data slots are non-overlapping). The cable plant is highly asymmetric: up to 1.2 GHz of downstream spectrum vs. 5–204 MHz upstream, yielding a ~10:1 bandwidth ratio favoring download. Active electronics (amplifiers) in the field maintain signal strength over distances up to 100 miles.
DOCSIS has evolved through five generations: 1.0 (1997, 40 Mbps downstream), 2.0 (2001, 30 Mbps upstream), 3.0 (2006, channel bonding for 1 Gbps), 3.1 (2013, OFDM for 10 Gbps downstream) (CableLabs 2013), and 4.0 (2020, 10 Gbps downstream and 6 Gbps upstream). Each generation applied closed-loop reasoning by tightening the feedback between the CMTS and modems: more precise timing, finer-grained scheduling, and adaptive modulation based on measured channel quality.
3.7.3 Invariant Analysis: Access Networks (DOCSIS and PON)
| Invariant | DOCSIS Answer | PON Answer |
|---|---|---|
| State | CMTS tracks per-modem SNR, queue depth, grant schedule | OLT tracks per-ONU distance (ranging), queue depth, burst schedule |
| Time | Contention slots + scheduled grants; MAP messages every ~2ms | 125-microsecond framing cycle; DBA (Dynamic Bandwidth Allocation) per cycle |
| Coordination | Hybrid: contention for requests, centralized for data | Fully centralized: OLT schedules all upstream access |
| Interface | DOCSIS MAC frames over RF; CMTS as L2/L3 gateway | GEM (GPON Encapsulation Method) frames; OLT as L2 gateway |
The key difference is in the Coordination invariant: DOCSIS retains contention for the request phase (inheriting from its cable-TV broadcast origins), while PON eliminates contention entirely through OLT scheduling. Both converge on centralized data-plane scheduling, but DOCSIS carries a residual collision risk that PON avoids.
3.7.4 Environment → Measurement → Belief: Access Networks
| Layer | What the Access Network Has | What’s Missing |
|---|---|---|
| Environment | True subscriber demand, channel conditions per modem/ONU | — |
| Measurement | CMTS/OLT: per-subscriber SNR reports, queue requests | Per-application demand; subscriber intent |
| Belief | “Subscriber X needs Y bandwidth this cycle” | Long-term demand prediction; no awareness of application type |
The E→M gap is accidentally noisy: both CMTS and OLT receive honest per-subscriber measurements (SNR, buffer occupancy), but the measurements lag real-time demand by one scheduling cycle. Better prediction algorithms (anticipating demand patterns) would improve utilization, but the fundamental mechanism is sound.
Both DOCSIS and PON illustrate how the binding constraint shapes the coordination answer. When one entity (the service provider) controls the medium and the other entities (subscribers) have no need to communicate directly, centralized scheduling is the natural and efficient solution. Distributed contention (CSMA/CD) solves a different problem: peer stations with equal authority and no coordinator.
3.9 Generative Exercises
A switched Ethernet network uses a switch with a MAC address table that can hold 8,192 entries. An attacker sends frames with randomly generated source MAC addresses at a rate of 1,000 frames per second.
Using the four invariants: (a) Which invariant does this attack exploit? (b) What happens to the switch’s forwarding behavior when the MAC table is full? (c) How does this convert a switched network back into a shared network? (d) Propose a defense mechanism and identify which design principle it applies.
A data center runs VXLAN for L2 overlay but uses the original flood-and-learn mechanism instead of EVPN. The data center has 500 VTEPs and 50,000 virtual machines.
Using the E-M-B decomposition: (a) What is the Environment, Measurement, and Belief at each VTEP? (b) Where is the E→M gap, and what type is it (accidentally noisy, physically limited, or structurally filtered)? (c) Calculate the broadcast amplification factor when a single ARP request is flooded to all 500 VTEPs. (d) Explain how EVPN closes this gap and identify which invariant it fixes.
A network architect proposes using STP in a 48-switch data center arranged as a fat-tree topology. Each switch has 48 ports, and there are 576 inter-switch links.
Using the dependency chain: (a) How many links will STP block? (b) What fraction of the total inter-switch bandwidth is wasted? (c) What is the maximum convergence time if the root bridge fails? (d) The architect argues that RSTP solves the convergence problem. Explain why faster convergence leaves the bandwidth problem intact — identify which invariant RSTP fixes and which it leaves broken.
The ratio of propagation delay to frame time, \(a\), must be much less than 1 for CSMA/CD to work efficiently. At 2.94 Mbit/s and 1 km, \(a \approx 0.2\). When the DIX consortium raised the speed to 10 Mbps, they shortened the maximum cable segment to 500 m and imposed a 64-byte minimum frame size1 to preserve \(a < 1\). Every Ethernet speed increase since has required either shorter cables or longer minimum frames — the same physics constraint, applied repeatedly.↩︎
The consortium chose 48-bit MAC addresses with a hierarchical structure: the first 24 bits are an Organizationally Unique Identifier (OUI) assigned by IEEE to each manufacturer; the remaining 24 bits are assigned by the manufacturer. This yields \(2^{48} \approx 281\) trillion globally unique addresses — enough that every network interface card ever manufactured receives a unique address with no coordination beyond the OUI registry. The 48-bit choice was an Interface invariant decision: make the address space large enough that it never needs revisiting. It succeeded — 48-bit MACs remain unchanged 45 years later.↩︎
At 10 Mbps, a 64-byte (512-bit) frame takes 512 / 10^7 = 51.2 µs. This equals the round-trip propagation time across a maximum 2.5 km cable segment (2500 m / 200 m/µs = 12.5 µs one-way, ~25 µs round-trip, plus repeater and transceiver delays). The minimum frame size and maximum cable length are locked together through this constraint.↩︎
The cap at 10 doublings (2^10 - 1 = 1023 slots) limits maximum backoff to ~52 ms. The discard after 16 collisions is an engineering choice: beyond 16 consecutive collisions, the probability of successful delivery is negligible and the frame is better retried by a higher layer.↩︎
The factor 5 arises from Metcalfe’s analysis assuming the average collision involves about 2.5 stations and the jam signal adds overhead. See Metcalfe and Boggs (1976) for the full derivation.↩︎
The 2-second hello timer is a conservative default chosen to balance between timely failure detection and control-plane bandwidth consumption. Faster hellos detect failures sooner but increase BPDU traffic.↩︎
Max-age = 20s (10 x hello interval) allows for up to 10 missed BPDUs before declaring a link dead. The 15s listening + 15s learning states prevent transient loops during reconvergence by ensuring the bridge has heard from all neighbors before forwarding.↩︎
The 12-bit VLAN ID occupies a 4-byte 802.1Q tag inserted between the source MAC and Type fields. Two values are reserved (0 and 4095), leaving 4,094 usable VLANs.↩︎
Outer Ethernet (14 bytes) + outer IP (20 bytes) + outer UDP (8 bytes) + VXLAN header (8 bytes) = 50 bytes of encapsulation overhead per frame.↩︎
125 µs = 1/8000 seconds, inherited from the 8 kHz sampling rate of the telephone network. This framing period predates GPON — it was already the standard TDM frame period in SONET/SDH.↩︎