3 Wired Link Layer

3.1 The Anchor: Point-to-Point or Shared Bus with Known, Fixed Endpoints

Every wired link layer solves the same physical problem: deliver frames between devices connected by a known, fixed physical medium — copper cable, fiber optic strand, or coaxial bus. Unlike wireless, where the medium is shared by physics and the transmitter is deaf to its own collisions, wired media offer two crucial advantages. First, the set of connected devices is known at installation time — you can count the stations on a cable segment. Second, on a wire, a transmitter can hear its own signal returning and detect collisions in real time. These properties anchor every wired link layer design.

The binding constraint evolves across eras but always derives from the physical medium’s topology: shared bus (1976), where all stations compete for a single cable; point-to-point switched links (1990s), where each port gets a dedicated collision domain; multi-path fabric (2008+), where hundreds of switches must forward without loops; and virtual L2 over physical L3 (2014+), where applications demand L2 semantics across an IP-routed infrastructure.

The wired link layer must continuously answer four decision problems:

How to share the wire — contention or switching? (Should every station compete for the medium, or should a switch forward each frame to a specific port?)
How to prevent loops in redundant topologies? (When multiple paths exist between two stations, how does the network avoid forwarding frames in circles forever?)
How to scale beyond a single broadcast domain? (When thousands of stations share L2, how do you contain broadcast traffic?)
How to deliver L2 connectivity across L3 infrastructure? (When the physical fabric runs IP, how do applications that need L2 adjacency get it?)

Robert Metcalfe, designing the first Ethernet at Xerox PARC in 1973 (Metcalfe and Boggs 1976), framed the foundational version of the first question:

“The key technical challenge was to provide a broadcast communication system for a building full of personal computers — fast, cheap, and shared.” — Metcalfe, 1973

Radia Perlman (Perlman 1985), confronting the loop problem a decade later, captured the second question with characteristic clarity:

“I had this moment of realization. The bridges were supposed to be invisible. They had to figure out a loop-free topology by themselves, without any configuration, and they had to do it fast.” — Perlman, 1985

3.2 Act 1: “It’s 1976. A Building Full of Personal Computers Needs a Network.”

Xerox PARC in Palo Alto was unlike any computing environment of its era. By 1973, the Alto — one of the first personal workstations — was sitting on dozens of desks. These machines needed to share a laser printer, access file servers, and exchange electronic mail. The available networking options were inadequate: ARPANET was for inter-site WAN communication, and serial point-to-point links could not connect a building full of machines. Robert Metcalfe, a recent Harvard PhD who had studied Norman Abramson’s ALOHAnet (Abramson 1970), saw the opportunity to adapt packet radio concepts to a local wired medium.

The experimental Ethernet ran at 2.94 Mbit/s over 75-ohm coaxial cable with a reach of approximately 1 km. These numbers are coupled through physics: collision detection requires that a transmitter hear a collision before it finishes sending its shortest frame. The round-trip propagation time across 1 km of cable is approximately 10 µs; at 2.94 Mbit/s, the minimum frame takes roughly 50 µs to transmit — safely longer than the round-trip, so collisions are always detected in time.¹ Frames used 8-bit addresses, sufficient for 256 stations in one building.

When DEC, Intel, and Xerox formed the DIX consortium to standardize Ethernet for commercial use (1980), they confronted a scope change: Ethernet would span multiple buildings, campuses, and eventually organizations. An 8-bit address space — adequate for one lab — was insufficient.²

“Ether is a passive broadcast medium with no central control, and with stations that have equal access. Coordination of access to the Ether for packet transmission is distributed among the contending transmitting stations using controlled statistical arbitration.” — Metcalfe and Boggs, 1976 (Metcalfe and Boggs 1976)

What the pioneers saw: A single building with 100–200 workstations. A shared coaxial cable snaking through the ceiling, physically accessible to every machine. Traffic was bursty — print jobs, file transfers, electronic mail — not continuous streams. The wire was idle most of the time. The workstations were cooperative (all owned by Xerox PARC researchers, no adversarial traffic). The cable was short enough that propagation delay was negligible relative to packet transmission time.

What remained invisible from the pioneers’ vantage point: Networks would scale from hundreds to millions of endpoints. Traffic would shift from bursty office applications to continuous video streams. Organizations would need to connect buildings, campuses, and eventually data centers containing hundreds of thousands of servers. A single shared cable was inherently limited to building scale.

3.2.1 The Solution: CSMA/CD on a Shared Bus

Metcalfe adapted Kleinrock and Tobagi’s CSMA concept (Kleinrock and Tobagi 1975) to the wired medium, adding a critical capability that wireless physics denied: collision detection. On a coaxial cable, a transmitter monitors the wire during transmission. If the observed signal differs from the transmitted signal, a collision has occurred — two stations transmitted simultaneously and their signals interfered. The transmitter immediately aborts, sends a 32-bit jam signal to ensure all stations detect the collision, and enters exponential backoff.

The exponential backoff algorithm (later called binary exponential backoff, BEB) selects a random wait time from an interval that doubles with each successive collision: after the $n$th collision, choose a random slot from $[0, 2^n - 1]$, where each slot is 51.2 $\mu$s³ (the round-trip propagation time of a maximum-length cable at 10 Mbps). After 10 collisions, the window caps at 1023 slots⁴. After 16 collisions, the frame is discarded. Under heavy load, this produces collision collapse: retransmissions trigger more collisions, which trigger more retransmissions, and the available bandwidth is consumed by failed attempts rather than successful frames. The positive feedback loop drives useful throughput toward zero — the same instability pattern Abramson documented for ALOHA (Abramson 1970), now at higher speed.

Metcalfe applied decision placement as fully distributed — every station independently senses the wire, transmits when idle, detects collisions, and backs off. No central controller, no token passing, no reservation. This was the simplest possible coordination: the wire itself carries all the state any station needs.

Metcalfe applied closed-loop reasoning through the sense-transmit-detect-backoff cycle. The measurement is direct: the transmitter compares its own signal on the wire with what it receives. When they differ, a collision is occurring. The loop tightens the response: abort immediately, do not waste the channel transmitting a frame that is already corrupted. This is qualitatively better than wireless ALOHA, where the transmitter is deaf to collisions and must wait for a timeout to discover failure.

3.2.2 Invariant Analysis: Ethernet CSMA/CD (1976)

Invariant	Ethernet Answer (1976)	Gap?
State	Carrier sense + collision detection: rich local knowledge	Limited to one cable segment; no knowledge of stations beyond the wire
Time	BEB: randomized, doubles per collision	No adaptation to offered load; 16-collision cap means arbitrary frame loss
Coordination	Fully distributed: each station decides independently	Fair under light load; under heavy load, BEB penalizes unlucky stations
Interface	Ethernet frame: 6-byte DA, 6-byte SA, 2-byte Type, 46–1500 byte payload, 4-byte FCS	Frame format ossifies immediately — unchanged for 50 years

The State answer is remarkably good for a local protocol. Carrier sense gives each station real-time knowledge of whether another station is currently transmitting. Collision detection gives instantaneous feedback when two transmissions overlap. The measurement is honest and accurate — on a wire, signal interference is unambiguous.

The Time answer is adequate but imprecise. BEB tracks only collision count as an indirect proxy for offered load. A station experiencing its 5th collision could be competing with 2 other stations (bad luck) or 200 (genuine overload). BEB conflates both cases. This is an accidentally noisy measurement — the signal (collision count) is honest but imprecise as an estimator of contention.

The Interface answer proves prophetic: the Ethernet frame format defined in 1976 — destination MAC, source MAC, type/length, payload, FCS — survives essentially unchanged into the 2020s. This is the most successful ossified interface in networking history.

3.2.3 Environment → Measurement → Belief

Layer	What Ethernet Has	What’s Missing
Environment	All stations on the cable, their traffic patterns, queue depths	—
Measurement	Carrier sense (is wire busy?), collision detect (did signals interfere?)	Station count, offered load, queue depths of competitors
Belief	“Wire is idle, safe to transmit” or “collision occurred, back off”	No model of contention level; each collision is treated identically

The E→M gap is accidentally noisy: the wire provides an honest signal (busy/idle, collision/clean), but the station has no way to estimate how many other stations are waiting. Better estimation algorithms (adaptive backoff based on observed collision rate) were proposed but never standardized — the simple BEB was good enough for the environments Ethernet actually served.

3.2.4 “The Gaps Didn’t Matter… Yet.”

At Xerox PARC with 100 workstations and bursty traffic, utilization rarely exceeded 10%. Collisions were infrequent. BEB resolved them quickly. The single cable was short enough that propagation delay was negligible, making collision detection nearly instantaneous. Metcalfe’s efficiency model — $E = 1/(1 + 5a)$⁵, where $a$ is the ratio of propagation delay to frame transmission time — showed that short cables with long frames yield efficiency above 95%.

The design worked brilliantly. So brilliantly that DEC, Intel, and Xerox standardized it as the DIX Ethernet specification in 1980, and IEEE adopted it as 802.3 in 1983 (IEEE Standard for Ethernet 1985). The speed was fixed at 10 Mbps. The frame format was locked. Ethernet escaped the lab and entered the enterprise.

But enterprises had more than one building. And each building had more than one cable segment.

3.3 Act 2: “It’s 1985. Bridges Connect Cable Segments — and Create Loops.”

As Ethernet spread through campuses in the early 1980s, organizations needed to connect multiple cable segments. A building typically had one segment per floor. Connecting floors required a device that could forward frames between segments: a bridge. Bridges are transparent — end stations do not know they exist. A bridge learns which MAC addresses are reachable through each port by observing source addresses, and forwards frames only to the port where the destination lives. If the destination is unknown, the bridge floods the frame to all ports except the incoming one.

The problem emerged when network administrators installed redundant bridges for reliability. Two bridges connecting the same pair of segments create a forwarding loop: a broadcast frame enters bridge A, exits to segment 2, enters bridge B, exits back to segment 1, enters bridge A again — forever. Each copy generates more copies. Within seconds, the network saturates with duplicate frames. This is a broadcast storm, and it can bring down an entire campus network.

Radia Perlman at Digital Equipment Corporation saw that bridges needed to self-organize a loop-free topology without any human configuration (Perlman 1985).

“I was working at DEC and they said, ‘we’ve got this bridging thing and it kind of works but it has this problem with loops.’ So I figured out the spanning tree algorithm.” — Perlman

What the pioneers saw: Campus networks with a handful of bridges (typically 5–20), connecting a modest number of segments. Redundancy was desirable for reliability, but loops were catastrophic. The bridges needed to agree on a loop-free subset of links — a spanning tree — automatically, without configuration, and without end stations knowing anything about bridges.

What remained invisible from the pioneers’ vantage point: Data center networks would eventually contain thousands of switches with hundreds of redundant paths. Blocking half the links to create a spanning tree would waste enormous capacity. And the 30–50 second convergence time would be intolerable for applications requiring sub-second failover.

3.3.1 Which Invariant Broke?

What Changed	Before	After
Topology	Single cable segment	Multiple segments connected by bridges
Path diversity	One path (the cable)	Multiple paths through redundant bridges
Failure mode	Cable break isolates stations	Loop: broadcast storm, network-wide saturation

The Coordination invariant broke. On a single cable, there was no coordination problem — all stations shared one medium, and CSMA/CD handled access. With bridges connecting multiple segments, the network gained path diversity but no mechanism to prevent loops. Each bridge forwarded independently, following its MAC table. Each bridge saw only its own ports. The result was uncontrolled positive feedback: broadcast frames cycling through loops, multiplying with each pass.

3.3.2 Perlman’s Redesign: The Spanning Tree Protocol (STP)

Perlman applied disaggregation by separating two concerns that bridges had previously conflated: forwarding (delivering frames to the correct port) and loop prevention (ensuring the forwarding topology is acyclic). STP handles loop prevention as a distinct protocol layer, running in the background while forwarding operates normally on the resulting tree.

STP elects a root bridge (the bridge with the lowest bridge ID) and computes a shortest-path tree from the root to every other bridge. Each bridge determines its root port (the port closest to the root) and each network segment determines its designated port (the port closest to the root for that segment). All other ports are placed in blocking state — they neither forward nor learn. The result is a spanning tree: a loop-free subset of the network topology that reaches every segment.

Bridges exchange Bridge Protocol Data Units (BPDUs) every 2 seconds⁶. Each BPDU carries the root bridge ID, the sender’s distance to the root, and the sender’s bridge ID. When a bridge receives a BPDU advertising a shorter path to the root, it updates its own state. When a bridge stops receiving BPDUs on a port (indicating a failure), it begins reconvergence.

Perlman applied decision placement as distributed but with elected authority — every bridge participates in the election, but the root bridge becomes the single authority that defines the tree. This is a middle point on the distributed-centralized spectrum: no bridge has global knowledge, but they collectively agree on a single tree through the BPDU exchange protocol.

3.3.3 Invariant Analysis: STP (1985)

Invariant	STP Answer (1985)	Gap?
State	Each bridge knows: root ID, own distance to root, port roles	No global topology view; each bridge sees only its neighbors’ BPDUs
Time	Max-age (20s) + listening (15s) + learning (15s) = 50s convergence	All forwarding halts during reconvergence — packets silently dropped
Coordination	Distributed root election via BPDUs	Single root bridge is a logical bottleneck; all traffic follows root-centric paths
Interface	BPDU messages on reserved multicast (01:80:C2:00:00:00)	BPDUs are the control plane interface; invisible to end stations

The Time gap is the most consequential. When a bridge fails or a link goes down, STP must reconverge: detect the failure (up to 20 seconds for BPDUs to age out), then transition ports through listening and learning states (15 seconds each)⁷. Total: 30–50 seconds during which the network drops all traffic on affected segments. The user-visible consequence is concrete: a laptop plugged into a switchport cannot obtain a DHCP lease until STP finishes converging — DHCP’s 4-second timeout expires repeatedly while ports sit in the listening state, and the user sees “no network” for nearly a minute. This was the price of conservative loop prevention — STP ensures no forwarding occurs until the new tree is guaranteed to be loop-free.

The Coordination gap is structural: all traffic in the spanning tree flows through the root bridge and follows root-centric paths. If stations A and B are on adjacent segments, their frames often traverse the entire tree to reach each other, even though a direct path exists (but is blocked by STP). Redundant links sit idle, wasting half the network’s capacity.

3.3.4 Environment → Measurement → Belief

Layer	What STP Has	What’s Missing
Environment	Full physical topology: all bridges, all links, all failures	—
Measurement	BPDUs from neighbors: root ID, cost, bridge ID	Direct failure detection — must wait for BPDU timeout
Belief	Root bridge identity, shortest path to root, port roles	Global topology; alternative paths; traffic demand patterns

The E→M gap is structurally filtered: STP deliberately limits each bridge’s view to its immediate neighbors’ BPDUs. This is not an accident — it is a design choice. Perlman’s algorithm works with only local information, which makes it simple and self-organizing. The cost is slow failure detection: a bridge discovers a link failure only when BPDUs stop arriving, which takes up to 20 seconds (the max-age timer).

RSTP (IEEE Standard for Local and Metropolitan Area Networks 2001) would later fix the Time gap by replacing passive timers with an active proposal/agreement handshake. When a bridge detects a topology change, it proposes a new port role to its neighbor; the neighbor immediately agrees or counter-proposes. Convergence drops from 30–50 seconds to sub-second. The lesson is a general Time invariant pattern: timer-based coordination is simple but slow; active negotiation trades complexity for speed. The same tradeoff reappears in BGP (Chapter 6) and OSPF convergence.

3.3.5 “The Gaps Didn’t Matter… Yet.”

Campus networks in the 1980s had modest requirements. A few hundred stations, a handful of bridges, bursty office traffic. Users could tolerate 30–50 seconds of outage during a rare topology change. The capacity wasted by blocking redundant links was invisible — networks were lightly loaded. STP’s guarantees — automatic, configuration-free, loop-free topology — were worth the cost.

But two forces were converging that would shatter these assumptions. First, Ethernet was about to move from shared bus to dedicated switch ports, multiplying the number of links in the network. Second, organizations were about to discover that blocking half of those links was an unacceptable waste.

3.4 Act 3: “It’s 1995. The Switch Revolution Eliminates the Shared Bus.”

Throughout the late 1980s and early 1990s, a quiet revolution transformed Ethernet’s physical architecture. The shared coaxial bus gave way to structured cabling with hubs (10BASE-T, 1990), and hubs gave way to switches. A switch is a multi-port bridge with hardware-speed forwarding. Each switch port connects to a single device or to another switch, creating a dedicated point-to-point link. The consequences were profound:

CSMA/CD becomes irrelevant. On a shared bus, every station competes for the medium. On a switched port, there are only two parties: the station and the switch. With full-duplex operation (IEEE Standard for Local and Metropolitan Area Networks 1997), both can transmit simultaneously — there are no collisions. The entire CSMA/CD mechanism that Metcalfe built becomes dead code, still present in the standard but never invoked. The fundamental access problem that motivated Ethernet’s design dissolves.

MAC address tables replace carrier sense as the State mechanism. A switch learns MAC addresses by observing the source address of incoming frames. When a frame arrives at port 3 from MAC address AA:BB:CC:DD:EE:FF, the switch records that this address is reachable through port 3. Future frames destined for that address go only to port 3, not to all ports. The switch builds a complete map of which addresses live on which ports — a qualitatively richer state than “is the wire busy?”

Speed scales independently of the access protocol. Fast Ethernet (100 Mbps, 1995) (IEEE Standard for 100 Mbit/s Ethernet (Fast Ethernet) 1995), Gigabit Ethernet (1 Gbps, 1999) (IEEE Standard for 1000 Mbit/s Ethernet (Gigabit Ethernet) 1999), and 10 Gigabit Ethernet (10 Gbps, 2002) (IEEE Standard for 10 Gbit/s Ethernet 2002) all use the same frame format and the same switching logic. Speed increases come from faster PHY (Physical Layer) electronics, not from protocol redesign. The access protocol is no longer the bottleneck — the switch hardware is.

3.4.1 What Changed in the Environment

What Changed	Before (Shared Bus)	After (Switched)
Medium topology	Shared coaxial cable	Dedicated point-to-point links
Collision domain	All stations on the segment	Two devices per link (full-duplex: zero collisions)
State mechanism	Carrier sense (is wire busy?)	MAC address table (which port for this destination?)
Bandwidth per station	Shared 10 Mbps among all stations	Dedicated 10/100/1000 Mbps per port

3.4.2 VLANs: Partitioning the Broadcast Domain

Switches solved the collision problem but created a new one: broadcast domains grew enormous. In a switched network, a broadcast frame from any station reaches every other station — the switch floods it to all ports. With thousands of stations on a flat L2 network, broadcast traffic (ARP requests, DHCP discoveries, NetBIOS announcements) consumes significant bandwidth and processing at every endpoint.

IEEE 802.1Q (IEEE Standard for Local and Metropolitan Area Networks — Bridges and Bridged Networks 1998) introduced Virtual LANs (VLANs): a 12-bit tag inserted into the Ethernet frame header that partitions a physical switch into multiple logical broadcast domains. Stations in VLAN 10 are isolated from VLAN 20 broadcasts. Traffic crossing VLAN boundaries must pass through a router (L3 device).

802.1Q applied disaggregation by separating the physical topology from the logical topology. A single physical switch can host multiple independent broadcast domains. This separation allowed network administrators to organize traffic by function (engineering, finance, guest) without rewiring the physical network.

The 12-bit VLAN tag⁸ supports $2^{12} = 4{,}094$ VLANs. For a campus with a few dozen departments, 4,094 was generous. For a data center hosting thousands of tenants, it would prove catastrophic.

3.4.3 Invariant Analysis: Switched Ethernet with VLANs (1990s)

Invariant	Switched Ethernet Answer (1990s)	Gap?
State	MAC address table: per-port destination mapping	Learned by observation — no verification, no authentication
Time	MAC aging timer: 300s default	Coarse — stale entries persist for minutes after a host moves
Coordination	Per-switch learning; STP for loop prevention across switches	STP still blocks redundant links; still converges in 30-50s
Interface	802.1Q VLAN tag: 12-bit identifier	Only 4,094 VLANs — insufficient for multi-tenant environments

The Coordination gap persists from Act 2: STP still prunes redundant paths. In a campus network with a few switches, this is manageable. In a data center with hundreds of switches, STP blocks half the links — precisely the links that would provide the bandwidth data center applications demand.

The Interface gap is new and will prove decisive. The 802.1Q 12-bit VLAN tag was designed for campus networks with tens of departments. Cloud data centers hosting thousands of tenants each requiring isolated L2 domains will exhaust 4,094 VLANs immediately.

3.4.4 Environment → Measurement → Belief

Layer	What Switches Have	What’s Missing
Environment	All connected hosts, their MAC addresses, their traffic	—
Measurement	Source MAC on incoming frames → port mapping	No host authentication; MAC addresses are trivially spoofed
Belief	“MAC X is on port Y” (learned from last observed frame)	No verification; stale entries persist; no awareness of host mobility

The E→M gap is accidentally noisy: MAC learning is based on the source address of incoming frames, which is an honest but unverified signal. A host that moves to a different port will eventually be re-learned, but the stale entry persists for up to 300 seconds (the default aging timer). During that window, frames for the moved host go to the wrong port and are silently dropped.

3.4.5 “The Gaps Didn’t Matter… Yet.”

Campus networks in the 1990s were relatively static. Hosts rarely moved between ports. A few dozen VLANs served organizational needs. STP’s link-blocking was invisible when bandwidth was plentiful and the network had 10–50 switches.

But a new environment was emerging: the data center. Thousands of servers. Hundreds of switches. Dozens of redundant paths. Applications demanding full bisection bandwidth. And STP was about to become the bottleneck.

3.5 Act 4: “It’s 2008. Data Centers Expose STP’s Fatal Flaw.”

The mid-2000s saw an explosion in data center construction. Companies like Google, Amazon, and Microsoft were building warehouse-scale computing facilities with tens of thousands of servers. The traditional data center network was a tree topology: servers connected to access switches, access switches connected to aggregation switches, aggregation switches connected to core switches. Each level had fewer, more expensive switches with higher bandwidth.

This tree had two fatal properties. First, oversubscription: the links between aggregation and core switches carried traffic from hundreds of servers, creating bottlenecks. The ratio of server bandwidth to core bandwidth — the oversubscription ratio — was typically 4:1 or worse, meaning servers could achieve only a fraction of their link speed when communicating across the data center. Second, STP blocked redundant links: even when administrators installed parallel paths between switches for reliability, STP pruned them to a single tree. Half the purchased bandwidth sat idle.

Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat at UC San Diego diagnosed the architectural failure precisely:

“We argue that data center architects should move away from the traditional tree-based architecture and toward the folded Clos topology, where the entire network can be built from commodity switches.” — Al-Fares, Loukissas, and Vahdat, 2008 (Al-Fares et al. 2008)

What the pioneers saw: Data center traffic patterns that were fundamentally different from campus traffic. In a campus, most traffic flows north-south (client to server, through the core). In a data center, most traffic flows east-west (server to server, within the facility). MapReduce, distributed storage, and search index updates generated massive cross-rack traffic that the tree topology was structurally incapable of serving.

What remained invisible from the pioneers’ vantage point: Even fat-tree fabrics would struggle with tenant isolation at cloud scale. The 4,094-VLAN limit of 802.1Q would become a hard barrier. And applications would demand the ability to move virtual machines across the data center without changing their IP addresses — a form of L2 mobility that no L3 fabric natively supports.

3.5.1 Which Invariant Broke?

What Changed	Before (Campus)	After (Data Center)
Scale	10–50 switches	100–1,000+ switches
Traffic pattern	North-south (client → server)	East-west (server → server)
Bandwidth demand	Oversubscription tolerable	Full bisection bandwidth required
STP impact	Blocks a few redundant links	Blocks half the network’s bandwidth

The Coordination invariant broke again, but for a different reason than in Act 2. STP prevented loops, but at the cost of blocking all redundant paths. In a tree topology with $k$ parallel links between levels, STP activates exactly 1 and blocks $k-1$. For a data center with hundreds of parallel paths, this means STP wastes the majority of the network’s capacity.

3.5.2 Al-Fares’s Redesign: Fat-Tree Topology with ECMP

The insight was architectural: replace the tree with a folded Clos topology (also called a fat-tree). In a $k$-port fat-tree, every switch is the same commodity device with $k$ ports. The topology has three tiers: $k$ pods of $(k/2)^2$ servers each, connected through edge switches, aggregation switches, and core switches. Every pair of servers has $k/2$ equal-cost paths for intra-pod traffic and $(k/2)^2$ equal-cost paths for inter-pod traffic.

The fat-tree eliminates STP entirely. Because the topology is symmetric and all paths between tiers have equal cost, Equal-Cost Multi-Path (ECMP) routing distributes traffic across all available paths. No links are blocked. Full bisection bandwidth is achieved: any server can communicate with any other server at full line rate.

Al-Fares applied disaggregation by separating forwarding from loop prevention in a different way than Perlman. STP prevented loops by blocking links. The fat-tree prevents loops by running L3 (IP routing) at every switch — routing protocols inherently prevent loops through shortest-path computation and TTL fields. There is no spanning tree, no root election, no blocked ports.

Al-Fares applied decision placement as distributed — each switch runs its own routing instance and makes independent forwarding decisions based on its routing table. ECMP is stateless: the switch hashes each flow’s 5-tuple to select among equal-cost next hops. No central controller is needed.

3.5.3 Invariant Analysis: Fat-Tree with ECMP (2008)

Invariant	Fat-Tree Answer (2008)	Gap?
State	IP routing table: full topology via link-state IGP	L3 state is accurate; tracks topology only, omits L2 host location
Time	IGP convergence: sub-second with BFD	Vast improvement over STP’s 30-50s
Coordination	Distributed ECMP: hash-based per-flow load splitting	Static hashing — blind to congestion and flow size
Interface	Standard IP forwarding; no STP, no VLANs needed within the fabric	Still bounded by 802.1Q’s 4,094-VLAN limit at the edge

The Time improvement is dramatic: from 30–50 seconds (STP) to sub-second (IGP with BFD failure detection). This alone transformed data center reliability. The Coordination improvement is equally significant: instead of blocking redundant links, ECMP activates all of them. The fat-tree delivers full bisection bandwidth using commodity switches.

The Interface gap persists: even though the fabric runs IP internally, tenant-facing edge ports still use 802.1Q VLANs for isolation. A cloud provider hosting 10,000 tenants, each needing isolated L2 domains, exhausts 4,094 VLANs immediately.

The fat-tree also provides predictable latency: every leaf switch is exactly the same number of hops from every other leaf switch, regardless of pod placement. This uniformity — absent from the tree topology, where cross-pod paths traverse more levels — is as valuable as the bandwidth gain for latency-sensitive data center applications.

Two alternatives to STP competed during this period. TRILL (Transparent Interconnection of Lots of Links) (Perlman et al. 2011) ran IS-IS (Intermediate System to Intermediate System) at L2 to provide shortest-path forwarding with multi-pathing — using hop counts to mitigate temporary loops during convergence, a fundamentally different approach from STP’s blocking strategy. SPB (Shortest Path Bridging) (IEEE Standard for Local and Metropolitan Area Networks 2012) used IS-IS to compute shortest-path trees for each bridge. Both were technically sound but lost to the simpler approach: run IP everywhere and use VXLAN for L2 overlay. TRILL and SPB attempted to fix L2 from within; the industry chose to abandon L2 in the fabric and emulate it at the edge.

3.5.4 “The Gaps Didn’t Matter… Yet.”

Fat-tree fabrics with ECMP solved the bandwidth and convergence problems within a single data center. But cloud computing was creating a new demand: virtual machine mobility. A VM running a web application needs to keep its IP address when it migrates from one physical server to another — otherwise, TCP connections break and DNS caches go stale. IP address preservation requires L2 adjacency: the source and destination must be in the same broadcast domain. In an IP-routed fat-tree, there is no L2 domain that spans the fabric.

3.6 Act 5: “It’s 2014. Applications Need L2 Semantics Across an L3 Fabric.”

Cloud data centers host thousands of tenants, each requiring isolated network domains. Virtual machines migrate across racks for load balancing and maintenance. Container orchestration systems like Kubernetes attach and detach virtual network interfaces in milliseconds. All of these operations assume L2 connectivity: the ability to send Ethernet frames between endpoints regardless of their physical location in the data center.

The fat-tree fabric runs IP. IP provides L3 reachability, yet L2 adjacency requires more. The 802.1Q VLAN mechanism provides L2 isolation but is limited to 4,094 segments and stops at L3 boundaries. The industry needed a way to deliver L2 semantics — Ethernet frame forwarding, broadcast domains, MAC-based addressing — over an L3 IP infrastructure.

3.6.1 The Solution: VXLAN and EVPN

VXLAN (Virtual eXtensible LAN) (Mahalingam et al. 2014) solves the data plane problem. It encapsulates a complete Ethernet frame inside a UDP packet, adding a 24-bit VXLAN Network Identifier (VNI) that identifies the virtual L2 segment. The encapsulated frame travels across the IP fabric as a normal UDP packet, routed by standard L3 forwarding. At the destination switch (the VXLAN Tunnel Endpoint, or VTEP), the outer headers are stripped and the inner Ethernet frame is delivered to the destination host.

VXLAN’s 24-bit VNI supports $2^{24} \approx 16$ million virtual segments, obliterating the 4,094-VLAN limit. Each tenant gets its own VNI. VMs in the same VNI can communicate via L2 regardless of their physical location. VMs in different VNIs are completely isolated.

EVPN (Ethernet VPN) (Sajassi et al. 2015) solves the control plane problem. In the original VXLAN specification, VTEPs discovered remote MAC addresses through flood-and-learn: when a switch received a frame for an unknown destination, it flooded the encapsulated frame to all VTEPs. This replicated the inefficiency of L2 flooding — broadcast storms in an overlay network.

EVPN replaces flood-and-learn with BGP-based MAC/IP advertisement. When a host connects to a VTEP, the VTEP advertises the host’s MAC and IP addresses to all other VTEPs using BGP EVPN route types. Remote VTEPs learn MAC-to-VTEP mappings from BGP updates, eliminating the need to flood. ARP requests can be answered locally by the VTEP (ARP suppression), further reducing broadcast traffic. Multi-tenancy is handled through Route Distinguishers (RDs) and Route Targets (RTs): each tenant’s MAC/IP routes carry a unique RD so they remain distinct within BGP, and RTs control which VTEPs import which tenant’s routes. These are the coordination primitives that make EVPN work for cloud providers hosting thousands of isolated tenants on shared physical infrastructure.

VXLAN/EVPN applied disaggregation by separating the L2 forwarding plane (Ethernet frames) from the L2 control plane (MAC learning) and from the physical transport (IP routing). Three independent layers, each optimized separately: the underlay (IP fabric with ECMP) handles packet delivery; the overlay (VXLAN) handles L2 encapsulation; the control plane (BGP EVPN) handles endpoint discovery.

VXLAN/EVPN applied decision placement as centralized control with distributed forwarding. BGP route reflectors distribute MAC/IP bindings to all VTEPs. Each VTEP makes local forwarding decisions based on these bindings. This mirrors the routing architecture: centralized information distribution, distributed packet forwarding.

3.6.2 Invariant Analysis: VXLAN/EVPN (2014)

Invariant	VXLAN/EVPN Answer (2014)	Gap?
State	BGP-distributed MAC/IP bindings: global, verified	Control plane overhead scales with number of hosts
Time	BGP update propagation: seconds	Slower than hardware-based L2 learning for local events
Coordination	BGP route reflectors: centralized distribution	Route reflector is a single logical point of failure
Interface	24-bit VNI (16M segments); MAC-in-UDP encapsulation	Encapsulation overhead: 50-byte VXLAN header per frame

The Interface answer resolves the 4,094-VLAN bottleneck that persisted through Acts 3 and 4. The State answer is the most significant improvement: instead of per-switch flood-and-learn (local, unverified, slow), every VTEP has a BGP-distributed global view of all MAC/IP bindings. The Coordination answer mirrors BGP’s architecture in the WAN: route reflectors aggregate and distribute information, individual routers make local decisions.

The remaining gaps are operational rather than architectural. VXLAN header overhead (50 bytes⁹ per frame) reduces the effective MTU — jumbo frames (9,000 bytes) absorb this cost, but standard 1,500-byte frames lose over 3% of payload capacity. BGP EVPN control plane scales linearly with the number of hosts, which can reach hundreds of thousands in a large data center.

3.6.3 Environment → Measurement → Belief After the Fix

Layer	What VXLAN/EVPN Has	What’s Missing
Environment	All hosts, their MAC/IP addresses, their VTEP locations	—
Measurement	BGP advertisements: MAC, IP, VNI, VTEP IP	Real-time host liveness — must wait for BGP withdrawal on failure
Belief	Global MAC/IP → VTEP mapping, per-VNI isolation	Assumes BGP is timely; stale entries possible during convergence

3.7 Act 6: Access Technologies — DOCSIS and PON

The wired link layer story extends beyond LANs and data centers to access networks — the “last mile” connecting subscribers to their service providers. Two dominant technologies share the wire (or fiber) among multiple subscribers using fundamentally different physical media.

3.7.1 DOCSIS: Broadband Over Cable

DOCSIS (Data Over Cable Service Interface Specification) (CableLabs 1997) delivers broadband over the hybrid fiber-coax (HFC) plants originally built for cable television. The downstream direction uses frequency-division multiplexing: the cable operator allocates specific frequency channels for data, separate from television channels. The upstream direction faces a coordination challenge: multiple cable modems share a single coaxial segment back to the CMTS (Cable Modem Termination System).

DOCSIS upstream uses a hybrid contention + scheduled model, not pure TDMA. Modems compete using a contention-based request phase (similar to ALOHA with exponential backoff) to request upstream bandwidth. The CMTS receives these requests and responds with explicit grants — scheduled time slots for actual data transmission. Contention exists for the request, but data delivery is collision-free. This two-phase design balances responsiveness (modems request immediately when data arrives) with efficiency (data slots are non-overlapping). The cable plant is highly asymmetric: up to 1.2 GHz of downstream spectrum vs. 5–204 MHz upstream, yielding a ~10:1 bandwidth ratio favoring download. Active electronics (amplifiers) in the field maintain signal strength over distances up to 100 miles.

DOCSIS has evolved through five generations: 1.0 (1997, 40 Mbps downstream), 2.0 (2001, 30 Mbps upstream), 3.0 (2006, channel bonding for 1 Gbps), 3.1 (2013, OFDM for 10 Gbps downstream) (CableLabs 2013), and 4.0 (2020, 10 Gbps downstream and 6 Gbps upstream). Each generation applied closed-loop reasoning by tightening the feedback between the CMTS and modems: more precise timing, finer-grained scheduling, and adaptive modulation based on measured channel quality.

3.7.2 PON: Shared Fiber

Passive Optical Networks (PON) share a single fiber strand among 32–128 subscribers using a passive optical splitter — no active electronics in the field, just glass. This is architecturally opposite to cable’s active amplifier chain: lower operational cost, higher reliability, but no mid-span amplification. Downstream, the OLT (Optical Line Terminal) broadcasts to all ONUs (Optical Network Units); each ONU reads only frames addressed to it. Upstream, ONUs take turns transmitting in time slots assigned by the OLT — a purely centralized, collision-free TDMA scheme.

The OLT distributes timing information every 125 microseconds¹⁰ to align upstream bursts — a strict Time invariant discipline that ensures no two ONUs ever transmit simultaneously. GPON (Gigabit-Capable Passive Optical Networks (GPON) 2003) provides 2.5 Gbps downstream and 1.25 Gbps upstream. XGS-PON (2016) delivers 10 Gbps symmetric. The roadmap continues toward 25G-PON and 50G-PON. The decision placement is fully centralized: the OLT schedules all upstream access. This is the opposite of Ethernet’s distributed CSMA/CD, and it reflects the asymmetry of the access network — one provider, many subscribers, no peer-to-peer contention.

3.7.3 Invariant Analysis: Access Networks (DOCSIS and PON)

Invariant	DOCSIS Answer	PON Answer
State	CMTS tracks per-modem SNR, queue depth, grant schedule	OLT tracks per-ONU distance (ranging), queue depth, burst schedule
Time	Contention slots + scheduled grants; MAP messages every ~2ms	125-microsecond framing cycle; DBA (Dynamic Bandwidth Allocation) per cycle
Coordination	Hybrid: contention for requests, centralized for data	Fully centralized: OLT schedules all upstream access
Interface	DOCSIS MAC frames over RF; CMTS as L2/L3 gateway	GEM (GPON Encapsulation Method) frames; OLT as L2 gateway

The key difference is in the Coordination invariant: DOCSIS retains contention for the request phase (inheriting from its cable-TV broadcast origins), while PON eliminates contention entirely through OLT scheduling. Both converge on centralized data-plane scheduling, but DOCSIS carries a residual collision risk that PON avoids.

3.7.4 Environment → Measurement → Belief: Access Networks

Layer	What the Access Network Has	What’s Missing
Environment	True subscriber demand, channel conditions per modem/ONU	—
Measurement	CMTS/OLT: per-subscriber SNR reports, queue requests	Per-application demand; subscriber intent
Belief	“Subscriber X needs Y bandwidth this cycle”	Long-term demand prediction; no awareness of application type

The E→M gap is accidentally noisy: both CMTS and OLT receive honest per-subscriber measurements (SNR, buffer occupancy), but the measurements lag real-time demand by one scheduling cycle. Better prediction algorithms (anticipating demand patterns) would improve utilization, but the fundamental mechanism is sound.

Both DOCSIS and PON illustrate how the binding constraint shapes the coordination answer. When one entity (the service provider) controls the medium and the other entities (subscribers) have no need to communicate directly, centralized scheduling is the natural and efficient solution. Distributed contention (CSMA/CD) solves a different problem: peer stations with equal authority and no coordinator.

3.8 The Grand Arc: From Shared Bus to Virtual Overlay

3.8.1 The Evolving Anchor

Era	Year	Binding Constraint	Key Change	Consequence
Shared bus	1976	Single coaxial cable	Carrier sense + collision detection	CSMA/CD: works for <100 stations
Bridging	1985	Redundant bridges create loops	STP: prune to a tree	30-50s convergence; half the links blocked
Switching	1995	Full-duplex point-to-point links	CSMA/CD eliminated; MAC tables	VLANs partition broadcast domains (4,094 limit)
Data center	2008	STP blocks half the bandwidth	Fat-tree + ECMP: all links active	Full bisection bandwidth; sub-second convergence
Overlay	2014	Applications need L2 over L3	VXLAN + EVPN: L2 in UDP tunnels	16M segments; BGP control plane
Access	1997+	Shared last-mile medium (cable/fiber)	Centralized scheduling (CMTS/OLT)	Contention eliminated at cost of single-point control

Each row in this table inherits the previous row’s solution and encounters a new constraint. The shared bus works until bridges create loops. STP prevents loops but blocks links. Switches eliminate collisions but not broadcast storms (VLANs). Data center scale exposes STP’s bandwidth waste (fat-tree). IP fabrics solve bandwidth but not L2 mobility (VXLAN). Each generation’s success creates the conditions for the next generation’s failure.

3.8.2 Three Design Principles Applied Across the Arc

Disaggregation is applied repeatedly, and each separation creates a new interface:

Metcalfe separated the PHY (coaxial transceiver) from the MAC (CSMA/CD logic). Interface: the MII bus, which later enabled speed upgrades without protocol changes.
Perlman separated forwarding (MAC learning) from loop prevention (STP). Interface: BPDUs, which became the control plane for L2 networks.
VXLAN separated L2 semantics (Ethernet frames) from physical transport (IP routing). Interface: VXLAN headers, which tunnel L2 inside L3.

Closed-loop reasoning operates at different timescales because each generation closes a different loop. Ethernet’s feedback loop (sense → transmit → detect → backoff) is the fastest: microseconds, because it tracks per-frame medium state. STP’s loop (BPDU → election → port role) is the slowest: 2-second intervals with 30–50s convergence, because it tracks topology — a problem that changes rarely but demands global consistency. Fat-tree ECMP’s reconvergence via IGP runs at sub-second timescales, faster than STP because link-state flooding and SPF recomputation replaced passive timer expiry. EVPN’s BGP loop distributes MAC/IP updates within seconds. The pattern is not monotonic speedup — it is that each loop’s period matches the timescale of the problem it tracks. Medium access changes per-frame (microseconds). Topology changes per-failure (seconds to minutes). Endpoint location changes per-migration (seconds). The diagnostic: a loop that is too fast for its problem wastes bandwidth on unnecessary updates (the original ARPANET DV mistake from Chapter 6); a loop that is too slow misses real changes (STP’s 30-second blind spot).

Decision placement oscillates between distributed and centralized:

CSMA/CD: fully distributed, forced by the lack of any central authority on a shared bus.
STP: distributed but elects a central root bridge.
ECMP: distributed, with each switch making independent hash-based decisions.
EVPN: centralized distribution (BGP route reflectors) with distributed forwarding.

The pattern: purely distributed mechanisms work at small scale. As scale increases, some centralization of information distribution becomes necessary, but forwarding remains distributed for performance.

3.8.3 The Dependency Chain

flowchart TD
    A["Shared coaxial bus<br/>(physical constraint)"]:::constraint --> B["CSMA/CD works<br/>(100 stations)"]:::fix
    B --> C["Bridges extend reach<br/>(campus-wide L2)"]:::fix
    C --> D["Loops: broadcast storms<br/>(frames cycle forever)"]:::failure
    D --> E["STP: prune to tree<br/>(Perlman 1985)"]:::fix
    E --> F["Half the links blocked<br/>(wasted bandwidth)"]:::failure
    E --> G["30-50s convergence<br/>(all forwarding halts)"]:::failure
    F --> H["Switches: full-duplex<br/>(no CSMA/CD needed)"]:::fix
    H --> I["Broadcast domain grows<br/>(thousands of stations)"]:::failure
    I --> J["VLANs: 802.1Q<br/>(12-bit, 4094 limit)"]:::fix
    G --> K["Data center scale<br/>(100-1000 switches)"]:::constraint
    F --> K
    K --> L["Fat-tree + ECMP<br/>(Al-Fares 2008)"]:::fix
    L --> M["No L2 mobility<br/>(IP fabric, no VLANs)"]:::failure
    J --> M
    M --> N["VXLAN: L2 over L3<br/>(24-bit VNI)"]:::fix
    N --> O["Flood-and-learn<br/>(broadcast storms in overlay)"]:::failure
    O --> P["EVPN: BGP control plane<br/>(MAC/IP distribution)"]:::fix

    classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:#fff
    classDef failure fill:#e74c3c,stroke:#a93226,color:#fff
    classDef fix fill:#27ae60,stroke:#1a7a42,color:#fff

Wired Link Layer Dependency Chain

3.8.4 Pioneer Diagnosis Table

Year	Pioneer/Event	Invariant	Diagnosis	Contribution
1976	Metcalfe & Boggs	State, Coordination	Shared bus needs distributed access control	CSMA/CD Ethernet: sense, transmit, detect, backoff
1985	Perlman	Coordination	Redundant bridges create forwarding loops	STP: distributed root election, path pruning
1990s	Switch vendors	State, Interface	Shared bus limits bandwidth; need per-port forwarding	Ethernet switches: MAC tables, full-duplex, VLANs
2001	IEEE 802.1w	Time	STP converges too slowly (30-50s)	RSTP: sub-second convergence via proposal/agreement
2008	Al-Fares et al.	Coordination	STP blocks half the data center’s links	Fat-tree + ECMP: all links active, full bisection BW
2011	IETF / IEEE	Coordination	Need L2 multi-path without STP	TRILL / SPB: IS-IS at L2 (competed, lost to VXLAN)
2014	IETF	Interface	4,094 VLANs insufficient; need L2 over L3	VXLAN: MAC-in-UDP with 24-bit VNI (16M segments)
2015	IETF	Coordination	Flood-and-learn fails at overlay scale	EVPN: BGP-based MAC/IP advertisement
1997	CableLabs	Coordination	Shared cable upstream needs scheduling	DOCSIS: hybrid request-grant scheduling
2003	ITU-T	Time	Shared fiber upstream needs collision-free access	GPON: 125-µs framing with OLT-scheduled TDMA

3.8.5 Innovation Timeline

flowchart TD
    subgraph sg1["Shared Bus Era"]
        A1["1973 — Metcalfe: dissertation (Xerox PARC)"]
        A2["1976 — Metcalfe & Boggs: CACM paper"]
        A3["1980 — DIX Ethernet specification"]
        A4["1983 — IEEE 802.3 standard"]
        A1 --> A2 --> A3 --> A4
    end
    subgraph sg2["Bridging Era"]
        B1["1985 — Perlman: STP (IEEE 802.1D)"]
        B2["1990 — 10BASE-T structured cabling"]
        B3["1995 — Fast Ethernet 802.3u (100 Mbps)"]
        B4["1998 — VLANs 802.1Q"]
        B1 --> B2 --> B3 --> B4
    end
    subgraph sg3["Switch Revolution"]
        C1["1997 — Full-duplex 802.3x"]
        C2["1999 — Gigabit Ethernet 802.3z"]
        C3["2001 — RSTP 802.1w"]
        C4["2002 — 10 Gigabit Ethernet 802.3ae"]
        C1 --> C2 --> C3 --> C4
    end
    subgraph sg4["Data Center Fabrics"]
        D1["2008 — Al-Fares: fat-tree (SIGCOMM)"]
        D2["2009 — VL2 (SIGCOMM)"]
        D3["2011 — TRILL RFC 6325"]
        D4["2012 — SPB 802.1aq"]
        D1 --> D2 --> D3 --> D4
    end
    subgraph sg5["L2/L3 Convergence"]
        E1["2014 — VXLAN RFC 7348"]
        E2["2015 — EVPN RFC 7432"]
        E3["2017 — 400 Gigabit Ethernet 802.3bs"]
        E1 --> E2 --> E3
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5

Wired Link Layer Innovation Timeline

3.9 Generative Exercises

Exercise 1: The MAC Table Overflow Attack

A switched Ethernet network uses a switch with a MAC address table that can hold 8,192 entries. An attacker sends frames with randomly generated source MAC addresses at a rate of 1,000 frames per second.

Using the four invariants: (a) Which invariant does this attack exploit? (b) What happens to the switch’s forwarding behavior when the MAC table is full? (c) How does this convert a switched network back into a shared network? (d) Propose a defense mechanism and identify which design principle it applies.

Exercise 2: VXLAN Without EVPN

A data center runs VXLAN for L2 overlay but uses the original flood-and-learn mechanism instead of EVPN. The data center has 500 VTEPs and 50,000 virtual machines.

Using the E-M-B decomposition: (a) What is the Environment, Measurement, and Belief at each VTEP? (b) Where is the E→M gap, and what type is it (accidentally noisy, physically limited, or structurally filtered)? (c) Calculate the broadcast amplification factor when a single ARP request is flooded to all 500 VTEPs. (d) Explain how EVPN closes this gap and identify which invariant it fixes.

Exercise 3: The Spanning Tree That Shouldn’t Be

A network architect proposes using STP in a 48-switch data center arranged as a fat-tree topology. Each switch has 48 ports, and there are 576 inter-switch links.

Using the dependency chain: (a) How many links will STP block? (b) What fraction of the total inter-switch bandwidth is wasted? (c) What is the maximum convergence time if the root bridge fails? (d) The architect argues that RSTP solves the convergence problem. Explain why faster convergence leaves the bandwidth problem intact — identify which invariant RSTP fixes and which it leaves broken.

Abramson, Norman. 1970. “The ALOHA System: Another Alternative for Computer Communications.” Proceedings of the Fall Joint Computer Conference, 281–85.

Al-Fares, Mohammad, Alexander Loukissas, and Amin Vahdat. 2008. “A Scalable, Commodity Data Center Network Architecture.” Proceedings of the ACM SIGCOMM Conference, 63–74.

CableLabs. 1997. Data-over-Cable Service Interface Specifications: Radio Frequency Interface Specification. DOCSIS 1.0.

CableLabs. 2013. Data-over-Cable Service Interface Specifications: DOCSIS 3.1 Physical Layer Specification. DOCSIS 3.1.

Gigabit-Capable Passive Optical Networks (GPON): General Characteristics, Pub. L. No. G.984.1 (2003).

IEEE Standard for 10 Gbit/s Ethernet, Pub. L. Nos. IEEE 802.3ae (2002).

IEEE Standard for 100 Mbit/s Ethernet (Fast Ethernet), Pub. L. Nos. IEEE 802.3u (1995).

IEEE Standard for 1000 Mbit/s Ethernet (Gigabit Ethernet), Pub. L. Nos. IEEE 802.3z (1999).

IEEE Standard for Ethernet, Pub. L. Nos. IEEE 802.3 (1985).

IEEE Standard for Local and Metropolitan Area Networks — Bridges and Bridged Networks, Pub. L. Nos. IEEE 802.1Q (1998).

IEEE Standard for Local and Metropolitan Area Networks: Rapid Spanning Tree Protocol, Pub. L. Nos. IEEE 802.1w (2001).

IEEE Standard for Local and Metropolitan Area Networks: Shortest Path Bridging, Pub. L. Nos. IEEE 802.1aq (2012).

IEEE Standard for Local and Metropolitan Area Networks: Specification for 802.3 Full Duplex Operation, Pub. L. Nos. IEEE 802.3x (1997).

Kleinrock, Leonard, and Fouad A. Tobagi. 1975. “Packet Switching in Radio Channels: Part I — Carrier Sense Multiple-Access Modes and Their Throughput-Delay Characteristics.” IEEE Transactions on Communications COM-23 (12): 1400–1416.

Mahalingam, M., D. Dutt, K. Duda, et al. 2014. Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks. RFC No. 7348. IETF.

Metcalfe, Robert M., and David R. Boggs. 1976. “Ethernet: Distributed Packet Switching for Local Computer Networks.” Communications of the ACM 19 (7): 395–404.

Perlman, Radia. 1985. “An Algorithm for Distributed Computation of a Spanning Tree in an Extended LAN.” Proceedings of the ACM SIGCOMM Conference, 44–53.

Perlman, R., D. Eastlake, D. Dutt, S. Gai, and A. Ghanwani. 2011. Routing Bridges (RBridges): Base Protocol Specification (TRILL). RFC No. 6325. IETF.

Sajassi, A., R. Aggarwal, N. Bitar, et al. 2015. BGP MPLS-Based Ethernet VPN. RFC No. 7432. IETF.

The ratio of propagation delay to frame time, $a$, must be much less than 1 for CSMA/CD to work efficiently. At 2.94 Mbit/s and 1 km, $a \approx 0.2$. When the DIX consortium raised the speed to 10 Mbps, they shortened the maximum cable segment to 500 m and imposed a 64-byte minimum frame size¹ to preserve $a < 1$. Every Ethernet speed increase since has required either shorter cables or longer minimum frames — the same physics constraint, applied repeatedly.↩︎
The consortium chose 48-bit MAC addresses with a hierarchical structure: the first 24 bits are an Organizationally Unique Identifier (OUI) assigned by IEEE to each manufacturer; the remaining 24 bits are assigned by the manufacturer. This yields $2^{48} \approx 281$ trillion globally unique addresses — enough that every network interface card ever manufactured receives a unique address with no coordination beyond the OUI registry. The 48-bit choice was an Interface invariant decision: make the address space large enough that it never needs revisiting. It succeeded — 48-bit MACs remain unchanged 45 years later.↩︎
At 10 Mbps, a 64-byte (512-bit) frame takes 512 / 10^7 = 51.2 µs. This equals the round-trip propagation time across a maximum 2.5 km cable segment (2500 m / 200 m/µs = 12.5 µs one-way, ~25 µs round-trip, plus repeater and transceiver delays). The minimum frame size and maximum cable length are locked together through this constraint.↩︎
The cap at 10 doublings (2^10 - 1 = 1023 slots) limits maximum backoff to ~52 ms. The discard after 16 collisions is an engineering choice: beyond 16 consecutive collisions, the probability of successful delivery is negligible and the frame is better retried by a higher layer.↩︎
The factor 5 arises from Metcalfe’s analysis assuming the average collision involves about 2.5 stations and the jam signal adds overhead. See Metcalfe and Boggs (1976) for the full derivation.↩︎
The 2-second hello timer is a conservative default chosen to balance between timely failure detection and control-plane bandwidth consumption. Faster hellos detect failures sooner but increase BPDU traffic.↩︎
Max-age = 20s (10 x hello interval) allows for up to 10 missed BPDUs before declaring a link dead. The 15s listening + 15s learning states prevent transient loops during reconvergence by ensuring the bridge has heard from all neighbors before forwarding.↩︎
The 12-bit VLAN ID occupies a 4-byte 802.1Q tag inserted between the source MAC and Type fields. Two values are reserved (0 and 4095), leaving 4,094 usable VLANs.↩︎
Outer Ethernet (14 bytes) + outer IP (20 bytes) + outer UDP (8 bytes) + VXLAN header (8 bytes) = 50 bytes of encapsulation overhead per frame.↩︎
125 µs = 1/8000 seconds, inherited from the 8 kHz sampling rate of the telephone network. This framing period predates GPON — it was already the standard TDM frame period in SONET/SDH.↩︎

--- title: "Wired Link Layer" --- --- ## The Anchor: Point-to-Point or Shared Bus with Known, Fixed Endpoints Every wired link layer solves the same physical problem: deliver frames between devices connected by a known, fixed physical medium — copper cable, fiber optic strand, or coaxial bus. Unlike wireless, where the medium is shared by physics and the transmitter is deaf to its own collisions, wired media offer two crucial advantages. First, the set of connected devices is known at installation time — you can count the stations on a cable segment. Second, on a wire, a transmitter can hear its own signal returning and detect collisions in real time. These properties anchor every wired link layer design. The binding constraint evolves across eras but always derives from the physical medium's topology: **shared bus** (1976), where all stations compete for a single cable; **point-to-point switched links** (1990s), where each port gets a dedicated collision domain; **multi-path fabric** (2008+), where hundreds of switches must forward without loops; and **virtual L2 over physical L3** (2014+), where applications demand L2 semantics across an IP-routed infrastructure. The wired link layer must continuously answer four decision problems: 1. **How to share the wire — contention or switching?** (Should every station compete for the medium, or should a switch forward each frame to a specific port?) 2. **How to prevent loops in redundant topologies?** (When multiple paths exist between two stations, how does the network avoid forwarding frames in circles forever?) 3. **How to scale beyond a single broadcast domain?** (When thousands of stations share L2, how do you contain broadcast traffic?) 4. **How to deliver L2 connectivity across L3 infrastructure?** (When the physical fabric runs IP, how do applications that need L2 adjacency get it?) Robert Metcalfe, designing the first Ethernet at Xerox PARC in 1973 [@metcalfe1976], framed the foundational version of the first question: > *"The key technical challenge was to provide a broadcast communication system for a building full of personal computers — fast, cheap, and shared."* — Metcalfe, 1973 Radia Perlman [@perlman1985], confronting the loop problem a decade later, captured the second question with characteristic clarity: > *"I had this moment of realization. The bridges were supposed to be invisible. They had to figure out a loop-free topology by themselves, without any configuration, and they had to do it fast."* — Perlman, 1985 --- ## Act 1: "It's 1976. A Building Full of Personal Computers Needs a Network." Xerox PARC in Palo Alto was unlike any computing environment of its era. By 1973, the Alto — one of the first personal workstations — was sitting on dozens of desks. These machines needed to share a laser printer, access file servers, and exchange electronic mail. The available networking options were inadequate: ARPANET was for inter-site WAN communication, and serial point-to-point links could not connect a building full of machines. Robert Metcalfe, a recent Harvard PhD who had studied Norman Abramson's ALOHAnet [@abramson1970], saw the opportunity to adapt packet radio concepts to a local wired medium. The experimental Ethernet ran at **2.94 Mbit/s** over 75-ohm coaxial cable with a reach of approximately 1 km. These numbers are coupled through physics: collision detection requires that a transmitter hear a collision *before* it finishes sending its shortest frame. The round-trip propagation time across 1 km of cable is approximately 10 µs; at 2.94 Mbit/s, the minimum frame takes roughly 50 µs to transmit — safely longer than the round-trip, so collisions are always detected in time.^[The ratio of propagation delay to frame time, $a$, must be much less than 1 for CSMA/CD to work efficiently. At 2.94 Mbit/s and 1 km, $a \approx 0.2$. When the DIX consortium raised the speed to 10 Mbps, they shortened the maximum cable segment to 500 m and imposed a 64-byte minimum frame size^[64 bytes = 14-byte header (6 DA + 6 SA + 2 Type) + 46-byte minimum payload + 4-byte FCS. The 46-byte minimum payload exists because 64 bytes at 10 Mbps = 51.2 µs, which must equal the slot time for collision detection to work.] to preserve $a < 1$. Every Ethernet speed increase since has required either shorter cables or longer minimum frames — the same physics constraint, applied repeatedly.] Frames used **8-bit addresses**, sufficient for 256 stations in one building. When DEC, Intel, and Xerox formed the DIX consortium to standardize Ethernet for commercial use (1980), they confronted a scope change: Ethernet would span multiple buildings, campuses, and eventually organizations. An 8-bit address space — adequate for one lab — was insufficient.^[The consortium chose **48-bit MAC addresses** with a hierarchical structure: the first 24 bits are an Organizationally Unique Identifier (OUI) assigned by IEEE to each manufacturer; the remaining 24 bits are assigned by the manufacturer. This yields $2^{48} \approx 281$ trillion globally unique addresses — enough that every network interface card ever manufactured receives a unique address with no coordination beyond the OUI registry. The 48-bit choice was an Interface invariant decision: make the address space large enough that it never needs revisiting. It succeeded — 48-bit MACs remain unchanged 45 years later.] > *"Ether is a passive broadcast medium with no central control, and with stations that have equal access. Coordination of access to the Ether for packet transmission is distributed among the contending transmitting stations using controlled statistical arbitration."* — Metcalfe and Boggs, 1976 [@metcalfe1976] **What the pioneers saw:** A single building with 100--200 workstations. A shared coaxial cable snaking through the ceiling, physically accessible to every machine. Traffic was bursty — print jobs, file transfers, electronic mail — not continuous streams. The wire was idle most of the time. The workstations were cooperative (all owned by Xerox PARC researchers, no adversarial traffic). The cable was short enough that propagation delay was negligible relative to packet transmission time. **What remained invisible from the pioneers' vantage point:** Networks would scale from hundreds to millions of endpoints. Traffic would shift from bursty office applications to continuous video streams. Organizations would need to connect buildings, campuses, and eventually data centers containing hundreds of thousands of servers. A single shared cable was inherently limited to building scale. ### The Solution: CSMA/CD on a Shared Bus Metcalfe adapted Kleinrock and Tobagi's CSMA concept [@kleinrock1975] to the wired medium, adding a critical capability that wireless physics denied: **collision detection**. On a coaxial cable, a transmitter monitors the wire during transmission. If the observed signal differs from the transmitted signal, a collision has occurred — two stations transmitted simultaneously and their signals interfered. The transmitter immediately aborts, sends a 32-bit jam signal to ensure all stations detect the collision, and enters exponential backoff. The exponential backoff algorithm (later called **binary exponential backoff**, BEB) selects a random wait time from an interval that doubles with each successive collision: after the $n$th collision, choose a random slot from $[0, 2^n - 1]$, where each slot is 51.2 $\mu$s^[At 10 Mbps, a 64-byte (512-bit) frame takes 512 / 10^7 = 51.2 µs. This equals the round-trip propagation time across a maximum 2.5 km cable segment (2500 m / 200 m/µs = 12.5 µs one-way, ~25 µs round-trip, plus repeater and transceiver delays). The minimum frame size and maximum cable length are locked together through this constraint.] (the round-trip propagation time of a maximum-length cable at 10 Mbps). After 10 collisions, the window caps at 1023 slots^[The cap at 10 doublings (2^10 - 1 = 1023 slots) limits maximum backoff to ~52 ms. The discard after 16 collisions is an engineering choice: beyond 16 consecutive collisions, the probability of successful delivery is negligible and the frame is better retried by a higher layer.]. After 16 collisions, the frame is discarded. Under heavy load, this produces **collision collapse**: retransmissions trigger more collisions, which trigger more retransmissions, and the available bandwidth is consumed by failed attempts rather than successful frames. The positive feedback loop drives useful throughput toward zero — the same instability pattern Abramson documented for ALOHA [@abramson1970], now at higher speed. Metcalfe **applied decision placement** as fully distributed — every station independently senses the wire, transmits when idle, detects collisions, and backs off. No central controller, no token passing, no reservation. This was the simplest possible coordination: the wire itself carries all the state any station needs. Metcalfe **applied closed-loop reasoning** through the sense-transmit-detect-backoff cycle. The measurement is direct: the transmitter compares its own signal on the wire with what it receives. When they differ, a collision is occurring. The loop tightens the response: abort immediately, do not waste the channel transmitting a frame that is already corrupted. This is qualitatively better than wireless ALOHA, where the transmitter is deaf to collisions and must wait for a timeout to discover failure. ### Invariant Analysis: Ethernet CSMA/CD (1976) | Invariant | Ethernet Answer (1976) | Gap? | |:------|:-------------------------------|:-------------------------------| | **State** | Carrier sense + collision detection: rich local knowledge | Limited to one cable segment; no knowledge of stations beyond the wire | | **Time** | BEB: randomized, doubles per collision | No adaptation to offered load; 16-collision cap means arbitrary frame loss | | **Coordination** | Fully distributed: each station decides independently | Fair under light load; under heavy load, BEB penalizes unlucky stations | | **Interface** | Ethernet frame: 6-byte DA, 6-byte SA, 2-byte Type, 46--1500 byte payload, 4-byte FCS | Frame format ossifies immediately — unchanged for 50 years | The State answer is remarkably good for a local protocol. Carrier sense gives each station real-time knowledge of whether another station is currently transmitting. Collision detection gives instantaneous feedback when two transmissions overlap. The measurement is honest and accurate — on a wire, signal interference is unambiguous. The Time answer is adequate but imprecise. BEB tracks only collision count as an indirect proxy for offered load. A station experiencing its 5th collision could be competing with 2 other stations (bad luck) or 200 (genuine overload). BEB conflates both cases. This is an **accidentally noisy** measurement — the signal (collision count) is honest but imprecise as an estimator of contention. The Interface answer proves prophetic: the Ethernet frame format defined in 1976 — destination MAC, source MAC, type/length, payload, FCS — survives essentially unchanged into the 2020s. This is the most successful ossified interface in networking history. ### Environment → Measurement → Belief | Layer | What Ethernet Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | **Environment** | All stations on the cable, their traffic patterns, queue depths | — | | **Measurement** | Carrier sense (is wire busy?), collision detect (did signals interfere?) | Station count, offered load, queue depths of competitors | | **Belief** | "Wire is idle, safe to transmit" or "collision occurred, back off" | No model of contention level; each collision is treated identically | The E→M gap is **accidentally noisy**: the wire provides an honest signal (busy/idle, collision/clean), but the station has no way to estimate how many other stations are waiting. Better estimation algorithms (adaptive backoff based on observed collision rate) were proposed but never standardized — the simple BEB was good enough for the environments Ethernet actually served. ### "The Gaps Didn't Matter... Yet." At Xerox PARC with 100 workstations and bursty traffic, utilization rarely exceeded 10%. Collisions were infrequent. BEB resolved them quickly. The single cable was short enough that propagation delay was negligible, making collision detection nearly instantaneous. Metcalfe's efficiency model — $E = 1/(1 + 5a)$^[The factor 5 arises from Metcalfe's analysis assuming the average collision involves about 2.5 stations and the jam signal adds overhead. See @metcalfe1976 for the full derivation.], where $a$ is the ratio of propagation delay to frame transmission time — showed that short cables with long frames yield efficiency above 95%. The design worked brilliantly. So brilliantly that DEC, Intel, and Xerox standardized it as the **DIX Ethernet** specification in 1980, and IEEE adopted it as **802.3** in 1983 [@ieee8023]. The speed was fixed at 10 Mbps. The frame format was locked. Ethernet escaped the lab and entered the enterprise. But enterprises had more than one building. And each building had more than one cable segment. --- ## Act 2: "It's 1985. Bridges Connect Cable Segments — and Create Loops." As Ethernet spread through campuses in the early 1980s, organizations needed to connect multiple cable segments. A building typically had one segment per floor. Connecting floors required a device that could forward frames between segments: a **bridge**. Bridges are transparent — end stations do not know they exist. A bridge learns which MAC addresses are reachable through each port by observing source addresses, and forwards frames only to the port where the destination lives. If the destination is unknown, the bridge floods the frame to all ports except the incoming one. The problem emerged when network administrators installed redundant bridges for reliability. Two bridges connecting the same pair of segments create a forwarding loop: a broadcast frame enters bridge A, exits to segment 2, enters bridge B, exits back to segment 1, enters bridge A again — forever. Each copy generates more copies. Within seconds, the network saturates with duplicate frames. This is a **broadcast storm**, and it can bring down an entire campus network. Radia Perlman at Digital Equipment Corporation saw that bridges needed to self-organize a loop-free topology without any human configuration [@perlman1985]. > *"I was working at DEC and they said, 'we've got this bridging thing and it kind of works but it has this problem with loops.' So I figured out the spanning tree algorithm."* — Perlman **What the pioneers saw:** Campus networks with a handful of bridges (typically 5--20), connecting a modest number of segments. Redundancy was desirable for reliability, but loops were catastrophic. The bridges needed to agree on a loop-free subset of links — a spanning tree — automatically, without configuration, and without end stations knowing anything about bridges. **What remained invisible from the pioneers' vantage point:** Data center networks would eventually contain thousands of switches with hundreds of redundant paths. Blocking half the links to create a spanning tree would waste enormous capacity. And the 30--50 second convergence time would be intolerable for applications requiring sub-second failover. ### Which Invariant Broke? | What Changed | Before | After | |:------|:-------------------------------|:-------------------------------| | Topology | Single cable segment | Multiple segments connected by bridges | | Path diversity | One path (the cable) | Multiple paths through redundant bridges | | Failure mode | Cable break isolates stations | Loop: broadcast storm, network-wide saturation | The Coordination invariant broke. On a single cable, there was no coordination problem — all stations shared one medium, and CSMA/CD handled access. With bridges connecting multiple segments, the network gained path diversity but no mechanism to prevent loops. Each bridge forwarded independently, following its MAC table. Each bridge saw only its own ports. The result was uncontrolled positive feedback: broadcast frames cycling through loops, multiplying with each pass. ### Perlman's Redesign: The Spanning Tree Protocol (STP) Perlman **applied disaggregation** by separating two concerns that bridges had previously conflated: **forwarding** (delivering frames to the correct port) and **loop prevention** (ensuring the forwarding topology is acyclic). STP handles loop prevention as a distinct protocol layer, running in the background while forwarding operates normally on the resulting tree. STP elects a **root bridge** (the bridge with the lowest bridge ID) and computes a shortest-path tree from the root to every other bridge. Each bridge determines its **root port** (the port closest to the root) and each network segment determines its **designated port** (the port closest to the root for that segment). All other ports are placed in **blocking** state — they neither forward nor learn. The result is a spanning tree: a loop-free subset of the network topology that reaches every segment. Bridges exchange **Bridge Protocol Data Units** (BPDUs) every 2 seconds^[The 2-second hello timer is a conservative default chosen to balance between timely failure detection and control-plane bandwidth consumption. Faster hellos detect failures sooner but increase BPDU traffic.]. Each BPDU carries the root bridge ID, the sender's distance to the root, and the sender's bridge ID. When a bridge receives a BPDU advertising a shorter path to the root, it updates its own state. When a bridge stops receiving BPDUs on a port (indicating a failure), it begins reconvergence. Perlman **applied decision placement** as distributed but with elected authority — every bridge participates in the election, but the root bridge becomes the single authority that defines the tree. This is a middle point on the distributed-centralized spectrum: no bridge has global knowledge, but they collectively agree on a single tree through the BPDU exchange protocol. ### Invariant Analysis: STP (1985) | Invariant | STP Answer (1985) | Gap? | |:------|:-------------------------------|:-------------------------------| | **State** | Each bridge knows: root ID, own distance to root, port roles | No global topology view; each bridge sees only its neighbors' BPDUs | | **Time** | Max-age (20s) + listening (15s) + learning (15s) = 50s convergence | All forwarding halts during reconvergence — packets silently dropped | | **Coordination** | Distributed root election via BPDUs | Single root bridge is a logical bottleneck; all traffic follows root-centric paths | | **Interface** | BPDU messages on reserved multicast (01:80:C2:00:00:00) | BPDUs are the control plane interface; invisible to end stations | The Time gap is the most consequential. When a bridge fails or a link goes down, STP must reconverge: detect the failure (up to 20 seconds for BPDUs to age out), then transition ports through listening and learning states (15 seconds each)^[Max-age = 20s (10 x hello interval) allows for up to 10 missed BPDUs before declaring a link dead. The 15s listening + 15s learning states prevent transient loops during reconvergence by ensuring the bridge has heard from all neighbors before forwarding.]. Total: **30--50 seconds** during which the network drops all traffic on affected segments. The user-visible consequence is concrete: a laptop plugged into a switchport cannot obtain a DHCP lease until STP finishes converging — DHCP's 4-second timeout expires repeatedly while ports sit in the listening state, and the user sees "no network" for nearly a minute. This was the price of conservative loop prevention — STP ensures no forwarding occurs until the new tree is guaranteed to be loop-free. The Coordination gap is structural: all traffic in the spanning tree flows through the root bridge and follows root-centric paths. If stations A and B are on adjacent segments, their frames often traverse the entire tree to reach each other, even though a direct path exists (but is blocked by STP). Redundant links sit idle, wasting half the network's capacity. ### Environment → Measurement → Belief | Layer | What STP Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | **Environment** | Full physical topology: all bridges, all links, all failures | — | | **Measurement** | BPDUs from neighbors: root ID, cost, bridge ID | Direct failure detection — must wait for BPDU timeout | | **Belief** | Root bridge identity, shortest path to root, port roles | Global topology; alternative paths; traffic demand patterns | The E→M gap is **structurally filtered**: STP deliberately limits each bridge's view to its immediate neighbors' BPDUs. This is not an accident — it is a design choice. Perlman's algorithm works with only local information, which makes it simple and self-organizing. The cost is slow failure detection: a bridge discovers a link failure only when BPDUs stop arriving, which takes up to 20 seconds (the max-age timer). RSTP [@ieee8021w] would later fix the Time gap by replacing passive timers with an active **proposal/agreement handshake**. When a bridge detects a topology change, it proposes a new port role to its neighbor; the neighbor immediately agrees or counter-proposes. Convergence drops from 30--50 seconds to sub-second. The lesson is a general Time invariant pattern: **timer-based coordination is simple but slow; active negotiation trades complexity for speed.** The same tradeoff reappears in BGP (Chapter 6) and OSPF convergence. ### "The Gaps Didn't Matter... Yet." Campus networks in the 1980s had modest requirements. A few hundred stations, a handful of bridges, bursty office traffic. Users could tolerate 30--50 seconds of outage during a rare topology change. The capacity wasted by blocking redundant links was invisible — networks were lightly loaded. STP's guarantees — automatic, configuration-free, loop-free topology — were worth the cost. But two forces were converging that would shatter these assumptions. First, Ethernet was about to move from shared bus to dedicated switch ports, multiplying the number of links in the network. Second, organizations were about to discover that blocking half of those links was an unacceptable waste. --- ## Act 3: "It's 1995. The Switch Revolution Eliminates the Shared Bus." Throughout the late 1980s and early 1990s, a quiet revolution transformed Ethernet's physical architecture. The shared coaxial bus gave way to structured cabling with hubs (10BASE-T, 1990), and hubs gave way to **switches**. A switch is a multi-port bridge with hardware-speed forwarding. Each switch port connects to a single device or to another switch, creating a dedicated point-to-point link. The consequences were profound: **CSMA/CD becomes irrelevant.** On a shared bus, every station competes for the medium. On a switched port, there are only two parties: the station and the switch. With **full-duplex** operation [@ieee8023x], both can transmit simultaneously — there are no collisions. The entire CSMA/CD mechanism that Metcalfe built becomes dead code, still present in the standard but never invoked. The fundamental access problem that motivated Ethernet's design dissolves. **MAC address tables replace carrier sense as the State mechanism.** A switch learns MAC addresses by observing the source address of incoming frames. When a frame arrives at port 3 from MAC address AA:BB:CC:DD:EE:FF, the switch records that this address is reachable through port 3. Future frames destined for that address go only to port 3, not to all ports. The switch builds a complete map of which addresses live on which ports — a qualitatively richer state than "is the wire busy?" **Speed scales independently of the access protocol.** Fast Ethernet (100 Mbps, 1995) [@ieee8023u], Gigabit Ethernet (1 Gbps, 1999) [@ieee8023z], and 10 Gigabit Ethernet (10 Gbps, 2002) [@ieee8023ae] all use the same frame format and the same switching logic. Speed increases come from faster PHY (Physical Layer) electronics, not from protocol redesign. The access protocol is no longer the bottleneck — the switch hardware is. ### What Changed in the Environment | What Changed | Before (Shared Bus) | After (Switched) | |:------|:-------------------------------|:-------------------------------| | Medium topology | Shared coaxial cable | Dedicated point-to-point links | | Collision domain | All stations on the segment | Two devices per link (full-duplex: zero collisions) | | State mechanism | Carrier sense (is wire busy?) | MAC address table (which port for this destination?) | | Bandwidth per station | Shared 10 Mbps among all stations | Dedicated 10/100/1000 Mbps per port | ### VLANs: Partitioning the Broadcast Domain Switches solved the collision problem but created a new one: **broadcast domains grew enormous**. In a switched network, a broadcast frame from any station reaches every other station — the switch floods it to all ports. With thousands of stations on a flat L2 network, broadcast traffic (ARP requests, DHCP discoveries, NetBIOS announcements) consumes significant bandwidth and processing at every endpoint. IEEE 802.1Q [@ieee8021q] introduced **Virtual LANs (VLANs)**: a 12-bit tag inserted into the Ethernet frame header that partitions a physical switch into multiple logical broadcast domains. Stations in VLAN 10 are isolated from VLAN 20 broadcasts. Traffic crossing VLAN boundaries must pass through a router (L3 device). 802.1Q **applied disaggregation** by separating the physical topology from the logical topology. A single physical switch can host multiple independent broadcast domains. This separation allowed network administrators to organize traffic by function (engineering, finance, guest) without rewiring the physical network. The 12-bit VLAN tag^[The 12-bit VLAN ID occupies a 4-byte 802.1Q tag inserted between the source MAC and Type fields. Two values are reserved (0 and 4095), leaving 4,094 usable VLANs.] supports $2^{12} = 4{,}094$ VLANs. For a campus with a few dozen departments, 4,094 was generous. For a data center hosting thousands of tenants, it would prove catastrophic. ### Invariant Analysis: Switched Ethernet with VLANs (1990s) | Invariant | Switched Ethernet Answer (1990s) | Gap? | |:------|:-------------------------------|:-------------------------------| | **State** | MAC address table: per-port destination mapping | Learned by observation — no verification, no authentication | | **Time** | MAC aging timer: 300s default | Coarse — stale entries persist for minutes after a host moves | | **Coordination** | Per-switch learning; STP for loop prevention across switches | STP still blocks redundant links; still converges in 30-50s | | **Interface** | 802.1Q VLAN tag: 12-bit identifier | Only 4,094 VLANs — insufficient for multi-tenant environments | The Coordination gap persists from Act 2: STP still prunes redundant paths. In a campus network with a few switches, this is manageable. In a data center with hundreds of switches, STP blocks half the links — precisely the links that would provide the bandwidth data center applications demand. The Interface gap is new and will prove decisive. The 802.1Q 12-bit VLAN tag was designed for campus networks with tens of departments. Cloud data centers hosting thousands of tenants each requiring isolated L2 domains will exhaust 4,094 VLANs immediately. ### Environment → Measurement → Belief | Layer | What Switches Have | What's Missing | |:------|:-------------------------------|:-------------------------------| | **Environment** | All connected hosts, their MAC addresses, their traffic | — | | **Measurement** | Source MAC on incoming frames → port mapping | No host authentication; MAC addresses are trivially spoofed | | **Belief** | "MAC X is on port Y" (learned from last observed frame) | No verification; stale entries persist; no awareness of host mobility | The E→M gap is **accidentally noisy**: MAC learning is based on the source address of incoming frames, which is an honest but unverified signal. A host that moves to a different port will eventually be re-learned, but the stale entry persists for up to 300 seconds (the default aging timer). During that window, frames for the moved host go to the wrong port and are silently dropped. ### "The Gaps Didn't Matter... Yet." Campus networks in the 1990s were relatively static. Hosts rarely moved between ports. A few dozen VLANs served organizational needs. STP's link-blocking was invisible when bandwidth was plentiful and the network had 10--50 switches. But a new environment was emerging: the data center. Thousands of servers. Hundreds of switches. Dozens of redundant paths. Applications demanding full bisection bandwidth. And STP was about to become the bottleneck. --- ## Act 4: "It's 2008. Data Centers Expose STP's Fatal Flaw." The mid-2000s saw an explosion in data center construction. Companies like Google, Amazon, and Microsoft were building warehouse-scale computing facilities with tens of thousands of servers. The traditional data center network was a **tree topology**: servers connected to access switches, access switches connected to aggregation switches, aggregation switches connected to core switches. Each level had fewer, more expensive switches with higher bandwidth. This tree had two fatal properties. First, **oversubscription**: the links between aggregation and core switches carried traffic from hundreds of servers, creating bottlenecks. The ratio of server bandwidth to core bandwidth — the oversubscription ratio — was typically 4:1 or worse, meaning servers could achieve only a fraction of their link speed when communicating across the data center. Second, **STP blocked redundant links**: even when administrators installed parallel paths between switches for reliability, STP pruned them to a single tree. Half the purchased bandwidth sat idle. Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat at UC San Diego diagnosed the architectural failure precisely: > *"We argue that data center architects should move away from the traditional tree-based architecture and toward the folded Clos topology, where the entire network can be built from commodity switches."* — Al-Fares, Loukissas, and Vahdat, 2008 [@alfares2008] **What the pioneers saw:** Data center traffic patterns that were fundamentally different from campus traffic. In a campus, most traffic flows north-south (client to server, through the core). In a data center, most traffic flows east-west (server to server, within the facility). MapReduce, distributed storage, and search index updates generated massive cross-rack traffic that the tree topology was structurally incapable of serving. **What remained invisible from the pioneers' vantage point:** Even fat-tree fabrics would struggle with tenant isolation at cloud scale. The 4,094-VLAN limit of 802.1Q would become a hard barrier. And applications would demand the ability to move virtual machines across the data center without changing their IP addresses — a form of L2 mobility that no L3 fabric natively supports. ### Which Invariant Broke? | What Changed | Before (Campus) | After (Data Center) | |:------|:-------------------------------|:-------------------------------| | Scale | 10--50 switches | 100--1,000+ switches | | Traffic pattern | North-south (client → server) | East-west (server → server) | | Bandwidth demand | Oversubscription tolerable | Full bisection bandwidth required | | STP impact | Blocks a few redundant links | Blocks half the network's bandwidth | The Coordination invariant broke again, but for a different reason than in Act 2. STP prevented loops, but at the cost of blocking all redundant paths. In a tree topology with $k$ parallel links between levels, STP activates exactly 1 and blocks $k-1$. For a data center with hundreds of parallel paths, this means STP wastes the majority of the network's capacity. ### Al-Fares's Redesign: Fat-Tree Topology with ECMP The insight was architectural: replace the tree with a **folded Clos topology** (also called a fat-tree). In a $k$-port fat-tree, every switch is the same commodity device with $k$ ports. The topology has three tiers: $k$ pods of $(k/2)^2$ servers each, connected through edge switches, aggregation switches, and core switches. Every pair of servers has $k/2$ equal-cost paths for intra-pod traffic and $(k/2)^2$ equal-cost paths for inter-pod traffic. The fat-tree eliminates STP entirely. Because the topology is symmetric and all paths between tiers have equal cost, **Equal-Cost Multi-Path (ECMP)** routing distributes traffic across all available paths. No links are blocked. Full bisection bandwidth is achieved: any server can communicate with any other server at full line rate. Al-Fares **applied disaggregation** by separating forwarding from loop prevention in a different way than Perlman. STP prevented loops by blocking links. The fat-tree prevents loops by running L3 (IP routing) at every switch — routing protocols inherently prevent loops through shortest-path computation and TTL fields. There is no spanning tree, no root election, no blocked ports. Al-Fares **applied decision placement** as distributed — each switch runs its own routing instance and makes independent forwarding decisions based on its routing table. ECMP is stateless: the switch hashes each flow's 5-tuple to select among equal-cost next hops. No central controller is needed. ### Invariant Analysis: Fat-Tree with ECMP (2008) | Invariant | Fat-Tree Answer (2008) | Gap? | |:------|:-------------------------------|:-------------------------------| | **State** | IP routing table: full topology via link-state IGP | L3 state is accurate; tracks topology only, omits L2 host location | | **Time** | IGP convergence: sub-second with BFD | Vast improvement over STP's 30-50s | | **Coordination** | Distributed ECMP: hash-based per-flow load splitting | Static hashing — blind to congestion and flow size | | **Interface** | Standard IP forwarding; no STP, no VLANs needed within the fabric | Still bounded by 802.1Q's 4,094-VLAN limit at the edge | The Time improvement is dramatic: from 30--50 seconds (STP) to sub-second (IGP with BFD failure detection). This alone transformed data center reliability. The Coordination improvement is equally significant: instead of blocking redundant links, ECMP activates all of them. The fat-tree delivers full bisection bandwidth using commodity switches. The Interface gap persists: even though the fabric runs IP internally, tenant-facing edge ports still use 802.1Q VLANs for isolation. A cloud provider hosting 10,000 tenants, each needing isolated L2 domains, exhausts 4,094 VLANs immediately. The fat-tree also provides **predictable latency**: every leaf switch is exactly the same number of hops from every other leaf switch, regardless of pod placement. This uniformity — absent from the tree topology, where cross-pod paths traverse more levels — is as valuable as the bandwidth gain for latency-sensitive data center applications. Two alternatives to STP competed during this period. **TRILL** (Transparent Interconnection of Lots of Links) [@rfc6325] ran IS-IS (Intermediate System to Intermediate System) at L2 to provide shortest-path forwarding with multi-pathing — using **hop counts to mitigate temporary loops** during convergence, a fundamentally different approach from STP's blocking strategy. **SPB** (Shortest Path Bridging) [@ieee8021aq] used IS-IS to compute shortest-path trees for each bridge. Both were technically sound but lost to the simpler approach: run IP everywhere and use VXLAN for L2 overlay. TRILL and SPB attempted to fix L2 from within; the industry chose to abandon L2 in the fabric and emulate it at the edge. ### "The Gaps Didn't Matter... Yet." Fat-tree fabrics with ECMP solved the bandwidth and convergence problems within a single data center. But cloud computing was creating a new demand: virtual machine mobility. A VM running a web application needs to keep its IP address when it migrates from one physical server to another — otherwise, TCP connections break and DNS caches go stale. IP address preservation requires L2 adjacency: the source and destination must be in the same broadcast domain. In an IP-routed fat-tree, there is no L2 domain that spans the fabric. --- ## Act 5: "It's 2014. Applications Need L2 Semantics Across an L3 Fabric." Cloud data centers host thousands of tenants, each requiring isolated network domains. Virtual machines migrate across racks for load balancing and maintenance. Container orchestration systems like Kubernetes attach and detach virtual network interfaces in milliseconds. All of these operations assume L2 connectivity: the ability to send Ethernet frames between endpoints regardless of their physical location in the data center. The fat-tree fabric runs IP. IP provides L3 reachability, yet L2 adjacency requires more. The 802.1Q VLAN mechanism provides L2 isolation but is limited to 4,094 segments and stops at L3 boundaries. The industry needed a way to deliver L2 semantics — Ethernet frame forwarding, broadcast domains, MAC-based addressing — over an L3 IP infrastructure. ### The Solution: VXLAN and EVPN **VXLAN** (Virtual eXtensible LAN) [@rfc7348] solves the data plane problem. It encapsulates a complete Ethernet frame inside a UDP packet, adding a 24-bit **VXLAN Network Identifier (VNI)** that identifies the virtual L2 segment. The encapsulated frame travels across the IP fabric as a normal UDP packet, routed by standard L3 forwarding. At the destination switch (the **VXLAN Tunnel Endpoint**, or VTEP), the outer headers are stripped and the inner Ethernet frame is delivered to the destination host. VXLAN's 24-bit VNI supports $2^{24} \approx 16$ million virtual segments, obliterating the 4,094-VLAN limit. Each tenant gets its own VNI. VMs in the same VNI can communicate via L2 regardless of their physical location. VMs in different VNIs are completely isolated. **EVPN** (Ethernet VPN) [@rfc7432] solves the control plane problem. In the original VXLAN specification, VTEPs discovered remote MAC addresses through **flood-and-learn**: when a switch received a frame for an unknown destination, it flooded the encapsulated frame to all VTEPs. This replicated the inefficiency of L2 flooding — broadcast storms in an overlay network. EVPN replaces flood-and-learn with **BGP-based MAC/IP advertisement**. When a host connects to a VTEP, the VTEP advertises the host's MAC and IP addresses to all other VTEPs using BGP EVPN route types. Remote VTEPs learn MAC-to-VTEP mappings from BGP updates, eliminating the need to flood. ARP requests can be answered locally by the VTEP (ARP suppression), further reducing broadcast traffic. Multi-tenancy is handled through **Route Distinguishers (RDs)** and **Route Targets (RTs)**: each tenant's MAC/IP routes carry a unique RD so they remain distinct within BGP, and RTs control which VTEPs import which tenant's routes. These are the coordination primitives that make EVPN work for cloud providers hosting thousands of isolated tenants on shared physical infrastructure. VXLAN/EVPN **applied disaggregation** by separating the L2 forwarding plane (Ethernet frames) from the L2 control plane (MAC learning) and from the physical transport (IP routing). Three independent layers, each optimized separately: the underlay (IP fabric with ECMP) handles packet delivery; the overlay (VXLAN) handles L2 encapsulation; the control plane (BGP EVPN) handles endpoint discovery. VXLAN/EVPN **applied decision placement** as centralized control with distributed forwarding. BGP route reflectors distribute MAC/IP bindings to all VTEPs. Each VTEP makes local forwarding decisions based on these bindings. This mirrors the routing architecture: centralized information distribution, distributed packet forwarding. ### Invariant Analysis: VXLAN/EVPN (2014) | Invariant | VXLAN/EVPN Answer (2014) | Gap? | |:------|:-------------------------------|:-------------------------------| | **State** | BGP-distributed MAC/IP bindings: global, verified | Control plane overhead scales with number of hosts | | **Time** | BGP update propagation: seconds | Slower than hardware-based L2 learning for local events | | **Coordination** | BGP route reflectors: centralized distribution | Route reflector is a single logical point of failure | | **Interface** | 24-bit VNI (16M segments); MAC-in-UDP encapsulation | Encapsulation overhead: 50-byte VXLAN header per frame | The Interface answer resolves the 4,094-VLAN bottleneck that persisted through Acts 3 and 4. The State answer is the most significant improvement: instead of per-switch flood-and-learn (local, unverified, slow), every VTEP has a BGP-distributed global view of all MAC/IP bindings. The Coordination answer mirrors BGP's architecture in the WAN: route reflectors aggregate and distribute information, individual routers make local decisions. The remaining gaps are operational rather than architectural. VXLAN header overhead (50 bytes^[Outer Ethernet (14 bytes) + outer IP (20 bytes) + outer UDP (8 bytes) + VXLAN header (8 bytes) = 50 bytes of encapsulation overhead per frame.] per frame) reduces the effective MTU — jumbo frames (9,000 bytes) absorb this cost, but standard 1,500-byte frames lose over 3% of payload capacity. BGP EVPN control plane scales linearly with the number of hosts, which can reach hundreds of thousands in a large data center. ### Environment → Measurement → Belief After the Fix | Layer | What VXLAN/EVPN Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | **Environment** | All hosts, their MAC/IP addresses, their VTEP locations | — | | **Measurement** | BGP advertisements: MAC, IP, VNI, VTEP IP | Real-time host liveness — must wait for BGP withdrawal on failure | | **Belief** | Global MAC/IP → VTEP mapping, per-VNI isolation | Assumes BGP is timely; stale entries possible during convergence | --- ## Act 6: Access Technologies — DOCSIS and PON The wired link layer story extends beyond LANs and data centers to **access networks** — the "last mile" connecting subscribers to their service providers. Two dominant technologies share the wire (or fiber) among multiple subscribers using fundamentally different physical media. ### DOCSIS: Broadband Over Cable DOCSIS (Data Over Cable Service Interface Specification) [@docsis10] delivers broadband over the hybrid fiber-coax (HFC) plants originally built for cable television. The downstream direction uses frequency-division multiplexing: the cable operator allocates specific frequency channels for data, separate from television channels. The upstream direction faces a coordination challenge: multiple cable modems share a single coaxial segment back to the CMTS (Cable Modem Termination System). DOCSIS upstream uses a **hybrid contention + scheduled model**, not pure TDMA. Modems compete using a contention-based request phase (similar to ALOHA with exponential backoff) to request upstream bandwidth. The CMTS receives these requests and responds with **explicit grants** — scheduled time slots for actual data transmission. Contention exists for the request, but data delivery is collision-free. This two-phase design balances responsiveness (modems request immediately when data arrives) with efficiency (data slots are non-overlapping). The cable plant is highly asymmetric: up to 1.2 GHz of downstream spectrum vs. 5--204 MHz upstream, yielding a **~10:1 bandwidth ratio** favoring download. Active electronics (amplifiers) in the field maintain signal strength over distances up to 100 miles. DOCSIS has evolved through five generations: 1.0 (1997, 40 Mbps downstream), 2.0 (2001, 30 Mbps upstream), 3.0 (2006, channel bonding for 1 Gbps), 3.1 (2013, OFDM for 10 Gbps downstream) [@docsis31], and 4.0 (2020, 10 Gbps downstream and 6 Gbps upstream). Each generation **applied closed-loop reasoning** by tightening the feedback between the CMTS and modems: more precise timing, finer-grained scheduling, and adaptive modulation based on measured channel quality. ### PON: Shared Fiber Passive Optical Networks (PON) share a single fiber strand among 32--128 subscribers using a **passive optical splitter** — no active electronics in the field, just glass. This is architecturally opposite to cable's active amplifier chain: lower operational cost, higher reliability, but no mid-span amplification. Downstream, the OLT (Optical Line Terminal) broadcasts to all ONUs (Optical Network Units); each ONU reads only frames addressed to it. Upstream, ONUs take turns transmitting in time slots assigned by the OLT — a purely centralized, collision-free TDMA scheme. The OLT distributes timing information every **125 microseconds**^[125 µs = 1/8000 seconds, inherited from the 8 kHz sampling rate of the telephone network. This framing period predates GPON — it was already the standard TDM frame period in SONET/SDH.] to align upstream bursts — a strict Time invariant discipline that ensures no two ONUs ever transmit simultaneously. GPON [@itu_gpon] provides **2.5 Gbps downstream and 1.25 Gbps upstream**. XGS-PON (2016) delivers 10 Gbps symmetric. The roadmap continues toward 25G-PON and 50G-PON. The **decision placement** is fully centralized: the OLT schedules all upstream access. This is the opposite of Ethernet's distributed CSMA/CD, and it reflects the asymmetry of the access network — one provider, many subscribers, no peer-to-peer contention. ### Invariant Analysis: Access Networks (DOCSIS and PON) | Invariant | DOCSIS Answer | PON Answer | |:------|:-------------------------------|:-------------------------------| | **State** | CMTS tracks per-modem SNR, queue depth, grant schedule | OLT tracks per-ONU distance (ranging), queue depth, burst schedule | | **Time** | Contention slots + scheduled grants; MAP messages every ~2ms | 125-microsecond framing cycle; DBA (Dynamic Bandwidth Allocation) per cycle | | **Coordination** | Hybrid: contention for requests, centralized for data | Fully centralized: OLT schedules all upstream access | | **Interface** | DOCSIS MAC frames over RF; CMTS as L2/L3 gateway | GEM (GPON Encapsulation Method) frames; OLT as L2 gateway | The key difference is in the Coordination invariant: DOCSIS retains contention for the request phase (inheriting from its cable-TV broadcast origins), while PON eliminates contention entirely through OLT scheduling. Both converge on centralized data-plane scheduling, but DOCSIS carries a residual collision risk that PON avoids. ### Environment → Measurement → Belief: Access Networks | Layer | What the Access Network Has | What's Missing | |:------|:-------------------------------|:-------------------------------| | **Environment** | True subscriber demand, channel conditions per modem/ONU | — | | **Measurement** | CMTS/OLT: per-subscriber SNR reports, queue requests | Per-application demand; subscriber intent | | **Belief** | "Subscriber X needs Y bandwidth this cycle" | Long-term demand prediction; no awareness of application type | The E→M gap is **accidentally noisy**: both CMTS and OLT receive honest per-subscriber measurements (SNR, buffer occupancy), but the measurements lag real-time demand by one scheduling cycle. Better prediction algorithms (anticipating demand patterns) would improve utilization, but the fundamental mechanism is sound. Both DOCSIS and PON illustrate how the binding constraint shapes the coordination answer. When one entity (the service provider) controls the medium and the other entities (subscribers) have no need to communicate directly, centralized scheduling is the natural and efficient solution. Distributed contention (CSMA/CD) solves a different problem: peer stations with equal authority and no coordinator. --- ## The Grand Arc: From Shared Bus to Virtual Overlay ### The Evolving Anchor | Era | Year | Binding Constraint | Key Change | Consequence | |:-----|:--------|:------|:-------------------------------|:-------------------------------| | Shared bus | 1976 | Single coaxial cable | Carrier sense + collision detection | CSMA/CD: works for <100 stations | | Bridging | 1985 | Redundant bridges create loops | STP: prune to a tree | 30-50s convergence; half the links blocked | | Switching | 1995 | Full-duplex point-to-point links | CSMA/CD eliminated; MAC tables | VLANs partition broadcast domains (4,094 limit) | | Data center | 2008 | STP blocks half the bandwidth | Fat-tree + ECMP: all links active | Full bisection bandwidth; sub-second convergence | | Overlay | 2014 | Applications need L2 over L3 | VXLAN + EVPN: L2 in UDP tunnels | 16M segments; BGP control plane | | Access | 1997+ | Shared last-mile medium (cable/fiber) | Centralized scheduling (CMTS/OLT) | Contention eliminated at cost of single-point control | Each row in this table inherits the previous row's solution and encounters a new constraint. The shared bus works until bridges create loops. STP prevents loops but blocks links. Switches eliminate collisions but not broadcast storms (VLANs). Data center scale exposes STP's bandwidth waste (fat-tree). IP fabrics solve bandwidth but not L2 mobility (VXLAN). Each generation's success creates the conditions for the next generation's failure. ### Three Design Principles Applied Across the Arc **Disaggregation** is applied repeatedly, and each separation creates a new interface: - Metcalfe separated the PHY (coaxial transceiver) from the MAC (CSMA/CD logic). Interface: the MII bus, which later enabled speed upgrades without protocol changes. - Perlman separated forwarding (MAC learning) from loop prevention (STP). Interface: BPDUs, which became the control plane for L2 networks. - VXLAN separated L2 semantics (Ethernet frames) from physical transport (IP routing). Interface: VXLAN headers, which tunnel L2 inside L3. **Closed-loop reasoning** operates at different timescales because each generation closes a different loop. Ethernet's feedback loop (sense → transmit → detect → backoff) is the fastest: microseconds, because it tracks per-frame medium state. STP's loop (BPDU → election → port role) is the slowest: 2-second intervals with 30--50s convergence, because it tracks topology — a problem that changes rarely but demands global consistency. Fat-tree ECMP's reconvergence via IGP runs at sub-second timescales, faster than STP because link-state flooding and SPF recomputation replaced passive timer expiry. EVPN's BGP loop distributes MAC/IP updates within seconds. The pattern is not monotonic speedup — it is that each loop's period matches the timescale of the problem it tracks. Medium access changes per-frame (microseconds). Topology changes per-failure (seconds to minutes). Endpoint location changes per-migration (seconds). The diagnostic: a loop that is too fast for its problem wastes bandwidth on unnecessary updates (the original ARPANET DV mistake from Chapter 6); a loop that is too slow misses real changes (STP's 30-second blind spot). **Decision placement** oscillates between distributed and centralized: - CSMA/CD: fully distributed, forced by the lack of any central authority on a shared bus. - STP: distributed but elects a central root bridge. - ECMP: distributed, with each switch making independent hash-based decisions. - EVPN: centralized distribution (BGP route reflectors) with distributed forwarding. The pattern: purely distributed mechanisms work at small scale. As scale increases, some centralization of information distribution becomes necessary, but forwarding remains distributed for performance. ### The Dependency Chain ```{mermaid} %%| fig-cap: "Wired Link Layer Dependency Chain" flowchart TD A["Shared coaxial bus (physical constraint)"]:::constraint --> B["CSMA/CD works (100 stations)"]:::fix B --> C["Bridges extend reach (campus-wide L2)"]:::fix C --> D["Loops: broadcast storms (frames cycle forever)"]:::failure D --> E["STP: prune to tree (Perlman 1985)"]:::fix E --> F["Half the links blocked (wasted bandwidth)"]:::failure E --> G["30-50s convergence (all forwarding halts)"]:::failure F --> H["Switches: full-duplex (no CSMA/CD needed)"]:::fix H --> I["Broadcast domain grows (thousands of stations)"]:::failure I --> J["VLANs: 802.1Q (12-bit, 4094 limit)"]:::fix G --> K["Data center scale (100-1000 switches)"]:::constraint F --> K K --> L["Fat-tree + ECMP (Al-Fares 2008)"]:::fix L --> M["No L2 mobility (IP fabric, no VLANs)"]:::failure J --> M M --> N["VXLAN: L2 over L3 (24-bit VNI)"]:::fix N --> O["Flood-and-learn (broadcast storms in overlay)"]:::failure O --> P["EVPN: BGP control plane (MAC/IP distribution)"]:::fix classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:#fff classDef failure fill:#e74c3c,stroke:#a93226,color:#fff classDef fix fill:#27ae60,stroke:#1a7a42,color:#fff ``` ### Pioneer Diagnosis Table | Year | Pioneer/Event | Invariant | Diagnosis | Contribution | |:-----|:--------|:------|:-------------------------------|:-------------------------------| | 1976 | Metcalfe & Boggs | State, Coordination | Shared bus needs distributed access control | CSMA/CD Ethernet: sense, transmit, detect, backoff | | 1985 | Perlman | Coordination | Redundant bridges create forwarding loops | STP: distributed root election, path pruning | | 1990s | Switch vendors | State, Interface | Shared bus limits bandwidth; need per-port forwarding | Ethernet switches: MAC tables, full-duplex, VLANs | | 2001 | IEEE 802.1w | Time | STP converges too slowly (30-50s) | RSTP: sub-second convergence via proposal/agreement | | 2008 | Al-Fares et al. | Coordination | STP blocks half the data center's links | Fat-tree + ECMP: all links active, full bisection BW | | 2011 | IETF / IEEE | Coordination | Need L2 multi-path without STP | TRILL / SPB: IS-IS at L2 (competed, lost to VXLAN) | | 2014 | IETF | Interface | 4,094 VLANs insufficient; need L2 over L3 | VXLAN: MAC-in-UDP with 24-bit VNI (16M segments) | | 2015 | IETF | Coordination | Flood-and-learn fails at overlay scale | EVPN: BGP-based MAC/IP advertisement | | 1997 | CableLabs | Coordination | Shared cable upstream needs scheduling | DOCSIS: hybrid request-grant scheduling | | 2003 | ITU-T | Time | Shared fiber upstream needs collision-free access | GPON: 125-µs framing with OLT-scheduled TDMA | ### Innovation Timeline ```{mermaid} %%| fig-cap: "Wired Link Layer Innovation Timeline" flowchart TD subgraph sg1["Shared Bus Era"] A1["1973 — Metcalfe: dissertation (Xerox PARC)"] A2["1976 — Metcalfe & Boggs: CACM paper"] A3["1980 — DIX Ethernet specification"] A4["1983 — IEEE 802.3 standard"] A1 --> A2 --> A3 --> A4 end subgraph sg2["Bridging Era"] B1["1985 — Perlman: STP (IEEE 802.1D)"] B2["1990 — 10BASE-T structured cabling"] B3["1995 — Fast Ethernet 802.3u (100 Mbps)"] B4["1998 — VLANs 802.1Q"] B1 --> B2 --> B3 --> B4 end subgraph sg3["Switch Revolution"] C1["1997 — Full-duplex 802.3x"] C2["1999 — Gigabit Ethernet 802.3z"] C3["2001 — RSTP 802.1w"] C4["2002 — 10 Gigabit Ethernet 802.3ae"] C1 --> C2 --> C3 --> C4 end subgraph sg4["Data Center Fabrics"] D1["2008 — Al-Fares: fat-tree (SIGCOMM)"] D2["2009 — VL2 (SIGCOMM)"] D3["2011 — TRILL RFC 6325"] D4["2012 — SPB 802.1aq"] D1 --> D2 --> D3 --> D4 end subgraph sg5["L2/L3 Convergence"] E1["2014 — VXLAN RFC 7348"] E2["2015 — EVPN RFC 7432"] E3["2017 — 400 Gigabit Ethernet 802.3bs"] E1 --> E2 --> E3 end sg1 --> sg2 --> sg3 --> sg4 --> sg5 ``` --- ## Generative Exercises ::: {.callout-tip} ### Exercise 1: The MAC Table Overflow Attack A switched Ethernet network uses a switch with a MAC address table that can hold 8,192 entries. An attacker sends frames with randomly generated source MAC addresses at a rate of 1,000 frames per second. **Using the four invariants:** (a) Which invariant does this attack exploit? (b) What happens to the switch's forwarding behavior when the MAC table is full? (c) How does this convert a switched network back into a shared network? (d) Propose a defense mechanism and identify which design principle it applies. ::: ::: {.callout-tip} ### Exercise 2: VXLAN Without EVPN A data center runs VXLAN for L2 overlay but uses the original flood-and-learn mechanism instead of EVPN. The data center has 500 VTEPs and 50,000 virtual machines. **Using the E-M-B decomposition:** (a) What is the Environment, Measurement, and Belief at each VTEP? (b) Where is the E→M gap, and what type is it (accidentally noisy, physically limited, or structurally filtered)? (c) Calculate the broadcast amplification factor when a single ARP request is flooded to all 500 VTEPs. (d) Explain how EVPN closes this gap and identify which invariant it fixes. ::: ::: {.callout-tip} ### Exercise 3: The Spanning Tree That Shouldn't Be A network architect proposes using STP in a 48-switch data center arranged as a fat-tree topology. Each switch has 48 ports, and there are 576 inter-switch links. **Using the dependency chain:** (a) How many links will STP block? (b) What fraction of the total inter-switch bandwidth is wasted? (c) What is the maximum convergence time if the root bridge fails? (d) The architect argues that RSTP solves the convergence problem. Explain why faster convergence leaves the bandwidth problem intact — identify which invariant RSTP fixes and which it leaves broken. :::