This is an exciting time to be in networking. The teams that have led network innovation for the last half decade have been hyperscale data center architects, not the traditional networking vendors, and their innovations are increasingly open to the broader community.
What can we all learn from Facebook’s Altoona network design? Petr Laupukhov, one of the Facebook architects, joined me to author an InfoWorld article on Core and Pod designs before Altoona details were public. Building on that article, there are a few lessons that the broader enterprise and service provider community can learn from Altoona.
1. Core and Pod Design
Hyperscale data centers, like most data centers, are not built out all at once. Unlike most data centers, however, they are designed with incremental growth over multiple technology cycles in mind. They are broken in two separable areas: a data center core and a series of pods.
Core: The data center network core may range from a pair of large routers to an entire scale-out L3 fabric. The implementation matters less than the long design timeframe (7-10 years) and the core-to-pod technology choices (more below).
Pods: Pods are modular units of compute, storage, and networking that are designed, procured, and automated together as a unit. Most hyperscale data centers have new generations of pod design every 18-24 months, so a large data center may have many instances of pod generation 1 sitting next to pod generation 2 and generation 3. The pod implementation may be a single switching/routing layer, though in more modern designs (like the one implied by our Big Cloud Fabric) it is a full leaf-spine clos that may be L2, L3, or a hybrid of the two.
Core-Pod Connectivity: In the Altoona blog post and excellent video, architects repeatedly mention that the protocols between core and pod are very, very simple (just L3 / BGP / ECMP). This is critical for enterprise and service provider architects. The technology used in pods will change over the lifetime of the data center. Keeping the core<>pod connectivity requirements as generic as possible ensures design freedom for future pod generations.
For more on evolving traditional enterprise and service provider networks to Core and Pod designs, we put together a whitepaper here.
2. When It Comes To Pod Sizing, Small Is Beautiful
The pod size for Altoona is only 48 racks. While that may seem enormous to most architects, keep in mind that the previous Facebook four-post design was a pod size of 255 racks. Facebook has increased data center size while reducing pod size. With leading enterprise and data center architects with whom I work regularly, pod sizes of 12-16 racks (and sometimes even as small as 4 racks) have often proven to be adequate.
Blast Radius: While not explicitly discussed in the Altoona post, pods in these designs are synonymous with failure domains, sometimes called the “blast radius” of the design. When thinking through the core<>pod connectivity and pod sizing, a design tenet is that an unexpected, cascading failure be contained within the pod.
Simplified Automation: The two largest contributors to code complexity in network automation that architects with whom I work generally bring up are a) variation in topology, and b) test-ability. By keeping pods small and automating each one as a unit, automation developers know *exactly* the topology that the code will assume as opposed to automating an entire data center or very large pod whose topology may evolve over time. A small pod is also vastly more practical to replicate in the lab to test automation code before running it in production.
3. Simple Hardware Made Highly Resilient
Altoona is the latest in a line of hyperscale data center designs that eschew chassis switches in favor of simple 1RU switches in a fabric formation. The hardware is far simpler (and less expensive), yet with clever use of multi-pathing in the design and automating provisioning/management workflows, the end result is well summarized in their post: “Individual components in a fabric are virtually unimportant, and things that hard-fail don’t require immediate fixing.”
We have found a similar reliability in our Big Cloud Fabric product, which uses this same design philosophy. In a recent test using the Hadoop Terasort workload as a benchmark, we were able to failover the controller every 30 seconds, a switch every 8 seconds, and a link every 4 seconds across a 32 leaf / 6 spine network carrying 42,000 mac addresses of traffic without seeing any impact on application performance.
For more details on that test effort, check out the whitepaper we wrote on it here.
Many enterprise and service provider architects will read about the Altoona design and rapidly pass it off as being specific to Facebook, and certainly there are aspects of the design that really work better for one massive scale workload than the traditional enterprise and SP requirements. A 48-rack pod is, for most who read this, far too large to be agile and practical. A scale-out L3 core (in Altoona the “Fabric Spine Planes” and their connection to the Pods) is, for many, a design overkill for inter-pod traffic. There are simpler approaches that imply less constraints on future pod generations when you assume that most workloads will be constrained within a single pod. The approach to scaling out the internet-facing edge may be unique to a consumer web company.
If you can look past some of these Facebook-specific aspects, you can see that there are principles embedded here that are both diametrically opposed to traditional three-tier data center design yet very much applicable to a data centers worldwide – Core and Pod Design, Pod Sizing, Simple Hardware in Resilient Configurations.
This is an exciting time to be in networking.
-Kyle Forster, Co-Founder