How can Facebook's data center design apply to your data center plans?
Over the past year, Facebook has thrown some interesting wrenches into the gears of the traditional networking industry. While mainstream thinking is to keep most details of your network operations under wraps, Facebook has been freely sharing its innovations. For a company whose business model is built on people sharing personal information, I suppose this makes perfect sense.
What makes even more sense is the return Facebook gets on their openness. Infrastructure VP Jason Taylor estimates that over the past three years Facebook has saved some $2 billion by letting the members of its Open Compute Project have a go at its design specifications.
But what really turned heads was last year's announcement of Wedge, an open top-of-rack switch developed with the OCP community. Wedge was followed eight months later by 6-Pack, a modular version of Wedge purposed for the network core. Added to these bare-metal switches is FBOSS, an open Linux-based network operating system (well, not exactly an operating system – more on that in a later post), and OpenBNC for system management.
Why this openness matters to the rest of us is that all of this is not just a mad-science project within Facebook's lair. You can soon buy Wedge through Taiwanese switch manufacturer Accton, bringing switches into your data center for a fraction of the cost of proprietary switches with integrated operating systems. And you're not locked in to running FBOSS on the switch either. You can shop around, choosing the NOS that makes the most sense to you, such as Open Network Linux, Cumulus Linux, Big Switch Switch Light, and possibly others such as Pica8's PicOS or even Juniper's JUNOS. If you have an intrepid team of developers with time on their hands you can even build your own.
I'll write more about open switches and open software in subsequent articles, but for now I want to focus on what Facebook has been sharing about their innovations in data center network design and what it means for you. Last November, between the announcements of Wedge and 6-Pack, Facebook opened its newest data center in Altoona, Iowa. And as it has done with its other network innovations, Facebook openly shared its new design.
It turns out that there are some valuable takeaways from the Altoona design that can be applied to data centers of any size.
Say "hyperscale data center" to most anyone who keeps up with such things, and they'll reflexively name Facebook, Google, and Amazon. And because of this association, people think of hyperscale as something that applies only to mammoth data centers supported by an army of developers.
In reality, hyperscale just means the ability to scale out very rapidly. A hyperscale data center network might be small, but it can grow exponentially larger without changing the fundamental components and structures of the network. You should be able to use the same switches and the same interconnect patterns as you grow – just more of them. You do not need to throw out one class of switches for another just to accommodate growth.
Another misconception about hyperscale data centers is that they are optimized for one or a relatively few applications at massive scale across the entire data center. This stems particularly from the Facebook and Google associations. Hyperscale designs are in fact ideal for very heavy east-west workloads, but hyperscale design principles can apply to an average enterprise data center, supporting hundreds of business applications just as easily as it supports a single social media, big data, or search app.
Hyperscale also conjures up images of do-it-yourself networks built from the silicon up by a cadre of brilliant young architects commanding salaries far out of reach of the average network operator. That might be true of the innovators, but because Facebook has laid its work right out on the table, mere mortals like you and I can put their design principles to work in our own data centers.
To appreciate the significance of the Altoona network, let's first have a look at the network architecture Facebook is using in its earlier data centers.
Good is not good enough: Facebook's cluster design
Figure 1 shows Facebook's pre-Altoona aggregated cluster design, which they call the "4-post" architecture. Up to 255 server cabinets are connected through ToR switches (RSW) to high-density cluster switches (CSW). The RSWs have up to 44 10G downlinks and four or eight 10G uplinks. Four CSWs and their connected RSWs comprise a cluster.
Four "FatCat" (FC) aggregation switches interconnect the clusters. Each CSW has a 40G connection to each of the four FCs. An 80G protection ring connects the CSWs within each cluster, and the FCs are connected to a 160G protection ring.
This is a good design in several ways. Redundancy is good; oversubscription is good (generally 10:1 between RSWs and CSWs, 4:1 between CSWs and FCs); the topology is reasonably flat with no routers interconnecting clusters; and growth is managed simply, at least up to the 40G port capacity of the FCs, by adding new clusters.
But Facebook found that good is not good enough.
Most of the problems with this architecture stem from the necessity of very large switches for the CSWs and FCs:
- With just four boxes handling all intra-cluster traffic and four boxes handling all inter-cluster traffic, a switch failure has a serious impact. One CSW failure reduces intra-cluster capacity by 25%, and one FC failure reduces inter-cluster capacity by 25%.
- Very large switches restrict vendor choice – there are only a few "big iron" manufacturers. And because these few vendors sell relatively fewer big boxes, the per-port CapEx and OpEx is disproportionately high when compared to smaller switches offered by a larger number of vendors.
- The proprietary internals of these big switches prevent customization, complicate management, and extend waits for bug fixes to months or even years.
- Large switches tend to have oversubscribed switching fabrics, so all ports cannot be used simultaneously.
- The cluster switches' port densities limit the scale and bandwidth of these topologies, and make transitions to next-generation port speeds too slow.
- Facebook's distributed application creates machine-to-machine traffic that is difficult to manage within an aggregated network design.
Altoona: Hyperscale Insights
Altoona's next-generation architecture, then, must fundamentally correct the problems of the cluster architecture while retaining its best features. Specifically:
- Rather than a few large switches, use lots of small switches. That way each switch is responsible for a smaller percentage of the load, and a switch failure takes a smaller bite out of overall capacity.
- Port density is distributed across many switches rather than condensed in one switch, easing transition to higher-bandwidth ports and reducing internal oversubscription.
- The internal switch architecture should be open, non-blocking, and built with merchant silicon, encouraging customization, simplifying management and troubleshooting, and shortening wait times for bug fixes.
- Find a modular unit smaller than a cluster that can be replicated over a wide range of uses, and be economically deployed to all corners of the data center.
- Reduce capital and operational expense.
- And, of course, accommodate any rate of growth quickly, simply, and cheaply.
What Facebook came up with is a disaggregated core-and-pod design that creates a single high-performance fabric spanning the entire data center. The pod, the basic building block (or standard "unit of network," as Facebook calls it), consists of 48 ToR switches connected via 40G uplinks to four fabric switches. Looking at this topology in Figure 2, you can readily recognize a folded 3-stage Clos fabric – or taking away the geekspeak, a leaf-and-spine topology. Rather than the hundreds of server racks in the cluster design, these pods each contain only 48 server racks. As a result both ToR and fabric switches can remain relatively small in terms of port density. And assuming that each ToR switch has 48 10G downlinks, the pod has 3:1 oversubscription – a notable improvement over the 10:1 cluster oversubscription.
The individual pods are connected via 40G uplinks to four spine planes, as shown in Figure 3. Each spine plane can have up to 48 switches. Key to this topology is that the fabric switches each have an equal number of 40G downlinks and uplinks – maxing out at 48 down an 48 up – so the fabric is non-blocking and there is no oversubscription between pods. Bisectional bandwidth, running to multi petabits, is consistent throughput the data center.
The diagram in Figure 3 shows the color-coded connections between fabric switches and their corresponding spine planes, but doesn't do justice to how it all ties together. And something that surely strikes you is that there are a lot of links between fabric switches and spine switches. Optics and cables can become expensive, so it's important to manage the distances between pods and spine planes. (If you're interested in learning more about Facebook's architectures, here are the source documents I used for cluster architecture (PDF) and the Altoona architecture.)
If you rotate the pods and line them up, the way the 48 racks of each pod would be arranged into rows in the data center, and then do the same with the spine planes – but lining them up perpendicular to the pods – you get the three-dimensional diagram shown in Figure 4, with the fabric switches becoming part of the spine planes. Distance between fabric switches and spine switches are reduced. Note that there are also edge pods, which provide external connectivity to the fabric.
Facebook network engineer Alexey Andreyev describes the fabric this way: "This highly modular design allows us to quickly scale capacity in any dimension, within a simple and uniform framework. When we need more compute capacity, we add server pods. When we need more intra-fabric network capacity, we add spine switches on all planes. When we need more extra-fabric connectivity, we add edge pods or scale uplinks on the existing edge switches."
If you want to hear Andreyev describe the Altoona architecture himself, here's an excellent video:
You might be wondering by now what any of this has to do with you and your data center. After all, Facebook is supporting more or less a single distributed application generating machine-to-machine traffic spanning its entire data center. You probably don't. And while a 48-rack pod is a scale-down from their earlier clusters, most enterprise data centers in their entirety are smaller than 48 server racks.
So why should you care? Because it's not the scale. It's the scalability.
The fundamental takeaways from the Altoona design are the advantages of building your data center network using small open switches, in an architecture that enables you to scale to any size without changing the basic building blocks. First look at the switches. You don't have to wait for Wedge or 6-Pack to go on the market (Accton will be selling Wedge soon). You can pick up bare-metal switches from Accton, Quanta, Celestica, Dell, and others for a fraction of the cost a big-name vendor will charge. For example, a Quanta switch with 32 40G ports lists for $7,495. A Juniper QFX5100 with 24 40G ports lists for a little under $30,000. Is that a fair comparison? That JUNOS premium gives you a pretty awesome operating system, but the bare-metal switch gives you a bunch of options for loading an OS of your choice.
As for the pod and core design, that can be adjusted to your own needs. The pod can be whatever size you want; while the "unit of network" is a wonderful concept, it's not a rule. You can create a number of pod designs to fit specific workflow needs, or just to start a migration away from older architectures. Pods can also be application specific. As your data center network grows, or you adopt newer technologies, you can non-disruptively "plug in" new pods.
The same goes for the core part. You can build it at layer 2, or at layer 3. It all depends on the workflows you're supporting. Using a simple pod and core design you can manageably grow your data center network at whatever rate makes sense to you, from a new pod every few years to an explosive growth of new pods every few months.