Achieving DC-Class Resiliency and “Headless” Recovery
In the second part of the technical blog series, I talk about how forwarding state in modern OpenFlow is communicated from SDN controllers down to switches. While this level of detail is well beyond what a typical administrator needs to understand about networking products, it is key to SDN’s scalability and reliability. More so, this issue gets to the heart of one of the critical value propositions of SDN: the single logical point of management. While the SDN network presents administrators with a single, “logically centralized” view of the network, underneath the covers lurks some hard distributed systems problems. In this article, I talk about how the solutions to these distributed systems problems vary between the original academic OpenFlow implementations we did at Stanford and the production-grade OpenFlow that Big Switch Networks ships today. I’ll end by relating the discussion to “headless” mode and show that SDN networks can continue running even when all controllers have failed.
Early Days Academic OpenFlow Was Reactive
OpenFlow gives programmers an unprecedented degree of control over low-level packet forwarding memory in a way that other protocols previously had not. Given this control, researchers (myself included) cooked up all sorts of different and new ways of managing forwarding memory and as a result, all aspects of network packet forwarding. One of the most popular and simplest ways of managing forwarding memory was “reactively”, i.e., to treat forwarding memory like a cache. That is, initially, the memory has no forwarding rules and very similar to a “cache miss”, the controller dynamically populates the memory with the forwarding rules as packets arrive. In practice, this meant that any packet that did not match a forwarding rule in the switch would be sent to the controller and the controller would reactively install the corresponding forwarding rules back into the switch. Subsequent packets would then match the newly populated forwarding entry and be transmitted at full line rate.
While this was simple to program and enables a number of interesting flow-based use-cases, reactive flow control also created significant scaling problems. The network or bus between the data plane and control plane is a big bottleneck and can not handle even a significant fraction of the total data plane traffic. As I mention in my last blog (http://bigswitch.com/blog/2014/05/02/modern-openflow-and-sdn), reactive flow control was an effort to work around the limitations of early OpenFlow implementations and not an inherent aspect of the protocol itself. Because the amount of forwarding memory exposed by these early implementations was so small, SDN controllers could not program all necessary forwarding rules at once and were in effect forced to page them in dynamically. Much as page faults and swapping lead to poor performance and scale in the compute world, so it was the same in networking. The reactive model ran into bottlenecks both inside the switch -- between ASIC and the switch’s CPU -- as well as on the controller itself. As a result, the reactive approach didn’t scale nor was it resilient enough for production networks.
Modern, Production OpenFlow is Proactive
By design, this reactive/flow-based way of programming forwarding memory was never an inherent part of OpenFlow. Quite similar to traditional networking, modern productized versions of OpenFlow proactively write forwarding rules into the corresponding forwarding memory. That is, as soon as a policy is configured (e.g., adding an ACL from the CLI) or a network change occurs (e.g., link comes up), the controller proactively updates all of the necessary switches and forwarding memory with the new forwarding rules. This means that the common case packet forwarding operations do not even involve the controller and the data plane/control plane channel is no longer a bottleneck.
Proactive forwarding is only feasible with switches that expose multiple tables of forwarding memory. Given the cache analogy above, multiple tables increases the amount and flexibility of forwarding memory so that it’s possible to pre-cache all possible forwarding rules rather than dynamically/reactively page them in. The only messages that are sent from the data plane to the controller are control messages (e.g., an LACP packet) or an actual fault (e.g., a link down message). As a result, the proactive approach scales up to the sizes of modern hyperscale networks.
Improved Reliability with Headless Mode
In my time working with SDN, I’ve been asked countless times -- “isn’t the controller a single point of failure?” The answer is of course ‘NO’: much like traditional chassis-based switches have redundant supervisor modules, SDN has redundant controllers. When the active controller goes down, the standby controller activates and packets continue flowing without any packet loss because the underlying forwarding memory doesn’t need to be updated.
However, being the new kid on the block, it’s not enough for SDN to be “good enough” on reliability; it has to show value and be better. To that end, modern SDN implementations support “headless mode”. Headless means even if both the primary and the backup controller go down, all packet forwarding continues without loss. This is a direct result of proactive forwarding: once the controller has completely populated the forwarding memories of all switches, it’s only needed to handle error conditions and changes in policy. Even further, a number of error conditions can be proactively handled, e.g., by preconfigured backup paths. So, even in this pathological “headless” failure case where the system has already suffered two or more errors, SDN networks can continue to function. We have designed this in to the architecture so that there is a looser coupling between control and data plane; different from traditional supervisor and line card relationship and closer to ESX and vCenter or KVM and Nova, the system continues to function even if the orchestration/control plane has completely failed. This level of fault isolation and resiliency is only achievable because of the hierarchical logical centralization in the controller and is not available in even the most “production ready” traditional networking gear. When was the last time you saw a traditional switch keep forwarding packets after it lost both/all of its supervisor modules?
--Rob Sherwood, Big Switch CTO