DC operators have been overwhelmed with network complexity during the last 20 years, and hence our sign no. 4 focuses on Network Automation. Because traditional DC networks operated on a box-by-box basis, these networking tasks needed to be performed with great care and skilled personnel, hence causing a lot of delays. Also, the fact that the network is the lifeline of applications, it’s critical to avoid a network outage as it would impact a large set of applications. These factors led to the network being the slowest path in data center, and hence a roadblock to data center innovation and digital transformation.
Cloud Giants of course had to address these challenges, given that they need service velocity and uptime. They forced networking vendors to provide programmable interfaces, and invested in internal software teams to build controller software that automated many of the human tasks. They have proven that network infrastructure can operate at the speed of VMs and containers. Not only are Cloud Giants able to rapidly scale out their own applications, they are also able to offer network-as-a-service to end users (enterprise IT organizations).
In particular, Google’s Site Reliability Engineering (SRE) practices point out, value of automation is multi-fold: Consistency, Faster Actions, Faster Repairs, Time Savings, etc. In a recent talk, Google mentioned handling 30,000 config changes in a ~10,000 switch data center fabric. And 70% of failures happen when a (change-)management operation is underway, such as: upgrades, rollout, configuration (see diagram below).
While automation and simplification are often used interchangeably, it is important to focus on complexities related to a variety of networking tasks and highlight how automation addresses these challenges. Specifically:
- Installation Complexities: Adding leaf/spine switches, ensuring cabling correctness, fabric formation, initial switch on-boarding and configuration, MLAG configuration for server connectivity to leaf switches, admin access control, etc.
- Operational Complexities: troubleshooting for packet drops, network congestion, network outage, application connectivity issues, application performance, software upgrades, failure management, etc.
- App Provisioning Complexities: add/remove compute hosts, add/remove application end points (VMs, containers), network policies (L2/L3), security policies, etc.
IT research analysts have also been pointing out these customer pain-points to networking vendors. For example, Gartner’s report No More Box Hugging: Network Vendors Must Drive NetOps Innovation to Remain Relevant. points out customers top two challenges are: we are trapped in 1999 mindset, and we just can’t make network changes fast enough.
How do you address installation complexities?
There are two options to addressing these complexities through network automation:
- Internal automation tools such as Chef, Puppet, Ansible: NetDevOps centric teams leverage these options to address certain, though programming skills are required
- Policy controller, typically provided by the network supplier: benefit here is that no programming skills required
What about eliminating operational complexities?
Here network controller is a must. It can provide built-in tools for fabric tracing, failure conditions, upgrade workflows, etc. It can significantly improve service velocity and network uptime. NetOps can manage larger infrastructure with the same resources, and don’t have to spend nights and weekends on menial operational tasks.
Is it possible to automate network changes as applications are deployed?
Certainly. But this requires integration into SDDC/private cloud orchestrators, such as VMware, Nutanix, OpenStack and Containers/Kubernetes. There are two approaches: direct switch integration, and controller integration as shown in the diagram below.
Legacy approach, where each switch is programmatically interacting with, tends to be limited in automation because the scope of automation is limited to that particular switch. Also even in a medium size fabric (few dozen switches), the API scaling and latency becomes a major bottleneck, sometimes causing racing conditions which lead to network instability.
The cloud-inspired approach is to leverage external (software) controller for integration with SDDC orchestrator. This single API interaction leads to predictable and scalable automation. Also, network changes can be automated fabric wide, instead of being limited to a specific switch, which makes it useful in a broad set of provisioning workflows. As VMs/containers are on-boarded, elastically scaled or removed, full fabric-wide L2 and L3 network configurations can be automated, along with server-to-leaf MLAG configurations. No more weeks/months of network delay in on-boarding an application. Network is no longer in the way of application deployments.
Can you trust network automation?
The “trust but verify” rule is necessary when it comes to network automation. There needs to be continuous validation to ensure that automation adheres to provisioning policy during and after its instantiation. Any non-compliance is flagged to the operator for remediation.
How does Big Switch enable network automation?
Big Switch’s Big Cloud Fabric (BCF) controller provides built-in automation tools for Day0/Day1/Day2 workflows, to automate installation and operations tasks (see details in SW Controls blog). A unique advantage of BCF is its integration into SDDC orchestrators: VMware vSphere/NSX/vSAN, Nutanix Enterprise Cloud HCI, Kubernetes and Red Hat OpenStack. Specifically:
- Host add/remove: MLAG to leaf switch automated
- VM/Container add/remove: auto-configure L2 network
- Application connectivity check: auto fabric trace, auto ACL trace
Once BCF automation is enabled by network admin, server team can keep adding new servers at will; virtualization/cloud team can deploy applications (via VMs or containers) at will. No interactions with the network team required.
How is Big Switch network automation with SDDC orchestrators helping customers?
- "BCF enabled us to automate the network underlay for our VMware environment, including NSX overlay. BCF decreased TCO by about 90% compared to our legacy fabric solution." -- Senior IT Architect, Large Enterprise Financial Services Company
- "Big Cloud Fabric is a key enabler of our business. We needed a modern, robust solution that would integrate with OpenStack, scale and be free of legacy vendor lock-in." -- Christian Serrasin, Founder, CleanSafeCloud
- "If You have a vSAN cluster, BCF is the best choice for network management. It has opened the network "blackbox" for our vSAN admins." -- Zhen Jia, IT Manager, TianJin Broadcast & TV Network
How much faster with Big Switch network automation?
Over two-thirds of the Big Switch customers surveyed experienced network automation providing 50% to 100% higher speed than legacy box-based network. Indeed, DC network CAN operate at the speed of SDDC/cloud!
Care to guess the next sign (hint: security)?
VP & Chief Product Officer
Prashant is responsible for Big Switch's Cloud-First Networking portfolio and strategy, including: product management, product marketing, technology partnerships/solutions and technical marketing. Prashant has been instrumental in the product strategy and development of Big Cloud Fabric and Big Monitoring Fabric products. Additionally, Prashant is responsible for Big Switch led open-source initiative, Open Network Linux (ONL), to accelerate adoption of open networking and HW vendor choice. You can connect with Prashant on LinkedIn.