Hands on with Fedora, KVM and Cumulus VX

Cumulus Linux is a network operating system based on Debian that runs on top of industry standard networking hardware. By providing a software-only solution, Cumulus is enabling disaggregation of data center switches similar to the x86 server hardware/software disaggregation. In addition to the networking features you would expect from a network operating system like L2 bridging, Spanning Tree Protocol, LLDP, bonding/LAG, L3 routing, and so on, it enables users to take advantage of the latest Linux applications and automation tools, which is in my opinion its true power.

Cumulus VX is a community-supported virtual appliance that enables network engineers to preview and test Cumulus Networks technology. The appliance is available in different formats (for VMware, VirtualBox, KVM, and Vagrant environments), and since I am running Fedora on my laptop the easiest thing for me was to use the KVM qcow2 image to try it out.

My goal is to build a four node leaf/spine topology. To form the fabric, each leaf will be connected to each spine, so we will end up with two “fabric facing” interfaces on each switch. In addition, I want to have a separate management interface on each device I can use for SSH access as well as automation purposes (Ansible being an immediate suspect), and a loopback interface to be used as the router-id.

base_topology

Prerequisites

  • Install KVM and related virtualization packages. I am running Fedora 22 and used yum groupinstall “Virtualization*” to obtain the latest versions of libvirt, virt-manager, qemu-kvm and associated dependencies.
  • From the Virtual Machine Manager, create four basic isolated networks (without IP, DHCP or NAT settings). Those will serve as transport for the point-to-point links between our switches. I named them as follows:
    • net1
    • net2
    • net3
    • net4
  • Download the KVM qcow2 image from the Cumulus website. At the time of writing the image is based on Cumulus Linux v2.5.5. You would want to copy it four times, and name them as follows:
    • leaf1.qcow2
    • leaf2.qcow2
    • spine3.qcow2
    • spine4.qcow2

Creating the VMs

While creating each VM you will need to specify the network settings, in particular what interfaces you want to be created, what networks they should be part of, and what is their L2 (MAC) information. To ease troubleshooting, I came out with my own convention for the interfaces MAC addresses.

Leaf1:

  • Leaf1 should have three interfaces:
    • One belonging to the “default” network – a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.
    • One belonging to net1, which is going to be used for the connection between leaf1 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.
    • One belonging to net2, which is going to be used for the connection between leaf1 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.
  • Make sure to adjust the path to specify the location of the image.
sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:11 --network network=net1,model=virtio,mac=00:00:01:00:00:13 --network network=net2,model=virtio,mac=00:00:01:00:00:14 --boot hd --disk path=/home/nyechiel/Downloads/VX/leaf1.qcow2,format=qcow2 --name=leaf1

Leaf2:

  • Leaf2 should have three interfaces:
    • One belonging to the “default” network – a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.
    • One belonging to net3, which is going to be used for the connection between leaf2 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.
    • One belonging to net4, which is going to be used for the connection between leaf2 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.
  • Make sure to adjust the path to specify the location of the image.
sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:22 --network network=net3,model=virtio,mac=00:00:02:00:00:23 --network network=net4,model=virtio,mac=00:00:02:00:00:24 --boot hd --disk path=/home/nyechiel/Downloads/VX/leaf2.qcow2,format=qcow2 --name=leaf2

Spine3:

  • Spine3 should have three interfaces:
    • One belonging to the “default” network – a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.
    • One belonging to net1, which is going to be used for the connection between leaf1 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.
    • One belonging to net3, which is going to be used for the connection between leaf2 and spine3. Behind the scenes, virt-manager created a Linux bridge for this network.
  • Make sure to adjust the path to specify the location of the image.
sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:33 --network network=net1,model=virtio,mac=00:00:03:00:00:31 --network network=net3,model=virtio,mac=00:00:03:00:00:32 --boot hd --disk path=/home/nyechiel/Downloads/VX/spine3.qcow2,format=qcow2 --name=spine3

Spine4:

  • Spine4 should have three interfaces:
    • One belonging to the “default” network – a network created by virt-manager with DHCP and NAT enabled, and will be used for the management access.
    • One belonging to net2, which is going to be used for the connection between leaf1 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.
    • One belonging to net4, which is going to be used for the connection between leaf2 and spine4. Behind the scenes, virt-manager created a Linux bridge for this network.
  • Make sure to adjust the path to specify the location of the image.
sudo virt-install --os-variant=generic --ram=256 --vcpus=1 --network=default,model=virtio,mac=00:00:00:00:00:44 --network network=net2,model=virtio,mac=00:00:04:00:00:41 --network network=net4,model=virtio,mac=00:00:04:00:00:42 --boot hd --disk path=/home/nyechiel/Downloads/VX/spine4.qcow2,format=qcow2 --name=spine4

Verifying the hypervisor topology

Before we log in to any of the newly created VMs, I first would like to verify the configuration and make sure that we have got the right connectivity in place. Using ifconfig on my Fedora system, and by looking into the MAC addresses, I correlated between the Linux bridges created by virt-manager (virbr0, virbr1, virbr2, virbr3, virbr4) and the virtual Ethernet devices (vnet). This is giving me the hypervisor point of view, and going to be really useful for troubleshooting purposes. I came up with this topology:

hypervisor_view

Useful commands to use here are brctl show and brctl showmacs. For example, let’s examine the link between leaf1 and spine3 (note that libvirt based the MAC on the configured guest MAC address with high byte set to 0xFE):

$ ip link show vnet1 | grep link
   link/ether fe:00:01:00:00:13 brd ff:ff:ff:ff:ff:ff
$ ip link show vnet10 | grep link
   link/ether fe:00:03:00:00:31 brd ff:ff:ff:ff:ff:ff
$ brctl show virbr1
bridge name      bridge id      STP enabled     interfacesvirbr1       8000.525400d32feb      yes         virbr1-nic                                                vnet1
                                                vnet10
$ brctl showmacs virbr1
port no   mac addr        is local?   ageing timer          2   fe:00:01:00:00:13    yes        18.34
  3   fe:00:03:00:00:31    yes        24.61
  1   52:54:00:d3:2f:eb    yes        0.00

Verifying the fabric topology

Now that we have the basic networking setup between the VMs and we understand the topology, we can jump into the switches and confirm their view. The switches can be accessed with the username “cumulus” and the password “CumulusLinux!”. This is also the password for root.

Using console access to the VMs and the ifconfig command we can learn a couple of things:

  1. eth0 is the base interface on each switch used for management purposes. It picked up an address from the 192.168.122.0/22 range, which is what virt-manager used to setup the “default” network. SSH to this address is enabled by default with standard TCP port 22.
  2. The “fabric” interfaces are swp1 and swp2.

Based on this information we can build up our final topology, which is a representation of the actual fabric:

fabric_topology

Now what?

Now that we have the basic topology setup and the right diagrams to support us, we can go on and configure things. Cumulus has got some good level of documentation so I will let you take it from here. You can configure things manually using the CLI (which is really a bash system with standard Linux commands) or use automation tools to control the switch.  

Using the CLI and following the documentation, it was pretty straightforward to me to configure hostnames, IP addresses and bring up OSPF and BFD (using Quagga) between the switches. Next I plan to play with the more advanced stuff (personally I want to test out BGP and IPv6 configurations), and try to automate things using Ansible. Happy testing!

 

Reflections on the networking industry, part 1: Welcome to vendor land

I have been involved with networking for quite some time now; I have had the opportunity to design, implement and operate different networks across different environments such as enterprise, data-center, and service provider – which inspired me to create this series of short blog posts exploring the computer networking industry. My view on the history, challenges, hype and reality, and most importantly – what’s next and how we can do better.

Part 1: Welcome to vendor land

Protocols and standards were always a key part of networking and were born out of necessity: we need different systems to be able to talk to each other.

Modern networking suite is built around Ethernet and TCP/IP stack, including TCP, UDP, and ICMP – all riding on top of IPv4 or IPv6. There is a general consensus that Ethernet and TCP/IP won the race against the other alternatives. This is great, right? Well, the problem is not with Ethernet or the TCP/IP stack, but with their “ecosystem”: a long list of complementary technologies and protocols.

Getting the industry to agree on the base layer 2, layer 3 and layer 4 protocols and their header format was indeed a big thing, but we kind of stopped there. Say you have got a standard-based Ethernet link. How would you bring it up and negotiate its speed? And what about monitoring, loop prevention, or neighbor discovery? Except for the very basic, common denominator functionality, vendors came out with their own set of proprietary protocols for solving these issues. Just from the top of my mind: ISL, VTP, DTP, UDLD, PAgP, CDP, and PVST are all examples of the “Ethernet ecosystem” from one (!) vendor.

True, you can find standard alternatives for the mentioned protocols today. Vendors are embracing open standards and tend to replace their proprietary implementation with a standard one if available. But why not to start with the standard one to begin with?

If you think that these are just historical examples from a different era, think again. Even in the 2010s decade, more and more protocols are being developed and/or adopted by single vendors only. I usually like to point out MC-LAG as an example of a fairly recent and very common architecture with no standard-based implementation. This feature alone can lead you to choose one vendor (or even one specific hardware model from one vendor) across your entire network, resulting in a perfect vendor lock-in.   

An Overview of Link Aggregation and LACP

The concept of Link Aggregation (LAG) is well known in the networking industry by now, and people usually consider it as a basic functionality that just works out of the box. With all of the SDN hype that’s going on out there, I sometimes feel that we tend to neglect some of the more “traditional” stuff like this one. As with many networking technologies and protocols, things may not just work out of the box, and it’s important to master the details to be able to design things properly, know what to expect to (i.e., what the normal behavior is) and ultimately being able to troubleshoot in case of a problem.

The basic concept of LAG is that multiple physical links are combined into one logical bundle. This provides two major benefits, depending on the LAG configuration:

  1. Increased capacity – traffic may be balanced across the member links to provide aggregated throughput
  2. Redundancy – the LAG bundle can survive the loss of one or more member links

LAG is defined by the IEEE 802.1AX-2008 standard, which states, “Link Aggregation allows one or more links to be aggregated together to form a Link Aggregation Group, such that a MAC client can treat the Link Aggregation Group as if it were a single link”. This layer 2 transparency is achieved by the LAG using a single MAC address for all the device’s ports in the LAG group. The individual port members must be of the same speed, so you cannot bundle for example a 1G and 10G interfaces. The ports should also have the same duplex settings, encapsulation type (i.e., access/untagged or 802.1q tagged with the exact same number of VLANs) as well as MTU.

LAG can be configured as either static (manually) or dynamic by using a protocol to negotiate the LAG formation, with LACP being the standard-based one. There is also the Port Aggregation Protocol (PAgP), which is similar in many regards to LACP, but is Cisco proprietary and not in common usage anymore.

LAG blog (1)

Wait… LAG, bond, bundle, team, trunk, EtherChannel, Port Channel?

Let’s clear this right away – there are several acronyms used to describe LAG which are sometimes used interchangeably. While LAG is the standard name defined by the IEEE specification, different vendors and operating systems came up with their own implementation and terminology. Bond, for example, is really known on Linux-based systems, following the name of the kernel driver. Team (or NIC teaming) is also pretty common across Windows systems, and lately Linux systems as well. EtherChannel is one of the famous terms, being used on Cisco’s IOS. Interesting enough, Cisco have changed the term in their IOS-XR software to bundles, and in their NX-OS systems to Port Channels. Oh… I love the standardization out there!

LAG can also be used as a general term to describe link aggregation with different technologies (such as MLPPP for PPP links) which can cause some confusion, while Ethernet is the de facto standard and the focus of the IEEE spec.

Use cases

Today, Link Aggregations can be found in many network designs, and across different portions of the network. LAG can be found across the Enterprise, Data Center, and Service Provider networks. In the cloud and virtualization space, it’s also common to want to use multiple network connections in your hypervisors to support Virtual Machine traffic. So you can have LAG configured between different network devices (for e.g., switch to switch, router to router), or between an end host or hypervisor and the upstream network device (usually some sort of a ToR switch).

L2 LAG and STP

From Spanning Tree Protocol (STP) perspective, no matter how many physical ports are being used to form the LAG, there is going to be only one logical interface representing each LAG bundle. The individual ports are not part of the STP topology, but only the one logical interface. STP is still going to be active on the LAG interface and should not be turned off, so that if there are multiple LAGs configured between two adjacent nodes, STP will block one of them.

LAG blog (2)

LAG blog (3)

L3 LAG

While LAG is extremely common across L2 network designs, and sometimes even seen as a partial replacement for Spanning Tree Protocol (STP), it is important to mention that LAG can also operate at L3, i.e, by assigning an IPv4 or IPv6 subnet to the aggregated link. You can then setup static or dynamic routing over the LAG like any other routed interface.

LAG versus MC-LAG

By definition, LAG is formed across two adjacent nodes which are directly connected to each other. The two nodes must be configured properly to form the LAG, so that traffic would be transferred properly between the nodes without a fear of creating traffic loops between the individual members for example.

MC-LAG, or Multi-Chassis Link Aggregation Group, is a type of LAG with constituent ports that terminate on separate chassis, thereby providing node-level redundancy. Unlike link aggregation in general, MC-LAG is not covered under IEEE standard, and its implementation varies by vendor. Cisco’s vPC is a good example for a MC-LAG implementation. The real challenge with MC-LAG is to maintain a consistent control plane state across the LAG setup, which is why the various multi-chassis mechanisms insist on countermeasures such as peer links or out of band connectivity between the redundant chassis.

LAG blog (4)

Load sharing operation

Traffic is not randomly placed across the LAG members, but instead shared using a deterministic hash algorithm. Depending on the platform and the configuration, a number of parameters may feed into the algorithm, including for example the ingress interface, source and/or destination MAC address, source and/or destination IP address, source and/or destination L4 (TCP/UDP) port numbers, MPLS labels, and so on.

Ultimately the hash will take in some combination of parameters to identify a flow and decide to which member link the frame should be placed in. It is important to note that all traffic for a particular flow will always be placed on the same link. That’s also means that traffic for a single flow (e.g., source and destination MAC) cannot exceed the bandwidth of a single member link. It is also important to note that each node (or chassis) performs the hash calculations locally itself, so that upstream and downstream traffic for a single flow will not necessarily traverse the same link.

Static configuration

The basic way to form a LAG is to simply specify the member ports on each node manually. This method does not involve any protocols to negotiate and form the LAG. Depending on the platform, the user can also control the hash algorithm on each side. As soon as a port becomes physically up it becomes a member of the LAG bundle. The major advantage of this is that the configuration is very simple. The disadvantage is that there is no method to detect any kind of cabling or configuration errors, which is most vendors would recommend a LACP configuration instead.

LACP configuration

LACP is the standards based protocol used to signal LAGs. It detects and protects the network from a variety of misconfiguration, ensuring that links are only aggregated into a bundle if they are consistently configured and cabled. LACP can be configured in one of two modes:

  • Active mode – the device immediately sends LACP messages (LACP PDUs) when the port comes up
  • Passive mode – Places a port into a passive negotiating state, in which the port only responds to LACP PDUs it receives but does not initiate LACP negotiation

If both sides are configured as active, LAG can be formed assuming successful negotiation of the other parameters. If one side is configured as active and the other one as passive, LAG can be formed as the passive port will respond to the LACP PDUs received from the active side. If both sides are passive, LACP will fail to negotiate the bundle. In practice it is rare to find passive mode used as it should be clearly and consistently defined which links will use LACP/LAG ahead of deployment. There are even vendors who does not offer the passive mode option at all.

With LACP, you can also control the timeout interval in which LACP PDUs will be sent. The standard defines two intervals: fast (1 second) and slow (30 seconds). Note that the timeout value does not have to agree between peers. While it is not a recommended configuration, it is possible to bring up a LAG with one end sending every 1 second and the other sending every 30 seconds. Depending on the platform and configuration, it is also possible to use Bidirectional Forwarding Detection (BFD) for fast detection of link failures.

Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking – Part II: Walking Through the Implementation

Second part of the SR-IOV networking post I wrote for the Red Hat Stack blog.

Red Hat Stack

In the previous blog post in this series we looked at what single root I/O virtualization (SR-IOV) networking is all about and we discussed why it is an important addition to Red Hat Enterprise Linux OpenStack Platform. In this second post we would like to provide a more detailed overview of the implementation, some thoughts on the current limitations, as well as what enhancements are being worked on in the OpenStack community.

Note: this post does not intend to provide a full end to end configuration guide. Customers with an active subscription are welcome to visit the official article covering SR-IOV Networking in Red Hat Enterprise Linux OpenStack Platform 6 for a complete procedure.

Setting up the Environment

In our small test environment we used two physical nodes: one serves as a Compute node for hosting virtual machine (VM) instances, and the other serves as both the OpenStack Controller and…

View original post 2,177 more words

Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking – Part I: Understanding the Basics

Check out this blog post I wrote for Red Hat Stack on SR-IOV networking support introduced in RHEL OpenStack Platfrom 6. This is based on the Nova and Neutron work done at the upstream community for the OpenStack Juno release.

Red Hat Stack

Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking – Part I: Understanding the Basics

Red Hat Enterprise Linux OpenStack Platform 6 introduces support for single root I/O virtualization (SR-IOV) networking. This is done through a new SR-IOV mechanism driver for the OpenStack Networking (Neutron) Modular Layer 2 (ML2) plugin, as well as necessary enhancements for PCI support in the Compute service (Nova).

In this blog post I would like to provide an overview of SR-IOV, and highlight why SR-IOV networking is an important addition to RHEL OpenStack Platform 6. We will also follow up with a second blog post going into the configuration details, describing the current implementation, and discussing some of the current known limitations and expected enhancements going forward.

PCI Passthrough: The Basics

PCI Passthrough allows direct assignment of a PCI device into a guest operating system (OS). One prerequisite for doing this is that the hypervisor…

View original post 866 more words

The need for Network Overlays – part II

In the previous post, I covered some of the basic concepts behind network overlays, primarily highlighting the need to move into a more robust, L3 based, network environments. In this post I would like to cover network overlays in more detail, going over the different encapsulation options and highlighting some of the key points to consider when deploying an overlay-based solution.

Underlying fabric considerations

While network overlays give you the impression that networks are suddenly all virtualized, we still need to consider the physical underlying network. No matter what overlay solution you might pick, it’s still going to be the job of the underlying transport network to switch or route the traffic from source to destination (and vice versa).

Like any other network design, there are several options to choose from when building the underlying network. Before picking up a solution, it’s important to analyze the requirements – namely the scale, amount of virtual machines (VMs), size of the network as well as the amount of traffic. Yes, there are some fancy network fabric solutions out there from any of your favorite vendors, but simple L3 Clos network will do just fine. The big news here is that the underlying network should no longer be a L2 bridged network, but can be configured as a L3 routed network. Clos topology with ECMP routing can provide efficient non-blocking forwarding with a quick convergence time in a case of a failure. Known protocols such as OSPF, IS-IS, and BGP, with the addition of a protocol like BFD, can provide a good standard-based foundation for such a network. One thing I do want to highlight when it comes to the underlying network, is the requirement to support Jumbo frames. No matter what overlay encapsulation you may choose to implement, extra bytes of header will be added to the frames, resulting in a need for high MTU support from the physical network.

For the virtualization/cloud admin, with overlay networks, the data network used to carry the overly traffic is no longer a special network that requires careful VLAN configuration. It is now just one more infrastructure network used to provide simple TCP/IP connectivity.

Encapsulation

When it comes to the overlay data-plane encapsulation, the amount of discussions, comparisons and debate out there is amazing. There are several options and standards available, all of them have the same goal: provide an emulated L2 networks over IP infrastructure. The main difference between them is the encapsulation format itself and their approach to the control plane – which is essentially the way to obtain MAC-to-IP mapping information for the tunnel end-points.

It all started with the well-known Generic Routing Encapsulation (GRE) protocol that was rebranded as NVGRE. GRE is a simple point-to-point tunneling protocol which is being used in todays networks to solve various design challenges and therfore is well understood by many network engineers. With NVGRE, the inner frame is being encapsulated with GRE encapsulation as specified in RFC 2784 and RFC 2890. The Key field (32 bits) in the GRE header is used to carry the Tenant Network Identifier (TNI) and is used to isolate different logical segments. One thing to note about GRE is the fact that it uses IP protocol number 47 for communication, i.e., it does not use TCP or UDP – which make it hard to create header entropy. Header entropy is something that you really want to have if you are using a flow-based ECMP network to carry the overlay traffic. Interesting enough, the authors of NVGRE do not cover the control plane part but only the data-plane considerations.

Other option would be Virtual Extensible LAN (VXLAN). Unlike NVGRE, VXLAN is a new protocol that was designed to solve the overlay networks use case. It uses UDP for communication (port 4789) and a 24-bit segment ID known as the VXLAN network identifier (VNID). With VXLAN, a hash of the inner frame’s header is used as the VXLAN source UDP port. As a result, a VXLAN flow can be unique, with the IP addresses and UDP ports combination in its outer header while traversing the underlay physical network. Therefore, the hashed source UDP port introduces a desirable level of entropy for ECMP load balancing. When it comes to the control plane, VXLAN does not provide any solution, but instead relies on flooding emulated with IP multicast. The original standard recommends to create an IP multicast group per VNI to handle broadcast traffic within a segment. This requires support for IP multicast on the underlying physical network as well as proper configuration and maintenance of the various multicast trees. This approach may work for small scale environments, but for large environments with good number of logical VXLAN segments this is probably not a good idea. It also important to note here that while IP multicast is a clever way to handle IP traffic, it is not commonly implemented in Data Center networks today, and the requirement to deploy an IP multicast network (which can be fairly complex) just to introduce VXLAN is not something that is accepted in most cases. These days, it is common to see “unicast mode” VXLAN implementations that do not require any kind of multicast support.

You may also have heard about Stateless Transport Tunneling Protocol (STT) which was originally introduced by Nicira (now VMware NSX). The main reason I decided to mention STT here is one of its benefits: the ability to leverage TCP offloading capabilities from existing physical NICs, resulting in improved performance. STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload (LSO) on what it thinks is a simple TCP datagram. That said, new generation NICs also offer offload capabilities for NVGRE and VXLAN, so this is not a unique benefit of STT anymore.

Last but not least, I would also like to introduce Geneve: Generic Network Virtualization Encapsulation, which looks to take a more holistic view of tunneling. From a first look, Geneve looks pretty much similar to VXLAN. It uses a UDP-based header and a 24 bit Virtual Network Identifier. So what is unique about Geneve? The fact that it uses an extendable header format, similar to (long-living) protocols such as BGP, LLDP, and IS-IS. The idea is that Geneve can evolve over time with new capabilities, not by revising the base protocol, but by adding new optional capabilities. The protocol has a set of fixed header, parameters and values, but then leave room for non-defined optional fields. New fields can be added to the protocol by simply defining and publishing them. The protocol is created in such a way that implementations know there may be optional fields that they may or may not understand. Although the protocol is new, there is some work to enable Open vSwitch support as well as NIC vendors announcing support for offloading capabilities.

I also want to leave room here for some other protocols that can be used as an encapsulation option. There is nothing wrong with MPLS for example, other than the fact that it requires to be enabled throughout the underlying transport network.

So should I pick a winner? probably not. As you can see you have got some options to choose from, but let’s make it clear: all protocols discussed above are ignoring the real problems (hint: control-plane operations) and providing a nice framework for data-plane encapsulation, which is just part of the deal. If I need to pick one, I would say that it looks like VXLAN and Geneve are here to stay (but we should let the market decide).

Tunnel End Point

I have already mentioned the term tunnel end-point, sometimes refer to as VTEP, earlier. But what is this end-point, and more importantly, where is it located? The function of VTEP is to encapsulate the VM traffic within an IP header to send across the underlying IP network. With the most common implementations, the VMs are unaware of the VTEP. They just send untagged or VLAN-tagged traffic that needs to classified and associated with a VTEP. Initial designs (which are still the most common ones) implemented the VTEP functionality within the hypervisor which houses the VMs, usually in the software vSwitch. While this is a valid solution that is probably here to stay, it also worth mentioning an alternative design in which the VTEP functionality is implemented in hardware, for e.g., within a top-of-rack (ToR) switch. This makes sense is some environments, especially where performance and throughput is critical.

Control plane or flooding

Probably the most interesting question to ask when picking an overlay network solution is what’s going on with the control plane and how the network is going to handle Broadcast, Unknown unicast and Multicast traffic (sometimes refer to as BUM traffic). I am not to going to provide easy answers here, simply because of the fact that there are plenty of solutions out there, each addresses this problem differently. I just want to emphasize that the protocol you are going to use to form the overly network (e.g., NVGRE, VXLAN, or what have you) is essentially taking care only for the data-plane encapsulation. For control plane you will need to rely either on flooding (basically continue to learn MAC addresses via the “flood and learn” method to ensure that the packet reaches all the other tunnel end-points), or consulting some sort of database which includes the MAC to IP bindings in the network (e.g., an SDN controller).

Connectivity with the outside world

Another factor to consider is the connectivity with the outside world – or how can a VM within an overlay network communicate with a device resides outside of the network. No matter how much overlays would be popular throughout the network, there are still going to be devices inside and outside of the Data Center that speaks only native IP or understand just 802.1Q VLANs. In order to communicate with those the overlay packet will need to get into some kind of a gateway that is capable of bridging or routing the traffic correctly. This gateway should handle the encapsulation/decapsulation function and provide the required connectivity. As with the control plane considerations, this part is not really covered in any of the encapsulation standards. Common ways to solve this challenge is by using virtual gateways, essentially logical routers/switches implemented in software (take a look on Neutron’s l3-agent to see how OpenStack handle this), or by introducing dedicated physical gateway devices.


Are overlays the only option?

I would like to summarize this post by emphasizing that overlays are an exciting technology which probably makes sense in certain environments. As you saw, an overly-based solution needs to be carefully designed, and as always depends on your business and network requirements. I also would like to emphasize that overlays are not the only option to scale-out networking, and I have seen some cool proposals lately which are probably deserve their own post.

The need for Network Overlays – part I

The IT industry has gained significant efficiency and flexibility as a direct result of virtualization. Organizations are moving toward a virtual datacenter model, and flexibility, speed, scale and automation are central to their success. While compute, memory resources and operating systems were successfully virtualized in the last decade, primarily due to the x86 server architecture, networks and network services have not kept pace.

The traditional solution: VLANs

Way before the era of server virtualization, Virtual LANs (or 802.1q VLANs) were used to partition different logical networks (or broadcast domains) over the same physical fabric. Instead of wiring a separate physical infrastructure for each group, VLANs were used efficiently to isolate the traffic from different groups or applications based on the business needs, with a unique identifier allocated to each logical network. For years, a physical server represented one end-point from the network perspective and was attached to an “access” (i.e., untagged) port in the network switch. The access switch was responsible to enforce the VLAN ID as well as other security and network settings (e.g., quality of service). The VLAN ID is a 12-bit field allowing a theoretical limit of 4096 unique logical networks. In practice though, most switch vendors support much lower number to be configured. You should remember that for each active VLAN in a switch, a VLAN database need to be maintained for proper mapping of the physical interfaces and the MAC addresses associated with the VLAN. Furthermore, some vendors would also create a different spanning-tree (STP) instance for each active VLAN on the switch which require additional memory cycles.

VLANs are a perfect solution for small-scale environments, where the number of end-points (and MAC addresses respectively) is small and controlled. With virtualization though, one server, now called hypervisor, can host many virtual machines and many network end-points. As I stated before, the networks have not kept pace, and the easiest (and also rational) thing to do was to reuse the good-old VLANs. We were essentially adding an additional layer of software access switch in the hypervisor to link the different virtual machines on the host, and those server “access” ports in the physical switch that traditionally were untagged, are now expecting tagged traffic with different VLAN IDs differentiating between the virtual machine networks. The main issue here is the fact that the virtual machines MAC addresses must be visible end-to-end throughout the network core. Reminder: VLANs must be properly configured on each switch along the path, as well as on the appropriate interfaces to get end-to-end MAC learning and connectivity.

In a virtualized world, where the number of end-points is constantly increasing and can be very high, VLANs is a limited solution that does not follow one of the main participles beyond virtualization: use of software application to divide one physical resource into multiple isolated virtual environments. Yes, VLANs does offer segmentation of different logical networks (or broadcast domains) over the same physical fabric, but you still need to manually provision the network and make sure the VLANs are properly configured across the network devices. This start to become a management and configuration nightmare and simply does not scale.

Where network vendors started to be (really) creative

 

At this point, when there was no doubt that VLANs and traditional L2 based networks are not suitable for large virtualized environments, plenty of network solutions were raised. I don’t really want to go into detail on any of those, but you can look for 802.1Qbg, VM-FEX, FabricPath, TRILL, 802.1ad (QinQ), and 802.1ah (PBB) to name a few. In my view, these are over complicating the network while ignoring the main problem – L2-based solution is a bad thing to begin with, and we should have looked for something completely different (hint: L3 routing is your friend).

Overlays to the rescue

 

L3 routing is a scalable and well-known solution (it runs the Internet, isn’t it?). With proper planning, routing domains can handle massive number of routes/networks, keeping the broadcast (and failure) domains small. Furthermore, most modern routing operating systems can utilize equal-cost multi-path (ECMP) routing, effectively load-sharing the traffic across all available routed links. In contrast, by default spanning-tree protocol (STP) blocks redundant L2 switched links to avoid switching loops, simply because that there is no way to handle loops within a switched environment (there is no “time-to-live” field within an Ethernet frame).

Routing sounds a lot better, but note that L2 adjacency is required by most applications running inside the virtual machines. L2 connectivity between the virtual machines is also required for virtual machine mobility (e.g., Live Migration in VMware terminology). This is where overlay networks enter the picture; an overlay network is a computer network which is built on top of another network. Using an overlay, we can build a L2 switched network on top of a L3 routed network. Don’t get it wrong – overlays are not a new networking concept and are already used extensively to solve many network challenges (see GRE tunneling and MPLS L2/L3 VPNs for some examples and use cases).

In the next post I will bring the second part of this article, diving into the theory behind network overlays and the way they tend to solve the network virtualization case.