An Overview of Link Aggregation and LACP

The concept of Link Aggregation (LAG) is well known in the networking industry by now, and people usually consider it as a basic functionality that just works out of the box. With all of the SDN hype that’s going on out there, I sometimes feel that we tend to neglect some of the more “traditional” stuff like this one. As with many networking technologies and protocols, things may not just work out of the box, and it’s important to master the details to be able to design things properly, know what to expect to (i.e., what the normal behavior is) and ultimately being able to troubleshoot in case of a problem.

The basic concept of LAG is that multiple physical links are combined into one logical bundle. This provides two major benefits, depending on the LAG configuration:

  1. Increased capacity – traffic may be balanced across the member links to provide aggregated throughput
  2. Redundancy – the LAG bundle can survive the loss of one or more member links

LAG is defined by the IEEE 802.1AX-2008 standard, which states, “Link Aggregation allows one or more links to be aggregated together to form a Link Aggregation Group, such that a MAC client can treat the Link Aggregation Group as if it were a single link”. This layer 2 transparency is achieved by the LAG using a single MAC address for all the device’s ports in the LAG group. The individual port members must be of the same speed, so you cannot bundle for example a 1G and 10G interfaces. The ports should also have the same duplex settings, encapsulation type (i.e., access/untagged or 802.1q tagged with the exact same number of VLANs) as well as MTU.

LAG can be configured as either static (manually) or dynamic by using a protocol to negotiate the LAG formation, with LACP being the standard-based one. There is also the Port Aggregation Protocol (PAgP), which is similar in many regards to LACP, but is Cisco proprietary and not in common usage anymore.

LAG blog (1)

Wait… LAG, bond, bundle, team, trunk, EtherChannel, Port Channel?

Let’s clear this right away – there are several acronyms used to describe LAG which are sometimes used interchangeably. While LAG is the standard name defined by the IEEE specification, different vendors and operating systems came up with their own implementation and terminology. Bond, for example, is really known on Linux-based systems, following the name of the kernel driver. Team (or NIC teaming) is also pretty common across Windows systems, and lately Linux systems as well. EtherChannel is one of the famous terms, being used on Cisco’s IOS. Interesting enough, Cisco have changed the term in their IOS-XR software to bundles, and in their NX-OS systems to Port Channels. Oh… I love the standardization out there!

LAG can also be used as a general term to describe link aggregation with different technologies (such as MLPPP for PPP links) which can cause some confusion, while Ethernet is the de facto standard and the focus of the IEEE spec.

Use cases

Today, Link Aggregations can be found in many network designs, and across different portions of the network. LAG can be found across the Enterprise, Data Center, and Service Provider networks. In the cloud and virtualization space, it’s also common to want to use multiple network connections in your hypervisors to support Virtual Machine traffic. So you can have LAG configured between different network devices (for e.g., switch to switch, router to router), or between an end host or hypervisor and the upstream network device (usually some sort of a ToR switch).

L2 LAG and STP

From Spanning Tree Protocol (STP) perspective, no matter how many physical ports are being used to form the LAG, there is going to be only one logical interface representing each LAG bundle. The individual ports are not part of the STP topology, but only the one logical interface. STP is still going to be active on the LAG interface and should not be turned off, so that if there are multiple LAGs configured between two adjacent nodes, STP will block one of them.

LAG blog (2)

LAG blog (3)


While LAG is extremely common across L2 network designs, and sometimes even seen as a partial replacement for Spanning Tree Protocol (STP), it is important to mention that LAG can also operate at L3, i.e, by assigning an IPv4 or IPv6 subnet to the aggregated link. You can then setup static or dynamic routing over the LAG like any other routed interface.

LAG versus MC-LAG

By definition, LAG is formed across two adjacent nodes which are directly connected to each other. The two nodes must be configured properly to form the LAG, so that traffic would be transferred properly between the nodes without a fear of creating traffic loops between the individual members for example.

MC-LAG, or Multi-Chassis Link Aggregation Group, is a type of LAG with constituent ports that terminate on separate chassis, thereby providing node-level redundancy. Unlike link aggregation in general, MC-LAG is not covered under IEEE standard, and its implementation varies by vendor. Cisco’s vPC is a good example for a MC-LAG implementation. The real challenge with MC-LAG is to maintain a consistent control plane state across the LAG setup, which is why the various multi-chassis mechanisms insist on countermeasures such as peer links or out of band connectivity between the redundant chassis.

LAG blog (4)

Load sharing operation

Traffic is not randomly placed across the LAG members, but instead shared using a deterministic hash algorithm. Depending on the platform and the configuration, a number of parameters may feed into the algorithm, including for example the ingress interface, source and/or destination MAC address, source and/or destination IP address, source and/or destination L4 (TCP/UDP) port numbers, MPLS labels, and so on.

Ultimately the hash will take in some combination of parameters to identify a flow and decide to which member link the frame should be placed in. It is important to note that all traffic for a particular flow will always be placed on the same link. That’s also means that traffic for a single flow (e.g., source and destination MAC) cannot exceed the bandwidth of a single member link. It is also important to note that each node (or chassis) performs the hash calculations locally itself, so that upstream and downstream traffic for a single flow will not necessarily traverse the same link.

Static configuration

The basic way to form a LAG is to simply specify the member ports on each node manually. This method does not involve any protocols to negotiate and form the LAG. Depending on the platform, the user can also control the hash algorithm on each side. As soon as a port becomes physically up it becomes a member of the LAG bundle. The major advantage of this is that the configuration is very simple. The disadvantage is that there is no method to detect any kind of cabling or configuration errors, which is most vendors would recommend a LACP configuration instead.

LACP configuration

LACP is the standards based protocol used to signal LAGs. It detects and protects the network from a variety of misconfiguration, ensuring that links are only aggregated into a bundle if they are consistently configured and cabled. LACP can be configured in one of two modes:

  • Active mode – the device immediately sends LACP messages (LACP PDUs) when the port comes up
  • Passive mode – Places a port into a passive negotiating state, in which the port only responds to LACP PDUs it receives but does not initiate LACP negotiation

If both sides are configured as active, LAG can be formed assuming successful negotiation of the other parameters. If one side is configured as active and the other one as passive, LAG can be formed as the passive port will respond to the LACP PDUs received from the active side. If both sides are passive, LACP will fail to negotiate the bundle. In practice it is rare to find passive mode used as it should be clearly and consistently defined which links will use LACP/LAG ahead of deployment. There are even vendors who does not offer the passive mode option at all.

With LACP, you can also control the timeout interval in which LACP PDUs will be sent. The standard defines two intervals: fast (1 second) and slow (30 seconds). Note that the timeout value does not have to agree between peers. While it is not a recommended configuration, it is possible to bring up a LAG with one end sending every 1 second and the other sending every 30 seconds. Depending on the platform and configuration, it is also possible to use Bidirectional Forwarding Detection (BFD) for fast detection of link failures.


The need for Network Overlays – part II

In the previous post, I covered some of the basic concepts behind network overlays, primarily highlighting the need to move into a more robust, L3 based, network environments. In this post I would like to cover network overlays in more detail, going over the different encapsulation options and highlighting some of the key points to consider when deploying an overlay-based solution.

Underlying fabric considerations

While network overlays give you the impression that networks are suddenly all virtualized, we still need to consider the physical underlying network. No matter what overlay solution you might pick, it’s still going to be the job of the underlying transport network to switch or route the traffic from source to destination (and vice versa).

Like any other network design, there are several options to choose from when building the underlying network. Before picking up a solution, it’s important to analyze the requirements – namely the scale, amount of virtual machines (VMs), size of the network as well as the amount of traffic. Yes, there are some fancy network fabric solutions out there from any of your favorite vendors, but simple L3 Clos network will do just fine. The big news here is that the underlying network should no longer be a L2 bridged network, but can be configured as a L3 routed network. Clos topology with ECMP routing can provide efficient non-blocking forwarding with a quick convergence time in a case of a failure. Known protocols such as OSPF, IS-IS, and BGP, with the addition of a protocol like BFD, can provide a good standard-based foundation for such a network. One thing I do want to highlight when it comes to the underlying network, is the requirement to support Jumbo frames. No matter what overlay encapsulation you may choose to implement, extra bytes of header will be added to the frames, resulting in a need for high MTU support from the physical network.

For the virtualization/cloud admin, with overlay networks, the data network used to carry the overly traffic is no longer a special network that requires careful VLAN configuration. It is now just one more infrastructure network used to provide simple TCP/IP connectivity.


When it comes to the overlay data-plane encapsulation, the amount of discussions, comparisons and debate out there is amazing. There are several options and standards available, all of them have the same goal: provide an emulated L2 networks over IP infrastructure. The main difference between them is the encapsulation format itself and their approach to the control plane – which is essentially the way to obtain MAC-to-IP mapping information for the tunnel end-points.

It all started with the well-known Generic Routing Encapsulation (GRE) protocol that was rebranded as NVGRE. GRE is a simple point-to-point tunneling protocol which is being used in todays networks to solve various design challenges and therfore is well understood by many network engineers. With NVGRE, the inner frame is being encapsulated with GRE encapsulation as specified in RFC 2784 and RFC 2890. The Key field (32 bits) in the GRE header is used to carry the Tenant Network Identifier (TNI) and is used to isolate different logical segments. One thing to note about GRE is the fact that it uses IP protocol number 47 for communication, i.e., it does not use TCP or UDP – which make it hard to create header entropy. Header entropy is something that you really want to have if you are using a flow-based ECMP network to carry the overlay traffic. Interesting enough, the authors of NVGRE do not cover the control plane part but only the data-plane considerations.

Other option would be Virtual Extensible LAN (VXLAN). Unlike NVGRE, VXLAN is a new protocol that was designed to solve the overlay networks use case. It uses UDP for communication (port 4789) and a 24-bit segment ID known as the VXLAN network identifier (VNID). With VXLAN, a hash of the inner frame’s header is used as the VXLAN source UDP port. As a result, a VXLAN flow can be unique, with the IP addresses and UDP ports combination in its outer header while traversing the underlay physical network. Therefore, the hashed source UDP port introduces a desirable level of entropy for ECMP load balancing. When it comes to the control plane, VXLAN does not provide any solution, but instead relies on flooding emulated with IP multicast. The original standard recommends to create an IP multicast group per VNI to handle broadcast traffic within a segment. This requires support for IP multicast on the underlying physical network as well as proper configuration and maintenance of the various multicast trees. This approach may work for small scale environments, but for large environments with good number of logical VXLAN segments this is probably not a good idea. It also important to note here that while IP multicast is a clever way to handle IP traffic, it is not commonly implemented in Data Center networks today, and the requirement to deploy an IP multicast network (which can be fairly complex) just to introduce VXLAN is not something that is accepted in most cases. These days, it is common to see “unicast mode” VXLAN implementations that do not require any kind of multicast support.

You may also have heard about Stateless Transport Tunneling Protocol (STT) which was originally introduced by Nicira (now VMware NSX). The main reason I decided to mention STT here is one of its benefits: the ability to leverage TCP offloading capabilities from existing physical NICs, resulting in improved performance. STT uses a header that looks just like the TCP header to the NIC. The NIC is thus able to perform Large Segment Offload (LSO) on what it thinks is a simple TCP datagram. That said, new generation NICs also offer offload capabilities for NVGRE and VXLAN, so this is not a unique benefit of STT anymore.

Last but not least, I would also like to introduce Geneve: Generic Network Virtualization Encapsulation, which looks to take a more holistic view of tunneling. From a first look, Geneve looks pretty much similar to VXLAN. It uses a UDP-based header and a 24 bit Virtual Network Identifier. So what is unique about Geneve? The fact that it uses an extendable header format, similar to (long-living) protocols such as BGP, LLDP, and IS-IS. The idea is that Geneve can evolve over time with new capabilities, not by revising the base protocol, but by adding new optional capabilities. The protocol has a set of fixed header, parameters and values, but then leave room for non-defined optional fields. New fields can be added to the protocol by simply defining and publishing them. The protocol is created in such a way that implementations know there may be optional fields that they may or may not understand. Although the protocol is new, there is some work to enable Open vSwitch support as well as NIC vendors announcing support for offloading capabilities.

I also want to leave room here for some other protocols that can be used as an encapsulation option. There is nothing wrong with MPLS for example, other than the fact that it requires to be enabled throughout the underlying transport network.

So should I pick a winner? probably not. As you can see you have got some options to choose from, but let’s make it clear: all protocols discussed above are ignoring the real problems (hint: control-plane operations) and providing a nice framework for data-plane encapsulation, which is just part of the deal. If I need to pick one, I would say that it looks like VXLAN and Geneve are here to stay (but we should let the market decide).

Tunnel End Point

I have already mentioned the term tunnel end-point, sometimes refer to as VTEP, earlier. But what is this end-point, and more importantly, where is it located? The function of VTEP is to encapsulate the VM traffic within an IP header to send across the underlying IP network. With the most common implementations, the VMs are unaware of the VTEP. They just send untagged or VLAN-tagged traffic that needs to classified and associated with a VTEP. Initial designs (which are still the most common ones) implemented the VTEP functionality within the hypervisor which houses the VMs, usually in the software vSwitch. While this is a valid solution that is probably here to stay, it also worth mentioning an alternative design in which the VTEP functionality is implemented in hardware, for e.g., within a top-of-rack (ToR) switch. This makes sense is some environments, especially where performance and throughput is critical.

Control plane or flooding

Probably the most interesting question to ask when picking an overlay network solution is what’s going on with the control plane and how the network is going to handle Broadcast, Unknown unicast and Multicast traffic (sometimes refer to as BUM traffic). I am not to going to provide easy answers here, simply because of the fact that there are plenty of solutions out there, each addresses this problem differently. I just want to emphasize that the protocol you are going to use to form the overly network (e.g., NVGRE, VXLAN, or what have you) is essentially taking care only for the data-plane encapsulation. For control plane you will need to rely either on flooding (basically continue to learn MAC addresses via the “flood and learn” method to ensure that the packet reaches all the other tunnel end-points), or consulting some sort of database which includes the MAC to IP bindings in the network (e.g., an SDN controller).

Connectivity with the outside world

Another factor to consider is the connectivity with the outside world – or how can a VM within an overlay network communicate with a device resides outside of the network. No matter how much overlays would be popular throughout the network, there are still going to be devices inside and outside of the Data Center that speaks only native IP or understand just 802.1Q VLANs. In order to communicate with those the overlay packet will need to get into some kind of a gateway that is capable of bridging or routing the traffic correctly. This gateway should handle the encapsulation/decapsulation function and provide the required connectivity. As with the control plane considerations, this part is not really covered in any of the encapsulation standards. Common ways to solve this challenge is by using virtual gateways, essentially logical routers/switches implemented in software (take a look on Neutron’s l3-agent to see how OpenStack handle this), or by introducing dedicated physical gateway devices.

Are overlays the only option?

I would like to summarize this post by emphasizing that overlays are an exciting technology which probably makes sense in certain environments. As you saw, an overly-based solution needs to be carefully designed, and as always depends on your business and network requirements. I also would like to emphasize that overlays are not the only option to scale-out networking, and I have seen some cool proposals lately which are probably deserve their own post.

IPv6 address assignment – stateless, stateful, DHCP… oh my!

People don’t like changes. IPv6 could have help to solve a lot of the burden in networks deployed today, which are still mostly based on the original version of the Internet Protocol, aka version 4. But time has come, and even the old tricks like throwing network address translation (NAT) everywhere are not going to help anymore, simply because we are out of IP addresses. It may take some more time, and people will do everything they can to (continue and) delay it, but believe me – there is no other way around – IPv6 is here to replace IPv4. IPv6 is also a critical part of the promise of the cloud and the Internet of Things (IoT). If you want to connect everything to the network, you better plan for massive scale and have enough addresses to use.

One of the trickiest things with IPv6 though is the fact that it’s pretty different from IPv4. While some of the concepts remains the same, there are some fundamental differences between IPv4 and IPv6, and it’s definitely takes some time to get used into some of the IPv6 basics, including the terms being used. Experienced IPv4 engineers will probably need to change their mindset, and as I stated before, people don’t really like changes…

In this post, I want to highlight the address assignment options available with IPv6, which is in my view one of the most fundamental things in IP networking, and where things are pretty different comparing to IPv4. I am going to assume you have some basic background on IPv6, and while I will cover the theory part I will also show the command line interface and demonstrate some of the configuration options, focusing on SLAAC and stateless DHCPv6. I am going to use a simple topology with two Cisco routers directly connected to each other using their GigabitEthernet 1/0 interface. Both routers are running IOS 15.2(4).

 Let the party started

With IPv6 an interface can have multiple prefixes and IP addresses, and unlike IPv4, all of them are primary. All interfaces will have a Link-Local address which is the address used to implement many of the control plane functions. If you don’t manually set the Link-Local address, one will automatically be generated for you. Note that the IPv6 protocol stack will not become operational on an interface until a Link-Local address was assigned or generated and it passed Duplicate Address Detection (DAD) verification. In Cisco IOS, we will first need to enable IPv6 on the router which is done globally using the ipv6 unicast-routing command. We will then enable IPv6 on the interface using the ipv6 enable command:

ipv6 unicast-routing
interface GigabitEthernet1/0
 ipv6 enable

Now IPv6 in enabled on the interface, and we should get a Link-Local address assigned automatically:

show ipv6 interface g1/0 | include link

IPv6 is enabled, link-local address is FE80::C800:51FF:FE2F:1C

IPv6 address assignment options

A little bit of theory as promised. When it comes to IPv6 address assignment there are several options you can use:

  • Static (manual) address assignment – exactly like with IPv4, you can go on and apply the address yourself. I believe this is straight forward and therefore I am not going to demonstrate that.
  • Stateless Address Auto Configuration (SLAAC) – nodes listen for ICMPv6 Router Advertisements (RA) messages periodically sent out by routers on the local link, or requested by the node using an RA solicitation message. They can then create a Global unicast IPv6 address by combining its interface EUI-64 (based on the MAC address on Ethernet interfaces) plus the Link Prefix obtained via the Router Advertisement. This is a unique feature only to IPv6 which provides simple “plug & play” networking. By default, SLAAC does not provide anything to the client outside of an IPv6 address and a default gateway. SLAAC is greatly discussed in RFC 4862.
  • Stateless DHCPv6 – with this option SLAAC is still used to get the IP address, but DHCP is used to obtain “other” configuration options, usually things like DNS, NTP, etc. The advantage here is that the DHCP server is not required to store any dynamic state information about any individual clients. In case of large networks which has huge number of end points attached to it, implementing stateless DHCPv6 will highly reduce the number of DHCPv6 messages that are needed for address state refreshment.
  • Stateful DHCPv6 – functions exactly the same as IPv4 DHCP in which hosts receive both their IPv6 address and additional parameters from the DHCP server. Like DHCP for IPv4, the components of a DHCPv6 infrastructure consist of DHCPv6 clients that request configuration, DHCPv6 servers that provide configuration, and DHCPv6 relay agents that convey messages between clients and servers when clients are on subnets that do not have a DHCPv6 server. You can learn more about DHCP for IPv6 in RFC 3315.

NOTE: The only way to get a default gateway in IPv6 is via a RA message. DHCPv6 does not carry default route information at this time.

Putting it all together

An IPv6 host performs stateless address autoconfiguration (SLAAC) by default and uses a configuration protocol such as DHCPv6 based on the following flags in the Router Advertisement message sent by a neighboring router:

  • Managed Address Configuration Flag, the ‘M’ flag. When set to 1, this flag instructs the host to use a configuration protocol to obtain stateful IPv6 addresses
  • Other Stateful Configuration Flag, the ‘O’ flag. When set to 1, this flag instructs the host to use a configuration protocol to obtain other configuration settings, e.g., DNS, NTP, etc.

Combining the values of the M and O flags can yield the following:

  • Both M and O Flags are set to 0. This combination corresponds to a network without a DHCPv6 infrastructure. Hosts use Router Advertisements for non-link-local addresses and other methods (such as manual configuration) to configure other parameters.
  • Both M and O Flags are set to 1. DHCPv6 is used for both addresses and other configuration settings, aka stateful DHCPv6.

  • The M Flag is set to 0 and the O Flag is set to 1. DHCPv6 is not used to assign addresses, only to assign other configuration settings. Neighboring routers are configured to advertise non-link-local address prefixes from which IPv6 hosts derive stateless addresses. This combination is known as statless DHCPv6.

Examining the configuration


Client configuration:

interface GigabitEthernet1/0
 ipv6 address autoconfig
 ipv6 enable

Server configuration:

interface GigabitEthernet1/0
 ipv6 address 2001:1111:1111::1/64
 ipv6 enable

We can see the server sending the RA message with the prefix that was configured:

ICMPv6-ND: Request to send RA for FE80::C801:51FF:FE2F:1C
ICMPv6-ND: Setup RA from FE80::C801:51FF:FE2F:1C to FF02::1 on GigabitEthernet1/0 
ICMPv6-ND: MTU = 1500
ICMPv6-ND: prefix = 2001:1111:1111::/64 onlink autoconfig 
ICMPv6-ND: 2592000/604800 (valid/preferred)

And the client receiving the message and calculating an address using EUI-64:

ICMPv6-ND: Received RA from FE80::C801:51FF:FE2F:1C on GigabitEthernet1/0 
ICMPv6-ND: Prefix : 2001:1111:1111::ICMPv6-ND: Update on-link prefix 2001:1111:1111::/64 on GiabitEthernet1/0 
IPV6ADDR: Generating IntfID for 'eui64', prefix 2001:1111:1111::/64 
ICMPv6-ND: IPv6 Address Autoconfig 2001:1111:1111:0:C800:51FF:FE2F:1C 

R1#show ipv6 interface brief
GigabitEthernet1/0 [up/up]

Stateless DHCP

Client configuration:

No changes are required on the client side. The client is configured to use SLAAC by setting the “auto-config” option.

interface GigabitEthernet1/0
 ipv6 address autoconfig
 ipv6 enable

Server configuration:

dns-server 2001:1111:1111::10
interface GigabitEthernet1/0
 ipv6 address 2001:1111:11111::1/64
 ipv6 enable
 ipv6 nd other-config-flag
 ipv6 dhcp server STATELESS_DHCP

We can see the client keeping the same IP address, but now obtaining DNS settings through DHCP:

IPv6 DHCP: Adding server FE80::C801:51FF:FE2F:1C
IPv6 DHCP: Processing options
IPv6 DHCP: Configuring DNS server 2001:1111:1111::10
IPv6 DHCP: Configuring domain name

The need for Network Overlays – part I

The IT industry has gained significant efficiency and flexibility as a direct result of virtualization. Organizations are moving toward a virtual datacenter model, and flexibility, speed, scale and automation are central to their success. While compute, memory resources and operating systems were successfully virtualized in the last decade, primarily due to the x86 server architecture, networks and network services have not kept pace.

The traditional solution: VLANs

Way before the era of server virtualization, Virtual LANs (or 802.1q VLANs) were used to partition different logical networks (or broadcast domains) over the same physical fabric. Instead of wiring a separate physical infrastructure for each group, VLANs were used efficiently to isolate the traffic from different groups or applications based on the business needs, with a unique identifier allocated to each logical network. For years, a physical server represented one end-point from the network perspective and was attached to an “access” (i.e., untagged) port in the network switch. The access switch was responsible to enforce the VLAN ID as well as other security and network settings (e.g., quality of service). The VLAN ID is a 12-bit field allowing a theoretical limit of 4096 unique logical networks. In practice though, most switch vendors support much lower number to be configured. You should remember that for each active VLAN in a switch, a VLAN database need to be maintained for proper mapping of the physical interfaces and the MAC addresses associated with the VLAN. Furthermore, some vendors would also create a different spanning-tree (STP) instance for each active VLAN on the switch which require additional memory cycles.

VLANs are a perfect solution for small-scale environments, where the number of end-points (and MAC addresses respectively) is small and controlled. With virtualization though, one server, now called hypervisor, can host many virtual machines and many network end-points. As I stated before, the networks have not kept pace, and the easiest (and also rational) thing to do was to reuse the good-old VLANs. We were essentially adding an additional layer of software access switch in the hypervisor to link the different virtual machines on the host, and those server “access” ports in the physical switch that traditionally were untagged, are now expecting tagged traffic with different VLAN IDs differentiating between the virtual machine networks. The main issue here is the fact that the virtual machines MAC addresses must be visible end-to-end throughout the network core. Reminder: VLANs must be properly configured on each switch along the path, as well as on the appropriate interfaces to get end-to-end MAC learning and connectivity.

In a virtualized world, where the number of end-points is constantly increasing and can be very high, VLANs is a limited solution that does not follow one of the main participles beyond virtualization: use of software application to divide one physical resource into multiple isolated virtual environments. Yes, VLANs does offer segmentation of different logical networks (or broadcast domains) over the same physical fabric, but you still need to manually provision the network and make sure the VLANs are properly configured across the network devices. This start to become a management and configuration nightmare and simply does not scale.

Where network vendors started to be (really) creative


At this point, when there was no doubt that VLANs and traditional L2 based networks are not suitable for large virtualized environments, plenty of network solutions were raised. I don’t really want to go into detail on any of those, but you can look for 802.1Qbg, VM-FEX, FabricPath, TRILL, 802.1ad (QinQ), and 802.1ah (PBB) to name a few. In my view, these are over complicating the network while ignoring the main problem – L2-based solution is a bad thing to begin with, and we should have looked for something completely different (hint: L3 routing is your friend).

Overlays to the rescue


L3 routing is a scalable and well-known solution (it runs the Internet, isn’t it?). With proper planning, routing domains can handle massive number of routes/networks, keeping the broadcast (and failure) domains small. Furthermore, most modern routing operating systems can utilize equal-cost multi-path (ECMP) routing, effectively load-sharing the traffic across all available routed links. In contrast, by default spanning-tree protocol (STP) blocks redundant L2 switched links to avoid switching loops, simply because that there is no way to handle loops within a switched environment (there is no “time-to-live” field within an Ethernet frame).

Routing sounds a lot better, but note that L2 adjacency is required by most applications running inside the virtual machines. L2 connectivity between the virtual machines is also required for virtual machine mobility (e.g., Live Migration in VMware terminology). This is where overlay networks enter the picture; an overlay network is a computer network which is built on top of another network. Using an overlay, we can build a L2 switched network on top of a L3 routed network. Don’t get it wrong – overlays are not a new networking concept and are already used extensively to solve many network challenges (see GRE tunneling and MPLS L2/L3 VPNs for some examples and use cases).

In the next post I will bring the second part of this article, diving into the theory behind network overlays and the way they tend to solve the network virtualization case.