Reflections on the networking industry, part 1: Welcome to vendor land

I have been involved with networking for quite some time now; I have had the opportunity to design, implement and operate different networks across different environments such as enterprise, data-center, and service provider – which inspired me to create this series of short blog posts exploring the computer networking industry. My view on the history, challenges, hype and reality, and most importantly – what’s next and how we can do better.

Part 1: Welcome to vendor land

Protocols and standards were always a key part of networking and were born out of necessity: we need different systems to be able to talk to each other.

Modern networking suite is built around Ethernet and TCP/IP stack, including TCP, UDP, and ICMP – all riding on top of IPv4 or IPv6. There is a general consensus that Ethernet and TCP/IP won the race against the other alternatives. This is great, right? Well, the problem is not with Ethernet or the TCP/IP stack, but with their “ecosystem”: a long list of complementary technologies and protocols.

Getting the industry to agree on the base layer 2, layer 3 and layer 4 protocols and their header format was indeed a big thing, but we kind of stopped there. Say you have got a standard-based Ethernet link. How would you bring it up and negotiate its speed? And what about monitoring, loop prevention, or neighbor discovery? Except for the very basic, common denominator functionality, vendors came out with their own set of proprietary protocols for solving these issues. Just from the top of my mind: ISL, VTP, DTP, UDLD, PAgP, CDP, and PVST are all examples of the “Ethernet ecosystem” from one (!) vendor.

True, you can find standard alternatives for the mentioned protocols today. Vendors are embracing open standards and tend to replace their proprietary implementation with a standard one if available. But why not to start with the standard one to begin with?

If you think that these are just historical examples from a different era, think again. Even in the 2010s decade, more and more protocols are being developed and/or adopted by single vendors only. I usually like to point out MC-LAG as an example of a fairly recent and very common architecture with no standard-based implementation. This feature alone can lead you to choose one vendor (or even one specific hardware model from one vendor) across your entire network, resulting in a perfect vendor lock-in.   


Neutron networking with Red Hat Enterprise Linux OpenStack Platform

(This is a summary version of a talk I gave at Red Hat Summit on June 25th, 2015. Slides are available here)

I was honored to speak the second time in a row on Red Hat Summit, the premier open source technology event hosted in Boston this year. As I am now focusing on product management for networking in Red Hat Enterprise Linux OpenStack Platform I presented Red Hat’s approach to Neutron, the OpenStack networking service.

Since OpenStack is fairly a new project and a new product on Red Hat’s portfolio, I was not sure what level of knowledge to expect from my audience. Therefore I have started with a fairly basic overview of Neutron – what it is and what are some of the common features you can get from its API today. I was very happy to see that most of the people at the audience seemed to be already familiar with OpenStack and with Neutron so the overview part was quick.

The next part of my presentation was a deep dive into Neutron when deployed with the ML2/Open vSwitch (OVS) plugin. This is our default configuration when deploying Red Hat Enterprise Linux OpenStack Platform today, and like any other Red Hat products, based on fully open-source components. Since there is so much to cover here (and I only had one hour for the entire talk), I focused on the core elements of the solution, and the common features we see customers using today: L2 connectivity, L3 routing and NAT for IPv4, and DHCP for IP address assignment. I explained the theory of operation and used some graphics to describe the backend implementation and how things look on the OpenStack nodes.

OVS-based solution is our default, but we are also working with a very large number of leading vendors in the industry providing their own solutions through the use of Neutron plugins. I spent some time to describe the various plugins out there, our current partner ecosystem, and Red Hat’s certification program for 3rd party software.

I then covered some of the major recent enhancements introduced in Red Hat Enterprise Linux OpenStack Platform 6 based on the upstream Juno code base: IPv6 support, L3 HA, and distributed virtual router (DVR) – which is still a Technology Preview feature, yet very interesting to our customers.

Overall, I was very happy with this talk and with the number of questions I got in the end. It looks like OpenStack is happening, and more and more customers are interested to find out more about it. See you next year in San Francisco for Red Hat Summit 2016!

IPv6 Prefix Delegation – what is it and how does it going to help OpenStack?

IPv6 offers several ways to assign IP addresses to end hosts. Some of them (SLAAC, stateful DHCPv6, stateless DHCPv6) were already covered in this post. The IPv6 Prefix Delegation mechanism (described in RFC 3769 and RFC 3633) provides “a way of automatically configuring IPv6 prefixes and addresses on routers and hosts” – which sounds like yet another IP assignment option. How does it differ from the other methods? And why do we need it? Let’s try to figure it out.

Understanding the problem

I know that you still find it hard to believe… but IPv6 is here, and with IPv6 there are enough addresses. That means that we can finally design our networks properly and avoid using different kinds of network address translation (NAT) in different places across the network. Clean IPv6 design will use addresses from the Global Unicast Address (GUA) range, which are routable in the public Internet. Since these are globally routed, care needs to be taken to ensure that prefixes configured by one customer do not overlap with prefixes chosen by another.

While SLAAC or DHCPv6 enable simple and automatic host configuration, they do not provide specification to automatically delegate a prefix to a customer site. With IPv6, there is a need to create a hierarchical model in which the service provider allocates prefixes from a set of pools to the customer. The customer then assign addresses to its end systems out of the predefined pool. This is powerful, as it provides the service provider with control over the IPv6 prefixes assignment, and could eliminate potential conflicts in prefix selection.

How does it work?

With Prefix Delegation, a delegating router (Prefix Delegation Server) delegates IPv6 prefixes to a requesting router (Prefix Delegation Client). The requesting router then uses the prefixes to assign global IPv6 addresses to the devices on its internal interfaces. Prefix Delegation is useful when the delegating router does not have information about the topology of the networks in which the requesting router is located. The delegating router requires only the identity of the requesting router to choose a prefix for delegation. Prefix Delegation is not a new protocol. It is using DHCPv6 messages as defined in RFC 3633, thus sometimes referred to as DHCPv6 Prefix Delegation.

DHCPv6 prefix delegation operates as follows:

  1. A delegating router (Server) is provided with IPv6 prefixes to be delegated to requesting routers.
  2. A requesting router (Client) requests one or more prefixes from the delegating router.
  3. The delegating router (Server) chooses prefixes for delegation, and responds with prefixes to the requesting router (Client).
  4. The requesting router (Client) is then responsible for the delegated prefixes.
  5. The final address allocation mechanism in the local network can be performed with SLAAC or stateful/stateless DHCPv6, based on the customer preference. At this step the key thing is the IPv6 prefix and not how it is delivered to end systems.

IPv6 in OpenStack Neutron

Back in the Icehouse development cycle, the Neutron “subnet” API was enhanced to support IPv6 address assignment options. Reference implementation of this followed at the Juno cycle, where dnsmasq and radvd processes were chosen to serve the subnets with RAs, SLAAC or DHCPv6.

In the current Neutron implementation, tenants must supply a prefix when creating subnets. This is not a big deal for IPv4, as tenants are expected to pick private IPv4 subnets for their networks and NAT is going to take place anyway when reaching external public networks. For IPv6 subnets that use Global Unicast Address (GUA) format, addresses are globally routable and cannot overlap. There is no NAT or floating IP model for IPv6 in Neutron. And if you ask me, there should not be one. GUA is the way to go. But can we just trust the tenants to configure their IPv6 prefixes correctly? Probably not, and that’s why Prefix Delegation is an important feature for OpenStack.

An OpenStack administrator may want to simplify the process of subnet prefix selection for the tenants by automatically supplying prefixes for IPv6 subnets from one or more large pools of pre-configured IPv6 prefixes. The tenant would not need to specify any prefix configuration. Prefix Delegation will take care of the address assignment.

The code is expected to land in OpenStack Liberty based on this specification. Other than REST API changes, a PD client would need to run in the Neutron router network namespace whenever a subnet attached to that router requires prefix delegation. Dibbler is an open-source utility that supports PD client and can be used to provide the required functionality.

OpenStack Networking with Neutron: What Plugin Should I Deploy?

(This is a summary version of a talk I gave at OpenStack Israel event on June 15th, 2015. Slides are available here)

Neutron is probably one of the most pluggable projects in OpenStack today. The theory is very simple and goes like this: Neutron is providing just an API layer and you have got to choose the backend implementation you want. But in reality, there are plenty of plugins (or drivers) to choose from and the plugin architecture is not always so clear.

The plugin is a critical piece of the deployment and directly affects the feature set you are going to get, as well as the scale, performance, high availability, and supported network topologies. In addition, different plugins offer different approaches for managing and operating the networks.

So what is a Neutron plugin?

The Neutron API exposed via the Neutron server is splitted into two buckets: the core (L2) API and the API extensions. While the core API consists only of the fundamental Neutron definitions (Network, Subnet, Port), the API extension is where the interesting stuff get to be defined, and where you can deal with constructs like L3 router, provider networks, or L4-L7 services such as FWaaS, LBaaS or VPNaaS.

In order to match this design, the plugin architecture is built out of a “core” plugin (which implements the core API) and one or more “service” plugins (to implement additional “advanced” services defined in the API extensions). To make things more interesting, these advanced network services can also be provided by the core plugin by implementing the relevant extensions.

What plugins are out there?

There are many plugins out there, each with its own approach. But when trying to categorize them, I found that usually it boils down to “software centric” plugins versus “hardware centric” plugins.

With the software centric ones, the assumption is that the network hardware is general-purpose, and the functionality is offered, as the name implies, with software only. This is where we get to see most of the overlay networking approaches with the virtual tunnel end-points (VTEP) implemented in the Compute/Hypervisor nodes. The requirements from the physical fabric is to provide only basic IP routing/switching. The plugin can use an SDN approach to provision the overly tunnels in an optimal manner, and handle broadcast, unknown unicast and multicast (BUM) traffic efficiently.

With the hardware centric ones, the assumption is that a dedicated network hardware is in place. This is where the traditional network vendors usually offer a combined software/hardware solution taking advantage of their network gear. The advantages of this design is better performance (if you offload certain network function to the hardware) and the promise of better manageability and control of the physical fabric.

And what is there by default?    

There are efforts in the Neutron community to completely separate the API (or control-plane components) from the plugin or actual implementation. The vision is to position Neutron as a platform, and not as any specific implementation. That being said, Neutron was really developed out of the Open vSwitch plugin, and some good amount of the upstream development today is still focused around that. Open vSwitch (with the OVS ML2 driver) is what you get by default, and this is by far the most common plugin deployed in production (see the recent user survey here). This solution is not perfect and has pros and cons like any other of the solutions out there.

While Open vSwitch is used on the Compute nodes to provide connectivity for VM instances, some of the key components with this solution are actually not related to Open vSwitch. L3 routing, DHCP, and other services are implemented using dedicated software agents using Linux tools such as network namespaces (ip netns), dnsmasq, or iptables.

So how one should choose a plugin?

I am sorry, but there is no easy answer here. From my experience, the best way is to develop a methodological approach:

1. Evaluate the default Open vSwitch based solution first. Even if you end up not choosing it for your production environment, it should at least get you familiar with the Neutron constructs, definitions and concepts

2. Get to know your business needs, and collect technical requirements. Some key questions to answer:

  • Are you building a greenfield deployment?
  • What level of interaction is expected with your existing network?
  • What type of applications are going to run in your cloud?
  • Is self-service required?
  • Who are the end-users?
  • What level of isolation and security is required?
  • What level of QoS is expected?
  • Are you building a multi cloud/multi data-center or an hybrid deployment?

3. Test things up yourself. Don’t rely on vendor presentations and other marketing materials

What’s Coming in OpenStack Networking for the Kilo Release

A post I wrote for the Red Hat Stack blog on what’s coming in OpenStack Networking for the Kilo release

Red Hat Stack

KiloOpenStack  Kilo, the 11th release of the open source project, was officially released in April, and now is a good time to review some of the changes we saw in the OpenStack Networking (Neutron) community during this cycle, as well as some of the key new networking features introduced in the project.

Scaling the Neutron development community

The Kilo cycle brings two major efforts which are meant to better expand and scale the Neutron development community: core plugin decomposition and advanced services split. These changes should not directly impact OpenStack users but are expected to reduce code footprint, improve feature velocity, and ultimately bring faster innovation speed. Let’s take a look at each individually:

Neutron core plugin decomposition

Neutron, by design, has a pluggable architecture which offers a custom backend implementation of the Networking API. The plugin is a core piece of the deployment and acts as the “glue”…

View original post 2,596 more words

An Overview of Link Aggregation and LACP

The concept of Link Aggregation (LAG) is well known in the networking industry by now, and people usually consider it as a basic functionality that just works out of the box. With all of the SDN hype that’s going on out there, I sometimes feel that we tend to neglect some of the more “traditional” stuff like this one. As with many networking technologies and protocols, things may not just work out of the box, and it’s important to master the details to be able to design things properly, know what to expect to (i.e., what the normal behavior is) and ultimately being able to troubleshoot in case of a problem.

The basic concept of LAG is that multiple physical links are combined into one logical bundle. This provides two major benefits, depending on the LAG configuration:

  1. Increased capacity – traffic may be balanced across the member links to provide aggregated throughput
  2. Redundancy – the LAG bundle can survive the loss of one or more member links

LAG is defined by the IEEE 802.1AX-2008 standard, which states, “Link Aggregation allows one or more links to be aggregated together to form a Link Aggregation Group, such that a MAC client can treat the Link Aggregation Group as if it were a single link”. This layer 2 transparency is achieved by the LAG using a single MAC address for all the device’s ports in the LAG group. The individual port members must be of the same speed, so you cannot bundle for example a 1G and 10G interfaces. The ports should also have the same duplex settings, encapsulation type (i.e., access/untagged or 802.1q tagged with the exact same number of VLANs) as well as MTU.

LAG can be configured as either static (manually) or dynamic by using a protocol to negotiate the LAG formation, with LACP being the standard-based one. There is also the Port Aggregation Protocol (PAgP), which is similar in many regards to LACP, but is Cisco proprietary and not in common usage anymore.

LAG blog (1)

Wait… LAG, bond, bundle, team, trunk, EtherChannel, Port Channel?

Let’s clear this right away – there are several acronyms used to describe LAG which are sometimes used interchangeably. While LAG is the standard name defined by the IEEE specification, different vendors and operating systems came up with their own implementation and terminology. Bond, for example, is really known on Linux-based systems, following the name of the kernel driver. Team (or NIC teaming) is also pretty common across Windows systems, and lately Linux systems as well. EtherChannel is one of the famous terms, being used on Cisco’s IOS. Interesting enough, Cisco have changed the term in their IOS-XR software to bundles, and in their NX-OS systems to Port Channels. Oh… I love the standardization out there!

LAG can also be used as a general term to describe link aggregation with different technologies (such as MLPPP for PPP links) which can cause some confusion, while Ethernet is the de facto standard and the focus of the IEEE spec.

Use cases

Today, Link Aggregations can be found in many network designs, and across different portions of the network. LAG can be found across the Enterprise, Data Center, and Service Provider networks. In the cloud and virtualization space, it’s also common to want to use multiple network connections in your hypervisors to support Virtual Machine traffic. So you can have LAG configured between different network devices (for e.g., switch to switch, router to router), or between an end host or hypervisor and the upstream network device (usually some sort of a ToR switch).

L2 LAG and STP

From Spanning Tree Protocol (STP) perspective, no matter how many physical ports are being used to form the LAG, there is going to be only one logical interface representing each LAG bundle. The individual ports are not part of the STP topology, but only the one logical interface. STP is still going to be active on the LAG interface and should not be turned off, so that if there are multiple LAGs configured between two adjacent nodes, STP will block one of them.

LAG blog (2)

LAG blog (3)


While LAG is extremely common across L2 network designs, and sometimes even seen as a partial replacement for Spanning Tree Protocol (STP), it is important to mention that LAG can also operate at L3, i.e, by assigning an IPv4 or IPv6 subnet to the aggregated link. You can then setup static or dynamic routing over the LAG like any other routed interface.

LAG versus MC-LAG

By definition, LAG is formed across two adjacent nodes which are directly connected to each other. The two nodes must be configured properly to form the LAG, so that traffic would be transferred properly between the nodes without a fear of creating traffic loops between the individual members for example.

MC-LAG, or Multi-Chassis Link Aggregation Group, is a type of LAG with constituent ports that terminate on separate chassis, thereby providing node-level redundancy. Unlike link aggregation in general, MC-LAG is not covered under IEEE standard, and its implementation varies by vendor. Cisco’s vPC is a good example for a MC-LAG implementation. The real challenge with MC-LAG is to maintain a consistent control plane state across the LAG setup, which is why the various multi-chassis mechanisms insist on countermeasures such as peer links or out of band connectivity between the redundant chassis.

LAG blog (4)

Load sharing operation

Traffic is not randomly placed across the LAG members, but instead shared using a deterministic hash algorithm. Depending on the platform and the configuration, a number of parameters may feed into the algorithm, including for example the ingress interface, source and/or destination MAC address, source and/or destination IP address, source and/or destination L4 (TCP/UDP) port numbers, MPLS labels, and so on.

Ultimately the hash will take in some combination of parameters to identify a flow and decide to which member link the frame should be placed in. It is important to note that all traffic for a particular flow will always be placed on the same link. That’s also means that traffic for a single flow (e.g., source and destination MAC) cannot exceed the bandwidth of a single member link. It is also important to note that each node (or chassis) performs the hash calculations locally itself, so that upstream and downstream traffic for a single flow will not necessarily traverse the same link.

Static configuration

The basic way to form a LAG is to simply specify the member ports on each node manually. This method does not involve any protocols to negotiate and form the LAG. Depending on the platform, the user can also control the hash algorithm on each side. As soon as a port becomes physically up it becomes a member of the LAG bundle. The major advantage of this is that the configuration is very simple. The disadvantage is that there is no method to detect any kind of cabling or configuration errors, which is most vendors would recommend a LACP configuration instead.

LACP configuration

LACP is the standards based protocol used to signal LAGs. It detects and protects the network from a variety of misconfiguration, ensuring that links are only aggregated into a bundle if they are consistently configured and cabled. LACP can be configured in one of two modes:

  • Active mode – the device immediately sends LACP messages (LACP PDUs) when the port comes up
  • Passive mode – Places a port into a passive negotiating state, in which the port only responds to LACP PDUs it receives but does not initiate LACP negotiation

If both sides are configured as active, LAG can be formed assuming successful negotiation of the other parameters. If one side is configured as active and the other one as passive, LAG can be formed as the passive port will respond to the LACP PDUs received from the active side. If both sides are passive, LACP will fail to negotiate the bundle. In practice it is rare to find passive mode used as it should be clearly and consistently defined which links will use LACP/LAG ahead of deployment. There are even vendors who does not offer the passive mode option at all.

With LACP, you can also control the timeout interval in which LACP PDUs will be sent. The standard defines two intervals: fast (1 second) and slow (30 seconds). Note that the timeout value does not have to agree between peers. While it is not a recommended configuration, it is possible to bring up a LAG with one end sending every 1 second and the other sending every 30 seconds. Depending on the platform and configuration, it is also possible to use Bidirectional Forwarding Detection (BFD) for fast detection of link failures.

Red Hat Enterprise Linux OpenStack Platform 6: SR-IOV Networking – Part II: Walking Through the Implementation

Second part of the SR-IOV networking post I wrote for the Red Hat Stack blog.

Red Hat Stack

In the previous blog post in this series we looked at what single root I/O virtualization (SR-IOV) networking is all about and we discussed why it is an important addition to Red Hat Enterprise Linux OpenStack Platform. In this second post we would like to provide a more detailed overview of the implementation, some thoughts on the current limitations, as well as what enhancements are being worked on in the OpenStack community.

Note: this post does not intend to provide a full end to end configuration guide. Customers with an active subscription are welcome to visit the official article covering SR-IOV Networking in Red Hat Enterprise Linux OpenStack Platform 6 for a complete procedure.

Setting up the Environment

In our small test environment we used two physical nodes: one serves as a Compute node for hosting virtual machine (VM) instances, and the other serves as both the OpenStack Controller and…

View original post 2,177 more words