vSphere Metro Storage Cluster Networking: Part 2

In vSphere Metro Storage Cluster Networking: Part 1, I wrote about considerations for the Data Center Interconnect (DCI) within a vSphere Metro Storage Cluster (vMSC) environment. If you haven’t already read that, you might want to start up with that post first. In Part 2, we’ll cover one of the key network considerations for workloads within vMSC, which is ingress-egress routing.

Within most data center network designs, IP ranges are location specific. It is usually the case that each data center will have its own IP range, and the routing infrastructure for a site is responsible only for data traffic to and from that IP range.

Conventional data center network design

With the advent of stretched Layer 2 networks across data centers, now one or more IP subnets originally belonging to the primary site (we’ll call it DC 1) need to be stretched across to the secondary site (we’ll call it DC 2). This breaks the traditional network paradigm, which is a 1:1 mapping of IP range to location, since now DC 1 and DC 2 both share some IP subnets.

Of course, we’ll need to be sure that the rest of the corporate network (or the Internet) knows how to get to the stretched subnet through that alternate connection at DC 2. This is especially the case if we want to provide link and data center resiliency, so that the rest of the world can access the workloads residing within the stretched subnet when there are component or site failures.

To do that, we’ll need to make some changes so the network at DC 2 knows and actively advertises that it is also providing routing for the stretched network with IP subnet 10.1.1.0/24, which originated from DC 1. Within the corporate network, this is usually easily done by inserting routes via interior routing protocols like EIGRP or OSPF. If Internet connectivity is also desired, then the stretched route will also need to be inserted via exterior routing protocols like BGP. This is usually not easily done, so YMMV. In some cases, the upstream ISPs may not allow this, or may charge extra.

What about egress traffic from the VM workload? Since there will be multiple gateways within the stretched cluster, how does the VM know which gateway to send out its traffic from? The answer is that it actually doesn’t. Using a First Hop Routing Protocol (FHRP) like Hot Standby Routing Protocol (HSRP) and a technique called FHRP isolation (we won’t go into the gory details, it’ll take a whole post), we’ll set things up so that the same gateway IP will be presented across both data centers. So regardless of which side the VM is on, it’ll forward its outbound traffic to the common gateway IP, and the local gateway will take care of forwarding the packets outbound according to its local routing table. If you’re interested to look into the gory details, read up on FHRP Isolation and how to configure it.

Stretched network design with routes

Now, the problem with providing multiple routes is that the inbound traversal path then depends on the relative location of the user to that of the workload. Without any route optimizations, it’s likely that the path taken by network traffic will be sent via the shortest (more accurately, the least costly) route. On the other hand, the return path depends where the VM is residing at a point in time.

This could very well result in non-deterministic and asymmetric traffic paths, for example in the diagram below. Ingress/Egress traffic for User 1 (in green) goes only through network links on DC 1, and so does not cause unnecessary traffic across the DCI link. At the same time, any services appliances (firewalls, load-balancers) within the path of the User 1 to VM1 see a consistent network connection, and can thus operate without problems

If you look at User 2 (red traffic path), the situation changes. Because the network determines that it is least costly (and therefore preferred) to route traffic to 10.1.1.0/24 via DC 2, it results in traffic having to traverse the DCI link before reaching VM1 at DC 1. VM1 being in DC 1, would send return traffic out through the local gateway, which would then reach User 2 via a different path. Here, not only is the DCI link not being used optimally, but also the asymmetric ingress/egress paths could confuse some load-balancers and firewalls, since the appliances at DC 1 would not see the inbound traffic, and the appliances ad DC 2 would not see the outbound traffic. It is possible that connections can end up being dropped because of this inconsistency.

Non-deterministic traffic paths

And so, the alternative is to manually tune the routes by creating a preferred route which has lower cost, so that an optimized path is ALWAYS taken through DC 1. The higher-cost, non-preferred route would still be advertised by DC 2, but would be used only if the network link to DC 1 goes down. With that, you would, fingers crossed, get optimized network routing to the workloads with all network services states kept intact…

Deterministic traffic paths after route tuning

… At least, of course, until the VM migrates to DC2, which results in fully asymmetrical ingress/egress paths. Oops…

Fully asymmetric paths after route tuning, and VM migration

With fully asymmetric traffic paths, we end up with the least desirable state. Firstly, significant additional load could be placed on the DCI, reducing available bandwidth. Additionally, the asymmetric flow means that network services such as firewalls/load-balancers at both sides may only see half of the network connection states, and could drop these connections as being invalid.

As a summary, simply by creating a stretched network for vMSC, we see that the following challenges become apparent on the network when routing comes into play:

  • Asymmetric traffic flow across DC sites
  • Lack of VM site-awareness for optimized routing
  • Inability of network service appliances to handle asymmetric traffic flow
  • Inefficient use of the DCI

So this post has proven to be a little longer than expected. In the interest of readability, I’ll continue discussing some measures which can be used to address the challenges in Part 3.

Leave a Reply