Learn

BGP Multipath

BGP multipath allows multiple BGP routes to be used simultaneously to reach the same destination. Understanding how BGP multipath works and the tradeoffs involved in its implementation can help you improve your network engineering skill set. 

This article will review the principles, benefits, and potential drawbacks of BGP multipath with a focus on equal-cost multipathing (ECMP), which uses multiple identically-performing paths. We will also discuss the more advanced unequal-cost multipathing (UCMP) and its caveats.

Before we dive in, note that in all router implementations, a static route or an IGP route (OSPF, ISIS) is preferred over a BGP route. There is a default hierarchy that you generally should not modify. We’ll focus only on BGP here. However, if you’re not familiar with protocol hierarchies, see Juniper’s Understanding Route Preference Values article.

Throughout this article, we will use Juniper configuration and operational show command outputs. The syntax for other vendors is very similar. See the Cisco and Nokia documentation for more information on their syntax. 

Vanilla BGP: non-multipath BGP

Before we dive into the specifics of BGP multipath, let’s review how traditional BGP works. 

The BGP routing protocol receives and advertises routes to destinations (prefixes) on the network, e.g., a route for the destination 10.10.10.0/24. When a router receives two or more BGP routes to the same destination, the best path selection process chooses the single best route.

If a router receives routes A and B, pointing to 10.10.10.0/24 as the destination, it must choose the best between them. The router evaluates multiple attributes carried in the route advertisement, such as AS path length and BGP metric. The best route is found and installed as the active route in the routing table (Routing Information Base, RIB). 

Next, the active route is added to the Forwarding Information Base (FIB) and programmed in the hardware ASICs. On a chassis-based router, each line card is programmed with this active route.

Benefits of BGP multipath

The chief benefits of BGP multipath compared to vanilla BGP are:

  • The ability to load-balance traffic across multiple links. 
  • Reduced impact in the event of a BGP session or link failure. 

Next-hop failure such as this means BGP has to reconverge before traffic can be forwarded again, whereas having multiple paths already installed ensures continuous forwarding and zero packet loss. 

When a failure occurs while multiple paths are in active use, the router must only remove the failed forwarding next hop, rather than wait for the RIB best path selection, FIB programming, and ASIC programming process to occur. Instead of all traffic to that destination being affected, only the failed path is impacted. 

In the case of two active multipath links in use, approximately half of the traffic is affected. With four links, approximately one quarter is affected, and so on.

Considerations when implementing BGP multipath

The BGP protocol is agnostic to link capacity and load. In a well-designed and maintained network, there is sufficient link capacity on backup paths to ensure continuity of service without congestion or loss in a primary BGP/link failure.

When implementing BGP multipath, be aware of the impact of traffic shifts. Consider your current links’ capacity and utilization patterns and how implementing BGP multipath will affect the state of your network.

{{banner-14="/design/banners"}}

Equal-cost multipathing (ECMP)

ECMP uses multiple routes, each of which use paths having very similar or identical characteristics in terms of link capacity and latency, and identical BGP attributes such as AS path and BGP metric.

Let’s look at a basic use case to illustrate the concept. Below, an enterprise router connects to two ISP routers via two independent links and BGP sessions. Instead of configuring the links in a primary/backup failover configuration, the goal is to load-balance traffic across them. 

A dual-homed single router (source)

The ISP advertises a default route to us via each BGP session and both routes are accepted and installed in our RIB. Without enabling BGP multipath, our RIB looks like this:

user@router> show route 0.0.0.0/0 exact
...
0.0.0.0/0         *[BGP/170] 1w5d 18:04:33, MED 0, localpref 100
                     AS path: 65000 I, validation-state: unknown
                     >  to 172.16.1.1 via xe-0/0/1.0
                   [BGP/170] 0w6d 22:42:11, MED 0, localpref 100
                      AS path: 65000 I, validation-state: unknown
                     >  to 172.16.2.1 via xe-0/0/2.0

We see two routes to 0.0.0.0/0, and the route learned via BGP session over xe-0/0/1.0 with next hop 172.16.1.1 is the preferred, active path. This is because this path is the oldest, i.e., it has been up and stable the longest. 

Immediately you can see all outbound traffic will use this path. The other path is available for failover but is currently unused.

You can confirm this by viewing the forwarding table. Here, we see a single next hop MAC address and a single next-hop interface. Outgoing packets towards 0.0.0.0/0 use this interface and destination MAC address.

user@router> show route forwarding-table destination 0.0.0.0/0
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination        Type RtRef Next hop           Type Index    NhRef Netif
default            user     5 172.16.1.1
                                                 ucst    16975    11 xe-0/0/1.0
                                                 

In contrast, if we enable BGP multipath, then both routes can be used. In the next section, we’ll configure and verify it.

BGP multipath Configuration

To configure BGP peers to use multipath we must take two steps.

  1. Enable multipath on the relevant BGP sessions.
    Let’s add the multipath knob to our ISP peerings:
set protocols bgp group MYISP neighbor 172.16.1.1 description "MYISP Link#1 Circuit ID: 11111"
set protocols bgp group MYISP neighbor 172.16.1.1 import IMPORT_MYISP
set protocols bgp group MYISP neighbor 172.16.1.1 export EXPORT_MYISP
set protocols bgp group MYISP neighbor 172.16.1.1 peer-as 65000
set protocols bgp group MYISP neighbor 172.16.1.1 multipath

set protocols bgp group MYISP neighbor 172.16.2.1 description "MYISP Link#2 Circuit ID: 22222"
set protocols bgp group MYISP neighbor 172.16.2.1 import IMPORT_MYISP
set protocols bgp group MYISP neighbor 172.16.2.1 export EXPORT_MYISP
set protocols bgp group MYISP neighbor 172.16.2.1 peer-as 65000
set protocols bgp group MYISP neighbor 172.16.2.1 multipath

To verify multipath is working, you can view the BGP neighborship status. Before adding multipath:

user@router> show bgp neighbor 172.16.1.1 | match options
  Options: < Preference AuthKey LogUpDown AddressFamily PeerAS Rib-group Refresh>

After adding multipath:

user@router> show bgp neighbor 172.16.1.1 | match options
  Options: < Preference AuthKey LogUpDown AddressFamily PeerAS Multipath Rib-group Refresh>
  1. Enable load-balancing in our forwarding table.
    We will selectively enable it for our ISP’s AS number only. It is best to introduce new features selectively and specifically to ensure limited scope and prevent undesired effects on other routes, as we may not wish to use ECMP globally. This most-specificity style of configuration makes troubleshooting easier. 
#
# define as-path for MYISP, matching all routes received
#
set policy-options as-path MYISP "65000 .*"

#
# define policy-statement matching this AS path
# then load-balancing across routes which match
#
set policy-options policy-statement ECMP-FIB from as-path MYISP
set policy-options policy-statement ECMP-FIB then load-balance per-packet



#
# apply this policy to the forwarding table (FIB)
#
set routing-options forwarding-table export ECMP-FIB

With this simple policy we are matching packets destined to routes learned from AS 65000 and enabling per-flow load-balancing across available equal-cost routes. 

This is not a typographical error. The Junos syntax ‘load-balance per-packet’ is a misnomer retained for legacy compatibility reasons. For more information on this ‘per-packet’ syntax, please see Juniper’s website Configuring Per-Packet Load article.

Almost all modern routers will load-balance per flow, ensuring that flows (which are similar to conversations) from one IP to another IP will always traverse the same path. The reason routers consistently bind flows to links is to avoid round-robin or random spray per-packet load-balancing. Doing so would lead to flows being split across multiple links, and packets would arrive out of order at the destination IP, giving rise to TCP performance issues.

The TCP protocol will reorder packets based on sequence numbers which can negatively impact performance and throughput. Reordering creates processing overhead that can increase latency or congestion by sending retransmission requests (as happens during packet loss). Therefore the effect is similar to packet loss, resulting in a reduction in TCP sliding window size, which slows down the transfer rate. 

Best practice is to ensure that multipath/ECMP is configured with per-flow load-balancing.

Junos offers some alternative load-balancing methods, beyond the scope of this article. Please review the Junos documentation to understand the different behaviors, as may be applicable if you have a specific use case:

  consistent-hash      Give a prefix consistent load-balancing
  destination-ip-only  Give a destination based ip load-balancing
  per-packet           Load balance on a per-packet basis
  random               Load balance using packet random spray
  source-ip-only       Give a source based ip load-balancing
  

{{banner-15="/design/banners"}}

BGP multipath verification

When we view this route in the RIB it now looks like:

0.0.0.0/0     *[BGP/170] 1w5d 18:04:33, localpref 100
                      AS path: 65000 I, validation-state: unknown
                       to 172.16.1.1 via xe-0/0/1.0
                    >  to 172.16.2.1 via xe-0/0/2.0
                    

Notice that while both paths are shown, only one path is showing as active, indicated by “>”; this is a quirk of Junos. To see what’s happening at the packet forwarding level, we must check our forwarding table again:

user@router> show route forwarding-table destination 0.0.0.0/0
Routing table: default.inet
Internet:
Enabled protocols: Bridging,
Destination        Type RtRef Next hop           Type Index    NhRef Netif
0.0.0.0/0          user     0                    ulst  1048662    32
                              172.16.1.1         ucst     1396     4 xe-0/0/1.0
                              172.16.2.1         ucst     1850     4 xe-0/0/2.0
                              

Here we can see that both next hops and interfaces are listed against this destination 0.0.0.0/0. Traffic will now be load-balanced across these links according to Juniper’s default ‘per-packet’ load-balancing. 

Congratulations! You’ve now configured and verified per-flow ECMP load-balancing with BGP routes. That’s all there is to enabling this on Junos devices. You can now benefit from load-balancing and faster failover.

Link aggregation as an alternative or complement to BGP multipath

Link aggregation is a feasible alternative or complementary option to multipathing in some circumstances. Combining multiple similar or identical links into a single logical link provides physical redundancy and increased capacity while managing only a single logical link and routing protocol. 

Such aggregated links are commonly known as Link Aggregation Groups (LAGs) and Aggregated Ethernet interfaces (AggEth or ae interfaces in Junos). Expanding on our previous example, we now have connectivity to two different ISPs. We also have two physical links to each ISP. 

We can aggregate these links to simplify configuration and management instead of establishing four BGP sessions (one session per link).

Dual-multihomed single router (Source)

On our side, we can create a single logical ae interface per ISP, each ae interface having two physical member links. The same configuration is used on the ISP side. Each ae link behaves like any standard link, having a single IP and MAC address assigned. We also enable an additional protocol called Link Aggregation Control Protocol (LACP) which manages the state and membership of physical links inside the logical one.

 {{banner-7="/design/banners"}}

Configuration

#
# first enable the creation of aggregate ethernet interfaces
#
set chassis aggregated-devices ethernet device-count 8

#
# then create our new logical interfaces, MYISP1 first
#
set interfaces ae1 description "MYISP1 Circuit ID: 11111"
set interfaces ae1 aggregated-ether-options lacp active
set interfaces ae1 aggregated-ether-options lacp periodic fast
set interfaces ae1 unit 0 family inet address 172.16.1.1/30

#
# then configure physical interfaces, adding to the ae
#
set interfaces xe-0/0/0 description "MYISP1 Link #1"
set interfaces xe-0/0/0 gigether-options 802.1ad ae1

set interfaces xe-0/0/1 description "MYISP1 Link #2"
set interfaces xe-0/0/1 gigether-options 802.1ad ae1

#
# repeat for MYISP2
#
set interfaces ae2 description "MYISP2 Circuit ID: 22222"
set interfaces ae2 aggregated-ether-options lacp active
set interfaces ae2 aggregated-ether-options lacp periodic fast
set interfaces ae2 unit 0 family inet address 172.16.2.1/30

set interfaces xe-0/0/2 description "MYISP2 Link #1"
set interfaces xe-0/0/2 gigether-options 802.1ad ae2

set interfaces xe-0/0/3 description "MYISP2 Link #2"
set interfaces xe-0/0/3 gigether-options 802.1ad ae2

Verification

Confirm the LACP adjacency has come up on each new ae interface:

user@router> show lacp interfaces ae1
Aggregated interface: ae1
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      xe-0/0/1       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      xe-0/0/1     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
      xe-0/0/2       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
      xe-0/0/2     Partner    No    No   Yes  Yes  Yes   Yes     Fast    Active
    LACP protocol:        Receive State  Transmit State          Mux State
      xe-0/0/1                  Current   Fast periodic Collecting distributing
      xe-0/0/2                  Current   Fast periodic Collecting distributing
      

With this configuration we have avoided the need to run one BGP session per link, whilst still retaining link redundancy and availing of the additional capacity. 

Suppose in the future our demand for capacity grows. We can add additional links without altering our routing with this command:

set interfaces xe-0/0/4 description "MYISP2 Link #3"
set interfaces xe-0/0/4 gigether-options 802.1ad ae2

While BGP multipath pertains to load-balancing at IP packet layer (Layer 3), Junos has an adaptive load-balancing feature for the continual rebalancing of frames (Layer 2) traversing a LAG. It is recommended to enable this feature if links will run at high utilization or where large flows traverse the links; (e.g., nightly backups). 

Combining LAGs and ECMP

It’s possible to combine LACP LAGs and ECMP. Suppose we have configured one LAG per ISP and receive default routes (0.0.0.0/0) from each one. 

With a small modification to the previous load-balancing configuration, you can balance traffic across both ISPs. 

In this scenario, we have link redundancy with each ISP as well as redundancy between ISPs. Additionally, we have fast failover with less impact on active traffic as only approximately half is affected if one ISP fails.

Configuration

#
# define policy-statement matching each ISP’s route
# then load-balancing across them
#
set policy-options policy-statement ECMP-FIB from next-hop 172.16.1.2
set policy-options policy-statement ECMP-FIB from next-hop 172.16.2.2
set policy-options policy-statement ECMP-FIB then load-balance per-packet

Verification

We can check the RIB and FIB to confirm that both 0.0.0.0/0 routes are active. Observe two paths in the RIB; though only one is marked active, you will see two active next hops in the FIB output.

show route 0.0.0.0/0 exact
show route forwarding-table destination 0.0.0.0/0

Best practices

You might have observed one drawback of LAGs and one advantage of BGP multipath. Consider what happens when we introduce a failure inside a LAG, reducing its capacity from 20 Gbps to 10 Gbps. Here is our network in a steady state with all links up.

ISP Capacity Utilization
ISP1 20 Gbps (2x10G) 12.2 Gbps
ISP2 20 Gbps (2x10G) 13.5 Gbps

Now, what happens when Link#2 inside ae2 facing ISP2 fails, perhaps due to a fiber or pluggable optic fault at the ISP end? Optical loss of signal or LACP protocol will detect a link failure and remove it from the ae bundle, reducing capacity to 10 Gbps. 

ISP Capacity Utilization
ISP1 20 Gbps (2x10G) 12.2 Gbps
ISP2 10 Gbps (1x10G) ~9.9 Gbps

However, our BGP protocol and our load-balancing configuration will not react to this new condition. Traffic will still be shared approximately evenly between links ae1 and ae2. However, the 13.5 Gbps traffic demand cannot fit on a 10 Gbps link; approximately 3.6 Gbps of packets must be dropped. TCP will detect this packet loss and compensate to a limited extent, but the user experience will suffer significantly.

This example highlights two issues: how LAGs and upper-layer protocols behave when capacity degrades and capacity management challenges. 

The table below details when the links should be upgraded. 

ISP Capacity Utilization
ISP1 20 Gbps (2x10G) 7.5 Gbps, one link can fail
ISP2 20 Gbps (2x10G) 7.5 Gbps, one link can fail
ISP1 + ISP2 40 Gbps (4x10G) 15.0 Gbps, one ISP can fail
Total failure of single ISP 20 Gbps (2x10G) remains 15 Gbps (5 Gbps headroom remains)

This brings us to the LAG configuration option minimum-links. Whether connected to an ISP, another internal router or datacenter switch, you can configure your LAGs to completely fail if they no longer have sufficient operable links. 

For instance, you may have a 2x10G LAG which you deem operable only if both links remain up, or a 4x10G LAG which you can comfortably tolerate running at half capacity. The configuration is straightforward:

set interfaces xe-3/1/1 gigether-options 802.1ad ae5
set interfaces xe-3/1/2 gigether-options 802.1ad ae5
set interfaces xe-3/1/3 gigether-options 802.1ad ae5
set interfaces xe-3/1/4 gigether-options 802.1ad ae5
set interfaces ae5 aggregated-ether-options minimum-links 2

Unequal-cost multipathing (UCMP)

The question, “Can I load-balance traffic across links of unequal capacity?” brings us to the more advanced topic of UCMP. You can employ it, but it is generally discouraged for several reasons:

  • Link failures in a LAG can adversely affect a network. UCMP increases complexity in managing capacity as network planning on day one might not account for routing changes and traffic shifts over time.
  • The inequality between paths may not simply pertain to differences in capacity but differences in latency as well. While routers will try to consistently ‘pin’ flows to one path or the other, differences in latency may manifest in poor user experience as flows between the user and a given destination begin and end, taking a different path each time. This could result in users complaining of inconsistent network performance and prove very difficult to troubleshoot due to the ephemeral nature of the flow state.
  • The client-to-server path may be the lower-latency one, whereas the bulk of traffic may flow server-to-client on the higher-latency path. When you consider how webpages are rendered from multiple resources requiring multiple flows between multiple end-points, the inconsistent delivery and rendering may be noticeable.
  • It does not address latency differences. While the BGP protocol has been extended to signal path bandwidth, it does not address latency differences. Therefore, it’s only a partial mitigation and its use increases complexity on the network which creates risk.

For more insights on the potential problems of UCMP, see IP Space’s Does Unequal-Cost Multipathing Make Sense article. 

{{banner-sre="/design/banners"}}

Conclusion: Understand BGP multipath use cases

In the right use cases, BGP multipath is an effective tool available across router operating systems. It is relatively simple to implement, and ECMP applications are usually straightforward. However, UCMP can rapidly become complex and challenging to maintain. 

In general, avoid UCMP when you can and always consider:

  • When and where to use BGP multipath. Multipath is useful when you have multiple BGP routes to the same destination and wish to actively load-balance across them. Multipath should not be used for primary/backup failover configurations if the arrangement is more suitable. For instance in a dual-ISP scenario, ISP1 may be connected at higher capacity, or offer faster or more reliable service, and therefore be preferred as the primary connection.
  • When and where to use LAGs. LAGs are beneficial in simplifying BGP routing configuration, monitoring, and troubleshooting. Consider LAGs when connecting two internal devices or where multiple BGP sessions would be configured on just one device on both ends of the links. In some cases, LAGs are possible but not ideal. One example is when a LAG terminates on a single device locally but connects to two devices on the remote end. At the remote end, terminating on two devices requires MC-LAG and ICCP protocols, which increases complexity and poses a risk due to misconfiguration or software defects.
  • When and where to combine BGP multipath and LAGs. Good use cases for combining both techniques are found in data center fabrics, the dual ISP example, or in connecting your multihomed customers if you are operating a transit network. 

Whatever the case, always carefully consider (and lab test) the implications of any new feature you deploy on your network, especially in terms of capacity management, failover scenarios, and operational maintenance.

Chapters