In this article, we will look at how the BGP protocol chooses the best route to use for forwarding traffic. With other routing protocols, discriminating between paths and selecting the best ones will help ensure optimal network utilization and great performance. Other protocols, such as OSPF or Spanning Tree protocols, optimize for link state, minimal latency, and using the fastest path first. In BGP, however, the emphasis is on policy-based path selection. BGP optimizes for shorter paths at the autonomous system level rather than the link level, and it optimizes for best path according to the relationship with the neighboring autonomous system.
BGP is not like a true link-state or distance-vector protocol — it is known as a path vector protocol. It was designed from the ground up to offer fine-grained administrative policy control over path selection. Baked into the BGP protocol and its implementations are sensible defaults ensuring reasonably optimal path selection even without a policy being applied. Most implementations will, by default, select a single best path to a given IP destination.
We will use Juniper JunOS examples across this article to show you how the administrator can influence BGP path selection. We’ll cover some implementation details and typical use cases with policy examples. Note that we will focus on path selection between BGP routes installed in the routing table; out of scope will be BGP routes that are filtered out by policy or deemed invalid due to, for example, AS path loops. We also won’t cover vendor-specific path selection features, such as Cisco’s “weight” attribute, which is not part of the BGP standard.
We will explain the main factors influencing BGP path selection, which are the following.
BGP Path Selection
Each router vendor implements its own BGP path selection defaults. Notably, Cisco includes a vendor-specific parameter called Weight. The path with the highest weight is always chosen.
As mentioned, in this article, we will use Juniper Junos for our explanation and examples. Juniper documents its default selection process on this page, where you can read about the selection process and how various tiebreakers are evaluated.
It’s essential to note that routes are compared and selected at each step of the algorithm using sorting and tiebreaking. For example, let’s say we start with two eBGP routes to the same destination, and one has a higher BGP local preference than the other. At step two of the process, the route with the higher local preference will be chosen, and the process stops (for that destination). If, instead, these two routes had the same local preference, but one had a shorter AS path than the other, it would be chosen at step five of the process.
Quoting from Juniper’s article, and simplifying some details, the full algorithm looks essentially like this:
The algorithm in the previous section explains how the best route is selected from a set of routes with the same prefix length, but it’s crucial to note that the most specific prefix, i.e., the longest subnet mask, will always be chosen first. Let’s consider some examples:
Routes A and D will be selected as active routes for their given prefix lengths, but a selection process will occur between routes B and C because they have the same prefix length.
Route A covers the range 192.168.0.0 through 192.168.255.255.
Route D covers the range 192.168.1.128 through 192.168.1.255.
Routes B and C cover the range 192.168.1.0 through 192.168.1.255.
Let’s consider the implications of the different prefix lengths when a packet forwarding decision is made:
- When a packet is forwarded to 192.168.1.120, route B or C will be chosen.
- When a packet is forwarded to 192.168.1.130, route D will be chosen, because it has a longer prefix.
- When a packet is forwarded to anything inside the range 192.168.0.0 - 192.168.255.255 that does not match the more specific /25 and /24 routes, route A will be chosen.
Let’s move on to policy. As mentioned, BGP offers fine-grained policy controls for path selection and the specific ordering in the algorithm is intended to give sensible results, and often near-optimal ones, even without a specific policy being applied.
However, it is often the case that a given BGP peering should be preferred over another for reasons of bandwidth cost per Mbps, or latency, or relationship to that peer. BGP itself is not aware of peer relationships, path latency, link or path bandwidth, or the financial cost of that bandwidth. This is where you, as the administrator, come in. A very typical arrangement is to have a tiered local preference structure based on relationship, for example:
- Customer routes are most preferred, with a higher-than-default local preference, e.g., 110.
- Private or internet exchange routes are preferred next, with a default local preference, e.g., 100.
- Transit routes are the least preferred, with a lower-than-default preference, e.g., 90.
A tiered local preference structure like this is used because the best path is always the most direct one. You wouldn’t route traffic out of your network to take a scenic route via the internet transit providers to reach your customer. Accordingly, you want to always prefer your direct links to customers to reach their prefixes, and direct or exchange links to peers to reach peers, and then use transit for everything else. On the financial side, customers pay you, peers exchange traffic settlement-free, generally speaking, and you pay your transit providers.
The main attributes worthy of consideration when creating policy and understanding how route selection will work in operation are the following:
- Local preference
- AS path length
- MED (metric)
Note: As described earlier, prefix length is not a BGP attribute but supersedes BGP attribute route selection criteria.
When crafting policy adjustments to local preference or MED, also consider how AS path length plays a role and how route selection affects how traffic will route and reroute as links fail, routers go down for maintenance, etc.
Typical Use Cases
Let’s get a little more granular with the local preference policy outlined previously. If we add some distinctions, such as offering customers primary and backup connections, we might set the primary to local preference 110 and the backup to local preference 105. This ensures that no matter the value of any other route attributes, primary connections remain primary and active if available, and backup connections remain as idle backups, only becoming active if no primary is available.
With peering connections, we might differentiate between directly connected peers, which cross-connect directly to our router in the data center, and indirectly connected peers, which meet at an internet exchange point such as a shared LAN switch.
If we had two transit providers and one was more expensive than the other but otherwise similarly performing, we might elect to use local preference 90 for the cheaper transit and local preference 85 for the more expensive one. Again, this would ensure that the cheaper transit is used when available, and the expensive transit is only used when the primary is down. Our local preference schema would then look like this:
RPKI ROA Considerations
Harking back to the introduction, which explains that filtered or otherwise invalid routes are out of scope for route selection, it’s important to note that in a global network of BGP sessions, RPKI ROA policies play an important part in route selection.
This article primarily outlines how you, as an administrator, can influence and manage route selection in your network. However, be aware that networks implementing ROA validation policies use RPKI ROA attributes to determine whether to accept a route advertisement or reject it, and if accepting it, whether to use the standard or a lowered local preference value.
A quick summary is that every AS number on the internet has the option to create ROA records detailing which origin AS numbers and prefix lengths are valid for its IP space. Networks evaluating ROAs as part of their BGP import policies will check to see if a route they receive has RPKI ROA status valid, invalid, or unknown.
ROAs look like this (simplified):
This means any BGP route seen on the internet for the IP range 192.168.0.0 - 192.168.1.255 must originate from AS65000 and must be in the form of a route prefix 192.168.0.0/23, 192.168.0.0/24, or 192.168.1.0/24. Any other route referencing this IP range will be deemed invalid because it clashes with this ROA specification.
Therefore, a valid route is one that matches a ROA record, while an unknown route is one that has no ROA covering the IP space to which it pertains. Most networks accept valid and unknown routes because, at the time of writing, only one-quarter to one-third of the internet routing table is covered by ROAs. If the route is valid, it will be accepted, and if the route has no ROA, the same—the alternative is to drop connectivity to most of the internet! RPKI ROA adoption is growing, but it will be some time before most internet prefixes are covered. Until then, the only sane policy is to accept unknowns.
As an newtork operator or engineer, it’s important for you to know that invalid routes are those that clash with a ROA. Many large transit providers now drop invalid routes; some will accept them, but with a lower local preference value. The RPKI ROA system is designed to prevent misconfigurations in route announcements, with the key implication being that it helps prevent BGP route hijacking.
The takeaway is that if you or your customers create ROAs for your prefixes, you absolutely must take great care to maintain accuracy. Otherwise, you may find yourself cut off from the internet as your BGP announcements are dropped!
You should now understand the basic considerations for default path selection, how you can change this behavior using policy, and why you might want to do so. We also touched on RPKI ROA considerations, including the benefits and risks of deploying them. Equipped with this knowledge, you are hopefully now empowered to begin traffic engineering your own BGP sessions for best effect.