This blog is the second installment of a 5-part blog series about the Border Gateway Protocol (BGP). You can download the full series in The Comprehensive Guide to BGP, or view individual installments below.
The Internet is a global infrastructure distributed all over the world and nowadays formed by more than 65k Autonomous Systems (AS’s), most of them running local/country-level businesses. Given the Internet’s worldwide distribution, AS’s cannot be directly connected to each other, but they form a global p2p distributed network where each player knows how to directly reach only a very limited amounts of routes. Global reachability is achieved by exchanging these routes among players via the exterior Border Gateway Protocol (eBGP).
Two AS’s exchange routing information by establishing BGP session(s) between pairs of routers running a BGP daemon, namely BGP speakers. After establishing a BGP session, the BGP speakers start exchanging the set of network prefixes that they either received from other AS’s or that they already possessed. Each network destination exchanged is paired with a set of attributes that describes the characteristics of the path to reach that destination, forming what is called a route. Eventually, a BGP speaker will receive the routes related to all the Internet destinations. With these routes, the BGP speaker can forward traffic towards any intended destination.
The set of attributes associated with each route enables a BGP speaker to implement routing policies which may reflect either commercial agreements it has with its neighbors or technical considerations. This flexibility is one of the key factors that allowed BGP to become the standard de-facto routing protocol of the Internet.
BGP protocol is quite unique in the family of the routing protocols. Its most relevant peculiarity is that it relies on TCP (port 179) to guarantee the ordered and reliable exchange of protocol messages. This is because – unlike other routing protocols – there is no peer discovery process, and each peer is statically configured by the network administrator. Indeed, BGP is conceived to be an inter-AS protocol where peers should have quite a large degree of stability, thus making the discovery process useless. Therefore, it is a sine qua non condition for two BGP speakers to have IP reachability.
The process of establishing a BGP session between peers is performed via a simple finite state machine (FSM), which is described in the original RFC and can be summarized in the below figure:
The process can be simplified as follows:
- Each BGP speaker starts in Idle state. This is a transient state where the BGP speaker initializes the required resources for the connection, and where it starts to listen for TCP attempt connections on port 179 and, at the same time, attempts to connect to the other BGP speaker via TCP on port 179.
- Once these steps are performed, the BGP speaker moves to the Connect state, where it sets a timer and waits for the TCP connection to be completed. If the ConnectRetryTimer expires, the TCP connection is dropped, and the timer resets while still listening for incoming TCP connection attempts. If the TCP connection fails, the BGP speaker moves to the Active state.
- The Active state is another transient state where the BGP speaker basically stops to actively attempt to connect to the other party and just listen for incoming TCP connection attempts. Once the ConnectRetryTimer expires, the BGP speaker goes back to Connect state.
- If the TCP connection is successful either in Connect or Active states, the BGP speaker must send an OPEN message containing a list of its capabilities, moving to the OpenSent state.
- Once in OpenSent state, a BGP speaker waits for an OPEN message from the other party, and if no error occurs, sends a KEEPALIVE message, moving then to OpenConfirm state. Otherwise, it sends a NOTIFICATION message and goes back to Idle state.
- Finally, in the OpenConfirm state the BGP speaker waits for a KEEPALIVE message or a NOTIFICATION message from the other party. If it receives a KEEPALIVE message, then it moves to the Established state, otherwise it moves back to Idle. In general, any error in any state cause the BGP speaker to move to Idle state.
- Once in Established state, each BGP speaker announces routes towardsvia UPDATE messages to allow the other party to reach those destinations. The amount of destinations announced strongly depend on the agreement between network operators.
The evolution of the FSM described above is regulated via the exchange of BGP messages. Each BGP message starts with a common BGP header composed by 19 bytes and encoded as follows:
The Marker is a 16-byte field set all to one. This field is included for compatibility with older BGP versions and has no specific semantic in the current BGP version. The Length field contains the length of the BGP message, header included. The original RFC specifies that the maximum length of a BGP message is 4096 bytes despite the field size. This value was considered to be more than enough for the protocol requirements. Finally, the Type field contains the type of the message. The most common BGP message types are four: OPEN (1), NOTIFICATION (2), KEEPALIVE (3), and UPDATE (4).
1. The most important type of messages in the initial setup phase of a BGP session are the Open messages. As detailed in the FSM, these messages are used by the two BGP speakers to inform each other about the parameters they propose to use for the BGP session and inform the other party about their capabilities. Capabilities are Optional Parameters describing which BGP extension each speaker supports, like the support for four-octet AS numbers, the support of multiple protocols in BGP (e.g. IPv6), and the support for multiple paths. If two BGP speakers share a common capability, they will be automatically enabled to exploit the capability new features in the BGP session, like announcing each other’s IPv6 routes.
In addition to capabilities, Open messages carry the current Version of BGP (set to 4 since 1994) and some information about each of the peer, like the self-explanatory My Autonomous System field and the BGP identifier field, where it is encoded one of the IPv4 addresses belonging to the announcing BGP speaker. Another important mandatory parameter found in these messages is the Hold Time, which regulates how long the BGP session can stay up without the exchange of any protocol message. This parameter is crucial to avoid the reset of the session upon a temporary network failure; however, its value cannot be too high otherwise it could take too long for a speaker to realize that the session is no longer available. The BGP specification suggests using 90 seconds for that value.
2. Notification messages are quite the opposite of the Open messages. They are triggered whenever one of the BGP speaker incurs in an error (for any reason). In these cases, the BGP speaker notifies the other party about the type of error it has experienced, just before tearing down the BGP session. The possibility was recently introduced to add some text to this message to show some human-readable information in the log of the other BGP speaker.
3. Keepalive messages are empty BGP messages, composed by the header and without any payload. These simple messages are used to acknowledge decisions (like in the setup phase) and to keep the session up in absence of routing information exchanged by the two BGP speakers.
4. Finally, there are the Update messages. Those are the messages that carry the routing information that AS’s exchange each other. An Update message can be thought as composed by three main parts: Withdrawn Routes, Path attributes and Network Layer Reachability Information (NLRI).
Withdrawn routes and NLRI fields are quite straight-forward. They contain the subnets that are the subject of the route carried by the UPDATE message. If the subnet is among the Withdrawn routes, it means that the BGP speaker has no more routes involving that specific subnet. If the subnet is among the NLRI, then it means that the BGP speaker found a new route to for that specific subnet, whose characteristics are described in the Path attributes field. This can either mean that a new subnet has been announced in the Internet, or that an already existing subnet could be reached via a different route.
The Path Attributes field contains a set of attributes which describe the path toward the destinations contained into the NLRI field. There are many different attributes, each with its own role and format. The original RFC mandates that the Path Attributes field must be present if the NLRI field contains at least one destination, but only a few Path Attributes are required to be present:
- AS_PATH: the list of AS’s that must be traversed to reach the destination via the given route
- ORIGIN: the origin of path information
- NEXT_HOP: the IP address of the router that should be used as next hop towards the given destination
Once the session is established, the two BGP speakers announce Update messages to each other to advertise their reachability to the other party. Since BGP speakers could be connected to multiple BGP speakers – possibly belonging to different AS’s – and thus could receive multiple routes toward the very same destination, each BGP speaker must run a decision process to select the best route for each subnet received. This is called BGP decision process, and that’s exactly where BGP conveys its flexibility.
More in detail, when a BGP speaker receives a BGP update message from a neighbor, it stores each route advertised into a table dedicated to the neighbor, called Adj-RIB-In.
Once a route has been installed into the Adj-RIB-In, it is checked against a filter to decide whether it can be accepted or not. The ingress filter is completely customizable by the network administrator, which can decide to discard a route for a plethora of reasons. For example, a route could be discarded if the destination network was not expected to be received from that specific neighbor, or if it contains a specific path attribute value.
If a route is accepted, then it participates to the BGP decision process together with all the accepted routes toward the same destination learned from other neighbors. This process is conceptually composed of three phases, with some slight differences from router vendor to router vendor.
The first phase is triggered whenever an Update message has been accepted and consists in calculating a degree of preference for each route advertised in the message.
Once this phase is completed, the BGP speaker chooses the best route among all the routes available for each distinct destination in the message and installs each best route in the Local Routing Information Base (Loc-RIB). The Loc-RIB is the table containing all the routes that the BGP speaker is using to route the traffic received from its neighbors. The third phase is triggered once the Loc-RIB has been modified. In this phase, each route that contributed to change the Loc-RIB is checked against neighbor-specific output filters. From there, it’s installed into the neighbor Adj-RIB-out and becomes ready to be advertised. Like the ingress filters, the output filters are completely customizable by the network administrator which could decide, for example, to exclude a neighbor to receive a route towards a specific destination.