Author: Nikita Shirokov and RanjeethDasineni
Billions of people worldwide use Facebook services, and for this reason our infrastructure engineers have developed a series of systems to optimize traffic and ensure fast and reliable access for everyone. Today we took another step on the road to open source: we released the Katran forwarding plane software library, which supports the network load balancing system used in the Facebook infrastructure at the bottom. Katran offers a software-based load-balancing solution. The fully redesigned forwarding platform takes full advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and eBPF virtual machines. Today Katran is deployed on a large number of back-end servers at Facebook's point of presence (PoP), which helps us improve the performance and scalability of network load balancing and reduce inefficiencies (such as busy loops when there are no inbound packets). . We hope that by sharing Katran with the open source community, others can also improve the performance of their load-balancing system and use Katran as the basis for future work.
Difficult challenges: Handle requests in Facebook's vast environment
In order to manage traffic under Facebook's vast environment, we deployed a global network of distributed point-of-sales to act as a proxy system for the data center. Given the sheer volume of requests, entry points and data centers are faced with this daunting challenge: to make a large number of (back-end) servers look like a single virtual unit to the outside world, and to efficiently distribute workload among those back-end servers.
This type of challenge is usually resolved by advertising virtual IP addresses (VIPs) to the Internet at each location. Packets sent to the VIP are then seamlessly distributed between the back-end servers. However, the distribution algorithm needs to take this into consideration: The back-end server usually runs at the application layer and terminates the TCP connection. This task is handled by a network load balancing system (often called a Layer 4 load balancing system or L4LB because it operates on packets instead of processing application level requests). Figure 1 shows the effect of L4LB over other network components.
Figure 1: The network load balancing system faces several back-end servers running back-end applications, stably sending all packets from each client connection to a unique back-end server.
Requirements for high performance load balancing systems
The performance of L4LB is particularly important for managing delays and expanding the number of back-end servers because L4LB needs to process each inbound packet on the path. Performance is usually measured in peak packets per second (pps) that L4LB can handle. Traditionally, engineers prefer to use hardware-based solutions to accomplish this task because they typically use accelerators to ease the burden on the main CPU, such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs). However, one of the disadvantages of the hardware scheme is that it limits the flexibility of the system. In order to effectively meet the needs of Facebook, the network load balancing system must meet the following requirements:
· runs on popular Linux servers. This allows us to run a load-balancing system on some or all of the currently deployed large numbers of servers. Software-based load balancing systems meet this standard.
· Coexist with other services on a specific server. This eliminates the need for dedicated servers that specifically run load-balancing systems, thereby increasing fault tolerance.
Allows low-interference maintenance. Facebook's software must be able to rapidly evolve to support new or improved products and services. For load balancing systems and back-end tiers, maintenance and upgrades are routine rather than exceptions. Minimizing interference during the execution of these tasks allows us to iterate faster.
Easy to detect and debug. All large distributed infrastructures must deal with exceptions and contingencies, so it is important to reduce the time for debugging and troubleshooting problems. The load balancing system needs to be easy to detect and is compatible with standard tools such as tcpdump.
In order to meet these requirements, we designed a high-performance software network load balancing system. Our first-generation L4LB was based on the IPVS kernel module and met Facebook's needs for more than four years. However, it failed to achieve the goal of coexistence with other services (especially the back end). In the second iteration, we took advantage of the eXpress Data Path (XDP) framework and the new BPF Virtual Machine (eBPF) to let the software load balancing system work with the back end on a large number of machines. Figure 2 shows the key differences between the two generations of L4LB.
Figure 2: The difference between two generations of L4LB. Please note that both are software load balancing systems running on back-end servers. Katran (right) allows us to put load balancing systems and back-end applications together, thus enhancing the ability of load balancing systems.
First Generation L4LB: Based on OSS Software
When using the first-generation L4LB, we highly relied on existing open-source components to implement most of the functionality. This approach helped us to replace the hardware-based solution in a large deployment environment in just a few months. There are four major components of this design:
VIP announcement: This component peers with the network element (usually the switch) in front of the L4LB, but only advertises to the outside the virtual IP address that L4LB is responsible for. The switch then uses the ECMP mechanism to distribute packets between the L4LBs advertising the VIP. Due to its lightweight and flexible design, we use ExaBGP for VIP announcements.
Backend server selection: In order to send all packets from the client to the same backend, L4LB uses a consistent hash, which depends on the 5-tuple of inbound packets (source address, source port, destination address) Destination port and protocol). Using a consistent hash ensures that all packets belonging to the transport connection are sent to the same backend, regardless of the L4LB of the received packet. This does not require any state synchronization across multiple L4LBs. Consistent hashing also guarantees minimal interference with the existing connection when the backend leaves or joins the backend pool.
Forwarding plane: Once L4LB selects the appropriate backend, packets need to be forwarded to this host. To avoid restrictions (such as L4LB and back-end hosts are in the same L2 domain), we use a simple IP-in-IP encapsulation. This allowed us to place the L4LB and back-end hosts in different racks. We use the IPVS kernel module for packaging. The backend is configured to ensure that there is a corresponding VIP on the loopback interface. This allows the backend to send packets directly on the return path to the client (instead of L4LB). This optimization is often called direct server return (DSR), leaving L4LB limited only by the number of inbound packets.
Control plane: This component performs a variety of functions, including performing health checks on back-end servers, providing simple interfaces (via configuration files) to add or remove VIPs, and providing a simple API to check the status of L4LB and back-end servers. We have developed this component internally.
Each L4LB also stores the backend selection for each 5-tuple as a lookup table to avoid double counting the hash of future packets. This state is a pure optimization, not necessarily to check for correctness. This design meets several of the requirements listed above for the Facebook workload, but there is a major drawback: Placing both the L4LB and the backend on one host increases the likelihood of device failure. Even in the local state, L4LB is a CPU-intensive component. To isolate the fault domain, we run L4LB and back-end servers on a set of unrelated machines. In this scenario, the number of L4LBs is less than the back-end server, which makes L4LB more susceptible to load surges. Packets go through the regular Linux networking stack before being processed by L4LB, exacerbating this problem.
Figure 3: Overview of the first generation L4LB. Please note that the load balancing system and back-end applications run on different machines. Different load balancing systems make consistent decisions without any state synchronization. Using packet encapsulation allows servers running the load balancing system and servers running back-end applications to be placed in different racks. In a typical deployment environment, the ratio of L4LB to the number of back-end application servers is very small.
Katran: Redesigning Forwarding Plane
Katran, our second-generation L4LB, uses a completely redesigned forwarding plane that has been significantly improved over the previous version. Two recent innovations in the kernel community help this new design:
XDP provides a fast, programmable network data path that does not require a comprehensive kernel bypass method that can be used in conjunction with the Linux network stack. (Here https://www.iovisor.org/technology/xdp has a detailed introduction to XDP.)
The eBPF virtual machine runs userspace-provided programs in specific areas of the kernel, thus providing a flexible, efficient, and more reliable way to interact with the Linux kernel and extend functionality. eBPF has brought significant improvements in several areas, including tracking and filtering. (Here https://www.iovisor.org/technology/ebpf has more details.)
The overall architecture of the system is similar to that of the first-generation L4LB: First, ExaBGP advertises to the outside world which VIPS is responsible for a Katran instance. Second, packets sent to the VIP are sent to the Katran instance using the ECMP mechanism. Finally, Katran chooses a back end and forwards the packet to the correct back-end server. The main difference is the final step.
Early and efficient packet processing: Katran uses an XDP and BPF program to forward packets. After XDP is enabled in driver mode, the packet processing routine (BPF program) runs immediately after the network card (NIC) receives the packet and the kernel intercepts it. In the face of each inbound packet, XDP always invokes the BPF program. If the network card has multiple queues, the program is called in parallel for each queue. The BPF program used to process packets is lock-free, using a per-CPU-specific BPF map. Due to this parallel mechanism, performance can scale linearly with the number of RX queues on the NIC. Katran also supports the "general XDP" operating mode (instead of driver mode), but performance is affected.
Cheap but more stable hashing: Katran uses an extended version of the Maglev hash to select back-end servers. Several features of the extended hash are to quickly recover after a back-end server failure, distribute the load more evenly, and be able to set different weights for different back-end servers. This last item is an important feature that allows us to easily handle hardware updates at the point of entry and in the data center: We only need to set appropriate weights to be compatible with the new generation of hardware. Although the code to compute this hash is more expressive, it is small enough to be fully loaded into the L1 cache.
More resilient local state: Katran is efficient in handling data packets and hashing, and therefore has an interesting interaction with the local state table. We have found that calculating the hash for the backend server selection component is often easier than looking up the 5-tuple local state table. This phenomenon is even more pronounced when the local state table lookup goes all the way to the shared last level cache. In order to take full advantage of this phenomenon naturally, we implemented the lookup table as an LRU eviction cache. The LRU cache size can be configured at startup, acts as a tunable parameter, and balances computation and lookup. We chose these values empirically and optimized for pps. In addition, Katran offers a runtime "compute only" switch that completely ignores the LRU cache in the event of a hostile catastrophic memory pressure.
RSS-Friendly Encapsulation: Receive-side extensions (RSS) are an important optimization in network cards and are designed to distribute the load evenly among CPUs by forwarding packets from each data stream to a separate CPU. Katran's packaging technology is designed to work with RSS. Instead of using the same external source for each IP-in-IP packet, use a different external source IP to encapsulate packets in different data streams (eg, with different 5-tuples), but in the same data stream. Packets are always assigned the same external source IP.
Figure 4: Katran provides a fast path for high-speed processing of data packets without the need for comprehensive kernel bypass. Note that the packet passes through the kernel/user space boundary only once. This allows us to put L4LB together with back-end applications without sacrificing performance.
These features significantly enhance the L4LB's performance, flexibility, and scalability. If there are no inbound packets, Katran's design also eliminates busy cycles on the receive path that consumes almost no CPU. Compared to a comprehensive kernel bypass solution (such as DPDK), using XDP allows us to run Katran with any application on the same host without any performance degradation. Today Katran operates with back-end servers at our point-of-sale, and the ratio of L4LB to backend has been improved. This also enhances the ability to adapt to load surges, host failures, and maintenance. The redesigned forwarding plane is the core of this transformation. We think that if we use our forwarding plane, other systems can also benefit from it, so we open source the code, with examples of how to use it to design L4LB.
Katran operates under certain assumptions and constraints to achieve performance improvements. In fact, we found that these constraints were quite reasonable and did not stop our deployment. We believe that most users who use our library will find that they are easy to meet. We list each one below:
Katran only works in Direct Service Return (DSR) mode.
Katran is the component that decides the final destination of packets sent to the VIP, so the network needs to route the packets to Katran first. This requires that the network topology is based on L3. For example, the data packet is sent by IP routing rather than routed by MAC address.
Katran cannot forward fragmented packets, nor can it fragment the database itself. This can be mitigated by increasing the maximum transmission unit (MTU) within the network or by changing the advertised TCPMSS from the backend. (Even if you increase MTU, it is recommended to take the latter measure.)
Katran does not support packets with IP option sets. The maximum packet size cannot exceed 3.5 KB.
Katran assumed at the time of its construction that it would be used in the "integrated load balancing system" scenario where a single interface would be used for "from users to L4LB (entry)" traffic and "from L4LB to L7LB (exports) )” traffic.
Despite these limitations, we still believe that Katran provides an excellent forwarding plane for users and organizations that want to use XDP and eBPF, two outstanding innovations to build an efficient load balancing system. We would be happy to answer any questions raised by potential adopters using our GitHub repository (https://github.com/facebookincubator/katran), and always welcome pull requests!