
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Wed, 15 Apr 2026 05:18:25 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Production ready eBPF, or how we fixed the BSD socket API]]></title>
            <link>https://blog.cloudflare.com/tubular-fixing-the-socket-api-with-ebpf/</link>
            <pubDate>Thu, 17 Feb 2022 17:02:54 GMT</pubDate>
            <description><![CDATA[ We are open sourcing the production tooling we’ve built for the sk_lookup hook we contributed to the Linux kernel, called tubular ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3qt80mUTCxJp6nLkenADAL/29eb329be3097752997f3ac4b9a00f25/tubular-1.png" />
            
            </figure><p>As we develop new products, we often push our operating system - Linux - beyond what is commonly possible. A common theme has been relying on <a href="https://ebpf.io/what-is-ebpf/">eBPF</a> to build technology that would otherwise have required modifying the kernel. For example, we’ve built <a href="/l4drop-xdp-ebpf-based-ddos-mitigations/">DDoS mitigation</a> and a <a href="/unimog-cloudflares-edge-load-balancer/">load balancer</a> and use it to <a href="/introducing-ebpf_exporter/">monitor our fleet of servers</a>.</p><p>This software usually consists of a small-ish eBPF program written in C, executed in the context of the kernel, and a larger user space component that loads the eBPF into the kernel and manages its lifecycle. We’ve found that the ratio of eBPF code to userspace code differs by an order of magnitude or more. We want to shed some light on the issues that a developer has to tackle when dealing with eBPF and present our solutions for building rock-solid production ready applications which contain eBPF.</p><p>For this purpose we are open sourcing the production tooling we’ve built for the <a href="https://www.kernel.org/doc/html/latest/bpf/prog_sk_lookup.html">sk_lookup hook</a> we contributed to the Linux kernel, called <b>tubular</b>. It exists because <a href="/its-crowded-in-here/">we’ve outgrown the BSD sockets API</a>. To deliver some products we need features that are just not possible using the standard API.</p><ul><li><p>Our services are available on millions of IPs.</p></li><li><p>Multiple services using the same port on different addresses have to coexist, e.g. <a href="https://1.1.1.1/">1.1.1.1</a> resolver and our authoritative DNS.</p></li><li><p>Our Spectrum product <a href="/how-we-built-spectrum/">needs to listen on all 2^16 ports</a>.</p></li></ul><p>The source code for tubular is at <a href="https://github.com/cloudflare/tubular">https://github.com/cloudflare/tubular</a>, and it allows you to do all the things mentioned above. Maybe the most interesting feature is that you can change the addresses of a service on the fly:</p><div></div>
<p></p>
    <div>
      <h2>How tubular works</h2>
      <a href="#how-tubular-works">
        
      </a>
    </div>
    <p><code>tubular</code> sits at a critical point in the Cloudflare stack, since it has to inspect every connection terminated by a server and decide which application should receive it.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ZFfhU9ui5dR4KpbOcofqr/c61fde7a87d189167a3b72ca90b61e20/unnamed.png" />
            
            </figure><p>Failure to do so will drop or misdirect connections hundreds of times per second. So it has to be incredibly robust during day to day operations. We had the following goals for tubular:</p><ul><li><p><b>Releases must be unattended and happen online.</b> tubular runs on thousands of machines, so we can’t babysit the process or take servers out of production.</p></li><li><p><b>Releases must fail safely.</b> A failure in the process must leave the previous version of tubular running, otherwise we may drop connections.</p></li><li><p><b>Reduce the impact of (userspace) crashes.</b> When the inevitable bug comes along we want to minimise the blast radius.</p></li></ul><p>In the past we had built a proof-of-concept control plane for sk_lookup called <a href="https://github.com/majek/inet-tool">inet-tool</a>, which proved that we could get away without a persistent service managing the eBPF. Similarly, tubular has <code>tubectl</code>: short-lived invocations make the necessary changes and persisting state is handled by the kernel in the form of <a href="https://www.kernel.org/doc/html/latest/bpf/maps.html">eBPF maps</a>. Following this design gave us crash resiliency by default, but left us with the task of mapping the user interface we wanted to the tools available in the eBPF ecosystem.</p>
    <div>
      <h2>The tubular user interface</h2>
      <a href="#the-tubular-user-interface">
        
      </a>
    </div>
    <p>tubular consists of a BPF program that attaches to the sk_lookup hook in the kernel and userspace Go code which manages the BPF program. The <code>tubectl</code> command wraps both in a way that is easy to distribute.</p><p><code>tubectl</code> manages two kinds of objects: bindings and sockets. A binding encodes a rule against which an incoming packet is matched. A socket is a reference to a TCP or UDP socket that can accept new connections or packets.</p><p>Bindings and sockets are "glued" together via arbitrary strings called labels. Conceptually, a binding assigns a label to some traffic. The label is then used to find the correct socket.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3G1ZM7iKvkGurKMBYqonW5/f6612068b8fb404a8cc0af1d02108c43/unnamed--2-.png" />
            
            </figure>
    <div>
      <h3>Adding bindings</h3>
      <a href="#adding-bindings">
        
      </a>
    </div>
    <p>To create a binding that steers port 80 (aka HTTP) traffic destined for 127.0.0.1 to the label “foo” we use <code>tubectl bind</code>:</p>
            <pre><code>$ sudo tubectl bind "foo" tcp 127.0.0.1 80</code></pre>
            <p>Due to the power of sk_lookup we can have much more powerful constructs than the BSD API. For example, we can redirect connections to all IPs in 127.0.0.0/24 to a single socket:</p>
            <pre><code>$ sudo tubectl bind "bar" tcp 127.0.0.0/24 80</code></pre>
            <p>A side effect of this power is that it's possible to create bindings that "overlap":</p>
            <pre><code>1: tcp 127.0.0.1/32 80 -&gt; "foo"
2: tcp 127.0.0.0/24 80 -&gt; "bar"</code></pre>
            <p>The first binding says that HTTP traffic to localhost should go to “foo”, while the second asserts that HTTP traffic in the localhost subnet should go to “bar”. This creates a contradiction, which binding should we choose? tubular resolves this by defining precedence rules for bindings:</p><ol><li><p>A prefix with a longer mask is more specific, e.g. 127.0.0.1/32 wins over 127.0.0.0/24.</p></li><li><p>A port is more specific than the port wildcard, e.g. port 80 wins over "all ports" (0).</p></li></ol><p>Applying this to our example, HTTP traffic to all IPs in 127.0.0.0/24 will be directed to bar, except for 127.0.0.1 which goes to foo.</p>
    <div>
      <h3>Getting ahold of sockets</h3>
      <a href="#getting-ahold-of-sockets">
        
      </a>
    </div>
    <p><code>sk_lookup</code> needs a reference to a TCP or a UDP socket to redirect traffic to it. However, a socket is usually accessible only by the process which created it with the socket syscall. For example, an HTTP server creates a TCP listening socket bound to port 80. How can we gain access to the listening socket?</p><p>A fairly well known solution is to make processes cooperate by passing socket file descriptors via <a href="/know-your-scm_rights/">SCM_RIGHTS</a> messages to a tubular daemon. That daemon can then take the necessary steps to hook up the socket with <code>sk_lookup</code>. This approach has several drawbacks:</p><ol><li><p>Requires modifying processes to send SCM_RIGHTS</p></li><li><p>Requires a tubular daemon, which may crash</p></li></ol><p>There is another way of getting at sockets by using systemd, provided <a href="https://www.freedesktop.org/software/systemd/man/systemd.socket.html">socket activation</a> is used. It works by creating an additional service unit with the correct <a href="https://www.freedesktop.org/software/systemd/man/systemd.service.html#Sockets=">Sockets</a> setting. In other words: we can leverage systemd oneshot action executed on creation of a systemd socket service, registering the socket into tubular. For example:</p>
            <pre><code>[Unit]
Requisite=foo.socket

[Service]
Type=oneshot
Sockets=foo.socket
ExecStart=tubectl register "foo"</code></pre>
            <p>Since we can rely on systemd to execute <code>tubectl</code> at the correct times we don't need a daemon of any kind. However, the reality is that a lot of popular software doesn't use systemd socket activation. Dealing with systemd sockets is complicated and doesn't invite experimentation. Which brings us to the final trick: <a href="https://www.man7.org/linux/man-pages/man2/pidfd_getfd.2.html">pidfd_getfd</a>:</p><blockquote><p>The <code>pidfd_getfd()</code> system call allocates a new file descriptor in the calling process. This new file descriptor is a duplicate of an existing file descriptor, targetfd, in the process referred to by the PID file descriptor pidfd.</p></blockquote><p>We can use it to iterate all file descriptors of a foreign process, and pick the socket we are interested in. To return to our example, we can use the following command to find the TCP socket bound to 127.0.0.1 port 8080 in the httpd process and register it under the "foo" label:</p>
            <pre><code>$ sudo tubectl register-pid "foo" $(pidof httpd) tcp 127.0.0.1 8080</code></pre>
            <p>It's easy to wire this up using systemd's <a href="https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStartPre=">ExecStartPost</a> if the need arises.</p>
            <pre><code>[Service]
Type=forking # or notify
ExecStart=/path/to/some/command
ExecStartPost=tubectl register-pid $MAINPID foo tcp 127.0.0.1 8080</code></pre>
            
    <div>
      <h2>Storing state in eBPF maps</h2>
      <a href="#storing-state-in-ebpf-maps">
        
      </a>
    </div>
    <p>As mentioned previously, tubular relies on the kernel to store state, using <a href="https://prototype-kernel.readthedocs.io/en/latest/bpf/ebpf_maps.html">BPF key / value data structures also known as maps</a>. Using the <a href="https://www.kernel.org/doc/html/latest/userspace-api/ebpf/syscall.html">BPF_OBJ_PIN syscall</a> we can persist them in /sys/fs/bpf:</p>
            <pre><code>/sys/fs/bpf/4026532024_dispatcher
├── bindings
├── destination_metrics
├── destinations
├── sockets
└── ...</code></pre>
            <p>The way the state is structured differs from how the command line interface presents it to users. Labels like “foo” are convenient for humans, but they are of variable length. Dealing with variable length data in BPF is cumbersome and slow, so the BPF program never references labels at all. Instead, the user space code allocates numeric IDs, which are then used in the BPF. Each ID represents a (<code>label</code>, <code>domain</code>, <code>protocol</code>) tuple, internally called <code>destination</code>.</p><p>For example, adding a binding for "foo" <code>tcp 127.0.0.1</code> ... allocates an ID for ("<code>foo</code>", <code>AF_INET</code>, <code>TCP</code>). Including domain and protocol in the destination allows simpler data structures in the BPF. Each allocation also tracks how many bindings reference a destination so that we can recycle unused IDs. This data is persisted into the destinations hash table, which is keyed by (Label, Domain, Protocol) and contains (ID, Count). Metrics for each destination are tracked in destination_metrics in the form of per-CPU counters.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5JI7BzZmFdOS5DO6n2cpjh/3bf1a320954e9de2e4b60e64e0a3b375/unnamed--1--5.png" />
            
            </figure><p><code>bindings</code> is a <a href="https://en.wikipedia.org/wiki/Trie">longest prefix match (LPM) trie</a> which stores a mapping from (<code>protocol</code>, <code>port</code>, <code>prefix</code>) to (<code>ID</code>, <code>prefix length</code>). The ID is used as a key to the sockets map which contains pointers to kernel socket structures. IDs are allocated in a way that makes them suitable as an array index, which allows using the simpler BPF sockmap (an array) instead of a socket hash table. The prefix length is duplicated in the value to work around shortcomings in the BPF API.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4FJXwiooeaLRbRriCrETia/656cc5c0f78ca393627335cec1064755/unnamed--3-.png" />
            
            </figure>
    <div>
      <h2>Encoding the precedence of bindings</h2>
      <a href="#encoding-the-precedence-of-bindings">
        
      </a>
    </div>
    <p>As discussed, bindings have a precedence associated with them. To repeat the earlier example:</p>
            <pre><code>1: tcp 127.0.0.1/32 80 -&gt; "foo"
2: tcp 127.0.0.0/24 80 -&gt; "bar"</code></pre>
            <p>The first binding should be matched before the second one. We need to encode this in the BPF somehow. One idea is to generate some code that executes the bindings in order of specificity, a technique we’ve used to great effect in <a href="/l4drop-xdp-ebpf-based-ddos-mitigations/">l4drop</a>:</p>
            <pre><code>1: if (mask(ip, 32) == 127.0.0.1) return "foo"
2: if (mask(ip, 24) == 127.0.0.0) return "bar"
...</code></pre>
            <p>This has the downside that the program gets longer the more bindings are added, which slows down execution. It's also difficult to introspect and debug such long programs. Instead, we use a specialised BPF longest prefix match (LPM) map to do the hard work. This allows inspecting the contents from user space to figure out which bindings are active, which is very difficult if we had compiled bindings into BPF. The LPM map uses a trie behind the scenes, so <a href="https://en.wikipedia.org/wiki/Trie#Searching">lookup has complexity proportional to the length of the key</a> instead of linear complexity for the “naive” solution.</p><p>However, using a map requires a trick for encoding the precedence of bindings into a key that we can look up. Here is a simplified version of this encoding, which ignores IPv6 and uses labels instead of IDs. To insert the binding <code>tcp 127.0.0.0/24 80</code> into a trie we first convert the IP address into a number.</p>
            <pre><code>127.0.0.0    = 0x7f 00 00 00</code></pre>
            <p>Since we're only interested in the first 24 bits of the address we, can write the whole prefix as</p>
            <pre><code>127.0.0.0/24 = 0x7f 00 00 ??</code></pre>
            <p>where “?” means that the value is not specified. We choose the number 0x01 to represent TCP and prepend it and the port number (80 decimal is 0x50 hex) to create the full key:</p>
            <pre><code>tcp 127.0.0.0/24 80 = 0x01 50 7f 00 00 ??</code></pre>
            <p>Converting <code>tcp 127.0.0.1/32 80</code> happens in exactly the same way. Once the converted values are inserted into the trie, the LPM trie conceptually contains the following keys and values.</p>
            <pre><code>LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
        0x01 50 7f 00 00 01 = "foo"</code></pre>
            <p>To find the binding for a TCP packet destined for 127.0.0.1:80, we again encode a key and perform a lookup.</p>
            <pre><code>input:  0x01 50 7f 00 00 01   TCP packet to 127.0.0.1:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  y
---------------------------
result: "foo"

y = byte matches</code></pre>
            <p>The trie returns “foo” since its key shares the longest prefix with the input. Note that we stop comparing keys once we reach unspecified “?” bytes, but conceptually “bar” is still a valid result. The distinction becomes clear when looking up the binding for a TCP packet to 127.0.0.255:80.</p>
            <pre><code>input:  0x01 50 7f 00 00 ff   TCP packet to 127.0.0.255:80
---------------------------
LPM trie:
        0x01 50 7f 00 00 ?? = "bar"
           y  y  y  y  y
        0x01 50 7f 00 00 01 = "foo"
           y  y  y  y  y  n
---------------------------
result: "bar"

n = byte doesn't match</code></pre>
            <p>In this case "foo" is discarded since the last byte doesn't match the input. However, "bar" is returned since its last byte is unspecified and therefore considered to be a valid match.</p>
    <div>
      <h2>Observability with minimal privileges</h2>
      <a href="#observability-with-minimal-privileges">
        
      </a>
    </div>
    <p>Linux has the powerful ss tool (part of iproute2) available to inspect socket state:</p>
            <pre><code>$ ss -tl src 127.0.0.1
State      Recv-Q      Send-Q           Local Address:Port           Peer Address:Port
LISTEN     0           128                  127.0.0.1:ipp                 0.0.0.0:*</code></pre>
            <p>With tubular in the picture this output is not accurate anymore. <code>tubectl</code> bindings makes up for this shortcoming:</p>
            <pre><code>$ sudo tubectl bindings tcp 127.0.0.1
Bindings:
 protocol       prefix port label
      tcp 127.0.0.1/32   80   foo</code></pre>
            <p>Running this command requires super-user privileges, despite in theory being safe for any user to run. While this is acceptable for casual inspection by a human operator, it's a dealbreaker for observability via pull-based monitoring systems like Prometheus. The usual approach is to expose metrics via an HTTP server, which would have to run with elevated privileges and be accessible to the Prometheus server somehow. Instead, BPF gives us the tools to enable read-only access to tubular state with minimal privileges.</p><p>The key is to carefully set file ownership and mode for state in /sys/fs/bpf. Creating and opening files in /sys/fs/bpf uses <a href="https://www.kernel.org/doc/html/latest/userspace-api/ebpf/syscall.html#bpf-subcommand-reference">BPF_OBJ_PIN and BPF_OBJ_GET</a>. Calling BPF_OBJ_GET with BPF_F_RDONLY is roughly equivalent to open(O_RDONLY) and allows accessing state in a read-only fashion, provided the file permissions are correct. tubular gives the owner full access but restricts read-only access to the group:</p>
            <pre><code>$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root root 0 Feb  2 13:19 bindings
-rw-r----- 1 root root 0 Feb  2 13:19 destination_metrics</code></pre>
            <p>It's easy to choose which user and group should own state when loading tubular:</p>
            <pre><code>$ sudo -u root -g tubular tubectl load
created dispatcher in /sys/fs/bpf/4026532024_dispatcher
loaded dispatcher into /proc/self/ns/net
$ sudo ls -l /sys/fs/bpf/4026532024_dispatcher | head -n 3
total 0
-rw-r----- 1 root tubular 0 Feb  2 13:42 bindings
-rw-r----- 1 root tubular 0 Feb  2 13:42 destination_metrics</code></pre>
            <p>There is one more obstacle, <a href="https://github.com/systemd/systemd/blob/b049b48c4b6e60c3cbec9d2884f90fd4e7013219/src/shared/mount-setup.c#L111-L112">systemd mounts /sys/fs/bpf</a> in a way that makes it inaccessible to anyone but root. Adding the executable bit to the directory fixes this.</p>
            <pre><code>$ sudo chmod -v o+x /sys/fs/bpf
mode of '/sys/fs/bpf' changed from 0700 (rwx------) to 0701 (rwx-----x)</code></pre>
            <p>Finally, we can export metrics without privileges:</p>
            <pre><code>$ sudo -u nobody -g tubular tubectl metrics 127.0.0.1 8080
Listening on 127.0.0.1:8080
^C</code></pre>
            <p>There is a caveat, unfortunately: truly unprivileged access requires unprivileged BPF to be enabled. Many distros have taken to disabling it via the unprivileged_bpf_disabled sysctl, in which case scraping metrics does require CAP_BPF.</p>
    <div>
      <h2>Safe releases</h2>
      <a href="#safe-releases">
        
      </a>
    </div>
    <p>tubular is distributed as a single binary, but really consists of two pieces of code with widely differing lifetimes. The BPF program is loaded into the kernel once and then may be active for weeks or months, until it is explicitly replaced. In fact, a reference to the program (and link, see below) is persisted into /sys/fs/bpf:</p>
            <pre><code>/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
└── ...</code></pre>
            <p>The user space code is executed for seconds at a time and is replaced whenever the binary on disk changes. This means that user space has to be able to deal with an "old" BPF program in the kernel somehow. The simplest way to achieve this is to compare what is loaded into the kernel with the BPF shipped as part of tubectl. If the two don't match we return an error:</p>
            <pre><code>$ sudo tubectl bind foo tcp 127.0.0.1 80
Error: bind: can't open dispatcher: loaded program #158 has differing tag: "938c70b5a8956ff2" doesn't match "e007bfbbf37171f0"</code></pre>
            <p><code>tag</code> is the truncated hash of the instructions making up a BPF program, which the kernel makes available for every loaded program:</p>
            <pre><code>$ sudo bpftool prog list id 158
158: sk_lookup  name dispatcher  tag 938c70b5a8956ff2
...</code></pre>
            <p>By comparing the tag tubular asserts that it is dealing with a supported version of the BPF program. Of course, just returning an error isn't enough. There needs to be a way to update the kernel program so that it's once again safe to make changes. This is where the persisted link in /sys/fs/bpf comes into play. <code>bpf_links</code> are used to attach programs to various BPF hooks. "Enabling" a BPF program is a two-step process: first, load the BPF program, next attach it to a hook using a bpf_link. Afterwards the program will execute the next time the hook is executed. By updating the link we can change the program on the fly, in an atomic manner.</p>
            <pre><code>$ sudo tubectl upgrade
Upgraded dispatcher to 2022.1.0-dev, program ID #159
$ sudo bpftool prog list id 159
159: sk_lookup  name dispatcher  tag e007bfbbf37171f0
…
$ sudo tubectl bind foo tcp 127.0.0.1 80
bound foo#tcp:[127.0.0.1/32]:80</code></pre>
            <p>Behind the scenes the upgrade procedure is slightly more complicated, since we have to update the pinned program reference in addition to the link. We pin the new program into /sys/fs/bpf:</p>
            <pre><code>/sys/fs/bpf/4026532024_dispatcher
├── link
├── program
├── program-upgrade
└── ...</code></pre>
            <p>Once the link is updated we <a href="https://www.man7.org/linux/man-pages/man2/rename.2.html">atomically rename</a> program-upgrade to replace program. In the future we may be able to <a href="https://lkml.kernel.org/netdev/20211028094724.59043-5-lmb@cloudflare.com/t/">use RENAME_EXCHANGE</a> to make upgrades even safer.</p>
    <div>
      <h2>Preventing state corruption</h2>
      <a href="#preventing-state-corruption">
        
      </a>
    </div>
    <p>So far we’ve completely neglected the fact that multiple invocations of <code>tubectl</code> could modify the state in /sys/fs/bpf at the same time. It’s very hard to reason about what would happen in this case, so in general it’s best to prevent this from ever occurring. A common solution to this is <a href="https://gavv.github.io/articles/file-locks/#differing-features">advisory file locks</a>. Unfortunately it seems like BPF maps don't support locking.</p>
            <pre><code>$ sudo flock /sys/fs/bpf/4026532024_dispatcher/bindings echo works!
flock: cannot open lock file /sys/fs/bpf/4026532024_dispatcher/bindings: Input/output error</code></pre>
            <p>This led to a bit of head scratching on our part. Luckily it is possible to flock the directory instead of individual maps:</p>
            <pre><code>$ sudo flock --exclusive /sys/fs/bpf/foo echo works!
works!</code></pre>
            <p>Each <code>tubectl</code> invocation likewise invokes <a href="https://www.man7.org/linux/man-pages//man2/flock.2.html"><code>flock()</code></a>, thereby guaranteeing that only ever a single process is making changes.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>tubular is in production at Cloudflare today and has simplified the deployment of <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">Spectrum</a> and our <a href="https://www.cloudflare.com/dns/">authoritative DNS</a>. It allowed us to leave behind limitations of the BSD socket API. However, its most powerful feature is that <a href="https://research.cloudflare.com/publications/Fayed2021/">the addresses a service is available on can be changed on the fly</a>. In fact, we have built tooling that automates this process across our global network. Need to listen on another million IPs on thousands of machines? No problem, it’s just an HTTP POST away.</p><p><i>Interested in working on tubular and our L4 load balancer</i> <a href="/unimog-cloudflares-edge-load-balancer/"><i>unimog</i></a><i>? We are</i> <a href="https://boards.greenhouse.io/cloudflare/jobs/3232234?gh_jid=3232234"><i>hiring in our European offices</i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Go]]></category>
            <guid isPermaLink="false">7ofIShaWHxqlp4ZmHyNRs</guid>
            <dc:creator>Lorenz Bauer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Graceful upgrades in Go]]></title>
            <link>https://blog.cloudflare.com/graceful-upgrades-in-go/</link>
            <pubDate>Thu, 11 Oct 2018 14:30:20 GMT</pubDate>
            <description><![CDATA[ The idea behind graceful upgrades is to swap out the configuration and code of a process while it is running, without anyone noticing it. If this sounds error-prone, dangerous, undesirable and in general a bad idea – I’m with you. ]]></description>
            <content:encoded><![CDATA[ <p><sup><i>Dingle Dangle! by </i></sup><a href="https://www.flickr.com/photos/grant_subaru/14175646490"><sup><i>Grant C.</i></sup></a><sup><i> (CC-BY 2.0)</i></sup></p><p>The idea behind graceful upgrades is to swap out the configuration and code of a process while it is running, without anyone noticing it. If this sounds error-prone, dangerous, undesirable and in general a bad idea – I’m with you. However, sometimes you really need them. Usually this happens in an environment where there is no load balancing layer. We have these at Cloudflare, which led to us investigating and implementing various solutions to this problem.</p><p>Coincidentally, implementing graceful upgrades involves some fun low-level systems programming, which is probably why there are already a bajillion options out there. Read on to learn what trade-offs there are, and why you should really use the Go library we are about to open source. For the impatient, the code is on <a href="https://github.com/cloudflare/tableflip">GitHub</a> and you can read the <a href="https://godoc.org/github.com/cloudflare/tableflip">documentation on godoc</a>.</p>
    <div>
      <h3>The basics</h3>
      <a href="#the-basics">
        
      </a>
    </div>
    <p>So what does it mean for a process to perform a graceful upgrade? Let’s use a web server as an example: we want to be able to fire HTTP requests at it, and never see an error because a graceful upgrade is happening.</p><p>We know that HTTP uses TCP under the hood, and that we interface with TCP using the BSD socket API. We have told the OS that we’d like to receive connections on port 80, and the OS has given us a listening socket, on which we call <code>Accept()</code> to wait for new clients.</p><p>A new client will be refused if the OS doesn’t know of a listening socket for port 80, or nothing is calling <code>Accept()</code> on it. The trick of a graceful upgrade is to make sure that neither of these two things occur while we somehow restart our service. Let’s look at the all the ways we could achieve this, from simple to complex.</p>
    <div>
      <h3>Just <code>Exec()</code></h3>
      <a href="#just-exec">
        
      </a>
    </div>
    <p>Ok, how hard can it be. Let’s just <code>Exec()</code> the new binary (without doing a fork first). This does exactly what we want, by replacing the currently running code with the new code from disk.</p>
            <pre><code>// The following is pseudo-Go.

func main() {
	var ln net.Listener
	if isUpgrade {
		ln = net.FileListener(os.NewFile(uintptr(fdNumber), "listener"))
	} else {
		ln = net.Listen(network, address)
	}
	
	go handleRequests(ln)

	&lt;-waitForUpgradeRequest

	syscall.Exec(os.Argv[0], os.Argv[1:], os.Environ())
}</code></pre>
            <p>Unfortunately this has a fatal flaw since we can’t “undo” the exec. Imagine a configuration file with too much white space in it or an extra semicolon. The new process would try to read that file, get an error and exit.</p><p>Even if the exec call works, this solution assumes that initialisation of the new process is practically instantaneous. We can get into a situation where the kernel refuses new connections because the <a href="https://veithen.github.io/2014/01/01/how-tcp-backlog-works-in-linux.html">listen queue is overflowing</a>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/46DpPmhtCjeHZmXVkySQTP/e644645ace6a5b37c1e0c4c09fc86814/Example1-1.png" />
            
            </figure><p><i>New connections may be dropped if </i><code><i>Accept()</i></code><i> is not called regularly enough</i></p><p>Specifically, the new binary is going to spend some time after <code>Exec()</code> to initialise, which delays calls to  <code>Accept()</code>. This means the backlog of new connections grows until some are dropped. Plain exec is out of the game.</p>
    <div>
      <h3><code>Listen()</code> all the things</h3>
      <a href="#listen-all-the-things">
        
      </a>
    </div>
    <p>Since just using exec is out of the question, we can try the next best thing. Lets fork and exec a new process which then goes through its usual start up routine. At some point it will create a few sockets by listening on some addresses, except that won’t work out-of-the-box due to errno 48, otherwise known as Address Already In Use. The kernel is preventing us from listening on the address and port combination used by the old process.</p><p>Of course, there is a flag to fix that: <code>SO_REUSEPORT</code>. This tells the kernel to ignore the fact that there is already a listening socket for a given address and port, and just allocate a new one.</p>
            <pre><code>func main() {
	ln := net.ListenWithReusePort(network, address)

	go handleRequests(ln)

	&lt;-waitForUpgradeRequest

	cmd := exec.Command(os.Argv[0], os.Argv[1:])
	cmd.Start()

	&lt;-waitForNewProcess
}</code></pre>
            <p>Now both processes have working listening sockets and the upgrade works. Right?</p><p><code>SO_REUSEPORT</code> is a little bit peculiar in what it does inside the kernel. As systems programmers, we tend to think of a socket as the file descriptor that is returned by the socket call. The kernel however makes a distinction between the data structure of a socket, and one or more file descriptors pointing at it. It creates a separate socket structure if you bind using <code>SO_REUSEPORT</code>, not just another file descriptor. The old and the new process are thus referring to two separate sockets, which happen to share the same address. This leads to an unavoidable race condition: new-but-not-yet-accepted connections on the socket used by the old process will be orphaned and terminated by the kernel. GitHub wrote <a href="https://githubengineering.com/glb-part-2-haproxy-zero-downtime-zero-delay-reloads-with-multibinder/#haproxy-almost-safe-reloads">an excellent blog post about this problem</a>.</p><p>The engineers at GitHub solved the problems with <code>SO_REUSEPORT</code> by using an obscure feature of the sendmsg syscall <a href="http://man7.org/linux/man-pages/man0/sys_socket.h.0p.html">called ancilliary data</a>. It turns out that ancillary data can include file descriptors. Using this API made sense for GitHub, since it allowed them to integrate elegantly with HAProxy. Since we have the luxury of changing the program we can use simpler alternatives.</p>
    <div>
      <h3>NGINX: share sockets via fork and exec</h3>
      <a href="#nginx-share-sockets-via-fork-and-exec">
        
      </a>
    </div>
    <p>NGINX is the tried and trusted workhorse of the Internet, and happens to support graceful upgrades. As a bonus we also use it at Cloudflare, so we were confident in its implementation.</p><p>It is written in a process-per-core model, which means that instead of spawning a bunch of threads NGINX runs a process per logical CPU core. Additionally, there is a primary process which orchestrates graceful upgrades.</p><p>The primary is responsible for creating all listen sockets used by NGINX and sharing them with the workers. This is fairly straightforward: first, the <code>FD_CLOEXEC</code> bit is cleared on all listen sockets. This means that they are not closed when the <code>exec()</code> syscall is made. The primary then does the customary <code>fork()</code> / <code>exec()</code> dance to spawn the workers, passing the file descriptor numbers as an environment variable.</p><p>Graceful upgrades make use of the same mechanism. We can spawn a new primary process (PID 1176) by <a href="http://nginx.org/en/docs/control.html#upgrade">following the NGINX documentation</a>. This inherits the existing listeners from the old primary process (PID 1017) just like workers do. The new primary then spawns its own workers:</p>
            <pre><code> CGroup: /system.slice/nginx.service
       	├─1017 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1019 nginx: worker process
       	├─1021 nginx: worker process
       	├─1024 nginx: worker process
       	├─1026 nginx: worker process
       	├─1027 nginx: worker process
       	├─1028 nginx: worker process
       	├─1029 nginx: worker process
       	├─1030 nginx: worker process
       	├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1187 nginx: worker process
       	├─1188 nginx: worker process
       	├─1190 nginx: worker process
       	├─1191 nginx: worker process
       	├─1192 nginx: worker process
       	├─1193 nginx: worker process
       	├─1194 nginx: worker process
       	└─1195 nginx: worker process</code></pre>
            <p>At this point there are two completely independent NGINX processes running. PID 1176 might be a new version of NGINX, or could use an updated config file. When a new connection arrives for port 80, one of the 16 worker processes is chosen by the kernel.</p><p>After executing the remaining steps, we end up with a fully replaced NGINX:</p>
            <pre><code>   CGroup: /system.slice/nginx.service
       	├─1176 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
       	├─1187 nginx: worker process
       	├─1188 nginx: worker process
       	├─1190 nginx: worker process
       	├─1191 nginx: worker process
       	├─1192 nginx: worker process
       	├─1193 nginx: worker process
       	├─1194 nginx: worker process
       	└─1195 nginx: worker process</code></pre>
            <p>Now, when a request arrives the kernel chooses between one of the eight remaining processes.</p><p>This process is rather fickle, so NGINX has a safeguard in place. Try requesting a second upgrade while the first hasn’t finished, and you’ll find the following message in the error log:</p>
            <pre><code>[crit] 1176#1176: the changing binary signal is ignored: you should shutdown or terminate before either old or new binary's process</code></pre>
            <p>This is very sensible, there is no good reason why there should be more than two processes at any given point in time. In the best case, we also want this behaviour from our Go solution.</p>
    <div>
      <h3>Graceful upgrade wishlist</h3>
      <a href="#graceful-upgrade-wishlist">
        
      </a>
    </div>
    <p>The way NGINX has implemented graceful upgrades is very nice. There is a clear life cycle which determines valid actions at any point in time:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/71WT8Hg8Bi4AAgf3yGdN8I/e30876bbf1c15d1e68143278b52f4bba/upgrade-lifecycle.svg" />
            
            </figure><p>It also solves the problems we’ve identified with the other approaches. Really, we’d like NGINX-style graceful upgrades as a Go library.</p><ul><li><p>No old code keeps running after a successful upgrade</p></li><li><p>The new process can crash during initialisation, without bad effects</p></li><li><p>Only a single upgrade is active at any point in time</p></li></ul><p>Of course, the Go community has produced some fine libraries just for this occasion. We looked at</p><ul><li><p><a href="https://github.com/alext/tablecloth">github.com/alext/tablecloth</a> (hat tip for the great name)</p></li><li><p><a href="https://godoc.org/github.com/astaxie/beego/grace">github.com/astaxie/beego/grace</a></p></li><li><p><a href="https://github.com/facebookgo/grace">github.com/facebookgo/grace</a></p></li><li><p><a href="https://github.com/crawshaw/littleboss">github.com/crawshaw/littleboss</a></p></li></ul><p>just to name a few. Each of them is different in its implementation and trade-offs, but none of them ticked all of our boxes. The most common problem is that they are designed to gracefully upgrade an http server. This makes their API much nicer, but removes flexibility that we need to support other socket based protocols. So really, there was absolutely no choice but to write our own library, called tableflip. Having fun was not part of the equation.</p>
    <div>
      <h3>tableflip</h3>
      <a href="#tableflip">
        
      </a>
    </div>
    <p>tableflip is a Go library for NGINX-style graceful upgrades. Here is what using it looks like:</p>
            <pre><code>upg, _ := tableflip.New(tableflip.Options{})
defer upg.Stop()

// Do an upgrade on SIGHUP
go func() {
    sig := make(chan os.Signal, 1)
    signal.Notify(sig, syscall.SIGHUP)
    for range sig {
   	    _ = upg.Upgrade()
    }
}()

// Start a HTTP server
ln, _ := upg.Fds.Listen("tcp", "localhost:8080")
server := http.Server{}
go server.Serve(ln)

// Tell the parent we are ready
_ = upg.Ready()

// Wait to be replaced with a new process
&lt;-upg.Exit()

// Wait for connections to drain.
server.Shutdown(context.TODO())</code></pre>
            <p>Calling <code>Upgrader.Upgrade</code> spawns a new process with the necessary net.Listeners, and waits for the new process to signal that it has finished initialisation, to die or to time out. Calling it when an upgrade is ongoing returns an error.</p><p><code>Upgrader.Fds.Listen</code> is inspired by <code>facebookgo/grace</code> and allows inheriting net.Listener easily. Behind the scenes, <code>Fds</code> makes sure that unused inherited sockets are cleaned up. This includes UNIX sockets, which are tricky due to <a href="https://golang.org/pkg/net/#UnixListener.SetUnlinkOnClose">UnlinkOnClose</a>. You can also pass straight up <code>*os.File</code> to the new process if you desire.</p><p>Finally, <code>Upgrader.Ready</code> cleans up unused fds and signals the parent process that initialisation is done. The parent can then exit, which completes the graceful upgrade cycle.</p> ]]></content:encoded>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">7qaSxnaXNYj34tyA0WUanS</guid>
            <dc:creator>Lorenz Bauer</dc:creator>
        </item>
    </channel>
</rss>