
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 07 Apr 2026 22:58:58 GMT</lastBuildDate>
        <item>
            <title><![CDATA[connect() - why are you so slow?]]></title>
            <link>https://blog.cloudflare.com/linux-transport-protocol-port-selection-performance/</link>
            <pubDate>Thu, 08 Feb 2024 14:00:27 GMT</pubDate>
            <description><![CDATA[ This is our story of what we learned about the connect() implementation for TCP in Linux. Both its strong and weak points. How connect() latency changes under pressure, and how to open connection so that the syscall latency is deterministic and time-bound ]]></description>
            <content:encoded><![CDATA[ <p></p><p>It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:</p><ul><li><p><a href="/amazon-2bn-ipv4-tax-how-avoid-paying/">Amazon’s $2bn IPv4 tax – and how you can avoid paying it</a></p></li><li><p><a href="/ipv6-from-dns-pov/">Using DNS to estimate worldwide state of IPv6 adoption</a></p></li></ul><p>And many more in our <a href="/searchresults#q=IPv6&amp;sort=date%20descending&amp;f:@customer_facing_source=[Blog]&amp;f:@language=[English]">catalog</a>. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.</p>
    <div>
      <h2>How does cache work at Cloudflare?</h2>
      <a href="#how-does-cache-work-at-cloudflare">
        
      </a>
    </div>
    <p>Describing the full scope of the <a href="https://developers.cloudflare.com/reference-architecture/cdn-reference-architecture/#cloudflare-cdn-architecture-and-design">architecture</a> is out of scope of this article, however, we can provide a basic outline:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/70ULgxsqU4zuyWYVrNn6et/8c80079d6dd93083059a875bbf48059d/image1-2.png" />
            
            </figure><ol><li><p>Internet User makes a request to pull an asset</p></li><li><p>Cloudflare infrastructure routes that request to a handler</p></li><li><p>Handler machine returns cached asset, or if miss</p></li><li><p>Handler machine reaches to origin server (owned by a customer) to pull the requested asset</p></li></ol><p>The particularly interesting part is the cache miss case. When a website suddenly becomes very popular, many uncached assets may need to be fetched all at once. Hence we may make an upwards of: 50k TCP unicast connections to a single destination_._</p><p>That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.</p><p>Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.</p>
    <div>
      <h2>TCP connect() performance of two source IPv4 addresses vs one IPv4 address</h2>
      <a href="#tcp-connect-performance-of-two-source-ipv4-addresses-vs-one-ipv4-address">
        
      </a>
    </div>
    <p>We leveraged a tool called <a href="https://github.com/wg/wrk">wrk</a>, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.</p><p>During the test we measured the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201">tcp_v4_connect()</a> with the BPF BCC libbpf-tool <a href="https://github.com/iovisor/bcc/blob/master/libbpf-tools/funclatency.c">funclatency</a> tool to gather latency metrics as time progresses.</p><p>Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.</p>
    <div>
      <h3>Two IPv4 addresses</h3>
      <a href="#two-ipv4-addresses">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1q7v3WNgI5X3JQg5ua0B8g/5b557ca762a08422badae379233dee76/image6.png" />
            
            </figure><p>The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.</p><p>We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.</p><p>Now let us look at the performance of one IPv4 address under the same conditions.</p>
    <div>
      <h3>One IPv4 address</h3>
      <a href="#one-ipv4-address">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6kpueuXS3SbBTIig306IDN/b27ab899656fbfc0bf3c885a44fb04a4/image8.png" />
            
            </figure><p>In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.</p><p>The next logical step is to figure out where this bottleneck is happening.</p>
    <div>
      <h2>Port selection is not what you think it is</h2>
      <a href="#port-selection-is-not-what-you-think-it-is">
        
      </a>
    </div>
    <p>To investigate this, we first took a flame graph of a production machine:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1tFwadYDdC5UVK78j4yKsv/64aca09189acba5bf3dab2e043265e0f/image7.png" />
            
            </figure><p>Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth <a href="https://www.brendangregg.com/flamegraphs.html">guide</a> about flame graphs for more details.</p><p>Most of the samples are taken in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a>. We can see that there are also many samples for <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a> with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.</p><p>Wrk introduces a bit more variability than we would like to see. Still focusing on the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/tcp_ipv4.c#L201"><code>tcp_v4_connect()</code></a>, we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as <a href="https://github.com/ColinIanKing/stress-ng">stress-ng</a> may also be used, but some modification is necessary to implement the socket option <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a>. There is more about that socket option later.</p><p>We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5d6tJum5BBe3jsLRqhXtFN/7952fb3d0a3da761de158fae4f925eb5/Screenshot-2024-02-07-at-15.54.29.png" />
            
            </figure><p>On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.</p><p>The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.</p>
    <div>
      <h3>__inet_hash_connect()</h3>
      <a href="#__inet_hash_connect">
        
      </a>
    </div>
    <p>At this point we wanted to understand this split a bit better. We know from the flame graph and the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1000"><code>__inet_hash_connect()</code></a> that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.</p><p>Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1043">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>   offset &amp;= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i &lt; remaining; i += 2, port += 2) {
        if (unlikely(port &gt;= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &amp;head-&gt;chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &amp;tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset &amp; 1) &amp;&amp; remaining &gt; 1)
        goto other_parity_scan;</code></pre>
            <p>Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the <a href="https://datatracker.ietf.org/doc/html/rfc6056#section-3.3.4">Double-Hash Port Selection Algorithm</a>. We will ignore the bind bucket functionality since that is not our main concern.</p><p>Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:</p><p><a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L1045">net/ipv4/inet_hashtables.c</a></p>
            <pre><code>port = low + offset;</code></pre>
            <p>If low was odd, we will have an odd starting port because odd + even = odd.</p><p>There is a bit too much going on in the loop to explain in text. I have an example instead:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6uVqtAUR07epRRKqQbHWkp/2a5671b1dd3c68c012e7171b8103a53e/image5.png" />
            
            </figure><p>This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.</p><p>For each selection of a port, the algorithm then makes a call to the function <code>check_established()</code> which dereferences <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/inet_hashtables.c#L544"><code>__inet_check_established()</code></a>. This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post on <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">ephemeral port exhausting</a> that dives into the socket list and port uniqueness criteria.</p><p>At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.</p>
    <div>
      <h3>Security considerations</h3>
      <a href="#security-considerations">
        
      </a>
    </div>
    <p>Port selection has been shown to be used in device <a href="https://lwn.net/Articles/910435/">fingerprinting</a> in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.</p>
    <div>
      <h3>Why the even/odd split?</h3>
      <a href="#why-the-even-odd-split">
        
      </a>
    </div>
    <p>Prior to this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07f4c90062f8fc7c8c26f8f95324cbe8fa3145a5">patch</a> and that <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1580ab63fc9a03593072cc5656167a75c4f1d173">patch</a>, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.</p><p>Now we have an explanation for the flame graph and charts. So what can we do about this?</p>
    <div>
      <h2>User space solution (kernel &lt; 6.8)</h2>
      <a href="#user-space-solution-kernel-6-8">
        
      </a>
    </div>
    <p>We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.</p><h3>Select, test, repeat<p>For the “select, test, repeat” approach, you may have code that ends up looking like this:</p>
            <pre><code>sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i &gt;= 0:
    if estab &gt;= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1</code></pre>
            <p>The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.</p><p>This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.</p><h3>Select port by random shifting range<p>This approach leverages the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> socket option. And we were able to achieve performance like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Uz8whp12VuvqvKTDnE1u9/4701177d739bdffe2a2399213cf72941/Screenshot-2024-02-07-at-16.00.22.png" />
            
            </figure><p>That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>The way this solution works is something like:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>We first fetch the system's local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.</p><p>We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:</p>
<table>
<thead>
  <tr>
    <th><span>Window size</span></th>
    <th><span>Errors</span></th>
    <th><span>Total test time</span></th>
    <th><span>Connections/second</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>500</span></td>
    <td><span>868</span></td>
    <td><span>~1.8 seconds</span></td>
    <td><span>~30,139</span></td>
  </tr>
  <tr>
    <td><span>1,000</span></td>
    <td><span>1,129</span></td>
    <td><span>~2 seconds</span></td>
    <td><span>~27,260</span></td>
  </tr>
  <tr>
    <td><span>5,000</span></td>
    <td><span>4,037</span></td>
    <td><span>~6.7 seconds</span></td>
    <td><span>~8,405</span></td>
  </tr>
  <tr>
    <td><span>10,000</span></td>
    <td><span>6,695</span></td>
    <td><span>~17.7 seconds</span></td>
    <td><span>~3,183</span></td>
  </tr>
</tbody>
</table><p>As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “<a href="#selecttestrepeat">select, test, repeat</a>”.</p><p>In kernels &gt;= 6.8, we can do even better.</p><h2>Kernel solution (kernel &gt;= 6.8)</h2><p>A new <a href="https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=207184853dbd">patch</a> was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.</p><p>Instead of picking a random window offset for <code>setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE</code>, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:</p>
            <pre><code>IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi &lt;&lt; 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>Setting <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91d0b78c5177f3e42a4d8738af8ac19c3a90d002"><code>IP_LOCAL_PORT_RANGE</code></a> option is what tells the kernel to use a similar approach to “<a href="#random">select port by random shifting range</a>” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ttWStZgNYfwftr71r8Vrt/7c333411ef01b674cc839f27ae4cbbbf/Screenshot-2024-02-07-at-16.04.24.png" />
            
            </figure><p>The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.</p><p>These results are great for TCP, but what about other protocols?</p>
    <div>
      <h2>Other protocols &amp; connect()</h2>
      <a href="#other-protocols-connect">
        
      </a>
    </div>
    <p>It is worth mentioning at this point that the algorithms used for the protocols are <i>mostly</i> the same for IPv4 &amp; IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.</p>
    <div>
      <h3>DCCP</h3>
      <a href="#dccp">
        
      </a>
    </div>
    <p>The DCCP protocol leverages the same port selection <a href="https://elixir.bootlin.com/linux/v6.6/source/net/dccp/ipv4.c#L115">algorithm</a> as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.</p>
    <div>
      <h3>UDP &amp; UDP-Lite</h3>
      <a href="#udp-udp-lite">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a> leverages a different algorithm found in the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L239"><code>udp_lib_get_port()</code></a>. Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.</p><p>The best comparison to the TCP measurements is a UDP setup similar to:</p>
            <pre><code>sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))</code></pre>
            <p>And the results should be unsurprising with one IPv4 source address:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4UM5d0RBTgqADgVLbbqMlQ/940306c90767ba4b5e3762c6467b71ed/Screenshot-2024-02-07-at-16.06.27.png" />
            
            </figure><p>UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.</p><p>UDP has another problem. Given the socket option <code>SO_REUSEADDR</code>, the port you get back may conflict with another UDP socket. This is in part due to the function <a href="https://elixir.bootlin.com/linux/v6.6/source/net/ipv4/udp.c#L141"><code>udp_lib_lport_inuse()</code></a> ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous <a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">blog post</a>.</p>
    <div>
      <h2>In summary</h2>
      <a href="#in-summary">
        
      </a>
    </div>
    <p>Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().</p><p>We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “<a href="#selecttestrepeat">select, test, repeat</a>”, “<a href="#random">select port by random shifting range</a>”, and an <a href="https://man7.org/linux/man-pages/man7/ip.7.html"><code>IP_LOCAL_PORT_RANGE</code></a> socket option <a href="#kernel">solution</a> in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.</p><p>Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!</p></h3></h3> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Protocols]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[IPv4]]></category>
            <category><![CDATA[IPv6]]></category>
            <category><![CDATA[Network]]></category>
            <guid isPermaLink="false">1C6z0btasEsz1cmdmoug0m</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
        <item>
            <title><![CDATA[CVE-2022-47929: traffic control noqueue no problem?]]></title>
            <link>https://blog.cloudflare.com/cve-2022-47929-traffic-control-noqueue-no-problem/</link>
            <pubDate>Tue, 31 Jan 2023 14:00:00 GMT</pubDate>
            <description><![CDATA[ In the Linux kernel before 6.1.6, a NULL pointer dereference bug in the traffic control subsystem allows an unprivileged user to trigger a denial of service (system crash) via a crafted traffic control configuration that is set up with "tc qdisc" and "tc class" commands. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Kt5g4yfw3QI3Gu8UzcclV/da58a3de4dc53ef2ff7130e27cbb0bf4/image1-56.png" />
            
            </figure><p>USER namespaces power the functionality of our favorite tools such as docker and podman. <a href="/live-patch-security-vulnerabilities-with-ebpf-lsm/">We wrote about Linux namespaces back in June</a> and explained them like this:</p><blockquote><p>Most of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward - NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special, very curious USER namespace. USER namespace is special since it allows the - typically unprivileged owner to operate as "root" inside it. It's a foundation to having tools like Docker to not operate as true root, and things like rootless containers.</p></blockquote><p>Due to its nature, allowing unprivileged users access to USER namespace always carried a great security risk. With its help the unprivileged user can in fact run code that typically requires root. This code is often under-tested and buggy. Today we will look into one such case where USER namespaces are leveraged to exploit a kernel bug that can result in an unprivileged denial of service attack.</p>
    <div>
      <h3>Enter Linux Traffic Control queue disciplines</h3>
      <a href="#enter-linux-traffic-control-queue-disciplines">
        
      </a>
    </div>
    <p>In 2019, we were exploring leveraging <a href="https://man7.org/linux/man-pages/man8/tc.8.html#DESCRIPTION">Linux Traffic Control's</a> <a href="https://tldp.org/HOWTO/Traffic-Control-HOWTO/components.html#c-qdisc">queue discipline</a> (qdisc) to schedule packets for one of our services with the <a href="https://man7.org/linux/man-pages/man8/tc-htb.8.html">Hierarchy Token Bucket</a> (HTB) <a href="https://tldp.org/HOWTO/Traffic-Control-HOWTO/classful-qdiscs.html">classful qdisc</a> strategy. Linux Traffic Control is a user-configured system to schedule and filter network packets. Queue disciplines are the strategies in which packets are scheduled. In particular, we wanted to filter and schedule certain packets from an interface, and drop others into the <a href="https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt">noqueue</a> qdisc.</p><p>noqueue is a special case qdisc, such that packets are supposed to be dropped when scheduled into it. In practice, this is not the case. Linux handles noqueue such that packets are passed through and not dropped (for the most part). The <a href="https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt">documentation</a> states as much. It also states that “It is not possible to assign the noqueue queuing discipline to physical devices or classes.” So what happens when we assign noqueue to a class?</p><p>Let's write some shell commands to show the problem in action:</p>
            <pre><code>1. $ sudo -i
2. # dev=enp0s5
3. # tc qdisc replace dev $dev root handle 1: htb default 1
4. # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
5. # tc qdisc add dev $dev parent 1:1 handle 10: noqueue</code></pre>
            <ol><li><p>First we need to log in as root because that gives us <a href="https://man7.org/linux/man-pages/man7/capabilities.7.html#DESCRIPTION">CAP_NET_ADMIN</a> to be able to configure traffic control.</p></li><li><p>We then assign a network interface to a variable. These can be found with <code>ip a</code>. Virtual interfaces can be located by calling <code>ls /sys/devices/virtual/net</code>. These will match with the output from <code>ip a</code>.</p></li><li><p>Our interface is currently assigned to the <a href="https://man7.org/linux/man-pages/man8/tc-pfifo_fast.8.html">pfifo_fast</a> qdisc, so we replace it with the HTB classful qdisc and assign it the handle of <code>1:</code>. We can think of this as the root node in a tree. The “default 1” configures this such that unclassified traffic will be routed directly through this qdisc which falls back to pfifo_fast queuing. (more on this later)</p></li><li><p>Next we add a class to our root qdisc <code>1:</code>, assign it to the first leaf node 1 of root 1: <code>1:1</code>, and give it some reasonable configuration defaults.</p></li><li><p>Lastly, we add the noqueue qdisc to our first leaf node in the hierarchy: <code>1:1</code>. This effectively means traffic routed here will be scheduled to noqueue</p></li></ol><p>Assuming our setup executed without a hitch, we will receive something similar to this kernel panic:</p>
            <pre><code>BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
...
Call Trace:
&lt;TASK&gt;
htb_enqueue+0x1c8/0x370
dev_qdisc_enqueue+0x15/0x90
__dev_queue_xmit+0x798/0xd00
...
&lt;/TASK&gt;
</code></pre>
            <p>We know that the root user is responsible for setting qdisc on interfaces, so if root can crash the kernel, so what? We just do not apply noqueue qdisc to a class id of a HTB qdisc:</p>
            <pre><code># dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:2 htb rate 10mbit // A
// B is missing, so anything not filtered into 1:2 will be pfifio_fast</code></pre>
            <p>Here, we leveraged the default case of HTB where we assign a class id 1:2 to be rate-limited (A), and implicitly did not set a qdisc to another class such as id 1:1 (B). Packets queued to (A) will be filtered to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L620">HTB_DIRECT</a> and packets queued to (B) will be filtered into pfifo_fast.</p><p>Because we were not familiar with this part of the codebase, we <a href="https://lore.kernel.org/all/CALrw=nEdA0asN4n7B3P2TyHKJ+UBPvoAiMrwkT42=fqp2-CPiw@mail.gmail.com/">notified</a> the mailing lists and created a ticket. The bug did not seem all that important to us at that time.</p><p>Fast-forward to 2022, we are <a href="https://lwn.net/Articles/903580/">pushing</a> USER namespace creation hardening. We extended the Linux LSM framework with a new LSM hook: <a href="https://lore.kernel.org/all/20220815162028.926858-1-fred@cloudflare.com/">userns_create</a> to leverage <a href="/live-patch-security-vulnerabilities-with-ebpf-lsm/">eBPF LSM</a> for our protections, and encourage others to do so as well. Recently while combing our ticket backlog, we rethought this bug. We asked ourselves, “can we leverage USER namespaces to trigger the bug?” and the short answer is yes!</p>
    <div>
      <h3>Demonstrating the bug</h3>
      <a href="#demonstrating-the-bug">
        
      </a>
    </div>
    <p>The exploit can be performed with any classful qdisc that assumes a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/include/net/sch_generic.h#L73">struct Qdisc.enqueue</a> function to not be NULL (more on this later), but in this case, we are demonstrating just with HTB.</p>
            <pre><code>$ unshare -rU –net
$ dev=lo
$ tc qdisc replace dev $dev root handle 1: htb default 1
$ tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
$ tc qdisc add dev $dev parent 1:1 handle 10: noqueue
$ ping -I $dev -w 1 -c 1 1.1.1.1</code></pre>
            <p>We use the “lo” interface to demonstrate that this bug is triggerable with a virtual interface. This is important for containers because they are fed virtual interfaces most of the time, and not the physical interface. Because of that, we can use a container to crash the host as an unprivileged user, and thus perform a denial of service attack.</p>
    <div>
      <h3>Why does that work?</h3>
      <a href="#why-does-that-work">
        
      </a>
    </div>
    <p>To understand the problem a bit better, we need to look back to the original <a href="https://lore.kernel.org/all/1440703299-21243-1-git-send-email-phil@nwl.cc/#t">patch series</a>, but specifically this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d66d6c3152e8d5a6db42a56bf7ae1c6cae87ba48">commit</a> that introduced the bug. Before this series, achieving noqueue on interfaces relied on a hack that would set a device qdisc to noqueue if the device had a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L1263">tx_queue_len = 0</a>. The commit d66d6c3152e8 (“net: sched: register noqueue qdisc”) circumvents this by explicitly allowing noqueue to be added with the <code>tc</code> command without needing to get around that limitation.</p><p>The way the kernel checks for whether we are in a noqueue case or not, is to simply check if a qdisc has a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L4214">NULL enqueue()</a> function. Recall from earlier that noqueue does not necessarily drop packets in practice? After that check in the fail case, the following logic handles the noqueue functionality. In order to fail the check, the author had to <i>cheat</i> a reassignment from <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L628">noop_enqueue()</a> to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L142">NULL</a> by making <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L683">enqueue = NULL</a> in the init which is called <i>way after</i> <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_api.c#L131">register_qdisc()</a> during runtime.</p><p>Here is where classful qdiscs come into play. The check for an enqueue function is no longer NULL. In this call path, it is now set to HTB (in our example) and is thus allowed to enqueue the struct skb to a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L3778">queue</a> by making a call to the function <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L612">htb_enqueue()</a>. Once in there, HTB performs a <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L216">lookup</a> to pull in a qdisc assigned to a leaf node, and eventually attempts to <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_htb.c#L635">queue</a> the struct skb to the chosen qdisc which ultimately reaches this function:</p><p><i>include/net/sch_generic.h</i></p>
            <pre><code>static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
				struct sk_buff **to_free)
{
	qdisc_calculate_pkt_len(skb, sch);
	return sch-&gt;enqueue(skb, sch, to_free); // sch-&gt;enqueue == NULL
}</code></pre>
            <p>We can see that the enqueueing process is fairly agnostic from physical/virtual interfaces. The permissions and validation checks are done when adding a queue to an interface, which is why the classful qdics assume the queue to not be NULL. This knowledge leads us to a few solutions to consider.</p>
    <div>
      <h3>Solutions</h3>
      <a href="#solutions">
        
      </a>
    </div>
    <p>We had a few solutions ranging from what we thought was best to worst:</p><ol><li><p>Follow tc-noqueue documentation and do not allow noqueue to be assigned to a classful qdisc</p></li><li><p>Instead of checking for <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/core/dev.c#L4214">NULL</a>, check for <a href="https://elixir.bootlin.com/linux/v6.2-rc1/source/net/sched/sch_generic.c#L687">struct noqueue_qdisc_ops</a>, and reset noqueue to back to noop_enqueue</p></li><li><p>For each classful qdisc, check for NULL and fallback</p></li></ol><p>While we ultimately went for the first option: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=96398560f26aa07e8f2969d73c8197e6a6d10407">"disallow noqueue for qdisc classes"</a>, the third option creates a lot of churn in the code, and does not solve the problem completely. Future qdiscs implementations could forget that important check as well as the maintainers. However, the reason for passing on the second option is a bit more interesting.</p><p>The reason we did not follow that approach is because we need to first answer these questions:</p><p>Why not allow noqueue for classful qdiscs?</p><p>This contradicts the documentation. The documentation does have some precedent for not being totally followed in practice, but we will need to update that to reflect the current state. This is fine to do, but does not address the behavior change problem other than remove the NULL dereference bug.</p><p>What behavior changes if we do allow noqueue for qdiscs?</p><p>This is harder to answer because we need to determine what that behavior should be. Currently, when noqueue is applied as the root qdisc for an interface, the path is to essentially allow packets to be processed. Claiming a fallback for classes is a different matter. They may each have their own fallback rules, and how do we know what is the right fallback? Sometimes in HTB the fallback is pass-through with HTB_DIRECT, sometimes it is pfifo_fast. What about the other classes? Perhaps instead we should fall back to the default noqueue behavior as it is for root qdiscs?</p><p>We felt that going down this route would only add confusion and additional complexity to queuing. We could also make an argument that such a change could be considered a feature addition and not necessarily a bug fix. Suffice it to say, adhering to the current documentation seems to be the more appealing approach to prevent the vulnerability now, while something else can be worked out later.</p>
    <div>
      <h3>Takeaways</h3>
      <a href="#takeaways">
        
      </a>
    </div>
    <p>First and foremost, apply this <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=96398560f26aa07e8f2969d73c8197e6a6d10407">patch</a> as soon as possible. And consider hardening USER namespaces on your systems by setting <code>sysctl -w</code> <a href="https://sources.debian.org/patches/linux/3.16.56-1+deb8u1/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch/"><code>kernel.unprivileged_userns_clone</code></a><code>=0</code>, which only lets root create USER namespaces in Debian kernels, <code>sysctl -w</code> <a href="https://docs.kernel.org/admin-guide/sysctl/user.html?highlight=max_user_namespaces"><code>user.max_user_namespaces</code></a><code>=[number]</code> for a process hierarchy, or consider backporting these two patches: <code>[security_create_user_ns()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8)</code> and the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed5d44d42c95e8a13bb54e614d2269c8740667f9">SELinux implementation</a>  (now in Linux 6.1.x) to allow you to protect your systems with either eBPF or SELinux. If you are sure you're not using USER namespaces and in extreme cases, you might consider turning the feature off with <code>CONFIG_USERNS=n</code>. This is just one example of many where namespaces are leveraged to perform an attack, and more are surely to crop up in varying levels of severity in the future.</p><p>Special thanks to Ignat Korchagin and Jakub Sitnicki for code reviews and helping demonstrate the bug in practice.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[CVE]]></category>
            <guid isPermaLink="false">KiLg1KENXvFAT8ADCa4hN</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
        <item>
            <title><![CDATA[Live-patching security vulnerabilities inside the Linux kernel with eBPF Linux Security Module]]></title>
            <link>https://blog.cloudflare.com/live-patch-security-vulnerabilities-with-ebpf-lsm/</link>
            <pubDate>Wed, 29 Jun 2022 11:45:00 GMT</pubDate>
            <description><![CDATA[ Learn how to patch Linux security vulnerabilities without rebooting the hardware and how to tighten the security of your Linux operating system with eBPF Linux Security Module ]]></description>
            <content:encoded><![CDATA[ <p></p><p><a href="https://www.kernel.org/doc/html/latest/admin-guide/LSM/index.html">Linux Security Modules</a> (LSM) is a hook-based framework for implementing security policies and Mandatory Access Control in the Linux kernel. Until recently users looking to implement a security policy had just two options. Configure an existing LSM module such as AppArmor or SELinux, or write a custom kernel module.</p><p><a href="https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.7">Linux 5.7</a> introduced a third way: <a href="https://docs.kernel.org/bpf/prog_lsm.html">LSM extended Berkeley Packet Filters (eBPF)</a> (LSM BPF for short). LSM BPF allows developers to write granular policies without configuration or loading a kernel module. LSM BPF programs are verified on load, and then executed when an LSM hook is reached in a call path.</p>
    <div>
      <h2>Let’s solve a real-world problem</h2>
      <a href="#lets-solve-a-real-world-problem">
        
      </a>
    </div>
    <p>Modern operating systems provide facilities allowing "partitioning" of kernel resources. For example FreeBSD has "jails", Solaris has "zones". Linux is different - it provides a set of seemingly independent facilities each allowing isolation of a specific resource. These are called "namespaces" and have been growing in the kernel for years. They are the base of popular tools like Docker, lxc or firejail. Many of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward - NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special very curious USER namespace.</p><p>USER namespace is special, since it allows the owner to operate as "root" inside it. How it works is beyond the scope of this blog post, however, suffice to say it's a foundation to having tools like Docker to not operate as true root, and things like rootless containers.</p><p>Due to its nature, allowing unpriviledged users access to USER namespace always carried a great security risk.  One such risk is privilege escalation.</p><p>Privilege escalation is a <a href="https://www.cloudflare.com/learning/security/what-is-an-attack-surface/">common attack surface</a> for operating systems. One way users may gain privilege is by mapping their namespace to the root namespace via the unshare <a href="https://en.wikipedia.org/wiki/System_call">syscall</a> and specifying the <i>CLONE_NEWUSER</i> flag. This tells unshare to create a new user namespace with full permissions, and maps the new user and group ID to the previous namespace. You can use the <a href="https://man7.org/linux/man-pages/man1/unshare.1.html">unshare(1)</a> program to map root to our original namespace:</p>
            <pre><code>$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …
$ unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# cat /proc/self/uid_map
         0       1000          1</code></pre>
            <p>In most cases using unshare is harmless, and is intended to run with lower privileges. However, this syscall has been known to be used to <a href="https://nvd.nist.gov/vuln/detail/CVE-2022-0492">escalate privileges</a>.</p><p>Syscalls <i>clone</i> and <i>clone3</i> are worth looking into as they also have the ability to <i>CLONE_NEWUSER</i>. However, for this post we’re going to focus on unshare.</p><p>Debian solved this problem with this <a href="https://sources.debian.org/patches/linux/3.16.56-1+deb8u1/debian/add-sysctl-to-disallow-unprivileged-CLONE_NEWUSER-by-default.patch/">"add sysctl to disallow unprivileged CLONE_NEWUSER by default"</a> patch, but it was not mainlined. Another similar patch <a href="https://lore.kernel.org/all/1453502345-30416-3-git-send-email-keescook@chromium.org/">"sysctl: allow CLONE_NEWUSER to be disabled"</a> attempted to mainline, and was met with push back. A critique is the <a href="https://lore.kernel.org/all/87poq5y0jw.fsf@x220.int.ebiederm.org/">inability to toggle this feature</a> for specific applications. In the article “<a href="https://lwn.net/Articles/673597/">Controlling access to user namespaces</a>” the author wrote: “... the current patches do not appear to have an easy path into the mainline.” And as we can see, the patches were ultimately not included in the vanilla kernel.</p>
    <div>
      <h2>Our solution - LSM BPF</h2>
      <a href="#our-solution-lsm-bpf">
        
      </a>
    </div>
    <p>Since upstreaming code that restricts USER namespace seem to not be an option, we decided to use LSM BPF to circumvent these issues. This requires no modifications to the kernel and allows us to express complex rules guarding the access.</p>
    <div>
      <h3>Track down an appropriate hook candidate</h3>
      <a href="#track-down-an-appropriate-hook-candidate">
        
      </a>
    </div>
    <p>First, let us track down the syscall we’re targeting. We can find the prototype in the <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/syscalls.h#L608"><i>include/linux/syscalls.h</i></a> file. From there, it’s not as obvious to track down, but the line:</p>
            <pre><code>/* kernel/fork.c */</code></pre>
            <p>Gives us a clue of where to look next in <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3201"><i>kernel/fork.c</i></a>. There a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3082"><i>ksys_unshare()</i></a> is made. Digging through that function, we find a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/fork.c#L3129"><i>unshare_userns()</i></a>. This looks promising.</p><p>Up to this point, we’ve identified the syscall implementation, but the next question to ask is what hooks are available for us to use? Because we know from the <a href="https://man7.org/linux/man-pages/man2/unshare.2.html">man-pages</a> that unshare is used to mutate tasks, we look at the task-based hooks in <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/lsm_hooks.h#L605"><i>include/linux/lsm_hooks.h</i></a>. Back in the function <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/user_namespace.c#L171"><i>unshare_userns()</i></a> we saw a call to <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/cred.c#L252"><i>prepare_creds()</i></a>. This looks very familiar to the <a href="https://elixir.bootlin.com/linux/v5.18/source/include/linux/lsm_hooks.h#L624"><i>cred_prepare</i></a> hook. To verify we have our match via <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/cred.c#L291"><i>prepare_creds()</i></a>, we see a call to the security hook <a href="https://elixir.bootlin.com/linux/v5.18/source/security/security.c#L1706"><i>security_prepare_creds()</i></a> which ultimately calls the hook:</p>
            <pre><code>…
rc = call_int_hook(cred_prepare, 0, new, old, gfp);
…</code></pre>
            <p>Without going much further down this rabbithole, we know this is a good hook to use because <i>prepare_creds()</i> is called right before <i>create_user_ns()</i> in <a href="https://elixir.bootlin.com/linux/v5.18/source/kernel/user_namespace.c#L181"><i>unshare_userns()</i></a> which is the operation we’re trying to block.</p>
    <div>
      <h3>LSM BPF solution</h3>
      <a href="#lsm-bpf-solution">
        
      </a>
    </div>
    <p>We’re going to compile with the <a href="https://nakryiko.com/posts/bpf-core-reference-guide/#defining-own-co-re-relocatable-type-definitions">eBPF compile once-run everywhere (CO-RE)</a> approach. This allows us to compile on one architecture and load on another. But we’re going to target x86_64 specifically. LSM BPF for ARM64 is still in development, and the following code will not run on that architecture. Watch the <a href="https://lore.kernel.org/bpf/">BPF mailing list</a> to follow the progress.</p><p>This solution was tested on kernel versions &gt;= 5.15 configured with the following:</p>
            <pre><code>BPF_EVENTS
BPF_JIT
BPF_JIT_ALWAYS_ON
BPF_LSM
BPF_SYSCALL
BPF_UNPRIV_DEFAULT_OFF
DEBUG_INFO_BTF
DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
DYNAMIC_FTRACE
FUNCTION_TRACER
HAVE_DYNAMIC_FTRACE</code></pre>
            <p>A boot option <code>lsm=bpf</code> may be necessary if <code>CONFIG_LSM</code> does not contain “bpf” in the list.</p><p>Let’s start with our preamble:</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>#include &lt;linux/bpf.h&gt;
#include &lt;linux/capability.h&gt;
#include &lt;linux/errno.h&gt;
#include &lt;linux/sched.h&gt;
#include &lt;linux/types.h&gt;

#include &lt;bpf/bpf_tracing.h&gt;
#include &lt;bpf/bpf_helpers.h&gt;
#include &lt;bpf/bpf_core_read.h&gt;

#define X86_64_UNSHARE_SYSCALL 272
#define UNSHARE_SYSCALL X86_64_UNSHARE_SYSCALL</code></pre>
            <p>Next we set up our necessary structures for CO-RE relocation in the following way:</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>…

typedef unsigned int gfp_t;

struct pt_regs {
	long unsigned int di;
	long unsigned int orig_ax;
} __attribute__((preserve_access_index));

typedef struct kernel_cap_struct {
	__u32 cap[_LINUX_CAPABILITY_U32S_3];
} __attribute__((preserve_access_index)) kernel_cap_t;

struct cred {
	kernel_cap_t cap_effective;
} __attribute__((preserve_access_index));

struct task_struct {
    unsigned int flags;
    const struct cred *cred;
} __attribute__((preserve_access_index));

char LICENSE[] SEC("license") = "GPL";

…</code></pre>
            <p>We don’t need to fully-flesh out the structs; we just need the absolute minimum information a program needs to function. CO-RE will do whatever is necessary to perform the relocations for your kernel. This makes writing the LSM BPF programs easy!</p><p><i>deny_unshare.bpf.c</i>:</p>
            <pre><code>SEC("lsm/cred_prepare")
int BPF_PROG(handle_cred_prepare, struct cred *new, const struct cred *old,
             gfp_t gfp, int ret)
{
    struct pt_regs *regs;
    struct task_struct *task;
    kernel_cap_t caps;
    int syscall;
    unsigned long flags;

    // If previous hooks already denied, go ahead and deny this one
    if (ret) {
        return ret;
    }

    task = bpf_get_current_task_btf();
    regs = (struct pt_regs *) bpf_task_pt_regs(task);
    // In x86_64 orig_ax has the syscall interrupt stored here
    syscall = regs-&gt;orig_ax;
    caps = task-&gt;cred-&gt;cap_effective;

    // Only process UNSHARE syscall, ignore all others
    if (syscall != UNSHARE_SYSCALL) {
        return 0;
    }

    // PT_REGS_PARM1_CORE pulls the first parameter passed into the unshare syscall
    flags = PT_REGS_PARM1_CORE(regs);

    // Ignore any unshare that does not have CLONE_NEWUSER
    if (!(flags &amp; CLONE_NEWUSER)) {
        return 0;
    }

    // Allow tasks with CAP_SYS_ADMIN to unshare (already root)
    if (caps.cap[CAP_TO_INDEX(CAP_SYS_ADMIN)] &amp; CAP_TO_MASK(CAP_SYS_ADMIN)) {
        return 0;
    }

    return -EPERM;
}</code></pre>
            <p>Creating the program is the first step, the second is loading and attaching the program to our desired hook. There are several ways to do this: <a href="https://github.com/cilium/ebpf">Cilium ebpf</a> project, <a href="https://github.com/libbpf/libbpf-rs">Rust bindings</a>, and several others on the <a href="https://ebpf.io/projects/">ebpf.io</a> project landscape page. We’re going to use native libbpf.</p><p><i>deny_unshare.c</i>:</p>
            <pre><code>#include &lt;bpf/libbpf.h&gt;
#include &lt;unistd.h&gt;
#include "deny_unshare.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char *argv[])
{
    struct deny_unshare_bpf *skel;
    int err;

    libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
    libbpf_set_print(libbpf_print_fn);

    // Loads and verifies the BPF program
    skel = deny_unshare_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "failed to load and verify BPF skeleton\n");
        goto cleanup;
    }

    // Attaches the loaded BPF program to the LSM hook
    err = deny_unshare_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "failed to attach BPF skeleton\n");
        goto cleanup;
    }

    printf("LSM loaded! ctrl+c to exit.\n");

    // The BPF link is not pinned, therefore exiting will remove program
    for (;;) {
        fprintf(stderr, ".");
        sleep(1);
    }

cleanup:
    deny_unshare_bpf__destroy(skel);
    return err;
}</code></pre>
            <p>Lastly, to compile, we use the following Makefile:</p><p><i>Makefile</i>:</p>
            <pre><code>CLANG ?= clang-13
LLVM_STRIP ?= llvm-strip-13
ARCH := x86
INCLUDES := -I/usr/include -I/usr/include/x86_64-linux-gnu
LIBS_DIR := -L/usr/lib/lib64 -L/usr/lib/x86_64-linux-gnu
LIBS := -lbpf -lelf

.PHONY: all clean run

all: deny_unshare.skel.h deny_unshare.bpf.o deny_unshare

run: all
	sudo ./deny_unshare

clean:
	rm -f *.o
	rm -f deny_unshare.skel.h

#
# BPF is kernel code. We need to pass -D__KERNEL__ to refer to fields present
# in the kernel version of pt_regs struct. uAPI version of pt_regs (from ptrace)
# has different field naming.
# See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd56e0058412fb542db0e9556f425747cf3f8366
#
deny_unshare.bpf.o: deny_unshare.bpf.c
	$(CLANG) -g -O2 -Wall -target bpf -D__KERNEL__ -D__TARGET_ARCH_$(ARCH) $(INCLUDES) -c $&lt; -o $@
	$(LLVM_STRIP) -g $@ # Removes debug information

deny_unshare.skel.h: deny_unshare.bpf.o
	sudo bpftool gen skeleton $&lt; &gt; $@

deny_unshare: deny_unshare.c deny_unshare.skel.h
	$(CC) -g -Wall -c $&lt; -o $@.o
	$(CC) -g -o $@ $(LIBS_DIR) $@.o $(LIBS)

.DELETE_ON_ERROR:</code></pre>
            
    <div>
      <h3>Result</h3>
      <a href="#result">
        
      </a>
    </div>
    <p>In a new terminal window run:</p>
            <pre><code>$ make run
…
LSM loaded! ctrl+c to exit.</code></pre>
            <p>In another terminal window, we’re successfully blocked!</p>
            <pre><code>$ unshare -rU
unshare: unshare failed: Cannot allocate memory
$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …</code></pre>
            <p>The policy has an additional feature to always allow privilege pass through:</p>
            <pre><code>$ sudo unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root)</code></pre>
            <p>In the unprivileged case the syscall early aborts. What is the performance impact in the privileged case?</p>
    <div>
      <h3>Measure performance</h3>
      <a href="#measure-performance">
        
      </a>
    </div>
    <p>We’re going to use a one-line unshare that’ll map the user namespace, and execute a command within for the measurements:</p>
            <pre><code>$ unshare -frU --kill-child -- bash -c "exit 0"</code></pre>
            <p>With a resolution of CPU cycles for syscall unshare enter/exit, we’ll measure the following as root user:</p><ol><li><p>Command ran without the policy</p></li><li><p>Command run with the policy</p></li></ol><p>We’ll record the measurements with <a href="https://docs.kernel.org/trace/ftrace.html">ftrace</a>:</p>
            <pre><code>$ sudo su
# cd /sys/kernel/debug/tracing
# echo 1 &gt; events/syscalls/sys_enter_unshare/enable ; echo 1 &gt; events/syscalls/sys_exit_unshare/enable</code></pre>
            <p>At this point, we’re enabling tracing for the syscall enter and exit for unshare specifically. Now we set the time-resolution of our enter/exit calls to count CPU cycles:</p>
            <pre><code># echo 'x86-tsc' &gt; trace_clock </code></pre>
            <p>Next we begin our measurements:</p>
            <pre><code># unshare -frU --kill-child -- bash -c "exit 0" &amp;
[1] 92014</code></pre>
            <p>Run the policy in a new terminal window, and then run our next syscall:</p>
            <pre><code># unshare -frU --kill-child -- bash -c "exit 0" &amp;
[2] 92019</code></pre>
            <p>Now we have our two calls for comparison:</p>
            <pre><code># cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 4/4   #P:8
#
#                                _-----=&gt; irqs-off
#                               / _----=&gt; need-resched
#                              | / _---=&gt; hardirq/softirq
#                              || / _--=&gt; preempt-depth
#                              ||| / _-=&gt; migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         unshare-92014   [002] ..... 762950852559027: sys_unshare(unshare_flags: 10000000)
         unshare-92014   [002] ..... 762950852622321: sys_unshare -&gt; 0x0
         unshare-92019   [007] ..... 762975980681895: sys_unshare(unshare_flags: 10000000)
         unshare-92019   [007] ..... 762975980752033: sys_unshare -&gt; 0x0
</code></pre>
            <p>unshare-92014 used 63294 cycles.unshare-92019 used 70138 cycles.</p><p>We have a 6,844 (~10%) cycle penalty between the two measurements. Not bad!</p><p>These numbers are for a single syscall, and add up the more frequently the code is called. Unshare is typically called at task creation, and not repeatedly during normal execution of a program. Careful consideration and measurement is needed for your use case.</p>
    <div>
      <h2>Outro</h2>
      <a href="#outro">
        
      </a>
    </div>
    <p>We learned a bit about what LSM BPF is, how unshare is used to map a user to root, and how to solve a real-world problem by implementing a solution in eBPF. Tracking down the appropriate hook is not an easy task, and requires a bit of playing and a lot of kernel code. Fortunately, that’s the hard part. Because a policy is written in C, we can granularly tweak the policy to our problem. This means one may extend this policy with an allow-list to allow certain programs or users to continue to use an unprivileged unshare. Finally, we looked at the performance impact of this program, and saw the overhead is worth blocking the attack vector.</p><p>“Cannot allocate memory” is not a clear error message for denying permissions. We proposed a <a href="https://lore.kernel.org/all/20220608150942.776446-1-fred@cloudflare.com/">patch</a> to propagate error codes from the <i>cred_prepare</i> hook up the call stack. Ultimately we came to the conclusion that a new hook is better suited to this problem. Stay tuned!</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2AGA68zpZ0kGK4kfyvQ5Fa</guid>
            <dc:creator>Frederick Lawler</dc:creator>
        </item>
    </channel>
</rss>