
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 14 Apr 2026 15:21:16 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Introducing the p0f BPF compiler]]></title>
            <link>https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/</link>
            <pubDate>Tue, 02 Aug 2016 14:01:15 GMT</pubDate>
            <description><![CDATA[ Two years ago we blogged about our love of BPF (BSD packet filter) bytecode. Today we are very happy to open source another component of the bpftools: our p0f BPF compiler! ]]></description>
            <content:encoded><![CDATA[ <p>Two years ago we blogged about our love of <a href="/bpf-the-forgotten-bytecode/">BPF (BSD packet filter)</a> bytecode.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7pVXfGj2thB7lm7mXwf75k/86067d4726e37e8bddb37ea1e07fe6e3/13488404_e45bf52f98_z.jpg" />
            
            </figure><p><a href="https://creativecommons.org/licenses/by/2.0/">CC BY 2.0</a> <a href="https://www.flickr.com/photos/rocketjim54/13488404/in/photolist-6UV6AL-cNg6EC-bCTyGc-pHpUZt-8mgA4n-pZV4Qq-dJTkkr-ckZqwA-dJTkhB-q2Kyas-cLvVP1-2c8CG-a5JDy8-6NSXFW-73SFAD-9JGikG-6NNM2t-6mGTaN-eHCuuX-6NSXLC-6mH8fQ">image</a> by <a href="https://www.flickr.com/photos/rocketjim54/">jim simonson</a></p><p>Then we published a set of utilities we are using to generate the BPF rules for our production iptables: <a href="/introducing-the-bpf-tools/">the bpftools</a>.</p><p>Today we are very happy to open source another component of the bpftools: our <b>p0f BPF compiler</b>!</p>
    <div>
      <h3>Meet the p0f</h3>
      <a href="#meet-the-p0f">
        
      </a>
    </div>
    <p><a href="http://lcamtuf.coredump.cx/p0f3/">p0f</a> is a tool written by superhuman <a href="https://en.wikipedia.org/wiki/Micha%C5%82_Zalewski">Michal Zalewski</a>.The main purpose of p0f is to passively analyze and categorize arbitrary network traffic. You can feed p0f any packet and in return it will derive knowledge about the operating system that sent the packet.</p><p>One of the features that caught our attention was the concise yet explanatory signature format used to describe TCP SYN packets.</p><p>The p0f SYN signature is a simple string consisting of colon separated values. This string cleanly describes a SYN packet in a human-readable way. The format is pretty smart, skipping the varying TCP fields and keeping focus only on the essence of the SYN packet, extracting the interesting bits from it.</p><p>We are using this on daily basis to categorize the packets that we, at CloudFlare, see when we are a target of a SYN flood. To defeat SYN attacks we want to discriminate the packets that are part of an attack from legitimate traffic. One of the ways we do this uses p0f.</p><p>We want to rate limit attack packets, and in effect prioritize processing of <i>other</i>, hopefully legitimate, ones. The p0f SYN signatures give us a language to describe and distinguish different types of SYN packets.</p><p>For example here is a typical p0f SYN signature of a Linux SYN packet:</p>
            <pre><code>4:64:0:*:mss*10,6:mss,sok,ts,nop,ws:df,id+:0</code></pre>
            <p>while this is a Windows 7 one:</p>
            <pre><code>4:128:0:*:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0</code></pre>
            <p>Not getting into details yet, but you can clearly see that there are differences between these operating systems. Over time we noticed that the attack packets are often different. Here are two examples of attack SYN packets:</p>
            <pre><code>4:255:0:0:*,0::ack+,uptr+:0
4:64:0:*:65535,*:mss,nop,ws,nop,nop,sok:df,id+:0</code></pre>
            <p>You can have a look at more signatures in p0f's <a href="https://github.com/p0f/p0f/blob/master/docs/README">README</a> and <a href="https://github.com/p0f/p0f/blob/master/p0f.fp">signatures database</a>.</p><p>It's not <i>always</i> possible to perfectly distinguish an attack from valid packets, but very often it is. This realization led us to develop an attack mitigation tool based on p0f SYN signatures. With this we can ask <code>iptables</code> to rate limit only the selected attack signatures.</p><p>But before we discuss the mitigations, let's explain the signature format.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3UNvxctXaoSVPeP4dICkXM/83267ae5e26c6ecca088facfb127a80d/640px-12-8_equals_4-4_drum_pattern.png" />
            
            </figure><p><a href="http://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0</a> <a href="https://commons.wikimedia.org/w/index.php?curid=17871154">image</a> by <a href="//en.wikipedia.org/wiki/User:Hyacinth">Hyacinth</a> at the <a href="//en.wikipedia.org/wiki/">English language Wikipedia</a></p>
    <div>
      <h2>Signature</h2>
      <a href="#signature">
        
      </a>
    </div>
    <p>As mentioned, the p0f SYN signature is a colon-separated string with the following parts:</p><ul><li><p><b>IP version</b>: the first field carries the IP version. Allowed values are <code>4</code> and <code>6</code>.</p></li><li><p><b>Initial TTL</b>: assuming that realistically a packet will not jump through more than 35 hops, we can specify an initial TTL <i>ittl</i> (usual values are <code>255</code>, <code>128</code>, <code>64</code> and <code>32</code>) and check if the packet's TTL is in the range (<i>ittl</i>, <i>ittl</i> - 35).</p></li><li><p><b>IP options length</b>: length of IP options. Although it's not that common to see options in the IP header (and so <code>0</code> is the typical value you would see in a signature), the standard defines a variable length field before the IP payload where options can be specified. A <code>*</code> value is allowed too, which means "not specified".</p></li><li><p><b>MSS</b>: maximum segment size specified in the TCP options. Can be a constant or <code>*</code>.</p></li><li><p><b>Window Size</b>: window size specified in the TCP header. It can be a expressed as:</p></li><li><p>a constant <code>c</code>, like 8192</p></li><li><p>a multiple of the MSS, in the <code>c*mss</code> format</p></li><li><p>a multiple of a constant, in the <code>%c</code> format</p></li><li><p>any value, as <code>*</code></p></li><li><p><b>Window Scale</b>: window scale specified during the three way handshake. Can be a constant or <code>*</code>.</p></li><li><p><b>TCP options layout</b>: list of TCP options in the order they are seen in a TCP packet.</p></li><li><p><b>Quirks</b>: comma separated list of unusual (e.g. ACK number set in a non ACK packet) or incorrect (e.g. malformed TCP options) characteristics of a packet.</p></li><li><p><b>Payload class</b>: TCP payload size. Can be <code>0</code> (no data), <code>+</code> (1 or more bytes of data) or <code>*</code>.</p></li></ul>
    <div>
      <h4>TCP Options format</h4>
      <a href="#tcp-options-format">
        
      </a>
    </div>
    <p>The following common TCP options are recognised:</p><ul><li><p><b>nop</b>: no-operation</p></li><li><p><b>mss</b>: maximum segment size</p></li><li><p><b>ws</b>: window scaling</p></li><li><p><b>sok</b>: selective ACK permitted</p></li><li><p><b>sack</b>: selective ACK</p></li><li><p><b>ts</b>: timestamp</p></li><li><p><b>eol+x</b>: end of options followed by <code>x</code> bytes of padding</p></li></ul>
    <div>
      <h4>Quirks</h4>
      <a href="#quirks">
        
      </a>
    </div>
    <p>p0f describes a number of quirks:</p><ul><li><p><b>df</b>: don't fragment bit is set in the IP header</p></li><li><p><b>id+</b>: df bit is set and IP identification field is non zero</p></li><li><p><b>id-</b>: df bit is not set and IP identification is zero</p></li><li><p><b>ecn</b>: explicit congestion flag is set</p></li><li><p><b>0+</b>: reserved ("must be zero") field in IP header is not actually zero</p></li><li><p><b>flow</b>: flow label in IPv6 header is non-zero</p></li><li><p><b>seq-</b>: sequence number is zero</p></li><li><p><b>ack+</b>: ACK field is non-zero but ACK flag is not set</p></li><li><p><b>ack-</b>: ACK field is zero but ACK flag is set</p></li><li><p><b>uptr+</b>: URG field is non-zero but URG flag not set</p></li><li><p><b>urgf+</b>: URG flag is set</p></li><li><p><b>pushf+</b>: PUSH flag is set</p></li><li><p><b>ts1-</b>: timestamp 1 is zero</p></li><li><p><b>ts2+</b>: timestamp 2 is non-zero in a SYN packet</p></li><li><p><b>opt+</b>: non-zero data in options segment</p></li><li><p><b>exws</b>: excessive window scaling factor (window scale greater than 14)</p></li><li><p><b>linux</b>: match a packet sent from the Linux network stack (<code>IP.id</code> field equal to <code>TCP.ts1</code> xor <code>TCP.seq_num</code>). Note that this quirk is not part of the original p0f signature format; we decided to add it since we found it useful.</p></li><li><p><b>bad</b>: malformed TCP options</p></li></ul>
    <div>
      <h2>Mitigating attacks</h2>
      <a href="#mitigating-attacks">
        
      </a>
    </div>
    <p>Given a p0f SYN signature, we want to pass it to <code>iptables</code> for mitigation. It's not obvious how to do so, but fortunately we are experienced in BPF bytecode since we are already using it to block DNS DDoS attacks.</p><p>We decided to extend our BPF infrastructure to support p0f as well, by building a tool to compile a p0f SYN signature into a BPF bytecode blob, which got incorporated into the bpftools project.</p><p>This allows us to use a simple and human readable syntax for the mitigations - the p0f signature - and compile it to a very efficient BPF form that can be used by iptables.</p><p>With a p0f signature running as BPF in the iptables we're able to distinguish attack packets with a very high speed and react accordingly. We can either hard <code>-j DROP</code> them or do a rate limit if we wish.</p>
    <div>
      <h2>How to compile p0f to BPF</h2>
      <a href="#how-to-compile-p0f-to-bpf">
        
      </a>
    </div>
    <p>First you need to clone the <code>cloudflare/bpftools</code> GitHub repository:</p>
            <pre><code>$ git clone https://github.com/cloudflare/bpftools.git</code></pre>
            <p>Then compile it:</p>
            <pre><code>$ cd bpftools
$ make</code></pre>
            <p>With this you can run <code>bpfgen p0f</code> to generate a BPF filter that matches a p0f signature.</p><p>Here's an example where we take the p0f signature of a Linux TCP SYN packet (the one we introduced before), and by using <code>bpftools</code> we generate the BPF bytecode that will match this category of packets:</p>
            <pre><code>$ ./bpfgen p0f -- 4:64:0:*:mss*10,6:mss,sok,ts,nop,ws:df,id+:0
56,0 0 0 0,48 0 0 8,37 52 0 64,37 0 51 29,48 0 0 0,
84 0 0 15,21 0 48 5,48 0 0 9,21 0 46 6,40 0 0 6,
69 44 0 8191,177 0 0 0,72 0 0 14,2 0 0 8,72 0 0 22,
36 0 0 10,7 0 0 0,96 0 0 8,29 0 36 0,177 0 0 0,
80 0 0 39,21 0 33 6,80 0 0 12,116 0 0 4,21 0 30 10,
80 0 0 20,21 0 28 2,80 0 0 24,21 0 26 4,80 0 0 26,
21 0 24 8,80 0 0 36,21 0 22 1,80 0 0 37,21 0 20 3,
48 0 0 6,69 0 18 64,69 17 0 128,40 0 0 2,2 0 0 1,
48 0 0 0,84 0 0 15,36 0 0 4,7 0 0 0,96 0 0 1,
28 0 0 0,2 0 0 5,177 0 0 0,80 0 0 12,116 0 0 4,
36 0 0 4,7 0 0 0,96 0 0 5,29 0 1 0,6 0 0 65536,
6 0 0 0,</code></pre>
            <p>If this looks magical, use the <code>-s</code> flag to see the explanation on what's going on:</p>
            <pre><code>$ ./bpfgen -s p0f -- 4:64:0:*:mss*10,6:mss,sok,ts,nop,ws:df,id+:0
; ip: ip version
; (ip[8] &lt;= 64): ttl &lt;= 64
; (ip[8] &gt; 29): ttl &gt; 29
; ((ip[0] &amp; 0xf) == 5): IP options len == 0
; (tcp[14:2] == (tcp[22:2] * 10)): win size == mss * 10
; (tcp[39:1] == 6): win scale == 6
; ((tcp[12] &gt;&gt; 4) == 10): TCP data offset
; (tcp[20] == 2): olayout mss
; (tcp[24] == 4): olayout sok
; (tcp[26] == 8): olayout ts
; (tcp[36] == 1): olayout nop
; (tcp[37] == 3): olayout ws
; ((ip[6] &amp; 0x40) != 0): df set
; ((ip[6] &amp; 0x80) == 0): mbz zero
; ((ip[2:2] - ((ip[0] &amp; 0xf) * 4) - ((tcp[12] &gt;&gt; 4) * 4)) == 0): payload len == 0
;
; ipver=4
; ip and (ip[8] &lt;= 64) and (ip[8] &gt; 29) and ((ip[0] &amp; 0xf) == 5) and (tcp[14:2] == (tcp[22:2] * 10)) and (tcp[39:1] == 6) and ((tcp[12] &gt;&gt; 4) == 10) and (tcp[20] == 2) and (tcp[24] == 4) and (tcp[26] == 8) and (tcp[36] == 1) and (tcp[37] == 3) and ((ip[6] &amp; 0x40) != 0) and ((ip[6] &amp; 0x80) == 0) and ((ip[2:2] - ((ip[0] &amp; 0xf) * 4) - ((tcp[12] &gt;&gt; 4) * 4)) == 0)

l000:
    ld       #0x0
l001:
    ldb      [8]
l002:
    jgt      #0x40, l055, l003
l003:
    jgt      #0x1d, l004, l055
l004:
    ldb      [0]
l005:
    and      #0xf
l006:
    jeq      #0x5, l007, l055
l007:
    ldb      [9]
l008:
    jeq      #0x6, l009, l055
l009:
    ldh      [6]
l010:
    jset     #0x1fff, l055, l011
l011:
    ldxb     4*([0]&amp;0xf)
l012:
    ldh      [x + 14]
l013:
    st       M[8]
l014:
    ldh      [x + 22]
l015:
    mul      #10
l016:
    tax
l017:
    ld       M[8]
l018:
    jeq      x, l019, l055
l019:
    ldxb     4*([0]&amp;0xf)
l020:
    ldb      [x + 39]
l021:
    jeq      #0x6, l022, l055
l022:
    ldb      [x + 12]
l023:
    rsh      #4
l024:
    jeq      #0xa, l025, l055
l025:
    ldb      [x + 20]
l026:
    jeq      #0x2, l027, l055
l027:
    ldb      [x + 24]
l028:
    jeq      #0x4, l029, l055
l029:
    ldb      [x + 26]
l030:
    jeq      #0x8, l031, l055
l031:
    ldb      [x + 36]
l032:
    jeq      #0x1, l033, l055
l033:
    ldb      [x + 37]
l034:
    jeq      #0x3, l035, l055
l035:
    ldb      [6]
l036:
    jset     #0x40, l037, l055
l037:
    jset     #0x80, l055, l038
l038:
    ldh      [2]
l039:
    st       M[1]
l040:
    ldb      [0]
l041:
    and      #0xf
l042:
    mul      #4
l043:
    tax
l044:
    ld       M[1]
l045:
    sub      x
l046:
    st       M[5]
l047:
    ldxb     4*([0]&amp;0xf)
l048:
    ldb      [x + 12]
l049:
    rsh      #4
l050:
    mul      #4
l051:
    tax
l052:
    ld       M[5]
l053:
    jeq      x, l054, l055
l054:
    ret      #65536
l055:
    ret      #0</code></pre>
            
    <div>
      <h2>Example run</h2>
      <a href="#example-run">
        
      </a>
    </div>
    <p>For example, consider we want to block SYN packets generated by the <code>hping3</code> tool.</p><p>First, we need to recognize the p0f SYN signature. Here it is, we know that one off the top of our heads:</p>
            <pre><code>4:64:0:0:*,0::ack+:0</code></pre>
            <p>(notice: unless you use the <code>-L 0</code> option, <code>hping3</code> will send SYN packets with the ACK number set, interesting, isn't it?)</p><p>Now, we can use the bpftools to get BPF bytecode that will match the naughty packets:</p>
            <pre><code>$ ./bpfgen p0f -- 4:64:0:0:*,0::ack+:0
39,0 0 0 0,48 0 0 8,37 35 0 64,37 0 34 29,48 0 0 0,
84 0 0 15,21 0 31 5,48 0 0 9,21 0 29 6,40 0 0 6,
69 27 0 8191,177 0 0 0,80 0 0 12,116 0 0 4,
21 0 23 5,48 0 0 6,69 21 0 128,80 0 0 13,
69 19 0 16,64 0 0 8,21 17 0 0,40 0 0 2,2 0 0 3,
48 0 0 0,84 0 0 15,36 0 0 4,7 0 0 0,96 0 0 3,
28 0 0 0,2 0 0 7,177 0 0 0,80 0 0 12,116 0 0 4,
36 0 0 4,7 0 0 0,96 0 0 7,29 0 1 0,6 0 0 65536,
6 0 0 0,</code></pre>
            <p>This bytecode can then be passed to iptables:</p>
            <pre><code>$ sudo iptables -A INPUT -p tcp --dport 80 -m bpf --bytecode "39,0 0 0 0,48 0 0 8,37 35 0 64,37 0 34 29,48 0 0 0,84 0 0 15,21 0 31 5,48 0 0 9,21 0 29 6,40 0 0 6,69 27 0 8191,177 0 0 0,80 0 0 12,116 0 0 4,21 0 23 5,48 0 0 6,69 21 0 128,80 0 0 13,69 19 0 16,64 0 0 8,21 17 0 0,40 0 0 2,2 0 0 3,48 0 0 0,84 0 0 15,36 0 0 4,7 0 0 0,96 0 0 3,28 0 0 0,2 0 0 7,177 0 0 0,80 0 0 12,116 0 0 4,36 0 0 4,7 0 0 0,96 0 0 7,29 0 1 0,6 0 0 65536,6 0 0 0," -j DROP</code></pre>
            <p>And here's how it would look in iptables:</p>
            <pre><code>$ sudo iptables -L INPUT -v
Chain INPUT (policy DROP 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    6   240            tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:80match bpf 0 0 0 0,48 0 0 8,37 35 0 64,37 0 34 29,48 0 0 0,84 0 0 15,21 0 31 5,48 0 0 9,21 0 29 6,40 0 0 6,69 27 0 8191,177 0 0 0,80 0 0 12,116 0 0 4,21 0 23 5,48 0 0 6,69 21 0 128,80 0 0 13,69 19 0 16,64 0 0 8,21 17 0 0,40 0 0 2,2 0 0 3,48 0 0 0,84 0 0 15,36 0 0 4,7 0 0 0,96 0 0 3,28 0 0 0,2 0 0 7,177 0 0 0,80 0 0 12,116 0 0 4,36 0 0 4,7 0 0 0,96 0 0 7,29 0 1 0,6 0 0 65536,6 0 0 0</code></pre>
            
    <div>
      <h4>Closing words</h4>
      <a href="#closing-words">
        
      </a>
    </div>
    <p>While defending from DDoS attacks is sometimes fun, most often it's a mundane repetitive job. We are constantly working on improving our automatic DDoS mitigation system, but we do not believe there is a strong reason to keep it all secret. We want to help others fighting attacks. Maybe if we all worked together one day we could solve the DDoS problem for all.</p><p>Releasing our code <a href="https://cloudflare.github.io/">open source</a> is an important part of CloudFlare. This blog post and the p0f BPF compiler are part of our effort to open source our DDoS mitigations. We hope others affected by SYN floods will find it useful.</p><p><i>Do you enjoy playing with low level networking bits? Are you interested in dealing with some of the largest DDoS attacks ever seen?</i><i>If so you should definitely have a look at the </i><a href="https://www.cloudflare.com/join-our-team/"><i>opened positions</i></a><i> in our London, San Francisco, Singapore, Champaign (IL) and Austin (TX) offices!</i></p> ]]></content:encoded>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Best Practices]]></category>
            <guid isPermaLink="false">a4eKkNiCIb7ugKBTDIZQv</guid>
            <dc:creator>Gilberto Bertin</dc:creator>
        </item>
        <item>
            <title><![CDATA[Partial kernel bypass merged into netmap main]]></title>
            <link>https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/</link>
            <pubDate>Thu, 17 Dec 2015 14:15:37 GMT</pubDate>
            <description><![CDATA[ In a previous post we described our work on a new netmap mode called single-rx-queue. After submitting the pull request, the netmap maintainers told us that the patch was interesting, but they would prefer something more configurable instead of a tailored custom mode. ]]></description>
            <content:encoded><![CDATA[ <p>In <a href="/single-rx-queue-kernel-bypass-with-netmap/">a previous post</a> we described our work on a new netmap mode called <i>single-rx-queue</i>.</p><p>After submitting the pull request, the netmap maintainers told us that the patch was interesting, but they would prefer something more configurable instead of a tailored custom mode.</p><p>After an exchange of ideas and some more work, our patch just got merged to mainline netmap.</p>
    <div>
      <h4>Meet the new netmap</h4>
      <a href="#meet-the-new-netmap">
        
      </a>
    </div>
    <p>Before our patch netmap used to be an all-or-nothing deal. That is: there was no way to put a network adapter partially in netmap mode. All of the queues would have to be detached from the host network stack. Even a netmap mode called “single ring pair” didn't help.</p><p>Our final patch is extended and more generic, while still supporting the simple functionality of our original single-rx-queue mode.</p><p>First we modified netmap to leave queues that are not explicitly requested to be in netmap mode attached to the host stack. In this way, if a user requests a pair of rings (for example using <code>nm_open(“netmap:eth0-4”)</code>) it will actually get a reference to both the number 4 RX and TX rings, while keeping the other rings attached to the kernel stack.</p><p>But since the NIC is still partially connected to the host stack, a new problem arises: what should we do with packets that are going to be transmitted by the operating system to a TX ring which is in netmap mode? The solution is simple: just move them to the RX host ring. In this way we can access these packets from netmap simply by opening the interface again in netmap mode and asking for its software ring pair.</p><p>Last, for simpler use cases we needed a way to ask for only the RX rings, without the TX counterpart - we do not need TX rings for our specific use case. To achieve this we introduced a couple of flags, <code>NR_TX_RINGS_ONLY</code> and <code>NR_RX_RINGS_ONLY</code> (which translate to <code>/T</code> and <code>/R</code> when we are using <code>nm_open()</code>) to request only TX or RX rings.</p><p>With these changes, the only line we needed to edit in our code was the netmap interface name passed to <code>nm_open()</code>. This:</p>
            <pre><code>snprintf(nm_if, sizeof(nm_if) “netmap:%s~%d”, if_name, ring_nr);</code></pre>
            <p>becomes this:</p>
            <pre><code>snprintf(nm_if, sizeof(nm_if), “netmap:%s-%d/R”, iface_name, ring_nr);</code></pre>
            <p>and everything kept working as expected!</p>
    <div>
      <h4>Try it out</h4>
      <a href="#try-it-out">
        
      </a>
    </div>
    <p>You can follow these instructions to build a test program under Linux. In this example we are using the ixgbe driver.</p><p>The test program source code is available on github:</p><ul><li><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2015-12-nm-single-rx-queue/main.c">2015-12-nm-single-rx-queue/main.c</a></p></li></ul><p>First clone the test application and the netmap repository:</p>
            <pre><code>$ git clone https://github.com/cloudflare/cloudflare-blog
$ cd cloudflare-blog/2015-12-nm-single-rx-queue
$ git clone https://github.com/luigirizzo/netmap deps</code></pre>
            <p>build it:</p>
            <pre><code>$ make</code></pre>
            <p>build and load netmap:</p>
            <pre><code>$ cd deps/netmap/LINUX
$ ./configure --kernel-sources=/path/to/kernel/sources --driver=ixgbe
$ make
$ sudo insmod netmap.ko
$ sudo insmod ixgbe/ixgbe.ko</code></pre>
            <p>and launch the application:</p>
            <pre><code>$ sudo ./nm-single-rx-queue eth0 1</code></pre>
            
    <div>
      <h4>Thanks</h4>
      <a href="#thanks">
        
      </a>
    </div>
    <p>We would like to thanks Luigi and Giuseppe for their great help shaping the final patch and their work on netmap.</p> ]]></content:encoded>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">dQwro6ueWQLKrOocSFwHC</guid>
            <dc:creator>Gilberto Bertin</dc:creator>
        </item>
        <item>
            <title><![CDATA[Single RX queue kernel bypass in Netmap for high packet rate networking]]></title>
            <link>https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/</link>
            <pubDate>Fri, 09 Oct 2015 10:26:42 GMT</pubDate>
            <description><![CDATA[ In a previous post we discussed the performance limitations of the Linux kernel network stack. We detailed the available kernel bypass techniques allowing user space programs to receive packets with high throughput.  ]]></description>
            <content:encoded><![CDATA[ <p>In <a href="/kernel-bypass/">a previous post</a> we discussed the performance limitations of the Linux kernel network stack. We detailed the available kernel bypass techniques allowing user space programs to receive packets with high throughput. Unfortunately, none of the discussed open source solutions supported our needs. To improve the situation we decided to contribute to the <a href="http://info.iet.unipi.it/~luigi/netmap">Netmap project</a>. In this blog post we'll describe our proposed changes.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6wUCqE4w1PE0nbGFJLjIX3/8e30b5ce1a117131d4929b29bf536fc1/122715232_32da8cd353_o-1.jpg" />
            
            </figure><p><a href="https://creativecommons.org/licenses/by-sa/2.0/">CC BY-SA 2.0</a> <a href="https://www.flickr.com/photos/binary_koala/122715232">image</a> by Binary Koala</p>
    <div>
      <h3>Our needs</h3>
      <a href="#our-needs">
        
      </a>
    </div>
    <p>At CloudFlare we are constantly dealing with large packet floods. Our network constantly receives a large volume of packets, often coming from many, simultaneous attacks. In fact, it is entirely possible that the server which just served you this blog post is dealing with a many-million packets per second flood <i>right now</i>.</p><p>Since the Linux Kernel can't really handle a large volume of packets, we need to work around it. During packet floods we offload selected network flows (belonging to a flood) to a user space application. This application filters the packets at very high speed. Most of the packets are dropped, as they belong to a flood. The small number of "valid" packets are injected back to the kernel and handled in the same way as usual traffic.</p><p>It’s important to emphasize that the kernel bypass is enabled only for selected flows, which means that all other packets go to the kernel as usual.</p><p>This setup works perfectly on our servers with Solarflare network cards - we can use the <code>ef_vi</code> API to achieve the kernel bypass. Unfortunately, we don’t have this functionality on our servers with Intel IXGBE NIC’s.</p><p>This is when <a href="http://info.iet.unipi.it/~luigi/netmap/">Netmap</a> comes in.</p>
    <div>
      <h4>Netmap</h4>
      <a href="#netmap">
        
      </a>
    </div>
    <p>Over the last few months we’ve been thinking hard about how to achieve bypass for selected flows (aka: bifurcated driver) on non-Solarflare network cards.</p><p>We’ve considered PF_RING, DPDK and other custom solutions, but sadly all of them take over the whole network card. Eventually we decided that the best way would be to patch Netmap with the functionality we need.</p><p>We chose Netmap because:</p><ul><li><p>It’s fully open source and released under a BSD license.</p></li><li><p>It has a great NIC-agnostic API.</p></li><li><p>It’s very fast: can reach line rate easily.</p></li><li><p>The project is well maintained and reasonably mature.</p></li><li><p>The code is very high quality.</p></li><li><p>The driver-specific modifications are trivial: most of the magic happens in the shared Netmap module. It’s easy to add support for new hardware.</p></li></ul>
    <div>
      <h3>Introducing the single RX queue mode</h3>
      <a href="#introducing-the-single-rx-queue-mode">
        
      </a>
    </div>
    <p>Usually, when a network card goes into the Netmap mode, all the RX queues get disconnected from the kernel and are available to the Netmap applications.</p><p>We don't want that. We want to keep most of the RX queues back in the kernel mode, and enable Netmap mode only on selected RX queues. We call this functionality: "single RX queue mode".</p><p>The intention was to expose a minimal API which could:</p><ul><li><p>Open a network interface in "a single RX queue mode".</p></li><li><p>This would allow netmap applications to receive packets from that specific RX queue.</p></li><li><p>While leaving all the other queues attached to the host network stack.</p></li><li><p>On demand add or remove RX queues from the "single RX queue mode".</p></li><li><p>Eventually remove the interface from the Netmap mode and reattach the RX queues to the host stack.</p></li></ul><p>The patch to Netmap is awaiting code review and is available here:</p><ul><li><p><a href="https://github.com/luigirizzo/netmap/pull/87">https://github.com/luigirizzo/netmap/pull/87</a></p></li></ul><p>The minimal program receiving packets from <code>eth3</code> RX queue #4 would look like:</p>
            <pre><code>d = nm_open("netmap:eth3~4", NULL, 0, 0);
while (1) {
    fds = {fds: d-&gt;fd, events: POLLIN};
    poll(&amp;fds, 1, -1);

    ring = NETMAP_RXRING(d-&gt;nifp, 4);
    while (!nm_ring_empty(ring)) {
        i   = ring-&gt;cur;
        buf = NETMAP_BUF(ring, ring-&gt;slot[i].buf_idx);
        len = ring-&gt;slot[i].len;
        //process(buf, len)
        ring-&gt;head = ring-&gt;cur = nm_ring_next(ring, i);
    }
}</code></pre>
            <p>This code is very close to a Netmap example program. Indeed the only difference is the <code>nm_open()</code> call, which uses the new syntax <code>netmap:ifname~queue_number</code>.</p><p>Once again, when running this code only packets arriving on the RX queue #4 will go to the netmap program. All other RX and TX queues will be handled by the Linux kernel network stack.</p><p>You can find a more complete example here:</p><ul><li><p><a href="https://github.com/jibi/nm-single-rx-queue">https://github.com/jibi/nm-single-rx-queue</a></p></li></ul>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2a0bUYxkJn3js1Nnve8vV6/cabf675555d51efb16d90372405aa07e/RX_bypass.png" />
            
            </figure>
    <div>
      <h4>Isolating a queue</h4>
      <a href="#isolating-a-queue">
        
      </a>
    </div>
    <p>In multiqueue network cards, any packet can end up in almost any RX queue due to RSS. This is why before enabling the single RX mode it is necessary to make sure only the selected flow goes to the Netmap queue.</p><p>To do so it is necessary to:</p><ul><li><p>Modify the <b>indirection table</b> to ensure no new RSS-hashed packets will go there.</p></li><li><p>Use <b>flow steering</b> to specifically direct some flows to the isolated queue.</p></li><li><p>Work around <b>RFS</b> - make sure no other application is running on the CPU Netmap will run on.</p></li></ul><p>For example:</p>
            <pre><code>$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1
$ ethtool -K eth3 ntuple on
$ ethtool -N eth3 flow-type udp4 dst-port 53 action 4</code></pre>
            <p>Here we are setting the indirection table to prevent traffic from going to RX queue #4. Then we are enabling flow steering to enqueue all UDP traffic with destination port 53 into queue #4.</p>
    <div>
      <h4>Trying it out</h4>
      <a href="#trying-it-out">
        
      </a>
    </div>
    <p>Here's how to run it with the IXGBE NIC. First grab the sources:</p>
            <pre><code>$ git clone https://github.com/jibi/netmap.git
$ cd netmap
$ git checkout -B single-rx-queue-mode
$ ./configure --drivers=ixgbe --kernel-sources=/path/to/kernel</code></pre>
            <p>Load the netmap-patched modules and setup the interface:</p>
            <pre><code>$ insmod ./LINUX/netmap.ko
$ insmod ./LINUX/ixgbe/ixgbe.ko
$ # Distribute the interrupts:
$ (let CPU=0; cd /sys/class/net/eth3/device/msi_irqs/; for IRQ in *; do \
  echo $CPU &gt; /proc/irq/$IRQ/smp_affinity_list; let CPU+=1
         done)
$ # Enable RSS:
$ ethtool -K eth3 ntuple on</code></pre>
            <p>At this point we started flooding the interface with 6M short UDP packets. <code>htop</code> shows the server being totally busy with handling the flood:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7etmtoEPnxIeh6CPP8mrJd/ef05cbd9ff6626ab3adf6d4ded361877/htop1-1.png" />
            
            </figure><p>To counter the flood we started Netmap. First, we needed to edit the indirection table, to isolate the RX queue #4:</p>
            <pre><code>$ ethtool -X eth3 weight 1 1 1 1 0 1 1 1 1 1
$ ethtool -N eth3 flow-type udp4 dst-port 53 action 4</code></pre>
            <p>This caused all the flood packets to go to RX queue #4.</p><p>Before putting an interface in Netmap mode it is necessary to turn off hardware offload features:</p>
            <pre><code>$ ethtool -K eth3 lro off gro off</code></pre>
            <p>Finally we launched the netmap offload:</p>
            <pre><code>$ sudo taskset -c 15 ./nm_offload eth3 4
[+] starting test02 on interface eth3 ring 4
[+] UDP pps: 5844714
[+] UDP pps: 5996166
[+] UDP pps: 5863214
[+] UDP pps: 5986365
[+] UDP pps: 5867302
[+] UDP pps: 5964911
[+] UDP pps: 5909715
[+] UDP pps: 5865769
[+] UDP pps: 5906668
[+] UDP pps: 5875486</code></pre>
            <p>As you see the netmap program on a single RX queue was able to receive about 5.8M packets.</p><p>For completeness, here's an <code>htop</code> showing only a single core being busy with Netmap:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/72C2hRcoZTkoaTJZLaqgoi/5563f1256fe254287e9d77272da0ea97/htop2-1.png" />
            
            </figure>
    <div>
      <h4>Thanks</h4>
      <a href="#thanks">
        
      </a>
    </div>
    <p>We would like to thank Pavel Odintsov who suggested the possibility of using Netmap this way. He even prepared <a href="http://www.stableit.ru/2015/06/how-to-run-netmap-on-single-queue-and.html">the initial hack</a> we based our work on.</p><p>We would also like to thank Luigi Rizzo, for his Netmap work and great feedback on our patches.</p>
    <div>
      <h4>Final words</h4>
      <a href="#final-words">
        
      </a>
    </div>
    <p>At CloudFlare our application stack is based on open source software. We’re grateful to so many open source programmers for their awesome work. Whenever we can we try to contribute back to the community - we hope "the single RX Netmap mode" will be useful to others.</p><p>You can find more CloudFlare open source <a href="https://cloudflare.github.io/">here</a>.</p> ]]></content:encoded>
            <category><![CDATA[Tech Talks]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">4Sezx7V7TGi5C7AwEQSlvL</guid>
            <dc:creator>Gilberto Bertin</dc:creator>
        </item>
    </channel>
</rss>