
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Mon, 13 Apr 2026 18:39:24 GMT</lastBuildDate>
        <item>
            <title><![CDATA[How we scaled nginx and saved the world 54 years every day]]></title>
            <link>https://blog.cloudflare.com/how-we-scaled-nginx-and-saved-the-world-54-years-every-day/</link>
            <pubDate>Tue, 31 Jul 2018 15:00:00 GMT</pubDate>
            <description><![CDATA[ 10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them. ]]></description>
            <content:encoded><![CDATA[ <blockquote><p>The <a href="https://twitter.com/Cloudflare?ref_src=twsrc%5Etfw">@Cloudflare</a> team just pushed a change that improves our network's performance significantly, especially for particularly slow outlier requests. How much faster? We estimate we're saving the Internet ~54 years *per day* of time we'd all otherwise be waiting for sites to load.</p><p>— Matthew Prince (@eastdakota) <a href="https://twitter.com/eastdakota/status/1012420672352542720?ref_src=twsrc%5Etfw">June 28, 2018</a></p></blockquote><p>10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them.</p>
    <div>
      <h3>How NGINX works</h3>
      <a href="#how-nginx-works">
        
      </a>
    </div>
    <p>NGINX is one of the programs that popularized using event loops to solve <a href="http://www.kegel.com/c10k.html">the C10K problem</a>. Every time a network event comes in (a new connection, a request, or a notification that we can send more data, etc.) NGINX wakes up, handles the event, and then goes back to do whatever it needs to do (which may be handling other events). When an event arrives, data associated with the event is already ready, which allows NGINX to efficiently handle many requests simultaneously without waiting.</p>
            <pre><code>num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
// handle event[1]: send out response to GET http://cloudflare.com/</code></pre>
            <p>For example, here's what a piece of code could look like to read data from a file descriptor:</p>
            <pre><code>// we got a read event on fd
while (buf_len &gt; 0) {
    ssize_t n = read(fd, buf, buf_len);
    if (n &lt; 0) {
        if (errno == EWOULDBLOCK || errno == EAGAIN) {
            // try later when we get a read event again
        }
        if (errno == EINTR) {
            continue;
        }
        return total;
    }
    buf_len -= n;
    buf += n;
    total += n;
}</code></pre>
            <p>When fd is a network socket, this will return the bytes that have already arrived. The final call will return <code>EWOULDBLOCK</code> which means we have drained the local read buffer, so we should not read from the socket again until more data becomes available.</p>
    <div>
      <h3>Disk I/O is not like network I/O</h3>
      <a href="#disk-i-o-is-not-like-network-i-o">
        
      </a>
    </div>
    <p>When <code>fd</code> is a regular file on Linux, <code>EWOULDBLOCK</code> and <code>EAGAIN</code> never happens, and read always waits to read the entire buffer. This is true even if the file was opened with <code>O_NONBLOCK</code>. Quoting <a href="http://man7.org/linux/man-pages/man2/open.2.html"><code>open(2)</code></a>:</p><blockquote><p>Note that this flag has no effect for regular files and block devices</p></blockquote><p>In other words, the code above basically reduces to:</p>
            <pre><code>if (read(fd, buf, buf_len) &gt; 0) {
   return buf_len;
}</code></pre>
            <p>Which means that if an event handler needs to read from disk, it will block the event loop until the entire read is finished, and subsequent event handlers are delayed.</p><p>This ends up being fine for most workloads, because reading from disk is usually fast enough, and much more predictable compared to waiting for a packet to arrive from network. That's especially true now that everyone has an SSD, and our cache disks are all SSDs. Modern SSDs have very low latency, typically in 10s of µs. On top of that, we can run NGINX with multiple worker processes so that a slow event handler does not block requests in other processes. Most of the time, we can rely on NGINX's event handling to service requests quickly and efficiently.</p>
    <div>
      <h3>SSD performance: not always what’s on the label</h3>
      <a href="#ssd-performance-not-always-whats-on-the-label">
        
      </a>
    </div>
    <p>As you might have guessed, these rosy assumptions aren’t always true. If each read always takes 50µs then it should only take 2ms to read 0.19MB in 4KB blocks (and we read in larger blocks). But our own measurements showed that our time to first byte is sometimes much worse, particularly at 99th and 999th percentile. In other words, the slowest read per 100 (or per 1000) reads often takes much longer.</p><p>SSDs are very fast but they are also notoriously complicated. Inside them are computers that queue up and re-order I/O, and also perform various background tasks like garbage collection and defragmentation. Once in a while, a request gets slowed down enough to matter. My colleague <a href="https://twitter.com/ibobrik?lang=en">Ivan Babrou</a> ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others. Going forward we will consider performance consistency in our SSD purchases, but in the meantime we need to have a solution for our existing hardware.</p>
    <div>
      <h3>Spreading the load evenly with <code>SO_REUSEPORT</code></h3>
      <a href="#spreading-the-load-evenly-with-so_reuseport">
        
      </a>
    </div>
    <p>An individual slow response once in a blue moon is difficult to avoid, but what we really don't want is a 1 second I/O blocking 1000 other requests that we receive within the same second. Conceptually NGINX can handle many requests in parallel but it only runs 1 event handler at a time. So I added a metric that measures this:</p>
            <pre><code>gettimeofday(&amp;start, NULL);
num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
gettimeofday(&amp;event_start_handle, NULL);
// handle event[1]: send out response to GET http://cloudflare.com/
timersub(&amp;event_start_handle, &amp;start, &amp;event_loop_blocked);</code></pre>
            <p>p99 of <code>event_loop_blocked</code> turned out to be more than 50% of our TTFB. Which is to say, half of the time it takes to serve a request is a result of the event loop being blocked by other requests. <code>event_loop_blocked</code> only measures about half of the blocking (because delayed calls to <code>epoll_wait()</code> are not measured) so the actual ratio of blocked time is much higher.</p><p>Each of our machines run NGINX with 15 worker processes, which means one slow I/O should only block up to 6% of the requests. However, the events are not evenly distributed, with the top worker taking 11% of the requests (or twice as many as expected).</p><p><code>SO_REUSEPORT</code> can solve the uneven distribution problem. Marek Majkowski has previously written about <a href="/the-sad-state-of-linux-socket-balancing/">the downside</a> in the context of other NGINX instances, but that downside mostly doesn't apply in our case since upstream connections in our cache process are long-lived, so a slightly higher latency in opening the connection is negligible. This single configuration change to enable <code>SO_REUSEPORT</code> improved peak p99 by 33%.</p>
    <div>
      <h3>Moving read() to thread pool: not a silver bullet</h3>
      <a href="#moving-read-to-thread-pool-not-a-silver-bullet">
        
      </a>
    </div>
    <p>A solution to this is to make read() not block. In fact, this is a feature that's <a href="https://www.nginx.com/blog/thread-pools-boost-performance-9x/">implemented in upstream NGINX</a>! When the following configuration is used, <code>read()</code> and <code>write()</code> are done in a thread pool and won't block the event loop:</p>
            <pre><code>aio         threads;
aio_write   on;</code></pre>
            <p>However when we tested this, instead of 33x response time improvement, we actually saw a slight increase in p99. The difference was within margin of error but we were quite discouraged by the result and stopped pursuing this option for a while.</p><p>There are a few reasons why we didn’t see the level of improvements that NGINX saw. In their test, they were using 200 concurrent connections to request files that were 4MB in size, which were residing on spinning disks. Spinning disks increase I/O latency so it makes sense that an optimization that helps latency would have more dramatic effect.</p><p>We are also mostly concerned with p99 (and p999) performance. Solutions that help the average performance don't necessarily help with outliers.</p><p>Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are for files smaller than 60KB. Smaller files mean fewer occasions to block (we typically read the entire file in 2 reads).</p><p>If we look at the disk I/O that a cache hit has to do:</p>
            <pre><code>// we got a request for https://example.com which has cache key 0xCAFEBEEF
fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY);
// read up to 32KB for the metadata as well as the headers
// done in thread pool if "aio threads" is on
read(fd, buf, 32*1024);</code></pre>
            <p>32KB isn't a static number, if the headers are small we need to read just 4KB (we don't use direct IO so kernel will round up to 4KB). The <code>open()</code> seems innocuous but it's actually not free. At a minimum the kernel needs to check if the file exists and if the calling process has permission to open it. For that it would have to find the inode of <code>/cache/prefix/dir/EF/BE/CAFEBEEF</code>, and to do that it would have to look up <code>CAFEBEEF</code> in <code>/cache/prefix/dir/EF/BE/</code>. Long story short, in the worst case the kernel has to do the following lookups:</p>
            <pre><code>/cache
/cache/prefix
/cache/prefix/dir
/cache/prefix/dir/EF
/cache/prefix/dir/EF/BE
/cache/prefix/dir/EF/BE/CAFEBEEF</code></pre>
            <p>That's 6 separate reads done by <code>open()</code> compared to just 1 read done by <code>read()</code>! Fortunately, most of the time lookups are serviced by <a href="https://www.kernel.org/doc/Documentation/filesystems/vfs.txt">the dentry cache</a> and don't require trips to the SSDs. But clearly having <code>read()</code> done in thread pool is only half of the picture.</p>
    <div>
      <h3>The coup de grâce: non-blocking open() in thread pools</h3>
      <a href="#the-coup-de-grace-non-blocking-open-in-thread-pools">
        
      </a>
    </div>
    <p>So I modified NGINX to do most of <code>open()</code> inside the thread pool as well so it won't block the event loop. And the result (both non-blocking open and non-blocking read):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7zUNNTAUAfVrlcTLWqp0Xc/aaff6458aa54f35f8b7835cb19679fd3/Screenshot-from-2018-07-17-12-16-27.png" />
            
            </figure><p>On June 26 we deployed our changes to 5 of our busiest data centers, followed by world wide roll-out the next day. Overall peak p99 TTFB improved by a factor of 6. In fact, adding up all the time from processing 8 million requests per second, we saved the Internet 54 years of wait time every day.</p><p>We've submitted our work to <a href="http://mailman.nginx.org/pipermail/nginx-devel/2018-August/011354.html">upstream</a>. Interested parties can follow along.</p><p>Our event loop handling is still not completely non-blocking. In particular, we still block when we are caching a file for the first time (both the <code>open(O_CREAT)</code> and <code>rename()</code>), or doing revalidation updates. However, those are rare compared to cache hits. In the future we will consider moving those off of the event loop to further improve our p99 latency.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>NGINX is a powerful platform, but scaling extremely high I/O loads on linux can be challenging. Upstream NGINX can offload reads in separate threads, but at our scale we often need to go one step further. If working on challenging performance problems sounds exciting to you, <a href="https://www.cloudflare.com/careers/jobs/?department=Engineering">apply to join our team</a> in San Francisco, London, Austin or Champaign.</p> ]]></content:encoded>
            <category><![CDATA[NGINX]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Cache]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">3uAC7i358jXCZ2M7Cjj9Pn</guid>
            <dc:creator>Ka-Hing Cheung</dc:creator>
        </item>
        <item>
            <title><![CDATA[Web Cache Deception Attack revisited]]></title>
            <link>https://blog.cloudflare.com/web-cache-deception-attack-revisited/</link>
            <pubDate>Fri, 19 Jan 2018 17:38:00 GMT</pubDate>
            <description><![CDATA[ In April, we wrote about Web Cache Deception attacks, and how our customers can avoid them using origin configuration.  Since our previous blog post, we have looked for but have not seen any large scale attacks like this in the wild. ]]></description>
            <content:encoded><![CDATA[ <p>In April, we wrote about <a href="/understanding-our-cache-and-the-web-cache-deception-attack/">Web Cache Deception attacks</a>, and how our customers can avoid them using origin configuration.</p><p>Read that blog post to learn about how to configure your website, and for those who are not able to do that, how to disable caching for certain URIs to prevent this type of attacks. Since our previous blog post, we have looked for but have not seen any large scale attacks like this in the wild.</p><p>Today, we have released a tool to help our customers make sure only assets that should be cached are being cached.</p>
    <div>
      <h3>A brief re-introduction to Web Cache Deception attack</h3>
      <a href="#a-brief-re-introduction-to-web-cache-deception-attack">
        
      </a>
    </div>
    <p>Recall that the Web Cache Deception attack happens when an attacker tricks a user into clicking a link in the format of <code>http://www.example.com/newsfeed/foo.jpg</code>, when <code>http://www.example.com/newsfeed</code> is the location of a dynamic script that returns different content for different users. For some website configurations (default in Apache but not in nginx), this would invoke <code>/newsfeed</code> with <a href="https://tools.ietf.org/html/rfc3875#section-4.1.5"><code>PATH_INFO</code></a> set to <code>/foo.jpg</code>. If <code>http://www.example.com/newsfeed/foo.jpg</code> does not return the proper <code>Cache-Control</code> headers to tell a web cache not to cache the content, web caches may decide to cache the result based on the extension of the URL. The attacker can then visit the same URL and retrieve the cached content of a private page.</p><p>The proper fix for this is to configure your website to either reject requests with the extra <code>PATH_INFO</code> or to return the proper <code>Cache-Control</code> header. Sometimes our customers are not able to do that (maybe the website is running third-party software that they do not fully control), and they can apply a Bypass Cache Page Rule for those script locations.</p>
    <div>
      <h3>Cache Deception Armor</h3>
      <a href="#cache-deception-armor">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/KOEBIR0VfOtdSOjvueaLv/7898ccfe2589400017b74e2a928d4e20/photo-1460194436988-671f763436b7" />
            
            </figure><p>Photo by <a href="https://unsplash.com/@enzo74?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Henry Hustava</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></p><p>The new Cache Deception Armor Page Rule protects customers from Web Cache Deception attacks while still allowing static assets to be cached. It verifies that the URL's extension matches the returned <code>Content-Type</code>. In the above example, if <code>http://www.example.com/newsfeed</code> is a script that outputs a web page, the <code>Content-Type</code> is <code>text/html</code>. On the other hand, <code>http://www.example.com/newsfeed/foo.jpg</code> is expected to have <code>image/jpeg</code> as <code>Content-Type</code>. When we see a mismatch that could result in a Web Cache Deception attack, we will not cache the response.</p><p>There are some exceptions to this. For example if the returned <code>Content-Type</code> is <code>application/octet-stream</code> we don't care what the extension is, because that's typically a signal to instruct the browser to save the asset instead of to display it. We also allow <code>.jpg</code> to be served as <code>image/webp</code> or <code>.gif</code> as <code>video/webm</code> and other cases that we think are unlikely to be attacks.</p><p>This new Page Rule depends upon <a href="https://support.cloudflare.com/hc/en-us/articles/115003206852s">Origin Cache Control</a>. A <code>Cache-Control</code> header from the origin or Edge Cache TTL Page Rule will override this protection.</p> ]]></content:encoded>
            <category><![CDATA[Attacks]]></category>
            <category><![CDATA[Page Rules]]></category>
            <category><![CDATA[Vulnerabilities]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Best Practices]]></category>
            <guid isPermaLink="false">f3uB5nHEVeLV7BA4Q2wzt</guid>
            <dc:creator>Ka-Hing Cheung</dc:creator>
        </item>
    </channel>
</rss>