The Cloudflare Blog

How we scaled nginx and saved the world 54 years every day

Ka-Hing Cheung — Tue, 31 Jul 2018 15:00:00 GMT

The @Cloudflare team just pushed a change that improves our network's performance significantly, especially for particularly slow outlier requests. How much faster? We estimate we're saving the Internet ~54 years *per day* of time we'd all otherwise be waiting for sites to load.
— Matthew Prince (@eastdakota) June 28, 2018

10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them.

How NGINX works

NGINX is one of the programs that popularized using event loops to solve the C10K problem. Every time a network event comes in (a new connection, a request, or a notification that we can send more data, etc.) NGINX wakes up, handles the event, and then goes back to do whatever it needs to do (which may be handling other events). When an event arrives, data associated with the event is already ready, which allows NGINX to efficiently handle many requests simultaneously without waiting.

num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
// handle event[1]: send out response to GET http://cloudflare.com/

For example, here's what a piece of code could look like to read data from a file descriptor:

// we got a read event on fd
while (buf_len > 0) {
    ssize_t n = read(fd, buf, buf_len);
    if (n < 0) {
        if (errno == EWOULDBLOCK || errno == EAGAIN) {
            // try later when we get a read event again
        }
        if (errno == EINTR) {
            continue;
        }
        return total;
    }
    buf_len -= n;
    buf += n;
    total += n;
}

When fd is a network socket, this will return the bytes that have already arrived. The final call will return EWOULDBLOCK which means we have drained the local read buffer, so we should not read from the socket again until more data becomes available.

Disk I/O is not like network I/O

When fd is a regular file on Linux, EWOULDBLOCK and EAGAIN never happens, and read always waits to read the entire buffer. This is true even if the file was opened with O_NONBLOCK. Quoting open(2):

Note that this flag has no effect for regular files and block devices

In other words, the code above basically reduces to:

if (read(fd, buf, buf_len) > 0) {
   return buf_len;
}

Which means that if an event handler needs to read from disk, it will block the event loop until the entire read is finished, and subsequent event handlers are delayed.

This ends up being fine for most workloads, because reading from disk is usually fast enough, and much more predictable compared to waiting for a packet to arrive from network. That's especially true now that everyone has an SSD, and our cache disks are all SSDs. Modern SSDs have very low latency, typically in 10s of µs. On top of that, we can run NGINX with multiple worker processes so that a slow event handler does not block requests in other processes. Most of the time, we can rely on NGINX's event handling to service requests quickly and efficiently.

SSD performance: not always what’s on the label

As you might have guessed, these rosy assumptions aren’t always true. If each read always takes 50µs then it should only take 2ms to read 0.19MB in 4KB blocks (and we read in larger blocks). But our own measurements showed that our time to first byte is sometimes much worse, particularly at 99th and 999th percentile. In other words, the slowest read per 100 (or per 1000) reads often takes much longer.

SSDs are very fast but they are also notoriously complicated. Inside them are computers that queue up and re-order I/O, and also perform various background tasks like garbage collection and defragmentation. Once in a while, a request gets slowed down enough to matter. My colleague Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others. Going forward we will consider performance consistency in our SSD purchases, but in the meantime we need to have a solution for our existing hardware.

Spreading the load evenly with `SO_REUSEPORT`

An individual slow response once in a blue moon is difficult to avoid, but what we really don't want is a 1 second I/O blocking 1000 other requests that we receive within the same second. Conceptually NGINX can handle many requests in parallel but it only runs 1 event handler at a time. So I added a metric that measures this:

gettimeofday(&start, NULL);
num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
gettimeofday(&event_start_handle, NULL);
// handle event[1]: send out response to GET http://cloudflare.com/
timersub(&event_start_handle, &start, &event_loop_blocked);

p99 of event_loop_blocked turned out to be more than 50% of our TTFB. Which is to say, half of the time it takes to serve a request is a result of the event loop being blocked by other requests. event_loop_blocked only measures about half of the blocking (because delayed calls to epoll_wait() are not measured) so the actual ratio of blocked time is much higher.

Each of our machines run NGINX with 15 worker processes, which means one slow I/O should only block up to 6% of the requests. However, the events are not evenly distributed, with the top worker taking 11% of the requests (or twice as many as expected).

SO_REUSEPORT can solve the uneven distribution problem. Marek Majkowski has previously written about the downside in the context of other NGINX instances, but that downside mostly doesn't apply in our case since upstream connections in our cache process are long-lived, so a slightly higher latency in opening the connection is negligible. This single configuration change to enable SO_REUSEPORT improved peak p99 by 33%.

Moving read() to thread pool: not a silver bullet

A solution to this is to make read() not block. In fact, this is a feature that's implemented in upstream NGINX! When the following configuration is used, read() and write() are done in a thread pool and won't block the event loop:

aio         threads;
aio_write   on;

However when we tested this, instead of 33x response time improvement, we actually saw a slight increase in p99. The difference was within margin of error but we were quite discouraged by the result and stopped pursuing this option for a while.

There are a few reasons why we didn’t see the level of improvements that NGINX saw. In their test, they were using 200 concurrent connections to request files that were 4MB in size, which were residing on spinning disks. Spinning disks increase I/O latency so it makes sense that an optimization that helps latency would have more dramatic effect.

We are also mostly concerned with p99 (and p999) performance. Solutions that help the average performance don't necessarily help with outliers.

Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are for files smaller than 60KB. Smaller files mean fewer occasions to block (we typically read the entire file in 2 reads).

If we look at the disk I/O that a cache hit has to do:

// we got a request for https://example.com which has cache key 0xCAFEBEEF
fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY);
// read up to 32KB for the metadata as well as the headers
// done in thread pool if "aio threads" is on
read(fd, buf, 32*1024);

32KB isn't a static number, if the headers are small we need to read just 4KB (we don't use direct IO so kernel will round up to 4KB). The open() seems innocuous but it's actually not free. At a minimum the kernel needs to check if the file exists and if the calling process has permission to open it. For that it would have to find the inode of /cache/prefix/dir/EF/BE/CAFEBEEF, and to do that it would have to look up CAFEBEEF in /cache/prefix/dir/EF/BE/. Long story short, in the worst case the kernel has to do the following lookups:

/cache
/cache/prefix
/cache/prefix/dir
/cache/prefix/dir/EF
/cache/prefix/dir/EF/BE
/cache/prefix/dir/EF/BE/CAFEBEEF

That's 6 separate reads done by open() compared to just 1 read done by read()! Fortunately, most of the time lookups are serviced by the dentry cache and don't require trips to the SSDs. But clearly having read() done in thread pool is only half of the picture.

The coup de grâce: non-blocking open() in thread pools

So I modified NGINX to do most of open() inside the thread pool as well so it won't block the event loop. And the result (both non-blocking open and non-blocking read):

On June 26 we deployed our changes to 5 of our busiest data centers, followed by world wide roll-out the next day. Overall peak p99 TTFB improved by a factor of 6. In fact, adding up all the time from processing 8 million requests per second, we saved the Internet 54 years of wait time every day.

We've submitted our work to upstream. Interested parties can follow along.

Our event loop handling is still not completely non-blocking. In particular, we still block when we are caching a file for the first time (both the open(O_CREAT) and rename()), or doing revalidation updates. However, those are rare compared to cache hits. In the future we will consider moving those off of the event loop to further improve our p99 latency.

Conclusion

NGINX is a powerful platform, but scaling extremely high I/O loads on linux can be challenging. Upstream NGINX can offload reads in separate threads, but at our scale we often need to go one step further. If working on challenging performance problems sounds exciting to you, apply to join our team in San Francisco, London, Austin or Champaign.

Web Cache Deception Attack revisited

Ka-Hing Cheung — Fri, 19 Jan 2018 17:38:00 GMT

In April, we wrote about Web Cache Deception attacks, and how our customers can avoid them using origin configuration.

Read that blog post to learn about how to configure your website, and for those who are not able to do that, how to disable caching for certain URIs to prevent this type of attacks. Since our previous blog post, we have looked for but have not seen any large scale attacks like this in the wild.

Today, we have released a tool to help our customers make sure only assets that should be cached are being cached.

A brief re-introduction to Web Cache Deception attack

Recall that the Web Cache Deception attack happens when an attacker tricks a user into clicking a link in the format of http://www.example.com/newsfeed/foo.jpg, when http://www.example.com/newsfeed is the location of a dynamic script that returns different content for different users. For some website configurations (default in Apache but not in nginx), this would invoke /newsfeed with PATH_INFO set to /foo.jpg. If http://www.example.com/newsfeed/foo.jpg does not return the proper Cache-Control headers to tell a web cache not to cache the content, web caches may decide to cache the result based on the extension of the URL. The attacker can then visit the same URL and retrieve the cached content of a private page.

The proper fix for this is to configure your website to either reject requests with the extra PATH_INFO or to return the proper Cache-Control header. Sometimes our customers are not able to do that (maybe the website is running third-party software that they do not fully control), and they can apply a Bypass Cache Page Rule for those script locations.

Cache Deception Armor

Photo by Henry Hustava / Unsplash

The new Cache Deception Armor Page Rule protects customers from Web Cache Deception attacks while still allowing static assets to be cached. It verifies that the URL's extension matches the returned Content-Type. In the above example, if http://www.example.com/newsfeed is a script that outputs a web page, the Content-Type is text/html. On the other hand, http://www.example.com/newsfeed/foo.jpg is expected to have image/jpeg as Content-Type. When we see a mismatch that could result in a Web Cache Deception attack, we will not cache the response.

There are some exceptions to this. For example if the returned Content-Type is application/octet-stream we don't care what the extension is, because that's typically a signal to instruct the browser to save the asset instead of to display it. We also allow .jpg to be served as image/webp or .gif as video/webm and other cases that we think are unlikely to be attacks.

This new Page Rule depends upon Origin Cache Control. A Cache-Control header from the origin or Edge Cache TTL Page Rule will override this protection.