The Cloudflare Blog

Keepalives considered harmful

Ivan Babrou — Thu, 19 Mar 2020 12:57:35 GMT

This may sound like a weird title, but hear me out. You’d think keepalives would always be helpful, but turns out reality isn’t always what you expect it to be. It really helps if you read Why does one NGINX worker take all the load? first. This post is an adaptation of a rather old post on Cloudflare’s internal blog, so not all details are exactly as they are in production today but the lessons are still valid.

This is a story about how we were seeing some complaints about sporadic latency spikes, made some unconventional changes, and were able to slash the 99.9th latency percentile by 4x!

Request flow on Cloudflare edge

I'm going to focus only on two parts of our edge stack: FL and SSL.

FL accepts plain HTTP connections and does the main request logic, including our WAF
SSL terminates SSL and passes connections to FL over local Unix socket:

Here’s a diagram:

These days we route all traffic through SSL for simplicity, but in the grand scheme of things it’s not going to matter much.

Each of these processes is not itself a single process, but rather a main process and a collection of workers that do actual processing. Another diagram for you to make this more visual:

Keepalives

Requests come over connections that are reused for performance reasons, which is sometimes referred to as “keepalive”. It's generally expensive for a client to open a new TCP connection and our servers keep some memory pools associated with connections that need to be recycled. In fact, one of the range of mitigations we use for attack traffic is disabling keepalives for abusive clients, forcing them to reopen connections, which slows them down considerably.

To illustrate the usefulness of keepalives, here's me requesting http://example.com/ from curl over the same connection:

$ curl -s -w "Time to connect: %{time_connect} Time to first byte: %{time_starttransfer}\n" -o /dev/null http://example.com/ -o /dev/null http://example.com/

Time to connect: 0.012108 Time to first byte: 0.018724
Time to connect: 0.000019 Time to first byte: 0.007391

The first request took 18.7ms and out of them 12.1ms were used to establish a new connection, which is not even a TLS one. When I sent another request over the same connection, I didn't need to pay extra and it took just 7.3ms to service my request. That's a big win, which gets even bigger if you need to establish a brand new connection (especially if it’s not TLSv1.3 or QUIC). DNS was also cached in the example above, but it may be another negative factor for domains with low TTL.

Keepalives tend to be used extensively, because they seem beneficial. Generally keepalives are enabled by default for this exact reason.

Taking all the load

Due to how Linux works (see the first link in this blog post), most of the load goes to a few workers out of the pool. When that worker is not able to accept() a pending connection because it's busy processing requests, that connection spills to another worker. The process cascades until some worker is ready to pick up.

This leaves us with a ladder of load spread between workers:

               CPU%                                       Runtime
nobody    4254 51.2  0.8 5655600 1093848 ?     R    Aug23 3938:34 nginx: worker process
nobody    4257 47.9  0.8 5615848 1071612 ?     S    Aug23 3682:05 nginx: worker process
nobody    4253 43.8  0.8 5594124 1069424 ?     R    Aug23 3368:27 nginx: worker process
nobody    4255 39.4  0.8 5573888 1070272 ?     S    Aug23 3030:01 nginx: worker process
nobody    4256 36.2  0.7 5556700 1052560 ?     R    Aug23 2784:23 nginx: worker process
nobody    4251 33.1  0.8 5563276 1063700 ?     S    Aug23 2545:07 nginx: worker process
nobody    4252 29.2  0.8 5561232 1058748 ?     S    Aug23 2245:59 nginx: worker process
nobody    4248 26.7  0.8 5554652 1057288 ?     S    Aug23 2056:19 nginx: worker process
nobody    4249 24.5  0.7 5537276 1043568 ?     S    Aug23 1883:18 nginx: worker process
nobody    4245 22.5  0.7 5552340 1048592 ?     S    Aug23 1736:37 nginx: worker process
nobody    4250 20.7  0.7 5533728 1038676 ?     R    Aug23 1598:16 nginx: worker process
nobody    4247 19.6  0.7 5547548 1044480 ?     S    Aug23 1507:27 nginx: worker process
nobody    4246 18.4  0.7 5538104 1043452 ?     S    Aug23 1421:23 nginx: worker process
nobody    4244 17.5  0.7 5530480 1035264 ?     S    Aug23 1345:39 nginx: worker process
nobody    4243 16.6  0.7 5529232 1024268 ?     S    Aug23 1281:55 nginx: worker process
nobody    4242 16.6  0.7 5537956 1038408 ?     R    Aug23 1278:40 nginx: worker process

The third column is instant CPU%, the time after the date is total on-CPU time. If you look at the the same processes and count their open sockets, you'll see this (using the same order of processes as above):

More load corresponds to more open sockets. More open sockets generate more load to serve these connections. It’s a vicious circle.

The twist

Now that we have all these workers holding onto this connections, requests over these connections are also in a way pinned to workers:

In FL we're doing some things that are somewhat compute intensive, which means that some workers can be busy for a somewhat prolonged period of time, while other workers will be sitting idle, increasing latency for requests. Ideally we want to always hand over a request to a worker that's idle, because that maximizes our chances of not being blocked, even if it takes a few event loop iterations to serve a request (meaning that we may still block down the road).

One part of dealing with the issue is offloading some of the compute intensive parts of the request processing into a thread pool, but that was something that hadn’t happened at that point. We had to do something else in the meantime.

Clients have no way of knowing that their worker is busy and they should probably ask another worker to serve the connection. For clients over a real network this doesn't even make sense, since by the time their request comes to NGINX up to tens of milliseconds later, the situation will probably be different.

But our clients are not over a long haul network! They are local and they are SSL connecting over a Unix socket. It's not exactly that expensive to reopen a new connection for each request and what it gives is the ability to pass a fully formed request that's buffered in SSL into FL:

Two key points here:

The request is always picked up by an idle worker
The request is readily available for potentially compute intensive processing

The former is the most important part.

Validating the hypothesis

To validate this assumption, I wrote the following program:

package main
 
import (
    "flag"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "time"
)
 
func main() {
    u := flag.String("url", "", "url to request")
    p := flag.Duration("pause", 0, "pause between requests")
    t := flag.Duration("threshold", 0, "threshold for reporting")
    c := flag.Int("count", 10000, "number of requests to send")
    n := flag.Int("close", 1, "close connection after every that many requests")
 
    flag.Parse()
 
    if *u == "" {
        flag.PrintDefaults()
        os.Exit(1)
    }
 
    client := http.Client{}
 
    for i := 0; i < *c; i++ {
        started := time.Now()
 
        request, err := http.NewRequest("GET", *u, nil)
        if err != nil {
            log.Fatalf("Error constructing request: %s", err)
        }
 
        if i%*n == 0 {
            request.Header.Set("Connection", "Close")
        }
 
        response, err := client.Do(request)
        if err != nil {
            log.Fatalf("Error performing request: %s", err)
        }
 
        _, err = ioutil.ReadAll(response.Body)
        if err != nil {
            log.Fatalf("Error reading request body: %s", err)
        }
 
        response.Body.Close()
 
        elapsed := time.Since(started)
        if elapsed > *t {
            log.Printf("Request %d took %dms", i, int(elapsed.Seconds()*1000))
        }
 
        time.Sleep(*p)
    }
}

The program connects to a requested URL and recycles the connection after X requests have completed. We also pause for a short time between requests.

If we close a connection after every request:

$ go run /tmp/main.go -url http://test.domain/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 1
2018/08/24 23:42:34 Request 453 took 32ms
2018/08/24 23:42:38 Request 1044 took 24ms
2018/08/24 23:43:00 Request 4106 took 83ms
2018/08/24 23:43:12 Request 5778 took 27ms
2018/08/24 23:43:16 Request 6292 took 27ms
2018/08/24 23:43:20 Request 6856 took 21ms
2018/08/24 23:43:32 Request 8578 took 45ms
2018/08/24 23:43:42 Request 9938 took 22ms

We request an endpoint that's served in FL by Lua, so seeing any slow requests is unfortunate. There's an element of luck in this game and our program sees no cooperation from eyeballs or SSL, so it's somewhat expected.

Now, if we start closing the connection only after every other request, the situation immediately gets a lot worse:

$ go run /tmp/main.go -url http://teste1.cfperf.net/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 2
2018/08/24 23:43:51 Request 162 took 22ms
2018/08/24 23:43:51 Request 220 took 21ms
2018/08/24 23:43:53 Request 452 took 23ms
2018/08/24 23:43:54 Request 540 took 41ms
2018/08/24 23:43:54 Request 614 took 23ms
2018/08/24 23:43:56 Request 900 took 40ms
2018/08/24 23:44:02 Request 1705 took 21ms
2018/08/24 23:44:03 Request 1850 took 27ms
2018/08/24 23:44:03 Request 1878 took 36ms
2018/08/24 23:44:08 Request 2470 took 21ms
2018/08/24 23:44:11 Request 2926 took 22ms
2018/08/24 23:44:14 Request 3350 took 37ms
2018/08/24 23:44:14 Request 3404 took 21ms
2018/08/24 23:44:16 Request 3598 took 32ms
2018/08/24 23:44:16 Request 3606 took 22ms
2018/08/24 23:44:19 Request 4026 took 33ms
2018/08/24 23:44:20 Request 4250 took 74ms
2018/08/24 23:44:22 Request 4483 took 20ms
2018/08/24 23:44:23 Request 4572 took 21ms
2018/08/24 23:44:23 Request 4644 took 23ms
2018/08/24 23:44:24 Request 4758 took 63ms
2018/08/24 23:44:25 Request 4808 took 39ms
2018/08/24 23:44:30 Request 5496 took 28ms
2018/08/24 23:44:31 Request 5736 took 88ms
2018/08/24 23:44:32 Request 5845 took 43ms
2018/08/24 23:44:33 Request 5988 took 52ms
2018/08/24 23:44:34 Request 6042 took 26ms
2018/08/24 23:44:34 Request 6049 took 23ms
2018/08/24 23:44:40 Request 6872 took 86ms
2018/08/24 23:44:40 Request 6940 took 23ms
2018/08/24 23:44:40 Request 6964 took 23ms
2018/08/24 23:44:44 Request 7532 took 32ms
2018/08/24 23:44:49 Request 8224 took 22ms
2018/08/24 23:44:49 Request 8234 took 29ms
2018/08/24 23:44:51 Request 8536 took 24ms
2018/08/24 23:44:55 Request 9028 took 22ms
2018/08/24 23:44:55 Request 9050 took 23ms
2018/08/24 23:44:55 Request 9092 took 26ms
2018/08/24 23:44:57 Request 9330 took 25ms
2018/08/24 23:45:01 Request 9962 took 48ms

If we close after every 5 requests, the number of slow responses almost doubles. This is counter-intuitive, keepalives are supposed to help with latency, not make it worse!

Trying this in the wild

To see how this works out in the real world, we disabled keepalives between SSL and FL in one location, forcing SSL to send every request over a separate connection (remember: cheap local Unix socket). Here’s how our probes toward that location reacted to this:

Here’s cumulative wait time between SSL and FL:

And finally 99.9th percentile of the same measurement:

This is a huge win.

Another telling graph is comparing our average “edge processing time” (which includes WAF) in a test location to a global value:

We reduced unnecessary wait time due to an imbalance without increasing the CPU load, which directly translates into improved user experience and lower resource consumption for us.

The downsides

There has to be a downside from this, right? The problem we introduced is that CPU imbalance between individual CPU cores went up:

Overall CPU usage did not change, just the distribution of it. We already know of a way of dealing with this: SO_REUSEPORT. Either that or EPOLLROUNDROBIN, which doesn’t have some of the drawbacks of SO_REUSEPORT (which does not work for a Unix socket, for example), but requires a patched kernel. If we combine both disabled keepalives and EPOLLROUNDROBIN changes, we can see CPUs allocated to FL converge in their utilization nicely:

We’ve tried different combinations and having both EPOLLROUNDROBIN with disabled keepalives worked best. Having just one of them is not as beneficial to lower latency.

Conclusions

We disabled keepalives between SSL and FL running on the same box and this greatly improved our tail latency caused by requests landing on non-optimal FL workers. This was an unexpected fix, but it worked and we are able to explain it.

This doesn’t mean that you should go and disable keepalives everywhere. Generally keepalives are great and should stay enabled, but in our case the latency of local connection establishment is much lower than the delay we can get from landing on a busy worker.

In reality this means that we can run our machines hotter and not see latency rise as much as it did before. Imagine moving the CPU cap from 50% to 80% with no effect on latency. The numbers are arbitrary, but the idea holds. Running hotter allows for fewer machines able to serve the same amount of traffic, reducing our overall footprint. ?

Introducing ebpf_exporter

Ivan Babrou — Fri, 24 Aug 2018 15:11:53 GMT

This is an adapted transcript of a talk I gave at Promcon 2018. You can find slides with additional information on our Prometheus deployment and presenter notes here. There's also a video.

Tip: you can click on the image to see the original large version.

Here at Cloudflare we use Prometheus to collect operational metrics. We run it on hundreds of servers and ingest millions of metrics per second to get insight into our network and provide the best possible service to our customers.

Prometheus metric format is popular enough, it's now being standardized as OpenMetrics under Cloud Native Computing Foundation. It's exciting to see convergence in long fragmented metrics landscape.

In this blog post we'll talk about how we measure low level metrics and share a tool that can help you to get similar understanding of your systems.

There are two main exporters one can use to get some insight into a Linux system performance.

The first one is node_exporter that gives you information about basics like CPU usage breakdown by type, memory usage, disk IO stats, filesystem and network usage.

The second one is cAdvisor, that gives similar metrics, but drills down to a container level. Instead of seeing total CPU usage you can see which containers (and systemd units are also containers for cAdvisor) use how much of global resources.

This is the absolute minimum of what you should know about your systems. If you don’t have these two, you should start there.

Let’s look at the graphs you can get out of this.

I should mention that every screenshot in this post is from a real production machine doing something useful. We have different generations of hardware, so don’t try to draw any conclusions.

Here you can see the basics you get from node_exporter for CPU and memory. You can see the utilization and how much slack resources you have.

Some more metrics from node_exporter, this time for disk IO. There are similar panels for network as well.

At the basic level you can do some correlation to explain why CPU went up if you see higher network and disk activity.

With cAdvisor this gets more interesting, since you can now see how global resources like CPU are sliced between services. If CPU or memory usage grows, you can pinpoint exact service that is responsible and you can also see how it affects other services.

If global CPU numbers do not change much, you can still see shifts between services.

All of this information comes from the simplest counters and first derivatives (rates) on them.

Counters are great, but they lack detail about individual events. Let’s take disk io for example. We get device io time from node_exporter and the derivative of that cannot go beyond one second per one second of real time, which means we can draw a bold red line at 1s and see what kind of utilization we get from our disks.

We get one number that characterizes out workload, but that’s simply not enough to understand it. Are we doing many fast IO operations? Are we doing few slow ones? What kind of mix of slow vs fast do we get? How are writes and reads different?

These questions beg for a histogram. What we have is a counter above, but what we want is a histogram below.

Keep in mind that Prometheus histograms are cumulative and le label counts all events lower than or equal to the value of the label.

Okay, so imagine we have that histogram. To visualize the difference you get between the two types of metrics here are two screenshots of the same event, which is an SSD replacement in production. We replaced a disk and this is how it affected the line and the histogram. In the new Grafana 5 you can plot histograms as heatmaps and it gives you a lot more detail than just one line.

The next slide has a bigger screenshot, on this slide you should just see the shift in detail.

Here you can see buckets color coded and each timeslot has its own distribution in a tooltip. It can definitely get easier to understand with bar highlights and double Y axis, but it’s a big step forward from just one line nonetheless.

In addition to nice visualizations, you can also plot number of events above Xms and have alerts and SLOs on that. For example, you can alert if your read latency for block devices exceeds 10ms.

And if you were looking closely at these histograms, you may have noticed values on the Y axis are kind of high. Before the replacement you can see values in 0.5s to 1.0s bucket.

Tech specs for the left disk give you 50 microsecond read/write latency and on the right you get a slight decrease to 36 microseconds. That’s not what we see on the histogram at all. Sometimes you can spot this with fio in testing, but production workloads may have patterns that are difficult to replicate and have very different characteristics. Histograms show how it is.

Even a few slow requests can hurt overall numbers if you're not careful with IO. We've blogged how this affected our cache latency and how we worked around this recently.

By now you should be convinced that histograms are clearly superior for events where you care about timings of individual events.

And if you wanted to switch all your storage latency measurements to histograms, I have tough news for you: Linux kernel only provides counters for node_exporter.

You can try to mess with blktrace for this, but it doesn’t seem practical or efficient.

Let’s switch gears a little and see how summary stats from counters can be deceiving.

This is a research from Autodesk. The creature on the left is called Datasaurus. Then there are target figures in the middle and animations on the right.

The amazing part is that shapes on the right and all intermediate animation frames for them have the same values for mean and standard deviation for both X and Y axis.

From summary statistic’s point of view every box on the right is same, but if you plot individual events, a very different picture emerges.

These animations are from the same research, here you can clearly see how box plots (mean + stddev) are non-representative of the raw events.

Histograms, on the contrary, give an accurate picture.

We established that histograms are what you want, but you need individual events to make those histograms. What are the requirements for a system that would handle this task, assuming that we want to measure things like io operations in the Linux kernel?

It has to be low overhead, otherwise we can’t run it in production
It has to be universal, so we are not locked into just io tracing
It has to be supported out of the box, third party kernel modules and patches are not very practical
And finally it has to be safe, we don’t want to crash large chunk of the internet we're responsible for to get some metrics, even if they are interesting

And it turns out, there’s a solution called eBPF. It’s a low overhead sandboxed user-defined bytecode running in the kernel. It can never crash, hang or interfere with the kernel negatively. That sounds kind of vague, but here are two links that dive into the details explaining how this works.

The main part is that it’s already included with the Linux kernel. It’s used in networking subsystem and things like seccomp rules, but as a general “run safe code in the kernel” it has many uses beyond that.

We said it’s a bytecode and this is how it looks. The good part is that you never have to write this by hand.

To use eBPF you write small C programs that attach to kernel functions and run before or after them. Your C code is then compiled into bytecode, verified and loaded into kernel to run with JIT compiler for translation into native opcodes. The constraints on the code are enforced by verifier and you are guaranteed to not be able to write an infinite loop or allocate tons of memory.

To share data with eBPF programs, you can use maps. In terms of metric collection, maps are updated by eBPF programs in the kernel and only accessed by userspace during scraping.

It is critical for performance that the eBPF VM runs in the kernel and does not cross userspace boundary.

You can see the workflow on the image above.

Having to write C is not an eBPF constraint, you can produce bytecode in any way you want. There are other alternatives like lua and ply, and sysdig is adding a backed to run via eBPF too. Maybe someday people will be writing safe javacript that runs in the kernel.

Just like GCC compiles C code into machine code, BCC compiles C code into eBPF opcodes. BCC is a rewriter plus LLVM compiler and you can use it as a library in your code. There are bindings for C, C++, Python and Go.

In this example we have a simple function that runs after d_lookup kernel function that is responsible for directory cache lookups. It doesn’t look complicated and the basics should be understandable for people familiar with C-like languages.

In addition to a compiler, BCC includes some tools that can be used for debugging production systems with low overhead eBPF provides.

Let’s quickly go over a few of them.

The one above is biolatency, which shows you a histogram of disk io operations. That’s exactly what we started with and it’s already available, just as a script instead of an exporter for Prometheus.

Here’s execsnoop, that allows you to see which commands are being launched in the system. This is often useful if you want to catch quickly terminating programs that do not hang around long enough to be observed in ps output.

There’s also ext4slower that instead of showing slow IO operations, shows slow ext4 filesystem operations. One might think these two map fairly closely, but one filesystem operation does not necessarily map to one disk IO operation:

Writes can go into writeback cache and not touch the disk until later
Reads can involve multiple IOs to the physical device
Reads can also be blocked behind async writes

The more you know about disk io, the more you want to run stateless, really. Sadly, RAM prices are not going down.

Okay, now to the main idea of this post. We have all these primitives, now we should be able to tie them all together and get an ebpf_exporter on our hands to get metrics in Prometheus where they belong. Many BCC tools already have kernel side ready, reviewed and battle tested, so the hardest part is covered.

Let’s look at a simple example to get counters for timers fired in the kernel. This example is from the ebpf_exporter repo, where you can find a few more complex ones.

On the BCC code side we define a hash and attach to a tracepoint. When the tracepoint fires, we increment our hash where the key is the address of a function that fired the tracepoint.

On the exporter side we say that we want to take the hash and transform 8 byte keys with ksym decoder. Kernel keeps a map of function addresses to their names in /proc/kallsyms and we just use that. We also define that we want to attach our function to timer:timer_start tracepoint.

This is objectively quite verbose and perhaps some things can be inferred instead of being explicit.

This is an example graph we can get out of this exporter config.

Why can this be useful? You may remember our blog post about our tracing of a weird bug during OS upgrade from Debian Jessie to Stretch.

TL;DR is that systemd bug broke TCP segmentation offload on vlan interface, which increased CPU load 5x and introduced lots of interesting side effects up to memory allocation stalls. You do not want to see that allocating one page takes 12 seconds, but that's exactly what we were seeing.

If we had timer metrics enabled, we would have seen the clear change in metrics sooner.

Another bundled example can give you IPC or instruction per cycle metrics for each CPU. On this production system we can see a few interesting things:

Not all cores are equal, with two cores being outliers (green on the top is zero, yellow on the bottom is one)
Something happened that dropped IPC of what looks like half of cores
There’s some variation in daily load cycles that affects IPC

Why can this be useful? Check out Brendan Gregg’s somewhat controversial blog post about CPU utilization.

TL;DR is that CPU% does not include memory stalls that do not perform any useful compute work. IPC helps to understand that factor better.

Another bundled example is LLC or L3 CPU cache hit rate. This is from the same machine as IPC metrics and you can see some major affecting not just IPC, but also the hit rate.

Why can this be useful? You can see how your CPU cache is doing and you can see how it may be affected by bigger L3 cache or more cores sharing the same cache.

LLC hit rate usually goes hand in hand with IPC patterns as well.

Now to the fun part. We started with IO latency histograms and they are bundled with ebpf_exporter examples.

This is also a histogram, but now for run queue latency. When a process is woken up in the kernel, it’s ready to run. If CPU is idle, it can run immediately and scheduling delay is zero. If CPU is busy doing some other work, the process is put on a run queue and the CPU picks it up when it’s available. That delay between being ready to run and actually running is the scheduling delay and it affects latency for interactive applications.

In cAdvisor you have a counter with this delay, here you have a histogram of actual values of that delay.

Understanding of how you can be delayed is important for understanding the causes of externally visible latency. Check out another blog post by Brendan Gregg to see how you can further trace and understand scheduling delays with Linux perf tool. It's quite surprising how high delay can be on even an even lighty loaded machine.

It also helps to have a metrics that you can observe if you change any scheduling sysctls in the kernel. The law is that you can never trust internet or even your own judgement to change any sysctls. If you can’t measure the effects, you are lying to yourself.

There’s also another great post from Scylla people about their finds. We were able to apply their sysctls and observe results, which was quite satisfying.

Things like pinning processes to CPUs also affects scheduling, so this metric is invaluable for such experiments.

The examples we gave are not the only ones possible. In addition to IPC and LLC metrics there are around 500 hardware metrics you can get on a typical server.

There are around 2000 tracepoints with stable ABI you can use on many kernel subsystems.

And you can always trace any non-inlined kernel function with kprobes and kretprobes, but nobody guarantees binary compatibility between releases for those. Some are stable, others not so much.

We don’t have support for user statically defined tracepoints or uprobes, which means you cannot trace userspace applications. This is something we can reconsider in the future, but in the meantime you can always add metrics to your apps by regular means.

This image shows tools that BCC provides, you can see it covers many aspects.

Nothing in this life is free and even low overhead still means some overhead. You should measure this in your own environment, but this is what we’ve seen.

For a fast getpid() syscall you get around 34% overhead if you count them by PID. 34% sounds like a lot, but it’s the simplest syscall and 100ns overhead is a cost of one memory reference.

For a complex case where we mix command name to copy some memory and mix in some randomness, the number jumps to 330ns or 105% overhead. We can still do 1.55M ops/s instead of 3.16M ops/s per logical core. If you measure something that doesn’t happen as often on each core, you’re probably not going to notice it as much.

We've seen run queue latency histogram to add 300ms of system time on a machine with 40 logical cores and 250K scheduling events per second.

Your mileage may vary.

So where should you run the exporter then? The answer is anywhere you feel comfortable.

If you run simple programs, it doesn’t hurt to run anywhere. If you are concerned about overhead, you can do it only on canary instances and hope for results to be representative. We do exactly that, but instead of canary instances we have a luxury of having a couple of canary datacenters with live customers. They get updates after our offices do, so it’s not as bad as it sounds.

For some upgrade events you may want to enable more extensive metrics. An example would be a major distro or kernel upgrade.

The last thing we wanted to mention is that ebpf_exporter is open source and we encourage you to try it and maybe contribute interesting examples that may be useful to others.

GitHub: https://github.com/cloudflare/ebpf_exporter

Reading materials on eBPF:

While this post was in drafts, we added another example for tracing port exhaustion issues and that took under 10 minutes to write.

Tracing System CPU on Debian Stretch

Ivan Babrou — Sun, 13 May 2018 16:00:00 GMT

This is a heavily truncated version of an internal blog post from August 2017. For more recent updates on Kafka, check out another blog post on compression, where we optimized throughput 4.5x for both disks and network.

Photo by Alex Povolyashko / Unsplash

Upgrading our systems to Debian Stretch

For quite some time we've been rolling out Debian Stretch, to the point where we have reached ~10% adoption in our core datacenters. As part of upgarding the underlying OS, we also evaluate the higher level software stack, e.g. taking a look at our ClickHouse and Kafka clusters.

During our upgrade of Kafka, we sucessfully migrated two smaller clusters, logs and dns, but ran into issues when attempting to upgrade one of our larger clusters, http.

Thankfully, we were able to roll back the http cluster upgrade relatively easily, due to heavy versioning of both the OS and the higher level software stack. If there's one takeaway from this blog post, it's to take advantage of consistent versioning.

High level differences

We upgraded one Kafka http node, and it did not go as planned:

Having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared to a node where no upgrade happened, and an intermediary node that received a software stack upgrade, but not an OS upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.

For debugging CPU saturation issues, we call on perf to fish out details:

The command used was: perf top -F 99.

RCU stalls

In addition to higher system CPU usage, we found secondary slowdowns, including read-copy update (RCU) stalls:

[ 4909.110009] logfwdr (26887) used greatest stack depth: 11544 bytes left
[ 4909.392659] oom_reaper: reaped process 26861 (logfwdr), now anon-rss:8kB, file-rss:0kB, shmem-rss:0kB
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843]  13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845]   (t=8403 jiffies g=110722 c=110721 q=6440)

We've seen RCU stalls before, and our (suboptimal) solution was to reboot the machine.

However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. During our deep dive, we noticed in dmesg that we had issues allocating memory, while trying to write errors:

Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel:         26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel:         (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java            R  running task    13488  1714   1513 0x00080188
Aug 15 21:51:35 myhost kernel:  ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel:  [] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel:  [] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel:  [] ? vt_console_print+0x295/0x3c0
Aug 15 21:51:35 myhost kernel:  [] ? call_console_drivers.isra.22.constprop.30+0xf3/0x100
Aug 15 21:51:35 myhost kernel:  [] ? console_unlock+0x281/0x550
Aug 15 21:51:35 myhost kernel:  [] ? vprintk_emit+0x278/0x430
Aug 15 21:51:35 myhost kernel:  [] ? vprintk_default+0x1f/0x30
Aug 15 21:51:35 myhost kernel:  [] ? printk+0x48/0x50
Aug 15 21:51:35 myhost kernel:  [] ? dump_stack_print_info+0x7e/0xc0
Aug 15 21:51:35 myhost kernel:  [] ? dump_stack+0x44/0x65
Aug 15 21:51:35 myhost kernel:  [] ? warn_alloc+0x124/0x150
Aug 15 21:51:35 myhost kernel:  [] ? __alloc_pages_slowpath+0x932/0xb80
Aug 15 21:51:35 myhost kernel:  [] ? __alloc_pages_nodemask+0x202/0x250
Aug 15 21:51:35 myhost kernel:  [] ? alloc_pages_current+0x92/0x120
Aug 15 21:51:35 myhost kernel:  [] ? __page_cache_alloc+0xbf/0xd0
Aug 15 21:51:35 myhost kernel:  [] ? filemap_fault+0x2ea/0x4d0
Aug 15 21:51:35 myhost kernel:  [] ? xfs_filemap_fault+0x45/0xa0
Aug 15 21:51:35 myhost kernel:  [] ? __do_fault+0x6b/0xd0
Aug 15 21:51:35 myhost kernel:  [] ? handle_mm_fault+0xe98/0x12b0
Aug 15 21:51:35 myhost kernel:  [] ? __seccomp_filter+0x1db/0x290
Aug 15 21:51:35 myhost kernel:  [] ? __do_page_fault+0x22c/0x4c0
Aug 15 21:51:35 myhost kernel:  [] ? do_page_fault+0x20/0x70
Aug 15 21:51:35 myhost kernel:  [] ? page_fault+0x22/0x30

This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this hypothesis, we looked at the very beginning of the error chain:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel:  [] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel:  [] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel:  [] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel:  [] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel:  [] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel:  [] __alloc_pages_slowpath+0x31b/0xb80

As much as shrink_node may scream "NUMA issues", you're looking primarily at:

Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages

In addition, we also found memory allocation issues:

[78972.506644] Mem-Info:
[78972.506653] active_anon:3936889 inactive_anon:371971 isolated_anon:0
[78972.506653]  active_file:25778474 inactive_file:1214478 isolated_file:2208
[78972.506653]  unevictable:0 dirty:1760643 writeback:0 unstable:0
[78972.506653]  slab_reclaimable:1059804 slab_unreclaimable:141694
[78972.506653]  mapped:47285 shmem:535917 pagetables:10298 bounce:0
[78972.506653]  free:202928 free_pcp:3085 free_cma:0
[78972.506660] Node 0 active_anon:8333016kB inactive_anon:989808kB active_file:50622384kB inactive_file:2401416kB unevictable:0kB isolated(anon):0kB isolated(file):3072kB mapped:96624kB dirty:3422168kB writeback:0kB shmem:1261156kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:15744 all_unreclaimable? no
[78972.506666] Node 1 active_anon:7414540kB inactive_anon:498076kB active_file:52491512kB inactive_file:2456496kB unevictable:0kB isolated(anon):0kB isolated(file):5760kB mapped:92516kB dirty:3620404kB writeback:0kB shmem:882512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:9080974 all_unreclaimable? no
[78972.506671] Node 0 DMA free:15900kB min:100kB low:124kB high:148kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 9 printk messages dropped ** [78972.506716] Node 0 Normal: 15336*4kB (UMEH) 4584*8kB (MEH) 2119*16kB (UME) 775*32kB (MEH) 106*64kB (UM) 81*128kB (MH) 29*256kB (UM) 25*512kB (M) 19*1024kB (M) 7*2048kB (M) 2*4096kB (M) = 236080kB
[78972.506725] Node 1 Normal: 31740*4kB (UMEH) 3879*8kB (UMEH) 873*16kB (UME) 353*32kB (UM) 286*64kB (UMH) 62*128kB (UMH) 28*256kB (MH) 20*512kB (UMH) 15*1024kB (UM) 7*2048kB (UM) 12*4096kB (M) = 305752kB
[78972.506726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506727] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506728] 27531091 total pagecache pages
[78972.506729] 0 pages in swap cache
[78972.506730] Swap cache stats: add 0, delete 0, find 0/0
[78972.506730] Free swap  = 0kB
[78972.506731] Total swap = 0kB
[78972.506731] 33524975 pages RAM
[78972.506732] 0 pages HighMem/MovableOnly
[78972.506732] 546255 pages reserved
[78972.620129] ntpd: page allocation stalls for 272380ms, order:0, mode:0x24000c0(GFP_KERNEL)
[78972.620132] CPU: 16 PID: 13099 Comm: ntpd Tainted: G           O    4.9.43-cloudflare-2017.8.4 #1
[78972.620133] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A21 10/01/2015
[78972.620136]  ffffc90022f9b6f8 ffffffff8142d668 ffffffff81ca31b8 0000000000000001
[78972.620138]  ffffc90022f9b778 ffffffff81162f14 024000c022f9b740 ffffffff81ca31b8
[78972.620140]  ffffc90022f9b720 0000000000000010 ffffc90022f9b788 ffffc90022f9b738
[78972.620140] Call Trace:
[78972.620148]  [] dump_stack+0x4d/0x65
[78972.620152]  [] warn_alloc+0x124/0x150
[78972.620154]  [] __alloc_pages_slowpath+0x932/0xb80
[78972.620157]  [] __alloc_pages_nodemask+0x202/0x250
[78972.620160]  [] alloc_pages_current+0x92/0x120
[78972.620162]  [] __get_free_pages+0xe/0x40
[78972.620165]  [] __pollwait+0x9a/0xe0
[78972.620168]  [] datagram_poll+0x29/0x100
[78972.620170]  [] sock_poll+0x48/0xa0
[78972.620172]  [] do_select+0x335/0x7b0

This specific error message did seem fun:

[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)

You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation (smallest allocation of one 4 KiB page).

Comparing to our control nodes, the only two possible explanations were: a kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. However, just to be safe, we rolled both the kernel back to 4.4.55, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.

Digging a bit deeper

Keeping servers running on older kernel and distribution is not a viable long term solution. Through bisection, we found the issue lay in the Jessie to Stretch upgrade, contrary to our initial hypothesis.

Now that we knew what the problem was, we proceeded to investigate why. With the help from existing automation around perf and Java, we generated the following flamegraphs:

Jessie

Stretch

At first it looked like Jessie was doing writev instead of sendfile, but the full flamegraphs revealed that Strech was executing sendfile a lot slower.

If you highlight sendfile:

Jessie

Stretch

And zoomed in:

Jessie

Stretch

These two look very different.

Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.

We'll dive into the differences in the next section.

And deeper

To trace latency distributions of sendfile syscalls between Jessie and Stretch, we used funclatency from bcc-tools:

Jessie

$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
     usecs               : count     distribution
         0 -> 1          : 9        |                                        |
         2 -> 3          : 47       |****                                    |
         4 -> 7          : 53       |*****                                   |
         8 -> 15         : 379      |****************************************|
        16 -> 31         : 329      |**********************************      |
        32 -> 63         : 101      |**********                              |
        64 -> 127        : 23       |**                                      |
       128 -> 255        : 50       |*****                                   |
       256 -> 511        : 7        |                                        |

Stretch

$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:28
     usecs               : count     distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 20       |***                                     |
         4 -> 7          : 46       |*******                                 |
         8 -> 15         : 56       |********                                |
        16 -> 31         : 65       |**********                              |
        32 -> 63         : 75       |***********                             |
        64 -> 127        : 75       |***********                             |
       128 -> 255        : 258      |****************************************|
       256 -> 511        : 144      |**********************                  |
       512 -> 1023       : 24       |***                                     |
      1024 -> 2047       : 27       |****                                    |
      2048 -> 4095       : 28       |****                                    |
      4096 -> 8191       : 35       |*****                                   |
      8192 -> 16383      : 1        |                                        |

In the flamegraphs, you can see timers being set at the tip (mod_timer function), with these timers taking locks. On Stretch we installed 3x more timers, resulting in 10x the amount of contention:

Jessie

$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:36
FUNC                                    COUNT
mod_timer                               60482
00:33:37
FUNC                                    COUNT
mod_timer                               58263
00:33:38
FUNC                                    COUNT
mod_timer                               54626

$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:36
FUNC                                    COUNT
lock_timer_base                         15962
00:32:37
FUNC                                    COUNT
lock_timer_base                         16261
00:32:38
FUNC                                    COUNT
lock_timer_base                         15806

Stretch

$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:28
FUNC                                    COUNT
mod_timer                              149068
00:33:29
FUNC                                    COUNT
mod_timer                              155994
00:33:30
FUNC                                    COUNT
mod_timer                              160688

$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:32
FUNC                                    COUNT
lock_timer_base                        119189
00:32:33
FUNC                                    COUNT
lock_timer_base                        196895
00:32:34
FUNC                                    COUNT
lock_timer_base                        140085

The Linux kernel includes debugging facilities for timers, which call the timer:timer_start tracepoint on every timer start. This allowed us to pull up timer names:

Jessie

$ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ]
      6 blk_rq_timed_out_timer
      2 clocksource_watchdog
      5 commit_timeout
      5 cursor_timer_handler
      2 dev_watchdog
     10 garp_join_timer
      2 ixgbe_service_timer
     36 reqsk_timer_handler
   4769 tcp_delack_timer
    171 tcp_keepalive_timer
 168512 tcp_write_timer

Stretch

$ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ]
      6 clocksource_watchdog
      4 commit_timeout
     12 cursor_timer_handler
      2 dev_watchdog
     18 garp_join_timer
      4 ixgbe_service_timer
      1 neigh_timer_handler
      1 reqsk_timer_handler
   4622 tcp_delack_timer
      1 tcp_keepalive_timer
1983978 tcp_write_timer
      1 writeout_period

So basically we install 12x more tcp_write_timer timers, resulting in higher kernel CPU usage.

Taking specific flamegraphs of the timers revealed the differences in their operation:

Jessie

Stretch

We then traced the functions that were different:

Jessie

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:33
FUNC                                    COUNT
tcp_sendmsg                             21166
03:33:34
FUNC                                    COUNT
tcp_sendmsg                             21768
03:33:35
FUNC                                    COUNT
tcp_sendmsg                             21712

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:14
FUNC                                    COUNT
tcp_push_one                              496
03:37:15
FUNC                                    COUNT
tcp_push_one                              432
03:37:16
FUNC                                    COUNT
tcp_push_one                              495

$ sudo /usr/share/bcc/tools/trace -p 23485 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
   1583 4
   2043 54
   3546 18
   4016 59
   4423 50
   5349 8
   6154 40
   6620 38
  17121 51
  39528 44

Stretch

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:30
FUNC                                    COUNT
tcp_sendmsg                             53834
03:33:31
FUNC                                    COUNT
tcp_sendmsg                             49472
03:33:32
FUNC                                    COUNT
tcp_sendmsg                             51221

$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:10
FUNC                                    COUNT
tcp_push_one                            64483
03:37:11
FUNC                                    COUNT
tcp_push_one                            65058
03:37:12
FUNC                                    COUNT
tcp_push_one                            72394

$ sudo /usr/share/bcc/tools/trace -p 3416 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
    396 46
    409 4
   1124 50
   1305 18
   1547 40
   1672 59
   1729 8
   2181 38
  19052 44
  64504 4096

The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile.

To further introspect, we leveraged a kernel feature available since 4.9: the ability to count stacks. This led us to measuring what hits tcp_push_one:

Jessie

$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    1
  tcp_push_one
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    4950

Stretch

$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    123
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  __vfs_write
  vfs_write
  SyS_write
  do_syscall_64
  return_from_SYSCALL_64
    172
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  kernel_sendmsg
  sock_no_sendpage
  tcp_sendpage
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    735110

If you diff the most popular stacks, you'll get:

--- jessie.txt  2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
 tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
 inet_sendpage
 kernel_sendpage
 sock_sendpage

Let's look closer at tcp_sendpage:

int tcp_sendpage(struct sock *sk, struct page *page, int offset,
         size_t size, int flags)
{
    ssize_t res;

    if (!(sk->sk_route_caps & NETIF_F_SG) ||
        !sk_check_csum_caps(sk))
        return sock_no_sendpage(sk->sk_socket, page, offset, size,
                    flags);

    lock_sock(sk);

    tcp_rate_check_app_limited(sk);  /* is sending application-limited? */

    res = do_tcp_sendpages(sk, page, offset, size, flags);
    release_sock(sk);
    return res;
}

It looks like we don't enter the if body. We looked up what NET_F_SG does: segmentation offload. This difference is peculiar, since both OS'es should have this enabled.

Even deeper, to the crux

It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. Our network setup can be described as follows:

eth2 -->|              |--> vlan10
        |---> bond0 -->|
eth3 -->|              |--> vlan100

The missing piece was that we were missing segmentation offload on VLAN interfaces, where the actual IPs live.

Here's the diff from ethtook -k vlan10:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:21:12.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:21:12.000000000 -0700
@@ -1,21 +1,21 @@
 Features for vlan10:
 rx-checksumming: off [fixed]
-tx-checksumming: off
+tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
-       tx-checksum-ip-generic: off
+       tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off
        tx-checksum-sctp: off
-scatter-gather: off
-       tx-scatter-gather: off
+scatter-gather: on
+       tx-scatter-gather: on
        tx-scatter-gather-fraglist: off
-tcp-segmentation-offload: off
-       tx-tcp-segmentation: off [requested on]
-       tx-tcp-ecn-segmentation: off [requested on]
-       tx-tcp-mangleid-segmentation: off [requested on]
-       tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+       tx-tcp-segmentation: on
+       tx-tcp-ecn-segmentation: on
+       tx-tcp-mangleid-segmentation: on
+       tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
 generic-receive-offload: on
 large-receive-offload: off [fixed]
 rx-vlan-offload: off [fixed]

So we entusiastically enabled segmentation offload:

$ sudo ethtool -K vlan10 sg on

And it didn't help! Will the suffering ever end? Let's also enable TCP transmission checksumming offload:

$ sudo ethtool -K vlan10 tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ip-generic: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: on

Nothing. The diff is essentially empty now:

$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:31:27.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:31:27.000000000 -0700
@@ -4,11 +4,11 @@ tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
-       tx-checksum-fcoe-crc: off [requested on]
-       tx-checksum-sctp: off [requested on]
+       tx-checksum-fcoe-crc: off
+       tx-checksum-sctp: off
 scatter-gather: on
        tx-scatter-gather: on
-       tx-scatter-gather-fraglist: off [requested on]
+       tx-scatter-gather-fraglist: off
 tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on

The last missing piece we found was that offload changes are applied only during connection initiation, so we restarted Kafka, and we immediately saw a performance improvement (green line):

Not enabling offload features when possible seems like a pretty bad regression, so we filed a ticket for systemd:

https://github.com/systemd/systemd/issues/6629

In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.

Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and our 5 day CPU usage history yielded positive results:

The DNS cluster also yielded positive results, with just 2 nodes rebooted (purple line going down):

Conclusion

It was an error on our part to hit performance degradation without a good regression framework in place to catch the issue. Luckily, due to our heavy use of version control, we managed to bisect the issue rather quickly, and had a temp rollback in place while root causing the problem.

In the end, enabling offload also removed RCU stalls. It's not really clear whether it was the cause or just a catalyst, but the end result speaks for itself.

On the bright side, we dug pretty deep into Linux kernel internals, and although there were fleeting moments of giving up, moving to the woods to become a park ranger, we persevered and came out of the forest successful.

If deep diving from high level symptoms to kernel/OS issues makes you excited, drop us a line.

Squeezing the firehose: getting the most from Kafka compression

Ivan Babrou — Mon, 05 Mar 2018 16:17:03 GMT

We at Cloudflare are long time Kafka users, first mentions of it date back to beginning of 2014 when the most recent version was 0.8.0. We use Kafka as a log to power analytics (both HTTP and DNS), DDoS mitigation, logging and metrics.

While the idea of unifying abstraction of the log remained the same since then (read this classic blog post from Jay Kreps if you haven't), Kafka evolved in other areas since then. One of these improved areas was compression support. Back in the old days we've tried enabling it a few times and ultimately gave up on the idea because of unresolved issues in the protocol.

Kafka compression overview

Just last year Kafka 0.11.0 came out with the new improved protocol and log format.

The naive approach to compression would be to compress messages in the log individually:

Edit: originally we said this is how Kafka worked before 0.11.0, but that appears to be false.

Compression algorithms work best if they have more data, so in the new log format messages (now called records) are packed back to back and compressed in batches. In the previous log format messages recursive (compressed set of messages is a message), new format makes things more straightforward: compressed batch of records is just a batch.

Now compression has a lot more space to do its job. There's a high chance that records in the same Kafka topic share common parts, which means they can be compressed better. On the scale of thousands of messages difference becomes enormous. The downside here is that if you want to read record3 in the example above, you have to fetch records 1 and 2 as well, whether the batch is compressed or not. In practice this doesn't matter too much, because consumers usually read all records sequentially batch after batch.

The beauty of compression in Kafka is that it lets you trade off CPU vs disk and network usage. The protocol itself is designed to minimize overheads as well, by requiring decompression only in a few places:

On the receiving side of the log only consumers need to decompress messages:

In reality, if you don't use encryption, data can be copied between NIC and disks with zero copies to user space, lowering the cost to some degree.

Kafka bottlenecks at Cloudflare

Having less network and disk usage was a big selling point for us. Back in 2014 we started with spinning disks under Kafka and never had issues with disk space. However, at some point we started having issues with random io. Most of the time consumers and replicas (which are just another type of consumer) read from the very tip of the log, and that data resides in page cache meaning you don't need to read from disks at all:

In this case the only time you touch the disk is during writes, and sequential writes are cheap. However, things start to fall apart when you have multiple lagging consumers:

Each consumer wants to read different part of the log from the physical disk, which means seeking back and forth. One lagging consumer was okay to have, but multiple of them would start fighting for disk io and just increase lag for all of them. To work around this problem we upgraded to SSDs.

Consumers were no longer fighting for disk time, but it felt terribly wasteful most of the time when consumers are not lagging and there's zero read io. We were not bored for too long, as other problems emerged:

Disk space became a problem. SSDs are much more expensive and usable disk space reduced by a lot.
As we grew, we started saturating network. We used 2x10Gbit NICs and imperfect balance meant that we sometimes saturated network links.

Compression promised to solve both of these problems, so we were eager to try again with improved support from Kafka.

Performance testing

At Cloudflare, we use Go extensively, which means that a lot of our Kafka consumers and producers are in Go. This means we can't just take off-the-shelf Java client provided by Kafka team with every server release and start enjoying the benefits of compression. We had to get support from our Kafka client library first (we use sarama from Shopify). Luckily, support was added at the end of 2017. With more fixes from our side we were able to get the test setup working.

Kafka supports 4 compression codecs: none, gzip, lz4 and snappy. We had to figure out how these would work for our topics, so we wrote a simple producer that copied data from existing topic into destination topic. With four destination topics for each compression type we were able to get the following numbers.

Each destination topic was getting roughly the same amount of messages:

To make it even more obvious, this was the disk usage of these topics:

This looked amazing, but it was rather low throughput nginx errors topic, containing literally string error messages from nginx. Our main target was requests HTTP log topic with capnp encoded messages that are much harder to compress. Naturally, we moved on to try out one partition of requests topic. First results were insanely good:

They were so good, because they were lies. If with nginx error logs we were pushing under 20Mbps of uncompressed logs, here we jumped 30x to 600Mbps and compression wasn't able to keep up. Still, as a starting point, this experiment gave us some expectations in terms of compression ratios for the main target.

Compression	Messages consumed	Disk usage	Average message size
None	30.18M	48106MB	1594B
Gzip	3.17M	1443MB	455B
Snappy	20.99M	14807MB	705B
LZ4	20.93M	14731MB	703B

Gzip sounded too expensive from the beginning (especially in Go), but Snappy should have been able to keep up. We profiled our producer, and it was spending just 2.4% of CPU time in Snappy compression, never saturating a single core:

For Snappy we were able to get the following thread stacktrace from Kafka with jstack:

"kafka-request-handler-3" #87 daemon prio=5 os_prio=0 tid=0x00007f80d2e97800 nid=0x1194 runnable [0x00007f7ee1adc000]
   java.lang.Thread.State: RUNNABLE
    at org.xerial.snappy.SnappyNative.rawCompress(Native Method)
    at org.xerial.snappy.Snappy.rawCompress(Snappy.java:446)
    at org.xerial.snappy.Snappy.compress(Snappy.java:119)
    at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:376)
    at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:130)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    - locked <0x00000007a74cc8f0> (a java.io.DataOutputStream)
    at org.apache.kafka.common.utils.Utils.writeTo(Utils.java:861)
    at org.apache.kafka.common.record.DefaultRecord.writeTo(DefaultRecord.java:203)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendDefaultRecord(MemoryRecordsBuilder.java:622)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:409)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:442)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:595)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1(LogValidator.scala:336)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1$adapted(LogValidator.scala:335)
    at kafka.log.LogValidator$$$Lambda$675/1035377790.apply(Unknown Source)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at kafka.log.LogValidator$.buildRecordsAndAssignOffsets(LogValidator.scala:335)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsetsCompressed(LogValidator.scala:288)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsets(LogValidator.scala:71)
    at kafka.log.Log.liftedTree1$1(Log.scala:654)
    at kafka.log.Log.$anonfun$append$2(Log.scala:642)
    - locked <0x0000000640068e88> (a java.lang.Object)
    at kafka.log.Log$$Lambda$627/239353060.apply(Unknown Source)
    at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
    at kafka.log.Log.append(Log.scala:624)
    at kafka.log.Log.appendAsLeader(Log.scala:597)
    at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:499)
    at kafka.cluster.Partition$$Lambda$625/1001513143.apply(Unknown Source)
    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
    at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:223)
    at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:487)
    at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(ReplicaManager.scala:724)
    at kafka.server.ReplicaManager$$Lambda$624/2052953875.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$12/187472540.apply(Unknown Source)
    at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
    at scala.collection.mutable.HashMap$$Lambda$25/1864869682.apply(Unknown Source)
    at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
    at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:708)
    at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:459)
    at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:466)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:99)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:65)
    at java.lang.Thread.run(Thread.java:748)

This pointed us to this piece of code in Kafka repository.

There wasn't enough logging to figure out why Kafka was recompressing, but we were able to get this information out with a patched Kafka broker:

diff --git a/core/src/main/scala/kafka/log/LogValidator.scala b/core/src/main/scala/kafka/log/LogValidator.scala
index 15750e9cd..5197d0885 100644
--- a/core/src/main/scala/kafka/log/LogValidator.scala
+++ b/core/src/main/scala/kafka/log/LogValidator.scala
@@ -21,6 +21,7 @@ import java.nio.ByteBuffer
 import kafka.common.LongRef
 import kafka.message.{CompressionCodec, NoCompressionCodec}
 import kafka.utils.Logging
+import org.apache.log4j.Logger
 import org.apache.kafka.common.errors.{InvalidTimestampException, UnsupportedForMessageFormatException}
 import org.apache.kafka.common.record._
 import org.apache.kafka.common.utils.Time
@@ -236,6 +237,7 @@ private[kafka] object LogValidator extends Logging {
   
       // No in place assignment situation 1 and 2
       var inPlaceAssignment = sourceCodec == targetCodec && toMagic > RecordBatch.MAGIC_VALUE_V0
+      logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == targetCodec (" + targetCodec + ") && toMagic (" + toMagic + ") > RecordBatch.MAGIC_VALUE_V0 (" + RecordBatch.MAGIC_VALUE_V0 + ")")
   
       var maxTimestamp = RecordBatch.NO_TIMESTAMP
       val expectedInnerOffset = new LongRef(0)
@@ -250,6 +252,7 @@ private[kafka] object LogValidator extends Logging {
         // Do not compress control records unless they are written compressed
         if (sourceCodec == NoCompressionCodec && batch.isControlBatch)
           inPlaceAssignment = true
+          logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == NoCompressionCodec (" + NoCompressionCodec + ") && batch.isControlBatch (" + batch.isControlBatch + ")")
   
         for (record <- batch.asScala) {
           validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
@@ -261,21 +264,26 @@ private[kafka] object LogValidator extends Logging {
           if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && toMagic > RecordBatch.MAGIC_VALUE_V0) {
             // Check if we need to overwrite offset
             // No in place assignment situation 3
-            if (record.offset != expectedInnerOffset.getAndIncrement())
+            val off = expectedInnerOffset.getAndIncrement()
+            if (record.offset != off)
               inPlaceAssignment = false
+              logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: record.offset (" + record.offset + ") != expectedInnerOffset.getAndIncrement() (" + off + ")")
             if (record.timestamp > maxTimestamp)
               maxTimestamp = record.timestamp
           }
   
           // No in place assignment situation 4
-          if (!record.hasMagic(toMagic))
+          if (!record.hasMagic(toMagic)) {
+            logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: !record.hasMagic(toMagic) (" + !record.hasMagic(toMagic) + ")")
             inPlaceAssignment = false
+          }
   
           validatedRecords += record
         }
       }
   
       if (!inPlaceAssignment) {
+        logger.info("inPlaceAssignment = " + inPlaceAssignment + "; recompressing")
         val (producerId, producerEpoch, sequence, isTransactional) = {
           // note that we only reassign offsets for requests coming straight from a producer. For records with magic V2,
           // there should be exactly one RecordBatch per request, so the following is all we need to do. For Records

And the output was:

Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == targetCodec (SnappyCompressionCodec) && toMagic (2) > RecordBatch.MAGIC_VALUE_V0 (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == NoCompressionCodec (NoCompressionCodec) && batch.isControlBatch (false) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (1) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (2) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (3) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (4) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (5) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (6) (kafka.log.LogValidator$)

We promptly fixed the issue and resumed the testing. These were the results:

Compression	User time	Messages	Time per 1m	CPU ratio	Disk usage	Avg. message size	Compression ratio
None	209.67s	26.00M	8.06s	1x	41448MB	1594B	1x
Gzip	570.56s	6.98M	81.74s	10.14x	3111MB	445B	3.58x
Snappy	337.55s	26.02M	12.97s	1.61x	17675MB	679B	2.35x
LZ4	525.82s	26.01M	20.22s	2.51x	22922MB	881B	1.81x

Now we were able to keep up with both Snappy and LZ4. Gzip was still out of the question and LZ4 had incompatibility issues between Kafka versions and our Go client, which left us with Snappy. This was a winner in terms of compression ratio and speed too, so we were not very disappointed by the lack of choice.

Deploying into production

In production, we started small with Java based consumers and producers. Our first production topic was just 1Mbps and 600rps of nginx error logs. Messages there were very repetitive, and we were able to get whopping 8x decrease in size with batching records for just 1 second across 2 partitions.

This gave us some confidence to move onto next topic with journald logs encoded with JSON. Here we were able to reduce ingress from 300Mbps to just 50Mbps (yellow line on the graph):

With all major topics in DNS cluster switched to Snappy we saw even better picture in terms of broker CPU usage:

On the next graph you can see Kafka CPU usage as the purple line and producer CPU usage as the green line:

CPU usage of the producer did not go up substantially, which means most of the work is spent in non compression related tasks. Consumers did not see any increase in CPU usage either, which means we've got our 2.6x decrease in size practically for free.

It was time to hunt the biggest beast of all: requests topic with HTTP access logs. There we were doing up to 100Gbps and 7.5Mrps of ingress at peak (a lot more when big attacks are happening, but this was a quiet week):

With many smaller topics switched to Snappy already, we did not need to do anything special here. This is how it went:

That's a 2.25x decrease in ingress bandwidth and average message size. We have multiple replicas and consumers, which means egress is a multiple of ingress. We were able to reduce in-DC traffic by hundreds of gigabits of internal traffic and save terabytes of flash storage. With network and disks being bottlenecks, this meant we'd need less than a half of hardware we had. Kafka was one of the main hardware hogs in this datacenter, so this was a large scale win.

Yet, 2.25x seemed a bit on the low side.

Looking for more

We wanted to see if we can do better. To do that, we extracted one batch of records from Kafka and ran some benchmarks on it. All batches are around 1 MB uncompressed, 600 records in each on average.

To run the benchmarks we used lzbench, which runs lots of different compression algorithms and provides a summary. Here's what we saw with results sorted by compression ratio (heavily filtered list):

lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33587 MB/s 33595 MB/s      984156 100.00
...
lz4 1.8.0                 594 MB/s  2428 MB/s      400577  40.70
...
snappy 1.1.4              446 MB/s  1344 MB/s      425564  43.24
...
zstd 1.3.3 -1             409 MB/s   844 MB/s      259438  26.36
zstd 1.3.3 -2             303 MB/s   889 MB/s      244650  24.86
zstd 1.3.3 -3             242 MB/s   899 MB/s      232057  23.58
zstd 1.3.3 -4             240 MB/s   910 MB/s      230936  23.47
zstd 1.3.3 -5             154 MB/s   891 MB/s      226798  23.04

This looked too good to be true. Zstandard is a fairly new (released 1.5 years ago) compression algorithm from Facebook. In benchmarks on the project's home page you can see this:

In our case we were getting this:

Compressor name	Ratio	Compression	Decompression
zstd	3.794	409 MB/s	844 MB/s
lz4	2.475	594 MB/s	2428 MB/s
snappy	2.313	446 MB/s	1344 MB/s

Clearly, results are very dependent on the kind of data you are trying to compress. For our data zstd was giving amazing results even on the lowest compression level. Compression ratio was better than even gzip at maximum compression level, while throughput was a lot higher. For posterity, this is how DNS logs compressed (HTTP logs compressed similarly):

$ ./lzbench -ezstd/zlib rrdns.recordbatch
lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33235 MB/s 33502 MB/s      927048 100.00 rrdns.recordbatch
zstd 1.3.3 -1             430 MB/s   909 MB/s      226298  24.41 rrdns.recordbatch
zstd 1.3.3 -2             322 MB/s   878 MB/s      227271  24.52 rrdns.recordbatch
zstd 1.3.3 -3             255 MB/s   883 MB/s      217730  23.49 rrdns.recordbatch
zstd 1.3.3 -4             253 MB/s   883 MB/s      217141  23.42 rrdns.recordbatch
zstd 1.3.3 -5             169 MB/s   869 MB/s      216119  23.31 rrdns.recordbatch
zstd 1.3.3 -6             102 MB/s   939 MB/s      211092  22.77 rrdns.recordbatch
zstd 1.3.3 -7              78 MB/s   968 MB/s      208710  22.51 rrdns.recordbatch
zstd 1.3.3 -8              65 MB/s  1005 MB/s      204370  22.05 rrdns.recordbatch
zstd 1.3.3 -9              59 MB/s  1008 MB/s      204071  22.01 rrdns.recordbatch
zstd 1.3.3 -10             44 MB/s  1029 MB/s      202587  21.85 rrdns.recordbatch
zstd 1.3.3 -11             43 MB/s  1054 MB/s      202447  21.84 rrdns.recordbatch
zstd 1.3.3 -12             32 MB/s  1051 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -13             31 MB/s  1050 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -14             13 MB/s  1074 MB/s      200228  21.60 rrdns.recordbatch
zstd 1.3.3 -15           8.15 MB/s  1171 MB/s      197114  21.26 rrdns.recordbatch
zstd 1.3.3 -16           5.96 MB/s  1051 MB/s      190683  20.57 rrdns.recordbatch
zstd 1.3.3 -17           5.64 MB/s  1057 MB/s      191227  20.63 rrdns.recordbatch
zstd 1.3.3 -18           4.45 MB/s  1166 MB/s      187967  20.28 rrdns.recordbatch
zstd 1.3.3 -19           4.40 MB/s  1108 MB/s      186770  20.15 rrdns.recordbatch
zstd 1.3.3 -20           3.19 MB/s  1124 MB/s      186721  20.14 rrdns.recordbatch
zstd 1.3.3 -21           3.06 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zstd 1.3.3 -22           3.01 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zlib 1.2.11 -1             97 MB/s   301 MB/s      305992  33.01 rrdns.recordbatch
zlib 1.2.11 -2             93 MB/s   327 MB/s      284784  30.72 rrdns.recordbatch
zlib 1.2.11 -3             74 MB/s   364 MB/s      265415  28.63 rrdns.recordbatch
zlib 1.2.11 -4             68 MB/s   342 MB/s      269831  29.11 rrdns.recordbatch
zlib 1.2.11 -5             48 MB/s   367 MB/s      258558  27.89 rrdns.recordbatch
zlib 1.2.11 -6             32 MB/s   376 MB/s      247560  26.70 rrdns.recordbatch
zlib 1.2.11 -7             24 MB/s   409 MB/s      244623  26.39 rrdns.recordbatch
zlib 1.2.11 -8           9.67 MB/s   429 MB/s      239659  25.85 rrdns.recordbatch
zlib 1.2.11 -9           3.63 MB/s   446 MB/s      235604  25.41 rrdns.recordbatch

For our purposes we picked level 6 as the compromise between compression ratio and CPU cost. It is possible to be even more aggressive as real world usage proved later.

One great property of zstd is more or less the same decompression speed between levels, which means you only have one knob that connects CPU cost of compression to compression ratio.

Armed with this knowledge, we dug up forgotten Kafka ticket to add zstd, along with KIP (Kafka Improvement Proposal) and even PR on GitHub. Sadly, these did not get traction back in the day, but this work saved us a lot of time.

We ported the patch to Kafka 1.0.0 release and pushed it in production. After another round of smaller scale testing and with patched clients we pushed Zstd into production for requests topic.

Graphs below include switch from no compression (before 2/9) to Snappy (2/9 to 2/17) to Zstandard (after 2/17):

The decrease in size was 4.5x compared to no compression at all. On next generation hardware with 2.4x more storage and 2.5x higher network throughput we suddenly made our bottleneck more than 10x wider and shifted it from storage and network to CPU cost. We even got to cancel pending hardware order for Kafka expansion because of this.

Conclusion

Zstandard is a great modern compression algorithm, promising high compression ratio and throughput, tunable in small increments. Whenever you consider using compression, you should check zstd. If you don't consider compression, then it's worth seeing if you can get benefits from it. Run benchmarks with your data in either case.

Testing in real world scenario showed how benchmarks, even coming from zstd itself, can be misleading. Going beyond codecs built into Kafka allowed us to improve compression ratio 2x at very low cost.

We hope that the data we gathered can be a catalyst to making Zstandard an official compression codec in Kafka to benefit other people. There are 3 bits allocated for codec type and only 2 are used so far, which means there are 4 more vacant places.

If you were skeptical of compression benefits in Kafka because of old flaws in Kafka protocol, this may be the time to reconsider.

If you enjoy benchmarking, profiling and optimizing large scale services, come join us.

Manage Cloudflare records with Salt

Ivan Babrou — Wed, 14 Dec 2016 14:25:52 GMT

We use Salt to manage our ever growing global fleet of machines. Salt is great for managing configurations and being the source of truth. We use it for remote command execution and for network automation tasks. It allows us to grow our infrastructure quickly with minimal human intervention.

CC-BY 2.0 image by Kevin Dooley

We got to thinking. Are DNS records not just a piece of the configuration? We concluded that they are and decided to manage our own records from Salt too.

We are strong believers in eating our own dog food, so we make our employees use the next version of our service before rolling it to everyone else. That way if there's a problem visiting one of the 5 million websites that use Cloudflare it'll get spotted quickly internally. This is also why we keep our own DNS records on Cloudflare itself.

Cloudflare has an API that allows you to manage your zones programmatically without ever logging into the dashboard. Until recently, we were using handcrafted scripts to manage our own DNS records via our API. These scripts were in exotic languages like PHP for historical reasons and had interesting behavior that not everybody enjoyed. While we were dogfooding our own APIs, these scripts were pushing the source of truth for DNS out of Salt.

When we decided to move some zones to Salt, we had a few key motivations:

Single source of truth
Peer-reviewed, audited and versioned changes
Making things that our customers would want to use

Points 1 and 2 were achieved by having DNS records in a Salt repo. The Salt configuration is itself in git, so we get peer reviews and audit trail for free. We think that we made progress on point 3 also.

After extensive internal testing and finding a few bugs in our API (that's what we wanted!), we are happy to announce the public availability of Cloudflare Salt module.

If you are familiar with Salt, it should be easy to see how to configure your records via Salt. All you need is the following:

Create the state cloudflare to deploy your zones:

example.com:
  cloudflare.manage_zone_records:
    - zone: {{ pillar["cloudflare_zones"]["example.com"]|yaml }}

Add a pillar to configure your zone:

cloudflare_zones:
  example.com:
    auth_email: ivan@example.com
    auth_key: auth key goes here
    zone_id: 0101deadbeefdeadbeefdeadbeefdead
    records:
      - name: blog.example.com
        content: 93.184.216.34
        proxied: true

Here we configure zone example.com to only have one record blog.example.com pointing to 93.184.216.34 behind Cloudflare.

You can test your changes before you deploy:

salt-call state.apply cloudflare test=true

And then deploy if you are happy with the dry run:

salt-call state.apply cloudflare

After the initial setup, all you need is to change the records array in pillar and re-deploy the state. See the README for more details.

DNS records are only one part of configuration you may want to change for your Cloudflare domain. We have plans to "saltify" other settings like WAF, caching, page rules and others too.

Come work with us if you want to help!

Debugging war story: the mystery of NXDOMAIN

Ivan Babrou — Wed, 07 Dec 2016 14:11:49 GMT

The following blog post describes a debugging adventure on Cloudflare's Mesos-based cluster. This internal cluster is primarily used to process log file information so that Cloudflare customers have analytics, and for our systems that detect and respond to attacks.

The problem encountered didn't have any effect on our customers, but did have engineers scratching their heads...

The Problem

At some point in one of our cluster we started seeing errors like this (an NXDOMAIN for an existing domain on our internal DNS):

lookup some.existing.internal.host on 10.1.0.9:53: no such host

This seemed very weird, since the domain did indeed exist. It was one of our internal domains! Engineers had mentioned that they'd seen this behaviour, so we decided to investigate deeper. Queries triggering this error were varied and ranged from dynamic SRV records managed by mesos-dns to external domains looked up from inside the cluster.

Our first naive attempt was to run the following in a loop:

while true; do dig some.existing.internal.host > /tmp/dig.txt || break; done

Running this for a while on one server did not reproduce the problem: all the lookups were successful. Then we took our service logs for a day and did a grep for “no such host” and similar messages. Errors were happening sporadically. There were hours between errors and no obvious pattern that could lead us to any conclusion. Our investigation discarded the possibility that the error lay in Go, which we use for lots of our services, since errors were coming from Java services too.

Into the rabbit hole

We used to run Unbound on a single IP across a few machines for our cluster DNS resolver. BGP is then responsible for announcing internal routes from the machines to the router. We decided to try to find a pattern by sending lots of requests from different machines and recording errors. Here’s what our load testing program looked like at first:

package main

import (
	"flag"
	"fmt"
	"net"
	"os"
	"time"
)

func main() {
	n := flag.String("n", "", "domain to look up")
	p := flag.Duration("p", time.Millisecond*10, "pause between lookups")

	flag.Parse()

	if *n == "" {
		flag.PrintDefaults()
		os.Exit(1)
	}

	for {
		_, err := net.LookupHost(*n)
		if err != nil {
			fmt.Println("error:", err)
		}

		time.Sleep(*p)
	}
}

We run net.LookupHost in a loop with small pauses and log errors; that’s it. Packaging this into a Docker container and running on Marathon was an obvious choice for us, since that is how we run other services anyway. Logs get shipped to Kafka and then to Kibana, where we can analyze them. Running this program on 65 machines doing lookups every 50ms shows the following error distribution (from high to low) across hosts:

We saw no strong correlation to racks or specific machines. Errors happened on many hosts, but not on all of them and in different time windows errors happen on different machines. Putting time on X axis and number of errors on Y axis showed the following:

To see if some particular DNS recursor had gone crazy, we stopped all load generators on regular machines and started the load generation tool on the recursors themselves. There were no errors in a few hours, which suggested that Unbound was perfectly healthy.

We started to suspect that packet loss was the issue, but why would “no such host” occur? It should only happen when an NXDOMAIN error is in a DNS response, but our theory was that replies didn’t come back at all.

The Missing

To test the hypothesis that losing packets can lead to a “no such host” error, we first tried blocking outgoing traffic on port 53:

sudo iptables -A OUTPUT -p udp --dport 53 -j DROP

In this case, dig and similar tools just time out, but don’t return “no such host”:

; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> cloudflare.com
;; global options: +cmd
;; connection timed out; no servers could be reached

Go is a bit smarter and tells you more about what’s going on, but doesn't return “no such host” either:

error: lookup cloudflare.com on 10.1.0.9:53: write udp 10.1.14.20:47442->10.1.0.9:53: write: operation not permitted

Since the Linux kernel tells the sender that it dropped packets, we had to point the nameserver to some black hole in the network that does nothing with packets to mimic packet loss. Still no luck:

error: lookup cloudflare.com on 10.1.2.9:53: read udp 10.1.14.20:39046->10.1.2.9:53: i/o timeout

To continue blaming the network we had to support our assumptions somehow, so we added timing information to our lookups:

s := time.Now()
_, err := net.LookupHost(*n)
e := time.Now().Sub(s).Seconds()
if err != nil {
    log.Printf("error after %.4fs: %s", e, err)
} else if e > 1 {
    log.Printf("success after %.4fs", e)
}

To be honest, we started by timing errors and added success timing later. Errors were happening after 10s, comparatively many successful responses were coming after 5s. It does look like packet loss, but still does not tell us why “no such host” happens.

Since now we were in a place when we knew which hosts were more likely to be affected by this, we ran the following two commands in parallel in two screen sessions:

while true; do dig cloudflare.com > /tmp/dig.log || break; done; date; sudo killall tcpdump
sudo tcpdump -w /state/wtf.cap port 53

The point was to get a network dump with failed resolves. In there, we saw the following queries:

00.00s A cloudflare.com
05.00s A cloudflare.com
10.00s A cloudflare.com.in.our.internal.domain

Two queries time out without any answer, but the third one gets lucky and succeeds. Naturally, we don’t have cloudflare.com in our internal domain, so Unbound rightfully gives NXDOMAIN in reply, 10s after the lookup was initiated.

Bingo

Let’s look at /etc/resolv.conf to understand more:

nameserver 10.1.0.9
search in.our.internal.domain

Using the search keyword allows us to use short hostnames instead of FQDN, making myhost transparently equivalent to myhost.in.our.internal.domain.

For the DNS resolver it means the following: for any DNS query ask the nameserver 10.1.0.9, if this fails, append .in.our.internal.domain to the query and retry. It doesn’t matter what failure occurs for the original DNS query. Usually it is NXDOMAIN, but in our case it’s a read timeout due to packet loss.

The following events seemed to have to occur for a “no such host” error to appear:

The original DNS request has to be lost
The retry that is sent after 5 seconds has to be lost
The subsequent query for the internal domain (caused by the search option) has to succeed and return NXDOMAIN

On the other hand, to observe a timed out DNS query instead of NXDOMAIN, you have to lose four packets sent 5 seconds one after another (2 for the original query and 2 for the internal version of your domain), which is a much smaller probability. In fact, we only saw an NXDOMAIN after 15s once and never saw an error after 20s.

To validate that assumption, we built a proof-of-concept DNS server that drops all requests for cloudflare.com, but sends an NXDOMAIN for existing domains:

package main

import (
	"github.com/miekg/dns"
	"log"
)

func main() {
	server := &dns.Server{Addr: ":53", Net: "udp"}

	dns.HandleFunc(".", func(w dns.ResponseWriter, r *dns.Msg) {
		m := &dns.Msg{}
		m.SetReply(r)

		for _, q := range r.Question {
			log.Printf("checking %s", q.Name)
			if q.Name == "cloudflare.com." {
				log.Printf("ignoring %s", q.Name)
				// just ignore
				return
			}
		}


		w.WriteMsg(m)
	})

	log.Printf("listening..")

	if err := server.ListenAndServe(); err != nil {
		log.Fatalf("error listening: %s", err)
	}
}

Finally, we found what was going on and had a way of reliably replicating that behaviour.

Solutions

Let's think about how we can improve our client to better handle these transient network issues, making it more resilient. The man page for resolv.conf tells you that you have two knobs: the timeout and retries options. The default values are 5 and 2 respectively.

Unless you keep your DNS server very busy, it is very unlikely that it would take it more than 1 second to reply. In fact, if you happen to have a network device on the Moon, you can expect it to reply in 3 seconds. If your nameserver lives in the next rack and is reachable over a high-speed network, you can safely assume that if there is no reply after 1 second, your DNS server did not get your query. If you want to have less weird “no such domain” errors that make you scratch your head, you might as well increase retries. The more times you retry with transient packet loss, the less chance of failure. The more often you retry, the higher chances to finish faster.

Imagine that you have truly random 1% packet loss.

2 retries, 5s timeout: max 10s wait before error, 0.001% chance of failure
5 retries, 1s timeout: max 5s wait before error, 0.000001% chance of failure

In real life, the distribution would be different due to the fact that packet loss is not random, but you can expect to wait much less for DNS to reply with this type of change.

As you know many system libraries that provide DNS resolution like glibc, nscd, systemd-resolved are not hardened to handle being on the internet or in a environment with packet losses. We have faced the challenge of creating a reliable and fast DNS resolution environment a number of times as we have grown, only to later discover that the solution is not perfect.

Over to you

Given what you have read in this article about packet loss and split-DNS/private-namespace, how would you design a fast and reliable resolution setup? What software would you use and why? What tuning changes from standard configuration would you use?

We'd love to hear your ideas in the comments. And, if you like working on problems like send us your resume. We are hiring.