The Cloudflare Blog

Shared Dictionaries: compression that keeps up with the agentic web

Alex Krivit — Fri, 17 Apr 2026 13:02:00 GMT

Web pages have grown 6-9% heavier every year for the past decade, spurred by the web becoming more framework-driven, interactive, and media-rich. Nothing about that trajectory is changing. What is changing is how often those pages get rebuilt and how many clients request them. Both are skyrocketing because of agents.

Shared dictionaries shrink asset transfers from servers to browsers so pages load faster with less bloat on the wire, especially for returning users or visitors on a slow connection. Instead of re-downloading entire JavaScript bundles after every deploy, the browser tells the server what it already has cached, and the server only sends the file diffs.

Today, we’re excited to give you a sneak peek of our support for shared compression dictionaries, show you what we’ve seen in early testing, and reveal when you’ll be able to try the beta yourself (hint: it’s April 30, 2026!).

The problem: more shipping = less caching

Agentic crawlers, browsers, and other tools hit endpoints repeatedly, fetching full pages, often to extract a fragment of information. Agentic actors represented just under 10% of total requests across Cloudflare's network during March 2026¹, up ~60% year-over-year.

Every page shipped is heavier than last year and read more often by machines than ever before. But agents aren’t just consuming the web, they’re helping to build it. AI-assisted development means teams ship faster. Increasing the frequency of deploys, experiments, and iterations is great for product velocity, but terrible for caching.

As agents push a one-line fix, the bundler re-chunks, filenames change, and every user on earth could re-download the entire application. Not because the code is meaningfully any different, but because the browser/client has no way to know specifically what changed. It sees a new URL and starts from zero. Traditional compression helps with the size of each download, but it can't help with the redundancy. It doesn't know the client already has 95% of the file cached. So every deploy, across every user, across every bot, sends redundant bytes again and again. Ship ten small changes a day and you've effectively opted out of caching. This wastes bandwidth and CPU in a web where hardware is quickly becoming the bottleneck.

In order to scale with more requests hitting heavier pages that are re-deployed more often, compression has to get smarter.

What are shared dictionaries?

A compression dictionary is a shared reference between server and client that works like a cheat sheet. Instead of compressing a response from scratch, the server says "you already know this part of the file because you’ve cached it before" and only sends what's new. The client holds the same reference and uses it to reconstruct the full response during decompression. The more the dictionary can reference content in the file, the smaller the compressed output that is transferred to the client.

This principle of compressing against what's already known is how modern compression algorithms pull ahead of their predecessors. Brotli ships with a built-in dictionary of common web patterns like HTML attributes and common phrases; Zstandard is purpose-built for custom dictionaries: you can feed it representative content samples and it generates an optimized dictionary for the kind of content you serve. Gzip has neither; it must build dictionaries by finding patterns in real-time as it’s compressing. These “traditional compression” algorithms are already available on Cloudflare today.

Shared dictionaries take this principle a step further: the previously cached version of the resource becomes the dictionary. Remember the deploy problem where a team ships a one-line fix and every user re-downloads the full bundle? With shared dictionaries, the browser already has the old version cached. The server compresses against it, sending only the diff. That 500KB bundle with a one-line change becomes only a few kilobytes on the wire. At 100K daily users and 10 deploys a day, that's the difference between 500GB of transfer and a few hundred megabytes.

Delta compression

Delta compression is what turns the version the browser already has into the dictionary. The protocol looks to when the server first serves a resource, it attaches a Use-As-Dictionary response header, telling the browser to essentially hold onto the file because it’ll be useful later. On the next request for that resource, the browser sends an Available-Dictionary header back, telling the server, "here's what I've got." The server then proceeds to compress the new version against the old one and sends only the diff. No separate dictionary file needed.

This is where the payoff lands for real applications. Versioned JS bundles, CSS files, framework updates, and anything that changes incrementally between releases. The browser has app.bundle.v1.js cached already and the developer makes an update and deploys app.bundle.v2.js. Delta compression only sends the diff between these versions. Every subsequent version after is also just a diff. Version three compresses against version two. Version 47 compresses against version 46. The savings don't reset, they persist across the entire release history.

There's also active discussion in the community about custom and dynamic dictionaries for non-static content. That's future work, but the implications are significant. We'll save that for another post.

So why the wait?

If shared dictionaries are so powerful, why doesn't everyone use them already?

Because the last time they were tried, the implementation couldn't survive contact with the open web.

Google shipped Shared Dictionary Compression for HTTP (SDCH) in Chrome in 2008. It worked well with some early adopters reporting double-digit improvements in page load times. But SDCH accumulated problems faster than anyone was able to fix them.

The most memorable was a class of compression side-channel attacks (CRIME, BREACH). Researchers showed that if an attacker could inject content alongside something sensitive that gets compressed (like a session cookie, token, etc.) the size of the compressed output could leak information about the secret. The attacker could guess a byte at a time, watch whether the asset size shrank, and repeat until they extracted the whole secret.

But security wasn't the only problem, or even the main reason why adoption didn’t happen. SDCH surfaced a few architectural problems like violating the Same-Origin Policy (which ironically is partially why it performed so well). Its cross-origin dictionary model couldn't be reconciled with CORS, and it lacked some specification regarding interactions with things like the Cache API. After a while it became clear that adoption wasn’t ready, so in 2017 Chrome (the only browser supporting at the time) unshipped it.

Getting the web community to pick up the baton took a decade, but it was worth it.

The modern standard, RFC 9842: Compression Dictionary Transport, closes key design gaps that made SDCH untenable. For example, it enforces that an advertised dictionary is only usable on responses from the same-origin, preventing many conditions that made side-channel compression attacks possible.

Chrome and Edge have shipped support with Firefox working to follow. The standard is moving toward broad adoption, but complete cross-browser support is still catching up.

The RFC mitigates the security problems but dictionary transport has always been complex to implement. An origin may have to generate dictionaries, serve them with the right headers, check every request for an Available-Dictionary match, delta-compress the response on the fly, and fall back gracefully when a client doesn't have a dictionary. Caching gets complex too. Responses vary on both encoding and dictionary hash, so every dictionary version creates a separate cache variant. Mid-deploy, you have clients with the old dictionary, clients with the new one, and clients with none. Your cache is storing separate copies for each. Hit rates drop, storage climbs, and the dictionaries themselves have to stay fresh under normal HTTP caching rules.

This complexity is a coordination problem. And exactly the kind of thing that belongs at the edge. A CDN already sits in front of every request, already manages compression, and already handles cache variants (watch this space for a soon-to-come announcement blog).

How Cloudflare is building shared dictionary support

Shared dictionary compression touches every layer of the stack between the browser and the origin. We've seen strong customer interest: some people have already built their own implementations like RFC author Patrick Meenan's dictionary-worker, which runs the full dictionary lifecycle inside a Cloudflare Worker using WASM-compiled Zstandard (as an example). We want to make this accessible to everyone and as easy as possible to implement. So we’re rolling it out across the platform in three phases, starting with the plumbing.

Phase 1: Passthrough support is currently in active development. Cloudflare forwards the headers and encodings that shared dictionaries require like Use-As-Dictionary, Available-Dictionary, and the dcb and dcz content encodings, without stripping, modifying, or recompressing them. The Cache keys are extended to vary on Available-Dictionary and Accept-Encoding so dictionary-compressed responses are cached correctly. This phase serves customers who manage their own dictionaries at the origin.

We plan to have an open beta of Phase 1 ready by April 30, 2026. To use it, you'll need to be on a Cloudflare zone with the feature enabled, have an origin that serves dictionary-compressed responses with the correct headers (Use-As-Dictionary, Content-Encoding: dcb or dcz, Vary on Accept-Encoding, Available-Dictionary), and your visitors need to be on a browser that can use dictionary transport. Today, that means Chrome 130+ and Edge 130+, with Firefox support in progress.

Keep your eyes fixed on the changelog for when this becomes available and more documentation for how to use it.

We’ve already started testing passthrough internally. In a controlled test, we deployed two js bundles in sequence. They were nearly identical except for a few localized changes between the versions representing successive deploys of the same web application. 200 tests were run hitting our San Jose, California PoP with an origin located in Council Bluffs, Iowa serving the dictionary, JS bundle, and dictionary-compressed bundle (note that the dcz compressed bundle was precomputed). For each request we captured TTFB and total download time via curl.

Uncompressed, the asset is 272KB. Gzip brought that down to 92.2KB, a solid 66% reduction. With shared dictionary compression over DCZ, using the previous version as the dictionary, that same asset dropped to 2.6KB. That's a 99% reduction from the uncompressed asset, and still 97% smaller than gzip.

In the same lab test, we measured two timing milestones from the client: time to first byte (TTFB) and full download completion. The timing difference is stark. At P99 on a cache miss, DCZ TTFB is roughly 60 ms faster than gzip. On a cache hit the TTFB gap narrows to a negligible 10ms.

Download completion is where it matters. "Transfer" in the graphs indicates total time curl spent receiving the response minus TTFB. On a cache miss, this time for the DCZ response is 1ms versus 161ms for gzip (99.4% faster - the body essentially arrived alongside the headers). On a cache hit, 1ms versus 54ms (98% faster). The localized changes between versions were already captured by the dictionary, which is exactly the point: for successive deploys of roughly the same application, shared dictionaries eliminate nearly all redundant transfer.

Initial lab results simulating minimal JS bundle diffs, results will vary based on the actual delta between the dictionary and the asset.

Phase 2: This is where Cloudflare starts doing the work for you. Instead of handling dictionary headers, compression, and fallback logic on the origin, in this phase you tell Cloudflare which assets should be used as dictionaries via a rule and we manage the rest for you. We inject the Use-As-Dictionary headers, store the dictionary bytes, delta-compress new versions against old ones, and serve the right variant to each client. Your origin serves normal responses. Any dictionary complexity moves off your infrastructure and onto ours.

To demonstrate this, we've built a live demo to show what this looks like in practice. Try it here: Can I Compress (with Dictionaries)?

The demo deploys a new ~94KB JavaScript bundle every minute, meant to mimic a typical production single page application bundle. The bulk of the code is static between deploys; only a small configuration block changes each time, which also mirrors real-world deploys where most of the bundle is unchanged framework and library code. When the first version loads, Cloudflare's edge stores it as a dictionary. When the next deploy arrives, the browser sends the hash of the version it already has, and the edge delta-compresses the new bundle against it. The result: 94KB compresses to roughly 450 bytes. That's a 99.5% reduction over gzip, because the only thing on the wire is the actual diff.

The demo site includes walkthroughs so you can verify the compression ratios on your own via curl, your browser, or your agent of choice.

Phase 3: The dictionary is automatically generated on behalf of the website. Instead of customers specifying which assets to use as dictionaries, Cloudflare identifies them automatically. Our network already sees every version of every resource that flows through it, which includes millions of sites, billions of requests, and every new deployment. The idea is that when the network observes a URL pattern where successive responses share most of their content, it has a strong signal that the resource is a good candidate for delta compression. If safe to do so, it stores the previous version as a dictionary and compresses subsequent versions against it. No customer configuration. No maintenance.

This is a simple idea, but is genuinely hard. Safely generating dictionaries that avoid revealing private data and identifying traffic for which dictionaries will offer the most benefit are real engineering problems. But Cloudflare has the right pieces: we see the traffic patterns across the entire network, we already manage the cache layer where dictionaries need to live, and our RUM beacon to clients can help give us a validation loop to confirm that a dictionary actually improves compression before we commit to serving it. The combination of traffic visibility, edge storage, and synthetic testing is what makes automatic generation feasible, though there are still many pieces to figure out.

The performance and bandwidth benefits of phase 3 are the crux of our motivation. This is what makes shared dictionaries accessible to everyone using Cloudflare, including the millions of zones that would never have had the engineering time to implement custom dictionaries manually.

The bigger picture

For most of the web's history, compression was stateless. Every response was compressed as if the client had never seen anything before. Shared dictionaries change that: they give compression a memory.

That matters more now than it would have five years ago. Agentic coding tools are compressing the interval between deploys, while also driving a growing share of the traffic that consumes them. While today AI tools can produce massive diffs, agents are gaining more context and becoming surgical in their code changes. This, coupled with more frequent releases and more automated clients means more redundant bytes on every request. Delta compression helps both sides of that equation by reducing the number of bytes per transfer, and the number of transfers that need to happen at all.

Shared Dictionaries took decades to standardize. Cloudflare is helping to build the infrastructure to make it work for every client that touches your site, human or not. Phase 1 beta opens April 30 and we’re excited for you to try it.

_____

^{1Bots =}^~31.3%^{of all HTTP requests. AI = ~}^29-30%^{of all Bot traffic (March 2026).}

New standards for a faster and more private Internet

Matt Bullock — Wed, 25 Sep 2024 13:00:00 GMT

As the Internet grows, so do the demands for speed and security. At Cloudflare, we’ve spent the last 14 years simplifying the adoption of the latest web technologies, ensuring that our users stay ahead without the complexity. From being the first to offer free SSL certificates through Universal SSL to quickly supporting innovations like TLS 1.3, IPv6, and HTTP/3, we've consistently made it easy for everyone to harness cutting-edge advancements.

One of the most exciting recent developments in web performance is Zstandard (zstd) — a new compression algorithm that we have found compresses data 42% faster than Brotli while maintaining almost the same compression levels. Not only that, but Zstandard reduces file sizes by 11.3% compared to GZIP, all while maintaining comparable speeds. As compression speed and efficiency directly impact latency, this is a game changer for improving user experiences across the web.

We’re also re-starting the rollout of Encrypted Client Hello (ECH), a new proposed standard that prevents networks from snooping on which websites a user is visiting. Encrypted Client Hello (ECH) is a successor to ESNI and masks the Server Name Indication (SNI) that is used to negotiate a TLS handshake. This means that whenever a user visits a website on Cloudflare that has ECH enabled, no one except for the user, Cloudflare, and the website owner will be able to determine which website was visited. Cloudflare is a big proponent of privacy for everyone and is excited about the prospects of bringing this technology to life.

In this post, we also further explore our work measuring the impact of HTTP/3 prioritization, and the development of Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion control to further optimize network performance.

Introducing Zstandard compression

Zstandard, an advanced compression algorithm, was developed by Yann Collet at Facebook and open sourced in August 2016 to manage large-scale data processing. It has gained popularity in recent years due to its impressive compression ratios and speed. The protocol was included in Chromium-based browsers and Firefox in March 2024 as a supported compression algorithm.

Today, we are excited to announce that Zstandard compression between Cloudflare and browsers is now available to everyone.

Our testing shows that Zstandard compresses data up to 42% faster than Brotli while achieving nearly equivalent data compression. Additionally, Zstandard outperforms GZIP by approximately 11.3% in compression efficiency, all while maintaining similar compression speeds. This means Zstandard can compress files to the same size as Brotli but in nearly half the time, speeding up your website without sacrificing performance. This is exciting because compression speed and file size directly impacts latency. When a browser requests a resource from the origin server, the server needs time to compress the data before it’s sent over the network. A faster compression algorithm, like Zstandard, reduces this initial processing time. By also reducing the size of files transmitted over the Internet, better compression means downloads take less time to complete, websites load quicker, and users ultimately get a better experience.

Why is compression so important?

Website performance is crucial to the success of online businesses. Study after study has shown that an increased load time directly affects sales. In highly competitive markets, the performance of a website is crucial for success. Just like a physical shop situated in a remote area faces challenges in attracting customers, a slow website encounters similar difficulties in attracting traffic. Think about buying a piece of flat pack furniture such as a bookshelf. Instead of receiving the bookshelf fully assembled, which would be expensive and cumbersome to transport, you receive it in a compact, flat box with all the components neatly organized, ready for assembly. The parts are carefully arranged to take up the least amount of space, making the package much smaller and easier to handle. When you get the item, you simply follow the instructions to assemble it to its proper state.

This is similar to how data compression works. The data is “disassembled” and packed tightly to reduce its size before being transmitted. Once it reaches its destination, it’s “reassembled” to its original form. This compression process reduces the amount of data that needs to be sent, saving bandwidth, reducing costs, and speeding up the transfer, just like how flat pack furniture reduces shipping costs and simplifies delivery logistics.

However, with compression, there is a tradeoff: time to compress versus the overall compression ratio. A compression ratio is a measure of how much a file's size is reduced during compression. For example, a 10:1 compression ratio means that the compressed file is one-tenth the size of the original. Just like assembling flat-pack furniture takes time and effort, achieving higher compression ratios often requires more processing time. While a higher compression ratio significantly reduces file size — making data transmission faster and more efficient — it may take longer to compress and decompress the data. Conversely, quicker compression methods might produce larger files, leading to faster processing but at the cost of greater bandwidth usage. Balancing these factors is key to optimizing performance in data transmission.

W3 Technologies reports that as of September 12, 2024, 88.6% of websites rely on compression to optimize speed and reduce bandwidth usage. GZIP, introduced in 1996, remains the default algorithm for many, used by 57.0% of sites due to its reasonable compression ratios and fast compression speeds. Brotli, released by Google in 2016, delivers better compression ratios, leading to smaller file sizes, especially for static assets like JavaScript and CSS, and is used by 45.5% of websites. However, this also means that 11.4% of websites still operate without any compression, missing out on crucial performance improvements.

As the Internet and its supporting infrastructure have evolved, so have user demands for faster, more efficient performance. This growing need for higher efficiency without compromising speed is where Zstandard comes into play.

Enter Zstandard

Zstandard offers higher compression ratios comparable to GZIP, but with significantly faster compression and decompression speeds than Brotli. This makes it ideal for real-time applications that require both speed and relatively high compression ratios.

To understand Zstandard's advantages, it's helpful to know about Zlib. Zlib was developed in the mid-1990s based on the DEFLATE compression algorithm, which combines LZ77 and Huffman coding to reduce file sizes. While Zlib has been a compression standard since the mid-1990s and is used in Cloudflare’s open-source GZIP implementation, its design is limited by a 32 KB sliding window — a constraint from the memory limitations of that era. This makes Zlib less efficient on modern hardware, which can access far more memory.

Zstandard enhances Zlib by leveraging modern innovations and hardware capabilities. Unlike Zlib’s fixed 32 KB window, Zstandard has no strict memory constraints and can theoretically address terabytes of memory. However, in practice, it typically uses much less, around 1 MB at lower compression levels. This flexibility allows Zstandard to buffer large amounts of data, enabling it to identify and compress repeating patterns more effectively. Zstandard also employs repcode modeling to efficiently compress structured data with repetitive sequences, further reducing file sizes and enhancing its suitability for modern compression needs.

Zstandard is optimized for modern CPUs, which can execute multiple tasks simultaneously using multiple Arithmetic Logic Units (ALUs) that are used to perform mathematical tasks. Zstandard achieves this by processing data in parallel streams, dividing it into multiple parts that are processed concurrently. The Huffman decoder, Huff0, can decode multiple symbols in parallel on a single CPU core, and when combined with multi-threading, this leads to substantial speed improvements during both compression and decompression.

Zstandard’s branchless design is a crucial innovation that enhances CPU efficiency, especially in modern processors. To understand its significance, consider how CPUs execute instructions.

Modern CPUs use pipelining, where different stages of an instruction are processed simultaneously—like a production line—keeping all parts of the processor busy. However, when CPUs encounter a branch, such as an 'if-else' decision, they must make a branch prediction to guess the next step. If the prediction is wrong, the pipeline must be cleared and restarted, causing slowdowns.

Zstandard avoids this issue by eliminating conditional branching. Without relying on branch predictions, it ensures the CPU can execute instructions continuously, keeping the pipeline full and avoiding performance bottlenecks.

A key feature of Zstandard is its use of Finite State Entropy (FSE), an advanced compression method that encodes data more efficiently based on probability. FSE, built on the Asymmetric Numeral System (ANS), allows Zstandard to use fractional bits for encoding, unlike traditional Huffman coding, which only uses whole bits. This allows heavily repeated data to be compressed more tightly without sacrificing efficiency.

Zstandard findings

In the third quarter of 2024, we conducted extensive tests on our new Zstandard compression module, focusing on a 24-hour period where we switched the default compression algorithm from Brotli to Zstandard across our Free plan traffic. This experiment spanned billions of requests, covering a wide range of file types and sizes, including HTML, CSS, and JavaScript. The results were very promising, with significant improvements in both compression speed and file size reduction, leading to faster load times and more efficient bandwidth usage.

Compression ratios

In terms of compression efficiency, Zstandard delivers impressive results. Below are the average compression ratios we observed during our testing.

Compression Algorithm	Average Compression Ratio
GZIP	2.56
Zstandard	2.86
Brotli	3.08

As the table shows, Zstandard achieves an average compression ratio of 2.86:1, which is notably higher than gzip's 2.56:1 and close to Brotli’s 3.08:1. While Brotli slightly edges out Zstandard in terms of pure compression ratio, what is particularly exciting is that we are only using Zstandard’s default compression level of 3 (out of 22) on our traffic. In the fourth quarter of 2024, we plan to experiment with higher compression levels and multithreading capabilities to further enhance Zstandard’s performance and optimize results even more.

Compression speeds

What truly sets Zstandard apart is its speed. Below are the average times to compress data from our traffic-based tests measured in milliseconds:

Compression Algorithm	Average Time to Compress (ms)
GZIP	0.872
Zstandard	0.848
Brotli	1.544

Zstandard not only compresses data efficiently, but it also does so 42% faster than Brotli, with an average compression time of 0.848 ms compared to Brotli’s 1.544 ms. It even outperforms gzip, which compresses at 0.872 ms on average.

From our results, we have found Zstandard strikes an excellent balance between achieving a high compression ratio and maintaining fast compression speed, making it particularly well-suited for dynamic content such as HTML and non-cacheable sensitive data. Zstandard can compress these responses from the origin quickly and efficiently, saving time compared to Brotli while providing better compression ratios than GZIP.

Implementing Zstandard at Cloudflare

To implement Zstandard compression at Cloudflare, we needed to build it into our Nginx-based service which already handles GZIP and Brotli compression. Nginx is modular by design, with each module performing a specific function, such as compressing a response. Our custom Nginx module leverages Nginx's function 'hooks' — specifically, the header filter and body filter — to implement Zstandard compression.

Header filter

The header filter allows us to access and modify response headers. For example, Cloudflare only compresses responses above a certain size (50 bytes for Zstandard), which is enforced with this code:

if (r->headers_out.content_length_n != -1 &&
    r->headers_out.content_length_n < conf->min_length) {
    return ngx_http_next_header_filter(r);
}

Here, we check the "Content-Length" header. If the content length is less than our minimum threshold, we skip compression and let Nginx execute the next module.

We also need to ensure the content is not already compressed by checking the "Content-Encoding" header:

if (r->headers_out.content_encoding &&
    r->headers_out.content_encoding->value.len) {
    return ngx_http_next_header_filter(r);
}

If the content is already compressed, the module is bypassed, and Nginx proceeds to the next header filter.

Body filter

The body filter hook is where the actual processing of the response body occurs. In our case, this involves compressing the data with the Zstandard encoder and streaming the compressed data back to the client. Since responses can be very large, it's not feasible to buffer the entire response in memory, so we manage internal memory buffers carefully to avoid running out of memory.

The Zstandard library is well-suited for streaming compression and provides the ZSTD_compressStream2 function:

ZSTDLIB_API size_t ZSTD_compressStream2(ZSTD_CCtx* cctx,
                                        ZSTD_outBuffer* output,
                                        ZSTD_inBuffer* input,
                                        ZSTD_EndDirective endOp);

This function can be called repeatedly with chunks of input data to be compressed. It accepts input and output buffers and an "operation" parameter (ZSTD_EndDirective endOp) that controls whether to continue feeding data, flush the data, or finalize the compression process.

Nginx uses a "flush" flag on memory buffers to indicate when data can be sent. Our module uses this flag to set the appropriate Zstandard operation:

switch (zstd_operation) {
    case ZSTD_e_continue: {
        if (flush) {
            zstd_operation = ZSTD_e_flush;
        }
    }
}

This logic allows us to switch from the "ZSTD_e_continue" operation, which feeds more input data into the encoder, to "ZSTD_e_flush", which extracts compressed data from the encoder.

Compression cycle

The compression module operates in the following cycle:

Receive uncompressed data.
Locate an internal buffer to store compressed data.
Compress the data with Zstandard.
Send the compressed data back to the client.

Once a buffer is filled with compressed data, it’s passed to the next Nginx module and eventually sent to the client. When the buffer is no longer in use, it can be recycled, avoiding unnecessary memory allocation. This process is managed as follows:

if (free) {
    // A free buffer is available, so use it
    buffer = free;
} else if (buffers_used < maximum_buffers) {
    // No free buffers, but we're under the limit, so allocate a new one
    buffer = create_buf();
} else {
    // No free buffers and can't allocate more
    err = no_memory;
}

Handling backpressure

If no buffers are available, it can lead to backpressure — a situation where the Zstandard module generates compressed data faster than the client can receive it. This causes data to become "stuck" inside Nginx, halting further compression due to memory constraints. In such cases, we stop compression and send an empty buffer to the next Nginx module, allowing Nginx to attempt to send the data to the client again. When successful, this frees up memory buffers that our module can reuse, enabling continued streaming of the compressed response without buffering the entire response in memory.

What's next? Compression dictionaries

The future of Internet compression lies in the use of compression dictionaries. Both Brotli and Zstandard support dictionaries, offering up to 90% improvement on compression levels compared to using static dictionaries.

Compression dictionaries store common patterns or sequences of data, allowing algorithms to compress information more efficiently by referencing these patterns rather than repeating them. This concept is akin to how an iPhone's predictive text feature works. For example, if you frequently use the phrase "On My Way," you can customize your iPhone’s dictionary to recognize the abbreviation "OMW" and automatically expand it to "On My Way" when you type it, saving the user from typing six extra letters.

O	M	W
O	n		M	y		W	a	y

Traditionally, compression algorithms use a static dictionary defined by its RFC that is shared between clients and origin servers. This static dictionary is designed to be broadly applicable, balancing size and compression effectiveness for general use. However, Zstandard and Brotli support custom dictionaries, tailored specifically to the content being sent to the client. For example, Cloudflare could create a specialized dictionary that focuses on frequently used terms like “Cloudflare”. This custom dictionary would compress these terms more efficiently, and a browser using the same dictionary could decode them accurately, leading to significant improvements in compression and performance.

In the future, we will enable users to leverage origin-generated dictionaries for Zstandard and Brotli to enhance compression. Another exciting area we're exploring is the use of AI to create these dictionaries dynamically without them needing to be generated at the origin. By analyzing data streams in real-time, Cloudflare could develop context-aware dictionaries tailored to the specific characteristics of the data being processed. This approach would allow users to significantly improve both compression ratios and processing speed for their applications.

Compression Rules for everyone

Today we’re also excited to announce the introduction of Compression Rules for all our customers. By default, Cloudflare will automatically compress certain content types based on their Content-Type headers. Customers can use compression rules to optimize how and what Cloudflare compresses. This feature was previously exclusive to our Enterprise plans. Compression Rules is built on the same robust framework as our other rules products, such as Origin Rules, Custom Firewall Rules, and Cache Rules, with additional fields for Media Type and Extension Type. This allows you to easily specify the content you wish to compress, providing granular control over your site’s performance optimization.

Compression rules are now available on all our pay-as-you-go plans and will be added to free plans in October 2024. This feature was previously exclusive to our Enterprise customers. In the table below, you’ll find the updated limits, including an increase to 125 Compression Rules for Enterprise plans, aligning with our other rule products' quotas.

Plan Type	Free*	Pro	Business	Enterprise
Available Compression Rules	10	25	50	125

Using Compression Rules to enable Zstandard

To integrate our Zstandard module into our platform, we also added support for it within our Compression Rules framework. This means that customers can now specify Zstandard as their preferred compression method, and our systems will automatically enable the Zstandard module in Nginx, disabling other compression modules when necessary.

The Accept-Encoding header determines which compression algorithms a client supports. If a browser supports Zstandard (zstd), and both Cloudflare and the zone have enabled the feature, then Cloudflare will return a Zstandard compressed response.

If the client does not support Zstandard, then Cloudflare will automatically fall back to Brotli, GZIP, or serve the content uncompressed where no compression algorithm is supported, ensuring compatibility. To enable Zstandard for your entire site or specifically filter on certain file types, all Cloudflare users can deploy a simple compression rule.

Further details and examples of what can be accomplished with Compression Rules can be found in our developer documentation.

Currently, we support Zstandard, Brotli, and GZIP as compression algorithms for traffic sent to clients, and support GZIP and Brotli (since 2023) compressed data from the origin. We plan to implement full end-to-end support for Zstandard in 2025, offering customers another effective way to reduce their egress costs.

Once Zstandard is enabled, you can view your browser’s Network Activity log to check the content-encoding headers of the response.

Enable Zstandard now!

Zstandard is now available to all Cloudflare customers through Compression Rules on our Enterprise and pay as you go plans, with free plans gaining access in October 2024. Whether you're optimizing for speed or aiming to reduce bandwidth, Compression Rules give all customers granular control over their site's performance.

Encrypted Client Hello (ECH)

While performance is crucial for delivering a fast user experience, ensuring privacy is equally important in today’s Internet landscape. As we optimize for speed with Zstandard, Cloudflare is also working to protect users' sensitive information from being exposed during data transmission. With web traffic growing more complex and interconnected, it's critical to keep both performance and privacy in balance. This is where technologies like Encrypted Client Hello (ECH) come into play, securing connections without sacrificing speed.

Ten years ago, we embarked on a mission to create a more secure and encrypted web. At the time, much of the Internet remained unencrypted, leaving user data vulnerable to interception. On September 27, 2014, we took a major step forward by enabling HTTPS for free for all Cloudflare customers. Overnight, we doubled the size of the encrypted web. This set the stage for a more secure Internet, ensuring that encryption was not a privilege limited by budget but a right accessible to everyone.

Since then, both Cloudflare and the broader community have helped encrypt more of the Internet. Projects like Let's Encrypt launched to make certificates free for everyone. Cloudflare invested to encrypt more of the connection, and future-proof that encryption from coming technologies like quantum computers. We've always believed that it was everyone's right, regardless of your budget, to have an encrypted Internet at no cost.

One of the last major challenges has been securing the SNI (Server Name Identifier), which remains exposed in plaintext during the TLS handshake. This is where Encrypted Client Hello (ECH) comes in, and today, we are proud to announce that we're closing that gap.

Cloudflare announced support for Encrypted Client Hello (ECH) in 2023 and has continued to enhance its implementation in collaboration with our Internet browser partners. During a TLS handshake, one of the key pieces of information exchanged is the Server Name Indication (SNI), which is used to initiate a secure connection. Unfortunately, the SNI is sent in plaintext, meaning anyone can read it. Imagine hand-delivering a letter — anyone following you can see where you're delivering it, even if they don’t know the contents. With ECH, it is like sending the same confidential letter to a P.O. Box. You place your sensitive letter in a sealed inner envelope with the actual address. Then, you put that envelope into a larger, standard envelope addressed to a public P.O. Box, trusted to securely forward your intended recipient. The larger envelope containing the non-sensitive information is visible to everyone, while the inner envelope holds the confidential details, such as the actual address and recipient. Just as the P.O. Box maintains the anonymity of the true recipient’s address, ECH ensures that the SNI remains protected.

While encrypting the SNI is a primary motivation for ECH, its benefits extend further. ECH encrypts the entire Client Hello, ensuring user privacy and enabling TLS to evolve without exposing sensitive connection data. By securing the full handshake, ECH allows for flexible, future-proof encryption designs that safeguard privacy as the Internet continues to grow.

How ECH works

Encrypted Client Hello (ECH) introduces a layer of privacy by dividing the ClientHello message into two distinct parts: a ClientHelloOuter and a ClientHelloInner.

ClientHelloOuter: This part remains unencrypted and contains innocuous values for sensitive TLS extensions. It sets the SNI to Cloudflare’s public name, currently set to cloudflare-ech.com. Cloudflare manages this domain and possesses the necessary certificates to handle TLS negotiations for it.
ClientHelloInner: This part is encrypted with a public key and includes the actual server name the client wants to visit, along with other sensitive TLS extensions. The encryption scheme ensures that this sensitive data can only be decrypted by the client-facing server, which in our case is Cloudflare.

During the TLS handshake, the ClientHelloOuter reveals only the public name (e.g., cloudflare-ech.com), while the encrypted ClientHelloInner carries the real server name. As a result, intermediaries observing the traffic will only see cloudflare-ech.com in plaintext, concealing the actual destination.

The design of ECH effectively addresses many challenges in securely deploying handshake encryption, thanks to the collaborative efforts within the IETF community. The key to ECH’s success is its integration with other IETF standards, including the new HTTPS DNS resource record, which enables HTTPS endpoints to advertise different TLS capabilities and simplifies key distribution. By using Encrypted DNS methods, browsers and clients can anonymously query these HTTPS records. These records contain the ECH parameters needed to initiate a secure connection.

ECH leverages the Hybrid Public Key Encryption (HPKE) standard, which streamlines the handshake encryption process, making it more secure and easier to implement. Before initiating a layer 4 connection, the user’s browser makes a DNS request for an HTTPS record, and zones with ECH enabled will include an ECH configuration in the HTTPS record containing an encryption public key and some associated metadata. For example, looking at the zone cloudflare-ech.com, you can see the following record returned:

dig cloudflare-ech.com https +short


1 . alpn="h3,h2" ipv4hint=104.18.10.118,104.18.11.118 ech=AEX+DQBB2gAgACD1W1B+GxY3nZ53Rigpsp0xlL6+80qcvZtgwjsIs4YoOwAEAAEAAQASY2xvdWRmbGFyZS1lY2guY29tAAA= ipv6hint=2606:4700::6812:a76,2606:4700::6812:b76

Aside from the public key used by the client to encrypt ClientHelloInner and other parameters that specify the ECH configuration, the configured public name is also present.

Y2xvdWRmbGFyZS1lY2guY29t

When the string is decoded it reveals:

cloudflare-ech.com

This public name is then used by the client in the ClientHelloOuter.

Practical implications

With ECH, any observer monitoring the traffic between the client and Cloudflare will see only uniform TLS handshakes that appear to be directed towards cloudflare-ech.com, regardless of the actual website being accessed. For instance, if a user visits example.com, intermediaries will not discern this specific destination but will only see cloudflare-ech.com in the visible handshake data.

The problem with middleboxes

In a basic HTTPS connection, a browser (client) establishes a TLS connection directly with an origin server to send requests and download content. However, many connections on the Internet do not go directly from a browser to the server but instead pass through some form of proxy or middlebox (often referred to as a "monster-in-the-middle" or MITM). This routing through intermediaries can occur for various reasons, both benign and malicious.

One common type of HTTPS interceptor is the TLS-terminating forward proxy. This proxy sits between the client and the destination server, transparently forwarding and potentially modifying traffic. To perform this task, the proxy terminates the TLS connection from the client, decrypts the traffic, and then re-encrypts and forwards it to the destination server over a new TLS connection. To avoid browser certificate validation errors, these forward proxies typically require users to install a root certificate on their devices. This root certificate allows the proxy to generate and present a trusted certificate for the destination server, a process often managed by network administrators in corporate environments, as seen with Cloudflare WARP. These services can help prevent sensitive company data from being transmitted to unauthorized destinations, safeguarding confidentiality.

However, TLS-terminating forward proxies may not be equipped to handle Encrypted Client Hello (ECH) correctly, especially if the MITM proxy and the client facing ECH server belong to different entities. Because the MITM proxy will terminate the TLS connection without being ECH aware, it may provide a valid certificate for the public name (in our case, cloudflare-ech.com) without being able to decrypt the ClientHelloInner or provide a new public key for the client to use. In this case, the client considers ECH to be disabled, which means you lose out on both ECH and pay the cost of an extra round trip.

We also observed that specific Cloudflare setups, such as CNAME Flattening and Orange-to-Orange configurations, could cause ECH to break. This issue arose because the end destination for these connections did not support TLS 1.3, preventing ECH from being processed correctly. Fortunately, in close collaboration with our browser partners, we implemented a fallback in our BoringSSL implementation that handles TLS terminations. This fallback allows browsers to retry connections over TLS 1.2 without ECH, ensuring that a connection can be established and not break.

As a result of these improvements, we have enabled ECH by default for all Free plans, while all other plan types can manually enable it through their Cloudflare dashboard or via the API. We are excited to support ECH at scale, enhancing the privacy and security of users' browsing activities. ECH plays a crucial role in safeguarding online interactions from potential eavesdroppers and maintaining the confidentiality of web activities.

HTTP/3 Prioritization and QUIC congestion control

Two other areas we are investing in to improve performance for all our customers are HTTP/3 Prioritization and QUIC congestion control.

HTTP/3 Prioritization focuses on efficiently managing the order in which web assets are loaded, thereby improving web performance by ensuring critical assets are delivered faster. HTTP/3 Prioritization uses Extensible Priorities to simplify prioritization with two parameters: urgency (ranging from 0-7) and a true/false value indicating whether the resource can be processed progressively. This allows resources like HTML, CSS, and images to be prioritized based on importance.

On the other hand, QUIC congestion control aims to optimize the flow of data, preventing network bottlenecks and ensuring smooth, reliable transmission even under heavy traffic conditions.

Both of these improvements significantly impact how Cloudflare’s network serves requests to clients. Before deploying these technologies across our global network, which handles peak traffic volumes of over 80 million requests per second, we first developed a reliable method to measure their impact through rigorous experimentation.

Measuring impact

Accurately measuring the impact of features implemented by Cloudflare for our customers is crucial for several reasons. These measurements ensure that optimizations related to performance, security, or reliability deliver the intended benefits without introducing new issues. Precise measurement validates the effectiveness of these changes, allowing Cloudflare to assess improvements in metrics such as load times, user experience, and overall site security. One of the best ways to measure performance changes is through aggregated real-world data.

Cloudflare Web Analytics offers free, privacy-first analytics for your website, helping you understand the performance of your web pages as experienced by your visitors. Real User Metrics (RUM) is a vital tool in web performance optimization, capturing data from real users interacting with a website, providing insights into site performance under real-world conditions. RUM tracks various metrics directly from the user's device, including load times, resource usage, and user interactions. This data is essential for understanding the actual user experience, as it reflects the diverse environments and conditions under which the site is accessed.

A key performance indicator measured through RUM is Core Web Vitals (CWV), a set of metrics defined by Google that quantify crucial aspects of user experience on the web. CWV focuses on three main areas: loading performance, interactivity, and visual stability. The specific metrics include Largest Contentful Paint (LCP), which measures loading performance; First Input Delay (FID), which gauges interactivity; and Cumulative Layout Shift (CLS), which assesses visual stability. By using the CWV measurement in RUM, developers can monitor and optimize their applications to ensure a smoother, faster, and more stable user experience and track the impact of any changes they release.

Over the last three months we have developed the capability to include valuable information in Server-Timing response headers. When a page that uses Cloudflare Web Analytics is loaded in a browser, the privacy-first client-side script from Web Analytics collects browser metrics and server-timing headers, then sends back this performance data. This data is ingested, aggregated, and made available for querying. The server-timing header includes Layer 4 information, such as Round-Trip Time (RTT) and protocol type (TCP or QUIC). Combined with Core Web Vitals data, this allows us to determine whether an optimization has positively impacted a request compared to a control sample. This capability enables us to release large-scale changes such as HTTP/3 Prioritization or BBR with a clear understanding of their impact across our global network.

An example of this header contains several key properties that provide valuable information about the network performance as observed by the server:

server-timing: cfL4;desc="?proto=TCP&rtt=7337&sent=8&recv=8&lost=0&retrans=0&sent_bytes=3419&recv_bytes=832&delivery_rate=548023&cwnd=25&unsent_bytes=0&cid=94dae6b578f91145&ts=225

proto: Indicates the transport protocol used
rtt: Round-Trip Time (RTT), representing the duration of the network round trip as measured by the layer 4 connection using a smoothing algorithm.
sent: Number of packets sent.
recv: Number of packets received.
lost: Number of packets lost.
retrans: Number of retransmitted packets.
sent_bytes: Total number of bytes sent.
recv_bytes: Total number of bytes received.
delivery_rate: Rate of data delivery, an instantaneous measurement in bytes per second.
cwnd: Congestion Window, an instantaneous measurement of packet or byte count depending on the protocol.
unsent_bytes: Number of bytes not yet sent.
cid: A 16-byte hexadecimal opaque connection ID.
ts: Timestamp in milliseconds, representing when the data was captured.

This real-time collection of performance data via RUM and Server-Timing headers allows Cloudflare to make data-driven decisions that directly enhance user experience. By continuously analyzing these detailed network and performance insights, we can ensure that future optimizations, such as HTTP/3 Prioritization or BBR deployment, are delivering tangible benefits for our customers.

Enabling HTTP/3 Prioritization for all plans

As part of our focus on improving observability through the integration of the server-timing header, we implemented several minor changes to optimize QUIC handshakes. Notably, we observed positive improvements in our telemetry due to the Layer 4 observability enhancements provided by the server-timing header. These internal findings coincided with third-party measurements, which showed similar improvements in handshake performance.

In the fourth quarter of 2024, we will apply the same experimental methodology to the HTTP/3 Prioritization support announced during Speed Week 2023. HTTP/3 Prioritization is designed to enhance the efficiency and speed of loading web pages by intelligently managing the order in which web assets are delivered to users. This is crucial because modern web pages are composed of numerous elements — such as images, scripts, and stylesheets — that vary in importance. Proper prioritization ensures that critical elements, like primary content and layout, load first, delivering a faster and more seamless browsing experience.

We will use this testing framework to measure performance improvements before enabling the feature across all plan types. This process allows us not only to quantify the benefits but, most importantly, to ensure there are no performance regressions.

Congestion control

Following the completion of the HTTP/3 Prioritization experiments we will then begin testing different congestion control algorithms, specifically focusing on BBR (Bottleneck Bandwidth and Round-trip propagation time) version 3. Congestion control is a crucial mechanism in network communication that aims to optimize data transfer rates while avoiding network congestion. When too much data is sent too quickly over a network, it can lead to congestion, causing packet loss, delays, and reduced overall performance. Think of a busy highway during rush hour. If too many cars (data packets) flood the highway at once, traffic jams occur, slowing everyone down.

Congestion control algorithms act like traffic managers, regulating the flow of data to prevent these “traffic jams,” ensuring that data moves smoothly and efficiently across the network. Each side of a connection runs an algorithm in real time, dynamically adjusting the flow of data based on the current and predicted network conditions. BBR is an advanced congestion control algorithm, initially developed by Google. BBR seeks to estimate the actual available bandwidth and the minimum round-trip time (RTT) to determine the optimal data flow. This approach allows BBR to maintain high throughput while minimizing latency, leading to more efficient and stable network performance.

BBR v3, the latest iteration, builds on the strengths of its predecessors BBRv1 and BBRv2 by further refining its bandwidth estimation techniques and enhancing its adaptability to varying network conditions. We found BBR v3 to be faster in several cases compared to our previous implementation of CUBIC. Most importantly, it reduced loss and retransmission rates in our Oxy proxy implementation.

With these promising results, we are excited to test various congestion control algorithms including BBRv3 for quiche, our QUIC implementation, across our HTTP/3 traffic. Combining the layer 4 server-timing information with experiments in this area will enable us to explicitly control and measure the impact on real-world metrics.

The future

The future of the Internet relies on continuous innovation to meet the growing demands for speed, security, and scalability. Technologies like Zstandard for compression, BBR for congestion control, HTTP/3 prioritization, and Encrypted Client Hello are setting new standards for performance and privacy. By implementing these protocols, web services can achieve faster page load times, more efficient bandwidth usage, and stronger protections for user data.

These advancements don't just offer incremental improvements, they provide a significant leap forward in optimizing the user experience and safeguarding online interactions. At Cloudflare, we are committed to making these technologies accessible to everyone, empowering businesses to deliver better, faster, and more secure services.

Stay tuned for more developments as we continue to push the boundaries of what's possible on the web and if you’re passionate about building and implementing the latest Internet innovations, we’re hiring!

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Matt Bullock — Fri, 23 Jun 2023 13:01:00 GMT

Throughout Speed Week, we have talked about the importance of optimizing performance. Compression plays a crucial role by reducing file sizes transmitted over the Internet. Smaller file sizes lead to faster downloads, quicker website loading, and an improved user experience.

Take household cleaning products as a real world example. It is estimated “a typical bottle of cleaner is 90% water and less than 10% actual valuable ingredients”. Removing 90% of a typical 500ml bottle of household cleaner reduces the weight from 600g to 60g. This reduction means only a 60g parcel, with instructions to rehydrate on receipt, needs to be sent. Extrapolated into the gallons, this weight reduction soon becomes a huge shipping saving for businesses. Not to mention the environmental impact.

This is how compression works. The sender compresses the file to its smallest possible size, and then sends the smaller file with instructions on how to handle it when received. By reducing the size of the files sent, compression ensures the amount of bandwidth needed to send files over the Internet is a lot less. Where files are stored in expensive cloud providers like AWS, reducing the size of files sent can directly equate to significant cost savings on bandwidth.

Smaller file sizes are also particularly beneficial for end users with limited Internet connections, such as mobile devices on cellular networks or users in areas with slow network speeds.

Cloudflare has always supported compression in the form of Gzip. Gzip is a widely used compression algorithm that has been around since 1992 and provides file compression for all Cloudflare users. However, in 2013 Google introduced Brotli which supports higher compression levels and better performance overall. Switching from gzip to Brotli results in smaller file sizes and faster load times for web pages. We have supported Brotli since 2017 for the connection between Cloudflare and client browsers. Today we are announcing end-to-end Brotli support for web content: support for Brotli compression, at the highest possible levels, from the origin server to the client.

If your origin server supports Brotli, turn it on, crank up the compression level, and enjoy the performance boost.

Brotli compression to 11

Brotli has 12 levels of compression ranging from 0 to 11, with 0 providing the fastest compression speed but the lowest compression ratio, and 11 offering the highest compression ratio but requiring more computational resources and time. During our initial implementation of Brotli five years ago, we identified that compression level 4 offered the balance between bytes saved and compression time without compromising performance.

Since 2017, Cloudflare has been using a maximum compression of Brotli level 4 for all compressible assets based on the end user's "accept-encoding" header. However, one issue was that Cloudflare only requested Gzip compression from the origin, even if the origin supported Brotli. Furthermore, Cloudflare would always decompress the content received from the origin before compressing and sending it to the end user, resulting in additional processing time. As a result, customers were unable to fully leverage the benefits offered by Brotli compression.

Old world

With Cloudflare now fully supporting Brotli end to end, customers will start seeing our updated accept-encoding header arriving at their origins. Once available customers can transfer, cache and serve heavily compressed Brotli files directly to us, all the way up to the maximum level of 11. This will help reduce latency and bandwidth consumption. If the end user device does not support Brotli compression, we will automatically decompress the file and serve it either in its decompressed format or as a Gzip-compressed file, depending on the Accept-Encoding header.

Full end-to-end Brotli compression support

End user cannot support Brotli compression

Customers can implement Brotli compression at their origin by referring to the appropriate online materials. For example, customers that are using NGINX, can implement Brotli by following this tutorial and setting compression at level 11 within the nginx.conf configuration file as follows:

brotli on;
brotli_comp_level 11;
brotli_static on;
brotli_types text/plain text/css application/javascript application/x-javascript text/xml 
application/xml application/xml+rss text/javascript image/x-icon 
image/vnd.microsoft.icon image/bmp image/svg+xml;

Cloudflare will then serve these assets to the client at the exact same compression level (11) for the matching file brotli_types. This means any SVG or BMP images will be sent to the client compressed at Brotli level 11.

Testing

We applied compression against a simple CSS file, measuring the impact of various compression algorithms and levels. Our goal was to identify potential improvements that users could experience by optimizing compression techniques. These results can be seen in the following table:

Test	Size (bytes)	% Reduction of original file (Higher % better)
Uncompressed response (no compression used)	2,747	-
Cloudflare default Gzip compression (level 8)	1,121	59.21%
Cloudflare default Brotli compression (level 4)	1,110	59.58%
Compressed with max Gzip level (level 9)	1,121	59.21%
Compressed with max Brotli level (level 11)	909	66.94%

By compressing Brotli at level 11 users are able to reduce their file sizes by 19% compared to the best Gzip compression level. Additionally, the strongest Brotli compression level is around 18% smaller than the default level used by Cloudflare. This highlights a significant size reduction achieved by utilizing Brotli compression, particularly at its highest levels, which can lead to improved website performance, faster page load times and an overall reduction in egress fees.

To take advantage of higher end to end compression rates the following Cloudflare proxy features need to be disabled.

Email Obfuscation
Rocket Loader
Server Side Excludes (SSE)
Mirage
HTML Minification - JavaScript and CSS can be left enabled.
Automatic HTTPS Rewrites

This is due to Cloudflare needing to decompress and access the body to apply the requested settings. Alternatively a customer can disable these features for specific paths using Configuration Rules.

If any of these rewrite features are enabled, your origin can still send Brotli compression at higher levels. However, we will decompress, apply the Cloudflare feature(s) enabled, and recompress on the fly using Cloudflare’s default Brotli level 4 or Gzip level 8 depending on the user's accept-encoding header.

For browsers that do not accept Brotli compression, we will continue to decompress and send Gzipped responses or uncompressed.

Implementation

The initial step towards implementing Brotli from the origin involved constructing a decompression module that could be integrated into Cloudflare software stack. It allows us to efficiently convert the compressed bits received from the origin into the original, uncompressed file. This step was crucial as numerous features such as Email Obfuscation and Cloudflare Workers Customers, rely on accessing the body of a response to apply customizations.

We integrated the decompressor into the core reverse web proxy of Cloudflare. This integration ensured that all Cloudflare products and features could access Brotli decompression effortlessly. This also allowed our Cloudflare Workers team to incorporate Brotli Directly into Cloudflare Workers allowing our Workers customers to be able to interact with responses returned in Brotli or pass through to the end user unmodified.

Introducing Compression rules - Granular control of compression to end users

By default Cloudflare compresses certain content types based on the Content-Type header of the file. Today we are also announcing Compression Rules for our Enterprise Customers to allow you even more control on how and what Cloudflare will compress.

Today we are also announcing the introduction of Compression Rules for our Enterprise Customers. With Compression Rules, you gain enhanced control over Cloudflare's compression capabilities, enabling you to customize how and which content Cloudflare compresses to optimize your website's performance.

For example, by using Cloudflare's Compression Rules for .ktx files, customers can optimize the delivery of textures in webGL applications, enhancing the overall user experience. Enabling compression minimizes the bandwidth usage and ensures that webGL applications load quickly and smoothly, even when dealing with large and detailed textures.

Alternatively customers can disable compression or specify a preference of how we compress. Another example could be an Infrastructure company only wanting to support Gzip for their IoT devices but allow Brotli compression for all other hostnames.

Compression rules use the filters that our other rules products are built on top of with the added fields of Media Type and Extension type. Allowing users to easily specify the content you wish to compress.

Deprecating the Brotli toggle

Brotli has been long supported by some web browsers since 2016 and Cloudflare offered Brotli Support in 2017. As with all new web technologies Brotli was unknown and we gave customers the ability to selectively enable or disable BrotlI via the API and our UI.

Now that Brotli has evolved and is supported by all browsers, we plan to enable Brotli on all zones by default in the coming months. Mirroring the Gzip behavior we currently support and removing the toggle from our dashboard. If browsers do not support Brotli, Cloudflare will continue to support their accepted encoding types such as Gzip or uncompressed and Enterprise customers will still be able to use Compression rules to granularly control how we compress data towards their users.

The future of web compression

We've seen great adoption and great performance for Brotli as the new compression technique for the web. Looking forward, we are closely following trends and new compression algorithms such as zstd as a possible next-generation compression algorithm.

At the same time, we're looking to improve Brotli directly where we can. One development that we're particularly focused on is shared dictionaries with Brotli. Whenever you compress an asset, you use a "dictionary" that helps the compression to be more efficient. A simple analogy of this is typing OMW into an iPhone message. The iPhone will automatically translate it into On My Way using its own internal dictionary.

O	M	W
O	n		M	y		W	a	y

This internal dictionary has taken three characters and morphed this into nine characters (including spaces) The internal dictionary has saved six characters which equals performance benefits for users.

By default, the Brotli RFC defines a static dictionary that both clients and the origin servers use. The static dictionary was designed to be general purpose and apply to everyone. Optimizing the size of the dictionary as to not be too large whilst able to generate best compression results. However, what if an origin could generate a bespoke dictionary tailored to a specific website? For example a Cloudflare-specific dictionary would allow us to compress the words and phrases that appear repeatedly on our site such as the word “Cloudflare”. The bespoke dictionary would be designed to compress this as heavily as possible and the browser using the same dictionary would be able to translate this back.

A new proposal by the Web Incubator CG aims to do just that, allowing you to specify your own dictionaries that browsers can use to allow websites to optimize compression further. We're excited about contributing to this proposal and plan on publishing our research soon.

Try it now

Compression Rules are available now! With End to End Brotli being rolled out over the coming weeks. Allowing you to improve performance, reduce bandwidth and granularly control how Cloudflare handles compression to your end users.

Watch on Cloudflare TV

My internship: Brotli compression using a reduced dictionary

Felix Hanau — Wed, 11 Nov 2020 16:32:39 GMT

Brotli is a state of the art lossless compression format, supported by all major browsers. It is capable of achieving considerably better compression ratios than the ubiquitous gzip, and is rapidly gaining in popularity. Cloudflare uses the Google brotli library to dynamically compress web content whenever possible. In 2015, we took an in-depth look at how brotli works and its compression advantages.

One of the more interesting features of the brotli file format, in the context of textual web content compression, is the inclusion of a built-in static dictionary. The dictionary is quite large, and in addition to containing various strings in multiple languages, it also supports the option to apply multiple transformations to those words, increasing its versatility.

The open sourced brotli library, that implements an encoder and decoder for brotli, has 11 predefined quality levels for the encoder, with higher quality level demanding more CPU in exchange for a better compression ratio. The static dictionary feature is used to a limited extent starting with level 5, and to the full extent only at levels 10 and 11, due to the high CPU cost of this feature.

We improve on the limited dictionary use approach and add optimizations to improve the compression at levels 5 through 9 at a negligible performance impact when compressing web content.

Brotli Static Dictionary

Brotli primarily uses the LZ77 algorithm to compress its data. Our previous blog post about brotli compression provides an introduction.

To improve compression on text files and web content, brotli also includes a static, predefined dictionary. If a byte sequence cannot be matched with an earlier sequence using LZ77 the encoder will try to match the sequence with a reference to the static dictionary, possibly using one of the multiple transforms. For example, every HTML file contains the opening tag that cannot be compressed with LZ77, as it is unique, but it is contained in the brotli static dictionary and will be replaced by a reference to it. The reference generally takes less space than the sequence itself, which decreases the compressed file size.

The dictionary contains 13,504 words in six languages, with lengths from 4 to 24 characters. To improve the compression of real-world text and web data, some dictionary words are common phrases ("The current") or strings common in web content (‘type=”text/javascript”’). Unlike usual LZ77 compression, a word from the dictionary can only be matched as a whole. Starting a match in the middle of a dictionary word, ending it before the end of a word or even extending into the next word is not supported by the brotli format.

Instead, the dictionary supports 120 transforms of dictionary words to support a larger number of matches and find longer matches. The transforms include adding suffixes (“work” becomes “working”) adding prefixes (“book” => “ the book”) making the first character uppercase ("process" => "Process") or converting the whole word to uppercase (“html” => “HTML”). In addition to transforms that make words longer or capitalize them, the cut transform allows a shortened match (“consistently” => “consistent”), which makes it possible to find even more matches.

Methods

With the transforms included, the static dictionary contains 1,633,984 different words – too many for exhaustive search, except when used with the slow brotli compression levels 10 and 11. When used at a lower compression level, brotli either disables the dictionary or only searches through a subset of roughly 5,500 words to find matches in an acceptable time frame. It also only considers matches at positions where no LZ77 match can be found and only uses the cut transform.

Our approach to the brotli dictionary uses a larger, but more specialized subset of the dictionary than the default, using more aggressive heuristics to improve the compression ratio with negligible cost to performance. In order to provide a more specialized dictionary, we provide the compressor with a content type hint from our servers, relying on the Content-Type header to tell the compressor if it should use a dictionary for HTML, JavaScript or CSS. The dictionaries can be furthermore refined by colocation language in the future.

Fast dictionary lookup

To improve compression without sacrificing performance, we needed a fast way to find matches if we want to search the dictionary more thoroughly than brotli does by default. Our approach uses three data structures to find a matching word directly. The radix trie is responsible for finding the word while the hash table and bloom filter are used to speed up the radix trie and quickly eliminate many words that can’t be matched using the dictionary.

Lookup for a position starting with “type”

The radix trie easily finds the longest matching word without having to try matching several words. To find the match, we traverse the graph based on the text at the current position and remember the last node with a matching word. The radix trie supports compressed nodes (having more than one character as an edge label), which greatly reduces the number of nodes that need to be traversed for typical dictionary words.

The radix trie is slowed down by the large number of positions where we can’t find a match. An important finding is that most mismatching strings have a mismatching character in the first four bytes. Even for positions where a match exists, a lot of time is spent traversing nodes for the first four bytes since the nodes close to the tree root usually have many children.

Luckily, we can use a hash table to look up the node equivalent to four bytes, matching if it exists or reject the possibility of a match. We thus look up the first four bytes of the string, if there is a matching node we traverse the trie from there, which will be fast as each four-byte prefix usually only has a few corresponding dict words. If there is no matching node, there will not be a matching word at this position and we do not need to further consider it.

While the hash table is designed to reject mismatches quickly and avoid cache misses and high search costs in the trie, it still suffers from similar problems: We might search through several 4-byte prefixes with the hash value of the given position, only to learn that no match can be found. Additionally, hash lookups can be expensive due to cache misses.

To quickly reject words that do not match the dictionary, but might still cause cache misses, we use a k=1 bloom filter to quickly rule out most non-matching positions. In the k=1 case, the filter is simply a lookup table with one bit indicating whether any matching 4-byte prefixes exist for a given hash value. If the hash value for the given bit is 0, there won’t be a match. Since the bloom filter uses at most one bit for each four-byte prefix while the hash table requires 16 bytes, cache misses are much less likely. (The actual size of the structures is a bit different since there are many empty spaces in both structures and the bloom filter has twice as many elements to reject more non-matching positions.)

This is very useful for performance as a bloom filter lookup requires a single memory access. The bloom filter is designed to be fast and simple, but still rejects more than half of all non-matching positions and thus allows us to save a full hash lookup, which would often mean a cache miss.

Heuristics

To improve the compression ratio without sacrificing performance, we employed a number of heuristics:

Only search the dictionary at some positionsThis is also done using the stock dictionary, but we search more aggressively. While the stock dictionary only considers positions where the LZ77 match finder did not find a match, we also consider positions that have a bad match according to the brotli cost model: LZ77 matches that are short or have a long distance between the current position and the reference usually only offer a small compression improvement, so it is worth trying to find a better match in the static dictionary.

Only consider the longest match and then transform itInstead of finding and transforming all matches at a position, the radix trie only gives us the longest match which we then transform. This approach results in a vast performance improvement. In most cases, this results in finding the best match.

Only include some transformsWhile all transformations can improve the compression ratio, we only included those that work well with the data structures. The suffix transforms can easily be applied after finding a non-transformed match. For the upper case transforms, we include both the non-transformed and the upper case version of a word in the radix trie. The prefix and cut transforms do not play well with the radix trie, therefore a cut of more than 1 byte and prefix transforms are not supported.

Generating the reduced dictionary

At low compression levels, brotli searches a subset of ~5,500 out of 13,504 words of the dictionary, negatively impacting compression. To store the entire dictionary, we would need to store ~31,700 words in the trie considering the upper case transformed output of ASCII sequences and ~11,000 four-byte prefixes in the hash. This would slow down hash table and radix trie, so we needed to find a different subset of the dictionary that works well for web content.

For this purpose, we used a large data set containing representative content. We made sure to use web content from several world regions to reflect language diversity and optimize compression. Based on this data set, we identified which words are most common and result in the largest compression improvement according to the brotli cost model. We only include the most useful words based on this calculation. Additionally, we remove some words if they slow down hash table lookups of other, more common words based on their hash value.

We have generated separate dictionaries for HTML, CSS and JavaScript content and use the MIME type to identify the right dictionary to use. The dictionaries we currently use include about 15-35% of the entire dictionary including uppercase transforms. Depending on the type of data and the desired compression/speed tradeoff, different options for the size of the dictionary can be useful. We have also developed code that automatically gathers statistics about matches and generates a reduced dictionary based on this, which makes it easy to extend this to other textual formats, perhaps data that is majority non-English or XML data and achieve better results for this type of data.

Results

We tested the reduced dictionary on a large data set of HTML, CSS and JavaScript files.

The improvement is especially big for small files as the LZ77 compression is less effective on them. Since the improvement on large files is a lot smaller, we only tested files up to 256KB. We used compression level 5, the same compression level we currently use for dynamic compression on our edge, and tested on a Intel Core i7-7820HQ CPU.

Compression improvement is defined as 1 - (compressed size using the reduced dictionary / compressed size without dictionary). This ratio is then averaged for each input size range. We also provide an average value weighted by file size. Our data set mirrors typical web traffic, covering a wide range of file sizes with small files being more common, which explains the large difference between the weighted and unweighted average.

With the improved dictionary approach, we are now able to compress HTML, JavaScript and CSS files as well, or sometimes even better than using a higher compression level would allow us, all while using only 1% to 3% more CPU. For reference using compression level 6 over 5 would increase CPU usage by up to 12%.

A Solution to Compression Oracles on the Web

Guest Author — Tue, 27 Mar 2018 12:00:00 GMT

CC 3.0 by Jean-Jacques MILAN

This is a guest post by Blake Loring, a PhD student at Royal Holloway, University of London. Blake worked at Cloudflare as an intern in the summer of 2017.

Compression is often considered an essential tool when reducing the bandwidth usage of internet services. The impact that the use of such compression schemes can have on security, however, has often been overlooked. The recently detailed CRIME, BREACH, TIME and HEIST attacks on TLS have shown that if an attacker can make requests on behalf of a user then secret information can be extracted from encrypted messages using only the length of the response. Deciding whether an element of a web-page should be secret often depends on the content of the page, however there are some common elements of web-pages which should always remain secret such as Cross-Site Request Forgery (CSRF) tokens. Such tokens are used to ensure that malicious webpages cannot forge requests from a user by enforcing that any request must contain a secret token included in a previous response.

I worked at Cloudflare last summer to investigate possible solutions to this problem. The result is a project called cf-nocompress. The aim of this project was to develop a tool which automatically mitigates instances of the attack, in particular CSRF extraction, on Cloudflare hosted services transparently without significantly impacting the effectiveness of compression. We have published a proof-of-concept implementation on GitHub, and provide a challenge site and tool which demonstrates the attack in action).

The Problem

Most web compression schemes reduce the size of data by replacing common sequences with references to a dictionary of terms created during the compression. When using such compression schemes the size of the encrypted response will be reduced if there are repeated strings within the plaintext. This can be exploited through the use of a canary, an element in a request which we know will be added to the response, to test whether a string exists within the original response using the compressed response length. From this we can extract the contents of portions of a webpage incrementally by guessing each subsequent character. This attack creates an opportunity for malicious JavaScript to extract CSRF tokens and other confidential information from a webpage through malicious code served to a browser using either a packet sniffer (a methodology created by Duong and Rizzo as part of the BEAST attack) or JavaScript APIs which reveal network statistics (described by Vanhoef and Van Goethem in HEIST).

There are two common mitigation schemes for this attack. The first is to send a unique CSRF every time a page is loaded. By removing the consistent element from the page the threat of attack is removed. This approach requires the server to keep state of valid CSRFs and whether they have been used, additionally it can only be used to protect page tokens and not user-readable data. Another approach is to XOR all secrets in a response with a per-request random number and then transmit the number with the response. Once received a piece of JavaScript can then be used to recover the original secret by XORing the data again. Alternatively, the server can be modified to expect the XOR variant and the random number rather than the original secret. This approach allows for all secrets to be protected, however it requires client side post-processing. Additionally, both approaches require extensive, per page, modification which make mitigation incredibly cumbersome in practice. At present the only way to fully mitigate such an attack is to disable compression entirely on vulnerable websites, an impractical solution for most websites and content delivery networks.

Our Solution

We decided to use selective compression, compressing only non-secret parts of a page, in order to stop the extraction of secret information from a page. We found that in most cases a secret within a webpage can be described in terms of a classical regular expression. These descriptions allow us to identify secrets online as a response is streamed. Once the secrets are identified they can be flagged so that a modified compression library can ensure that they are not added to the dictionary. The primary advantage of this approach is that protection can be offered transparently by the web-server and the application does not need to be modified as long as a regular expression can be used to clearly express which portions of a response are secret. In addition, we do not need to maintain state for each user or require client-side JavaScript to appropriately render the page.

The proof-of-concept is implemented as a plugin for NGINX and requires a small patch to the gzip module. The plugin uses sregex to identify secrets within a page. The modified gzip functions as normal, however when a secret is processed compression is disabled. This ensures secrets do not get added to the compression dictionary, removing any on response size.

Additional security considerations

The regular expression matching engine we use in this proof-of-concept is not guaranteed to run in constant time. As such, matching a string against some regular expressions could introduce a timing based side-channel attack. This issue is compounded by the complexity of modern regular expressions as matching time can often be non-intuitive. Whilst in many cases the risk such an attack would pose is minimal, a limited matcher with constant runtime and restrictions on unbounded loops should be developed if our mitigation is adopted.

The Challenge Site

We have set up the challenge website compression.website with protection, and a clone of the site compression.website/unsafe without it. The page is a simple form with a per-client CSRF designed to emulate common CSRF protection. Using the example attack presented with the library we have shown that we are able to extract the CSRF from the size of request responses in the unprotected variant but we have not been able to extract it on the protected site. We welcome attempts to extract the CSRF without access to the unencrypted response.

Everyone can now run JavaScript on Cloudflare with Workers

Kenton Varda — Tue, 13 Mar 2018 13:00:00 GMT

This post is also available in 日本語.

Exactly one year ago today, Cloudflare gave me a mission: Make it so people can run code on Cloudflare's edge. At the time, we didn't yet know what that would mean. Would it be container-based? A new Turing-incomplete domain-specific language? Lua? "Functions"? There were lots of ideas.

Eventually, we settled on what now seems the obvious choice: JavaScript, using the standard Service Workers API, running in a new environment built on V8. Five months ago, we gave you a preview of what we were building, and started the beta.

Today, with thousands of scripts deployed and many billions of requests served, Cloudflare Workers is now ready for everyone.

"Moving away from VCL and adopting Cloudflare Workers will allow us to do some creative routing that will let us deliver JavaScript to npm's millions of users even faster than we do now. We will be building our next generation of services on Cloudflare's platform and we get to do it in JavaScript!"

— CJ Silverio, CTO, npm, Inc.

What is the Cloud, really?

Historically, web application code has been split between servers and browsers. Between them lies a vast but fundamentally dumb network which merely ferries data from point to point.

We don't believe this lives up to the promise of "The Cloud."

We believe the true dream of cloud computing is that your code lives in the network itself. Your code doesn't run in "us-west-4" or "South Central Asia (Mumbai)", it runs everywhere.

More concretely, it should run where it is most needed. When responding to a user in New Zealand, your code should run in New Zealand. When crunching data in your database, your code should run on the machines that store the data. When interacting with a third-party API, your code should run wherever that API is hosted. When human explorers reach Mars, they aren't going to be happy waiting a half an hour for your app to respond -- your code needs to be running on Mars.

Cloudflare Workers are our first step towards this vision. When you deploy a Worker, it is deployed to Cloudflare's entire edge network of over a hundred locations worldwide in under 30 seconds. Each request for your domain will be handled by your Worker at a Cloudflare location close to the end user, with no need for you to think about individual locations. The more locations we bring online, the more your code just "runs everywhere."

Well, OK… it won't run on Mars. Yet. You out there, Elon?

What's a Worker?

Cloudflare Workers derive their name from Web Workers, and more specifically Service Workers, the W3C standard API for scripts that run in the background in a web browser and intercept HTTP requests. Cloudflare Workers are written against the same standard API, but run on Cloudflare's servers, not in a browser.

Here are the tools you get to work with:

Execute any JavaScript code, using the latest standard language features.
Intercept and modify HTTP request and response URLs, status, headers, and body content.
Respond to requests directly from your Worker, or forward them elsewhere.
Send HTTP requests to third-party servers.
Send multiple requests, in serial or parallel, and use the responses to compose a final response to the original request.
Send asynchronous requests after the response has already been returned to the client (for example, for logging or analytics).
Control other Cloudflare features, such as caching behavior.

The possible uses for Workers are infinite, and we're excited to see what our customers come up with. Here are some ideas we've seen in the beta:

Route different types of requests to different origin servers.
Expand HTML templates on the edge, to reduce bandwidth costs at your origin.
Apply access control to cached content.
Redirect a fraction of users to a staging server.
Perform A/B testing between two entirely different back-ends.
Build "serverless" applications that rely entirely on web APIs.
Create custom security filters to block unwanted traffic unique to your app.
Rewrite requests to improve cache hit rate.
Implement custom load balancing and failover logic.
Apply quick fixes to your application without having to update your production servers.
Collect analytics without running code in the user's browser.
Much more.

Here's an example.

// A Worker which:
// 1. Redirects visitors to the home page ("/") to a
//    country-specific page (e.g. "/US/").
// 2. Blocks hotlinks.
// 3. Serves images directly from Google Cloud Storage.
addEventListener('fetch', event => {
  event.respondWith(handle(event.request))
})

async function handle(request) {
  let url = new URL(request.url)
  if (url.pathname == "/") {
    // This is a request for the home page ("/").
    // Redirect to country-specific path.
    // E.g. users in the US will be sent to "/US/".
    let country = request.headers.get("CF-IpCountry")
    url.pathname = "/" + country + "/"
    return Response.redirect(url, 302)

  } else if (url.pathname.startsWith("/images/")) {
    // This is a request for an image (under "/images").
    // First, block third-party referrers to discourage
    // hotlinking.
    let referer = request.headers.get("Referer")
    if (referer &&
        new URL(referer).hostname != url.hostname) {
      return new Response(
          "Hotlinking not allowed.",
          { status: 403 })
    }

    // Hotlink check passed. Serve the image directly
    // from Google Cloud Storage, to save serving
    // costs. The image will be cached at Cloudflare's
    // edge according to its Cache-Control header.
    url.hostname = "example-bucket.storage.googleapis.com"
    return fetch(url, request)
  } else {
    // Regular request. Forward to origin server.
    return fetch(request)
  }
}

It's Really Fast

Sometimes people ask us if JavaScript is "slow". Nothing could be further from the truth.

Workers uses the V8 JavaScript engine built by Google for Chrome. V8 is not only one of the fastest implementations of JavaScript, but one of the fastest implementations of any dynamically-typed language, period. Due to the immense amount of work that has gone into optimizing V8, it outperforms just about any popular server programming language with the possible exceptions of C/C++, Rust, and Go. (Incidentally, we will support those soon, via WebAssembly.)

The bottom line: A typical Worker script executes in less than one millisecond. Most users are unable to measure any latency difference when they enable Workers -- except, of course, when their worker actually improves latency by responding directly from the edge.

On another speed-related note, Workers deploy fast, too. Workers deploy globally in under 30 seconds from the time you save and enable the script.

Pricing

Workers are a paid add-on to Cloudflare. We wanted to keep the pricing as simple as possible, so here's the deal:

Get Started

Log into your Cloudflare account and visit the "Workers" section to configure Workers.
Experiment with Workers in the Playground, no account required.
Read the documentation to learn how Workers are written.
Check out the original announcement blog post for more technical details.
Discuss Workers in the Cloudflare Community.

"Cloudflare Workers saves us a great deal of time. Managing bot traffic without Workers would consume valuable development and server resources that are better spent elsewhere."

— John Thompson, Senior System Administrator, MaxMind

Squeezing the firehose: getting the most from Kafka compression

Ivan Babrou — Mon, 05 Mar 2018 16:17:03 GMT

We at Cloudflare are long time Kafka users, first mentions of it date back to beginning of 2014 when the most recent version was 0.8.0. We use Kafka as a log to power analytics (both HTTP and DNS), DDoS mitigation, logging and metrics.

While the idea of unifying abstraction of the log remained the same since then (read this classic blog post from Jay Kreps if you haven't), Kafka evolved in other areas since then. One of these improved areas was compression support. Back in the old days we've tried enabling it a few times and ultimately gave up on the idea because of unresolved issues in the protocol.

Kafka compression overview

Just last year Kafka 0.11.0 came out with the new improved protocol and log format.

The naive approach to compression would be to compress messages in the log individually:

Edit: originally we said this is how Kafka worked before 0.11.0, but that appears to be false.

Compression algorithms work best if they have more data, so in the new log format messages (now called records) are packed back to back and compressed in batches. In the previous log format messages recursive (compressed set of messages is a message), new format makes things more straightforward: compressed batch of records is just a batch.

Now compression has a lot more space to do its job. There's a high chance that records in the same Kafka topic share common parts, which means they can be compressed better. On the scale of thousands of messages difference becomes enormous. The downside here is that if you want to read record3 in the example above, you have to fetch records 1 and 2 as well, whether the batch is compressed or not. In practice this doesn't matter too much, because consumers usually read all records sequentially batch after batch.

The beauty of compression in Kafka is that it lets you trade off CPU vs disk and network usage. The protocol itself is designed to minimize overheads as well, by requiring decompression only in a few places:

On the receiving side of the log only consumers need to decompress messages:

In reality, if you don't use encryption, data can be copied between NIC and disks with zero copies to user space, lowering the cost to some degree.

Kafka bottlenecks at Cloudflare

Having less network and disk usage was a big selling point for us. Back in 2014 we started with spinning disks under Kafka and never had issues with disk space. However, at some point we started having issues with random io. Most of the time consumers and replicas (which are just another type of consumer) read from the very tip of the log, and that data resides in page cache meaning you don't need to read from disks at all:

In this case the only time you touch the disk is during writes, and sequential writes are cheap. However, things start to fall apart when you have multiple lagging consumers:

Each consumer wants to read different part of the log from the physical disk, which means seeking back and forth. One lagging consumer was okay to have, but multiple of them would start fighting for disk io and just increase lag for all of them. To work around this problem we upgraded to SSDs.

Consumers were no longer fighting for disk time, but it felt terribly wasteful most of the time when consumers are not lagging and there's zero read io. We were not bored for too long, as other problems emerged:

Disk space became a problem. SSDs are much more expensive and usable disk space reduced by a lot.
As we grew, we started saturating network. We used 2x10Gbit NICs and imperfect balance meant that we sometimes saturated network links.

Compression promised to solve both of these problems, so we were eager to try again with improved support from Kafka.

Performance testing

At Cloudflare, we use Go extensively, which means that a lot of our Kafka consumers and producers are in Go. This means we can't just take off-the-shelf Java client provided by Kafka team with every server release and start enjoying the benefits of compression. We had to get support from our Kafka client library first (we use sarama from Shopify). Luckily, support was added at the end of 2017. With more fixes from our side we were able to get the test setup working.

Kafka supports 4 compression codecs: none, gzip, lz4 and snappy. We had to figure out how these would work for our topics, so we wrote a simple producer that copied data from existing topic into destination topic. With four destination topics for each compression type we were able to get the following numbers.

Each destination topic was getting roughly the same amount of messages:

To make it even more obvious, this was the disk usage of these topics:

This looked amazing, but it was rather low throughput nginx errors topic, containing literally string error messages from nginx. Our main target was requests HTTP log topic with capnp encoded messages that are much harder to compress. Naturally, we moved on to try out one partition of requests topic. First results were insanely good:

They were so good, because they were lies. If with nginx error logs we were pushing under 20Mbps of uncompressed logs, here we jumped 30x to 600Mbps and compression wasn't able to keep up. Still, as a starting point, this experiment gave us some expectations in terms of compression ratios for the main target.

Compression	Messages consumed	Disk usage	Average message size
None	30.18M	48106MB	1594B
Gzip	3.17M	1443MB	455B
Snappy	20.99M	14807MB	705B
LZ4	20.93M	14731MB	703B

Gzip sounded too expensive from the beginning (especially in Go), but Snappy should have been able to keep up. We profiled our producer, and it was spending just 2.4% of CPU time in Snappy compression, never saturating a single core:

For Snappy we were able to get the following thread stacktrace from Kafka with jstack:

"kafka-request-handler-3" #87 daemon prio=5 os_prio=0 tid=0x00007f80d2e97800 nid=0x1194 runnable [0x00007f7ee1adc000]
   java.lang.Thread.State: RUNNABLE
    at org.xerial.snappy.SnappyNative.rawCompress(Native Method)
    at org.xerial.snappy.Snappy.rawCompress(Snappy.java:446)
    at org.xerial.snappy.Snappy.compress(Snappy.java:119)
    at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:376)
    at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:130)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    - locked <0x00000007a74cc8f0> (a java.io.DataOutputStream)
    at org.apache.kafka.common.utils.Utils.writeTo(Utils.java:861)
    at org.apache.kafka.common.record.DefaultRecord.writeTo(DefaultRecord.java:203)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendDefaultRecord(MemoryRecordsBuilder.java:622)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:409)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:442)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:595)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1(LogValidator.scala:336)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1$adapted(LogValidator.scala:335)
    at kafka.log.LogValidator$$$Lambda$675/1035377790.apply(Unknown Source)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at kafka.log.LogValidator$.buildRecordsAndAssignOffsets(LogValidator.scala:335)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsetsCompressed(LogValidator.scala:288)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsets(LogValidator.scala:71)
    at kafka.log.Log.liftedTree1$1(Log.scala:654)
    at kafka.log.Log.$anonfun$append$2(Log.scala:642)
    - locked <0x0000000640068e88> (a java.lang.Object)
    at kafka.log.Log$$Lambda$627/239353060.apply(Unknown Source)
    at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
    at kafka.log.Log.append(Log.scala:624)
    at kafka.log.Log.appendAsLeader(Log.scala:597)
    at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:499)
    at kafka.cluster.Partition$$Lambda$625/1001513143.apply(Unknown Source)
    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
    at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:223)
    at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:487)
    at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(ReplicaManager.scala:724)
    at kafka.server.ReplicaManager$$Lambda$624/2052953875.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$12/187472540.apply(Unknown Source)
    at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
    at scala.collection.mutable.HashMap$$Lambda$25/1864869682.apply(Unknown Source)
    at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
    at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:708)
    at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:459)
    at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:466)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:99)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:65)
    at java.lang.Thread.run(Thread.java:748)

This pointed us to this piece of code in Kafka repository.

There wasn't enough logging to figure out why Kafka was recompressing, but we were able to get this information out with a patched Kafka broker:

diff --git a/core/src/main/scala/kafka/log/LogValidator.scala b/core/src/main/scala/kafka/log/LogValidator.scala
index 15750e9cd..5197d0885 100644
--- a/core/src/main/scala/kafka/log/LogValidator.scala
+++ b/core/src/main/scala/kafka/log/LogValidator.scala
@@ -21,6 +21,7 @@ import java.nio.ByteBuffer
 import kafka.common.LongRef
 import kafka.message.{CompressionCodec, NoCompressionCodec}
 import kafka.utils.Logging
+import org.apache.log4j.Logger
 import org.apache.kafka.common.errors.{InvalidTimestampException, UnsupportedForMessageFormatException}
 import org.apache.kafka.common.record._
 import org.apache.kafka.common.utils.Time
@@ -236,6 +237,7 @@ private[kafka] object LogValidator extends Logging {
   
       // No in place assignment situation 1 and 2
       var inPlaceAssignment = sourceCodec == targetCodec && toMagic > RecordBatch.MAGIC_VALUE_V0
+      logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == targetCodec (" + targetCodec + ") && toMagic (" + toMagic + ") > RecordBatch.MAGIC_VALUE_V0 (" + RecordBatch.MAGIC_VALUE_V0 + ")")
   
       var maxTimestamp = RecordBatch.NO_TIMESTAMP
       val expectedInnerOffset = new LongRef(0)
@@ -250,6 +252,7 @@ private[kafka] object LogValidator extends Logging {
         // Do not compress control records unless they are written compressed
         if (sourceCodec == NoCompressionCodec && batch.isControlBatch)
           inPlaceAssignment = true
+          logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == NoCompressionCodec (" + NoCompressionCodec + ") && batch.isControlBatch (" + batch.isControlBatch + ")")
   
         for (record <- batch.asScala) {
           validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
@@ -261,21 +264,26 @@ private[kafka] object LogValidator extends Logging {
           if (batch.magic > RecordBatch.MAGIC_VALUE_V0 && toMagic > RecordBatch.MAGIC_VALUE_V0) {
             // Check if we need to overwrite offset
             // No in place assignment situation 3
-            if (record.offset != expectedInnerOffset.getAndIncrement())
+            val off = expectedInnerOffset.getAndIncrement()
+            if (record.offset != off)
               inPlaceAssignment = false
+              logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: record.offset (" + record.offset + ") != expectedInnerOffset.getAndIncrement() (" + off + ")")
             if (record.timestamp > maxTimestamp)
               maxTimestamp = record.timestamp
           }
   
           // No in place assignment situation 4
-          if (!record.hasMagic(toMagic))
+          if (!record.hasMagic(toMagic)) {
+            logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: !record.hasMagic(toMagic) (" + !record.hasMagic(toMagic) + ")")
             inPlaceAssignment = false
+          }
   
           validatedRecords += record
         }
       }
   
       if (!inPlaceAssignment) {
+        logger.info("inPlaceAssignment = " + inPlaceAssignment + "; recompressing")
         val (producerId, producerEpoch, sequence, isTransactional) = {
           // note that we only reassign offsets for requests coming straight from a producer. For records with magic V2,
           // there should be exactly one RecordBatch per request, so the following is all we need to do. For Records

And the output was:

Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == targetCodec (SnappyCompressionCodec) && toMagic (2) > RecordBatch.MAGIC_VALUE_V0 (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == NoCompressionCodec (NoCompressionCodec) && batch.isControlBatch (false) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (1) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (2) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (3) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (4) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (5) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (6) (kafka.log.LogValidator$)

We promptly fixed the issue and resumed the testing. These were the results:

Compression	User time	Messages	Time per 1m	CPU ratio	Disk usage	Avg. message size	Compression ratio
None	209.67s	26.00M	8.06s	1x	41448MB	1594B	1x
Gzip	570.56s	6.98M	81.74s	10.14x	3111MB	445B	3.58x
Snappy	337.55s	26.02M	12.97s	1.61x	17675MB	679B	2.35x
LZ4	525.82s	26.01M	20.22s	2.51x	22922MB	881B	1.81x

Now we were able to keep up with both Snappy and LZ4. Gzip was still out of the question and LZ4 had incompatibility issues between Kafka versions and our Go client, which left us with Snappy. This was a winner in terms of compression ratio and speed too, so we were not very disappointed by the lack of choice.

Deploying into production

In production, we started small with Java based consumers and producers. Our first production topic was just 1Mbps and 600rps of nginx error logs. Messages there were very repetitive, and we were able to get whopping 8x decrease in size with batching records for just 1 second across 2 partitions.

This gave us some confidence to move onto next topic with journald logs encoded with JSON. Here we were able to reduce ingress from 300Mbps to just 50Mbps (yellow line on the graph):

With all major topics in DNS cluster switched to Snappy we saw even better picture in terms of broker CPU usage:

On the next graph you can see Kafka CPU usage as the purple line and producer CPU usage as the green line:

CPU usage of the producer did not go up substantially, which means most of the work is spent in non compression related tasks. Consumers did not see any increase in CPU usage either, which means we've got our 2.6x decrease in size practically for free.

It was time to hunt the biggest beast of all: requests topic with HTTP access logs. There we were doing up to 100Gbps and 7.5Mrps of ingress at peak (a lot more when big attacks are happening, but this was a quiet week):

With many smaller topics switched to Snappy already, we did not need to do anything special here. This is how it went:

That's a 2.25x decrease in ingress bandwidth and average message size. We have multiple replicas and consumers, which means egress is a multiple of ingress. We were able to reduce in-DC traffic by hundreds of gigabits of internal traffic and save terabytes of flash storage. With network and disks being bottlenecks, this meant we'd need less than a half of hardware we had. Kafka was one of the main hardware hogs in this datacenter, so this was a large scale win.

Yet, 2.25x seemed a bit on the low side.

Looking for more

We wanted to see if we can do better. To do that, we extracted one batch of records from Kafka and ran some benchmarks on it. All batches are around 1 MB uncompressed, 600 records in each on average.

To run the benchmarks we used lzbench, which runs lots of different compression algorithms and provides a summary. Here's what we saw with results sorted by compression ratio (heavily filtered list):

lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33587 MB/s 33595 MB/s      984156 100.00
...
lz4 1.8.0                 594 MB/s  2428 MB/s      400577  40.70
...
snappy 1.1.4              446 MB/s  1344 MB/s      425564  43.24
...
zstd 1.3.3 -1             409 MB/s   844 MB/s      259438  26.36
zstd 1.3.3 -2             303 MB/s   889 MB/s      244650  24.86
zstd 1.3.3 -3             242 MB/s   899 MB/s      232057  23.58
zstd 1.3.3 -4             240 MB/s   910 MB/s      230936  23.47
zstd 1.3.3 -5             154 MB/s   891 MB/s      226798  23.04

This looked too good to be true. Zstandard is a fairly new (released 1.5 years ago) compression algorithm from Facebook. In benchmarks on the project's home page you can see this:

In our case we were getting this:

Compressor name	Ratio	Compression	Decompression
zstd	3.794	409 MB/s	844 MB/s
lz4	2.475	594 MB/s	2428 MB/s
snappy	2.313	446 MB/s	1344 MB/s

Clearly, results are very dependent on the kind of data you are trying to compress. For our data zstd was giving amazing results even on the lowest compression level. Compression ratio was better than even gzip at maximum compression level, while throughput was a lot higher. For posterity, this is how DNS logs compressed (HTTP logs compressed similarly):

$ ./lzbench -ezstd/zlib rrdns.recordbatch
lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33235 MB/s 33502 MB/s      927048 100.00 rrdns.recordbatch
zstd 1.3.3 -1             430 MB/s   909 MB/s      226298  24.41 rrdns.recordbatch
zstd 1.3.3 -2             322 MB/s   878 MB/s      227271  24.52 rrdns.recordbatch
zstd 1.3.3 -3             255 MB/s   883 MB/s      217730  23.49 rrdns.recordbatch
zstd 1.3.3 -4             253 MB/s   883 MB/s      217141  23.42 rrdns.recordbatch
zstd 1.3.3 -5             169 MB/s   869 MB/s      216119  23.31 rrdns.recordbatch
zstd 1.3.3 -6             102 MB/s   939 MB/s      211092  22.77 rrdns.recordbatch
zstd 1.3.3 -7              78 MB/s   968 MB/s      208710  22.51 rrdns.recordbatch
zstd 1.3.3 -8              65 MB/s  1005 MB/s      204370  22.05 rrdns.recordbatch
zstd 1.3.3 -9              59 MB/s  1008 MB/s      204071  22.01 rrdns.recordbatch
zstd 1.3.3 -10             44 MB/s  1029 MB/s      202587  21.85 rrdns.recordbatch
zstd 1.3.3 -11             43 MB/s  1054 MB/s      202447  21.84 rrdns.recordbatch
zstd 1.3.3 -12             32 MB/s  1051 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -13             31 MB/s  1050 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -14             13 MB/s  1074 MB/s      200228  21.60 rrdns.recordbatch
zstd 1.3.3 -15           8.15 MB/s  1171 MB/s      197114  21.26 rrdns.recordbatch
zstd 1.3.3 -16           5.96 MB/s  1051 MB/s      190683  20.57 rrdns.recordbatch
zstd 1.3.3 -17           5.64 MB/s  1057 MB/s      191227  20.63 rrdns.recordbatch
zstd 1.3.3 -18           4.45 MB/s  1166 MB/s      187967  20.28 rrdns.recordbatch
zstd 1.3.3 -19           4.40 MB/s  1108 MB/s      186770  20.15 rrdns.recordbatch
zstd 1.3.3 -20           3.19 MB/s  1124 MB/s      186721  20.14 rrdns.recordbatch
zstd 1.3.3 -21           3.06 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zstd 1.3.3 -22           3.01 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zlib 1.2.11 -1             97 MB/s   301 MB/s      305992  33.01 rrdns.recordbatch
zlib 1.2.11 -2             93 MB/s   327 MB/s      284784  30.72 rrdns.recordbatch
zlib 1.2.11 -3             74 MB/s   364 MB/s      265415  28.63 rrdns.recordbatch
zlib 1.2.11 -4             68 MB/s   342 MB/s      269831  29.11 rrdns.recordbatch
zlib 1.2.11 -5             48 MB/s   367 MB/s      258558  27.89 rrdns.recordbatch
zlib 1.2.11 -6             32 MB/s   376 MB/s      247560  26.70 rrdns.recordbatch
zlib 1.2.11 -7             24 MB/s   409 MB/s      244623  26.39 rrdns.recordbatch
zlib 1.2.11 -8           9.67 MB/s   429 MB/s      239659  25.85 rrdns.recordbatch
zlib 1.2.11 -9           3.63 MB/s   446 MB/s      235604  25.41 rrdns.recordbatch

For our purposes we picked level 6 as the compromise between compression ratio and CPU cost. It is possible to be even more aggressive as real world usage proved later.

One great property of zstd is more or less the same decompression speed between levels, which means you only have one knob that connects CPU cost of compression to compression ratio.

Armed with this knowledge, we dug up forgotten Kafka ticket to add zstd, along with KIP (Kafka Improvement Proposal) and even PR on GitHub. Sadly, these did not get traction back in the day, but this work saved us a lot of time.

We ported the patch to Kafka 1.0.0 release and pushed it in production. After another round of smaller scale testing and with patched clients we pushed Zstd into production for requests topic.

Graphs below include switch from no compression (before 2/9) to Snappy (2/9 to 2/17) to Zstandard (after 2/17):

The decrease in size was 4.5x compared to no compression at all. On next generation hardware with 2.4x more storage and 2.5x higher network throughput we suddenly made our bottleneck more than 10x wider and shifted it from storage and network to CPU cost. We even got to cancel pending hardware order for Kafka expansion because of this.

Conclusion

Zstandard is a great modern compression algorithm, promising high compression ratio and throughput, tunable in small increments. Whenever you consider using compression, you should check zstd. If you don't consider compression, then it's worth seeing if you can get benefits from it. Run benchmarks with your data in either case.

Testing in real world scenario showed how benchmarks, even coming from zstd itself, can be misleading. Going beyond codecs built into Kafka allowed us to improve compression ratio 2x at very low cost.

We hope that the data we gathered can be a catalyst to making Zstandard an official compression codec in Kafka to benefit other people. There are 3 bits allocated for codec type and only 2 are used so far, which means there are 4 more vacant places.

If you were skeptical of compression benefits in Kafka because of old flaws in Kafka protocol, this may be the time to reconsider.

If you enjoy benchmarking, profiling and optimizing large scale services, come join us.

ARM Takes Wing: Qualcomm vs. Intel CPU comparison

Vlad Krasnov — Wed, 08 Nov 2017 20:03:14 GMT

One of the nicer perks I have here at Cloudflare is access to the latest hardware, long before it even reaches the market.

Until recently I mostly played with Intel hardware. For example Intel supplied us with an engineering sample of their Skylake based Purley platform back in August 2016, to give us time to evaluate it and optimize our software. As a former Intel Architect, who did a lot of work on Skylake (as well as Sandy Bridge, Ivy Bridge and Icelake), I really enjoy that.

Our previous generation of servers was based on the Intel Broadwell micro-architecture. Our configuration includes dual-socket Xeons E5-2630 v4, with 10 cores each, running at 2.2GHz, with a 3.1GHz turboboost and hyper-threading enabled, for a total of 40 threads per server.

Since Intel was, and still is, the undisputed leader of the server CPU market with greater than 98% market share, our upgrade process until now was pretty straightforward: every year Intel releases a new generation of CPUs, and every year we buy them. In the process we usually get two extra cores per socket, and all the extra architectural features such upgrade brings: hardware AES and CLMUL in Westmere, AVX in Sandy Bridge, AVX2 in Haswell, etc.

In the current upgrade cycle, our next server processor ought to be the Xeon Silver 4116, also in a dual-socket configuration. In fact, we have already purchased a significant number of them. Each CPU has 12 cores, but it runs at a lower frequency of 2.1GHz, with 3.0GHz turboboost. It also has smaller last level cache: 1.375 MiB/core, compared to 2.5 MiB the Broadwell processors had. In addition, the Skylake based platform supports 6 memory channels and the AVX-512 instruction set.

As we head into 2018, however, change is in the air. For the first time in a while, Intel has serious competition in the server market: Qualcomm and Cavium both have new server platforms based on the ARMv8 64-bit architecture (aka aarch64 or arm64). Qualcomm has the Centriq platform (code name Amberwing), based on the Falkor core, and Cavium has the ThunderX2 platform, based on the ahm ... ThunderX2 core?

The majestic Amberwing powered by the Falkor CPU CC BY-SA 2.0 image by DrPhotoMoto

Recently, both Qualcomm and Cavium provided us with engineering samples of their ARM based platforms, and in this blog post I would like to share my findings about Centriq, the Qualcomm platform.

The actual Amberwing in question

Overview

I tested the Qualcomm Centriq server, and compared it with our newest Intel Skylake based server and previous Broadwell based server.

Platform	Grantley (Intel)	Purley (Intel)	Centriq (Qualcomm)
Core	Broadwell	Skylake	Falkor
Process	14nm	14nm	10nm
Issue	8 µops/cycle	8 µops/cycle	8 instructions/cycle
Dispatch	4 µops/cycle	5 µops/cycle	4 instructions/cycle
# Cores	10 x 2S + HT (40 threads)	12 x 2S + HT (48 threads)	46
Frequency	2.2GHz (3.1GHz turbo)	2.1GHz (3.0GHz turbo)	2.5 GHz
LLC	2.5 MB/core	1.35 MB/core	1.25 MB/core
Memory Channels	4	6	6
TDP	170W (85W x 2S)	170W (85W x 2S)	120W
Other features	AES CLMUL AVX2	AES CLMUL AVX512	AES CLMUL NEON Trustzone CRC32

Overall on paper Falkor looks very competitive. In theory a Falkor core can process 8 instructions/cycle, same as Skylake or Broadwell, and it has higher base frequency at a lower TDP rating.

Ecosystem readiness

Up until now, a major obstacle to the deployment of ARM servers was lack, or weak, support by the majority of the software vendors. In the past two years, ARM’s enablement efforts have paid off, as most Linux distros, as well as most popular libraries support the 64-bit ARM architecture. Driver availability, however, is unclear at that point.

At Cloudflare, we run a complex software stack that consists of many integrated services, and running each of them efficiently is top priority.

On the edge we have the NGINX server software, that does support ARMv8. NGINX is written in C, and it also uses several libraries written in C, such as zlib and BoringSSL, therefore solid C compiler support is very important.

In addition, our flavor of NGINX is highly integrated with the lua-nginx-module, and we rely a lot on LuaJIT.

Finally, a lot of our services, such as our DNS server, RRDNS, are written in Go.

The good news is that both gcc and clang not only support ARMv8 in general, but have optimization profiles for the Falkor core.

Go has official support for ARMv8 as well, and they improve the arm64 backend constantly.

As for LuaJIT, the stable version, 2.0.5 does not support ARMv8, but the beta version, 2.1.0 does. Let’s hope it gets out of beta soon.

Benchmarks

OpenSSL

The first benchmark I wanted to perform, was OpenSSL version 1.1.1 (development version), using the bundled openssl speed tool. Although we recently switched to BoringSSL, I still prefer OpenSSL for benchmarking, because it has almost equally well optimized assembly code paths for both ARMv8 and the latest Intel processors.

In my opinion handcrafted assembly is the best measure of a CPU’s potential, as it bypasses the compiler bias.

Public key cryptography

Public key cryptography is all about raw ALU performance. It is interesting, but not surprising to see that in the single core benchmark, the Broadwell core is faster than Skylake, and both in turn are faster than Falkor. This is because Broadwell runs at a higher frequency, while architecturally it is not much inferior to Skylake.

Falkor is at a disadvantage here. First, in a single core benchmark, the turbo is engaged, meaning the Intel processors run at a higher frequency. Second, in Broadwell, Intel introduced two special instructions to accelerate big number multiplication: ADCX and ADOX. These perform two independent add-with-carry operations per cycle, whereas ARM can only do one. Similarly, the ARMv8 instruction set does not have a single instruction to perform 64-bit multiplication, instead it uses a pair of MUL and UMULH instructions.

Nevertheless, at the SoC level, Falkor wins big time. It is only marginally slower than Skylake at an RSA2048 signature, and only because RSA2048 does not have an optimized implementation for ARM. The ECDSA performance is ridiculously fast. A single Centriq chip can satisfy the ECDSA needs of almost any company in the world.

It is also very interesting to see Skylake outperform Broadwell by a 30% margin, despite losing the single core benchmark, and only having 20% more cores. This can be explained by more efficient all-core turbo, and improved hyper-threading.

Symmetric key cryptography

Symmetric key performance of the Intel cores is outstanding.

AES-GCM uses a combination of special hardware instructions to accelerate AES and CLMUL (carryless multiplication). Intel first introduced those instructions back in 2010, with their Westmere CPU, and every generation since they have improved their performance. ARM introduced a set of similar instructions just recently, with their 64-bit instruction set, and as an optional extension. Fortunately every hardware vendor I know of implemented those. It is very likely that Qualcomm will improve the performance of the cryptographic instructions in future generations.

ChaCha20-Poly1305 is a more generic algorithm, designed in such a way as to better utilize wide SIMD units. The Qualcomm CPU only has the 128-bit wide NEON SIMD, while Broadwell has 256-bit wide AVX2, and Skylake has 512-bit wide AVX-512. This explains the huge lead Skylake has over both in single core performance. In the all-cores benchmark the Skylake lead lessens, because it has to lower the clock speed when executing AVX-512 workloads. When executing AVX-512 on all cores, the base frequency goes down to just 1.4GHz---keep that in mind if you are mixing AVX-512 and other code.

The bottom line for symmetric crypto is that although Skylake has the lead, Broadwell and Falkor both have good enough performance for any real life scenario, especially considering the fact that on our edge, RSA consumes more CPU time than all the other crypto algorithms combined.

Compression

The next benchmark I wanted to see was compression. This is for two reasons. First, it is a very important workload on the edge, as having better compression saves bandwidth, and helps deliver content faster to the client. Second, it is a very demanding workload, with a high rate of branch mispredictions.

Obviously the first benchmark would be the popular zlib library. At Cloudflare, we use an improved version of the library, optimized for 64-bit Intel processors, and although it is written mostly in C, it does use some Intel specific intrinsics. Comparing this optimized version to the generic zlib library wouldn’t be fair. Not to worry, with little effort I adapted the library to work very well on the ARMv8 architecture, with the use of NEON and CRC32 intrinsics. In the process it is twice as fast as the generic library for some files.

The second benchmark is the emerging brotli library, it is written in C, and allows for a level playing field for all platforms.

All the benchmarks are performed on the HTML of blog.cloudflare.com, in memory, similar to the way NGINX performs streaming compression. The size of the specific version of the HTML file is 29,329 bytes, making it a good representative of the type of files we usually compress. The parallel benchmark compresses multiple files in parallel, as opposed to compressing a single file on many threads, also similar to the way NGINX works.

gzip

When using gzip, at the single core level Skylake is the clear winner. Despite having lower frequency than Broadwell, it seems that having lower penalty for branch misprediction helps it pull ahead. The Falkor core is not far behind, especially with lower quality settings. At the system level Falkor performs significantly better, thanks to the higher core count. Note how well gzip scales on multiple cores.

brotli

With brotli on single core the situation is similar. Skylake is the fastest, but Falkor is not very much behind, and with quality setting 9, Falkor is actually faster. Brotli with quality level 4 performs very similarly to gzip at level 5, while actually compressing slightly better (8,010B vs 8,187B).

When performing many-core compression, the situation becomes a bit messy. For levels 4, 5 and 6 brotli scales very well. At level 7 and 8 we start seeing lower performance per core, bottoming with level 9, where we get less than 3x the performance of single core, running on all cores.

My understanding is that at those quality levels Brotli consumes significantly more memory, and starts thrashing the cache. The scaling improves again at levels 10 and 11.

Bottom line for brotli, Falkor wins, since we would not consider going above quality 7 for dynamic compression.

Golang

Golang is another very important language for Cloudflare. It is also one of the first languages to offer ARMv8 support, so one would expect good performance. I used some of the built-in benchmarks, but modified them to run on multiple goroutines.

Go crypto

I would like to start the benchmarks with crypto performance. Thanks to OpenSSL we have good reference numbers, and it is interesting to see just how good the Go library is.

As far as Go crypto is concerned ARM and Intel are not even on the same playground. Go has very optimized assembly code for ECDSA, AES-GCM and Chacha20-Poly1305 on Intel. It also has Intel optimized math functions, used in RSA computations. All those are missing for ARMv8, putting it at a big disadvantage.

Nevertheless, the gap can be bridged with a relatively small effort, and we know that with the right optimizations, performance can be on par with OpenSSL. Even a very minor change, such as implementing the function addMulVVW in assembly, lead to an over tenfold improvement in RSA performance, putting Falkor ahead of both Broadwell and Skylake, with 8,009 signatures/second.

Another interesting thing to note is that on Skylake, the Go Chacha20-Poly1305 code, that uses AVX2 performs almost identically to the OpenSSL AVX512 code, this is again due to AVX2 running at higher clock speeds.

Go gzip

Next in Go performance is gzip. Here again we have a reference point to pretty well optimized code, and we can compare it to Go. In the case of the gzip library, there are no Intel specific optimizations in place.

Gzip performance is pretty good. The single core Falkor performance is way below both Intel processors, but at the system level it manages to outperform Broadwell, and lags behind Skylake. Since we already know that Falkor outperforms both when C is used, it can only mean that Go’s backend for ARMv8 is still pretty immature compared to gcc.

Go regexp

Regexp is widely used in a variety of tasks, so its performance is quite important too. I ran the built-in benchmarks on 32KB strings.

Go regexp performance is not very good on Falkor. In the medium and hard tests it takes second place, thanks to the higher core count, but Skylake is significantly faster still.

Doing some profiling shows that a lot of the time is spent in the function bytes.IndexByte. This function has an assembly implementation for amd64 (runtime.indexbytebody), but generic implementation for Go. The easy regexp tests spend most time in this function, which explains the even wider gap.

Go strings

Another important library for a web server is the Go strings library. I only tested the basic Replacer class here.

In this test again, Falkor lags behind, and loses even to Broadwell. Profiling shows significant time is spent in the function runtime.memmove. Guess what? It has a highly optimized assembly code for amd64, that uses AVX2, but only very simple ARM assembly, that copies 8 bytes at a time. By changing three lines in that code, and using the LDP/STP instructions (load pair/store pair) to copy 16 bytes at a time, I improved the performance of memmove by 30%, which resulted in 20% faster EscapeString and UnescapeString performance. And that is just scratching the surface.

Go conclusion

Go support for aarch64 is quite disappointing. I am very happy to say that everything compiles and works flawlessly, but on the performance side, things should get better. Is seems like the enablement effort so far was concentrated on the compiler back end, and the library was left largely untouched. There are a lot of low hanging optimization fruits out there, like my 20-minute fix for addMulVVW clearly shows. Qualcomm and other ARMv8 vendors intends to put significant engineering resources to amend this situation, but really anyone can contribute to Go. So if you want to leave your mark, now is the time.

LuaJIT

Lua is the glue that holds Cloudflare together.

Except for the binary_trees benchmark, the performance of LuaJIT on ARM is very competitive. It wins two benchmarks, and is in almost a tie in a third one.

That being said, binary_trees is a very important benchmark, because it triggers many memory allocations and garbage collection cycles. It will require deeper investigation in the future.

NGINX

For the NGINX workload, I decided to generate a load that would resemble an actual server.

I set up a server that serves the HTML file used in the gzip benchmark, over https, with the ECDHE-ECDSA-AES128-GCM-SHA256 cipher suite.

It also uses LuaJIT to redirect the incoming request, remove all line breaks and extra spaces from the HTML file, while adding a timestamp. The HTML is then compressed using brotli with quality 5.

Each server was configured to work with as many workers as it has virtual CPUs. 40 for Broadwell, 48 for Skylake and 46 for Falkor.

As the client for this test, I used the hey program, running from 3 Broadwell servers.

Concurrently with the test, we took power readings from the respective BMC units of each server.

With the NGINX workload Falkor handled almost the same amount of requests as the Skylake server, and both significantly outperform Broadwell. The power readings, taken from the BMC show that it did so while consuming less than half the power of other processors. That means Falkor managed to get 214 requests/watt vs the Skylake’s 99 requests/watt and Broadwell’s 77.

I was a bit surprised to see Skylake and Broadwell consume about the same amount of power, given both are manufactured with the same process, and Skylake has more cores.

The low power consumption of Falkor is not surprising, Qualcomm processors are known for their great power efficiency, which has allowed them to be a dominant player in the mobile phone CPU market.

Conclusion

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently, my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison, Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

If you enjoy testing and selecting hardware on behalf of millions of Internet properties, come [join us](https://www.cloudflare.com/careers/).

A Very WebP New Year from Cloudflare

David Wragg — Wed, 21 Dec 2016 14:00:00 GMT

Cloudflare has an automatic image optimization feature called Polish, available to customers on paid plans. It recompresses images and removes unnecessary data so that they are delivered to browsers more quickly.

Up until now, Polish has not changed image types when optimizing (even if, for example, a PNG might sometimes have been smaller than the equivalent JPEG). But a new feature in Polish allows us to swap out an image for an equivalent image compressed using Google’s WebP format when the browser is capable of handling WebP and delivering that type of image would be quicker.

CC-BY 2.0 image by John Stratford

What is WebP?

The main image formats used on the web haven’t changed much since the early days (apart from the SVG vector format, PNG was the last one to establish itself, almost two decades ago).

WebP is a newer image format for the web, proposed by Google. It takes advantage of progress in image compression techniques since formats such as JPEG and PNG were designed. It is often able to compress the images into a significantly smaller amount of data than the older formats.

WebP is versatile and able to replace the three main raster image formats used on the web today:

WebP can do lossy compression, so it can be used instead of JPEG for photographic and photo-like images.
WebP can do lossless compression, and supports an alpha channel meaning images can have transparent regions. So it can be used instead of PNG, such as for images with sharp transitions that should be reproduced exactly (e.g. line art and graphic design elements).
WebP images can be animated, so it can be used as a replacement for animated GIF images.

Currently, the main browser that supports WebP is Google’s Chrome (both on desktop and mobile devices). See the WebP page on caniuse.com for more details.

Polish WebP conversion

Customers on the Pro, Business, and Enterprise plans can enable the automatic creation of WebP images by checking the WebP box in the Polish settings for a zone (these are found on the “Speed” page of the dashboard):

When this is enabled, Polish will optimize images just as it always has. But it will also convert the image to WebP, if WebP can shrink the image data more than the original format. These WebP images are only returned to web browsers that indicate they support WebP (e.g. Google Chrome), so most websites using Polish should be able to benefit from WebP conversion.

(Although Polish can now produce WebP images by converting them from other formats, it can't consume WebP images to optimize them. If you put a WebP image on an origin site, Polish won't do anything with it. Until the WebP ecosystem grows and matures, it is uncertain that attempting to optimize WebP is worthwhile.)

Polish has two modes: lossless and lossy. In lossless mode, JPEG images are optimized to remove unnecessary data, but the image displayed is unchanged. In lossy mode, Polish reduces the quality of JPEG images in a way that should not have a significant visible effect, but allows it to further reduce the size of the image data.

These modes are respected when JPEG images are converted to WebP. In lossless mode, the conversion is done in a way that preserves the image as faithfully as possible (due to the nature of the conversion, the resulting WebP might not be exactly identical, but there are unlikely to be any visible differences). In lossy mode, the conversion sacrifices a little quality in order to shrink the image data further, but as before, there should not be a significant visible effect.

These modes do not affect PNGs and GIFs, as these are lossless formats and so Polish will preserve images in those formats exactly.

Note that WebP conversion does not change the URLs of images, even if the file extension in the URL implies a different format. For example, a JPEG image at https://example.com/picture.jpg that has been converted to WebP will still have that same URL. The “Content-Type” HTTP header tells the browser the true format of an image.

By the Numbers

A few studies have been published of how well WebP compresses images compared with established formats. These studies provide a useful picture of how WebP performs. But before we released our WebP support, we decided to do a survey based on the context on which we planned to use WebP:

We evaluated WebP based on a collection of images gathered from the websites of our customers. The corpus consisted of 23,500 images (JPEG, PNG and GIFs).
Some studies compare WebP with JPEG by taking uncompressed images and compressing them to JPEG and WebP directly. But we wanted to know what happens when we convert an image that was has already been compressed as a JPEG. In a sense this is an unfair test, because a JPEG may contain artifacts due to compression that would not be present in the original raw image, and conversion to WebP may try to retain those artifacts. But it is such conversions matter for our use of WebP (this consideration does not apply to PNG and GIF conversions, because they are lossless).
We’re not just interested in whether WebP conversion can shrink images found on the web. We want to know how much WebP allows Polish to reduce the size further than it already does, thus providing a real end-user benefit. So our survey also includes the results of Polish without WebP.
In some cases, converting to WebP does not produce a result smaller than the optimized image in the original format. In such cases, we discard the WebP image. So the figures presented below do not penalize WebP for such cases.

Here is a chart showing the results of Polish, with and without WebP conversion. For each format, the average original image size is normalized to 100%, and the average sizes after Polishing are shown relative to that.

Here are the average savings corresponding to the chart:

Original Format	Polish without WebP	Polish using WebP
JPEG (with Polish lossless mode)	9%	19%
JPEG (with Polish lossy mode)	34%	47%
PNG	16%	38%
GIF	3%	16%

(The saving is calculated as 100% - (polished size) / (original size).)

As you can see, WebP conversion achieves significant size improvements not only for JPEG images, but also for PNG and GIF images. We believe supporting WebP will result in lower bandwidth and faster website delivery.

… and a WebP New Year

WebP does not yet have the same level of browser support as JPEG, PNG and GIF, but we are excited about its potential to streamline web pages. Polish WebP conversion allows our customers to adopt WebP with a simple change to the settings in the Cloudflare dashboard. So, if you are on one of our paid plans, we encourage you to try it out today.

PS — Want to help optimize the web? We’re hiring.

Results of experimenting with Brotli for dynamic web content

Vlad Krasnov — Fri, 23 Oct 2015 14:24:50 GMT

Compression is one of the most important tools CloudFlare has to accelerate website performance. Compressed content takes less time to transfer, and consequently reduces load times. On expensive mobile data plans, compression even saves money for consumers. However, compression is not free—it comes at a price. It is one of the most compute expensive operations our servers perform, and the better the compression rate we want, the more effort we have to spend.

The most popular compression format on the web is gzip. We put a great deal of effort into improving the performance of the gzip compression, so we can perform compression on the fly with fewer CPU cycles. Recently a potential replacement for gzip, called Brotli, was announced by Google. Being early adopters for many technologies, we at CloudFlare want to see for ourselves if it is as good as claimed.

This post takes a look at a bit of history behind gzip and Brotli, followed by a performance comparison.

Compression 101

Many popular lossless compression algorithms rely on LZ77 and Huffman coding, so it’s important to have a basic understanding of these two techniques before getting into gzip or Brotli.

LZ77

LZ77 is a simple technique developed by Abraham Lempel and Jacob Ziv in 1977 (hence the original name). Let's call the input to the algorithm a string (a sequence of bytes, not necessarily letters) and each consecutive sequence of bytes in the input a substring. LZ77 compresses the input string by replacing some of its substrings by pointers (or backreferences) to an identical substring previously encountered in the input.

The pointer usually has the form of , where length indicates the number of identical bytes found, and distance indicates how many bytes separate the current occurrence of the substring from the previous one. For example the string abcdeabcdf can be compressed with LZ77 to abcde<4,5>f and aaaaaaaaaa can be compressed to simply a<9, 1>. The decompressor when encountering a backreference will simply copy the required number of bytes from the already decompressed output, which makes it very fast for decompression. This is a nice illustration of LZ77 from my previous blog Improving compression with a preset DEFLATE dictionary:

Output (length tokens are blue, distance tokens are red):

149

157

141

The deflate algorithm managed to reduce the original text from 251 characters, to just 152 tokens! Those tokens are later compressed further by Huffman coding.

Huffman Coding

Huffman coding is another lossless compression algorithm. Developed by David Huffman back in the 50s, it is used for many compression algorithms, including JPEG. A Huffman code is a type of prefix coding, where, given an alphabet and an input, frequently occurring characters are replaced by shorter bit sequences and rarely occurring characters are replaced with longer sequences.

The code can be expressed as a binary tree, where the leaf nodes are the literals of the alphabet and the two edges from each node are marked with 0 and 1. To decode the next character, the decompressor can parse the tree from the root until it encounters a literal in the tree.

Each compression format uses Huffman coding differently, and for our little example we will create a Huffman code for an alphabet that includes only the literals and the length codes we actually used. To start, we must count the frequency of each letter in the LZ77 compressed text:

(space) - 19, o -14, e - 11, n - 8, t - 7, a - 6, d - 6, i - 6, 3 - 4, 5 - 4, h - 4, s - 4, u - 4, c - 3, m - 3, p - 3, r - 3, y - 3, (") - 2, F - 2, L - 2, b - 2, g - 2, l - 2, w - 2, (') - 1, (,) - 1, (.) - 1, A - 1, D - 1, G - 1, I - 1, 9 - 1, 20 - 1, S - 1, 56 - 1, W - 1, 6 - 1, f - 1.

We can then use the algorithm from Wikipedia to build this Huffman code (binary):

(space) - 101, o - 000, e - 1101, n - 0101, t - 0011, a - 11101, d - 11110, i - 11100, 3 - 01100, 5 - 01111, h - 10001, s - 11000, u - 10010, c - 00100, m - 111111, p - 110011, r - 110010, y - 111110, (") - 011010, F - 010000, L - 100000, b - 011101, g - 010011, l - 011100, w - 011011, (') - 0100101, (,) - 1001110, (.) - 1001100, A - 0100011, D - 0100010, G - 1000011, I - 0010110, 9 - 1000010, 20 - 0010111, S - 0010100, 56 - 0100100, W - 0010101, 6 - 1001101, f - 1001111.

Here the most frequent letters - space and 'o' got the shortest codes, only 3 bit long, whereas the letters that occur only once get 7 bit codes. If we were to represent the alphabet of 256 bytes and some length tokens, we would require 9 bits per every letter.

Now apply the code to the LZ77 output, (leaving the distance tokens untouched) and we get:

100000 11100 0011 0011 011100 1101 101 011101 10010 0101 0101 111110 101 010000 000 000 01111 4 0010101 1101 0101 0011 101 10001 000 110011 110011 11100 0101 010011 101 0011 10001 110010 000 10010 010011 01100 8 10001 1101 101 1001111 000 110010 1101 11000 0011 101 0010100 00100 000 000 01111 28 10010 110011 1001101 23 11100 1101 011100 11110 101 111111 11100 00100 1101 101 0100011 0101 11110 101 011101 1000010 58 1101 111111 101 000 0101 01111 35 10001 1101 11101 11110 101 0100010 000 011011 0101 101 00100 11101 111111 1101 01111 19 1000011 000 000 11110 101 010000 11101 11100 110010 111110 1001110 101 11101 01100 55 11000 01100 20 11000 11101 11100 11110 101 011010 100000 0010111 149 0010110 101 11110 000 0101 0100101 0011 101 011011 11101 01100 157 0011 000 101 11000 1101 1101 101 111110 000 10010 0100100 141 1001100 011010

Assuming all the distance tokens require one byte to store, the total output length is 898 bits. Compared to the 1356 bits we would require to store the LZ77 output, and 2008 bits for the original input. We achieved a compression ratio of 31% which is very good. In reality this is not the case, since we must also encode the Huffman tree we used, otherwise it would be impossible to decompress the text.

gzip

The compression algorithm used in gzip is called "deflate," and it’s a combination of the LZ77 and Huffman algorithms discussed above.

On the LZ77 side, gzip has a minimum size of 3 bytes and a maximum of 258 bytes for the length tokens and a maximum of 32768 bytes for the distance tokens. The maximum distance also defines the sliding window size the implementation uses for compression and decompression.

For Huffman coding, deflate has two alphabets. The first alphabet includes the input literals ("letters" 0-255), the end-of-block symbol (the "letter" 256) and the length tokens ("letters" 257-285). There are only 29 letters to encode all the possible lengths and the tokens 265-284 will always have additional bits encoded in the stream to cover the entire range from 3 to 258. For example the letter 257 indicates the minimal length of 3, whereas the letter 265 will indicate the length 11 if followed by the bit 0, and the length 12 if followed by the bit 1.

The second alphabet is for the distance tokens only. Its letters are the codewords 0 through 29. Similar to the length tokens, the codes 4-29 are followed by 1 to 13 additional bits to cover the whole range from 1 to 32768.

Using two distinct alphabets makes a lot of sense, because it allows us to represent the distance code with fewer bits while avoiding any ambiguity, since the distance codes always follow the length codes.

The most common implementation of gzip compression is the zlib library. At CloudFlare, we use a custom version of zlib optimized for our servers.

zlib has 9 preset quality settings for the deflate algorithm, labeled from 1 to 9. It can also be run with quality set to 0, in which case it does not perform any compression. These quality settings can be divided into two categories:

Fast compression (levels 1-3): When a sufficiently long backreference is found, it is emitted immediately, and the search moves to the position at the end of the matched string. If the match is longer than a few bytes, the entire matched string will not be hashed, meaning its substrings will never be referenced in the future. Clearly, this reduces the compression ratio.
Slow compression (levels 4-9): Here, every substring is hashed; therefore, any substring can be referenced in the future. In addition, slow compression enables "lazy" matches. If at a given position a sufficiently long match is found, it will not be immediately emitted. Instead, the algorithm attempts to find a match at the next position. If a longer match is found, the algorithm will emit a single literal at the current position instead the shorter match, and continue to the next position. As a rule, this results in a better compression ratio than levels 1-3.

Other differences between the quality levels are how far to search for a backreference and how long a match should be before stopping.

A very decent compression rate can be observed at levels 3-4, which are also quite fast. Increasing compression quality from level 4 and up gives incrementally smaller compression gains, while requiring substantially more time. Often, levels 8 and 9 will produce similarly compressed output, but level 9 will require more time to do it.

It is important to understand that the zlib implementation does not guarantee the best compression possible with the format, even when the quality setting is set to 9. Instead, it uses a set of heuristics and optimizations that allow for very good compression at reasonable speed.

Better gzip compression can be achieved by the zopfli library, but it is significantly slower than zlib.

Brotli

The Brotli format was developed by Google and has been refined for a while now. Here at CloudFlare, we built an nginx module that performs dynamic Brotli compression, and we deployed it on our test server that supports HTTP2 and other new features.

To check the Brotli compression on the test server you can use the nightly build of the Firefox browser, which also supports this format.

Brotli, deflate, and gzip

Brotli and deflate are very closely related. Brotli also uses the LZ77 and Huffman algorithms for compression. Both algorithms use a sliding window for backreferences. Gzip uses a fixed size, 32KB window, and Brotli can use any window size from 1KB to 16MB, in powers of 2 (minus 16 bytes). This means that the Brotli window can be up to 512 times larger window than the deflate window. This difference is almost irrelevant in web-server context, as text files larger than 32KB are the minority.

Other differences include smaller minimal match length (2 bytes minimum in Brotli, compared to 3 bytes minimum in deflate) and larger maximal match length (16779333 bytes in Broli, compared to 258 bytes in deflate).

Static dictionary

Brotli also features a static dictionary. The "dictionary" supported by deflate can greatly improve compression, but has to be supplied independently and can only be addressed as part of the sliding window. The Brotli dictionary is part of the implementation and can be referenced from anywhere in the stream, somewhat increasing its efficiency for larger files. Moreover, different transformations can be applied to words of the dictionary effectively increasing its size.

Context modeling

Brotli also supports something called context modeling. Context modeling is a feature that allows multiple Huffman trees for the same alphabet in the same block. For example, in deflate each block consists of a series of literals (bytes that could not be compressed by backreferencing) and pairs that define a backreference for copying. Literals and lengths form a single alphabet, while the distances are a different alphabet.

In Brotli, each block is composed of "commands". A command consists of 3 parts. The first part of each command is a word . "Insert" defines the number of literals that will follow the word, and it may have the value of 0. "Copy" defines the number of literals to copy from a back reference. The word is followed by a sequence of "insert" literals: … . Finally, the command ends with a . Distance defines the backreference from which to copy the previously defined number of bytes. Unlike the distance in deflate, Brotli distance can have additional meanings, such as references to the static dictionary, or references to recently used distances.

There are three alphabets here. One is for s and it simply covers all the possible byte values from 0 to 255. The other one is for s, and its size depends on the size of the sliding window and other parameters. The third alphabet is for the length pairs, with 704 letters. Here indicates a single letter in the alphabet, as opposed to pairs in deflate where length and distance are letters in distinct alphabets.

So why do we care about context modeling? It means that for any of the alphabets—up to 256 different Huffman trees—can be used in the same block. The switch between different trees is determined by "context". This can be useful when the compressed file consists of different types of characters. For example binary data interleaved with UTF-8 strings, or a multilingual dictionary.

Whereas the basic idea behind Brotli remains identical to that of deflate, the way the data is encoded is very different. Those improvements allow for significantly better compression, but they also require a significant amount of processing. To alleviate the performance cost somewhat, Brotli drops the error detection CRC check present in gzip.

Benchmarking

The heaviest operation in both deflate and Brotli is the search for backward references. A higher level in zlib generally means that the backward reference search will attempt to find a better match for longer, but not necessarily succeed. This leads to significant increase in processing time. In contrast, Brotli trades longer searches for lighter operations that give better ROI, such as context modeling, dictionary references and more efficient data encoding in general. For that reason Brotli is, in theory, capable of outperforming zlib at some stage for similar compression rates, as well as giving better compression at its maximal levels.

We decided to put those theories to the test. At CloudFlare, one of the primary use cases for Brotli/gzip is on-the-fly compression of textual web assets like HTML, CSS, and JavaScript, so that’s what we’ll be testing.

There is a tradeoff between compression speed and transfer speed. It is only beneficial to increase your compression ratio if you can reduce the number of bytes you have to transfer faster than you would actually transfer them. Slower compression will actually slow the connection down. CloudFlare currently uses our own zlib implementation with quality set to 8, and that is our benchmarking baseline.

The benchmark set consists of 10,655 HTML, CSS, and JavaScript files. The benchmarks were performed on an Intel 3.5GHz, E3-1241 v3 CPU.

The files were grouped into several size groups, since different websites have different size characteristics, and we must optimize for them all.

Although the quality setting in the Brotli implementation states the possible values are 0 to 11, we didn't see any differences between 0 and 1, or between 10 and 11, therefore only quality settings 1 to 10 are reported.

Compression quality

Compression quality is measured as (total size of all files after compression)/(total size of all files before compression)*100%. The columns represent the size distribution of the HTML, CSS, and JavaScript files used in the test in bytes (the file size ranges are shown using the [x,y) notation).

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	65.0%	46.8%	42.8%	38.7%	34.4%	32.0%	29.9%	31.1%	31.4%	31.5%
zlib 2	64.9%	46.5%	42.5%	38.3%	33.8%	31.4%	29.2%	30.3%	30.5%	30.6%
zlib 3	64.9%	46.4%	42.3%	38.0%	33.6%	31.1%	28.8%	29.8%	29.9%	30.1%
zlib 4	64.5%	45.8%	41.6%	37.1%	32.6%	30.0%	27.8%	28.5%	28.7%	28.9%
zlib 5	64.5%	45.5%	41.3%	36.7%	32.1%	29.4%	27.1%	27.8%	27.8%	28.0%
zlib 6	64.4%	45.5%	41.3%	36.7%	32.0%	29.3%	27.0%	27.6%	27.6%	27.8%
zlib 7	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.5%	27.7%
zlib 8	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
zlib 9	64.4%	45.5%	41.3%	36.6%	32.0%	29.2%	26.9%	27.5%	27.4%	27.7%
brotli 1	61.3%	46.6%	42.7%	38.4%	33.8%	30.8%	28.6%	29.5%	28.6%	29.0%
brotli 2	61.9%	46.9%	42.8%	38.3%	33.6%	30.6%	28.3%	29.1%	28.3%	28.6%
brotli 3	61.8%	46.8%	42.7%	38.2%	33.3%	30.4%	28.1%	28.9%	28.0%	28.4%
brotli 4	53.8%	40.8%	38.6%	34.8%	30.9%	28.7%	27.0%	28.0%	27.5%	27.7%
brotli 5	49.9%	37.7%	35.7%	32.3%	28.7%	26.6%	25.2%	26.2%	26.0%	26.1%
brotli 6	50.0%	37.7%	35.7%	32.3%	28.6%	26.5%	25.1%	26.0%	25.7%	25.9%
brotli 7	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.5%	25.7%
brotli 8	50.0%	37.6%	35.6%	32.3%	28.5%	26.4%	25.0%	25.9%	25.4%	25.6%
brotli 9	50.0%	37.6%	35.5%	32.2%	28.5%	26.4%	25.0%	25.8%	25.3%	25.5%
brotli 10	46.8%	33.4%	32.5%	29.4%	26.0%	23.9%	22.9%	23.8%	23.0%	23.3%

Clearly, the compression possible by Brotli is significant. On average, Brotli at the maximal quality setting produces 1.19X smaller results than zlib at the maximal quality. For files smaller than 1KB the result is 1.38X smaller on average, a very impressive improvement, that can probably be attributed to the use of static dictionary.

Compression speed

Compression speed is measured as (total size of files before compression)/(total time to compress all files) and is reported in MB/s.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
zlib 1	5.9	21.8	34.4	43.4	62.1	89.8	117.9	127.9	139.6	125.5
zlib 2	5.9	21.7	34.3	43.0	61.2	87.6	114.3	123.1	130.7	118.9
zlib 3	5.9	21.7	34.0	42.4	60.5	84.8	108.0	114.5	114.9	106.9
zlib 4	5.8	20.9	32.2	39.8	54.8	74.9	93.3	97.5	96.1	90.7
zlib 5	5.8	20.6	31.4	38.3	51.6	68.4	82.0	81.3	73.2	71.6
zlib 6	5.8	20.6	31.2	37.9	50.6	64.0	73.7	70.2	57.5	58.0
zlib 7	5.8	20.5	31.0	37.4	49.6	60.8	67.4	64.6	51.0	52.0
zlib 8	5.8	20.5	31.0	37.2	48.8	53.2	56.6	56.5	41.6	43.1
zlib 9	5.8	20.6	30.8	37.3	48.6	51.7	56.6	54.2	40.4	41.9
brotli 1	3.4	12.8	20.4	25.9	37.8	57.3	80.0	94.1	105.8	91.3
brotli 2	3.4	12.4	19.5	24.4	35.2	52.3	71.2	82.0	89.0	78.8
brotli 3	3.4	12.3	19.0	23.7	34.0	49.8	67.4	76.3	81.5	73.0
brotli 4	2.0	7.6	11.9	15.2	22.2	33.1	44.7	51.9	58.5	51.0
brotli 5	2.0	5.2	8.0	10.3	15.0	22.0	29.7	33.3	32.8	30.3
brotli 6	1.8	3.8	5.5	7.0	10.5	16.3	23.5	28.6	28.4	25.6
brotli 7	1.5	2.3	3.1	3.7	4.9	7.2	10.7	15.5	19.6	16.2
brotli 8	1.4	2.3	2.7	3.1	4.0	5.3	7.1	10.6	15.1	12.2
brotli 9	1.3	2.1	2.4	2.8	3.4	4.3	5.5	7.0	10.6	8.8
brotli 10	0.2	0.4	0.4	0.5	0.5	0.6	0.6	0.6	0.5	0.5

On average for all files, we can see that Brotli at quality level 4 is slightly faster than zlib at quality level 8 (and 9) while having comparable compression ratio. However that is misleading. Most files are smaller than 64KB, and if we look only at those files then Brotli 4 is actually 1.48X slower than zlib level 8!

Connection speedup

For on-the-fly compression, the most important question is how much time to invest in compression to make data transfer faster. Because increasing compression quality only gives incremental improvement over a given level, we need the added compressed bytes to outweigh the additional time spent compressing those bytes.

Again, CloudFlare uses compression quality 8 with zlib, so that’s our baseline. For each quality setting of Brotli starting at 4 (which is somewhat comparable to zlib 8 in terms of both time and compression ratio), we compute the added compression speed as: ((total size for zlib 8) - (total size after compression with Brotli))/((total time for Brotli)-(total time for zlib 8)).

The results are reported in MB/s. Negative numbers indicate lower compression ratio.

	[20, 1024)	[1024, 2048)	[2048, 3072)	[3072, 4096)	[4096, 8192)	[8192, 16384)	[16384, 32768)	[32768, 65536)	[65536, +∞)	All files
brotli 4	0.33	0.56	0.52	0.47	0.42	0.44	-0.24	-2.86	0.02	0.00
brotli 5	0.44	0.55	0.60	0.61	0.71	0.97	1.01	1.08	2.24	1.58
brotli 6	0.36	0.36	0.37	0.37	0.45	0.63	0.69	0.86	1.52	1.12
brotli 7	0.28	0.20	0.19	0.18	0.19	0.23	0.24	0.34	0.72	0.52
brotli 8	0.26	0.20	0.17	0.15	0.15	0.17	0.15	0.21	0.48	0.35
brotli 9	0.25	0.19	0.15	0.13	0.13	0.13	0.11	0.13	0.30	0.24
brotli 10	0.03	0.04	0.04	0.04	0.03	0.03	0.02	0.02	0.02	0.02

Those numbers are quite low, due to the slow speed of Brotli. To get any speedup we really want to see those numbers being greater than the connection speed. It seems that for files greater than 64KB, Brotli at quality setting 5 can speed up slow connections.

Keep in mind, that on a real server, compression is only one of many tasks that share the CPU, and compression speeds would be slower there.

Conclusions

The current state of Brotli gives us some mixed impressions. There is no yes/no answer to the question "Is Brotli better than gzip?". It definitely looks like a big win for static content compression, but on the web where the content is dynamic we also need to consider on-the-fly compression.

The way I see it, Brotli already has an advantage over zlib for large files (larger than 64KB) on slow connections. However, those constitute only 20% of our sampled dataset (and 80% of the total size).

Our Brotli module has a minimal size setting for Brotli compression that allows us to use gzip for smaller files and Brotli only for large ones.

It is important to remember that zlib has the advantage of being the optimization target for years by the entire web community, while Brotli is the development effort of a small but capable and talented team. There is no doubt that the current implementation will only improve with time.

Simple Helix chooses CloudFlare to ignite white-hot Magento performance

Guest Author — Tue, 01 Sep 2015 17:04:32 GMT

Today’s guest blogger is George Cagle. George is a system administrator at Simple Helix, a CloudFlare partner.

Some months ago, we made a big bet on partnering with CloudFlare for performance improvements and website security for our Magento hosting customers. Customer experience is core to our business and relying on another company is a major deal. CloudFlare is now included in Default–On mode for select Simple Helix hosting plans and can be added to any existing plan. The results have been great and we wanted to share a couple successes with the rest of the CloudFlare community.

Testing the waters

The first thing one notices after melding their site with the worldwide CloudFlare CDN network is just how fast a website becomes. In Simple Helix’s testing, we found that proper CloudFlare implementation can yield 100% speed increases, and an even faster 143% speed increase when paired with the Railgun™ web optimizer for dynamic content.

Adding CloudFlare will certainly improve performance, but it can also significantly improve security through the Web Application Firewall feature. The security benefits of having the CloudFlare service can be seen after just the first few days of adoption as outlined below:

Total number of threats mitigated by CloudFlare in a week

The results provided insight: "CloudFlare helped break down the problem in an elegant way that made threat assessments much easier for our customers to digest and make well-reasoned decisions based on the information presented" said Brian Rorex, systems administrator at Simple Helix. CloudFlare protects sites from some of the most common maladies that plague the modern Internet like overly-aggressive crawler bots, botnet attacks, and DDoS attacks.

Getting results

Given our happiness with the performance of the CloudFlare service, we have chosen it to respond to some unique performance challenges of our customers with much success. One such Simple Helix client is a popular fashion accessory company that let us know that they were launching a high-traffic media campaign that would increase traffic significantly for several days. We responded by putting them on CloudFlare as their first step in bolstering the company's infrastructure in preparation for the big day. When the tweet finally hit the internet and the traffic ramped up, CloudFlare and the servers hardly broke a sweat, reducing the effective bandwidth usage at the origin by almost 70%.

Another Simple Helix customer with a popular apparel store routinely saw 300-500% spikes in traffic during regular sales events. Nesting their web servers behind the CloudFlare CDN evened out the traffic to an almost flat bandwidth usage graph and provided an 83.4% bandwidth savings at the origin server.

Simple Helix is one of the leading hosting providers serving the Magento e-commerce market with over 100,000 domains. Our customers vary from mom and pop shops to internationally recognized brands. A young, rapidly growing company, Simple Helix is constantly on the look-out for new technology that will improve the quality of service for their customers and set them apart from the pack.

We’re excited to make CloudFlare a standard part of setup at Simple Helix. This means that our customers get the performance and security benefits of CloudFlare without any additional work. If you currently run an eCommerce store that could take advantage of what we have to offer, please contact the Simple Helix team to find out which plan is right for you.

Railgun v5 has landed: better, faster, lighter

John Graham-Cumming — Mon, 31 Aug 2015 22:31:51 GMT

Three years ago we launched Railgun, CloudFlare's origin network optimizer. Railgun allows us to cache the uncacheable to accelerate the connection between CloudFlare and our customers' origin servers. That brings the benefit of a CDN to even dynamic content with no need for 'fast purging' or other tricks. With Railgun even dynamic, ever-changing pages benefit from caching.

CC BY 2.0 image by Nathan E Photography

Over those three years Railgun has been deployed widely by our customers to accelerate the delivery of their web sites and lower their bandwidth costs.

Today we're announcing the availability of Railgun v5 with a number of significant improvements:

We've substantially reduced memory utilization and CPU requirements

Railgun performs delta compression on every request/response requiring CPU (to perform the compression) and memory (to keep a cache of pages to delta against). Version 5 has undergone extensive optimization based on the performance of Railgun on large web sites and at hosting providers. Version 5 requires much less memory and lower CPU.

A new, lighter weight, faster wire protocol

The original Railgun wire protocol that transfer requests and compressed responses between the customer server and CloudFlare's infrastructure has been completely replaced with a new, lighter-weight completely binary protocol that is faster and uses less bandwidth.

An extra layer of compression

We noticed in real-world tests that although delta compression provided incredible compression and faster page load times that it was possible to squeeze out even greater compression by performing traditional non-delta compression in addition to the delta compression. This is now standard and all content is compressed yielding an extra 10-15% compression.

Streaming mode for large downloads

Large downloads are not delta compressed for performance reasons (the benefits of the delta compression are outweighed by the cost of compressing very large pages). To ensure that large pages are downloaded as quickly as possible, Railgun v5 provides an automatic streaming mode where the page is streamed from the origin server across the Internet to CloudFlare and on to the end web browser. This substantially reduces the time to download very large pages through Railgun.

Better utilization of origin web server connections

Railgun's management of the connection between Railgun and the customer origin server has been improved to pool connections and make best use of HTTP keep-alives. This reduces the load on the origin server and improves performance as connections are efficiently reused resulting in lower latency.

Improved cryptographic infrastructure

CloudFlare has been moving all communication between servers to encrypted connections. Railgun has always used a TLS connection between CloudFlare and the customer server even if the requests being passed were HTTP and not HTTPS. With version 5 we've switched Railgun to use our new CA for greater security. The connection between CloudFlare and the customer's Railgun is secured with certificates in both directions that are verified against the CloudFlare CA.

Optimized partners

CloudFlare Optimized Partners in particular can benefit from the lower resource usage of Railgun version 5. A2 Hosting, an Optimized Partner and Railgun Beta participant, reported increased compression rates using version 5. Also new for partners is the ability to assign subdomains to a Railgun. Upgrading to the latest version, or installing Railgun for the first time, only takes a few minutes (Railgun Quick Start Guide for Optimized Partners). Railgun is perfect for ecommerce sites as well as news sites and popular blogs.

Install or upgrade today

Railgun is available as part of CloudFlare's Business and Enterprise plans or from an Optimized Partner. Installation instructions for Railgun are available on CloudFlare's resources and downloads page. We recommend installing from CloudFlare's package repository, which makes it easy to keep Railgun up-to-date. This release also sees Railgun available on Red Hat Enterprise Linux (RHEL) and CentOS 7. Railgun v5's configuration is completely compatible with version 4 and customers can simply replace the Railgun binary and restart to use version 5 and immediately see the benefits.

Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code

Vlad Krasnov — Wed, 08 Jul 2015 13:28:00 GMT

Recently I was contacted by Dr. Igor Kozin from The Institute of Cancer Research in London. He asked about the optimal way to compile CloudFlare's open source fork of zlib. It turns out that zlib is widely used to compress the SAM/BAM files that are used for DNA sequencing. And it turns out our zlib fork is the best open source solution for that file format.

CC BY-SA 2.0 image by Shaury Nash

The files used for this kind of research reach hundreds of gigabytes and every time they are compressed and decompressed with our library many important seconds are saved, bringing the cure for cancer that much closer. At least that's what I am going to tell myself when I go to bed.

This made me realize that the benefits of open source go much farther than one can imagine, and you never know where a piece of code may end up. Open sourcing makes sophisticated algorithms and software accessible to individuals and organizations that would not have the resources to develop them on their own, or the money pay for a proprietary solution.

It also made me wonder exactly what we did to zlib that makes it stand out from other zlib forks.

Recap

Zlib is a compression library that supports two formats: deflate and gzip. Both formats use the same algorithm also called DEFLATE, but with different headers and checksum functions. The deflate algorithm is described here.

Both formats are supported by the absolute majority of web browsers, and we at CloudFlare compress all text content on the fly using the gzip format. Moreover DEFLATE is also used by the PNG file format, and our fork of zlib also accelerates our image optimization engine Polish. You can find the optimized fork of pngcrush here.

Given the amount of traffic we must handle, compression optimization really makes sense for us. Therefore we included several improvements over the default implementation.

First of all it is important to understand the current state of zlib. It is a very old library, one of the oldest that is still used as is to this day. It is so old it was written in K&R C. It is so old USB was not invented yet. It is so old that DOS was still a thing. It is so old (insert your favorite so old joke here). More precisely it dates back to 1995. Back to the days 16-bit computers with 64KB addressable space were still in use.

Still it represents one of the best pieces of code ever written, and even modernizing it gives only modest performance boost. Which shows the great skill of its authors and the long way compilers have come since 1995.

Below is a list of some of the improvements in our fork of zlib. This work was done by me, my colleague Shuxin Yang, and also includes improvements from other sources.

uint64_t as the standard type - the default fork used 16-bit types.
Using an improved hash function - we use the iSCSI CRC32 function as the hash function in our zlib. This specific function is implemented as a hardware instruction on Intel processors. It has very fast performance and better collision properties.
Search for matches of at least 4 bytes, instead the 3 bytes the format suggests. This leads to fewer hash collisions, and less effort wasted on insignificant matches. It also improves the compression rate a little bit for the majority of cases (but not all).
Using SIMD instructions for window rolling.
Using the hardware carry-less multiplication instruction PLCMULQDQ for the CRC32 checksum.
Optimized longest-match function. This is the most performance demanding function in the library. It is responsible for finding the (length, distance) matches in the current window.

In addition, we have an experimental branch that implements an improved version of the linked list used in zlib. It has much better performance for compression levels 6 to 9, while retaining the same compression ratio. You can find the experimental branch here.

Benchmarking

You can find independent benchmarks of our library here and here. In addition, I performed some in-house benchmarking, and put the results here for your convenience.

All the benchmarks were performed on an i5-4278U CPU. The compression was performed from and to a ramdisk. All libraries were compiled with gcc version 4.8.4 with the compilation flags: "-O3 -march=native".

I tested the performance of the main zlib fork, optimized implementation by Intel, our own main branch, and our experimental branch.

Four data sets were used for the benchmarks. The Calgary corpus, the Canterbury corpus, the Large Canterbury corpus and the Silesia corpus.

Calgary corpus

Performance:

Compression rates:

For this benchmark, Intel only outperforms our implementation for level 1, but at the cost of 1.39X larger files. This difference is far greater than even the difference between levels 1 and 9, and should probably be regarded as a different compression level. CloudFlare is faster on all other levels, and outperforms significantly for levels 6 to 9. The experimental implementation is even faster for those levels.

Canterbury corpus

Performance:

Compression rates:

Here we see a similar situation. Intel at level 1 gets 1.44X larger files. CloudFlare is faster for levels 2 to 9. On level 9, the experimental branch outperforms the reference implementation by 2X.

Large corpus

Performance:

Compression rates:

This time Intel is slightly faster for levels 5 and 6 than the CloudFlare implementation. The experimental CloudFlare implementation is faster still on level 6. The compression rate for Intel level 1 is 1.58 lower than CloudFlare. On level 9, the experimental fork is 7.5X(!) faster than reference.

Silesia corpus

Performance:

Compression rates:

Here again, CloudFlare is the fastest on levels 2 to 9. On level 9 the difference in speed between the experimental fork and the reference fork is 2.44X.

Conclusion

As evident from the benchmarks, the CloudFlare implementation outperforms the competition in the vast majority of settings. We put great effort in making it as fast as possible on our servers.

If you intend to use our library, you should check for yourself if it delivers the best balance of performance and compression for your dataset. As between different file format and sizes performance can vary.

And if you like open source software, don't forget to give back to the community, by contributing your own code!

Improving compression with a preset DEFLATE dictionary

Vlad Krasnov — Mon, 30 Mar 2015 09:21:21 GMT

A few years ago Google made a proposal for a new HTTP compression method, called SDCH (SanDwiCH). The idea behind the method is to create a dictionary of long strings that appear throughout many pages of the same domain (or popular search results). The compression is then simply searching for the appearance of the long strings in a dictionary and replacing them with references to the aforementioned dictionary. Afterwards the output is further compressed with DEFLATE.

_{CC BY SA 2.0}_image_by_{Quinn Dombrowski}

With the right dictionary for the right page the savings can be spectacular, even 70% smaller than gzip alone. In theory, a whole file can be replaced by a single token.

The drawbacks of the method are twofold: first - the dictionary that is created is fairly large and must be distributed as a separate file, in fact the dictionary is often larger than the individual pages it compresses; second - the dictionary is usually absolutely useless for another set of pages.

For large domains that are visited repeatedly the advantage is huge: at a cost of single dictionary download, all the following page views can be compressed with much higher efficiency. Currently we aware of Google and LinkedIn compressing content with SDCH.

SDCH for millions of domains

Here at CloudFlare our task is to support millions of domains, which have little in common, and creating a single SDCH dictionary is very difficult. Nevertheless better compression is important, because it produces smaller payloads, which result in content being delivered faster to our clients. That is why we set out to find that little something that is common to all the pages and to see if we could compress them further.

Besides SDCH (which is only supported by the Chrome browser), the common compression methods over HTTP are gzip and DEFLATE. Some do not know it but the compression they perform is identical. The two formats differ in the content headers they use, with gzip having slightly larger headers, and the error detection function - gzip uses CRC32, whereas DEFLATE uses Adler32.

Usually the servers opt to compress with gzip, however its cousin DEFLATE supports a neat feature called "Preset Dictionary". This dictionary is not like the dictionary used by SDCH, in fact it is not a real dictionary. To understand how this "dictionary" can be used to our advantage, it is important to understand how the DEFLATE algorithm works.

_{CC BY 2.0}_image_by_{Caleb Roenigk}

The DEFLATE algorithm consists of two stages, first it performs the LZ77 algorithm, where it simply goes over the input, and replaces occurrences of previously encountered strings with (short) "directions" where the same string can be found in the previous input. The directions are a tuple of (length, distance), where distance tells how far back in the input the string was and length tells how many bytes were matched. The minimal length deflate will match is 3 (4 in the highly optimized implementation CloudFlare uses), the maximal length is 258, and the farthest distance back is 32KB.

This is an illustration of the LZ77 algorithm:

Input:

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

Output (length tokens are blue, distance tokens are red):

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"

DEFLATE managed to reduce the original text from 251 characters, to just 152 tokens! Those tokens are later compressed further by Huffman encoding, which is the second stage.

How long and devotedly the algorithm searches for a string before it stops is defined by the compression level. For example with compression level 4 the algorithm will be happy to find a match of 16 bytes, whereas with level 9 it will attempt to look for the maximal 258 byte match. If a match was not found the algorithm outputs the input as is, uncompressed.

Clearly at the beginning of the input, there can be no references to previous strings, and it is always uncompressed. Similarly the first occurrence of any string in the input will never be compressed. For example almost all HTML files start with the string "

To solve this problem the deflate dictionary effectively acts as an initial back reference for possible matches. So if we add the aforementioned string "

To illustrate, lets compress the children’s song with the help of a 42 byte dictionary, containing the following: Little bunny Foo hopping forest Good Fairy. The compressed output will then be:


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

Now, even strings at the very beginning of the input are compressed and strings that only appear once in the file are compressed as well. With the help of the dictionary we are down to 115 tokens. That means roughly 25% better compression rate.

An experiment

We wanted to see if we could make a dictionary that would benefit ALL the HTML pages we serve, and not just a specific domain. To that end we scanned over 20,000 publicly available random HTML pages that passed through our servers on a random sunny day, we took the first 16KB of each page, and used them to prepare two dictionaries, one of 16KB and the other of 32KB. Using a larger dictionary is useless, because then it would be larger than the LZ77 window used by deflate.

To build a dictionary, I made a little go program that takes a set of files and performs "pseudo" LZ77 over them, finding strings that DEFLATE would not compress in the first 16KB of each input file. It then performs a frequency count of the individual strings, and scores them according to their length and frequency. In the end the highest scoring strings are saved into the dictionary file.

Our benchmark consists of another set of pages obtained in similar manner. The number of benchmarked files was about 19,000 with total size of 563MB.

deflate -4	deflate -9	deflate -4 + 16K dict	deflate -9 + 16K dict	deflate -4 + 32K dict	deflate -9 + 32K dict
Size (KB)	169,176	166,012	161,896	158,352	161,212	157,444
Time (sec)	6.90	11.56	7.15	11.80	7.88	11.82

We can see from the results that the compression we gain for level 4 is almost 5% better than without the dictionary, which is even greater than the compression gained by using level 9 compression, while being substantially faster. For level 9, the gain is greater than 5% without a significant performance hit.

The results highly depend on the dataset used for the dictionary and on the compressed pages. For example when making a dicitonary aimed at a specific web site, the compression rate for that site increased by up to 30%.

For very small pages, such as error pages, with size less than 1KB, a DEFLATE dictionary was able to gain compression rates of up to 50% smaller than DEFLATE alone.

Of course different dictionaries may be used for different file types. In fact we think that it would make sense to create a standard set of dictionaries that could be used accross the web.

The utility to make a dictionary for deflate can be found at https://github.com/vkrasnov/dictator.The optimized version of zlib used by CloudFlare can be found at https://github.com/cloudflare/zlib

Efficiently compressing dynamically generated web content

John Graham-Cumming — Thu, 06 Dec 2012 09:19:00 GMT

I originally wrote this article for the Web Performance Calendar website, which is a terrific resource of expert opinions on making your website as fast as possible. We thought CloudFlare users would be interested so we reproduced it here. Enjoy!

Efficiently compressing dynamically generated web content

With the widespread adoption of high bandwidth Internet connections in the home, offices and on mobile devices, limitations in available bandwidth to download web pages have largely been eliminated.

At the same time latency remains a major problem. According to a recent presentation by Google, broadband Internet latency is 18ms for fiber technologies, 26ms for cable-based services, 43ms for DSL and 150ms-400ms for mobile devices. Ultimately, bandwidth can be expanded greatly with new technologies but latency is limited by the speed of light. The latency of an Internet connection directly affects the speed with which a web page can be downloaded.

The latency problem occurs because the TCP protocol requires round trips to acknowledge received information (since packets can and do get lost while traversing the Internet) and to prevent Internet congestion TCP has mechanisms to limit the amount of data sent per round trip until it has learnt how much it can send without causing congestion.

The collision between the speed of light and the TCP protocol is made worse by the fact that web site owners are likely to choose the cheapest hosting available without thinking about its physical location. In fact, the move to ‘the cloud' encourages the idea that web sites are simply ‘out there' without taking into account the very real problem of latency introduced by the distance between the end user's web browser and the server. It is not uncommon, for example, to see web sites aimed at UK consumers being hosted in the US. A web user in London accessing a .co.uk site that is actually hosted in Chicago incurs an additional 60ms round trip time because of the distance traversed.

Dealing with speed-of-light induced latency requires moving web content closer to user who are browsing, or making the web content smaller so that fewer round trips are required (or both).

The caching challenge

Caching technologies and content delivery services mean that static content (such as images, CSS, JavaScript) can be e cached close to end users helping to reduce latency when they are loaded. CloudFlare sees on average that about 65% of web content is cacheable.

But the most critical part of a web page, the actual HTML content is often dynamically generated and cannot be cached. Because none of the relatively fast to load content that's in cache cannot even be loaded before the HTML, any delay in the web browser receiving the HTML affects the entire web browsing experience.

Thus being able to deliver the page HTML as quickly as possible even in high latency environments is vital to ensuring a good browsing experience. Studies have shown that the slower the page load time the more likely the user is to give up and move elsewhere. A recent Google study said that a response time of less than 100ms is perceived by a human as ‘instant' (a human eye blink is somewhere in the 100ms to 400ms range); less than 300ms the computer seems sluggish; above 1s and the user's train of thought is lost to distraction or other thoughts. TCP's congestion avoidance algorithm means that many round trips are necessary when downloading a web page. For example, getting just the HTML for the CNN home page takes approximately 15 round trips; it's not hard to see how long latency can quickly multiply into a situation where the end-user is losing patience with the web site.

Unfortunately, it is not possible to cache the HTML of most web pages because it is dynamically generated. Dynamic pages are commonplace because the HTML is programmatically generated and not static. For example, a news web site will generate fresh HTML as news stories change or to show a different page depending on the geographical location of the end user. Many web pages are also dynamically generated because they are personalized for the end user — each person's Facebook page is unique. And web application frameworks, such as WordPress, encourage the use dynamically generate HTML by default and mark the content as uncachable.

Compression to the rescue

Given that web pages need to be dynamically generated the only viable option is to reduce the page size so that fewer TCP round trips are needed minimizing the effect of latency. The current best option for doing this is the use of the gzip encoding. On typical web page content gzip encoding will reduce the page size to about 20-25% of the original size. But this still results in multiple-kilobytes of page data being transmitted incurring the TCP congestion avoidance and latency penalty; in the CNN example above there were 15 round-trips even though the page was gzip compressed.

Gzip encoding is completely generic. It does not take into account any special features of the content it is compressing. It is also self-referential: a gzip encoded page is entirely self-contained. This is advantageous because it means that a system that uses gzipped content can be stateless, but it means that even larger compression ratios that would be possible with external dictionaries of common content are notpossible.

External dictionaries increase compression ratios dramatically because the compressed data can refer to items from the dictionary. Those references can be very small (a few bytes each) but expand to very large content from the dictionary.

For example, imagine that it's necessary to transmit The King James Bible to a user. The plain text version from Project Gutenberg is 4,452,097 bytes and compressed with gzip it is 1,404,452 bytes (a reduction in size to 31%). But imagine the case where the compressor knows that the end user has a separate copy of the Old Testament and New Testament in a dictionary of useful content. Instead of transmitting a megabyte of gzip compressed content they can transmit an instruction of the form . That instruction will just be a few bytes long.

Clearly, that's an extreme and unusual case but it highlights the usefulness of external shared dictionaries of common content that can be used to reconstruct an original, uncompressed document. External dictionaries can be applied to dynamically generated web content to achieve compression that exceeds that possible with gzip.

Caching page parts

On the web, shared dictionaries make sense because dynamic web content contains large chunks that's the same for all users and over time. Consider, for example the BBC News homepage which is approximately 116KB of HTML. That page is dynamically generated and the HTTP caching headers are set so that it is not cached. Even though the news stories on the page are frequently updated a large amount of boilerplate HTML does not change from request to request (or even user to user). The first 32KB of the page (28% of the HTML) consists of embedded JavaScript, headers, navigational elements and styles. If that ‘header block' were stored by web browsers in a local dictionary then the BBC would only need to send a small instruction saying instead of 32KB of data. That would save multiple round-trips. And throughout the BBC News page there are smaller chunks of unchanging content that could be referenced from a dictionary.

It's not hard to imagine that for any web site there are large parts of the HTML that are the same from request to request and from user to user. Even on a very personalized site like Facebook the HTML is similar from user to user.

And as more and more applications use HTTP for APIs there's an opportunity to increase API performance through the use of shared dictionaries of JSON or XML. APIs often contain even more common, repeated parts than HTML as they are intended for machine consumption and change slowly over time (whereas the HTML of a page will change more quickly as designers update the look of a page).

Two different proposals have tried to address this in different ways:

SDCH and ESI. Neither have achieved acceptance as Internet standards partly because of the added complexity of deploying them.

SDCH

In 2008, a group working at Google proposed a protocol for negotiating shared dictionaries of content so that a web server can compress a page in the knowledge that a web browser has chunks of the page in its cache. The proposal is known as SDCH (Shared Dictionary Compression over HTTP). Current versions of Google Chrome use SDCH to compress Google Search results.

This can be seen in the Developer Tools in Google Chrome. Any search request will contain an HTTP header specifying that the browser accepts SDCH compressed pages:

Accept-Encoding: gzip,deflate,sdch

And if SDCH is used then the server responds indicating the dictionary that was used. If necessary Chrome will retrieve the dictionary. Since the dictionary should change infrequently it will be in local web browser cache most of the time. For example, here's a sample HTTP header seen in a real response from a Google Search:

Get-Dictionary: /sdch/60W93cgP.dct

The dictionary file simply contains HTML (and JavaScript etc.) and the compressed page contains references to parts of the dictionary file using the VCDIFF format specified in RFC 3284. The compressed page consists mostly of COPY and ADD VCDIFF functions. A COPY x, y instruction tells the browser to copy y bytes of data from osition x in the dictionary (this is how common content gets compressed and expanded from the dictionary). The ADD instruction is used to insert uncompressed data (i.e. those parts of the page that are not in the dictionary).

In a Google Search the dictionary is used to locally cache infrequently changing parts of a page (such as the HTML header, navigation elements and page footer).

SDCH has not achieved widespread acceptance because of the difficulty of generating the shared dictionaries. Three problems arise: when to update the dictionary, how to update the dictionary and prevention of leakage of private information.

For maximum effectiveness it's desirable to produce a shared dictionary that will be useful in reducing page sizes across a large number of page views. To do this it's necessary to either implement an automatic technique that samples real web traffic and identifies common blocks of HTML, or to determine which pages are most viewed and compute dictionaries for them (perhaps based on specialised knowledge of what parts of the page are common across requests).

When automated techniques are used it's important to ensure that when sampling traffic that contains personal information (such as for a logged in user) that personal information does not end up in the dictionary.

Although SDCH is powerful when used, these dictionary generation difficulties have prevented its widespread use. The Apache mod_sdch project is inactive and the Google SDCH group has been largely inactive since 2011.

ESI

In 2001 a consortium of companies proposed addressing both latency and common content with ESI (Edge Side Includes). Edge Side Includes work by having a web page creator identify unchanging parts of the page and then making these available as separate mini-pages using HTTP.

For example, if a page contains a common header and navigation, a web page author might place that in a separate nav.html file and then in a page they are authoring enter the following XML in place of the header and navigation HTML:

\<esi:include src="http://example.com/nav.html" "continue"/>

ESI is intended for use with HTML content that is delivered via a Content Delivery Network and major CDNs were the sponsor of the original proposal.

When a user retrieves a CDN managed page that contains ESI components the CDN reconstructs the complete page from the component parts (which the CDN will either have to retrieve, or, more likely, have in cache since they change infrequently).

The CDN delivers the complete, normal HTML to the end user, but because the CDN has access nodes all over the world the latency between the end user web browser and the CDN is minimized. ESI tries to minimize the amount of data sent between the origin web server and the CDN (where the latency may be high) while being transparent to the browser.

The biggest problem with adoption of ESI is that it forces web page authors to break pages up into blocks that can be safely cached by a CDN adding to the complexity of web page authoring. In addition, a CDN has to be used to deliver the pages as web browsers do not understand the ESI directives.

The time dimension

The SDCH and ESI approaches rely on identifying parts of pages that are known to be unchanging and can be cached either at the edge of a CDN or in a shared dictionary in a web browser.

Another approach is to consider how web pages evolve over time. It is common for web users to visit the same web pages frequently (such as news sites, online email, social media and major retailers). This maymean that a user's web browser has some previous version of the web page they are loading in its local cache. Even though that web page may be out of date it could still be used as a shared dictionary as components of it are likely to appear in the latest version of the page.

For example, a daily visit to a news web site could be speeded up if a web server were only able to send the differences between yesterday's news and today's. It's likely that most of the HTML of a page like the BBC News homepage will have remained unchanged; only the stories will be new and they will only make up a small portion of the page.

CloudFlare looked at how much dynamically generated pages change over time and found that, for example, reddit.com changes by about 2.15% over five minutes and 3.16% over an hour. The New York Times home page changes by about 0.6% over five minutes and 3% over an hour. BBC News changes by about 0.4% over five minutes and 2% over an hour. With delta compression it would be possible to turn those figures directly into a compression ratio by only sending the tiny percentage of the page that has changed. Compressing the BBC News web site to 0.4% is an enormous improvement compared to gzip's 20-25% compression ratio meaning that 116KB would result in just 464 bytes transmitted (which would likely all fit in a single TCP packet requiring a single round trip).

This delta method is the essence of RFC 3229 which was written in 2002.

RFC 3229

This RFC proposes an extension to HTTP where a web browser can indicate to a server that it has a particular version of a page (using the value from the ETag HTTP header that was supplied when the page was previously downloaded). The receiving web server can then apply a delta compression technique (encoded using VCDIFF discussed above) to send only the parts that have changed since that particular version of the page.

The RFC also proposes that a web browser be able to send the identifiers of multiple versions of a single page so that the web server can choose among them. That way, if the web browser has multiple versions in cache there's an increased chance that the server will have one of the versions available to it for delta compression.

Although this technique is powerful because it greatly reduces the amount of data to be sent from a web server to browser it has not been widely deployed because of the enormous resources needed on web servers.

To be effective a web server would need to keep copies of versions of the pages it generates in order that when a request comes in it is able to perform delta compression. For a popular web site that would create a large storage burden; for a site with heavy personalization it would mean keeping a copy of the pages served to every single user. For example, Facebook has around 1 billion active users, just keeping a copy of the HTML of the last time they viewed their timeline would require 250TB of storage.

CloudFlare's Railgun

CloudFlare's Railgun is a transparent delta compression technology that takes advantage of CloudFlare's CDN network to greatly accelerate the transmission of dynamically generated web pages from origin web servers to the CDN node nearest end user web surfers. Unlike SDCH and ESI it does not require any work on the part of a web site creator and unlike RFC 3229 it does not require caching a version of each page for each end user.

Railgun consists of two components: the sender and the listener. The sender is installed at every CloudFlare data center around the world. The listener is a software component that customers install on their network.

The sender and listener establish a permanent TCP connection that's secured by TLS. This TCP connection is used for the Railgun protocol. It's an all binary multiplexing protocol that allows multiple HTTP requests to be run simultaneously and asynchronously across the link. To a web client the Railgun system looks like a proxy server, but instead of being a server it's a wide-area link with special properties. One of those properties is that it performs compression on non-cacheable content by synchronizing page versions.

Each end of the Railgun link keeps track of the last version of a web page that's been requested. When a new request comes in for a page that Railgun has already seen, only the changes are sent across the link. The listener component make an HTTP request to the real, origin web server for the uncacheable page, makes a comparison with the stored version and sends across the differences.

The sender then reconstructs the page from its cache and the difference sent by the other side. Because multiple users pass through the same Railgun link only a single cached version of the page is needed for delta compression as opposed to one per end user with techniques like RFC 3229.

For example, a test on a major news site sent 23,529 bytes of gzipped data which when decompressed become 92,516 bytes of page (so the page is compressed to 25.25% of its original size). Railgun compression between two version of the page at a five minute interval resulted in just 266 bytes of difference data being sent (a compression to 0.29% of the original page size). The one hour difference is 2,885 bytes (a compression to 3% of the original page size). Clearly, Railgun delta compression outperforms gzip enormously.

For pages that are frequently accessed the deltas are often so small that they fit inside a single TCP packet, and because the connection between the two parts of Railgun is kept active problems with TCP congestion avoidance are eliminated.

Conclusion

The use of external dictionaries of content is a powerful technique that can achieve much larger compression ratios that the self-contained gzip method. But only CloudFlare's Railgun implements delta compression in a manner that is completely transparent to end users and website owners.

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o		F	o	o
	W	e	n	t		h	o	p	p	i	n	g		t	h	r	o	u	g
h		t	h	e		f	o	r	e	s	t		S	c	o	o	p	i	n
g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
t	h	e		h	e	a	d		D	o	w	n		c	a	m	e		t
h	e		G	o	o	d		F	a	i	r	y	,		a	n	d		s
h	e		s	a	i	d		"	L	i	t	t	l	e		b	u	n	n
y		F	o	o		F	o	o		I		d	o	n	'	t		w	a
n	t		t	o		s	e	e		y	o	u		S	c	o	o	p	i
n	g		u	p		t	h	e		f	i	e	l	d		m	i	c	e
	A	n	d		b	o	p	p	i	n	g		t	h	e	m		o	n
	t	h	e		h	e	a	d	.	"

L	i	t	t	l	e		b	u	n	n	y		F	o	o	5	4	W	e
n	t		h	o	p	p	i	n	g		t	h	r	o	u	g	h	3	8
e		f	o	r	e	s	t		S	c	o	o	5	28	u	p	6	23	i
e	l	d		m	i	c	e		A	n	d		b	9	58	e	m		o
n	5	35	h	e	a	d		D	o	w	n		c	a	m	e	5	19	G
o	o	d		F	a	i	r	y	,		a	3	55	s	3	20	s	a	i
d		"	L	20	149	I		d	o	n	'	t		w	a	3	157	t	o
	s	e	e		y	o	o	56	141	.	"


17	42	4	4	W	e	n	t	9	51	t	h		o	u	g	h	3	8	e
8	63	S	c	o	o	5	28	u	p	6	23	i	e	l	d		m	i	c
e		A	n	d		b	9	58	e	m		o	n	5	35	h	e	a	d
	D	o	w	n		c	a	m	e	5	19	10	133	,	a	3	55	s	3
20	s	a	i	d		"	21	149	I		d	o	n	'	t		w	a	3
157	t	o		s	e	e		y	o	o	56	141	.	"