The Cloudflare Blog

Unweight: how we compressed an LLM 22% without sacrificing quality

Mari Galicer — Fri, 17 Apr 2026 13:00:00 GMT

Running inference within 50ms of 95% of the world's Internet-connected population means being ruthlessly efficient with GPU memory. Last year we improved memory utilization with Infire, our Rust-based inference engine, and eliminated cold-starts with Omni, our model scheduling platform. Now we are tackling the next big bottleneck in our inference platform: model weights.

Generating a single token from an LLM requires reading every model weight from GPU memory. On the NVIDIA H100 GPUs we use in many of our datacenters, the tensor cores can process data nearly 600 times faster than memory can deliver it, leading to a bottleneck not in compute, but memory bandwidth. Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller.

To solve this problem, we built Unweight: a lossless compression system that can make model weights up to 15–22% smaller while preserving bit-exact outputs, without relying on any special hardware. The core breakthrough here is that decompressing weights in fast on-chip memory and feeding them directly to the tensor cores avoids an extra round-trip through slow main memory. Depending on the workload, Unweight’s runtime selects from multiple execution strategies – some prioritize simplicity, others minimize memory traffic – and an autotuner picks the best one per weight matrix and batch size.

This post dives into how Unweight works, but in the spirit of greater transparency and encouraging innovation in this rapidly developing space, we’re also publishing a technical paper and open sourcing the GPU kernels.

Our initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone. Because Unweight works selectively on the parameters for decoding, this leads to a 15-22% in model size reduction and ~3 GB VRAM savings. As shown in the graphic below, this enables us to squeeze more out of our GPUs and thus run more models in more places — making inference cheaper and faster on Cloudflare’s network.

^{Thanks to Unweight, we’re able to fit more models on a single GPU}

Why compression is harder than it sounds

There is a growing body of research exploring how to compress model weights in creative ways to make inference faster and/or run on smaller GPUs. The most common is quantization, a technique to reduce the size of model weights and activations by converting large 32- or 16-bit floating point numbers to smaller 8 or 4-bit integers. This is a form of lossy compression: different 16-bit floating point values can be converted to the same 4-bit integer. This reduction in accuracy affects the quality of responses in unpredictable ways. For production inference serving diverse use cases, we knew we wanted something lossless that preserves exact model behaviour.

Several recent systems (Huff-LLM, ZipNN, and ZipServ) have shown that LLM weights can be compressed significantly, but these approaches target different problems than ours. ZipNN compresses weights for distribution and storage with decompression happening on the CPU. HUff-LLM proposes custom FGPA hardware for decoding. And ZipServ does fuse decompression with GPU inference, but targets consumer grade GPUs, which don’t work with our H100 GPUs. None of these gave us what we needed: lossless inference-time decompression on Hopper GPUs that can integrate with our Rust based inference engine.

The core challenge isn't vanilla compression — exponent bytes in BF16 weights are highly redundant, so entropy coding works well on them. The challenge is decompressing fast enough that it doesn't slow down inference. On an H100, the tensor cores sit idle waiting for memory most of the time — but that idle capacity can't simply be repurposed for decompression. Each GPU compute unit can run either the decompression kernel or the matrix multiplication kernel, not both simultaneously, due to shared memory constraints. Any decode latency that isn't perfectly overlapped with the matrix multiplication becomes directly additive to token latency. Unweight's answer is to decompress weights in fast on-chip shared memory and feed the results directly to the tensor cores — but making that work efficiently across different batch sizes and weight shapes is where the real engineering lives.

How model weights can be compressed effectively

Every number in an AI model is stored as a 16-bit "brain float" (BF16). Each BF16 value has three parts:

Sign (1 bit): positive or negative
Exponent (8 bits): the magnitude
Mantissa (7 bits): the precise value within that magnitude

Here’s how one of these weights breaks down:

The sign and mantissa vary unpredictably across weights — they look like random data and can't be meaningfully compressed. But the exponent tells a different story.

The exponent is surprisingly predictable

Prior research has established that across trained LLMs, out of 256 possible exponent values, just a handful dominate. The top 16 most common exponents cover over 99% of all weights in a typical layer. Information theory says you only need ~2.6 bits to represent this distribution — far less than the 8 bits allocated. If you look at the exponent value distribution in a typical LLM layer, you can see that the top 16 exponents account for 99% of all model weights.

Exponent value distribution in a typical LLM layer

This is the redundancy that Unweight exploits. We leave the sign and mantissa untouched and compress only the exponent byte using Huffman coding — a classic technique that assigns short codes to common values and longer codes to rare ones. Because the exponent distribution is so skewed, this achieves roughly 30% compression on the exponent stream. We apply this selectively to the MLP weight matrices (gate, up, and down projections), which make up roughly two-thirds of a model’s parameters and dominate memory traffic during token generation. Attention weights, embeddings and layer norms are uncompressed. All told the optimizations translate to about 20% reduction in overall multilayer perceptron (MLP) weight size, as explained in full detail in our technical report.

The small number of weights with rare exponents are handled separately: if any weight in a row of 64 has an exponent outside the top-16 palette, the entire row is stored verbatim. This approach eliminates per-element branching in the hot path — instead of checking every single weight for edge cases, we make one decision per row up front.

The GPU memory bottleneck

An NVIDIA H100 GPU has two relevant kinds of memory:

High Bandwidth Memory (HBM): large, but relatively slow to access. This is where model weights live.
Shared memory (SMEM): tiny, but extremely fast. This is where the GPU stages data right before doing math.

^{During inference, generating each token requires reading the full weight matrix from HBM. The memory bus between HBM and SMEM is the performance bottleneck –}^{not the math itself. Fewer bytes across the bus = faster token generation.}

During inference, generating each token requires reading the full weight matrix from HBM through the memory bus — this is the bottleneck. The H100's tensor cores can crunch numbers far faster than HBM can feed them data. Compression helps because fewer bytes need to cross the bus. But there's a catch: the GPU can't do math on compressed data. The weights must be decompressed first.

Most prior work decompresses entire weight matrices back into HBM, then runs a standard matrix multiplication. This helps with storage capacity but doesn't help with bandwidth because you still read the full uncompressed matrix from HBM for every token.

Four ways to use compressed weights

There's no single best way to use compressed weights during inference. The right approach depends on the workload — the batch size, the shape of the weight matrix, and how much GPU time is available for decompression. Unweight offers four compressed execution pipelines, each with a different balance between decompression effort and computation complexity: a full Huffman decode, exponent-only decode, palette transcode, or skipping pre-processing completely.

^{Four different execution pipelines}

The four pipelines form a spectrum. At one end, full decode completely reconstructs the original BF16 weights and hands them to NVIDIA’s cuBLAS library for a standard matrix multiplication. This is the simplest path with cuBLAS running at full speed on ordinary data, but the preprocess step writes the most bytes back to main memory. It works well at small batch sizes where the matrix multiplication is tiny and custom kernel overhead dominates. At the other end, direct palette skips preprocessing entirely. Weights are pre-transcoded to a compact 4-bit format at model load time, and the matrix multiplication kernel reconstructs BF16 values on the fly from these indices. Zero preprocess cost, but the kernel does more work per element.

In between sit two independent paths: one that decodes only the exponent bytes (halving preprocess traffic), and one that transcodes to 4-bit palette indices at runtime (quartering it). Both use a reconstructive matrix multiplication — a custom kernel that loads compressed data, reconstructs BF16 in fast shared memory, and feeds it directly to the tensor cores without a round-trip through main memory.

Why no single pipeline wins

Less preprocessing means less data written to HBM, which frees the memory bus sooner. But it shifts more reconstruction work onto the matmul kernel. Whether that tradeoff pays off depends on the situation.

With small batch sizes (i.e. 1-64 tokens), the matmul is tiny, so there isn't much computation to overlap with, and the fixed costs of a custom kernel dominate. Full decode + cuBLAS often wins simply because cuBLAS has lower overhead. With large batch sizes (i.e. 256+ tokens), the matmul runs long enough to absorb the extra reconstruction work. A lighter preprocess finishes faster, and the freed-up bus bandwidth and compute overlap pay off. The palette or exponent pipelines pull ahead. Different weight matrices within the same layer can favor different pipelines. The "gate" and "up" projections have different dimensions than the "down" projection, changing the order of operations performed within the matmul which requires different performance tradeoffs.

Throughput vs pipeline strategy

This is why Unweight doesn't hard-code a single strategy. The runtime picks the best pipeline for each weight matrix at each batch size, informed by an autotuning process that measures actual end-to-end throughput on the target hardware (more on this below).

How the reconstructive matmul works

Three of the four pipelines use a custom matrix multiplication kernel that fuses decompression with computation. This kernel loads compressed data from HBM, reconstructs the original BF16 values in shared memory, and feeds them directly into the tensor cores — all in one operation. The reconstructed weights never exist in main memory.

Traditional decompression vs Unweight

With Unweight, ~30% fewer bytes cross the memory bus for MLP weight matrices

Inside this kernel, the GPU's thread groups are split into two roles:

A producer group loads compressed inputs from HBM into shared memory using dedicated memory-copy hardware (TMA). It stages sign+mantissa bytes, exponent data (or palette indices), and – for rows with rare exponents – the verbatim exponent rows. It runs ahead of the consumer, filling a circular buffer so data is ready before it's needed.
Consumer groups reconstruct BF16 values by combining exponents with sign+mantissa bytes, then immediately feed the result into Hopper's WGMMA tensor-core instructions. The reconstructed weights go straight from assembly to computation without leaving shared memory.

The reconstructive matmul comes in multiple variants, differing in how many output tiles each compute unit handles and how deep the circular buffer runs. Wider output tiles improve data reuse at large batch sizes; deeper buffers hide memory latency at small batch sizes. The autotuner selects the best variant per workload.

Sharing the GPU between decoding and computation

In the two fused pipelines, a separate preprocess kernel (Huffman decoder or palette transcoder) runs concurrently with the reconstructive matmul. But these kernels compete for GPU resources.

On Hopper, each compute unit (SM) has 228 KB of shared memory. The reconstructive matmul needs ~227 KB for its pipeline buffer and accumulator tiles. A decode kernel needs ~16 KB for its Huffman lookup table. Since 227 + 16 > 228, these two kernels cannot share the same compute unit. Every SM assigned to decoding is one fewer SM available for the matmul.

This creates a balancing act: more decode SMs means faster preprocessing but slower matrix multiplication, and vice versa. The optimal split is another tunable parameter — and another reason why the autotuner measures real throughput rather than relying on heuristics.

Pipelining across layers

Even with the SM partitioning constraint, Unweight hides much of the decompression cost by exploiting the structure of transformer models.

Not every layer needs Huffman decoding at runtime. Unweight classifies layers as "hard" (requiring Huffman preprocessing) or "easy" (using pre-transcoded palette data that the matmul can consume directly). The runtime alternates between them:

^{Decode runs on separate CUDA streams during bootstrap, attention, and easy MLP compute. By the time a hard layer's MLP runs, its preprocessed weights are already waiting}

While the GPU computes an easy layer — which needs no preprocessing — a separate set of CUDA streams is decoding the next hard layer's weights in the background. By the time the easy layers finish and the hard layer's turn arrives, its preprocessed data is already waiting. Double-buffered preprocess slots ensure that decode output from one hard layer isn't overwritten while it's still being consumed.

The down projection benefits most from this overlap: it's consumed last in the MLP sequence (after gate, activation, and up), so its decode has the longest runway to complete.

Autotuning

With four pipelines, multiple matmul kernel variants , and a tunable SM split between decoding and computation, the configuration space is large. Rather than hard-coding a single strategy, Unweight uses an autotuner that measures actual end-to-end inference throughput on the target hardware. It sweeps candidate configurations for the gate projection while holding up and down fixed, then sweeps up, then down, repeating until no further improvement is found. The result is a per-model configuration file that tells the runtime exactly which pipeline, matmul variant, and SM allocation to use for each projection at each batch size — all driven by measured performance rather than heuristics.

One compression format, multiple uses

Encoding format, execution pipeline, and scheduling are independent choices. The same Huffman-compressed model bundle can serve both distribution and inference:

For distribution, Huffman encoding maximizes compression (~22% total model size reduction), reducing transfer times when shipping models across the network.
For inference, Huffman-encoded projections can be transcoded to the palette intermediate format on model load, enabling the most efficient runtime execution without constraining the distribution format.

A single model bundle doesn't need to commit to one strategy at packaging time. The runtime selects the best execution path per projection and per batch size on the fly.

Our results

On Llama 3.1 8B (our primary testbed), Unweight achieves:

~13% model footprint reduction for inference bundles (compressing only gate/up MLP projections), or ~22% for distribution bundles (compressing all MLP projections including down). All compression is 100% bit-exact lossless. Extrapolating to Llama 70B, this can translate to roughly 18–28 GB saved depending on configuration.
30–40% throughput overhead at current optimization level, measured end-to-end on H100 SXM5. The overhead is largest at batch size 1 (~41%) and narrows at batch 1024 (~30%). Three known sources – small-batch fixed costs, redundant weight-tile reconstruction, and the excluded down projection – are under active optimization.

These are intermediate results on a single model. The compression ratios should generalize to other SwiGLU architectures (exponent statistics are consistent across model scales), but the throughput numbers are specific to the current kernel implementations and will change as optimization continues. We do not yet compress attention weights, embeddings, or layer norms, which dilute the overall reduction.

Why this matters

GPUs are expensive in multiple dimensions: the cost of the cards themselves, the high-bandwidth memory they demand, and their significant power consumption.

To combat this, several researchers have shown systems with promising results of ~30% compression ratios on full models — but these target consumer GPUs and research frameworks that don’t work at production scale. The key insight into Unweight’s development is that multilayer perceptrons (MLPs) constitute the majority of model weights and a significant amount of the compute cost during inference workloads. It compresses only MLP weights (avoiding overhead on layers where compression benefit is marginal), is designed specifically for datacenter H100 GPUs with their tightly-balanced compute and memory, and comes with four execution pipelines that adapt to batch size rather than using a single approach.

However, we want to be clear: Unweight is not a free lunch. On-chip reconstruction adds computational work that wouldn't exist with uncompressed weights. On Llama 3.1 8B, the inference configuration saves approximately 13% of total model memory at a throughput cost of roughly 30% at typical serving batch sizes. This gap narrows at larger batches (where preprocess overlap improves) and is expected to narrow further as we optimize — in particular, we haven't yet compressed the down projection in each MLP layer (about one-third of the compressible weights), and several kernel improvements are in active development.

For Cloudflare's network, Unweight gives us better capacity: it allows us to serve state-of-the-art models with less GPU memory per instance, which translates to cost savings and the ability to deploy more models in more places. For model distribution, the savings are larger: Huffman-compressed bundles are about 22% smaller, reducing transfer times when shipping models to edge locations worldwide.

What’s next

Looking forward, we have three concrete research directions we think will improve upon our efficiency gains:

Down projection compression. Unweight compresses gate and up MLP projections today, but down projection accounts for roughly one-third of compressible weights. This requires a different kernel variant due to its transposed dimensions, which we will expect to reduce the total model size beyond 22%.

Kernel optimization. The current 30–40% throughput overhead has three identified sources: small-batch fixed costs in the reconstructive matmul, redundant weight reconstruction at large batch sizes, and the missing down projection. Each has a known mitigation path, which we outline in our technical paper.

More models. Our results are for Llama 3.1 8B, but the underlying exponent statistics are consistent across SwiGLU architectures at all scales. We're working to bring Unweight to the larger models we serve through Workers AI.

Longer term, we are investigating what Unweight’s architecture means for Mixture-of-Experts models, where cold experts must be fetched on demand and reduced storage would further reduce cost.

This is a fast-moving field, so we’re excited to open-source our work here and contribute to a growing corpus of research in compression and GPU efficiency. Unweight is one piece of the puzzle, but we hope that other researchers find it a useful paradigm to build upon!

Bringing more transparency to post-quantum usage, encrypted messaging, and routing security

David Belson — Fri, 27 Feb 2026 06:00:00 GMT

Cloudflare Radar already offers a wide array of security insights — from application and network layer attacks, to malicious email messages, to digital certificates and Internet routing.

And today we’re introducing even more. We are launching several new security-related data sets and tools on Radar:

We are extending our post-quantum (PQ) monitoring beyond the client side to now include origin-facing connections. We have also released a new tool to help you check any website's post-quantum encryption compatibility.
A new Key Transparency section on Radar provides a public dashboard showing the real-time verification status of Key Transparency Logs for end-to-end encrypted messaging services like WhatsApp, showing when each log was last signed and verified by Cloudflare's Auditor. The page serves as a transparent interface where anyone can monitor the integrity of public key distribution and access the API to independently validate our Auditor’s proofs.
Routing Security insights continue to expand with the addition of global, country, and network-level information about the deployment of ASPA, an emerging standard that can help detect and prevent BGP route leaks.

Measuring origin post-quantum support

Since April 2024, we have tracked the aggregate growth of client support for post-quantum encryption on Cloudflare Radar, chronicling its global growth from under 3% at the start of 2024, to over 60% in February 2026. And in October 2025, we added the ability for users to check whether their browser supports X25519MLKEM768 — a hybrid key exchange algorithm combining classical X25519 with ML-KEM, a lattice-based post-quantum scheme standardized by NIST. This provides security against both classical and quantum attacks.

However, post-quantum encryption support on user-to-Cloudflare connections is only part of the story.

For content not in our CDN cache, or for uncacheable content, Cloudflare’s edge servers establish a separate connection with a customer’s origin servers to retrieve it. To accelerate the transition to quantum-resistant security for these origin-facing fetches, we previously introduced an API allowing customers to opt in to preferring post-quantum connections. Today, we’re making post-quantum compatibility of origin servers visible on Radar.

The new origin post-quantum support graph on Radar illustrates the share of customer origins supporting X25519MLKEM768. This data is derived from our automated TLS scanner, which probes TLS 1.3-compatible origins and aggregates the results daily. It is important to note that our scanner tests for support rather than the origin server's specific preference. While an origin may support a post-quantum key exchange algorithm, its local TLS key exchange preference can ultimately dictate the encryption outcome.

While the headline graph focuses on post-quantum readiness, the scanner also evaluates support for classical key exchange algorithms. Within the Radar Data Explorer view, you can also see the full distribution of these supported TLS key exchange methods.

As shown in the graphs above, approximately 10% of origins could benefit from a post-quantum-preferred key agreement today. This represents a significant jump from less than 1% at the start of 2025 — a 10x increase in just over a year. We expect this number to grow steadily as the industry continues its migration. This upward trend likely accelerated in 2025 as many server-side TLS libraries, such as OpenSSL 3.5.0+, GnuTLS 3.8.9+, and Go 1.24+, enabled hybrid post-quantum key exchange by default, allowing platforms and services to support post-quantum connections simply by upgrading their cryptographic library dependencies.

In addition to the Radar and Data Explorer graphs, the origin readiness data is available through the Radar API as well.

As an additional part of our efforts to help the Internet transition to post-quantum cryptography, we are also launching a tool to test whether a specific hostname supports post-quantum encryption. These tests can be run against any publicly accessible website, as long as they allow connections from Cloudflare’s egress IP address ranges.

_{A screenshot of the tool in Radar to test whether a hostname supports post-quantum encryption.}

The tool presents a simple form where users can enter a hostname (such as cloudflare.com or www.wikipedia.org) and optionally specify a custom port (the default is 443, the standard HTTPS port). After clicking "Test", the result displays a tag indicating PQ support status alongside the negotiated TLS key exchange algorithm. If the server prefers PQ secure connections, a green "PQ" tag appears with a message confirming the connection is "post-quantum secure." Otherwise, a red tag indicates the connection is "not post-quantum secure", showing the classical algorithm that was negotiated.

Under the hood, this tool uses Cloudflare Containers — a new capability that allows running container workloads alongside Workers. Since the Workers runtime is not exposed to details of the underlying TLS handshake, Workers cannot initiate TLS scans. Therefore, we created a Go container that leverages the crypto/tls package's support for post-quantum compatibility checks. The container runs on-demand and performs the actual handshake to determine the negotiated TLS key exchange algorithm, returning results through the Radar API.

With the addition of these origin-facing insights, complementing the existing client-facing insights, we have moved all the post-quantum content to its own section on Radar.

Securing E2EE messaging systems with Key Transparency

End-to-end encrypted (E2EE) messaging apps like WhatsApp and Signal have become essential tools for private communication, relied upon by billions of people worldwide. These apps use public-key cryptography to ensure that only the sender and recipient can read the contents of their messages — not even the messaging service itself. However, there's an often-overlooked vulnerability in this model: users must trust that the messaging app is distributing the correct public keys for each contact.

If an attacker were able to substitute an incorrect public key in the messaging app's database, they could intercept messages intended for someone else — all without the sender knowing.

Key Transparency addresses this challenge by creating an auditable, append-only log of public keys — similar in concept to Certificate Transparency for TLS certificates. Messaging apps publish their users' public keys to a transparency log, and independent third parties can verify and vouch that the log has been constructed correctly and consistently over time. In September 2024, Cloudflare announced such a Key Transparency auditor for WhatsApp, providing an independent verification layer that helps ensure the integrity of public key distribution for the messaging app's billions of users.

Today, we're publishing Key Transparency audit data in a new Key Transparency section on Cloudflare Radar. This section showcases the Key Transparency logs that Cloudflare audits, giving researchers, security professionals, and curious users a window into the health and activity of these critical systems.

The new page launches with two monitored logs: WhatsApp and Facebook Messenger Transport. Each monitored log is displayed as a card containing the following information:

Status: Indicates whether the log is online, in initialization, or disabled. An "online" status means the log is actively publishing key updates into epochs that Cloudflare audits. (An epoch represents a set of updates applied to the key directory at a specific time.)
Last signed epoch: The most recent epoch that has been published by the messaging service's log and acknowledged by Cloudflare. By clicking on the eye icon, users can view the full epoch data in JSON format, including the epoch number, timestamp, cryptographic digest, and signature.
Last verified epoch: The most recent epoch that Cloudflare has verified. Verification involves checking that the transition of the transparency log data structure from the previous epoch to the current one represents a valid tree transformation — ensuring the log has been constructed correctly. The verification timestamp indicates when Cloudflare completed its audit.
Root: The current root hash of the Auditable Key Directory (AKD) tree. This hash cryptographically represents the entire state of the key directory at the current epoch. Like the epoch fields, users can click to view the complete JSON response from the auditor.

The data shown on the page is also available via the Key Transparency Auditor API, with endpoints for auditor information and namespaces.

If you would like to perform audit proof verification yourself, you can follow the instructions in our Auditing Key Transparency blog post. We hope that these use cases are the first of many that we publish in this Key Transparency section in Radar — if your company or organization is interested in auditing for your public key or related infrastructure, you can reach out to us here.

Tracking RPKI ASPA adoption

While the Border Gateway Protocol (BGP) is the backbone of Internet routing, it was designed without built-in mechanisms to verify the validity of the paths it propagates. This inherent trust has long left the global network vulnerable to route leaks and hijacks, where traffic is accidentally or maliciously detoured through unauthorized networks.

Although RPKI and Route Origin Authorizations (ROAs) have successfully hardened the origin of routes, they cannot verify the path traffic takes between networks. This is where ASPA (Autonomous System Provider Authorization) comes in. ASPA extends RPKI protection by allowing an Autonomous System (AS) to cryptographically sign a record listing the networks authorized to propagate its routes upstream. By validating these Customer-to-Provider relationships, ASPA allows systems to detect invalid path announcements with confidence and react accordingly.

While the specific IETF standard remains in draft, the operational community is moving fast. Support for creating ASPA objects has already landed in the portals of Regional Internet Registries (RIRs) like ARIN and RIPE NCC, and validation logic is available in major software routing stacks like OpenBGPD and BIRD.

To provide better visibility into the adoption of this emerging standard, we have added comprehensive RPKI ASPA support to the Routing section of Cloudflare Radar. Tracking these records globally allows us to understand how quickly the industry is moving toward better path validation.

Our new ASPA deployment view allows users to examine the growth of ASPA adoption over time, with the ability to visualize trends across the five Regional Internet Registries (RIRs) based on AS registration. You can view the entire history of ASPA entries, dating back to October 1, 2023, or zoom into specific date ranges to correlate spikes in adoption with industry events, such as the introduction of ASPA features on ARIN and RIPE NCC online dashboards.

Beyond aggregate trends, we have also introduced a granular, searchable explorer for real-time ASPA content. This table view allows you to inspect the current state of ASPA records, searchable by AS number, AS name, or by filtering for only providers or customer ASNs. This allows network operators to verify that their records are published correctly and to view other networks’ configurations.

We have also integrated ASPA data directly into the country/region routing pages. Users can now track how different locations are progressing in securing their infrastructure, based on the associated ASPA records from the customer ASNs registered locally.

On individual AS pages, we have updated the Connectivity section. Now, when viewing the connections of a network, you may see a visual indicator for "ASPA Verified Provider." This annotation confirms that an ASPA record exists authorizing that specific upstream connection, providing an immediate signal of routing hygiene and trust.

For ASes that have deployed ASPA, we now display a complete list of authorized provider ASNs along with their details. Beyond the current state, Radar also provides a detailed timeline of ASPA activity involving the AS. This history distinguishes between changes initiated by the AS itself ("As customer") and records created by others designating it as a provider ("As provider"), allowing users to immediately identify when specific routing authorizations were established or modified.

Visibility is an essential first step toward broader adoption of emerging routing security protocols like ASPA. By surfacing this data, we aim to help operators deploy protections and assist researchers in tracking the Internet's progress toward a more secure routing path. For those who need to integrate this data into their own workflows or perform deeper analysis, we are also exposing these metrics programmatically. Users can now access ASPA content snapshots, historical timeseries, and detailed changes data using the newly introduced endpoints in the Cloudflare Radar API.

As security evolves, so does our data

Internet security continues to evolve, with new approaches, protocols, and standards being developed to ensure that information, applications, and networks remain secure. The security data and insights available on Cloudflare Radar will continue to evolve as well. The new sections highlighted above serve to expand existing routing security, transparency, and post-quantum insights already available on Cloudflare Radar.

If you share any of these new charts and graphs on social media, be sure to tag us: @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky). If you have questions or comments, or suggestions for data that you’d like to see us add to Radar, you can reach out to us on social media, or contact us via email.

Internet measurement, resilience, and transparency: blog takeover from Cloudflare Research and friends

Mari Galicer — Mon, 27 Oct 2025 12:00:00 GMT

The Cloudflare Research team spends our time investigating how we can apply new technologies to continue to help build a better Internet. We don’t just write papers – we put ideas into practice, and test our hypotheses in real time.

Our work is deeply collaborative by nature, working closely with academia, standards bodies like the IETF, the open-source community, and our own product and engineering teams. We believe in doing this research in the open so that others can learn from it, give us feedback, and work with us to make the next version of the Internet even better. That’s why this week we’re publishing a series of posts to make more of our research public – research that we think will help push forward a more measurable, resilient, and transparent Internet.

Internet Measurement will be one of the week’s major themes because our posts here coincide with the Association for Computing Machinery (ACM)’s annual Internet Measurement Conference, a venue for new work that measures and analyzes the behavior, performance, and evolution of the Internet and networked systems. Internet measurement is hard to get right, so we’re taking the opportunity to dive deeper into some of the foundational concepts and products that define how we do measurement at Cloudflare scale.

Each day this week we share new stories from our Research team and friends in our engineering groups elsewhere at Cloudflare. We will dive deep into Internet measurement data, establish new frameworks for Internet resilience, discuss cryptographic protocols for an increasingly automated web, and explore new advances in networking technologies.

We’re excited to showcase this work, so stay tuned this week for the posts to follow. Want a preview of what to expect? Read on for an outline of what we will cover this week.

An ode to Internet measurement

We’ll start the week with a foundational look at what Internet measurement actually consists of, explaining the jargon behind the science and some of the fundamental tradeoffs one has to make when trying to do measurement well. A former Cloudflare intern will share how working with Cloudflare-scale data completely changed his perspective on detecting connection tampering. We’ll also dig into how Cloudflare Radar has evolved in the past few years, and take a deeper look at how our Internet speed test works!

A better Internet is a more resilient Internet

Something that we take for granted, but notice when it fails: a network's ability not just to stay online, but to withstand, adapt to, and rapidly recover from breakdowns – otherwise known as Internet Resilience. There are many factors that can cause Internet disruption, from cyberattack to natural disaster to government-directed shutdowns. We’ll go deeper into these disruptions in our quarterly Internet Disruption Summary, which details the length and impact of each outage as observed from Cloudflare’s network.

It’s easy to say Internet Resilience is the goal, but it can be harder to define what that actually means. In our blog “A Framework for Internet Resilience,” we do exactly that – establish a framework for how governments, infrastructure providers, and researchers can assess how resilient their infrastructure is, from first principles.

A resilient Internet is also immune to quantum compromise. Much has happened since we published our highly cited State of the Post-Quantum Internet, so we’ll share an updated view of progress of post-quantum deployment over the past year, as well as a deep dive into Merkle Tree Certificates, an experimental design with Chrome to make post-quantum certificates deployable at scale.

A transparent look into Cloudflare’s network

Cloudflare sees millions of connections and IP addresses per second – and characterizing them at scale isn’t easy. We’ll take a deeper look at what a connection actually means at Cloudflare: what server-side characteristics we observe and measure across our network, and what they tell us about the size and flow of data through the Internet.

Many products at Cloudflare aren’t possible without pushing the limits of network hardware and software to deliver improved performance, increased efficiency, or novel capabilities. That’s why we’re sharing a deep dive into how we bend the limits of our Linux networking stack to be economical with addressing space while maintaining performance.

All of this theory has real-world applications we’ll dive into: from detecting shared IP space (CGNAT), to defending against DDoS attacks, to improving the efficiency of our cache.

Cryptographic protocols for an agentic web

The rise of AI agents and AI crawlers is a turning point for infrastructure providers. For instance, traffic from many users is condensed into a few beefy datacenters, and request patterns appear to be more automated as LLMs orchestrate web browsers. Measuring the impact of this shift has become an interesting and complex problem.

This week, we’ll dive into how honest agents and website operators can work together to stay safe, private, and resilient. We’ll discuss new work being done in the IETF that builds upon Web Bot Auth – a protocol that allows automated HTTP clients like bots and agents to identify themselves to the rest of the Internet. In addition, in order to empower honest users, we’ll propose new cryptographic protocols that allow them through while protecting websites from DDoS, fraud, or scraping attacks. We will present real-world deployment considerations, as well as mechanisms to future-proof them in the face of the imminent post-quantum transition.

Get your reading glasses on

Expect blog posts this week that push the boundaries of emerging research in their respective fields, establish new frameworks and ideas, and bridge the gap between academic theory and real-world applications. We couldn’t be more excited to share them with you!

How we built the most efficient inference engine for Cloudflare’s network

Vlad Krasnov — Wed, 27 Aug 2025 14:00:00 GMT

Inference powers some of today’s most powerful AI products: chat bot replies, AI agents, autonomous vehicle decisions, and fraud detection. The problem is, if you’re building one of these products on top of a hyperscaler, you’ll likely need to rent expensive GPUs from large centralized data centers to run your inference tasks. That model doesn’t work for Cloudflare — there’s a mismatch between Cloudflare’s globally-distributed network and a typical centralized AI deployment using large multi-GPU nodes. As a company that operates our own compute on a lean, fast, and widely distributed network within 50ms of 95% of the world’s Internet-connected population, we need to be running inference tasks more efficiently than anywhere else.

This is further compounded by the fact that AI models are getting larger and more complex. As we started to support these models, like the Llama 4 herd and gpt-oss, we realized that we couldn’t just throw money at the scaling problems by buying more GPUs. We needed to utilize every bit of idle capacity and be agile with where each model is deployed.

After running most of our models on the widely used open source inference and serving engine vLLM, we figured out it didn’t allow us to fully utilize the GPUs at the edge. Although it can run on a very wide range of hardware, from personal devices to data centers, it is best optimized for large data centers. When run as a dedicated inference server on powerful hardware serving a specific model, vLLM truly shines. However, it is much less optimized for dynamic workloads, distributed networks, and for the unique security constraints of running inference at the edge alongside other services.

That’s why we decided to build something that will be able to meet the needs of Cloudflare inference workloads for years to come. Infire is an LLM inference engine, written in Rust, that employs a range of techniques to maximize memory, network I/O, and GPU utilization. It can serve more requests with fewer GPUs and significantly lower CPU overhead, saving time, resources, and energy across our network.

Our initial benchmarking has shown that Infire completes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded machines equipped with an H100 NVL GPU. On infrastructure under real load, it performs significantly better.

Currently, Infire is powering the Llama 3.1 8B model for Workers AI, and you can test it out today at @cf/meta/llama-3.1-8b-instruct!

The Architectural Challenge of LLM Inference at Cloudflare

Thanks to industry efforts, inference has improved a lot over the past few years. vLLM has led the way here with the recent release of the vLLM V1 engine with features like an optimized KV cache, improved batching, and the implementation of Flash Attention 3. vLLM is great for most inference workloads — we’re currently using it for several of the models in our Workers AI catalog — but as our AI workloads and catalog has grown, so has our need to optimize inference for the exact hardware and performance requirements we have.

Cloudflare is writing much of our new infrastructure in Rust, and vLLM is written in Python. Although Python has proven to be a great language for prototyping ML workloads, to maximize efficiency we need to control the low-level implementation details. Implementing low-level optimizations through multiple abstraction layers and Python libraries adds unnecessary complexity and leaves a lot of CPU performance on the table, simply due to the inefficiencies of Python as an interpreted language.

We love to contribute to open-source projects that we use, but in this case our priorities may not fit the goals of the vLLM project, so we chose to write a server for our needs. For example, vLLM does not support co-hosting multiple models on the same GPU without using Multi-Instance GPU (MIG), and we need to be able to dynamically schedule multiple models on the same GPU to minimize downtime. We also have an in-house AI Research team exploring unique features that are difficult, if not impossible, to upstream to vLLM.

Finally, running code securely is our top priority across our platform and Workers AI is no exception. We simply can’t trust a 3rd party Python process to run on our edge nodes alongside the rest of our services without strong sandboxing. We are therefore forced to run vLLM via gvisor. Having an extra virtualization layer adds another performance overhead to vLLM. More importantly, it also increases the startup and tear downtime for vLLM instances — which are already pretty long. Under full load on our edge nodes, vLLM running via gvisor consumes as much as 2.5 CPU cores, and is forced to compete for CPU time with other crucial services, that in turn slows vLLM down and lowers GPU utilization as a result.

While developing Infire, we’ve been incorporating the latest research in inference efficiency — let’s take a deeper look at what we actually built.

How Infire works under the hood

Infire is composed of three major components: an OpenAI compatible HTTP server, a batcher, and the Infire engine itself.

^{An overview of Infire’s architecture}

Platform startup

When a model is first scheduled to run on a specific node in one of our data centers by our auto-scaling service, the first thing that has to happen is for the model weights to be fetched from our R2 object storage. Once the weights are downloaded, they are cached on the edge node for future reuse.

As the weights become available either from cache or from R2, Infire can begin loading the model onto the GPU.

Model sizes vary greatly, but most of them are large, so transferring them into GPU memory can be a time-consuming part of Infire’s startup process. For example, most non-quantized models store their weights in the BF16 floating point format. This format has the same dynamic range as the 32-bit floating format, but with reduced accuracy. It is perfectly suited for inference providing the sweet spot of size, performance and accuracy. As the name suggests, the BF16 format requires 16 bits, or 2 bytes per weight. The approximate in-memory size of a given model is therefore double the size of its parameters. For example, LLama3.1 8B has approximately 8B parameters, and its memory footprint is about 16 GB. A larger model, like LLama4 Scout, has 109B parameters, and requires around 218 GB of memory. Infire utilizes a combination of Page Locked memory with CUDA asynchronous copy mechanism over multiple streams to speed up model transfer into GPU memory.

While loading the model weights, Infire begins just-in-time compiling the required kernels based on the model's parameters, and loads them onto the device. Parallelizing the compilation with model loading amortizes the latency of both processes. The startup time of Infire when loading the Llama-3-8B-Instruct model from disk is just under 4 seconds.

The HTTP server

The Infire server is built on top of hyper, a high performance HTTP crate, which makes it possible to handle hundreds of connections in parallel – while consuming a modest amount of CPU time. Because of ChatGPT’s ubiquity, vLLM and many other services offer OpenAI compatible endpoints out of the box. Infire is no different in that regard. The server is responsible for handling communication with the client: accepting connections, handling prompts and returning responses. A prompt will usually consist of some text, or a "transcript" of a chat session along with extra parameters that affect how the response is generated. Some parameters that come with a prompt include the temperature, which affects the randomness of the response, as well as other parameters that affect the randomness and length of a possible response.

After a request is deemed valid, Infire will pass it to the tokenizer, which transforms the raw text into a series of tokens, or numbers that the model can consume. Different models use different kinds of tokenizers, but the most popular ones use byte-pair encoding. For tokenization, we use HuggingFace's tokenizers crate. The tokenized prompts and params are then sent to the batcher, and scheduled for processing on the GPU, where they will be processed as vectors of numbers, called embeddings.

The batcher

The most important part of Infire is in how it does batching: by executing multiple requests in parallel. This makes it possible to better utilize memory bandwidth and caches.

In order to understand why batching is so important, we need to understand how the inference algorithm works. The weights of a model are essentially a bunch of two-dimensional matrices (also called tensors). The prompt represented as vectors is passed through a series of transformations that are largely dominated by one operation: vector-by-matrix multiplication. The model weights are so large, that the cost of the multiplication is dominated by the time it takes to fetch it from memory. In addition, modern GPUs have hardware units dedicated to matrix-by-matrix multiplications (called Tensor Cores on Nvidia GPUs). In order to amortize the cost of memory access and take advantage of the Tensor Cores, it is necessary to aggregate multiple operations into a larger matrix multiplication.

Infire utilizes two techniques to increase the size of those matrix operations. The first one is called prefill: this technique is applied to the prompt tokens. Because all the prompt tokens are available in advance and do not require decoding, they can all be processed in parallel. This is one reason why input tokens are often cheaper (and faster) than output tokens.

^{How Infire enables larger matrix multiplications via batching}

The other technique is called batching: this technique aggregates multiple prompts into a single decode operation.

Infire mixes both techniques. It attempts to process as many prompts as possible in parallel, and fills the remaining slots in a batch with prefill tokens from incoming prompts. This is also known as continuous batching with chunked prefill.

As tokens get decoded by the Infire engine, the batcher is also responsible for retiring prompts that reach an End of Stream token, and sending tokens back to the decoder to be converted into text.

Another job the batcher has is handling the KV cache. One demanding operation in the inference process is called attention. Attention requires going over the KV values computed for all the tokens up to the current one. If we had to recompute those previously encountered KV values for every new token we decode, the runtime of the process would explode for longer context sizes. However, using a cache, we can store all the previous values and re-read them for each consecutive token. Potentially the KV cache for a prompt can store KV values for as many tokens as the context window allows. In LLama 3, the maximal context window is 128K tokens. If we pre-allocated the KV cache for each prompt in advance, we would only have enough memory available to execute 4 prompts in parallel on H100 GPUs! The solution for this is paged KV cache. With paged KV caching, the cache is split into smaller chunks called pages. When the batcher detects that a prompt would exceed its KV cache, it simply assigns another page to that prompt. Since most prompts rarely hit the maximum context window, this technique allows for essentially unlimited parallelism under typical load.

Finally, the batcher drives the Infire forward pass by scheduling the needed kernels to run on the GPU.

CUDA kernels

Developing Infire gives us the luxury of focusing on the exact hardware we use, which is currently Nvidia Hopper GPUs. This allowed us to improve performance of specific compute kernels using low-level PTX instructions for this specific architecture.

Infire just-in-time compiles its kernel for the specific model it is running, optimizing for the model’s parameters, such as the hidden state size, dictionary size and the GPU it is running on. For some operations, such as large matrix multiplications, Infire will utilize the high performance cuBLASlt library, if it would deem it faster.

Infire also makes use of very fine-grained CUDA graphs, essentially creating a dedicated CUDA graph for every possible batch size on demand. It then stores it for future launch. Conceptually, a CUDA graph is another form of just-in-time compilation: the CUDA driver replaces a series of kernel launches with a single construct (the graph) that has a significantly lower amortized kernel launch cost, thus kernels executed back to back will execute faster when launched as a single graph as opposed to individual launches.

How Infire performs in the wild

We ran synthetic benchmarks on one of our edge nodes with an H100 NVL GPU.

The benchmark we ran was on the widely used ShareGPT v3 dataset. We ran the benchmark on a set of 4,000 prompts with a concurrency of 200. We then compared Infire and vLLM running on bare metal as well as vLLM running under gvisor, which is the way we currently run in production. In a production traffic scenario, an edge node would be competing for resources with other traffic. To simulate this, we benchmarked vLLM running in gvisor with only one CPU available.

	requests/s	tokens/s	CPU load
Infire	40.91	17224.21	25%
vLLM 0.10.0	38.38	16164.41	140%
vLLM under gvisor	37.13	15637.32	250%
vLLM under gvisor with CPU constraints	22.04	9279.25	100%

As evident from the benchmarks we achieved our initial goal of matching and even slightly surpassing vLLM performance, but more importantly, we’ve done so at a significantly lower CPU usage, in large part because we can run Infire as a trusted bare-metal process. Inference no longer takes away precious resources from our other services and we see GPU utilization upward of 80%, reducing our operational costs.

This is just the beginning. There are still multiple proven performance optimizations yet to be implemented in Infire – for example, we’re integrating Flash Attention 3, and most of our kernels don’t utilize kernel fusion. Those and other optimizations will allow us to unlock even faster inference in the near future.

What’s next

Running AI inference presents novel challenges and demands to our infrastructure. Infire is how we’re running AI efficiently — close to users around the world. By building upon techniques like continuous batching, a paged KV-cache, and low-level optimizations tailored to our hardware, Infire maximizes GPU utilization while minimizing overhead. Infire completes inference tasks faster and with a fraction of the CPU load of our previous vLLM-based setup, especially under the strict security constraints we require. This allows us to serve more requests with fewer resources, making requests served via Workers AI faster and more efficient.

However, this is just our first iteration — we’re excited to build in multi-GPU support for larger models, quantization, and true multi-tenancy into the next version of Infire. This is part of our goal to make Cloudflare the best possible platform for developers to build AI applications.

Want to see if your AI workloads are faster on Cloudflare? Get started with Workers AI today.

How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

Sven Sauleau — Wed, 27 Aug 2025 14:00:00 GMT

As the demand for AI products grows, developers are creating and tuning a wider variety of models. While adding new models to our growing catalog on Workers AI, we noticed that not all of them are used equally – leaving infrequently used models occupying valuable GPU space. Efficiency is a core value at Cloudflare, and with GPUs being the scarce commodity they are, we realized that we needed to build something to fully maximize our GPU usage.

Omni is an internal platform we’ve built for running and managing AI models on Cloudflare’s edge nodes. It does so by spawning and managing multiple models on a single machine and GPU using lightweight isolation. Omni makes it easy and efficient to run many small and/or low-volume models, combining multiple capabilities by:

Spawning multiple models from a single control plane,
Implementing lightweight process isolation, allowing models to spin up and down quickly,
Isolating the file system between models to easily manage per-model dependencies, and
Over-committing GPU memory to run more models on a single GPU.

Cloudflare aims to place GPUs as close as we possibly can to people and applications that are using them. With Omni in place, we’re now able to run more models on every node in our network, improving model availability, minimizing latency, and reducing power consumed by idle GPUs.

Here’s how.

Omni’s architecture – at a glance

At a high level, Omni is a platform to run AI models. When an inference request is made on Workers AI, we load the model’s configuration from Workers KV and our routing layer forwards it to the closest Omni instance that has available capacity. For inferences using the Asynchronous Batch API, we route to an Omni instance that is idle, which is typically in a location where it’s night.

Omni runs a few checks on the inference request, runs model specific pre and post processing, then hands the request over to the model.

Elastic scaling by spawning multiple models from a single control plane

If you’re developing an AI application, a typical setup is having a container or a VM dedicated to running a single model with a GPU attached to it. This is simple. But it’s also heavy-handed — because it requires managing the entire stack from provisioning the VM, installing GPU drivers, downloading model weights, and managing the Python environment. At scale, managing infrastructure this way is incredibly time consuming and often requires an entire team.

If you’re using Workers AI, we handle all of this for you. Omni uses a single control plane for running multiple models, called the scheduler, which automatically provisions models and spawns new instances as your traffic scales. When starting a new model instance, it downloads model weights, Python code, and any other dependencies. Omni’s scheduler provides fine-grained control and visibility over the model’s lifecycle: it receives incoming inference requests and routes them to the corresponding model processes, being sure to distribute the load between multiple GPUs. It then makes sure the model processes are running, rolls out new versions as they are released, and restarts itself when detecting errors or failure states. It also collects metrics for billing and emits logs.

The inference itself is done by a per-model process, supervised by the scheduler. It receives the inference request and some metadata, then sends back a response. Depending on the model, the response can be various types; for instance, a JSON object or a SSE stream for text generation, or binary for image generation.

The scheduler and the child processes communicate by passing messages over Inter-Process Communication (IPC). Usually the inference request is buffered in the scheduler for applying features, like prompt templating or tool calling, before the request is passed to the child process. For potentially large binary requests, the scheduler hands over the underlying TCP connection to the child process for consuming the request body directly.

Implementing lightweight process and Python isolation

Typically, deploying a model requires its own dedicated container, but we want to colocate more models on a single container to conserve memory and GPU capacity. In order to do so, we needed finer-grained controls over CPU memory and the ability to isolate a model from its dependencies and environment. We deploy Omni in two configurations; a container running multiple models or bare metal running a single model. In both cases, process isolation and Python virtual environments allow us to isolate models with different dependencies by creating namespaces and are limited by cgroups.

Python doesn’t take into account cgroups memory limits for memory allocations, which can lead to OOM errors. Many AI Python libraries rely on psutil for pre-allocating CPU memory. psutil reads /proc/meminfo to determine how much memory is available. Since in Omni each model has its own configurable memory limits, we need psutil to reflect the current usage and limits for a given model, not for the entire system.

The solution for us was to create a virtual file system, using fuse, to mount our own version of /proc/meminfo which reflects the model’s current usage and limits.

To illustrate this, here’s an Omni instance running a model (running as pid 8). If we enter the mount namespace and look at /proc/meminfo it will reflect the model’s configuration:

# Enter the mount (file system) namespace of a child process
$ nsenter -t 8 -m

$ mount
...
none /proc/meminfo fuse ...

$ cat /proc/meminfo
MemTotal:     7340032 kB
MemFree:     7316388 kB
MemAvailable:     7316388 kB

In this case the model has 7Gib of memory available and the entire container 15Gib. If the model tries to allocate more than 7Gib of memory, it will be OOM killed and restarted by the scheduler’s process manager, without causing any problems to the other models.

For isolating Python and some system dependencies, each model runs in a Python virtual environment, managed by uv. Dependencies are cached on the machine and, if possible, shared between models (uv uses symbolic links between its cache and virtual environments).

Also separated processes for models allows to have different CUDA contexts and isolation for error recovery.

Over-committing memory to run more models on a single GPU

Some models don’t receive enough traffic to fully utilize a GPU, and with Omni we can pack more models on a single GPU, freeing up capacity for other workloads. When it comes to GPU memory management, Omni has two main jobs: safely over-commit GPU memory, so that more models than normal can share a single GPU, and enforce memory limits, to prevent any single model from running out of memory while running.

Over-committing memory means allocating more memory than is physically available to the device.

For example, if a GPU has 10 Gib of memory, Omni would allow 2 models of 10Gib each on that GPU.

Right now, Omni is configured to run 13 models and is allocating about 400% GPU memory on a single GPU, saving up 4 GPUs. Omni does this by injecting a CUDA stub library that intercepts CUDA memory allocations (cuMalloc* or cudaMalloc*) calls and forces memory allocations to be performed in unified memory mode.

In Unified memory mode CUDA shares the same memory address space for both the GPU and the CPU:

^CUDA’s^{unified memory mode}

In practice this is what memory over-commitment looks like: imagine 3 models (A, B and C). Models A+B fit in the GPU’s memory but C takes up the entire memory.

Models A+B are loaded first and are in GPU memory, while model C is in CPU memory
Omni receives a request for model C so models A+B are swapped out and C is swapped in.
Omni receives a request for model B, so model C is partly swapped out and model B is swapped back in.
Omni receives a request for model A, so model A is swapped back in and model C is completely swapped out.

The trade-off is added latency: if performing an inference requires memory that is currently on the host system, it must be transferred to the GPU. For smaller models, this latency is minimal, because with PCIe 4.0, the physical bus between your GPU and system, provides 32 GB/sec of bandwidth. On the other hand, if a model need to be “cold started” i.e. it’s been swapped out because it hasn’t been used in a while, the system may need to swap back the entire model – a larger sized model, for example, might use 5Gib of GPU memory for weights and caches, and would take ~156ms to be swapped back into the GPU. Naturally, over time, inactive models are put into CPU memory, while active models stay hot in the GPU.

Rather than allowing the model to choose how much GPU memory it uses, AI frameworks tend to pre-allocate as much GPU memory as possible for performance reasons, making co-locating models more complicated. Omni allows us to control how much memory is actually exposed to any given model to prevent a greedy model from over-using the GPU allocated to it. We do this by overriding the CUDA runtime and driver APIs (cudaMemGetInfo and cuMemGetInfo). Instead of exposing the entire GPU memory, we only expose a subset of memory to each model.

How Omni runs multiple models for Workers AI

AI models can run in a variety of inference engines or backends: vLLM, Python, and now our very own inference engine, Infire. While models have different capabilities, each model needs to support Workers AI features, like batching and function calling. Omni acts as a unified layer for integrating these systems. It integrates into our internal routing and scheduling systems, and provides a Python API for our engineering team to add new models more easily. Let’s take a closer look at how Omni does this in practice:

from omni import Response
import cowsay


def handle_request(request, context):
    try:
        json = request.body.json
        text = json["text"]
    except Exception as err:
        return Response.error(...)

    return cowsay.get_output_string('cow', text)

Similar to how a JavaScript Worker works, Omni calls a request handler, running the model’s logic and returning a response.

Omni installs Python dependencies at model startup. We run an internal Python registry and mirror the public registry. In either case we declare dependencies in requirements.txt:

cowsay==6.1

The handle_request function can be async and return different Python types, including pydantic objects. Omni will convert the return value into a Workers AI response for the eyeball.

A Python package is injected, named omni, containing all the Python APIs to interact with the request, the Workers AI systems, building Responses, error handling, etc. Internally we publish it as regular Python package to be used in standalone, for unit testing for instance:

from omni import Context, Request
from model import handle_request


def test_basic():
    ctx = Context.inactive()
    req = Request(json={"text": "my dog is cooler than you!"})
    out = handle_request(req, ctx)
    assert out == """  __________________________
| my dog is cooler than you! |
  ==========================
                          \\
                           \\
                             ^__^
                             (oo)\\_______
                             (__)\\       )\\/\\
                                 ||----w |
                                 ||     ||"""

What’s next

Omni allows us to run models more efficiently by spawning them from a single control plane and implementing lightweight process isolation. This enables quick starting and stopping of models, isolated file systems for managing Python and system dependencies, and over-committing GPU memory to run more models on a single GPU. This improves the performance for our entire Workers AI stack, reduces the cost of running GPUs, and allows us to ship new models and features quickly and safely.

Right now, Omni is running in production on a handful of models in the Workers AI catalog, and we’re adding more every week. Check out Workers AI today to experience Omni’s performance benefits on your AI application.

Message Signatures are now part of our Verified Bots Program, simplifying bot authentication

Mari Galicer — Tue, 01 Jul 2025 10:00:00 GMT

As a site owner, how do you know which bots to allow on your site, and which you’d like to block? Existing identification methods rely on a combination of IP address range (which may be shared by other services, or change over time) and user-agent header (easily spoofable). These have limitations and deficiencies. In our last blog post, we proposed using HTTP Message Signatures: a way for developers of bots, agents, and crawlers to clearly identify themselves by cryptographically signing requests originating from their service.

Since we published the blog post on Message Signatures and the IETF draft for Web Bot Auth in May 2025, we’ve seen significant interest around implementing and deploying Message Signatures at scale. It’s clear that well-intentioned bot owners want a clear way to identify their bots to site owners, and site owners want a clear way to identify and manage bot traffic. Both parties seem to agree that deploying cryptography for the purposes of authentication is the right solution.

Today, we’re announcing that we’re integrating HTTP Message Signatures directly into our Verified Bots Program. This announcement has two main parts: (1) for bots, crawlers, and agents, we’re simplifying enrollment into the Verified Bots program for those who sign requests using Message Signatures, and (2) we’re encouraging all bot operators moving forward to use Message Signatures over existing verification mechanisms. Because Verified Bots are considered authenticated, they do not face challenges from our Bot Management to identify as bots, given they’re already identified as such.

For site owners, no additional action is required – Cloudflare will automatically validate signatures on our edge, and if that validation is a success, that traffic will be marked as verified so that site owners can use the verified bot fields to create Bot Management and WAF rules based on it.

This isn't just about simplifying things for bot operators — it’s about giving website owners unparalleled accuracy in identifying trusted bot traffic, cutting down on the overhead for cryptographic verification, and fundamentally transforming how we manage authentication across the Cloudflare network.

Become a Verified Bot with Message Signatures

Cloudflare’s existing Verified Bots program is for bots that are transparent about who they are and what they do, like indexing sites for search or scanning for security vulnerabilities. You can see a list of these verified bots in Cloudflare Radar:

^{A preview of the Verified Bots page on Cloudflare Radar.}

In the past, in order to apply to be a verified bot, we used to ask for IP address ranges or reverse DNS names so that we could verify your identity. This required some manual steps like checking that the IP address range is valid and is associated with the appropriate ASN.

With the integration of Message Signatures, we’re aiming to streamline applications into our Verified Bot program. Bots applying with well-formed Message Signatures will be prioritized, and approved more quickly!

Getting started

In order to make generating Message Signatures as easy as possible, Cloudflare is providing two open source libraries: a web-bot-auth library in rust, and a web-bot-auth npm package in TypeScript. If you’re working on a different implementation, let us know – we’d love to add it to our developer docs!

At a high level, signing your requests with web bot auth consists of the following steps:

Generate a valid signing key. See Signing Key section for step-by-step instructions.
Host a JSON web key set containing your public key under /.well-known/http-message-signature-directory of your website.
Sign responses for that URL using a Web Bot Auth library, one signature for each key contained in it, to prove you own it. See the Hosting section for step-by-step instructions.
Register that URL with us, using our Verified Bots form. This can be done directly in your Cloudflare account. See our documentation.
Sign requests using a Web Bot Auth library.

As an example, Cloudflare Radar's URL Scanner lets you scan any URL and get a publicly shareable report with security, performance, technology, and network information. Here’s an example of what a well-formed signature looks like for requests coming from URL Scanner:

GET /path/to/resource HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
Signature-Agent: "https://web-bot-auth-directory.radar-cfdata-org.workers.dev"
Signature-Input: sig=("@authority" "signature-agent");\
             	 created=1700000000;\
             	 expires=1700011111;\
             	 keyid="poqkLGiymh_W0uP6PZFw-dvez3QJT5SolqXBCW38r0U";\
             	 tag="web-bot-auth"
Signature:sig=jdq0SqOwHdyHr9+r5jw3iYZH6aNGKijYp/EstF4RQTQdi5N5YYKrD+mCT1HA1nZDsi6nJKuHxUi/5Syp3rLWBA==:

Since we’ve already registered URLScanner as a Verified Bot, Cloudflare will now automatically verify that the signature in the Signature header matches the request — more on that later.

Register your bot

Access the Verified Bots submission form on your account. If that link does not immediately take you there, go to your Cloudflare account → Account Home → the three dots next to your account name → Configurations → Verified Bots.

If you do not have a Cloudflare account, you can sign up for a free one.

For the verification method, select "Request Signature", then enter the URL of your key directory in Validation Instructions. Specifying the User-Agent values is optional if you’re submitting a Request Signature bot.

Once your application has gone through our (now shortened) review process, you don’t need to take any further action.

Message Signature verification for origins

Starting today, Cloudflare is ramping up verification of cryptographic signatures provided by automated crawlers and bots. This is currently available for all Free and Pro plans, and as we continue to test and validate at scale, will be released to all Business and Enterprise plans. This means that as time passes, the number of unauthenticated web crawlers should diminish, ensuring most bot traffic is authenticated before it reaches your website’s servers, helping to prevent spoofing attacks.

At a high level, signature verification works like this:

A bot or agent sends a request to a website behind Cloudflare.
Cloudflare’s Message Signature verification service checks for the Signature, Signature-Input, and Signature-Agent headers.
It checks that the incoming request presents a keyid parameter in your Signature-Input that points to a key we already know.
It looks at the expires parameter in the incoming bot request. If the current time is after expiration, verification fails. This guards against replay attacks, preventing malicious agents from trying to pass as a bot by retrying messages they captured in the past.
It checks that you’ve specified a tag parameter indicating web-bot-auth, to indicate your intent that the message be handled using web bot authentication specifically
It looks at all the components chosen in your Signature-Input header, and constructs a signature base from it.
If all pre-flight checks pass, Cloudflare attempts to verify the signature base against the value in Signature field using an ed25519 verification algorithm and the key supplied in keyid.
Verified Bots and other systems at Cloudflare use a successful verification as proof of your identity, and apply rules corresponding to that identity.

If any of the above steps fail, Cloudflare falls back to existing bot identification and mitigation mechanisms. As the system matures, we would strengthen these requirements, and limit the possibilities of a soft downgrade.

As a site owner, you can segment your Verified Bot traffic by its type and purpose by adding the Verified Bot Categories field cf.verified_bot_category as a filter criterion in WAF Custom rules, Advanced Rate Limiting, and Late Transform rules. For instance, to allow the Bibliothèque nationale de France and the Library of Congress, and institutions dedicated to academic research, you can add a rule that allows bots in the Academic Research category.

Where we’re going next

HTTP Message Signatures is a primitive that is useful beyond Cloudflare – the IETF standardized it as part of RFC 9421.

As discussed in our previous blog post, Cloudflare believes that making Message Signatures a core component of bot authentication on the web should follow the same path. The specifications for the protocol are being built in the open, and they have already evolved following feedback.

Moreover, due to widespread interest, the IETF is considering forming a working group around Web Bot Auth. Should you be a crawler, an origin, or even a CDN, we invite you to provide feedback to ensure the solution gets stronger, and suits your needs.

A better, more trusted Internet

For bot, agent, and crawler operators that act transparently and provide vital services for the Internet, we’re providing a faster and more automated path to being recognized as a Verified Bot, reducing manual processes. We trust that this approach improves bot authentication from what were formerly brittle and unreliable authentication methods, to a secure and reliable alternative. It should reduce the overall volume of friction and hurdles genuinely useful bots face.

For site owners, Message Signatures provides better assurance that the bot traffic is legitimate — automatically recognized and allowed, minimizing disruption to essential services (e.g., search engine indexing, monitoring). In line with our commitments to making TLS/SSL and Post-Quantum certificates available for everyone, we’ll always offer the cryptographic verification of Message Signatures for all sites because we believe in a safer and more efficient Internet by fostering a trusted environment for both human and automated traffic.

If you have a feature request, feedback, or are interested in partnering with us, please reach out.

Orange Me2eets: We made an end-to-end encrypted video calling app and it was easy

Michael Rosenberg — Thu, 26 Jun 2025 14:00:00 GMT

Developing a new video conferencing application often begins with a peer-to-peer setup using WebRTC, facilitating direct data exchange between clients. While effective for small demonstrations, this method encounters scalability hurdles with increased participants. The data transmission load for each client escalates significantly in proportion to the number of users, as each client is required to send data to every other client except themselves (n-1).

In the scaling of video conferencing applications, Selective Forwarding Units (SFUs) are essential. Essentially a media stream routing hub, an SFU receives media and data flows from participants and intelligently determines which streams to forward. By strategically distributing media based on network conditions and participant needs, this mechanism minimizes bandwidth usage and greatly enhances scalability. Nearly every video conferencing application today uses SFUs.

In 2024, we announced Cloudflare Realtime (then called Cloudflare Calls), our suite of WebRTC products, and we also released Orange Meets, an open source video chat application built on top of our SFU.

We also realized that use of an SFU often comes with a privacy cost, as there is now a centralized hub that could see and listen to all the media contents, even though its sole job is to forward media bytes between clients as a data plane.

We believe end-to-end encryption should be the industry standard for secure communication and that’s why today we’re excited to share that we’ve implemented and open sourced end-to-end encryption in Orange Meets. Our generic implementation is client-only, so it can be used with any WebRTC infrastructure. Finally, our new designated committer distributed algorithm is verified in a bounded model checker to verify this algorithm handles edge cases gracefully.

End-to-end encryption for video conferencing is different than for text messaging

End-to-end encryption describes a secure communication channel whereby only the intended participants can read, see, or listen to the contents of the conversation, not anybody else. WhatsApp and iMessage, for example, are end-to-end-encrypted, which means that the companies that operate those apps or any other infrastructure can’t see the contents of your messages.

Whereas encrypted group chats are usually long-lived, highly asynchronous, and low bandwidth sessions, video and audio calls are short-lived, highly synchronous, and require high bandwidth. This difference comes with plenty of interesting tradeoffs, which influenced the design of our system.

We had to consider how factors like the ephemeral nature of calls, compared to the persistent nature of group text messages, also influenced the way we designed E2EE for Orange Meets. In chat messages, users must be able to decrypt messages sent to them while they were offline (e.g. while taking a flight). This is not a problem for real-time communication.

The bandwidth limitations around audio/video communication and the use of an SFU prevented us from using some of the E2EE technologies already available for text messages. Apple’s iMessage, for example, encrypts a message N-1 times for an N-user group chat. We can't encrypt the video for each recipient, as that could saturate the upload capacity of Internet connections as well as slow down the client. Media has to be encrypted once and decrypted by each client while preserving secrecy around only the current participants of the call.

Messaging Layer Security (MLS)

Around the same time we were working on Orange Meets, we saw a lot of excitement around new apps being built with Messaging Layer Security (MLS), an IETF-standardized protocol that describes how you can do a group key exchange in order to establish end-to-end-encryption for group communication.

Previously, the only way to achieve these properties was to essentially run your own fork of the Signal protocol, which itself is more of a living protocol than a solidified standard. Since MLS is standardized, we’ve now seen multiple high-quality implementations appear, and we’re able to use them to achieve Signal-level security with far less effort.

Implementing MLS here wasn’t easy: it required a moderate amount of client modification, and the development and verification of an encrypted room-joining protocol. Nonetheless, we’re excited to be pioneering a standards-based approach that any customer can run on our network, and to share more details about how our implementation works.

We did not have to make any changes to the SFU to get end-to-end encryption working. Cloudflare’s SFU doesn’t care about the contents of the data forwarded on our data plane and whether it’s encrypted or not.

Orange Meets: the basics

Orange Meets is a video calling application built on Cloudflare Workers that uses the Cloudflare Realtime SFU service as the data plane. The roles played by the three main entities in the application are as follows:

The user is a participant in the video call. They connect to the Orange Meets server and SFU, described below.
The Orange Meets Server is a simple service run on a Cloudflare Worker that runs the small-scale coordination logic of Orange Meets, which is concerned with which user is in which video call — called a room — and what the state of the room is. Whenever something in the room changes, like a participant joining or leaving, or someone muting themselves, the app server broadcasts the change to all room participants. You can use any backend server for this component, we just chose Cloudflare Workers for its convenience.
Cloudflare Realtime Selective Forwarding Unit (SFU) is a service that Cloudflare runs, which takes everyone’s audio and video and broadcasts it to everyone else. These connections are potentially lossy, using UDP for transmission. This is done because a dropped video frame from five seconds ago is not very important in the context of a video call, and so should not be re-sent, as it would be in a TCP connection.

^{The network topology of Orange Meets}

Next, we have to define what we mean by end-to-end encryption in the context of video chat.

End-to-end encrypting Orange Meets

The most immediate way to end-to-end encrypt Orange Meets is to simply have the initial users agree on a symmetric encryption/decryption key at the beginning of a call, and just encrypt every video frame using that key. This is sufficient to hide calls from Cloudflare’s SFU. Some source-encrypted video conferencing implementations, such as Jitsi Meet, work this way.

The issue, however, is that kicking a malicious user from a call does not invalidate their key, since the keys are negotiated just once. A joining user learns the key that was used to encrypt video from before they joined. These failures are more formally referred to as failures of post-compromise security and perfect forward secrecy. When a protocol successfully implements these in a group setting, we call the protocol a continuous group key agreement protocol.

Fortunately for us, MLS is a continuous group key agreement protocol that works out of the box, and the nice folks at Phoenix R&D and Cryspen have a well-documented open-source Rust implementation of most of the MLS protocol.

All we needed to do was write an MLS client and compile it to WASM, so we could decrypt video streams in-browser. We’re using WASM since that’s one way of running Rust code in the browser. If you’re running a video conferencing application on a desktop or mobile native environment, there are other MLS implementations in your preferred programming language.

Our setup for encryption is as follows:

Make a web worker for encryption. We wrote a web worker in Rust that accepts a WebRTC video stream, broken into individual frames, and encrypts each frame. This code is quite simple, as it’s just an MLS encryption:

group.create_message(
	&self.mls_provider,
	self.my_signing_keys.as_ref()?,
	frame,
)

Postprocess outgoing audio/video. We take our normal stream and, using some newer features of the WebRTC API, add a transform step to it. This transform step simply sends the stream to the worker:

const senderStreams = sender.createEncodedStreams()
const { readable, writable } = senderStreams
this.worker.postMessage(
	{
    	    type: 'encryptStream',
    	    in: readable,
    	    out: writable,
	},
	[readable, writable]
)

And the same for decryption:

const receiverStreams = receiver.createEncodedStreams()
const { readable, writable } = receiverStreams
this.worker.postMessage(
	{
    	    type: 'decryptStream',
    	    in: readable,
    	    out: writable,
	},
	[readable, writable]
)

Once we do this for both audio and video streams, we’re done.

Handling different codec behaviors

The streams are now encrypted before sending and decrypted before rendering, but the browser doesn’t know this. To the browser, the stream is still an ordinary video or audio stream. This can cause errors to occur in the browser’s depacketizing logic, which expects to see certain bytes in certain places, depending on the codec. This results in some extremely cypherpunk artifacts every dozen seconds or so:

Fortunately, this exact issue was discovered by engineers at Discord, who handily documented it in their DAVE E2EE videocalling protocol. For the VP8 codec, which we use by default, the solution is simple: split off the first 1–10 bytes of each packet, and send them unencrypted:

fn split_vp8_header(frame: &[u8]) -> Option<(&[u8], &[u8])> {
    // If this is a keyframe, keep 10 bytes unencrypted. Otherwise, 1 is enough
    let is_keyframe = frame[0] >> 7 == 0;
    let unencrypted_prefix_size = if is_keyframe { 10 } else { 1 };
    frame.split_at_checked(unencrypted_prefix_size)
}

These bytes are not particularly important to encrypt, since they only contain versioning info, whether or not this frame is a keyframe, some constants, and the width and height of the video.

And that’s truly it for the stream encryption part! The only thing remaining is to figure out how we will let new users join a room.

“Join my Orange Meet”

Usually, the only way to join the call is to click a link. And since the protocol is encrypted, a joining user needs to have some cryptographic information in order to decrypt any messages. How do they receive this information, though? There are a few options.

DAVE does it by using an MLS feature called external proposals. In short, the Discord server registers itself as an external sender, i.e., a party that can send administrative messages to the group, but cannot receive any. When a user wants to join a room, they provide their own cryptographic material, called a key package, and the server constructs and sends an MLS External Add message to the group to let them know about the new user joining. Eventually, a group member will commit this External Add, sending the joiner a Welcome message containing all information necessary to send and receive video.

^{A user joining a group via MLS external proposals. Recall the Orange Meets app server functions as a broadcast channel for the whole group. We consider a group of 3 members. We write member #2 as the one committing to the proposal, but this can be done by any member. Member #2 also sends a Commit message to the other members, but we omit this for space.}

This is a perfectly viable way to implement room joining, but implementing it would require us to extend the Orange Meets server logic to have some concept of MLS. Since part of our goal is to keep things as simple as possible, we would like to do all our cryptography client-side.

So instead we do what we call the designated committer algorithm. When a user joins a group, they send their cryptographic material to one group member, the designated committer, who then constructs and sends the Add message to the rest of the group. Similarly, when notified of a user’s exit, the designated committer constructs and sends a Remove message to the rest of the group. With this setup, the server’s job remains nothing more than broadcasting messages! It’s quite simple too—the full implementation of the designated committer state machine comes out to 300 lines of Rust, including the MLS boilerplate, and it’s about as efficient.

^{A user joining a group via the designated committer algorithm.}

One cool property of the designated committer algorithm is that something like this isn’t possible in a text group chat setting, since any given user (in particular, the designated committer) may be offline for an arbitrary period of time. Our method works because it leverages the fact that video calls are an inherently synchronous medium.

Verifying the Designated Committer Algorithm with TLA⁺

The designated committer algorithm is a pretty neat simplification, but it comes with some non-trivial edge cases that we need to make sure we handle, such as:

How do we make sure there is only one designated committer at a time? The designated committer is the alive user with the smallest index in the MLS group state, which all users share.
What happens if the designated committer exits? Then the next user will take its place. Every user keeps track of pending Adds and Removes, so it can continue where the previous designated committer left off.
If a user has not caught up to all messages, could they think they’re the designated committer? No, they have to believe first that all prior eligible designated committers are disconnected.

To make extra sure that this algorithm was correct, we formally modeled it and put it through the TLA⁺ model checker. To our surprise, it caught some low-level bugs! In particular, it found that, if the designated committer dies while adding a user, the protocol does not recover. We fixed these by breaking up MLS operations and enforcing a strict ordering on messages locally (e.g., a Welcome is always sent before its corresponding Add).

You can find an explainer, lessons learned, and the full PlusCal program (a high-level language that compiles to TLA⁺) here. The caveat, as with any use of a bounded model checker, is that the checking is, well, bounded. We verified that no invalid protocol states are possible in a group of up to five users. We think this is good evidence that the protocol is correct for an arbitrary number of users. Because there are only two distinct roles in the protocol (designated committer and other group member), any weird behavior ought to be reproducible with two or three users, max.

Preventing Monster-in-the-Middle attacks

One important concern to address in any end-to-end encryption setup is how to prevent the service provider from replacing users’ key packages with their own. If the Orange Meets app server did this, and colluded with a malicious SFU to decrypt and re-encrypt video frames on the fly, then the SFU could see all the video sent through the network, and nobody would know.

To resolve this, like DAVE, we include a safety number in the corner of the screen for all calls. This number uniquely represents the cryptographic state of the group. If you check out-of-band (e.g., in a Signal group chat) that everyone agrees on the safety number, then you can be sure nobody’s key material has been secretly replaced.

In fact, you could also read the safety number aloud in the video call itself, but doing this is not provably secure. Reading a safety number aloud is an in-band verification mechanism, i.e., one where a party authenticates a channel within that channel. If a malicious app server colluding with a malicious SFU were able to construct believable video and audio of the user reading the safety number aloud, it could bypass this safety mechanism. So if your threat model includes adversaries that are able to break into a Worker and Cloudflare’s SFU, and simultaneously generate real-time deep-fakes, you should use out-of-band verification 😄.

Future work

There are some areas we could improve on:

There is another attack vector for a malicious app server: it is possible to simply serve users malicious JavaScript. This problem, more generally called the JavaScript Cryptography Problem, affects any in-browser application where the client wants to hide data from the server. Fortunately, we are working on a standard to address this, called Web Application Manifest Consistency, Integrity, and Transparency. In short, like our Code Verify solution for WhatsApp, this would allow every website to commit to the JavaScript it serves, and have a third party create an auditable log of the code. With transparency, malicious JavaScript can still be distributed, but at least now there is a log that records the code.
We can make out-of-band authentication easier by placing trust in an identity provider. Using OpenPubkey, it would be possible for a user to get the identity provider to sign their cryptographic material, and then present that. Then all the users would check the signature before using the material. Transparency would also help here to ensure no signatures were made in secret.

Conclusion

We built end-to-end encryption into the Orange Meets video chat app without a lot of engineering time, and by modifying just the client code. To do so, we built a WASM (compiled from Rust) service worker that sets up an MLS group and does stream encryption and decryption, and designed a new joining protocol for groups, called the designated committer algorithm, and formally modeled it in TLA⁺. We made comments for all kinds of optimizations that are left to do, so please send us a PR if you’re so inclined!

Try using Orange Meets with E2EE enabled at e2ee.orange.cloudflare.dev, or deploy your own instance using the open source repository on Github.

Forget IPs: using cryptography to verify bot and agent traffic

Thibault Meunier — Thu, 15 May 2025 13:00:00 GMT

With the rise of traffic from AI agents, what’s considered a bot is no longer clear-cut. There are some clearly malicious bots, like ones that DoS your site or do credential stuffing, and ones that most site owners do want to interact with their site, like the bot that indexes your site for a search engine, or ones that fetch RSS feeds.

Historically, Cloudflare has relied on two main signals to verify legitimate web crawlers from other types of automated traffic: user agent headers and IP addresses. The User-Agent header allows bot developers to identify themselves, i.e. MyBotCrawler/1.1. However, user agent headers alone are easily spoofed and are therefore insufficient for reliable identification. To address this, user agent checks are often supplemented with IP address validation, the inspection of published IP address ranges to confirm a crawler's authenticity. However, the logic around IP address ranges representing a product or group of users is brittle – connections from the crawling service might be shared by multiple users, such as in the case of privacy proxies and VPNs, and these ranges, often maintained by cloud providers, change over time.

Cloudflare will always try to block malicious bots, but we think our role here is to also provide an affirmative mechanism to authenticate desirable bot traffic. By using well-established cryptography techniques, we’re proposing a better mechanism for legitimate agents and bots to declare who they are, and provide a clearer signal for site owners to decide what traffic to permit.

Today, we’re introducing two proposals – HTTP message signatures and request mTLS – for friendly bots to authenticate themselves, and for customer origins to identify them. In this blog post, we’ll share how these authentication mechanisms work, how we implemented them, and how you can participate in our closed beta.

Existing bot verification mechanisms are broken

Historically, if you’ve worked on ChatGPT, Claude, Gemini, or any other agent, you’ve had several options to identify your HTTP traffic to other services:

You define a user agent, an HTTP header described in RFC 9110. The problem here is that this header is easily spoofable and there’s not a clear way for agents to identify themselves as semi-automated browsers — agents often use the Chrome user agent for this very reason, which is discouraged. The RFC states: “If a user agent masquerades as a different user agent, recipients can assume that the user intentionally desires to see responses tailored for that identified user agent, even if they might not work as well for the actual user agent being used.”
You publish your IP address range(s). This has limitations because the same IP address might be shared by multiple users or multiple services within the same company, or even by multiple companies when hosting infrastructure is shared (like Cloudflare Workers, for example). In addition, IP addresses are prone to change as underlying infrastructure changes, leading services to use ad-hoc sharing mechanisms like CIDR lists.
You go to every website and share a secret, like a Bearer token. This is impractical at scale because it requires developers to maintain separate tokens for each website their bot will visit.

We can do better! Instead of these arduous methods, we’re proposing that developers of bots and agents cryptographically sign requests originating from their service. When protecting origins, reverse proxies such as Cloudflare can then validate those signatures to confidently identify the request source on behalf of site owners, allowing them to take action as they see fit.

A typical system has three actors:

User: the entity that wants to perform some actions on the web. This may be a human, an automated program, or anything taking action to retrieve information from the web.
Agent: an orchestrated browser or software program. For example, Chrome on your computer, or OpenAI’s Operator with ChatGPT. Agents can interact with the web according to web standards (HTML rendering, JavaScript, subrequests, etc.).
Origin: the website hosting a resource. The user wants to access it through the browser. This is Cloudflare when your website is using our services, and it’s your own server(s) when exposed directly to the Internet.

In the next section, we’ll dive into HTTP Message Signatures and request mTLS, two mechanisms a browser agent may implement to sign outgoing requests, with different levels of ease for an origin to adopt.

Introducing HTTP Message Signatures

HTTP Message Signatures is a standard that defines the cryptographic authentication of a request sender. It’s essentially a cryptographically sound way to say, “hey, it’s me!”. It’s not the only way that developers can sign requests from their infrastructure — for example, AWS has used Signature v4, and Stripe has a framework for authenticating webhooks — but Message Signatures is a published standard, and the cleanest, most developer-friendly way to sign requests.

We’re working closely with the wider industry to support these standards-based approaches. For example, OpenAI has started to sign their requests. In their own words:

"Ensuring the authenticity of Operator traffic is paramount. With HTTP Message Signatures (RFC 9421), OpenAI signs all Operator requests so site owners can verify they genuinely originate from Operator and haven’t been tampered with” – Eugenio, Engineer, OpenAI

Without further delay, let’s dive in how HTTP Messages Signatures work to identify bot traffic.

Scoping standards to bot authentication

Generating a message signature works like this: before sending a request, the agent signs the target origin with a public key. When fetching https://example.com/path/to/resource, it signs example.com. This public key is known to the origin, either because the agent is well known, because it has previously registered, or any other method. Then, the agent writes a Signature-Input header with the following parameters:

A validity window (created and expires timestamps)
A Key ID that uniquely identifies the key used in the signature. This is a JSON Web Key Thumbprint.
A tag that shows websites the signature’s purpose and validation method, i.e. web-bot-auth for bot authentication.

In addition, the Signature-Agent header indicates where the origin can find the public keys the agent used when signing the request, such as in a directory hosted by signer.example.com. This header is part of the signed content as well.

Here’s an example:

GET /path/to/resource HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 Chrome/113.0.0 MyBotCrawler/1.1
Signature-Agent: signer.example.com
Signature-Input: sig=("@authority" "signature-agent");\
             	 created=1700000000;\
             	 expires=1700011111;\
             	 keyid="ba3e64==";\
             	 tag="web-bot-auth"
Signature: sig=abc==

For those building bots, we propose signing the authority of the target URI, i.e. www.example.com, and a way to retrieve the bot public key in the form of signature-agent, if present, i.e. crawler.search.google.com for Google Search, operator.openai.com for OpenAI Operator, workers.dev for Cloudflare Workers.

The User-Agent from the example above indicates that the software making the request is Chrome, because it is an agent that uses an orchestrated Chrome to browse the web. You should note that MyBotCrawler/1.1 is still present. The User-Agent header can actually contain multiple products, in decreasing order of importance. If our agent is making requests via Chrome, that’s the most important product and therefore comes first.

At Internet-level scale, these signatures may add a notable amount of overhead to request processing. However, with the right cryptographic suite, and compared to the cost of existing bot mitigation, both technical and social, this seems to be a straightforward tradeoff. This is a metric we will monitor closely, and report on as adoption grows.

Generating request signatures

We’re making several examples for generating Message Signatures for bots and agents available on Github (though we encourage other implementations!), all of which are standards-compliant, to maximize interoperability.

Imagine you’re building an agent using a managed Chromium browser, and want to sign all outgoing requests. To achieve this, the webextensions standard provides chrome.webRequest.onBeforeSendHeaders, where you can modify HTTP headers before they are sent by the browser. The event is triggered before sending any HTTP data, and when headers are available.

Here’s what that code would look like:

chrome.webRequest.onBeforeSendHeaders.addListener(
  function (details) {
	// Signature and header assignment logic goes here
      // 
  },
  { urls: [""] },
  ["blocking", "requestHeaders"] // requires "installation_mode": "force_installed"
);


            Cloudflare provides a web-bot-auth helper package on npm that helps you generate request signatures with the correct parameters. onBeforeSendHeaders is a Chrome extension hook that needs to be implemented synchronously. To do so, we import {signatureHeadersSync} from “web-bot-auth”. Once the signature completes, both Signature and Signature-Input headers are assigned. The request flow can then continue.
            const request = new URL(details.url);
const created = new Date();
const expired = new Date(created.getTime() + 300_000)


// Perform request signature
const headers = signatureHeadersSync(
  request,
  new Ed25519Signer(jwk),
  { created, expires }
);
// `headers` object now contains `Signature` and `Signature-Input` headers that can be used
            This extension code is available on GitHub, alongside a  debugging server, deployed at https://http-message-signatures-example.research.cloudflare.com. 
    
      Validating request signatures 
      
        
      
    
    Using our debug server, we can now inspect and validate our request signatures from the perspective of the website we’d be visiting. We should now see the Signature and Signature-Input headers:  
          
          
          
^{In this example, the homepage of the debugging server validates the signature from the RFC 9421 Ed25519 verifying key, which the extension uses for signing.}
The above demo and code walkthrough has been fully written in TypeScript: the verification website is on Cloudflare Workers, and the client is a Chrome browser extension. We are cognisant that this does not suit all clients and servers on the web. To demonstrate the proposal works in more environments, we have also implemented bot signature validation in Go with a plugin for Caddy server.
    
      Experimentation with request mTLS
      
        
      
    
    HTTP is not the only way to convey signatures. For instance, one mechanism that has been used in the past to authenticate automated traffic against secured endpoints is mTLS, the “mutual” presentation of TLS certificates. As described in our knowledge base:
Mutual TLS, or mTLS for short, is a method for mutual authentication. mTLS ensures that the parties at each end of a network connection are who they claim to be by verifying that they both have the correct private key. The information within their respective TLS certificates provides additional verification.
While mTLS seems like a good fit for bot authentication on the web, it has limitations. If a user is asked for authentication via the mTLS protocol but does not have a certificate to provide, they would get an inscrutable and unskippable error. Origin sites need a way to conditionally signal to clients that they accept or require mTLS authentication, so that only mTLS-enabled clients use it.
    
      A TLS flag for bot authentication
      
        
      
    
    TLS flags are an efficient way to describe whether a feature, like mTLS, is supported by origin sites. Within the IETF, we have proposed a new TLS flag called req mTLS to be sent by the client during the establishment of a connection that signals support for authentication via a client certificate. 
This proposal leverages the tls-flags proposal under discussion in the IETF. The TLS Flags draft allows clients and servers to send an array of one bit flags to each other, rather than creating a new extension (with its associated overhead) for each piece of information they want to share. This is one of the first uses of this extension, and we hope that by using it here we can help drive adoption.
When a client sends the req mTLS flag to the server, they signal to the server that they are able to respond with a certificate if requested. The server can then safely request a certificate without risk of blocking ordinary user traffic, because ordinary users will never set this flag. 
Let’s take a look at what an example of such a req mTLS would look like in Wireshark, a network protocol analyser. You can follow along in the packet capture here.
            Extension: req mTLS (len=12)
	Type: req mTLS (65025)
	Length: 12
	Data: 0b0000000000000000000001
            The extension number is 65025, or 0xfe01. This corresponds to an unassigned block of TLS extensions that can be used to experiment with TLS Flags. Once the standard is adopted and published by the IETF, the number would be fixed. To use the req mTLS flag the client needs to set the 80^th bit to true, so with our block length of 12 bytes, it should  contain the data 0b0000000000000000000001, which is the case here. The server then responds with a certificate request, and the request follows its course.
    
      Request mTLS in action
      
        
      
    
    Code for this section is available in GitHub under cloudflareresearch/req-mtls
Because mutual TLS is widely supported in TLS libraries already, the parts we need to introduce to the client and server are:
Sending/parsing of TLS-flags
Specific support for the req mTLS flag
To the best of our knowledge, there is no complete public implementation of either scheme. Using it for bot authentication may provide a motivation to do so.
Using our experimental fork of Go, a TLS client could support req mTLS as follows:
            config := &tls.Config{
    	TLSFlagsSupported:  []tls.TLSFlag{0x50},
    	RootCAs:       	rootPool,
    	Certificates:  	certs,
    	NextProtos:    	[]string{"h2"},
}
trans := http.Transport{TLSClientConfig: config, ForceAttemptHTTP2: true}
            This example library allows you to configure Go to send req mTLS 0x50 bytes in the TLS Flags extension. If you’d like to test your implementation out, you can prompt your client for certificates against req-mtls.research.cloudflare.com using the Cloudflare Research client cloudflareresearch/req-mtls. For clients, once they set the TLS Flags associated with req mTLS, they are done. The code section taking care of normal mTLS will take over at that point, with no need to implement something new.
    
      Two approaches, one goal
      
        
      
    
    We believe that developers of agents and bots should have a public, standard way to authenticate themselves to CDNs and website hosting platforms, regardless of the technology they use or provider they choose. At a high level, both HTTP Message Signatures and request mTLS achieve a similar goal: they allow the owner of a service to authentically identify themselves to a website. That’s why we’re participating in the standardizing effort for both of these protocols at the IETF, where many other authentication mechanisms we’ve discussed here — from TLS to OAuth Bearer tokens –— been developed by diverse sets of stakeholders and standardized as RFCs.   
Evaluating both proposals against each other, we’re prioritizing HTTP Message Signatures for Bots because it relies on the previously adopted RFC 9421 with several reference implementations, and works at the HTTP layer, making adoption simpler. request mTLS may be a better fit for site owners with concerns about the additional bandwidth, but TLS Flags has fewer implementations, is still waiting for IETF adoption, and upgrading the TLS stack has proven to be more challenging than with HTTP. Both approaches share similar discovery and key management concerns, as highlighted in a glossary draft at the IETF. We’re actively exploring both options, and would love to hear from both site owners and bot developers about how you’re evaluating their respective tradeoffs.
    
      The bigger picture 
      
        
      
    
    In conclusion, we think request signatures and mTLS are promising mechanisms for bot owners and developers of AI agents to authenticate themselves in a tamper-proof manner, forging a path forward that doesn’t rely on ever-changing IP address ranges or spoofable headers such as User-Agent. This authentication can be consumed by Cloudflare when acting as a reverse proxy, or directly by site owners on their own infrastructure. This means that as a bot owner, you can now go to content creators and discuss crawling agreements, with as much granularity as the number of bots you have. You can start implementing these solutions today and test them against the research websites we’ve provided in this post.
Bot authentication also empowers site owners small and large to have more control over the traffic they allow, empowering them to continue to serve content on the public Internet while monitoring automated requests. Longer term, we will integrate these authentication mechanisms into our AI Audit and Bot Management products, to provide better visibility into the bots and agents that are willing to identify themselves.
Being able to solve problems for both origins and clients is key to helping build a better Internet, and we think identification of automated traffic is a step towards that. If you want us to start verifying your message signatures or client certificates, have a compelling use case you’d like us to consider, or any questions, please reach out.



What’s new in Cloudflare: MASQUE now powers 1.1.1.1 & WARP apps, DEX now generally available with Remote Captures
Mari Galicer — Fri, 27 Dec 2024 14:00:00 GMT
 At Cloudflare, we are constantly innovating and launching new features and capabilities across our product portfolio. Today’s roundup blog post shares two exciting updates across our platform: our cross-platform 1.1.1.1 & WARP applications (consumer) and device agents (Zero Trust)  now use MASQUE, a cutting-edge HTTP/3-based protocol, to secure your Internet connection. Additionally, DEX is now available for general availability. 
    
      Faster and more stable: our 1.1.1.1 & WARP apps now use MASQUE by default
      
        
      
    
    
          
          
          
We’re excited to announce that as of today, our cross-platform 1.1.1.1 & WARP apps now use MASQUE, a cutting-edge HTTP/3-based protocol, to secure your Internet connection.
As a reminder, our 1.1.1.1 & WARP apps have two main functions: send all DNS queries through 1.1.1.1, our privacy-preserving DNS resolver, and protect your device’s network traffic via WARP by creating a private and encrypted tunnel to the resources you’re accessing, preventing unwanted third parties or public Wi-Fi networks from snooping on your traffic.
There are many ways to encrypt and proxy Internet traffic — you may have heard of a few, such as IPSec, WireGuard, or OpenVPN. There are many tradeoffs we considered when choosing a protocol, but we believe MASQUE is the future of fast, secure, and stable Internet proxying, it aligns with our belief in building on top of open Internet standards, and we’ve deployed it successfully at scale for customers like iCloud Private Relay and Microsoft Edge Secure Network.
    
      Why MASQUE?
      
        
      
    
    MASQUE is a modern framework for proxying traffic that allows a variety of application protocols, including HTTP/3, to utilize QUIC as their transport mechanism. That’s a lot of acronyms, so let's make sure those are clear. 
          
          
          
QUIC is a general-purpose transport protocol and Internet standard that operates on top of UDP (instead of TCP), is encrypted by default, and solves several performance issues that plagued its predecessors. HTTP/3 is the latest version of the HTTP protocol, defining the application-layer protocol that runs on top of QUIC as its transport mechanism. MASQUE is a set of mechanisms for tunneling traffic over HTTP. It extends the existing HTTP CONNECT model, to allow tunneling UDP and IP traffic. This is especially efficient when combined with the QUIC’s unreliable datagram extension. 
For example, we can use MASQUE’s CONNECT-IP method to establish a tunnel that can send multiple concurrent requests over a single QUIC connection:
            HEADERS
:method = CONNECT
:protocol = connect-ip
:scheme = https
:path = /.well-known/masque/ip/*/*/
:authority = example.org
capsule-protocol = ?1
            The benefit these protocols have for the quality and security of everyone’s Internet browsing experience is real. Earlier transport protocols were built before the advent of smartphones and mobile networks, so QUIC was designed to support a mobile world, maintaining connections even in poorly connected networks, and minimizing disruptions as people switch rapidly between networks as they move through their day. Leveraging HTTP/3 as the application layer means that MASQUE is more like “normal” HTTP traffic on the Internet, meaning that it is easier to support, is compatible with existing firewall and security rules, and that it supports cryptographic agility (i.e. support for post-quantum crypto), making this traffic more secure and resilient in the long term.
    
      Get started now 
      
        
      
    
    All new installations of our 1.1.1.1 & WARP apps support MASQUE, including iOS, Android, macOS, Windows, and Linux, and we’ve started to roll it out as the preferred protocol over WireGuard. On mobile, to check if your connection is already secured over MASQUE, or change your device’s default option, you can toggle this setting via Advanced > Connection options > Tunnel protocol:
          
          
          
^{Protocol connection options shown here on the iOS app}
We offer the following options: 
Auto: this allows the app to choose the protocol.
MASQUE: always use MASQUE to secure your connection.
WireGuard: always use WireGuard to secure your connection.
On desktop versions, you can switch the protocol by using the WARP command-line interface. For example:
            warp-cli tunnel protocol set WireGuard
warp-cli tunnel protocol set MASQUE
            With this rollout, we're excited to see MASQUE deliver increased performance and stability to millions of users. Download one of the WARP apps today!
    
      DEX now Generally Available: Announcing detailed device visibility with DEX Remote Captures
      
        
      
    
    
          
          
          
Following the successful beta launch of Digital Experience Monitoring (DEX), we are thrilled to announce the general availability of DEX, along with new Remote Captures functionality.
In today's hyper distributed environment, user experience is paramount. Recurring performance problems can lead to decreased user satisfaction, lost productivity, and damaged brand reputation.  Digital Experience Monitoring (DEX) offers a comprehensive solution to these challenges. Previous blog posts have discussed the solution and its capabilities. (Introducing Digital Experience Monitoring, Understanding end user-connectivity and performance with Digital Experience Monitoring, now available in beta, What's new in Cloudflare One: Digital Experience monitoring notifications)
    
      Introducing Remote Captures: PCAP and WARP Diag
      
        
      
    
    Imagine this: an end user is frustrated with a slow application, and your IT team is struggling to pinpoint the root cause. Traditionally, troubleshooting such issues involved contacting the end user and asking them to manually collect and share network traffic data. This process is time-consuming, prone to errors, and often disruptive to the end user's workflow.
Building upon the capabilities of DEX, we are excited to introduce Remote Captures, a powerful new feature that empowers IT admins to gain unprecedented visibility into end-user devices and network performance. DEX now introduces Remote Captures, a powerful new feature that empowers IT admins to remotely initiate network packet captures (PCAP) and WARP Diag logs directly from your end users’ devices and capture diagnostic information automatically from our device client. This streamlined approach accelerates troubleshooting, reduces the burden on end users, and provides valuable insights into network performance and security.
    
      Why Remote Captures?
      
        
      
    
    Remote Captures offer several key advantages. By analyzing detailed network traffic, IT teams can quickly pinpoint the root cause of network issues. Furthermore, granular network data empowers security teams to proactively detect and investigate potential threats. Finally, by identifying bottlenecks and latency issues, Remote Captures enable organizations to optimize network performance for a smoother user experience.
    
      How Remote Captures work
      
        
      
    
    Initiating a Remote Capture is straightforward. First, select the specific device you wish to troubleshoot. Then, with a few simple clicks, start capturing network traffic and/or WARP Diag data. Once the capture is complete, download the captured data and utilize your preferred tools for in-depth analysis.
          
          
          
    
      Get started today
      
        
      
    
    DEX Remote Captures are now available for Cloudflare One customers. They can be configured by going to Cloudflare Dashboard >  Zero Trust > DEX > Remote Captures, and then selecting the device you wish to collect from. For more information, refer to Remote captures. This new capability highlights just one of the many ways our unified SASE platform helps organizations find and fix security issues across SaaS applications. Try it out now using our free tier to get started.
    
      Never miss an update 
      
        
      
    
    We hope you enjoy reading our roundup blog posts as we continue to build and innovate. Stay tuned to the Cloudflare Blog for the latest news and updates. 


Cloudflare helps verify the security of end-to-end encrypted messages by auditing key transparency for WhatsApp
Thibault Meunier — Tue, 24 Sep 2024 13:00:00 GMT
 Chances are good that today you’ve sent a message through an end-to-end encrypted (E2EE) messaging app such as WhatsApp, Signal, or iMessage. While we often take the privacy of these conversations for granted, they in fact rely on decades of research, testing, and standardization efforts, the foundation of which is a public-private key exchange. There is, however, an oft-overlooked implicit trust inherent in this model: that the messaging app infrastructure is distributing the public keys of all of its users correctly.
Here’s an example: if Joe and Alice are messaging each other on WhatsApp, Joe uses Alice’s phone number to retrieve Alice’s public key from the WhatsApp database, and Alice receives Joe’s public key. Their messages are then encrypted using this key exchange, so that no one — even WhatsApp — can see the contents of their messages besides Alice and Joe themselves. However, in the unlikely situation where an attacker, Bob, manages to register a different public key in WhatsApp’s database, Joe would try to message Alice but unknowingly be messaging Bob instead. And while this threat is most salient for journalists, activists, and those most vulnerable to cyber attacks, we believe that protecting the privacy and integrity of end-to-end encrypted conversations is for everyone.
There are several methods that end-to-end encrypted messaging apps have deployed thus far to protect the integrity of public key distribution, the most common of which is to do an in-person verification of the QR code fingerprint of your public key (WhatsApp and Signal both have a version of this). As you can imagine, this experience is inconvenient and unwieldy, especially as your number of contacts and group chats increase.
Over the past few years, there have been significant developments in this area of cryptography, and WhatsApp has paved the way with their Key Transparency announcement. But as an independent third party, Cloudflare can provide stronger reassurance: that’s why we’re excited to announce that we’re now verifying WhatsApp’s Key Transparency audit proofs. 
    
      Auditing: the next frontier of encryption 
      
        
      
    
    We didn’t build this in a vacuum: similar to how the web and messaging apps became encrypted over time, we see auditing public key infrastructure as the next logical step in securing Internet infrastructure. This solution builds upon learnings from Certificate Transparency and Binary Transparency, which share some of the underlying data structure and cryptographic techniques, and we’re excited about the formation of a working group at the IETF to make multi-party operation of Key Transparency-like systems tractable for a broader set of use cases. 
We see our role here as a pioneer of a real world deployment of this auditing infrastructure, working through and sharing the operational challenges of operating a system that is critical for a messaging app used by billions of people around the world.   
We’ve also done this before — in 2022, Cloudflare announced Code Verify, a partnership in which we verify that the code delivered in the browser for WhatsApp Web has not been tampered with. When users run WhatsApp in their browser, the WhatsApp Code Verify extension compares a hash of the code that is executing in the browser with the hash that Cloudflare has of the codebase, enabling WhatsApp web users to easily see whether the code that is executing is the code that was publicly committed to. 
            
            
            
            
            
^{In Code Verify, Cloudflare builds a non-mutable chain associating the WhatsApp version with the hash of its code.}
Cloudflare’s role in Key Transparency is similar in that we are checking that a tree-based directory of public keys (more on this later) has been constructed correctly, and has been done so consistently over time.
    
      How Key Transparency works
      
        
      
    
    The architectural foundation of Key Transparency is the Auditable Key Directory (AKD): a tree-shaped data structure, constructed and maintained by WhatsApp, in which the nodes contain hashed contact details of each user. We’ll explain the basics here but if you’re interested in learning more, check out the SEEMless and Parakeet papers.
The AKD tree is constructed by building a binary tree, each parent node of which is a hash of each of its left and right child nodes:
            
            
            
            
            
^{Each child node on the tree contains contact and public key details for a user (shown here for illustrative purposes). In reality, Cloudflare only sees a hash of each node rather than Alice and Bob’s contact info in plaintext.}
An epoch describes a specific version of the tree at a given moment in time, identified by its root node. Using a structure similar to Code Verify, the WhatsApp Log stores each root node hash as part of an append-only time structure of updates.
            
            
            
            
            
What kind of changes are valid to be included in a given epoch? When a new person, Brian, joins WhatsApp, WhatsApp inserts a new “B” node in the AKD tree, and a new epoch. If Alice loses her phone and rotates her key, her “version” is updated to v1 in the next update.  
    
      How we built the Auditor on Cloudflare Workers 
      
        
      
    
    The role of the Auditor is to provide two main guarantees: that epochs are globally unique, and that they are valid. They are, however, quite different: global uniqueness requires consistency on whether an epoch and its associated root hash has been seen, while validity is a matter of computation — is the transition from the previous epoch to the current one a correct tree transformation?
    
      Timestamping service
      
        
      
    
    
            
            
            
            
            
^{Timestamping service architecture (Cloudflare Workers in Rust, using a Durable Object for storage)}
At regular intervals, the WhatsApp Log puts all new updates into the tree, and cuts a new epoch in the format “{counter}/{previous}/{current}”. The counter is a number, whereby “previous” is a hexadecimal encoded hash of the previous tree root, and “current” is a hexadecimal encoded hash for the new tree root. As a shorthand, epochs can be referred to by their counter only.
Here’s an example:
1001/d0bbf29c48716f26a951ae2a244eb1d070ee38865c29c8ad8174e8904e3cdc1a/e1006114485e8f0bbe2464e0ebac77af37bce76851745592e8dd5991ff2cd411
Once an epoch is constructed, the WhatsApp Log sends it to the Auditor for cross-signing, to ensure it has only been seen once. The Auditor adds a timestamp as to when this new epoch has been seen. Cloudflare’s Auditor uses a Durable Object for every epoch to create their timestamp. This guarantees the global uniqueness of an epoch, and the possibility of replay in the event the WhatsApp Log experiences an outage or is distributed across multiple locations. WhatsApp’s Log is expected to produce new epochs at regular intervals, given this constrains the propagation of public key updates seen by their users. Therefore, Cloudflare Auditor does not have to keep the durable object state forever. Once replay and consistency have been accounted for, this state is cleared. This is done after a month, thanks to durable object alarms.
Additional checks are performed by the service, such as checking that the epochs are consecutive, or that their digest is unique. This enforces a chain of epochs and their associated digests, provided by the WhatsApp Log and signed by the Auditor, providing a consistent view for all to see.
We decided to write this service in Rust because Workers rely on cloudflare/workers-rs bindings, and the auditable key directory library is also in Rust (facebook/akd).
    
      Tree validation service
      
        
      
    
    With the timestamping service above, WhatsApp users (as well as their Log) have assurance that epochs are transparent. WhatsApp’s directory can be audited at any point in time, and if it were to be tampered with by WhatsApp or an intermediary, the WhatsApp Log can be held accountable for it.
Epochs and their digests are only representations of their underlying key directory. To fully audit the directory, the transition from the previous digest to a current digest has to be validated. To perform validation, we need to run the epoch validation method. Specifically, we want to run verify_consecutive_append_only on every epoch constructed by the Log. The size of an epoch varies with the number of updates it contains, and therefore the number of associated nodes in the tree to construct as well. While Workers are able to run such validation for a small number of updates, this is a compute-intensive task. Therefore, still leveraging the same Rust codebase, the Auditor leverages a container that only performs the tree construction and validation. The Auditor retrieves the updates for a given epoch, copies them into its own R2 bucket, and delegates the validation to a container running on Cloudflare. Once validated, the epoch is marked as verified.
            
            
            
            
            
^{Architecture for Cloudflare’s Plexi Auditor. The proof verification and signatures stored do not contain personally identifiable information such as your phone number, public key, or other metadata tied to your WhatsApp account.}
This decouples global uniqueness requirements and epoch validation, which happens at two distinct times. It allows the validation to take more time, and not be latency sensitive.
    
      How can I verify Cloudflare has signed an epoch?
      
        
      
    
    Anyone can perform audit proof verification — the proofs are publicly available — but Cloudflare will be doing so automatically and publicly to make the results accessible to all. Verify that Cloudflare’s signature matches WhatsApp’s by visiting our Key Transparency website, or via our command line tool.
To use our command line tool, you’ll need to download the plexi client. It helps construct data structures which are used for signatures, and requires you to have git and cargo installed.
            cargo install plexi
            With the client installed, let’s now check the audit proofs for WhatsApp namespace: whatsapp.key-transparency.v1. Plexi Auditor is represented by one public key, which can verify and vouch for multiple Logs with their own dedicated “namespace.” To validate an epoch, such as epoch 458298 (the epoch at which the log decided to start sharing data), you can run the following command:
            plexi audit --remote-url 'https://akd-auditor.cloudflare.com' --namespace 'whatsapp.key-transparency.v1' --long
Namespace
  Name              	: whatsapp.key-transparency.v1
  Ciphersuite       	: ed25519(protobuf)

Signature (2024-09-23T16:53:45Z)
  Epoch height      	: 489193
  Epoch digest      	: cbe5097ae832a3ae51ad866104ffd4aa1f7479e873fd18df9cb96a02fc91ebfe
  Signature         	: fe94973e19da826487b637c019d3ce52f0c08093ada00b4fe6563e2f8117b4345121342bc33aae249be47979dfe704478e2c18aed86e674df9f934b718949c08
  Signature verification: success
  Proof verification	: success
            
    
      Interested in having Cloudflare audit your public key infrastructure?
      
        
      
    
    At the end of the day, security threats shouldn’t become usability problems — everyday messaging app users shouldn’t have to worry about whether the public keys of the people they’re talking to have been compromised. In the same way that certificate transparency is now built into the issuance and use of digital certificates to encrypt web traffic, we think that public key transparency and auditing should be built into end-to-end encrypted systems by default, so that users never have to do manual QR code verification again. 
We built our auditing service to be general purpose, reliable, and fast, and WhatsApp’s Key Transparency is just the first of several use cases it will be used for – Cloudflare is interested in helping audit the delivery of code binaries and integrity of all types of end-to-end encrypted infrastructure. If your company or organization is interested in working with us, you can reach out to us here. 


Privacy-preserving measurement and machine learning
Christopher Patton — Fri, 29 Sep 2023 13:00:45 GMT
 
In 2023, data-driven approaches to making decisions are the norm. We use data for everything from analyzing x-rays to translating thousands of languages to directing autonomous cars. However, when it comes to building these systems, the conventional approach has been to collect as much data as possible, and worry about privacy as an afterthought.
The problem is, data can be sensitive and used to identify individuals – even when explicit identifiers are removed or noise is added.
Cloudflare Research has been interested in exploring different approaches to this question: is there a truly private way to perform data collection, especially for some of the most sensitive (but incredibly useful!) technology?
Some of the use cases we’re thinking about include: training federated machine learning models for predictive keyboards without collecting every user’s keystrokes; performing a census without storing data about individuals’ responses; providing healthcare authorities with data about COVID-19 exposures without tracking peoples’ locations en masse; figuring out the most common errors browsers are experiencing without reporting which websites are visiting.  
It’s with those use cases in mind that we’ve been participating in the Privacy Preserving Measurement working group at the IETF whose goal is to develop systems for collecting and using this data while minimizing the amount of per-user information exposed to the data collector.
So far, the most promising standard in this space is DAP – Distributed Aggregation Protocol – a clever way to use multi-party computation to aggregate data without exposing individual measurements. Early versions of the algorithms used by DAP have been implemented by Google and Apple for exposure notifications.
In this blog post, we’ll do a deep dive into the fundamental concepts behind the DAP protocol and give an example of how we’ve implemented it into Daphne, our open source aggregator server. We hope this will inspire others to collaborate with us and get involved in this space!
    
      The principles behind DAP, an open standard for privacy preserving measurement
      
        
      
    
    
            
            
            
            
            
At a high level, using the DAP protocol forces us to think in terms of data minimization: collect only the data that we use and nothing more. Abstractly, our goal is to devise a system with which a data collector can compute some function \( f(m_{1},...,m_{N}) \) of measurements \( m_{1},...,m_{N} \) uploaded by users without observing the measurements in the clear.
            
            
            
            
            
Alice wants to know some aggregate statistic – like the average salary of the people at the party – without knowing how much each individual person makes.
This may at first seem like an impossible task: to compute on data without knowing the data we're computing on. Nevertheless, —and, as is often the case in cryptography— once we've properly constrained the problem, solutions begin to emerge.
            
            
            
            
            
Strawperson solution: delegate the calculation to a trusted third party, Bob. The problem with this is that Bob can see the private inputs in the clear
In an ideal world (see above), there would be some server somewhere on the Internet that we could trust to consume measurements, aggregate them, and send the result to the data collector without ever disclosing anything else. However, in reality there's no reason for users to trust such a server more than the data collector; Indeed, both are subject to the usual assortment of attacks that can lead to a data breach.
            
            
            
            
            
_MPC solution: secret-share the inputs across multiple parties, a.k.a. Bob and Daphne. If at least one person is honest, Alice gets the aggregate result without anyone knowing individual inputs in the clear._‌ ‌
Instead, what we do in DAP is distribute the computation across the servers such that no single server has a complete measurement. The key idea that makes this possible is secret sharing.
    
      Computing on secret shared data
      
        
      
    
    To set things up, let's make the problem a little more concrete. Suppose each measurement \( m_{i} \) is a number and our goal is to compute the sum of the measurements. That is, \( f(m_{1},...,m_{N}) = m_{1} + \cdots + m_{N} \). Our goal is to use secret sharing to allow two servers, which we'll call aggregators, to jointly compute this sum.
To understand secret sharing, we're going to need a tiny bit of math—modular arithmetic. The expression \(  X + Y  (\textrm{mod})  \textit{q} \) means "add \(  X  \) and \(  Y  \), then divide the sum by \(  q  \) and return the remainder". For now the modulus \(  q  \) can be any large number, as long as it's larger than any sum we'd ever want to compute (\(  2 ^{64}  \), say). In the remainder of this section, we'll omit \(  q  \) and simply write \(  X  + Y \) for addition modulo \(  q  \).
The goal of secret sharing is to shard a measurement (i.e., a "secret") into two "shares" such that (i) the measurement can be recovered by combining the shares together and (ii) neither share leaks any information about the measurement. To secret share each \(  m_{i} \), we choose a random number \( R_{i} \in \lbrace  0,...,q - 1\rbrace \), set the first share to be \(X_{i} = m_{i} - R_{i} \) and set the other share to be \( Y_{i} = R_{i} \). To recover the measurement, we simply add the shares together. This works because \( X_{i} + Y_{i} = (m_{i} - R_{i}) + R_{i} = m_{i} \). Moreover, each share is indistinguishable from a random number: For example, \( 1337 \) might be secret-shared into \( 11419752798245067454 \) and \( 7026991275464485499 \) (modulo \( q = 2^{64} \)).
With this scheme we can devise a simple protocol for securely computing the sum:
Each client shards its measurement \( m_{i} \) into \( X_{i} \) and \( Y_{i} \) and sends one share to each server.
The first aggregator computes \( X = X_{1} + \cdots + X_{N} \) and reveals \( X \) to the data collector. The second aggregator computes \( Y = Y_{1} + \cdots + Y_{N} \) and reveals \( Y \) to the data collector.
The data collector unshards the result as \( r = X + Y \).
This works because the secret shares are additive, and the order in which we add things up is irrelevant to the function we're computing:
\( r = m_{1} + \cdots + m_{N} \) // by definition\( r = (m_{1} - R_{1}) + R_{1} + \cdots (m_{N} - R_{N}) + R_{N} \) // apply sharding\( r = (m_{1} - R_{1}) + \cdots + (m_{N} - R_{N}) + R_{1} + \cdots R_{N} \) // rearrange the sum\( r = X + Y \) // apply aggregation
    
      Rich data types
      
        
      
    
    This basic template for secure aggregation was described in a paper from Henry Corrigan-Gibbs and Dan Boneh called "Prio: Private, Robust, and Scalable Computation of Aggregate Statistics" (NSDI 2017). This paper is a critical milestone in DAP's history, as it showed that a wide variety of aggregation tasks (not just sums) can be solved within one, simple protocol framework, Prio. With DAP, our goal in large part is to bring this framework to life.
All Prio tasks are instances of the same template. Measurements are encoded in a form that allows the aggregation function to be expressed as the sum of (shares of) the encoded measurements. For example:
To get arithmetic mean, we just divide the sum by the number of measurements.
Variance and standard deviation can be expressed as a linear function of the sum and the sum of squares (i.e., \( m_{i}, m_{i}^{2} \) for each \( i \)).
Quantiles (e.g., median) can be estimated reasonably well by mapping the measurements into buckets and aggregating the histogram.
Linear regression (i.e., finding a line of best fit through a set of data points) is a bit more complicated, but can also be expressed in the Prio framework.
This degree of flexibility is essential for wide-spread adoption because it allows us to get the most value we can out of a relatively small amount of software. However, there are a couple problems we still need to overcome, both of which entail the need for some form of interaction.
    
      Input validation
      
        
      
    
    The first problem is input validation. Software engineers, especially those of us who operate web services, know in our bones that validating inputs we get from clients is of paramount importance. (Never, ever stick a raw input you got from a client into an SQL query!) But if the inputs are secret shared, then there is no way for an aggregator to discern even a single bit of the measurement, let alone check that it has an expected value. (A secret share of a valid measurement and a number sampled randomly from \( \lbrace 0,...,q - 1 \rbrace \) look identical.) At least, not on its own.
The solution adopted by Prio (and the standard, with some improvements), is a special kind of zero-knowledge proof (ZKP) system designed to operate on secret shared data. The goal is for a prover to convince a verifier that a statement about some data it has committed to is true (e.g., the user has a valid hardware key), without revealing the data itself (e.g. which hardware key is in-use).
Our setting is exactly the same, except that we're working on secret-shared data rather than committed data. Along with the measurement shares, the client sends shares of a validity proof; then during aggregation, the aggregators interact with one another in order to check and verify the proof. (One round-trip over the network is required.)
A happy consequence of working with secret shared data is that proof generation and verification are much faster than for committed (or encrypted) data. This is mainly because we avoid the use of public-key cryptography (i.e., elliptic curves) and are less constrained in how we choose cryptographic parameters. (We require the modulus \( q \) to be a prime number with a particular structure, but such primes are not hard to find.)
    
      Non-linear aggregation
      
        
      
    
    There are a variety of aggregation tasks for which Prio is not well-suited, in particular those that are non-linear. One such task is to find the "heavy hitters" among the set of measurements. The heavy hitters are the subset of the measurements that occur most frequently, say at least \( t \) times for some threshold \( t \). For example, the measurements might be the URLs visited on a given day by users of a web browser; the heavy hitters would be the set of URLs that were visited by at least \( t \) users.
This computation can be expressed as a simple program:
            def heavy_hitters(measurements: list[bytes], t: int) -> set[bytes]:
    hh = defaultdict(lambda: 0)
    for measurement in measurements:
        hh[measurement] += 1
    return set(map(lambda x: x[0], filter(lambda x: x[1] >= t, hh.items())))
            However, it cannot be expressed as a linear function, at least not efficiently (with sub-exponential space). This would be required to perform this computation on secret-shared measurements.
In order to enable non-linear computation on secret shared data, it is necessary to introduce some form of interaction. There are a few possibilities. For the heavy hitters problem in particular, Henry Corrigan-Gibbs and others devised a protocol called Poplar (IEEE Security & Privacy 2021) in which several rounds of aggregation and unsharding are performed, where in each round, information provided by the collector is used to "query" the measurements to obtain a refined aggregate result.
    
      Helping to build a world of multi-party computation
      
        
      
    
    Protocols like Prio or Poplar that enable computation over secret shared data fit into a rich tradition in cryptography known as multi-party computation (MPC). MPC is at once an active research area in theoretical computer science and a class of protocols that are beginning to see real-world use—in our case, to minimize the amount of privacy-sensitive information we collect in order to keep the Internet moving.
The PPM working group at IETF represents a significant effort, by Cloudflare and others, to standardize MPC techniques for privacy preserving measurement. This work has three main prongs:
To identify the types of problems that need to be solved.
To provide cryptography researchers from academia, industry, and the public sector with "templates" for solutions that we know how to deploy. One such template is called a "Verifiable Distributed Aggregation Function (VDAF)", which specifies a kind of "API boundary" between protocols like Prio and Poplar and the systems that are built around them. Cloudflare Research is leading development of the standard, contributing to implementations, and providing security analysis.
To provide a deployment roadmap for emerging protocols. DAP is one such roadmap: it specifies execution of a generic VDAF over HTTPS and attends to the various operational considerations that arise as deployments progress. As well as contributing to the standard itself, Cloudflare has developed its own implementation designed for our own infrastructure (see below).
The IETF is working on its first set of drafts (DAP/VDAF). These drafts are mature enough to deploy, and a number of deployments are scaling up as we speak. Our hope is that we have initiated positive feedback between theorists and practitioners: as new cryptographic techniques emerge, more practitioners will begin to work with them, which will lead to identifying new problems to solve, leading to new techniques, and so on.
    
      Daphne: Cloudflare’s implementation of a DAP Aggregation Server
      
        
      
    
    Our emerging technology group has been working on Daphne, our Rust-based implementation of a DAP aggregator server. This is only half of a deployment – DAP architecture requires two aggregator servers to interoperate, both operated by different parties. Our current version only implements the DAP Helper role; the other role is the DAP Leader. Plans are in the works to implement the Leader as well, which will open us up to deploy Daphne for more use cases.
We made two big decisions in our implementation here: using Rust and using Workers. Rust has been skyrocketing in popularity in the past few years due to its performance and memory management – a favorite of cryptographers for similar reasons. Workers is Cloudflare’s serverless execution environment that allows developers to easily deploy applications globally across our network – making it a favorite tool to prototype with at Cloudflare. This allows for easy integration with our Workers-based storage solutions like: Durable Objects, which we’re using for storing various data artifacts as required by the DAP protocol; and KV, which we’re using for managing aggregation task configuration. We’ve learned a lot from our interop tests and deployment, which has helped improve our own Workers products and which we have also fed back into the PPM working group to help improve the DAP standard.
If you’re interested in learning more about Daphne or collaborating with us in this space, you can fill out this form. If you’d like to get involved in the DAP standard, you can check out the working group. 


Cloudflare is now powering Microsoft Edge Secure Network
Mari Galicer — Thu, 28 Sep 2023 13:00:30 GMT
 
            
            
            
            
            
Between third-party cookies that track your activity across websites, to highly targeted advertising based on your IP address and browsing data, it's no secret that today’s Internet browsing experience isn’t as private as it should be. Here at Cloudflare, we believe everyone should be able to browse the Internet free of persistent tracking and prying eyes.
That’s why we’re excited to announce that we’ve partnered with Microsoft Edge to provide a fast and secure VPN, right in the browser. Users don’t have to install anything new or understand complex concepts to get the latest in network-level privacy: Edge Secure Network VPN is available on the latest consumer version of Microsoft Edge in most markets, and automatically comes with 5 GB of data. Just enable the feature by going to [Microsoft Edge Settings & more (…) > Browser essentials, and click Get VPN for free]. See Microsoft’s Edge Secure Network page for more details.
    
      Cloudflare’s Privacy Proxy platform isn’t your typical VPN
      
        
      
    
    To take a step back: a VPN is a way in which the Internet traffic leaving your device is tunneled through an intermediary server operated by a provider – in this case, Cloudflare! There are many important pieces that make this possible, but among them is the VPN protocol, which defines the way in which the tunnel is established and how traffic flows through it. You may have heard of some of these protocols: Wireguard, IPsec, and OpenVPN, for example. And while we’re no stranger to these, (Cloudflare’s WireGuard implementation is currently in use by millions of devices that use 1.1.1.1+WARP) – we see our Privacy Proxy Platform as a way to push forward the next frontier of Internet privacy and embrace one of Cloudflare’s core values: open Internet standards.
The Privacy Proxy Platform implements HTTP CONNECT, a method defined in the HTTP standard that proxies traffic by establishing a tunnel and then sending reliable and ordered byte streams through that tunnel. You can read more about this proxying method (and its history!) in our Primer on Proxies.
We also leverage other parts of Cloudflare’s privacy-oriented infrastructure that are already deployed at scale: requests first utilize 1.1.1.1 for DNS, a token proxy based on Privacy Pass for client authentication, and Geo-egress to choose an accurate egress IP address without exposing users’ precise location.
    
      How it works
      
        
      
    
    Let’s dive into the details of these components. For the purposes of this blog, we’ll call the devices people are using to browse the Internet (your phone, tablet or computer) clients, and the websites they’re trying to visit origin sites.
The Privacy Proxy Platform includes three main parts:
Token Proxy: this is the service that checks if you’re an Edge Secure Network user with a legitimate Microsoft account.
Privacy API: based on the above, Cloudflare’s Privacy API issues authentication tokens that clients use for authenticating to the proxy itself.
Privacy Proxy: this is the HTTP CONNECT-based proxy service running on Cloudflare’s network. This service checks that the client presents a valid authentication token, and if so, proxies the encrypted HTTP request to the origin site. It is also responsible for selecting a valid egress IP address to be used.
            
            
            
            
            
When Edge Secure Network protections are on – say, when a user connects to an open Wi-Fi network at a coffee shop – our proxy will automatically prompt that client for a token to authenticate. If the client has a token, it will present one. If it doesn’t, it will utilize the token proxy to mint a new pool using the help of an attester and issuer: the attester checks the validity of the client and Microsoft account, and the issuer issues tokens for that client in return. This dance is based on the Privacy Pass protocol. Importantly, it allows Cloudflare to validate that clients are who they say they are without collecting or storing personal information from Microsoft users.
Once the client has presented the proxy server with a valid token, the Privacy Proxy then chooses a valid egress IP address based on a hash of the client’s geolocation. It then uses the DNS record (provided by Cloudflare’s DNS resolver, 1.1.1.1) to open up an encrypted session to the origin website. From there, it’s pretty straightforward: if the user continues to browse on that site, further requests will be sent through that connection, if they stop or close the browser, that connection will close as well.
Because Cloudflare proxies millions of requests per second, many of the operational aspects of the proxy are managed by Oxy, our proxying framework that handles everything from telemetry, graceful restarts, to stream multiplexing and IP fallbacks, and authentication hooks.
    
      Low last-mile latency and geolocation parity thanks to Cloudflare’s Network
      
        
      
    
    Cloudflare’s privacy proxy implementation maximizes user experience without sacrificing privacy. When Edge Secure Network is enabled, users will have search and browsing results relevant to where they’re geographically located. At Cloudflare, we call this the pizza test: people should be able to use any of our privacy proxy products and still be able to get results for “pizza places near me”. We accomplish this by always egressing through a Cloudflare data center that has an IP address that corresponds to the user’s location – we’ve written more about how we did this for 1.1.1.1+WARP.
Unlike your typical VPN operator that has dozens – sometimes hundreds – of servers, Cloudflare has a much larger footprint: data centers in over 300 cities. Because our network is an anycast “every service, everywhere” approach, each of our data centers can accept traffic from an Edge Secure network client. This means that Edge users will automatically detect and connect with a Cloudflare data center geographically very close to them, minimizing last-mile latency. Finally, because Cloudflare also operates a CDN, websites that are already on Cloudflare will be given a “hot-path,” and will load faster.
We at Cloudflare are always striving to bring more privacy options to the open Internet, and we are excited to provide more private and secure browsing to Edge users. To learn more, head to Microsoft’s Edge Secure Network page or Microsoft’s support page. If you’re a partner interested in using a privacy-preserving proxy like this one, fill out this form. 


Donning a MASQUE: building a new protocol into Cloudflare WARP
Mari Galicer — Thu, 22 Jun 2023 13:00:02 GMT
 
            
            
            
            
            
When we originally announced WARP, we knew we were launching a product that was different from other VPNs. Cloudflare has not only hundreds more data centers than your typical VPN provider, but also a unique purview into the adoption of open Internet standards. The confluence of these two factors have led us to today’s announcement: support for MASQUE, a cutting-edge new protocol for the beta version of our consumer WARP iOS app.
MASQUE is a set of mechanisms that extend HTTP/3 and leverage the unique properties of the QUIC transport protocol to efficiently proxy IP and UDP traffic. Most importantly, it will make your Internet browsing experience faster and more stable without sacrificing privacy.
Like many products at Cloudflare, we’re offering this first as a free, consumer offering. Once we’ve had an opportunity to learn from what it’s like to operate MASQUE on mobile devices, at scale, we plan to integrate it into our Zero Trust enterprise product suite.
    
      We’re not saying goodbye to WireGuard
      
        
      
    
    When we first built WARP we chose to go with WireGuard for many reasons – among them, simplicity. This is where WireGuard shines: ~4,000 lines of code that use public-key cryptography to create an encrypted tunnel between one computer and another. The cryptographic parts – encapsulation and decapsulation –  are fast, simple, and secure. This simplicity has allowed us to implement it cross-platform without much effort; today, we support WireGuard clients on iOS, Android, macOS, Windows, and Linux.
That being said, the protocol is not without its issues. Like many tradeoffs in technology, WireGuard’s strengths are also its drawbacks. While simple, it is also rigid: it’s not possible to extend it easily, for example, for session management, congestion control, or to recover more quickly from error-state behaviors we’re familiar with. Finally, neither the protocol nor the cryptography it uses are standards-based, making it difficult to keep up with the strongest known cryptography (post-quantum crypto, for example).
    
      We want to move QUIC-ly
      
        
      
    
    We’re excited about MASQUE because it fits into the way the Internet is evolving. According to this year’s usage report from our Radar team, HTTP/2 is currently the standard in use by the majority of Internet traffic, but HTTP/3 occupies a growing share – 28% as of June 2023. Cloudflare has always been dedicated towards adopting the cutting edge when it comes to standards: when RFC 9000 (the QUIC transport protocol) was published, we enabled it for all Cloudflare customers the very next day.
            
            
            
            
            
So why do we think HTTP/3 is so promising? Well, a lot of it has to do with solving performance issues with HTTP/2. HTTP/3 promises a number of things.
Faster connection establishment: the TCP+TLS handshake of earlier HTTP versions typically takes two to three round trips. QUIC performs the transport and security handshake at the same time, cutting down on the total required round trips.
No more head of line blocking: when one packet of information does not make it to its destination, it will no longer block all streams of information.
Agility and evolution: QUIC has strong extension and version negotiation mechanisms. And because it encrypts all but a few bits of its wire image, deploying new transport features is easier and more practical. In contrast, TCP evolution was hampered by middleboxes that failed to keep up with the times.
Naturally, we’d want the proxying protocol we use for so many people’s everyday browsing to take advantage of these benefits. For example, the QUIC unreliable datagram extension doesn't help much for standard web traffic but it's ideal for tunneling UDP or IP packets that expect an unreliable substrate beneath them. Without the unreliable aspect, the protocols on top can get upset and start to perform badly. Datagrams help unlock QUIC's proxying potential.
    
      MASQUE: A new era for VPN performance and flexibility
      
        
      
    
    You may have heard of HTTP GET, POST, and PUT, but what about CONNECT? HTTP-CONNECT is a method that opens up a tunnel between servers and proxies traffic between them. For a deeper dive, check out our Primer on Proxies. Many Cloudflare services use this method like so:
            
            
            
            
            
Clients send a CONNECT request, and if the proxy sends back a 2xx (success) status code, tunnel secured! Simple. However, remember that QUIC is UDP-based. Luckily, the MASQUE working group has figured out how to run multiple concurrently stream and datagram-based connections. Establishing one looks like this:
            
            
            
            
            
Here’s what this MASQUE proxying looks like:
            
            
            
            
            
From a development perspective, MASQUE also allows us to improve our performance in other ways: we’re already running it for iCloud Private Relay and other Privacy Proxy partners. The services that power these partnerships, from our Rust-based proxy framework to our open source QUIC implementation, are already deployed globally in our network and have proven to be fast, resilient, and reliable. We've already learned a lot about how to operate proxies at scale, but there’s plenty of room for improvement. The good news is that every performance improvement we make to speed up MASQUE-based connections for our WARP clients will also improve performance for our customers that use HTTP-CONNECT, and vice-versa.
From a protocol perspective, we also think that MASQUE will prove to be resilient over time. As you can see above, connections are made through port 443, which for both TCP and UDP blends in well with general HTTP/3 traffic and is less susceptible than WireGuard to blocking.
Finally, because MASQUE is an IETF standard, innovations via extensions are already underway. One we’re particularly excited about is Multipath QUIC, an extension whose implementation would allow us to use multiple concurrent network interfaces for a single logical QUIC connection. For example, using both LTE and WiFi on a single mobile device could allow for seamless switching between the two, helping to avoid pesky disruptions when you’re coming to and from work or home.
The magic of supporting MASQUE is that it combines some pretty cool (and very Cloudflare-y!) elements: a standards-based proxying protocol that provides real user-facing performance benefits, built upon Cloudflare’s widely available Anycast network, and encryption of that last-mile between that network and your phone.
    
      So how can I use it?
      
        
      
    
    If you’d like to join the waitlist for our beta tester program for MASQUE, you can sign up here.
You’ll first need to download Testflight on a valid iOS device. We will be sending out invites to download the app via Testflight first come, first served, as early as next week. Once you’ve downloaded the app, MASQUE will be available as the default connection in our beta iOS version, only available in iOS 17 (and up).
To toggle between WireGuard and MASQUE, go to Settings > Personalization > Protocol:
            
            
            
            
            
    
      Protocols come and go, but our privacy promise remains the same
      
        
      
    
    While the protocols that dominate the Internet may change, our promise to consumers remains the same – a more private Internet, free of cost. When using WARP, we still route all DNS queries through 1.1.1.1, our privacy-respecting DNS resolver; we will never write user-identifiable log data to disk; we will never sell your browsing data or use it in any way to target you with advertising data; and you can still use WARP without providing any personal information like your name, phone number, or email address. 


Privacy Gateway: a privacy preserving proxy built on Internet standards
Mari Galicer — Thu, 27 Oct 2022 13:01:00 GMT
 
If you’re running a privacy-oriented application or service on the Internet, your options to provably protect users’ privacy are limited. You can minimize logs and data collection but even then, at a network level, every HTTP request needs to come from somewhere. Information generated by HTTP requests, like users’ IP addresses and TLS fingerprints, can be sensitive especially when combined with application data.
Meaningful improvements to your users’ privacy require a change in how HTTP requests are sent from client devices to the server that runs your application logic. This was the motivation for Privacy Gateway: a service that relays encrypted HTTP requests and responses between a client and application server. With Privacy Gateway, Cloudflare knows where the request is coming from, but not what it contains, and applications can see what the request contains, but not where it comes from. Neither Cloudflare nor the application server has the full picture, improving end-user privacy.
We recently deployed Privacy Gateway for Flo Health Inc., a leading female health app, for the launch of their Anonymous Mode. With Privacy Gateway in place, all request data for Anonymous Mode users is encrypted between the app user and Flo, which prevents Flo from seeing the IP addresses of those users and Cloudflare from seeing the contents of that request data.
With Privacy Gateway in place, several other privacy-critical applications are possible:
Browser developers can collect user telemetry in a privacy-respecting manner– what extensions are installed, what defaults a user might have changed — while removing what is still a potentially personal identifier (the IP address) from that data.
Users can visit a healthcare site to report a Covid-19 exposure without worrying that the site is tracking their IP address and/or location.
DNS resolvers can serve DNS queries without linking who made the request with what website they’re visiting – a pattern we’ve implemented with Oblivious DNS.Privacy Gateway is based on Oblivious HTTP (OHTTP), an emerging IETF standard and is built upon standard hybrid public-key cryptography.
    
      How does it work?
      
        
      
    
    The main innovation in the Oblivious HTTP standard – beyond a basic proxy service – is that these messages are encrypted to the application’s server, such that Privacy Gateway learns nothing of the application data beyond the source and destination of each message.
Privacy Gateway enables application developers and platforms, especially those with strong privacy requirements, to build something that closely resembles a “Mixnet”: an approach to obfuscating the source and destination of a message across a network. To that end, Privacy Gateway consists of three main components:
Client: the user’s device, or any client that’s configured to forward requests to Privacy Gateway.
Privacy Gateway: a service operated by Cloudflare and designed to relay requests between the Client and the Gateway, without being able to observe the contents within.
Application server: the origin or application web server responsible for decrypting requests from clients, and encrypting responses back.
If you were to imagine request data as the contents of the envelope (a letter) and the IP address and request metadata as the address on the outside, with Privacy Gateway, Cloudflare is able to see the envelope’s address and safely forward it to its destination without being able to see what’s inside.
            
            
            
            
            
An Oblivious HTTP transaction using Privacy Gateway
In slightly more detail, the data flow is as follows:
Client encapsulates an HTTP request using the public key of the application server, and sends it to Privacy Gateway over an HTTPS connection.
Privacy Gateway forwards the request to the server over its own, separate HTTPS connection with the application server.
The application server  decapsulates the request, forwarding it to the target server which can produce the response.
The application server returns an encapsulated response to Privacy Gateway, which then forwards the result to the client.As specified in the protocol, requests from the client to the server are encrypted using HPKE, a state-of-the-art standard for public key encryption – which you can read more about here. We’ve taken additional measures to ensure that OHTTP’s use of HPKE is secure by conducting a formal analysis of the protocol, and we expect to publish a deeper analysis in the coming weeks.
    
      How Privacy Gateway improves end-user privacy
      
        
      
    
    This interaction offers two types of privacy, which we informally refer to as request privacy and client privacy.
Request privacy means that the application server does not learn information that would otherwise be revealed by an HTTP request, such as IP address, geolocation, TLS and HTTPS fingerprints, and so on. Because Privacy Gateway uses a separate HTTPS connection between itself and the application server, all of this per-request information revealed to the application server represents that of Privacy Gateway, not of the client. However, developers need to take care to not send personally identifying information in the contents of requests. If the request, once decapsulated, includes information like users’ email, phone number, or credit card info, for example, Privacy Gateway will not meaningfully improve privacy.
Client privacy is a stronger notion. Because Cloudflare and the application server are not colluding to share individual user’s data, from the server’s perspective, each individual transaction came from some unknown client behind Privacy Gateway. In other words, a properly configured Privacy Gateway deployment means that applications cannot link any two requests to the same client. In particular, with Privacy Gateway, privacy loves company. If there is only one end-user making use of Privacy Gateway, then it only provides request privacy (since the client IP address remains hidden from the Gateway). It would not provide client privacy, since the server would know that each request corresponds to the same, single client. Client privacy requires that there be many users of the system, so the application server cannot make this determination.
To better understand request and client privacy, consider the following HTTP request between a client and server:
            
            
            
            
            
Normal HTTP configuration with a client anonymity set of size 1
If a client connects directly to the server (or “Gateway” in OHTTP terms), the server is likely to see information about the client, including the IP address, TLS cipher used, and a degree of location data based on that IP address:
            - ipAddress: 192.0.2.33 # the client’s real IP address
- ASN: 7922
- AS Organization: Comcast Cable
- tlsCipher: AEAD-CHACHA20-POLY1305-SHA256 # potentially unique
- tlsVersion: TLSv1.3
- Country: US
- Region: California
- City: Campbell
            There’s plenty of sensitive information here that might be unique to the end-user. In other words, the connection offers neither request nor client privacy.
With Privacy Gateway, clients do not connect directly to the application server itself. Instead, they connect to Privacy Gateway, which in turn connects to the server. This means that the server only observes connections from Privacy Gateway, not individual connections from clients, yielding a different view:
            - ipAddress: 104.16.5.5 # a Cloudflare IP
- ASN: 13335
- AS Organization: Cloudflare
- tlsCipher: ECDHE-ECDSA-AES128-GCM-SHA256 # shared across several clients
- tlsVersion: TLSv1.3
- Country: US
- Region: California
- City: Los Angeles
            
            
            
            
            
            
Privacy Gateway configuration with a client anonymity set of size k
This is request privacy. All information about the client’s location and identity are hidden from the application server. And all details about the application data are hidden from Privacy Gateway. For sensitive applications and protocols like DNS, keeping this metadata separate from the application data is an important step towards improving end-user privacy.
Moreover, applications should take care to not reveal sensitive, per-client information in their individual requests. Privacy Gateway cannot guarantee that applications do not send identifying info – such as email addresses, full names, etc – in request bodies, since it cannot observe plaintext application data. Applications which reveal user identifying information in requests may violate client privacy, but not request privacy. This is why – unlike our full application-level Privacy Proxy product – Privacy Gateway is not meant to be used as a generic proxy-based protocol for arbitrary applications and traffic. It is meant to be a special purpose protocol for sensitive applications, including DNS (as is evidenced by Oblivious DNS-over-HTTPS), telemetry data, or generic API requests as discussed above.
    
      Integrating Privacy Gateway into your application
      
        
      
    
    Integrating with Privacy Gateway requires applications to implement the client and server side of the OHTTP protocol. Let’s walk through what this entails.
    
      Server Integration
      
        
      
    
    The server-side part of the protocol is responsible for two basic tasks:
Publishing a public key for request encapsulation; and
Decrypting encapsulated client requests, processing the resulting request, and encrypting the corresponding response.
A public encapsulation key, called a key configuration, consists of a key identifier (so the server can support multiple keys at once for rotation purposes), cryptographic algorithm identifiers for encryption and decryption, and a public key:
            HPKE Symmetric Algorithms {
  HPKE KDF ID (16),
  HPKE AEAD ID (16),
}

OHTTP Key Config {
  Key Identifier (8),
  HPKE KEM ID (16),
  HPKE Public Key (Npk * 8),
  HPKE Symmetric Algorithms Length (16),
  HPKE Symmetric Algorithms (32..262140),
}
            Clients need this public key to create their request, and there are lots of ways to do this. Servers could fix a public key and then bake it into their application, but this would require a software update to rotate the key. Alternatively, clients could discover the public key some other way. Many discovery mechanisms exist and vary based on your threat model – see this document for more details. To start, a simple approach is to have clients fetch the public key directly from the server over some API. Below is a snippet of the API that our open source OHTTP server provides:
            func (s *GatewayResource) configHandler(w http.ResponseWriter, r *http.Request) {
	config, err := s.Gateway.Config(s.keyID)
	if err != nil {
		http.Error(w, http.StatusText(http.StatusInternalServerError), http.StatusInternalServerError)
		return
	}
	w.Write(config.Marshal())
}
            Once public key generation and distribution is solved, the server then needs to handle encapsulated requests from clients. For each request, the server needs to decrypt the request, translate the plaintext to a corresponding HTTP request that can be resolved, and then encrypt the resulting response back to the client.
Open source OHTTP libraries typically offer functions for request decryption and response encryption, whereas plaintext translation from binary HTTP to an HTTP request is handled separately. For example, our open source server delegates this translation to a different library that is specific to how Go HTTP requests are represented in memory. In particular, the function to translate from a plaintext request to a Go HTTP request is done with a function that has the following signature:
            func UnmarshalBinaryRequest(data []byte) (*http.Request, error) {
	...
}
            Conversely, translating a Go HTTP response to a plaintext binary HTTP response message is done with a function that has the following signature:
            type BinaryResponse http.Response

func (r *BinaryResponse) Marshal() ([]byte, error) {
	...
}
            While there exist several open source libraries that one can use to implement OHTTP server support, we’ve packaged all of it up in our open source server implementation available here. It includes instructions for building, testing, and deploying to make it easy to get started.
    
      Client integration
      
        
      
    
    Naturally, the client-side behavior of OHTTP mirrors that of the server. In particular, the client must:
Discover or obtain the server public key; and
Encode and encrypt HTTP requests, send them to Privacy Gateway, and decrypt and decode the HTTP responses.
Discovery of the server public key depends on the server’s chosen deployment model. For example, if the public key is available over an API, clients can simply fetch it directly:
            $ curl https://server.example/ohttp-configs > config.bin
            Encoding, encrypting, decrypting, and decoding are again best handled by OHTTP libraries when available. With these functions available, building client support is rather straightforward. A trivial example Go client using the library functions linked above is as follows:
            configEnc := ... // encoded public key
config, err := ohttp.UnmarshalPublicConfig(configEnc)
if err != nil {
	return err
}

request, err := http.NewRequest(http.MethodGet, "https://test.example/index.html", nil)
if err != nil {
	return err
}

binaryRequest := ohttp.BinaryRequest(*request)
encodedRequest, err := binaryRequest.Marshal()
if err != nil {
	return err
}

ohttpClient := ohttp.NewDefaultClient(config)
encapsulatedReq, reqContext, err := ohttpClient.EncapsulateRequest(encodedRequest)

relayRequest, err := http.NewRequest(http.MethodPost, "https://relay.example", bytes.NewReader(encapsulatedReq.Marshal()))
if err != nil {
	return err
}
relayRequest.Header.Set("Content-Type", "message/ohttp-req")

client := http.Client{}
relayResponse, err := client.Do(relayRequest)
if err != nil {
	return err
}
bodyBytes, err := ioutil.ReadAll(relayResponse.Body)
if err != nil {
	return err
}
encapsulatedResp, err := ohttp.UnmarshalEncapsulatedResponse(bodyBytes)
if err != nil {
	return err
}

receivedResp, err := reqContext.DecapsulateResponse(encapsulatedResp)
if err != nil {
	return err
}

response, err := ohttp.UnmarshalBinaryResponse(receivedResp)
if err != nil {
	return err
}

fmt.Println(response)
            A standalone client like this isn’t likely very useful to you if you have an existing application. To help integration into your existing application, we created a sample OHTTP client library that’s compatible with iOS and macOS applications. Additionally, if there’s language or platform support you would like to see to help ease integration on either or the client or server side, please let us know!
    
      Interested?
      
        
      
    
    Privacy Gateway is currently in early access – available to select privacy-oriented companies and partners. If you’re interested, please get in touch. 


Private by design: building privacy-preserving products with Cloudflare's Privacy Edge
Mari Galicer — Wed, 28 Sep 2022 13:00:00 GMT
 
When Cloudflare was founded, our value proposition had three pillars: more secure, more reliable, and more performant. Over time, we’ve realized that a better Internet is also a more private Internet, and we want to play a role in building it.
User awareness and expectations of and for privacy are higher than ever, but we believe that application developers and platforms shouldn’t have to start from scratch. We’re excited to introduce Privacy Edge – Code Auditability, Privacy Gateway, Privacy Proxy, and Cooperative Analytics – a suite of products that make it easy for site owners and developers to build privacy into their products, by default.
    
      Building network-level privacy into the foundations of app infrastructure
      
        
      
    
    As you’re browsing the web every day, information from the networks and apps you use can expose more information than you intend. When accumulated over time, identifiers like your IP address, cookies, browser and device characteristics create a unique profile that can be used to track your browsing activity. We don’t think this status quo is right for the Internet, or that consumers should have to understand the complex ecosystem of third-party trackers to maintain privacy. Instead, we’ve been working on technologies that encourage and enable website operators and app developers to build privacy into their products at the protocol level.
Getting privacy right is hard. We figured we’d start in the area we know best: building privacy into our network infrastructure. Like other work we’ve done in this space – offering free SSL certificates to make encrypted HTTP requests the norm, and launching 1.1.1.1, a privacy-respecting DNS resolver, for example – the products we’re announcing today are built upon the foundations of open Internet standards, many of which are co-authored by members of our Research Team.
Privacy Edge – the collection of products we’re announcing today, includes:
Privacy Gateway: A lightweight proxy that encrypts request data and forwards it through an IP-blinding relay
Code Auditability: A solution to verifying that code delivered in your browser hasn’t been tampered with
Private Proxy: A proxy that offers the protection of a VPN, built natively into application architecture
Cooperative Analytics: A multi-party computation approach to measurement and analytics based on an emerging distributed aggregation protocol.
Today’s announcement of Privacy Edge isn’t exhaustive. We’re continuing to explore, research and develop new privacy-enabling technologies, and we’re excited about all of them.
    
      Privacy Gateway: IP address privacy for your users
      
        
      
    
    There are situations in which applications only need to receive certain HTTP requests for app functionality, but linking that data with who or where it came from creates a privacy concern.
We recently partnered with Flo Health, a period tracking app, to solve exactly that privacy concern: for users that have turned on “Anonymous mode,” Flo encrypts and forwards traffic through Privacy Gateway so that the network-level request information (most importantly, users’ IP addresses) are replaced by the Cloudflare network.
            
            
            
            
            
How data is encapsulated, forwarded, and decapsulated in the Privacy Gateway system.
So how does it work? Privacy Gateway is based on Oblivious HTTP, an emerging IETF standard, and at a high level describes the following data flow:
The client encapsulates an HTTP request using the public key of the customer’s gateway server, and sends it to the relay over a client<>relay HTTPS connection.
The relay forwards the request to the server over its own relay<>gateway HTTPS connection.
The gateway server decapsulates the request, forwarding it to the application server.
The gateway server returns an encapsulated response to the relay, which then forwards the result to the client.
The novel feature Privacy Gateway implements from the OHTTP specification is that messages sent through the relay are encrypted (via HPKE) to the application server, so that the relay learns nothing of the application data beyond the source and destination of each message.
The end result is that the relay will know where the data request is coming from (i.e. users’ IP addresses) but not what it contains (i.e. contents of the request), and the application can see what the data contains but won’t know where it comes from. A win for end-user privacy.
    
      Delivering verifiable and authentic code for privacy-critical applications
      
        
      
    
    How can you ensure that the code — the JavaScript, CSS or even HTML —delivered to a browser hasn’t been tampered with?
One way is to generate a hash (a consistent, unique, and shorter representation) of the code, and have two independent parties compare those hashes when delivered to the user's browser.
Our Code Auditability service does exactly that, and our recent partnership with Meta deployed it at scale to WhatsApp Web. Installing their Code Verify browser extension ensures users can be sure that they are delivered the code they’re intended to run – free of tampering or corrupted files.
With WhatsApp Web:
WhatsApp publishes the latest version of their JavaScript libraries to their servers, and the corresponding hash for that version to Cloudflare’s audit endpoint.
A WhatsApp web client fetches the latest libraries from WhatsApp.
The Code Verify browser extension subsequently fetches the hash for that version from Cloudflare over a separate, secure connection.
Code Verify compares the “known good” hash from Cloudflare with the hash of the libraries it locally computed.
If the hashes match, as they should under almost any circumstance, the code is “verified” from the perspective of the extension. If the hashes don’t match, it indicates that the code running on the user's browser is different from the code WhatsApp intended to run on all its user's browsers.
            
            
            
            
            
How Cloudflare and WhatsApp Web verify code shipped to users isn't tampered with.
Right now, we call this "Code Auditability" and we see a ton of other potential use cases including password managers, email applications, certificate issuance – all technologies that are potentially targets of tampering or security threats because of the sensitive data they handle.
In the near term, we’re working with other app developers to co-design solutions that meet their needs for privacy-critical products. In the long term, we’re working on standardizing the approach, including building on existing Content Security Policy standards, or the Isolated Web Apps proposal, and even an approach towards building Code Auditability natively into the browser so that a browser extension (existing or new) isn't required.
    
      Privacy-preserving proxying – built into applications
      
        
      
    
    What if applications could build the protection of a VPN into their products, by default?
Privacy Proxy is our platform to proxy traffic through Cloudflare using a combination of privacy protocols that make it much more difficult to track users’ web browsing activity over time. At a high level, the Privacy Proxy Platform encrypts browsing traffic, replaces a device’s IP address with one from the Cloudflare network, and then forwards it onto its destination.
            
            
            
            
            
System architecture for Privacy Proxy.
The Privacy Proxy platform consists of several pieces and protocols to make it work:
Privacy API: a service that issues unique cryptographic tokens, later redeemed against the proxy service to ensure that only valid clients are able to connect to the service.
Geolocated IP assignment: a service that assigns each connection a new Cloudflare IP address based on the client’s approximate location.
Privacy Proxy: the HTTP CONNECT-based service running on Cloudflare’s network that handles the proxying of traffic. This service validates the privacy token passed by the client, enforces any double spend prevention necessary for the token.
            
            
            
            
            
We’re working on several partnerships to provide network-level protection for user’s browsing traffic, most recently with Apple for Private Relay. Private Relay’s design adds privacy to the traditional proxy design by adding an additional hop – an ingress proxy, operated by Apple – that separates handling users’ identities (i.e., whether they’re a valid iCloud+ user) from the proxying of traffic – the egress proxy, operated by Cloudflare.
    
      Measurements and analytics without seeing individual inputs
      
        
      
    
    What if you could calculate the results of a poll, without seeing individuals' votes, or update inputs to a machine learning model that predicted COVID-19 exposure without seeing who was exposed?
It might seem like magic, but it's actually just cryptography. Cooperative Analytics is a multi-party computation system for aggregating privacy-sensitive user measurements that doesn’t reveal individual inputs, based on the Distributed Aggregation Protocol (DAP).
            
            
            
            
            
How data flows through the Cooperative Analytics system.
At a high-level, DAP takes the core concept behind MapReduce — what became a fundamental way to aggregate large amounts of data — and rethinks how it would work with privacy-built in, so that each individual input cannot be (provably) mapped back to the original user.
Specifically:
Measurements are first "secret shared," or split into multiple pieces. For example, if a user's input is the number 5, her input could be split into two shares of [10,-5].
The input share pieces are then distributed between different, non-colluding servers for aggregation (in this example, simply summed up). Similar to Privacy Gateway or Private Proxy, no one party has all the information needed to reconstruct any user's input.
Depending on the use case, the servers will then communicate with one another in order to verify that the input is "valid" – so that no one can insert an input that throws off the entire results. The magic of multi-party computation is that the servers can perform this computation without learning anything about the input beyond its validity.
Once enough input shares have been aggregated to ensure strong anonymity and a statistically significant sample size – each server sends its sum of the input shares to the overall consumer of this service to then compute the final result.
For simplicity, the above example talks about measurements as summed up numbers, but DAP describes algorithms for multiple different types of inputs: the most common string input, or a linear regression, for example.
Early iterations of this system have been implemented by Apple and Google for COVID-19 exposure notifications, but there are many other potential use cases for a system like this: think sensitive browser telemetry, geolocation data – any situation where one has a question about a population of users, but doesn't want to have to measure them directly.
Because this system requires different parties to operate separate aggregation servers, Cloudflare is working with several partners to act as one of the aggregation servers for DAP. We’re calling our implementation Daphne, and it’s built on top of Cloudflare Workers.
    
      Privacy still requires trust
      
        
      
    
    Part of what's cool about these systems is that they distribute information — whether user data, network traffic, or both — amongst multiple parties.
While we think that products included in Privacy Edge are moving the Internet in the right direction, we understand that trust only goes so far. To that end, we're trying to be as transparent as possible.
We've open sourced the code for Privacy Gateway's server and DAP's aggregation server, and all the standards work we're doing is in public with the IETF.
We're also working on detailed and accessible privacy notices for each product that describe exactly what kind of network data Cloudflare sees, doesn't see, and how long we retain it for.
And, most importantly, we’re continuing to develop new protocols (like Oblivious HTTP) and technologies that don’t just require trust, but that can provably minimize the data observed or logged.
We'd love to see more folks get involved in the standards space, and we welcome feedback from privacy experts and potential customers on how we can improve the integrity of these systems.
    
      We’re looking for collaborators
      
        
      
    
    Privacy Edge products are currently in early access.
We're looking for application developers who want to build more private user-facing apps with Privacy Gateway; browser and existing VPN vendors looking to improve network-level security for their users via Privacy Proxy; and anyone shipping sensitive software on the Internet that is looking to iterate with us on code auditability and web app signing.
If you're interested in working with us on furthering privacy on the Internet, then please reach out, and we’ll be in touch! 


1.1.1.1 + WARP: More features, still private
Mari Galicer — Sat, 06 Aug 2022 16:15:03 GMT
 
            
            
            
            
            
It’s a Saturday night. You open your browser, looking for nearby pizza spots that are open. If the search goes as intended, your browser will show you results that are within a few miles, often based on the assumed location of your IP address. At Cloudflare, we affectionately call this type of geolocation accuracy the “pizza test”. When you use a Cloudflare product that sits between you and the Internet (for example, WARP), it’s one of the ways we work to balance user experience and privacy. Too inaccurate and you’re getting pizza places from a neighboring country; too accurate and you’re reducing the privacy benefits of obscuring your location.
With that in mind, we’re excited to announce two major improvements to our 1.1.1.1 + WARP apps: first, an improvement to how we ensure search results and other geographically-aware Internet activity work without compromising your privacy, and second, a larger network with more locations available to WARP+ subscribers, powering even speedier connections to our global network.
    
      A better Internet browsing experience for every WARP user
      
        
      
    
    When we originally built the 1.1.1.1+ WARP mobile app, we wanted to create a consumer-friendly way to connect to our network and privacy-respecting DNS resolver.
What we discovered over time is that the topology of the Internet dictates a different type of experience for users in different locations. Why? Sometimes, because traffic congestion or technical issues route your traffic to a less congested part of the network. Other times, Internet Service Providers may not peer with Cloudflare or engage in traffic engineering to optimize their networks how they see fit, which could result in user traffic connecting to a location that doesn’t quite map to their locale or language.
Regardless of the cause, the impact is that your search results become less relevant, if not outright confusing. For example, in somewhere dense with country borders, like Europe, your traffic in Berlin could get routed to Amsterdam because your mobile operator chooses to not peer in-country, giving you results in Dutch instead of German. This can also be disruptive if you’re trying to stream content subject to licensing restrictions, such as a person in the UK trying to watch BBC iPlayer or a person in Brazil trying to watch the World Cup.
So we fixed this. We just rolled out a major update to the service that powers WARP that will give you a geographically accurate browsing experience without revealing your IP address to the websites you’re visiting. Instead, websites you visit will see a Cloudflare IP address instead, making it harder for them to track you directly.
    
      How it works
      
        
      
    
    Traditionally, consumer VPNs deliberately route your traffic through a server in another country, making your connection slow, and often getting blocked because of their ability to flout location-based content restrictions. We took a different approach when we first launched WARP in 2018, giving you the best possible performance by routing your traffic through the Cloudflare data center closest to you. However, because not every Internet Service Provider (ISP) peers with Cloudflare, users sometimes end up exiting the Cloudflare network from a more “random” data center – one that does not accurately represent their locale.
Websites and third party services often infer geolocation from your IP address, and now, 1.1.1.1 + WARP replaces your original IP address with one that consistently and accurately represents your approximate location.
Here’s how we did it:
We ran an analysis on a subset of our network traffic to find a rough approximation of how many users we have per city.
We divided that amongst our egress IPs, using an anycast architecture to be efficient with the number of additional IPs we had to allocate and advertise per metro area.
We then submitted geolocation information of those IPs to various geolocation database providers, ensuring third party services associate those Cloudflare egress IPs with an accurate approximate location.
It was important to us to provide the benefits of this location accuracy without compromising user privacy, so the app doesn’t ask for specific location permissions or log your IP address.
            
            
            
            
            
    
      An even bigger network for WARP+ users
      
        
      
    
    
            
            
            
            
            
We also recently announced that we’ve expanded our network to over 275 cities in over 100 countries. This gave us an opportunity to revisit where we offered WARP, and how we could expand the number of locations users can connect to WARP with (in other words: an opportunity to make things faster).
From today, all WARP+ subscribers will benefit from a larger network with 20+ new cities: with no change in subscription pricing. A closer Cloudflare data center means less latency between your device and Cloudflare, which directly improves your download speed, thanks to what’s called the Bandwidth-Delay Product (put simply: lower latency, higher throughput!).
As a result, sites load faster, both for those on the Cloudflare network and those that aren’t. As we continue to expand our network, we'll be revising this on a regular basis to ensure that all WARP and WARP+ subscribers continue to get great performance.
    
      Speed, privacy, and relevance
      
        
      
    
    Beyond being able to find pizza on a Saturday night, we believe everyone should be able to browse the Internet freely – and not have to sacrifice the speed, privacy, or relevance of their search results in order to do so.
In the near future, we’ll be investing in features to bring even more of the benefits of Cloudflare infrastructure to every 1.1.1.1 + WARP user. Stay tuned! 


How Cloudflare verifies the code WhatsApp Web serves to users
Matt Silverlock — Thu, 10 Mar 2022 18:30:01 GMT
 
How do you know the code your web browser downloads when visiting a website is the code the website intended you to run? In contrast to a mobile app downloaded from a trusted app store, the web doesn’t provide the same degree of assurance that the code hasn’t been tampered with. Today, we’re excited to be partnering with WhatsApp to provide a system that assures users that the code run when they visit WhatsApp on the web is the code that WhatsApp intended.
With WhatsApp usage in the browser growing, and the increasing number of at-risk users — including journalists, activists, and human rights defenders — WhatsApp wanted to take steps to provide assurances to browser-based users. They approached us to help dramatically raise the bar for third-parties looking to compromise or otherwise tamper with the code responsible for end-to-end encryption of messages between WhatsApp users.
So how will this work? Cloudflare holds a hash of the code that WhatsApp users should be running. When users run WhatsApp in their browser, the WhatsApp Code Verify extension compares a hash of that code that is executing in their browser with the hash that Cloudflare has — enabling them to easily see whether the code that is executing is the code that should be.
The idea itself — comparing hashes to detect tampering or even corrupted files — isn’t new, but automating it, deploying it at scale, and making sure it “just works” for WhatsApp users is. Given the reach of WhatsApp and the implicit trust put into Cloudflare, we want to provide more detail on how this system actually works from a technical perspective.
Before we dive in, there's one important thing to explicitly note: Cloudflare is providing a trusted audit endpoint to support Code Verify. Messages, chats or other traffic between WhatsApp users are never sent to Cloudflare; those stay private and end-to-end encrypted. Messages or media do not traverse Cloudflare’s network as part of this system, an important property from Cloudflare’s perspective in our role as a trusted third party.
    
      Making verification easier
      
        
      
    
    Hark back to 2003: Fedora, a popular Linux distribution based on Red Hat, has just been launched. You’re keen to download it, but want to make sure you have the “real” Fedora, and that the download isn’t a “fake” version that siphons off your passwords or logs your keystrokes. You head to the download page, kick off the download, and see an MD5 hash (considered secure at the time) next to the download. After the download is complete, you run md5 fedora-download.iso and compare the hash output to the hash on the page. They match, life is good, and you proceed to installing Fedora onto your machine.
But hold on a second: if the same website providing the download is also providing the hash, couldn’t a malicious actor replace both the download and the hash with their own values? The md5 check we ran above would still pass, but there’s no guarantee that we have the “real” (untampered) version of the software we intended to download.
            
            
            
            
            
Hosting the hash on the same server as the software is still common in 2022.
There are other approaches that attempt to improve upon this — providing signed signatures that users can verify were signed with “well known” public keys hosted elsewhere. Hosting those signatures (or “hashes”) with a trusted third party dramatically raises the bar when it comes to tampering, but now we require the user to know who to trust, and require them to learn tools like GnuPG. That doesn’t help us trust and verify software at the scale of the modern Internet.
This is where the Code Verify extension and Cloudflare come in. The Code Verify extension, published by Meta Open Source, automates this: locally computing the cryptographic hash of the libraries used by WhatsApp Web and comparing that hash to one from a trusted third-party source (Cloudflare, in this case).
We’ve illustrated this to make how it works a little clearer, showing how each of the three parties — the user, WhatsApp and Cloudflare — interact with each other.
            
            
            
            
            
Broken down, there are four major steps to verifying the code hasn’t been tampered with:
WhatsApp publishes the latest version of their JavaScript libraries to their servers, and the corresponding hash for that version to Cloudflare’s audit endpoint.
A WhatsApp web client fetches the latest libraries from WhatsApp.
The Code Verify browser extension subsequently fetches the hash for that version from Cloudflare over a separate, secure connection.
Code Verify compares the “known good” hash from Cloudflare with the hash of the libraries it locally computed.
If the hashes match, as they should under almost any circumstance, the code is “verified” from the perspective of the extension. If the hashes don’t match, it indicates that the code running on the user's browser is different from the code WhatsApp intended to run on all its user's browsers.
    
      Security needs to be convenient
      
        
      
    
    It’s this process — and the fact that is automated on behalf of the user — that helps provide transparency in a scalable way. If users had to manually fetch, compute and compare the hashes themselves, detecting tampering would only be for the small fraction of technical users. For a service as large as WhatsApp, that wouldn’t have been a particularly accessible or user-friendly approach.
This approach also has parallels to a number of technologies in use today. One of them is Subresource Integrity in web browsers: when you fetch a third-party asset (such as a script or stylesheet), the browser validates that the returned asset matches the hash described. If it doesn’t, it refuses to load that asset, preventing potentially compromised scripts from siphoning off user data. Another is Certificate Transparency and the related Binary Transparency projects. Both of these provide publicly auditable transparency for critical assets, including WebPKI certificates and other binary blobs. The system described in this post doesn’t scale to arbitrary assets – yet – but we are exploring ways in which we could extend this offering for something more general and usable like Binary Transparency.
Our collaboration with the team at WhatsApp is just the beginning of the work we’re doing to help improve privacy and security on the web.  We’re aiming to help other organizations verify the code delivered to users is the code they’re meant to be running. Protecting Internet users at scale and enabling privacy are core tenets of what we do at Cloudflare, and we look forward to continuing this work throughout 2022.

The Cloudflare Blog

Unweight: how we compressed an LLM 22% without sacrificing quality

Why compression is harder than it sounds

How model weights can be compressed effectively

The exponent is surprisingly predictable

The GPU memory bottleneck

Four ways to use compressed weights

Why no single pipeline wins

How the reconstructive matmul works

Sharing the GPU between decoding and computation

Pipelining across layers

Autotuning

One compression format, multiple uses

Our results

Why this matters

What’s next

Bringing more transparency to post-quantum usage, encrypted messaging, and routing security

Measuring origin post-quantum support

Securing E2EE messaging systems with Key Transparency

Tracking RPKI ASPA adoption

As security evolves, so does our data

Internet measurement, resilience, and transparency: blog takeover from Cloudflare Research and friends

An ode to Internet measurement

A better Internet is a more resilient Internet

A transparent look into Cloudflare’s network

Cryptographic protocols for an agentic web

Get your reading glasses on

How we built the most efficient inference engine for Cloudflare’s network

The Architectural Challenge of LLM Inference at Cloudflare

How Infire works under the hood

Platform startup

The HTTP server

The batcher

CUDA kernels

How Infire performs in the wild

What’s next

How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

Omni’s architecture – at a glance

Elastic scaling by spawning multiple models from a single control plane

Implementing lightweight process and Python isolation

Over-committing memory to run more models on a single GPU

How Omni runs multiple models for Workers AI

What’s next

Message Signatures are now part of our Verified Bots Program, simplifying bot authentication

Become a Verified Bot with Message Signatures

Getting started

Register your bot

Message Signature verification for origins

Where we’re going next

A better, more trusted Internet

Orange Me2eets: We made an end-to-end encrypted video calling app and it was easy

End-to-end encryption for video conferencing is different than for text messaging

Messaging Layer Security (MLS)

Orange Meets: the basics

End-to-end encrypting Orange Meets

Handling different codec behaviors

“Join my Orange Meet”

Verifying the Designated Committer Algorithm with TLA+

Preventing Monster-in-the-Middle attacks

Future work

Conclusion

Forget IPs: using cryptography to verify bot and agent traffic

Existing bot verification mechanisms are broken

Introducing HTTP Message Signatures

Scoping standards to bot authentication

Generating request signatures

Validating request signatures

Experimentation with request mTLS

A TLS flag for bot authentication

Request mTLS in action

Two approaches, one goal

The bigger picture

What’s new in Cloudflare: MASQUE now powers 1.1.1.1 & WARP apps, DEX now generally available with Remote Captures

Faster and more stable: our 1.1.1.1 & WARP apps now use MASQUE by default

Why MASQUE?

Get started now

DEX now Generally Available: Announcing detailed device visibility with DEX Remote Captures

Introducing Remote Captures: PCAP and WARP Diag

Why Remote Captures?

How Remote Captures work

Verifying the Designated Committer Algorithm with TLA⁺