The Cloudflare Blog

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Guy Bedford — Wed, 22 Apr 2026 13:00:00 GMT

Rust Workers run on the Cloudflare Workers platform by compiling Rust to WebAssembly, but as we’ve found, WebAssembly has some sharp edges. When things go wrong with a panic or an unexpected abort, the runtime can be left in an undefined state. For users of Rust Workers, panics were historically fatal, poisoning the instance and possibly even bricking the Worker for a period of time.

While we were able to detect and mitigate these issues, there remained a small chance that a Rust Worker would unexpectedly fail and cause other requests to fail along with it. An unhandled Rust abort in a Worker affecting one request might escalate into a broader failure affecting sibling requests or even continue to affect new incoming requests. The root cause of this was in wasm-bindgen, the core project that generates the Rust-to-JavaScript bindings Rust Workers depend on, and its lack of built-in recovery semantics.

In this post, we’ll share how the latest version of Rust Workers handles comprehensive Wasm error recovery that solves this abort-induced sandbox poisoning. This work has been contributed back into wasm-bindgen as part of our collaboration within the wasm-bindgen organization formed last year. First with panic=unwind support, which ensures that a single failed request never poisons other requests, and then with abort recovery mechanisms that guarantee Rust code on Wasm can never re-execute after an abort.

Initial recovery mitigations

Our initial attempts to address reliability in this area focused on understanding and containing failures caused by Rust panics and aborts in production Rust Workers. We introduced a custom Rust panic handler that tracked failure state within a Worker and triggered full application reinitialization before handling subsequent requests. On the JavaScript side, this required wrapping the Rust-JavaScript call boundary using Proxy‑based indirection to ensure that all entrypoints were consistently encapsulated. We also made targeted modifications to the generated bindings to correctly reinitialize the WebAssembly module after a failure.

While this approach relied on custom JavaScript logic, it demonstrated that reliable recovery was achievable and eliminated the persistent failure modes we were seeing in practice. This solution was shipped by default to all workers‑rs users starting in version 0.6, and it laid the groundwork for the more general, upstreamed abort recovery mechanisms described in the sections that follow.

Implementing `panic=unwind` with WebAssembly Exception Handling

The abort recovery mechanisms described above ensure that a Worker can survive a failure, but they do so by reinitializing the entire application. For stateless request handlers, this is fine. But for workloads that hold meaningful state in memory, such as Durable Objects, reinitialization means losing that state entirely. A single panic in one request could wipe the in-memory state being used by other concurrent requests.

In most native Rust environments, panics can be unwound, allowing destructors to run and the program to recover without losing state. In WebAssembly, things historically looked very different. Rust compiled to Wasm via wasm32-unknown-unknown defaults to panic=abort, so a panic inside a Rust Worker would abruptly trap with an unreachable instruction and exit Wasm back to JS with a WebAssembly.RuntimeError.

To recover from panics without discarding instance state, we needed panic=unwind support for wasm32-unknown-unknown in wasm-bindgen, made possible by the WebAssembly Exception Handling proposal, which gained wide engine support in 2023.

We start by compiling with RUSTFLAGS='-Cpanic=unwind' cargo build -Zbuild-std, which rebuilds the standard library with unwind support and generates code with proper panic unwinding. For example:

struct HasDropA;
struct HasDropB;
extern "C" {
    fn imported_func();
}

fn some_func() {
    let a = HasDropA;
    let b = HasDropB;
    imported_func();
}

compiles to WebAssembly as:

try
  call 
catch_all
  call 
  call 
  rethrow
end
call 
call

This ensures that even if imported_func() panics, destructors still run. Similarly, std::panic::catch_unwind(|| some_func()) compiles into:

try
  call 
  ;; set result to Ok(return value)
catch
  try
    call 
    ;; set result to Err(panic payload)
  catch_all
    call 
    unreachable
  end
end

Getting this to work end-to-end required several changes to the wasm-bindgen toolchain. The WebAssembly parser Walrus did not know how to handle try/catch instructions, so we added support for them. The descriptor interpreter also needed to be taught how to evaluate code containing exception handling blocks. At that point, the full application could be built with panic=unwind.

The final step was modifying the exports generated by wasm-bindgen to catch panics at the Rust-JavaScript boundary and surface them as JavaScript PanicError exceptions. One subtlety: Rust will catch foreign exceptions and abort when unwinding through extern "C" functions, so exports needed to be marked extern "C-unwind" to explicitly allow unwinding across the boundary. For futures, a panic rejects the JavaScript Promise with a PanicError.

Closures required special attention to ensure unwind safety was properly checked, via a new MaybeUnwindSafe trait that checks UnwindSafe only when built with panic=unwind. This quickly exposed a problem, though: many closures capture references that remain after an unwind, making them inherently unwind-unsafe. To avoid a situation where users are encouraged to incorrectly wrap closures in AssertUnwindSafe just to satisfy the compiler, we added Closure::new_aborting variants, which terminate on panic instead of unwinding in cases where unwind safety can't be guaranteed.

With panic unwinding enabled:

Panics in exported Rust functions are caught by wasm-bindgen
Panics surface to JavaScript as PanicError exceptions
Async exports reject their returned promises with a PanicError
Rust destructors run correctly
The WebAssembly instance remains valid and reusable

The full details of the approach and how to use it in wasm-bindgen are covered in the latest guide page for Wasm Bindgen: Catching Panics.

Abort recovery

Even with panic=unwind support, aborts still happen - out-of-memory errors being one common cause. Because aborts can’t unwind, there is no possibility of state recovery at all, but we can at least detect and recover from aborts for future operations to avoid invalid state erroring subsequent requests.

Panic unwind support introduced a new problem for abort recovery. When we receive an error from Wasm we don’t know if it came from an extern “C-unwind” foreign error, or if it was a genuine abort. Aborts can take many shapes in WebAssembly.

We had two options to solve this technically: either mark all errors which are definitely aborts, or mark all errors which are definitely unwinds. Either could have worked but we chose the latter. Since our foreign exception handling was directly using raw WAT-level (WebAssembly text format) Exception Handling instructions already, we found it easier to implement exception tags for foreign exceptions to distinguish them from aborting non-unwind-safe exceptions.

With the ability to clearly distinguish between recoverable and non-recoverable errors thanks to this Exception.Tag feature in WebAssembly Exception Handling, we were able to then integrate both a new abort handler as well as abort reentrancy guards. A new abort hook, set_on_abort, can be used at initialization time to attach a handler that recovers accordingly for the platform embedding’s needs.

Hardening panic and abort handling is critical to avoiding invalid execution state. WebAssembly allows deeply interleaved call stacks, where Wasm can call into JavaScript and JavaScript can re-enter Wasm at arbitrary depths, while alongside this, multiple tasks can be functioning in the same instance. Previously, an abort occurring in one task or nested stack was not guaranteed to invalidate higher stacks through JS, leading to undefined behavior. Care was required to ensure we can guarantee the execution model, and contribution in this space remains ongoing.

While aborts are never ideal, and reinitialization on failure is an absolute worst-case scenario, implementing critical error recovery as the last line of defense ensures execution correctness and that future operations will be able to succeed. The invalid state does not persist, ensuring a single failure does not cascade into multiple failures.

Extension: abort reinitialization for wasm-bindgen libraries

While we were working on this, we realized that this is a common problem for libraries used by JS that are built with wasm-bindgen, and that they would also benefit from attaching an abort handler to be able to perform recovery.

But when building Wasm as an ES module and importing it directly (e.g. via import { func } from ‘wasm-dep’), it’s not clear what the recovery mechanism would be for a Wasm abort while calling func() for an already-linked and initialized library that is in a user JS application.

While not strictly a Rust Workers use case, our team also supports JS-based Workers users who run Rust-backed Wasm library dependencies. If we could fix this problem at the same time, that could indirectly also benefit Wasm usage on the Cloudflare Workers platform.

To support automatic abort recovery for Wasm library use cases, we added support for an experimental reinitialization mechanism into wasm‑bindgen, --reset-state-function. This exposes a function that allows the Rust application to effectively request that it reset its internal Wasm instance back to its initial state for the next call, without requiring consumers of the generated bindings to reimport or recreate them. Class instances from the old instance will throw as their handles become orphaned, but new classes can then be constructed. The JS application using a Wasm library is errored but not bricked.

The full technical details of this feature and how to use it in wasm-bindgen are covered in the new wasm-bindgen guide section Wasm Bindgen: Handling Aborts.

Maturing the Rust Wasm Exception Handling ecosystem

Upstream contributions for this work did not stop at the wasm-bindgen project. Building for Wasm with panic=unwind still requires an experimental nightly Rust target, so we’ve also been working to advance Rust’s Wasm support for WebAssembly Exception Handling to help bring this to stable Rust.

During the development of WebAssembly Exception Handling, a late‑stage specification change resulted in two variants: legacy exception handling and the final modern exception handling "with exnref". Today, Rust’s WebAssembly targets still default to emitting code for the legacy variant. While legacy exception handling is widely supported, it is now deprecated.

Modern WebAssembly Exception Handling is supported as of the following JS platform releases:

Runtime	Version	Release Date
v8	13.8.1	April 28, 2025
workerd	v1.20250620.0	June 19, 2025
Chrome	138	June 28, 2025
Firefox	131	October 1, 2024
Safari	18.4	March 31, 2025
Node.js	25.0.0	October 15, 2025

As we were investigating the support matrix, the largest concern ended up being the Node.js 24 LTS release schedule, which would have left the entire ecosystem stuck on legacy WebAssembly Exception Handling until April 2028.

Having discovered this discrepancy, we were able to backport modern exception handling to the Node.js 24 release, and even backport the fixes needed to make it work on the Node.js 22 release line to ensure support for this target. This should allow the modern Exception Handling proposal to become the default target next year.

Over the coming months, we’ll be working to make the transition to stable panic=unwind and modern Exception Handling as invisible as possible to end users.

While these long‑term investments in the ecosystem take time, they help build a stronger foundation for the Rust WebAssembly community as a whole, and we’re glad to be able to contribute to these improvements.

Using panic unwind in Rust Workers

As of version 0.8.0 of Rust Workers, we have a new --panic-unwind flag, which can be added to the build command, following the instructions here.

With this flag, panics can be fully recovered, and abort recovery will use the new abort classification and recovery hook mechanism. We highly recommend upgrading and trying it out for a more stable Rust Workers experience, and plan to make panic=unwind the default in a subsequent release. Users remaining on panic=abort will still continue to take advantage of the previous custom recovery wrapper handling from 0.6.0.

Committing to Rust Workers stability

This work is part of our ongoing effort towards a stable release for Rust Workers. By solving these sharp edges of the Wasm platform foundations at their root, and contributing back to the ecosystem where it makes sense, we build stronger foundations not just for our platform, but the entire Rust, JS, and Wasm ecosystem.

We have a number of future improvements planned for Rust Workers, and we’ll soon be sharing updates on this additional work, including wasm-bindgen generics and automated bindgen, which Guy Bedford from our team previewed in a talk on Rust & JS Interoperability at Wasm.io last month.

Find us in #rust‑on‑workers on the Cloudflare Discord. We also welcome feedback and discussion and especially all new contributors to the workers-rs and wasm-bindgen GitHub projects.

Inside Gen 13: how we built our most powerful server yet

Syona Sarma — Mon, 23 Mar 2026 13:00:00 GMT

A few months ago, Cloudflare announced the transition to FL2, our Rust-based rewrite of Cloudflare's core request handling layer. This transition accelerates our ability to help build a better Internet for everyone. With the migration in the software stack, Cloudflare has refreshed our server hardware design with improved hardware capabilities and better efficiency to serve the evolving demands of our network and software stack. Gen 13 is designed with 192-core AMD EPYC™ Turin 9965 processor, 768 GB of DDR5-6400 memory, 24 TB of PCIe 5.0 NVMe storage, and dual 100 GbE port network interface card.

Gen 13 delivers:

Up to 2x throughput compared to Gen 12 while staying within latency SLA
Up to 50% improvement in performance / watt efficiency, reducing data center expansion costs
Up to 60% higher throughput per rack keeping rack power budget constant
2x memory capacity, 1.5x storage capacity, 4x network bandwidth
Introduced PCIe encryption hardware support in addition to memory encryption
Improved support for thermally demanding powerful drop-in PCIe accelerators

This blog post covers the engineering rationale behind each major component selection: what we evaluated, what we chose, and why.

Generation	Gen 13 Compute	Previous Gen 12 Compute
Form Factor	2U1N, Single socket	2U1N, Single socket
Processor	AMD EPYC™ 9965 Turin 192-Core Processor	AMD EPYC™ 9684X Genoa-X 96-Core Processor
Memory	768GB of DDR5-6400 x12 memory channel	384GB of DDR5-4800 x12 memory channel
Storage	x3 E1.S NVMe Samsung PM9D3a 7.68TB / Micron 7600 Pro 7.68TB	x2 E1.S NVMe Samsung PM9A3 7.68TB / Micron 7450 Pro 7.68TB
Network	Dual 100 GbE OCP 3.0 Intel Ethernet Network Adapter E830-CDA2 / NVIDIA Mellanox ConnectX-6 Dx	Dual 25 GbE OCP 3.0 Intel Ethernet Network Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx
System Management	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)
Power Supply	1300W, Titanium Grade	800W, Titanium Grade

^{Figure: Gen 13 server}

CPU

Gen 12	AMD EPYC™ 9684X Genoa-X 96-Core (400W TDP, 1152 MB L3 Cache)
Gen 13	AMD EPYC™ 9965 Turin Dense 192-Core (500W TDP, 384 MB L3 Cache)

During the design phase, we evaluated several 5th generation AMD EPYC™ Processors, code-named Turin, in Cloudflare’s hardware lab: AMD Turin 9755, AMD Turin 9845, and AMD Turin 9965. The table below summarizes the differences in specifications of the candidates for Gen 13 servers against the AMD Genoa-X 9684X used in our Gen 12 servers. Notably, all three candidates offer increases in core count but with smaller L3 cache per core. However, with the migration to FL2, the new workloads are less dependent on L3 cache and scale up well with the increased core count to achieve up to 100% increase in throughput.

The three CPU candidates are designed to target different use cases: AMD Turin 9755 offers superior per-core performance, AMD Turin 9965 trades per-core performance for efficiency, and AMD Turin 9845 trades core count for lower socket power. We evaluated three CPUs in the production environment.

CPU Model	AMD Genoa-X 9684X	AMD Turin 9755	AMD Turin 9845	AMD Turin 9965
For server platform	Gen 12	Gen 13 candidate	Gen 13 candidate	Gen 13 candidate
# of CPU Cores	96	128	160	192
# of Threads	192	256	320	384
Base Clock	2.4 GHz	2.7 GHz	2.1 GHz	2.25 GHz
Max Boost Clock	3.7 GHz	4.1 GHz	3.7 GHz	3.7 GHz
All Core Boost Clock	3.42 GHz	4.1 GHz	3.25 GHz	3.35 GHz
Total L3 Cache	1152 MB	512 MB	320 MB	384 MB
L3 cache per core	12 MB / core	4 MB / core	2 MB / core	2 MB / core
Maximum configurable TDP	400W	500W	390W	500W

Why AMD Turin 9965?

First, FL2 ended the L3 cache crunch.

L3 cache is the large, last-level cache shared among all CPU cores on the same compute die to store frequently used data. It bridges the gap between slow main memory external to the CPU, and the fast but smaller L1 and L2 cache on the CPU, reducing the latency for the CPU to access data.

Some may notice that the 9965 has only 2 MB of L3 cache per core, an 83.3% reduction from the 12 MB per core on Gen 12’s Genoa-X 9684X. Why trade away the very cache advantage that gave Gen 12 its edge? The answer lies in how our workloads have evolved.

Cloudflare has migrated from FL1 to FL2, a complete rewrite of our request handling layer in Rust. With the new software stack, Cloudflare’s request processing pipeline has become significantly less dependent on large L3 cache. FL2 workloads scale nearly linearly with core count, and the 9965’s 192 cores provide a 2x increase in hardware threads over Gen 12.

Second, performance per total cost of ownership (TCO). During production evaluation, the 9965’s 192 cores delivered the highest aggregate requests per second of the three candidates, and its performance-per-watt scaled favorably at 500W TDP, yielding superior rack-level TCO.

	Gen 12	Gen 13
Processor	AMD EPYC™ 4th Gen Genoa-X 9684X	AMD EPYC™ 5th Gen Turin 9965
Core count	96C/192T	192C/384T
FL throughput	Baseline	Up to +100%
Performance per watt	Baseline	Up to +50%

Third, operational simplicity. Our operational teams have a strong preference for fewer, higher-density servers. Managing a fleet of 192-core machines means fewer nodes to provision, patch, and monitor per unit of compute delivered. This directly reduces operational overhead across our global network.

Finally, they are forward compatible. The AMD processor architecture supports DDR5-6400, PCIe Gen 5.0, CXL 2.0 Type 3 memory across all SKUs. AMD Turin 9965 has the highest number of high-performing cores per socket in the industry, maximizing the compute density per socket, maintaining competitiveness and relevance of the platform for years to come. By moving to AMD Turin 9965 from AMD Genoa-X 9684X, we get longer security support from AMD, extending the useful life of the Gen 13 server before they become obsolete and need to be refreshed.

Memory

Gen 12	12x 32GB DDR5-4800 2Rx8 (384 GB total, 4 GB/core)
Gen 13	12x 64GB DDR5-6400 2Rx4 (768 GB total, 4 GB/core)

Because the AMD Turin processor has twice the core count of the previous generation, it demands more memory resources, both in capacity and in bandwidth, to deliver throughput gains.

Maximizing bandwidth with 12 channels

The chosen AMD EPYC™ 9965 CPU supports twelve memory channels, and for Gen 13, we are populating every single one of them. We’ve selected 64 GB DDR5-6400 ECC RDIMMs in a “one DIMM per channel” (1DPC) configuration.

This setup provides 614 GB/s of peak memory bandwidth per socket, a 33.3% increase compared to our Gen 12 server platform. By utilizing all 12 channels, we ensure that the CPU is never “starved” for data, even during the most memory-intensive parallel workloads.

Populating all twelve channels in a balanced configuration — equal capacity per channel, with no mixed configurations — is common best practice. This matters operationally: AMD Turin processors interleave across all memory channels with the same DIMM type, same memory capacity and same rank configuration. Interleaving increases memory bandwidth by spreading contiguous memory access across all memory channels in the interleave set instead of sending all memory access to a single or a small subset of memory channels.

The 4 GB per core “sweet spot”

Our Gen 12 servers are configured with 4GB per core. We revisited that decision as we designed Gen 13.

Cloudflare launches a lot of new products and services every month, and each new product or service demands an incremental amount of memory capacity. These accumulate over time and could become an issue of memory pressure, if memory capacity is not sized appropriately.

Initial requirement considered a memory-to-core ratio between 4 GB and 6 GB per core. With 192 cores on the AMD Turin 9965, that translates to a range of 768 GB to 1152 GB. Note that at higher capacities, DIMM module capacity granularity are typically 16GB increments. With 12 channels in a 1DPC configuration, our options are 12x 48GB (576 GB), 12x 64GB (768 GB), or 12x 96GB (1152 GB).

12x 48GB = 576 GB, or 1.5 GB/thread. The memory capacity of this configuration is too low; this would starve memory-hungry workloads and violate the lower bound.
12x 96GB = 1152 GB, or 3.0 GB/thread. This would be a 50% capacity increase per core and would also result in higher power consumption and a substantial increase in cost, especially in the current market conditions where memory prices are 10x of what they were a year ago.
12x 64GB = 768 GB, or 2.0 GB/thread (4 GB/core). This configuration is consistent with our Gen 12 memory to core ratio, and represents a 2x increase in memory capacity per server. Keeping the memory capacity configuration at 4 GB per core provides sufficient capacity for workloads that scale with core count, like our primary workload, FL, and provide sufficient memory capacity headroom for future growth without overprovisioning.

FL2 uses memory more efficiently than FL1 did: our internal measures show FL2 uses less than half the CPU of FL1, and far less than half the memory. The capacity freed up by the software stack migration provides ample headroom to support Cloudflare growth for the next few years.

The decision: 12x 64GB for 768 GB total. This maintains the proven 4 GB/core ratio, provides a 2x total capacity increase over Gen 12, and stays within the DIMM cost curve sweet spot.

Efficiency through dual rank

In Gen 12, we demonstrated that dual-rank DIMMs provide measurably higher memory throughput than single-rank modules, with advantages of up to 17.8% at a 1:1 read-write ratio. Dual-rank DIMMs are faster because they allow the memory controller to access one rank while another is refreshing. That same principle carries forward here.

Our requirement also calls for approximately 1 GB/s of memory bandwidth per hardware thread. With 614 GB/s of peak bandwidth across 384 threads, we deliver 1.6 GB/s per thread, comfortably exceeding the minimum. Production analysis has shown that Cloudflare workloads are not memory-bandwidth-bound, so we bank the headroom as margin for future workload growth.

By opting for 2Rx4 DDR5 RDIMMs at maximum supported 6400MT/s, we ensure we get the lowest latency and best performance from our Gen 13 platform memory configuration.

Storage

Gen 12

x2 E1.S NVMe PCIe 4.0, 16 TB total

Samsung PM9A3 7.68TB

Micron 7450 Pro 7.68TB

Gen 13

x3 E1.S NVMe PCIe 5.0, 24 TB total

Samsung PM9D3a 7.68TB

Micron 7600 Pro 7.68TB

+10x U.2 NVMe PCIe 5.0 option

Our storage architecture underwent a transformation in Gen 12 when we pivoted from M.2 to EDSFF E1.S. For Gen 13, we are increasing the storage capacity and the bandwidth to align with the latest technology. We have also added a front drive bay for flexibility to add up to 10x U.2 drives to keep pace with Cloudflare storage product growth.

The move to PCIe 5.0

Gen 13 is configured with PCIe Gen 5.0 NVMe drives. While Gen 4.0 served us well, the move to Gen 5.0 ensures that our storage subsystem can serve data at improved latency, and keep up with increased storage bandwidth demand from the new processor.

16 TB to 24 TB

Beyond the speed increase, we are physically expanding the array from two to three NVMe drives. Our Gen 12 server platform was designed with four E1.S storage drive slots, but only two slots were populated with 8TB drives. The Gen 13 server platform uses the same design with four E1.S storage drive slots available, but with three slots populated with 8TB drives. Why add a third drive? This increases our storage capacity per server from 16TB to 24TB, ensuring we are expanding our global storage capacity to maintain and improve CDN cache performance. This supports growth projections for Durable Objects, Containers, and Quicksilver services, too.

Front drive bay to support additional drives

For Gen 13, the chassis is designed with a front drive bay that can support up to ten U.2 PCIe Gen 5.0 NVMe drives. The front drive bay provides the option for Cloudflare to use the same chassis across compute and storage platforms, as well as the flexibility to convert a compute SKU to a storage SKU when needed.

Endurance and reliability

We designed our servers to have a 5-year operational life and require storage drives endurance to sustain 1 DWPD (Drive Writes Per Day) over the full server lifespan.

Both the Samsung PM9D3a and Micron 7600 Pro meet the 1 DWPD specification with a hardware over-provisioning (OP) of approximately 7%. If future workload profiles demand higher endurance, we have the option to hold back additional user capacity to increase effective OP.

NVMe 2.0 and OCP NVMe 2.0 compliance

Both the Samsung PM9D3a and Micron 7600 adopt the NVMe 2.0 specification (up from NVMe 1.4) and the OCP NVMe Cloud SSD Specification 2.0. Key improvements include Zoned Namespaces (ZNS) for better write amplification management, Simple Copy Command for intra-device data movement without crossing the PCIe bus, and enhanced Command and Feature Lockdown for tighter security controls. The OCP 2.0 spec also adds deeper telemetry and debug capabilities purpose-built for datacenter operations, which aligns with our emphasis on fleet-wide manageability.

Thermal efficiency

The storage drives will continue to be in the E1.S 15mm form factor. Its high-surface-area design is essential for cooling these new Gen 5.0 controllers, which can pull upwards of 25W under sustained heavy I/O. The 2U chassis provides ample airflow over the E1.S drives as well as U.2 drive bays, a design advantage we validated in Gen 12 when we made the decision to move from 1U to 2U.

Network

Gen 12

Dual 25 GbE port OCP 3.0 NIC

Intel E810-XXVDA2

NVIDIA Mellanox ConnectX-6 Lx

Gen 13

Dual 100 GbE port OCP 3.0 NIC

Intel E830-CDA2

NVIDIA Mellanox ConnectX-6 Dx

For more than eight years, dual 25 GbE was the backbone of our fleet. Since 2018 it has served us well, but as the CPU has improved to serve more requests and our products scale, we’ve officially hit the wall. For Gen 13, we are quadrupling our per-port bandwidth.

Why 100 GbE and why now?

Network Interface Card (NIC) bandwidth must keep pace with compute performance growth. With 192 modern cores, our 25 GbE links will become a measurable bottleneck. Production data from our co-locations worldwide over a week showed that, on our Gen 12, P95 bandwidth per port is consistently >50% of available bandwidth. Since throughput is doubling per server on Gen 13, we are at risk of saturating the NIC bandwidth.

^{Figure: on Gen 12, P95 bandwidth per port is consistently >50% of available bandwidth}

The decision to go to 100 GbE rather than 50 GbE was driven by industry economics: 50 GbE transceiver volumes remain low in the industry, making them a poor supply chain bet. Dual 100 GbE ports also give us 200 Gb/s of aggregate bandwidth per server, future-proofing against the next several years of traffic growth.

Hardware choices and compatibility

We are maintaining our dual-vendor strategy to ensure supply chain resilience, a lesson hard-learned during the pandemic when single-sourcing the Gen 11 NIC left us scrambling.

Both NICs are compliant with OCP 3.0 SFF/TSFF form factor with the integrated pull tab, maintaining chassis commonality with Gen 12 and ensuring field technicians need no new tools or training for swaps.

PCIe Allocation

The OCP 3.0 NIC slot is allocated PCIe 4.0 x16 lanes on the motherboard, providing 256 Gb/s of bidirectional bandwidth, more than enough for dual 100 GbE (200 Gb/s aggregate) with room to spare.

Management

Gen 12

Project Argus Data Center Secure Control Module 2.0

Gen 13

Project Argus Data Center Secure Control Module 2.0

PCIe encryption

We are maintaining the architectural shift, introduced in Gen 12, of separating management and security-related components from the motherboard onto the Project Argus Data Center Secure Control Module 2.0.

^{Figure: Project Argus DC-SCM 2.0}

Continuity with DC-SCM 2.0

We are carrying forward the Data Center Secure Control Module 2.0 (DC-SCM 2.0) standard. By decoupling management and security functions from the motherboard, we ensure that the “brains” of the server’s security stay modular and protected.

The DC-SCM module houses our most critical components:

Basic Input/Output System (BIOS)
Baseboard Management Controller (BMC)
Hardware Root of Trust (HRoT) and TPM (Infineon SLB 9672)
Dual BMC/BIOS flash chips for redundancy

Why we are staying the course with DC-SCM 2.0

The decision to keep this architecture for Gen 13 is driven by the proven security gains we saw in the previous generation. By offloading these functions to a dedicated module, we maintain:

Rapid recovery: Dual image redundancy allows for near-instant restoration of BIOS/UEFI and BMC firmware if an accidental corruption or a malicious update is detected.
Physical resilience: The Gen 13 chassis also moves the intrusion detection mechanism further from the flat edge of the chassis, making physical intercept harder.
PCIe encryption: In addition to TSME (Transparent Secure Memory Encryption) for CPU-to-memory encryption that was already enabled since our Gen 10 platforms, AMD Turin 9965 processor for Gen 13 extends encryption to PCIe traffic, this ensures data is protected in transit across every bus in the system.
Operational consistency: Sticking with the Gen 12 management stack means our security audits, deployment, provisioning, and operational standard procedure remain fully compatible.

Power

Gen 12	800W 80 PLUS Titanium CRPS
Gen 13	1300W 80 PLUS Titanium CRPS

As we upgrade the compute and networking capability of the server, the power envelope of our servers has naturally expanded. Gen 13 are equipped with bigger power supplies to deliver the power needed.

The jump to 1300W

While our Gen 12 nodes operated comfortably with 800W 80 PLUS Titanium CRPS (Common Redundant Power Supply), the Gen 13 specification requires a larger power supply. We have selected a 1300W 80 PLUS Titanium CRPS.

Power consumption of Gen 13 during typical operation has risen to 850W, a 250W increase over the 600W seen in Gen 12. The primary contributors are the 500W TDP CPU (up from 400W), doubling of the memory capacity and the additional NVMe drive.

Why 1300W instead of 1000W? The current PSU ecosystem lacks viable, high-efficiency options at 1000W. To ensure supply chain reliability, we moved to the next industry-standard tier of 1300W.

EU Lot 9 is a regulation that requires servers deploying in the European Union to have power supplies with efficiency at 10%, 20%, 50% and 100% load to be at or above the percentage threshold specified in the regulation. The threshold matches 80 PLUS Power Supply certification program titanium grade PSU requirement. We chose a titanium grade PSU for Gen 13 to maintain full compliance with EU Lot 9, ensuring that the servers can be deployed in our European data centers and beyond.

Thermal design: 2U pays dividends again

The 2U1N form factor we adopted in Gen 12 continues to pay dividends. Gen 13 uses 5x 80mm fans (up from 4x in Gen 12) to handle the increased thermal load from the 500W CPU. The larger fan volume, combined with the 2U chassis airflow characteristics, means fans operate well below maximum duty cycle at typical ambient temperatures, keeping fan power in the < 50W range per fan.

Drop-in accelerator support

Gen 12	x2 single width FHFL or x1 double width FHFL
Gen 13	x2 double width FHFL

Maintaining the modularity of our fleet is a core requirement for our server design. This requirement enabled Cloudflare to quickly retrofit and deploy GPUs globally to more than 100 cities in 2024. In Gen 13, we are continuing the support of high-performance PCIe add-in cards.

On Gen 13, the 2U chassis layout is updated and configured to support more demanding power and thermal requirements. While Gen 12 was limited to a single double-width GPU, the Gen 13 architecture now supports two double-width PCIe cards.

A launchpad to scale Cloudflare to greater heights

Every generation of Cloudflare servers is an exercise in balancing competing constraints: performance versus power, capacity versus cost, flexibility versus simplicity. Gen 13 comes with 2x core count, 2x memory capacity, 4x network bandwidth, 1.5x storage capacity, and future-proofing for accelerator deployments — all while improving total cost of ownership and maintaining a robust management feature set and security posture that our global fleet demands.

Gen 13 servers are fully qualified and will be deployed to serve millions of requests across Cloudflare’s global network in more than 330 cities. As always, Cloudflare’s journey to serve the Internet as efficiently as possible does not end here. As the deployment of Gen 13 begins, we are planning the architecture for Gen 14.

If you are excited about helping build a better Internet, come join us. We are hiring.

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

Syona Sarma — Mon, 23 Mar 2026 13:00:00 GMT

Two years ago, Cloudflare deployed our 12th Generation server fleet, based on AMD EPYC™ Genoa-X processors with their massive 3D V-Cache. That cache-heavy architecture was a perfect match for our request handling layer, FL1 at the time. But as we evaluated next-generation hardware, we faced a dilemma — the CPUs offering the biggest throughput gains came with a significant cache reduction. Our legacy software stack wasn't optimized for this, and the potential throughput benefits were being capped by increasing latency.

This blog describes how the FL2 transition, our Rust-based rewrite of Cloudflare's core request handling layer, allowed us to prove Gen 13's full potential and unlock performance gains that would have been impossible on our previous stack. FL2 removes the dependency on the larger cache, allowing for performance to scale with cores while maintaining our SLAs. Today, we are proud to announce the launch of Cloudflare's Gen 13 based on AMD EPYC™ 5th Gen Turin-based servers running FL2, effectively capturing and scaling performance at the edge.

What AMD EPYCTurin brings to the table

AMD's EPYC™ 5th Generation Turin-based processors deliver more than just a core count increase. The architecture delivers improvements across multiple dimensions of what Cloudflare servers require.

2x core count: up to 192 cores versus Gen 12's 96 cores, with SMT providing 384 threads
Improved IPC: Zen 5's architectural improvements deliver better instructions-per-cycle compared to Zen 4
Better power efficiency: Despite the higher core count, Turin consumes up to 32% fewer watts per core compared to Genoa-X
DDR5-6400 support: Higher memory bandwidth to feed all those cores

However, Turin's high density OPNs make a deliberate tradeoff: prioritizing throughput over per core cache. Our analysis across the Turin stack highlighted this shift. For example, comparing the highest density Turin OPN to our Gen 12 Genoa-X processors reveals that Turin's 192 cores share 384MB of L3 cache. This leaves each core with access to just 2MB, one-sixth of Gen 12's allocation. For any workload that relies heavily on cache locality, which ours did, this reduction posed a serious challenge.

Generation	Processor	Cores/Threads	L3 Cache/Core
Gen 12	AMD Genoa-X 9684X	96C/192T	12MB (3D V-Cache)
Gen 13 Option 1	AMD Turin 9755	128C/256T	4MB
Gen 13 Option 2	AMD Turin 9845	160C/320T	2MB
Gen 13 Option 3	AMD Turin 9965	192C/384T	2MB

Diagnosing the problem with performance counters

For our FL1 request handling layer, NGINX- and LuaJIT-based code, this cache reduction presented a significant challenge. But we didn't just assume it would be a problem; we measured it.

During the CPU evaluation phase for Gen 13, we collected CPU performance counters and profiling data to identify exactly what was happening under the hood using AMD uProf tool. The data showed:

L3 cache miss rates increased dramatically compared to Gen 12's server equipped with 3D V-cache processors
Memory fetch latency dominated request processing time as data that previously stayed in L3 now required trips to DRAM
The latency penalty scaled with utilization as we pushed CPU usage higher, and cache contention worsened

L3 cache hits complete in roughly 50 cycles; L3 cache misses requiring DRAM access take 350+ cycles, an order of magnitude difference. With 6x less cache per core, FL1 on Gen 13 was hitting memory far more often, incurring latency penalties.

The tradeoff: latency vs. throughput

Our initial tests running FL1 on Gen 13 confirmed what the performance counters had already suggested. While the Turin processor could achieve higher throughput, it came at a steep latency cost.

Metric	Gen 12 (FL1)	Gen 13 - AMD Turin 9755 (FL1)	Gen 13 - AMD Turin 9845 (FL1)	Gen 13 - AMD Turin 9965 (FL1)	Delta
Core count	baseline	+33%	+67%	+100%
FL throughput	baseline	+10%	+31%	+62%	Improvement
Latency at low to moderate CPU utilization	baseline	+10%	+30%	+30%	Regression
Latency at high CPU utilization	baseline	> 20%	> 50%	> 50%	Unacceptable

The Gen 13 evaluation server with AMD Turin 9965 that generated 60% throughput gain was compelling, and the performance uplift provided the most improvement to Cloudflare’s total cost of ownership (TCO).

But a more than 50% latency penalty is not acceptable. The increase in request processing latency would directly impact customer experience. We faced a familiar infrastructure question: do we accept a solution with no TCO benefit, accept the increased latency tradeoff, or find a way to boost efficiency without adding latency?

Incremental gains with performance tuning

To find a path to an optimal outcome, we collaborated with AMD to analyze the Turin 9965 data and run targeted optimization experiments. We systematically tested multiple configurations:

Hardware Tuning: Adjusting hardware prefetchers and Data Fabric (DF) Probe Filters, which showed only marginal gains
Scaling Workers: Launching more FL1 workers, which improved throughput but cannibalized resources from other production services
CPU Pinning & Isolation: Adjusting workload isolation configurations to find optimal mix, with limited success

The configuration that ultimately provided the most value was AMD’s Platform Quality of Service (PQOS). PQOS extensions enable fine-grained regulation of shared resources like cache and memory bandwidth. Since Turin processors consist of one I/O Die and up to 12 Core Complex Dies (CCDs), each sharing an L3 cache across up to 16 cores, we put this to the test. Here is how the different experimental configurations performed.

First, we used PQOS to allocate a dedicated L3 cache share within a single CCD for FL1, the gains were minimal. However, when we scaled the concept to the socket level, dedicating an entire CCD strictly to FL1, we saw meaningful throughput gains while keeping latency acceptable.

Configuration	Description	Illustration	Performance gain
NUMA-aware core affinity (equivalent to PQOS at socket level)	6 out of 12 CCD (aligned with NUMA domain) run FL. 32MB L3 cache in each CCD shared among all cores.		>15% incremental throughput gain
PQOS config 1	1 of 2 vCPU on each physical core in each CCD runs FL. FL gets 75% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain Other services show minor signs of degradation
PQOS config 2	1 of 2 vCPU in each physical core in each CCD runs FL. FL gets 50% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain
PQOS config 3	2 vCPU on 50% of the physical core in each CCD runs FL. FL gets 50% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain

The opportunity: FL2 was already in progress

Hardware tuning and resource configuration provided modest gains, but to truly unlock the performance potential of the Gen 13 architecture, we knew we would have to rewrite our software stack to fundamentally change how it utilized system resources.

Fortunately, we weren't starting from scratch. As we announced during Birthday Week 2025, we had already been rebuilding FL1 from the ground up. FL2 is a complete rewrite of our request handling layer in Rust, built on our Pingora and Oxy frameworks, replacing 15 years of NGINX and LuaJIT code.

The FL2 project wasn't initiated to solve the Gen 13 cache problem — it was driven by the need for better security (Rust's memory safety), faster development velocity (strict module system), and improved performance across the board (less CPU, less memory, modular execution).

FL2's cleaner architecture, with better memory access patterns and less dynamic allocation, might not depend on massive L3 caches the way FL1 did. This gave us an opportunity to use the FL2 transition to prove whether Gen 13's throughput gains could be realized without the latency penalty.

Proving it out: FL2 on Gen 13

As the FL2 rollout progressed, production metrics from our Gen 13 servers validated what we had hypothesized.

Metric	Gen 13 AMD Turin 9965 (FL1)	Gen 13 AMD Turin 9965 (FL2)
FL requests per CPU%	baseline	50% higher
Latency vs Gen 12	baseline	70% lower
Throughput vs Gen 12	62% higher	100% higher

The out-of-the-box efficiency gains on our new FL2 stack were substantial, even before any system optimizations. FL2 slashed the latency penalty by 70%, allowing us to push Gen 13 to higher CPU utilization while strictly meeting our latency SLAs. Under FL1, this would have been impossible.

By effectively eliminating the cache bottleneck, FL2 enables our throughput to scale linearly with core count. The impact is undeniable on the high-density AMD Turin 9965: we achieved a 2x performance gain, unlocking the true potential of the hardware. With further system tuning, we expect to squeeze even more power out of our Gen 13 fleet.

Generational improvement with Gen 13

With FL2 unlocking the immense throughput of the high-core-count AMD Turin 9965, we have officially selected these processors for our Gen 13 deployment. Hardware qualification is complete, and Gen 13 servers are now shipping at scale to support our global rollout.

Performance improvements

	Gen 12	Gen 13
Processor	AMD EPYC™ 4th Gen Genoa-X 9684X	AMD EPYC™ 5th Gen Turin 9965
Core count	96C/192T	192C/384T
FL throughput	baseline	Up to +100%
Performance per watt	baseline	Up to +50%

Gen 13 business impact

Up to 2x throughput vs Gen 12 for uncompromising customer experience: By doubling our throughput capacity while staying within our latency SLAs, we guarantee our applications remain fast and responsive, and able to absorb massive traffic spikes.

50% better performance/watt vs Gen 12 for sustainable scaling: This gain in power efficiency not only reduces data center expansion costs, but allows us to process growing traffic with a vastly lower carbon footprint per request.

60% higher rack throughput vs Gen 12 for global edge upgrades: Because we achieved this throughput density while keeping the rack power budget constant, we can seamlessly deploy this next generation compute anywhere in the world across our global edge network, delivering top tier performance exactly where our customers want it.

Gen 13 + FL2: ready for the edge

Our legacy request serving layer FL1 hit a cache contention wall on Gen 13, forcing an unacceptable tradeoff between throughput and latency. Instead of compromising, we built FL2.

Designed with a vastly leaner memory access pattern, FL2 removes our dependency on massive L3 caches and allows linear scaling with core count. Running on the Gen 13 AMD Turin platform, FL2 unlocks 2x the throughput and a 50% boost in power efficiency all while keeping latency within our SLAs. This leap forward is a great reminder of the importance of hardware-software co-design. Unconstrained by cache limits, Gen 13 servers are now ready to be deployed to serve millions of requests across Cloudflare’s global network.

If you're excited about working on infrastructure at global scale, we're hiring.

The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages

Leo Bacevicius — Fri, 27 Feb 2026 06:00:00 GMT

You've seen it. Maybe you didn't register it consciously, but you've seen it. That little widget asking you to verify you're human. That full-page security check before accessing a website. If you've spent any time on the Internet, you've encountered Cloudflare's Turnstile widget or Challenge Pages — likely more times than you can count.

^{The Turnstile widget – a familiar sight across millions of websites}

When we say that a large portion of the Internet sits behind Cloudflare, we mean it. Our Turnstile widget and Challenge Pages are served 7.67 billion times every single day. That's not a typo. Billions. This might just be the most-seen user interface on the Internet.

And that comes with enormous responsibility.

Designing a product with billions of eyeballs on it isn't just challenging — it requires a fundamentally different approach. Every pixel, every word, every interaction has to work for someone's grandmother in rural Japan, a teenager in São Paulo, a visually impaired developer in Berlin, and a busy executive in Lagos. All at the same time. In moments of frustration.

Today we’re sharing the story of how we redesigned Turnstile and Challenge Pages. It's a story told in three parts, by three of us: the design process and research that shaped our decisions (Leo), the engineering challenge of deploying changes at unprecedented scale (Ana), and the measurable impact on billions of users (Marina).

Let's start with how we approached the problem from a design perspective.

Part 1: The design process

The problem

Let's be honest: nobody likes being asked to prove they're human. You know you're human. I know I'm human. The only one who doesn't seem convinced is that little widget standing between you and the website you're trying to access. At best, it's a minor inconvenience. At worst? You've probably wanted to throw your computer out the window in a fit of rage. We've all been there. And no one would blame you.

^{Turnstile integrated into a login flow}

As the world warms up to what appears to be an inevitable AI revolution, the need for security verification is only increasing. At Cloudflare, we've seen a significant rise in bot attacks — and in response, organizations are investing more heavily in security measures. That means more challenges being issued to more end users, more often.

The numbers tell the story:

2023: 2.14B daily

2024: 3B daily

2025: 5.35B daily

That's a 58.1% average increase in security checks, year over year. More security checks mean more opportunities for end user frustration. The more companies integrate these verification systems to protect themselves and their customers, the higher the chance that someone, somewhere, is going to have a bad experience.

We knew it was time to take a hard look at our flagship products and ask ourselves: Are we doing right by the billions of people who encounter these experiences? Are we fulfilling our mission to build a better Internet — not just a more secure one, but a more human one?

The answer, we discovered, was: we could do better.

The design audit

Before redesigning anything, we needed to understand what we were working with. We started by conducting a comprehensive audit of every state, every error message, and every interaction across both Turnstile and Challenge Pages.

What we found wasn't the best.

^{The state of inconsistency in the Turnstile widget. Multiple states with no unified approach}

The inconsistencies were glaring. We had no unified approach across the multitude of different error scenarios. Some messages were overly verbose and technical ("Your device clock is set to a wrong time or this challenge page was accidentally cached by an intermediary and is no longer available"). Others were too vague to be helpful ("Timed out"). The visual language varied wildly — different layouts, different hierarchies, different tones of voice.

We also examined the feedback we'd received online. Social media, support tickets, community forums — we read it all. The frustration was palpable, and much of it was avoidable.

Take our feedback mechanism, for example. We offered users feedback options like "The widget sometimes fails" versus "The widget fails all the time." But what's the difference, really? And how were they supposed to know how often it failed? We were asking users to interpret ambiguous options during their most frustrated moments. The more we left open to interpretation, the less useful the feedback became — and the more frustration we saw across social channels.

^{The previous feedback screen: "The widget sometimes fails" vs "The widget fails all the time" — what's the difference?}

Our Challenge Pages — the full-page security blocks that appear when we detect suspicious activity or when site owners have heightened security settings — had similar issues. Some states were confusing. Others used too much technical jargon. Many failed to provide actionable guidance when users needed it most.

^{The state of inconsistency on the Challenge pages. Multiple states with no unified approach}

The audit was humbling. But it gave us a clear picture of where we needed to focus.

Mapping the user journey

To design better experiences, we first needed to understand every possible path a user could take. What was the happy path? Was there even one? And what were the unhappy paths that led to escalating frustration?

^{Mapping the complete user journey — from initial encounter through error scenarios, with sentiment tracking}

This was a true cross-functional effort. We worked closely with engineers like Ana who knew the technical ins and outs of every edge case, and with Marina on the product side who understood not just how the product worked, but how users felt about it — the love and the hate we'd see online.

We have some of the smartest people working on bot protection at Cloudflare. But intelligence and clarity aren't the same thing. There's a delicate balance between technical complexity and user simplicity. Only when these two dance together successfully can we communicate information in a way that actually makes sense to people.

And here's the thing: the messaging has to work for everyone. A person of any age. Any mental or physical capability. Any cultural background. Any level of technical sophistication. That's what designing at scale really means — you can’t ignore edge cases, since, at such scale, they are no longer edge cases.

Establishing a unified information architecture

One of the most influential books in UX design is Steve Krug's Don't Make Me Think. The core principle is simple: every moment a user spends trying to interpret, understand, or decode your interface is a moment of friction. And friction, especially in moments of frustration, leads to abandonment.

Our audit revealed that we were asking users to think far too much. Different pieces of information occupied the same space in the UI across different states. There was no consistent visual hierarchy. Users encountering an error state in Turnstile would find information in a completely different place than they would on a Challenge Page.

We made a fundamental decision: one information architecture to rule them all.

^{Visual diagram displaying a unified information architecture with a consistent structure across Turnstile widget and Challenge pages}

Both Turnstile and Challenge Pages would now follow the same structural pattern. The same visual hierarchy. The same placement for actions, for explanatory text, for links to documentation.

Did this constrain our design options? Absolutely. We had to say no to a lot of creative ideas that didn't fit the framework. But constraints aren't the enemy of good design — they're often its best friend. By limiting our options, we could go deeper on the details that actually mattered.

For users, the benefit is profound: they don't need to re-learn what each piece of the UI means. Error states look consistent. Help links are always in the same place. Once you understand one state, you understand them all. That's cognitive load reduced to a minimum — exactly where it should be during a security verification.

What user research taught us

How do you keep yourself accountable when redesigning something that billions of people see? You test. A lot.

We recruited 8 participants across 8 different countries, deliberately seeking diversity in age, digital savviness, and cultural background. We weren't looking for tech-savvy early adopters — we wanted to understand how the redesign would work for everyone.

Our approach was rigorous: participants saw both the current experience and proposed changes, without knowing which was "old" or "new." We counterbalanced positioning to eliminate bias. And we did not just test our new ideas, but also challenged our assumptions about what needed changing in the first place.

^{Two different versions of a Turnstile being tested in an A/B test}

Some things didn’t need fixing

One hypothesis: should we align with competitors? Most CAPTCHA providers show "I am human" across all states. We use distinct content — "Verify you are human," then "Verifying...," then "Success!"

Were we overcomplicating things? We tested it head-to-head.

Our approach won decisively. For the interactivity state, "Verify you are human" scored 5 out of 8 points versus just 3 for "I am human." For the verifying state, it was even more dramatic — 7.5 versus 0.5. Users wanted to know what was happening, not just be told what they were.

^{User testing results: users strongly favored our approach over the competitor-style design}

This experiment didn't ship as a feature, but it was invaluable. It gave us confidence we weren't just being different for the sake of it. Some things were already right.

But these needed to change

The research surfaced four areas where we were failing users:

Help, not bureaucracy. When users encountered errors, we offered "Send Feedback." In testing, they were baffled. "Who am I sending this to? The website? Cloudflare? My ISP?" More importantly, we discovered something fundamental: at the moment of maximum frustration, people don't want to file a report — they want to fix the problem. We replaced "Send Feedback" with "Troubleshoot" — a single word that promises action rather than bureaucracy.

^{The problematic "Send Feedback" prompt: users didn't know who they were sending feedback to}

Attention, not alarm. We'd used red backgrounds liberally for errors. The reaction in testing was visceral — participants felt they had failed, felt powerless. Even for simple issues that would resolve with a retry, users assumed the worst and gave up. Red at full saturation wasn't communicating "Here's something to address." It was communicating "You have failed, and there's nothing you can do." The fix: red only for icons, never for text or backgrounds.

^{The evolution: from the states with unclear error state description in red to much clearer and concise error communication in neutral-color text.}

Scannable, not verbose. We'd tried to be thorough, explaining errors in technical detail. It backfired. Non-technical users found it alienating. Technical users didn't need it. Everyone was trying to read it in the tiny real estate of a widget. The lesson: less is more, especially in constrained spaces during stressful moments.

Accessible to everyone. Our audit revealed 10px fonts in some states. Grey text that technically met AA (at least 4.5:1 for normal text and 3:1 for large text) compliance but was difficult to read in practice. "Technically compliant" isn't good enough when you're serving the entire Internet.

We set a clear goal: to meet the WCAG 2.2 AAA standard— the highest and most stringent level of web accessibility compliance, designed to make content accessible to the broadest range of users, including those with severe disabilities. Throughout the redesign, when visual consistency conflicted with readability, readability won. Every time.

This extended beyond vision. We designed for screen reader users, keyboard-only navigators, and people with color vision variations — going beyond what automated compliance tools can catch.

And accessibility isn't just about impairments — it's about language. What fits in English, overflows in German. What's concise in Spanish is ambiguous in Japanese. Supporting over 40 languages forced us to radically simplify. The same "Unable to connect to website / Troubleshoot" pattern now works across English, Bulgarian, Danish, German, Greek, Japanese, Indonesian, Russian, Slovak, Slovenian, Serbian, Filipino, and many more.

^{The redesigned error state across 12 languages — consistent layout despite varying text lengths}

Final redesign

So what did we actually ship?

First, let's talk about what we didn't change. The happy path — "Verify you are human" → "Verifying..." → "Success!" — tested exceptionally well. Users understood what was happening at each stage. The distinct content for each state, which we'd worried might be overcomplicating things, was actually our competitive advantage.

^{The happy path: Verify you are human → Verifying → Success! These states tested well and remained largely unchanged}

But for the states that needed work, we made significant changes guided by everything we learned.

Simplified, scannable content

We radically reduced the amount of text in error states. Instead of verbose explanations like "Your device clock is set to a wrong time or this challenge page was accidentally cached by an intermediary and is no longer available," we now show:

A clear, simple state name (e.g., "Incorrect device time")
A prominent "Troubleshoot" link

That's it. The detailed guidance now lives in a dedicated modal screen that opens when users need it — giving them room to actually read and follow troubleshooting steps.

^{The troubleshooting modal: detailed guidance when users need it, without cluttering the widget}

The troubleshooting modal provides context ("This error occurs when your device's clock or calendar is inaccurate. To complete this website’s security verification process, your device must be set to the correct date and time in your time zone."), numbered steps to try, links to documentation, and — only after the user has tried to resolve the issue — an option to submit feedback to Cloudflare. Help first, feedback second.

AAA accessibility compliance

Every state now meets WCAG 2.2 AAA standards for contrast and readability. Font sizes have established minimums. Interactive elements are clearly focusable and properly announced by screen readers.

Unified experience across Turnstile and Challenge pages

Whether users encounter the compact Turnstile widget or a full Challenge Page, the information architecture is now consistent. Same hierarchy. Same placement. Same mental model.

Challenge Pages now follow a clean structure: the website name and favicon at the top, a clear status message (like "Verification successful" or "Your browser is out of date"), and actionable guidance below. No more walls of orange or red text. No more technical jargon without context.

^{Re-designed Challenge page states with clear troubleshooting instructions.}

Validated across languages

Every piece of content was tested in over 40 supported languages. Our process involved three layers of validation:

Initial design review by the design team
Professional translation by our qualified vendor
Final review by native-speaking Cloudflare employees

This wasn't just about translation accuracy — it was about ensuring the visual design held up when content length varied dramatically between languages.

The complete picture

The result is a security verification experience that's clearer, more accessible, less frustrating, and — crucially — just as secure. We didn't compromise on protection to improve the experience. We proved that good design and strong security aren't in conflict.

^{Re-designed Turnstile widgets on the left and a re-designed Challenge page on the right}

But designing the experience was only half the battle. Shipping it to billions of users? That's where Ana comes in.

Part 2: Shipping to billions

Beyond centering a div

Some may say the hardest part of being a Frontend Engineer is centering a div. In reality, the real challenge often lies much deeper, especially when working close to the platform primitives. Building a critical piece of Internet infrastructure using native APIs forces you to think differently about UI development, tradeoffs, and long-term maintainability.

In our case, we use Rust to handle the UI for both the Turnstile widget and the Challenge page. This decision brought clear benefits in terms of safety and consistency across platforms, but it also increased frontend complexity. Many of us are used to the ergonomics of modern frameworks like React, where common UI interactions come almost for free. Working with Rust meant reimplementing even simple interactions using lower level constructs like document.getElementById, createElement, and appendChild.

On top of that, compile times and strict checks naturally slowed down rapid UI iteration compared to JavaScript based frameworks. Debugging was also more involved, as the tooling ecosystem is still evolving. These constraints pushed us to be more deliberate, more thoughtful, and ultimately more disciplined in how we approached UI development.

Small visual changes, big global impact

What initially looked like small visual tweaks such as padding adjustments or alignment changes quickly revealed a much bigger challenge: internationalization.

Once translations were available, we had to ensure that content remained readable and usable across 38 languages and 16 different UI states. Text length variability alone required careful design decisions. Some translations can be 30 to 300 percent longer than English. A short English string like “Stuck?” becomes “Tidak bisa melanjutkan?” in Indonesian or “Es geht nicht weiter?” in German, dramatically changing layout requirements.

Right-to-left language support added another layer of complexity. Supporting Arabic, Persian or Farsi, and Hebrew meant more than flipping text direction. Entire layouts had to be mirrored, including alignment, navigation patterns, directional icons, and animation flows. Many of these elements are implicitly designed with left-to-right assumptions, so we had to revisit those decisions and make them truly bidirectional.

Ordered lists also required special care. Not every culture uses the Western 1, 2, 3 numbering system, and hardcoding numeric sequences can make interfaces feel foreign or incorrect. We leaned on locale-aware numbering and fully translatable list formats to ensure ordering felt natural and culturally appropriate in every language.

Building confidence through testing

As we started listing action points in feedback reports, correctness became even more critical. Every action needed to render properly, trigger the right flow, and behave consistently across states, languages, and edge cases.

To get there, we invested heavily in testing. Unit tests helped us validate logic in isolation, while end-to-end tests ensured that new states and languages worked as expected in real scenarios. This testing foundation gave us confidence to iterate safely, prevented regressions, and ensured that feedback reports remained reliable and actionable for users.

The outcome

What began as a set of technical constraints turned into an opportunity to build a more robust, inclusive, and well-tested UI system. Working with fewer abstractions and closer to the browser primitives forced us to rethink assumptions, improve our internationalization strategy, and raise the overall quality bar.

The result is not just a solution that works, but one we trust. And that trust is what allows us to keep improving, even when centering a div turns out to be the easy part.

Part 3: The impact

Designing for billions of people is a responsibility we take seriously. At this scale, it is essential to leverage measurable data to tell us the real impact of our design choices. As we prepare to roll out these changes, we are focusing on five key metrics that will tell us if we’ve truly succeeded in making the Internet’s most-seen UI more human.

1. Challenge Completion Rate

Our primary north star is the Challenge Solve Rate: the percentage of issued challenges that are successfully completed. By moving away from technical jargon like "intermediary caching" and toward simple, actionable labels like "Incorrect device time," we expect a significant uptick in CSR. A higher CSR doesn't mean we're being easier on bots; it means we’re removing the hurdles that were accidentally tripping up legitimate human users.

2. Time to Complete

Every second a user spends on a challenge page is a second they aren't getting the information that they need. Our research showed that users were often paralyzed by choice when seeing a wall of red text. With our new scannable, neutral-color design, we are tracking Time to Complete to ensure users can identify and resolve issues in seconds rather than minutes.

3. Abandonment Rate Changes

In the past, our liberal use of "saturated red" caused a visceral reaction: users felt they had failed and simply gave up. By reserving red only for icons and using a unified architecture, we aim to reduce Abandonment Rates. We want users to feel empowered to click Troubleshoot rather than feeling powerless and clicking away.

4. Support Ticket Volume

One of the bigger shifts from a product perspective is our new Troubleshooting Modal. By providing clear, numbered steps directly within the widget, we are building self-service support into the UI. We expect this to result in a measurable decrease in support ticket volume for both our customers and our own internal teams.

5. Social Sentiment

We know that security challenges are rarely loved, but they shouldn't be hated because they are confusing. We are monitoring Social Sentiment across community forums, feedback reports, and social channels to see if the conversation shifts from "this widget is broken" to "I had an issue, but I fixed it".

As a Product Manager, my goal is often invisible security — the best challenge is the one the user never sees. But when a challenge must be seen, it should be an assistant, not a bouncer. This redesign proves that AAA accessibility and high-security standards aren't in competition; they are two sides of the same coin. By unifying the architecture of Turnstile and Challenge Pages, we’ve built a foundation that allows us to iterate faster and protect the Internet more humanely than ever before.

Looking ahead

This redesign is a foundation, not a finish line.

We're continuing to monitor how users interact with the new experience, and we're committed to iterating based on what we learn. The feedback mechanisms we've built into the new design — the ones that actually help users troubleshoot, rather than just asking them to report problems — will give us richer insights than we've ever had before.

We're also watching how the security landscape evolves. As bot attacks grow more sophisticated, and as AI continues to blur the line between human and automated behavior, the challenge of verification will only get harder. Our job is to stay ahead — to keep improving security without making the human experience worse.

If you encounter the new Turnstile or Challenge Pages and have feedback, we want to hear it. Reach out through our community forums or use the feedback mechanisms built into the experience itself.

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

Manuel Olguín Muñoz — Fri, 13 Feb 2026 14:00:00 GMT

ecdysis | ˈekdəsəs |
noun
the process of shedding the old skin (in reptiles) or casting off the outer cuticle (in insects and other arthropods).

How do you upgrade a network service, handling millions of requests per second around the globe, without disrupting even a single connection?

One of our solutions at Cloudflare to this massive challenge has long been ecdysis, a Rust library that implements graceful process restarts where no live connections are dropped, and no new connections are refused.

Last month, we open-sourced ecdysis, so now anyone can use it. After five years of production use at Cloudflare, ecdysis has proven itself by enabling zero-downtime upgrades across our critical Rust infrastructure, saving millions of requests with every restart across Cloudflare’s global network.

It’s hard to overstate the importance of getting these upgrades right, especially at the scale of Cloudflare’s network. Many of our services perform critical tasks such as traffic routing, TLS lifecycle management, or firewall rules enforcement, and must operate continuously. If one of these services goes down, even for an instant, the cascading impact can be catastrophic. Dropped connections and failed requests quickly lead to degraded customer performance and business impact.

When these services need updates, security patches can’t wait. Bug fixes need deployment and new features must roll out.

The naive approach involves waiting for the old process to be stopped before spinning up the new one, but this creates a window of time where connections are refused and requests are dropped. For a service handling thousands of requests per second in a single location, multiply that across hundreds of data centers, and a brief restart becomes millions of failed requests globally.

Let’s dig into the problem, and how ecdysis has been the solution for us — and maybe will be for you.

Links: GitHub | crates.io | docs.rs

Why graceful restarts are hard

The naive approach to restarting a service, as we mentioned, is to stop the old process and start a new one. This works acceptably for simple services that don’t handle real-time requests, but for network services processing live connections, this approach has critical limitations.

First, the naive approach creates a window during which no process is listening for incoming connections. When the old process stops, it closes its listening sockets, which causes the OS to immediately refuse new connections with ECONNREFUSED. Even if the new process starts immediately, there will always be a gap where nothing is accepting connections, whether milliseconds or seconds. For a service handling thousands of requests per second, even a gap of 100ms means hundreds of dropped connections.

Second, stopping the old process kills all already-established connections. A client uploading a large file or streaming video gets abruptly disconnected. Long-lived connections like WebSockets or gRPC streams are terminated mid-operation. From the client’s perspective, the service simply vanishes.

Binding the new process before shutting down the old one appears to solve this, but also introduces additional issues. The kernel normally allows only one process to bind to an address:port combination, but the SO_REUSEPORT socket option permits multiple binds. However, this creates a problem during process transitions that makes it unsuitable for graceful restarts.

When SO_REUSEPORT is used, the kernel creates separate listening sockets for each process and load balances new connections across these sockets. When the initial SYN packet for a connection is received, the kernel will assign it to one of the listening processes. Once the initial handshake is completed, the connection then sits in the accept() queue of the process until the process accepts it. If the process then exits before accepting this connection, it becomes orphaned and is terminated by the kernel. GitHub’s engineering team documented this issue extensively when building their GLB Director load balancer.

How ecdysis works

When we set out to design and build ecdysis, we identified four key goals for the library:

Old code can be completely shut down post-upgrade.
The new process has a grace period for initialization.
New code crashing during initialization is acceptable and shouldn’t affect the running service.
Only a single upgrade runs in parallel to avoid cascading failures.

ecdysis satisfies these requirements following an approach pioneered by NGINX, which has supported graceful upgrades since its early days. The approach is straightforward:

The parent process fork()s a new child process.
The child process replaces itself with a new version of the code with execve().
The child process inherits the socket file descriptors via a named pipe shared with the parent.
The parent process waits for the child process to signal readiness before shutting down.

Crucially, the socket remains open throughout the transition. The child process inherits the listening socket from the parent as a file descriptor shared via a named pipe. During the child's initialization, both processes share the same underlying kernel data structure, allowing the parent to continue accepting and processing new and existing connections. Once the child completes initialization, it notifies the parent and begins accepting connections. Upon receiving this ready notification, the parent immediately closes its copy of the listening socket and continues handling only existing connections.

This process eliminates coverage gaps while providing the child a safe initialization window. There is a brief window of time when both the parent and child may accept connections concurrently. This is intentional; any connections accepted by the parent are simply handled until completion as part of the draining process.

This model also provides the required crash safety. If the child process fails during initialization (e.g., due to a configuration error), it simply exits. Since the parent never stopped listening, no connections are dropped, and the upgrade can be retried once the problem is fixed.

ecdysis implements the forking model with first-class support for asynchronous programming through Tokio and systemd integration:

Tokio integration: Native async stream wrappers for Tokio. Inherited sockets become listeners without additional glue code. For synchronous services, ecdysis supports operation without async runtime requirements.
systemd-notify support: When the systemd_notify feature is enabled, ecdysis automatically integrates with systemd’s process lifecycle notifications. Setting Type=notify-reload in your service unit file allows systemd to track upgrades correctly.
systemd named sockets: The systemd_sockets feature enables ecdysis to manage systemd-activated sockets. Your service can be socket-activated and support graceful restarts simultaneously.

Platform note: ecdysis relies on Unix-specific syscalls for socket inheritance and process management. It does not work on Windows. This is a fundamental limitation of the forking approach.

Security considerations

Graceful restarts introduce security considerations. The forking model creates a brief window where two process generations coexist, both with access to the same listening sockets and potentially sensitive file descriptors.

ecdysis addresses these concerns through its design:

Fork-then-exec: ecdysis follows the traditional Unix pattern of fork() followed immediately by execve(). This ensures the child process starts with a clean slate: new address space, fresh code, and no inherited memory. Only explicitly-passed file descriptors cross the boundary.

Explicit inheritance: Only listening sockets and communication pipes are inherited. Other file descriptors are closed via CLOEXEC flags. This prevents accidental leakage of sensitive handles.

seccomp compatibility: Services using seccomp filters must allow fork() and execve(). This is a tradeoff: graceful restarts require these syscalls, so they cannot be blocked.

For most network services, these tradeoffs are acceptable. The security of the fork-exec model is well understood and has been battle-tested for decades in software like NGINX and Apache.

Code example

Let’s look at a practical example. Here’s a simplified TCP echo server that supports graceful restarts:

use ecdysis::tokio_ecdysis::{SignalKind, StopOnShutdown, TokioEcdysisBuilder};
use tokio::{net::TcpStream, task::JoinSet};
use futures::StreamExt;
use std::net::SocketAddr;

#[tokio::main]
async fn main() {
    // Create the ecdysis builder
    let mut ecdysis_builder = TokioEcdysisBuilder::new(
        SignalKind::hangup()  // Trigger upgrade/reload on SIGHUP
    ).unwrap();

    // Trigger stop on SIGUSR1
    ecdysis_builder
        .stop_on_signal(SignalKind::user_defined1())
        .unwrap();

    // Create listening socket - will be inherited by children
    let addr: SocketAddr = "0.0.0.0:8080".parse().unwrap();
    let stream = ecdysis_builder
        .build_listen_tcp(StopOnShutdown::Yes, addr, |builder, addr| {
            builder.set_reuse_address(true)?;
            builder.bind(&addr.into())?;
            builder.listen(128)?;
            Ok(builder.into())
        })
        .unwrap();

    // Spawn task to handle connections
    let server_handle = tokio::spawn(async move {
        let mut stream = stream;
        let mut set = JoinSet::new();
        while let Some(Ok(socket)) = stream.next().await {
            set.spawn(handle_connection(socket));
        }
        set.join_all().await;
    });

    // Signal readiness and wait for shutdown
    let (_ecdysis, shutdown_fut) = ecdysis_builder.ready().unwrap();
    let shutdown_reason = shutdown_fut.await;

    log::info!("Shutting down: {:?}", shutdown_reason);

    // Gracefully drain connections
    server_handle.await.unwrap();
}

async fn handle_connection(mut socket: TcpStream) {
    // Echo connection logic here
}

The key points:

build_listen_tcp creates a listener that will be inherited by child processes.
ready() signals to the parent process that initialization is complete and that it can safely exit.
shutdown_fut.await blocks until an upgrade or stop is requested. This future only yields once the process should be shut down, either because an upgrade/reload was executed successfully or because a shutdown signal was received.

When you send SIGHUP to this process, here’s what ecdysis does…

…on the parent process:

Forks and execs a new instance of your binary.
Passes the listening socket to the child.
Waits for the child to call ready().
Drains existing connections, then exits.

…on the child process:

Initializes itself following the same execution flow as the parent, except any sockets owned by ecdysis are inherited and not bound by the child.
Signals readiness to the parent by calling ready().
Blocks waiting for a shutdown or upgrade signal.

Production at scale

ecdysis has been running in production at Cloudflare since 2021. It powers critical Rust infrastructure services deployed across 330+ data centers in 120+ countries. These services handle billions of requests per day and require frequent updates for security patches, feature releases, and configuration changes.

Every restart using ecdysis saves hundreds of thousands of requests that would otherwise be dropped during a naive stop/start cycle. Across our global footprint, this translates to millions of preserved connections and improved reliability for customers.

ecdysis vs alternatives

Graceful restart libraries exist for several ecosystems. Understanding when to use ecdysis versus alternatives is critical to choosing the right tool.

tableflip is our Go library that inspired ecdysis. It implements the same fork-and-inherit model for Go services. If you need Go, tableflip is a great option!

shellflip is Cloudflare’s other Rust graceful restart library, designed specifically for Oxy, our Rust-based proxy. shellflip is more opinionated: it assumes systemd and Tokio, and focuses on transferring arbitrary application state between parent and child. This makes it excellent for complex stateful services, or services that want to apply such aggressive sandboxing that they can’t even open their own sockets, but adds overhead for simpler cases.

Start building

ecdysis brings five years of production-hardened graceful restart capabilities to the Rust ecosystem. It’s the same technology protecting millions of connections across Cloudflare’s global network, now open-sourced and available for anyone!

Full documentation is available at docs.rs/ecdysis, including API reference, examples for common use cases, and steps for integrating with systemd.

The examples directory in the repository contains working code demonstrating TCP listeners, Unix socket listeners, and systemd integration.

The library is actively maintained by the Argo Smart Routing & Orpheus team, with contributions from teams across Cloudflare. We welcome contributions, bug reports, and feature requests on GitHub.

Whether you’re building a high-performance proxy, a long-lived API server, or any network service where uptime matters, ecdysis can provide a foundation for zero-downtime operations.

Start building: github.com/cloudflare/ecdysis

Finding the grain of sand in a heap of Salt

Opeyemi Onikute — Thu, 13 Nov 2025 14:00:00 GMT

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

That was the challenge we faced as we built the infrastructure to reduce release delays due to failures of Salt, a configuration management tool. (We eventually reduced such failures on the edge by over 5%, as we’ll explain below.) We’ll explore the fundamentals of Salt, and how it is used at Cloudflare. We then describe the common failure modes and how they delay our ability to release valuable changes to serve our customers.

By first solving an architectural problem, we provided the foundation for self-service mechanisms to find the root cause of Salt failures on servers, datacenters and groups of datacenters. This system is able to correlate failures with git commits, external service failures and ad hoc releases. The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

To start, we will go into the basics of the Cloudflare network and how Salt operates within it. And then we’ll get to how we solved the challenge akin to finding a grain of sand in a heap of Salt.

How Salt works

Configuration management (CM) ensures that a system corresponds to its configuration information, and maintains the integrity and traceability of that information over time. A good configuration management system ensures that a system does not “drift” – i.e. deviate from the desired state. Modern CM systems include detailed descriptions of infrastructure, version control for these descriptions, and other mechanisms to enforce the desired state across different environments. Without CM, administrators must manually configure systems, a process that is error-prone and difficult to reproduce.

Salt is an example of such a CM tool. Designed for high-speed remote execution and configuration management, it uses a simple, scalable model to manage large fleets. As a mature CM tool, it provides consistency, reproducibility, change control, auditability and collaboration across team and organisational boundaries.

Salt’s design revolves around a master/minion architecture, a message bus built on ZeroMQ, and a declarative state system. (At Cloudflare we generally avoid the terms "master" and "minion." But we will use them here because that's how Salt describes its architecture.) The salt master is a central controller that distributes jobs and configuration data. It listens for requests on the message bus and dispatches commands to targeted minions. It also stores state files, pillar data and cache files. The salt minion is a lightweight agent installed on each managed host/server. Each minion maintains a connection to the master via ZeroMQ and subscribes to published jobs. When a job matches the minion, it executes the requested function and returns results.

The diagram below shows a simplification of the Salt architecture described in the docs, for the purpose of this blog post.

The state system provides declarative configuration management. States are often written in YAML and describe a resource (package, file, service, user, etc.) and the desired attributes. A common example is a package state, which ensures that a package is installed at a specified version.

# /srv/salt/webserver/init.sls
include:
  - common

nginx:
  pkg.installed: []

/etc/nginx/nginx.conf:
  file.managed:
    - source: salt://webserver/files/nginx.conf
    - require:
      - pkg: nginx

States can call execution modules, which are Python functions that implement system actions. When applying states, Salt returns a structured result containing whether the state succeeded (result: True/False), a comment, changes made, and duration.

Salt at Cloudflare

We use Salt to manage our ever-growing fleet of machines, and have previously written about our extensive usage. The master-minion architecture described above allows us to push configuration in the form of states to thousands of servers, which is essential for maintaining our network. We’ve designed our change propagation to involve blast radius protection. With these protections in place, a highstate failure becomes a signal, rather than a customer-impacting event.

This release design was intentional – we decided to “fail safe” instead of failing hard. By further adding guardrails to safely release new code before a feature reaches all users, we are able to propagate a change with confidence that failures will halt the Salt deployment pipeline by default. However, every halt blocks other configuration deployments and requires human intervention to determine the root cause. This can quickly become a toilsome process as the steps are repetitive and bring no enduring value.

Part of our deployment pipeline for Salt changes uses Apt. Every X minutes a commit is merged into the master branch, per Y minutes those merges are bundled and deployed to APT servers. The key file to retrieving Salt Master configuration from that APT server is the APT source file:

# /etc/apt/sources.list.d/saltcodebase.sources
# MANAGED BY SALT -- DO NOT MODIFY

Types: deb
URIs: mirror+file:/etc/apt/mirrorlists/saltcodebase.txt
Suites: stable canary
Components: cloudflare
Signed-By: /etc/apt/keyrings/cloudflare.gpg

This file directs a master to the correct suite for its specific environment. Using that suite, it retrieves the latest package containing the relevant Salt Debian package with the latest changes. It installs that package and begins deploying the included configuration. As it deploys the configuration on machines, the machines report their health using Prometheus. If a version is healthy, it will be progressed into the next environment. Before it can be progressed, a version has to pass a certain soak threshold to allow a version to develop its errors, making more complex issues become apparent. That is the happy case.

The unhappy case brings a myriad of complications: As we do progressive deployments, if a version is broken, any subsequent version is also broken. And because broken versions are continuously overtaken by newer versions, we need to stop deployments altogether. In a broken version scenario, it is crucial to get a fix out as soon as possible. This touches upon the core question of this blog post: What if a broken Salt version is propagated across the environment, we are abandoning deployments, and we need to get a fix out as soon as possible?

The pain: how Salt breaks and reports errors (and how it affects Cloudflare)

While Salt aims for idempotent and predictable configuration, failures can occur during the render, compile, or runtime stages. These failures are commonly due to misconfiguration. Errors in Jinja templates or invalid YAML can cause the render stage to fail. Examples include missing colons, incorrect indentation, or undefined variables. A syntax error is often raised with a stack trace pointing to the offending line.

Another frequent cause of failure is missing pillar or grain data. Since pillar data is compiled on the master, forgetting to update pillar top files or refreshing pillar can result in KeyError exceptions. As a system that maintains order using requisites, misconfigured requisites can lead to states executing out-of-order or being skipped. Failures can also happen when minions are unable to authenticate with the master, or cannot reach the master due to network or firewall issues.

Salt reports errors in several ways. By default, the salt and salt-call commands exit with a retcode 1 when any state fails. Salt also sets internal retcodes for specific cases: 1 for compile errors, 2 when a state returns False, and 5 for pillar compilation errors. Test mode shows what changes would be made without actually executing them, but is useful for catching syntax or ordering issues. Debug logs can be toggled using the -l debug CLI option (salt state.highstate -l debug).

The state return also includes the details of the individual state failures - the durations, timestamps, functions and results. If we introduce a failure to the file.managed state by referencing a file that doesn’t exist in the Salt fileserver, we see this failure:

web1:
----------
          ID: nginx
    Function: pkg.installed
      Result: True
     Comment: Package nginx is already installed
     Started: 15:32:41.157235
    Duration: 256.138 ms
     Changes:   

----------
          ID: /etc/nginx/nginx.conf
    Function: file.managed
      Result: False
     Comment: Source file salt://webserver/files/nginx.conf not found in saltenv 'base'
     Started: 15:32:41.415128
    Duration: 14.581 ms
     Changes:   

Summary for web1
------------
Succeeded: 1 (changed=0)
Failed:    1
------------
Total states run:     2
Total run time: 270.719 ms

The return can also be displayed in JSON:

{
  "web1": {
    "pkg_|-nginx_|-nginx_|-installed": {
      "comment": "Package nginx is already installed",
      "name": "nginx",
      "start_time": "15:32:41.157235",
      "result": true,
      "duration": 256.138,
      "changes": {}
    },
    "file_|-/etc/nginx/nginx.conf_|-/etc/nginx/nginx.conf_|-managed": {
      "comment": "Source file salt://webserver/files/nginx.conf not found in saltenv 'base'",
      "name": "/etc/nginx/nginx.conf",
      "start_time": "15:32:41.415128",
      "result": false,
      "duration": 14.581,
      "changes": {}
    }
  }
}

The flexibility of the output format means that humans can parse them in custom scripts. But more importantly, it can also be consumed by more complex, interconnected automation systems. We knew we could easily parse these outputs to attribute the cause of a Salt failure with an input – e.g. a change in source control, an external service failure, or a software release. But something was missing.

The solutions

Configuration errors are a common cause of failure in large-scale systems. Some of these could even lead to full system outages, which we prevent with our release architecture. When a new release or configuration breaks in production, our SRE team needs to find and fix the root cause to avoid release delays. As we’ve previously noted, this triage is tedious and increasingly difficult due to system complexity.

While some organisations use formal techniques such as automated root cause analysis, most triage is still frustratingly manual. After evaluating the scope of the problem, we decided to adopt an automated approach. This section describes the step-by-step approach to solving this broad, complex problem in production.

Phase one: retrievable CM inputs

When a Salt highstate fails on a minion, SRE teams faced a tedious investigation process: manually SSHing into minions, searching through logs for error messages, tracking down job IDs (JIDs), and locating the job associated with the JID on one of multiple associated masters. This is all while racing against a 4-hour retention window on master logs. The fundamental problem was architectural: Job results live on Salt Masters, not on the minions where they're executed, forcing operators to guess which master processed their job (SSHing into each one) and limiting visibility for users without master access.

We built a solution that caches job results directly on minions, similar to the local_cache returner that exists for masters. That enables local job retrieval and extended retention periods. This transformed a multistep, time-sensitive investigation into a single query — operators can retrieve job details, automatically extract error context, and trace failures back to specific file changes and commit authors, all from the minion itself. The custom returner filters and manages cache size intelligently, eliminating the “which master?” problem while also enabling automated error attribution, reducing time to resolution, and removing human toil from routine troubleshooting.

By decentralizing job history and making it queryable at the source, we moved significantly closer to a self-service debugging experience where failures are automatically contextualized and attributed, letting SRE teams focus on fixes rather than forensics.

Phase two: Self-service using a Salt Blame Module

Once job information was available on the minion, we no longer needed to resolve which master triggered the job that failed. The next step was to write a Salt execution module that would allow an external service to query for job information, and more specifically failed job information, without needing to know Salt internals. This led us to write a module called Salt Blame. Cloudflare prides itself on its blameless culture, our software on the other hand…

The blame module is responsible for pulling together three things:

Local job history information
CM inputs (latest commit present during the job)
Git repo commit history

We chose to write an execution module for simplicity, decoupling external automation from the need to understand Salt internals, and potential usage by operators for further troubleshooting. Writing execution modules is already well established within operational teams and adheres to well-defined best practices such as unit tests, linting and extensive peer-review.

The module is understandably very simple. It iterates in reverse chronological order through the jobs in the local cache and looks for the first job failure chronologically, and then the successful job immediately prior to it. This is for no other reason than narrowing down the true first failure and giving us before and after state results. At this stage, we have several avenues to present context to the caller: To find possible commit culprits, we look through all commits between the last successful Job ID and the failure to determine if any of these changed files relevant to the failure. We also provided the list of failed states and their outputs as another avenue to spot the root cause. We’ve learned that this flexibility is important to cover the wide range of failure possibilities.

We also make a distinction between normal failed states, and compile errors. As described in the Salt docs, each job returns different retcodes based on the outcome.

Compile Error: 1 is set when any error is encountered in the state compiler.
Failed State: 2 is set when any state returns a False result.

Most of our failures manifest as failed states as a result of a change in source control. An engineer building a new feature for our customers may unintentionally introduce a failure that was uncaught by our CI and Salt Master tests. In the first iteration of the module, listing all the failed states was sufficient to pinpoint the root cause of a highstate failure.

However, we noticed that we had a blind spot. Compile errors do not result in a failed state, since no state runs. Since these errors returned a different retcode from what we checked for, the module was completely blind to them. Most compile errors happen when a Salt service dependency fails during the state compile phase. They can also happen as a result of a change in source control, although that is rare.

With both state failures and compile errors accounted for, we drastically improved our ability to pinpoint issues. We released the module to SREs who immediately realised the benefits of faster Salt triage.

# List all the recent failed states
minion~$ salt-call -l info blame.last_failed_states
local:
    |_
      ----------
      __id__:
          /etc/nginx/nginx.conf
      __run_num__:
          5221
      __sls__:
          foo
      changes:
          ----------
      comment:
          Source file salt://webserver/files/nginx.conf not found in saltenv 'base'
      duration:
          367.233
      finish_time_stamp:
          2025-10-22T10:00:17.289897+00:00
      fun:
          file.managed
      name:
          /etc/nginx/nginx.conf
      result:
          False
      start_time:
          10:00:16.922664
      start_time_stamp:
          2025-10-22T10:00:16.922664+00:00

# List all the commits that correlate with a failed state
minion~$ salt-call -l info blame.last_highstate_failure
local:
    ----------
    commits:
        |_
          ----------
          author_email:
              johndoe@cloudflare.com
          author_name:
              John Doe
          commit_datetime:
              2025-06-30T15:29:26.000+00:00
          commit_id:
              e4a91b2c9f7d3b6f84d12a9f0e62a58c3c7d9b5a
          path:
              /srv/salt/webserver/init.sls
    message:
        reviewed 5 change(s) over 12 commit(s) looking for 1 state failure(s)
    result:
        True

# List all the compile errors
minion~$ salt-call -l info blame.last_compile_errors
local:
    |_
      ----------
      error_types:
      job_timestamp:
          2025-10-24T21:55:54.595412+00:00
      message: A service failure has occured
      state: foo
      traceback:
          Full stack trace of the failure
      urls: http://url-matching-external-service-if-found

Phase three: automate, automate, automate!

Faster triage is always a welcome development, and engineers were comfortable running local commands on minions to triage Salt failures. But in a busy shift, time is of the essence. When failures spanned across multiple datacenters or machines, it easily became cumbersome to run commands across all these minions. This solution also required context-switches between multiple nodes and datacenters. We needed a way to aggregate common failure types using a single command – single minions, pre-production datacenters and production datacenters.

We implemented several mechanisms to simplify triage and eliminate manual triggers. We aimed to get this tooling as close to the triage location as possible, which is often chat. With three distinct commands, engineers were now able to triage Salt failures right from chat threads.

With a hierarchical approach, we made individual triage possible for minions, data centers and groups of data centers. A hierarchy makes this architecture fully extensible, flexible and self-organising. An engineer is able to triage a failure on one minion, and at the same time the entire data center as needed.

The ability to triage multiple data centers at the same time became immediately useful for tracking the root cause of failures in pre-production data centers. These failures delay the propagation of changes to other data centers, and hinder our ability to release changes for customer features, bug fixes or incident remediation. The addition of this triage option has cut down the time to debug and remediate Salt failures by over 5%, allowing us to consistently release important changes for our customers.

While 5% does not immediately look like a drastic improvement, the magic is in the cumulative effect. We won’t release actual figures of the amount of time releases are delayed for, but we can do a simple thought experiment. If the average amount of time spent is even just 60 minutes per day, a reduction by 5% saves us 90 minutes (one hour 30 minutes) per month.

Another indirect benefit lies in more efficient feedback loops. Since engineers spend less time fiddling with complex configurations, that energy is diverted towards preventing reoccurrence, further reducing the overall time by an immeasurable amount. Our future plans include measurement and data analytics to understand the outcomes of these direct and indirect feedback loops.

The image below shows an example of pre-production triage output. We are able to correlate failures with git commits, releases, and external service failures. During a busy shift, this information is invaluable for quickly fixing breakage. On average, each minion “blame” takes less than 30 seconds, while multiple data centers are able to return a result in a minute or less.

The image below describes the hierarchical model. Each step in the hierarchy is executed in parallel, allowing us to achieve blazing fast results.

With these mechanisms available, we further cut down triage time by triggering the triage automation on known conditions, especially those with impact to the release pipeline. This directly improved the velocity of changes to the edge since it took less time to find a root cause and fix-forward or revert.

Phase four: measure, measure, measure

After we got blazing fast Salt triage, we needed a way to measure the root causes. While individual root causes are not immediately valuable, historical analysis was deemed important. We wanted to understand the common causes of failure, especially as they hinder our ability to deliver value to customers. This knowledge creates a feedback loop that can be used to keep the number of failures low.

Using Prometheus and Grafana, we track the top causes of failure: git commits, releases, external service failures and unattributed failed states. The list of failed states is particularly useful because we want to know repeat offenders and drive better adoption of stable releasing practices. We are also particularly interested in root causes — a spike in the number of failures due to git commits indicates a need to adopt better coding practices and linting, a spike in external service failures indicates a regression in an internal system to be investigated, and a spike in release-based failures indicates a need for better gating and release-shepherding.

We analyse these metrics on a monthly cycle, providing feedback mechanisms through internal tickets and escalations. While the immediate impact of these efforts is not yet visible as the efforts are nascent, we expect to improve the overall health of our Saltstack infrastructure and release process by reducing the amount of breakage we see.

The broader picture

Much of operational work is often seen as a “necessary evil”. Humans in ops are conditioned to intervene when failures happen and remediate them. This cycle of alert-response is necessary to keep the infrastructure running, but it often leads to toil. We have discussed the effect of toil in a previous blog post.

This work represents another step in the right direction – removing more toil for our on-call SREs, and freeing up valuable time to work on novel issues. We hope that this encourages other operations engineers to share the progress they are making towards reducing overall toil in their organizations. We also hope that this sort of work can be adopted within Saltstack itself, although the lack of homogeneity in production systems across several companies makes it unlikely.

In the future, we plan to improve the accuracy of detection and rely less on external correlation of inputs to determine the root cause of failed outcomes. We will investigate how to move more of this logic into our native Saltstack modules, further streamlining the process and avoiding regressions as external systems drift.

If this sort of work is exciting to you, we encourage you to take a look at our careers page.

Cloudflare just got faster and more secure, powered by Rust

Richard Boulton — Fri, 26 Sep 2025 14:00:00 GMT

Cloudflare is relentless about building and running the world’s fastest network. We have been tracking and reporting on our network performance since 2021: you can see the latest update here.

Building the fastest network requires work in many areas. We invest a lot of time in our hardware, to have efficient and fast machines. We invest in peering arrangements, to make sure we can talk to every part of the Internet with minimal delay. On top of this, we also have to invest in the software we run our network on, especially as each new product can otherwise add more processing delay.

No matter how fast messages arrive, we introduce a bottleneck if that software takes too long to think about how to process and respond to requests. Today we are excited to share a significant upgrade to our software that cuts the median time we take to respond by 10ms and delivers a 25% performance boost, as measured by third-party CDN performance tests.

We've spent the last year rebuilding major components of our system, and we've just slashed the latency of traffic passing through our network for millions of our customers. At the same time, we've made our system more secure, and we've reduced the time it takes for us to build and release new products.

Where did we start?

Every request that hits Cloudflare starts a journey through our network. It might come from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then pass into a system we call FL, and finally through Pingora, which performs cache lookups or fetches data from the origin if needed.

FL is the brain of Cloudflare. Once a request reaches FL, we then run the various security and performance features in our network. It applies each customer’s unique configuration and settings, from enforcing WAF rules and DDoS protection to routing traffic to the Developer Platform and R2.

Built more than 15 years ago, FL has been at the core of Cloudflare’s network. It enables us to deliver a broad range of features, but over time that flexibility became a challenge. As we added more products, FL grew harder to maintain, slower to process requests, and more difficult to extend. Each new feature required careful checks across existing logic, and every addition introduced a little more latency, making it increasingly difficult to sustain the performance we wanted.

You can see how FL is key to our system — we’ve often called it the “brain” of Cloudflare. It’s also one of the oldest parts of our system: the first commit to the codebase was made by one of our founders, Lee Holloway, well before our initial launch. We’re celebrating our 15th Birthday this week - this system started 9 months before that!

commit 39c72e5edc1f05ae4c04929eda4e4d125f86c5ce
Author: Lee Holloway 
Date:   Wed Jan 6 09:57:55 2010 -0800

    nginx-fl initial configuration

As the commit implies, the first version of FL was implemented based on the NGINX webserver, with product logic implemented in PHP. After 3 years, the system became too complex to manage effectively, and too slow to respond, and an almost complete rewrite of the running system was performed. This led to another significant commit, this time made by Dane Knecht, who is now our CTO.

commit bedf6e7080391683e46ab698aacdfa9b3126a75f
Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.

From this point on, FL was implemented using NGINX, the OpenResty framework, and LuaJIT. While this was great for a long time, over the last few years it started to show its age. We had to spend increasing amounts of time fixing or working around obscure bugs in LuaJIT. The highly dynamic and unstructured nature of our Lua code, which was a blessing when first trying to implement logic quickly, became a source of errors and delay when trying to integrate large amounts of complex product logic. Each time a new product was introduced, we had to go through all the other existing products to check if they might be affected by the new logic.

It was clear that we needed a rethink. So, in July 2024, we cut an initial commit for a brand new, and radically different, implementation. To save time agreeing on a new name for this, we just called it “FL2”, and started, of course, referring to the original FL as “FL1”.

commit a72698fc7404a353a09a3b20ab92797ab4744ea8
Author: Maciej Lechowski
Date:   Wed Jul 10 15:19:28 2024 +0100

    Create fl2 project

Rust and rigid modularization

We weren’t starting from scratch. We’ve previously blogged about how we replaced another one of our legacy systems with Pingora, which is built in the Rust programming language, using the Tokio runtime. We’ve also blogged about Oxy, our internal framework for building proxies in Rust. We write a lot of Rust, and we’ve gotten pretty good at it.

We built FL2 in Rust, on Oxy, and built a strict module framework to structure all the logic in FL2.

Why Oxy?

When we set out to build FL2, we knew we weren’t just replacing an old system; we were rebuilding the foundations of Cloudflare. That meant we needed more than just a proxy; we needed a framework that could evolve with us, handle the immense scale of our network, and let teams move quickly without sacrificing safety or performance.

Oxy gives us a powerful combination of performance, safety, and flexibility. Built in Rust, it eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare’s scale, those guarantees aren’t nice-to-haves, they’re essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust’s strict compile-time guarantees also pair perfectly with FL2’s modular architecture, where we enforce clear contracts between product modules and their inputs and outputs.

But the choice wasn’t just about language. Oxy is the culmination of years of experience building high-performance proxies. It already powers several major Cloudflare services, from our Zero Trust Gateway to Apple’s iCloud Private Relay, so we knew it could handle the diverse traffic patterns and protocol combinations that FL2 would see. Its extensibility model lets us intercept, analyze, and manipulate traffic from layer 3 up to layer 7, and even decapsulate and reprocess traffic at different layers. That flexibility is key to FL2’s design because it means we can treat everything from HTTP to raw IP traffic consistently and evolve the platform to support new protocols and features without rewriting fundamental pieces.

Oxy also comes with a rich set of built-in capabilities that previously required large amounts of bespoke code. Things like monitoring, soft reloads, dynamic configuration loading and swapping are all part of the framework. That lets product teams focus on the unique business logic of their module rather than reinventing the plumbing every time. This solid foundation means we can make changes with confidence, ship them quickly, and trust they’ll behave as expected once deployed.

Smooth restarts - keeping the Internet flowing

One of the most impactful improvements Oxy brings is handling of restarts. Any software under continuous development and improvement will eventually need to be updated. In desktop software, this is easy: you close the program, install the update, and reopen it. On the web, things are much harder. Our software is in constant use and cannot simply stop. A dropped HTTP request can cause a page to fail to load, and a broken connection can kick you out of a video call. Reliability is not optional.

In FL1, upgrades meant restarts of the proxy process. Restarting a proxy meant terminating the process entirely, which immediately broke any active connections. That was particularly painful for long-lived connections such as WebSockets, streaming sessions, and real-time APIs. Even planned upgrades could cause user-visible interruptions, and unplanned restarts during incidents could be even worse.

Oxy changes that. It includes a built-in mechanism for graceful restarts that lets us roll out new versions without dropping connections whenever possible. When a new instance of an Oxy-based service starts up, the old one stops accepting new connections but continues to serve existing ones, allowing those sessions to continue uninterrupted until they end naturally.

This means that if you have an ongoing WebSocket session when we deploy a new version, that session can continue uninterrupted until it ends naturally, rather than being torn down by the restart. Across Cloudflare’s fleet, deployments are orchestrated over several hours, so the aggregate rollout is smooth and nearly invisible to end users.

We take this a step further by using systemd socket activation. Instead of letting each proxy manage its own sockets, we let systemd create and own them. This decouples the lifetime of sockets from the lifetime of the Oxy application itself. If an Oxy process restarts or crashes, the sockets remain open and ready to accept new connections, which will be served as soon as the new process is running. That eliminates the “connection refused” errors that could happen during restarts in FL1 and improves overall availability during upgrades.

We also built our own coordination mechanisms in Rust to replace Go libraries like tableflip with shellflip. This uses a restart coordination socket that validates configuration, spawns new instances, and ensures the new version is healthy before the old one shuts down. This improves feedback loops and lets our automation tools detect and react to failures immediately, rather than relying on blind signal-based restarts.

Composing FL2 from Modules

To avoid the problems we had in FL1, we wanted a design where all interactions between product logic were explicit and easy to understand.

So, on top of the foundations provided by Oxy, we built a platform which separates all the logic built for our products into well-defined modules. After some experimentation and research, we designed a module system which enforces some strict rules:

No IO (input or output) can be performed by the module.
The module provides a list of phases.
Phases are evaluated in a strictly defined order, which is the same for every request.
Each phase defines a set of inputs which the platform provides to it, and a set of outputs which it may emit.

Here’s an example of what a module phase definition looks like:

Phase {
    name: phases::SERVE_ERROR_PAGE,
    request_types_enabled: PHASE_ENABLED_FOR_REQUEST_TYPE,
    inputs: vec![
        InputKind::IPInfo,
        InputKind::ModuleValue(
            MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE.as_str(),
        ),
        InputKind::ModuleValue(MODULE_VALUE_ORIGINAL_SERVE_RESPONSE.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_UPSTREAM_ERROR_DETAILS.as_str()),
        InputKind::RayId,
        InputKind::StatusCode,
        InputKind::Visitor,
    ],
    outputs: vec![OutputValue::ServeResponse],
    filters: vec![],
    func: phase_serve_error_page::callback,
}

This phase is for our custom error page product. It takes a few things as input — information about the IP of the visitor, some header and other HTTP information, and some “module values.” Module values allow one module to pass information to another, and they’re key to making the strict properties of the module system workable. For example, this module needs some information that is produced by the output of our rulesets-based custom errors product (the “MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT” input). These input and output definitions are enforced at compile time.

While these rules are strict, we’ve found that we can implement all our product logic within this framework. The benefit of doing so is that we can immediately tell which other products might affect each other.

How to replace a running system

Building a framework is one thing. Building all the product logic and getting it right, so that customers don’t notice anything other than a performance improvement, is another.

The FL code base supports 15 years of Cloudflare products, and it’s changing all the time. We couldn’t stop development. So, one of our first tasks was to find ways to make the migration easier and safer.

Step 1 - Rust modules in OpenResty

It’s a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much.

So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.

For example, here’s part of the implementation for the custom error page module phase defined earlier (we’ve cut out some of the more boring details, so this doesn’t quite compile as-written):

pub(crate) fn callback(_services: &mut Services, input: &Input<'_>) -> Output {
    // Rulesets produced a response to serve - this can either come from a special
    // Cloudflare worker for serving custom errors, or be directly embedded in the rule.
    if let Some(rulesets_params) = input
        .get_module_value(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT)
        .cloned()
    {
        // Select either the result from the special worker, or the parameters embedded
        // in the rule.
        let body = input
            .get_module_value(MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE)
            .and_then(|response| {
                handle_custom_errors_fetch_response("rulesets", response.to_owned())
            })
            .or(rulesets_params.body);

        // If we were able to load a body, serve it, otherwise let the next bit of logic
        // handle the response
        if let Some(body) = body {
            let final_body = replace_custom_error_tokens(input, &body);

            // Increment a metric recording number of custom error pages served
            custom_pages::pages_served("rulesets").inc();

            // Return a phase output with one final action, causing an HTTP response to be served.
            return Output::from(TerminalAction::ServeResponse(ResponseAction::OriginError {
                rulesets_params.status,
                source: "rulesets http_custom_errors",
                headers: rulesets_params.headers,
                body: Some(Bytes::from(final_body)),
            }));
        }
    }
}

The internal logic in each module is quite cleanly separated from the handling of data, with very clear and explicit error handling encouraged by the design of the Rust language.

Many of our most actively developed modules were handled this way, allowing the teams to maintain their change velocity during our migration.

Step 2 - Testing and automated rollouts

It’s essential to have a seriously powerful test framework to cover such a migration. We built a system, internally named Flamingo, which allows us to run thousands of full end-to-end test requests concurrently against our production and pre-production systems. The same tests run against FL1 and FL2, giving us confidence that we’re not changing behaviours.

Whenever we deploy a change, that change is rolled out gradually across many stages, with increasing amounts of traffic. Each stage is automatically evaluated, and only passes when the full set of tests have been successfully run against it - as well as overall performance and resource usage metrics being within acceptable bounds. This system is fully automated, and pauses or rolls back changes if the tests fail.

The benefit is that we’re able to build and ship new product features in FL2 within 48 hours - where it would have taken weeks in FL1. In fact, at least one of the announcements this week involved such a change!

Step 3 - Fallbacks

Over 100 engineers have worked on FL2, and we have over 130 modules. And we’re not quite done yet. We're still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.

So how do we send traffic to FL2 without it being able to handle everything? If FL2 receives a request, or a piece of configuration for a request, that it doesn’t know how to handle, it gives up and does what we’ve called a fallback - it passes the whole thing over to FL1. It does this at the network level - it just passes the bytes on to FL1.

As well as making it possible for us to send traffic to FL2 without it being fully complete, this has another massive benefit. When we have implemented a piece of new functionality in FL2, but want to double check that it is working the same as in FL1, we can evaluate the functionality in FL2, and then trigger a fallback. We are able to compare the behaviour of the two systems, allowing us to get a high confidence that our implementation was correct.

Step 4 - Rollout

We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year. Essentially, we’ve been watching two graphs: one with the proportion of traffic routed to FL2 going up, and another with the proportion of traffic failing to be served by FL2 and falling back to FL1 going down.

We started this process by passing traffic for our free customers through the system. We were able to prove that the system worked correctly, and drive the fallback rates down for our major modules. Our Cloudflare Community MVPs acted as an early warning system, smoke testing and flagging when they suspected the new platform might be the cause of a new reported problem. Crucially their support allowed our team to investigate quickly, apply targeted fixes, or confirm the move to FL2 was not to blame.

We then advanced to our paying customers, gradually increasing the amount of customers using the system. We also worked closely with some of our largest customers, who wanted the performance benefits of FL2, and onboarded them early in exchange for lots of feedback on the system.

Right now, most of our customers are using FL2. We still have a few features to complete, and are not quite ready to onboard everyone, but our target is to turn off FL1 within a few more months.

Impact of FL2

As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work. You might have noticed in the module definition example a line

    filters: vec![],

Every module is able to provide a set of filters, which control whether they run or not. This means that we don’t run logic for every product for every request — we can very easily select just the required set of modules. The incremental cost for each new product we develop has gone away.

Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules. In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another.

As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That’s a huge bonus — we can spend the CPU on delivering more and more features for our customers!

How do we measure if we are getting better?

Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost.

Security

FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence.

Of course, no system is secure if used badly. It’s easy to write code in Rust, which causes memory corruption. To reduce risk, we maintain strong compile time linting and checking, together with strict coding standards, testing and review processes.

We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won’t be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations.

What’s next?

We’re spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We’re already seeing the benefits in terms of customer performance and speed of development, and we’re looking forward to giving these to all our customers.

We have one last service to completely migrate. The “HTTP & TLS Termination” box from the diagram way back at the top is also an NGINX service, and we’re midway through a rewrite in Rust. We’re making good progress on this migration, and expect to complete it early next year.

After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize! We’ll reorganize and simplify how the modules connect to each other, expand support for non-HTTP traffic like RPC and streams, and much more.

If you’re interested in being part of this journey, check out our careers page for open roles - we’re always looking for new talent to help us to help build a better Internet.

Eliminating Cold Starts 2: shard and conquer

Harris Hancock — Fri, 26 Sep 2025 13:00:00 GMT

Five years ago, we announced that we were Eliminating Cold Starts with Cloudflare Workers. In that episode, we introduced a technique to pre-warm Workers during the TLS handshake of their first request. That technique takes advantage of the fact that the TLS Server Name Indication (SNI) is sent in the very first message of the TLS handshake. Armed with that SNI, we often have enough information to pre-warm the request’s target Worker.

Eliminating cold starts by pre-warming Workers during TLS handshakes was a huge step forward for us, but “eliminate” is a strong word. Back then, Workers were still relatively small, and had cold starts constrained by limits explained later in this post. We’ve relaxed those limits, and users routinely deploy complex applications on Workers, often replacing origin servers. Simultaneously, TLS handshakes haven’t gotten any slower. In fact, TLS 1.3 only requires a single round trip for a handshake – compared to three round trips for TLS 1.2 – and is more widely used than it was in 2021.

Earlier this month, we finished deploying a new technique intended to keep pushing the boundary on cold start reduction. The new technique (or old, depending on your perspective) uses a consistent hash ring to take advantage of our global network. We call this mechanism “Worker sharding”.

What’s in a cold start?

A Worker is the basic unit of compute in our serverless computing platform. It has a simple lifecycle. We instantiate it from source code (typically JavaScript), make it serve a bunch of requests (often HTTP, but not always), and eventually shut it down some time after it stops receiving traffic, to re-use its resources for other Workers. We call that shutdown process “eviction”.

The most expensive part of the Worker’s lifecycle is the initial instantiation and first request invocation. We call this part a “cold start”. Cold starts have several phases: fetching the script source code, compiling the source code, performing a top-level execution of the resulting JavaScript module, and finally, performing the initial invocation to serve the incoming HTTP request that triggered the whole sequence of events in the first place.

Cold starts have become longer than TLS handshakes

Fundamentally, our TLS handshake technique depends on the handshake lasting longer than the cold start. This is because the duration of the TLS handshake is time that the visitor must spend waiting, regardless, so it’s beneficial to everyone if we do as much work during that time as possible. If we can run the Worker’s cold start in the background while the handshake is still taking place, and if that cold start finishes before the handshake, then the request will ultimately see zero cold start delay. If, on the other hand, the cold start takes longer than the TLS handshake, then the request will see some part of the cold start delay – though the technique still helps reduce that visible delay.

In the early days, TLS handshakes lasting longer than Worker cold starts was a safe bet, and cold starts typically won the race. One of our early blog posts explaining how our platform works mentions 5 millisecond cold start times – and that was correct, at the time!

For every limit we have, our users have challenged us to relax them. Cold start times are no different.

There are two crucial limits which affect cold start time: Worker script size and the startup CPU time limit. While we didn’t make big announcements at the time, we have quietly raised both of those limits since our last Eliminating Cold Starts blog post:

Worker script size (compressed) increased from 1 MB to 5 MB, then again from 5 MB to 10 MB, for paying users.
Worker script size (compressed) increased from 1 MB to 3 MB for free users.
Startup CPU time increased from 200ms to 400ms.

We relaxed these limits because our users wanted to deploy increasingly complex applications to our platform. And deploy they did! But the increases have a cost:

Increasing script size increases the amount of data we must transfer from script storage to the Workers runtime.
Increasing script size also increases the time complexity of the script compilation phase.
Increasing the startup CPU time limit increases the maximum top-level execution time.

Taken together, cold starts for complex applications began to lose the TLS handshake race.

Routing requests to an existing Worker

With relaxed script size and startup time limits, optimizing cold start time directly was a losing battle. Instead, we needed to figure out how to reduce the absolute number of cold starts, so that requests are simply less likely to incur one.

One option is to route requests to existing Worker instances, where before we might have chosen to start a new instance.

Previously, we weren’t particularly good at routing requests to existing Worker instances. We could trivially coalesce requests to a single Worker instance if they happened to land on a machine which already hosted a Worker, because in that case it’s not a distributed systems problem. But what if a Worker already existed in our data center on a different server, and some other server received a request for the Worker? We would always choose to cold start a new Worker on the machine which received the request, rather than forward the request to the machine with the already-existing Worker, even though forwarding the request would avoid the cold start.

To drive the point home: Imagine a visitor sends one request per minute to a data center with 300 servers, and that the traffic is load balanced evenly across all servers. On average, each server will receive one request every five hours. In particularly busy data centers, this span of time could be long enough that we need to evict the Worker to re-use its resources, resulting in a 100% cold start rate. That’s a terrible experience for the visitor.

Consequently, we found ourselves explaining to users, who saw high latency while prototyping their applications, that their latency would counterintuitively decrease once they put sufficient traffic on our network. This highlighted the inefficiency in our original, simple design.

If, instead, those requests were all coalesced onto one single server, we would notice multiple benefits. The Worker would receive one request per minute, which is short enough to virtually guarantee that it won’t be evicted. This would mean the visitor may experience a single cold start, and then have a 100% “warm request rate.” We would also use 99.7% (299 / 300) less memory serving this traffic. This makes room for other Workers, decreasing their eviction rate, and increasing their warm request rates, too – a virtuous cycle!

There’s a cost to coalescing requests to a single instance, though, right? After all, we’re adding latency to requests if we have to proxy them around the data center to a different server.

In practice, the added time-to-first-byte is less than one millisecond, and is the subject of continual optimization by our IPC and performance teams. One millisecond is far less than a typical cold start, meaning it’s always better, in every measurable way, to proxy a request to a warm Worker than it is to cold start a new one.

The consistent hash ring

A solution to this very problem lies at the heart of many of our products, including one of our oldest: the HTTP cache in our Content Delivery Network.

When a visitor requests a cacheable web asset through Cloudflare, the request gets routed through a pipeline of proxies. One of those proxies is a caching proxy, which stores the asset for later, so we can serve it to future requests without having to request it from the origin again.

A Worker cold start is analogous to an HTTP cache miss, in that a request to a warm Worker is like an HTTP cache hit.

When our standard HTTP proxy pipeline routes requests to the caching layer, it chooses a cache server based on the request's cache key to optimize the HTTP cache hit rate. The cache key is the request’s URL, plus some other details. This technique is often called “sharding”. The servers are considered to be individual shards of a larger, logical system – in this case a data center’s HTTP cache. So, we can say things like, “Each data center contains one logical HTTP cache, and that cache is sharded across every server in the data center.”

Until recently, we could not make the same claim about the set of Workers in a data center. Instead, each server contained its own standalone set of Workers, and they could easily duplicate effort.

We borrow the cache’s trick to solve that. In fact, we even use the same type of data structure used by our HTTP cache to choose servers: a consistent hash ring. A naive sharding implementation might use a classic hash table mapping Worker script IDs to server addresses. That would work fine for a set of servers which never changes. But servers are actually ephemeral and have their own lifecycle. They can crash, get rebooted, taken out for maintenance, or decommissioned. New ones can come online. When these events occur, the size of the hash table would change, necessitating a re-hashing of the whole table. Every Worker’s home server would change, and all sharded Workers would be cold started again!

A consistent hash ring improves this scenario significantly. Instead of establishing a direct correspondence between script IDs and server addresses, we map them both to a number line whose end wraps around to its beginning, also known as a ring. To look up the home server of a Worker, first we hash its script, and then we find where it lies on the ring. Next, we take the server address which comes directly on or after that position on the ring, and consider that the Worker’s home.

If a new server appears for some reason, all the Workers that lie before it on the ring get re-homed, but none of the other Workers are disturbed. Similarly, if a server disappears, all the Workers which lay before it on the ring get re-homed.

We refer to the Worker’s home server as the “shard server”. In request flows involving sharding, there is also a “shard client”. It’s also a server! The shard client initially receives a request, and, using its consistent hash ring, looks up which shard server it should send the request to. I’ll be using these two terms – shard client and shard server – in the rest of this post.

Handling overload

The nature of HTTP assets lend themselves well to sharding. If they are cacheable, they are static, at least for their cache Time to Live (TTL) duration. So, serving them requires time and space complexity which scales linearly with their size.

But Workers aren’t JPEGs. They are live units of compute which can use up to five minutes of CPU time per request. Their time and space complexity do not necessarily scale with their input size, and can vastly outstrip the amount of computing power we must dedicate to serving even a huge file from cache.

This means that individual Workers can easily get overloaded when given sufficient traffic. So, no matter what we do, we need to keep in mind that we must be able to scale back up to infinity. We will never be able to guarantee that a data center has only one instance of a Worker, and we must always be able to horizontally scale at the drop of a hat to support burst traffic. Ideally this is all done without producing any errors.

This means that a shard server must have the ability to refuse requests to invoke Workers on it, and shard clients must always gracefully handle this scenario.

Two load shedding options

I am aware of two general solutions to shedding load gracefully, without serving errors.

In the first solution, the client asks politely if it may issue the request. It then sends the request if it receives a positive response. If it instead receives a “go away” response, it handles the request differently, like serving it locally. In HTTP, this pattern can be found in Expect: 100-continue semantics. The main downside is that this introduces one round-trip of latency to set the expectation of success before the request can be sent. (Note that a common naive solution is to just retry requests. This works for some kinds of requests, but is not a general solution, as requests may carry arbitrarily large bodies.)

The second general solution is to send the request without confirming that it can be handled by the server, then count on the server to forward the request elsewhere if it needs to. This could even be back to the client. This avoids the round-trip of latency that the first solution incurs, but there is a tradeoff: It puts the shard server in the request path, pumping bytes back to the client. Fortunately, we have a trick to minimize the amount of bytes we actually have to send back in this fashion, which I’ll describe in the next section.

Optimistically sending sharded requests

There are a couple of reasons why we chose to optimistically send sharded requests without waiting for permission.

The first reason of note is that we expect to see very few of these refused requests in practice. The reason is simple: If a shard client receives a refusal for a Worker, then it must cold start the Worker locally. As a consequence, it can serve all future requests locally without incurring another cold start. So, after a single refusal, the shard client won’t shard that Worker any more (until traffic for the Worker tapers off enough for an eviction, at least).

Generally, this means we expect that if a request gets sharded to a different server, the shard server will most likely accept the request for invocation. Since we expect success, it makes a lot more sense to optimistically send the entire request to the shard server than it does to incur a round-trip penalty to establish permission first.

The second reason is that we have a trick to avoid paying too high a cost for proxying the request back to the client, as I mentioned above.

We implement our cross-instance communication in the Workers runtime using Cap’n Proto RPC, whose distributed object model enables some incredible features, like JavaScript-native RPC. It is also the elder, spiritual sibling to the just-released Cap’n Web.

In the case of sharding, Cap’n Proto makes it very easy to implement an optimal request refusal mechanism. When the shard client assembles the sharded request, it includes a handle (called a capability in Cap’n Proto) to a lazily-loaded local instance of the Worker. This lazily-loaded instance has the same exact interface as any other Worker exposed over RPC. The difference is just that it’s lazy – it doesn’t get cold started until invoked. In the event the shard server decides it must refuse the request, it does not return a “go away” response, but instead returns the shard client’s own lazy capability!

The shard client’s application code only sees that it received a capability from the shard server. It doesn’t know where that capability is actually implemented. But the shard client’s RPC system does know where the capability lives! Specifically, it recognizes that the returned capability is actually a local capability – the same one that it passed to the shard server. Once it realizes this, it also realizes that any request bytes it continues to send to the shard server will just come looping back. So, it stops sending more request bytes, waits to receive back from the shard server all the bytes it already sent, and shortens the request path as soon as possible. This takes the shard server entirely out of the loop, preventing a “trombone effect.”

Workers invoking Workers

With load shedding behavior figured out, we thought the hard part was over.

But, of course, Workers may invoke other Workers. There are many ways this could occur, most obviously via Service Bindings. Less obviously, many of our favorite features, such as Workers KV, are actually cross-Worker invocations. But there is one product, in particular, that stands out for its powerful ability to invoke other Workers: Workers for Platforms.

Workers for Platforms allows you to run your own functions-as-a-service on Cloudflare infrastructure. To use the product, you deploy three special types of Workers:

a dynamic dispatch Worker
any number of user Workers
an optional, parameterized outbound Worker

A typical request flow for Workers for Platforms goes like so: First, we invoke the dynamic dispatch Worker. The dynamic dispatch Worker chooses and invokes a user Worker. Then, the user Worker invokes the outbound Worker to intercept its subrequests. The dynamic dispatch Worker chose the outbound Worker's arguments prior to invoking the user Worker.

To really amp up the fun, the dynamic dispatch Worker could have a tail Worker attached to it. This tail Worker would need to be invoked with traces related to all the preceding invocations. Importantly, it should be invoked one single time with all events related to the request flow, not invoked multiple times for different fragments of the request flow.

You might further ask, can you nest Workers for Platforms? I don’t know the official answer, but I can tell you that the code paths do exist, and they do get exercised.

To support this nesting doll of Workers, we keep a context stack during invocations. This context includes things like ownership overrides, resource limit overrides, trust levels, tail Worker configurations, outbound Worker configurations, feature flags, and so on. This context stack was manageable-ish when everything was executed on a single thread. For sharding to be truly useful, though, we needed to be able to move this context stack around to other machines.

Our choice of Cap’n Proto RPC as our primary communications medium helped us make sense of it all. To shard Workers deep within a stack of invocations, we serialize the context stack into a Cap’n Proto data structure and send it to the shard server. The shard server deserializes it into native objects, and continues the execution where things left off.

As with load shedding, Cap’n Proto’s distributed object model provides us simple answers to otherwise difficult questions. Take the tail Worker question – how do we coalesce tracing data from invocations which got fanned out across any number of other servers back to one single place? Easy: create a capability (a live Cap’n Proto object) for a reportTraces() callback on the dynamic dispatch Worker’s home server, and put that in the serialized context stack. Now, that context stack can be passed around at will. That context stack will end up in multiple places: At a minimum, it will end up on the user Worker’s shard server and the outbound Worker’s shard server. It may also find its way to other shard servers if any of those Workers invoked service bindings! Each of those shard servers can call the reportTraces() callback, and be confident that the data will make its way back to the right place: the dynamic dispatch Worker’s home server. None of those shard servers need to actually know where that home server is. Phew!

Eviction rates down, warm request rates up

Features like this are always satisfying to roll out, because they produce graphs showing huge efficiency gains.

Once fully rolled out, only about 4% of total requests from enterprise traffic ended up being sharded. To put that another way, 96% of all enterprise requests are to Workers which are sufficiently loaded that we must run multiple instances of them in a data center.

Despite that low total rate of sharding, we reduced our global Worker eviction rate by 10x.

Our eviction rate is a measure of memory pressure within our system. You can think of it like garbage collection at a macro level, and it has the same implications. Fewer evictions means our system uses memory more efficiently. This has the happy consequence of using less CPU to clean up our memory. More relevant to Workers users, the increased efficiency means we can keep Workers in memory for an order of magnitude longer, improving their warm request rate and reducing their latency.

The high leverage shown – sharding just 4% of our traffic to improve memory efficiency by 10x – is a consequence of the power-law distribution of Internet traffic.

A power law distribution is a phenomenon which occurs across many fields of science, including linguistics, sociology, physics, and, of course, computer science. Events which follow power law distributions typically see a huge amount clustered in some small number of “buckets”, and the rest spread out across a large number of those “buckets”. Word frequency is a classic example: A small handful of words like “the”, “and”, and “it” occur in texts with extremely high frequency, while other words like “eviction” or “trombone” might occur only once or twice in a text.

In our case, the majority of Workers requests goes to a small handful of high-traffic Workers, while a very long tail goes to a huge number of low-traffic Workers. The 4% of requests which were sharded are all to low-traffic Workers, which are the ones that benefit the most from sharding.

So did we eliminate cold starts? Or will there be an Eliminating Cold Starts 3 in our future?

For enterprise traffic, our warm request rate increased from 99.9% to 99.99% – that’s three 9’s to four 9’s. Conversely, this means that the cold start rate went from 0.1% to 0.01% of requests, a 10x decrease. A moment’s thought, and you’ll realize that this is coherent with the eviction rate graph I shared above: A 10x decrease in the number of Workers we destroy over time must imply we’re creating 10x fewer to begin with.

Simultaneously, our warm request rate became less volatile throughout the course of the day.

Hmm.

I hate to admit this to you, but I still notice a little bit of space at the top of the graph. 😟

Can you help us get to five 9’s?

Safe in the sandbox: security hardening for Cloudflare Workers

Erik Corry — Thu, 25 Sep 2025 14:00:00 GMT

As a serverless cloud provider, we run your code on our globally distributed infrastructure. Being able to run customer code on our network means that anyone can take advantage of our global presence and low latency. Workers isn’t just efficient though, we also make it simple for our users. In short: You write code. We handle the rest.

Part of 'handling the rest' is making Workers as secure as possible. We have previously written about our security architecture. Making Workers secure is an interesting problem because the whole point of Workers is that we are running third party code on our hardware. This is one of the hardest security problems there is: any attacker has the full power available of a programming language running on the victim's system when they are crafting their attacks.

This is why we are constantly updating and improving the Workers Runtime to take advantage of the latest improvements in both hardware and software. This post shares some of the latest work we have been doing to keep Workers secure.

Some background first: Workers is built around the V8 JavaScript runtime, originally developed for Chromium-based browsers like Chrome. This gives us a head start, because V8 was forged in an adversarial environment, where it has always been under intense attack and scrutiny. Like Workers, Chromium is built to run adversarial code safely. That's why V8 is constantly being tested against the best fuzzers and sanitizers, and over the years, it has been hardened with new technologies like Oilpan/cppgc and improved static analysis.

We use V8 in a slightly different way, though, so we will be describing in this post how we have been making some changes to V8 to improve security in our use case.

Hardware-assisted security improvements from Memory Protection Keys

Modern CPUs from Intel, AMD, and ARM have support for memory protection keys, sometimes called PKU, Protection Keys for Userspace. This is a great security feature which increases the power of virtual memory and memory protection.

Traditionally, the memory protection features of the CPU in your PC or phone were mainly used to protect the kernel and to protect different processes from each other. Within each process, all threads had access to the same memory. Memory protection keys allow us to prevent specific threads from accessing memory regions they shouldn't have access to.

V8 already uses memory protection keys for the JIT compilers. The JIT compilers for a language like JavaScript generate optimized, specialized versions of your code as it runs. Typically, the compiler is running on its own thread, and needs to be able to write data to the code area in order to install its optimized code. However, the compiler thread doesn't need to be able to run this code. The regular execution thread, on the other hand, needs to be able to run, but not modify, the optimized code. Memory protection keys offer a way to give each thread the permissions it needs, but no more. And the V8 team in the Chromium project certainly aren't standing still. They describe some of their future plans for memory protection keys here.

In Workers, we have some different requirements than Chromium. The security architecture for Workers uses V8 isolates to separate different scripts that are running on our servers. (In addition, we have extra mitigations to harden the system against Spectre attacks). If V8 is working as intended, this should be enough, but we believe in defense in depth: multiple, overlapping layers of security controls.

That's why we have deployed internal modifications to V8 to use memory protection keys to isolate the isolates from each other. There are up to 15 different keys available on a modern x64 CPU and a few are used for other purposes in V8, so we have about 12 to work with. We give each isolate a random key which is used to protect its V8 heap data, the memory area containing the JavaScript objects a script creates as it runs. This means security bugs that might previously have allowed an attacker to read data from a different isolate would now hit a hardware trap in 92% of cases. (Assuming 12 keys, 92% is about 11/12.)

The illustration shows an attacker attempting to read from a different isolate. Most of the time this is detected by the mismatched memory protection key, which kills their script and notifies us, so we can investigate and remediate. The red arrow represents the case where the attacker got lucky by hitting an isolate with the same memory protection key, represented by the isolates having the same colors.

However, we can further improve on a 92% protection rate. In the last part of this blog post we'll explain how we can lift that to 100% for a particular common scenario. But first, let's look at a software hardening feature in V8 that we are taking advantage of.

The V8 sandbox, a software-based security boundary

Over the past few years, V8 has been gaining another defense in depth feature: the V8 sandbox. (Not to be confused with the layer 2 sandbox which Workers have been using since the beginning.) The V8 sandbox has been a multi-year project that has been gaining maturity for a while. The sandbox project stems from the observation that many V8 security vulnerabilities start by corrupting objects in the V8 heap memory. Attackers then leverage this corruption to reach other parts of the process, giving them the opportunity to escalate and gain more access to the victim's browser, or even the entire system.

V8's sandbox project is an ambitious software security mitigation that aims to thwart that escalation: to make it impossible for the attacker to progress from a corruption on the V8 heap to a compromise of the rest of the process. This means, among other things, removing all pointers from the heap. But first, let's explain in as simple terms as possible, what a memory corruption attack is.

Memory corruption attacks

A memory corruption attack tricks a program into misusing its own memory. Computer memory is just a store of integers, where each integer is stored in a location. The locations each have an address, which is also just a number. Programs interpret the data in these locations in different ways, such as text, pixels, or pointers. Pointers are addresses that identify a different memory location, so they act as a sort of arrow that points to some other piece of data.

Here's a concrete example, which uses a buffer overflow. This is a form of attack that was historically common and relatively simple to understand: Imagine a program has a small buffer (like a 16-character text field) followed immediately by an 8-byte pointer to some ordinary data. An attacker might send the program a 24-character string, causing a "buffer overflow." Because of a vulnerability in the program, the first 16 characters fill the intended buffer, but the remaining 8 characters spill over and overwrite the adjacent pointer.

^{See below for how such an attack would now be thwarted.}

Now the pointer has been redirected to point at sensitive data of the attacker's choosing, rather than the normal data it was originally meant to access. When the program tries to use what it believes is its normal pointer, it's actually accessing sensitive data chosen by the attacker.

This type of attack works in steps: first create a small confusion (like the buffer overflow), then use that confusion to create bigger problems, eventually gaining access to data or capabilities the attacker shouldn't have. The attacker can eventually use the misdirection to either steal information or plant malicious data that the program will treat as legitimate.

This was a somewhat abstract description of memory corruption attacks using a buffer overflow, one of the simpler techniques. For some much more detailed and recent examples, see this description from Google, or this breakdown of a V8 vulnerability.

Compressed pointers in V8

Many attacks are based on corrupting pointers, so ideally we would remove all pointers from the memory of the program. Since an object-oriented language's heap is absolutely full of pointers, that would seem, on its face, to be a hopeless task, but it is enabled by an earlier development. Starting in 2020, V8 has offered the option of saving memory by using compressed pointers. This means that, on a 64-bit system, the heap uses only 32 bit offsets, relative to a base address. This limits the total heap to maximally 4 GiB, a limitation that is acceptable for a browser, and also fine for individual scripts running in a V8 isolate on Cloudflare Workers.

^{An artificial object with various fields, showing how the layout differs in a compressed vs. an uncompressed heap. The boxes are 64 bits wide.}

If the whole of the heap is in a single 4 GiB area then the first 32 bits of all pointers will be the same, and we don't need to store them in every pointer field in every object. In the diagram we can see that the object pointers all start with 0x12345678, which is therefore redundant and doesn't need to be stored. This means that object pointer fields and integer fields can be reduced from 64 to 32 bits.

We still need 64 bit fields for some fields like double precision floats and for the sandbox offsets of buffers, which are typically used by the script for input and output data. See below for details.

Integers in an uncompressed heap are stored in the high 32 bits of a 64 bit field. In the compressed heap, the top 31 bits of a 32 bit field are used. In both cases the lowest bit is set to 0 to indicate integers (as opposed to pointers or offsets).

Conceptually, we have two methods for compressing and decompressing, using a base address that is divisible by 4 GiB:

// Decompress a 32 bit offset to a 64 bit pointer by adding a base address.
void* Decompress(uint32_t offset) { return base + offset; }
// Compress a 64 bit pointer to a 32 bit offset by discarding the high bits.
uint32_t Compress(void* pointer) { return (intptr_t)pointer & 0xffffffff; }

This pointer compression feature, originally primarily designed to save memory, can be used as the basis of a sandbox.

From compressed pointers to the sandbox

The biggest 32-bit unsigned integer is about 4 billion, so the Decompress() function cannot generate any pointer that is outside the range [base, base + 4 GiB]. You could say the pointers are trapped in this area, so it is sometimes called the pointer cage. V8 can reserve 4 GiB of virtual address space for the pointer cage so that only V8 objects appear in this range. By eliminating all pointers from this range, and following some other strict rules, V8 can contain any memory corruption by an attacker to this cage. Even if an attacker corrupts a 32 bit offset within the cage, it is still only a 32 bit offset and can only be used to create new pointers that are still trapped within the pointer cage.

^{The buffer overflow attack from earlier no longer works because only the attacker's own data is available in the pointer cage.}

To construct the sandbox, we take the 4 GiB pointer cage and add another 4 GiB for buffers and other data structures to make the 8 GiB sandbox. This is why the buffer offsets above are 33 bits, so they can reach buffers in the second half of the sandbox (40 bits in Chromium with larger sandboxes). V8 stores these buffer offsets in the high 33 bits and shifts down by 31 bits before use, in case an attacker corrupted the low bits.

Cloudflare Workers have made use of compressed pointers in V8 for a while, but for us to get the full power of the sandbox we had to make some changes. Until recently, all isolates in a process had to be one single sandbox if you were using the sandboxed configuration of V8. This would have limited the total size of all V8 heaps to be less than 4 GiB, far too little for our architecture, which relies on serving 1000s of scripts at once.

That's why we commissioned Igalia to add isolate groups to V8. Each isolate group has its own sandbox and can have 1 or more isolates within it. Building on this change we have been able to start using the sandbox, eliminating a whole class of potential security issues in one stroke. Although we can place multiple isolates in the same sandbox, we are currently only putting a single isolate in each sandbox.

^{The layout of the sandbox. In the sandbox there can be more than one isolate, but all their heap pages must be in the pointer cage: the first 4 GiB of the sandbox. Instead of pointers between the objects, we use 32 bit offsets. The offsets for the buffers are 33 bits, so they can reach the whole sandbox, but not outside it.}

Virtual memory isn't infinite, there's a lot going on in a Linux process

At this point, we were not quite done, though. Each sandbox reserves 8 GiB of space in the virtual memory map of the process, and it must be 4 GiB aligned for efficiency. It uses much less physical memory, but the sandbox mechanism requires this much virtual space for its security properties. This presents us with a problem, since a Linux process 'only' has 128 TiB of virtual address space in a 4-level page table (another 128 TiB are reserved for the kernel, not available to user space).

At Cloudflare, we want to run Workers as efficiently as possible to keep costs and prices down, and to offer a generous free tier. That means that on each machine we have so many isolates running (one per sandbox) that it becomes hard to place them all in a 128 TiB space.

Knowing this, we have to place the sandboxes carefully in memory. Unfortunately, the Linux syscall, mmap, does not allow us to specify the alignment of an allocation unless you can guess a free location to request. To get an 8 GiB area that is 4 GiB aligned, we have to ask for 12 GiB, then find the aligned 8 GiB area that must exist within that, and return the unused (hatched) edges to the OS:

If we allow the Linux kernel to place sandboxes randomly, we end up with a layout like this with gaps. Especially after running for a while, there can be both 8 GiB and 4 GiB gaps between sandboxes:

Sadly, because of our 12 GiB alignment trick, we can't even make use of the 8 GiB gaps. If we ask the OS for 12 GiB, it will never give us a gap like the 8 GiB gap between the green and blue sandboxes above. In addition, there are a host of other things going on in the virtual address space of a Linux process: the malloc implementation may want to grab pages at particular addresses, the executable and libraries are mapped at a random location by ASLR, and V8 has allocations outside the sandbox.

The latest generation of x64 CPUs supports a much bigger address space, which solves both problems, and Linux kernels are able to make use of the extra bits with five level page tables. A process has to opt into this, which is done by a single mmap call suggesting an address outside the 47 bit area. The reason this needs an opt-in is that some programs can't cope with such high addresses. Curiously, V8 is one of them.

This isn't hard to fix in V8, but not all of our fleet has been upgraded yet to have the necessary hardware. So for now, we need a solution that works with the existing hardware. We have modified V8 to be able to grab huge memory areas and then use mprotect syscalls to create tightly packed 8 GiB spaces for sandboxes, bypassing the inflexible mmap API.

Putting it all together

Taking control of the sandbox placement like this actually gives us a security benefit, but first we need to describe a particular threat model.

We assume for the purposes of this threat model that an attacker has an arbitrary way to corrupt data within the sandbox. This is historically the first step in many V8 exploits. So much so that there is a special tier in Google's V8 bug bounty program where you may assume you have this ability to corrupt memory, and they will pay out if you can leverage that to a more serious exploit.

However, we assume that the attacker does not have the ability to execute arbitrary machine code. If they did, they could disable memory protection keys. Having access to the in-sandbox memory only gives the attacker access to their own data. So the attacker must attempt to escalate, by corrupting data inside the sandbox to access data outside the sandbox.

You will recall that the compressed, sandboxed V8 heap only contains 32 bit offsets. Therefore, no corruption there can reach outside the pointer cage. But there are also arrays in the sandbox — vectors of data with a given size that can be accessed with an index. In our threat model, the attacker can modify the sizes recorded for those arrays and the indexes used to access elements in the arrays. That means an attacker could potentially turn an array in the sandbox into a tool for accessing memory incorrectly. For this reason, the V8 sandbox normally has guard regions around it: These are 32 GiB virtual address ranges that have no virtual-to-physical address mappings. This helps guard against the worst case scenario: Indexing an array where the elements are 8 bytes in size (e.g. an array of double precision floats) using a maximal 32 bit index. Such an access could reach a distance of up to 32 GiB outside the sandbox: 8 times the maximal 32 bit index of four billion.

We want such accesses to trigger an alarm, rather than letting an attacker access nearby memory. This happens automatically with guard regions, but we don't have space for conventional 32 GiB guard regions around every sandbox.

Instead of using conventional guard regions, we can make use of memory protection keys. By carefully controlling which isolate group uses which key, we can ensure that no sandbox within 32 GiB has the same protection key. Essentially, the sandboxes are acting as each other's guard regions, protected by memory protection keys. Now we only need a wasted 32 GiB guard region at the start and end of the huge packed sandbox areas.

With the new sandbox layout, we use strictly rotating memory protection keys. Because we are not using randomly chosen memory protection keys, for this threat model the 92% problem described above disappears. Any in-sandbox security issue is unable to reach a sandbox with the same memory protection key. In the diagram, we show that there is no memory within 32 GiB of a given sandbox that has the same memory protection key. Any attempt to access memory within 32 GiB of a sandbox will trigger an alarm, just like it would with unmapped guard regions.

The future

In a way, this whole blog post is about things our customers don't need to do. They don't need to upgrade their server software to get the latest patches, we do that for them. They don't need to worry whether they are using the most secure or efficient configuration. So there's no call to action here, except perhaps to sleep easy.

However, if you find work like this interesting, and especially if you have experience with the implementation of V8 or similar language runtimes, then you should consider coming to work for us. We are recruiting both in the US and in Europe. It's a great place to work, and Cloudflare is going from strength to strength.

Building Jetflow: a framework for flexible, performant data pipelines at Cloudflare

Harry Hough — Wed, 23 Jul 2025 14:00:00 GMT

The Cloudflare Business Intelligence team manages a petabyte-scale data lake and ingests thousands of tables every day from many different sources. These include internal databases such as Postgres and ClickHouse, as well as external SaaS applications such as Salesforce. These tasks are often complex and tables may have hundreds of millions or billions of rows of new data each day. They are also business-critical for product decisions, growth plannings, and internal monitoring. In total, about 141 billion rows are ingested every day.

As Cloudflare has grown, the data has become ever larger and more complex. Our existing Extract Load Transform (ELT) solution could no longer meet our technical and business requirements. After evaluating other common ELT solutions, we concluded that their performance generally did not surpass our current system, either.

It became clear that we needed to build our own framework to cope with our unique requirements — and so Jetflow was born.

What we achieved

Over 100x efficiency improvement in GB-s:

Our longest running job with 19 billion rows was taking 48 hours using 300 GB of memory, and now completes in 5.5 hours using 4 GB of memory
We estimate that ingestion of 50 TB from Postgres via Jetflow could cost under $100 based on rates published by commercial cloud providers

>10x performance improvement:

Our largest dataset was ingesting 60-80,000 rows per second, this is now 2-5 million rows per second per database connection.
In addition, these numbers scale well with multiple database connections for some databases.

Extensibility:

The modular design makes it easy to extend and test. Today Jetflow works with ClickHouse, Postgres, Kafka, many different SaaS APIs, Google BigQuery and many others. It has continued to work well and remain flexible with the addition of new use cases.

How did we do this?

Requirements

The first step to designing our new framework had to be a clear understanding of the problems we were aiming to solve, with clear requirements to stop us creating new ones.

Performant & efficient

We needed to be able to move more data in less time as some ingestion jobs were taking ~24 hours, and our data will only grow. The data should be ingested in a streaming fashion and use less memory and compute resources than our existing solution.

Backwards compatible

Given the daily ingestion of thousands of tables, the chosen solution needed to allow for the migration of individual tables as needed. Due to our usage of Spark downstream and Spark's limitations in merging disparate Parquet schemas, the chosen solution had to offer the flexibility to generate the precise schemas needed for each case to match legacy.

We also required seamless integration with our custom metadata system, used for dependency checks and job status information.

Ease of use

We want a configuration file that can be version-controlled, without introducing bottlenecks on repositories with many concurrent changes.

To increase accessibility for different roles within the team, another requirement was no-code (or configuration as code) in the vast majority of cases. Users should not have to worry about availability or translation of data types between source and target systems, or writing new code for each new ingestion. The configuration needed should also be minimal — for example, data schema should be inferred from the source system and not need to be supplied by the user.

Customizable

Striking a balance with the no-code requirement above, although we want a low bar of entry we also want to have the option to tune and override options if desired, with a flexible and optional configuration layer. For example, writing Parquet files is often more expensive than reading from the database, so we want to be able to allocate more resources and concurrency as needed.

Additionally, we wanted to allow for control over where the work is executed, with the ability to spin up concurrent workers in different threads, different containers, or on different machines. The execution of workers and communication of data was abstracted away with an interface, and different implementations can be written and injected, controlled via the job configuration.

Testable

We wanted a solution capable of running locally in a containerized environment, which would allow us to write tests for every stage of the pipeline. With “black box” solutions, testing often means validating the output after making a change, which is a slow feedback loop, risks not testing all edge cases as there isn’t good visibility of all code paths internally, and makes debugging issues painful.

Designing a flexible framework

To build a truly flexible framework, we broke the pipeline down into distinct stages, and then create a config layer to define the composition of the pipeline from these stages, and any configuration overrides. Every pipeline configuration that makes sense logically should execute correctly, and users should not be able to create pipeline configs that do not work.

Pipeline configuration

This led us to a design where we created stages which were classified according to the meaningfully different categories of:

Consumers
Transformers
Loaders

The pipeline was constructed via a YAML file that required a consumer, zero or more transformers, and at least one loader. Consumers create a data stream (via reading from the source system), Transformers (e.g. data transformations, validations) take a data stream input and output a data stream conforming to the same API so that they can be chained, and Loaders have the same data streaming interface, but are the stages with persistent effects — i.e. stages where data is saved to an external system.

This modular design means that each stage is independently testable, with shared behaviour (such as error handling and concurrency) inherited from shared base stages, significantly decreasing development time for new use cases and increasing confidence in code correctness.

Data divisions

Next, we designed a breakdown for the data that would allow the pipeline to be idempotent both on whole pipeline re-run and also on internal retry of any data partition due to transient error. We decided on a design that let us parallelize processing, while maintaining meaningful data divisions that allowed the pipeline to perform cleanups of data where required for a retry.

RunInstance: the least granular division, corresponding to a business unit for a single run of the pipeline (e.g. one month/day/hour of data).
Partition: a division of the RunInstance that allows each row to be allocated to a partition in a way that is deterministic and self-evident from the row data without external state, and is therefore idempotent on retry. (e.g. an accountId range, a 10-minute interval)
Batch: a division of the partition data that is non-deterministic and used only to break the data down into smaller chunks for streaming/parallel processing for faster processing with fewer resources. (e.g. 10k rows, 50 MB)

The options that the user configures in the consumer stage YAML both construct the query that is used to retrieve the data from the source system, and also encode the semantic meaning of this data division in a system agnostic way, so that later stages understand what this data represents — e.g. this partition contains the data for all accounts IDs 0-500. This means that we can do targeted data cleanup and avoid, for example, duplicate data entries if a single data partition is retried due to error.

Framework implementation

Standard internal state for stage compatibility

Our most common use case is something like read from a database, convert to Parquet format, and then save to object storage, with each of these steps being a separate stage. As more use cases were onboarded to Jetflow, we had to make sure that if someone wrote a new stage it would be compatible with the other stages. We don’t want to create a situation where new code needs to be written for every output format and target system, or you end up with a custom pipeline for every different use case.

The way we have solved this problem is by having our stage extractor class only allow output data in a single format. This means as long as any downstream stages support this format as in the input and output format they would be compatible with the rest of the pipeline. This seems obvious in retrospect, but internally was a painful learning experience, as we originally created a custom type system and struggled with stage interoperability.

For this internal format, we chose to use Arrow, an in-memory columnar data format. The key benefits of this format for us are:

Arrow ecosystem: Many data projects now support Arrow as an output format. This means when we write extractor stages for new data sources, it is often trivial to produce Arrow output.
No serialisation overhead: This makes it easy to move Arrow data between machines and even programming languages with minimum overhead. Jetflow was designed from the start to have the flexibility to be able to run in a wide range of systems via a job controller interface, so this efficiency in data transmission means there’s minimal compromise on performance when creating distributed implementations.
Reserve memory in large fixed-size batches to avoid memory allocations: As Go is a garbage collected (GC) language and GC cycle times are affected mostly by the number of objects rather than the sizes of those objects, fewer heap objects reduces CPU time spent garbage collecting significantly, even if the total size is the same. As the number of objects to scan, and possibly collect, during a GC cycle increases with the number of allocations, if we have 8192 rows with 10 columns each, Arrow would only require us to do 10 allocations versus the 8192 allocations of most drivers that allocate on a row by row basis, meaning fewer objects and lower GC cycle times with Arrow.

Converting rows to columns

Another important performance optimization was reducing the number of conversion steps that happen when reading and processing data. Most data ingestion frameworks internally represent data as rows. In our case, we are mostly writing data in Parquet format, which is column based. When reading data from column-based sources (e.g. ClickHouse, where most drivers receive RowBinary format), converting into row-based memory representations for the specific language implementation is inefficient. This is then converted again from rows to columns to write Parquet files. These conversions result in a significant performance impact.

Jetflow instead reads data from column-based sources in columnar formats (e.g. for ClickHouse-native Block format) and then copies this data into Arrow column format. Parquet files are then written directly from Arrow columns. The simplification of this process improves performance.

Writing each pipelines stage

Case study: ClickHouse

When testing an initial version of Jetflow, we discovered that due to the architecture of ClickHouse, using additional connections would not be of any benefit, since ClickHouse was reading faster than we were receiving data. It should then be possible, with a more optimized database driver, to take better advantage of that single connection to read a much larger number of rows per second, without needing additional connections.

Initially, a custom database driver was written for ClickHouse, but we ended up switching to the excellent ch-go low level library, which directly reads Blocks from ClickHouse in a columnar format. This had a dramatic effect on performance in comparison to the standard Go driver. Combined with the framework optimisations above, we now ingest millions of rows per second with a single ClickHouse connection.

A valuable lesson learned is that as with any software, tradeoffs are often made for the sake of convenience or a common use case that may not match your own. Most database drivers tend not to be optimized for reading large batches of rows, and have high per-row overhead.

Case study: Postgres

For Postgres, we use the excellent jackc/pgx driver, but instead of using the database/sql Scan interface, we directly receive the raw bytes for each row and use the jackc/pgx internal scan functions for each Postgres OID (Object Identifier) type.

The database/sql Scan interface in Go uses reflection to understand the type passed to the function and then also uses reflection to set each field with the column value received from Postgres. In typical scenarios, this is fast enough and easy to use, but falls short for our use cases in terms of performance. The jackc/pgx driver reuses the row bytes produced each time the next Postgres row is requested, resulting in zero allocations per row. This allows us to write high-performance, low-allocation code within Jetflow. With this design, we are able to achieve nearly 600,000 rows per second per Postgres connection for most tables, with very low memory usage.

Conclusion

As of early July 2025, the team ingests 77 billion records per day via Jetflow. The remaining jobs are in the process of being migrated to Jetflow, which will bring the total daily ingestion to 141 billion records. The framework has allowed us to ingest tables in cases that would not otherwise have been possible, and provided significant cost savings due to ingestions running for less time and with fewer resources.

In the future, we plan to open source the project, and if you are interested in joining our team to help develop tools like this, then open roles can be found at https://www.cloudflare.com/careers/jobs/.

Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform

Jérôme Schneider — Tue, 22 Oct 2024 13:00:00 GMT

Vectorize is a globally distributed vector database that enables you to build full-stack, AI-powered applications with Cloudflare Workers. Vectorize makes querying embeddings — representations of values or objects like text, images, audio that are designed to be consumed by machine learning models and semantic search algorithms — faster, easier and more affordable.

In this post, we dive deep into how we built Vectorize on Cloudflare’s Developer Platform, leveraging Cloudflare’s global network, Cache, Workers, R2, Queues, Durable Objects, and container platform.

What is a vector database?

A vector database is a queryable store of vectors. A vector is a large array of numbers called vector dimensions.

A vector database has a similarity search query: given an input vector, it returns the vectors that are closest according to a specified metric, potentially filtered on their metadata.

Vector databases are used to power semantic search, document classification, and recommendation and anomaly detection, as well as contextualizing answers generated by LLMs (Retrieval Augmented Generation, RAG).

Why do vectors require special database support?

Conventional data structures like B-trees, or binary search trees expect the data they index to be cheap to compare and to follow a one-dimensional linear ordering. They leverage this property of the data to organize it in a way that makes search efficient. Strings, numbers, and booleans are examples of data featuring this property.

Because vectors are high-dimensional, ordering them in a one-dimensional linear fashion is ineffective for similarity search, as the resulting ordering doesn’t capture the proximity of vectors in the high-dimensional space. This phenomenon is often referred to as the curse of dimensionality.

In addition to this, comparing two vectors using distance metrics useful for similarity search is a computationally expensive operation, requiring vector-specific techniques for databases to overcome.

Query processing architecture

Vectorize builds upon Cloudflare’s global network to bring fast vector search close to its users, and relies on many components to do so.

These are the Vectorize components involved in processing vector queries.

Vectorize runs in every Cloudflare data center, on the infrastructure powering Cloudflare Workers. It serves traffic coming from Worker bindings as well as from the Cloudflare REST API through our API Gateway.

Each query is processed on a server in the data center in which it enters, picked in a fashion that spreads the load across all servers of that data center.

The Vectorize DB Service (a Rust binary) running on that server processes the query by reading the data for that index on R2, Cloudflare’s object storage. It does so by reading through Cloudflare’s Cache to speed up I/O operations.

Searching vectors, and indexing them to speed things up

Being a vector database, Vectorize features a similarity search query: given an input vector, it returns the K vectors that are closest according to a specified metric.

Conceptually, this similarity search consists of 3 steps:

Evaluate the proximity of the query vector with every vector present in the index.
Sort the vectors based on their proximity “score”.
Return the top matches.

While this method is accurate and effective, it is computationally expensive and does not scale well to indexes containing millions of vectors (see Why do vectors require special database support? above).

To do better, we need to prune the search space, that is, avoid scanning the entire index for every query.

For this to work, we need to find a way to discard vectors we know are irrelevant for a query, while focusing our efforts on those that might be relevant.

Indexing vectors with IVF

Vectorize prunes the search space for a query using an indexing technique called IVF, Inverted File Index.

IVF clusters the index vectors according to their relative proximity. For each cluster, it then identifies its centroid, the center of gravity of that cluster, a high-dimensional point minimizing the distance with every vector in the cluster.

Once the list of centroids is determined, each centroid is given a number. We then structure the data on storage by placing each vector in a file named like the centroid it is closest to.

When processing a query, we then can then focus on relevant vectors by looking only in the centroid files closest to that query vector, effectively pruning the search space.

Compressing vectors with PQ

Vectorize supports vectors of up to 1536 dimensions. At 4 bytes per dimension (32 bits float), this means up to 6 KB per vector. That’s 6 GB of uncompressed vector data per million vectors that we need to fetch from storage and put in memory.

To process multi-million vector indexes while limiting the CPU, memory, and I/O required to do so, Vectorize uses a dimensionality reduction technique called PQ (Product Quantization). PQ compresses the vectors data in a way that retains most of their specificity while greatly reducing their size — a bit like down sampling a picture to reduce the file size, while still being able to tell precisely what’s in the picture — enabling Vectorize to efficiently perform similarity search on these lighter vectors.

In addition to storing the compressed vectors, their original data is retained on storage as well, and can be requested through the API; the compressed vector data is used only to speed up the search.

Approximate nearest neighbor search and result accuracy refining

By pruning the search space and compressing the vector data, we’ve managed to increase the efficiency of our query operation, but it is now possible to produce a set of matches that is different from the set of true closest matches. We have traded result accuracy for speed by performing an approximate nearest neighbor search, reaching an accuracy of ~80%.

To boost the result accuracy up to over 95%, Vectorize then performs a result refinement pass on the top approximate matches using uncompressed vector data, and returns the best refined matches.

Eventual consistency and snapshot versioning

Whenever you query your Vectorize index, you are guaranteed to receive results which are read from a consistent, immutable snapshot — even as you write to your index concurrently. Writes are applied in strict order of their arrival in our system, and they are funneled into an asynchronous process. We update the index files by reading the old version, making changes, and writing this updated version as a new object in R2. Each index file has its own version number, and can be updated independently of the others. Between two versions of the index we may update hundreds or even thousands of IVF and metadata index files, but even as we update the files, your queries will consistently use the current version until it is time to switch.

Each IVF and metadata index file has its own version. The list of all versioned files which make up the snapshotted version of the index is contained within a manifest file. Each version of the index has its own manifest. When we write a new manifest file based on the previous version, we only need to update references to the index files which were modified; if there are files which weren't modified, we simply keep the references to the previous version.

We use a root manifest as the authority of the current version of the index. This is the pivot point for changes. The root manifest is a copy of a manifest file from a particular version, which is written to a deterministic location (the root of the R2 bucket for the index). When our async write process has finished processing vectors, and has written all new index files to R2, we commit by overwriting the current root manifest with a copy of the new manifest. PUT operations in R2 are atomic, so this effectively makes our updates atomic. Once the manifest is updated, Vectorize DB Service instances running on our network will pick it up, and use it to serve reads.

Because we keep past versions of index and manifest files, we effectively maintain versioned snapshots of your index. This means we have a straightforward path towards building a point-in-time recovery feature (similar to D1's Time Travel feature).

You may have noticed that because our write process is asynchronous, this means Vectorize is eventually consistent — that is, there is a delay between the successful completion of a request writing on the index, and finally seeing those updates reflected in queries. This isn't always ideal for all data storage use cases. For example, imagine two users using an online ticket reservation application for airline tickets, where both users buy the same seat — one user will successfully reserve the ticket, and the other will eventually get an error saying the seat was taken, and they need to choose again. Because a vector index is not typically used as a primary database for these transactional use cases, we decided eventual consistency was a worthy trade off in order to ensure Vectorize queries would be fast, high-throughput, and cheap even as the size of indexes grew into the millions.

Coordinating distributed writes: it's just another block in the WAL

In the section above, we touched on our eventually consistent, asynchronous write process. Now we'll dive deeper into our implementation.

The WAL

A write ahead log (WAL) is a common technique for making atomic and durable writes in a database system. Vectorize’s WAL is implemented with SQLite in Durable Objects.

In Vectorize, the payload for each update is given an ID, written to R2, and the ID for that payload is handed to the WAL Durable Object which persists it as a "block." Because it's just a pointer to the data, the blocks are lightweight records of each mutation.

Durable Objects (DO) have many benefits — strong transactional guarantees, a novel combination of compute and storage, and a high degree of horizontal scale — but individual DOs are small allotments of memory and compute. However, the process of updating the index for even a single mutation is resource intensive — a single write may include thousands of vectors, which may mean reading and writing thousands of data files stored in R2, and storing a lot of data in memory. This is more than what a single DO can handle.

So we designed the WAL to leverage DO's strengths and made it a coordinator. It controls the steps of updating the index by delegating the heavy lifting to beefier instances of compute resources (which we call "Executors"), but uses its transactional properties to ensure the steps are done with strong consistency. It safeguards the process from rogue or stalled executors, and ensures the WAL processing continues to move forward. DOs are easy to scale, so we create a new DO instance for each Vectorize index.

WAL Executor

The executors run from a single pool of compute resources, shared by all WALs. We use a simple producer-consumer pattern using Cloudflare Queues. The WAL enqueues a request, and executors poll the queue. When they get a request, they call an API on the WAL requesting to be assigned to the request.

The WAL ensures that one and only one executor is ever assigned to that write. As the executor writes, the index files and the updated manifest are written in R2, but they are not yet visible. The final step is for the executor to call another API on the WAL to commit the change — and this is key — it passes along the updated manifest. The WAL is responsible for overwriting the root manifest with the updated manifest. The root manifest is the pivot point for atomic updates: once written, the change is made visible to Vectorize’s database service, and the updated data will appear in queries.

From the start, we designed this process to account for non-deterministic errors. We focused on enumerating failure modes first, and only moving forward with possible design options after asserting they handled the possibilities for failure. For example, if an executor stalls, the WAL finds a new executor. If the first executor comes back, the coordinator will reject its attempt to commit the update. Even if that first executor is working on an old version which has already been written, and writes new index files and a new manifest to R2, they will not overwrite the files written from the committed version.

Batching updates

Now that we have discussed the general flow, we can circle back to one of our favorite features of the WAL. On the executor, the most time-intensive part of the write process is reading and writing many files from R2. Even with making our reads and writes concurrent to maximize throughput, the cost of updating even thousands of vectors within a single file is dwarfed by the total latency of the network I/O. Therefore it is more efficient to maximize the number of vectors processed in a single execution.

So that is what we do: we batch discrete updates. When the WAL is ready to request work from an executor, it will get a chunk of "blocks" off the WAL, starting with the next un-written block, and maintaining the sequence of blocks. It will write a new "batch" record into the SQLite table, which ties together that sequence of blocks, the version of the index, and the ID of the executor assigned to the batch.

Users can batch multiple vectors to update in a single insert or upsert call. Because the size of each update can vary, the WAL adaptively calculates the optimal size of its batch to increase throughput. The WAL will fit as many upserted vectors as possible into a single batch by counting the number of updates represented by each block. It will batch up to 200,000 vectors at once (a value we arrived at after our own testing) with a limit of 1,000 blocks. With this throughput, we have been able to quickly load millions of vectors into an index (with upserts of 5,000 vectors at a time). Also, the WAL does not pause itself to collect more writes to batch — instead, it begins processing a write as soon as it arrives. Because the WAL only processes one batch at a time, this creates a natural pause in its workflow to batch up writes which arrive in the meantime.

Retraining the index

The WAL also coordinates our process for retraining the index. We occasionally re-train indexes to ensure the mapping of IVF centroids best reflects the current vectors in the index. This maintains the high accuracy of the vector search.

Retraining produces a completely new index. All index files are updated; vectors have been reshuffled across the index space. For this reason, all indexes have a second version stamp — which we call the generation — so that we can differentiate between retrained indexes.

The WAL tracks the state of the index, and controls when the training is started. We have a second pool of processes called "trainers." The WAL enqueues a request on a queue, then a trainer picks up the request and it begins training.

Training can take a few minutes to complete, but we do not pause writes on the current generation. The WAL will continue to handle writes as normal. But the training runs from a fixed snapshot of the index, and will become out-of-date as the live index gets updated in parallel. Once the trainer has completed, it signals the WAL, which will then start a multi-step process to switch to the new generation. It enters a mode where it will continue to record writes in the WAL, but will stop making those writes visible on the current index. Then it will begin catching up the retrained index with all of the updates that came in since it started. Once it has caught up to all data present in the index when the trainer signaled the WAL, it will switch over to the newly retrained index. This prevents the new index from appearing to "jump back in time." All subsequent writes will be applied to that new index.

This is all modeled seamlessly with the batch record. Because it associates the index version with a range of WAL blocks, multiple batches can span the same sequence of blocks as long as they belong to different generations. We can say this another way: a single WAL block can be associated with many batches, as long as these batches are in different generations. Conceptually, the batches act as a second WAL layered over the WAL blocks.

Indexing and filtering metadata

Vectorize supports metadata filters on vector similarity queries. This allows a query to focus the vector similarity search on a subset of the index data, yielding matches that would otherwise not have been part of the top results.

For instance, this enables us to query for the best matching vectors for color: “blue” and category: ”robe”.

Conceptually, what needs to happen to process this example query is:

Identify the set of vectors matching color: “blue” by scanning all metadata.
Identify the set of vectors matching category: “robe” by scanning all metadata.
Intersect both sets (boolean AND in the filter) to identify vectors matching both the color and category filter.
Score all vectors in the intersected set, and return the top matches.

While this method works, it doesn’t scale well. For an index with millions of vectors, processing the query that way would be very resource intensive. What’s worse, it prevents us from using our IVF index to identify relevant vector data, forcing us to compute a proximity score on potentially millions of vectors if the filtered set of vectors is large.

To do better, we need to prune the metadata search space by indexing it like we did for the vector data, and find a way to efficiently join the vector sets produced by the metadata index with our IVF vector index.

Indexing metadata with Chunked Sorted List Indexes

Vectorize maintains one metadata index per filterable property. Each filterable metadata property is indexed using a Chunked Sorted List Index.

A Chunked Sorted List Index is a sorted list of all distinct values present in the data for a filterable property, with each value mapped to the set of vector IDs having that value. This enables Vectorize to binary search a value in the metadata index in O(log n) complexity, in other words about as fast as search can be on a large dataset.

Because it can become very large on big indexes, the sorted list is chunked in pieces matching a target weight in KB to keep index state fetches efficient.

A lightweight chunk descriptor list is maintained in the index manifest, keeping track of the list chunks and their lower/upper values. This chunk descriptor list can be binary searched to identify which chunk would contain the searched metadata value.

Once the candidate chunk is identified, Vectorize fetches that chunk from index data and binary searches it to take the set of vector IDs matching a metadata value if found, or an empty set if not found.

We identify the matching vector set this way for every predicate in the metadata filter of the query, then intersect the sets in memory to determine the final set of vectors matched by the filters.

This is just half of the query being processed. We now need to identify the vectors most similar to the query vector, within those matching the metadata filters.

Joining the metadata and vector indexes

A vector similarity query always comes with an input vector. We can rank all centroids of our IVF vector index based on their proximity with that query vector.

The vector set matched by the metadata filters contains for each vector its ID and IVF centroid number.

From this, Vectorize derives the number of vectors matching the query filters per IVF centroid, and determines which and how many top-ranked IVF centroids need to be scanned according to the number of matches the query asks for.

Vectorize then performs the IVF-indexed vector search (see the section Searching Vectors, and indexing them to speed things up above) by considering only the vectors in the filtered metadata vector set while doing so.

Because we’re effectively pruning the vector search space using metadata filters, filtered queries can often be faster than their unfiltered equivalent.

Query performance

The performance of a system is measured in terms of latency and throughput.

Latency is a measure relative to individual queries, evaluating the time it takes for a query to be processed, usually expressed in milliseconds. It is what an end user perceives as the “speed” of the service, so a lower latency is desirable.

Throughput is a measure relative to an index, evaluating the number of queries it can process concurrently over a period of time, usually expressed in requests per second or RPS. It is what enables an application to scale to thousands of simultaneous users, so a higher throughput is desirable.

Vectorize is designed for great index throughput and optimized for low query latency to deliver great performance for demanding applications. Check out our benchmarks.

Query latency optimization

As a distributed database keeping its data state on blob storage, Vectorize’s latency is primarily driven by the fetch of index data, and relies heavily on Cloudflare’s network of caches as well as individual server RAM cache to keep latency low.

Because Vectorize data is snapshot versioned, (see Eventual consistency and snapshot versioning above), each version of the index data is immutable and thus highly cacheable, increasing the latency benefits Vectorize gets from relying on Cloudflare’s cache infrastructure.

To keep the index data lean, Vectorize uses techniques to reduce its weight. In addition to Product Quantization (see Compressing vectors with PQ above), index files use a space-efficient binary format optimized for runtime performance that Vectorize is able to use without parsing, once fetched.

Index data is fragmented in a way that minimizes the amount of data required to process a query. Auxiliary indexes into that data are maintained to limit the amount of fragments to fetch, reducing overfetch by jumping straight to the relevant piece of data on mass storage.

Vectorize boosts all vector proximity computations by leveraging SIMD CPU instructions, and by organizing the vector search in 2 passes, effectively balancing the latency/result accuracy ratio (see Approximate nearest neighbor search and result accuracy refining above).

When used via a Worker binding, each query is processed close to the server serving the worker request, and thus close to the end user, minimizing the network-induced latency between the end user, the Worker application, and Vectorize.

Query throughput

Vectorize runs in every Cloudflare data center, on thousands of servers across the world.

Thanks to the snapshot versioning of every index’s data, every server is simultaneously able to serve the index concurrently, without contention on state.

This means that a Vectorize index elastically scales horizontally with its distributed traffic, providing very high throughput for the most demanding Worker applications.

Increased index size

We are excited that our upgraded version of Vectorize can support a maximum of 5 million vectors, which is a 25x improvement over the limit in beta (200,000 vectors). All the improvements we discussed in this blog post contribute to this increase in vector storage. Improved query performance and throughput comes with this increase in storage as well.

However, 5 million may be constraining for some use cases. We have already heard this feedback. The limit falls out of the constraints of building a brand new globally distributed stateful service, and our desire to iterate fast and make Vectorize generally available so builders can confidently leverage it in their production apps.

We believe builders will be able to leverage Vectorize as their primary vector store, either with a single index or by sharding across multiple indexes. But if this limit is too constraining for you, please let us know. Tell us your use case, and let's see if we can work together to make Vectorize work for you.

Try it now!

Every developer on a free plan can give Vectorize a try. You can visit our developer documentation to get started.

If you’re looking for inspiration on what to build, see the semantic search tutorial that combines Workers AI and Vectorize for document search, running entirely on Cloudflare. Or an example of how to combine OpenAI and Vectorize to give an LLM more context and dramatically improve the accuracy of its answers.

And if you have questions about how to use Vectorize for our product & engineering teams, or just want to bounce an idea off of other developers building on Workers AI, join the #vectorize and #workers-ai channels on our Developer Discord.

Improving platform resilience at Cloudflare through automation

Opeyemi Onikute — Wed, 09 Oct 2024 13:00:00 GMT

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.

In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.

Growing pains

The Cloudflare Site Reliability Engineering (SRE) team builds and manages the platform that helps product teams deliver our extensive suite of offerings to customers. One important component of this platform is the collection of servers that power critical products such as Durable Objects, Workers, and DDoS mitigation. We also build and maintain foundational software services that power our product offerings, such as configuration management, provisioning, and IP address allocation systems.

As part of tactical operations work, we are often required to respond to failures in any of these components to minimize impact to users. Impact can vary from lack of access to a specific product feature, to total unavailability. The level of response required is determined by the priority, which is usually a reflection of the severity of impact on users. Lower-priority failures are more common — a server may run too hot, or experience an unrecoverable hardware error. Higher-priority failures are rare and are typically resolved via a well-defined incident response process, requiring collaboration with multiple other teams.

The commonality of lower-priority failures makes it obvious when the response required, as defined in runbooks, is “toilsome”. To reduce this toil, we had previously implemented a plethora of solutions to automate runbook actions such as manually-invoked shell scripts, cron jobs, and ad-hoc software services. These had grown organically over time and provided solutions on a case-by-case basis, which led to duplication of work, tight coupling, and lack of context awareness across the solutions.

We also care about how long it takes to resolve any potential impact on users. A resolution process which involves the manual invocation of a script relies on human action, increasing the Mean-Time-To-Resolve (MTTR) and leaving room for human error. This risks increasing the amount of errors we serve to users and degrading trust.

These problems proved that we needed a way to automatically heal these platform components. This especially applies to our servers, for which failure can cause impact across multiple product offerings. While we have mechanisms to automatically steer traffic away from these degraded servers, in some rare cases the breakage is sudden enough to be visible.

Solving the problem

To provide a more reliable platform, we needed a new component that provides a common ground for remediation efforts. This would remove duplication of work, provide unified context-awareness and increase development speed, which ultimately saves hours of engineering time and effort.

A good solution would not allow only the SRE team to auto-remediate, it would empower the entire company. The key to adding self-healing capability was a generic interface for all teams to self-service and quickly remediate failures at various levels: machine, service, network, or dependencies.

A good way to think about auto-remediation is in terms of workflows. A workflow is a sequence of steps to get to a desired outcome. This is not dissimilar to a manual shell script which executes what a human would otherwise do via runbook instructions. Because of this logical fit with workflows and durable execution, we decided to adopt an open-source platform called Temporal.

The concept of durable execution is useful to gracefully manage infrastructure failures such as network outages and transient failures in external service endpoints. This capability meant we only needed to build a way to schedule “workflow” tasks and have the code provide reliability guarantees by default, using Temporal. This allowed us to focus on building out the orchestration system to support the control and flow of workflow execution in our data centers.

Temporal’s documentation provides a good introduction to writing Temporal workflows.

Building an Automatic Remediation System

Below, we describe how our automatic remediation system works. It is essentially a way to schedule tasks across our global network with built-in reliability guarantees. With this system, teams can serve their customers more reliably. An unexpected failure mode can be recognized and immediately mitigated, while the root cause can be determined later via a more detailed analysis.

Step one: we need a coordinator

After our initial testing of Temporal, it was now possible to write workflows. But we needed a way to schedule workflow tasks from other internal services. The coordinator was built to serve this purpose, and became the primary mechanism for the authorisation and scheduling of workflows.

The most important roles of the coordinator are authorisation, workflow task routing, and safety constraints enforcement. Each consumer is authorized via mTLS authentication, and the coordinator uses an ACL to determine whether to permit the execution of a workflow. An ACL configuration looks like the following example.

server_config {
    enable_tls = true
    [...]
    route_rule {
      name  = "global_get"
      method = "GET"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
    }
    route_rule {
      name = "global_post"
      method = "POST"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
      allow_public = true
    }
    route_rule {
      name = "public_access"
      method = "GET"
      route_patterns = ["/metrics"]
      uris = []
      allow_public = true
      skip_log_match = true
    }
}

Each workflow specifies two key characteristics: where to run the tasks and the safety constraints, using an HCL configuration file. Example constraints could be whether to run on only a specific node type (such as a database), or if multiple parallel executions are allowed: if a task has been triggered too many times, that is a sign of a wider problem that might require human intervention. The coordinator uses the Temporal Visibility API to determine the current state of the executions in the Temporal cluster.

An example of a configuration file is shown below:

task_queue_target = ""

# The following entries will ensure that
# 1. This workflow is not run at the same time in a 15m window.
# 2. This workflow will not run more than once an hour.
# 3. This workflow will not run more than 3 times in one day.
#
constraint {
    kind = "concurency"
    value = "1"
    period = "15m"
}

constraint {
    kind = "maxExecution"
    value = "1"
    period = "1h"
}

constraint {
    kind = "maxExecution"
    value = "3"
    period = "24h"
    is_global = true
}

Step two: Task Routing is amazing

An unforeseen benefit of using a central Temporal cluster was the discovery of Task Routing. This feature allows us to schedule a Workflow/Activity on any server that has a running Temporal Worker, and further segment by the type of server, its location, etc. For this reason, we have three primary task queues — the general queue in which tasks can be executed by any worker in the datacenter, the node type queue in which tasks can only be executed by a specific node type in the datacenter, and the individual node queue where we target a specific node for task execution.

We rely on this heavily to ensure the speed and efficiency of automated remediation. Certain tasks can be run in datacenters with known low latency to an external resource, or a node type with better performance than others (due to differences in the underlying hardware). This reduces the amount of failure and latency we see overall in task executions. Sometimes we are also constrained by certain types of tasks that can only run on a certain node type, such as a database.

Task Routing also means that we can configure certain task queues to have a higher priority for execution, although this is not a feature we have needed so far. A drawback of task routing is that every Workflow/Activity needs to be registered to the target task queue, which is a common gotcha. Thankfully, it is possible to catch this failure condition with proper testing.

Step three: when/how to self-heal?

None of this would be relevant if we didn’t put it to good use. A primary design goal for the platform was to ensure we had easy, quick ways to trigger workflows on the most important failure conditions. The next step was to determine what the best sources to trigger the actions were. The answer to this was simple: we could trigger workflows from anywhere as long as they are properly authorized and detect the failure conditions accurately.

Example triggers are an alerting system, a log tailer, a health check daemon, or an authorized engineer via a chatbot. Such flexibility allows a high level of reuse, and permits to invest more in workflow quality and reliability.

As part of the solution, we built a daemon that is able to poll a signal source for any unwanted condition and trigger a configured workflow. We have initially found Prometheus useful as a source because it contains both service-level and hardware/system-level metrics. We are also exploring more event-based trigger mechanisms, which could eliminate the need to use precious system resources to poll for metrics.

We already had internal services that are able to detect widespread failure conditions for our customers, but were only able to page a human. With the adoption of auto-remediation, these systems are now able to react automatically. This ability to create an automatic feedback loop with our customers is the cornerstone of these self-healing capabilities, and we continue to work on stronger signals, faster reaction times, and better prevention of future occurrences.

The most exciting part, however, is the future possibility. Every customer cares about any negative impact from Cloudflare. With this platform we can onboard several services (especially those that are foundational for the critical path) and ensure we react quickly to any failure conditions, even before there is any visible impact.

Step four: packaging and deployment

The whole system is written in golang, and a single binary can implement each role. We distribute it as an apt package or a container for maximum ease of deployment.

We deploy a Temporal-based worker to every server we intend to run tasks on, and a daemon in datacenters where we intend to automatically trigger workflows based on the local conditions. The coordinator is more nuanced since we rely on task routing and can trigger from a central coordinator, but we have also found value in running coordinators locally in the datacenters. This is especially useful in datacenters with less capacity or degraded performance, removing the need for a round-trip to schedule the workflows.

Step five: test, test, test

Temporal provides native mechanisms to test an entire workflow, via a comprehensive test suite that supports end-to-end, integration, and unit testing, which we used extensively to prevent regressions while developing. We also ensured proper test coverage for all the critical platform components, especially the coordinator.

Despite the ease of written tests, we quickly discovered that they were not enough. After writing workflows, engineers need an environment as close as possible to the target conditions. This is why we configured our staging environments to support quick and efficient testing. These environments receive the latest changes and point to a different (staging) Temporal cluster, which enables experimentation and easy validation of changes.

After a workflow is validated in the staging environment, we can then do a full release to production. It seems obvious, but catching simple configuration errors before releasing has saved us many hours in development/change-related-task time.

Deploying to production

As you can guess from the title of this post, we put this in production to automatically react to server-specific errors and unrecoverable failures. To this end, we have a set of services that are able to detect single-server failure conditions based on analyzed traffic data. After deployment, we have successfully mitigated potential impact by taking any errant single sources of failure out of production.

We have also created a set of workflows to reduce internal toil and improve efficiency. These workflows can automatically test pull requests on target machines, wipe and reset servers after experiments are concluded, and take away manual processes that cost many hours in toil.

Building a system that is maintained by several SRE teams has allowed us to iterate faster, and rapidly tackle long-standing problems. We have set ambitious goals regarding toil elimination and are on course to achieve them, which will allow us to scale faster by eliminating the human bottleneck.

Looking to the future

Our immediate plans are to leverage this system to provide a more reliable platform for our customers and drastically reduce operational toil, freeing up engineering resources to tackle larger-scale problems. We also intend to leverage more Temporal features such as Workflow Versioning, which will simplify the process of making changes to workflows by ensuring that triggered workflows run expected versions.

We are also interested in how others are solving problems using durable execution platforms such as Temporal, and general strategies to eliminate toil. If you would like to discuss this further, feel free to reach out on the Cloudflare Community and start a conversation!

If you’re interested in contributing to projects that help build a better Internet, our engineering teams are hiring.

Adopting OpenTelemetry for our logging pipeline

Colin Douch — Mon, 03 Jun 2024 13:00:41 GMT

Cloudflare’s logging pipeline is one of the largest data pipelines that Cloudflare has, serving millions of log events per second globally, from every server we run. Recently, we undertook a project to migrate the underlying systems of our logging pipeline from syslog-ng to OpenTelemetry Collector and in this post we want to share how we managed to swap out such a significant piece of our infrastructure, why we did it, what went well, what went wrong, and how we plan to improve the pipeline even more going forward.

Background

A full breakdown of our existing infrastructure can be found in our previous post An overview of Cloudflare's logging pipeline, but to quickly summarize here:

We run a syslog-ng daemon on every server, reading from the local systemd-journald journal, and a set of named pipes.
We forward those logs to a set of centralized “log-x receivers”, in one of our core data centers.
We have a dead letter queue destination in another core data center, which receives messages that could not be sent to the primary receiver, and which get mirrored across to the primary receivers when possible.

The goal of this project was to replace those syslog-ng instances as transparently as possible. That means we needed to implement all these behaviors as precisely as possible, so that we didn’t need to modify any downstream systems.

There were a few reasons for wanting to make this shift, and enduring the difficulties of overhauling such a large part of our infrastructure:

syslog-ng is written in C, which is not a core competency of our team. While we have made upstream contributions to the project in the past, and the experience was great, having the OpenTelemetry collector in Go allows much more of our team to be able to contribute improvements to the system.
Building syslog-ng against our internal Post-Quantum cryptography libraries was difficult, due to having to maintain an often brittle C build chain, whereas our engineering teams have optimized the Go build model to make this as simple as possible.
OpenTelemetry Collectors have built in support for Prometheus metrics, which allows us to gather much deeper levels of telemetry data around what the collectors are doing, and surface these insights as “meta-observability” to our engineering teams.
We already use OpenTelemetry Collectors for some of our tracing infrastructure, so unifying onto one daemon rather than having separate collectors for all our different types of telemetry reduces the cognitive load on the team.

The Migration Process

What we needed to build

While the upstream contrib repository contains a wealth of useful components, all packaged into its own distribution, it became clear early on that we would need our own internal components. Having our own internal components would require us to build our own distribution, so one of the first things we did was turn to OCB (OpenTelemetry Collector Builder) to provide us a way to build an internal distribution of an OpenTelemetry Collector. We eventually ended up templating our OCB configuration file to automatically include all the internal components we have built, so that we didn’t have to add them manually.

In total, we built four internal components for our initial version of the collector.

cfjs1exporter

Internally, our logging pipeline uses a line format we call “cfjs1”. This format describes a JSON encoded log, with two fields: a format field, that decides the type of the log, and a “wrapper” field which contains the log body (which is a structured JSON object in and of itself), with a field name that changes depending on the format field. These two fields decide which Kafka topic our receivers will end up placing the log message in.

Because we didn’t want to make changes to other parts of the pipeline, we needed to support this format in our collector. To do this, we took inspiration from the contrib repository’s syslogexporter, building our cfjs1 format into it.

Ultimately, we would like to move towards using OTLP (OpenTelemetry Protocol) as our line format. This would allow us to remove our custom exporter, and utilize open standards, enabling easier migrations in the future.

fileexporter

While the upstream contrib repo does have a file exporter component, it only supports two formats: JSON and Protobuf. We needed to support two other formats, plain text and syslog, so we ended up forking the file exporter internally. Our plain text formatter simply outputs the body of the log message into a file, with newlines as a delimiter. Our syslog format outputs RFC 5424 formatted syslog messages into a file.

The other feature we implemented on our internal fork was custom permissions. The upstream file exporter is a bit of a mess, in that it actually has two different modes of operation – a standard mode, not utilizing any of the compression or rotation features, and a more advanced mode which uses those features. Crucially, if you want to use any of the rotation features, you end up using lumberjack, whereas without those features you use a more native file handling. This leads to strange issues where some features of the exporter are supported in one mode, but not the other. In the case of permissions, the community seems open to the idea in the native handling, but lumberjack seems against the idea. This dichotomy is what led us to implement it ourselves internally.

Ultimately, we would love to upstream these improvements should the community be open to them. Having support for custom marshallers (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30331) would have made this a bit easier, however it’s not clear how that would work with OCB. Either that, or we could open source them in the Cloudflare organization, but we would love to remove the need to maintain our own fork in the future.

externaljsonprocessor

We want to set the value of an attribute/field that comes from external sources: either from an HTTP endpoint or an output from running a specific command. In syslog-ng, we have a sidecar service that generates a syslog-ng configuration to achieve this. In replacing syslog-ng with our OpenTelemetry Collector, we thought it would be easier to implement this feature as a custom component of our collector instead.

To that end, we implemented an “external JSON processor”, which is able to periodically query external data sources and add those fields to all the logs that flow through the processor. Cloudflare has many internal tools and APIs, and we use this processor to fetch data like the status of a data center, or the status of a systemd unit. This enables our engineers to have more filtering options, such as to exclude logs from data centers that are not supposed to receive customer traffic, or servers that are disabled for maintenance. Crucially, this allows us to update these values much faster than the standard three-hour cadence of other configuration updates through salt, allowing more rapid updates to these fields that may change quickly as we operate our network.

ratelimit processor

The last component we needed to implement was a replacement for the syslog-ng ratelimit filter, also contributed by us upstream. The ratelimit filter allows applying rate limits based on a specific field of a log message, dropping messages that exceed some limit (with an optional burst limit). In our case, we apply rate limits over the service field, ensuring that no individual service can degrade the log collection for any other.

While there has been some upstream discussion of similar components, we couldn’t find anything that explicitly fit our needs. This was especially true when you consider that in our case the data loss during the rate limiting process is intentional, something that might be hard to sell when trying to build something more generally applicable.

How we migrated

Once we had an OpenTelemetry Collector binary, we had to deploy it. Our deployment process took two forks: Deploying to our core data centers, and deploying to our edge data centers. For those unfamiliar, Cloudflare’s core data centers contain a small number of servers with a very diverse set of workloads, from Postgresql, to ElasticSearch, to Kubernetes, and everything in between. Our edge data centers, on the other hand, are much more homogenous. They contain a much larger number of servers, each one running the same set of services.

Both edge and core use salt to configure the services running on their servers. This meant that the first step was to write salt states that would install the OpenTelemetry collector, and write the appropriate configurations to disk. Once we had those in place, we also needed to write some temporary migration pieces that would disable syslog-ng and start the OpenTelemetry collector, as well as the inverse in the case of a roll back.

For the edge data centers, once we had a set of configurations written, it mostly came down to rolling the changes out gradually across the edge servers. Because edge servers run the same set of services, once we had gained confidence in our set of configurations, it became a matter of rolling out the changes slowly and monitoring the logging pipelines along the way. We did have a few false starts here, and needed to instrument our cfjs1exporter a bit more to work around issues surrounding some of our more niche services and general Internet badness which we’ll detail below in our lessons learned.

The core data centers required a more hands-on approach. Many of our services in core have custom syslog-ng configurations. For example, our Postgresql servers have custom handling for their audit logs, and our Kubernetes servers have custom handling for contour ingress and error logs. This meant that each role with a custom config had to be manually onboarded, with extensive testing on the designated canary nodes of each role to validate the configurations.

Lessons Learned

Failover

At Cloudflare, we regularly schedule chaos testing on our core data centers which contain our centralized log receivers. During one of these chaos tests, our cfjs1 exporter did not notice that it could not send to the primary central logging server. This caused our collector to not failover to the secondary central logging server and its log buffer to fill up, which resulted in the collector failing to consume logs from its receivers. This is not a problem with journal receivers since logs are buffered by journald before they get consumed by the collector, but it is a different case with named pipe receivers. Due to this bug, our collectors stopped consuming logs from named pipes, and services writing to these named pipes started blocking threads waiting to write to them. Our syslog-ng deployment solved this issue using a monit script to periodically kill the connections between syslog-ng and the central receivers, however we opted to solve this more explicitly in our exporter by building in much tighter timeouts, and modifying the upstream failover receiver to better respond to these partial failures.

Cutover delays

As we’ve previously blogged about, at Cloudflare, we use Nomad for running dynamic tasks in our edge data centers. We use a custom driver to run containers and this custom driver handles the shipping of logs from the container to a named pipe.

We did the migration from syslog-ng to OpenTelemetry Collectors while servers were live and running production services. During the migration, there was a gap when syslog-ng was stopped by our configuration management and our OpenTelemetry collector was started on the server. This gap caused the logs in the named pipe to not get consumed and similar to the previous named pipe, the services writing to the named pipe receiver in blocking mode got affected. Similar to NGINX and Postgresql, Cloudflare’s driver for Nomad also writes logs to the named pipe driver in blocking mode. Because of this delay, the driver timed out sending logs and rescheduled the containers.

We ultimately caught this pretty early on in testing, and changed our approach to the rollout. Instead of using Salt to separately stop syslog-ng and start the collector, we instead used salt to schedule a systemd “one shot” service that simultaneously stopped syslog-ng and started the collector, minimizing the downtime between the two.

What’s next?

Migrating such a critical part of our infrastructure is never easy, especially when it has remained largely untouched for nearly half a decade. Even with the issues we hit during our rollout, migrating to an OpenTelemetry Collector unlocks so many more improvements to our logging pipeline going forward. With the initial deployment complete, there are a number of changes we’re excited to work on next, including:

Better handling for log sampling, including tail sampling
Better insights for our engineering teams on their telemetry production
Migration to OTLP as our line protocol
Upstreaming of some of our custom components

If that sounds interesting to you, we’re hiring engineers to come work on our logging pipeline, so please reach out!

The Cloudflare Blog

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Initial recovery mitigations

Implementing panic=unwind with WebAssembly Exception Handling

Abort recovery

Extension: abort reinitialization for wasm-bindgen libraries

Maturing the Rust Wasm Exception Handling ecosystem

Using panic unwind in Rust Workers

Committing to Rust Workers stability

Inside Gen 13: how we built our most powerful server yet

CPU

Why AMD Turin 9965?

Memory

Maximizing bandwidth with 12 channels

The 4 GB per core “sweet spot”

Efficiency through dual rank

Storage

The move to PCIe 5.0

16 TB to 24 TB

Front drive bay to support additional drives

Endurance and reliability

NVMe 2.0 and OCP NVMe 2.0 compliance

Thermal efficiency

Network

Why 100 GbE and why now?

Hardware choices and compatibility

PCIe Allocation

Management

Continuity with DC-SCM 2.0

Why we are staying the course with DC-SCM 2.0

Power

The jump to 1300W

Thermal design: 2U pays dividends again

Drop-in accelerator support

A launchpad to scale Cloudflare to greater heights

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

What AMD EPYCTurin brings to the table

Diagnosing the problem with performance counters

The tradeoff: latency vs. throughput

Incremental gains with performance tuning

The opportunity: FL2 was already in progress

Proving it out: FL2 on Gen 13

Generational improvement with Gen 13

Performance improvements

Gen 13 business impact

Gen 13 + FL2: ready for the edge

The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages

Part 1: The design process

The problem

The design audit

Mapping the user journey

Establishing a unified information architecture

What user research taught us

Some things didn’t need fixing

But these needed to change

Final redesign

Simplified, scannable content

AAA accessibility compliance

Unified experience across Turnstile and Challenge pages

Validated across languages

The complete picture

Part 2: Shipping to billions

Beyond centering a div

Small visual changes, big global impact

Building confidence through testing

The outcome

Part 3: The impact

1. Challenge Completion Rate

2. Time to Complete

3. Abandonment Rate Changes

4. Support Ticket Volume

5. Social Sentiment

Looking ahead

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

Why graceful restarts are hard

How ecdysis works

Security considerations

Code example

Production at scale

ecdysis vs alternatives

Implementing `panic=unwind` with WebAssembly Exception Handling