The Cloudflare Blog

How we reduced core unit boot time from hours to minutes

Giovanni Pereira Zantedeschi — Mon, 01 Jun 2026 16:53:39 GMT

Cloudflare's core is the centralized data centers that run our control plane, billing, and analytics — distinct from the globally distributed edge that handles user traffic. Core servers are bare metal, and when issues happen during reboot, the consequences can cascade fast.

Their boot sequence is orchestrated by UEFI, the modern firmware standard that initializes hardware and hands off control to the operating system. Small quirks in that handoff can have outsized consequences.

After a routine firmware update, some of our core servers were taking four hours to come back online, rather than just minutes as they did before. What should have been a one-day fleet-wide rollout was stretching into multi-day slogs. New nodes faced the full timeout gauntlet on their very first boot. Maintenance windows ballooned. Engineering teams had to babysit upgrades that should have run unattended.

The behavior we saw was brought to light when we were bringing nodes online that had been powered off for an extended period. These nodes’ firmware was out of date and required multiple updates to resolve. Combine this with recent updates to the boot protocols used by servers in some of our locations, and boot times on the affected nodes became unacceptable.

This is the story of how we tracked the cause to a firmware quirk and an over-eager linear search through every available network boot interface, and how we cut total boot and upgrade time from hours back down to minutes. Along the way, we'll share what we learned about UEFI internals, vendor-specific quirks, and the automation strategies that ultimately solved the problem.

The network boot interface

A network boot interface allows a server to boot its operating system over the network instead of from local storage. This is critical for centralized, automated, and scalable control over how machines start up, especially across a globally distributed fleet serving different workloads. Since our servers are located in different environments and serve different purposes, they have different requirements for a specific network boot interface. The two primary interfaces are the Preboot Execution Environment (PXE) and Unified Extensible Firmware Interface (UEFI) HTTPS boot.

As part of our reboot process, our servers usually go through PXE for various automation reasons. At Cloudflare, we use the open-source iPXE, an open-source network boot firmware that supports modern protocols like HTTP and HTTPS. This allows computers to boot operating systems directly from web servers, the cloud, or enterprise storage networks with significantly faster speeds and greater reliability.

For organizations, iPXE turns the boot process into a programmable workflow. It offers advanced scripting capabilities that allow IT teams to automate complex deployments, such as provisioning servers based on specific hardware configurations or managing secure, diskless workstations.

Some of our hardware supports HTTPS-based UEFI network boot, which enables the computer's motherboard firmware to natively download operating system files securely.

The linear search

Our tale begins with that fateful firmware update. Following the update, the first reports came through our internal channels: servers weren't coming back online. Monitoring dashboards showed machines stuck in a pre-OS state for far longer than expected. Our initial suspicion was a firmware regression: perhaps the update itself had introduced a bug that was hanging the boot process.

To rule that out, we pulled up the serial console on an affected machine and watched a boot cycle in real time. The firmware Power On Self Test (POST) completed normally and hardware initialization looked healthy. But then, instead of quickly reaching the network boot stage and pulling down an OS image, the server sat waiting. And waiting.

The console output told the story: the system was attempting an IPv4 HTTPS network boot, timing out after several minutes, then trying IPv4 iPXE, timing out again, then repeating both — all before finally reaching the IPv6 HTTPS boot interface that would actually succeed.

Every failed network boot attempt burned roughly five minutes waiting for a timeout response. With four attempts stacking up before the correct interface was reached, a single boot cycle wasted around twenty minutes. For a routine reboot, that's painful. For firmware upgrade automation, which requires multiple sequential reboots, one per component, those twenty-minute penalties compounded into nearly four hours of idle waiting per server.

No searching games: Declare my boot interface

After tracing the boot sequence and isolating the timeout pattern, the root cause became clear: the servers were blindly searching through every available network boot interface, one by one, waiting for each to fail before moving on. The fix was to eliminate the guesswork entirely — declare the correct boot interface upfront so the system never wastes time on interfaces that will never respond.

But putting this into practice was far from straightforward. As we explain next, we hit several obstacles: the order of our boot automation workflow, a setting we were blocked from changing, and differing string formats from our different network interface card vendors.

Our boot automation workflow

Our boot automation flow is in three broad stages: firmware initialization, pre-boot, and kernel startup. After power on, the UEFI firmware does some hardware and peripheral initialization followed by the PXE pre-boot environment. The pre-boot sets up the network card and executes a small program called bootloader, which kickstarts the kernel. It’s in this PXE stage that various network interfaces are probed for the right one. On first boot, firmware upgrades are included in our boot automation workflow.

And because each firmware upgrade requires a reboot (and its attendant network boot attempt sequence), that’s how we got to the situation where the total boot time took close to four hours.

By restructuring the automation sequence to declare the network boot interface order early on in the pre-boot PXE stage for each hardware/use-case, we were able to cut the total time by about an hour, since the boot process no longer needed to spend 20 minutes probing for each firmware upgrade.

Attempting to declare the network boot interface order introduced two specific constraints:

Legacy Support: Boot ordering is not supported on older UEFI versions
Persistence: Configuration settings are often reset following a UEFI firmware upgrade

To address these edge cases, we implemented a state validation step. The firmware automation now validates the configuration post-change: if it detects that settings have been modified, it re-applies the config and triggers a reboot.

Although the first boot may take slightly longer, this change drastically reduces the time required for all future start-ups from about 20 minutes to less than a minute per subsequent boot.

Setting the boot order disabled by the vendor

The internal data structure of the Network Boot settings is an EFI_IFR_REF3 data structure that was being lazy loaded, meaning the data is not instantiated until it is explicitly accessed via a GUI callback:

typedef struct _EFI_IFR_REF3 {
  EFI_IFR_OP_HEADER          Header;
  EFI_IFR_QUESTION_HEADER    Question;
  EFI_QUESTION_ID            QuestionId;
  EFI_GUID                   FormSetId;
} EFI_IFR_REF3;

While this is standard industry practice to accelerate BIOS boot times, it rendered the “Network Boot Interface” invisible to our programmatic scans. Because the structure hadn't been "loaded" yet, our automation couldn't discover the priorities.

We worked with our vendors to enable specific tokens within the fixed "Boot Order Module." This forces the discovery of the Network Boot Interface during the boot sequence without requiring manual GUI interaction.

The UEFI from our equipment manufacturers had an immutable setting, Force Priority Httpv4 Httpv6 Pxev4 Pxev6, that was preventing us from changing the boot order.

This required a new BIOS version from our vendor and a debug session when setting the boot order.

Different strings from different network interface card vendors

Depending on the network interface card (NIC) vendor, the strings would be different, causing a mismatch when configuring the boot order through iPXE.

Examples:

UEFI: HTTPS IPv4 Ethernet Network Adapter XXX-XXX-Y for OCP 3.0 P1 UEFI: HTTPS IPv4 Network Adapter - 50:00:E6:8F:4F:32 P1

In order to work around this issue, we had to implement an additional feature to the CfHIIConfig_App tool, allowing it to set the config without having the full string:

.*HTTP.*IPv4.*P1

The config would then be matched against the accepted config strings and would select the correct boot order. We are currently working with our UEFI vendors to standardize the network interface strings to only make use of the relevant information (e.g. protocol, transfer type, port number, and physical slot index) and drop the product details like the MAC address. The product details, if needed, can be read from the embedded vital product detail information of the network interface card. That way we eliminate both configuration drift and the use of wildcards.

Inability to check the config via iPXE

Since iPXE reads this variable as HEX, it was reading the string output as hex. To check if the network boot setting was modified and to reduce boot time (so we don’t have to print the variables before setting them), we implemented a boolean flag, uefi-same-hex, to indicate whether a configuration changed.

This enabled us to run a single set command instead of first running show to compare, and then set if the configuration was not in the desired state.

# construct path to read the update variable
set buffer-var-guid 91468514-75bc-4bb5-8f33-91efff9e9b1f
set var-upd-path efivar/CfHIIVarUpd-${buffer-var-guid}

#Run the config change command
imgexec  set ${uefi-setting}=${uefi-value}

#Compare the update variable with the expected value if it has changed.
#If it has changed, set the local variable to reboot the system
iseq ${uefi-same-hex} ${${var-upd-path}} || set has-changed ${uefi-diff-hex}

The result: a more dynamic system

By eliminating the guesswork from our network boot sequence, we turned a four-hour ordeal back into a 3-minute process. The result is a system where changes are dynamic and no manual BIOS interactions are needed. A single BIOS firmware image serves all SKUs, configuration updates deploy at scale through our existing release pipeline, and the entire workflow operates from iPXE.

Metric	Before ordering change	After ordering change
Firmware Upgrade Automation	Nearly 4 hours	3 minutes
Subsequent Single Boot	About 20 minutes	Less than a minute

None of this would have been possible without digging deep into UEFI internals, collaborating closely with our OEM vendors to unlock capabilities like programmatic boot order control, and leveraging open-source tools like iPXE to build scalable automation.

With each passing day, Cloudflare's OpenBMC team continues to learn about, experiment with, and optimize the boot process across our core fleet. If you are managing bare-metal infrastructure and struggling with slow server boot times, we hope this post has given you a practical framework for identifying and eliminating unnecessary delays in your own network boot sequence. For those interested in learning more about iPXE and network boot automation, check it out here!

Building the foundation for running extra-large language models

Michelle Chen — Thu, 16 Apr 2026 14:00:00 GMT

An agent needs to be powered by a large language model. A few weeks ago, we announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot’s Kimi K2.5. Since then, we’ve made Kimi K2.5 3x faster and have more model additions in-flight. These models have been the backbone of a lot of the agentic products, harnesses, and tools that we have been launching this week.

Hosting AI models is an interesting challenge: it requires a delicate balance between software and very, very expensive hardware. At Cloudflare, we’re good at squeezing every bit of efficiency out of our hardware through clever software engineering. This is a deep dive on how we’re laying the foundation to run extra-large language models.

Hardware configurations

As we mentioned in our previous Kimi K2.5 blog post, we’re using a variety of hardware configurations in order to best serve models. A lot of hardware configurations depend on the size of inputs and outputs that users are sending to the model. For example, if you are using a model to write fanfiction, you might give it a few small prompts (input tokens) while asking it to generate pages of content (output tokens).

Conversely, if you are running a summarization task, you might be sending in hundreds of thousands of input tokens, but only generating a small summary with a few thousand output tokens. Presented with these opposing use cases, you have to make a choice — should you tune your model configuration so it’s faster at processing input tokens, or faster at generating output tokens?

When we launched large language models on Workers AI, we knew that most of the use cases would be used for agents. With agents, you send in a large number of input tokens. It starts off with a large system prompt, all the tools, MCPs. With the first user prompt, that context keeps growing. Each new prompt from the user sends a request to the model, which consists of everything that was said before — all the previous user prompts, assistant messages, code generated, etc. For Workers AI, that means we had to focus on two things: fast input token processing and fast tool calling.

Prefill decode (PD) disaggregation

One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine.

With prefill decode disaggregation, separate inference servers are run for each stage. First, a request is sent to the prefill stage which performs prefill and stores it in its KV cache. Then the same request is sent to the decode server, with information about how to transfer the KV cache from the prefill server and begin decoding. This has a number of advantages, because it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware.

This architecture requires a relatively complex load balancer to achieve. Beyond just routing the requests as described above, it must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens. To complicate matters, different inference servers require different information to initiate the KV cache transfer. We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly.

After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer’s use cases.

Here’s a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance.

Similarly, p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency.

Prompt Caching

Since agentic use cases usually have long contexts, we optimize for efficient prompt caching in order to not recompute input tensors on every turn. We leverage a header called x-session-affinity in order to help requests route to the right region that previously had the computed input tensors. We wrote about this in our original blog post about launching large LLMs on Workers AI. We added session affinity headers to popular agent harnesses like OpenCode, where we noticed a significant increase in total throughput. A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model. While we have KV-aware routing internally, we also rely on clients sending the x-session-affinity in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens. We highly encourage users to leverage prompt caching in order to have faster inference and cheaper pricing.

We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews.

KV-cache optimization

As we’re serving larger models now, one instance can span multiple GPUs. This means that we had to find an efficient way to share KV cache across GPUs. KV cache is where all the input tensors from prefill (result of prompts in a session) are stored, and initially lives in the VRAM of a GPU. Every GPU has a fixed VRAM size, but if your model instance requires multiple GPUs, there needs to be a way for the KV cache to live across GPUs and talk to each other. To achieve this for Kimi, we leveraged Moonshot AI’s Mooncake Transfer Engine and Mooncake Store.

Mooncake’s Transfer Engine is a high-performance data transfer framework. It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU. It improves the speed of transferring data across multiple GPU machines, which is particularly important in multi-GPU and multi-node configurations for models.

When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly. Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users.

Speculative decoding

LLMs work by predicting the next token in a sequence, based on the tokens that came before it. With a naive implementation, models only predict the next n token, but we can actually make it predict the next n+1, n+2... tokens in a single forward pass of the model. This popular technique is known as speculative decoding, which we’ve written about in a previous post on Workers AI.

With speculative decoding, we leverage a smaller LLM (the draft model) to generate a few candidate tokens for the target model to choose from. The target model then just has to select from a small pool of candidate tokens in a single forward pass. Validating the tokens is faster and less computationally expensive than using the larger target model to generate the tokens. However, quality is still upheld as the target model ultimately has to accept or reject the draft tokens.

In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it’s wrapped in a JSON envelope.

To do this with Kimi K2.5, we leverage NVIDIA’s EAGLE-3 (Extrapolation Algorithm for Greater Language-model Efficiency) draft model. The levers for tuning speculative decoding include the number of future tokens to generate. As a result, we’re able to achieve high-quality inference while speeding up tokens per second throughput.

Infire: our proprietary inference engine

As we announced during Birthday Week in 2025, Cloudflare has a proprietary inference engine, Infire, that makes machine learning models faster. Infire is an inference engine written in Rust, designed to support Cloudflare’s unique challenges with inference given our distributed global network. We’ve extended Infire support for this new class of large language models we are planning to run, which meant we had to build a few new features to make it all work.

Multi-GPU support

Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that’s not even including the extra VRAM you would need for KV Cache, which includes your context window.

Since we initially launched Infire, we had to add support for multi-GPU, letting the inference engine run across multiple GPUs in either pipeline-parallel or tensor-parallel modes with expert-parallelism supported as well.

For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible. For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency.

Even lower memory overhead

While already having much lower GPU memory overhead than vLLM, we optimized Infire even further, tightening the memory required for internal state like activations. Currently Infire is capable of running Llama 4 Scout on just two H200 GPUs with more than 56 GiB remaining for KV-cache, sufficient for more than 1.2m tokens. Infire is also capable of running Kimi K2.5 on 8 H100 GPUs (yes that is H100), with more than 30 GiB still available for KV-cache. In both cases you would have trouble even booting vLLM in the first place.

Faster cold-starts

While adding multi-GPU support, we identified additional opportunities to improve boot times. Even for the largest models, such as Kimi K2.5, Infire can begin serving requests in under 20 seconds. The load times are only bounded by the drive speed.

Maximizing our hardware for faster throughput

Investing in our proprietary inference engine enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible.

The journey doesn’t end

New technologies, research, and models come out on a weekly basis for the machine learning community. We’re continuously optimizing our technology stack in order to provide high-quality, performant inference for our customers while operating our GPUs efficiently. If these sound like interesting challenges for you – we’re hiring!

A one-line Kubernetes fix that saved 600 hours a year

Braxton Schafer — Thu, 26 Mar 2026 13:00:00 GMT

Every time we restarted Atlantis, the tool we use to plan and apply Terraform changes, we’d be stuck for 30 minutes waiting for it to come back up. No plans, no applies, no infrastructure changes for any repository managed by Atlantis. With roughly 100 restarts a month for credential rotations and onboarding, that added up to over 50 hours of blocked engineering time every month, and paged the on-call engineer every time.

This was ultimately caused by a safe default in Kubernetes that had silently become a bottleneck as the persistent volume used by Atlantis grew to millions of files. Here’s how we tracked it down and fixed it with a one-line change.

Mysteriously slow restarts

We manage dozens of Terraform projects with GitLab merge requests (MRs) using Atlantis, which handles planning and applying. It enforces locking to ensure that only one MR can modify a project at a time.

It runs on Kubernetes as a singleton StatefulSet and relies on a Kubernetes PersistentVolume (PV) to keep track of repository state on disk. Whenever a Terraform project needs to be onboarded or offboarded, or credentials used by Terraform are updated, we have to restart Atlantis to pick up those changes — a process that can take 30 minutes.

The slow restart was apparent when we recently ran out of inodes on the persistent storage used by Atlantis, forcing us to restart it to resize the volume. Inodes are consumed by each file and directory entry on disk, and the number available to a filesystem is determined by parameters passed when creating it. The Ceph persistent storage implementation provided by our Kubernetes platform does not expose a way to pass flags to mkfs, so we’re at the mercy of default values: growing the filesystem is the only way to grow available inodes, and restarting a PV requires a pod restart.

We talked about extending the alert window, but that would just mask the problem and delay our response to actual issues. Instead, we decided to investigate exactly why it was taking so long.

Bad behavior

When we were asked to do a rolling restart of Atlantis to pick up a change to the secrets it uses, we would run kubectl rollout restart statefulset atlantis, which would gracefully terminate the existing Atlantis pod before spinning up a new one. The new pod would appear almost immediately, but looking at it would show:

$ kubectl get pod atlantis-0
atlantis-0                                                        0/1     
Init:0/1     0             30m

...so what gives? Naturally, the first thing to check would be events for that pod. It's waiting around for an init container to run, so maybe the pod events would illuminate why?

$ kubectl events --for=pod/atlantis-0
LAST SEEN   TYPE      REASON                   OBJECT                   MESSAGE
30m         Normal    Killing                  Pod/atlantis-0   Stopping container atlantis-server
30m        Normal    Scheduled                Pod/atlantis-0   Successfully assigned atlantis/atlantis-0 to 36com1167.cfops.net
22s         Normal    Pulling                  Pod/atlantis-0   Pulling image "oci.example.com/git-sync/master:v4.1.0"
22s         Normal    Pulled                   Pod/atlantis-0   Successfully pulled image "oci.example.com/git-sync/master:v4.1.0" in 632ms (632ms including waiting). Image size: 58518579 bytes.

That looks almost normal... but what's taking so long between scheduling the pod and actually starting to pull the image for the init container? Unfortunately that was all the data we had to go on from Kubernetes itself. But surely there had to be something more that can tell us why it's taking so long to actually start running the pod.

Going deeper

In Kubernetes, a component called kubelet that runs on each node is responsible for coordinating pod creation, mounting persistent volumes, and many other things. From my time on our Kubernetes team, I know that kubelet runs as a systemd service and so its logs should be available to us in Kibana. Since the pod has been scheduled, we know the host name we're interested in, and the log messages from kubelet include the associated object, so we could filter for atlantis to narrow down the log messages to anything we found interesting.

We were able to observe the Atlantis PV being mounted shortly after the pod was scheduled. We also observed all the secret volumes mount without issue. However, there was still a big unexplained gap in the logs. We saw:

[operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-94b75052-8d70-4c67-993a-9238613f3b99\" (UniqueName: \"kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com^0001-000e-rook-ceph-nvme-0000000000000002-a6163184-670f-422b-a135-a1246dba4695\") pod \"atlantis-0\" (UID: \"83089f13-2d9b-46ed-a4d3-cba885f9f48a\") device mount path \"/state/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com/d42dcb508f87fa241a49c4f589c03d80de2f720a87e36932aedc4c07840e2dfc/globalmount\"" pod="atlantis/atlantis-0"
[pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[atlantis-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="atlantis/atlantis-0" podUID="83089f13-2d9b-46ed-a4d3-cba885f9f48a"
[util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="atlantis/atlantis-0"

The last two messages looped several times until eventually we observed the pod actually start up properly.

So kubelet thinks that the pod is otherwise ready to go, but it's not starting it and something's timing out.

The missing piece

The lowest-level logs we had on the pod didn't show us what's going on. What else do we have to look at? Well, the last message before it hangs is the PV being mounted onto the node. Ordinarily, if the PV has issues mounting (e.g. due to still being stuck mounted on another node), that will bubble up as an event. But something's still going on here, and the only thing we have left to drill down on is the PV itself. So I plug that into Kibana, since the PV name is unique enough to make a good search term... and immediately something jumps out:

[volume_linux.go:49] Setting volume ownership for /state/var/lib/kubelet/pods/83089f13-2d9b-46ed-a4d3-cba885f9f48a/volumes/kubernetes.io~csi/pvc-94b75052-8d70-4c67-993a-9238613f3b99/mount and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699

Remember how I said at the beginning we'd just run out of inodes? In other words, we have a lot of files on this PV. When the PV is mounted, kubelet is running chgrp -R to recursively change the group on every file and folder across this filesystem. No wonder it was taking so long — that's a ton of entries to traverse even on fast flash storage!

The pod's spec.securityContext included fsGroup: 1, which ensures that processes running under GID 1 can access files on the volume. Atlantis runs as a non-root user, so without this setting it wouldn’t have permission to read or write to the PV. The way Kubernetes enforces this is by recursively updating ownership on the entire PV every time it's mounted.

The fix

Fixing this was heroically...boring. Since version 1.20, Kubernetes has supported an additional field on pod.spec.securityContext called fsGroupChangePolicy. This field defaults to Always, which leads to the exact behavior we see here. It has another option, OnRootMismatch, to only change permissions if the root directory of the PV doesn't have the right permissions. If you don’t know exactly how files are created on your PV, do not set fsGroupChangePolicy: OnRootMismatch. We checked to make sure that nothing should be changing the group on anything in the PV, and then set that field:

spec:
  template:
    spec:
      securityContext:
        fsGroupChangePolicy: OnRootMismatch

Now, it takes about 30 seconds to restart Atlantis, down from the 30 minutes it was when we started.

Default Kubernetes settings are sensible for small volumes, but they can become bottlenecks as data grows. For us, this one-line change to fsGroupChangePolicy reclaimed nearly 50 hours of blocked engineering time per month. This was time our teams had been spending waiting for infrastructure changes to go through, and time that our on-call engineers had been spending responding to false alarms. That’s roughly 600 hours a year returned to productive work, from a fix that took longer to diagnose than deploy.

Safe defaults in Kubernetes are designed for small, simple workloads. But as you scale, they can slowly become bottlenecks. If you’re running workloads with large persistent volumes, it’s worth checking whether recursive permission changes like this are silently eating your restart time. Audit your securityContext settings, especially fsGroup and fsGroupChangePolicy. OnRootMismatch has been available since v1.20.

Not every fix is heroic or complex, and it’s usually worth asking “why does the system behave this way?”

If debugging infrastructure problems at scale sounds interesting, we’re hiring. Come join us on the Cloudflare Community or our Discord to talk shop.

Inside Gen 13: how we built our most powerful server yet

Syona Sarma — Mon, 23 Mar 2026 13:00:00 GMT

A few months ago, Cloudflare announced the transition to FL2, our Rust-based rewrite of Cloudflare's core request handling layer. This transition accelerates our ability to help build a better Internet for everyone. With the migration in the software stack, Cloudflare has refreshed our server hardware design with improved hardware capabilities and better efficiency to serve the evolving demands of our network and software stack. Gen 13 is designed with 192-core AMD EPYC™ Turin 9965 processor, 768 GB of DDR5-6400 memory, 24 TB of PCIe 5.0 NVMe storage, and dual 100 GbE port network interface card.

Gen 13 delivers:

Up to 2x throughput compared to Gen 12 while staying within latency SLA
Up to 50% improvement in performance / watt efficiency, reducing data center expansion costs
Up to 60% higher throughput per rack keeping rack power budget constant
2x memory capacity, 1.5x storage capacity, 4x network bandwidth
Introduced PCIe encryption hardware support in addition to memory encryption
Improved support for thermally demanding powerful drop-in PCIe accelerators

This blog post covers the engineering rationale behind each major component selection: what we evaluated, what we chose, and why.

Generation	Gen 13 Compute	Previous Gen 12 Compute
Form Factor	2U1N, Single socket	2U1N, Single socket
Processor	AMD EPYC™ 9965 Turin 192-Core Processor	AMD EPYC™ 9684X Genoa-X 96-Core Processor
Memory	768GB of DDR5-6400 x12 memory channel	384GB of DDR5-4800 x12 memory channel
Storage	x3 E1.S NVMe Samsung PM9D3a 7.68TB / Micron 7600 Pro 7.68TB	x2 E1.S NVMe Samsung PM9A3 7.68TB / Micron 7450 Pro 7.68TB
Network	Dual 100 GbE OCP 3.0 Intel Ethernet Network Adapter E830-CDA2 / NVIDIA Mellanox ConnectX-6 Dx	Dual 25 GbE OCP 3.0 Intel Ethernet Network Adapter E810-XXVDA2 / NVIDIA Mellanox ConnectX-6 Lx
System Management	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)	DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)
Power Supply	1300W, Titanium Grade	800W, Titanium Grade

^{Figure: Gen 13 server}

CPU

Gen 12	AMD EPYC™ 9684X Genoa-X 96-Core (400W TDP, 1152 MB L3 Cache)
Gen 13	AMD EPYC™ 9965 Turin Dense 192-Core (500W TDP, 384 MB L3 Cache)

During the design phase, we evaluated several 5th generation AMD EPYC™ Processors, code-named Turin, in Cloudflare’s hardware lab: AMD Turin 9755, AMD Turin 9845, and AMD Turin 9965. The table below summarizes the differences in specifications of the candidates for Gen 13 servers against the AMD Genoa-X 9684X used in our Gen 12 servers. Notably, all three candidates offer increases in core count but with smaller L3 cache per core. However, with the migration to FL2, the new workloads are less dependent on L3 cache and scale up well with the increased core count to achieve up to 100% increase in throughput.

The three CPU candidates are designed to target different use cases: AMD Turin 9755 offers superior per-core performance, AMD Turin 9965 trades per-core performance for efficiency, and AMD Turin 9845 trades core count for lower socket power. We evaluated three CPUs in the production environment.

CPU Model	AMD Genoa-X 9684X	AMD Turin 9755	AMD Turin 9845	AMD Turin 9965
For server platform	Gen 12	Gen 13 candidate	Gen 13 candidate	Gen 13 candidate
# of CPU Cores	96	128	160	192
# of Threads	192	256	320	384
Base Clock	2.4 GHz	2.7 GHz	2.1 GHz	2.25 GHz
Max Boost Clock	3.7 GHz	4.1 GHz	3.7 GHz	3.7 GHz
All Core Boost Clock	3.42 GHz	4.1 GHz	3.25 GHz	3.35 GHz
Total L3 Cache	1152 MB	512 MB	320 MB	384 MB
L3 cache per core	12 MB / core	4 MB / core	2 MB / core	2 MB / core
Maximum configurable TDP	400W	500W	390W	500W

Why AMD Turin 9965?

First, FL2 ended the L3 cache crunch.

L3 cache is the large, last-level cache shared among all CPU cores on the same compute die to store frequently used data. It bridges the gap between slow main memory external to the CPU, and the fast but smaller L1 and L2 cache on the CPU, reducing the latency for the CPU to access data.

Some may notice that the 9965 has only 2 MB of L3 cache per core, an 83.3% reduction from the 12 MB per core on Gen 12’s Genoa-X 9684X. Why trade away the very cache advantage that gave Gen 12 its edge? The answer lies in how our workloads have evolved.

Cloudflare has migrated from FL1 to FL2, a complete rewrite of our request handling layer in Rust. With the new software stack, Cloudflare’s request processing pipeline has become significantly less dependent on large L3 cache. FL2 workloads scale nearly linearly with core count, and the 9965’s 192 cores provide a 2x increase in hardware threads over Gen 12.

Second, performance per total cost of ownership (TCO). During production evaluation, the 9965’s 192 cores delivered the highest aggregate requests per second of the three candidates, and its performance-per-watt scaled favorably at 500W TDP, yielding superior rack-level TCO.

	Gen 12	Gen 13
Processor	AMD EPYC™ 4th Gen Genoa-X 9684X	AMD EPYC™ 5th Gen Turin 9965
Core count	96C/192T	192C/384T
FL throughput	Baseline	Up to +100%
Performance per watt	Baseline	Up to +50%

Third, operational simplicity. Our operational teams have a strong preference for fewer, higher-density servers. Managing a fleet of 192-core machines means fewer nodes to provision, patch, and monitor per unit of compute delivered. This directly reduces operational overhead across our global network.

Finally, they are forward compatible. The AMD processor architecture supports DDR5-6400, PCIe Gen 5.0, CXL 2.0 Type 3 memory across all SKUs. AMD Turin 9965 has the highest number of high-performing cores per socket in the industry, maximizing the compute density per socket, maintaining competitiveness and relevance of the platform for years to come. By moving to AMD Turin 9965 from AMD Genoa-X 9684X, we get longer security support from AMD, extending the useful life of the Gen 13 server before they become obsolete and need to be refreshed.

Memory

Gen 12	12x 32GB DDR5-4800 2Rx8 (384 GB total, 4 GB/core)
Gen 13	12x 64GB DDR5-6400 2Rx4 (768 GB total, 4 GB/core)

Because the AMD Turin processor has twice the core count of the previous generation, it demands more memory resources, both in capacity and in bandwidth, to deliver throughput gains.

Maximizing bandwidth with 12 channels

The chosen AMD EPYC™ 9965 CPU supports twelve memory channels, and for Gen 13, we are populating every single one of them. We’ve selected 64 GB DDR5-6400 ECC RDIMMs in a “one DIMM per channel” (1DPC) configuration.

This setup provides 614 GB/s of peak memory bandwidth per socket, a 33.3% increase compared to our Gen 12 server platform. By utilizing all 12 channels, we ensure that the CPU is never “starved” for data, even during the most memory-intensive parallel workloads.

Populating all twelve channels in a balanced configuration — equal capacity per channel, with no mixed configurations — is common best practice. This matters operationally: AMD Turin processors interleave across all memory channels with the same DIMM type, same memory capacity and same rank configuration. Interleaving increases memory bandwidth by spreading contiguous memory access across all memory channels in the interleave set instead of sending all memory access to a single or a small subset of memory channels.

The 4 GB per core “sweet spot”

Our Gen 12 servers are configured with 4GB per core. We revisited that decision as we designed Gen 13.

Cloudflare launches a lot of new products and services every month, and each new product or service demands an incremental amount of memory capacity. These accumulate over time and could become an issue of memory pressure, if memory capacity is not sized appropriately.

Initial requirement considered a memory-to-core ratio between 4 GB and 6 GB per core. With 192 cores on the AMD Turin 9965, that translates to a range of 768 GB to 1152 GB. Note that at higher capacities, DIMM module capacity granularity are typically 16GB increments. With 12 channels in a 1DPC configuration, our options are 12x 48GB (576 GB), 12x 64GB (768 GB), or 12x 96GB (1152 GB).

12x 48GB = 576 GB, or 1.5 GB/thread. The memory capacity of this configuration is too low; this would starve memory-hungry workloads and violate the lower bound.
12x 96GB = 1152 GB, or 3.0 GB/thread. This would be a 50% capacity increase per core and would also result in higher power consumption and a substantial increase in cost, especially in the current market conditions where memory prices are 10x of what they were a year ago.
12x 64GB = 768 GB, or 2.0 GB/thread (4 GB/core). This configuration is consistent with our Gen 12 memory to core ratio, and represents a 2x increase in memory capacity per server. Keeping the memory capacity configuration at 4 GB per core provides sufficient capacity for workloads that scale with core count, like our primary workload, FL, and provide sufficient memory capacity headroom for future growth without overprovisioning.

FL2 uses memory more efficiently than FL1 did: our internal measures show FL2 uses less than half the CPU of FL1, and far less than half the memory. The capacity freed up by the software stack migration provides ample headroom to support Cloudflare growth for the next few years.

The decision: 12x 64GB for 768 GB total. This maintains the proven 4 GB/core ratio, provides a 2x total capacity increase over Gen 12, and stays within the DIMM cost curve sweet spot.

Efficiency through dual rank

In Gen 12, we demonstrated that dual-rank DIMMs provide measurably higher memory throughput than single-rank modules, with advantages of up to 17.8% at a 1:1 read-write ratio. Dual-rank DIMMs are faster because they allow the memory controller to access one rank while another is refreshing. That same principle carries forward here.

Our requirement also calls for approximately 1 GB/s of memory bandwidth per hardware thread. With 614 GB/s of peak bandwidth across 384 threads, we deliver 1.6 GB/s per thread, comfortably exceeding the minimum. Production analysis has shown that Cloudflare workloads are not memory-bandwidth-bound, so we bank the headroom as margin for future workload growth.

By opting for 2Rx4 DDR5 RDIMMs at maximum supported 6400MT/s, we ensure we get the lowest latency and best performance from our Gen 13 platform memory configuration.

Storage

Gen 12

x2 E1.S NVMe PCIe 4.0, 16 TB total

Samsung PM9A3 7.68TB

Micron 7450 Pro 7.68TB

Gen 13

x3 E1.S NVMe PCIe 5.0, 24 TB total

Samsung PM9D3a 7.68TB

Micron 7600 Pro 7.68TB

+10x U.2 NVMe PCIe 5.0 option

Our storage architecture underwent a transformation in Gen 12 when we pivoted from M.2 to EDSFF E1.S. For Gen 13, we are increasing the storage capacity and the bandwidth to align with the latest technology. We have also added a front drive bay for flexibility to add up to 10x U.2 drives to keep pace with Cloudflare storage product growth.

The move to PCIe 5.0

Gen 13 is configured with PCIe Gen 5.0 NVMe drives. While Gen 4.0 served us well, the move to Gen 5.0 ensures that our storage subsystem can serve data at improved latency, and keep up with increased storage bandwidth demand from the new processor.

16 TB to 24 TB

Beyond the speed increase, we are physically expanding the array from two to three NVMe drives. Our Gen 12 server platform was designed with four E1.S storage drive slots, but only two slots were populated with 8TB drives. The Gen 13 server platform uses the same design with four E1.S storage drive slots available, but with three slots populated with 8TB drives. Why add a third drive? This increases our storage capacity per server from 16TB to 24TB, ensuring we are expanding our global storage capacity to maintain and improve CDN cache performance. This supports growth projections for Durable Objects, Containers, and Quicksilver services, too.

Front drive bay to support additional drives

For Gen 13, the chassis is designed with a front drive bay that can support up to ten U.2 PCIe Gen 5.0 NVMe drives. The front drive bay provides the option for Cloudflare to use the same chassis across compute and storage platforms, as well as the flexibility to convert a compute SKU to a storage SKU when needed.

Endurance and reliability

We designed our servers to have a 5-year operational life and require storage drives endurance to sustain 1 DWPD (Drive Writes Per Day) over the full server lifespan.

Both the Samsung PM9D3a and Micron 7600 Pro meet the 1 DWPD specification with a hardware over-provisioning (OP) of approximately 7%. If future workload profiles demand higher endurance, we have the option to hold back additional user capacity to increase effective OP.

NVMe 2.0 and OCP NVMe 2.0 compliance

Both the Samsung PM9D3a and Micron 7600 adopt the NVMe 2.0 specification (up from NVMe 1.4) and the OCP NVMe Cloud SSD Specification 2.0. Key improvements include Zoned Namespaces (ZNS) for better write amplification management, Simple Copy Command for intra-device data movement without crossing the PCIe bus, and enhanced Command and Feature Lockdown for tighter security controls. The OCP 2.0 spec also adds deeper telemetry and debug capabilities purpose-built for datacenter operations, which aligns with our emphasis on fleet-wide manageability.

Thermal efficiency

The storage drives will continue to be in the E1.S 15mm form factor. Its high-surface-area design is essential for cooling these new Gen 5.0 controllers, which can pull upwards of 25W under sustained heavy I/O. The 2U chassis provides ample airflow over the E1.S drives as well as U.2 drive bays, a design advantage we validated in Gen 12 when we made the decision to move from 1U to 2U.

Network

Gen 12

Dual 25 GbE port OCP 3.0 NIC

Intel E810-XXVDA2

NVIDIA Mellanox ConnectX-6 Lx

Gen 13

Dual 100 GbE port OCP 3.0 NIC

Intel E830-CDA2

NVIDIA Mellanox ConnectX-6 Dx

For more than eight years, dual 25 GbE was the backbone of our fleet. Since 2018 it has served us well, but as the CPU has improved to serve more requests and our products scale, we’ve officially hit the wall. For Gen 13, we are quadrupling our per-port bandwidth.

Why 100 GbE and why now?

Network Interface Card (NIC) bandwidth must keep pace with compute performance growth. With 192 modern cores, our 25 GbE links will become a measurable bottleneck. Production data from our co-locations worldwide over a week showed that, on our Gen 12, P95 bandwidth per port is consistently >50% of available bandwidth. Since throughput is doubling per server on Gen 13, we are at risk of saturating the NIC bandwidth.

^{Figure: on Gen 12, P95 bandwidth per port is consistently >50% of available bandwidth}

The decision to go to 100 GbE rather than 50 GbE was driven by industry economics: 50 GbE transceiver volumes remain low in the industry, making them a poor supply chain bet. Dual 100 GbE ports also give us 200 Gb/s of aggregate bandwidth per server, future-proofing against the next several years of traffic growth.

Hardware choices and compatibility

We are maintaining our dual-vendor strategy to ensure supply chain resilience, a lesson hard-learned during the pandemic when single-sourcing the Gen 11 NIC left us scrambling.

Both NICs are compliant with OCP 3.0 SFF/TSFF form factor with the integrated pull tab, maintaining chassis commonality with Gen 12 and ensuring field technicians need no new tools or training for swaps.

PCIe Allocation

The OCP 3.0 NIC slot is allocated PCIe 4.0 x16 lanes on the motherboard, providing 256 Gb/s of bidirectional bandwidth, more than enough for dual 100 GbE (200 Gb/s aggregate) with room to spare.

Management

Gen 12

Project Argus Data Center Secure Control Module 2.0

Gen 13

Project Argus Data Center Secure Control Module 2.0

PCIe encryption

We are maintaining the architectural shift, introduced in Gen 12, of separating management and security-related components from the motherboard onto the Project Argus Data Center Secure Control Module 2.0.

^{Figure: Project Argus DC-SCM 2.0}

Continuity with DC-SCM 2.0

We are carrying forward the Data Center Secure Control Module 2.0 (DC-SCM 2.0) standard. By decoupling management and security functions from the motherboard, we ensure that the “brains” of the server’s security stay modular and protected.

The DC-SCM module houses our most critical components:

Basic Input/Output System (BIOS)
Baseboard Management Controller (BMC)
Hardware Root of Trust (HRoT) and TPM (Infineon SLB 9672)
Dual BMC/BIOS flash chips for redundancy

Why we are staying the course with DC-SCM 2.0

The decision to keep this architecture for Gen 13 is driven by the proven security gains we saw in the previous generation. By offloading these functions to a dedicated module, we maintain:

Rapid recovery: Dual image redundancy allows for near-instant restoration of BIOS/UEFI and BMC firmware if an accidental corruption or a malicious update is detected.
Physical resilience: The Gen 13 chassis also moves the intrusion detection mechanism further from the flat edge of the chassis, making physical intercept harder.
PCIe encryption: In addition to TSME (Transparent Secure Memory Encryption) for CPU-to-memory encryption that was already enabled since our Gen 10 platforms, AMD Turin 9965 processor for Gen 13 extends encryption to PCIe traffic, this ensures data is protected in transit across every bus in the system.
Operational consistency: Sticking with the Gen 12 management stack means our security audits, deployment, provisioning, and operational standard procedure remain fully compatible.

Power

Gen 12	800W 80 PLUS Titanium CRPS
Gen 13	1300W 80 PLUS Titanium CRPS

As we upgrade the compute and networking capability of the server, the power envelope of our servers has naturally expanded. Gen 13 are equipped with bigger power supplies to deliver the power needed.

The jump to 1300W

While our Gen 12 nodes operated comfortably with 800W 80 PLUS Titanium CRPS (Common Redundant Power Supply), the Gen 13 specification requires a larger power supply. We have selected a 1300W 80 PLUS Titanium CRPS.

Power consumption of Gen 13 during typical operation has risen to 850W, a 250W increase over the 600W seen in Gen 12. The primary contributors are the 500W TDP CPU (up from 400W), doubling of the memory capacity and the additional NVMe drive.

Why 1300W instead of 1000W? The current PSU ecosystem lacks viable, high-efficiency options at 1000W. To ensure supply chain reliability, we moved to the next industry-standard tier of 1300W.

EU Lot 9 is a regulation that requires servers deploying in the European Union to have power supplies with efficiency at 10%, 20%, 50% and 100% load to be at or above the percentage threshold specified in the regulation. The threshold matches 80 PLUS Power Supply certification program titanium grade PSU requirement. We chose a titanium grade PSU for Gen 13 to maintain full compliance with EU Lot 9, ensuring that the servers can be deployed in our European data centers and beyond.

Thermal design: 2U pays dividends again

The 2U1N form factor we adopted in Gen 12 continues to pay dividends. Gen 13 uses 5x 80mm fans (up from 4x in Gen 12) to handle the increased thermal load from the 500W CPU. The larger fan volume, combined with the 2U chassis airflow characteristics, means fans operate well below maximum duty cycle at typical ambient temperatures, keeping fan power in the < 50W range per fan.

Drop-in accelerator support

Gen 12	x2 single width FHFL or x1 double width FHFL
Gen 13	x2 double width FHFL

Maintaining the modularity of our fleet is a core requirement for our server design. This requirement enabled Cloudflare to quickly retrofit and deploy GPUs globally to more than 100 cities in 2024. In Gen 13, we are continuing the support of high-performance PCIe add-in cards.

On Gen 13, the 2U chassis layout is updated and configured to support more demanding power and thermal requirements. While Gen 12 was limited to a single double-width GPU, the Gen 13 architecture now supports two double-width PCIe cards.

A launchpad to scale Cloudflare to greater heights

Every generation of Cloudflare servers is an exercise in balancing competing constraints: performance versus power, capacity versus cost, flexibility versus simplicity. Gen 13 comes with 2x core count, 2x memory capacity, 4x network bandwidth, 1.5x storage capacity, and future-proofing for accelerator deployments — all while improving total cost of ownership and maintaining a robust management feature set and security posture that our global fleet demands.

Gen 13 servers are fully qualified and will be deployed to serve millions of requests across Cloudflare’s global network in more than 330 cities. As always, Cloudflare’s journey to serve the Internet as efficiently as possible does not end here. As the deployment of Gen 13 begins, we are planning the architecture for Gen 14.

If you are excited about helping build a better Internet, come join us. We are hiring.

Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance

Syona Sarma — Mon, 23 Mar 2026 13:00:00 GMT

Two years ago, Cloudflare deployed our 12th Generation server fleet, based on AMD EPYC™ Genoa-X processors with their massive 3D V-Cache. That cache-heavy architecture was a perfect match for our request handling layer, FL1 at the time. But as we evaluated next-generation hardware, we faced a dilemma — the CPUs offering the biggest throughput gains came with a significant cache reduction. Our legacy software stack wasn't optimized for this, and the potential throughput benefits were being capped by increasing latency.

This blog describes how the FL2 transition, our Rust-based rewrite of Cloudflare's core request handling layer, allowed us to prove Gen 13's full potential and unlock performance gains that would have been impossible on our previous stack. FL2 removes the dependency on the larger cache, allowing for performance to scale with cores while maintaining our SLAs. Today, we are proud to announce the launch of Cloudflare's Gen 13 based on AMD EPYC™ 5th Gen Turin-based servers running FL2, effectively capturing and scaling performance at the edge.

What AMD EPYCTurin brings to the table

AMD's EPYC™ 5th Generation Turin-based processors deliver more than just a core count increase. The architecture delivers improvements across multiple dimensions of what Cloudflare servers require.

2x core count: up to 192 cores versus Gen 12's 96 cores, with SMT providing 384 threads
Improved IPC: Zen 5's architectural improvements deliver better instructions-per-cycle compared to Zen 4
Better power efficiency: Despite the higher core count, Turin consumes up to 32% fewer watts per core compared to Genoa-X
DDR5-6400 support: Higher memory bandwidth to feed all those cores

However, Turin's high density OPNs make a deliberate tradeoff: prioritizing throughput over per core cache. Our analysis across the Turin stack highlighted this shift. For example, comparing the highest density Turin OPN to our Gen 12 Genoa-X processors reveals that Turin's 192 cores share 384MB of L3 cache. This leaves each core with access to just 2MB, one-sixth of Gen 12's allocation. For any workload that relies heavily on cache locality, which ours did, this reduction posed a serious challenge.

Generation	Processor	Cores/Threads	L3 Cache/Core
Gen 12	AMD Genoa-X 9684X	96C/192T	12MB (3D V-Cache)
Gen 13 Option 1	AMD Turin 9755	128C/256T	4MB
Gen 13 Option 2	AMD Turin 9845	160C/320T	2MB
Gen 13 Option 3	AMD Turin 9965	192C/384T	2MB

Diagnosing the problem with performance counters

For our FL1 request handling layer, NGINX- and LuaJIT-based code, this cache reduction presented a significant challenge. But we didn't just assume it would be a problem; we measured it.

During the CPU evaluation phase for Gen 13, we collected CPU performance counters and profiling data to identify exactly what was happening under the hood using AMD uProf tool. The data showed:

L3 cache miss rates increased dramatically compared to Gen 12's server equipped with 3D V-cache processors
Memory fetch latency dominated request processing time as data that previously stayed in L3 now required trips to DRAM
The latency penalty scaled with utilization as we pushed CPU usage higher, and cache contention worsened

L3 cache hits complete in roughly 50 cycles; L3 cache misses requiring DRAM access take 350+ cycles, an order of magnitude difference. With 6x less cache per core, FL1 on Gen 13 was hitting memory far more often, incurring latency penalties.

The tradeoff: latency vs. throughput

Our initial tests running FL1 on Gen 13 confirmed what the performance counters had already suggested. While the Turin processor could achieve higher throughput, it came at a steep latency cost.

Metric	Gen 12 (FL1)	Gen 13 - AMD Turin 9755 (FL1)	Gen 13 - AMD Turin 9845 (FL1)	Gen 13 - AMD Turin 9965 (FL1)	Delta
Core count	baseline	+33%	+67%	+100%
FL throughput	baseline	+10%	+31%	+62%	Improvement
Latency at low to moderate CPU utilization	baseline	+10%	+30%	+30%	Regression
Latency at high CPU utilization	baseline	> 20%	> 50%	> 50%	Unacceptable

The Gen 13 evaluation server with AMD Turin 9965 that generated 60% throughput gain was compelling, and the performance uplift provided the most improvement to Cloudflare’s total cost of ownership (TCO).

But a more than 50% latency penalty is not acceptable. The increase in request processing latency would directly impact customer experience. We faced a familiar infrastructure question: do we accept a solution with no TCO benefit, accept the increased latency tradeoff, or find a way to boost efficiency without adding latency?

Incremental gains with performance tuning

To find a path to an optimal outcome, we collaborated with AMD to analyze the Turin 9965 data and run targeted optimization experiments. We systematically tested multiple configurations:

Hardware Tuning: Adjusting hardware prefetchers and Data Fabric (DF) Probe Filters, which showed only marginal gains
Scaling Workers: Launching more FL1 workers, which improved throughput but cannibalized resources from other production services
CPU Pinning & Isolation: Adjusting workload isolation configurations to find optimal mix, with limited success

The configuration that ultimately provided the most value was AMD’s Platform Quality of Service (PQOS). PQOS extensions enable fine-grained regulation of shared resources like cache and memory bandwidth. Since Turin processors consist of one I/O Die and up to 12 Core Complex Dies (CCDs), each sharing an L3 cache across up to 16 cores, we put this to the test. Here is how the different experimental configurations performed.

First, we used PQOS to allocate a dedicated L3 cache share within a single CCD for FL1, the gains were minimal. However, when we scaled the concept to the socket level, dedicating an entire CCD strictly to FL1, we saw meaningful throughput gains while keeping latency acceptable.

Configuration	Description	Illustration	Performance gain
NUMA-aware core affinity (equivalent to PQOS at socket level)	6 out of 12 CCD (aligned with NUMA domain) run FL. 32MB L3 cache in each CCD shared among all cores.		>15% incremental throughput gain
PQOS config 1	1 of 2 vCPU on each physical core in each CCD runs FL. FL gets 75% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain Other services show minor signs of degradation
PQOS config 2	1 of 2 vCPU in each physical core in each CCD runs FL. FL gets 50% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain
PQOS config 3	2 vCPU on 50% of the physical core in each CCD runs FL. FL gets 50% of the 32MB L3 cache of each CCD.		< 5% incremental throughput gain

The opportunity: FL2 was already in progress

Hardware tuning and resource configuration provided modest gains, but to truly unlock the performance potential of the Gen 13 architecture, we knew we would have to rewrite our software stack to fundamentally change how it utilized system resources.

Fortunately, we weren't starting from scratch. As we announced during Birthday Week 2025, we had already been rebuilding FL1 from the ground up. FL2 is a complete rewrite of our request handling layer in Rust, built on our Pingora and Oxy frameworks, replacing 15 years of NGINX and LuaJIT code.

The FL2 project wasn't initiated to solve the Gen 13 cache problem — it was driven by the need for better security (Rust's memory safety), faster development velocity (strict module system), and improved performance across the board (less CPU, less memory, modular execution).

FL2's cleaner architecture, with better memory access patterns and less dynamic allocation, might not depend on massive L3 caches the way FL1 did. This gave us an opportunity to use the FL2 transition to prove whether Gen 13's throughput gains could be realized without the latency penalty.

Proving it out: FL2 on Gen 13

As the FL2 rollout progressed, production metrics from our Gen 13 servers validated what we had hypothesized.

Metric	Gen 13 AMD Turin 9965 (FL1)	Gen 13 AMD Turin 9965 (FL2)
FL requests per CPU%	baseline	50% higher
Latency vs Gen 12	baseline	70% lower
Throughput vs Gen 12	62% higher	100% higher

The out-of-the-box efficiency gains on our new FL2 stack were substantial, even before any system optimizations. FL2 slashed the latency penalty by 70%, allowing us to push Gen 13 to higher CPU utilization while strictly meeting our latency SLAs. Under FL1, this would have been impossible.

By effectively eliminating the cache bottleneck, FL2 enables our throughput to scale linearly with core count. The impact is undeniable on the high-density AMD Turin 9965: we achieved a 2x performance gain, unlocking the true potential of the hardware. With further system tuning, we expect to squeeze even more power out of our Gen 13 fleet.

Generational improvement with Gen 13

With FL2 unlocking the immense throughput of the high-core-count AMD Turin 9965, we have officially selected these processors for our Gen 13 deployment. Hardware qualification is complete, and Gen 13 servers are now shipping at scale to support our global rollout.

Performance improvements

	Gen 12	Gen 13
Processor	AMD EPYC™ 4th Gen Genoa-X 9684X	AMD EPYC™ 5th Gen Turin 9965
Core count	96C/192T	192C/384T
FL throughput	baseline	Up to +100%
Performance per watt	baseline	Up to +50%

Gen 13 business impact

Up to 2x throughput vs Gen 12 for uncompromising customer experience: By doubling our throughput capacity while staying within our latency SLAs, we guarantee our applications remain fast and responsive, and able to absorb massive traffic spikes.

50% better performance/watt vs Gen 12 for sustainable scaling: This gain in power efficiency not only reduces data center expansion costs, but allows us to process growing traffic with a vastly lower carbon footprint per request.

60% higher rack throughput vs Gen 12 for global edge upgrades: Because we achieved this throughput density while keeping the rack power budget constant, we can seamlessly deploy this next generation compute anywhere in the world across our global edge network, delivering top tier performance exactly where our customers want it.

Gen 13 + FL2: ready for the edge

Our legacy request serving layer FL1 hit a cache contention wall on Gen 13, forcing an unacceptable tradeoff between throughput and latency. Instead of compromising, we built FL2.

Designed with a vastly leaner memory access pattern, FL2 removes our dependency on massive L3 caches and allows linear scaling with core count. Running on the Gen 13 AMD Turin platform, FL2 unlocks 2x the throughput and a 50% boost in power efficiency all while keeping latency within our SLAs. This leap forward is a great reminder of the importance of hardware-software co-design. Unconstrained by cache limits, Gen 13 servers are now ready to be deployed to serve millions of requests across Cloudflare’s global network.

If you're excited about working on infrastructure at global scale, we're hiring.

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

Manuel Olguín Muñoz — Fri, 13 Feb 2026 14:00:00 GMT

ecdysis | ˈekdəsəs |
noun
the process of shedding the old skin (in reptiles) or casting off the outer cuticle (in insects and other arthropods).

How do you upgrade a network service, handling millions of requests per second around the globe, without disrupting even a single connection?

One of our solutions at Cloudflare to this massive challenge has long been ecdysis, a Rust library that implements graceful process restarts where no live connections are dropped, and no new connections are refused.

Last month, we open-sourced ecdysis, so now anyone can use it. After five years of production use at Cloudflare, ecdysis has proven itself by enabling zero-downtime upgrades across our critical Rust infrastructure, saving millions of requests with every restart across Cloudflare’s global network.

It’s hard to overstate the importance of getting these upgrades right, especially at the scale of Cloudflare’s network. Many of our services perform critical tasks such as traffic routing, TLS lifecycle management, or firewall rules enforcement, and must operate continuously. If one of these services goes down, even for an instant, the cascading impact can be catastrophic. Dropped connections and failed requests quickly lead to degraded customer performance and business impact.

When these services need updates, security patches can’t wait. Bug fixes need deployment and new features must roll out.

The naive approach involves waiting for the old process to be stopped before spinning up the new one, but this creates a window of time where connections are refused and requests are dropped. For a service handling thousands of requests per second in a single location, multiply that across hundreds of data centers, and a brief restart becomes millions of failed requests globally.

Let’s dig into the problem, and how ecdysis has been the solution for us — and maybe will be for you.

Links: GitHub | crates.io | docs.rs

Why graceful restarts are hard

The naive approach to restarting a service, as we mentioned, is to stop the old process and start a new one. This works acceptably for simple services that don’t handle real-time requests, but for network services processing live connections, this approach has critical limitations.

First, the naive approach creates a window during which no process is listening for incoming connections. When the old process stops, it closes its listening sockets, which causes the OS to immediately refuse new connections with ECONNREFUSED. Even if the new process starts immediately, there will always be a gap where nothing is accepting connections, whether milliseconds or seconds. For a service handling thousands of requests per second, even a gap of 100ms means hundreds of dropped connections.

Second, stopping the old process kills all already-established connections. A client uploading a large file or streaming video gets abruptly disconnected. Long-lived connections like WebSockets or gRPC streams are terminated mid-operation. From the client’s perspective, the service simply vanishes.

Binding the new process before shutting down the old one appears to solve this, but also introduces additional issues. The kernel normally allows only one process to bind to an address:port combination, but the SO_REUSEPORT socket option permits multiple binds. However, this creates a problem during process transitions that makes it unsuitable for graceful restarts.

When SO_REUSEPORT is used, the kernel creates separate listening sockets for each process and load balances new connections across these sockets. When the initial SYN packet for a connection is received, the kernel will assign it to one of the listening processes. Once the initial handshake is completed, the connection then sits in the accept() queue of the process until the process accepts it. If the process then exits before accepting this connection, it becomes orphaned and is terminated by the kernel. GitHub’s engineering team documented this issue extensively when building their GLB Director load balancer.

How ecdysis works

When we set out to design and build ecdysis, we identified four key goals for the library:

Old code can be completely shut down post-upgrade.
The new process has a grace period for initialization.
New code crashing during initialization is acceptable and shouldn’t affect the running service.
Only a single upgrade runs in parallel to avoid cascading failures.

ecdysis satisfies these requirements following an approach pioneered by NGINX, which has supported graceful upgrades since its early days. The approach is straightforward:

The parent process fork()s a new child process.
The child process replaces itself with a new version of the code with execve().
The child process inherits the socket file descriptors via a named pipe shared with the parent.
The parent process waits for the child process to signal readiness before shutting down.

Crucially, the socket remains open throughout the transition. The child process inherits the listening socket from the parent as a file descriptor shared via a named pipe. During the child's initialization, both processes share the same underlying kernel data structure, allowing the parent to continue accepting and processing new and existing connections. Once the child completes initialization, it notifies the parent and begins accepting connections. Upon receiving this ready notification, the parent immediately closes its copy of the listening socket and continues handling only existing connections.

This process eliminates coverage gaps while providing the child a safe initialization window. There is a brief window of time when both the parent and child may accept connections concurrently. This is intentional; any connections accepted by the parent are simply handled until completion as part of the draining process.

This model also provides the required crash safety. If the child process fails during initialization (e.g., due to a configuration error), it simply exits. Since the parent never stopped listening, no connections are dropped, and the upgrade can be retried once the problem is fixed.

ecdysis implements the forking model with first-class support for asynchronous programming through Tokio and systemd integration:

Tokio integration: Native async stream wrappers for Tokio. Inherited sockets become listeners without additional glue code. For synchronous services, ecdysis supports operation without async runtime requirements.
systemd-notify support: When the systemd_notify feature is enabled, ecdysis automatically integrates with systemd’s process lifecycle notifications. Setting Type=notify-reload in your service unit file allows systemd to track upgrades correctly.
systemd named sockets: The systemd_sockets feature enables ecdysis to manage systemd-activated sockets. Your service can be socket-activated and support graceful restarts simultaneously.

Platform note: ecdysis relies on Unix-specific syscalls for socket inheritance and process management. It does not work on Windows. This is a fundamental limitation of the forking approach.

Security considerations

Graceful restarts introduce security considerations. The forking model creates a brief window where two process generations coexist, both with access to the same listening sockets and potentially sensitive file descriptors.

ecdysis addresses these concerns through its design:

Fork-then-exec: ecdysis follows the traditional Unix pattern of fork() followed immediately by execve(). This ensures the child process starts with a clean slate: new address space, fresh code, and no inherited memory. Only explicitly-passed file descriptors cross the boundary.

Explicit inheritance: Only listening sockets and communication pipes are inherited. Other file descriptors are closed via CLOEXEC flags. This prevents accidental leakage of sensitive handles.

seccomp compatibility: Services using seccomp filters must allow fork() and execve(). This is a tradeoff: graceful restarts require these syscalls, so they cannot be blocked.

For most network services, these tradeoffs are acceptable. The security of the fork-exec model is well understood and has been battle-tested for decades in software like NGINX and Apache.

Code example

Let’s look at a practical example. Here’s a simplified TCP echo server that supports graceful restarts:

use ecdysis::tokio_ecdysis::{SignalKind, StopOnShutdown, TokioEcdysisBuilder};
use tokio::{net::TcpStream, task::JoinSet};
use futures::StreamExt;
use std::net::SocketAddr;

#[tokio::main]
async fn main() {
    // Create the ecdysis builder
    let mut ecdysis_builder = TokioEcdysisBuilder::new(
        SignalKind::hangup()  // Trigger upgrade/reload on SIGHUP
    ).unwrap();

    // Trigger stop on SIGUSR1
    ecdysis_builder
        .stop_on_signal(SignalKind::user_defined1())
        .unwrap();

    // Create listening socket - will be inherited by children
    let addr: SocketAddr = "0.0.0.0:8080".parse().unwrap();
    let stream = ecdysis_builder
        .build_listen_tcp(StopOnShutdown::Yes, addr, |builder, addr| {
            builder.set_reuse_address(true)?;
            builder.bind(&addr.into())?;
            builder.listen(128)?;
            Ok(builder.into())
        })
        .unwrap();

    // Spawn task to handle connections
    let server_handle = tokio::spawn(async move {
        let mut stream = stream;
        let mut set = JoinSet::new();
        while let Some(Ok(socket)) = stream.next().await {
            set.spawn(handle_connection(socket));
        }
        set.join_all().await;
    });

    // Signal readiness and wait for shutdown
    let (_ecdysis, shutdown_fut) = ecdysis_builder.ready().unwrap();
    let shutdown_reason = shutdown_fut.await;

    log::info!("Shutting down: {:?}", shutdown_reason);

    // Gracefully drain connections
    server_handle.await.unwrap();
}

async fn handle_connection(mut socket: TcpStream) {
    // Echo connection logic here
}

The key points:

build_listen_tcp creates a listener that will be inherited by child processes.
ready() signals to the parent process that initialization is complete and that it can safely exit.
shutdown_fut.await blocks until an upgrade or stop is requested. This future only yields once the process should be shut down, either because an upgrade/reload was executed successfully or because a shutdown signal was received.

When you send SIGHUP to this process, here’s what ecdysis does…

…on the parent process:

Forks and execs a new instance of your binary.
Passes the listening socket to the child.
Waits for the child to call ready().
Drains existing connections, then exits.

…on the child process:

Initializes itself following the same execution flow as the parent, except any sockets owned by ecdysis are inherited and not bound by the child.
Signals readiness to the parent by calling ready().
Blocks waiting for a shutdown or upgrade signal.

Production at scale

ecdysis has been running in production at Cloudflare since 2021. It powers critical Rust infrastructure services deployed across 330+ data centers in 120+ countries. These services handle billions of requests per day and require frequent updates for security patches, feature releases, and configuration changes.

Every restart using ecdysis saves hundreds of thousands of requests that would otherwise be dropped during a naive stop/start cycle. Across our global footprint, this translates to millions of preserved connections and improved reliability for customers.

ecdysis vs alternatives

Graceful restart libraries exist for several ecosystems. Understanding when to use ecdysis versus alternatives is critical to choosing the right tool.

tableflip is our Go library that inspired ecdysis. It implements the same fork-and-inherit model for Go services. If you need Go, tableflip is a great option!

shellflip is Cloudflare’s other Rust graceful restart library, designed specifically for Oxy, our Rust-based proxy. shellflip is more opinionated: it assumes systemd and Tokio, and focuses on transferring arbitrary application state between parent and child. This makes it excellent for complex stateful services, or services that want to apply such aggressive sandboxing that they can’t even open their own sockets, but adds overhead for simpler cases.

Start building

ecdysis brings five years of production-hardened graceful restart capabilities to the Rust ecosystem. It’s the same technology protecting millions of connections across Cloudflare’s global network, now open-sourced and available for anyone!

Full documentation is available at docs.rs/ecdysis, including API reference, examples for common use cases, and steps for integrating with systemd.

The examples directory in the repository contains working code demonstrating TCP listeners, Unix socket listeners, and systemd integration.

The library is actively maintained by the Argo Smart Routing & Orpheus team, with contributions from teams across Cloudflare. We welcome contributions, bug reports, and feature requests on GitHub.

Whether you’re building a high-performance proxy, a long-lived API server, or any network service where uptime matters, ecdysis can provide a foundation for zero-downtime operations.

Start building: github.com/cloudflare/ecdysis

How Workers powers our internal maintenance scheduling pipeline

Kevin Deems — Mon, 22 Dec 2025 14:00:00 GMT

Cloudflare has data centers in over 330 cities globally, so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that disruptive maintenance requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible.

It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another.

We realized we needed a centralized, automated "brain" to act as a safeguard — a system that could see the entire state of our network at once. By building this scheduler on Cloudflare Workers, we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend.

In this blog post, we’ll explain how we built it, and share the results we’re seeing now.

Building a system to de-risk critical maintenance operations

Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously.

Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid.

Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time.

Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events.

We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance.

Maintenance constraints

Every constraint starts with a set of proposed maintenance items, such as a network router or list of servers. We then find all the maintenance events in the calendar that overlap with the proposed maintenance time window.

Next, we aggregate product APIs, such as a list of Aegis customer IP pools. Aegis returns a set of IP ranges where a customer requested egress out of specific data center IDs, shown below.

[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]

In this scenario, data center 21 and data center 45 relate to each other because we need at least one data center online for the Aegis customer 9876 to receive egress traffic from Cloudflare. If we tried to take data centers 21 and 45 down simultaneously, our coordinator would alert us that there would be unintended consequences for that customer workload.

We initially had a naive solution to load all data into a single Worker. This included all server relationships, product configurations, and metrics for product and infrastructure health to compute constraints. Even in our proof of concept phase, we ran into problems with “out of memory” errors.

We needed to be more cognizant of Workers’ platform limits. This required loading only as much data as was absolutely necessary to process the constraint’s business logic. If a maintenance request for a router in Frankfurt, Germany, comes in, we almost certainly do not care what is happening in Australia since there is no overlap across regions. Thus, we should only load data for neighboring data centers in Germany. We needed a more efficient way to process relationships in our dataset.

Graph processing on Workers

As we looked at our constraints, a pattern emerged where each constraint boiled down to two concepts: objects and associations. In graph theory, these components are known as vertices and edges, respectively. An object could be a network router and an association could be the list of Aegis pools in the data center that requires the router to be online. We took inspiration from Facebook’s TAO research paper to establish a graph interface on top of our product and infrastructure data. The API looks like the following:

type ObjectID = string

interface MainTAOInterface {
  object_get(id: ObjectID): Promise

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable
}

The core insight is that associations are typed. For example, a constraint would call the graph interface to retrieve Aegis product data.

async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise> {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) => {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map(datacenterEntries)
  const uniquePools = new Set()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) => {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map(poolTotalsEntries)
  const poolAnalysis: Record = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}

We use two association types in the code above:

DATACENTER_INSIDE_AEGIS_POOL, which retrieves the Aegis customer pools that a data center resides in.
AEGIS_POOL_CONTAINS_DATACENTER, which retrieves the data centers an Aegis pool needs to serve traffic.

The associations are inverted indices of one another. The access pattern is exactly the same as before, but now the graph implementation has much more control of how much data it queries. Before, we needed to load all Aegis pools into memory and filter inside constraint business logic. Now, we can directly fetch only the data that matters to the application.

The interface is powerful because our graph implementation can improve performance behind the scenes without complicating the business logic. This lets us use the scalability of Workers and Cloudflare’s CDN to fetch data from our internal systems very quickly.

Fetch pipeline

We switched to using the new graph implementation, sending more targeted API requests. Response sizes dropped by 100x overnight, switching from loading a few massive requests to many tiny requests.

While this solves the issue of loading too much into memory, we now have a subrequest problem because instead of a few large HTTP requests, we make an order of magnitude more of small requests. Overnight, we started consistently breaching subrequest limits.

In order to solve this problem, we built a smart middleware layer between our graph implementation and the fetch API.

export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);

If you’re familiar with Go, you may have seen the singleflight package before. We took inspiration from this idea and the first middleware component in the fetch pipeline deduplicates inflight HTTP requests, so they all wait on the same Promise for data instead of producing duplicate requests in the same Worker. Next, we use a lightweight Least Recently Used (LRU) cache to internally cache requests that we have already seen before.

Once both of those are complete, we use Cloudflare’s caches.default.match function to cache all GET requests in the region that the Worker is running. Since we have multiple data sources with different performance characteristics, we choose time to live (TTL) values carefully. For example, real-time data is only cached for 1 minute. Relatively static infrastructure data could be cached for 1–24 hours depending on the type of data. Power management data might be changed manually and infrequently, so we can cache it for longer at the edge.

In addition to those layers, we have the standard exponential backoff, retries and jitter. This helps reduce wasted fetch calls where a downstream resource might be unavailable temporarily. By backing off slightly, we increase the chance that we fetch the next request successfully. Conversely, if the Worker sends requests constantly without backoff, it will easily breach the subrequest limit when the origin starts returning 5xx errors.

Putting it all together, we saw ~99% cache hit rate. Cache hit rate is the percentage of HTTP requests served from Cloudflare’s fast cache memory (a "hit") versus slower requests to data sources running in our control plane (a "miss"), calculated as (hits / (hits + misses)). A high rate means better HTTP request performance and lower costs because querying data from cache in our Worker is an order of magnitude faster than fetching from an origin server in a different region. After tuning settings, for our in memory and CDN caches, hit rates have increased dramatically. Since much of our workload is real-time, we will never have a 100% hit rate as we must request fresh data at least once per minute.

We have talked about improving the fetching layer, but not about how we made origin HTTP requests faster. Our maintenance coordinator needs to react in real-time to network degradation and failure of machines in data centers. We use our distributed Prometheus query engine, Thanos, to deliver performant metrics from the edge into the coordinator.

Thanos in real-time

To explain how our choice in using the graph processing interface affected our real-time queries, let’s walk through an example. In order to analyze the health of edge routers, we could send the following query:

sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})

Originally, we asked our Thanos service, which stores Prometheus metrics, for a list of each edge router’s current health status and would manually filter for routers relevant to the maintenance inside the Worker. This is suboptimal for many reasons. For example, Thanos returned multi-MB responses which it needed to decode and encode. The Worker also needed to cache and decode these large HTTP responses only to filter out the majority of the data while processing a specific maintenance request. Since TypeScript is single-threaded and parsing JSON data is CPU-bound, sending two large HTTP requests means that one is blocked waiting for the other to finish parsing.

Instead, we simply use the graph to find targeted relationships such as the interface links between edge and spine routers, denoted as EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE.

sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})

The result is 1 Kb on average instead of multiple MBs, or approximately 1000x smaller. This also massively reduces the amount of CPU required inside the Worker because we offload most of the deserialization to Thanos. As we explained before, this means we need to make a higher number of these smaller fetch requests, but load balancers in front of Thanos can spread the requests evenly to increase throughput for this use case.

Our graph implementation and fetch pipeline successfully tamed the 'thundering herd' of thousands of tiny real-time requests. However, historical analysis presents a different I/O challenge. Instead of fetching small, specific relationships, we need to scan months of data to find conflicting maintenance windows. In the past, Thanos would issue a massive amount of random reads to our object store, R2. To solve this massive bandwidth penalty without losing performance, we adopted a new approach the Observability team developed internally this year.

Historical data analysis

There are enough maintenance use cases that we must rely on historical data to tell us if our solution is accurate and will scale with the growth of Cloudflare’s network. We do not want to cause incidents, and we also want to avoid blocking proposed physical maintenance unnecessarily. In order to balance these two priorities, we can use time series data about maintenance events that happened two months or even a year ago to tell us how often a maintenance event is violating one of our constraints, e.g. edge router availability or Aegis. We blogged earlier this year about using Thanos to automatically release and revert software to the edge.

Thanos primarily fans out to Prometheus, but when Prometheus' retention is not enough to answer the query it has to download data from object storage — R2 in our case. Prometheus TSDB blocks were originally designed for local SSDs, relying on random access patterns that become a bottleneck when moved to object storage. When our scheduler needs to analyze months of historical maintenance data to identify conflicting constraints, random reads from object storage incur a massive I/O penalty. To solve this, we implemented a conversion layer that transforms these blocks into Apache Parquet files. Parquet is a columnar format native to big data analytics that organizes data by column rather than row, which — together with rich statistics — allows us to only fetch what we need.

Furthermore, since we are rewriting TSDB blocks into Parquet files, we can also store the data in a way that allows us to read the data in just a few big sequential chunks.

sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})

In the example above we would choose the tuple “(__name__, dc_id)” as a primary sorting key so that metrics with the name “hmd:release_scopes:enabled” and the same value for “dc_id” get sorted close together.

Our Parquet gateway now issues precise R2 range requests to fetch only the specific columns relevant to the query. This reduces the payload from megabytes to kilobytes. Furthermore, because these file segments are immutable, we can aggressively cache them on the Cloudflare CDN.

This turns R2 into a low-latency query engine, allowing us to backtest complex maintenance scenarios against long-term trends instantly, avoiding the timeouts and high tail latency we saw with the original TSDB format. The graph below shows a recent load test, where Parquet reached up to 15x the P90 performance compared to the old system for the same query pattern.

To get a deeper understanding of how the Parquet implementation works, you can watch this talk at PromCon EU 2025, Beyond TSDB: Unlocking Prometheus with Parquet for Modern Scale.

Building for scale

By leveraging Cloudflare Workers, we moved from a system that ran out of memory to one that intelligently caches data and uses efficient observability tooling to analyze product and infrastructure data in real time. We built a maintenance scheduler that balances network growth with product performance.

But “balance” is a moving target.

Every day, we add more hardware around the world, and the logic required to maintain it without disrupting customer traffic gets exponentially harder with more products and types of maintenance operations. We’ve worked through the first set of challenges, but now we’re staring down more subtle, complex ones that only appear at this massive scale.

We need engineers who aren't afraid of hard problems. Join our Infrastructure team and come build with us.

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

Nnamdi Ajah — Tue, 22 Oct 2024 13:00:00 GMT

Introduction

At Cloudflare, we provide a range of services through our global network of servers, located in 330 cities worldwide. When you interact with our long-standing application services, or newer services like Workers AI, you’re in contact with one of our fleet of thousands of servers which support those services.

These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.

Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as firmware. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the Linux Foundation Project for BMCs, OpenBMC. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. This gives us transparency (which is important to us as a security company) and allows us faster time to develop custom features/fixes for the BMC firmware that we run on our entire fleet.

In this blog post, we are going to describe how we customized and extended the OpenBMC firmware to better monitor our servers’ boot-up processes to start more reliably and allow better diagnostics in the event that an issue happens during server boot-up.

Server subsystems

Server systems consist of multiple complex subsystems that include the processors, memory, storage, networking, power supply, cooling, etc. When booting up the host of a server system, the power state of each subsystem of the server is changed in an asynchronous manner. This is done so that subsystems can initialize simultaneously, thereby improving the efficiency of the boot process. Though started asynchronously, these subsystems may interact with each other at different points of the boot sequence and rely on handshake/synchronization to exchange information. For example, during boot-up, the UEFI (Universal Extensible Firmware Interface), often referred to as the BIOS, configures the motherboard in a phase known as the Platform Initialization (PI) phase, during which the UEFI collects information from subsystems such as the CPUs, memory, etc. to initialize the motherboard with the right settings.

^{Figure 1: Server Boot Process}

When the power state of the subsystems, handshakes, and synchronization are not properly managed, there may be race conditions that would result in failures during the boot process of the host. Cloudflare experienced some of these boot-related failures while rolling out open source firmware (OpenBMC) to the Baseboard Management Controllers (BMCs) of our servers.

Baseboard Management Controller (BMC) as a manager of the host

A BMC is a specialized microprocessor that is attached to the board of a host (server) to assist with remote management capabilities of the host. Servers usually sit in data centers and are often far away from the administrators, and this creates a challenge to maintain them at scale. This is where a BMC comes in, as the BMC serves as the interface that gives administrators the ability to securely and remotely access the servers and carry out management functions. The BMC does this by exposing various interfaces, including Intelligent Platform Management Interface (IPMI) and Redfish, for distributed management. In addition, the BMC receives data from various sensors/devices (e.g. temperature, power supply) connected to the server, and also the operating parameters of the server, such as the operating system state, and publishes the values on its IPMI and Redfish interfaces.

^{Figure 2: Block diagram of BMC in a server system.}

At Cloudflare, we use the OpenBMC project for our Baseboard Management Controller (BMC).

Below are examples of management functions carried out on a server through the BMC. The interactions in the examples are done over ipmitool, a command line utility for interacting with systems that support IPMI.

# Check the sensor readings of a server remotely (i.e. over a network)
$  ipmitool   sdr
PSU0_CURRENT_IN  | 0.47 Amps         | ok
PSU0_CURRENT_OUT | 6 Amps            | ok
PSU0_FAN_0       | 6962 RPM          | ok
SYS_FAN          | 13034 RPM         | ok
SYS_FAN1         | 11172 RPM         | ok
SYS_FAN2         | 11760 RPM         | ok
CPU_CORE_VR_POUT | 9.03 Watts        | ok
CPU_POWER        | 76.95 Watts       | ok
CPU_SOC_VR_POUT  | 12.98 Watts       | ok
DIMM_1_VR_POUT   | 29.03 Watts       | ok
DIMM_2_VR_POUT   | 27.97 Watts       | ok
CPU_CORE_MOSFET  | 40 degrees C      | ok
CPU_TEMP         | 50 degrees C      | ok
DIMM_MOSFET_1    | 36 degrees C      | ok
DIMM_MOSFET_2    | 39 degrees C      | ok
DIMM_TEMP_A1     | 34 degrees C      | ok
DIMM_TEMP_B1     | 33 degrees C      | ok

…

# check the power status of a server remotely (i.e. over a network)
ipmitool   power status
Chassis Power is off

# power on the server
ipmitool   power on
Chassis Power Control: On

Switching to OpenBMC firmware for our BMCs gives us more control over the software that powers our infrastructure. This has given us more flexibility, customizations, and an overall better uniform experience for managing our servers. Since OpenBMC is open source, we also leverage community fixes while upstreaming some of our own. Some of the advantages we have experienced with OpenBMC include a faster turnaround time to fixing issues, optimizations around thermal cooling, increased power efficiency and supporting AI inference.

While developing Cloudflare’s OpenBMC firmware, however, we ran into a number of boot problems.

Host not booting: When we send a request over IPMI for a host to power on (as in the example above, power on the server), ipmitool would indicate the power status of the host as ON, but we would not see any power going into the CPU nor any activity on the CPU. While ipmitool was correct about the power going into the chassis as ON, we had no information about the power state of the server from ipmitool, and we initially falsely assumed that since the chassis power was on, the rest of the server components should be ON. The System Event Log (SEL), which is responsible for displaying platform-specific events, was not giving us any useful information beyond indicating that the server was in a soft-off state (powered off), working state (operating system is loading and running), or that a “System Restart” of the host was initiated.

# System Event Logs (SEL) showing the various power states of the server
$ ipmitool sel elist | tail -n3
  4d |  Pre-Init  |0000011021| System ACPI Power State ACPI_STATUS | S5_G2: soft-off | Asserted
  4e |  Pre-Init  |0000011022| System ACPI Power State ACPI_STATUS | S0_G0: working | Asserted
  4f |  Pre-Init  |0000011023| System Boot Initiated RESTART_CAUSE | System Restart | Asserted

In the System Event Logs shown above, ACPI is the acronym for Advanced Configuration and Power Interface, a standard for power management on computing systems. In the ACPI soft-off state, the host is powered off (the motherboard is on standby power but CPU/host isn’t powered on); according to the ACPI specifications, this state is called S5_G2. (These states are discussed in more detail below.) In the ACPI working state, the host is booted and in a working state, also known in the ACPI specifications as status S0_G0 (which in our case happened to be false), and the third row indicates the cause of the restart was due to a System Restart. Most of the boot-related SEL events are sent from the UEFI to the BMC. The UEFI has been something of a black box to us, as we rely on our original equipment manufacturers (OEMs) to develop the UEFI firmware for us, and for the generation of servers with this issue, the UEFI firmware did not implement sending the boot progress of the host to the BMC.

One discrepancy we observed was the difference in the power status and the power going into the CPU, which we read with a sensor we call CPU_POWER.

# Check power status
$ ipmitool    power status
Chassis Power is on

However, checking the power into the CPU shows that the CPU was not receiving any power.

# Check power going into the CPU
$ ipmitool    sdr | grep CPU_POWER    
CPU_POWER        | 0 Watts           | ok

The CPU_POWER being at 0 watts contradicts all the previous information that the host was powered up and working, when the host was actually completely shut down.

Missing Memory Modules: Our servers would randomly boot up with less memory than expected. Computers can boot up with less memory than installed due to a number of problems, such as a loose connection, hardware problem, or faulty memory. For our case, it happened not to be any of the usual suspects, but instead was due to both the BMC and UEFI trying to simultaneously read from the memory modules, leading to access contentions. Memory modules usually contain a Serial Presence Detect (SPD), which is used by the UEFI to dynamically detect the memory module. This SPD is usually located on an inter-integrated circuit (i2c), which is a low speed, two write protocol for devices to talk to each other. The BMC also reads the temperature of the memory modules via the i2c. When the server is powered on, amongst other hardware initializations, the UEFI also initializes the memory modules that it can detect via their (i.e. each individual memory modules) Serial Presence Detect (SPD), the BMC could also be trying to access the temperature of the memory module at the same time, over the same i2c protocol. This simultaneous attempted read denies one of the parties access. When the UEFI is denied access to the SPD, it thinks the memory module is not available and skips over it. Below is an example of the related i2c-bus contention logs we saw in the journal of the BMC when the host is booting.

kernel: aspeed-i2c-bus 1e78a300.i2c-bus: irq handled != irq. expected 0x00000021, but was 0x00000020

The above logs indicate that the i2c address 1e78a300 (which happens to be connected to the serial presence detect of the memory modules) could not properly handle a signal, known as an interrupt request (irq). When this scenario plays out on the UEFI, the UEFI is unable to detect the memory module.

^{Figure 3: I2C diagram showing I2C interconnection of the server’s memory modules (also known as DIMMs) with the BMC}

DIMM in Figure 3 refers to Dual Inline Memory Module, which is the type of memory module used in servers.

Thermal telemetry: During the boot-up process of some of our servers, some temperature devices, such as the temperature sensors of the memory modules, would show up as failed, thereby causing some of the fans to enter a fail-safe Pulse Width Modulation (PWM) mode. PWM is a technique to encode information delivered to electronic devices by adjusting the frequency of the waveform signal to the device. It is used in this case to control fan speed by adjusting the frequency of the power signal delivered to the fan. When a fan enters a fail-safe mode, PWM is used to set the fan speeds to a preset value, irrespective of what the optimized PWM setting of the fans should be, and this could negatively affect the cooling of the server and power consumption.

Implementing host ACPI state on OpenBMC

In the process of studying the issues we faced relating to the boot-up process of the host, we learned how the power state of the subsystems within the chassis changes. Part of our learnings led us to investigate the Advanced Configuration and Power Interface (ACPI) and how the ACPI state of the host changed during the boot process.

Advanced Configuration and Power Interface (ACPI) is an open industry specification for power management used in desktop, mobile, workstation, and server systems. The ACPI Specification replaces previous power management methodologies such as Advanced Power Management (APM). ACPI provides the advantages of:

Allowing OS-directed power management (OSPM).
Having a standardized and robust interface for power management.
Sending system-level events such as when the server power/sleep buttons are pressed
Hardware and software support, such as a real-time clock (RTC) to schedule the server to wake up from sleep or to reduce the functionality of the CPU based on RTC ticks when there is a loss of power.

From the perspective of power management, ACPI enables an OS-driven conservation of energy by transitioning components which are not in active use to a lower power state, thereby reducing power consumption and contributing to more efficient power management.

The ACPI Specification defines four global “Gx” states, six sleeping “Sx” states, and four “Dx” device power states. These states are defined as follows:

Gx	Name	Sx	Description
G0	Working	S0	The run state. In this state the machine is fully running
G1	Sleeping	S1	A sleep state where the CPU will suspend activity but retain its contexts.
S2	A sleep state where memory contexts are held, but CPU contexts are lost. CPU re-initialization is done by firmware.
S3	A logically deeper sleep state than S2 where CPU re-initialization is done by device. Equates to Suspend to RAM.
S4	A logically deeper sleep state than S3 in which DRAM is context is not maintained and contexts are saved to disk. Can be implemented by either OS or firmware.
G2	Soft off but PSU still supplies power	S5	The soft off state. All activity will stop, and all contexts are lost. The Complex Programmable Logic Device (CPLD) responsible for power-up and power-down sequences of various components e.g. CPU, BMC is on standby power, but the CPU/host is off.
G3	Mechanical off		PSU does not supply power. The system is safe for disassembly.
Dx	Name	Description
D0	Fully powered on	Hardware device is fully functional and operational
D1	Hardware device is partially powered down	Reduced functionality and can be quickly powered back to D0
D2	Hardware device is in a deeper lower power than D1	Much more limited functionality and can only be slowly powered back to D0.
D3	Hardware device is significantly powered down or off	Device is inactive with perhaps only the ability to be powered back on

The states that matter to us are:

S0_G0_D0: often referred to as the working state. Here we know our host system is running just fine.
S2_D2: Memory contexts are held, but CPU context is lost. We usually use this state to know when the host’s UEFI is performing platform firmware initialization.
S5_G2: Often referred to as the soft off state. Here we still have power going into the chassis, however, processor and DRAM context are not maintained, and the operating system power management of the host has no context.

Since the issues we were experiencing were related to the power state changes of the host — when we asked the host to reboot or power on — we needed a way to track the various power state changes of the host as it went from power off to a complete working state. This would give us better management capabilities over the devices that were on the same power domain of the host during the boot process. Fortunately, the OpenBMC community already implemented an ACPI daemon, which we extended to serve our needs. We added an ACPI S2_D2 power state, in which memory contexts are held, but CPU context is lost, to the ACPI daemon running on the BMC to enable us to know when the host’s UEFI is performing firmware initialization, and also set up various management tasks for the different ACPI power states.

An example of a power management task we carry out using the S0_G0_D0 state is to re-export our Voltage Regulator (VR) sensors on S0_G0_D0 state, as shown with the service file below:

cat /lib/systemd/system/Re-export-VR-device.service 
[Unit]
Description=RE Export VR Device Process
Wants=xyz.openbmc_project.EntityManager.service
After=xyz.openbmc_project.EntityManager.service
Conflicts=host-s2-state.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'set -a && source /usr/bin/Re-export-VR-device.sh on'
SyslogIdentifier=Re-export-VR-device.service

[Install]
WantedBy=host-s0-state.target

Having set this up, OpenBMC has a Net Function (ipmiSetACPIState) in phosphor-host-ipmid that is responsible for setting the ACPIState of the host on the BMC. This command is called by the host using the standard ipmi command with the corresponding NetFn=0x06 and Cmd=0x06.

In the event of an immediate power cycle (i.e. host reboots without operating system shutdown), the host is unable to send its S5_G2 state to the BMC. For this case, we created a patch to OpenBMC’s x86-power-control to let the BMC become aware that the host has entered the ACPI S5_G2 state (i.e. soft-off). When the host comes out of the power off state, the UEFI performs the Power On Self Test (POST) and sends the S2_D2 to the BMC, and after the UEFI has loaded the OS on the host, it notifies the BMC by sending the ACPI S0_G0_D0 state.

Fixing the issues

Going back to the boot-up issues we faced, we discovered that they were mostly caused by devices which were in the same power domain of the CPU, interfering with the UEFI/platform firmware initialization phase. Below is a high level description of the fixes we applied.

Servers not booting: After identifying the devices that were interfering with the POST stage of the firmware initialization, we used the host ACPI state to control when we set the appropriate power mode state for those devices so as not to cause POST to fail.

Memory modules missing: During the boot-up process, memory modules (DIMMs) are powered and initialized in S2_D2 ACPI state. During this initialization process, UEFI firmware sends read commands to the Serial Presence Detect (SPD) on the DIMM to retrieve information for DIMM enumeration. At the same time, the BMC could be sending commands to read DIMM temperature sensors. This can cause SMBUS collisions, which could either cause DIMM temperature reading to fail or UEFI DIMM enumeration to fail. The latter case would cause the system to boot up with reduced DIMM capacity, which could be mistaken as a failing DIMM scenario. After we had discovered the race condition issue, we disabled the BMC from reading the DIMM temperature sensors during S2_D2 ACPI state and set a fixed speed for the corresponding fans. This solution allows our UEFI to retrieve all the necessary DIMM subsystems information for enumeration, and our servers now boot up with the correct size of memory.

Thermal telemetry: In S0_G0 power state, when sensors are not reporting values back to the BMC, the BMC assumes that devices may be overheating and puts the fan controller into fail-safe mode where fan speeds are ramped up to maximum speed. However, in S5_G2 state, some thermal sensors like CPU temperature, NIC temperature, etc. are not powered and not available. Our solution is to set these thermal sensors as non-functional in their exported configuration when in S5_G2 state and during the transition from S5_G2 state to S2_D2 state. Setting the affected devices as non-functional in their configuration, instead of waiting for thermal sensor read commands to error out, prevents the controller from entering the fail-safe mode.

Moving forward

Aside from resolving issues, we have seen other benefits from implementing ACPI Power State on our BMC firmware. An example is in the area of our automated firmware regression testing. Various parts of our tests require rebooting/power cycling the servers over a hundred times, during which we monitor the ACPI power state changes of our servers as against using a boolean (running or not running, pingable or not pingable) to assert the status of our servers.

Also, it has given us the opportunity to learn more about the complex subsystems in a server system, and the various power modes of the different subsystems. This is an aspect that we are still actively learning about as we look to further optimize various aspects of the boot sequence of our servers.

In the course of time, implementing ACPI states is helping us achieve the following:

All components are enabled by end of boot sequence,
BIOS and BMC are able to retrieve component information,
And the BMC is aware when thermal sensors are in a non-functional state.

For better observability of the boot progress and “last state” of our systems, we have also started the process of adding the BootProgress object of the Redfish ComputerSystem Schema into our systems. This will give us an opportunity for pre-operating system (OS) boot observability and an easier debug starting point when the UEFI has issues (such as when the server isn’t coming on) during the server platform initialization.

With each passing day, Cloudflare’s OpenBMC team, which is made up of folks from different embedded backgrounds, learns about, experiments with, and deploys OpenBMC across our global fleet. This has been made possible by relying on the OpenBMC community’s contribution (as well as upstreaming some of our own contributions), and our interaction with our various vendors, thereby giving us the opportunity to make our systems more reliable, and giving us the ownership and responsibility of the firmware that powers the BMCs that manage our servers. If you are thinking of embracing open-source firmware in your BMC, we hope this blog post written by a team which started deploying OpenBMC less than 18 months ago has inspired you to give it a try.

For those who are interested in considering making the jump to open-source firmware, check it out here!

Leveraging Kubernetes virtual machines at Cloudflare with KubeVirt

Justin Cichra — Tue, 08 Oct 2024 13:00:00 GMT

Cloudflare runs several multi-tenant Kubernetes clusters across our core data centers. These general-purpose clusters run on bare metal and power our control plane, analytics, and various engineering tools such as build infrastructure and continuous integration.

Kubernetes is a container orchestration platform. It enables software engineers to deploy containerized applications to a cluster of machines. This enables teams to build highly-available software on a scalable and resilient platform.

In this blog post we discuss our Kubernetes architecture, why we needed virtualization, and how we’re using it today.

Multi-tenant clusters

Multi-tenancy is a concept where one system can share its resources among a wide range of customers. This model allows us to build and manage a small number of general purpose Kubernetes clusters for our internal application teams. Keeping the number of clusters small reduces our operational toil. This model shrinks costs and increases computational efficiency by sharing hardware. Multi-tenancy also allows us to scale more efficiently. Scaling is done at either a cluster or application level. Cluster operators scale the platform by adding more hardware. Teams scale their applications by updating their Kubernetes manifests. They can scale vertically by increasing their resource requests or horizontally by increasing the number of replicas.

All of our Kubernetes clusters are multi-tenant with various components enabled for a secure and resilient platform.

Pods are secured using the latest standards recommended by the Kubernetes project. We use Pod Security Admission (PSA) and Pod Security Standards to ensure all workloads are following best practices. By default, all namespaces use the most restrictive profile, and only a few Kubernetes control plane namespaces are granted privileged access. For additional policies not covered by PSA, we built custom Validating Webhooks on top of the controller-runtime framework. PSA and our custom policies ensure clusters are secure and workloads are isolated.

Our need for virtualization

A select number of teams needed tight integration with the Linux kernel. Examples include Docker daemons for build infrastructure and the ability to simulate servers running the software and configuration of our global network. With our pod security requirements, these workloads are not permitted to interface with the host kernel at a deep level (e.g. no iptables or sysctls). Doing so may disrupt other tenants sharing the node and open additional attack vectors if an application was compromised. A virtualization platform would enable these workloads to interact with their own kernel within a secured Kubernetes cluster.

We considered various different virtualization solutions. Running a separate virtualization platform outside of Kubernetes would have worked, but would not tightly integrate containerized workloads with virtual machines. It would also be an additional operational burden on our team, as backups, alerting, and fleet management would have to exist for both our Kubernetes and virtual machine clusters.

We then looked for solutions that run virtual machines within Kubernetes. Teams could already manually deploy QEMU pods, but this was not an elegant solution. We needed a better way. There were several other options, but KubeVirt was the tool that met the majority of our requirements. Other solutions required a privileged container to run a virtual machine, but KubeVirt did not – this was a crucial requirement in our goal of creating a more secure multi-tenant cluster. KubeVirt also uses a feature of the Kubernetes API called Custom Resource Definitions (CRDs), which extends the Kubernetes API with new objects, increasing the flexibility of Kubernetes beyond its built-in types. For KubeVirt, this includes objects such as VirtualMachine and VirtualMachineInstanceReplicaSet. We felt the use of CRDs would allow KubeVirt to grow as more features were added.

What is KubeVirt?

KubeVirt is a virtualization platform that enables users to run virtual machines within Kubernetes. With KubeVirt, virtual machines run alongside containerized workloads on the same platform. Kubernetes primitives such as network policies, configmaps, and services all integrate with virtual machines. KubeVirt scales with our needs and is successfully running hundreds of virtual machines across several clusters. We frequently remediate Kubernetes nodes, so virtual machines and pods are always exercising their startup/shutdown processes.

How Cloudflare uses KubeVirt

There are a number of internal projects leveraging virtual machines at Cloudflare. We’ll touch on a few of our more popular use cases:

Kubernetes scalability testing
Development environments
Kernel and iPXE testing
Build pipelines

Kubernetes scalability testing

Setup process

Our staging clusters are much smaller than our largest production clusters. They also run on bare metal and mirror the configuration we have for each production cluster. This is extremely useful when rolling out new software, operating systems, or kernel changes; however, they miss bugs that only surface at scale. We use KubeVirt to bridge this gap and virtualize Kubernetes clusters with hundreds of nodes and thousands of pods.

The setup process for virtualized clusters differs from our bare metal provisioning steps. For bare metal, we use Salt to provision clusters from start to finish. For our virtualized clusters we use Ansible and kubeadm. Our bare metal staging clusters are responsible for testing and validating our Salt configuration. The virtualized clusters give us a vanilla Kubernetes environment without any Cloudflare customizations. Having a stock environment in addition to our Salt environment helps us isolate bugs down to a Kubernetes change, a kernel change, or a Cloudflare-specific configuration change.

Our virtualized clusters consist of a KubeVirt VirtualMachine object per node. We create three control-plane nodes and any number of worker nodes. Each virtual machine starts out as a vanilla Debian generic cloud image. Using KubeVirt’s cloud-init support, the virtual machine downloads an internal Ansible playbook which installs a recent kernel, cri-o (the container runtime we use), and kubeadm.

- name: Add the Kubernetes gpg key
  apt_key:
    url: https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    state: present

- name: Add the Kubernetes repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /" | tee /etc/apt/sources.list.d/kubernetes.list

- name: Add the CRI-O gpg key
  apt_key:
    url: https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/cri-o-apt-keyring.gpg
    state: present

- name: Add the CRI-O repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/ /" | tee /etc/apt/sources.list.d/cri-o.list

- name: Install CRI-O and Kubernetes packages
  apt:
    name:
      - cri-o
      - kubelet
      - kubeadm
      - kubectl
    update_cache: yes
    state: present

- name: Enable and start CRI-O service
  service:
    state: started
    enabled: yes
    name: crio.service

^{Ansible playbook steps to download and install Kubernetes tooling}

Once each node has completed its individual playbook, we can initialize and join nodes to the cluster using another playbook that runs kubeadm. From there the cluster can be accessed by logging into a control plane node using kubectl.

Simulating at scale

When losing 10s or 100s of nodes at once, Kubernetes needs to act quickly to minimize downtime. The sooner it recognizes node failure, the faster it can reroute traffic to healthy pods.

Using Kubernetes in KubeVirt we are able to simulate a large cluster undergoing a network cut and observe how Kubernetes reacts. The KubeVirt Kubernetes cluster allows us to rapidly iterate on configuration changes and code patches.

The following Ansible playbook task simulates a network segmentation failure where only the control-plane nodes remain online.

- name: Disable network interfaces on all workers
  command: ifconfig enp1s0 down
  async: 5
  poll: 0
  ignore_errors: yes
  when: inventory_hostname in groups['kube-node']

^{An Ansible role which disables the network on all worker nodes simultaneously.}

This framework allows us to exercise the code in controller-manager, Kubernetes’s daemon that reconciles the fundamental state of the system (Nodes, Pods, etc). Our simulation platform helped us drastically shorten full traffic recovery time when a large number of Kubernetes nodes become unreachable. We upstreamed our changes to Kubernetes and more controller-manager speed improvements are coming soon.

Development environments

Compiling code on your laptop can be slow. Perhaps you’re working on a patch for a large open-source project (e.g. V8 or Clickhouse) or need more bandwidth to upload and download containers. With KubeVirt, we enable our developers to rapidly iterate on software development and testing on powerful server hardware. KubeVirt integrates with Kubernetes Persistent Volumes, which enables teams to persist their development environment across restarts.

There are a number of teams at Cloudflare using KubeVirt for a variety of development and testing environments. Most notably is a project called Edge Test Fleet, which emulates a physical server and all the software that runs Cloudflare’s global network. Teams can test their code and configuration changes against the entire software stack without reserving dedicated hardware. Cloudflare uses Salt to provision systems. It can be difficult to iterate and test Salt changes without a complete virtual environment. Edge Test Fleet makes iterating on Salt easier, ensuring states compile and render the right output. With Edge Test Fleet, new developers can better understand how Cloudflare’s global network works without touching staging or production.

Additionally, one Cloudflare team developed a framework that allows users to build and test changes to Clickhouse using a VSCode environment. This framework is generally applicable to all teams requiring a development environment. Once a template environment is provisioned, CSI Volume Cloning can duplicate a golden volume, separating persistent environments for each developer.

apiVersion: v1
kind: PersistentVolumeClaim
name: devspace-jcichra-rootfs
namespace: dev-clickhouse-vms
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: rook-ceph-nvme
  dataSource:
    kind: PersistentVolumeClaim
    name: dev-rootfs
  resources:
    requests:
      storage: 500Gi

^{A PersistentVolumeClaim that clones data from another volume using CSI Volume Cloning}

Kernel and iPXE testing

Unlike user space software development, when a kernel crashes, the entire system crashes. The kernel team uses KubeVirt for development. KubeVirt gives all kernel engineers, regardless of laptop OS or architecture, the same x86 environment and hypervisor. Virtual machines on server hardware can be scaled up to more cores and memory than on laptops. The Cloudflare kernel team has also found low-level issues which only surface in environments with many CPUs.

To make testing fast and easy, the kernel team serves iPXE images via an nginx Pod and Service adjacent to the virtual machine. A recent kernel and Debian image are copied to the nginx pod via kubectl cp. The iPXE file can then be referenced in the KubeVirt virtual machine definition via the DNS name for the Kubernetes Service.

interfaces:
  name: default
  masquerade: {}
  model: e1000e
  ports:
    - port: 22
      dhcpOptions:
        bootFileName: http://httpboot.u-$K8S_USER.svc.cluster.local/boot.ipxe

When the virtual machine boots, it will get an IP address on the default interface behind NAT due to our masquerade setting. Then it will download boot.ipxe, which describes what additional files should be downloaded to start the system. In this case, the kernel (vmlinuz-amd64), Debian (baseimg-amd64.img) and additional kernel modules (modules-amd64.img) are downloaded.

^{UEFI iPXE boot connecting and downloading files from nginx pod in user’s namespace}

Once the system is booted, a developer can log in to the system for testing:

linux login: root
Password: 
Linux linux 6.6.35-cloudflare-2024.6.7 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@linux:~#

Custom kernels can be copied to the nginx pod via kubectl cp. Restarting the virtual machine will load that new kernel for testing. When a kernel panic occurs, the virtual machine can quickly be restarted with virtctl restart linux and it will go through the iPXE boot process again.

Build pipelines

Cloudflare leverages KubeVirt to build a majority of software at Cloudflare. Virtual machines give build system users full control over their pipeline. For example, Debian packages can easily be installed and separate container daemons (such as Docker) can run all within a Kubernetes namespace using the restricted Pod Security Standard. KubeVirt’s VirtualMachineReplicaSet concept allows us to quickly scale up and down the number of build agents to match demand. We can roll out different sets of virtual machines with varying sizes, kernels, and operating systems.

To scale efficiently, we leverage container disks to store our agent virtual machine images. Container disks allow us to store the virtual machine image (for example, a qcow image) in our container registry. This strategy works well when the state in virtual machines is ephemeral. Liveness probes detect unhealthy or broken agents, shutting down the virtual machine and replacing them with a fresh instance. Other automation limits virtual machine uptime, capping it to 3–4 hours to keep build agents fresh.

Next steps

We’re excited to expand our use of KubeVirt and unlock new capabilities for our internal users. KubeVirt’s Linux ARM64 support will allow us to build ARM64 packages in-cluster and simulate ARM64 systems.

Projects like KubeVirt CDI (Containerized Data Importer) will streamline our user’s virtual machine experience. Instead of users manually building container disks, we can provide a catalog of virtual machine images. It also allows us to copy virtual machine disks between namespaces.

Conclusion

KubeVirt has proven to be a great tool for virtualization in our Kubernetes-first environment. We’ve unlocked the ability to support more workloads with our multi-tenant model. The KubeVirt platform allows us to offer a single compute platform supporting containers and virtual machines. Managing it has been simple, and upgrades have been straightforward and non-disruptive. We’re exploring additional features KubeVirt offers to improve the experience for our users.

Finally, our team is expanding! We’re looking for more people passionate about Kubernetes to join our team and help us push Kubernetes to the next level.