The Cloudflare Blog

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Jeremy Hartman — Fri, 01 May 2026 21:07:30 GMT

Over the past two and a bit quarters, we've undertaken an intensive engineering effort, internally code-named "Code Orange: Fail Small", focused on making Cloudflare's infrastructure more resilient, secure, and reliable for every customer.

Earlier this month, the Cloudflare team finished this work.

While improving resiliency will never be a “job done” and will always be a top priority across our development lifecycle, we have now completed the work that would have avoided the November 18, 2025 and December 5, 2025 global outages.

This work focused on several key areas: safer configuration changes, reducing the impact of failure, and revising our “break glass” procedures and incident management. We also introduced measures to prevent drift and regressions over time, and strengthened the way we communicate to our customers during an outage.

Here we explain in depth what we shipped, and what it means for you.

Safer configuration changes

What it means for you: In most cases, Cloudflare internal configuration changes no longer reach our network instantly and are instead rolled out progressively with real-time health monitoring. This allows our observability tools to catch problems and revert issues before they affect your traffic.

In order to catch potentially dangerous deployments before they reach production, we've identified high-risk configuration pipelines, and built new tools to manage configuration changes better.

For products that run on our network processing customer traffic and receive configuration changes, we no longer deploy these changes instantly across the network. Instead, relevant teams have adopted a “health-mediated deployment” methodology, the same we use when releasing software, for all configuration deployments. This includes but is not limited to the product teams that were directly affected by the incidents.

Central to this is a new internal component we call Snapstone, which we built to bring health-mediated deployment to configuration changes. Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles. Before Snapstone, applying this methodology to config was possible but difficult. It required significant per-team effort and wasn't consistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.

This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward -- bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.

Reducing the impact of failure

What it means for you: In the event an issue is observed on our network, our systems now fail more gracefully. This vastly reduces the potential impact radius, to ensure your traffic is delivered even in worst-case scenarios.

Product teams have carefully reviewed, both in a manual and programmatic fashion, their potential failure modes for products that are critical for serving customer traffic. Teams have removed non-essential runtime dependencies and implemented better failure modes. We will now use the last known good configuration where possible (“fail stale”), and if that isn’t possible we have reviewed each failure case and implemented “fail open” or “fail close” depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.

Let’s look at an example of how this works. Our November 2025 outage was triggered by a failed rollout of our Bot Management detection machine learning classifier. Under our new procedures, if data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.

As a result, if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic.

We have also begun further segmenting our system so that independent copies of services run for different cohorts of traffic. Cloudflare already takes advantage of these customer cohorts for blast radius mitigation with traffic management techniques today, and this additional process segmentation work provides a powerful reliability capability for us going forward.

For example, the Workers runtime system is segmented into multiple independent services handling different cohorts of traffic, with one handling only traffic for our free customers. Changes are deployed to these segments based on customer cohorts, starting with free customers first. We’re also sending updates more quickly and frequently to the least critical segments, and at a slower pace to the most critical segments.

As a result, if a change were deployed to the Workers runtime system and it broke traffic, it would now only affect a small percentage of our free customers before being automatically detected and rolled back.

Sticking to the Workers runtime system as an example, in a seven-day period earlier this month, the deployment process was triggered more than 50 times. You can see how each happens in “waves” as the change propagates to the edge, often in parallel to the following and prior releases:

We’re working on extending this pattern of deployment to many more of our systems in the future.

Revised “break glass” and incident management procedures

What it means for you: If an incident does occur, we have the tools and teams to communicate more clearly and resolve it faster, minimizing downtime.

Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency: if a network-wide outage impacts these tools, we lose the very pathways we need to fix them. Before this Code Orange initiative, our "break glass" pathways were restricted to a handful of people and offered limited tool access. We needed these tools and pathways to be more broadly available during an outage.

To solve this, we conducted a comprehensive audit of the tools essential for system visibility, debugging, and production changes. We ultimately developed backup authorization pathways for 18 key services, supported by new emergency scripts and proxies.

Throughout the Code Orange program, we moved from theory to practice. After small-team exercises, we conducted an engineering-wide drill on April 7, 2026, involving more than 200 team members. While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure.

This effort also focused on the flow of information. When internal visibility is disrupted, our incident response slows down, and our ability to communicate with the outside world suffers. Historically, technical observations from the heat of the moment didn't always translate into clear updates for our customers.

To bridge this gap, we established a dedicated communications team to work in lockstep with incident responders during major events. Just as our engineers practiced their "break glass" procedures, this team used the Code Orange program to drill on streamlining the cadence and clarity of customer updates. By ensuring we have both the tools to see and the structure to speak, we can resolve incidents faster and keep our customers better informed.

We have codified our improvements

What it means for you: We will remember the learnings from our incidents and have codified the resolutions. Our network will only become more resilient.

To avoid drift and reintroducing regressions to the work done as part of Code Orange over time, the team has built an internal Codex that solidifies all our guidelines in clear and concise rules.

The Codex is now mandatory for all engineering and product teams, and has become a central part of Cloudflare internal procedures. Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase. The goal is simple: Build institutional memory that enforces itself.

The November and December outages shared a common failure mode: code that assumed inputs would always be valid, with no graceful degradation when that assumption broke. A Rust service called .unwrap() instead of handling an error; Lua code indexed an object that didn't exist. Both patterns are preventable if the lessons are captured and enforced.

The Codex is part of our answer. It's a living repository of engineering standards written by domain experts through our Request For Comments (RFC) process, then distilled into actionable rules. Best practices that previously lived in the heads of senior engineers, or were discovered only after an incident, now become shared knowledge accessible to everyone. Each rule follows a simple format: "If you need X, use Y" with a link to the RFC that explains why.

For example, one RFC now states: "Do not use .unwrap() outside of tests and build.rs." Another captures a broader principle: "Services MUST validate that upstream dependencies are in an expected state before processing."

Had these rules been enforced earlier, the November and December outages would have been rejected merge requests instead of global incidents.

Rules without enforcement are suggestions. The Codex integrates with AI-powered agents at every stage of the software development lifecycle, from design review through deployment to incident analysis. This shifts enforcement left, from "global outage" to "rejected merge request." The blast radius of a violation shrinks from millions of affected requests to a single developer getting actionable feedback before their code ever reaches production.

The Codex is a living document and will be continuously improved over time. Domain experts write RFCs to codify best practices. Incidents surface gaps that become new RFCs. Every approved RFC generates Codex rules. Those rules feed the agents that review the next merge request. It's a flywheel: expertise becomes standards, standards become enforcement, enforcement raises the floor for everyone.

It’s not just about code: communication is key

What it means for you: Transparency is important to us. If something goes wrong, we’re committed to keeping you updated every step of the way, so you can stay focused on what matters to you.

The global outages have made us review core processes and cultural approaches even beyond engineering and product development. As part of the broader Code Orange initiatives, we have introduced additional service level objectives (SLOs) to all our services, enforced a global changelog, onboarded all teams to our maintenance coordination system, and improved transparency across the company on our incident “prevents” ticket backlog.

We have also strengthened the way we communicate to our customers during an outage. Our goal is to alert you to an issue the moment we confirm it, before you even notice a problem. By the time you notice a lag or an error, our aim is to have an update already waiting in your notifications.

During an active incident, we now provide updates at predictable intervals (e.g., every 30 or 60 minutes), even if the update is simply, "We are still testing the fix; no new changes yet." This allows you to plan your day rather than constantly refreshing a status page.

Our job isn't done when the status returns to normal. We provide detailed post-mortems explaining what happened, why it happened, and the specific structural changes we are making to ensure it doesn't happen again.

This initiative is complete. But our work on resiliency is never done.

We take the incidents very seriously and adopted a shared ownership across the entire Cloudflare organization by asking every team: What could have been done better? This guided the work that we carried out over the last two quarters.

While this work is never truly done, we are confident that we are in a much better position and Cloudflare is now much stronger because of it.

Code Orange: Fail Small — our resilience plan following recent incidents

Dane Knecht — Fri, 19 Dec 2025 22:35:30 GMT

On November 18, 2025, Cloudflare’s network experienced significant failures to deliver network traffic for approximately two hours and ten minutes. Nearly three weeks later, on December 5, 2025, our network again failed to serve traffic for 28% of applications behind our network for about 25 minutes.

We published detailed post-mortem blog posts following both incidents, but we know that we have more to do to earn back your trust. Today we are sharing details about the work underway at Cloudflare to prevent outages like these from happening again.

We are calling the plan “Code Orange: Fail Small”, which reflects our goal of making our network more resilient to errors or mistakes that could lead to a major outage. A “Code Orange” means the work on this project is prioritized above all else. For context, we declared a “Code Orange” at Cloudflare once before, following another major incident that required top priority from everyone across the company. We feel the recent events require the same focus. Code Orange is our way to enable that to happen, allowing teams to work cross-functionally as necessary to get the job done while pausing any other work.

The Code Orange work is organized into three main areas:

Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.
Change our internal “break glass”* procedures, and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.

These projects will deliver iterative improvements as they proceed, rather than one “big bang” change at their conclusion. Every individual update will contribute to more resiliency at Cloudflare. By the end, we expect Cloudflare’s network to be much more resilient, including for issues such as those that triggered the global incidents we experienced in the last two months.

We understand that these incidents are painful for our customers and the Internet as a whole. We’re deeply embarrassed by them, which is why this work is the first priority for everyone here at Cloudflare.

^*^{Break glass procedures at Cloudflare allow certain individuals to elevate their privilege under certain circumstances to perform urgent actions to resolve high severity scenarios.}

What went wrong?

In the first incident, users visiting a customer site on Cloudflare saw error pages that indicated Cloudflare could not deliver a response to their request. In the second, they saw blank pages.

Both outages followed a similar pattern. In the moments leading up to each incident we instantaneously deployed a configuration change in our data centers in hundreds of cities around the world.

The November change was an automatic update to our Bot Management classifier. We run various artificial intelligence models that learn from the traffic flowing through our network to build detections that identify bots. We constantly update those systems to stay ahead of bad actors trying to evade our security protection to reach customer sites.

During the December incident, while trying to protect our customers from a vulnerability in the popular open source framework React, we deployed a change to a security tool used by our security analysts to improve our signatures. Similar to the urgency of new bot management updates, we needed to get ahead of the attackers who wanted to exploit the vulnerability. That change triggered the start of the incident.

This pattern exposed a serious gap in how we deploy configuration changes at Cloudflare, versus how we release software updates. When we release software version updates, we do so in a controlled and monitored fashion. For each new binary release, the deployment must successfully complete multiple gates before it can serve worldwide traffic. We deploy first to employee traffic, before carefully rolling out the change to increasing percentages of customers worldwide, starting with free users. If we detect an anomaly at any stage, we can revert the release without any human intervention.

We have not applied that methodology to configuration changes. Unlike releasing the core software that powers our network, when we make configuration changes, we are modifying the values of how that software behaves and we can do so instantly. We give this power to our customers too: If you make a change to a setting in Cloudflare, it will propagate globally in seconds.

While that speed has advantages, it also comes with risks that we need to address. The past two incidents have demonstrated that we need to treat any change that is applied to how we serve traffic in our network with the same level of tested caution that we apply to changes to the software itself.

We will change how we deploy configuration updates at Cloudflare

Our ability to deploy configuration changes globally within seconds was the core commonality across the two incidents. In both events, a wrong configuration took down our network in seconds.

Introducing controlled rollouts of our configuration, just as we already do for software releases, is the most important workstream of our Code Orange plan.

Configuration changes at Cloudflare propagate to the network very quickly. When a user creates a new DNS record, or creates a new security rule, it reaches 90% of servers on the network within seconds. This is powered by a software component that we internally call Quicksilver.

Quicksilver is also used for any configuration change required by our own teams. The speed is a feature: we can react and globally update our network behavior very quickly. However, in both incidents this caused a breaking change to propagate to the entire network in seconds rather than passing through gates to test it.

While the ability to deploy changes to our network on a near-instant basis is useful in many cases, it is rarely necessary. Work is underway to treat configuration the same way that we treat code by introducing controlled deployments within Quicksilver to any configuration change.

We release software updates to our network multiple times per day through what we call our Health Mediated Deployment (HMD) system. In this framework, every team at Cloudflare that owns a service (a piece of software deployed into our network) must define the metrics that indicate a deployment has succeeded or failed, the rollout plan, and the steps to take if it does not succeed.

Different services will have slightly different variables. Some might need longer wait times before proceeding to more data centers, while others might have lower tolerances for error rates even if it causes false positive signals.

Once deployed, our HMD toolkit begins to carefully progress against that plan while monitoring each step before proceeding. If any step fails, the rollback will automatically begin and the team can be paged if needed.

By the end of Code Orange, configuration updates will follow this same process. We expect this to allow us to quickly catch the kinds of issues that occurred in these past two incidents long before they become widespread problems.

How will we address failure modes between services?

While we are optimistic that better control over configuration changes will catch more problems before they become incidents, we know that mistakes can and will occur. During both incidents, errors in one part of our network became problems in most of our technology stack, including the control plane that customers rely on to configure how they use Cloudflare.

We need to think about careful, graduated rollouts not just in terms of geographic progression (spreading to more of our data centers) or in terms of population progression (spreading to employees and customer types). We also need to plan for safer deployments that contain failures from service progression (spreading from one product like our Bot Management service to an unrelated one like our dashboard).

To that end, we are in the process of reviewing the interface contracts between every critical product and service that comprise our network to ensure that we a) assume failure will occur between each interface and b) handle that failure in the absolute most reasonable way possible.

To go back to our Bot Management service failure, there were at least two key interfaces where, if we had assumed failure was going to happen, we could have handled it gracefully to the point that it was unlikely any customer would have been impacted. The first was in the interface that read the corrupted config file. Instead of panicking, there should have been a sane set of validated defaults which would have allowed traffic to pass through our network, while we would have, at worst, lost the realtime fine-tuning that feeds into our bot detection machine-learning models. The second interface was between the core software that runs our network and the Bot Management module itself. In the event that our bot management module failed (as it did), we should not have dropped traffic by default. Instead, we could have come up with, yet again, a more sane default of allowing the traffic to pass with a passable classification.

How will we solve emergencies faster?

During the incidents, it took us too long to resolve the problem. In both cases, this was worsened by our security systems preventing team members from accessing the tools they needed to fix the problem, and in some cases, circular dependencies slowed us down as some internal systems also became unavailable.

As a security company, all our tools are behind authentication layers with fine-grained access controls to ensure customer data is safe and to prevent unauthorized access. This is the right thing to do, but at the same time, our current processes and systems slowed us down when speed was a top priority.

Circular dependencies also affected our customer experience. For example, during the November 18 incident, Turnstile, our no CAPTCHA bot solution, became unavailable. As we use Turnstile on the login flow to the Cloudflare dashboard, customers who did not have active sessions, or API service tokens, were not able to log in to Cloudflare in the moment of most need to make critical changes.

Our team will be reviewing and improving all of the break glass procedures and technology to ensure that, when necessary, we can access the right tools as fast as possible while maintaining our security requirements. This includes reviewing and removing circular dependencies, or being able to “bypass” them quickly in the event there is an incident. We will also increase the frequency of our training exercises, so that processes are well understood by all teams prior to any potential disaster scenario in the future.

When will we be done?

While we haven’t captured in this post all the work being undertaken internally, the workstreams detailed above describe the top priorities the teams are being asked to focus on. Each of these workstreams maps to a detailed plan touching nearly every product and engineering team at Cloudflare. We have a lot of work to do.

By the end of Q1, and largely before then, we will:

Ensure all production systems are covered by Health Mediated Deployments (HMD) for configuration management.
Update our systems to adhere to proper failure modes as appropriate for each product set.
Ensure we have processes in place so the right people have the right access to provide proper remediation during an emergency.

Some of these goals will be evergreen. We will always need to better handle circular dependencies as we launch new software and our break glass procedures will need to update to reflect how our security technology changes over time.

We failed our users and the Internet as a whole in these past two incidents. We have work to do to make it right. We plan to share updates as this work proceeds and appreciate the questions and feedback we have received from our customers and partners.