The Cloudflare Blog

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Jeremy Hartman — Fri, 01 May 2026 21:07:30 GMT

Over the past two and a bit quarters, we've undertaken an intensive engineering effort, internally code-named "Code Orange: Fail Small", focused on making Cloudflare's infrastructure more resilient, secure, and reliable for every customer.

Earlier this month, the Cloudflare team finished this work.

While improving resiliency will never be a “job done” and will always be a top priority across our development lifecycle, we have now completed the work that would have avoided the November 18, 2025 and December 5, 2025 global outages.

This work focused on several key areas: safer configuration changes, reducing the impact of failure, and revising our “break glass” procedures and incident management. We also introduced measures to prevent drift and regressions over time, and strengthened the way we communicate to our customers during an outage.

Here we explain in depth what we shipped, and what it means for you.

Safer configuration changes

What it means for you: In most cases, Cloudflare internal configuration changes no longer reach our network instantly and are instead rolled out progressively with real-time health monitoring. This allows our observability tools to catch problems and revert issues before they affect your traffic.

In order to catch potentially dangerous deployments before they reach production, we've identified high-risk configuration pipelines, and built new tools to manage configuration changes better.

For products that run on our network processing customer traffic and receive configuration changes, we no longer deploy these changes instantly across the network. Instead, relevant teams have adopted a “health-mediated deployment” methodology, the same we use when releasing software, for all configuration deployments. This includes but is not limited to the product teams that were directly affected by the incidents.

Central to this is a new internal component we call Snapstone, which we built to bring health-mediated deployment to configuration changes. Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles. Before Snapstone, applying this methodology to config was possible but difficult. It required significant per-team effort and wasn't consistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.

This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward -- bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.

Reducing the impact of failure

What it means for you: In the event an issue is observed on our network, our systems now fail more gracefully. This vastly reduces the potential impact radius, to ensure your traffic is delivered even in worst-case scenarios.

Product teams have carefully reviewed, both in a manual and programmatic fashion, their potential failure modes for products that are critical for serving customer traffic. Teams have removed non-essential runtime dependencies and implemented better failure modes. We will now use the last known good configuration where possible (“fail stale”), and if that isn’t possible we have reviewed each failure case and implemented “fail open” or “fail close” depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.

Let’s look at an example of how this works. Our November 2025 outage was triggered by a failed rollout of our Bot Management detection machine learning classifier. Under our new procedures, if data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.

As a result, if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic.

We have also begun further segmenting our system so that independent copies of services run for different cohorts of traffic. Cloudflare already takes advantage of these customer cohorts for blast radius mitigation with traffic management techniques today, and this additional process segmentation work provides a powerful reliability capability for us going forward.

For example, the Workers runtime system is segmented into multiple independent services handling different cohorts of traffic, with one handling only traffic for our free customers. Changes are deployed to these segments based on customer cohorts, starting with free customers first. We’re also sending updates more quickly and frequently to the least critical segments, and at a slower pace to the most critical segments.

As a result, if a change were deployed to the Workers runtime system and it broke traffic, it would now only affect a small percentage of our free customers before being automatically detected and rolled back.

Sticking to the Workers runtime system as an example, in a seven-day period earlier this month, the deployment process was triggered more than 50 times. You can see how each happens in “waves” as the change propagates to the edge, often in parallel to the following and prior releases:

We’re working on extending this pattern of deployment to many more of our systems in the future.

Revised “break glass” and incident management procedures

What it means for you: If an incident does occur, we have the tools and teams to communicate more clearly and resolve it faster, minimizing downtime.

Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency: if a network-wide outage impacts these tools, we lose the very pathways we need to fix them. Before this Code Orange initiative, our "break glass" pathways were restricted to a handful of people and offered limited tool access. We needed these tools and pathways to be more broadly available during an outage.

To solve this, we conducted a comprehensive audit of the tools essential for system visibility, debugging, and production changes. We ultimately developed backup authorization pathways for 18 key services, supported by new emergency scripts and proxies.

Throughout the Code Orange program, we moved from theory to practice. After small-team exercises, we conducted an engineering-wide drill on April 7, 2026, involving more than 200 team members. While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure.

This effort also focused on the flow of information. When internal visibility is disrupted, our incident response slows down, and our ability to communicate with the outside world suffers. Historically, technical observations from the heat of the moment didn't always translate into clear updates for our customers.

To bridge this gap, we established a dedicated communications team to work in lockstep with incident responders during major events. Just as our engineers practiced their "break glass" procedures, this team used the Code Orange program to drill on streamlining the cadence and clarity of customer updates. By ensuring we have both the tools to see and the structure to speak, we can resolve incidents faster and keep our customers better informed.

We have codified our improvements

What it means for you: We will remember the learnings from our incidents and have codified the resolutions. Our network will only become more resilient.

To avoid drift and reintroducing regressions to the work done as part of Code Orange over time, the team has built an internal Codex that solidifies all our guidelines in clear and concise rules.

The Codex is now mandatory for all engineering and product teams, and has become a central part of Cloudflare internal procedures. Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase. The goal is simple: Build institutional memory that enforces itself.

The November and December outages shared a common failure mode: code that assumed inputs would always be valid, with no graceful degradation when that assumption broke. A Rust service called .unwrap() instead of handling an error; Lua code indexed an object that didn't exist. Both patterns are preventable if the lessons are captured and enforced.

The Codex is part of our answer. It's a living repository of engineering standards written by domain experts through our Request For Comments (RFC) process, then distilled into actionable rules. Best practices that previously lived in the heads of senior engineers, or were discovered only after an incident, now become shared knowledge accessible to everyone. Each rule follows a simple format: "If you need X, use Y" with a link to the RFC that explains why.

For example, one RFC now states: "Do not use .unwrap() outside of tests and build.rs." Another captures a broader principle: "Services MUST validate that upstream dependencies are in an expected state before processing."

Had these rules been enforced earlier, the November and December outages would have been rejected merge requests instead of global incidents.

Rules without enforcement are suggestions. The Codex integrates with AI-powered agents at every stage of the software development lifecycle, from design review through deployment to incident analysis. This shifts enforcement left, from "global outage" to "rejected merge request." The blast radius of a violation shrinks from millions of affected requests to a single developer getting actionable feedback before their code ever reaches production.

The Codex is a living document and will be continuously improved over time. Domain experts write RFCs to codify best practices. Incidents surface gaps that become new RFCs. Every approved RFC generates Codex rules. Those rules feed the agents that review the next merge request. It's a flywheel: expertise becomes standards, standards become enforcement, enforcement raises the floor for everyone.

It’s not just about code: communication is key

What it means for you: Transparency is important to us. If something goes wrong, we’re committed to keeping you updated every step of the way, so you can stay focused on what matters to you.

The global outages have made us review core processes and cultural approaches even beyond engineering and product development. As part of the broader Code Orange initiatives, we have introduced additional service level objectives (SLOs) to all our services, enforced a global changelog, onboarded all teams to our maintenance coordination system, and improved transparency across the company on our incident “prevents” ticket backlog.

We have also strengthened the way we communicate to our customers during an outage. Our goal is to alert you to an issue the moment we confirm it, before you even notice a problem. By the time you notice a lag or an error, our aim is to have an update already waiting in your notifications.

During an active incident, we now provide updates at predictable intervals (e.g., every 30 or 60 minutes), even if the update is simply, "We are still testing the fix; no new changes yet." This allows you to plan your day rather than constantly refreshing a status page.

Our job isn't done when the status returns to normal. We provide detailed post-mortems explaining what happened, why it happened, and the specific structural changes we are making to ensure it doesn't happen again.

This initiative is complete. But our work on resiliency is never done.

We take the incidents very seriously and adopted a shared ownership across the entire Cloudflare organization by asking every team: What could have been done better? This guided the work that we carried out over the last two quarters.

While this work is never truly done, we are confident that we are in a much better position and Cloudflare is now much stronger because of it.

Cloudflare outage on February 20, 2026

David Tuber — Sat, 21 Feb 2026 00:00:00 GMT

On February 20, 2026, at 17:48 UTC, Cloudflare experienced a service outage when a subset of customers who use Cloudflare’s Bring Your Own IP (BYOIP) service saw their routes to the Internet withdrawn via Border Gateway Protocol (BGP).

The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind. This issue was caused by a change that Cloudflare made to how our network manages IP addresses onboarded through the BYOIP pipeline. This change caused Cloudflare to unintentionally withdraw customer prefixes.

For some BYOIP customers, this resulted in their services and applications being unreachable from the Internet, causing timeouts and failures to connect across their Cloudflare deployments that used BYOIP. The website for Cloudflare’s recursive DNS resolver (1.1.1.1) saw 403 errors as well. The total duration of the incident was 6 hours and 7 minutes with most of that time spent restoring prefix configurations to their state prior to the change.

Cloudflare engineers reverted the change and prefixes stopped being withdrawn when we began to observe failures. However, before engineers were able to revert the change, ~1,100 BYOIP prefixes were withdrawn from the Cloudflare network. Some customers were able to restore their own service by using the Cloudflare dashboard to re-advertise their IP addresses. We resolved the incident when we restored all prefix configurations.

We are sorry for the impact to our customers. We let you down today. This post is an in-depth recounting of exactly what happened and which systems and processes failed. We will also outline the steps we are taking to prevent outages like this from happening again.

How did the outage impact customers?

This graph shows the amount of prefixes advertised by Cloudflare during the incident to a BGP neighbor, which correlates to impact as prefixes that weren’t advertised were unreachable on the Internet:

Out of the total 6,500 prefixes advertised to this peer, 4,306 of those were BYOIP prefixes. These BYOIP prefixes are advertised to every peer and represent all the BYOIP prefixes we advertise globally.

During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC. Out of the 4,306 total BYOIP prefixes, 25% of BYOIP prefixes were unintentionally withdrawn. We were able to detect impact on one.one.one.one and revert the impacting change before more prefixes were impacted. At 19:19 UTC, we published guidance to customers that they would be able to self-remediate this incident by going to the Cloudflare dashboard and re-advertising their prefixes.

Cloudflare was able to revert many of the advertisement changes around 20:20 UTC, which caused 800 prefixes to be restored. There were still ~300 prefixes that were unable to be remediated through the dashboard because the service configurations for those prefixes were removed from the edge due to a software bug. These prefixes were manually restored by Cloudflare engineers at 23:03 UTC.

This incident did not impact all BYOIP customers because the configuration change was applied iteratively and not instantaneously across all BYOIP customers. Once the configuration change was revealed to be causing impact, the change was reverted before all customers were affected.

The impacted BYOIP customers first experienced a behavior called BGP Path Hunting. In this state, end user connections traverse networks trying to find a route to the destination IP. This behavior will persist until the connection that was opened times out and fails. Until the prefix is advertised somewhere, customers will continue to see this failure mode. This loop-until-failure scenario affected any product that uses BYOIP for advertisement to the Internet. Additionally, visitors to one.one.one.one, the website for Cloudflare’s recursive DNS resolver, were met with HTTP 403 errors and an “Edge IP Restricted” error message. DNS resolution over the 1.1.1.1 Public Resolver, including DNS over HTTPS, was not affected. A full breakdown of the services impacted is below.

Service/Product	Impact Description
Core CDN and Security Services	Traffic was not attracted to Cloudflare, and users connecting to websites advertised on those ranges would have seen failures to connect
Spectrum	Spectrum apps on BYOIP failed to proxy traffic due to traffic not being attracted to Cloudflare
Dedicated Egress	Customers who used Gateway Dedicated Egress leveraging BYOIP or Dedicated IPs for CDN Egress leveraging BYOIP would not have been able to send traffic out to their destinations
Magic Transit	End users connecting to applications protected by Magic Transit would not have been advertised on the Internet, and would have seen connection timeouts and failures

There was also a set of customers who were unable to restore service by toggling the prefixes on the Cloudflare dashboard. As engineers began reannouncing prefixes to restore service for these customers, these customers may have seen increased latency and failures despite their IP addresses being advertised. This was because the addressing settings for some users were removed from edge servers due an issue in our own software, and the state had to be propagated back to the edge.

We’re going to get into what exactly broke in our addressing system, but to do that we need to cover a quick primer on the Addressing API, which is the underlying source of truth for customer IP addresses at Cloudflare.

Cloudflare’s Addressing API

The Addressing API is an authoritative dataset of the addresses present on the Cloudflare network. Any change to that dataset is immediately reflected in Cloudflare's global network. While we are in the process of improving how these systems roll out changes as a part of Code Orange: Fail Small, today customers can configure their IP addresses by interacting with public-facing APIs which configure a set of databases that trigger operational workflows propagating the changes to Cloudflare’s edge. This means that changes to the Addressing API are immediately propagated to the Cloudflare edge.

Advertising and configuring IP addresses on Cloudflare involves several steps:

Customers signal to Cloudflare about advertisement/withdrawal of IP addresses via the Addressing API or BGP Control
The Addressing API instructs the machines to change the prefix advertisements
BGP will be updated on the routers once enough machines have received the notification to update the prefix
Finally, customers can configure Cloudflare products to use BYOIP addresses via service bindings which will assign products to these ranges

The Addressing API allows us to automate most of the processes surrounding how we advertise or withdraw addresses, but some processes still require manual actions. These manual processes are risky because of their close proximity to Production. As a part of Code Orange: Fail Small, one of the goals of remediation was to remove manual actions taken in the Addressing API and replace them with safe workflows.

How did the incident occur?

The specific piece of configuration that broke was a modification attempting to automate the customer action of removing prefixes from Cloudflare’s BYOIP service, a regular customer request that is done manually today. Removing this manual process was part of our Code Orange: Fail Small work to push all changes toward safe, automated, health-mediated deployment. Since the list of related objects of BYOIP prefixes can be large, this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them. Unfortunately, this regular cleanup sub-task queried the API with a bug.

Here is the API query from the cleanup sub-task:

 resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

And here is the relevant part of the API implementation:

	if v := req.URL.Query().Get("pending_delete"); v != "" {
		// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
		prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
		if err != nil {
			api.RenderError(ctx, w, ErrInternalError)
			return
		}

		api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
		return
	}

Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion. The new sub-task then began systematically deleting all BYOIP prefixes and all of their related dependent objects including service bindings, until the impact was noticed, and an engineer identified the sub-task and shut it down.

Why did Cloudflare not catch the bug in our staging environment or testing?

Our staging environment contains data that matches Production as closely as possible, but was not sufficient in this case and the mock data we relied on to simulate what would occur was insufficient.

In addition, while we have tests for this functionality, coverage for this scenario in our testing process and environment was incomplete. Initial testing and code review focused on the BYOIP self-service API journey and were completed successfully. While our engineers successfully tested the exact process a customer would have followed, testing did not cover a scenario where the task-runner service would independently execute changes to user data without explicit input.

Why was recovery not immediate?

Affected BYOIP prefixes were not all impacted in the same way, necessitating more intensive data recovery steps. As a part of Code Orange: Fail Small, we are building a system where operational state snapshots can be safely rolled out through health-mediated deployments. In the event something does roll out that causes unexpected behavior, it can be very quickly rolled back to a known-good state. However, that system is not in Production today.

BYOIP prefixes were in different states of impact during this incident, and each of these different states required different actions:

Most impacted customers only had their prefixes withdrawn. Customers in this configuration could go into the dashboard and toggle their advertisements, which would restore service.
Some customers had their prefixes withdrawn and some bindings removed. These customers were in a partial state of recovery where they could toggle some prefixes but not others.
Some customers had their prefixes withdrawn and all service bindings removed. They could not toggle their prefixes in the dashboard because there was no service (Magic Transit, Spectrum, CDN) bound to them. These customers took the longest to mitigate, as a global configuration update had to be initiated to reapply the service bindings for all these customers to every single machine on Cloudflare’s edge.

How does this incident relate to Code Orange: Fail Small?

The change we were making when this incident occurred is part of the Code Orange: Fail Small initiative, which is aimed at improving the resiliency of code and configuration at Cloudflare. As a brief primer of the Code Orange: Fail Small initiatives, the work can be divided into three buckets:

Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Change our internal “break glass” procedures and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.

The change that we attempted to deploy falls under the first bucket. By moving risky, manual changes to safe, automated configuration updates that are deployed in a health-mediated manner, we aim to improve the reliability of the service.

Critical work was already ongoing to enhance the Addressing API's configuration change support through staged test mediation and better correctness checks. This work was ongoing in parallel with the deployed change. Although preventative measures weren't fully deployed before the outage, teams were actively working on these systems when the incident occurred. Following our Code Orange: Fail Small promise to require controlled rollouts of any change into Production, our engineering teams have been reaching deep into all layers of our stack to identify and fix all problematic findings. While this outage wasn't itself global, the blast radius and impact were unacceptably large, further reinforcing Code Orange: Fail Small as a priority until we have re-established confidence in all changes to our network being as gradual as possible. Now let’s talk more specifically about improvements to these systems.

Remediation and follow-up steps

API schema standardization

One of the issues in this incident is that the pending_delete flag was interpreted as a string, making it difficult for both client and server to rationalize the value of the flag. We will improve the API schema to ensure better standardization, which will make it much easier for testing and systems to validate whether an API call is properly formed or not. This work is part of the third Code Orange workstream, which aims to create well-defined behavior under all conditions.

Better separation between operational and configured state

Today, customers make changes to the addressing schema that are persisted in an authoritative database, and that database is the same one used for operational actions. This makes manual rollback processes more challenging because engineers need to utilize database snapshots instead of rationalizing between desired and actual states. We will redesign the rollback mechanism and database configuration to ensure that we have an easy way to roll back changes quickly and also to introduce layers between customer configuration and Production.

We will snap shot the data that we read from the database and are applying to Production, and apply those snapshots in the same way that we deploy all our other Production changes, mediated by health metrics that can automatically stop the deployment if things are going wrong. This means that the next time we have a problem where the database gets changed into a bad state, we can near-instantly revert individual customers (or all customers) to a version that was working.

While this will temporarily block our customers from being able to make direct updates via our API in the event of an outage, it will mean that we can continue serving their traffic while we work to fix the database, instead of being down for that time. This work aligns with the first and second Code Orange workstreams, which involves fast rollback and also safe, health-mediated deployment of configuration.

Better arbitrate large withdrawal actions

We will improve our monitoring to detect when changes are happening too fast or too broadly, such as withdrawing or deleting BGP prefixes quickly, and disable the deployment of snapshots when this happens. This will form a type of circuit breaker to stop any out-of-control process that is manipulating the database from having a large blast radius, like we saw in this incident.

We also have some ongoing work to directly monitor that the services run by our customers are behaving correctly, and those signals can also be used to trip the circuit breaker and stop potentially dangerous changes from being applied until we have had time to investigate. This work aligns with the first Code Orange workstream, which involves safe deployment of changes.

Below is the timeline of events inclusive of deployment of the change and remediation steps:

Time (UTC)	Status	Description
2026-02-05 21:53	Code merged into system	Broken sub-process merged into code base
2026-02-20 17:46	Code deployed into system	Address API release with broken sub-process completes
2026-02-20 17:56	Impact Start	Broken sub-process begins executing. Prefix advertisement updates begin propagating and prefixes begin to be withdrawn – IMPACT STARTS –
2026-02-20 18:13	Cloudflare engaged	Cloudflare engaged for failures on one.one.one.one
2026-02-20 18:18	Internal incident declared	Cloudflare engineers continue investigating impact
2026-02-20 18:21	Addressing API team paged	Engineering team responsible for Addressing API engaged and debugging begins
2026-02-20 18:46	Issue identified	Broken sub-process terminated by an engineer and regular execution disabled; remediation begins
2026-02-20 19:11	Mitigation begins	Cloudflare Engineers begin to restore serviceability for prefixes that were withdrawn while others focused on prefixes that were removed
2026-02-20 19:19	Some prefixes mitigated	Customers begin to re-advertise their prefixes via the dashboard to restore service. – IMPACT DOWNGRADE –
2026-02-20 19:44	Additional mitigation continues	Engineers begin database recovery methods for removed prefixes
2026-02-20 20:30	Final mitigation process begins	Engineers complete release to restore withdrawn prefixes that still have existing service bindings. Others are still working on removed prefixes – IMPACT DOWNGRADE –
2026-02-20 21:08	Configuration update deploys	Engineering begins global machine configuration rollout to restore prefixes that were not self-mitigated or mitigated via previous efforts – IMPACT DOWNGRADE –
2026-02-20 23:03	Configuration update completed	Global machine configuration deployment to restore remaining prefixes is completed. – IMPACT ENDS –

We deeply apologize for this incident today and how it affected the service we provide our customers, and also the Internet at large. We aim to provide a network that is resilient to change, and we did not deliver on our promise to you. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

Route leak incident on January 22, 2026

Bryton Herdes — Fri, 23 Jan 2026 14:00:00 GMT

On January 22, 2026, an automated routing policy configuration error caused us to leak some Border Gateway Protocol (BGP) prefixes unintentionally from a router at our data center in Miami, Florida. While the route leak caused some impact to Cloudflare customers, multiple external parties were also affected because their traffic was accidentally funnelled through our Miami data center location.

The route leak lasted 25 minutes, causing congestion on some of our backbone infrastructure in Miami, elevated loss for some Cloudflare customer traffic, and higher latency for traffic across these links. Additionally, some traffic was discarded by firewall filters on our routers that are designed to only accept traffic for Cloudflare services and our customers.

While we’ve written about route leaks before, we rarely find ourselves causing them. This route leak was the result of an accidental misconfiguration on a router in Cloudflare’s network, and only affected IPv6 traffic. We sincerely apologize to the users, customers, and networks we impacted yesterday as a result of this BGP route leak.

BGP route leaks

We have written multiple times about BGP route leaks, and we even record route leak events on Cloudflare Radar for anyone to view and learn from. To get a fuller understanding of what route leaks are, you can refer to this detailed background section, or refer to the formal definition within RFC7908.

Essentially, a route leak occurs when a network tells the broader Internet to send it traffic that it's not supposed to forward. Technically, a route leak occurs when a network, or Autonomous System (AS), appears unexpectedly in an AS path. An AS path is what BGP uses to determine the path across the Internet to a final destination. An example of an anomalous AS path indicative of a route leak would be finding a network sending routes received from a peer to a provider.

During this type of route leak, the rules of valley-free routing are violated, as BGP updates are sent from AS64501 to their peer (AS64502), and then unexpectedly up to a provider (AS64503). Oftentimes the leaker, in this case AS64502, is not prepared to handle the amount of traffic they are going to receive and may not even have firewall filters configured to accept all of the traffic coming in their direction. In simple terms, once a route update is sent to a peer or provider, it should only be sent further to customers and not to another peer or provider AS.

During the incident on January 22, we caused a similar kind of route leak, in which we took routes from some of our peers and redistributed them in Miami to some of our peers and providers. According to the route leak definitions in RFC7908, we caused a mixture of Type 3 and Type 4 route leaks on the Internet.

Timeline

Time (UTC)	Event
2026-01-22 19:52 UTC	A change that ultimately triggers the routing policy bug is merged in our network automation code repository
2026-01-22 20:25 UTC	Automation is run on single Miami edge-router resulting in unexpected advertisements to BGP transit providers and peers IMPACT START
2026-01-22 20:40 UTC	Network team begins investigating unintended route advertisements from Miami
2026-01-22 20:44 UTC	Incident is raised to coordinate response
2026-01-22 20:50 UTC	The bad configuration change is manually reverted by a network operator, and automation is paused for the router, so it cannot run again IMPACT STOP
2026-01-22 21:47 UTC	The change that triggered the leak is reverted from our code repository
2026-01-22 22:07 UTC	Automation is confirmed by operators to be healthy to run again on the Miami router, without the routing policy bug
2026-01-22 22:40 UTC	Automation is unpaused on the single router in Miami

What happened: the configuration error

On January 22, 2026, at 20:25 UTC, we pushed a change via our policy automation platform to remove the BGP announcements from Miami for one of our data centers in Bogotá, Colombia. This was purposeful, as we previously forwarded some IPv6 traffic through Miami toward the Bogotá data center, but recent infrastructure upgrades removed the need for us to do so.

This change generated the following diff (a program that compares configuration files in order to determine how or whether they differ):

[edit policy-options policy-statement 6-COGENT-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-COMCAST-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-GTT-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-LEVEL3-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-PRIVATE-PEER-ANYCAST-OUT term ADV-SITELOCAL from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-PUBLIC-PEER-ANYCAST-OUT term ADV-SITELOCAL from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-PUBLIC-PEER-OUT term ADV-SITELOCAL from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-TELEFONICA-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;
[edit policy-options policy-statement 6-TELIA-ACCEPT-EXPORT term ADV-SITELOCAL-GRE-RECEIVER from]
-      prefix-list 6-BOG04-SITE-LOCAL;

While this policy change looks innocent at a glance, only removing the prefix lists containing BOG04 unicast prefixes resulted in a policy that was too permissive:

policy-options policy-statement 6-TELIA-ACCEPT-EXPORT {
    term ADV-SITELOCAL-GRE-RECEIVER {
        from route-type internal;
        then {
            community add STATIC-ROUTE;
            community add SITE-LOCAL-ROUTE;
            community add MIA01;
            community add NORTH-AMERICA;
            accept;
        }
    }
}

The policy would now mark every prefix of type “internal” as acceptable, and proceed to add some informative communities to all matching prefixes. But more importantly, the policy also accepted the route through the policy filter, which resulted in the prefix — which was intended to be “internal” — being advertised externally. This is an issue because the “route-type internal” match in JunOS or JunOS EVO (the operating systems used by HPE Juniper Networks devices) will match any non-external route type, such as Internal BGP (IBGP) routes, which is what happened here.

As a result, all IPv6 prefixes that Cloudflare redistributes internally across the backbone were accepted by this policy, and advertised to all our BGP neighbors in Miami. This is unfortunately very similar to the outage we experienced in 2020, on which you can read more on our blog.

When the policy misconfiguration was applied at 20:25 UTC, a series of unintended BGP updates were sent from AS13335 to peers and providers in Miami. These BGP updates are viewable historically by looking at MRT files with the monocle tool or using RIPE BGPlay.

➜  ~ monocle search --start-ts 2026-01-22T20:24:00Z --end-ts 2026-01-22T20:30:00Z --as-path ".*13335[ \d$]32934$*"
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f077::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f091::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f16f::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f17c::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f26f::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f27c::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113609.854028|2801:14:9000::6:4112:1|64112|2a03:2880:f33f::/48|64112 22850 174 3356 13335 32934|IGP|2801:14:9000::6:4112:1|0|0|22850:65151|false|||pit.scl
A|1769113583.095278|2001:504:d::4:9544:1|49544|2a03:2880:f17c::/48|49544 1299 3356 13335 32934|IGP|2001:504:d::4:9544:1|0|0|1299:25000 1299:25800 49544:16000 49544:16106|false|||route-views.isc
A|1769113583.095278|2001:504:d::4:9544:1|49544|2a03:2880:f27c::/48|49544 1299 3356 13335 32934|IGP|2001:504:d::4:9544:1|0|0|1299:25000 1299:25800 49544:16000 49544:16106|false|||route-views.isc
A|1769113583.095278|2001:504:d::4:9544:1|49544|2a03:2880:f091::/48|49544 1299 3356 13335 32934|IGP|2001:504:d::4:9544:1|0|0|1299:25000 1299:25800 49544:16000 49544:16106|false|||route-views.isc
A|1769113584.324483|2001:504:d::19:9524:1|199524|2a03:2880:f091::/48|199524 1299 3356 13335 32934|IGP|2001:2035:0:2bfd::1|0|0||false|||route-views.isc
A|1769113584.324483|2001:504:d::19:9524:1|199524|2a03:2880:f17c::/48|199524 1299 3356 13335 32934|IGP|2001:2035:0:2bfd::1|0|0||false|||route-views.isc
A|1769113584.324483|2001:504:d::19:9524:1|199524|2a03:2880:f27c::/48|199524 1299 3356 13335 32934|IGP|2001:2035:0:2bfd::1|0|0||false|||route-views.isc
{trimmed}

^{In the monocle output seen above, we have the timestamp of our BGP update, followed by the next-hop in the announcement, the ASN of the network feeding a given route-collector, the prefix involved, and the AS path and BGP communities if any are found. At the end of the output per-line, we also find the route-collector instance.}

Looking at the first update for prefix 2a03:2880:f077::/48, the AS path is 64112 22850 174 3356 13335 32934. This means we (AS13335) took the prefix received from Meta (AS32934), our peer, and then advertised it toward Lumen (AS3356), one of our upstream transit providers. We know this is a route leak as routes received from peers are only meant to be readvertised to downstream (customer) networks, not laterally to other peers or up to providers.

As a result of the leak and the forwarding of unintended traffic into our Miami router from providers and peers, we experienced congestion on our backbone between Miami and Atlanta, as you can see in the graph below.

This would have resulted in elevated loss for some Cloudflare customer traffic, and higher latency than usual for traffic traversing these links. In addition to this congestion, the networks whose prefixes we leaked would have had their traffic discarded by firewall filters on our routers that are designed to only accept traffic for Cloudflare services and our customers. At peak, we discarded around 12Gbps of traffic ingressing our router in Miami for these non-downstream prefixes.

Follow-ups and preventing route leaks

We are big supporters and active contributors to efforts within the IETF and infrastructure community that strengthen routing security. We know firsthand how easy it is to cause a route leak accidentally, as evidenced by this incident.

Preventing route leaks will require a multi-faceted approach, but we have identified multiple areas in which we can improve, both short- and long-term.

In terms of our routing policy configurations and automation, we are:

Patching the failure in our routing policy automation that caused the route leak, and will mitigate this potential failure and others like it immediately
Implementing additional BGP community-based safeguards in our routing policies that explicitly reject routes that were received from providers and peers on external export policies
Adding automatic routing policy evaluation into our CI/CD pipelines that looks specifically for empty or erroneous policy terms
Improve early detection of issues with network configurations and the negative effects of an automated change

To help prevent route leaks in general, we are:

Validating routing equipment vendors' implementation of RFC9234 (BGP roles and the Only-to-Customer Attribute) in preparation for our rollout of the feature, which is the only way independent of routing policy to prevent route leaks caused at the local Autonomous System (AS)
Encouraging the long term adoption of RPKI Autonomous System Provider Authorization (ASPA), where networks could automatically reject routes that contain anomalous AS paths

Most importantly, we would again like to apologize for the impact we caused users and customers of Cloudflare, as well as any impact felt by external networks.

What came first: the CNAME or the A record?

Sebastiaan Neuteboom — Wed, 14 Jan 2026 00:00:00 GMT

On January 8, 2026, a routine update to 1.1.1.1 aimed at reducing memory usage accidentally triggered a wave of DNS resolution failures for users across the Internet. The root cause wasn't an attack or an outage, but a subtle shift in the order of records within our DNS responses.

While most modern software treats the order of records in DNS responses as irrelevant, we discovered that some implementations expect CNAME records to appear before everything else. When that order changed, resolution started failing. This post explores the code change that caused the shift, why it broke specific DNS clients, and the 40-year-old protocol ambiguity that makes the "correct" order of a DNS response difficult to define.

Timeline

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Description
2025-12-02	The record reordering is introduced to the 1.1.1.1 codebase
2025-12-10	The change is released to our testing environment
2026-01-07 23:48	A global release containing the change starts
2026-01-08 17:40	The release reaches 90% of servers
2026-01-08 18:19	Incident is declared
2026-01-08 18:27	The release is reverted
2026-01-08 19:55	Revert is completed. Impact ends

What happened?

While making some improvements to lower the memory usage of our cache implementation, we introduced a subtle change to CNAME record ordering. The change was introduced on December 2, 2025, released to our testing environment on December 10, and began deployment on January 7, 2026.

How DNS CNAME chains work

When you query for a domain like www.example.com, you might get a CNAME (Canonical Name) record that indicates one name is an alias for another name. It’s the job of public resolvers, such as 1.1.1.1, to follow this chain of aliases until it reaches a final response:

www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1

As 1.1.1.1 traverses this chain, it caches every intermediate record. Each record in the chain has its own TTL (Time-To-Live), indicating how long we can cache it. Not all the TTLs in a CNAME chain need to be the same:

www.example.com → cdn.example.com (TTL: 3600 seconds) # Still cached cdn.example.com → 198.51.100.1 (TTL: 300 seconds) # Expired

When one or more records in a CNAME chain expire, it’s considered partially expired. Fortunately, since parts of the chain are still in our cache, we don’t have to resolve the entire CNAME chain again — only the part that has expired. In our example above, we would take the still valid www.example.com → cdn.example.com chain, and only resolve the expired cdn.example.com A record. Once that’s done, we combine the existing CNAME chain and the newly resolved records into a single response.

The logic change

The code that merges these two chains is where the change occurred. Previously, the code would create a new list, insert the existing CNAME chain, and then append the new records:

impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&self, entry: &mut CacheEntry) {
        let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
        answer_rrs.extend_from_slice(&self.records); // CNAMEs first
        answer_rrs.extend_from_slice(&entry.answer); // Then A/AAAA records
        entry.answer = answer_rrs;
    }
}

However, to save some memory allocations and copies, the code was changed to instead append the CNAMEs to the existing answer list:

impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&self, entry: &mut CacheEntry) {
        entry.answer.extend(self.records); // CNAMEs last
    }
}

As a result, the responses that 1.1.1.1 returned now sometimes had the CNAME records appearing at the bottom, after the final resolved answer.

Why this caused impact

When DNS clients receive a response with a CNAME chain in the answer section, they also need to follow this chain to find out that www.example.com points to 198.51.100.1. Some DNS client implementations handle this by keeping track of the expected name for the records as they’re iterated sequentially. When a CNAME is encountered, the expected name is updated:

;; QUESTION SECTION:
;; www.example.com.        IN    A

;; ANSWER SECTION:
www.example.com.    3600   IN    CNAME  cdn.example.com.
cdn.example.com.    300    IN    A      198.51.100.1

Find records for www.example.com
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
Encounter cdn.example.com. A 198.51.100.1

When the CNAME suddenly appears at the bottom, this no longer works:

;; QUESTION SECTION:
;; www.example.com.	       IN    A

;; ANSWER SECTION:
cdn.example.com.    300    IN    A      198.51.100.1
www.example.com.    3600   IN    CNAME  cdn.example.com.

Find records for www.example.com
Ignore cdn.example.com. A 198.51.100.1 as it doesn’t match the expected name
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
No more records are present, so the response is considered empty

One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution. When looking at its getanswer_r implementation, we can indeed see it expects to find the CNAME records before any answers:

for (; ancount > 0; --ancount)
  {
    // ... parsing DNS records ...
    
    if (rr.rtype == T_CNAME)
      {
        /* Record the CNAME target as the new expected name. */
        int n = __ns_name_unpack (c.begin, c.end, rr.rdata,
                                  name_buffer, sizeof (name_buffer));
        expected_name = name_buffer;  // Update what we're looking for
      }
    else if (rr.rtype == qtype
             && __ns_samebinaryname (rr.rname, expected_name)  // Must match!
             && rr.rdlength == rrtype_to_rdata_length (type:qtype))
      {
        /* Address record matches - store it */
        ptrlist_add (list:addresses, item:(char *) alloc_buffer_next (abuf, uint32_t));
        alloc_buffer_copy_bytes (buf:abuf, src:rr.rdata, size:rr.rdlength);
      }
  }

Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs. Cisco has published a service document describing the issue.

Not all implementations break

Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

typedef struct DnsAnswerItem {
        DnsResourceRecord *rr; // The actual record
        DnsAnswerFlags flags;  // Which section it came from
        // ... other metadata
} DnsAnswerItem;


typedef struct DnsAnswer {
        unsigned n_ref;
        OrderedSet *items;
} DnsAnswer;

When following a CNAME chain it can then search the entire answer set, even if the CNAME records don’t appear at the top.

What the RFC says

RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:

If recursive service is requested and available, the recursive response to a query will be one of the following:
- The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.

In our case, we did originally implement the specification so that CNAMEs appear first. However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

The subtle distinction: RRsets vs RRs in message sections

To understand why this ambiguity exists, we need to understand a subtle but important distinction in DNS terminology.

RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class. For RRsets, the specification is clear about ordering:

The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS.

However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets. While modern DNS specifications have shown that message sections can indeed contain multiple RRsets (consider DNSSEC responses with signatures), RFC 1034 doesn’t describe message sections in those terms. Instead, it treats message sections as containing individual Resource Records (RRs).

The problem is that the RFC primarily discusses ordering in the context of RRsets but doesn't specify the ordering of different RRsets relative to each other within a message section. This is where the ambiguity lives.

RFC 1034 section 6.2.1 includes an example that demonstrates this ambiguity further. It mentions that the order of Resource Records (RRs) is not significant either:

The difference in ordering of the RRs in the answer section is not significant.

However, this example only shows two A records for the same name within the same RRset. It doesn't address whether this applies to different record types like CNAMEs and A records.

CNAME chain ordering

It turns out that this issue extends beyond putting CNAME records before other record types. Even when CNAMEs appear before other records, sequential parsing can still break if the CNAME chain itself is out of order. Consider the following response:

;; QUESTION SECTION:
;; www.example.com.              IN    A

;; ANSWER SECTION:
cdn.example.com.           3600  IN    CNAME  server.cdn-provider.com.
www.example.com.           3600  IN    CNAME  cdn.example.com.
server.cdn-provider.com.   300   IN    A      198.51.100.1

Each CNAME belongs to a different RRset, as they have different owners, so the statement about RRset order being insignificant doesn’t apply here.

However, RFC 1034 doesn't specify that CNAME chains must appear in any particular order. There's no requirement that www.example.com. CNAME cdn.example.com. must appear before cdn.example.com. CNAME server.cdn-provider.com.. With sequential parsing, the same issue occurs:

Find records for www.example.com
Ignore cdn.example.com. CNAME server.cdn-provider.com. as it doesn’t match the expected name
Encounter www.example.com. CNAME cdn.example.com
Find records for cdn.example.com
Ignore server.cdn-provider.com. A 198.51.100.1 as it doesn’t match the expected name

What should resolvers do?

RFC 1034 section 5 describes resolver behavior. Section 5.2.2 specifically addresses how resolvers should handle aliases (CNAMEs):

In most cases a resolver simply restarts the query at the new name when it encounters a CNAME.

This suggests that resolvers should restart the query upon finding a CNAME, regardless of where it appears in the response. However, it's important to distinguish between different types of resolvers:

Recursive resolvers, like 1.1.1.1, are full DNS resolvers that perform recursive resolution by querying authoritative nameservers
Stub resolvers, like glibc’s getaddrinfo, are simplified local interfaces that forward queries to recursive resolvers and process the responses

The RFC sections on resolver behavior were primarily written with full resolvers in mind, not the simplified stub resolvers that most applications actually use. Some stub resolvers evidently don’t implement certain parts of the spec, such as the CNAME-restart logic described in the RFC.

The DNSSEC specifications provide contrast

Later DNS specifications demonstrate a different approach to defining record ordering. RFC 4035, which defines protocol modifications for DNSSEC, uses more explicit language:

When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included.

The specification uses "MUST" and explicitly defines "higher priority" for RRSIG records. However, "higher priority for inclusion" refers to whether RRSIGs should be included in the response, not where they should appear. This provides unambiguous guidance to implementers about record inclusion in DNSSEC contexts, while not mandating any particular behavior around record ordering.

For unsigned zones, however, the ambiguity from RFC 1034 remains. The word "preface" has guided implementation behavior for nearly four decades, but it has never been formally specified as a requirement.

Do CNAME records come first?

While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.

Based on what we have learned during this incident, we have reverted the CNAME re-ordering and do not intend to change the order in the future.

To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF. If consensus is reached on the clarified behavior, this would become an RFC that explicitly defines how to correctly handle CNAMEs in DNS responses, helping us and the wider DNS community navigate the protocol. The proposal can be found at https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section. If you have suggestions or feedback we would love to hear your opinions, most usefully via the DNSOP working group at the IETF.

Code Orange: Fail Small — our resilience plan following recent incidents

Dane Knecht — Fri, 19 Dec 2025 22:35:30 GMT

On November 18, 2025, Cloudflare’s network experienced significant failures to deliver network traffic for approximately two hours and ten minutes. Nearly three weeks later, on December 5, 2025, our network again failed to serve traffic for 28% of applications behind our network for about 25 minutes.

We published detailed post-mortem blog posts following both incidents, but we know that we have more to do to earn back your trust. Today we are sharing details about the work underway at Cloudflare to prevent outages like these from happening again.

We are calling the plan “Code Orange: Fail Small”, which reflects our goal of making our network more resilient to errors or mistakes that could lead to a major outage. A “Code Orange” means the work on this project is prioritized above all else. For context, we declared a “Code Orange” at Cloudflare once before, following another major incident that required top priority from everyone across the company. We feel the recent events require the same focus. Code Orange is our way to enable that to happen, allowing teams to work cross-functionally as necessary to get the job done while pausing any other work.

The Code Orange work is organized into three main areas:

Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.
Change our internal “break glass”* procedures, and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.

These projects will deliver iterative improvements as they proceed, rather than one “big bang” change at their conclusion. Every individual update will contribute to more resiliency at Cloudflare. By the end, we expect Cloudflare’s network to be much more resilient, including for issues such as those that triggered the global incidents we experienced in the last two months.

We understand that these incidents are painful for our customers and the Internet as a whole. We’re deeply embarrassed by them, which is why this work is the first priority for everyone here at Cloudflare.

^*^{Break glass procedures at Cloudflare allow certain individuals to elevate their privilege under certain circumstances to perform urgent actions to resolve high severity scenarios.}

What went wrong?

In the first incident, users visiting a customer site on Cloudflare saw error pages that indicated Cloudflare could not deliver a response to their request. In the second, they saw blank pages.

Both outages followed a similar pattern. In the moments leading up to each incident we instantaneously deployed a configuration change in our data centers in hundreds of cities around the world.

The November change was an automatic update to our Bot Management classifier. We run various artificial intelligence models that learn from the traffic flowing through our network to build detections that identify bots. We constantly update those systems to stay ahead of bad actors trying to evade our security protection to reach customer sites.

During the December incident, while trying to protect our customers from a vulnerability in the popular open source framework React, we deployed a change to a security tool used by our security analysts to improve our signatures. Similar to the urgency of new bot management updates, we needed to get ahead of the attackers who wanted to exploit the vulnerability. That change triggered the start of the incident.

This pattern exposed a serious gap in how we deploy configuration changes at Cloudflare, versus how we release software updates. When we release software version updates, we do so in a controlled and monitored fashion. For each new binary release, the deployment must successfully complete multiple gates before it can serve worldwide traffic. We deploy first to employee traffic, before carefully rolling out the change to increasing percentages of customers worldwide, starting with free users. If we detect an anomaly at any stage, we can revert the release without any human intervention.

We have not applied that methodology to configuration changes. Unlike releasing the core software that powers our network, when we make configuration changes, we are modifying the values of how that software behaves and we can do so instantly. We give this power to our customers too: If you make a change to a setting in Cloudflare, it will propagate globally in seconds.

While that speed has advantages, it also comes with risks that we need to address. The past two incidents have demonstrated that we need to treat any change that is applied to how we serve traffic in our network with the same level of tested caution that we apply to changes to the software itself.

We will change how we deploy configuration updates at Cloudflare

Our ability to deploy configuration changes globally within seconds was the core commonality across the two incidents. In both events, a wrong configuration took down our network in seconds.

Introducing controlled rollouts of our configuration, just as we already do for software releases, is the most important workstream of our Code Orange plan.

Configuration changes at Cloudflare propagate to the network very quickly. When a user creates a new DNS record, or creates a new security rule, it reaches 90% of servers on the network within seconds. This is powered by a software component that we internally call Quicksilver.

Quicksilver is also used for any configuration change required by our own teams. The speed is a feature: we can react and globally update our network behavior very quickly. However, in both incidents this caused a breaking change to propagate to the entire network in seconds rather than passing through gates to test it.

While the ability to deploy changes to our network on a near-instant basis is useful in many cases, it is rarely necessary. Work is underway to treat configuration the same way that we treat code by introducing controlled deployments within Quicksilver to any configuration change.

We release software updates to our network multiple times per day through what we call our Health Mediated Deployment (HMD) system. In this framework, every team at Cloudflare that owns a service (a piece of software deployed into our network) must define the metrics that indicate a deployment has succeeded or failed, the rollout plan, and the steps to take if it does not succeed.

Different services will have slightly different variables. Some might need longer wait times before proceeding to more data centers, while others might have lower tolerances for error rates even if it causes false positive signals.

Once deployed, our HMD toolkit begins to carefully progress against that plan while monitoring each step before proceeding. If any step fails, the rollback will automatically begin and the team can be paged if needed.

By the end of Code Orange, configuration updates will follow this same process. We expect this to allow us to quickly catch the kinds of issues that occurred in these past two incidents long before they become widespread problems.

How will we address failure modes between services?

While we are optimistic that better control over configuration changes will catch more problems before they become incidents, we know that mistakes can and will occur. During both incidents, errors in one part of our network became problems in most of our technology stack, including the control plane that customers rely on to configure how they use Cloudflare.

We need to think about careful, graduated rollouts not just in terms of geographic progression (spreading to more of our data centers) or in terms of population progression (spreading to employees and customer types). We also need to plan for safer deployments that contain failures from service progression (spreading from one product like our Bot Management service to an unrelated one like our dashboard).

To that end, we are in the process of reviewing the interface contracts between every critical product and service that comprise our network to ensure that we a) assume failure will occur between each interface and b) handle that failure in the absolute most reasonable way possible.

To go back to our Bot Management service failure, there were at least two key interfaces where, if we had assumed failure was going to happen, we could have handled it gracefully to the point that it was unlikely any customer would have been impacted. The first was in the interface that read the corrupted config file. Instead of panicking, there should have been a sane set of validated defaults which would have allowed traffic to pass through our network, while we would have, at worst, lost the realtime fine-tuning that feeds into our bot detection machine-learning models. The second interface was between the core software that runs our network and the Bot Management module itself. In the event that our bot management module failed (as it did), we should not have dropped traffic by default. Instead, we could have come up with, yet again, a more sane default of allowing the traffic to pass with a passable classification.

How will we solve emergencies faster?

During the incidents, it took us too long to resolve the problem. In both cases, this was worsened by our security systems preventing team members from accessing the tools they needed to fix the problem, and in some cases, circular dependencies slowed us down as some internal systems also became unavailable.

As a security company, all our tools are behind authentication layers with fine-grained access controls to ensure customer data is safe and to prevent unauthorized access. This is the right thing to do, but at the same time, our current processes and systems slowed us down when speed was a top priority.

Circular dependencies also affected our customer experience. For example, during the November 18 incident, Turnstile, our no CAPTCHA bot solution, became unavailable. As we use Turnstile on the login flow to the Cloudflare dashboard, customers who did not have active sessions, or API service tokens, were not able to log in to Cloudflare in the moment of most need to make critical changes.

Our team will be reviewing and improving all of the break glass procedures and technology to ensure that, when necessary, we can access the right tools as fast as possible while maintaining our security requirements. This includes reviewing and removing circular dependencies, or being able to “bypass” them quickly in the event there is an incident. We will also increase the frequency of our training exercises, so that processes are well understood by all teams prior to any potential disaster scenario in the future.

When will we be done?

While we haven’t captured in this post all the work being undertaken internally, the workstreams detailed above describe the top priorities the teams are being asked to focus on. Each of these workstreams maps to a detailed plan touching nearly every product and engineering team at Cloudflare. We have a lot of work to do.

By the end of Q1, and largely before then, we will:

Ensure all production systems are covered by Health Mediated Deployments (HMD) for configuration management.
Update our systems to adhere to proper failure modes as appropriate for each product set.
Ensure we have processes in place so the right people have the right access to provide proper remediation during an emergency.

Some of these goals will be evergreen. We will always need to better handle circular dependencies as we launch new software and our break glass procedures will need to update to reflect how our security technology changes over time.

We failed our users and the Internet as a whole in these past two incidents. We have work to do to make it right. We plan to share updates as this work proceeds and appreciate the questions and feedback we have received from our customers and partners.

Cloudflare outage on December 5, 2025

Dane Knecht — Fri, 05 Dec 2025 00:00:00 GMT

^{Note: This post was updated to clarify the relationship of the internal WAF tool with the incident on Dec. 5.}

On December 5, 2025, at 08:47 UTC (all times in this blog are UTC), a portion of Cloudflare’s network began experiencing significant failures. The incident was resolved at 09:12 (~25 minutes total impact), when all services were fully restored.

A subset of customers were impacted, accounting for approximately 28% of all HTTP traffic served by Cloudflare. Several factors needed to combine for an individual customer to be affected as described below.

The issue was not caused, directly or indirectly, by a cyber attack on Cloudflare’s systems or malicious activity of any kind. Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Any outage of our systems is unacceptable, and we know we have let the Internet down again following the incident on November 18. We will be publishing details next week about the work we are doing to stop these types of incidents from occurring.

What happened

The graph below shows HTTP 500 errors served by our network during the incident timeframe (red line at the bottom), compared to unaffected total Cloudflare traffic (green line at the top).

Cloudflare's Web Application Firewall (WAF) provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis. Before today, the buffer size was set to 128KB.

As part of our ongoing work to protect customers who use React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications, to make sure as many customers as possible were protected.

This first change was being rolled out using our gradual deployment system. During rollout, we noticed that our internal WAF testing tool did not support the increased buffer size. As this internal test tool was not needed at that time and had no effect on customer traffic, we made a second change to turn it off.

This second change of turning off our WAF testing tool was implemented using our global configuration system. This system does not perform gradual rollouts, but rather propagates changes within seconds to the entire fleet of servers in our network and is under review following the outage we experienced on November 18.

Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.

As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following Lua exception:

[lua] Failed to run module rulesets callback late_routing: /usr/local/nginx-fl/lua/modules/init.lua:314: attempt to index field 'execute' (a nil value)

resulting in HTTP code 500 errors being issued.

The issue was identified shortly after the change was applied, and was reverted at 09:12, after which all traffic was served correctly.

Customers that have their web assets served by our older FL1 proxy AND had the Cloudflare Managed Ruleset deployed were impacted. All requests for websites in this state returned an HTTP 500 error, with the small exception of some test endpoints such as /cdn-cgi/trace.

Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

The runtime error

Cloudflare’s rulesets system consists of sets of rules which are evaluated for each request entering our system. A rule consists of a filter, which selects some traffic, and an action which applies an effect to that traffic. Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”, which is used to trigger evaluation of another ruleset.

Our internal logging system uses this feature to evaluate new rules before we make them available to the public. A top level ruleset will execute another ruleset containing test rules. It was these test rules that we were attempting to disable.

We have a killswitch subsystem as part of the rulesets system which is intended to allow a rule which is misbehaving to be disabled quickly. This killswitch system receives information from our global configuration system mentioned in the prior sections. We have used this killswitch system on a number of occasions in the past to mitigate incidents and have a well-defined Standard Operating Procedure, which was followed in this incident.

However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset:

if rule_result.action == "execute" then
  rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
end

This code expects that, if the ruleset has action=”execute”, the “rule_result.execute” object will exist. However, because the rule had been skipped, the rule_result.execute object did not exist, and Lua returned an error due to attempting to look up a value in a nil value.

This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

What about the changes being made after the incident on November 18, 2025?

We made an unrelated change that caused a similar, longer availability incident two weeks ago on November 18, 2025. In both cases, a deployment to help mitigate a security issue for our customers propagated to our entire network and led to errors for nearly all of our customer base.

We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization. In particular, the projects outlined below should help contain the impact of these kinds of changes:

Enhanced Rollouts & Versioning: Similar to how we slowly deploy software with strict health validation, data used for rapid threat response and general configuration needs to have the same safety and blast mitigation features. This includes health validation and quick rollback capabilities among other things.
Streamlined break glass capabilities: Ensure that critical operations can still be achieved in the face of additional types of failures. This applies to internal services as well as all standard methods of interaction with the Cloudflare control plane used by all Cloudflare customers.
"Fail-Open" Error Handling: As part of the resilience effort, we are replacing the incorrectly applied hard-fail logic across all critical Cloudflare data-plane components. If a configuration file is corrupt or out-of-range (e.g., exceeding feature caps), the system will log the error and default to a known-good state or pass traffic without scoring, rather than dropping requests. Some services will likely give the customer the option to fail open or closed in certain scenarios. This will include drift-prevention capabilities to ensure this is enforced continuously.

Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.

These kinds of incidents, and how closely they are clustered together, are not acceptable for a network like ours. On behalf of the team at Cloudflare we want to apologize for the impact and pain this has caused again to our customers and the Internet as a whole.

Timeline

Time (UTC)	Status	Description
08:47	INCIDENT start	Configuration change deployed and propagated to the network
08:48	Full impact	Change fully propagated
08:50	INCIDENT declared	Automated alerts
09:11	Change reverted	Configuration change reverted and propagation start
09:12	INCIDENT end	Revert fully propagated, all traffic restored

Cloudflare outage on November 18, 2025

Matthew Prince — Tue, 18 Nov 2025 00:00:00 GMT

On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file. Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today.

This post is an in-depth recount of exactly what happened and what systems and processes failed. It is also the beginning, though not the end, of what we plan to do in order to make sure an outage like this will not happen again.

The outage

The chart below shows the volume of 5xx error HTTP status codes served by the Cloudflare network. Normally this should be very low, and it was right up until the start of the outage.

The volume prior to 11:20 is the expected baseline of 5xx errors observed across our network. The spike, and subsequent fluctuations, show our system failing due to loading the incorrect feature file. What’s notable is that our system would then recover for a period. This was very unusual behavior for an internal error.

The explanation was that the file was being generated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management. Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.

This fluctuation made it unclear what was happening as the entire system would recover and then fail again as sometimes good, sometimes bad configuration files were distributed to our network. Initially, this led us to believe this might be caused by an attack. Eventually, every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state.

Errors continued until the underlying issue was identified and resolved starting at 14:30. We solved the problem by stopping the generation and propagation of the bad feature file and manually inserting a known good file into the feature file distribution queue. And then forcing a restart of our core proxy.

The remaining long tail in the chart above is our team restarting remaining services that had entered a bad state, with 5xx error code volume returning to normal at 17:06.

The following services were impacted:

Service / Product	Impact description
Core CDN and security services	HTTP 5xx status codes. The screenshot at the top of this post shows a typical error page delivered to end users.
Turnstile	Turnstile failed to load.
Workers KV	Workers KV returned a significantly elevated level of HTTP 5xx errors as requests to KV’s “front end” gateway failed due to the core proxy failing.
Dashboard	While the dashboard was mostly operational, most users were unable to log in due to Turnstile being unavailable on the login page.
Email Security	While email processing and delivery were unaffected, we observed a temporary loss of access to an IP reputation source which reduced spam-detection accuracy and prevented some new-domain-age detections from triggering, with no critical customer impact observed. We also saw failures in some Auto Move actions; all affected messages have been reviewed and remediated.
Access	Authentication failures were widespread for most users, beginning at the start of the incident and continuing until the rollback was initiated at 13:05. Any existing Access sessions were unaffected. All failed authentication attempts resulted in an error page, meaning none of these users ever reached the target application while authentication was failing. Successful logins during this period were correctly logged during this incident. Any Access configuration updates attempted at that time would have either failed outright or propagated very slowly. All configuration updates are now recovered.

As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.

How Cloudflare processes requests, and how this went wrong today

Every request to Cloudflare takes a well-defined path through our network. It could be from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then flow into our core proxy system (which we call FL for “Frontline”), and finally through Pingora, which performs cache lookups or fetches data from the origin if needed.

We previously shared more detail about how the core proxy works here.

As a request transits the core proxy, we run the various security and performance products available in our network. The proxy applies each customer’s unique configuration and settings, from enforcing WAF rules and DDoS protection to routing traffic to the Developer Platform and R2. It accomplishes this through a set of domain-specific modules that apply the configuration and policy rules to traffic transiting our proxy.

One of those modules, Bot Management, was the source of today’s outage.

Cloudflare’s Bot Management includes, among other systems, a machine learning model that we use to generate bot scores for every request traversing our network. Our customers use bot scores to control which bots are allowed to access their sites — or not.

The model takes as input a “feature” configuration file. A feature, in this context, is an individual trait used by the machine learning model to make a prediction about whether the request was automated or not. The feature configuration file is a collection of individual features.

This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly.

A change in our underlying ClickHouse query behaviour (explained below) that generates this file caused it to have a large number of duplicate “feature” rows. This changed the size of the previously fixed-size feature configuration file, causing the bots module to trigger an error.

As a result, HTTP 5xx error codes were returned by the core proxy system that handles traffic processing for our customers, for any traffic that depended on the bots module. This also affected Workers KV and Access, which rely on the core proxy.

Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.

Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.

Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page. Visitors to the status page at that time were greeted by an error message:

In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

The query behaviour change

I mentioned above that a change in the underlying query behaviour resulted in the feature file containing a large number of duplicate rows. The database system in question uses ClickHouse’s software.

For context, it’s helpful to know how ClickHouse distributed queries work. A ClickHouse cluster consists of many shards. To query data from all shards, we have so-called distributed tables (powered by the table engine Distributed) in a database called default. The Distributed engine queries underlying tables in a database r0. The underlying tables are where data is stored on each shard of a ClickHouse cluster.

Queries to the distributed tables run through a shared system account. As part of efforts to improve our distributed queries security and reliability, there’s work being done to make them run under the initial user accounts instead.

Before today, ClickHouse users would only see the tables in the default database when querying table metadata from ClickHouse system tables such as system.tables or system.columns.

Since users already have implicit access to underlying tables in r0, we made a change at 11:05 to make this access explicit, so that users can see the metadata of these tables as well. By making sure that all distributed subqueries can run under the initial user, query limits and access grants can be evaluated in a more fine-grained manner, avoiding one bad subquery from a user affecting others.

The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name;

Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

This, unfortunately, was the type of query that was performed by the Bot Management feature file generation logic to construct each input “feature” for the file mentioned at the beginning of this section.

The query above would return a table of columns like the one displayed (simplified example):

However, as part of the additional permissions that were granted to the user, the response now contained all the metadata of the r0 schema effectively more than doubling the rows in the response ultimately affecting the number of rows (i.e. features) in the final file output.

Memory preallocation

Each module running on our proxy service has a number of limits in place to avoid unbounded memory consumption and to preallocate memory as a performance optimization. In this specific instance, the Bot Management system has a limit on the number of machine learning features that can be used at runtime. Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

When the bad file with more than 200 features was propagated to our servers, this limit was hit — resulting in the system panicking. The FL2 Rust code that makes the check and was the source of the unhandled error is shown below:

This resulted in the following panic which in turn resulted in a 5xx error:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

Other impact during the incident

Other systems that rely on our core proxy were impacted during the incident. This included Workers KV and Cloudflare Access. The team was able to reduce the impact to these systems at 13:04, when a patch was made to Workers KV to bypass the core proxy. Subsequently, all downstream systems that rely on Workers KV (such as Access itself) observed a reduced error rate.

The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.

Turnstile was impacted by this outage, resulting in customers who did not have an active dashboard session being unable to log in. This showed up as reduced availability during two time periods: from 11:30 to 13:10, and between 14:40 and 15:30, as seen in the graph below.

The first period, from 11:30 to 13:10, was due to the impact to Workers KV, which some control plane and dashboard functions rely upon. This was restored at 13:10, when Workers KV bypassed the core proxy system. The second period of impact to the dashboard occurred after restoring the feature configuration data. A backlog of login attempts began to overwhelm the dashboard. This backlog, in combination with retry attempts, resulted in elevated latency, reducing dashboard availability. Scaling control plane concurrency restored availability at approximately 15:30.

Remediation and follow-up steps

Now that our systems are back online and functioning normally, work has already begun on how we will harden them against failures like this in the future. In particular we are:

Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
Enabling more global kill switches for features
Eliminating the ability for core dumps or other error reports to overwhelm system resources
Reviewing failure modes for error conditions across all core proxy modules

Today was Cloudflare's worst outage since 2019. We've had outages that have made our dashboard unavailable. Some that have caused newer features to not be available for a period of time. But in the last 6+ years we've not had another outage that has caused the majority of core traffic to stop flowing through our network.

An outage like today is unacceptable. We've architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we've had outages in the past it's always led to us building new, more resilient systems.

On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today.

Time (UTC)	Status	Description
11:05	Normal.	Database access control change deployed.
11:28	Impact starts.	Deployment reaches customer environments, first errors observed on customer HTTP traffic.
11:32-13:05	The team investigated elevated traffic levels and errors to Workers KV service.	The initial symptom appeared to be degraded Workers KV response rate causing downstream impact on other Cloudflare services. Mitigations such as traffic manipulation and account limiting were attempted to bring the Workers KV service back to normal operating levels. The first automated test detected the issue at 11:31 and manual investigation started at 11:32. The incident call was created at 11:35.
13:05	Workers KV and Cloudflare Access bypass implemented — impact reduced.	During investigation, we used internal system bypasses for Workers KV and Cloudflare Access so they fell back to a prior version of our core proxy. Although the issue was also present in prior versions of our proxy, the impact was smaller as described below.
13:37	Work focused on rollback of the Bot Management configuration file to a last-known-good version.	We were confident that the Bot Management configuration file was the trigger for the incident. Teams worked on ways to repair the service in multiple workstreams, with the fastest workstream a restore of a previous version of the file.
14:24	Stopped creation and propagation of new Bot Management configuration files.	We identified that the Bot Management module was the source of the 500 errors and that this was caused by a bad configuration file. We stopped automatic deployment of new Bot Management configuration files.
14:24	Test of new file complete.	We observed successful recovery using the old version of the configuration file and then focused on accelerating the fix globally.
14:30	Main impact resolved. Downstream impacted services started observing reduced errors.	A correct Bot Management configuration file was deployed globally and most services started operating correctly.
17:06	All services resolved. Impact ends.	All downstream services restarted and all operations fully restored.

A deep dive into Cloudflare’s September 12, 2025 dashboard and API outage

Tom Lianza — Sat, 13 Sep 2025 07:19:00 GMT

What Happened

We had an outage in our Tenant Service API which led to a broad outage of many of our APIs and the Cloudflare Dashboard.

The incident’s impact stemmed from several issues, but the immediate trigger was a bug in the dashboard. This bug caused repeated, unnecessary calls to the Tenant Service API. The API calls were managed by a React useEffect hook, but we mistakenly included a problematic object in its dependency array. Because this object was recreated on every state or prop change, React treated it as “always new,” causing the useEffect to re-run each time. As a result, the API call executed many times during a single dashboard render instead of just once. This behavior coincided with a service update to the Tenant Service API, compounding instability and ultimately overwhelming the service, which then failed to recover.

When the Tenant Service became overloaded, it had an impact on other APIs and the dashboard because Tenant Service is part of our API request authorization logic. Without Tenant Service, API request authorization can not be evaluated. When authorization evaluation fails, API requests return 5xx status codes.

We’re very sorry about the disruption. The rest of this blog goes into depth on what happened, and what steps we are taking to prevent it from happening again.

Timeline

Time (UTC)	Description
2025-09-12 16:32	A new version of the Cloudflare Dashboard is released which contains a bug that will trigger many more calls to the /organizations endpoint, including retries in the event of failure.
2025-09-12 17:50	A new version of the Tenant API Service is deployed.
2025-09-12 17:57	The Tenant API Service becomes overwhelmed as new versions are deploying. Dashboard Availability begins to drop IMPACT START
2025-09-12 18:17	After providing more resources to the Tenant API Service, the Cloudflare API climbs to 98% availability, but the dashboard does not recover. IMPACT DECREASE
2025-09-12 18:58	In an attempt to restore dashboard availability, some erroring codepaths were removed and a new version of the Tenant Service is released. This was ultimately a bad change and causes API Impact again. IMPACT INCREASE
2025-09-12 19:01	In an effort to relieve traffic against the Tenant API Service, a temporary ratelimiting rule is published.
2025-09-12 19:12	The problematic changes to the Tenant API Service are reverted, and Dashboard Availability returns to 100%. IMPACT END

Dashboard availability

The Cloudflare dashboard was severely impacted throughout the full duration of the incident.

API availability

The Cloudflare API was severely impacted for two periods during the incident when the Tenant API Service was down.

How we responded

Our first goal in an incident is to restore service. Often that involves fixing the underlying issue directly, but not always. In this case we noticed increased usage across our Tenant Service, so we focused on reducing the load and increasing the available resources. We installed a global rate limit on the Tenant Service to help regulate the load. The Tenant Service is a GoLang process that runs on Kubernetes in a subset of our datacenters. We increased the number of pods available as well to help improve throughput. While we did this, we had others on the team continue to investigate why we were seeing the unusually high usage. Ultimately, increasing the resources available to the tenant service helped with availability but was insufficient to restore normal service.

After the Tenant Service began reporting healthy again and the API largely recovered, we still observed a considerable number of errors being reported from the service. We theorized that these were responsible for the ongoing Dashboard availability issues and made a patch to the service with the expectation that it would improve the API health and restore the dashboard to a healthy state. Ultimately this change degraded service further and was quickly reverted. The second outage can be seen in the graph above.

It’s painful to have an outage like this. That said, there were a few things that helped lessen the impact. Our automatic alerting service quickly identified the correct people to join the call and start working on remediation. Additionally, this was a failure in the control plane which has strict separation of concerns from the data plane. Thus, the outage did not affect services on Cloudflare’s network. The majority of users at Cloudflare were unaffected unless they were making configuration changes or using our dashboard.

Going forward

We believe it’s important to learn from our mistakes and this incident is an opportunity to make some improvements. Those improvements can be categorized as either ways to reduce / eliminate the impact of a similar change or as improvements to our observability tooling to better inform the team during future events.

Reducing impact

We use Argo Rollouts for releasing, which monitors deployments for errors and automatically rolls back that service on a detected error. We’ve been migrating our services over to Argo Rollouts but have not yet updated the Tenant Service to use it. Had it been in place, we would have automatically rolled back the second Tenant Service update limiting the second outage. This work had already been scheduled by the team and we’ve increased the priority of the migration.

When we restarted the Tenant Service, everyone’s dashboard began to re-authenticate with the API. This caused the API to become unstable again causing issues with everyone’s dashboard. This pattern is a common one often referred to as a Thundering Herd. Once a resource or service is made available, everyone tries to use it all at once. This is common, but was amplified by the bug in our dashboard logic. The fix for this behavior has already been released via a hotfix shortly after the impact was over. We’ll be introducing changes to the dashboard that include random delays to spread out retries and reduce contention as well.

Finally, the Tenant Service was not allocated sufficient capacity to handle spikes in load like this. We’ve allocated substantially more resources to this service, and are improving the monitoring so that we will be proactively alerted before this service hits capacity limits.

Improving visibility

We immediately saw an increase in our API usage but found it difficult to identify which requests were retries vs new requests. Had we known that we were seeing a sustained large volume of new requests, it would have made it easier to identify the issue as a loop in the dashboard. We are adding changes to how we call our APIs from our dashboard to include additional information, including if the request is a retry or new request.

We’re very sorry about the disruption. We will continue to investigate this issue and make improvements to our systems and processes.

The impact of the Salesloft Drift breach on Cloudflare and our customers

Sourov Zaman — Tue, 02 Sep 2025 17:10:00 GMT

On August 23rd, Cloudflare was notified that we (and our customers) are affected by the Salesloft Drift breach. Because of this breach, someone outside Cloudflare got access to our Salesforce instance, which we use for customer support and internal customer case management, and some of the data it contains. Most of this information is customer contact information and basic support case data, but some customer support interactions may reveal information about a customer's configuration and could contain sensitive information like access tokens. Given that Salesforce support case data contains the contents of support tickets with Cloudflare, any information that a customer may have shared with Cloudflare in our support system—including logs, tokens or passwords—should be considered compromised, and we strongly urge you to rotate any credentials that you may have shared with us through this channel.

As part of our response to this incident, we did our own search through the compromised data to look for tokens or passwords and found 104 Cloudflare API tokens. We have identified no suspicious activity associated with those tokens, but all of these have been rotated in an abundance of caution. All customers whose data was compromised in this breach have been informed directly by Cloudflare.

No Cloudflare services or infrastructure were compromised as a result of this breach.

We are responsible for the choice of tools we use in support of our business. This breach has let our customers down. For that, we sincerely apologize. The rest of this blog gives a detailed timeline and detailed information on how we investigated this breach.

The Salesloft Drift breach

Last week, Cloudflare became aware of suspicious activity within our Salesforce tenant and learned that we, as well as hundreds of other companies, had become the target of a threat actor that was able to successfully exfiltrate the text fields of support cases from our Salesforce instance. Our security team immediately began an investigation, cut off the threat actor’s access, and took a number of steps, detailed below, to secure our environment. We are writing this blog to detail what happened, how we responded, and to help our customers and others understand how to protect themselves from this incident.

Cloudflare uses Salesforce to keep track of who our customers are and how they use our services, and we use it as a support tool to interact with our customers. An important detail to understand as part of this incident is that the threat actor only accessed data in Salesforce “cases,” which may be created when Cloudflare sales and support team members need to comment to each other internally in order to support our customers; they are also created when customers interact with Cloudflare support. Salesforce had an integration with the Salesloft Drift chatbot, which Cloudflare used to give anyone who visited our website a way to contact us.

As Salesloft has announced, a threat actor breached their systems. As part of the breach, the threat actor was able to obtain OAuth credentials associated with the Salesloft Drift chat agent’s Salesforce integration to exfiltrate data from Salesloft customers’ Salesforce instances. Our investigation revealed that this was part of a sophisticated supply chain attack targeting business-to-business third-party integrations, affecting hundreds of organizations globally that were customers of Salesloft. Cloudforce One—Cloudflare’s threat intelligence & research team—has classified the advanced threat actor as GRUB1. Additional disclosures from Google’s Threat Intelligence Group aligned with the activity we observed in our environment.

Our investigation showed the threat actor compromised and exfiltrated data from our Salesforce tenant between August 12-17, 2025, following initial reconnaissance observed on August 9, 2025. A detailed analysis confirmed the exposure was limited to Salesforce case objects, which primarily consist of customer support tickets and their associated data within our Salesforce tenant. These case objects contain customer contact information related to the support case, case subject lines, and the body of the case correspondence—but not any attachments to the cases. Cloudflare does not request or require customers to share secrets, credentials, or API keys in support cases. However, in some troubleshooting scenarios, customers may paste keys, logs, or other sensitive information into the case text fields. Anything shared through this channel should now be considered compromised.

We believe this incident was not an isolated event but that the threat actor intended to harvest credentials and customer information for future attacks. Given that hundreds of organizations were affected through this Drift compromise, we suspect the threat actor will use this information to launch targeted attacks against customers across the affected organizations.

This post provides a timeline of the attack, details our response, and offers security recommendations to help other organizations mitigate similar threats.

Throughout this blog post, all dates and times are in UTC.

Cloudflare's response and remediation

When Salesforce and Salesloft notified us on August 23, 2025, that the Drift integration had been abused across multiple organizations, including Cloudflare, we immediately launched a company-wide Security Incident Response. We activated cross-functional teams, pulling together experts from Security, IT, Product, Legal, Communications, and business leadership under a single, unified incident command structure.

We set up four clear priority workstreams with the goal to protect our customers and Cloudflare:

Immediate Threat Containment: We cut off all threat actor access by disabling the compromised Drift integration, conducted forensic analysis to understand the scope of the compromise, and eliminated the active threat from our environment.
Secure our third-party ecosystem: We immediately disconnected all third-party integrations from Salesforce. We issued new secrets for all services and implemented a new process to rotate them weekly.
Safeguard the integrity of our wider systems: We expanded credential rotation to all our third-party Internet services and accounts as a precautionary measure to prevent the attacker from using compromised data to access other Cloudflare systems.
Customer Impact Analysis: We analyzed our Salesforce case objects data to identify whether customers could be compromised and to ensure they received timely and accurate communication about potential exposure.

Attack timeline & Cloudflare response

Our forensic investigation reconstructed the threat actor’s activities against Cloudflare, which occurred between August 9 and August 17, 2025. The following is a chronological summary of the threat actor's actions, including initial reconnaissance prior to the initial compromise.

August 9, 2025: First signs of reconnaissance

At 11:51, GRUB1 attempted to validate a Customer Cloudflare-issued API token to the Salesforce API. The actor used Trufflehog (a popular open-source secrets scanner) as their User-Agent and sent a verification request to client/v4/user/tokens/verify. The request failed with a 404 Not Found, confirming the token was invalid. The source of this API token is unclear—it could have been obtained from various sources including other Drift customers that GRUB1 may have compromised prior to Cloudflare.

August 12, 2025: Initial compromise of Cloudflare

At 22:14, GRUB1 gained access to Cloudflare’s Salesforce tenant by using a stolen credential used by the Salesloft integration. Using this credential, the GRUB1 logged in from the IP address 44[.]215[.]108[.]109 and made a GET request to the /services/data/v58.0/sobjects/ API endpoint. This action appeared to enumerate all objects in our Salesforce environment, giving the threat actor a high-level overview of the data stored there.

August 13, 2025: Expanding reconnaissance

One day after the initial breach, the threat actor GRUB1 launched a subsequent attack from the same IP address, 44[.]215[.]108[.]109. Starting at 19:33, the threat actor stole customer data from the Salesforce case objects. They first re-ran an object enumeration to confirm the data structure, then immediately retrieved the case objects’ schema using the /sobjects/Case/describe/ endpoint. This was followed by a broad Salesforce query that enumerated fields from the Salesforce case object.

August 14, 2025: Understanding our Salesforce environment

The threat actor GRUB1 dedicated hours to conduct comprehensive reconnaissance of Cloudflare's Salesforce tenant from the IP address 44[.]215[.]108[.]109. It appears their objective was to build an understanding of our environment. For several hours, they executed a series of targeted queries:

00:17 - They measured the tenant's scale by counting accounts, contacts, and users;
04:34 - Analyzed case workflows by querying CaseTeamMemberHistory; and
11:09 - Confirmed they were in a production environment by fingerprinting the Organization object.

The threat actor completed their reconnaissance with additional queries to understand how our customer support system operates—including how team members handle different types of cases, how cases are assigned and escalated, and how our support processes work—and then queried the /limits/ endpoint to learn the API's operational thresholds. The queries run by GRUB1 provided them with insight into their level of access, the size of the case objects, and the precise API limits they needed to respect to avoid detection within our Salesforce environment.

August 16, 2025: Preparing for the operation

Following the reconnaissance on August 14, 2025, we observed no traffic or successful logins from the threat actor GRUB1 for nearly 48 hours.

They returned on August 16, 2025. At 19:26, GRUB1 logged back into Cloudflare’s Salesforce tenant from the IP address 44[.]215[.]108[.]109 and, at 19:28, executed a single, final query: SELECT COUNT() FROM Case. This action served as a final "dry run" to verify the exact size of the dataset they were about to steal, marking the definitive end of the reconnaissance phase and setting the stage for the main attack.

August 17, 2025: Final exfiltration and coverup

GRUB1 initiated the data exfiltration phase by switching to new infrastructure, logging in at 11:11:23 from the IP address 208[.]68[.]36[.]90. After performing one final check on the size of the case object, they launched a Salesforce Bulk API 2.0 job at 11:11:56. In just over three minutes, they successfully exfiltrated a dataset containing the text of support cases—but not any attachments or files—in our instance of Salesforce. At 11:15:42, GRUB1 attempted to cover their tracks by deleting the API job. While this action concealed the primary evidence, our team was able to reconstruct the attack from residual logs.

We observed no further activity from this threat actor after August 17, 2025.

August 20, 2025: Vendor action ahead of notification

Salesloft revoked Drift-to-Salesforce connections across its customer base and published a notice on their website. At that point, Cloudflare had not yet been notified, and we had no indication that this vendor action might relate to our environment.

August 23, 2025: Salesforce and Salesloft notifications to Cloudflare

Our response to this incident began when Salesforce and Salesloft notified us of unusual Drift-related activity. We promptly implemented the vendors' recommended containment steps and engaged them to gather intelligence.

August 25, 2025: Cloudflare initiates response activity

By August 25, we had received additional intelligence about the incident and escalated our response beyond the initial vendor-recommended containment steps. We launched our own comprehensive investigation and remediation effort.

Our first priority was cutting off GRUB1's access at the source. We disabled the Drift user account, revoked its client ID and secrets, and completely purged all Salesloft software and browser extensions from Cloudflare systems. This comprehensive removal mitigated the risk of the threat actor reusing compromised tokens, regaining access through stale sessions, or leveraging software extensions for persistence. Separately, we expanded our security review to include all third-party services connected to our Salesforce environment, rotating credentials as a precautionary measure to prevent any potential lateral movement by the threat actor.

Since we use Salesforce as our primary tool for managing our customer support data, the risk was that customers had submitted secrets, passwords, or other sensitive data in their customer service requests. We needed to understand what sensitive material the attacker now had.

We immediately focused on whether any of that data could have been used to compromise our customers accounts, systems, or infrastructure. We examined the data obtained by the threat actor to see if it contained exposed credentials, since cases include freeform text fields where customers may submit Cloudflare API tokens, keys, or logs to our support team. Our teams developed custom scanning tools using regex, entropy, and pattern-matching techniques to detect likely Cloudflare secrets at scale.

Our investigation confirmed that the exposure was strictly limited to the freeform text in Salesforce case objects—not attachments or files. Cases are used by sales and support teams to communicate internally about customer support issues and to communicate directly with customers. As a result, these case objects contained text-only data consisting of:

The subject line of the Salesforce case
The body of the case (freeform text which may include any correspondence including keys, secrets, etc., if provided by the customer to Cloudflare)
Customer contact information (for example, company name, requestor email address and phone number, company domain name, and company country)

This conclusion was validated through extensive reviews of integrations, authentication activity, endpoint telemetry, and network logs.

August 26–29, 2025: Scaling the response and proactive measures

While the primary Salesforce and Salesloft credentials had already been rotated, our next step was to terminate and securely re-establish our third-party integrations. We began methodically re-onboarding the terminated services, ensuring each was provisioned with new secrets and subject to stricter security controls.

Meanwhile, our teams continued to analyze the data that was exfiltrated. Based on the analysis, we triaged & validated potential exposures, operating under the principle that any data that could have been exposed, was examined. This enabled us to take direct action by rotating Cloudflare platform-issued tokens immediately upon discovery—a total of 104 API tokens were rotated. No suspicious activity has been identified related to those tokens.

September 2, 2025: Customers notified

Based on Cloudflare’s detailed analysis—all impacted customers were formally notified via email and banner notices in our Dashboard with information about the incident and recommended next steps.

Recommendations for all organizations

This incident highlights the critical need for heightened vigilance in securing SaaS applications and other third-party integrations. The data compromised across hundreds of companies targeted in this attack could be used to launch additional attacks. We strongly urge all organizations to adopt the following security measures:

Disconnect Salesloft and its applications: Immediately disconnect all Salesloft connections from your Salesforce environment and uninstall any related software or browser extensions.
Rotate credentials: Reset the credentials for all third-party applications and integrations connected to your Salesforce instance. Rotate any credentials that may have been previously shared in a support case to Cloudflare. Based on the scope and intent of this attack, we also recommend rotating all third party credentials in your environment as well as any credentials that may have been included in a support case with any other vendor.
Implement frequent credential rotation: Establish a regular rotation schedule for all API keys and other secrets used in your integrations to reduce the window of exposure.
Review support case data: Review all customer support case data with your third-party providers to identify what sensitive information may have been exposed. Look for cases containing credentials, API keys, configuration details, or other sensitive data that customers may have shared. For Cloudflare customers specifically: you can access your support case history through the Cloudflare dashboard under Support > Get Help > Technical Support > My Activities, where you can filter cases or use the "Download Cases" feature to conduct a comprehensive review.
Conduct forensics: Review access logs and permissions to all third-party integrations and review public materials associated with the Drift incident and conduct a security review of your environment as appropriate.
Enforce least privilege: Audit all third-party applications to ensure they operate with the minimum level of access (least privilege) required for their function and ensure that admin accounts are not used for vendors. Additionally, enforce strict controls like IP address restrictions and session binding on all third-party and business-to-business (B2B) connections.
Enhance monitoring and controls: Deploy enhanced monitoring to detect anomalies such as large data exports or logins from unfamiliar locations. While capturing third party to third party logs can be difficult, it is imperative that these logs are part of your security operations teams.

Indicators of compromise

Below are the Indications of Compromise (IOCs) that we saw from GRUB1. We are publishing them so that other organizations, and especially those that may have been impacted by the Salesloft breach, can search their logs to confirm the same threat actor did not access their systems or third parties.

Indicator	Type	Description
208[.]68[.]36[.]90	IPV4	DigitalOcean based infrastructure
44[.]215[.]108[.]109	IPV4	AWS based infrastructure
TruffleHog	User Agent	Open source Secret Scanning tool
Salesforce-Multi-Org-Fetcher/1.0	User Agent	User-Agent string linked to malicious tooling
Salesforce-CLI/1.0	User Agent	Salesforce Command Line Interface (CLI),
python-requests/2.32.4	User Agent	User agent may indicate custom scripting
Python/3.11 aiohttp/3.12.15	User Agent	User agent which may allow many API calls in parallel

Conclusion

We are responsible for the tools that we select and when those tools are compromised by sophisticated threat actors, we own the consequences. Our team responded to the notice, and our investigation confirmed that the impact was strictly limited to data in Salesforce case objects, with no compromise of other Cloudflare systems or infrastructure.

That said, we consider the compromise of any data to be unacceptable. Our customers trust Cloudflare with their data, their infrastructure, and their security. In turn, we sometimes place our trust in third-party tools which need to be monitored and carefully scoped in what they can access. We are responsible for this. We let our customers down. For this, we sincerely apologize.

As third-party tools increasingly integrate with internal corporate data across the industry, we need to approach each new tool with careful scrutiny. This incident affected hundreds of organizations through a single integration point, highlighting the interconnected risks in today's technology landscape. We are committed to developing new capabilities to help us and our customers defend against such attacks in the future—stay tuned for announcements during Cloudflare's Birthday Week later this month.

We are also committed to sharing threat intelligence and research with the broader security community. In the weeks ahead, our Cloudforce One team will publish an in-depth blog analyzing GRUB1’s tradecraft to support the broader community in defending against similar campaigns.

Detailed event timeline

The following table provides a granular, chronological view of GRUB1's specific actions during the incident.

Date/Time (UTC)	Event Description
2025-08-09 11:51:13	GRUB1 observed leveraging Trufflehog and attempting to verify a token against a Cloudflare Customer Tenant: `client/v4/user/tokens/verify`, and received a 404 error from `44[.]215[.]108[.]109`
2025-08-12 22:14:08	GRUB1 logged into Cloudflare’s Salesforce tenant from `44[.]215[.]108[.]109`
2025-08-12 22:14:09	GRUB1 sent a `GET` request for a list of objects in Cloudflare’s Salesforce tenant: `/services/data/v58.0/sobjects/`
2025-08-13 19:33:02	GRUB1 logged into Cloudflare’s Salesforce tenant from `44[.]215[.]108[.]109`
2025-08-13 19:33:03	GRUB1 sent a `GET` request for a list of objects in Cloudflare's Salesforce tenant: `/services/data/v58.0/sobjects/`
2025-08-13 19:33:07 and 19:33:09	GRUB1 sent a `GET` request for metadata information for case in Cloudflare’s Salesforce tenant: `/services/data/v58.0/sobjects/Case/describe/`
2025-08-13 19:33:11	GRUB1 first observed executing Salesforce query: A broad query against the case object by `44[.]215[.]108[.]109`. This produced one of the earliest and larger data responses, consistent with reconnaissance via bulk record retrieval
2025-08-14 0:17:40	GRUB1 lists available objects and counts `“Account”`, `“Contact”` and `“User”` objects.
2025-08-14 00:17:47	GRUB1 queried Account table in Cloudflare’s Salesforce tenant: `“SELECT COUNT() FROM Account”` query on Cloudflare’s Salesforce tenant
2025-08-14 00:17:51	GRUB1 queried Contact table in Cloudflare’s Salesforce tenant: `“SELECT COUNT() FROM Contact”` query on Cloudflare’s Salesforce tenant
2025-08-14 00:18:00	GRUB1 queried User table in Cloudflare’s Salesforce tenant: `“SELECT COUNT() FROM User”` query on Cloudflare’s Salesforce tenant
2025-08-14 04:34:39	GRUB1 queried "CaseTeamMemberHistory” in Cloudflare’s Salesforce tenant: `“SELECT Id, IsDeleted, Name, CreatedDate, CreatedById, LastModifiedDate, LastModifiedById, SystemModstamp, LastViewedDate, LastReferencedDate, Case__c FROM CaseTeamMemberHistory__c LIMIT 5000”`
2025-08-14 11:09:14	GRUB1 queried Organization table in Cloudflare’s Salesforce tenant: `“SELECT Id, Name, OrganizationType, InstanceName, IsSandbox FROM Organization LIMIT 1”`
2025-08-14 11:09:21	GRUB1 queried User table in Cloudflare’s Salesforce tenant: `“SELECT Id, Username, Email, FirstName, LastName, Name, Title, CompanyName, Department, Division, Phone, MobilePhone, IsActive, LastLoginDate, CreatedDate, LastModifiedDate, TimeZoneSidKey, LocaleSidKey, LanguageLocaleKey, EmailEncodingKey FROM User WHERE IsActive = :x ORDER BY LastLoginDate DESC NULLS LAST LIMIT 20”`
2025-08-14 11:09:22	GRUB1 sent a `GET` request on LimitSnapshot in Cloudflare’s Salesforce tenant: `/services/data/v58.0/limits/`
2025-08-16 19:26:37	GRUB1 logged into Cloudflare’s Salesforce tenant from `44[.]215[.]108[.]109`
2025-08-16 19:28:08	GRUB1 queried Cases table in Cloudflare’s Salesforce tenant: `SELECT COUNT() FROM` Case
2025-08-17 11:11:23	GRUB1 logged into Cloudflare’s Salesforce tenant from `208[.]68[.]36[.]90`
2025-08-17 11:11:55	GRUB1 queried Case table in Cloudflare’s Salesforce tenant: `SELECT COUNT() FROM` Case
2025-08-17 11:11:56 to 11:15:18	GRUB1 leveraged Salesforce BulkAPI 2.0 from `208[.]68[.]36[.]90` to execute a job to exfiltrate the Cases object
2025-08-17 11:15:42	GRUB1 leveraged Salesforce Bulk API 2.0 from `208[.]68[.]36[.]90` to delete the recently executed job used to exfiltrate the Cases object

Cloudflare incident on August 21, 2025

David Tuber — Fri, 22 Aug 2025 00:58:00 GMT

On August 21, 2025, an influx of traffic directed toward clients hosted in the Amazon Web Services (AWS) us-east-1 facility caused severe congestion on links between Cloudflare and AWS us-east-1. This impacted many users who were connecting to or receiving connections from Cloudflare via servers in AWS us-east-1 in the form of high latency, packet loss, and failures to origins.

Customers with origins in AWS us-east-1 began experiencing impact at 16:27 UTC. The impact was substantially reduced by 19:38 UTC, with intermittent latency increases continuing until 20:18 UTC.

This was a regional problem between Cloudflare and AWS us-east-1, and global Cloudflare services were not affected. The degradation in performance was limited to traffic between Cloudflare and AWS us-east-1. The incident was a result of a surge of traffic from a single customer that overloaded Cloudflare's links with AWS us-east-1. It was a network congestion event, not an attack or a BGP hijack.

We’re very sorry for this incident. In this post, we explain what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

Cloudflare helps anyone to build, connect, protect, and accelerate their websites on the Internet. Most customers host their websites on origin servers that Cloudflare does not operate. To make their sites fast and secure, they put Cloudflare in front as a reverse proxy.

When a visitor requests a page, Cloudflare will first inspect the request. If the content is already cached on Cloudflare’s global network, or if the customer has configured Cloudflare to serve the content directly, Cloudflare will respond immediately, delivering the content without contacting the origin. If the content cannot be served from cache, we fetch it from the origin, serve it to the visitor, and cache it along the way (if it is eligible). The next time someone requests that same content, we can serve it directly from cache instead of making another round trip to the origin server.

When Cloudflare responds to a request with the cached content, it will send the response traffic over internal Data Center Interconnect (DCI) links through a series of network equipment and eventually reach the routers that represent our network edge (our “edge routers”) as shown below:

Our internal network capacity is designed to be larger than the available traffic demand in a location to account for failures of redundant links, failover from other locations, traffic engineering within or between networks, or even traffic surges from users. The majority of Cloudflare’s network links were operating normally, but some edge router links to an AWS peering switch had insufficient capacity to handle this particular surge.

What happened

At approximately 16:27 UTC on August, 21, 2025, a customer started sending many requests from AWS us-east-1 to Cloudflare for objects in Cloudflare’s cache. These requests generated a volume of response traffic that saturated all available direct peering connections between Cloudflare and AWS. This initial saturation became worse when AWS, in an effort to alleviate the congestion, withdrew some BGP advertisements to Cloudflare over some of the congested links. This action rerouted traffic to an additional set of peering links connected to Cloudflare via an offsite network interconnection switch, which subsequently also became saturated, leading to significant performance degradation. The impact became worse for two reasons: One of the direct peering links was operating at half-capacity due to a pre-existing failure, and the Data Center Interconnect (DCI) that connected Cloudflare’s edge routers to the offsite switch was due for a capacity upgrade. The diagram below illustrates this using approximate capacity estimates:

In response, our incident team immediately engaged with our partners at AWS to address the issue. Through close collaboration, we successfully alleviated the congestion and fully restored services for all affected customers.

Timeline

Time	Description
2025-08-21 16:27 UTC	Traffic surge for single customer begins, doubling total traffic from Cloudflare to AWS IMPACT START
2025-08-21 16:37 UTC	AWS begins withdrawing prefixes from Cloudflare on congested PNI (Private Network Interconnect) BGP sessions
2025-08-21 16:44 UTC	Network team is alerted to internal congestion in Ashburn (IAD)
2025-08-21 16:45 UTC	Network team is evaluating options for response, but AWS prefixes are unavailable on paths that are not congested due to their withdrawals
2025-08-21 17:22 UTC	AWS BGP prefixes withdrawals result in a higher amount of dropped traffic IMPACT INCREASE
2025-08-21 17:45 UTC	Incident is raised for customer impact in Ashburn (IAD)
2025-08-21 19:05 UTC	Rate limiting of single customer causing traffic surge decreases congestion
2025-08-21 19:27 UTC	Network team additional traffic engineering actions fully resolve congestion IMPACT DECREASE
2025-08-21 19:45 UTC	AWS begins reverting BGP withdrawals as requested by Cloudflare
2025-08-21 20:07 UTC	AWS finishes normalizing BGP prefix announcements to Cloudflare over IAD PNIs
2025-08-21 20:18 UTC	IMPACT END

When impact started, we saw a significant amount of traffic related to one customer, resulting in congestion:

This was handled by manual traffic actions both from Cloudflare and AWS. You can see some of the attempts by AWS to alleviate the congestion by looking at the number of IP prefixes AWS is advertising to Cloudflare during the duration of the outage. The lines in different colors correspond to the number of prefixes advertised per BGP session with us. The dips indicate AWS attempting to mitigate by withdrawing prefixes from the BGP sessions in an attempt to steer traffic elsewhere:

The congestion in the network caused network queues on the routers to grow significantly and begin dropping packets. Our edge routers were dropping high priority packets consistently during the outage, as seen in the chart below, which shows the queue drops for our Ashburn routers during the impact period:

The primary impact to customers as a result of this congestion would have been latency, loss (timeouts), or low throughput. We have a set of latency Service Level Objectives defined which imitate customer requests back to their origins measuring availability and latency. We can see that during the impact period, the percentage of requests whose latency fails to meet the target SLO threshold dips below an acceptable level in lock step with the packet drops during the outage:

After the congestion was alleviated, there was a brief period where both AWS and Cloudflare were attempting to normalize the prefix advertisements that had been adjusted to attempt to mitigate the congestion. That caused a long tail of latency that may have impacted some customers, which is why you see the packet drops resolve before the customer latencies are restored.

Remediations and follow-up steps

This event has underscored the need for enhanced safeguards to ensure that one customer's usage patterns cannot negatively affect the broader ecosystem. Our key takeaways are the necessity of architecting for better customer isolation to prevent any single entity from monopolizing shared resources and impacting the stability of the platform for others, and augmenting our network infrastructure to have sufficient capacity to meet demand.

To prevent a recurrence of this issue, we are implementing a multi-phased mitigation strategy. In the short and medium term:

We are developing a mechanism to selectively deprioritize a customer’s traffic if it begins to congest the network to a degree that impacts others.
We are expediting the Data Center Interconnect (DCI) upgrades which will provide network capacity significantly above what it is today.
We are working with AWS to make sure their and our BGP traffic engineering actions do not conflict with one another in the future.

Looking further ahead, our long-term solution involves building a new, enhanced traffic management system. This system will allot network resources on a per-customer basis, creating a budget that, once exceeded, will prevent a customer's traffic from degrading the service for anyone else on the platform. This system will also allow us to automate many of the manual actions that were taken to attempt to remediate the congestion seen during this incident.

Conclusion

Customers accessing AWS us-east-1 through Cloudflare experienced an outage due to insufficient network congestion management during an unusual high-traffic event.

We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

Redesigning Workers KV for increased availability and faster performance

Alex Robinson — Fri, 08 Aug 2025 13:00:00 GMT

On June 12, 2025, Cloudflare suffered a significant service outage that affected a large set of our critical services. As explained in our blog post about the incident, the cause was a failure in the underlying storage infrastructure used by our Workers KV service. Workers KV is not only relied upon by many customers, but serves as critical infrastructure for many other Cloudflare products, handling configuration, authentication and asset delivery across the affected services. Part of this infrastructure was backed by a third-party cloud provider, which experienced an outage on June 12 and directly impacted availability of our KV service.

Today we're providing an update on the improvements that have been made to Workers KV to ensure that a similar outage cannot happen again. We are now storing all data on our own infrastructure. We are also serving all requests from our own infrastructure in addition to any third-party cloud providers used for redundancy, ensuring high availability and eliminating single points of failure. Finally, the work has meaningfully improved performance and set a clear path for the removal of any reliance on third-party providers as redundant back-ups.

Background: The Original Architecture

Workers KV is a global key-value store that supports high read volumes with low latency. Behind the scenes, the service stores data in regional storage and caches data across Cloudflare's network to deliver exceptional read performance, making it ideal for configuration data, static assets, and user preferences that need to be available instantly around the globe.

Workers KV was initially launched in September 2018, predating Cloudflare-native storage services like Durable Objects and R2. As a result, Workers KV's original design leveraged object storage offerings from multiple third-party cloud service providers, maximizing availability via provider redundancy. The system operated in an active-active configuration, successfully serving requests even when one of the providers was unavailable, experiencing errors, or performing slowly.

Requests to Workers KV were handled by Storage Gateway Worker (SGW), a service running on Cloudflare Workers. When it received a write request, SGW would simultaneously write the key-value pair to two different third-party object storage providers, ensuring that data was always available from multiple independent sources. Deletes were handled similarly, by writing a special tombstone value in place of the object to mark the key as deleted, with these tombstones garbage collected later.

Reads from Workers KV could usually be served from Cloudflare's cache, providing reliably low latency. For reads of data not in cache, the system would race requests against both providers and return whichever response arrived first, typically from the geographically closer provider. This racing approach optimized read latency by always taking the fastest response while providing resilience against provider issues.

Given the inherent difficulty of keeping two independent storage providers synchronized, the architecture included sophisticated machinery to handle data consistency issues between backends. Despite this machinery, consistency edge cases remained more frequent than consumers required due to the inherently imperfect availability of upstream object storage systems and the challenges of maintaining perfect synchronization across independent providers.

Over the years, the system's implementation evolved significantly, including a variety of performance improvements we discussed last year, but the fundamental dual-provider architecture remained unchanged. This provided a reliable foundation for the massive growth in Workers KV usage while maintaining the performance characteristics that made it valuable for global applications.

Scaling Challenges and Architectural Trade-offs

As Workers KV usage scaled dramatically and access patterns became more diverse, the dual-provider architecture faced mounting operational challenges. The providers had fundamentally different limits, failure modes, APIs, and operational procedures that required constant adaptation.

The scaling issues extended beyond provider reliability. As KV traffic increased, the total number of IOPS exceeded what we could write to local cache infrastructure, forcing us to rely on traditional caching approaches when data was fetched from origin storage. This shift exposed additional consistency edge cases that hadn't been apparent at smaller scales, as the caching behavior became less predictable and more dependent on upstream provider performance characteristics.

Eventually, the combination of consistency issues, provider reliability disparities, and operational overhead led to a strategic decision to reduce complexity by moving to a single object storage provider earlier this year. This decision was made with awareness of the increased risk profile, but we believed the operational benefits outweighed the risks and viewed this as a temporary intermediate state while we developed our own storage infrastructure.

Unfortunately, on June 12, 2025, that risk materialized when our remaining third-party cloud provider experienced a global outage, causing a high percentage of Workers KV requests to fail for a period that lasted over two hours. The cascading impact to customers and to other Cloudflare services was severe: Access failed all identity-based logins, Gateway proxy became unavailable, WARP clients couldn't connect, and dozens of other services experienced significant disruptions.

Designing the Solution

The immediate goal after the incident was clear: bring at least one other fully redundant provider online such that another single-provider outage would not bring KV down. The new provider needed to handle massive scale along several dimensions: hundreds of billions of key-value pairs, petabytes of data stored, millions of GET requests per second, tens of thousands of steady-state PUT/DELETE requests per second, and tens of gigabits per second of throughput—all with high availability and low single-digit millisecond internal latency.

One obvious option was to bring back the provider that we had disabled earlier in the year. However, we could not just flip the switch back. The infrastructure to run in the dual backend configuration on the prior third-party storage provider was gone and the code had experienced some bit rot, making it infeasible to quickly revert to the previous dual-provider setup.

Additionally, the other provider had frequently been a source of their own operational problems, with relatively high error rates and concerningly low request throughput limits, that made us hesitant to rely on it again. Ultimately, we decided that our second provider should be entirely owned and operated by Cloudflare.

The next option was to build directly on top of Cloudflare R2. We already had a private beta version of Workers KV running on R2, but this experience helped us better understand Workers KV's unique storage requirements. Workers KV's traffic patterns are characterized by hundreds of billions of small objects with a median size of just 288 bytes—very different from typical object storage workloads that assume larger file sizes.

For workloads dominated by sub-1KB objects at this scale, database storage becomes significantly more efficient and cost-effective than traditional object storage. When you need to store billions of very small values with minimal per-value overhead, a database is a natural architectural fit. We're working on optimizations for R2 such as inlining small objects with metadata to eliminate additional retrieval hops that will improve performance for small objects, but for our immediate needs, a database-backed solution offered the most promising path forward.

After thorough evaluation of possible options, we decided to use a distributed database already in production at Cloudflare. This same database is used behind the scenes by both R2 and Durable Objects, giving us several key advantages: we have deep in-house expertise and existing automation for deployment and operations, and we knew we could depend on its reliability and performance characteristics at scale.

We sharded data across multiple database clusters, each with three-way replication for durability and availability. This approach allows us to scale capacity horizontally while maintaining strong consistency guarantees within each shard. We chose to run multiple clusters rather than one massive system to ensure a smaller blast radius if any cluster becomes unhealthy and to avoid pushing the practical limits of single-cluster scalability as Workers KV continues to grow.

Implementing the Solution

One immediate challenge that we ran into when implementing the system was connectivity. The SGW needed to communicate with database clusters running in our core datacenters, but databases typically use binary protocols over persistent TCP connections—not the HTTP-based communication patterns that work efficiently across our global network.

We built KV Storage Proxy (KVSP) to bridge this gap. KVSP is a service that provides an HTTP interface that our SGW can use while managing the complex database connectivity, authentication, and shard routing behind the scenes. KVSP stripes namespaces across multiple clusters using consistent hashing, preventing hotspotting where popular namespaces could overwhelm single clusters, eliminating noisy neighbor issues, and ensuring capacity limitations are distributed rather than concentrated.

The biggest downside of using a distributed database for Workers KV’s storage is that, while it excels at handling the small objects that dominate KV traffic, it is not optimal for the occasional large values of up to 25 MiB that some users store. Rather than compromise on either use case, we extended KVSP to automatically route larger objects to Cloudflare R2, creating a hybrid storage architecture that optimizes the backend choice based on object characteristics. From the perspective of SGW, this complexity is completely transparent—the same HTTP API works for all objects regardless of size.

We also restored our dual-provider capabilities between storage providers from KV’s prior architecture and adapted them to work well in tandem with the changes that had been made to KV’s implementation since it dropped down to a single provider. The modified system now operates by racing writes to both backends simultaneously, but returns success to the client as soon as the first backend confirms the write.

This improvement minimizes latency while ensuring durability across both systems. When one backend succeeds but the other fails—due to temporary network issues, rate limiting, or service degradation—the failed write is queued for background reconciliation, which serves as part of our synchronization machinery that is described in more detail below.

Deploying the Solution

With the hybrid architecture implemented, we began a careful rollout process designed to validate the new system while maintaining service availability.

The first step was introducing background writes from SGW to the new Cloudflare backend. This allowed us to validate write performance and error rates under real production load without affecting read traffic or user experience. It also was a necessary step in copying all data over to the new backend.

Next, we copied existing data from the third-party provider to our new backend running on Cloudflare infrastructure, routing the data through KVSP. This brought us to a critical milestone: we were now in a state where we could manually failover all operations to the new backend within minutes in the event of another provider outage. The single point of failure that caused the June incident had been eliminated.

With confidence in the failover capability, we began enabling our first namespaces in active-active mode, starting with internal Cloudflare services where we had sophisticated monitoring and deep understanding of the workloads. We dialed up traffic very slowly, carefully comparing results between backends. The fact that SGW could see responses from both backends asynchronously—after already returning a response to the user—allowed us to perform detailed comparisons and catch any discrepancies without impacting user-facing latency.

During testing, we discovered an important consistency regression compared to our single provider setup, which caused us to briefly roll back the change to put namespaces in active-active mode. While Workers KV is eventually consistent by design, with changes taking up to 60 seconds to propagate globally as cached versions time out, we had inadvertently regressed read-your-own-write (RYOW) consistency for requests routed through the same Cloudflare point of presence.

In the previous dual provider active-active setup, RYOW was provided within each PoP because we wrote PUT operations directly to a local cache instead of relying on the traditional caching system in front of upstream storage. However, KV throughput had outscaled the number of IOPS that the caching infrastructure could support, so we could no longer rely on that approach. This wasn't a documented property of Workers KV, but it is behavior that some customers have come to rely on in their applications.

To understand the scope of this issue, we created an adversarial test framework designed to maximize the likelihood of hitting consistency edge cases by rapidly interspersing reads and writes to a small set of keys from a handful of locations around the world. This framework allowed us to measure the percentage of reads where we observed a violation of RYOW consistency—scenarios where a read immediately following a write from the same point of presence would return stale data instead of the value that was just written. This allowed us to design and verify a new approach to how KV populates and invalidates data in cache, which restored the RYOW behavior that customers expect while maintaining the performance characteristics that make Workers KV effective for high-read workloads.

How KV Maintains Consistency Across Multiple Backends

With writes racing to both backends and reads potentially returning different results, maintaining data consistency across independent storage providers requires a sophisticated multi-layered approach. While the details have evolved over time, KV has always taken the same basic approach, consisting of three complementary mechanisms that work together to reduce the likelihood of inconsistencies and minimize the window for data divergence.

The first line of defense happens during write operations. When SGW sends writes to both backends simultaneously, we treat the write as successful as soon as either provider confirms persistence. However, if a write succeeds on one provider but fails on the other—due to network issues, rate limiting, or temporary service degradation—the failed write is captured and sent to a background reconciliation system. This system deduplicates failed keys and initiates a synchronization process to resolve the inconsistency.

The second mechanism activates during read operations. When SGW races reads against both providers and notices different results, it triggers the same background synchronization process. This helps ensure that keys that become inconsistent are brought back into alignment when first accessed rather than remaining divergent indefinitely.

The third layer consists of background crawlers that continuously scan data across both providers, identifying and fixing any inconsistencies missed by the previous mechanisms. These crawlers also provide valuable data on consistency drift rates, helping us understand how frequently keys slip through the reactive mechanisms and address any underlying issues.

The synchronization process itself relies on version metadata that we attach to every key-value pair. Each write automatically generates a new version consisting of a high-precision timestamp plus a random nonce, stored alongside the actual data. When comparing values between providers, we can determine which version is newer based on these timestamps. The newer value is then copied to the provider with the older version.

In rare cases where timestamps are within milliseconds of each other, clock skew could theoretically cause incorrect ordering, though given the tight bounds we maintain on our clocks through Cloudflare Time Services and typical write latencies, such conflicts would only occur with nearly simultaneous overlapping writes.

To prevent data loss during synchronization, we use conditional writes that verify that the last timestamp is older before writing instead of blindly overwriting values. This allows us to avoid introducing new inconsistency issues in cases where requests in close proximity succeed to different backends and the synchronization process copies older values over newer values.

Similarly, we can’t just delete data when the user requests it because if the delete only succeeded to one backend, the synchronization process would see this as missing data and copy it from the other backend. Instead, we overwrite the value with a tombstone that has a newer timestamp and no actual data. Only after both providers have the tombstone do we proceed with actually removing the keys from storage.

This layered consistency architecture doesn't guarantee strong consistency, but in practice it does eliminate most mismatches between backends while maintaining a performance profile that makes Workers KV attractive for latency-sensitive, high-read workloads while also providing high availability in the case of any backend errors. In distributed systems terms, KV chooses availability (AP) over consistency (CP) in the CAP theorem, and more interestingly also chooses latency over consistency in the absence of a partition, meaning it’s PA/EL under the PACELC theorem. Most inconsistencies are resolved within seconds through the reactive mechanisms, while the background crawlers ensure that even edge cases are typically corrected over time.

The above description applies to both our historical dual-provider setup and today's implementation, but two key improvements in the current architecture lead to significantly better consistency outcomes. First, KVSP maintains a much lower steady-state error rate compared to our previous third-party providers, reducing the frequency of write failures that create inconsistencies in the first place. Second, we now race all reads against both backends, whereas the previous system optimized for cost and latency by preferentially routing reads to a single provider after an initial learning period.

In the original dual-provider architecture, each SGW instance would initially race reads against both providers to establish baseline performance characteristics. Once an instance determined that one provider consistently outperformed the other for its geographic region, it would route subsequent reads exclusively to the faster provider, only falling back to the slower provider when the primary experienced failures or abnormal latency. While this approach effectively controlled third-party provider costs and optimized read performance, it created a significant blind spot in our consistency detection mechanisms—inconsistencies between providers could persist indefinitely if reads were consistently served from only one backend.

Results: Performance and Availability Gains

With these consistency mechanisms in place and our careful rollout strategy validated through internal services, we continued expanding active-active operation to additional namespaces across both internal and external workloads, and we were thrilled with what we saw. Not only did the new architecture provide the increased availability we needed for Workers KV, it also delivered significant performance improvements.

These performance gains were particularly pronounced in Europe, where our new storage backend is located, but the benefits extended far beyond what geographic locality alone could explain. The internal latency improvements compared to the third-party object store we were writing to in parallel were remarkable.

For example, p99 internal latency for reads to KVSP were below 5 milliseconds. For comparison, non-cached reads to the third-party object store from our closest location—after normalizing for transit time to create an apples-to-apples comparison—were typically around 80ms at p50 and 200ms at p99.

The graphs below show the closest thing that we can get to an apples-to-apples comparison: our observed internal latency for requests to KVSP compared with observed latency for requests that are cache misses and end up being forwarded to the external service provider from the closest point of presence, which includes an additional 5-10 milliseconds of request transit time.

These performance improvements translated directly into faster response times for the many internal Cloudflare services that depend on Workers KV, creating cascading benefits across our platform. The database-optimized storage proved particularly effective for the small object access patterns that dominate Workers KV traffic.

After seeing these positive results, we continued expanding the rollout, copying data and enabling groups of namespaces for both internal and external customers. The combination of improved availability and better performance validated our architectural approach and demonstrated the value of building critical infrastructure on our own platform.

What’s next?

Our immediate plans focus on expanding this hybrid architecture to provide even greater resilience and performance for Workers KV. We're rolling out the KVSP solution to additional locations, creating a truly global distributed backend that can serve traffic entirely from our own infrastructure while also working to further improve how quickly we reach consistency between providers and in cache after writes.

Our ultimate goal is to eliminate our remaining third-party storage dependency entirely, achieving full infrastructure independence for Workers KV. This will remove the external single points of failure that led to the June incident while giving us complete control over the performance and reliability characteristics of our storage layer.

Beyond Workers KV, this project has demonstrated the power of hybrid architectures that combine the best aspects of different storage technologies. The patterns we've developed—using KVSP as a translation layer, automatically routing objects based on size characteristics, and leveraging our existing database expertise—can be leveraged by other services that need to balance global scale with strong consistency requirements. The journey from a single-provider setup to a resilient hybrid architecture running on Cloudflare infrastructure demonstrates how thoughtful engineering can turn operational challenges into competitive advantages. With dramatically improved performance and active-active redundancy, Workers KV is well positioned to serve as an even more reliable foundation for the growing set of customers that depend on it.

Cloudflare service outage June 12, 2025

Jeremy Hartman — Thu, 12 Jun 2025 22:00:00 GMT

On June 12, 2025, Cloudflare suffered a significant service outage that affected a large set of our critical services, including Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile and Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard.

This outage lasted 2 hours and 28 minutes, and globally impacted all Cloudflare customers using the affected services. The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services. Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted availability of our KV service.

We’re deeply sorry for this outage: this was a failure on our part, and while the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them.

This was not the result of an attack or other security event. No data was lost as a result of this incident. Cloudflare Magic Transit and Magic WAN, DNS, Cache, proxy, WAF and related services were not directly impacted by this incident.

What was impacted?

As a rule, Cloudflare designs and builds our services on our own platform building blocks, and as such many of Cloudflare’s products are built to rely on the Workers KV service.

The following table details the impacted services, including the user-facing impact, operation failures, and increases in error rates observed:

Product/Service	Impact
Workers KV	Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500. The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget. This did not impact data stored in Workers KV.
Access	Access uses Workers KV to store application and policy configuration along with user identity information. During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity. Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency. Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider. Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.
Gateway	This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH). However, there were two exceptions: DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information. Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not. Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic. This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.
WARP	The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident. Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations. Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV. Consumer WARP saw a similar sporadic impact as the Zero Trust version.
Dashboard	Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were: Standard Logins (User/Password): Failed due to Turnstile unavailability. Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue. SSO Logins: Failed due to a full dependency on Access. The Cloudflare v4 API was not impacted during this incident.
Challenges and Turnstile	The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects. We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass. There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.
Browser Isolation	Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation. New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.
Images	Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted. Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.
Stream	Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate. Video uploads were not impacted.
Realtime	The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window. The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.
Workers AI	All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.
Pages & Workers Assets	Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time. During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.
AutoRAG	AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.
Durable Objects	SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover. Durable Object namespaces using the legacy key-value storage were not impacted.
D1	D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects. Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.
Queues & Event Notifications	Queues message operations including–pushing and consuming–were unavailable during the incident window. Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages. Event Notifications use Queues as their underlying delivery mechanism.
AI Gateway	AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.
CDN	Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage. The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.
Workers / Workers for Platforms	Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.
Workers Builds (CI/CD)	Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down. 100% of new Workers Builds failed during the incident window.
Browser Rendering	Browser Rendering depends on Browser Isolation for browser instance infrastructure. Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.
Zaraz	100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Product/Service

Impact

Workers KV

Workers KV saw 90.22% of requests failing: any key-value pair not cached and that required to retrieve the value from Workers KV's origin storage backends resulted in failed requests with response code 503 or 500.

The remaining requests were successfully served from Workers KV's cache (status code 200 and 404) or returned errors within our expected limits and/or error budget.

This did not impact data stored in Workers KV.

Access

Access uses Workers KV to store application and policy configuration along with user identity information.

During the incident Access failed 100% of identity based logins for all application types including Self-Hosted, SaaS and Infrastructure. User Identity information was unavailable to other services like WARP and Gateway during this incident. Access is designed to fail closed when it cannot successfully fetch policy configuration or a user’s identity.

Active Infrastructure Application SSH sessions with command logging enabled failed to save logs due to a Workers KV dependency.

Access’ System for Cross Domain Identity (SCIM) service was also impacted due to its reliance on Workers KV and Durable Objects (which depended on KV) to store user information. During this incident, user identities were not updated due to Workers KV updates failures. These failures would result in a 500 returned to identity providers. Some providers may require a manual re-synchronization but most customers would have seen immediate service restoration once Access’ SCIM service was restored due to retry logic by the identity provider.

Service authentication based logins (e.g. service token, Mutual TLS, and IP-based policies) and Bypass policies were unaffected. No Access policy edits or changes were lost during this time.

Gateway

This incident did not affect most Gateway DNS queries, including those over IPv4, IPv6, DNS over TLS (DoT), and DNS over HTTPS (DoH).

However, there were two exceptions:

DoH queries with identity-based rules failed. This happened because Gateway couldn't retrieve the required user’s identity information.

Authenticated DoH was disrupted for some users. Users with active sessions with valid authentication tokens were unaffected, but those needing to start new sessions or refresh authentication tokens could not.

Users of Gateway proxy, egress, and TLS decryption were unable to connect, register, proxy, or log traffic.

This was due to our reliance on Workers KV to retrieve up-to-date identity and device posture information. Each of these actions requires a call to Workers KV, and when unavailable, Gateway is designed to fail closed to prevent traffic from bypassing customer-configured rules.

WARP

The WARP client was impacted due to core dependencies on Access and Workers KV, which is required for device registration and authentication. As a result, no new clients were able to connect or sign up during the incident.

Existing WARP client users sessions that were routed through the Gateway proxy experienced disruptions, as Gateway was unable to perform its required policy evaluations.

Additionally, the WARP emergency disconnect override was rendered unavailable because of a failure in its underlying dependency, Workers KV.

Consumer WARP saw a similar sporadic impact as the Zero Trust version.

Dashboard

Dashboard user logins and most of the existing dashboard sessions were unavailable. This was due to an outage affecting Turnstile, DO, KV, and Access. The specific causes for login failures were:

Standard Logins (User/Password): Failed due to Turnstile unavailability.

Sign-in with Google (OIDC) Logins: Failed due to a KV dependency issue.

SSO Logins: Failed due to a full dependency on Access.

The Cloudflare v4 API was not impacted during this incident.

Challenges and Turnstile

The Challenge platform that powers Cloudflare Challenges and Turnstile saw a high rate of failure and timeout for siteverify API requests during the incident window due to its dependencies on Workers KV and Durable Objects.

We have kill switches in place to disable these calls in case of incidents and outages such as this. We activated these kill switches as a mitigation so that eyeballs are not blocked from proceeding. Notably, while these kill switches were active, Turnstile’s siteverify API (the API that validates issued tokens) could redeem valid tokens multiple times, potentially allowing for attacks where a bad actor might try to use a previously valid token to bypass.

There was no impact to Turnstile’s ability to detect bots. A bot attempting to solve a challenge would still have failed the challenge and thus, not receive a token.

Browser Isolation

Existing Browser Isolation sessions via Link-based isolation were impacted due to a reliance on Gateway for policy evaluation.

New link-based Browser Isolation sessions could not be initiated due to a dependency on Cloudflare Access. All Gateway-initiated isolation sessions failed due its Gateway dependency.

Images

Batch uploads to Cloudflare Images were impacted during the incident window, with a 100% failure rate at the peak of the incident. Other uploads were not impacted.

Overall image delivery dipped to around 97% success rate. Image Transformations were not significantly impacted, and Polish was not impacted.

Stream

Stream’s error rate exceeded 90% during the incident window as video playlists were unable to be served. Stream Live observed a 100% error rate.

Video uploads were not impacted.

Realtime

The Realtime TURN (Traversal Using Relays around NAT) service uses KV and was heavily impacted. Error rates were near 100% for the duration of the incident window.

The Realtime SFU service (Selective Forwarding Unit) was unable to create new sessions, although existing connections were maintained. This caused a reduction to 20% of normal traffic during the impact window.

Workers AI

All inference requests to Workers AI failed for the duration of the incident. Workers AI depends on Workers KV for distributing configuration and routing information for AI requests globally.

Pages & Workers Assets

Static assets served by Cloudflare Pages and Workers Assets (such as HTML, JavaScript, CSS, images, etc) are stored in Workers KV, cached, and retrieved at request time. Workers Assets saw an average error rate increase of around 0.06% of total requests during this time.

During the incident window, Pages error rate peaked to ~100% and all Pages builds could not complete.

AutoRAG

AutoRAG relies on Workers AI models for both document conversion and generating vector embeddings during indexing, as well as LLM models for querying and search. AutoRAG was unavailable during the incident window because of the Workers AI dependency.

Durable Objects

SQLite-backed Durable Objects share the same underlying storage infrastructure as Workers KV. The average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Durable Object namespaces using the legacy key-value storage were not impacted.

D1 databases share the same underlying storage infrastructure as Workers KV and Durable Objects.

Similar to Durable Objects, the average error rate during the incident window peaked at 22%, and dropped to 2% as services started to recover.

Queues & Event Notifications

Queues message operations including–pushing and consuming–were unavailable during the incident window.

Queues uses KV to map each Queue to underlying Durable Objects that contain queued messages.

Event Notifications use Queues as their underlying delivery mechanism.

AI Gateway

AI Gateway is built on top of Workers and relies on Workers KV for client and internal configurations. During the incident window, AI Gateway saw error rates peak at 97% of requests until dependencies recovered.

CDN

Automated traffic management infrastructure was operational but acted with reduced efficacy during the impact period. In particular, registration requests from Zero Trust clients increased substantially as a result of the outage.

The increase in requests imposed additional load in several Cloudflare locations, triggering response from automated traffic management. In response to these conditions, systems rerouted incoming CDN traffic to nearby locations, reducing impact to customers. There was a portion of traffic that was not rerouted as expected and is under investigation. CDN requests impacted by this issue would experience elevated latency, HTTP 499 errors, and / or HTTP 503 errors. Impacted Cloudflare service areas included São Paulo, Philadelphia, Atlanta, and Raleigh.

Workers / Workers for Platforms

Workers and Workers for Platforms rely on a third party service for uploads. During the incident window, Workers saw an overall error rate peak to ~2% of total requests. Workers for Platforms saw an overall error rate peak to ~10% of total requests during the same time period.

Workers Builds (CI/CD)

Starting at 18:03 UTC Workers builds could not receive new source code management push events due to Access being down.

100% of new Workers Builds failed during the incident window.

Browser Rendering

Browser Rendering depends on Browser Isolation for browser instance infrastructure.

Requests to both the REST API and via the Workers Browser Binding were 100% impacted during the incident window.

Zaraz

100% of requests were impacted during the incident window. Zaraz relies on Workers KV configs for websites when handling eyeball traffic. Due to the same dependency, attempts to save updates to Zaraz configs were unsuccessful during this period, but our monitoring shows that only a single user was affected.

Background

Workers KV is built as what we call a “coreless” service which means there should be no single point of failure as the service runs independently in each of our locations worldwide. However, Workers KV today relies on a central data store to provide a source of truth for data. A failure of that store caused a complete outage for cold reads and writes to the KV namespaces used by services across Cloudflare.

Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident. Workers KV removed a storage provider as we worked to re-architect KV’s backend, including migrating it to Cloudflare R2, to prevent data consistency issues (caused by the original data syncing architecture), and to improve support for data residency requirements.

One of our principles is to build Cloudflare services on our own platform as much as possible, and Workers KV is no exception. Many of our internal and external services rely heavily on Workers KV, which under normal circumstances helps us deliver the most robust services possible, instead of service teams attempting to build their own storage services. In this case, the cascading impact from the failure from Workers KV exacerbated the issue and significantly broadened the blast radius.

Incident timeline and impact

The incident timeline, including the initial impact, investigation, root cause, and remediation, are detailed below.

_{Workers KV error rates to storage infrastructure. 91% of requests to KV failed during the incident window.}

_{Cloudflare Access percentage of successful requests. Cloudflare Access relies directly on Workers KV and serves as a good proxy to measure Workers KV availability over time.}

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Event
2025-06-12 17:52	INCIDENT START Cloudflare WARP team begins to see registrations of new devices fail and begin to investigate these failures and declares an incident.
2025-06-12 18:05	Cloudflare Access team received an alert due to a rapid increase in error rates. Service Level Objectives for multiple services drop below targets and trigger alerts across those teams.
2025-06-12 18:06	Multiple service-specific incidents are combined into a single incident as we identify a shared cause (Workers KV unavailability). Incident priority upgraded to P1.
2025-06-12 18:21	Incident priority upgraded to P0 from P1 as severity of impact becomes clear.
2025-06-12 18:43	Cloudflare Access begins exploring options to remove Workers KV dependency by migrating to a different backing datastore with the Workers KV engineering team. This was proactive in the event the storage infrastructure continued to be down.
2025-06-12 19:09	Zero Trust Gateway began working to remove dependencies on Workers KV by gracefully degrading rules that referenced Identity or Device Posture state.
2025-06-12 19:32	Access and Device Posture force drop identity and device posture requests to shed load on Workers KV until third-party service comes back online.
2025-06-12 19:45	Cloudflare teams continue to work on a path to deploying a Workers KV release against an alternative backing datastore and having critical services write configuration data to that store.
2025-06-12 20:23	Services begin to recover as storage infrastructure begins to recover. We continue to see a non-negligible error rate and infrastructure rate limits due to the influx of services repopulating caches.
2025-06-12 20:25	Access and Device Posture restore calling Workers KV as third-party service is restored.
2025-06-12 20:28	IMPACT END Service Level Objectives return to pre-incident level. Cloudflare teams continue to monitor systems to ensure services do not degrade as dependent systems recover.
	INCIDENT END Cloudflare team see all affected services return to normal function. Service level objective alerts are recovered.

Remediation and follow-up steps

We’re taking immediate steps to improve the resiliency of services that depend on Workers KV and our storage infrastructure. This includes existing planned work that we are accelerating as a result of this incident.

This encompasses several workstreams, including efforts to avoid singular dependencies on storage infrastructure we do not own, improving the ability for us to recover critical services (including Access, Gateway and WARP)

Specifically:

(Actively in-flight): Bringing forward our work to improve the redundancy within Workers KV’s storage infrastructure, removing the dependency on any single provider. During the incident window we began work to cut over and backfill critical KV namespaces to our own infrastructure, in the event the incident continued.
(Actively in-flight): Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.
(Actively in-flight): Implementing tooling that allows us to progressively re-enable namespaces during storage infrastructure incidents. This will allow us to ensure that key dependencies, including Access and WARP, are able to come up without risking a denial-of-service against our own infrastructure as caches are repopulated.

This list is not exhaustive: our teams continue to revisit design decisions and assess the infrastructure changes we need to make in both the near (immediate) term and long term to mitigate the incidents like this going forward.

This was a serious outage, and we understand that organizations and institutions that are large and small depend on us to protect and/or run their websites, applications, zero trust and network infrastructure. Again we are deeply sorry for the impact and are working diligently to improve our service resiliency.

Cloudflare incident on March 21, 2025

Phillip Jones — Tue, 25 Mar 2025 01:40:38 GMT

Multiple Cloudflare services, including R2 object storage, experienced an elevated rate of errors for 1 hour and 7 minutes on March 21, 2025 (starting at 21:38 UTC and ending 22:45 UTC). During the incident window, 100% of write operations failed and approximately 35% of read operations to R2 failed globally. Although this incident started with R2, it impacted other Cloudflare services including Cache Reserve, Images, Log Delivery, Stream, and Vectorize.

While rotating credentials used by the R2 Gateway service (R2's API frontend) to authenticate with our storage infrastructure, the R2 engineering team inadvertently deployed the new credentials (ID and key pair) to a development instance of the service instead of production. When the old credentials were deleted from our storage infrastructure (as part of the key rotation process), the production R2 Gateway service did not have access to the new credentials. This ultimately resulted in R2’s Gateway service not being able to authenticate with our storage backend. There was no data loss or corruption that occurred as part of this incident: any in-flight uploads or mutations that returned successful HTTP status codes were persisted.

Once the root cause was identified and we realized we hadn’t deployed the new credentials to the production R2 Gateway service, we deployed the updated credentials and service availability was restored.

This incident happened because of human error and lasted longer than it should have because we didn’t have proper visibility into which credentials were being used by the Gateway Worker to authenticate with our storage infrastructure.

We’re deeply sorry for this incident and the disruption it may have caused to you or your users. We hold ourselves to a high standard and this is not acceptable. This blog post exactly explains the impact, what happened and when, and the steps we are taking to make sure this failure (and others like it) doesn’t happen again.

What was impacted?

The primary incident window occurred between 21:38 UTC and 22:45 UTC.

The following table details the specific impact to R2 and Cloudflare services that depend on, or interact with, R2:

Product/Service	Impact
R2	All customers using Cloudflare R2 would have experienced an elevated error rate during the primary incident window. Specifically: * Object write operations had a 100% error rate. * Object reads had an approximate error rate of 35% globally. Individual customer error rate varied during this window depending on access patterns. Customers accessing public assets through custom domains would have seen a reduced error rate as cached object reads were not impacted. * Operations involving metadata only (e.g., head and list operations) were not impacted. There was no data loss or risk to data integrity within R2's storage subsystem. This incident was limited to a temporary authentication issue between R2's API frontend and our storage infrastructure.
Billing	Billing uses R2 to store customer invoices. During the primary incident window, customers may have experienced errors when attempting to download/access past Cloudflare invoices.
Cache Reserve	Cache Reserve customers observed an increase in requests to their origin during the incident window as an increased percentage of reads to R2 failed. This resulted in an increase in requests to origins to fetch assets unavailable in Cache Reserve during this period. User-facing requests for assets to sites with Cache Reserve did not observe failures as cache misses failed over to the origin.
Email Security	Email Security depends on R2 for customer-facing metrics. During the primary incident window, customer-facing metrics would not have updated.
Images	All (100% of) uploads failed during the primary incident window. Successful delivery of stored images dropped to approximately 25%.
Key Transparency Auditor	All (100% of) operations failed during the primary incident window due to dependence on R2 writes and/or reads. Once the incident was resolved, service returned to normal operation immediately.
Log Delivery	Log delivery (for Logpush and Logpull) was delayed during the primary incident window, resulting in significant delays (up to 70 minutes) in log processing. All logs were delivered after incident resolution.
Stream	All (100% of) uploads failed during the primary incident window. Successful Stream video segment delivery dropped to 94%. Viewers may have seen video stalls every minute or so, although actual impact would have varied. Stream Live was down during the primary incident window as it depends on object writes.
Vectorize	Queries and operations against Vectorize indexes were impacted during the incident window. During the incident window, Vectorize customers would have seen an increased error rate for read queries to indexes and all (100% of) insert and upsert operation failed as Vectorize depends on R2 for persistent storage.

Incident timeline

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Event
Mar 21, 2025 - 19:49 UTC	The R2 engineering team started the credential rotation process. A new set of credentials (ID and key pair) for storage infrastructure was created. Old credentials were maintained to avoid downtime during credential change over.
Mar 21, 2025 - 20:19 UTC	Set updated production secret (wrangler secret put) and executed wrangler deploy command to deploy R2 Gateway service with updated credentials. Note: We later discovered the --env parameter was inadvertently omitted for both Wrangler commands. This resulted in credentials being deployed to the Worker assigned to the default environment instead of the Worker assigned to the production environment.
Mar 21, 2025 - 20:20 UTC	The R2 Gateway service Worker assigned to the default environment is now using the updated storage infrastructure credentials. Note: This was the wrong Worker, the production environment should have been explicitly set. But, at this point, we incorrectly believed the credentials were updated on the correct production Worker.
Mar 21, 2025 - 20:37 UTC	Old credentials were removed from our storage infrastructure to complete the credential rotation process.
Mar 21, 2025 - 21:38 UTC	– IMPACT BEGINS – R2 availability metrics begin to show signs of service degradation. The impact to R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure.
Mar 21, 2025 - 21:45 UTC	R2 global availability alerts are triggered (indicating 2% of error budget burn rate). The R2 engineering team began looking at operational dashboards and logs to understand impact.
Mar 21, 2025 - 21:50 UTC	Internal incident declared.
Mar 21, 2025 - 21:51 UTC	R2 engineering team observes gradual but consistent decline in R2 availability metrics for both read and write operations. Operations involving metadata only (e.g., head and list operations) were not impacted. Given gradual decline in availability metrics, R2 engineering team suspected a potential regression in propagation of new credentials in storage infrastructure.
Mar 21, 2025 - 22:05 UTC	Public incident status page published.
Mar 21, 2025 - 22:15 UTC	R2 engineering team created a new set of credentials (ID and key pair) for storage infrastructure in an attempt to force re-propagation. Continued monitoring operational dashboards and logs.
Mar 21, 2025 - 22:20 UTC	R2 engineering team saw no improvement in availability metrics. Continued investigating other potential root causes.
Mar 21, 2025 - 22:30 UTC	R2 engineering team deployed a new set of credentials (ID and key pair) to R2 Gateway service Worker. This was to validate whether there was an issue with the credentials we had pushed to gateway service. Environment parameter was still omitted in the deploy and secret put commands, so this deployment was still to the wrong non-production Worker.
Mar 21, 2025 - 22:36 UTC	– ROOT CAUSE IDENTIFIED – The R2 engineering team discovered that credentials had been deployed to a non-production Worker by reviewing production Worker release history.
Mar 21, 2025 - 22:45 UTC	– IMPACT ENDS – Deployed credentials to correct production Worker. R2 availability recovered.
Mar 21, 2025 - 22:54 UTC	The incident is considered resolved.

Analysis

R2’s architecture is primarily composed of three parts: R2 production gateway Worker (serves requests from S3 API, REST API, Workers API), metadata service, and storage infrastructure (stores encrypted object data).

The R2 Gateway Worker uses credentials (ID and key pair) to securely authenticate with our distributed storage infrastructure. We rotate these credentials regularly as a best practice security precaution.

Our key rotation process involves the following high-level steps:

Create a new set of credentials (ID and key pair) for our storage infrastructure. At this point, the old credentials are maintained to avoid downtime during credential change over.
Set the new credential secret for the R2 production gateway Worker using the wrangler secret put command.
Set the new updated credential ID as an environment variable in the R2 production gateway Worker using the wrangler deploy command. At this point, new storage credentials start being used by the gateway Worker.
Remove previous credentials from our storage infrastructure to complete the credential rotation process.
Monitor operational dashboards and logs to validate change over.

The R2 engineering team uses Workers environments to separate production and development environments for the R2 Gateway Worker. Each environment defines a separate isolated Cloudflare Worker with separate environment variables and secrets.

Critically, both wrangler secret put and wrangler deploy commands default to the default environment if the --env command line parameter is not included. In this case, due to human error, we inadvertently omitted the --env parameter and deployed the new storage credentials to the wrong Worker (default environment instead of production). To correctly deploy storage credentials to the production R2 Gateway Worker, we need to specify --env production.

The action we took on step 4 above to remove the old credentials from our storage infrastructure caused authentication errors, as the R2 Gateway production Worker still had the old credentials. This is ultimately what resulted in degraded availability.

The decline in R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure. This accounted for a delay in our initial discovery of the problem. Instead of relying on availability metrics after updating the old set of credentials, we should have explicitly validated which token was being used by the R2 Gateway service to authenticate with R2's storage infrastructure.

Overall, the impact on read availability was significantly mitigated by our intermediate cache that sits in front of storage and continued to serve requests.

Resolution

Once we identified the root cause, we were able to resolve the incident quickly by deploying the new credentials to the production R2 Gateway Worker. This resulted in an immediate recovery of R2 availability.

Next steps

This incident happened because of human error and lasted longer than it should have because we didn’t have proper visibility into which credentials were being used by the R2 Gateway Worker to authenticate with our storage infrastructure.

We have taken immediate steps to prevent this failure (and others like it) from happening again:

Added logging tags that include the suffix of the credential ID the R2 Gateway Worker uses to authenticate with our storage infrastructure. With this change, we can explicitly confirm which credential is being used.
Related to the above step, our internal processes now require explicit confirmation that the suffix of the new token ID matches logs from our storage infrastructure before deleting the previous token.
Require that key rotation takes place through our hotfix release tooling instead of relying on manual wrangler command entry which introduces human error. Our hotfix release deploy tooling explicitly enforces the environment configuration and contains other safety checks.
While it’s been an implicit standard that this process involves at least two humans to validate the changes ahead as we progress, we’ve updated our relevant SOPs (standard operating procedures) to include this explicitly.
In Progress: Extend our existing closed loop health check system that monitors our endpoints to test new keys, automate reporting of their status through our alerting platform, and ensure global propagation prior to releasing the gateway Worker.
In Progress: To expedite triage on any future issues with our distributed storage endpoints, we are updating our observability platform to include views of upstream success rates that bypass caching to give clearer indication of issues serving requests for any reason.

The list above is not exhaustive: as we work through the above items, we will likely uncover other improvements to our systems, controls, and processes that we’ll be applying to improve R2’s resiliency, on top of our business-as-usual efforts. We are confident that this set of changes will prevent this failure, and related credential rotation failure modes, from occurring again. Again, we sincerely apologize for this incident and deeply regret any disruption it has caused you or your users.

Cloudflare incident on February 6, 2025

Matt Silverlock — Fri, 07 Feb 2025 00:00:00 GMT

Multiple Cloudflare services, including our R2 object storage, were unavailable for 59 minutes on Thursday, February 6, 2025. This caused all operations against R2 to fail for the duration of the incident, and caused a number of other Cloudflare services that depend on R2 — including Stream, Images, Cache Reserve, Vectorize and Log Delivery — to suffer significant failures.

The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2. The action taken on the complaint resulted in an advanced product disablement action on the site that led to disabling the production R2 Gateway service responsible for the R2 API.

Critically, this incident did not result in the loss or corruption of any data stored on R2.

We’re deeply sorry for this incident: this was a failure of a number of controls, and we are prioritizing work to implement additional system-level controls related not only to our abuse processing systems, but so that we continue to reduce the blast radius of any system- or human- action that could result in disabling any production service at Cloudflare.

What was impacted?

All customers using Cloudflare R2 would have observed a 100% failure rate against their R2 buckets and objects during the primary incident window. Services that depend on R2 (detailed in the table below) observed heightened error rates and failure modes depending on their usage of R2.

The primary incident window occurred between 08:14 UTC to 09:13 UTC, when operations against R2 had a 100% error rate. Dependent services (detailed below) observed increased failure rates for operations that relied on R2.

From 09:13 UTC to 09:36 UTC, as R2 recovered and clients reconnected, the backlog and resulting spike in client operations caused load issues with R2's metadata layer (built on Durable Objects). This impact was significantly more isolated: we observed a 0.09% increase in error rates in calls to Durable Objects running in North America during this window.

The following table details the impacted services, including the user-facing impact, operation failures, and increases in error rates observed:

Product/Service	Impact
R2	100% of operations against R2 buckets and objects, including uploads, downloads, and associated metadata operations were impacted during the primary incident window. During the secondary incident window, we observed a <1% increase in errors as clients reconnected and increased pressure on R2's metadata layer. There was no data loss within the R2 storage subsystem: this incident impacted the HTTP frontend of R2. Separation of concerns and blast radius management meant that the underlying R2 infrastructure was unaffected by this.
Stream	100% of operations (upload & streaming delivery) against assets managed by Stream were impacted during the primary incident window.
Images	100% of operations (uploads & downloads) against assets managed by Images were impacted during the primary incident window. Impact to Image Delivery was minor: success rate dropped to 97% as these assets are fetched from existing customer backends and do not rely on intermediate storage.
Cache Reserve	Cache Reserve customers observed an increase in requests to their origin during the incident window as 100% of operations failed. This resulted in an increase in requests to origins to fetch assets unavailable in Cache Reserve during this period. This impacted less than 0.049% of all cacheable requests served during the incident window. User-facing requests for assets to sites with Cache Reserve did not observe failures as cache misses failed over to the origin.
Log Delivery	Log delivery was delayed during the primary incident window, resulting in significant delays (up to an hour) in log processing, as well as some dropped logs. Specifically: Non-R2 delivery jobs would have experienced up to 4.5% data loss during the incident. This level of data loss could have been different between jobs depending on log volume and buffer capacity in a given location. R2 delivery jobs would have experienced up to 13.6% data loss during the incident. R2 is a major destination for Cloudflare Logs. During the primary incident window, all available resources became saturated attempting to buffer and deliver data to R2. This prevented other jobs from acquiring resources to process their queues. Data loss (dropped logs) occurred when the job queues expired their data (to allow for new, incoming data). The system recovered when we enabled a kill switch to stop processing jobs sending data to R2.
Durable Objects	Durable Objects, and services that rely on it for coordination & storage, were impacted as the stampeding horde of clients re-connecting to R2 drove an increase in load. We observed a 0.09% actual increase in error rates in calls to Durable Objects running in North America, starting at 09:13 UTC and recovering by 09:36 UTC.
Cache Purge	Requests to the Cache Purge API saw a 1.8% error rate (HTTP 5xx) increase and a 10x increase in p90 latency for purge operations during the primary incident window. Error rates returned to normal immediately after this.
Vectorize	Queries and operations against Vectorize indexes were impacted during the primary incident window. 75% of queries to indexes failed (the remainder were served out of cache) and 100% of insert, upsert, and delete operations failed during the incident window as Vectorize depends on R2 for persistent storage. Once R2 recovered, Vectorize systems recovered in full. We observed no continued impact during the secondary incident window, and we have not observed any index corruption as the Vectorize system has protections in place for this.
Key Transparency Auditor	100% of signature publish & read operations to the KT auditor service failed during the primary incident window. No third party reads occurred during this window and thus were not impacted by the incident.
Workers & Pages	A small volume (0.002%) of deployments to Workers and Pages projects failed during the primary incident window. These failures were limited to services with bindings to R2, as our control plane was unable to communicate with the R2 service during this period.

Incident timeline and impact

The incident timeline, including the initial impact, investigation, root cause, and remediation, are detailed below.

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Event
2025-02-06 08:12	The R2 Gateway service is inadvertently disabled while responding to an abuse report.
2025-02-06 08:14	-- IMPACT BEGINS --
2025-02-06 08:15	R2 service metrics begin to show signs of service degradation.
2025-02-06 08:17	Critical R2 alerts begin to fire due to our service no longer responding to our health checks.
2025-02-06 08:18	R2 on-call engaged and began looking at our operational dashboards and service logs to understand impact to availability.
2025-02-06 08:23	Sales engineering escalated to the R2 engineering team that customers are experiencing a rapid increase in HTTP 500’s from all R2 APIs.
2025-02-06 08:25	Internal incident declared.
2025-02-06 08:33	R2 on-call was unable to identify the root cause and escalated to the lead on-call for assistance.
2025-02-06 08:42	Root cause identified as R2 team reviews service deployment history and configuration, which surfaces the action and the validation gap that allowed this to impact a production service.
2025-02-06 08:46	On-call attempts to re-enable the R2 Gateway service using our internal admin tooling, however this tooling was unavailable because it relies on R2.
2025-02-06 08:49	On-call escalates to an operations team who has lower level system access and can re-enable the R2 Gateway service.
2025-02-06 08:57	The operations team engaged and began to re-enable the R2 Gateway service.
2025-02-06 09:09	R2 team triggers a redeployment of the R2 Gateway service.
2025-02-06 09:10	R2 began to recover as the forced re-deployment rolled out as clients were able to reconnect to R2.
2025-02-06 09:13	-- IMPACT ENDS -- R2 availability recovers to within its service-level objective (SLO). Durable Objects begins to observe a slight increase in error rate (0.09%) for Durable Objects running in North America due to the spike in R2 clients reconnecting.
2025-02-06 09:36	The Durable Objects error rate recovers.
2025-02-06 10:29	The incident is closed after monitoring error rates.

At the R2 service level, our internal Prometheus metrics showed R2’s SLO near-immediately drop to 0% as R2’s Gateway service stopped serving all requests and terminated in-flight requests.

The slight delay in failure was due to the product disablement action taking 1–2 minutes to take effect as well as our configured metrics aggregation intervals:

For context, R2’s architecture separates the Gateway service, which is responsible for authenticating and serving requests to R2’s S3 & REST APIs and is the “front door” for R2 — its metadata store (built on Durable Objects), our intermediate caches, and the underlying, distributed storage subsystem responsible for durably storing objects.

During the incident, all other components of R2 remained up: this is what allowed the service to recover so quickly once the R2 Gateway service was restored and re-deployed. The R2 Gateway acts as the coordinator for all work when operations are made against R2. During the request lifecycle, we validate authentication and authorization, write any new data to a new immutable key in our object store, then update our metadata layer to point to the new object. When the service was disabled, all running processes stopped.

While this means that all in-flight and subsequent requests fail, anything that had received a HTTP 200 response had already succeeded with no risk of reverting to a prior version when the service recovered. This is critical to R2’s consistency guarantees and mitigates the chance of a client receiving a successful API response without the underlying metadata and storage infrastructure having persisted the change.

Deep dive

Due to human error and insufficient validation safeguards in our admin tooling, the R2 Gateway service was taken down as part of a routine remediation for a phishing URL.

During a routine abuse remediation, action was taken on a complaint that inadvertently disabled the R2 Gateway service instead of the specific endpoint/bucket associated with the report. This was a failure of multiple system level controls (first and foremost) and operator training.

A key system-level control that led to this incident was in how we identify (or "tag") internal accounts used by our teams. Teams typically have multiple accounts (dev, staging, prod) to reduce the blast radius of any configuration changes or deployments, but our abuse processing systems were not explicitly configured to identify these accounts and block disablement actions against them. Instead of disabling the specific endpoint associated with the abuse report, the system allowed the operator to (incorrectly) disable the R2 Gateway service.

Once we identified this as the cause of the outage, remediation and recovery was inhibited by the lack of direct controls to revert the product disablement action and the need to engage an operations team with lower level access than is routine. The R2 Gateway service then required a re-deployment in order to rebuild its routing pipeline across our edge network.

Once re-deployed, clients were able to re-connect to R2, and error rates for dependent services (including Stream, Images, Cache Reserve and Vectorize) returned to normal levels.

Remediation and follow-up steps

We have taken immediate steps to resolve the validation gaps in our tooling to prevent this specific failure from occurring in the future.

We are prioritizing several work-streams to implement stronger, system-wide controls (defense-in-depth) to prevent this, including how we provision internal accounts so that we are not relying on our teams to correctly and reliably tag accounts. A key theme to our remediation efforts here is around removing the need to rely on training or process, and instead ensuring that our systems have the right guardrails and controls built-in to prevent operator errors.

These work-streams include (but are not limited to) the following:

Actioned: deployed additional guardrails implemented in the Admin API to prevent product disablement of services running in internal accounts.
Actioned: Product disablement actions in the abuse review UI have been disabled while we add more robust safeguards. This will prevent us from inadvertently repeating similar high-risk manual actions.
In-flight: Changing how we create all internal accounts (staging, dev, production) to ensure that all accounts are correctly provisioned into the correct organization. This must include protections against creating standalone accounts to avoid re-occurrence of this incident (or similar) in the future.
In-flight: Further restricting access to product disablement actions beyond the remediations recommended by the system to a smaller group of senior operators.
In-flight: Two-party approval required for ad-hoc product disablement actions. Going forward, if an investigator requires additional remediations, they must be submitted to a manager or a person on our approved remediation acceptance list to approve their additional actions on an abuse report.
In-flight: Expand existing abuse checks that prevent accidental blocking of internal hostnames to also prevent any product disablement action of products associated with an internal Cloudflare account.
In-flight: Internal accounts are being moved to our new Organizations model ahead of public release of this feature. The R2 production account was a member of this organization, but our abuse remediation engine did not have the necessary protections to prevent acting against accounts within this organization.

We’re continuing to discuss & review additional steps and effort that can continue to reduce the blast radius of any system- or human- action that could result in disabling any production service at Cloudflare.

Conclusion

We understand this was a serious incident, and we are painfully aware of — and extremely sorry for — the impact it caused to customers and teams building and running their businesses on Cloudflare.

This is the first (and ideally, the last) incident of this kind and duration for R2, and we’re committed to improving controls across our systems and workflows to prevent this in the future.

Cloudflare incident on June 20, 2024

Lloyd Wallis — Wed, 26 Jun 2024 13:00:22 GMT

On Thursday, June 20, 2024, two independent events caused an increase in latency and error rates for Internet properties and Cloudflare services that lasted 114 minutes. During the 30-minute peak of the impact, we saw that 1.4 - 2.1% of HTTP requests to our CDN received a generic error page, and observed a 3x increase for the 99th percentile Time To First Byte (TTFB) latency.

These events occurred because:

Automated network monitoring detected performance degradation, re-routing traffic suboptimally and causing backbone congestion between 17:33 and 17:50 UTC
A new Distributed Denial-of-Service (DDoS) mitigation mechanism deployed between 14:14 and 17:06 UTC triggered a latent bug in our rate limiting system that allowed a specific form of HTTP request to cause a process handling it to enter an infinite loop between 17:47 and 19:27 UTC

Impact from these events were observed in many Cloudflare data centers around the world.

With respect to the backbone congestion event, we were already working on expanding backbone capacity in the affected data centers, and improving our network mitigations to use more information about the available capacity on alternative network paths when taking action. In the remainder of this blog post, we will go into more detail on the second and more impactful of these events.

As part of routine updates to our protection mechanisms, we created a new DDoS rule to prevent a specific type of abuse that we observed on our infrastructure. This DDoS rule worked as expected, however in a specific suspect traffic case it exposed a latent bug in our existing rate-limiting component. To be absolutely clear, we have no reason to believe this suspect traffic was intentionally exploiting this bug, and there is no evidence of a breach of any kind.

We are sorry for the impact and have already made changes to help prevent these problems from occurring again.

Background

Rate-limiting suspicious traffic

Depending on the profile of an HTTP request and the configuration of the requested Internet property, Cloudflare may protect our network and our customer’s origins by applying a limit to the number of requests a visitor can make within a certain time window. These rate limits can activate through customer configuration or in response to DDoS rules detecting suspicious activity.

Usually, these rate limits will be applied based on the IP address of the visitor. As many institutions and Internet Service Providers (ISPs) can have many devices and individual users behind a single IP address, rate limiting based on the IP address is a broad brush that can unintentionally block legitimate traffic.

Balancing traffic across our network

Cloudflare has several systems that together provide continuous real-time capacity monitoring and rebalancing to ensure we serve as much traffic as we can as quickly and efficiently as we can.

The first of these is Unimog, Cloudflare’s edge load balancer. Every packet that reaches our anycast network passes through Unimog, which delivers it to an appropriate server to process that packet. That server may be in a different location from where the packet originally arrived into our network, depending on the availability of compute capacity. Within each data center, Unimog aims to keep the CPU load uniform across all active servers.

For a global view of our network, we rely on Traffic Manager. Across all of our data center locations, it takes in a variety of signals, such as overall CPU utilization, HTTP request latency, and bandwidth utilization to instruct rebalancing decisions. It has built-in safety limits to prevent causing outsized traffic shifts, and also considers the expected resulting load in destination locations when making any decisions.

Incident timeline and impact

All timestamps are UTC on 2024-06-20.

14:14 DDoS rule gradual deployment starts
17:06 DDoS rule deployed globally
17:47 First HTTP request handling process is poisoned
18:04 Incident declared automatically based on detected high CPU load
18:34 Service restart shown to recover on a server, full restart tested in one data center
18:44 CPU load normalized in data center after service restart
18:51 Continual global reloads of all servers with many stuck processes begin
19:05 Global eyeball HTTP error rate peaks at 2.1% service unavailable / 3.45% total
19:05 First Traffic Manager actions recovering service
19:11 Global eyeball HTTP error rate halved to 1% service unavailable / 1.96% total
19:27 Global eyeball HTTP error rate reduced to baseline levels
19:29 DDoS rule deployment identified as likely cause of process poisoning
19:34 DDoS rule is fully disabled
19:43 Engineers stop routine restarts of services on servers with many stuck processes
20:16 Incident response stood down

Below, we provide a view of the impact from some of Cloudflare’s internal metrics. The first graph illustrates the percentage of all eyeball (inbound from external devices) HTTP requests that were served an error response because the service suffering poisoning could not be reached. We saw an initial increase to 0.5% of requests, and then later a larger one reaching as much as 2.1% before recovery started due to our service reloads.

For a broader view of errors, we can see all 5xx responses our network returned to eyeballs during the same window, including those from origin servers. These peaked at 3.45%, and you can more clearly see the gradual recovery between 19:25 and 20:00 UTC as Traffic Manager finished its re-routing activities. The dip at 19:25 UTC aligns with the last large reload, with the error increase afterwards primarily consisting of upstream DNS timeouts and connection limits which are consistent with high and unbalanced load.

And here’s what our TTFB measurements looked like at the 50th, 90th and 99th percentiles, showing an almost 3x increase in latency at p99:

Technical description of the error and how it happened

Global percentage of HTTP Request handling processes that were using excessive CPU during the event

Earlier on June 20, between 14:14 - 17:06 UTC, we gradually activated a new DDoS rule on our network. Cloudflare has recently been building a new way of mitigating HTTP DDoS attacks. This method is using a combination of rate-limits and cookies in order to allow legitimate clients that were falsely identified as being part of an attack to proceed anyway.

With this new method, an HTTP request that is considered suspicious runs through these key steps:

Check for the presence of a valid cookie, otherwise block the request
If a valid cookie is found, add a rate-limit rule based on the cookie value to be evaluated at a later point
Once all the currently applied DDoS mitigation are run, apply rate-limit rules

We use this "asynchronous" workflow because it is more efficient to block a request without a rate-limit rule, so it gives a chance for other rule types to be applied.

So overall, the flow can be summarized with this pseudocode:

for (rule in active_mitigations) {
   // ... (ignore other rule types)
   if (rule.match_current_request()) {
       if (!has_valid_cookie()) {
           // no cookie: serve error page
           return serve_error_page();
       } else {
           // add a rate-limit rule to be evaluated later
           add_rate_limit_rule(rule);
       }
   }
}


evaluate_rate_limit_rules();

When evaluating rate-limit rules, we need to make a key for each client that is used to look up the correct counter and compare it with the target rate. Typically, this key is the client IP address, but other options are available, such as the value of a cookie as used here. We actually reused an existing portion of the rate-limit logic to achieve this. In pseudocode, it looks like:

function get_cookie_key() {
   // Validate that the cookie is valid before taking its value.
   // Here the cookie has been checked before already, but this code is
   // also used for "standalone" rate-limit rules.
   if (!has_valid_cookie_broken()) { // more on the "broken" part later
       return cookie_value;
   } else {
       return parent_key_generator();
   }
}

This simple key generation function had two issues that, combined with a specific form of client request, caused an infinite loop in the process handling the HTTP request:

The rate-limit rules generated by the DDoS logic are using internal APIs in ways that haven't been anticipated. This caused the parent_key_generator in the pseudocode above to point to the get_cookie_key function itself, meaning that if that code path was taken, the function would call itself indefinitely
As these rate-limit rules are added only after validating the cookie, validating it a second time should give the same result. The problem is that the has_valid_cookie_broken function used here is actually different and both can disagree if the client sends multiple cookies where some are valid but not others

So, combining these two issues: the broken validation function tells get_cookie_key that the cookie is invalid, causing the else branch to be taken and calling the same function over and over.

A protection many programming languages have in place to help prevent loops like this is a run-time protection limit on how deep the stack of function calls can get. An attempt to call a function once already at this limit will result in a runtime error. When reading the logic above, an initial analysis might suggest we were reaching the limit in this case, and so requests eventually resulted in an error, with a stack containing those same function calls over and over.

However, this isn’t the case here. Some languages, including Lua, in which this logic is written, also implement an optimization called proper tail calls. A tail call is when the final action a function takes is to execute another function. Instead of adding that function as another layer in the stack, as we know for sure that we will not be returning execution context to the parent function afterwards, nor using any of its local variables, we can replace the top frame in the stack with this function call instead.

The end result is a loop in the request processing logic which never increases the size of the stack. Instead, it simply consumes 100% of available CPU resources, and never terminates. Once a process handling HTTP requests receives a single request on which the action should be applied and has a mixture of valid and invalid cookies, that process is poisoned and is never able to process any further requests.

Every Cloudflare server has dozens of such processes, so a single poisoned process does not have much of an impact. However, then some other things start happening:

The increase in CPU utilization for the server causes Unimog to lower the amount of new traffic that server receives, moving traffic to other servers, so at a certain point, more new connections are directed away from servers with a subset of their processes poisoned to those with fewer or no poisoned processes, and therefore lower CPU utilization.
The gradual increase in CPU utilization in the data center starts to cause Traffic Manager to redirect traffic to other data centers. As this movement does not fix the poisoned processes, CPU utilization remains high, and so Traffic Manager continues to redirect more and more traffic away.
The redirected traffic in both cases includes the requests that are poisoning processes, causing the servers and data centers to which this redirected traffic was sent to start failing in the same way.

Within a few minutes, multiple data centers had many poisoned processes, and Traffic Manager had redirected as much traffic away from them as possible, but was restricted from doing more. This was partly due to its built-in automation safety limits, but also because it was becoming more difficult to find a data center with sufficient available capacity to use as a target.

The first case of a poisoned process was at 17:47 UTC, and by 18:09 UTC – five minutes after the incident was declared – Traffic Manager was re-routing a lot of traffic out of Europe:

A summary map of Traffic Manager capacity actions as of 18:09 UTC. Each circle represents a data center that traffic is being re-routed towards or away from. The color of the circle indicates the CPU load of that data center. The orange ribbons between them show how much traffic is re-routed, and where from/to.

It’s obvious to see why, if we look at the percentage of the HTTP request service’s processes that were saturating their CPUs. 10% of our capacity in Western Europe was already gone, and 4% in Eastern Europe, during peak traffic time for those timezones:

Percentage of all the HTTP request handling processes saturating their CPU, by geographic region

Partially poisoned servers in many locations struggled with the request load, and the remaining processes could not keep up, resulting in Cloudflare returning minimal HTTP error responses.

Cloudflare engineers were automatically notified at 18:04 UTC, once our global CPU utilization reached a certain sustained level, and started to investigate. Many of our on-duty incident responders were already working on the open incident caused by backbone network congestion, and in the early minutes we looked into likely correlation with the network congestion events. It took some time for us to realize that locations where the CPU was highest is where traffic was the lowest, drawing the investigation away from a network event being the trigger. At this point, the focus moved to two main streams:

Evaluating if restarting poisoned processes allowed them to recover, and if so, instigating mass-restarts of the service on affected servers
Identifying the trigger of processes entering this CPU saturation state

It was 25 minutes after the initial incident was declared when we validated that restarts helped on one sample server. Five minutes after this, we started executing wider restarts – initially to entire data centers at once, and then as the identification method was refined, on servers with a large number of poisoned processes. Some engineers continued regular routine restarts of the affected service on impacted servers, whilst others moved to join the ongoing parallel effort to identify the trigger. At 19:36 UTC, the new DDoS rule was disabled globally, and the incident was declared resolved after executing one more round of mass restarts and monitoring.

At the same time, conditions presented by the incident triggered a latent bug in Traffic Manager. When triggered, the system would attempt to recover from the exception by initiating a graceful restart, halting its activity. The bug was first triggered at 18:17 UTC, then numerous times between 18:35 and 18:57 UTC. During two periods in this window (18:35-18:52 UTC and 18:56-19:05 UTC) the system did not issue any new traffic routing actions. This meant whilst we had recovered service in the most affected data centers, almost all traffic was still being re-routed away from them. Alerting notified on-call engineers of the issue at 18:34 UTC. By 19:05 UTC the Traffic team had written, tested, and deployed a fix. The first actions following restoration showed a positive impact on restoring service.

Remediation and follow-up steps

To resolve the immediate impact to our network from the request poisoning, Cloudflare instigated mass rolling restarts of the affected service until the change that triggered the condition was identified and rolled back. The change, which was the activation of a new type of DDoS rule, remains fully rolled back, and the rule will not be reactivated until we have fixed the broken cookie validation check and are fully confident this situation cannot recur.

We take these incidents very seriously, and recognize the magnitude of impact they had. We have identified several steps we can take to address these specific situations, and the risk of these sorts of problems from recurring in the future.

Design: The rate limiting implementation in use for our DDoS module is a legacy component, and rate limiting rules customers configure for their Internet properties use a newer engine with more modern technologies and protections.
Design: We are exploring options within and around the service which experienced process poisoning to limit the ability to loop forever through tail calls. Longer term, Cloudflare is entering the early implementation stages of replacing this service entirely. The design of this replacement service will allow us to apply limits on the non-interrupted and total execution time of a single request.
Process: The activation of the new rule for the first time was staged in a handful of production data centers for validation, and then to all data centers a few hours later. We will continue to enhance our staging and rollout procedures to minimize the potential change-related blast radius.

Conclusion

Cloudflare experienced two back-to-back incidents that affected a significant set of customers using our CDN and network services. The first was network backbone congestion that our systems automatically remediated. We mitigated the second by regularly restarting the faulty service whilst we identified and deactivated the DDoS rule that was triggering the fault. We are sorry for any disruption this caused our customers and to end users trying to access services.

The conditions necessary to activate the latent bug in the faulty service are no longer possible in our production environment, and we are putting further fixes and detections in place as soon as possible.

Major data center power failure (again): Cloudflare Code Orange tested

Matthew Prince — Mon, 08 Apr 2024 13:00:15 GMT

Here's a post we never thought we'd need to write: less than five months after one of our major data centers lost power, it happened again to the exact same data center. That sucks and, if you're thinking "why do they keep using this facility??," I don't blame you. We're thinking the same thing. But, here's the thing, while a lot may not have changed at the data center, a lot changed over those five months at Cloudflare. So, while five months ago a major data center going offline was really painful, this time it was much less so.

This is a little bit about how a high availability data center lost power for the second time in five months. But, more so, it's the story of how our team worked to ensure that even if one of our critical data centers lost power it wouldn't impact our customers.

On November 2, 2023, one of our critical facilities in the Portland, Oregon region lost power for an extended period of time. It happened because of a cascading series of faults that appears to have been caused by maintenance by the electrical grid provider, climaxing with a ground fault at the facility, and was made worse by a series of unfortunate incidents that prevented the facility from getting back online in a timely fashion.

If you want to read all the gory details, they're available here.

It's painful whenever a data center has a complete loss of power, but it's something that we were supposed to expect. Unfortunately, in spite of that expectation, we hadn't enforced a number of requirements on our products that would ensure they continued running in spite of a major failure.

That was a mistake we were never going to allow to happen again.

Code Orange

The incident was painful enough that we declared what we called Code Orange. We borrowed the idea from Google which, when they have an existential threat to their business, reportedly declares a Code Yellow or Code Red. Our logo is orange, so we altered the formula a bit.

Our conception of Code Orange was that the person who led the incident, in this case our SVP of Technical Operations, Jeremy Hartman, would be empowered to charge any engineer on our team to work on what he deemed the highest priority project. (Unless we declared a Code Red, which we actually ended up doing due to a hacking incident, and which would then take even higher priority. If you're interested, you can read more about that here.)

After getting through the immediate incident, Jeremy quickly triaged the most important work that needed to be done in order to ensure we'd be highly available even in the case of another catastrophic failure of a major data center facility. And the team got to work.

How'd we do?

We didn’t expect such an extensive real-world test so quickly, but the universe works in mysterious ways. On Tuesday, March 26, 2024, — just shy of five months after the initial incident — the same facility had another major power outage. Below, we'll get into what caused the outage this time, but what is most important is that it provided a perfect test for the work our team had done under Code Orange. So, what were the results?

First, let’s revisit what functions the Portland data centers at Cloudflare provide. As described in the November 2, 2023, post, the control plane of Cloudflare primarily consists of the customer-facing interface for all of our services including our website and API. Additionally, the underlying services that provide the Analytics and Logging pipelines are primarily served from these facilities.

Just like in November 2023, we were alerted immediately that we had lost connectivity to our PDX01 data center. Unlike in November, we very quickly knew with certainty that we had once again lost all power, putting us in the exact same situation as five months prior. We also knew, based on a successful internal cut test in February, how our systems should react. We had spent months preparing, updating countless systems and activating huge amounts of network and server capacity, culminating with a test to prove the work was having the intended effect, which in this case was an automatic failover to the redundant facilities.

Our Control Plane consists of hundreds of internal services, and the expectation is that when we lose one of the three critical data centers in Portland, these services continue to operate normally in the remaining two facilities, and we continue to operate primarily in Portland. We have the capability to fail over to our European data centers in case our Portland centers are completely unavailable. However, that is a secondary option, and not something we pursue immediately.

On March 26, 2024, at 14:58 UTC, PDX01 lost power and our systems began to react. By 15:05 UTC, our APIs and Dashboards were operating normally, all without human intervention. Our primary focus over the past few months has been to make sure that our customers would still be able to configure and operate their Cloudflare services in case of a similar outage. There were a few specific services that required human intervention and therefore took a bit longer to recover, however the primary interface mechanism was operating as expected.

To put a finer point on this, during the November 2, 2023, incident the following services had at least six hours of control plane downtime, with several of them functionally degraded for days.

API and Dashboard
Zero Trust
Magic Transit
SSL
SSL for SaaS
Workers
KV
Waiting Room
Load Balancing
Zero Trust Gateway
Access
Pages
Stream
Images

During the March 26, 2024, incident, all of these services were up and running within minutes of the power failure, and many of them did not experience any impact at all during the failover.

The data plane, which handles the traffic that Cloudflare customers pass through our data centers in over 300 cities worldwide, was not impacted.

Our Analytics platform, which provides a view into customer traffic, was impacted and wasn’t fully restored until later that day. This was expected behavior as the Analytics platform is reliant on the PDX01 data center. Just like the Control Plane work, we began building new Analytics capacity immediately after the November 2, 2023, incident. However, the scale of the work requires that it will take a bit more time to complete. We have been working as fast as we can to remove this dependency, and we expect to complete this work in the near future.

Once we had validated the functionality of our Control Plane services, we were faced yet again with the cold start of a very large data center. This activity took roughly 72 hours in November 2023, but this time around we were able to complete this in roughly 10 hours. There is still work to be done to make that even faster in the future, and we will continue to refine our procedures in case we have a similar incident in the future.

How did we get here?

As mentioned above, the power outage event from last November led us to introduce Code Orange, a process where we shift most or all engineering resources to addressing the issue at hand when there’s a significant event or crisis. Over the past five months, we shifted all non-critical engineering functions to focusing on ensuring high reliability of our control plane.

Teams across our engineering departments rallied to ensure our systems would be more resilient in the face of a similar failure in the future. Though the March 26, 2024, incident was unexpected, it was something we’d been preparing for.

The most obvious difference is the speed at which the control plane and APIs regained service. Without human intervention, the ability to log in and make changes to Cloudflare configuration was possible seven minutes after PDX01 was lost. This is due to our efforts to move all of our configuration databases to a Highly Available (HA) topology, and pre-provision enough capacity that we could absorb the capacity loss. More than 100 databases across over 20 different database clusters simultaneously failed out of the affected facility and restored service automatically. This was actually the culmination of over a year’s worth of work, and we make sure we prove our ability to failover properly with weekly tests.

Another significant improvement is the updates to our Logpush infrastructure. In November 2023, the loss of the PDX01 datacenter meant that we were unable to push logs to our customers. During Code Orange, we invested in making the Logpush infrastructure HA in Portland, and additionally created an active failover option in Amsterdam. Logpush took advantage of our massively expanded Kubernetes cluster that spans all of our Portland facilities and provides a seamless way for service owners to deploy HA compliant services that have resiliency baked in. In fact, during our February chaos exercise, we found a flaw in our Portland HA deployment, but customers were not impacted because the Amsterdam Logpush infrastructure took over successfully. During this event, we saw that the fixes we’d made since then worked, and we were able to push logs from the Portland region.

A number of other improvements in our Stream and Zero Trust products resulted in little to no impact to their operation. Our Stream products, which use a lot of compute resources to transcode videos, were able to seamlessly hand off to our Amsterdam facility to continue operations. Teams were given specific availability targets for the services and were provided several options to achieve those targets. Stream is a good example of a service that chose a different resiliency architecture but was able to seamlessly deliver their service during this outage. Zero Trust, which was also impacted in November 2023, has since moved the vast majority of its functionally to our hundreds of data centers, which kept working seamlessly throughout this event. Ultimately this is the strategy we are pushing all Cloudflare products to adopt as our data centers in over 300 cities worldwide provide the highest level of availability possible.

What happened to the power in the data center?

On March 26, 2024, at 14:58 UTC, PDX01 experienced a total loss of power to Cloudflare’s physical infrastructure following a reportedly simultaneous failure of four Flexential-owned and operated switchboards serving all of Cloudflare’s cages. This meant both primary and redundant power paths were deactivated across the entire environment. During the Flexential investigation, engineers focused on a set of equipment known as Circuit Switch Boards, or CSBs. CSBs are likened to an electrical panel board, consisting of a main input circuit breaker and series of smaller output breakers. Flexential engineers reported that infrastructure upstream of the CSBs (power feed, generator, UPS & PDU/transformer) was not impacted and continued to act normally. Similarly, infrastructure downstream from the CSBs such as Remote Power Panels and connected switchgear was not impacted – thus implying the outage was isolated to the CSBs themselves.

Initial assessment of the root cause of Flexential’s CSB failures points to incorrectly set breaker coordination settings within the four CSBs as one contributing factor. Trip settings which are too restrictive can result in overly sensitive overcurrent protection and the potential nuisance tripping of devices. In our case, Flexential’s breaker settings within the four CSBs were reportedly too low in relation to the downstream provisioned power capacities. When one or more of these breakers tripped, a cascading failure of the remaining active CSB boards resulted, thus causing a total loss of power serving Cloudflare’s cage and others on the shared infrastructure. During the triage of the incident, we were told that the Flexential facilities team noticed the incorrect trip settings, reset the CSBs and adjusted them to the expected values, enabling our team to power up our servers in a staged and controlled fashion. We do not know when these settings were established – typically, these would be set/adjusted as part of a data center commissioning process and/or breaker coordination study before customer critical loads are installed.

What’s next?

Our top priority is completing the resilience program for our Analytics platform. Analytics aren’t simply pretty charts in a dashboard. When you want to check the status of attacks, activities a firewall is blocking, or even the status of Cloudflare Tunnels - you need analytics. We have evidence that the resiliency pattern we are adopting works as expected, so this remains our primary focus, and we will progress as quickly as possible.

There were some services that still required manual intervention to properly recover, and we have collected data and action items for each of them to ensure that further manual action is not required. We will continue to use production cut tests to prove all of these changes and enhancements provide the resiliency that our customers expect.

We will continue to work with Flexential on follow-up activities to expand our understanding of their operational and review procedures to the greatest extent possible. While this incident was limited to a single facility, we will turn this exercise into a process that ensures we have a similar view into all of our critical data center facilities.

Once again, we are very sorry for the impact to our customers, particularly those that rely on the Analytics engine who were unable to access that product feature during the incident. Our work over the past four months has yielded the results that we expected, and we will stay absolutely focused on completing the remaining body of work.

Post mortem on the Cloudflare Control Plane and Analytics Outage

Matthew Prince — Sat, 04 Nov 2023 06:18:55 GMT

Beginning on Thursday, November 2, 2023, at 11:43 UTC Cloudflare's control plane and analytics services experienced an outage. The control plane of Cloudflare consists primarily of the customer-facing interface for all of our services including our website and APIs. Our analytics services include logging and analytics reporting.

The incident lasted from November 2 at 11:44 UTC until November 4 at 04:25 UTC. We were able to restore most of our control plane at our disaster recovery facility as of November 2 at 17:57 UTC. Many customers would not have experienced issues with most of our products after the disaster recovery facility came online. However, other services took longer to restore and customers that used them may have seen issues until we fully resolved the incident. Our raw log services were unavailable for most customers for the duration of the incident.

Services have now been restored for all customers. Throughout the incident, Cloudflare's network and security services continued to work as expected. While there were periods where customers were unable to make changes to those services, traffic through our network was not impacted.

This post outlines the events that caused this incident, the architecture we had in place to prevent issues like this, what failed, what worked and why, and the changes we're making based on what we've learned over the last 36 hours.

To start, this never should have happened. We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically. And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.

Intended Design

Cloudflare's control plane and analytics systems run primarily on servers in three data centers around Hillsboro, Oregon. The three data centers are independent of one another, each have multiple utility power feeds, and each have multiple redundant and independent network connections.

The facilities were intentionally chosen to be at a distance apart that would minimize the chances that a natural disaster would cause all three to be impacted, while still close enough that they could all run active-active redundant data clusters. This means that they are continuously syncing data between the three facilities. By design, if any of the facilities goes offline then the remaining ones are able to continue to operate.

This is a system design that we began implementing four years ago. While most of our critical control plane systems had been migrated to the high availability cluster, some services, especially for some newer products, had not yet been added to the high availability cluster.

In addition, our logging systems were intentionally not part of the high availability cluster. The logic of that decision was that logging was already a distributed problem where logs were queued at the edge of our network and then sent back to the core in Oregon (or another regional facility for customers using regional services for logging). If our logging facility was offline then analytics logs would queue at the edge of our network until it came back online. We determined that analytics being delayed was acceptable.

Flexential Data Center Power Failure

The largest of the three facilities in Oregon is run by Flexential. We refer to this facility as “PDX-DC04”. Cloudflare leases space in PDX-04 where we house our largest analytics cluster as well as more than a third of the machines for our high availability cluster. It is also the default location for services that have not yet been onboarded onto our high availability cluster. We are a relatively large customer of the facility, consuming approximately 10 percent of its total capacity.

On November 2 at 08:50 UTC Portland General Electric (PGE), the utility company that services PDX-04, had an unplanned maintenance event affecting one of their independent power feeds into the building. That event shut down one feed into PDX-04. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to effectively supplement the feed that was down.

Counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power. None of our observability tools were able to detect that the source of power had changed. Had they informed us, we would have stood up a team to monitor the facility closely and move control plane services that were dependent on that facility out while it was degraded.

It is also unusual that Flexential ran both the one remaining utility feed and the generators at the same time. It is not unusual for utilities to ask data centers to drop off the grid when power demands are high and run exclusively on generators. Flexential operates 10 generators, inclusive of redundant units, capable of supporting the facility at full load. It would also have been possible for Flexential to run the facility only from the remaining utility feed. We haven't gotten a clear answer why they ran utility power and generator power.

Informed Speculation On What Happened Next

From this decision onward, we don't yet have clarity from Flexential on the root cause or some of the decisions they made or the events. We will update this post as we get more information from Flexential, as well as PGE, on what happened. Some of what follows is informed speculation based on the most likely series of events as well as what individual Flexential employees have shared with us unofficially.

One possible reason they may have left the utility line running is because Flexential was part of a program with PGE called DSG. DSG allows the local utility to run a data center's generators to help supply additional power to the grid. In exchange, the power company helps maintain the generators and supplies fuel. We have been unable to locate any record of Flexential informing us about the DSG program. We've asked if DSG was active at the time and have not received an answer. We do not know if it contributed to the decisions that Flexential made, but it could explain why the utility line continued to remain online after the generators were started.

At approximately 11:40 UTC, there was a ground fault on a PGE transformer at PDX-04. We believe, but have not been able to get confirmation from Flexential or PGE, that this was the transformer that stepped down power from the grid for the second feed that was still running as it entered the data center. It seems likely, though we have not been able to confirm with Flexential or PGE, that the ground fault was caused by the unplanned maintenance PGE was performing that impacted the first feed. Or it was a very unlucky coincidence.

Ground faults with high voltage (12,470 volt) power lines are very bad. Electrical systems are designed to quickly shut down to prevent damage when one occurs. Unfortunately, in this case, the protective measure also shut down all of PDX-04’s generators. This meant that the two sources of power generation for the facility — both the redundant utility lines as well as the 10 generators — were offline.

Fortunately, in addition to the generators, PDX-04 also contains a bank of UPS batteries. These batteries are supposedly sufficient to power the facility for approximately 10 minutes. That time is meant to be enough to bridge the gap between the power going out and the generators automatically starting up. If Flexential could get the generators or a utility feed restored within 10 minutes then there would be no interruption. In reality, the batteries started to fail after only 4 minutes based on what we observed from our own equipment failing. And it took Flexential far longer than 10 minutes to get the generators restored.

Attempting to Restore Power

While we haven't gotten official confirmation, we have been told by employees that three things hampered getting the generators back online. First, they needed to be physically accessed and manually restarted because of the way the ground fault had tripped circuits. Second, Flexential's access control system was not powered by the battery backups, so it was offline. And third, the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.

Between 11:44 and 12:01 UTC, with the generators not fully restarted, the UPS batteries ran out of power and all customers of the data center lost power. Throughout this, Flexential never informed Cloudflare that there was any issue at the facility. We were first notified of issues in the data center when the two routers that connect the facility to the rest of the world went offline at 11:44 UTC. When we weren't able to reach the routers directly or through out-of-band management, we attempted to contact Flexential and dispatched our local team to physically travel to the facility. The first message to us from Flexential that they were experiencing an issue was at 12:28 UTC.

We are currently experiencing an issue with power at our [PDX-04] that began at approximately 0500AM PT [12:00 UTC]. Engineers are actively working to resolve the issue and restore service. We will communicate progress every 30 minutes or as more information becomes available as to the estimated time to restore. Thank you for your patience and understanding.

Designing for Data Center Level Failure

While the PDX-04’s design was certified Tier III before construction and is expected to provide high availability SLAs, we planned for the possibility that it could go offline. Even well-run facilities can have bad days. And we planned for that. What we expected would happen in that case is that our analytics would be offline, logs would be queued at the edge and delayed, and certain lower priority services that were not integrated into our high availability cluster would go offline temporarily until they could be restored at another facility.

The other two data centers running in the area would take over responsibility for the high availability cluster and keep critical services online. Generally that worked as planned. Unfortunately, we discovered that a subset of services that were supposed to be on the high availability cluster had dependencies on services exclusively running in PDX-04.

In particular, two critical services that process logs and power our analytics — Kafka and ClickHouse — were only available in PDX-04 but had services that depended on them that were running in the high availability cluster. Those dependencies shouldn’t have been so tight, should have failed more gracefully, and we should have caught them.

We had performed testing of our high availability cluster by taking each (and both) of the other two data center facilities entirely offline. And we had also tested taking the high availability portion of PDX-04 offline. However, we had never tested fully taking the entire PDX-04 facility offline. As a result, we had missed the importance of some of these dependencies on our data plane.

We were also far too lax about requiring new products and their associated databases to integrate with the high availability cluster. Cloudflare allows multiple teams to innovate quickly. As such, products often take different paths toward their initial alpha. While, over time, our practice is to migrate the backend for these services to our best practices, we did not formally require that before products were declared generally available (GA). That was a mistake as it meant that the redundancy protections we had in place worked inconsistently depending on the product.

Moreover, far too many of our services depend on the availability of our core facilities. While this is the way a lot of software services are created, it does not play to Cloudflare’s strength. We are good at distributed systems. Throughout this incident, our global network continued to perform as expected. While some of our products and features are configurable and serviceable through the edge of our network without needing the core, far too many today fail if the core is unavailable. We need to use the distributed systems products that we make available to all our customers for all our services, so they continue to function mostly as normal even if our core facilities are disrupted.

Disaster Recovery

At 12:48 UTC, Flexential was able to get the generators restarted. Power returned to portions of the facility. In order to not overwhelm the system, when power is restored to a data center it is typically done gradually by powering back on one circuit at a time. Like the circuit breakers in a residential home, each customer is serviced by redundant breakers. When Flexential attempted to power back up Cloudflare's circuits, the circuit breakers were discovered to be faulty. We don't know if the breakers failed due to the ground fault or some other surge as a result of the incident, or if they'd been bad before, and it was only discovered after they had been powered off.

Flexential began the process of replacing the failed breakers. That required them to source new breakers because more were bad than they had on hand in the facility. Because more services were offline than we expected, and because Flexential could not give us a time for restoration of our services, we made the call at 13:40 UTC to fail over to Cloudflare's disaster recovery sites located in Europe. Thankfully, we only needed to fail over a small percentage of Cloudflare’s overall control plane. Most of our services continued to run across our high availability systems across the two active core data centers.

We turned up the first services on the disaster recovery site at 13:43 UTC. Cloudflare's disaster recovery sites provide critical control plane services in the event of a disaster. While the disaster recovery site does not support some of our log processing services, it is designed to support the other portions of our control plane.

When services were turned up there, we experienced a thundering herd problem where the API calls that had been failing overwhelmed our services. We implemented rate limits to get the request volume under control. During this period, customers of most products would have seen intermittent errors when making modifications through our dashboard or API. By 17:57 UTC, the services that had been successfully moved to the disaster recovery site were stable and most customers were no longer directly impacted. However, some systems still required manual configuration (e.g., Magic WAN) and some other services, largely related to log processing and some bespoke APIs, remained unavailable until we were able to restore PDX-04.

Some Products and Features Delayed Restart

A handful of products did not properly get stood up on our disaster recovery sites. These tended to be newer products where we had not fully implemented and tested a disaster recovery procedure. These included our Stream service for uploading new videos and some other services. Our team worked two simultaneous tracks to get these services restored: 1) reimplementing them on our disaster recovery sites; and 2) migrating them to our high-availability cluster.

Flexential replaced our failed circuit breakers, restored both utility feeds, and confirmed clean power at 22:48 UTC. Our team was all-hands-on-deck and had worked all day on the emergency, so I made the call that most of us should get some rest and start the move back to PDX-04 in the morning. That decision delayed our full recovery, but I believe made it less likely that we’d compound this situation with additional mistakes.

Beginning first thing on November 3, our team began restoring service in PDX-04. That began with physically booting our network gear then powering up thousands of servers and restoring their services. The state of our services in the data center was unknown as we believed multiple power cycles were likely to have occurred during the incident. Our only safe process to recover was to follow a complete bootstrap of the entire facility.

This involved a manual process of bringing our configuration management servers online to begin the restoration of the facility. Rebuilding these took 3 hours. From there, our team was able to bootstrap the rebuild of the rest of the servers that power our services. Each server took between 10 minutes and 2 hours to rebuild. While we were able to run this in parallel across multiple servers, there were inherent dependencies between services that required some to be brought back online in sequence.

Services are now fully restored as of November 4, 2023, at 04:25 UTC. For most customers, because we also store analytics in our European core data centers, you should see no data loss in most analytics across our dashboard and APIs. However, some datasets which are not replicated in the EU will have persistent gaps. For customers that use our log push feature, your logs will not have been processed for the majority of the event, so anything you did not receive will not be recovered.

Lessons and Remediation

We have a number of questions that we need answered from Flexential. But we also must expect that entire data centers may fail. Google has a process where when there’s a significant event or crisis they can call a Code Yellow or Code Red. In these cases, most or all engineering resources are shifted to addressing the issue at hand.

We have not had such a process in the past, but it’s clear today we need to implement a version of it ourselves: Code Orange. We are shifting all non-critical engineering functions to focusing on ensuring high reliability of our control plane. As part of that, we expect the following changes:

Remove dependencies on our core data centers for control plane configuration of all services and move them wherever possible to be powered first by our distributed network
Ensure that the control plane running on the network continues to function even if all our core data centers are offline
Require that all products and features that are designated Generally Available must rely on the high availability cluster (if they rely on any of our core data centers), without having any software dependencies on specific facilities
Require all products and features that are designated Generally Available have a reliable disaster recovery plan that is tested
Test the blast radius of system failures and minimize the number of services that are impacted by a failure
Implement more rigorous chaos testing of all data center functions including the full removal of each of our core data center facilities
Thorough auditing of all core data centers and a plan to reaudit to ensure they comply with our standards
Logging and analytics disaster recovery plan that ensures no logs are dropped even in the case of a failure of all our core facilities

As I said earlier, I am sorry and embarrassed for this incident and the pain that it caused our customers and our team. We have the right systems and procedures in place to be able to withstand even the cascading string of failures we saw at our data center provider, but we need to be more rigorous about enforcing that they are followed and tested for unknown dependencies. This will have my full attention and the attention of a large portion of our team through the balance of the year. And the pain from the last couple of days will make us better.

Cloudflare incident on October 30, 2023

Matt Silverlock — Wed, 01 Nov 2023 16:39:43 GMT

Multiple Cloudflare services were unavailable for 37 minutes on October 30, 2023. This was due to the misconfiguration of a deployment tool used by Workers KV. This was a frustrating incident, made more difficult by Cloudflare’s reliance on our own suite of products. We are deeply sorry for the impact it had on customers. What follows is a discussion of what went wrong, how the incident was resolved, and the work we are undertaking to ensure it does not happen again.

Workers KV is our globally distributed key-value store. It is used by both customers and Cloudflare teams alike to manage configuration data, routing lookups, static asset bundles, authentication tokens, and other data that needs low-latency access.

During this incident, KV returned what it believed was a valid HTTP 401 (Unauthorized) status code instead of the requested key-value pair(s) due to a bug in a new deployment tool used by KV.

These errors manifested differently for each product depending on how KV is used by each service, with their impact detailed below.

What was impacted

A number of Cloudflare services depend on Workers KV for distributing configuration, routing information, static asset serving, and authentication state globally. These services instead received an HTTP 401 (Unauthorized) error when performing any get, put, delete, or list operation against a KV namespace.

Customers using the following Cloudflare products would have observed heightened error rates and/or would have been unable to access some or all features for the duration of the incident:

Product	Impact
Workers KV	Customers with applications leveraging KV saw those applications fail during the duration of this incident, including both the KV API within Workers, and the REST API. Workers applications not using KV were not impacted.
Pages	Applications hosted on Pages were unreachable for the duration of the incident and returned HTTP 500 errors to users. New Pages deployments also returned HTTP 500 errors to users for the duration.
Access	Users who were unauthenticated could not log in; any origin attempting to validate the JWT using the /certs endpoint would fail; any application with a device posture policy failed for all users. Existing logged-in sessions that did not use the /certs endpoint or posture checks were unaffected. Overall, a large percentage of existing sessions were still affected.
WARP / Zero Trust	Users were unable to register new devices or connect to resources subject to policies that enforce Device Posture checks or WARP Session timeouts. Devices already enrolled, resources not relying on device posture, or that had re-authorized outside of this window were unaffected.
Images	The Images API returned errors during the incident. Existing image delivery was not impacted.
Cache Purge (single file)	Single file purge was partially unavailable for the duration of the incident as some data centers could not access configuration data in KV. Data centers that had existing configuration data locally cached were unaffected. Other cache purge mechanisms, including purge by tag, were unaffected.
Workers	Uploading or editing Workers through the dashboard, wrangler or API returned errors during the incident. Deployed Workers were not impacted, unless they used KV.
AI Gateway	AI Gateway was not able to proxy requests for the duration of the incident.
Waiting Room	Waiting Room configuration is stored at the edge in Workers KV. Waiting Room configurations, and configuration changes, were unavailable and the service failed open. When access to KV was restored, some Waiting Room users would have experienced queuing as the service came back up.
Turnstile and Challenge Pages	Turnstile's JavaScript assets are stored in KV, and the entry point for Turnstile (api.js) was not able to be served. Clients accessing pages using Turnstile could not initialize the Turnstile widget and would have failed closed during the incident window. Challenge Pages (which products like Custom, Managed and Rate Limiting rules use) also use Turnstile infrastructure for presenting challenge pages to users under specific conditions, and would have blocked users who were presented with a challenge during that period.
Cloudflare Dashboard	Parts of the Cloudflare dashboard that rely on Turnstile and/or our internal feature flag tooling (which uses KV for configuration) returned errors to users for the duration.

Timeline

All timestamps referenced are in Coordinated Universal Time (UTC).

Time	Description
2023-10-30 18:58 UTC	The Workers KV team began a progressive deployment of a new KV build to production.
2023-10-30 19:29 UTC	The internal progressive deployment API returns staging build GUID to a call to list production builds.
2023-10-30 19:40 UTC	The progressive deployment API was used to continue rolling out the release. This routed a percentage of traffic to the wrong destination, triggering alerting and leading to the decision to roll back.
2023-10-30 19:54 UTC	Rollback via progressive deployment API attempted, traffic starts to fail at scale. — IMPACT START —
2023-10-30 20:15 UTC	Cloudflare engineers manually edit (via break glass mechanisms) deployment routes to revert to last known good build for the majority of traffic.
2023-10-30 20:29 UTC	Workers KV error rates return to normal pre-incident levels, and impacted services recover within the following minute.
2023-10-30 20:31 UTC	Impact resolved — IMPACT END —

As shown in the above timeline, there was a delay between the time we realized we were having an issue at 19:54 UTC and the time we were actually able to perform the rollback at 20:15 UTC.

This was caused by the fact that multiple tools within Cloudflare rely on Workers KV including Cloudflare Access. Access leverages Workers KV as part of its request verification process. Due to this, we were unable to leverage our internal tooling and had to use break-glass mechanisms to bypass the normal tooling. As described below, we had not spent sufficient time testing the rollback mechanisms. We plan to harden this moving forward.

Resolution

Cloudflare engineers manually switched (via break glass mechanism) the production route to the previous working version of Workers KV, which immediately eliminated the failing request path and subsequently resolved the issue with the Workers KV deployment.

Analysis

Workers KV is a low-latency key-value store that allows users to store persistent data on Cloudflare's network, as close to the users as possible. This distributed key-value store is used in many applications, some of which are first-party Cloudflare products like Pages, Access, and Zero Trust.

The Workers KV team was progressively deploying a new release using a specialized deployment tool. The deployment mechanism contains a staging and a production environment, and utilizes a process where the production environment is upgraded to the new version at progressive percentages until all production environments are upgraded to the most recent production build. The deployment tool had a latent bug with how it returns releases and their respective versions. Instead of returning releases from a single environment, the tool returned a broader list of releases than intended, resulting in production and staging releases being returned together.

In this incident, the service was deployed and tested in staging. But because of the deployment automation bug, when promoting to production, a script that had been deployed to the staging account was incorrectly referenced instead of the pre-production version on the production account. As a result, the deployment mechanism pointed the production environment to a version that was not running anywhere in the production environment, effectively black-holing traffic.

When this happened, Workers KV became unreachable in production, as calls to the product were directed to a version that was not authorized for production access, returning a HTTP 401 error code. This caused dependent products which stored key-value pairs in KV to fail, regardless of whether the key-value pair was cached locally or not.

Although automated alerting detected the issue immediately, there was a delay between the time we realized we were having an issue and the time we were actually able to perform the roll back. This was caused by the fact that multiple tools within Cloudflare rely on Workers KV including Cloudflare Access. Access uses Workers KV as part of the verification process for user JWTs (JSON Web Tokens).

These tools include the dashboard which was used to revert the change, and the authentication mechanism to access our continuous integration (CI) system. As Workers KV was down, so too were these services. Automatic rollbacks via our CI system had been successfully tested previously, but the authentication issues (Access relies on KV) due to the incident made accessing the necessary secrets to roll back the deploy impossible.

The fix ultimately was a manual change of the production build path to a previous and known good state. This path was known to have been deployed and was the previous production build before the attempted deployment.

Next steps

As more teams at Cloudflare have built on Workers, we have "organically" ended up in a place where Workers KV now underpins a tremendous amount of our products and services. This incident has continued to reinforce the need for us to revisit how we can reduce the blast radius of critical dependencies, which includes improving the sophistication of our deployment tooling, its ease-of-use for our internal teams, and product-level controls for these dependencies. We’re prioritizing these efforts to ensure that there is not a repeat of this incident.

This also reinforces the need for Cloudflare to improve the tooling, and the safety of said tooling, around progressive deployments of Workers applications internally and for customers.

This includes (but is not limited) to the below list of key follow-up actions (in no specific order) this quarter:

Onboard KV deployments to standardized Workers deployment models which use automated systems for impact detection and recovery.
Ensure that the rollback process has access to a known good deployment identifier and that it works when Cloudflare Access is down.
Add pre-checks to deployments which will validate input parameters to ensure version mismatches don't propagate to production environments.
Harden the progressive deployment tooling to operate in a way that is designed for multi-tenancy. The current design assumes a single-tenant model.
Add additional validation to progressive deployment scripts to verify that the deployment matches the app environment (production, staging, etc.).

Again, we’re extremely sorry this incident occurred, and take the impact of this incident on our customers extremely seriously.

How Cloudflare mitigated yet another Okta compromise

Sourov Zaman — Fri, 20 Oct 2023 21:39:13 GMT

On Wednesday, October 18, 2023, we discovered attacks on our system that we were able to trace back to Okta – threat actors were able to leverage an authentication token compromised at Okta to pivot into Cloudflare’s Okta instance. While this was a troubling security incident, our Security Incident Response Team’s (SIRT) real-time detection and prompt response enabled containment and minimized the impact to Cloudflare systems and data. We have verified that no Cloudflare customer information or systems were impacted by this event because of our rapid response. Okta has now released a public statement about this incident.

This is the second time Cloudflare has been impacted by a breach of Okta’s systems. In March 2022, we blogged about our investigation on how a breach of Okta affected Cloudflare. In that incident, we concluded that there was no access from the threat actor to any of our systems or data – Cloudflare’s use of hard keys for multi-factor authentication stopped this attack.

The key to mitigating this week’s incident was our team’s early detection and immediate response. In fact, we contacted Okta about the breach of their systems before they had notified us. The attacker used an open session from Okta, with Administrative privileges, and accessed our Okta instance. We were able to use our Cloudflare Zero Trust Access, Gateway, and Data Loss Prevention and our Cloudforce One threat research to validate the scope of the incident and contain it before the attacker could gain access to customer data, customer systems, or our production network. With this confidence, we were able to quickly mitigate the incident before the threat-actors were able to establish persistence.

According to Okta’s statement, the threat-actor accessed Okta’s customer support system and viewed files uploaded by certain Okta customers as part of recent support cases. It appears that in our case, the threat-actor was able to hijack a session token from a support ticket which was created by a Cloudflare employee. Using the token extracted from Okta, the threat-actor accessed Cloudflare systems on October 18. In this sophisticated attack, we observed that threat-actors compromised two separate Cloudflare employee accounts within the Okta platform. We detected this activity internally more than 24 hours before we were notified of the breach by Okta. Upon detection, our SIRT was able to engage quickly to identify the complete scope of compromise and contain the security incident. Cloudflare’s Zero Trust architecture protects our production environment, which helped prevent any impact to our customers.

Recommendations for Okta

We urge Okta to consider implementing the following best practices, including:

Take any report of compromise seriously and act immediately to limit damage; in this case Okta was first notified on October 2, 2023 by BeyondTrust but the attacker still had access to their support systems at least until October 18, 2023.
Provide timely, responsible disclosures to your customers when you identify that a breach of your systems has affected them.
Require hardware keys to protect all systems, including third-party support providers.

For a critical security service provider like Okta, we believe following these best practices is table stakes.

Recommendations for Okta’s Customers

If you are an Okta customer, we recommend that you reach out to them for further information regarding potential impact to your organization. We also advise the following actions:

Enable Hardware MFA for all user accounts. Passwords alone do not offer the necessary level of protection against attacks. We strongly recommend the usage of hardware keys, as other methods of MFA can be vulnerable to phishing attacks.
Investigate and respond to:
- All unexpected password and MFA changes for your Okta instances.
- Suspicious support-initiated events.
- Ensure all password resets are valid and force a password reset for any under suspicion.
- Any suspicious MFA-related events, ensuring only valid MFA keys are present in the user's account configuration.
Monitor for:
- New Okta users created.
- Reactivation of Okta users.
- All sessions have proper authentication associated with it.
- All Okta account and permission changes.
- MFA policy overrides, MFA changes, and MFA removal.
- Delegation of sensitive applications.
- Supply chain providers accessing your tenants.
Review session expiration policies to limit session hijack attacks.
Utilize tools to validate devices connected to your critical systems, such as Cloudflare Access Device Posture Check.
Practice defense in depth for your detection and monitoring strategies.

Cloudflare’s Security and IT teams continue to remain vigilant after this compromise. If further information is disclosed by Okta or discovered through additional log analysis, we will publish an update to this post.

Cloudflare's Security Incident Response Team is hiring.

1.1.1.1 lookup failures on October 4, 2023

Ólafur Guðmundsson — Wed, 04 Oct 2023 19:40:34 GMT

On 4 October 2023, Cloudflare experienced DNS resolution problems starting at 07:00 UTC and ending at 11:00 UTC. Some users of 1.1.1.1 or products like WARP, Zero Trust, or third party DNS resolvers which use 1.1.1.1 may have received SERVFAIL DNS responses to valid queries. We’re very sorry for this outage. This outage was an internal software error and not the result of an attack. In this blog, we’re going to talk about what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

In the Domain Name System (DNS), every domain name exists within a DNS zone. The zone is a collection of domain names and host names that are controlled together. For example, Cloudflare is responsible for the domain name cloudflare.com, which we say is in the “cloudflare.com” zone. The .com top-level domain (TLD) is owned by a third party and is in the “com” zone. It gives directions on how to reach cloudflare.com. Above all of the TLDs is the root zone, which gives directions on how to reach TLDs. This means that the root zone is important in being able to resolve all other domain names. Like other important parts of the DNS, the root zone is signed with DNSSEC, which means the root zone itself contains cryptographic signatures.

The root zone is published on the root servers, but it is also common for DNS operators to retrieve and retain a copy of the root zone automatically so that in the event that the root servers cannot be reached, the information in the root zone is still available. Cloudflare’s recursive DNS infrastructure takes this approach as it also makes the resolution process faster. New versions of the root zone are normally published twice a day. 1.1.1.1 has a WebAssembly app called static_zone running on top of the main DNS logic that serves those new versions when they are available.

What happened

On 21 September, as part of a known and planned change in root zone management, a new resource record type was included in the root zones for the first time. The new resource record is named ZONEMD, and is in effect a checksum for the contents of the root zone.

The root zone is retrieved by software running in Cloudflare’s core network. It is subsequently redistributed to Cloudflare’s data centers around the world. After the change, the root zone containing the ZONEMD record continued to be retrieved and distributed as normal. However, the 1.1.1.1 resolver systems that make use of that data had problems parsing the ZONEMD record. Because zones must be loaded and served in their entirety, the system’s failure to parse ZONEMD meant the new versions of the root zone were not used in Cloudflare’s resolver systems. Some of the servers hosting Cloudflare's resolver infrastructure failed over to querying the DNS root servers directly on a request-by-request basis when they did not receive the new root zone. However, others continued to rely on the known working version of the root zone still available in their memory cache, which was the version pulled on 21 September before the change.

On 4 October 2023 at 07:00 UTC, the DNSSEC signatures in the version of the root zone from 21 September expired. Because there was no newer version that the Cloudflare resolver systems were able to use, some of Cloudflare’s resolver systems stopped being able to validate DNSSEC signatures and as a result started sending error responses (SERVFAIL). The rate at which Cloudflare resolvers generated SERVFAIL responses grew by 12%. The diagrams below illustrate the progression of the failure and how it became visible to users.

Incident timeline and impact

21 September 6:30 UTC: Last successful pull of the root zone.4 October 7:00 UTC: DNSSEC signatures in the root zone obtained on 21 September expired causing an increase in SERVFAIL responses to client queries.7:57: First external reports of unexpected SERVFAILs started coming in.8:03: Internal Cloudflare incident declared.8:50: Initial attempt made at stopping 1.1.1.1 from serving responses using the stale root zone file with an override rule.10:30: Stopped 1.1.1.1 from preloading the root zone file entirely.10:32: Responses returned to normal.11:02: Incident closed.

This below chart shows the timeline of impact along with the percentage of DNS queries that returned with a SERVFAIL error:

We expect a baseline volume of SERVFAIL errors for regular traffic during normal operation. Usually that percentage sits at around 3%. These SERVFAILs can be caused by legitimate issues in the DNSSEC chain, failures to connect to authoritative servers, authoritative servers taking too long to respond, and many others. During the incident the amount of SERVFAILs peaked at 15% of total queries, although the impact was not evenly distributed around the world and was mainly concentrated in our larger data centers like Ashburn, Virginia; Frankfurt, Germany; and Singapore.

Why this incident happened

Why parsing the ZONEMD record failed

DNS has a binary format for storing resource records. In this binary format the type of the resource record (TYPE) is stored as a 16-bit integer. The type of resource record determines how the resource data (RDATA) is parsed. When the record type is 1, this means it is an A record, and the RDATA can be parsed as an IPv4 address. Record type 28 is an AAAA record, whose RDATA can be parsed as an IPv6 address instead. When a parser runs into an unknown resource type it won’t know how to parse its RDATA, but fortunately it doesn’t have to: the RDLENGTH field indicates how long the RDATA field is, allowing the parser to treat it as an opaque data element.

                                   1  1  1  1  1  1
      0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                                               |
    /                                               /
    /                      NAME                     /
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TYPE                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                     CLASS                     |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                      TTL                      |
    |                                               |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    |                   RDLENGTH                    |
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
    /                     RDATA                     /
    /                                               /
    +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

RFC 1035

The reason static_zone didn’t support the new ZONEMD record is because up until now we had chosen to distribute the root zone internally in its presentation format, rather than in the binary format. When looking at the text representation for a few resource records we can see there is a lot more variation in how different records are presented.

.			86400	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2023100400 1800 900 604800 86400
.			86400	IN	RRSIG	SOA 8 0 86400 20231017050000 20231004040000 46780 . J5lVTygIkJHDBt6HHm1QLx7S0EItynbBijgNlcKs/W8FIkPBfCQmw5BsUTZAPVxKj7r2iNLRddwRcM/1sL49jV9Jtctn8OLLc9wtouBmg3LH94M0utW86dKSGEKtzGzWbi5hjVBlkroB8XVQxBphAUqGxNDxdE6AIAvh/eSSb3uSQrarxLnKWvHIHm5PORIOftkIRZ2kcA7Qtou9NqPCSE8fOM5EdXxussKChGthmN5AR5S2EruXIGGRd1vvEYBrRPv55BAWKKRERkaXhgAp7VikYzXesiRLdqVlTQd+fwy2tm/MTw+v3Un48wXPg1lRPlQXmQsuBwqg74Ts5r8w8w==
.			518400	IN	NS	a.root-servers.net.
.			86400	IN	ZONEMD	2023100400 1 241 E375B158DAEE6141E1F784FDB66620CC4412EDE47C8892B975C90C6A102E97443678CCA4115E27195B468E33ABD9F78C

Example records taken from https://www.internic.net/domain/root.zone

When we run into an unknown resource record it’s not always easy to know how to handle it. Because of this, the library we use to parse the root zone at the edge does not make an attempt at doing so, and instead returns a parser error.

Why a stale version of the root zone was used

The static_zone app, tasked with loading and parsing the root zone for the purpose of serving the root zone locally (RFC 7706), stores the latest version in memory. When a new version is published it parses it and, when successfully done so, drops the old version. However, as parsing failed the static_zone app never switched to a newer version, and instead continued using the old version indefinitely. When the 1.1.1.1 service is first started the static_zone app does not have an existing version in memory. When it tries to parse the root zone it fails in doing so, but because it does not have an older version of the root zone to fall back on, it falls back on querying the root servers directly for incoming requests.

Why the initial attempt at disabling static_zone didn’t work

Initially we tried to disable the static_zone app through override rules, a mechanism that allows us to programmatically change some behavior of 1.1.1.1. The rule we deployed was:

phase = pre-cache set-tag rec_disable_static

For any incoming request this rule adds the tag rec_disable_static to the request. Inside the static_zone app we check for this tag and, if it’s set, we do not return a response from the cached, static root zone. However, to improve cache performance queries are sometimes forwarded to another node if the current node can’t find the response in its own cache. Unfortunately, the rec_disable_static tag is not included in the queries being forwarded to other nodes, which caused the static_zone app to continue replying with stale information until we eventually disabled the app entirely.

Why the impact was partial

Cloudflare regularly performs rolling reboots of the servers that host our services for tasks like kernel updates that can only take effect after a full system restart. At the time of this outage, resolver server instances that were restarted between the ZONEMD change and the DNSSEC invalidation did not contribute to impact. If they had restarted during this two-week period, they would have failed to load the root zone on startup and fallen back to resolving by sending DNS queries to root servers instead. In addition, the resolver uses a technique called serve stale (RFC 8767) with the purpose of being able to continue to serve popular records from a potentially stale cache to limit the impact. A record is considered to be stale once the TTL amount of seconds has passed since the record was retrieved from upstream. This prevented a total outage; impact was mainly felt in our largest data centers which had many servers that had not restarted the 1.1.1.1 service in that timeframe.

Remediation and follow-up steps

This incident had widespread impact, and we take the availability of our services very seriously. We have identified several areas of improvement and will continue to work on uncovering any other gaps that could cause a recurrence.

Here is what we are working on immediately:

Visibility: We’re adding alerts to notify when static_zone serves a stale root zone file. It should not have been the case that serving a stale root zone file went unnoticed for as long as it did. If we had been monitoring this better, with the caching that exists, there would have been no impact. It is our goal to protect our customers and their users from upstream changes.

Resilience: We will re-evaluate how we ingest and distribute the root zone internally. Our ingestion and distribution pipelines should handle new RRTYPEs seamlessly, and any brief interruption to the pipeline should be invisible to end users.

Testing: Despite having tests in place around this problem, including tests related to unreleased changes in parsing the new ZONEMD records, we did not adequately test what happens when the root zone fails to parse. We will improve our test coverage and the related processes.

Architecture: We should not use stale copies of the root zone past a certain point. While it’s certainly possible to continue to use stale root zone data for a limited amount of time, past a certain point there are unacceptable operational risks. We will take measures to ensure that the lifetime of cached root zone data is better managed as described in RFC 8806: Running a Root Server Local to a Resolver.

Conclusion

We are deeply sorry that this incident happened. There is one clear message from this incident: do not ever assume that something is not going to change! Many modern systems are built with a long chain of libraries that are pulled into the final executable, each one of those may have bugs or may not be updated early enough for programs to operate correctly when changes in input happen. We understand how important it is to have good testing in place that allows detection of regressions and systems and components that fail gracefully on changes to input. We understand that we need to always assume that “format” changes in the most critical systems of the internet (DNS and BGP) are going to have an impact.

We have a lot to follow up on internally and are working around the clock to make sure something like this does not happen again.