On February 20, 2026, at 17:48 UTC, Cloudflare experienced a service outage when a subset of customers who use Cloudflare’s Bring Your Own IP (BYOIP) service saw their routes to the Internet withdrawn via Border Gateway Protocol (BGP).
The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind. This issue was caused by a change that Cloudflare made to how our network manages IP addresses onboarded through the BYOIP pipeline. This change caused Cloudflare to unintentionally withdraw customer prefixes.
For some BYOIP customers, this resulted in their services and applications being unreachable from the Internet, causing timeouts and failures to connect across their Cloudflare deployments that used BYOIP. The website for Cloudflare’s recursive DNS resolver (1.1.1.1) saw 403 errors as well. The total duration of the incident was 6 hours and 7 minutes with most of that time spent restoring prefix configurations to their state prior to the change.
Cloudflare engineers reverted the change and prefixes stopped being withdrawn when we began to observe failures. However, before engineers were able to revert the change, ~1,100 BYOIP prefixes were withdrawn from the Cloudflare network. Some customers were able to restore their own service by using the Cloudflare dashboard to re-advertise their IP addresses. We resolved the incident when we restored all prefix configurations.
We are sorry for the impact to our customers. We let you down today. This post is an in-depth recounting of exactly what happened and which systems and processes failed. We will also outline the steps we are taking to prevent outages like this from happening again.
How did the outage impact customers?
This graph shows the amount of prefixes advertised by Cloudflare during the incident to a BGP neighbor, which correlates to impact as prefixes that weren’t advertised were unreachable on the Internet:
Out of the total 6,500 prefixes advertised to this peer, 4,306 of those were BYOIP prefixes. These BYOIP prefixes are advertised to every peer and represent all the BYOIP prefixes we advertise globally.
During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC. Out of the 4,306 total BYOIP prefixes, 25% of BYOIP prefixes were unintentionally withdrawn. We were able to detect impact on one.one.one.one and revert the impacting change before more prefixes were impacted. At 19:19 UTC, we published guidance to customers that they would be able to self-remediate this incident by going to the Cloudflare dashboard and re-advertising their prefixes.
Cloudflare was able to revert many of the advertisement changes around 20:20 UTC, which caused 800 prefixes to be restored. There were still ~300 prefixes that were unable to be remediated through the dashboard because the service configurations for those prefixes were removed from the edge due to a software bug. These prefixes were manually restored by Cloudflare engineers at 23:03 UTC.
This incident did not impact all BYOIP customers because the configuration change was applied iteratively and not instantaneously across all BYOIP customers. Once the configuration change was revealed to be causing impact, the change was reverted before all customers were affected.
The impacted BYOIP customers first experienced a behavior called BGP Path Hunting. In this state, end user connections traverse networks trying to find a route to the destination IP. This behavior will persist until the connection that was opened times out and fails. Until the prefix is advertised somewhere, customers will continue to see this failure mode. This loop-until-failure scenario affected any product that uses BYOIP for advertisement to the Internet. Additionally, visitors to one.one.one.one, the website for Cloudflare’s recursive DNS resolver, were met with HTTP 403 errors and an “Edge IP Restricted” error message. DNS resolution over the 1.1.1.1 Public Resolver, including DNS over HTTPS, was not affected. A full breakdown of the services impacted is below.
| Service/Product |
Impact Description |
| Core CDN and Security Services |
Traffic was not attracted to Cloudflare, and users connecting to websites advertised on those ranges would have seen failures to connect |
| Spectrum |
Spectrum apps on BYOIP failed to proxy traffic due to traffic not being attracted to Cloudflare |
| Dedicated Egress |
Customers who used Gateway Dedicated Egress leveraging BYOIP or Dedicated IPs for CDN Egress leveraging BYOIP would not have been able to send traffic out to their destinations |
| Magic Transit |
End users connecting to applications protected by Magic Transit would not have been advertised on the Internet, and would have seen connection timeouts and failures |
There was also a set of customers who were unable to restore service by toggling the prefixes on the Cloudflare dashboard. As engineers began reannouncing prefixes to restore service for these customers, these customers may have seen increased latency and failures despite their IP addresses being advertised. This was because the addressing settings for some users were removed from edge servers due an issue in our own software, and the state had to be propagated back to the edge.
We’re going to get into what exactly broke in our addressing system, but to do that we need to cover a quick primer on the Addressing API, which is the underlying source of truth for customer IP addresses at Cloudflare.
Cloudflare’s Addressing API
The Addressing API is an authoritative dataset of the addresses present on the Cloudflare network. Any change to that dataset is immediately reflected in Cloudflare's global network. While we are in the process of improving how these systems roll out changes as a part of Code Orange: Fail Small, today customers can configure their IP addresses by interacting with public-facing APIs which configure a set of databases that trigger operational workflows propagating the changes to Cloudflare’s edge. This means that changes to the Addressing API are immediately propagated to the Cloudflare edge.
Advertising and configuring IP addresses on Cloudflare involves several steps:
Customers signal to Cloudflare about advertisement/withdrawal of IP addresses via the Addressing API or BGP Control
The Addressing API instructs the machines to change the prefix advertisements
BGP will be updated on the routers once enough machines have received the notification to update the prefix
Finally, customers can configure Cloudflare products to use BYOIP addresses via service bindings which will assign products to these ranges
The Addressing API allows us to automate most of the processes surrounding how we advertise or withdraw addresses, but some processes still require manual actions. These manual processes are risky because of their close proximity to Production. As a part of Code Orange: Fail Small, one of the goals of remediation was to remove manual actions taken in the Addressing API and replace them with safe workflows.
How did the incident occur?
The specific piece of configuration that broke was a modification attempting to automate the customer action of removing prefixes from Cloudflare’s BYOIP service, a regular customer request that is done manually today. Removing this manual process was part of our Code Orange: Fail Small work to push all changes toward safe, automated, health-mediated deployment. Since the list of related objects of BYOIP prefixes can be large, this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them. Unfortunately, this regular cleanup sub-task queried the API with a bug.
Here is the API query from the cleanup sub-task:
resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)
And here is the relevant part of the API implementation:
if v := req.URL.Query().Get("pending_delete"); v != "" {
// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
if err != nil {
api.RenderError(ctx, w, ErrInternalError)
return
}
api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
return
}
Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion. The new sub-task then began systematically deleting all BYOIP prefixes and all of their related dependent objects including service bindings, until the impact was noticed, and an engineer identified the sub-task and shut it down.
Why did Cloudflare not catch the bug in our staging environment or testing?
Our staging environment contains data that matches Production as closely as possible, but was not sufficient in this case and the mock data we relied on to simulate what would occur was insufficient.
In addition, while we have tests for this functionality, coverage for this scenario in our testing process and environment was incomplete. Initial testing and code review focused on the BYOIP self-service API journey and were completed successfully. While our engineers successfully tested the exact process a customer would have followed, testing did not cover a scenario where the task-runner service would independently execute changes to user data without explicit input.
Affected BYOIP prefixes were not all impacted in the same way, necessitating more intensive data recovery steps. As a part of Code Orange: Fail Small, we are building a system where operational state snapshots can be safely rolled out through health-mediated deployments. In the event something does roll out that causes unexpected behavior, it can be very quickly rolled back to a known-good state. However, that system is not in Production today.
BYOIP prefixes were in different states of impact during this incident, and each of these different states required different actions:
Most impacted customers only had their prefixes withdrawn. Customers in this configuration could go into the dashboard and toggle their advertisements, which would restore service.
Some customers had their prefixes withdrawn and some bindings removed. These customers were in a partial state of recovery where they could toggle some prefixes but not others.
Some customers had their prefixes withdrawn and all service bindings removed. They could not toggle their prefixes in the dashboard because there was no service (Magic Transit, Spectrum, CDN) bound to them. These customers took the longest to mitigate, as a global configuration update had to be initiated to reapply the service bindings for all these customers to every single machine on Cloudflare’s edge.
How does this incident relate to Code Orange: Fail Small?
The change we were making when this incident occurred is part of the Code Orange: Fail Small initiative, which is aimed at improving the resiliency of code and configuration at Cloudflare. As a brief primer of the Code Orange: Fail Small initiatives, the work can be divided into three buckets:
Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Change our internal “break glass” procedures and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.
The change that we attempted to deploy falls under the first bucket. By moving risky, manual changes to safe, automated configuration updates that are deployed in a health-mediated manner, we aim to improve the reliability of the service.
Critical work was already ongoing to enhance the Addressing API's configuration change support through staged test mediation and better correctness checks. This work was ongoing in parallel with the deployed change. Although preventative measures weren't fully deployed before the outage, teams were actively working on these systems when the incident occurred. Following our Code Orange: Fail Small promise to require controlled rollouts of any change into Production, our engineering teams have been reaching deep into all layers of our stack to identify and fix all problematic findings. While this outage wasn't itself global, the blast radius and impact were unacceptably large, further reinforcing Code Orange: Fail Small as a priority until we have re-established confidence in all changes to our network being as gradual as possible. Now let’s talk more specifically about improvements to these systems.
API schema standardization
One of the issues in this incident is that the pending_delete flag was interpreted as a string, making it difficult for both client and server to rationalize the value of the flag. We will improve the API schema to ensure better standardization, which will make it much easier for testing and systems to validate whether an API call is properly formed or not. This work is part of the third Code Orange workstream, which aims to create well-defined behavior under all conditions.
Today, customers make changes to the addressing schema that are persisted in an authoritative database, and that database is the same one used for operational actions. This makes manual rollback processes more challenging because engineers need to utilize database snapshots instead of rationalizing between desired and actual states. We will redesign the rollback mechanism and database configuration to ensure that we have an easy way to roll back changes quickly and also to introduce layers between customer configuration and Production.
We will snap shot the data that we read from the database and are applying to Production, and apply those snapshots in the same way that we deploy all our other Production changes, mediated by health metrics that can automatically stop the deployment if things are going wrong. This means that the next time we have a problem where the database gets changed into a bad state, we can near-instantly revert individual customers (or all customers) to a version that was working.
While this will temporarily block our customers from being able to make direct updates via our API in the event of an outage, it will mean that we can continue serving their traffic while we work to fix the database, instead of being down for that time. This work aligns with the first and second Code Orange workstreams, which involves fast rollback and also safe, health-mediated deployment of configuration.
Better arbitrate large withdrawal actions
We will improve our monitoring to detect when changes are happening too fast or too broadly, such as withdrawing or deleting BGP prefixes quickly, and disable the deployment of snapshots when this happens. This will form a type of circuit breaker to stop any out-of-control process that is manipulating the database from having a large blast radius, like we saw in this incident.
We also have some ongoing work to directly monitor that the services run by our customers are behaving correctly, and those signals can also be used to trip the circuit breaker and stop potentially dangerous changes from being applied until we have had time to investigate. This work aligns with the first Code Orange workstream, which involves safe deployment of changes.
Below is the timeline of events inclusive of deployment of the change and remediation steps:
| Time (UTC) |
Status |
Description |
| 2026-02-05 21:53 |
Code merged into system |
Broken sub-process merged into code base |
| 2026-02-20 17:46 |
Code deployed into system |
Address API release with broken sub-process completes |
| 2026-02-20 17:56 |
Impact Start |
Broken sub-process begins executing. Prefix advertisement updates begin propagating and prefixes begin to be withdrawn – IMPACT STARTS – |
| 2026-02-20 18:13 |
Cloudflare engaged |
Cloudflare engaged for failures on one.one.one.one |
| 2026-02-20 18:18 |
Internal incident declared |
Cloudflare engineers continue investigating impact |
| 2026-02-20 18:21 |
Addressing API team paged |
Engineering team responsible for Addressing API engaged and debugging begins |
| 2026-02-20 18:46 |
Issue identified |
Broken sub-process terminated by an engineer and regular execution disabled; remediation begins |
| 2026-02-20 19:11 |
Mitigation begins |
Cloudflare Engineers begin to restore serviceability for prefixes that were withdrawn while others focused on prefixes that were removed |
| 2026-02-20 19:19 |
Some prefixes mitigated |
Customers begin to re-advertise their prefixes via the dashboard to restore service. – IMPACT DOWNGRADE – |
| 2026-02-20 19:44 |
Additional mitigation continues |
Engineers begin database recovery methods for removed prefixes |
| 2026-02-20 20:30 |
Final mitigation process begins |
Engineers complete release to restore withdrawn prefixes that still have existing service bindings. Others are still working on removed prefixes – IMPACT DOWNGRADE – |
| 2026-02-20 21:08 |
Configuration update deploys |
Engineering begins global machine configuration rollout to restore prefixes that were not self-mitigated or mitigated via previous efforts – IMPACT DOWNGRADE – |
| 2026-02-20 23:03 |
Configuration update completed |
Global machine configuration deployment to restore remaining prefixes is completed. – IMPACT ENDS – |
We deeply apologize for this incident today and how it affected the service we provide our customers, and also the Internet at large. We aim to provide a network that is resilient to change, and we did not deliver on our promise to you. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.