The Cloudflare Blog

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Chris J Arges — Thu, 07 May 2026 13:00:00 GMT

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes.

There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off.

Background

Our Linux kernel release process

Cloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.

The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.

By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.

At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.

About the Copy Fail vulnerability

It helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel's internal crypto API manages functions like kTLS and IPsec. Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program follows these steps:

Opens an AF_ALG socket and binds to an AEAD template.
Sets a key and accepts a request socket.
Submits input via sendmsg() or splice().
Executes the operation using recvmsg().

The splice() system call is critical here, as it moves data by passing page cache references.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.

The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.

The vulnerability: out-of-bounds write

When the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:

File: Any readable file.
Offset: Tunable via assoclen and splice parameters.
Value: Controlled via AAD bytes 4-7 in sendmsg()

The exploit, step by step

The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.

Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.
Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.
Write Construction: For each 4-byte shellcode chunk:
- sendmsg() with AAD bytes 4–7 containing the shellcode.
- splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.
Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.
Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.

The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.

How we responded

When the vulnerability was disclosed, many workstreams started in parallel:

Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.
Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.
Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.
Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.
Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.

There was no customer impact at any point during this response.

Validating detection coverage

One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.

This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.

Hunting for exploitation

While our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.

Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.

The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.

We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.

We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean.

Incident timeline and impact

Time (UTC)	Event
2026-04-29 16:00	Copy Fail publicly disclosed.
2026-04-29 ~21:00	Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process
2026-04-29 22:52	Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes.
2026-04-29 23:01	Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique.
2026-04-29 (evening)	First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected.
2026-04-29 (overnight)	Engineering drafted bpf-lsm mitigation program.
2026-04-30 03:14	Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems.
2026-04-30 (morning)	Engineering tested the bpf-lsm mitigation program and made it production-ready.
2026-04-30 14:25	Engineering incident declared to coordinate mitigation program and Linux patch rollout.
2026-04-30 ~17:00	Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime.
2026-04-30 (afternoon)	Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users.
2026-04-30 (evening)	bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works.
2026-05-04 (morning)	Reboot automation resumed at normal pace with the patched kernel.
2026-05-04 onward	Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.

How did we mitigate it?

Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.

Removing the module

The bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.

This mitigation was therefore exactly what the Copy Fail write-up from the security researchers who identified it recommends.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

Unfortunately removing the module would have impacted software that leverages the kernel crypto API. This meant that we had to figure out a more surgical mitigation.

Bpf-lsm

We’ve already developed and deployed such a tool for this exact scenario: bpf-lsm. Instead of removing the module, this tool leaves it loaded for legitimate users and uses a BPF Linux Security Module program to deny the socket_bind LSM hook for everyone else. This completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary's path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Rolling it out

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

Remediation and follow-up steps

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

Conclusion

The "Copy Fail" vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport's release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We'd also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

Cloudflare outage on February 20, 2026

David Tuber — Sat, 21 Feb 2026 00:00:00 GMT

On February 20, 2026, at 17:48 UTC, Cloudflare experienced a service outage when a subset of customers who use Cloudflare’s Bring Your Own IP (BYOIP) service saw their routes to the Internet withdrawn via Border Gateway Protocol (BGP).

The issue was not caused, directly or indirectly, by a cyberattack or malicious activity of any kind. This issue was caused by a change that Cloudflare made to how our network manages IP addresses onboarded through the BYOIP pipeline. This change caused Cloudflare to unintentionally withdraw customer prefixes.

For some BYOIP customers, this resulted in their services and applications being unreachable from the Internet, causing timeouts and failures to connect across their Cloudflare deployments that used BYOIP. The website for Cloudflare’s recursive DNS resolver (1.1.1.1) saw 403 errors as well. The total duration of the incident was 6 hours and 7 minutes with most of that time spent restoring prefix configurations to their state prior to the change.

Cloudflare engineers reverted the change and prefixes stopped being withdrawn when we began to observe failures. However, before engineers were able to revert the change, ~1,100 BYOIP prefixes were withdrawn from the Cloudflare network. Some customers were able to restore their own service by using the Cloudflare dashboard to re-advertise their IP addresses. We resolved the incident when we restored all prefix configurations.

We are sorry for the impact to our customers. We let you down today. This post is an in-depth recounting of exactly what happened and which systems and processes failed. We will also outline the steps we are taking to prevent outages like this from happening again.

How did the outage impact customers?

This graph shows the amount of prefixes advertised by Cloudflare during the incident to a BGP neighbor, which correlates to impact as prefixes that weren’t advertised were unreachable on the Internet:

Out of the total 6,500 prefixes advertised to this peer, 4,306 of those were BYOIP prefixes. These BYOIP prefixes are advertised to every peer and represent all the BYOIP prefixes we advertise globally.

During the incident, 1,100 prefixes out of the total 6,500 were withdrawn from 17:56 to 18:46 UTC. Out of the 4,306 total BYOIP prefixes, 25% of BYOIP prefixes were unintentionally withdrawn. We were able to detect impact on one.one.one.one and revert the impacting change before more prefixes were impacted. At 19:19 UTC, we published guidance to customers that they would be able to self-remediate this incident by going to the Cloudflare dashboard and re-advertising their prefixes.

Cloudflare was able to revert many of the advertisement changes around 20:20 UTC, which caused 800 prefixes to be restored. There were still ~300 prefixes that were unable to be remediated through the dashboard because the service configurations for those prefixes were removed from the edge due to a software bug. These prefixes were manually restored by Cloudflare engineers at 23:03 UTC.

This incident did not impact all BYOIP customers because the configuration change was applied iteratively and not instantaneously across all BYOIP customers. Once the configuration change was revealed to be causing impact, the change was reverted before all customers were affected.

The impacted BYOIP customers first experienced a behavior called BGP Path Hunting. In this state, end user connections traverse networks trying to find a route to the destination IP. This behavior will persist until the connection that was opened times out and fails. Until the prefix is advertised somewhere, customers will continue to see this failure mode. This loop-until-failure scenario affected any product that uses BYOIP for advertisement to the Internet. Additionally, visitors to one.one.one.one, the website for Cloudflare’s recursive DNS resolver, were met with HTTP 403 errors and an “Edge IP Restricted” error message. DNS resolution over the 1.1.1.1 Public Resolver, including DNS over HTTPS, was not affected. A full breakdown of the services impacted is below.

Service/Product	Impact Description
Core CDN and Security Services	Traffic was not attracted to Cloudflare, and users connecting to websites advertised on those ranges would have seen failures to connect
Spectrum	Spectrum apps on BYOIP failed to proxy traffic due to traffic not being attracted to Cloudflare
Dedicated Egress	Customers who used Gateway Dedicated Egress leveraging BYOIP or Dedicated IPs for CDN Egress leveraging BYOIP would not have been able to send traffic out to their destinations
Magic Transit	End users connecting to applications protected by Magic Transit would not have been advertised on the Internet, and would have seen connection timeouts and failures

There was also a set of customers who were unable to restore service by toggling the prefixes on the Cloudflare dashboard. As engineers began reannouncing prefixes to restore service for these customers, these customers may have seen increased latency and failures despite their IP addresses being advertised. This was because the addressing settings for some users were removed from edge servers due an issue in our own software, and the state had to be propagated back to the edge.

We’re going to get into what exactly broke in our addressing system, but to do that we need to cover a quick primer on the Addressing API, which is the underlying source of truth for customer IP addresses at Cloudflare.

Cloudflare’s Addressing API

The Addressing API is an authoritative dataset of the addresses present on the Cloudflare network. Any change to that dataset is immediately reflected in Cloudflare's global network. While we are in the process of improving how these systems roll out changes as a part of Code Orange: Fail Small, today customers can configure their IP addresses by interacting with public-facing APIs which configure a set of databases that trigger operational workflows propagating the changes to Cloudflare’s edge. This means that changes to the Addressing API are immediately propagated to the Cloudflare edge.

Advertising and configuring IP addresses on Cloudflare involves several steps:

Customers signal to Cloudflare about advertisement/withdrawal of IP addresses via the Addressing API or BGP Control
The Addressing API instructs the machines to change the prefix advertisements
BGP will be updated on the routers once enough machines have received the notification to update the prefix
Finally, customers can configure Cloudflare products to use BYOIP addresses via service bindings which will assign products to these ranges

The Addressing API allows us to automate most of the processes surrounding how we advertise or withdraw addresses, but some processes still require manual actions. These manual processes are risky because of their close proximity to Production. As a part of Code Orange: Fail Small, one of the goals of remediation was to remove manual actions taken in the Addressing API and replace them with safe workflows.

How did the incident occur?

The specific piece of configuration that broke was a modification attempting to automate the customer action of removing prefixes from Cloudflare’s BYOIP service, a regular customer request that is done manually today. Removing this manual process was part of our Code Orange: Fail Small work to push all changes toward safe, automated, health-mediated deployment. Since the list of related objects of BYOIP prefixes can be large, this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them. Unfortunately, this regular cleanup sub-task queried the API with a bug.

Here is the API query from the cleanup sub-task:

 resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

And here is the relevant part of the API implementation:

	if v := req.URL.Query().Get("pending_delete"); v != "" {
		// ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
		prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
		if err != nil {
			api.RenderError(ctx, w, ErrInternalError)
			return
		}

		api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
		return
	}

Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion. The new sub-task then began systematically deleting all BYOIP prefixes and all of their related dependent objects including service bindings, until the impact was noticed, and an engineer identified the sub-task and shut it down.

Why did Cloudflare not catch the bug in our staging environment or testing?

Our staging environment contains data that matches Production as closely as possible, but was not sufficient in this case and the mock data we relied on to simulate what would occur was insufficient.

In addition, while we have tests for this functionality, coverage for this scenario in our testing process and environment was incomplete. Initial testing and code review focused on the BYOIP self-service API journey and were completed successfully. While our engineers successfully tested the exact process a customer would have followed, testing did not cover a scenario where the task-runner service would independently execute changes to user data without explicit input.

Why was recovery not immediate?

Affected BYOIP prefixes were not all impacted in the same way, necessitating more intensive data recovery steps. As a part of Code Orange: Fail Small, we are building a system where operational state snapshots can be safely rolled out through health-mediated deployments. In the event something does roll out that causes unexpected behavior, it can be very quickly rolled back to a known-good state. However, that system is not in Production today.

BYOIP prefixes were in different states of impact during this incident, and each of these different states required different actions:

Most impacted customers only had their prefixes withdrawn. Customers in this configuration could go into the dashboard and toggle their advertisements, which would restore service.
Some customers had their prefixes withdrawn and some bindings removed. These customers were in a partial state of recovery where they could toggle some prefixes but not others.
Some customers had their prefixes withdrawn and all service bindings removed. They could not toggle their prefixes in the dashboard because there was no service (Magic Transit, Spectrum, CDN) bound to them. These customers took the longest to mitigate, as a global configuration update had to be initiated to reapply the service bindings for all these customers to every single machine on Cloudflare’s edge.

How does this incident relate to Code Orange: Fail Small?

The change we were making when this incident occurred is part of the Code Orange: Fail Small initiative, which is aimed at improving the resiliency of code and configuration at Cloudflare. As a brief primer of the Code Orange: Fail Small initiatives, the work can be divided into three buckets:

Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
Change our internal “break glass” procedures and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.
Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behavior under all conditions, including unexpected error states.

The change that we attempted to deploy falls under the first bucket. By moving risky, manual changes to safe, automated configuration updates that are deployed in a health-mediated manner, we aim to improve the reliability of the service.

Critical work was already ongoing to enhance the Addressing API's configuration change support through staged test mediation and better correctness checks. This work was ongoing in parallel with the deployed change. Although preventative measures weren't fully deployed before the outage, teams were actively working on these systems when the incident occurred. Following our Code Orange: Fail Small promise to require controlled rollouts of any change into Production, our engineering teams have been reaching deep into all layers of our stack to identify and fix all problematic findings. While this outage wasn't itself global, the blast radius and impact were unacceptably large, further reinforcing Code Orange: Fail Small as a priority until we have re-established confidence in all changes to our network being as gradual as possible. Now let’s talk more specifically about improvements to these systems.

Remediation and follow-up steps

API schema standardization

One of the issues in this incident is that the pending_delete flag was interpreted as a string, making it difficult for both client and server to rationalize the value of the flag. We will improve the API schema to ensure better standardization, which will make it much easier for testing and systems to validate whether an API call is properly formed or not. This work is part of the third Code Orange workstream, which aims to create well-defined behavior under all conditions.

Better separation between operational and configured state

Today, customers make changes to the addressing schema that are persisted in an authoritative database, and that database is the same one used for operational actions. This makes manual rollback processes more challenging because engineers need to utilize database snapshots instead of rationalizing between desired and actual states. We will redesign the rollback mechanism and database configuration to ensure that we have an easy way to roll back changes quickly and also to introduce layers between customer configuration and Production.

We will snap shot the data that we read from the database and are applying to Production, and apply those snapshots in the same way that we deploy all our other Production changes, mediated by health metrics that can automatically stop the deployment if things are going wrong. This means that the next time we have a problem where the database gets changed into a bad state, we can near-instantly revert individual customers (or all customers) to a version that was working.

While this will temporarily block our customers from being able to make direct updates via our API in the event of an outage, it will mean that we can continue serving their traffic while we work to fix the database, instead of being down for that time. This work aligns with the first and second Code Orange workstreams, which involves fast rollback and also safe, health-mediated deployment of configuration.

Better arbitrate large withdrawal actions

We will improve our monitoring to detect when changes are happening too fast or too broadly, such as withdrawing or deleting BGP prefixes quickly, and disable the deployment of snapshots when this happens. This will form a type of circuit breaker to stop any out-of-control process that is manipulating the database from having a large blast radius, like we saw in this incident.

We also have some ongoing work to directly monitor that the services run by our customers are behaving correctly, and those signals can also be used to trip the circuit breaker and stop potentially dangerous changes from being applied until we have had time to investigate. This work aligns with the first Code Orange workstream, which involves safe deployment of changes.

Below is the timeline of events inclusive of deployment of the change and remediation steps:

Time (UTC)	Status	Description
2026-02-05 21:53	Code merged into system	Broken sub-process merged into code base
2026-02-20 17:46	Code deployed into system	Address API release with broken sub-process completes
2026-02-20 17:56	Impact Start	Broken sub-process begins executing. Prefix advertisement updates begin propagating and prefixes begin to be withdrawn – IMPACT STARTS –
2026-02-20 18:13	Cloudflare engaged	Cloudflare engaged for failures on one.one.one.one
2026-02-20 18:18	Internal incident declared	Cloudflare engineers continue investigating impact
2026-02-20 18:21	Addressing API team paged	Engineering team responsible for Addressing API engaged and debugging begins
2026-02-20 18:46	Issue identified	Broken sub-process terminated by an engineer and regular execution disabled; remediation begins
2026-02-20 19:11	Mitigation begins	Cloudflare Engineers begin to restore serviceability for prefixes that were withdrawn while others focused on prefixes that were removed
2026-02-20 19:19	Some prefixes mitigated	Customers begin to re-advertise their prefixes via the dashboard to restore service. – IMPACT DOWNGRADE –
2026-02-20 19:44	Additional mitigation continues	Engineers begin database recovery methods for removed prefixes
2026-02-20 20:30	Final mitigation process begins	Engineers complete release to restore withdrawn prefixes that still have existing service bindings. Others are still working on removed prefixes – IMPACT DOWNGRADE –
2026-02-20 21:08	Configuration update deploys	Engineering begins global machine configuration rollout to restore prefixes that were not self-mitigated or mitigated via previous efforts – IMPACT DOWNGRADE –
2026-02-20 23:03	Configuration update completed	Global machine configuration deployment to restore remaining prefixes is completed. – IMPACT ENDS –

We deeply apologize for this incident today and how it affected the service we provide our customers, and also the Internet at large. We aim to provide a network that is resilient to change, and we did not deliver on our promise to you. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

Introducing REACT: Why We Built an Elite Incident Response Team

Chris O’Rourke — Thu, 09 Oct 2025 14:00:00 GMT

Cloudforce One’s mission is to help defend the Internet. In Q2’25 alone, Cloudflare stopped an average of 190 billion cyber threats every single day. But real-world customer experiences showed us that stopping attacks at the edge isn’t always enough. We saw ransomware disrupt financial operations, data breaches cripple real estate firms, and misconfigurations cause major data losses.

In each case, the real damage occurred inside networks.

These internal breaches uncovered another problem: customers had to hand off incidents to separate internal teams for investigation and remediation. Those handoffs created delays and fractured the response. The result was a gap that attackers could exploit. Critical context collected at the edge didn’t reach the teams managing cleanup, and valuable time was lost. Closing this gap has become essential, and we recognized the need to take responsibility for providing customers with a more unified defense.

Today, Cloudforce One is launching a new suite of incident response and security services to help organizations prepare for and respond to breaches.

These services are delivered by Cloudforce One REACT (Respond, Evaluate, Assess, Consult Team), a group of seasoned responders and security veterans who investigate threats, hunt adversaries, and work closely with executive leadership to guide response and decision-making. Customers already trust Cloudforce One to provide industry-leading threat intelligence, proactively identifying and neutralizing the most sophisticated threats. REACT extends that partnership, bringing our expertise directly to customer environments to stop threats wherever they occur. In this post, we’ll introduce REACT, explain how it works, detail the top threats our team has observed, and show you how to engage our experts directly for support.

Our goal is simple: to provide an end-to-end security partnership. We want to eliminate the painful gap between defense and recovery. Now, customers can get everything from proactive preparation to decisive incident response and full recovery—all from the partner you already trust to protect your infrastructure.

It’s time to move beyond fragmented responses and into one unified, powerful defense.

How REACT works

REACT services consist of two main components: Security advisory services to prepare for incidents and incident response for emergency situations.

^{A breakdown of the Cloudforce One incident readiness and response service offerings.}

Advisory services are designed to assess and improve an organization's security posture and readiness. These include proactive threat hunting, backed by Cloudflare’s real-time global threat intelligence, to find existing compromises, tabletop exercises to test response plans against simulated attacks, and both incident readiness and maturity assessments to identify and address systemic weaknesses.

The Incident Response component is initiated during an active security crisis. The team specializes in handling a range of complex threats, including APT and nation-state activity, ransomware, insider threats, and business email compromise. The response is also informed by Cloudflare's threat intelligence and, as a network-native service, allows responders to deploy mitigation measures directly at the Cloudflare edge for faster containment.

For organizations requiring guaranteed availability, incident response retainers are offered. These retainers provide priority response, the development of tailored playbooks, and ongoing advisory support.

Cloudflare’s REACT services are vendor-agnostic in their scope. We are making REACT available to both existing Cloudflare customers and non-customers, regardless of their current technology stack, and regardless of whether their environment is on-premise, public cloud, or hybrid.

What makes Cloudflare's approach different?

Our new service provides significant advantages over traditional incident response, where engagement and data sharing occur over separate, out-of-band channels. The integration of the service into the platform enables a more efficient and effective response to threats.

The core differentiators of this approach are:

Unmatched threat visibility. With roughly 20% of the web sitting behind Cloudflare's network, Cloudforce One has unique visibility into emerging attacks as they unfold globally. This lets REACT accelerate their investigations and quickly correlate incident details with emerging attack vectors and known adversary tactics.
Network-native mitigation. The service is designed for network-native response. This allows the team, with customer authorization, to deploy mitigations directly at the Cloudflare edge, such as a WAF rule or Secure Web Gateway policy. This capability reduces the time between threat identification and containment. All response actions are tracked within the dashboard for full visibility.
Service delivery by proven experts. Cloudforce One is composed of seasoned threat researchers, consultants, and incident responders. The team has a documented history of managing complex security incidents, including nation-state activity and sophisticated financial fraud.
Vendor-agnostic scope. While managed through the Cloudflare dashboard, the scope of the response is vendor-agnostic. The team is equipped to conduct investigations and coordinate remediation across diverse customer environments, including on-premise, public cloud, and hybrid infrastructures.

Key Threats Seen During Engagements So Far

Analysis of security engagements by the REACT team over the last six months reveals three prevalent and high-impact trends. The data indicates that automated defenses, while critical, must be supplemented by specialized incident response capabilities to effectively counter these specific threats.

High-impact insider threats

The REACT team has seen a significant number of incidents driven by insiders who use trusted access to bypass typical security controls. These threats are difficult to detect as they often combine technical actions with non-technical motivations. Recent scenarios observed are:

Disgruntled or current employees using their specialized, trusted access to execute targeted, destructive attacks.
Financially motivated insiders who are compensated by external actors to exfiltrate data or compromise internal systems.
State sponsored operatives gain trusted, privileged access via fraudulent remote work roles to exfiltrate data, conduct espionage, and steal funds for illicit regime financing.

Ransomware

The REACT team has observed that ransomware continues to be a primary driver of high-severity incidents, posing an existential threat to nearly every sector. Common themes observed include:

Disruption of core operations in the financial sector via hostage-taking of critical systems.
Paralysis of business functions and compromise of client data in the real estate industry, leading to significant downtime and regulatory scrutiny.
Broad impact across all industry verticals.

Stopping these attacks demands not only robust defenses but also a well-rehearsed recovery plan that cuts time-to-restoration to hours, not weeks.

Application security and supply chain breaches

The REACT team has also seen a significant increase in incidents originating at the application layer. These threats typically manifest in two primary areas: vulnerabilities within an organization’s own custom-developed (‘vibe coded’) applications, and security failures originating from their third-party supply chain:

Vibe coding: The practice of providing natural language prompts to AI models to generate code can produce critical vulnerabilities which can be exploited by threat actors using techniques like remote code execution (RCE), memory corruption, and SQL injection.
SaaS supply chain risk: A compromise at a critical third-party vendor that exposes sensitive data, such as when attackers used a stolen Salesloft OAuth token to exfiltrate customer support cases from their clients' Salesforce instances.

Integrated directly into your Cloudflare dashboard

Starting today, Cloudflare Enterprise customers will find a new "Incident Response Services" tab in the Threat intelligence navigation page in the Cloudflare dashboard. This dashboard integration ensures that critical security information and the ability to engage our incident response team are always at your fingertips, streamlining the process of getting expert help when it matters most.

^{Screenshot of the Cloudforce One Incident Response Services page in the Cloudflare dashboard}

Retainer customers will benefit from a dedicated Under Attack page, which allows customers to contact Cloudforce One team during an active incident. In the event of an active incident, a simple "Request Help" button in our “Under Attack” page will immediately page our on-call incident responders to get you the help you need without delay.

^{Screenshot on the Under Attack button in the Cloudflare dashboard}

^{Screenshot of the Emergency Incident Response page in the Cloudflare dashboard}

For proactive needs, you can also easily submit requests for security advisory services through the Cloudflare dashboard:

^{Confirmation of the successful service request submission}

How to engage with Cloudforce One

To learn more about REACT, existing Enterprise customers can explore the dedicated Incident Response section in the Cloudflare dashboard. For new inquiries regarding proactive partnerships and retainers, please contact Cloudflare sales. If you are facing an active security crisis and need the REACT team on the ground, please contact us immediately.