The Cloudflare Blog

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Chris J Arges — Thu, 07 May 2026 13:00:00 GMT

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes.

There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off.

Background

Our Linux kernel release process

Cloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.

The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.

By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.

At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.

About the Copy Fail vulnerability

It helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel's internal crypto API manages functions like kTLS and IPsec. Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program follows these steps:

Opens an AF_ALG socket and binds to an AEAD template.
Sets a key and accepts a request socket.
Submits input via sendmsg() or splice().
Executes the operation using recvmsg().

The splice() system call is critical here, as it moves data by passing page cache references.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.

The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.

The vulnerability: out-of-bounds write

When the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:

File: Any readable file.
Offset: Tunable via assoclen and splice parameters.
Value: Controlled via AAD bytes 4-7 in sendmsg()

The exploit, step by step

The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.

Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.
Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.
Write Construction: For each 4-byte shellcode chunk:
- sendmsg() with AAD bytes 4–7 containing the shellcode.
- splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.
Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.
Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.

The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.

How we responded

When the vulnerability was disclosed, many workstreams started in parallel:

Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.
Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.
Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.
Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.
Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.

There was no customer impact at any point during this response.

Validating detection coverage

One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.

This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.

Hunting for exploitation

While our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.

Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.

The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.

We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.

We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean.

Incident timeline and impact

Time (UTC)	Event
2026-04-29 16:00	Copy Fail publicly disclosed.
2026-04-29 ~21:00	Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process
2026-04-29 22:52	Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes.
2026-04-29 23:01	Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique.
2026-04-29 (evening)	First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected.
2026-04-29 (overnight)	Engineering drafted bpf-lsm mitigation program.
2026-04-30 03:14	Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems.
2026-04-30 (morning)	Engineering tested the bpf-lsm mitigation program and made it production-ready.
2026-04-30 14:25	Engineering incident declared to coordinate mitigation program and Linux patch rollout.
2026-04-30 ~17:00	Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime.
2026-04-30 (afternoon)	Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users.
2026-04-30 (evening)	bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works.
2026-05-04 (morning)	Reboot automation resumed at normal pace with the patched kernel.
2026-05-04 onward	Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.

How did we mitigate it?

Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.

Removing the module

The bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.

This mitigation was therefore exactly what the Copy Fail write-up from the security researchers who identified it recommends.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

Unfortunately removing the module would have impacted software that leverages the kernel crypto API. This meant that we had to figure out a more surgical mitigation.

Bpf-lsm

We’ve already developed and deployed such a tool for this exact scenario: bpf-lsm. Instead of removing the module, this tool leaves it loaded for legitimate users and uses a BPF Linux Security Module program to deny the socket_bind LSM hook for everyone else. This completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary's path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Rolling it out

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

Remediation and follow-up steps

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

Conclusion

The "Copy Fail" vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport's release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We'd also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

Network flow monitoring is GA, providing end-to-end traffic visibility

Chris Draper — Wed, 18 Oct 2023 13:00:53 GMT

Network engineers often find they need better visibility into their network’s traffic and operations while analyzing DDoS attacks or troubleshooting other traffic anomalies. These engineers typically have some high level metrics about their network traffic, but they struggle to collect essential information on the specific traffic flows that would clarify the issue. To solve this problem, Cloudflare has been piloting a cloud network flow monitoring product called Magic Network Monitoring that gives customers end-to-end visibility into all traffic across their network.

Today, Cloudflare is excited to announce that Magic Network Monitoring (previously called Flow Based Monitoring) is now generally available to all enterprise customers. Over the last year, the Cloudflare engineering team has significantly improved Magic Network Monitoring; we’re excited to offer a network services product that will help our customers identify threats faster, reduce vulnerabilities, and make their network more secure.

Magic Network Monitoring is automatically enabled for all Magic Transit and Magic WAN enterprise customers. The product is located at the account level of the Cloudflare dashboard and can be opened by navigating to “Analytics & Logs > Magic Monitoring”. The onboarding process for Magic Network Monitoring is self-serve, and all enterprise customers with access can begin configuring the product today.

Any enterprise customers without Magic Transit or Magic WAN that are interested in testing Magic Network Monitoring can receive access to the free version (with some limitations on traffic volume) by submitting a request to their Cloudflare account team or filling out this form to talk with an expert.

What is Magic Network Monitoring?

Magic Network Monitoring is a cloud network flow monitor. Network traffic flow refers to any stream of packets between one source and one destination with the same Internet protocol and set of ports. Customers can send network flow reports from their routers (or any other network flow generator) to a publicly available endpoint on Cloudflare’s anycast network, even if the traffic didn’t originally pass through Cloudflare’s network. Cloudflare analyzes the network flow data, then provides customers visibility into key network traffic metrics via an analytics dashboard. These metrics include: traffic volume (in bits or packets) over time, source IPs, destination IPs, ports, traffic protocols, and router IPs. Customers can also configure alerts to identify DDoS attacks and any other abnormal traffic volume activities.

Send flow data from your network to Cloudflare for analysis

Enterprise DDoS attack type detection

Magic Transit On Demand (MTOD) customers will experience significant traffic visibility benefits when using Magic Network Monitoring. Magic Transit is a network security solution that offers DDoS protection and traffic acceleration from every Cloudflare data center for on-premise, cloud-hosted, and hybrid networks. Magic Transit On Demand customers can activate Magic Transit for protection when a DDoS attack is detected.

In general, we noticed that some MTOD customers lacked the network visibility tools to quickly identify DDoS attacks and take the appropriate mitigation action. Now, MTOD customers can use Magic Network Monitoring to analyze their network data and receive an alert if a DDoS attack is detected.

Cloudflare detects a DDoS attack from the customer’s network flow data

Once a DDoS attack is detected, Magic Network Monitoring customers can choose to either manually or automatically enable Magic Transit to mitigate any DDoS attacks.

Activate Magic Transit for DDoS protection

Enterprise network monitoring

Cloudflare’s Magic WAN and Cloudflare One customers can also benefit from using Magic Network Monitoring. Today, these customers have excellent visibility into the traffic they send through Cloudflare’s network, but sometimes they may lack visibility into traffic that isn’t sent through Cloudflare. This can include traffic that remains on a local network, or network traffic sent in between cloud environments. Magic WAN and Cloudflare One customers can add Magic Network Monitoring into their suite of product solutions to establish end-to-end network visibility across all traffic on their network.

A deep dive into network flow and network traffic sampling

Magic Network Monitoring gives customers better visibility into their network traffic by ingesting and analyzing network flow data.

The process starts when a router (or other network flow generation device) collects statistical samples of inbound and / or outbound packet data. These samples are collected by examining 1 in every X packets, where X is the sampling rate configured on the router. Typical sampling rates range from 1 in every 1,000 to 1 in every 4,000 packets. The ideal sampling rate depends on the traffic volume, traffic diversity, and the compute / memory power of your router’s hardware. You can read more about the recommended network flow sampling rate in Cloudflare’s MNM Developer Docs.

The sampled data is packaged into one of two industry standard formats for network flow data: NetFlow or sFlow. In NetFlow, the sampled packet data is grouped by different packet characteristics such as source / destination IP, port, and protocol. Each group of sampled packet data also includes a traffic volume estimate. In sFlow, the entire packet header is selected as the representative sample, and there isn’t any data summarization. As a result, sFlow is a richer data format and includes more details about network traffic than NetFlow data. Once either the NetFlow or sFlow data samples are collected, they’re sent to Magic Network Monitoring for analysis and alerting.

Why simple random sampling didn’t work for Magic Network Monitoring

Magic Network Monitoring has come a long way from its early access release one year ago. In particular, the Cloudflare engineering team invested significant time in improving the accuracy of the traffic volume estimations in MNM. In the early access version of Magic Network Monitoring, customers were unexpectedly reporting that their network traffic volume estimates were too high and didn’t match the expected value.

Magic Network Monitoring performs its own sampling of the NetFlow or sFlow data it receives, so it can effectively scale and manage the data ingested across Cloudflare’s global network. Increasing the accuracy of the traffic volume estimations was more difficult than expected, as the NetFlow or sFlow data parsed by MNM is already built on sampled packet data. This introduces multiple distinct layers of data sampling in the product’s analytics.

The first version of Magic Network Monitoring used random sampling where a random subset of network flow data with the same timestamp was selected to represent the traffic volume at that point in time. A characteristic of network flow data is that some samples are more significant than others and represent a greater volume of network traffic. In order to account for this significance, we can associate a weight with each sample based on the traffic volume it represents. Network flow data weights are always positive numbers, and they follow a long tail distribution. These data characteristics caused MNM’s random sampling to incorrectly estimate the traffic volume of a customer’s network. Customers would see false spikes in their traffic volume analytics when an outlying data sample from the long tail was randomly selected to be the representative of all traffic at that point in time.

Increasing accuracy with VarOpt reservoir sampling

To solve this problem, the Cloudflare engineering team implemented an alternative reservoir sampling technique called VarOpt. VarOpt is designed to collect samples from a stream of data when the length of the data stream is unknown (a perfect application for analyzing incoming network flow data). In the MNM implementation of VarOpt, we start with an empty reservoir of a fixed size that is filled with samples of network flow data. When the reservoir is full, and there is still new incoming network flow data, an old sample is randomly discarded from the reservoir and replaced with a new one.

After a certain number of samples have been observed, we calculate the traffic volume across all weighted samples in the reservoir, and that is the estimated traffic volume of a customer’s network flow at that point in time. Finally, the reservoir is emptied, and the VarOpt loop is restarted by filling the reservoir with the next set of the latest network flow samples.

The new VarOpt sampling method significantly increased the accuracy of the traffic volume estimations in Magic Network Monitoring, and solved our customer’s problems. These sampling improvements paved the way for general availability, and we’re excited to make accurate network flow analytics available to everyone.

Developer Docs and Discord Community

There are detailed Developer Docs for Magic Network Monitoring that explain the product’s features and outlines a step-by-step configuration guide for new customers. As you’re working through the Magic Network Monitoring documentation, please feel free to provide feedback by clicking the “Give Feedback” button in the top right corner of the Developer Docs.

We’ve also created a channel in Cloudflare’s Discord community built around debugging configuration problems, testing new features, and providing product feedback. You can follow this link to join the Cloudflare Discord server.

Free version

A free version of Magic Network Monitoring is available to all Enterprise customers on request to their Cloudflare account team. The free version is designed to enable Enterprise customers to quickly test and evaluate Magic Network Monitoring before purchasing Magic Transit, Magic WAN, or Cloudflare One. Enterprise customers can fully configure Magic Network Monitoring themselves by following the step-by-step onboarding guide in the product’s documentation. The free version has some limitations on the quantity of traffic that can be processed which are further outlined in the product’s documentation.

The free version of Magic Network Monitoring is also available to all Free, Pro, and Business plan Cloudflare customers via a closed beta. Anyone can request access to the free version by reading the free version documentation and filling out this form. Priority access is granted to anyone that joins Cloudflare’s Discord server and sends a message in the Magic Network Monitoring Discord channel.

Next steps that you can take today

Magic Network Monitoring is generally available, and all Magic Transit and Magic WAN customers have been automatically granted access to the product today. You can navigate to the product by going to the account level of the Cloudflare dashboard, then selecting “Analytics & Logs > Magic Monitoring”.

If you’re an enterprise customer without Magic Transit or Magic WAN, and you want to use Magic Network Monitoring to improve your traffic visibility, you can talk with an MNM expert today.

If you’re interested in using Magic Transit and Magic Network Monitoring for DDoS protection, you can request a demo of Magic Transit. If you want to use Magic WAN and Magic Network Monitoring together to establish end-to-end network traffic visibility, you can talk with a Magic WAN expert.

How We Used eBPF to Build Programmable Packet Filtering in Magic Firewall

Chris J Arges — Mon, 06 Dec 2021 13:59:53 GMT

Cloudflare actively protects services from sophisticated attacks day after day. For users of Magic Transit, DDoS protection detects and drops attacks, while Magic Firewall allows custom packet-level rules, enabling customers to deprecate hardware firewall appliances and block malicious traffic at Cloudflare’s network. The types of attacks and sophistication of attacks continue to evolve, as recent DDoS and reflection attacks against VoIP services targeting protocols such as Session Initiation Protocol (SIP) have shown. Fighting these attacks requires pushing the limits of packet filtering beyond what traditional firewalls are capable of. We did this by taking best of class technologies and combining them in new ways to turn Magic Firewall into a blazing fast, fully programmable firewall that can stand up to even the most sophisticated of attacks.

Magical Walls of Fire

Magic Firewall is a distributed stateless packet firewall built on Linux nftables. It runs on every server, in every Cloudflare data center around the world. To provide isolation and flexibility, each customer’s nftables rules are configured within their own Linux network namespace.

This diagram shows the life of an example packet when using Magic Transit, which has Magic Firewall built in. First, packets go into the server and DDoS protections are applied, which drops attacks as early as possible. Next, the packet is routed into a customer-specific network namespace, which applies the nftables rules to the packets. After this, packets are routed back to the origin via a GRE tunnel. Magic Firewall users can construct firewall statements from a single API, using a flexible Wirefilter syntax. In addition, rules can be configured via the Cloudflare dashboard, using friendly UI drag and drop elements.

Magic Firewall provides a very powerful syntax for matching on various packet parameters, but it is also limited to the matches provided by nftables. While this is more than sufficient for many use cases, it does not provide enough flexibility to implement the advanced packet parsing and content matching we want. We needed more power.

Hello eBPF, meet Nftables!

When looking to add more power to your Linux networking needs, Extended Berkeley Packet Filter (eBPF) is a natural choice. With eBPF, you can insert packet processing programs that execute in the kernel, giving you the flexibility of familiar programming paradigms with the speed of in-kernel execution. Cloudflare loves eBPF and this technology has been transformative in enabling many of our products. Naturally, we wanted to find a way to use eBPF to extend our use of nftables in Magic Firewall. This means being able to match, using an eBPF program within a table and chain as a rule. By doing this we can have our cake and eat it too, by keeping our existing infrastructure and code, and extending it further.

If nftables could leverage eBPF natively, this story would be much shorter; alas, we had to continue our quest. To get us started in our search, we know that iptables integrates with eBPF. For example, one can use iptables and a pinned eBPF program for dropping packets with the following command:

iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/match -j DROP

This clue helped to put us on the right path. Iptables uses the xt_bpf extension to match on an eBPF program. This extension uses the BPF_PROG_TYPE_SOCKET_FILTER eBPF program type, which allows us to load the packet information from the socket buffer and return a value based on our code.

Since we know iptables can use eBPF, why not just use that? Magic Firewall currently leverages nftables, which is a great choice for our use case due to its flexibility in syntax and programmable interface. Thus, we need to find a way to use the xt_bpf extension with nftables.

This diagram helps explain the relationship between iptables, nftables and the kernel. The nftables API can be used by both the iptables and nft userspace programs, and can configure both xtables matches (including xt_bpf) and normal nftables matches.

This means that given the right API calls (netlink/netfilter messages), we can embed an xt_bpf match within an nftables rule. In order to do this, we need to understand which netfilter messages we need to send. By using tools such as strace, Wireshark, and especially using the source we were able to construct a message that could append an eBPF rule given a table and chain.

NFTA_RULE_TABLE table
NFTA_RULE_CHAIN chain
NFTA_RULE_EXPRESSIONS | NFTA_MATCH_NAME
	NFTA_LIST_ELEM | NLA_F_NESTED
	NFTA_EXPR_NAME "match"
		NLA_F_NESTED | NFTA_EXPR_DATA
		NFTA_MATCH_NAME "bpf"
		NFTA_MATCH_REV 1
		NFTA_MATCH_INFO ebpf_bytes

The structure of the netlink/netfilter message to add an eBPF match should look like the above example. Of course, this message needs to be properly embedded and include a conditional step, such as a verdict, when there is a match. The next step was decoding the format of ebpf_bytes as shown in the example below.

 struct xt_bpf_info_v1 {
	__u16 mode;
	__u16 bpf_program_num_elem;
	__s32 fd;
	union {
		struct sock_filter bpf_program[XT_BPF_MAX_NUM_INSTR];
		char path[XT_BPF_PATH_MAX];
	};
};

The bytes format can be found in the kernel header definition of struct xt_bpf_info_v1. The code example above shows the relevant parts of the structure.

The xt_bpf module supports both raw bytecodes, as well as a path to a pinned ebpf program. The later mode is the technique we used to combine the ebpf program with nftables.

With this information we were able to write code that could create netlink messages and properly serialize any relevant data fields. This approach was just the first step, we are also looking into incorporating this into proper tooling instead of sending custom netfilter messages.

Just Add eBPF

Now we needed to construct an eBPF program and load it into an existing nftables table and chain. Starting to use eBPF can be a bit daunting. Which program type do we want to use? How do we compile and load our eBPF program? We started this process by doing some exploration and research.

First we constructed an example program to try it out.

SEC("socket")
int filter(struct __sk_buff *skb) {
  /* get header */
  struct iphdr iph;
  if (bpf_skb_load_bytes(skb, 0, &iph, sizeof(iph))) {
    return BPF_DROP;
  }

  /* read last 5 bytes in payload of udp */
  __u16 pkt_len = bswap_16(iph.tot_len);
  char data[5];
  if (bpf_skb_load_bytes(skb, pkt_len - sizeof(data), &data, sizeof(data))) {
    return BPF_DROP;
  }

  /* only packets with the magic word at the end of the payload are allowed */
  const char SECRET_TOKEN[5] = "xyzzy";
  for (int i = 0; i < sizeof(SECRET_TOKEN); i++) {
    if (SECRET_TOKEN[i] != data[i]) {
      return BPF_DROP;
    }
  }

  return BPF_OK;
}

The excerpt mentioned is an example of an eBPF program that only accepts packets that have a magic string at the end of the payload. This requires checking the total length of the packet to find where to start the search. For clarity, this example omits error checking and headers.

Once we had a program, the next step was integrating it into our tooling. We tried a few technologies to load the program, like BCC, libbpf, and we even created a custom loader. Ultimately, we ended up using cilium’s ebpf library, since we are using Golang for our control-plane program and the library makes it easy to generate, embed and load eBPF programs.

# nft list ruleset
table ip mfw {
	chain input {
		#match bpf pinned /sys/fs/bpf/mfw/match drop
	}
}

Once the program is compiled and pinned, we can add matches into nftables using netlink commands. Listing the ruleset shows the match is present. This is incredible! We are now able to deploy custom C programs to provide advanced matching inside a Magic Firewall ruleset!

More Magic

With the addition of eBPF to our toolkit, Magic Firewall is an even more flexible and powerful way to protect your network from bad actors. We are now able to look deeper into packets and implement more complex matching logic than nftables alone could provide. Since our firewall is running as software on all Cloudflare servers, we can quickly iterate and update features.

One outcome of this project is SIP protection, which is currently in beta. That’s only the beginning. We are currently exploring using eBPF for protocol validations, advanced field matching, looking into payloads, and supporting even larger sets of IP lists.

We welcome your help here, too! If you have other use cases and ideas, please talk to your account team. If you find this technology interesting, come join our team!