Lecture: 5 min.
As cyber threats continue to exploit systemic vulnerabilities in widely used technologies, the United States Cybersecurity and Infrastructure Agency (CISA) produced best practices for the technology industry with their Secure-by-Design pledge. Cloudflare proudly signed this pledge on May 8, 2024, reinforcing our commitment to creating resilient systems where security is not just a feature, but a foundational principle.
We’re excited to share and provide transparency into how our security patching process meets one of CISA’s goals in the pledge: Demonstrating actions taken to increase installation of security patches for our customers.
Balancing security patching and customer experience
Managing and deploying Linux kernel updates is one of Cloudflare’s most challenging security processes. In 2024, over 1000 CVEs were logged against the Linux kernel and patched. To keep our systems secure, it is vital to perform critical patch deployment across systems while maintaining the user experience.
A common technical support phrase is “Have you tried turning it off and then on again?”. One may be surprised how often this tactic is used — it is also an essential part of how Cloudflare operates at scale when it comes to applying our most critical patches. Frequently restarting systems exercises the restart process, applies the latest firmware changes, and refreshes the filesystem. Simply put, the Linux kernel requires a restart to take effect.
However, considering that a single Cloudflare server may be processing hundreds of thousands of requests at any point in time, rebooting it would impact user experience. As a result, a calculated approach is required, and traffic must be carefully removed from the server before it can safely reboot.
First, the server is marked for maintenance. This action alerts our load balancing system, unimog, to stop sending traffic to this server. Next, the server waits for this flow of traffic to terminate, and once public traffic is gone, the server begins to disable internal traffic. Internal traffic has multiple purposes, such as determining optimal routing, service discovery, and system health checks. Once the server is no longer actively serving any traffic, it can safely restart, using the new kernel.
Kernel lifecycle at Cloudflare
This diagram is a high level view of the lifecycle of the Linux kernel at Cloudflare. The list of kernel versions shown is a point in time example snapshot from kernel.org.
First, a new kernel is released by the upstream kernel developers. We follow the longterm stable branch of the kernel. Each new kernel release is pulled into our internal repository automatically, where the kernel is built and tested. Once all testing has successfully passed, several flavors of the kernel are built and readied for a preliminary deployment.
The first stage of deployment is an internal environment that receives no traffic. Once it is confirmed that there are no crashes or unintended behavior, it is promoted to a production environment with traffic generated by Cloudflare employees as eyeballs.
Cloudflare employees are connected via Zero Trust to this environment. This allows our telemetry to collect information regarding CPU utilization, memory usage, and filesystem behavior, which is then analyzed for deviations from the previous kernel. This is the first time that a new kernel is interacting with live traffic and real users in a Cloudflare environment.
Once we are satisfied with kernel performance and behavior, we begin to deploy this kernel to customer traffic. This progression starts as a small percentage of traffic in multiple datacenters and ends in one large regional datacenter. This is an important qualification phase for a new kernel, as we need to collect data on real world traffic. Once we are satisfied with performance and behavior, we have a candidate release that can go everywhere.
When a new kernel is ready for release, an automated cycle named the Edge Reboot Release is initiated. The Edge Reboot Release begins and completes every 30 days. This guarantees that we are running an up-to-date kernel in our infrastructure every month.
What about patches for the kernel that are needed faster than the standard cycle? We can live patch changes to close those gaps faster, and we have even written about closing one of these CVE’s.
Automating kernel updates in our Control Plane
The Cloudflare network is 50 ms from 95% of the world’s Internet-connected population. The Control Plane runs different workloads than our network, and is composed of 80 different clustered workloads responsible for persistence of information and decisions that feed the Cloudflare network. Until 2024, the Control Plane kernel maintenance was performed ad-hoc, and this caused the working kernel for Control Plane workloads to fall behind on patches. Under the pledge, this had to change and become just as consistent as the rest of our network.
Consider a relational database as an example workload, as illustrated in the diagram above. One would need a copy available to restart the original in order to provide a seamless end user experience. This copy is called a database replica. That replica should then be promoted to become the primary serving database. Now that a new primary is serving traffic, the old primary is free to restart. If a database replica reboot is needed, an additional replica would be needed to take its place, allowing another safe restart. In this example, we have 2 different ways to restart a member of the clustered workload. Every clustered workload has different safe methodologies to restart one of its members.
Reboau (short for reboot automation) is an internally-built tool to manage custom reboot logic in the Control Plane. Reboau offers additional efficiencies described as “rack aware”, meaning it can operate on a rack of servers vs. a single server at a time. This optimization is helpful for a clustered workload, where it may be more efficient to drain and reboot a rack versus a single server. It also leverages metrics to determine when it is safe to lose a clustered member, execute the reboot, and ensure the system is healthy through the process.
In 2024, Cloudflare migrated Control Plane workloads to leverage Reboau and follow the same kernel upgrade cadence as the network. Now all of our infrastructure benefits from faster patching of the Linux kernel, to improve security and reliability for our customers.
Irrespective of the industry, if your organization builds software, we encourage you to familiarize yourself with CISA’s ‘Secure by Design’ principles and create a plan to implement them in your company. The CISA Secure by Design pledge is built around seven security goals, prioritizing the security of customers, and challenges organizations to think differently about security.
By implementing automated security patching through kernel updates, Cloudflare has demonstrated measurable progress in implementing functionality that allows automatic deployment of software patches by default. This process highlights Cloudflare's commitment to protecting our infrastructure and keeping our customers against emerging vulnerabilities.
For more updates on our CISA progress, you check out our blog. Cloudflare has delivered five of the seven CISA Secure by Design pledge goals, and we aim to complete the entirety of the pledge goals by May 2025.