
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Mon, 06 Apr 2026 13:10:44 GMT</lastBuildDate>
        <item>
            <title><![CDATA[A one-line Kubernetes fix that saved 600 hours a year]]></title>
            <link>https://blog.cloudflare.com/one-line-kubernetes-fix-saved-600-hours-a-year/</link>
            <pubDate>Thu, 26 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ When we investigated why our Atlantis instance took 30 minutes to restart, we discovered a bottleneck in how Kubernetes handles volume permissions. By adjusting the fsGroupChangePolicy, we reduced restart times to 30 seconds. ]]></description>
            <content:encoded><![CDATA[ <p>Every time we restarted Atlantis, the tool we use to plan and apply Terraform changes, we’d be stuck for 30 minutes waiting for it to come back up. No plans, no applies, no infrastructure changes for any repository managed by Atlantis. With roughly 100 restarts a month for credential rotations and onboarding, that added up to over <b>50 hours of blocked engineering time every month</b>, and paged the on-call engineer every time.</p><p>This was ultimately caused by a safe default in Kubernetes that had silently become a bottleneck as the persistent volume used by Atlantis grew to millions of files. Here’s how we tracked it down and fixed it with a one-line change.</p>
    <div>
      <h3>Mysteriously slow restarts</h3>
      <a href="#mysteriously-slow-restarts">
        
      </a>
    </div>
    <p>We manage dozens of Terraform projects with GitLab merge requests (MRs) using <a href="https://www.runatlantis.io/"><u>Atlantis</u></a>, which handles planning and applying. It enforces locking to ensure that only one MR can modify a project at a time. </p><p>It runs on Kubernetes as a singleton StatefulSet and relies on a Kubernetes PersistentVolume (PV) to keep track of repository state on disk. Whenever a Terraform project needs to be onboarded or offboarded, or credentials used by Terraform are updated, we have to restart Atlantis to pick up those changes — a process that can take 30 minutes.</p><p>The slow restart was apparent when we recently ran out of inodes on the persistent storage used by Atlantis, forcing us to restart it to resize the volume. Inodes are consumed by each file and directory entry on disk, and the number available to a filesystem is determined by parameters passed when creating it. The Ceph persistent storage implementation provided by our Kubernetes platform does not expose a way to pass flags to <code>mkfs</code>, so we’re at the mercy of default values: growing the filesystem is the only way to grow available inodes, and restarting a PV requires a pod restart. </p><p>We talked about extending the alert window, but that would just mask the problem and delay our response to actual issues. Instead, we decided to investigate exactly why it was taking so long.</p>
    <div>
      <h3>Bad behavior</h3>
      <a href="#bad-behavior">
        
      </a>
    </div>
    <p>When we were asked to do a rolling restart of Atlantis to pick up a change to the secrets it uses, we would run <code>kubectl rollout restart statefulset atlantis</code>, which would gracefully terminate the existing Atlantis pod before spinning up a new one. The new pod would appear almost immediately, but looking at it would show:</p>
            <pre><code>$ kubectl get pod atlantis-0
atlantis-0                                                        0/1     
Init:0/1     0             30m
</code></pre>
            <p>...so what gives? Naturally, the first thing to check would be events for that pod. It's waiting around for an init container to run, so maybe the pod events would illuminate why?</p>
            <pre><code>$ kubectl events --for=pod/atlantis-0
LAST SEEN   TYPE      REASON                   OBJECT                   MESSAGE
30m         Normal    Killing                  Pod/atlantis-0   Stopping container atlantis-server
30m        Normal    Scheduled                Pod/atlantis-0   Successfully assigned atlantis/atlantis-0 to 36com1167.cfops.net
22s         Normal    Pulling                  Pod/atlantis-0   Pulling image "oci.example.com/git-sync/master:v4.1.0"
22s         Normal    Pulled                   Pod/atlantis-0   Successfully pulled image "oci.example.com/git-sync/master:v4.1.0" in 632ms (632ms including waiting). Image size: 58518579 bytes.</code></pre>
            <p>That looks almost normal... but what's taking so long between scheduling the pod and actually starting to pull the image for the init container? Unfortunately that was all the data we had to go on from Kubernetes itself. But surely there <i>had</i> to be something more that can tell us why it's taking so long to actually start running the pod.</p>
    <div>
      <h3>Going deeper</h3>
      <a href="#going-deeper">
        
      </a>
    </div>
    <p>In Kubernetes, a component called <code>kubelet</code> that runs on each node is responsible for coordinating pod creation, mounting persistent volumes, and many other things. From my time on our Kubernetes team, I know that <code>kubelet</code> runs as a systemd service and so its logs should be available to us in Kibana. Since the pod has been scheduled, we know the host name we're interested in, and the log messages from <code>kubelet</code> include the associated object, so we could filter for <code>atlantis</code> to narrow down the log messages to anything we found interesting.</p><p>We were able to observe the Atlantis PV being mounted shortly after the pod was scheduled. We also observed all the secret volumes mount without issue. However, there was still a big unexplained gap in the logs. We saw:</p>
            <pre><code>[operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-94b75052-8d70-4c67-993a-9238613f3b99\" (UniqueName: \"kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com^0001-000e-rook-ceph-nvme-0000000000000002-a6163184-670f-422b-a135-a1246dba4695\") pod \"atlantis-0\" (UID: \"83089f13-2d9b-46ed-a4d3-cba885f9f48a\") device mount path \"/state/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com/d42dcb508f87fa241a49c4f589c03d80de2f720a87e36932aedc4c07840e2dfc/globalmount\"" pod="atlantis/atlantis-0"
[pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[atlantis-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="atlantis/atlantis-0" podUID="83089f13-2d9b-46ed-a4d3-cba885f9f48a"
[util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="atlantis/atlantis-0"</code></pre>
            <p>The last two messages looped several times until eventually we observed the pod actually start up properly.</p><p>So <code>kubelet</code> thinks that the pod is otherwise ready to go, but it's not starting it and something's timing out.</p>
    <div>
      <h3>The missing piece</h3>
      <a href="#the-missing-piece">
        
      </a>
    </div>
    <p>The lowest-level logs we had on the pod didn't show us what's going on. What else do we have to look at? Well, the last message before it hangs is the PV being mounted onto the node. Ordinarily, if the PV has issues mounting (e.g. due to still being stuck mounted on another node), that will bubble up as an event. But something's still going on here, and the only thing we have left to drill down on is the PV itself. So I plug that into Kibana, since the PV name is unique enough to make a good search term... and immediately something jumps out:</p>
            <pre><code>[volume_linux.go:49] Setting volume ownership for /state/var/lib/kubelet/pods/83089f13-2d9b-46ed-a4d3-cba885f9f48a/volumes/kubernetes.io~csi/pvc-94b75052-8d70-4c67-993a-9238613f3b99/mount and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699</code></pre>
            <p>Remember how I said at the beginning we'd just run out of inodes? In other words, we have a <i>lot</i> of files on this PV. When the PV is mounted, <code>kubelet</code> is running <code>chgrp -R</code> to recursively change the group on every file and folder across this filesystem. No wonder it was taking so long — that's a ton of entries to traverse even on fast flash storage!</p><p>The pod's <code>spec.securityContext</code> included <code>fsGroup: 1</code>, which ensures that processes running under GID 1 can access files on the volume. Atlantis runs as a non-root user, so without this setting it wouldn’t have permission to read or write to the PV. The way Kubernetes enforces this is by recursively updating ownership on the entire PV <i>every time it's mounted</i>.</p>
    <div>
      <h3>The fix</h3>
      <a href="#the-fix">
        
      </a>
    </div>
    <p>Fixing this was heroically...boring. Since version 1.20, Kubernetes has supported an additional field on <code>pod.spec.securityContext</code> called <code>fsGroupChangePolicy</code>. This field defaults to <code>Always</code>, which leads to the exact behavior we see here. It has another option, <code>OnRootMismatch</code>, to only change permissions if the root directory of the PV doesn't have the right permissions. If you don’t know exactly how files are created on your PV, do not set <code>fsGroupChangePolicy</code>: <code>OnRootMismatch</code>. We checked to make sure that nothing should be changing the group on anything in the PV, and then set that field: </p>
            <pre><code>spec:
  template:
    spec:
      securityContext:
        fsGroupChangePolicy: OnRootMismatch</code></pre>
            <p>Now, it takes about 30 seconds to restart Atlantis, down from the 30 minutes it was when we started.</p><p>Default Kubernetes settings are sensible for small volumes, but they can become bottlenecks as data grows. For us, this one-line change to <code>fsGroupChangePolicy</code> reclaimed nearly 50 hours of blocked engineering time per month. This was time our teams had been spending waiting for infrastructure changes to go through, and time that our on-call engineers had been spending responding to false alarms. That’s roughly 600 hours a year returned to productive work, from a fix that took longer to diagnose than deploy.</p><p>Safe defaults in Kubernetes are designed for small, simple workloads. But as you scale, they can slowly become bottlenecks. If you’re running workloads with large persistent volumes, it’s worth checking whether recursive permission changes like this are silently eating your restart time. Audit your <code>securityContext</code> settings, especially <code>fsGroup</code> and <code>fsGroupChangePolicy</code>. <code>OnRootMismatch</code> has been available since v1.20.</p><p>Not every fix is heroic or complex, and it’s usually worth asking “why does the system behave this way?”</p><p>If debugging infrastructure problems at scale sounds interesting, <a href="https://cloudflare.com/careers"><u>we’re hiring</u></a>. Come join us on the <a href="https://community.cloudflare.com/"><u>Cloudflare Community</u></a> or our <a href="https://discord.cloudflare.com/"><u>Discord</u></a> to talk shop.</p> ]]></content:encoded>
            <category><![CDATA[Kubernetes]]></category>
            <category><![CDATA[Terraform]]></category>
            <category><![CDATA[Platform Engineering]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[SRE]]></category>
            <guid isPermaLink="false">6bSk27AUeu3Ja7pTySyy0t</guid>
            <dc:creator>Braxton Schafer</dc:creator>
        </item>
        <item>
            <title><![CDATA[Inside Gen 13: how we built our most powerful server yet]]></title>
            <link>https://blog.cloudflare.com/gen13-config/</link>
            <pubDate>Mon, 23 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare's Gen 13 servers introduce AMD EPYC™ Turin 9965 processors and a transition to 100 GbE networking to meet growing traffic demands. In this technical deep dive, we explain the engineering rationale behind each major component selection. ]]></description>
            <content:encoded><![CDATA[ <p>A few months ago, Cloudflare announced <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>the transition to FL2</u></a>, our Rust-based rewrite of Cloudflare's core request handling layer. This transition accelerates our ability to help build a better Internet for everyone. With the migration in the software stack, Cloudflare has refreshed our server hardware design with improved hardware capabilities and better efficiency to serve the evolving demands of our network and software stack. Gen 13 is designed with 192-core AMD EPYC™ Turin 9965 processor, 768 GB of DDR5-6400 memory, 24 TB of PCIe 5.0 NVMe storage, and dual 100 GbE port network interface card.</p><p>Gen 13 delivers:</p><ul><li><p>Up to 2x throughput compared to Gen 12 while staying within latency SLA</p></li><li><p>Up to 50% improvement in performance / watt efficiency, reducing data center expansion costs</p></li><li><p>Up to 60% higher throughput per rack keeping rack power budget constant</p></li><li><p>2x memory capacity, 1.5x storage capacity, 4x network bandwidth</p></li><li><p>Introduced PCIe encryption hardware support in addition to memory encryption</p></li><li><p>Improved support for thermally demanding powerful drop-in PCIe accelerators</p></li></ul><p>This blog post covers the engineering rationale behind each major component selection: what we evaluated, what we chose, and why.</p><table><tr><td><p>Generation</p></td><td><p>Gen 13 Compute</p></td><td><p>Previous Gen 12 Compute</p></td></tr><tr><td><p>Form Factor</p></td><td><p>2U1N, Single socket</p></td><td><p>2U1N, Single socket</p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 9965 
Turin 192-Core Processor</p></td><td><p>AMD EPYC™ 9684X 
Genoa-X 96-Core Processor</p></td></tr><tr><td><p>Memory</p></td><td><p>768GB of DDR5-6400 x12 memory channel</p></td><td><p>384GB of DDR5-4800 x12 memory channel</p></td></tr><tr><td><p>Storage</p></td><td><p>x3 E1.S NVMe</p><p>
</p><p> Samsung PM9D3a 7.68TB / 
Micron 7600 Pro 7.68TB</p></td><td><p>x2 E1.S NVMe </p><p>
</p><p>Samsung PM9A3 7.68TB / 
Micron 7450 Pro 7.68TB</p></td></tr><tr><td><p>Network</p></td><td><p>Dual 100 GbE OCP 3.0 </p><p>
</p><p>Intel Ethernet Network Adapter E830-CDA2 /
NVIDIA Mellanox ConnectX-6 Dx</p></td><td><p>Dual 25 GbE OCP 3.0</p><p>
</p><p>Intel Ethernet Network Adapter E810-XXVDA2 / 
NVIDIA Mellanox ConnectX-6 Lx</p></td></tr><tr><td><p>System Management</p></td><td><p>DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)</p></td><td><p>DC-SCM 2.0 ASPEED AST2600 (BMC) + AST1060 (HRoT)</p></td></tr><tr><td><p>Power Supply</p></td><td><p>1300W, Titanium Grade</p></td><td><p>800W, Titanium Grade</p></td></tr></table>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Gawj2GP8s2CCZCWwNgBiB/587b0ed5ef65cf95cf178e5457150b6a/image3.png" />
          </figure><p><i><sup>Figure: Gen 13 server</sup></i></p>
    <div>
      <h2>CPU</h2>
      <a href="#cpu">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>AMD EPYC™ 9684X Genoa-X 96-Core (400W TDP, 1152 MB L3 Cache)</p></td></tr><tr><td><p>Gen 13</p></td><td><p>AMD EPYC™ 9965 Turin Dense 192-Core (500W TDP, 384 MB L3 Cache)</p></td></tr></table><p>During the design phase, we evaluated several 5th generation AMD EPYC™ Processors, code-named Turin, in Cloudflare’s hardware lab: AMD Turin 9755, AMD Turin 9845, and AMD Turin 9965. The table below summarizes the differences in <a href="https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/datasheets/amd-epyc-9005-series-processor-datasheet.pdf"><u>specifications</u></a> of the candidates for Gen 13 servers against the AMD Genoa-X 9684X used in our <a href="https://blog.cloudflare.com/gen-12-servers/"><u>Gen 12 servers</u></a>. Notably, all three candidates offer increases in core count but with smaller L3 cache per core. However, with the <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>migration to FL2</u></a>, the new workloads are <a href="https://blog.cloudflare.com/gen13-launch/"><u>less dependent on L3 cache and scale up well with the increased core count to achieve up to 100% increase in throughput</u></a>.</p><p>The three CPU candidates are designed to target different use cases: AMD Turin 9755 offers superior per-core performance, AMD Turin 9965 trades per-core performance for efficiency, and AMD Turin 9845 trades core count for lower socket power. We evaluated three CPUs in the production environment.</p><table><tr><td><p>CPU Model</p></td><td><p>AMD Genoa-X 9684X</p></td><td><p>AMD Turin 9755</p></td><td><p>AMD Turin 9845</p></td><td><p>AMD Turin 9965</p></td></tr><tr><td><p>For server platform</p></td><td><p>Gen 12</p></td><td><p>Gen 13 candidate</p></td><td><p>Gen 13 candidate</p></td><td><p>Gen 13 candidate</p></td></tr><tr><td><p># of CPU Cores</p></td><td><p>96</p></td><td><p>128</p></td><td><p>160</p></td><td><p>192</p></td></tr><tr><td><p># of Threads</p></td><td><p>192</p></td><td><p>256</p></td><td><p>320</p></td><td><p>384</p></td></tr><tr><td><p>Base Clock</p></td><td><p>2.4 GHz</p></td><td><p>2.7 GHz</p></td><td><p>2.1 GHz</p></td><td><p>2.25 GHz</p></td></tr><tr><td><p>Max Boost Clock</p></td><td><p>3.7 GHz</p></td><td><p>4.1 GHz</p></td><td><p>3.7 GHz</p></td><td><p>3.7 GHz</p></td></tr><tr><td><p>All Core Boost Clock</p></td><td><p>3.42 GHz</p></td><td><p>4.1 GHz</p></td><td><p>3.25 GHz</p></td><td><p>3.35 GHz</p></td></tr><tr><td><p>Total L3 Cache</p></td><td><p>1152 MB</p></td><td><p>512 MB</p></td><td><p>320 MB</p></td><td><p>384 MB</p></td></tr><tr><td><p>L3 cache per core</p></td><td><p>12 MB / core</p></td><td><p>4 MB / core</p></td><td><p>2 MB / core</p></td><td><p>2 MB / core</p></td></tr><tr><td><p>Maximum configurable TDP</p></td><td><p>400W</p></td><td><p>500W</p></td><td><p>390W</p></td><td><p>500W</p></td></tr></table>
    <div>
      <h3>Why AMD Turin 9965?</h3>
      <a href="#why-amd-turin-9965">
        
      </a>
    </div>
    <p>First, <b>FL2 ended the L3 cache crunch</b>.</p><p>L3 cache is the large, last-level cache shared among all CPU cores on the same compute die to store frequently used data. It bridges the gap between slow main memory external to the CPU, and the fast but smaller L1 and L2 cache on the CPU, reducing the latency for the CPU to access data.</p><p>Some may notice that the 9965 has only 2 MB of L3 cache per core, an 83.3% reduction from the 12 MB per core on Gen 12’s Genoa-X 9684X. Why trade away the very cache advantage that gave Gen 12 its edge? The answer lies in how our workloads have evolved.</p><p>Cloudflare has <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>migrated from FL1 to FL2</u></a>, a complete rewrite of our request handling layer in Rust. With the new software stack, Cloudflare’s request processing pipeline has become significantly less dependent on large L3 cache. FL2 workloads <a href="https://blog.cloudflare.com/gen13-launch/"><u>scale nearly linearly with core count</u></a>, and the 9965’s 192 cores provide a 2x increase in hardware threads over Gen 12.</p><p>Second, <b>performance per total cost of ownership (TCO)</b>. During production evaluation, the 9965’s 192 cores delivered the highest aggregate requests per second of the three candidates, and its performance-per-watt scaled favorably at 500W TDP, yielding superior rack-level TCO.</p><table><tr><td><p>
</p></td><td><p><b>Gen 12 </b></p></td><td><p><b>Gen 13 </b></p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 4th Gen Genoa-X 9684X</p></td><td><p>AMD EPYC™ 5th Gen Turin 9965</p></td></tr><tr><td><p>Core count</p></td><td><p>96C/192T</p></td><td><p>192C/384T</p></td></tr><tr><td><p>FL throughput</p></td><td><p>Baseline</p></td><td><p>Up to +100%</p></td></tr><tr><td><p>Performance per watt</p></td><td><p>Baseline</p></td><td><p>Up to +50%</p></td></tr></table><p>Third, <b>operational simplicity</b>. Our operational teams have a strong preference for fewer, higher-density servers. Managing a fleet of 192-core machines means fewer nodes to provision, patch, and monitor per unit of compute delivered. This directly reduces operational overhead across our global network.</p><p>Finally,<b> </b>they are <b>forward compatible</b>. The AMD processor architecture supports DDR5-6400, PCIe Gen 5.0, CXL 2.0 Type 3 memory across all SKUs. AMD Turin 9965 has the highest number of high-performing cores per socket in the industry, maximizing the compute density per socket, maintaining competitiveness and relevance of the platform for years to come. By moving to AMD Turin 9965 from AMD Genoa-X 9684X, we get longer security support from AMD, extending the useful life of the Gen 13 server before they become obsolete and need to be refreshed.</p>
    <div>
      <h2>Memory</h2>
      <a href="#memory">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>12x 32GB DDR5-4800 2Rx8 (384 GB total, 4 GB/core)</p></td></tr><tr><td><p>Gen 13</p></td><td><p>12x 64GB DDR5-6400 2Rx4 (768 GB total, 4 GB/core)</p></td></tr></table><p>Because the AMD Turin processor has twice the core count of the previous generation, it demands more memory resources, both in capacity and in bandwidth, to deliver throughput gains.</p>
    <div>
      <h3>Maximizing bandwidth with 12 channels</h3>
      <a href="#maximizing-bandwidth-with-12-channels">
        
      </a>
    </div>
    <p>The chosen AMD EPYC™ 9965 CPU supports twelve memory channels, and for Gen 13, we are populating every single one of them. We’ve selected 64 GB DDR5-6400 ECC RDIMMs in a “one DIMM per channel” (1DPC) configuration.</p><p>This setup provides 614 GB/s of peak memory bandwidth per socket, a 33.3% increase compared to our Gen 12 server platform. By utilizing all 12 channels, we ensure that the CPU is never “starved” for data, even during the most memory-intensive parallel workloads.</p><p>Populating all twelve channels in a balanced configuration — equal capacity per channel, with no mixed configurations — is common best practice. This matters operationally: AMD Turin processors interleave across all memory channels with the same DIMM type, same memory capacity and same rank configuration. Interleaving increases memory bandwidth by spreading contiguous memory access across all memory channels in the interleave set instead of sending all memory access to a single or a small subset of memory channels. </p>
    <div>
      <h3>The 4 GB per core “sweet spot”</h3>
      <a href="#the-4-gb-per-core-sweet-spot">
        
      </a>
    </div>
    <p>Our Gen 12 servers are configured with 4GB per core. We revisited that decision as we designed Gen 13.</p><p>Cloudflare launches a lot of new products and services every month, and each new product or service demands an incremental amount of memory capacity. These accumulate over time and could become an issue of memory pressure, if memory capacity is not sized appropriately.</p><p>Initial requirement considered a memory-to-core ratio between 4 GB and 6 GB per core. With 192 cores on the AMD Turin 9965, that translates to a range of 768 GB to 1152 GB. Note that at higher capacities,  DIMM module capacity granularity are typically 16GB increments. With 12 channels in a 1DPC configuration, our options are 12x 48GB (576 GB), 12x 64GB (768 GB), or 12x 96GB (1152 GB).</p><ul><li><p>12x 48GB = 576 GB, or 1.5 GB/thread. The memory capacity of this configuration is too low; this would starve memory-hungry workloads and violate the lower bound.</p></li><li><p>12x 96GB = 1152 GB, or 3.0 GB/thread. This would be a 50% capacity increase per core and would also result in higher power consumption and a substantial increase in cost, especially in the current market conditions where memory prices are 10x of what they were a year ago.</p></li><li><p>12x 64GB = 768 GB, or 2.0 GB/thread (4 GB/core). This configuration is consistent with our Gen 12 memory to core ratio, and represents a 2x increase in memory capacity per server. Keeping the memory capacity configuration at 4 GB per core provides sufficient capacity for workloads that scale with core count, like our primary workload, FL, and provide sufficient memory capacity headroom for future growth without overprovisioning.</p></li></ul><p><a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>FL2 uses memory more efficiently</u></a> than FL1 did: our internal measures show FL2 uses less than half the CPU of FL1, and far less than half the memory. The capacity freed up by the software stack migration provides ample headroom to support Cloudflare growth for the next few years.</p><p>The decision: 12x 64GB for 768 GB total. This maintains the proven 4 GB/core ratio, provides a 2x total capacity increase over Gen 12, and stays within the DIMM cost curve sweet spot.</p>
    <div>
      <h3>Efficiency through dual rank</h3>
      <a href="#efficiency-through-dual-rank">
        
      </a>
    </div>
    <p>In Gen 12, we demonstrated that dual-rank DIMMs provide measurably higher memory throughput than single-rank modules, with advantages of up to 17.8% at a 1:1 read-write ratio. Dual-rank DIMMs are faster because they allow the memory controller to access one rank while another is refreshing. That same principle carries forward here.</p><p>Our requirement also calls for approximately 1 GB/s of memory bandwidth per hardware thread. With 614 GB/s of peak bandwidth across 384 threads, we deliver 1.6 GB/s per thread, comfortably exceeding the minimum. Production analysis has shown that Cloudflare workloads are not memory-bandwidth-bound, so we bank the headroom as margin for future workload growth.</p><p>By opting for 2Rx4 DDR5 RDIMMs at maximum supported 6400MT/s, we ensure we get the lowest latency and best performance from our Gen 13 platform memory configuration.</p>
    <div>
      <h2>Storage</h2>
      <a href="#storage">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>x2 E1.S NVMe PCIe 4.0, 16 TB total</p><p>Samsung PM9A3 7.68TB</p><p>Micron 7450 Pro 7.68TB</p></td></tr><tr><td><p>Gen 13</p></td><td><p>x3 E1.S NVMe PCIe 5.0, 24 TB total</p><p>Samsung PM9D3a 7.68TB</p><p>Micron 7600 Pro 7.68TB</p><p>+10x U.2 NVMe PCIe 5.0 option</p></td></tr></table><p>Our storage architecture underwent a transformation in Gen 12 when we pivoted from M.2 to EDSFF E1.S. For Gen 13, we are increasing the storage capacity and the bandwidth to align with the latest technology. We have also added a front drive bay for flexibility to add up to 10x U.2 drives to keep pace with Cloudflare storage product growth. </p>
    <div>
      <h3>The move to PCIe 5.0</h3>
      <a href="#the-move-to-pcie-5-0">
        
      </a>
    </div>
    <p>Gen 13 is configured with PCIe Gen 5.0 NVMe drives. While Gen 4.0 served us well, the move to Gen 5.0 ensures that our storage subsystem can serve data at improved latency, and keep up with increased storage bandwidth demand from the new processor. </p>
    <div>
      <h3>16 TB to 24 TB</h3>
      <a href="#16-tb-to-24-tb">
        
      </a>
    </div>
    <p>Beyond the speed increase, we are physically expanding the array from two to three NVMe drives. Our Gen 12 server platform was designed with four E1.S storage drive slots, but only two slots were populated with 8TB drives. The Gen 13 server platform uses the same design with four E1.S storage drive slots available, but with three slots populated with 8TB drives. Why add a third drive? This increases our storage capacity per server from 16TB to 24TB, ensuring we are expanding our global storage capacity to maintain and improve CDN cache performance. This supports growth projections for Durable Objects, Containers, and Quicksilver services, too.</p>
    <div>
      <h3>Front drive bay to support additional drives</h3>
      <a href="#front-drive-bay-to-support-additional-drives">
        
      </a>
    </div>
    <p>For Gen 13, the chassis is designed with a front drive bay that can support up to ten U.2 PCIe Gen 5.0 NVMe drives. The front drive bay provides the option for Cloudflare to use the same chassis across compute and storage platforms, as well as the flexibility to convert a compute SKU to a storage SKU when needed. </p>
    <div>
      <h3>Endurance and reliability</h3>
      <a href="#endurance-and-reliability">
        
      </a>
    </div>
    <p>We designed our servers to have a 5-year operational life and require storage drives endurance to sustain 1 DWPD (Drive Writes Per Day) over the full server lifespan.</p><p>Both the Samsung PM9D3a and Micron 7600 Pro meet the 1 DWPD specification with a hardware over-provisioning (OP) of approximately 7%. If future workload profiles demand higher endurance, we have the option to hold back additional user capacity to increase effective OP.</p>
    <div>
      <h3>NVMe 2.0 and OCP NVMe 2.0 compliance</h3>
      <a href="#nvme-2-0-and-ocp-nvme-2-0-compliance">
        
      </a>
    </div>
    <p>Both the Samsung PM9D3a and Micron 7600 adopt the NVMe 2.0 specification (up from NVMe 1.4) and the OCP NVMe Cloud SSD Specification 2.0. Key improvements include Zoned Namespaces (ZNS) for better write amplification management, Simple Copy Command for intra-device data movement without crossing the PCIe bus, and enhanced Command and Feature Lockdown for tighter security controls. The OCP 2.0 spec also adds deeper telemetry and debug capabilities purpose-built for datacenter operations, which aligns with our emphasis on fleet-wide manageability.</p>
    <div>
      <h3>Thermal efficiency</h3>
      <a href="#thermal-efficiency">
        
      </a>
    </div>
    <p>The storage drives will continue to be in the E1.S 15mm form factor. Its high-surface-area design is essential for cooling these new Gen 5.0 controllers, which can pull upwards of 25W under sustained heavy I/O. The 2U chassis provides ample airflow over the E1.S drives as well as U.2 drive bays, a design advantage we validated in Gen 12 when we made the decision to move from 1U to 2U.</p>
    <div>
      <h2>Network</h2>
      <a href="#network">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>Dual 25 GbE port OCP 3.0 NIC </p><p>Intel E810-XXVDA2</p><p>NVIDIA Mellanox ConnectX-6 Lx</p></td></tr><tr><td><p>Gen 13</p></td><td><p>Dual 100 GbE port OCP 3.0 NIC</p><p>Intel E830-CDA2</p><p>NVIDIA Mellanox ConnectX-6 Dx</p></td></tr></table><p>For more than eight years, dual 25 GbE was the backbone of our fleet. <a href="https://blog.cloudflare.com/a-tour-inside-cloudflares-g9-servers/"><u>Since 2018</u></a> it has served us well, but as the CPU has improved to serve more requests and our products scale, we’ve officially hit the wall. For Gen 13, we are quadrupling our per-port bandwidth.</p>
    <div>
      <h3>Why 100 GbE and why now?</h3>
      <a href="#why-100-gbe-and-why-now">
        
      </a>
    </div>
    <p>Network Interface Card (NIC) bandwidth must keep pace with compute performance growth. With 192 modern cores, our 25 GbE links will become a measurable bottleneck. Production data from our co-locations worldwide over a week showed that, on our Gen 12, P95 bandwidth per port is consistently &gt;50% of available bandwidth. Since throughput is doubling per server on Gen 13, we are at risk of saturating the NIC bandwidth.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2lxP5Vy6y6CzCk1rE9FKVU/064d9e5392a08e92b38bca637d053573/image4.png" />
          </figure><p><sup><i>Figure: on Gen 12, P95 bandwidth per port is consistently &gt;50% of available bandwidth</i></sup></p><p>The decision to go to 100 GbE rather than 50 GbE was driven by industry economics: 50 GbE transceiver volumes remain low in the industry, making them a poor supply chain bet. Dual 100 GbE ports also give us 200 Gb/s of aggregate bandwidth per server, future-proofing against the next several years of traffic growth.</p>
    <div>
      <h3>Hardware choices and compatibility</h3>
      <a href="#hardware-choices-and-compatibility">
        
      </a>
    </div>
    <p>We are maintaining our dual-vendor strategy to ensure supply chain resilience, a lesson hard-learned during the pandemic when single-sourcing the Gen 11 NIC left us scrambling.</p><p>Both NICs are compliant with <a href="https://www.servethehome.com/ocp-nic-3-0-form-factors-quick-guide-intel-broadcom-nvidia-meta-inspur-dell-emc-hpe-lenovo-gigabyte-supermicro/"><u>OCP 3.0 SFF/TSFF</u></a> form factor with the integrated pull tab, maintaining chassis commonality with Gen 12 and ensuring field technicians need no new tools or training for swaps.</p>
    <div>
      <h3>PCIe Allocation</h3>
      <a href="#pcie-allocation">
        
      </a>
    </div>
    <p>The OCP 3.0 NIC slot is allocated PCIe 4.0 x16 lanes on the motherboard, providing 256 Gb/s of bidirectional bandwidth, more than enough for dual 100 GbE (200 Gb/s aggregate) with room to spare.</p>
    <div>
      <h2>Management</h2>
      <a href="#management">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p><a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0</p></td></tr><tr><td><p>Gen 13</p></td><td><p><a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0</p><p>PCIe encryption</p></td></tr></table><p>We are maintaining the architectural shift, introduced in Gen 12, of separating management and security-related components from the motherboard onto the <a href="https://blog.cloudflare.com/introducing-the-project-argus-datacenter-ready-secure-control-module-design-specification/"><u>Project Argus</u></a> Data Center Secure Control Module 2.0.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6F3XH0uQvBry9LJkZlVZOi/42f507d0d46d276db1e3724b21ea49dc/image1.png" />
          </figure><p><sup><i>Figure: Project Argus DC-SCM 2.0</i></sup></p>
    <div>
      <h3>Continuity with DC-SCM 2.0</h3>
      <a href="#continuity-with-dc-scm-2-0">
        
      </a>
    </div>
    <p>We are carrying forward the Data Center Secure Control Module 2.0 (DC-SCM 2.0) standard. By decoupling management and security functions from the motherboard, we ensure that the “brains” of the server’s security stay modular and protected.</p><p>The DC-SCM module houses our most critical components:</p><ul><li><p>Basic Input/Output System (BIOS)</p></li><li><p>Baseboard Management Controller (BMC)</p></li><li><p>Hardware Root of Trust (HRoT) and TPM (Infineon SLB 9672)</p></li><li><p>Dual BMC/BIOS flash chips for redundancy</p></li></ul>
    <div>
      <h3>Why we are staying the course with DC-SCM 2.0</h3>
      <a href="#why-we-are-staying-the-course-with-dc-scm-2-0">
        
      </a>
    </div>
    <p>The decision to keep this architecture for Gen 13 is driven by the proven security gains we saw in the previous generation. By offloading these functions to a dedicated module, we maintain:</p><ul><li><p><b>Rapid recovery</b>: Dual image redundancy allows for near-instant restoration of BIOS/UEFI and BMC firmware if an accidental corruption or a malicious update is detected.</p></li><li><p><b>Physical resilience</b>: The Gen 13 chassis also moves the intrusion detection mechanism further from the flat edge of the chassis, making physical intercept harder.</p></li><li><p><b>PCIe encryption</b>: In addition to TSME (Transparent Secure Memory Encryption) for CPU-to-memory encryption that was already enabled since our Gen 10 platforms, AMD Turin 9965 processor for Gen 13 extends encryption to PCIe traffic, this ensures data is protected in transit across every bus in the system.</p></li><li><p><b>Operational consistency</b>: Sticking with the Gen 12 management stack means our security audits, deployment, provisioning, and operational standard procedure remain fully compatible. </p></li></ul>
    <div>
      <h2>Power</h2>
      <a href="#power">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>800W 80 PLUS Titanium CRPS</p></td></tr><tr><td><p>Gen 13</p></td><td><p>1300W 80 PLUS Titanium CRPS</p></td></tr></table><p>As we upgrade the compute and networking capability of the server, the power envelope of our servers has naturally expanded. Gen 13 are equipped with bigger power supplies to deliver the power needed.</p>
    <div>
      <h3>The jump to 1300W</h3>
      <a href="#the-jump-to-1300w">
        
      </a>
    </div>
    <p>While our Gen 12 nodes operated comfortably with 800W 80 PLUS Titanium CRPS (Common Redundant Power Supply), the Gen 13 specification requires a larger power supply. We have selected a 1300W 80 PLUS Titanium CRPS.</p><p>Power consumption of Gen 13 during typical operation has risen to 850W, a 250W increase over the 600W seen in Gen 12. The primary contributors are the 500W TDP CPU (up from 400W), doubling of the memory capacity and the additional NVMe drive.</p><p>Why 1300W instead of 1000W? The current PSU ecosystem lacks viable, high-efficiency options at 1000W. To ensure supply chain reliability, we moved to the next industry-standard tier of 1300W. </p><p><a href="https://eur-lex.europa.eu/eli/reg/2019/424/oj/eng"><u>EU Lot 9</u></a> is a regulation that requires servers deploying in the European Union to have power supplies with efficiency at 10%, 20%, 50% and 100% load to be at or above the percentage threshold specified in the regulation. The threshold matches <a href="https://www.clearesult.com/80plus/80plus-psu-ratings-explained"><u>80 PLUS Power Supply certification program</u></a> titanium grade PSU requirement. We chose a titanium grade PSU for Gen 13 to maintain full compliance with EU Lot 9, ensuring that the servers can be deployed in our European data centers and beyond. </p>
    <div>
      <h3>Thermal design: 2U pays dividends again</h3>
      <a href="#thermal-design-2u-pays-dividends-again">
        
      </a>
    </div>
    <p>The 2U1N form factor we adopted in Gen 12 continues to pay dividends. Gen 13 uses 5x 80mm fans (up from 4x in Gen 12) to handle the increased thermal load from the 500W CPU. The larger fan volume, combined with the 2U chassis airflow characteristics, means fans operate well below maximum duty cycle at typical ambient temperatures, keeping fan power in the &lt; 50W range per fan.</p>
    <div>
      <h2>Drop-in accelerator support</h2>
      <a href="#drop-in-accelerator-support">
        
      </a>
    </div>
    <table><tr><td><p>Gen 12</p></td><td><p>x2 single width FHFL or x1 double width FHFL</p></td></tr><tr><td><p>Gen 13</p></td><td><p>x2 double width FHFL</p></td></tr></table><p>Maintaining the modularity of our fleet is a core requirement for our server design. This requirement enabled Cloudflare to quickly retrofit and <a href="https://blog.cloudflare.com/workers-ai?_gl=1*1gag2w6*_gcl_au*MzM4MjEyMTE0LjE3Njg5NDQ2NjA.*_ga*YzE1ZWNmMTgtNWNmOC00ZDJhLTkyYjUtMzQ0NjNiZjE1OWY1*_ga_SQCRB0TXZW*czE3NzMzNTQzNjQkbzE1JGcxJHQxNzczMzU0NTQ4JGoxOCRsMCRoMCRkQmROOWVoOFpxajBtSWtMTGRCa1VUVDJaY2RoaXBxTmY4QQ../#a-road-to-global-gpu-coverage"><u>deploy GPUs globally to more than 100 cities in 2024</u></a>. In Gen 13, we are continuing the support of high-performance PCIe add-in cards.</p><p>On Gen 13, the 2U chassis layout is updated and configured to support more demanding power and thermal requirements. While Gen 12 was limited to a single double-width GPU, the Gen 13 architecture now supports two double-width PCIe cards.</p>
    <div>
      <h2>A launchpad to scale Cloudflare to greater heights</h2>
      <a href="#a-launchpad-to-scale-cloudflare-to-greater-heights">
        
      </a>
    </div>
    <p>Every generation of Cloudflare servers is an exercise in balancing competing constraints: performance versus power, capacity versus cost, flexibility versus simplicity. Gen 13 comes with 2x core count, 2x memory capacity, 4x network bandwidth, 1.5x storage capacity, and future-proofing for accelerator deployments — all while improving total cost of ownership and maintaining a robust management feature set and security posture that our global fleet demands.</p><p>Gen 13 servers are fully qualified and will be deployed to serve millions of requests across Cloudflare’s global network in more than 330 cities. As always, Cloudflare’s journey to serve the Internet as efficiently as possible does not end here. As the deployment of Gen 13 begins, we are planning the architecture for Gen 14.</p><p>If you are excited about helping build a better Internet, come join us. <a href="https://www.cloudflare.com/careers/jobs/"><u>We are hiring</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[AMD]]></category>
            <guid isPermaLink="false">7KkjVfneO6PwoHTEAiSYVM</guid>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Ma Xiong</dc:creator>
            <dc:creator>Victor Hwang</dc:creator>
        </item>
        <item>
            <title><![CDATA[Launching Cloudflare’s Gen 13 servers: trading cache for cores for 2x edge compute performance]]></title>
            <link>https://blog.cloudflare.com/gen13-launch/</link>
            <pubDate>Mon, 23 Mar 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s Gen 13 servers double our compute throughput by rethinking the balance between cache and cores. Moving to high-core-count AMD EPYC ™ Turin CPUs, we traded large L3 cache for raw compute density. By running our new Rust-based FL2 stack, we completely mitigated the latency penalty to unlock twice the performance. ]]></description>
            <content:encoded><![CDATA[ <p>Two years ago, Cloudflare deployed our <a href="https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor/"><u>12th Generation server fleet</u></a>, based on AMD EPYC™ Genoa-X processors with their massive 3D V-Cache. That cache-heavy architecture was a perfect match for our request handling layer, FL1 at the time. But as we evaluated next-generation hardware, we faced a dilemma — the CPUs offering the biggest throughput gains came with a significant cache reduction. Our legacy software stack wasn't optimized for this, and the potential throughput benefits were being capped by increasing latency.</p><p>This blog describes how the <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>FL2 transition</u></a>, our Rust-based rewrite of Cloudflare's core request handling layer, allowed us to prove Gen 13's full potential and unlock performance gains that would have been impossible on our previous stack. FL2 removes the dependency on the larger cache, allowing for performance to scale with cores while maintaining our SLAs. Today, we are proud to announce the launch of Cloudflare's Gen 13 based on AMD EPYC™ 5th Gen Turin-based servers running FL2, effectively capturing and scaling performance at the edge. </p>
    <div>
      <h2>What AMD EPYCTurin brings to the table</h2>
      <a href="#what-amd-epycturin-brings-to-the-table">
        
      </a>
    </div>
    <p><a href="https://www.amd.com/en/products/processors/server/epyc/9005-series.html"><u>AMD's EPYC™ 5th Generation Turin-based processors</u></a> deliver more than just a core count increase. The architecture delivers improvements across multiple dimensions of what Cloudflare servers require.</p><ul><li><p><b>2x core count:</b> up to 192 cores versus Gen 12's 96 cores, with SMT providing 384 threads</p></li><li><p><b>Improved IPC:</b> Zen 5's architectural improvements deliver better instructions-per-cycle compared to Zen 4</p></li><li><p><b>Better power efficiency</b>: Despite the higher core count, Turin consumes up to 32% fewer watts per core compared to Genoa-X</p></li><li><p><b>DDR5-6400 support</b>: Higher memory bandwidth to feed all those cores</p></li></ul><p>However, Turin's high density OPNs make a deliberate tradeoff: prioritizing throughput over per core cache. Our analysis across the Turin stack highlighted this shift. For example, comparing the highest density Turin OPN to our Gen 12 Genoa-X processors reveals that Turin's 192 cores share 384MB of L3 cache. This leaves each core with access to just 2MB, one-sixth of Gen 12's allocation. For any workload that relies heavily on cache locality, which ours did, this reduction posed a serious challenge.</p><table><tr><td><p>Generation</p></td><td><p>Processor</p></td><td><p>Cores/Threads</p></td><td><p>L3 Cache/Core</p></td></tr><tr><td><p>Gen 12</p></td><td><p>AMD Genoa-X 9684X</p></td><td><p>96C/192T</p></td><td><p>12MB (3D V-Cache)</p></td></tr><tr><td><p>Gen 13 Option 1</p></td><td><p>AMD Turin 9755</p></td><td><p>128C/256T</p></td><td><p>4MB</p></td></tr><tr><td><p>Gen 13 Option 2</p></td><td><p>AMD Turin 9845</p></td><td><p>160C/320T</p></td><td><p>2MB</p></td></tr><tr><td><p>Gen 13 Option 3</p></td><td><p>AMD Turin 9965</p></td><td><p>192C/384T</p></td><td><p>2MB</p></td></tr></table>
    <div>
      <h2>Diagnosing the problem with performance counters</h2>
      <a href="#diagnosing-the-problem-with-performance-counters">
        
      </a>
    </div>
    <p>For our FL1 request handling layer, NGINX- and LuaJIT-based code, this cache reduction presented a significant challenge. But we didn't just assume it would be a problem; we measured it.</p><p>During the CPU evaluation phase for Gen 13, we collected CPU performance counters and profiling data to identify exactly what was happening under the hood using <a href="https://docs.amd.com/r/en-US/68658-uProf-getting-started-guide/Identifying-Issues-Using-uProfPcm"><u>AMD uProf tool</u></a>. The data showed:</p><ul><li><p>L3 cache miss rates increased dramatically compared to Gen 12's server equipped with 3D V-cache processors</p></li><li><p>Memory fetch latency dominated request processing time as data that previously stayed in L3 now required trips to DRAM</p></li><li><p>The latency penalty scaled with utilization as we pushed CPU usage higher, and cache contention worsened</p></li></ul><p>L3 cache hits complete in roughly 50 cycles; L3 cache misses requiring DRAM access take 350+ cycles, an order of magnitude difference. With 6x less cache per core, FL1 on Gen 13 was hitting memory far more often, incurring latency penalties.</p>
    <div>
      <h2>The tradeoff: latency vs. throughput </h2>
      <a href="#the-tradeoff-latency-vs-throughput">
        
      </a>
    </div>
    <p>Our initial tests running FL1 on Gen 13 confirmed what the performance counters had already suggested. While the Turin processor could achieve higher throughput, it came at a steep latency cost.</p><table><tr><td><p>Metric</p></td><td><p>Gen 12 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9755 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9845 (FL1)</p></td><td><p>Gen 13 - AMD Turin 9965 (FL1)</p></td><td><p>Delta</p></td></tr><tr><td><p>Core count</p></td><td><p>baseline</p></td><td><p>+33%</p></td><td><p>+67%</p></td><td><p>+100%</p></td><td><p></p></td></tr><tr><td><p>FL throughput</p></td><td><p>baseline</p></td><td><p>+10%</p></td><td><p>+31%</p></td><td><p>+62%</p></td><td><p>Improvement</p></td></tr><tr><td><p>Latency at low to moderate CPU utilization</p></td><td><p>baseline</p></td><td><p>+10%</p></td><td><p>+30%</p></td><td><p>+30%</p></td><td><p>Regression</p></td></tr><tr><td><p>Latency at high CPU utilization</p></td><td><p>baseline</p></td><td><p>&gt; 20% </p></td><td><p>&gt; 50% </p></td><td><p>&gt; 50% </p></td><td><p>Unacceptable</p></td></tr></table><p>The Gen 13 evaluation server with AMD Turin 9965 that generated 60% throughput gain was compelling, and the performance uplift provided the most improvement to Cloudflare’s total cost of ownership (TCO). </p><p>But a more than 50% latency penalty is not acceptable. The increase in request processing latency would directly impact customer experience. We faced a familiar infrastructure question: do we accept a solution with no TCO benefit, accept the increased latency tradeoff, or find a way to boost efficiency without adding latency?</p>
    <div>
      <h2>Incremental gains with performance tuning</h2>
      <a href="#incremental-gains-with-performance-tuning">
        
      </a>
    </div>
    <p>To find a path to an optimal outcome, we collaborated with AMD to analyze the Turin 9965 data and run targeted optimization experiments. We systematically tested multiple configurations:</p><ul><li><p><b>Hardware Tuning:</b> Adjusting hardware prefetchers and Data Fabric (DF) Probe Filters, which showed only marginal gains</p></li><li><p><b>Scaling Workers</b>: Launching more FL1 workers, which improved throughput but cannibalized resources from other production services</p></li><li><p><b>CPU Pinning &amp; Isolation:</b> Adjusting workload isolation configurations to find optimal mix, with limited success </p></li></ul><p>The configuration that ultimately provided the most value was <b>AMD’s Platform Quality of Service (PQOS). PQOS </b>extensions enable fine-grained regulation of shared resources like cache and memory bandwidth. Since Turin processors consist of one I/O Die and up to 12 Core Complex Dies (CCDs), each sharing an L3 cache across up to 16 cores, we put this to the test. Here is how the different experimental configurations performed. </p><p>First, we used PQOS to allocate a dedicated L3 cache share within a single CCD for FL1, the gains were minimal. However, when we scaled the concept to the socket level, dedicating an <i>entire</i> CCD strictly to FL1, we saw meaningful throughput gains while keeping latency acceptable.</p><div>
<figure>
<table><colgroup><col></col><col></col><col></col><col></col></colgroup>
<tbody>
<tr>
<td>
<p><span><span>Configuration</span></span></p>
</td>
<td>
<p><span><span>Description</span></span></p>
</td>
<td>
<p><span><span>Illustration</span></span></p>
</td>
<td>
<p><span><span>Performance gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>NUMA-aware core affinity </span></span><br /><span><span>(equivalent to PQOS at socket level)</span></span></p>
</td>
<td>
<p><span><span>6 out of 12 CCD (aligned with NUMA domain) run FL.</span></span></p>
<p> </p>
<p><span><span>32MB L3 cache in each CCD shared among all cores. </span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/4CBSHY02oIZOiENgFrzLSz/0c6c2ac8ef0096894ff4827e30d25851/image3.png" /></span></span></p>
</td>
<td>
<p><span><span>&gt;15% incremental </span></span></p>
<p><span><span>throughput gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 1</span></span></p>
</td>
<td>
<p><span><span>1 of 2 vCPU on each physical core in each CCD runs FL. </span></span></p>
<p> </p>
<p><span><span>FL gets 75% of the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/3iJo1BBRueQRy92R3aXbGx/596c3231fa0e66f20de70ea02615f9a7/image2.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
<p> </p>
<p><span><span>Other services show minor signs of degradation</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 2</span></span></p>
</td>
<td>
<p><span><span>1 of 2 vCPU in each physical core in each CCD runs FL.</span></span></p>
<p> </p>
<p><span><span>FL gets 50% of the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/3iJo1BBRueQRy92R3aXbGx/596c3231fa0e66f20de70ea02615f9a7/image2.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
</td>
</tr>
<tr>
<td>
<p><span><span>PQOS config 3</span></span></p>
</td>
<td>
<p><span><span>2 vCPU on 50% of the physical core in each CCD runs FL. </span></span></p>
<p> </p>
<p><span><span>FL gets 50% of  the 32MB L3 cache of each CCD.</span></span></p>
</td>
<td>
<p><span><span><img src="https://images.ctfassets.net/zkvhlag99gkb/7FKLfSxnSNUlXJCw8CJGzU/69c7b81b6cee5a2c7040ecc96748084b/image5.png" /></span></span></p>
</td>
<td>
<p><span><span>&lt; 5% incremental throughput gain</span></span></p>
</td>
</tr>
</tbody>
</table>
</figure>
</div>
    <div>
      <h2>The opportunity: FL2 was already in progress</h2>
      <a href="#the-opportunity-fl2-was-already-in-progress">
        
      </a>
    </div>
    <p>Hardware tuning and resource configuration provided modest gains, but to truly unlock the performance potential of the Gen 13 architecture, we knew we would have to rewrite our software stack to fundamentally change how it utilized system resources.</p><p>Fortunately, we weren't starting from scratch. As we <a href="https://blog.cloudflare.com/20-percent-internet-upgrade/"><u>announced during Birthday Week 2025</u></a>, we had already been rebuilding FL1 from the ground up. FL2 is a complete rewrite of our request handling layer in Rust, built on our <a href="https://blog.cloudflare.com/pingora-open-source/"><u>Pingora</u></a> and <a href="https://blog.cloudflare.com/introducing-oxy/"><u>Oxy</u></a> frameworks, replacing 15 years of NGINX and LuaJIT code.</p><p>The FL2 project wasn't initiated to solve the Gen 13 cache problem — it was driven by the need for better security (Rust's memory safety), faster development velocity (strict module system), and improved performance across the board (less CPU, less memory, modular execution).</p><p>FL2's cleaner architecture, with better memory access patterns and less dynamic allocation, might not depend on massive L3 caches the way FL1 did. This gave us an opportunity to use the FL2 transition to prove whether Gen 13's throughput gains could be realized without the latency penalty.</p>
    <div>
      <h2>Proving it out: FL2 on Gen 13</h2>
      <a href="#proving-it-out-fl2-on-gen-13">
        
      </a>
    </div>
    <p>As the FL2 rollout progressed, production metrics from our Gen 13 servers validated what we had hypothesized.</p><table><tr><td><p>Metric</p></td><td><p>Gen 13 AMD Turin 9965 (FL1)</p></td><td><p>Gen 13 AMD Turin 9965 (FL2)</p></td></tr><tr><td><p>FL requests per CPU%</p></td><td><p>baseline</p></td><td><p>50% higher</p></td></tr><tr><td><p>Latency vs Gen 12</p></td><td><p>baseline</p></td><td><p>70% lower</p></td></tr><tr><td><p>Throughput vs Gen 12</p></td><td><p>62% higher</p></td><td><p>100% higher</p></td></tr></table><p>The out-of-the-box efficiency gains on our new FL2 stack were substantial, even before any system optimizations. FL2 slashed the latency penalty by 70%, allowing us to push Gen 13 to higher CPU utilization while strictly meeting our latency SLAs. Under FL1, this would have been impossible.</p><p>By effectively eliminating the cache bottleneck, FL2 enables our throughput to scale linearly with core count. The impact is undeniable on the high-density AMD Turin 9965: we achieved a 2x performance gain, unlocking the true potential of the hardware. With further system tuning, we expect to squeeze even more power out of our Gen 13 fleet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1jV1q0n9PgmbbNzDl8E1J1/2ead24a20cc10836ba041f73a16f3883/image6.png" />
          </figure>
    <div>
      <h2>Generational improvement with Gen 13</h2>
      <a href="#generational-improvement-with-gen-13">
        
      </a>
    </div>
    <p>With FL2 unlocking the immense throughput of the high-core-count AMD Turin 9965, we have officially selected these processors for our Gen 13 deployment. Hardware qualification is complete, and Gen 13 servers are now shipping at scale to support our global rollout.</p>
    <div>
      <h3>Performance improvements</h3>
      <a href="#performance-improvements">
        
      </a>
    </div>
    <table><tr><td><p>
</p></td><td><p>Gen 12 </p></td><td><p>Gen 13 </p></td></tr><tr><td><p>Processor</p></td><td><p>AMD EPYC™ 4th Gen Genoa-X 9684X</p></td><td><p>AMD EPYC™ 5th Gen Turin 9965</p></td></tr><tr><td><p>Core count</p></td><td><p>96C/192T</p></td><td><p>192C/384T</p></td></tr><tr><td><p>FL throughput</p></td><td><p>baseline</p></td><td><p>Up to +100%</p></td></tr><tr><td><p>Performance per watt</p></td><td><p>baseline</p></td><td><p>Up to +50%</p></td></tr></table>
    <div>
      <h3>Gen 13 business impact</h3>
      <a href="#gen-13-business-impact">
        
      </a>
    </div>
    <p><b>Up to 2x throughput vs Gen 12 </b>for uncompromising customer experience: By doubling our throughput capacity while staying within our latency SLAs, we guarantee our applications remain fast and responsive, and able to absorb massive traffic spikes.</p><p><b>50% better performance/watt vs Gen 12 </b>for sustainable scaling: This gain in power efficiency not only reduces data center expansion costs, but allows us to process growing traffic with a vastly lower carbon footprint per request.</p><p><b>60% higher rack throughput vs Gen 12 </b>for global edge upgrades: Because we achieved this throughput density while keeping the rack power budget constant, we can seamlessly deploy this next generation compute anywhere in the world across our global edge network, delivering top tier performance exactly where our customers want it.</p>
    <div>
      <h2>Gen 13 + FL2: ready for the edge </h2>
      <a href="#gen-13-fl2-ready-for-the-edge">
        
      </a>
    </div>
    <p>Our legacy request serving layer FL1 hit a cache contention wall on Gen 13, forcing an unacceptable tradeoff between throughput and latency. Instead of compromising, we built FL2. </p><p>Designed with a vastly leaner memory access pattern, FL2 removes our dependency on massive L3 caches and allows linear scaling with core count. Running on the Gen 13 AMD Turin platform, FL2 unlocks 2x the throughput and a 50% boost in power efficiency all while keeping latency within our SLAs. This leap forward is a great reminder of the importance of hardware-software co-design. Unconstrained by cache limits, Gen 13 servers are now ready to be deployed to serve millions of requests across Cloudflare’s global network.</p><p>If you're excited about working on infrastructure at global scale, <a href="https://www.cloudflare.com/careers/jobs"><u>we're hiring</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[Engineering]]></category>
            <guid isPermaLink="false">4shbA7eyT2KredK7RJyizK</guid>
            <dc:creator>Syona Sarma</dc:creator>
            <dc:creator>JQ Lau</dc:creator>
            <dc:creator>Jesse Brandeburg</dc:creator>
        </item>
        <item>
            <title><![CDATA[Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare]]></title>
            <link>https://blog.cloudflare.com/ecdysis-rust-graceful-restarts/</link>
            <pubDate>Fri, 13 Feb 2026 14:00:00 GMT</pubDate>
            <description><![CDATA[ ecdysis is a Rust library enabling zero-downtime upgrades for network services. After five years protecting millions of connections at Cloudflare, it’s now open source. ]]></description>
            <content:encoded><![CDATA[ <blockquote><p>ecdysis | <i>ˈekdəsəs</i> |</p><p>noun</p><p>    the process of shedding the old skin (in reptiles) or casting off the outer 
    cuticle (in insects and other arthropods).  </p></blockquote><p>How do you upgrade a network service, handling millions of requests per second around the globe, without disrupting even a single connection?</p><p>One of our solutions at Cloudflare to this massive challenge has long been <a href="https://github.com/cloudflare/ecdysis"><b><u>ecdysis</u></b></a>, a Rust library that implements graceful process restarts where no live connections are dropped, and no new connections are refused. </p><p>Last month, <b>we open-sourced ecdysis</b>, so now anyone can use it. After five years of production use at Cloudflare, ecdysis has proven itself by enabling zero-downtime upgrades across our critical Rust infrastructure, saving millions of requests with every restart across Cloudflare’s <a href="https://www.cloudflare.com/network/"><u>global network</u></a>.</p><p>It’s hard to overstate the importance of getting these upgrades right, especially at the scale of Cloudflare’s network. Many of our services perform critical tasks such as traffic routing, <a href="https://www.cloudflare.com/application-services/solutions/certificate-lifecycle-management/"><u>TLS lifecycle management</u></a>, or firewall rules enforcement, and must operate continuously. If one of these services goes down, even for an instant, the cascading impact can be catastrophic. Dropped connections and failed requests quickly lead to degraded customer performance and business impact.</p><p>When these services need updates, security patches can’t wait. Bug fixes need deployment and new features must roll out. </p><p>The naive approach involves waiting for the old process to be stopped before spinning up the new one, but this creates a window of time where connections are refused and requests are dropped. For a service handling thousands of requests per second in a single location, multiply that across hundreds of data centers, and a brief restart becomes millions of failed requests globally.</p><p>Let’s dig into the problem, and how ecdysis has been the solution for us — and maybe will be for you. </p><p><b>Links</b>: <a href="https://github.com/cloudflare/ecdysis">GitHub</a> <b>|</b> <a href="https://crates.io/crates/ecdysis">crates.io</a> <b>|</b> <a href="https://docs.rs/ecdysis">docs.rs</a></p>
    <div>
      <h3>Why graceful restarts are hard</h3>
      <a href="#why-graceful-restarts-are-hard">
        
      </a>
    </div>
    <p>The naive approach to restarting a service, as we mentioned, is to stop the old process and start a new one. This works acceptably for simple services that don’t handle real-time requests, but for network services processing live connections, this approach has critical limitations.</p><p>First, the naive approach creates a window during which no process is listening for incoming connections. When the old process stops, it closes its listening sockets, which causes the OS to immediately refuse new connections with <code>ECONNREFUSED</code>. Even if the new process starts immediately, there will always be a gap where nothing is accepting connections, whether milliseconds or seconds. For a service handling thousands of requests per second, even a gap of 100ms means hundreds of dropped connections.</p><p>Second, stopping the old process kills all already-established connections. A client uploading a large file or streaming video gets abruptly disconnected. Long-lived connections like WebSockets or gRPC streams are terminated mid-operation. From the client’s perspective, the service simply vanishes.</p><p>Binding the new process before shutting down the old one appears to solve this, but also introduces additional issues. The kernel normally allows only one process to bind to an address:port combination, but <a href="https://man7.org/linux/man-pages/man7/socket.7.html"><u>the SO_REUSEPORT socket option</u></a> permits multiple binds. However, this creates a problem during process transitions that makes it unsuitable for graceful restarts.</p><p>When <code>SO_REUSEPORT</code> is used, the kernel creates separate listening sockets for each process and <a href="https://lwn.net/Articles/542629/"><u>load balances new connections across these sockets</u></a>. When the initial <code>SYN</code> packet for a connection is received, the kernel will assign it to one of the listening processes. Once the initial handshake is completed, the connection then sits in the <code>accept()</code> queue of the process until the process accepts it. If the process then exits before accepting this connection, it becomes orphaned and is terminated by the kernel. GitHub’s engineering team documented this issue extensively when <a href="https://github.blog/2020-10-07-glb-director-zero-downtime-load-balancer-updates/"><u>building their GLB Director load balancer</u></a>.</p>
    <div>
      <h3>How ecdysis works</h3>
      <a href="#how-ecdysis-works">
        
      </a>
    </div>
    <p>When we set out to design and build ecdysis, we identified four key goals for the library:</p><ol><li><p><b>Old code can be completely shut down</b> post-upgrade.</p></li><li><p><b>The new process has a grace period</b> for initialization.</p></li><li><p><b>New code crashing during initialization is acceptable</b> and shouldn’t affect the running service.</p></li><li><p><b>Only a single upgrade runs in parallel</b> to avoid cascading failures.</p></li></ol><p>ecdysis satisfies these requirements following an approach pioneered by NGINX, which has supported graceful upgrades since its early days. The approach is straightforward: </p><ol><li><p>The parent process <code>fork()</code>s a new child process.</p></li><li><p>The child process replaces itself with a new version of the code with <code>execve()</code>.</p></li><li><p>The child process inherits the socket file descriptors via a named pipe shared with the parent.</p></li><li><p>The parent process waits for the child process to signal readiness before shutting down.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4QK8GY1s30C8RUovBQnqbD/525094478911eda96c7877a10753159f/image3.png" />
          </figure><p>Crucially, the socket remains open throughout the transition. The child process inherits the listening socket from the parent as a file descriptor shared via a named pipe. During the child's initialization, both processes share the same underlying kernel data structure, allowing the parent to continue accepting and processing new and existing connections. Once the child completes initialization, it notifies the parent and begins accepting connections. Upon receiving this ready notification, the parent immediately closes its copy of the listening socket and continues handling only existing connections. </p><p>This process eliminates coverage gaps while providing the child a safe initialization window. There is a brief window of time when both the parent and child may accept connections concurrently. This is intentional; any connections accepted by the parent are simply handled until completion as part of the draining process.</p><p>This model also provides the required crash safety. If the child process fails during initialization (e.g., due to a configuration error), it simply exits. Since the parent never stopped listening, no connections are dropped, and the upgrade can be retried once the problem is fixed.</p><p>ecdysis implements the forking model with first-class support for asynchronous programming through<a href="https://tokio.rs"> <u>Tokio</u></a> and s<code>ystemd</code> integration:</p><ul><li><p><b>Tokio integration</b>: Native async stream wrappers for Tokio. Inherited sockets become listeners without additional glue code. For synchronous services, ecdysis supports operation without async runtime requirements.</p></li><li><p><b>systemd-notify support</b>: When the <code>systemd_notify</code> feature is enabled, ecdysis automatically integrates with systemd’s process lifecycle notifications. Setting <code>Type=notify-reload</code> in your service unit file allows systemd to track upgrades correctly.</p></li><li><p><b>systemd named sockets</b>: The <code>systemd_sockets</code> feature enables ecdysis to manage systemd-activated sockets. Your service can be socket-activated and support graceful restarts simultaneously.</p></li></ul><p>Platform note: ecdysis relies on Unix-specific syscalls for socket inheritance and process management. It does not work on Windows. This is a fundamental limitation of the forking approach.</p>
    <div>
      <h3>Security considerations</h3>
      <a href="#security-considerations">
        
      </a>
    </div>
    <p>Graceful restarts introduce security considerations. The forking model creates a brief window where two process generations coexist, both with access to the same listening sockets and potentially sensitive file descriptors.</p><p>ecdysis addresses these concerns through its design:</p><p><b>Fork-then-exec</b>: ecdysis follows the traditional Unix pattern of <code>fork()</code> followed immediately by <code>execve()</code>. This ensures the child process starts with a clean slate: new address space, fresh code, and no inherited memory. Only explicitly-passed file descriptors cross the boundary.</p><p><b>Explicit inheritance</b>: Only listening sockets and communication pipes are inherited. Other file descriptors are closed via <code>CLOEXEC</code> flags. This prevents accidental leakage of sensitive handles.</p><p><b>seccomp compatibility</b>: Services using seccomp filters must allow <code>fork()</code> and <code>execve()</code>. This is a tradeoff: graceful restarts require these syscalls, so they cannot be blocked.</p><p>For most network services, these tradeoffs are acceptable. The security of the fork-exec model is well understood and has been battle-tested for decades in software like NGINX and Apache.</p>
    <div>
      <h3>Code example</h3>
      <a href="#code-example">
        
      </a>
    </div>
    <p>Let’s look at a practical example. Here’s a simplified TCP echo server that supports graceful restarts:</p>
            <pre><code>use ecdysis::tokio_ecdysis::{SignalKind, StopOnShutdown, TokioEcdysisBuilder};
use tokio::{net::TcpStream, task::JoinSet};
use futures::StreamExt;
use std::net::SocketAddr;

#[tokio::main]
async fn main() {
    // Create the ecdysis builder
    let mut ecdysis_builder = TokioEcdysisBuilder::new(
        SignalKind::hangup()  // Trigger upgrade/reload on SIGHUP
    ).unwrap();

    // Trigger stop on SIGUSR1
    ecdysis_builder
        .stop_on_signal(SignalKind::user_defined1())
        .unwrap();

    // Create listening socket - will be inherited by children
    let addr: SocketAddr = "0.0.0.0:8080".parse().unwrap();
    let stream = ecdysis_builder
        .build_listen_tcp(StopOnShutdown::Yes, addr, |builder, addr| {
            builder.set_reuse_address(true)?;
            builder.bind(&amp;addr.into())?;
            builder.listen(128)?;
            Ok(builder.into())
        })
        .unwrap();

    // Spawn task to handle connections
    let server_handle = tokio::spawn(async move {
        let mut stream = stream;
        let mut set = JoinSet::new();
        while let Some(Ok(socket)) = stream.next().await {
            set.spawn(handle_connection(socket));
        }
        set.join_all().await;
    });

    // Signal readiness and wait for shutdown
    let (_ecdysis, shutdown_fut) = ecdysis_builder.ready().unwrap();
    let shutdown_reason = shutdown_fut.await;

    log::info!("Shutting down: {:?}", shutdown_reason);

    // Gracefully drain connections
    server_handle.await.unwrap();
}

async fn handle_connection(mut socket: TcpStream) {
    // Echo connection logic here
}</code></pre>
            <p>The key points:</p><ol><li><p><code><b>build_listen_tcp</b></code> creates a listener that will be inherited by child processes.</p></li><li><p><code><b>ready()</b></code> signals to the parent process that initialization is complete and that it can safely exit.</p></li><li><p><code><b>shutdown_fut.await</b></code> blocks until an upgrade or stop is requested. This future only yields once the process should be shut down, either because an upgrade/reload was executed successfully or because a shutdown signal was received.</p></li></ol><p>When you send <code>SIGHUP</code> to this process, here’s what ecdysis does…</p><p><i>…on the parent process:</i></p><ul><li><p>Forks and execs a new instance of your binary.</p></li><li><p>Passes the listening socket to the child.</p></li><li><p>Waits for the child to call <code>ready()</code>.</p></li><li><p>Drains existing connections, then exits.</p></li></ul><p><i>…on the child process:</i></p><ul><li><p>Initializes itself following the same execution flow as the parent, except any sockets owned by ecdysis are inherited and not bound by the child.</p></li><li><p>Signals readiness to the parent by calling <code>ready()</code>.</p></li><li><p>Blocks waiting for a shutdown or upgrade signal.</p></li></ul>
    <div>
      <h3>Production at scale</h3>
      <a href="#production-at-scale">
        
      </a>
    </div>
    <p>ecdysis has been running in production at Cloudflare since 2021. It powers critical Rust infrastructure services deployed across 330+ data centers in 120+ countries. These services handle billions of requests per day and require frequent updates for security patches, feature releases, and configuration changes.</p><p>Every restart using ecdysis saves hundreds of thousands of requests that would otherwise be dropped during a naive stop/start cycle. Across our global footprint, this translates to millions of preserved connections and improved reliability for customers.</p>
    <div>
      <h3>ecdysis vs alternatives</h3>
      <a href="#ecdysis-vs-alternatives">
        
      </a>
    </div>
    <p>Graceful restart libraries exist for several ecosystems. Understanding when to use ecdysis versus alternatives is critical to choosing the right tool.</p><p><a href="https://github.com/cloudflare/tableflip"><b><u>tableflip</u></b></a> is our Go library that inspired ecdysis. It implements the same fork-and-inherit model for Go services. If you need Go, tableflip is a great option!</p><p><a href="https://github.com/cloudflare/shellflip"><b><u>shellflip</u></b></a> is Cloudflare’s other Rust graceful restart library, designed specifically for Oxy, our Rust-based proxy. shellflip is more opinionated: it assumes systemd and Tokio, and focuses on transferring arbitrary application state between parent and child. This makes it excellent for complex stateful services, or services that want to apply such aggressive sandboxing that they can’t even open their own sockets, but adds overhead for simpler cases.</p>
    <div>
      <h3>Start building</h3>
      <a href="#start-building">
        
      </a>
    </div>
    <p>ecdysis brings five years of production-hardened graceful restart capabilities to the Rust ecosystem. It’s the same technology protecting millions of connections across Cloudflare’s global network, now open-sourced and available for anyone!</p><p>Full documentation is available at <a href="https://docs.rs/ecdysis"><u>docs.rs/ecdysis</u></a>, including API reference, examples for common use cases, and steps for integrating with <code>systemd</code>.</p><p>The <a href="https://github.com/cloudflare/ecdysis/tree/main/examples"><u>examples directory</u></a> in the repository contains working code demonstrating TCP listeners, Unix socket listeners, and systemd integration.</p><p>The library is actively maintained by the Argo Smart Routing &amp; Orpheus team, with contributions from teams across Cloudflare. We welcome contributions, bug reports, and feature requests on <a href="https://github.com/cloudflare/ecdysis"><u>GitHub</u></a>.</p><p>Whether you’re building a high-performance proxy, a long-lived API server, or any network service where uptime matters, ecdysis can provide a foundation for zero-downtime operations.</p><p>Start building:<a href="https://github.com/cloudflare/ecdysis"> <u>github.com/cloudflare/ecdysis</u></a></p> ]]></content:encoded>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Edge]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Application Services]]></category>
            <category><![CDATA[Rust]]></category>
            <guid isPermaLink="false">GMarF75NkFuiwVuyFJk77</guid>
            <dc:creator>Manuel Olguín Muñoz</dc:creator>
        </item>
        <item>
            <title><![CDATA[How Workers powers our internal maintenance scheduling pipeline]]></title>
            <link>https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/</link>
            <pubDate>Mon, 22 Dec 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Physical data center maintenance is risky on a global network. We built a maintenance scheduler on Workers to safely plan disruptive operations, while solving scaling challenges by viewing the state of our infrastructure through a graph interface on top of multiple data sources and metrics pipelines. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare has data centers in over <a href="https://www.cloudflare.com/network/"><u>330 cities globally</u></a>, so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that <a href="https://developers.cloudflare.com/support/disruptive-maintenance/"><u>disruptive maintenance</u></a> requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible.</p><p>It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another.</p><p>We realized we needed a centralized, automated "brain" to act as a safeguard — a system that could see the entire state of our network at once. By building this scheduler on <a href="https://workers.cloudflare.com/"><u>Cloudflare Workers</u></a>, we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend.</p><p>In this blog post, we’ll explain how we built it, and share the results we’re seeing now.</p>
    <div>
      <h2>Building a system to de-risk critical maintenance operations</h2>
      <a href="#building-a-system-to-de-risk-critical-maintenance-operations">
        
      </a>
    </div>
    <p>Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously. </p><p>Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid. </p><p>Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time.</p><p>Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events.</p><p>We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance.</p>
    <div>
      <h2>Maintenance constraints</h2>
      <a href="#maintenance-constraints">
        
      </a>
    </div>
    <p>Every constraint starts with a set of proposed maintenance items, such as a network router or list of servers. We then find all the maintenance events in the calendar that overlap with the proposed maintenance time window.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2vHCauxOGRXzhrO6DNDr2S/cf38b93ac9b812e5e064f800e537e549/image4.png" />
          </figure><p>Next, we aggregate product APIs, such as a list of Aegis customer IP pools. Aegis returns a set of IP ranges where a customer requested egress out of specific data center IDs, shown below.</p>
            <pre><code>[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]</code></pre>
            <p>In this scenario, data center 21 and data center 45 relate to each other because we need at least one data center online for the Aegis customer 9876 to receive egress traffic from Cloudflare. If we tried to take data centers 21 and 45 down simultaneously, our coordinator would alert us that there would be unintended consequences for that customer workload.</p><p>We initially had a naive solution to load all data into a single Worker. This included all server relationships, product configurations, and metrics for product and infrastructure health to compute constraints. Even in our proof of concept phase, we ran into problems with “out of memory” errors.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1v4Q6bXsZLBXLbrbRrcW3o/00d291ef3db459e99ae9b620965b6bc7/image2.png" />
          </figure><p>We needed to be more cognizant of Workers’ <a href="https://developers.cloudflare.com/workers/platform/limits/"><u>platform limits</u></a>. This required loading only as much data as was absolutely necessary to process the constraint’s business logic. If a maintenance request for a router in Frankfurt, Germany, comes in, we almost certainly do not care what is happening in Australia since there is no overlap across regions. Thus, we should only load data for neighboring data centers in Germany. We needed a more efficient way to process relationships in our dataset.</p>
    <div>
      <h2>Graph processing on Workers</h2>
      <a href="#graph-processing-on-workers">
        
      </a>
    </div>
    <p>As we looked at our constraints, a pattern emerged where each constraint boiled down to two concepts: objects and associations. In graph theory, these components are known as vertices and edges, respectively. An object could be a network router and an association could be the list of Aegis pools in the data center that requires the router to be online. We took inspiration from Facebook’s <a href="https://research.facebook.com/publications/tao-facebooks-distributed-data-store-for-the-social-graph/"><u>TAO</u></a> research paper to establish a graph interface on top of our product and infrastructure data. The API looks like the following:</p>
            <pre><code>type ObjectID = string

interface MainTAOInterface&lt;TObject, TAssoc, TAssocType&gt; {
  object_get(id: ObjectID): Promise&lt;TObject | undefined&gt;

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable&lt;TAssoc&gt;
}</code></pre>
            <p>The core insight is that associations are typed. For example, a constraint would call the graph interface to retrieve Aegis product data.</p>
            <pre><code>async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise&lt;Record&lt;string, PoolAnalysis&gt;&gt; {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) =&gt; {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map&lt;string, string[]&gt;(datacenterEntries)
  const uniquePools = new Set&lt;string&gt;()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) =&gt; {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map&lt;string, number&gt;(poolTotalsEntries)
  const poolAnalysis: Record&lt;string, PoolAnalysis&gt; = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}</code></pre>
            <p>We use two association types in the code above:</p><ol><li><p>DATACENTER_INSIDE_AEGIS_POOL, which retrieves the Aegis customer pools that a data center resides in.</p></li><li><p>AEGIS_POOL_CONTAINS_DATACENTER, which retrieves the data centers an Aegis pool needs to serve traffic.</p></li></ol><p>The associations are inverted indices of one another. The access pattern is exactly the same as before, but now the graph implementation has much more control of how much data it queries. Before, we needed to load all Aegis pools into memory and filter inside constraint business logic. Now, we can directly fetch only the data that matters to the application.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4b68YLIHiOPt5EeyTUTeBt/5f624f0d0912e7dfd0e308a3427d194c/unnamed.png" />
          </figure><p>The interface is powerful because our graph implementation can improve performance behind the scenes without complicating the business logic. This lets us use the scalability of Workers and Cloudflare’s CDN to fetch data from our internal systems very quickly.</p>
    <div>
      <h2>Fetch pipeline</h2>
      <a href="#fetch-pipeline">
        
      </a>
    </div>
    <p>We switched to using the new graph implementation, sending more targeted API requests. Response sizes dropped by 100x overnight, switching from loading a few massive requests to many tiny requests.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/71aDOicyippmUbj4ypXKw/73dacdf16ca0ac422efdfec9e86e9dbf/image5.png" />
          </figure><p>While this solves the issue of loading too much into memory, we now have a subrequest problem because instead of a few large HTTP requests, we make an order of magnitude more of small requests. Overnight, we started consistently breaching subrequest limits.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/36KjfOU8xIuUkwF7QOlNkK/e2275a50ff1bef497cdb201c2d3a6249/image3.png" />
          </figure><p>In order to solve this problem, we built a smart middleware layer between our graph implementation and the <code>fetch</code> API.</p>
            <pre><code>export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);</code></pre>
            <p>If you’re familiar with Go, you may have seen the <a href="https://pkg.go.dev/golang.org/x/sync/singleflight"><u>singleflight</u></a> package before. We took inspiration from this idea and the first middleware component in the fetch pipeline deduplicates inflight HTTP requests, so they all wait on the same Promise for data instead of producing duplicate requests in the same Worker. Next, we use a lightweight Least Recently Used (LRU) cache to internally cache requests that we have already seen before.</p><p>Once both of those are complete, we use Cloudflare’s <code>caches.default.match</code> function to cache all GET requests in the region that the Worker is running. Since we have multiple data sources with different performance characteristics, we choose time to live (TTL) values carefully. For example, real-time data is only cached for 1 minute. Relatively static infrastructure data could be cached for 1–24 hours depending on the type of data. Power management data might be changed manually and infrequently, so we can cache it for longer at the edge.</p><p>In addition to those layers, we have the standard exponential backoff, retries and jitter. This helps reduce wasted <code>fetch</code> calls where a downstream resource might be unavailable temporarily. By backing off slightly, we increase the chance that we fetch the next request successfully. Conversely, if the Worker sends requests constantly without backoff, it will easily breach the subrequest limit when the origin starts returning 5xx errors.</p><p>Putting it all together, we saw ~99% cache hit rate. <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cache-hit-ratio/"><u>Cache hit rate</u></a> is the percentage of HTTP requests served from Cloudflare’s fast cache memory (a "hit") versus slower requests to data sources running in our control plane (a "miss"), calculated as (hits / (hits + misses)). A high rate means better HTTP request performance and lower costs because querying data from cache in our Worker is an order of magnitude faster than fetching from an origin server in a different region. After tuning settings, for our in memory and CDN caches, hit rates have increased dramatically. Since much of our workload is real-time, we will never have a 100% hit rate as we must request fresh data at least once per minute.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1jifI33QpBkQPd7tE5Tapi/186a74b922faac3abe091b79f03d640b/image1.png" />
          </figure><p>We have talked about improving the fetching layer, but not about how we made origin HTTP requests faster. Our maintenance coordinator needs to react in real-time to network degradation and failure of machines in data centers. We use our distributed <a href="https://blog.cloudflare.com/how-cloudflare-runs-prometheus-at-scale/"><u>Prometheus</u></a> query engine, Thanos, to deliver performant metrics from the edge into the coordinator.</p>
    <div>
      <h2>Thanos in real-time</h2>
      <a href="#thanos-in-real-time">
        
      </a>
    </div>
    <p>To explain how our choice in using the graph processing interface affected our real-time queries, let’s walk through an example. In order to analyze the health of edge routers, we could send the following query:</p>
            <pre><code>sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})</code></pre>
            <p>Originally, we asked our Thanos service, which stores Prometheus metrics, for a list of each edge router’s current health status and would manually filter for routers relevant to the maintenance inside the Worker. This is suboptimal for many reasons. For example, Thanos returned multi-MB responses which it needed to decode and encode. The Worker also needed to cache and decode these large HTTP responses only to filter out the majority of the data while processing a specific maintenance request. Since TypeScript is single-threaded and parsing JSON data is CPU-bound, sending two large HTTP requests means that one is blocked waiting for the other to finish parsing.</p><p>Instead, we simply use the graph to find targeted relationships such as the interface links between edge and spine routers, denoted as <code>EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE</code>.</p>
            <pre><code>sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})</code></pre>
            <p>The result is 1 Kb on average instead of multiple MBs, or approximately 1000x smaller. This also massively reduces the amount of CPU required inside the Worker because we offload most of the deserialization to Thanos. As we explained before, this means we need to make a higher number of these smaller fetch requests, but load balancers in front of Thanos can spread the requests evenly to increase throughput for this use case. </p><p>Our graph implementation and fetch pipeline successfully tamed the 'thundering herd' of thousands of tiny real-time requests. However, historical analysis presents a different I/O challenge. Instead of fetching small, specific relationships, we need to scan months of data to find conflicting maintenance windows. In the past, Thanos would issue a massive amount of random reads to our object store, <a href="https://www.cloudflare.com/developer-platform/products/r2/">R2</a>. To solve this massive bandwidth penalty without losing performance, we adopted a new approach the Observability team developed internally this year.</p>
    <div>
      <h2>Historical data analysis</h2>
      <a href="#historical-data-analysis">
        
      </a>
    </div>
    <p>There are enough maintenance use cases that we must rely on historical data to tell us if our solution is accurate and will scale with the growth of Cloudflare’s network. We do not want to cause incidents, and we also want to avoid blocking proposed physical maintenance unnecessarily. In order to balance these two priorities, we can use time series data about maintenance events that happened two months or even a year ago to tell us how often a maintenance event is violating one of our constraints, e.g. edge router availability or Aegis. We blogged earlier this year about using Thanos to <a href="https://blog.cloudflare.com/safe-change-at-any-scale/"><u>automatically release and revert software</u></a> to the edge.</p><p>Thanos primarily fans out to Prometheus, but when Prometheus' retention is not enough to answer the query it has to download data from object storage — R2 in our case. Prometheus TSDB blocks were originally designed for local SSDs, relying on random access patterns that become a bottleneck when moved to object storage. When our scheduler needs to analyze months of historical maintenance data to identify conflicting constraints, random reads from object storage incur a massive I/O penalty. To solve this, we implemented a conversion layer that transforms these blocks into <a href="https://parquet.apache.org/"><u>Apache Parquet</u></a> files. Parquet is a columnar format native to big data analytics that organizes data by column rather than row, which — together with rich statistics — allows us to only fetch what we need.</p><p>Furthermore, since we are rewriting TSDB blocks into Parquet files, we can also store the data in a way that allows us to read the data in just a few big sequential chunks.</p>
            <pre><code>sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})</code></pre>
            <p>In the example above we would choose the tuple “(__name__, dc_id)” as a primary sorting key so that metrics with the name “hmd:release_scopes:enabled” and the same value for “dc_id” get sorted close together.</p><p>Our Parquet gateway now issues precise R2 range requests to fetch only the specific columns relevant to the query. This reduces the payload from megabytes to kilobytes. Furthermore, because these file segments are immutable, we can aggressively cache them on the Cloudflare CDN.</p><p>This turns R2 into a low-latency query engine, allowing us to backtest complex maintenance scenarios against long-term trends instantly, avoiding the timeouts and high tail latency we saw with the original TSDB format. The graph below shows a recent load test, where Parquet reached up to 15x the P90 performance compared to the old system for the same query pattern.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6lVj6W4W4MMUy6cEsDpk5G/21614b7ac003a86cb5162a2ba75f4c42/image8.png" />
          </figure><p>To get a deeper understanding of how the Parquet implementation works, you can watch this talk at PromCon EU 2025, <a href="https://www.youtube.com/watch?v=wDN2w2xN6bA&amp;list=PLoz-W_CUquUlHOg314_YttjHL0iGTdE3O&amp;index=16"><u>Beyond TSDB: Unlocking Prometheus with Parquet for Modern Scale</u></a>.</p>
    <div>
      <h2>Building for scale</h2>
      <a href="#building-for-scale">
        
      </a>
    </div>
    <p>By leveraging Cloudflare Workers, we moved from a system that ran out of memory to one that intelligently caches data and uses efficient observability tooling to analyze product and infrastructure data in real time. We built a maintenance scheduler that balances network growth with product performance.</p><p>But “balance” is a moving target.</p><p>Every day, we add more hardware around the world, and the logic required to maintain it without disrupting customer traffic gets exponentially harder with more products and types of maintenance operations. We’ve worked through the first set of challenges, but now we’re staring down more subtle, complex ones that only appear at this massive scale.</p><p>We need engineers who aren't afraid of hard problems. Join our <a href="https://www.cloudflare.com/careers/jobs/?department=Infrastructure"><u>Infrastructure team</u></a> and come build with us.</p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Prometheus]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <guid isPermaLink="false">5pdspiP2m71MeIoVL8wv1i</guid>
            <dc:creator>Kevin Deems</dc:creator>
            <dc:creator>Michael Hoffmann</dc:creator>
        </item>
        <item>
            <title><![CDATA[Is this thing on? Using OpenBMC and ACPI power states for reliable server boot]]></title>
            <link>https://blog.cloudflare.com/how-we-use-openbmc-and-acpi-power-states-to-monitor-the-state-of-our-servers/</link>
            <pubDate>Tue, 22 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s global fleet benefits from being managed by open source firmware for the Baseboard Management Controller (BMC), OpenBMC. This has come with various challenges, some of which we discuss here with an explanation of how the open source nature of the firmware for the BMC enabled us to fix the issues and maintain a more stable fleet. ]]></description>
            <content:encoded><![CDATA[ 
    <div>
      <h2>Introduction</h2>
      <a href="#introduction">
        
      </a>
    </div>
    <p>At Cloudflare, we provide a range of services through our global network of servers, located in <a href="https://www.cloudflare.com/network/"><u>330 cities</u></a> worldwide. When you interact with our long-standing <a href="https://www.cloudflare.com/application-services/products/"><u>application services</u></a>, or newer services like <a href="https://ai.cloudflare.com/?_gl=1*1vedsr*_gcl_au*NzE0Njc1NTIwLjE3MTkzMzEyODc.*_ga*NTgyMWU1Y2MtYTI2NS00MDA3LTlhZDktYWUxN2U5MDkzYjY3*_ga_SQCRB0TXZW*MTcyMTIzMzM5NC4xNS4xLjE3MjEyMzM1MTguMC4wLjA."><u>Workers AI</u></a>, you’re in contact with one of our fleet of thousands of servers which support those services.</p><p>These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor  — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.</p><p>Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as <a href="https://en.wikipedia.org/wiki/Firmware"><u>firmware</u></a>. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the <a href="https://www.openbmc.org/"><u>Linux Foundation Project for BMCs, OpenBMC</u></a>. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. This gives us transparency (which is important to us as a security company) and allows us faster time to develop custom features/fixes for the BMC firmware that we run on our entire fleet.</p><p>In this blog post, we are going to describe how we customized and extended the OpenBMC firmware to better monitor our servers’ boot-up processes to start more reliably and allow better diagnostics in the event that an issue happens during server boot-up.</p>
    <div>
      <h2>Server subsystems</h2>
      <a href="#server-subsystems">
        
      </a>
    </div>
    <p>Server systems consist of multiple complex subsystems that include the processors, memory, storage, networking, power supply, cooling, etc. When booting up the host of a server system, the power state of each subsystem of the server is changed in an asynchronous manner. This is done so that subsystems can initialize simultaneously, thereby improving the efficiency of the boot process. Though started asynchronously, these subsystems may interact with each other at different points of the boot sequence and rely on handshake/synchronization to exchange information. For example, during boot-up, the <a href="https://en.wikipedia.org/wiki/UEFI"><u>UEFI (Universal Extensible Firmware Interface)</u></a>, often referred to as the <a href="https://en.wikipedia.org/wiki/BIOS"><u>BIOS</u></a>, configures the motherboard in a phase known as the Platform Initialization (PI) phase, during which the UEFI collects information from subsystems such as the CPUs, memory, etc. to initialize the motherboard with the right settings.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6csPNEksLXsGgt3dq5xZ0S/3236656dbc01f3085bada5af853c3516/image1.png" />
          </figure><p><sup><i>Figure 1: Server Boot Process</i></sup></p><p>When the power state of the subsystems, handshakes, and synchronization are not properly managed, there may be race conditions that would result in failures during the boot process of the host. Cloudflare experienced some of these boot-related failures while rolling out open source firmware (<a href="https://en.wikipedia.org/wiki/OpenBMC"><u>OpenBMC</u></a>) to the Baseboard Management Controllers (BMCs) of our servers. </p>
    <div>
      <h2>Baseboard Management Controller (BMC) as a manager of the host</h2>
      <a href="#baseboard-management-controller-bmc-as-a-manager-of-the-host">
        
      </a>
    </div>
    <p>A BMC is a specialized microprocessor that is attached to the board of a host (server) to assist with remote management capabilities of the host. Servers usually sit in data centers and are often far away from the administrators, and this creates a challenge to maintain them at scale. This is where a BMC comes in, as the BMC serves as the interface that gives administrators the ability to securely and remotely access the servers and carry out management functions. The BMC does this by exposing various interfaces, including <a href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface"><u>Intelligent Platform Management Interface (IPMI)</u></a> and <a href="https://www.dmtf.org/standards/redfish"><u>Redfish</u></a>, for distributed management. In addition, the BMC receives data from various sensors/devices (e.g. temperature, power supply) connected to the server, and also the operating parameters of the server, such as the operating system state, and publishes the values on its IPMI and Redfish interfaces.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33dNmfyjqrbAGvcbZLTa0h/db3e6b79b1010081916ee6498b10c297/image2.png" />
          </figure><p><sup><i>Figure 2: Block diagram of BMC in a server system.</i></sup></p><p>At Cloudflare, we use the <a href="https://github.com/openbmc/openbmc"><u>OpenBMC</u></a> project for our Baseboard Management Controller (BMC).</p><p>Below are examples of management functions carried out on a server through the BMC. The interactions in the examples are done over <a href="https://github.com/ipmitool/ipmitool/wiki"><u>ipmitool</u></a>, a command line utility for interacting with systems that support IPMI.</p>
            <pre><code># Check the sensor readings of a server remotely (i.e. over a network)
$  ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; sdr
PSU0_CURRENT_IN  | 0.47 Amps         | ok
PSU0_CURRENT_OUT | 6 Amps            | ok
PSU0_FAN_0       | 6962 RPM          | ok
SYS_FAN          | 13034 RPM         | ok
SYS_FAN1         | 11172 RPM         | ok
SYS_FAN2         | 11760 RPM         | ok
CPU_CORE_VR_POUT | 9.03 Watts        | ok
CPU_POWER        | 76.95 Watts       | ok
CPU_SOC_VR_POUT  | 12.98 Watts       | ok
DIMM_1_VR_POUT   | 29.03 Watts       | ok
DIMM_2_VR_POUT   | 27.97 Watts       | ok
CPU_CORE_MOSFET  | 40 degrees C      | ok
CPU_TEMP         | 50 degrees C      | ok
DIMM_MOSFET_1    | 36 degrees C      | ok
DIMM_MOSFET_2    | 39 degrees C      | ok
DIMM_TEMP_A1     | 34 degrees C      | ok
DIMM_TEMP_B1     | 33 degrees C      | ok

…

# check the power status of a server remotely (i.e. over a network)
ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; power status
Chassis Power is off

# power on the server
ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; power on
Chassis Power Control: On</code></pre>
            <p>Switching to OpenBMC firmware for our BMCs gives us more control over the software that powers our infrastructure. This has given us more flexibility, customizations, and an overall better uniform experience for managing our servers. Since OpenBMC is open source, we also leverage community fixes while upstreaming some of our own. Some of the advantages we have experienced with OpenBMC include a faster turnaround time to fixing issues, <a href="https://blog.cloudflare.com/de-de/thermal-design-supporting-gen-12-hardware-cool-efficient-and-reliable/"><u>optimizations around thermal cooling</u></a>, <a href="https://blog.cloudflare.com/gen-12-servers/"><u>increased power efficiency</u></a> and <a href="https://blog.cloudflare.com/how-we-used-openbmc-to-support-ai-inference-on-gpus-around-the-world/"><u>supporting AI inference</u></a>.</p><p>While developing Cloudflare’s OpenBMC firmware, however, we ran into a number of boot problems.</p><p><b><i>Host not booting:</i></b> When we send a request over IPMI for a host to power on (as in the example above, power on the server), ipmitool would indicate the power status of the host as ON, but we would not see any power going into the CPU nor any activity on the CPU. While ipmitool was correct about the power going into the chassis as ON, we had no information about the power state of the server from ipmitool, and we initially falsely assumed that since the chassis power was on, the rest of the server components should be ON. The <a href="https://documents.uow.edu.au/~blane/netapp/ontap/sysadmin/monitoring/concept/c_oc_mntr_bmc-sys-event-log.html"><u>System Event Log (SEL)</u></a>, which is responsible for displaying platform-specific events, was not giving us any useful information beyond indicating that the server was in a soft-off state (powered off), working state (operating system is loading and running), or that a “System Restart” of the host was initiated.</p>
            <pre><code># System Event Logs (SEL) showing the various power states of the server
$ ipmitool sel elist | tail -n3
  4d |  Pre-Init  |0000011021| System ACPI Power State ACPI_STATUS | S5_G2: soft-off | Asserted
  4e |  Pre-Init  |0000011022| System ACPI Power State ACPI_STATUS | S0_G0: working | Asserted
  4f |  Pre-Init  |0000011023| System Boot Initiated RESTART_CAUSE | System Restart | Asserted</code></pre>
            <p>In the System Event Logs shown above, ACPI is the acronym for Advanced Configuration and Power Interface, a standard for power management on computing systems. In the ACPI soft-off state, the host is powered off (the motherboard is on standby power but CPU/host isn’t powered on); according to the <a href="https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf"><u>ACPI specifications</u></a>, this state is called S5_G2. (These states are discussed in more detail below.) In the ACPI working state, the host is booted and in a working state, also known in the ACPI specifications as status S0_G0 (which in our case happened to be false), and the third row indicates the cause of the restart was due to a System Restart. Most of the boot-related SEL events are sent from the UEFI to the BMC. The UEFI has been something of a black box to us, as we rely on our original equipment manufacturers (OEMs) to develop the UEFI firmware for us, and for the generation of servers with this issue, the UEFI firmware did not implement sending the boot progress of the host to the BMC.</p><p>One discrepancy we observed was the difference in the power status and the power going into the CPU, which we read with a sensor we call CPU_POWER.</p>
            <pre><code># Check power status
$ ipmitool &lt;some authentication&gt; &lt;bmc ip&gt;  power status
Chassis Power is on
</code></pre>
            <p>However, checking the power into the CPU shows that the CPU was not receiving any power.</p>
            <pre><code># Check power going into the CPU
$ ipmitool &lt;some authentication&gt; &lt;bmc ip&gt;  sdr | grep CPU_POWER    
CPU_POWER        | 0 Watts           | ok</code></pre>
            <p>The CPU_POWER being at 0 watts contradicts all the previous information that the host was powered up and working, when the host was actually completely shut down.</p><p><b><i>Missing Memory Modules:</i></b> Our servers would randomly boot up with less memory than expected. Computers can boot up with less memory than installed due to a number of problems, such as a loose connection, hardware problem, or faulty memory. For our case, it happened not to be any of the usual suspects, but instead was due to both the BMC and UEFI trying to simultaneously read from the memory modules, leading to access contentions. Memory modules usually contain a <a href="https://en.wikipedia.org/wiki/Serial_presence_detect"><u>Serial Presence Detect (SPD)</u></a>, which is used by the UEFI to dynamically detect the memory module. This SPD is usually located on an <a href="https://learn.sparkfun.com/tutorials/i2c/all"><u>inter-integrated circuit (i2c)</u></a>, which is a low speed, two write protocol for devices to talk to each other. The BMC also reads the temperature of the memory modules via the i2c. When the server is powered on, amongst other hardware initializations, the UEFI also initializes the memory modules that it can detect via their (i.e. each individual memory modules) Serial Presence Detect (SPD), the BMC could also be trying to access the temperature of the memory module at the same time, over the same i2c protocol. This simultaneous attempted read denies one of the parties access. When the UEFI is denied access to the SPD, it thinks the memory module is not available and skips over it. Below is an example of the related i2c-bus contention logs we saw in the <a href="https://www.freedesktop.org/software/systemd/man/latest/journalctl.html"><u>journal</u></a> of the BMC when the host is booting.</p>
            <pre><code>kernel: aspeed-i2c-bus 1e78a300.i2c-bus: irq handled != irq. expected 0x00000021, but was 0x00000020</code></pre>
            <p>The above logs indicate that the i2c address 1e78a300 (which happens to be connected to the serial presence detect of the memory modules) could not properly handle a signal, known as an interrupt request (irq). When this scenario plays out on the UEFI, the UEFI is unable to detect the memory module.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Fe8wb6xqwXkanb8iPv8O2/eaecfe0474576a00cdc25bfeb6fba7a2/image4.png" />
          </figure><p><sup><i>Figure 3: I2C diagram showing I2C interconnection of the server’s memory modules (also known as DIMMs) with the BMC </i></sup></p><p><a href="https://www.techtarget.com/searchstorage/definition/DIMM"><u>DIMM</u></a> in Figure 3 refers to <a href="https://www.techtarget.com/searchstorage/definition/DIMM"><u>Dual Inline Memory Module</u></a>, which is the type of memory module used in servers.</p><p><b><i>Thermal telemetry:</i></b> During the boot-up process of some of our servers, some temperature devices, such as the temperature sensors of the memory modules, would show up as failed, thereby causing some of the fans to enter a fail-safe <a href="https://en.wikipedia.org/wiki/Pulse-width_modulation"><u>Pulse Width Modulation (PWM)</u></a> mode. <a href="https://en.wikipedia.org/wiki/Pulse-width_modulation"><u>PWM</u></a> is a technique to encode information delivered to electronic devices by adjusting the frequency of the waveform signal to the device. It is used in this case to control fan speed by adjusting the frequency of the power signal delivered to the fan. When a fan enters a fail-safe mode, PWM is used to set the fan speeds to a preset value, irrespective of what the optimized PWM setting of the fans should be, and this could negatively affect the cooling of the server and power consumption.</p>
    <div>
      <h2>Implementing host ACPI state on OpenBMC</h2>
      <a href="#implementing-host-acpi-state-on-openbmc">
        
      </a>
    </div>
    <p>In the process of studying the issues we faced relating to the boot-up process of the host, we learned how the power state of the subsystems within the chassis changes. Part of our learnings led us to investigate the Advanced Configuration and Power Interface (ACPI) and how the ACPI state of the host changed during the boot process.</p><p>Advanced Configuration and Power Interface (ACPI) is an open industry specification for power management used in desktop, mobile, workstation, and server systems. The <a href="https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf"><u>ACPI Specification</u></a> replaces previous power management methodologies such as <a href="https://en.wikipedia.org/wiki/Advanced_Power_Management"><u>Advanced Power Management (APM)</u></a>. ACPI provides the advantages of:</p><ul><li><p>Allowing OS-directed power management (OSPM).</p></li><li><p>Having a standardized and robust interface for power management.</p></li><li><p>Sending system-level events such as when the server power/sleep buttons are pressed </p></li><li><p>Hardware and software support, such as a real-time clock (RTC) to schedule the server to wake up from sleep or to reduce the functionality of the CPU based on RTC ticks when there is a loss of power.</p></li></ul><p>From the perspective of power management, ACPI enables an OS-driven conservation of energy by transitioning components which are not in active use to a lower power state, thereby reducing power consumption and contributing to more efficient power management.</p><p>The ACPI Specification defines four global “Gx” states, six sleeping “Sx” states, and four “Dx” device power states. These states are defined as follows:</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Gx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Name</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Sx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Description</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Working</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>The run state. In this state the machine is fully running</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Sleeping</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A sleep state where the CPU will suspend activity but retain its contexts.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A sleep state where memory contexts are held, but CPU contexts are lost. CPU re-initialization is done by firmware.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A logically deeper sleep state than S2 where CPU re-initialization is done by device. Equates to Suspend to RAM.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S4</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A logically deeper sleep state than S3 in which DRAM is context is not maintained and contexts are saved to disk. Can be implemented by either OS or firmware. </span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Soft off but PSU still supplies power</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>The soft off state. All activity will stop, and all contexts are lost. The Complex Programmable Logic Device (CPLD) responsible for power-up and power-down sequences of various components e.g. CPU, BMC is on standby power, but the CPU/host is off.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Mechanical off</span></span></p>
                    </td>
                    <td> </td>
                    <td>
                        <p><span><span>PSU does not supply power. The system is safe for disassembly.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Dx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Name</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Description</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Fully powered on</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is fully functional and operational </span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is partially powered down</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Reduced functionality and can be quickly powered back to D0</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is in a deeper lower power than D1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Much more limited functionality and can only be slowly powered back to D0.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is significantly powered down or off</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Device is inactive with perhaps only the ability to be powered back on</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>The states that matter to us are:</p><ul><li><p><b>S0_G0_D0:</b> often referred to as the working state. Here we know our host system is running just fine.</p></li><li><p><b>S2_D2: </b>Memory contexts are held, but CPU context is lost. We usually use this state to know when the host’s UEFI is performing platform firmware initialization.</p></li><li><p><b>S5_G2:</b> Often referred to as the soft off state. Here we still have power going into the chassis, however, processor and DRAM context are not maintained, and the operating system power management of the host has no context.</p></li></ul><p>Since the issues we were experiencing were related to the power state changes of the host — when we asked the host to reboot or power on — we needed a way to track the various power state changes of the host as it went from power off to a complete working state. This would give us better management capabilities over the devices that were on the same power domain of the host during the boot process. Fortunately, the OpenBMC community already implemented an <a href="https://github.com/openbmc/google-misc/tree/master/subprojects/acpi-power-state-daemon"><u>ACPI daemon</u></a>, which we extended to serve our needs. We added an ACPI S2_D2 power state, in which memory contexts are held, but CPU context is lost, to the ACPI daemon running on the BMC to enable us to know when the host’s UEFI is performing firmware initialization, and also set up various management tasks for the different ACPI power states.</p><p>An example of a power management task we carry out using the S0_G0_D0 state is to re-export our Voltage Regulator (VR) sensors on S0_G0_D0 state, as shown with the service file below:</p>
            <pre><code>cat /lib/systemd/system/Re-export-VR-device.service 
[Unit]
Description=RE Export VR Device Process
Wants=xyz.openbmc_project.EntityManager.service
After=xyz.openbmc_project.EntityManager.service
Conflicts=host-s2-state.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'set -a &amp;&amp; source /usr/bin/Re-export-VR-device.sh on'
SyslogIdentifier=Re-export-VR-device.service

[Install]
WantedBy=host-s0-state.target
</code></pre>
            <p>Having set this up, OpenBMC has a Net Function (ipmiSetACPIState) in <a href="https://github.com/openbmc/phosphor-host-ipmid/tree/master"><u>phosphor-host-ipmid</u></a> that is responsible for setting the ACPIState of the host on the BMC. This command is called by the host using the standard ipmi command with the corresponding NetFn=0x06 and Cmd=0x06.</p><p>In the event of an immediate power cycle (i.e. host reboots without operating system shutdown), the host is unable to send its S5_G2 state to the BMC. For this case, we created a patch to OpenBMC’s <a href="https://github.com/openbmc/x86-power-control/tree/master"><u>x86-power-control</u></a> to let the BMC become aware that the host has entered the ACPI S5_G2 state (i.e. soft-off). When the host comes out of the power off state, the UEFI performs the Power On Self Test (POST) and sends the S2_D2 to the BMC, and after the UEFI has loaded the OS on the host, it notifies the BMC by sending the ACPI S0_G0_D0 state.</p>
    <div>
      <h2>Fixing the issues</h2>
      <a href="#fixing-the-issues">
        
      </a>
    </div>
    <p>Going back to the boot-up issues we faced, we discovered that they were mostly caused by devices which were in the same power domain of the CPU, interfering with the UEFI/platform firmware initialization phase. Below is a high level description of the fixes we applied.</p><p><b><i>Servers not booting</i></b><b>:</b> After identifying the devices that were interfering with the POST stage of the firmware initialization, we used the host ACPI state to control when we set the appropriate power mode state for those devices so as not to cause POST to fail.</p><p><b><i>Memory modules missing</i></b><b>:</b> During the boot-up process, memory modules (DIMMs) are powered and initialized in S2_D2 ACPI state. During this initialization process, UEFI firmware sends read commands to the Serial Presence Detect (SPD) on the DIMM to retrieve information for DIMM enumeration. At the same time, the BMC could be sending commands to read DIMM temperature sensors. This can cause SMBUS collisions, which could either cause DIMM temperature reading to fail or UEFI DIMM enumeration to fail. The latter case would cause the system to boot up with reduced DIMM capacity, which could be mistaken as a failing DIMM scenario. After we had discovered the race condition issue, we disabled the BMC from reading the DIMM temperature sensors during S2_D2 ACPI state and set a fixed speed for the corresponding fans. This solution allows our UEFI to retrieve all the necessary DIMM subsystems information for enumeration, and our servers now boot up with the correct size of memory.</p><p><b>Thermal telemetry:</b> In S0_G0 power state, when sensors are not reporting values back to the BMC, the BMC assumes that devices may be overheating and puts the fan controller into fail-safe mode where fan speeds are ramped up to maximum speed. However, in S5_G2 state, some thermal sensors like CPU temperature, NIC temperature, etc. are not powered and not available. Our solution is to set these thermal sensors as non-functional in their exported configuration when in S5_G2 state and during the transition from S5_G2 state to S2_D2 state. Setting the affected devices as non-functional in their configuration, instead of waiting for thermal sensor read commands to error out, prevents the controller from entering the fail-safe mode.</p>
    <div>
      <h2>Moving forward</h2>
      <a href="#moving-forward">
        
      </a>
    </div>
    <p>Aside from resolving issues, we have seen other benefits from implementing ACPI Power State on our BMC firmware. An example is in the area of our automated firmware regression testing. Various parts of our tests require rebooting/power cycling the servers over a hundred times, during which we monitor the ACPI power state changes of our servers as against using a boolean (running or not running, pingable or not pingable) to assert the status of our servers.</p><p>Also, it has given us the opportunity to learn more about the complex subsystems in a server system, and the various power modes of the different subsystems. This is an aspect that we are still actively learning about as we look to further optimize various aspects of the boot sequence of our servers.</p><p>In the course of time, implementing ACPI states is helping us achieve the following:</p><ul><li><p>All components are enabled by end of boot sequence,</p></li><li><p>BIOS and BMC are able to retrieve component information,</p></li><li><p>And the BMC is aware when thermal sensors are in a non-functional state.
</p></li></ul><p>For better observability of the boot progress and “last state” of our systems, we have also started the process of adding the BootProgress object of the <a href="https://redfish.dmtf.org/schemas/v1/ComputerSystem.v1_13_0.json"><u>Redfish ComputerSystem Schema</u></a> into our systems. This will give us an opportunity for pre-operating system (OS) boot observability and an easier debug starting point when the UEFI has issues (such as when the server isn’t coming on) during the server platform initialization.</p><p>With each passing day, Cloudflare’s OpenBMC team, which is made up of folks from different embedded backgrounds, learns about, experiments with, and deploys OpenBMC across our global fleet. This has been made possible by relying on the OpenBMC community’s contribution (as well as upstreaming some of our own contributions), and our interaction with our various vendors, thereby giving us the opportunity to make our systems more reliable, and giving us the ownership and responsibility of the firmware that powers the BMCs that manage our servers. If you are thinking of embracing open-source firmware in your BMC, we hope this blog post written by a team which started deploying OpenBMC less than 18 months ago has inspired you to give it a try. </p><p>For those who are interested in considering making the jump to open-source firmware, check it out <a href="https://github.com/openbmc/openbmc"><u>here</u></a>!</p> ]]></content:encoded>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[OpenBMC]]></category>
            <category><![CDATA[Servers]]></category>
            <category><![CDATA[Firmware]]></category>
            <guid isPermaLink="false">2hySj1JFTXmlofjA6IRijm</guid>
            <dc:creator>Nnamdi Ajah</dc:creator>
            <dc:creator>Ryan Chow</dc:creator>
            <dc:creator>Giovanni Pereira Zantedeschi</dc:creator>
        </item>
        <item>
            <title><![CDATA[Leveraging Kubernetes virtual machines at Cloudflare with KubeVirt]]></title>
            <link>https://blog.cloudflare.com/leveraging-kubernetes-virtual-machines-with-kubevirt/</link>
            <pubDate>Tue, 08 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ The Kubernetes team runs several multi-tenant clusters across Cloudflare’s core data centers. When multi-tenant cluster isolation is too limiting for an application, we use KubeVirt. KubeVirt is a cloud-native solution that enables our developers to run virtual machines alongside containers. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Cloudflare runs several <a href="https://kubernetes.io/docs/concepts/security/multi-tenancy/"><u>multi-tenant</u></a> <a href="https://kubernetes.io/"><u>Kubernetes</u></a> clusters across our core data centers. These general-purpose clusters run on bare metal and power our <a href="https://www.cloudflare.com/learning/network-layer/what-is-the-control-plane/"><u>control plane</u></a>, analytics, and various engineering tools such as build infrastructure and continuous integration.</p><p>Kubernetes is a container orchestration platform. It enables software engineers to deploy containerized applications to a cluster of machines. This enables teams to build highly-available software on a scalable and resilient platform.</p><p>In this blog post we discuss our Kubernetes architecture, why we needed virtualization, and how we’re using it today.</p>
    <div>
      <h2>Multi-tenant clusters</h2>
      <a href="#multi-tenant-clusters">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/cloud/what-is-multitenancy/"><u>Multi-tenancy</u></a> is a concept where one system can share its resources among a wide range of customers. This model allows us to build and manage a small number of general purpose Kubernetes clusters for our internal application teams. Keeping the number of clusters small reduces our operational toil. This model shrinks costs and increases computational efficiency by sharing hardware. Multi-tenancy also allows us to scale more efficiently. Scaling is done at either a cluster or application level. Cluster operators scale the platform by adding more hardware. Teams scale their applications by updating their Kubernetes manifests. They can scale <a href="https://en.wikipedia.org/wiki/Scalability#Vertical_or_scale_up"><u>vertically</u></a> by increasing their <a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"><u>resource</u></a> requests or <a href="https://en.wikipedia.org/wiki/Scalability#Horizontal_or_scale_out"><u>horizontally</u></a> by increasing the number of replicas.</p><p>All of our Kubernetes clusters are multi-tenant with various components enabled for a secure and resilient platform.</p><p><a href="https://kubernetes.io/docs/concepts/workloads/pods/"><u>Pods</u></a> are secured using the latest standards recommended by the Kubernetes project. We use <a href="https://kubernetes.io/docs/concepts/security/pod-security-admission/"><u>Pod Security Admission</u></a> (PSA) and <a href="https://kubernetes.io/docs/concepts/security/pod-security-standards/"><u>Pod Security Standards</u></a> to ensure all workloads are following best practices. By default, all namespaces use the most <a href="https://kubernetes.io/docs/concepts/security/pod-security-standards/#restricted"><u>restrictive</u></a> profile, and only a few Kubernetes control plane namespaces are granted <a href="https://kubernetes.io/docs/concepts/security/pod-security-standards/#privileged"><u>privileged</u></a> access. For additional policies not covered by PSA, we built custom <a href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#validatingadmissionwebhook"><u>Validating Webhooks</u></a> on top of the <a href="https://github.com/kubernetes-sigs/controller-runtime/tree/main/pkg/webhook/admission"><u>controller-runtime</u></a> framework. PSA and our custom policies ensure clusters are secure and workloads are isolated.</p>
    <div>
      <h2>Our need for virtualization</h2>
      <a href="#our-need-for-virtualization">
        
      </a>
    </div>
    <p>A select number of teams needed tight integration with the Linux kernel. Examples include Docker daemons for build infrastructure and the ability to simulate servers running the software and configuration of our <a href="https://www.cloudflare.com/network/"><u>global network</u></a>. With our pod security requirements, these workloads are not permitted to interface with the host kernel at a deep level (e.g. no <a href="https://en.wikipedia.org/wiki/Iptables"><u>iptables</u></a> or <a href="https://en.wikipedia.org/wiki/Sysctl"><u>sysctls</u></a>). Doing so may disrupt other tenants sharing the node and open additional <a href="https://www.cloudflare.com/learning/security/glossary/attack-vector/"><u>attack vectors</u></a> if an application was compromised. A virtualization platform would enable these workloads to interact with their own kernel within a secured Kubernetes cluster.</p><p>We considered various different virtualization solutions. Running a separate virtualization platform outside of Kubernetes would have worked, but would not tightly integrate containerized workloads with virtual machines. It would also be an additional operational burden on our team, as backups, alerting, and fleet management would have to exist for both our Kubernetes and virtual machine clusters.</p><p>We then looked for solutions that run virtual machines within Kubernetes. Teams could already manually deploy <a href="https://www.qemu.org/"><u>QEMU</u></a> pods, but this was not an elegant solution. We needed a better way. There were several other options, but <a href="https://kubevirt.io/"><u>KubeVirt</u></a> was the tool that met the majority of our requirements. Other solutions required a privileged container to run a virtual machine, but KubeVirt did not – this was a crucial requirement in our goal of creating a more secure multi-tenant cluster. KubeVirt also uses a feature of the Kubernetes API called <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/"><u>Custom Resource Definitions</u></a> (CRDs), which extends the Kubernetes API with new objects, increasing the flexibility of Kubernetes beyond its built-in types. For KubeVirt, this includes objects such as VirtualMachine and VirtualMachineInstanceReplicaSet. We felt the use of CRDs would allow KubeVirt to grow as more features were added.</p>
    <div>
      <h2>What is KubeVirt?</h2>
      <a href="#what-is-kubevirt">
        
      </a>
    </div>
    <p>KubeVirt is a virtualization platform that enables users to run virtual machines within Kubernetes. With KubeVirt, <a href="https://www.cloudflare.com/learning/cloud/what-is-a-virtual-machine/"><u>virtual machines</u></a> run alongside containerized workloads on the same platform. Kubernetes primitives such as <a href="https://kubernetes.io/docs/concepts/services-networking/network-policies/"><u>network policies</u></a>, <a href="https://kubernetes.io/docs/concepts/configuration/configmap/"><u>configmaps</u></a>, and <a href="https://kubernetes.io/docs/concepts/services-networking/service/"><u>services</u></a> all integrate with virtual machines. KubeVirt scales with our needs and is successfully running hundreds of virtual machines across several clusters. We frequently <a href="https://blog.cloudflare.com/automatic-remediation-of-kubernetes-nodes"><u>remediate Kubernetes nodes</u></a>, so virtual machines and pods are always exercising their startup/shutdown processes.</p>
    <div>
      <h2>How Cloudflare uses KubeVirt</h2>
      <a href="#how-cloudflare-uses-kubevirt">
        
      </a>
    </div>
    <p>There are a number of internal projects leveraging virtual machines at Cloudflare. We’ll touch on a few of our more popular use cases:</p><ol><li><p>Kubernetes scalability testing</p></li><li><p>Development environments</p></li><li><p>Kernel and iPXE testing</p></li><li><p>Build pipelines</p></li></ol>
    <div>
      <h3>
Kubernetes scalability testing</h3>
      <a href="#kubernetes-scalability-testing">
        
      </a>
    </div>
    
    <div>
      <h4>Setup process</h4>
      <a href="#setup-process">
        
      </a>
    </div>
    <p>Our staging clusters are much smaller than our largest production clusters. They also run on bare metal and mirror the configuration we have for each production cluster. This is extremely useful when rolling out new software, operating systems, or kernel changes; however, they miss bugs that only surface at scale. We use KubeVirt to bridge this gap and virtualize Kubernetes clusters with hundreds of nodes and thousands of pods.</p><p>The setup process for virtualized clusters differs from our bare metal provisioning steps. For bare metal, we use <a href="https://saltproject.io/"><u>Salt</u></a> to provision clusters from start to finish. For our virtualized clusters we use <a href="https://www.ansible.com/"><u>Ansible</u></a> and <a href="https://kubernetes.io/docs/reference/setup-tools/kubeadm/"><u>kubeadm</u></a>.  Our bare metal staging clusters are responsible for testing and validating our Salt configuration. The virtualized clusters give us a vanilla Kubernetes environment without any Cloudflare customizations. Having a stock environment in addition to our Salt environment helps us isolate bugs down to a Kubernetes change, a kernel change, or a Cloudflare-specific configuration change.</p><p>Our virtualized clusters consist of a KubeVirt <a href="https://kubevirt.io/api-reference/v1.2.2/definitions.html#_v1_virtualmachine"><u>VirtualMachine</u></a> object per node. We create three control-plane nodes and any number of worker nodes. Each virtual machine starts out as a vanilla Debian generic <a href="https://cdimage.debian.org/images/cloud/"><u>cloud image</u></a>. Using KubeVirt’s <a href="https://kubevirt.io/user-guide/user_workloads/startup_scripts/#cloud-init"><u>cloud-init support</u></a>, the virtual machine downloads an internal <a href="https://www.ansible.com/"><u>Ansible</u></a> <a href="https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_intro.html"><u>playbook</u></a> which installs a recent kernel, <a href="https://cri-o.io/"><u>cri-o</u></a> (the container runtime we use), and <a href="https://kubernetes.io/docs/reference/setup-tools/kubeadm/"><u>kubeadm</u></a>.</p>
            <pre><code>- name: Add the Kubernetes gpg key
  apt_key:
    url: https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    state: present

- name: Add the Kubernetes repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /" | tee /etc/apt/sources.list.d/kubernetes.list

- name: Add the CRI-O gpg key
  apt_key:
    url: https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/cri-o-apt-keyring.gpg
    state: present

- name: Add the CRI-O repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/ /" | tee /etc/apt/sources.list.d/cri-o.list

- name: Install CRI-O and Kubernetes packages
  apt:
    name:
      - cri-o
      - kubelet
      - kubeadm
      - kubectl
    update_cache: yes
    state: present

- name: Enable and start CRI-O service
  service:
    state: started
    enabled: yes
    name: crio.service</code></pre>
            <p><sup><i>Ansible playbook steps to download and install Kubernetes tooling</i></sup></p><p>Once each node has completed its individual playbook, we can <a href="http://ools/kubeadm/kubeadm-init/"><u>initialize</u></a> and <a href="https://kubernetes.io/docs/reference/setup-tools/kubeadm/kubeadm-join/"><u>join</u></a> nodes to the cluster using another playbook that runs kubeadm. From there the cluster can be accessed by logging into a control plane node using kubectl.</p>
    <div>
      <h4>Simulating at scale</h4>
      <a href="#simulating-at-scale">
        
      </a>
    </div>
    <p>When losing 10s or 100s of nodes at once, Kubernetes needs to act quickly to minimize downtime. The sooner it recognizes node failure, the faster it can reroute traffic to healthy pods.</p><p>Using Kubernetes in KubeVirt we are able to simulate a large cluster undergoing a network cut and observe how Kubernetes reacts. The KubeVirt Kubernetes cluster allows us to rapidly iterate on configuration changes and code patches.</p><p>The following Ansible playbook task simulates a network segmentation failure where only the control-plane nodes remain online.</p>
            <pre><code>- name: Disable network interfaces on all workers
  command: ifconfig enp1s0 down
  async: 5
  poll: 0
  ignore_errors: yes
  when: inventory_hostname in groups['kube-node']</code></pre>
            <p><sup><i>An Ansible role which disables the network on all worker nodes simultaneously.</i></sup></p><p>This framework allows us to exercise the code in <a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/"><u>controller-manager</u></a>, Kubernetes’s daemon that reconciles the fundamental state of the system (Nodes, Pods, etc). Our simulation platform helped us drastically shorten full traffic recovery time when a large number of Kubernetes nodes <a href="https://blog.cloudflare.com/major-data-center-power-failure-again-cloudflare-code-orange-tested"><u>become unreachable</u></a>. We upstreamed our <a href="https://github.com/kubernetes/kubernetes/pull/114296"><u>changes</u></a> to Kubernetes and more controller-manager speed improvements are coming soon.</p>
    <div>
      <h3>Development environments</h3>
      <a href="#development-environments">
        
      </a>
    </div>
    <p>Compiling code on your laptop can be slow. Perhaps you’re working on a patch for a large open-source project (e.g. <a href="https://v8.dev/"><u>V8</u></a> or <a href="https://clickhouse.com/"><u>Clickhouse</u></a>) or need more bandwidth to upload and download containers. With KubeVirt, we enable our developers to rapidly iterate on software development and testing on <a href="https://blog.cloudflare.com/cloudflare-gen-12-server-bigger-better-cooler-in-a-2u1n-form-factor"><u>powerful server hardware</u></a>. KubeVirt <a href="https://kubevirt.io/user-guide/storage/disks_and_volumes/#persistentvolumeclaim"><u>integrates</u></a> with Kubernetes <a href="https://kubernetes.io/docs/concepts/storage/persistent-volumes/"><u>Persistent Volumes</u></a>, which enables teams to persist their development environment across restarts.</p><p>There are a number of teams at Cloudflare using KubeVirt for a variety of development and testing environments. Most notably is a project called Edge Test Fleet, which emulates a physical server and all the software that runs Cloudflare’s <a href="https://www.cloudflare.com/network/"><u>global network</u></a>. Teams can test their code and configuration changes against the entire software stack without reserving dedicated hardware. Cloudflare uses <a href="https://blog.cloudflare.com/tag/salt/"><u>Salt</u></a> to provision systems. It can be difficult to iterate and test Salt changes without a complete virtual environment. Edge Test Fleet makes iterating on Salt easier, ensuring states compile and render the right output. With Edge Test Fleet, new developers can better understand how Cloudflare’s global network works without touching staging or production.</p><p>Additionally, one Cloudflare team developed a framework that allows users to build and test changes to <a href="https://blog.cloudflare.com/log-analytics-using-clickhouse"><u>Clickhouse</u></a> using a <a href="https://code.visualstudio.com/"><u>VSCode</u></a> environment. This framework is generally applicable to all teams requiring a development environment. Once a template environment is provisioned, <a href="https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource/"><u>CSI Volume Cloning</u></a> can duplicate a golden volume, separating persistent environments for each developer.</p>
            <pre><code>apiVersion: v1
kind: PersistentVolumeClaim
name: devspace-jcichra-rootfs
namespace: dev-clickhouse-vms
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: rook-ceph-nvme
  dataSource:
    kind: PersistentVolumeClaim
    name: dev-rootfs
  resources:
    requests:
      storage: 500Gi</code></pre>
            <p><sup><i>A PersistentVolumeClaim that clones data from another volume using CSI Volume Cloning</i></sup></p>
    <div>
      <h3>Kernel and iPXE testing</h3>
      <a href="#kernel-and-ipxe-testing">
        
      </a>
    </div>
    <p>Unlike <a href="https://en.wikipedia.org/wiki/User_space_and_kernel_space"><u>user space</u></a> software development, when a kernel crashes, the entire system crashes. The <a href="https://blog.cloudflare.com/tag/kernel"><u>kernel</u></a> team uses KubeVirt for development. KubeVirt gives all kernel engineers, regardless of laptop OS or architecture, the same x86 environment and <a href="https://en.wikipedia.org/wiki/Hypervisor"><u>hypervisor</u></a>. Virtual machines on server hardware can be scaled up to more cores and memory than on laptops. The Cloudflare kernel team has also found low-level issues which only surface in environments with many CPUs.</p><p>To make testing fast and easy, the kernel team serves <a href="https://ipxe.org/"><u>iPXE</u></a> images via an <a href="https://nginx.org/"><u>nginx</u></a> Pod and Service adjacent to the virtual machine. A recent kernel and Debian image are copied to the nginx pod via kubectl cp. The iPXE file can then be referenced in the KubeVirt virtual machine definition via the DNS name for the Kubernetes Service.</p>
            <pre><code>interfaces:
  name: default
  masquerade: {}
  model: e1000e
  ports:
    - port: 22
      dhcpOptions:
        bootFileName: http://httpboot.u-$K8S_USER.svc.cluster.local/boot.ipxe</code></pre>
            <p>When the virtual machine boots, it will get an IP address on the default interface behind <a href="https://en.wikipedia.org/wiki/Network_address_translation"><u>NAT</u></a> due to our <a href="https://kubevirt.io/user-guide/network/interfaces_and_networks/#masquerade"><u>masquerade</u></a> setting. Then it will download boot.ipxe, which describes what additional files should be downloaded to start the system. In this case, the kernel (<code>vmlinuz-amd64</code>), Debian (<code>baseimg-amd64.img</code>) and additional kernel modules (<code>modules-amd64.img</code>) are downloaded.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/74Pndk3FS6TVPACarSKD4N/fc7c3add6bae3c2c8b5e086ef9061872/image2.png" />
          </figure><p><sup><i>UEFI iPXE boot connecting and downloading files from nginx pod in user’s namespace</i></sup></p><p>Once the system is booted, a developer can log in to the system for testing:</p>
            <pre><code>linux login: root
Password: 
Linux linux 6.6.35-cloudflare-2024.6.7 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@linux:~# </code></pre>
            <p>Custom kernels can be copied to the nginx pod via <code>kubectl cp</code>. Restarting the virtual machine will load that new kernel for testing. When a kernel panic occurs, the virtual machine can quickly be restarted with <code>virtctl restart linux</code> and it will go through the iPXE boot process again.</p>
    <div>
      <h3>Build pipelines</h3>
      <a href="#build-pipelines">
        
      </a>
    </div>
    <p>Cloudflare leverages KubeVirt to build a majority of software at Cloudflare. Virtual machines give build system users full control over their pipeline. For example, Debian packages can easily be installed and separate container daemons (such as <a href="https://www.docker.com/"><u>Docker</u></a>) can run all within a Kubernetes namespace using the <code>restricted</code> Pod Security Standard. KubeVirt’s <a href="https://kubevirt.io/user-guide/user_workloads/replicaset/"><u>VirtualMachineReplicaSet</u></a> concept allows us to quickly scale up and down the number of build agents to match demand. We can roll out different sets of virtual machines with varying sizes, kernels, and operating systems.</p><p>To scale efficiently, we leverage <a href="https://kubevirt.io/user-guide/storage/disks_and_volumes/#containerdisk"><u>container disks</u></a> to store our agent virtual machine images. Container disks allow us to store the virtual machine image (for example, a <a href="https://en.wikipedia.org/wiki/Qcow"><u>qcow</u></a> image) in our container registry. This strategy works well when the state in virtual machines is ephemeral. <a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-command"><u>Liveness probes</u></a> detect unhealthy or broken agents, shutting down the virtual machine and replacing them with a fresh instance. Other automation limits virtual machine uptime, capping it to 3–4 hours to keep build agents fresh.</p>
    <div>
      <h2>Next steps</h2>
      <a href="#next-steps">
        
      </a>
    </div>
    <p>We’re excited to expand our use of KubeVirt and unlock new capabilities for our internal users. KubeVirt’s Linux ARM64 support will allow us to build ARM64 packages in-cluster and simulate ARM64 systems.</p><p>Projects like <a href="https://kubevirt.io/user-guide/operations/containerized_data_importer/"><u>KubeVirt CDI</u></a> (Containerized Data Importer) will streamline our user’s virtual machine experience. Instead of users manually building container disks, we can provide a catalog of virtual machine images. It also allows us to copy virtual machine disks between namespaces.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>KubeVirt has proven to be a great tool for virtualization in our Kubernetes-first environment. We’ve unlocked the ability to support more workloads with our multi-tenant model. The KubeVirt platform allows us to offer a single compute platform supporting containers and virtual machines. Managing it has been simple, and upgrades have been straightforward and non-disruptive. We’re exploring additional features KubeVirt offers to improve the experience for our users.</p><p>Finally, our team is expanding! We’re looking for more people passionate about Kubernetes to <a href="https://boards.greenhouse.io/cloudflare/jobs/5579824"><u>join our team</u></a> and help us push Kubernetes to the next level.</p> ]]></content:encoded>
            <category><![CDATA[Kubernetes]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <guid isPermaLink="false">1149BgOuHn2l6ubvzlzHar</guid>
            <dc:creator>Justin Cichra</dc:creator>
        </item>
    </channel>
</rss>