
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 14 Apr 2026 21:53:13 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey]]></title>
            <link>https://blog.cloudflare.com/cloudflare-one-onboarding-project-helix/</link>
            <pubDate>Mon, 02 Mar 2026 06:00:00 GMT</pubDate>
            <description><![CDATA[ Project Helix simplifies and accelerates the onboarding process for Cloudflare One. By using automation and Terraform templates, this tool allows customers to quickly deploy a comprehensive, best-practice configuration in minutes. ]]></description>
            <content:encoded><![CDATA[ <p>In the world of cybersecurity, "starting from scratch" is a double-edged sword. On one hand, you have a clean slate; on the other, you face a mountain of configurations, best practices, and potential "gotchas."</p><p>While <a href="https://www.cloudflare.com/zero-trust/"><u>Cloudflare One</u></a> has been often cited as one of the easiest-to-use SASE platforms, there is no magic without proper configuration. And while Cloudflare has been striving to simplify complex networking concepts by creating products such as <a href="https://www.cloudflare.com/network-services/products/magic-wan/"><u>Cloudflare WAN</u></a>, <a href="https://www.cloudflare.com/network-services/products/magic-transit/"><u>Magic Transit</u></a>, and <a href="https://www.cloudflare.com/network-services/products/magic-firewall/"><u>Cloudflare Network Firewall</u></a>, which simplify and reduce the typical complexity associated with deploying comparable functions from other vendors, the breadth of capabilities provided by Cloudflare One require creation of best-practice policies and templates to achieve the most optimal outcomes.</p><p>To make it easy to start taking advantage of Cloudflare’s powerful SASE platform, we have developed a method that ensures customers get the right configuration quickly and easily. We call it Project Helix. </p><p>In this post, we’ll dig into the problem of getting the correct customization, and how we built Project Helix to make it simple. That means our customers have access to the most powerful SASE platform out there — and the easiest to onboard.</p>
    <div>
      <h2>The complexity barrier: Why a 'blank slate' can slow Zero Trust adoption</h2>
      <a href="#the-complexity-barrier-why-a-blank-slate-can-slow-zero-trust-adoption">
        
      </a>
    </div>
    <p>Cloudflare One is the world’s largest composable platform, and we enable our product teams to release different capabilities when they are ready. That means customers get access to cutting-edge features as soon as possible, but sometimes these features require tweaking settings or attributes that are set in the platform by default. </p><p>For example, Cloudflare One provides comprehensive DNS protection, Network Protection, Secure Web Gateway, and Zero Trust Access to any private application included in all of our comprehensive <a href="https://www.cloudflare.com/plans/enterprise/interna/#why-cloudflare-interna-packages"><u>Interna</u></a> packages. But deploying advanced security capabilities such as Secure Web Gateway, TLS inspection, DLP, AV scanning, etc. may be too disruptive right out of the gate — so a Cloudflare One tenant is typically provisioned with a blank slate. That means that there are many switches one must flip to enable the full power of Cloudflare One.</p><p>So we faced a dilemma: How can we help our customers get the right settings, right away?</p><p>We started by releasing guides to help administrators get started quickly, wherein they could select a scenario that matches their goals and outcomes.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/01zYmAkjofG6lVx5Ped3IO/62e0817816463fd1144b665f014a338a/image5.png" />
          </figure><p>But we soon realized that that approach did not accomplish the frictionless nirvana we were after. For example, customers who wanted to take advantage of all four scenarios described in the “Get Started” guide would need to step through each of those wizards individually. </p><p>In another instance, we released a highly-anticipated capability to <a href="https://blog.cloudflare.com/tunnel-hostname-routing/"><u>connect and secure any private app by hostname</u></a>. But it was tricky to enable: in addition to flipping a switch in the Cloudflare One settings page, it required customers to change their default split tunnel configuration to include a specific CGNAT range designated for this functionality to be sent to Cloudflare via Cloudflare One Client. We couldn’t easily make this change a default Cloudflare One Client profile, as any change affecting traffic routing on a customer’s network could potentially break existing environments. </p><p>For greenfield deployments, we want to be easily able to enable any customer to benefit from this capability without introducing a bunch of friction.</p><p>We needed a way to engage the knowledge we have, and use it to navigate the numerous knobs, switches, and policies on behalf of our customers — so they can take advantage of the full breadth of innovation.</p>
    <div>
      <h2>Project Helix: Codifying expertise and automation</h2>
      <a href="#project-helix-codifying-expertise-and-automation">
        
      </a>
    </div>
    <p>To achieve this goal, we needed to find a reliable way of taking the amazing brainpower of our Solutions Engineers, Professional Service Engineers, and Partners and enable them to share the best practices they encountered deploying Cloudflare One, whether for production, demos, or proof-of-concepts. </p><p>Sharing this knowledge had to be as easy as a push of a button and in a codified format — otherwise we knew it wouldn’t be done consistently. We decided to call it Project Helix, for the way in which it weaves together expertise and automation.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4VkCzbtFn0VKI5LtEwUPWo/52ab2fd075995a62d09a6dc20909d37f/image4.png" />
          </figure><p>We kicked off the knowledge gathering by asking ourselves what we want customers to experience during the proof of concepts, and we documented all those outcomes. These included enabling baseline security best practice protections across DNS, Network, and HTTP protocols, enabling TLS inspection, QUIC/HTTP3 security for customers (a Cloudflare-exclusive capability for over 3 years now!), deploying Remote Browser Isolation for risky domain categories (such as newly-registered domains), deploying visibility and controls over AI applications the users can access, and elevating the visibility and configuration of the Tenant Control policies that allow customers to restrict their users to accessing only their own instance of SaaS applications such as Office 365, Google Workspace, Dropbox, Box, etc. </p><p>We also noted that a frequent point of friction for our customers was splitting out traffic for popular real-time communication apps such as Zoom to go directly to the Internet. And for customers whose users are often traveling, the team assembled a list of widely used captive portals across airlines, hotels, etc., to help ensure a smoother experience for users accessing resources on those private networks in conjunction with the Cloudflare One client.</p><p>The old way — manual deployment — has significant drawbacks. Deploying all those policies and configurations manually on a brand-new tenant would take several hours. It would also require copious documentation that would need to be manually maintained and updated. And manual configuration and execution of all these steps is subject to human error, raising questions of consistency.</p>
    <div>
      <h2>The technology behind Helix: Terraform and Workers</h2>
      <a href="#the-technology-behind-helix-terraform-and-workers">
        
      </a>
    </div>
    <p>When we learned that our in-house Cloudflare teams had <a href="https://blog.cloudflare.com/shift-left-enterprise-scale"><u>embraced Terraform</u></a> to manage the ever-growing number of accounts used to support Cloudflare internal users, we decided to use a similar approach to solve our own dilemma.</p><p>We architected scalable and flexible Terraform templates that were programmed to deliver all these settings, configuration snippets, and policies. Once we saw how amazing that outcome was, we wanted to make this easier and more user-friendly for the broader user base.</p><p>So the team created a web-based user interface, hosted in Cloudflare Workers and leveraging <a href="https://blog.cloudflare.com/containers-are-available-in-public-beta-for-simple-global-and-programmable/"><u>Cloudflare Containers</u></a>, to take input parameters and execute Terraform templates in an ephemeral fashion. As there’s no persistent storage used for this solution, it eliminates any potential security risk of storing logs or tokens used in the Terraform provisioning process. This allows anyone, from the most seasoned Solution Engineer to someone who is brand new to Cloudflare One, to deploy the full-functioning baseline configuration with a push a button.

Within a couple of minutes of entering some basic information, the Cloudflare One tenant is fully configured and enabled with advanced security features and most optimal settings. Helix also surfaces a comprehensive list of security policies that we recommend the customer enable –- with a flip of the switch.</p><p>We start by deploying a set of robust DNS-based security settings, surfacing policies that allow corporate DNS for zero trust, while blocking security risks and questionable categories from ever being resolved by the DNS.  So when you log in to Cloudflare Dash interface, you will see the following DNS policies preconfigured:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7L9HjaIQmALkMhvcDL974E/80cfda0ebbb4f853831615f25fe3832f/image2.png" />
          </figure><p>We then layer it with robust network policies that protect users and stop malicious traffic across all ports and protocols that you can observe by going to the Network Policies tab in the Dash UI</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2kD8g4UvGApHq8fPhoiCPG/1f299d6170cc48ae312bcfd6b5303fa0/image1.png" />
          </figure><p>And finally, we finish this with a broad set of robust HTTP security policies, featuring granular enterprise application tenant controls, securing of AI prompts, and isolating risky domains via <a href="https://www.cloudflare.com/sase/products/browser-isolation/"><u>Browser Isolation</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2kViSELKjsezICQo3R5mnj/a29132e225a03d321a2dc5ab4d3caa27/image3.png" />
          </figure><p>All of this is achieved in a matter of minutes, with 100% consistency and immunity to human data-entry errors. All you have to do is to turn these policies on or off to suit your particular needs.</p><p>To top it off, the deployment is optimized for maximum interoperability with leading captive portals across airlines and hotels, while also providing an option to easily break out traffic to Zoom to avoid performance issues of tunnelling. </p><p>But wait — there was one more thing! Cloudflare <a href="https://blog.cloudflare.com/internationalizing-the-cloudflare-dashboard"><u>internationalized its UI</u></a> back in 2020, and we wanted to bring the same language-friendliness to all customers and partners across the globe. So we templatized all the object names, policy names, user interactions, etc., within Terraform, and delivered the ability to internationalize deployment of these core best practices and policies in any language.</p>
    <div>
      <h2>The impact</h2>
      <a href="#the-impact">
        
      </a>
    </div>
    <p>The impact of this initiative has been massive. According to Bob Percciacante, a very seasoned Cloudflare One Solutions Engineer, using Helix for one of his proof-of-concepts saved 2–3 weeks of start-up and prep time to configure and verify all the necessary settings and features. He was able to demonstrate all the essential Cloudflare One features to the customer within 15 minutes of deploying a Helix-based configuration.</p><p>For the customer, it means they can start enjoying the security of Zero Trust from day one. </p><p><b>Ready to go beyond the blank slate and accelerate your own Zero Trust deployment?</b></p><ul><li><p><b>Explore Cloudflare One:</b> Learn more about the Cloudflare One platform and its comprehensive SASE capabilities on our<a href="https://www.cloudflare.com/sase/"><u> Cloudflare One page</u></a>.</p></li><li><p>Contact your Cloudflare account team to experience the best of Cloudflare One deployment at lightning speed!</p></li></ul><p></p> ]]></content:encoded>
            <category><![CDATA[Cloudflare One]]></category>
            <category><![CDATA[Automation]]></category>
            <guid isPermaLink="false">789OboluT5DiD55gWkWYQi</guid>
            <dc:creator>Michael Koyfman</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we simplified NCMEC reporting with Cloudflare Workflows]]></title>
            <link>https://blog.cloudflare.com/simplifying-ncmec-reporting-with-cloudflare-workflows/</link>
            <pubDate>Fri, 11 Apr 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We transitioned to Cloudflare Workflows to manage complex, multi-step processes more efficiently. This shift replaced our National Center for Missing & Exploited Children (NCMEC) reporting system. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare plays a significant role in supporting the Internet’s infrastructure. <a href="https://w3techs.com/technologies/history_overview/proxy/all/q"><u>As a reverse proxy by approximately 20% of all websites</u></a>, we sit directly in the request path between users and the origin, helping to improve performance, security, and reliability at scale. Beyond that, our global network powers services like <a href="https://www.cloudflare.com/en-gb/application-services/products/cdn/"><u>delivery</u></a>, <a href="https://workers.cloudflare.com/"><u>Workers</u></a>, and <a href="https://www.cloudflare.com/en-gb/developer-platform/products/r2/"><u>R2</u></a> — making Cloudflare not just a passive intermediary, but an active platform for delivering and hosting content across the Internet.</p><p>Since Cloudflare’s launch in 2010, we have collaborated with the National Center for Missing and Exploited Children (<a href="https://www.missingkids.org/home"><u>NCMEC</u></a>), a US-based clearinghouse for reporting child sexual abuse material (CSAM), and are committed to doing what we can to support identification and removal of CSAM content.</p><p>Members of the public, <a href="https://blog.cloudflare.com/cloudflares-response-to-csam-online/"><u>customers, and trusted organizations can submit reports</u></a> of abuse observed on Cloudflare’s network. A minority of these reports relate to CSAM, which are triaged with the highest priority by Cloudflare’s Trust &amp; Safety team. We will also forward details of the report, along with relevant files (where applicable) and supplemental information to NCMEC.</p><p>The process to generate and submit reports to NCMEC involves multiple steps, dependencies, and error handling, which quickly became complex under our original queue-based architecture. In this blog post, we discuss how Cloudflare <a href="https://developers.cloudflare.com/workflows/"><u>Workflows</u></a> helped streamline this process and simplify the code behind it.</p>
    <div>
      <h2>Life before Cloudflare Workflows</h2>
      <a href="#life-before-cloudflare-workflows">
        
      </a>
    </div>
    <p>When we designed our latest NCMEC reporting system in early 2024, <a href="https://blog.cloudflare.com/building-workflows-durable-execution-on-workers/"><u>Cloudflare Workflows</u></a> did not exist yet. We used the Workers platform <a href="https://developers.cloudflare.com/queues/"><b><u>Queues</u></b></a> as a solution for managing asynchronous tasks, and structured our system around them.</p><p>Our goal was to ensure reliability, fault tolerance, and automatic retries. However, without an orchestrator, we had to manually handle state, retries, and inter-queue messaging. While Queues worked, we needed something more explicit to help debug and observe the more complex asynchronous workflows we were building on top of the messaging system that Queues gave us.</p><p>In our queue-based architecture each report would go through multiple steps:</p><ol><li><p><b>Validate input</b>: Ensure the report has all necessary details.</p></li><li><p><b>Initiate report</b>: Call the NCMEC API to create a report.</p></li><li><p><b>Fetch impounded files (if applicable)</b>: Retrieve files stored in R2.</p></li><li><p><b>Upload files</b>: Send files to NCMEC via API.</p></li><li><p><b>Finalize report</b>: Mark the report as completed.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7n99a6YkThlegGitE2i7iv/a53e70ac11e21025d436c27dce7aaf3a/image2.png" />
          </figure><p><sup><i>A diagram of our queue-based architecture </i></sup></p><p>Each of these steps was handled by a separate queue, and if an error occurred, the system would retry the message several times before marking the report as failed. But errors weren’t always straightforward — for instance, if an external API call consistently failed due to bad input or returned an unexpected response shape, retries wouldn’t help. In those cases, the report could get stuck in an intermediate state, and we’d often have to manually dig through logs across different queues to figure out what went wrong.</p><p>Even more frustrating, when handling failed reports, we relied on a "Reaper" — a cron job that ran every hour to resubmit failed reports. Since a report could fail at any step, the Reaper had to deduce which queue failed and send a message to begin reprocessing. This meant:</p><ul><li><p><b>Debugging was a nightmare</b>: Tracing the journey of a single report meant jumping between logs for multiple queues.</p></li><li><p><b>Retries were unreliable</b>: Some queues had retry logic, while others relied on the Reaper, leading to inconsistencies.</p></li><li><p><b>State management was painful</b>: We had no clear way to track whether a report was halfway through the pipeline or completely lost, except by looking through the logs.</p></li><li><p><b>Operational overhead was high</b>: Developers frequently had to manually inspect failed reports and resubmit them.</p></li></ul><p>Queues gave us a solid foundation for moving messages around, but it wasn’t meant to handle orchestration. What we’d really done was build a bunch of loosely connected steps on top of a message bus and hoped it would all hold together. It worked, for the most part, but it was clunky, hard to reason about, and easy to break. Just understanding how a single report moved through the system meant tracing messages across multiple queues and digging through logs.</p><p>We knew we needed something better: a way to define workflows explicitly, with clear visibility into where things were and what had failed. But back then, we didn’t have a good way to do that without bringing in heavyweight tools or writing a bunch of glue code ourselves. When Cloudflare Workflows came along, it felt like the missing piece, finally giving us a simple, reliable way to orchestrate everything without duct tape.</p>
    <div>
      <h2>The solution: Cloudflare Workflows</h2>
      <a href="#the-solution-cloudflare-workflows">
        
      </a>
    </div>
    <p>Once <a href="https://developers.cloudflare.com/workflows/"><u>Cloudflare Workflows</u></a> was <a href="https://blog.cloudflare.com/building-workflows-durable-execution-on-workers/"><u>announced</u></a>, we saw an immediate opportunity to replace our queue-based architecture with a more structured, observable, and retryable system. Instead of relying on a web of multiple queues passing messages to each other, we now have a single workflow that orchestrates the entire process from start to finish. Critically, if any step failed, the Workflow could pick back up from where it left off, without having to repeat earlier processing steps, re-parsing files, or duplicating uploads.</p><p>With Cloudflare Workflows, each report follows a clear sequence of steps:</p><ol><li><p><b>Creating the report</b>: The system validates the incoming report and initiates it with NCMEC.</p></li><li><p><b>Checking for impounded files</b>: If there are impounded files associated with the report, the workflow proceeds to file collection.</p></li><li><p><b>Gathering files</b>: The system retrieves impounded files stored in R2 and prepares them for upload.</p></li><li><p><b>Uploading files to NCMEC</b>: Each file is uploaded to NCMEC using their API, ensuring all relevant evidence is submitted.</p></li><li><p><b>Adding file metadata</b>: Metadata about the uploaded files (hashes, timestamps, etc.) is attached to the report.</p></li><li><p><b>Finalizing the report</b>: Once all files are processed, the report is finalized and marked as complete.</p></li></ol><p>Here’s a simplified version of the orchestrator:</p>
            <pre><code>import { WorkflowEntrypoint, WorkflowEvent, WorkflowStep } from 'cloudflare:workers';


export class ReportWorkflow extends WorkflowEntrypoint&lt;Env, ReportType&gt; {
  async run(event: WorkflowEvent&lt;ReportType&gt;, step: WorkflowStep) {
    const reportToCreate: ReportType = event.payload;
    let reportId: number | undefined;


    try {
      await step.do('Create Report', async () =&gt; {
        const createdReport = await createReportStep(reportToCreate, this.env);
        reportId = createdReport?.id;
      });


      if (reportToCreate.hasImpoundedFiles) {
        await step.do('Gather Files', async () =&gt; {
          if (!reportId) throw new Error('Report ID is undefined.');
          await gatherFilesStep(reportId, this.env);
        });


        await step.do('Upload Files', async () =&gt; {
          if (!reportId) throw new Error('Report ID is undefined.');
          await uploadFilesStep(reportId, this.env);
        });


        await step.do('Add File Metadata', async () =&gt; {
          if (!reportId) throw new Error('Report ID is undefined.');
          await addFilesInfoStep(reportId, this.env);
        });
      }


      await step.do('Finalize Report', async () =&gt; {
        if (!reportId) throw new Error('Report ID is undefined.');
        await finalizeReportStep(reportId, this.env);
      });
    } catch (error) {
      console.error(error);
      throw error;
    }
  }
}</code></pre>
            <p>Not only can tasks be broken into discrete steps, but the Workflows dashboard gives us real-time visibility into each report processed and the status of each step in the workflow!</p><p>This allows us to easily see active and completed workflows, identify which steps failed and where, and retry failed steps or terminate workflows. These features revolutionize how we troubleshoot issues, providing us with a tool to deep dive into any issues that arise and retry steps with a click of a button.</p><p>Below are two dashboard screenshots, one of our running workflows and the second of an inspection of the success and failures of each step in the workflow. Some workflows look slower or “stuck” — that’s because failed steps are retried with exponential backoff. This helps smooth over transient issues like flaky APIs without manual intervention.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2DjVg3WMp8e5QGy19TuHMj/69e611c9267598c44e5a2b120f0f59ac/image4.png" />
          </figure><p><sup><i>Cloudflare Workflows Dashboard for our NCMEC Workflow</i></sup></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ElqnGMtnJQumNhuWZI3nb/6866cc9aa2b27856a8730a9faebc1747/image3.png" />
          </figure><p><sup><i>Cloudflare Workflows Dashboard containing a breakout of the NCMEC Workflow Steps</i></sup></p><p>Cloudflare Workflows transformed how we handle NCMEC incident reports. What was once a complex, queue-based architecture is now a structured, retryable, and observable process. Debugging is easier, error handling is more robust, and monitoring is seamless. </p>
    <div>
      <h3>Deploy your own Workflows</h3>
      <a href="#deploy-your-own-workflows">
        
      </a>
    </div>
    <p>If you’re also building larger, multi-step applications, or have an existing Workers application that has started to approach what we ended up with for our incident reporting process, then you can typically wrap that code within a Workflow with minimal changes. <a href="https://developers.cloudflare.com/workflows/examples/backup-d1/"><u>Workflows can read from R2, write to KV, query D1</u></a> and call other APIs just like any other Worker, but are designed to help orchestrate asynchronous, long-running tasks.</p><p>To get started with Workflows, you can head to the <a href="https://developers.cloudflare.com/workflows/"><u>Workflows developer documentation</u></a> and/or pull down the starter project and dive into the code immediately:</p>
            <pre><code>$ npm create cloudflare@latest workflows-starter -- 
--template="cloudflare/workflows-starter"
</code></pre>
            <p><i>Learn more about </i><a href="https://developers.cloudflare.com/workers/workflows"><i><u>Cloudflare Workflows</u></i></a><i>, and about using </i><a href="https://developers.cloudflare.com/cache/reference/csam-scanning/"><i><u>the Cloudflare CSAM Scanning Tool</u></i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Workflows]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[CSAM Reporting]]></category>
            <category><![CDATA[Automation]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">32j7ZR5lpPUtSjC9lwtY0t</guid>
            <dc:creator>Mahmoud Salem</dc:creator>
            <dc:creator>Rachael Truong</dc:creator>
        </item>
        <item>
            <title><![CDATA[Autonomous hardware diagnostics and recovery at scale]]></title>
            <link>https://blog.cloudflare.com/autonomous-hardware-diagnostics-and-recovery-at-scale/</link>
            <pubDate>Mon, 25 Mar 2024 13:00:33 GMT</pubDate>
            <description><![CDATA[ Operating hardware in 310 cities in 120 countries means that hardware can break anywhere and anytime. Detecting and managing server failure at scale requires automation. Here's how we automated ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Cloudflare’s global network spans more than 310 cities in over 120 countries. That means thousands of servers geographically spread across different data centers, running services that protect and accelerate our customer’s Internet applications. Operating hardware at such a scale means that hardware can break anywhere and at any time. In such cases, our systems are engineered such that these failures cause little to no impact. However, detecting and managing server failure at scale requires automation. This blog aims to provide insights into the difficulties involved in handling broken servers and how we were able to simplify the process through automation.</p>
    <div>
      <h2>Challenges dealing with broken servers</h2>
      <a href="#challenges-dealing-with-broken-servers">
        
      </a>
    </div>
    <p>When a server is found to have faulty hardware and needs to be removed from production, it is  considered broken and its state is set to Repair in the internal database where server status is tracked. In the past, our Data Center Operations team were essentially left to troubleshoot and diagnose broken servers on their own. They had to go through laborious tasks like performing queries to locate and repair servers, conducting diagnostics, reviewing results, evaluating if a server can be restored to production, and creating the necessary tickets for re-enabling servers and executing operations to put them back in production. Such effort can take hours for a single server alone, and can easily consume an engineer’s entire day.</p><p>As you can see, addressing server repairs was a labor-intensive process performed manually, Additionally, a lot of these servers remained powered on within the racks, wasting energy. With our fleet expanding rapidly, the attention of Data Center Operations is primarily devoted to supporting this growth, leaving less time to handle servers in need of repair.</p><p>It was clear that our infrastructure was growing too fast for us to be able to handle repairs and recovery, so we had to find a better way to handle these sorts of inefficiencies in our operations. This would allow our engineers to focus on the growth of our footprint while not abandoning repair and recovery – after all, these are still huge CapEx investments and wasted capacity that otherwise would have been fully utilized.</p>
    <div>
      <h2>Using automation as an autonomous system</h2>
      <a href="#using-automation-as-an-autonomous-system">
        
      </a>
    </div>
    <p>As members of the Infrastructure Software Systems and Automation team at Cloudflare, we primarily work on building tools and automation that help reduce excess work in order to ease the pressure on our operations teams, increase productivity, and enable people to execute operations with the highest efficiency.</p><p>Our team continuously strives to challenge our existing processes and systems, finding ways we can evolve them and make significant improvements – one of which is to build not just a typical automated system but an <b>autonomous</b> one. Building autonomous automations means creating systems that can operate independently, without the need for constant human intervention or oversight – a perfect example of this is <b>Phoenix</b>.</p>
    <div>
      <h2>Introducing Phoenix</h2>
      <a href="#introducing-phoenix">
        
      </a>
    </div>
    <p>Phoenix is an autonomous diagnostics and recovery automation that runs at regular intervals to discover Cloudflare data centers with servers that are broken, performing diagnostics on detection, recovering those that pass diagnostics by re-provisioning, and ultimately re-enabling  those that have successfully been re-provisioned in the safest and most unobtrusive way possible – <b>all without requiring any human intervention!</b> Should a server fail at any point in the process, Phoenix will take care of updating relevant tickets, even pinpointing the cause of the failure, and reverting the state of the server accordingly when needed – again, all without any human intervention!</p><p>The image below illustrates the whole process:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1qVvRJpQWUcF6rAMlVLbmO/df9ced60e39057106e8a17f06d682990/image1-34.png" />
            
            </figure><p>To better understand exactly how Phoenix works, let’s dive into some details about its core functionality.</p>
    <div>
      <h3>Discovery</h3>
      <a href="#discovery">
        
      </a>
    </div>
    <p>Discovery runs at a regular interval of 30 minutes, selecting a maximum of two Cloudflare data centers that have broken or repair state servers in its fleet, which are all configurable depending on business and operational needs, against which it can immediately execute diagnostics. At this rate, Phoenix is able to discover and operate on all broken servers in the fleet in about 3 days. On each run, it also detects data centers that may have broken servers already queued for recovery, and takes care of ensuring that the Recovery phase is executed immediately.</p>
    <div>
      <h3>Diagnostics</h3>
      <a href="#diagnostics">
        
      </a>
    </div>
    <p>Diagnostics takes care of running various tests across the broken servers of a selected data center in a single run, verifying viability of the hardware components, and identifying the candidates for recovery.</p><p>A diagnostic operation includes running the following:</p><ul><li><p><b>Out-of-Band connectivity check</b>This check determines the reachability of a device via out-of-band network. We employ IPMI (Intelligent Platform Management Interface) to ensure proper physical connectivity and accessibility of devices. This allows for effective monitoring and management of hardware components, enhancing overall system reliability and performance. Only devices that pass this check can progress to the Node Acceptance Testing phase.</p></li><li><p><b>Node Acceptance Tests</b>We leverage an existing internally-built tool called [INAT](<a href="/redefining-fleet-management-at-cloudflare#:~:text=fleet%2C%20known%20as-,INAT,-(Integrated%20Node%20Acceptance)">http://staging.blog.mrk.cfdata.org/redefining-fleet-management-at-cloudflare#:~:text=fleet%2C%20known%20as-,INAT,-(Integrated%20Node%20Acceptance)</a> (Integrated Node Acceptance Testing) that runs various tests suites/cases (Hardware Validation, Performance, etc.).</p><p>For every server that needs to be diagnosed, Phoenix will send relevant system instructions to have it boot into a custom Linux boot image, internally called INAT-image. Built into this image are the various tests that need to run when the server boots up, publishing the results to an internal resource in both human-readable (HTML) and machine-readable (JSON) formats, with the latter consumed and interpreted by Phoenix. Upon completion of the boot diagnostics, the server is powered off again to ensure it is not wasting energy.</p></li></ul><p>Our node acceptance tests encompass a range of evaluations, including but not limited to benchmark testing, CPU/Memory/Storage checks, drive wiping, and various other assessments.  <i>Look out for an upcoming in-depth blog post covering INAT.</i></p><p>A summarized diagnostics result is immediately added to the tracking ticket, including pinpointing the exact cause of a failure.</p>
    <div>
      <h3>Recovery</h3>
      <a href="#recovery">
        
      </a>
    </div>
    <p>Recovery executes what we call an expansion operation, which in its first phase will provision the servers that pass diagnostics. The second phase is to re-enable the successfully provisioned servers back to production, where only those that have been re-enabled successfully will start receiving production traffic again.</p><p>Once the diagnostics are passed and the broken servers move on towards the first phase of recovery, we change their statuses from Repair to Pending Provision. If the servers don't fully recover, for example, because there are server configuration errors or issues enabling services, Phoenix assesses the situation. In such cases, it returns those servers to the Repair state for additional evaluation. Additionally, if the diagnostics indicate that the servers need any faulty components replaced, then Phoenix notifies our Data Center operation team for manual repairs as required, ensuring that the server is not repeatedly selected until the required part replacement is completed. This ensures any necessary human intervention can be applied promptly, making the server ready for Phoenix to rediscover in its next iteration.</p><p>An autonomous recovery operation requires infusing intelligence into the automated system so that we can fully trust that it’s able to execute an expansion operation in the safest way possible and handle situations on its own without any human interventions. To do this, we’ve made sure Phoenix is automation-aware – this means that it knows when there are other automations executing certain operations such as expansions, and will only execute an expansion when there are no ongoing provisioning operations in the target data center. This ability to execute only when it’s safe to do so is to ensure that the recovery operation will not interfere with any other ongoing operations in the data center. We’ve also adjusted its tolerance with faulty hardware – this means it’s able to gracefully deal with misbehaving servers by letting these quickly drop out of the recovery candidate list upon misbehavior that prevents blocking the operation.</p>
    <div>
      <h3>Visibility</h3>
      <a href="#visibility">
        
      </a>
    </div>
    <p>While our autonomous system, Phoenix, seamlessly handles operations without human intervention, it doesn't mean we sacrifice visibility. Transparency is a key feature of Phoenix. It meticulously logs every operation, from executing tasks to providing progress updates, and shares this information in communication channels like chat rooms and Jira tickets. This ensures a clear understanding of what Phoenix is doing at all times.</p><p>Tracking of actions taken by automation as well as the state transitions of a server keeps us in the loop and gives us a better understanding of what these actions were and when they were executed, essentially giving us valuable insights that will help us improve not only the system but our processes as well. Having this operational data allows us to generate dashboards that let various teams monitor automation activities and measure their success. We are able to generate dashboards to guide business decisions and even answer common operational questions related to repair and recovery.</p>
    <div>
      <h2>Balancing automation and empathy: Error Budgets</h2>
      <a href="#balancing-automation-and-empathy-error-budgets">
        
      </a>
    </div>
    <p>When we launched Phoenix, we were well aware that not every broken server can be re-enabled and successfully returned to production, and more importantly, there's no 100% guarantee that a recovered server will be as stable as the ones with no repair history – there's a risk that these servers could fail and end up back in Repair status again.</p><p>Although there's no guarantee that these recovered servers won't fail again, causing additional work for SRE’s due to the monitoring alerts that get triggered, what we can guarantee is that Phoenix immediately stops recoveries without any human intervention if a certain number of failures for a server are reached in a given time window – this is where we applied the concept of an Error Budget.</p><p>The Error Budget is the amount of error that automation can accumulate over a certain period of time before our SRE’s start being unhappy due to the excessive server failures or unreliability of the system. It is empathy embedded in automation.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6xHmNRRtlbEbe3um2Cof6C/c8d88078be39d761074d85272e16b3b7/image2-32.png" />
            
            </figure><p>In the figure above, the y-axis represents the error budget. In this context, the error budget applies to the number of recovered servers that failed and were moved back to Repair state again. The x-axis represents the time unit allocated to the error budget – in this case, 24 hours. To ensure that Phoenix is strict enough in mitigating possible issues, we divide the time unit into three consecutive buckets of the same duration – representing the three “follow the sun” SRE shifts in a day. With this, Phoenix can only execute recoveries if the number of server failures is no more than 2. Additionally, Phoenix will also have to compensate succeeding time buckets by deducting the error budget of any excess failures in a given time bucket.</p><p>Phoenix will immediately stop recoveries if it exhausts its error budget prematurely. In this context, prematurely means before the end of the time unit for which the error budget was granted. Regardless of the error budget depletion rate within a time unit, the error budget is fully replenished at the beginning of each time unit, meaning the budget resets every day.</p><p>The Error Budget has helped us define and manage our tolerance for hardware failures without causing significant harm to the system or too much noise for SREs, and gave us opportunities to improve our diagnostics system. It provides a common incentive that allows both the Infrastructure Engineering and SRE teams to focus on finding the right balance between innovation and reliability.</p>
    <div>
      <h2>Where we go from here</h2>
      <a href="#where-we-go-from-here">
        
      </a>
    </div>
    <p>With Phoenix, we’ve not only witnessed the significant and far-reaching potential of having an autonomous automated system in our infrastructure, we’re actually reaping its benefits as well. It provides a win-win situation by successfully recovering hardware and ensuring that broken devices are powered off, thus preventing them from consuming unnecessary power while being idle in our racks. This not only reduces energy wastage but also contributes to sustainability efforts and cost savings. Automated processes that operate independently have not only freed our colleagues on various Infrastructure teams from doing mundane and repetitive tasks, allowing them to focus more on areas where they can use their skill sets for more interesting and productive work, but have also led us to evolving our old processes for handling hardware failures and repairs, making us much more efficient than ever.</p><p>Autonomous automation is a reality that is now beginning to shape the future of how we are building better and smarter systems here at Cloudflare, and we will continue to invest engineering time for these initiatives.</p><p><i>A huge thank you to Elvin Tan for his awesome work on INAT, and to Graeme, Darrel and David for INAT’s continuous improvements.</i></p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Automation]]></category>
            <guid isPermaLink="false">3wcKMm06trYxEYxIg3wWd</guid>
            <dc:creator>Jet Mariscal</dc:creator>
            <dc:creator>Aakash Shah</dc:creator>
            <dc:creator>Yilin Xiong</dc:creator>
        </item>
    </channel>
</rss>