
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 14 Apr 2026 21:55:04 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Introducing EmDash — the spiritual successor to WordPress that solves plugin security]]></title>
            <link>https://blog.cloudflare.com/emdash-wordpress/</link>
            <pubDate>Wed, 01 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Today we are launching the beta of EmDash, a full-stack serverless JavaScript CMS built on Astro 6.0. It combines the features of a traditional CMS with modern security, running plugins in sandboxed Worker isolates. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The cost of building software has drastically decreased. We recently <a href="https://blog.cloudflare.com/vinext/"><u>rebuilt Next.js in one week</u></a> using AI coding agents. But for the past two months our agents have been working on an even more ambitious project: rebuilding the WordPress open source project from the ground up.</p><p>WordPress powers <a href="https://w3techs.com/technologies/details/cm-wordpress"><u>over 40% of the Internet</u></a>. It is a massive success that has enabled anyone to be a publisher, and created a global community of WordPress developers. But the WordPress open source project will be 24 years old this year. Hosting a website has changed dramatically during that time. When WordPress was born, AWS EC2 didn’t exist. In the intervening years, that task has gone from renting virtual private servers, to uploading a JavaScript bundle to a globally distributed network at virtually no cost. It’s time to upgrade the most popular CMS on the Internet to take advantage of this change.</p><p>Our name for this new CMS is EmDash. We think of it as the spiritual successor to WordPress. It’s written entirely in TypeScript. It is serverless, but you can run it on your own hardware or any platform you choose. Plugins are securely sandboxed and can run in their own <a href="https://developers.cloudflare.com/workers/reference/how-workers-works/"><u>isolate</u></a>, via <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/worker-loader/"><u>Dynamic Workers</u></a>, solving the fundamental security problem with the WordPress plugin architecture. And under the hood, EmDash is powered by <a href="https://astro.build/"><u>Astro</u></a>, the fastest web framework for content-driven websites.</p><p>EmDash is fully open source, MIT licensed, and <a href="https://github.com/emdash-cms/emdash"><u>available on GitHub</u></a>. While EmDash aims to be compatible with WordPress functionality, no WordPress code was used to create EmDash. That allows us to license the open source project under the more permissive MIT license. We hope that allows more developers to adapt, extend, and participate in EmDash’s development.</p><p>You can deploy the EmDash v0.1.0 preview to your own Cloudflare account, or to any Node.js server today as part of our early developer beta:</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/emdash-cms/templates/tree/main/blog-cloudflare"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p>Or you can try out the admin interface here in the <a href="https://emdashcms.com/"><u>EmDash Playground</u></a>:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/50n8mewREzoxOFq2jDzpT9/6a38dbfbaeec2d21040137e574a935ad/CleanShot_2026-04-01_at_07.45.29_2x.png" />
          </figure>
    <div>
      <h3>What WordPress has accomplished</h3>
      <a href="#what-wordpress-has-accomplished">
        
      </a>
    </div>
    <p>The story of WordPress is a triumph of open source that enabled publishing at a scale never before seen. Few projects have had the same recognisable impact on the generation raised on the Internet. The contributors to WordPress’s core, and its many thousands of plugin and theme developers have built a platform that democratised publishing for millions; many lives and livelihoods being transformed by this ubiquitous software.</p><p>There will always be a place for WordPress, but there is also a lot more space for the world of content publishing to grow. A decade ago, people picking up a keyboard universally learned to publish their blogs with WordPress. Today it’s just as likely that person picks up Astro, or another TypeScript framework to learn and build with. The ecosystem needs an option that empowers a wide audience, in the same way it needed WordPress 23 years ago. </p><p>EmDash is committed to building on what WordPress created: an open source publishing stack that anyone can install and use at little cost, while fixing the core problems that WordPress cannot solve. </p>
    <div>
      <h3>Solving the WordPress plugin security crisis</h3>
      <a href="#solving-the-wordpress-plugin-security-crisis">
        
      </a>
    </div>
    <p>WordPress’ plugin architecture is fundamentally insecure. <a href="https://patchstack.com/whitepaper/state-of-wordpress-security-in-2025/"><u>96% of security issues</u></a> for WordPress sites originate in plugins. In 2025, more high severity vulnerabilities <a href="https://patchstack.com/whitepaper/state-of-wordpress-security-in-2026/"><u>were found in the WordPress ecosystem</u></a> than the previous two years combined.</p><p>Why, after over two decades, is WordPress plugin security so problematic?</p><p>A WordPress plugin is a PHP script that hooks directly into WordPress to add or modify functionality. There is no isolation: a WordPress plugin has direct access to the WordPress site’s database and filesystem. When you install a WordPress plugin, you are trusting it with access to nearly everything, and trusting it to handle every malicious input or edge case perfectly.</p><p>EmDash solves this. In EmDash, each plugin runs in its own isolated sandbox: a <a href="https://developers.cloudflare.com/dynamic-workers/"><u>Dynamic Worker</u></a>. Rather than giving direct access to underlying data, EmDash provides the plugin with <a href="https://blog.cloudflare.com/workers-environment-live-object-bindings/"><u>capabilities via bindings</u></a>, based on what the plugin explicitly declares that it needs in its manifest. This security model has a strict guarantee: an EmDash plugin can only perform the actions explicitly declared in its manifest. You can know and trust upfront, before installing a plugin, exactly what you are granting it permission to do, similar to going through an OAuth flow and granting a 3rd party app a specific set of scoped permissions.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4JDq2oEgwONHL8uUJsrof2/fb2ae5fcacd5371aaab575c35ca2ce2e/image8.png" />
          </figure><p>For example, a plugin that sends an email after a content item gets saved looks like this:</p>
            <pre><code>import { definePlugin } from "emdash";

export default () =&gt;
  definePlugin({
    id: "notify-on-publish",
    version: "1.0.0",
    capabilities: ["read:content", "email:send"],
    hooks: {
      "content:afterSave": async (event, ctx) =&gt; {
        if (event.collection !== "posts" || event.content.status !==    "published") return;

        await ctx.email!.send({
          to: "editors@example.com",
          subject: `New post published: ${event.content.title}`,
          text: `"${event.content.title}" is now live.`,
         });

        ctx.log.info(`Notified editors about ${event.content.id}`);
      },
    },
  });</code></pre>
            <p>This plugin explicitly requests two capabilities: <code>content:afterSave</code> to hook into the content lifecycle, and <code>email:send</code> to access the <code>ctx.email</code> function. It is impossible for the plugin to do anything other than use these capabilities. It has no external network access. If it does need network access, it can specify the exact hostname it needs to talk to, as part of its definition, and be granted only the ability to communicate with a particular hostname.</p><p>And in all cases, because the plugin’s needs are declared statically, upfront, it can always be clear exactly what the plugin is asking for permission to be able to do, at install time. A platform or administrator could define rules for what plugins are or aren’t allowed to be installed by certain groups of users, based on what permissions they request, rather than an allowlist of approved or safe plugins.</p>
    <div>
      <h3>Solving plugin security means solving marketplace lock-in</h3>
      <a href="#solving-plugin-security-means-solving-marketplace-lock-in">
        
      </a>
    </div>
    <p>WordPress plugin security is such a real risk that WordPress.org <a href="https://developer.wordpress.org/plugins/wordpress-org/plugin-developer-faq/#where-do-i-submit-my-plugin"><u>manually reviews and approves each plugin</u></a> in its marketplace. At the time of writing, that review queue is over 800 plugins long, and takes at least two weeks to traverse. The vulnerability surface area of WordPress plugins is so wide that in practice, all parties rely on marketplace reputation, ratings and reviews. And because WordPress plugins run in the same execution context as WordPress itself and are so deeply intertwined with WordPress code, some argue they must carry forward WordPress’ GPL license.</p><p>These realities combine to create a chilling effect on developers building plugins, and on platforms hosting WordPress sites.</p><p>Plugin security is the root of this problem. Marketplace businesses provide trust when parties otherwise cannot easily trust each other. In the case of the WordPress marketplace, the plugin security risk is so large and probable that many of your customers can only reasonably trust your plugin via the marketplace. But in order to be part of the marketplace your code must be licensed in a way that forces you to give it away for free everywhere other than that marketplace. You are locked in.</p><p>EmDash plugins have two important properties that mitigate this marketplace lock-in:</p><ol><li><p><b>Plugins can have any license</b>: they run independently of EmDash and share no code. It’s the plugin author’s choice.</p></li><li><p><b>Plugin code runs independently in a secure sandbox</b>: a plugin can be provided to an EmDash site, and trusted, without the EmDash site ever seeing the code.</p></li></ol><p>The first part is straightforward — as the plugin author, you choose what license you want. The same way you can when publishing to NPM, PyPi, Packagist or any other registry. It’s an open ecosystem for all, and up to the community, not the EmDash project, what license you use for plugins and themes.</p><p>The second part is where EmDash’s plugin architecture breaks free of the centralized marketplace.</p><p>Developers need to rely on a third party marketplace having vetted the plugin far less to be able to make decisions about whether to use or trust it. Consider the example plugin above that sends emails after content is saved; the plugin declares three things:</p><ul><li><p>It only runs on the <code>content:afterSave</code> hook</p></li><li><p>It has the <code>read:content</code> capability</p></li><li><p>It has the <code>email:send</code> capability</p></li></ul><p>The plugin can have tens of thousands of lines of code in it, but unlike a WordPress plugin that has access to everything and can talk to the public Internet, the person adding the plugin knows exactly what access they are granting to it. The clearly defined boundaries allow you to make informed decisions about security risks and to zoom in on more specific risks that relate directly to the capabilities the plugin is given.</p><p>The more that both sites and platforms can trust the security model to provide constraints, the more that sites and platforms can trust plugins, and break free of centralized control of marketplaces and reputation. Put another way: if you trust that food safety is enforced in your city, you’ll be adventurous and try new places. If you can’t trust that there might be a staple in your soup, you’ll be consulting Google before every new place you try, and it’s harder for everyone to open new restaurants.</p>
    <div>
      <h3>Every EmDash site has x402 support built in — charge for access to content</h3>
      <a href="#every-emdash-site-has-x402-support-built-in-charge-for-access-to-content">
        
      </a>
    </div>
    <p>The business model of the web <a href="https://blog.cloudflare.com/content-independence-day-no-ai-crawl-without-compensation/"><u>is at risk</u></a>, particularly for content creators and publishers. The old way of making content widely accessible, allowing all clients free access in exchange for traffic, breaks when there is no human looking at a site to advertise to, and the client is instead their agent accessing the web on their behalf. Creators need ways to continue to make money in this new world of agents, and to build new kinds of websites that serve what people’s agents need and will pay for. Decades ago a new wave of creators created websites that became great businesses (often using WordPress to power them) and a similar opportunity exists today.</p><p><a href="https://www.x402.org/"><u>x402</u></a> is an open, neutral standard for Internet-native payments. It lets anyone on the Internet easily charge, and any client pay on-demand, on a pay-per-use basis. A client, such as an agent, sends a HTTP request and receives a HTTP 402 Payment Required status code. In response, the client pays for access on-demand, and the server can let the client through to the requested content.</p><p>EmDash has built-in support for x402. This means anyone with an EmDash site can charge for access to their content without requiring subscriptions and with zero engineering work. All you need to do is configure which content should require payment, set how much to charge, and provide a Wallet address. The request/response flow ends up looking like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3IKfYGHF6Pgi3jQf1ERRQC/48815ffec3e204f4f2c6f7a40f232a93/image4.png" />
          </figure><p>Every EmDash site has a built-in business model for the AI era.</p>
    <div>
      <h3>Solving scale-to-zero for WordPress hosting platforms</h3>
      <a href="#solving-scale-to-zero-for-wordpress-hosting-platforms">
        
      </a>
    </div>
    <p>WordPress is not serverless: it requires provisioning and managing servers, scaling them up and down like a traditional web application. To maximize performance, and to be able to handle traffic spikes, there’s no avoiding the need to pre-provision instances and run some amount of idle compute, or share resources in ways that limit performance. This is particularly true for sites with content that must be server rendered and cannot be cached.</p><p>EmDash is different: it’s built to run on serverless platforms, and make the most out of the <a href="https://developers.cloudflare.com/workers/reference/how-workers-works/"><u>v8 isolate architecture</u></a> of Cloudflare’s open source runtime <a href="https://github.com/cloudflare/workerd"><u>workerd</u></a>. On an incoming request, the Workers runtime instantly spins up an isolate to execute code and serve a response. It scales back down to zero if there are no requests. And it <a href="https://blog.cloudflare.com/workers-pricing-scale-to-zero/"><u>only bills for CPU time</u></a> (time spent doing actual work).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3yIX0whveiJ7xQ9P20TeyA/84462e6ec58cab27fbd6bf1703efeabc/image7.png" />
          </figure><p>You can run EmDash anywhere, on any Node.js server — but on Cloudflare you can run millions of instances of EmDash using <a href="https://developers.cloudflare.com/cloudflare-for-platforms/"><u>Cloudflare for Platforms</u></a> that each instantly scale fully to zero or up to as many RPS as you need to handle, using the exact same network and runtime that the biggest websites in the world rely on.</p><p>Beyond cost optimizations and performance benefits, we’ve bet on this architecture at Cloudflare in part because we believe in having low cost and free tiers, and that everyone should be able to build websites that scale. We’re excited to help platforms extend the benefits of this architecture to their own customers, both big and small.</p>
    <div>
      <h3>Modern frontend theming and architecture via Astro</h3>
      <a href="#modern-frontend-theming-and-architecture-via-astro">
        
      </a>
    </div>
    <p>EmDash is powered by Astro, the web framework for content-driven websites. To create an EmDash theme, you create an Astro project that includes:</p><ul><li><p><b>Pages</b>: Astro routes for rendering content (homepage, blog posts, archives, etc.)</p></li><li><p><b>Layouts:</b> Shared HTML structure</p></li><li><p><b>Components:</b> Reusable UI elements (navigation, cards, footers)</p></li><li><p><b>Styles:</b> CSS or Tailwind configuration</p></li><li><p><b>A seed file:</b> JSON that tells the CMS what content types and fields to create</p></li></ul><p>This makes creating themes familiar to frontend developers who are <a href="https://npm-stat.com/charts.html?package=astro&amp;from=2024-01-01&amp;to=2026-03-30"><u>increasingly choosing Astro</u></a>, and to LLMs which are already trained on Astro.</p><p>WordPress themes, though incredibly flexible, operate with a lot of the same security risks as plugins, and the more popular and commonplace your theme, the more of a target it is. Themes run through integrating with <code>functions.php</code> which is an all-encompassing execution environment, enabling your theme to be both incredibly powerful and potentially dangerous. EmDash themes, as with dynamic plugins, turns this expectation on its head. Your theme can never perform database operations.</p>
    <div>
      <h3>An AI Native CMS — MCP, CLI, and Skills for EmDash</h3>
      <a href="#an-ai-native-cms-mcp-cli-and-skills-for-emdash">
        
      </a>
    </div>
    <p>The least fun part about working with any CMS is doing the rote migration of content: finding and replacing strings, migrating custom fields from one format to another, renaming, reordering and moving things around. This is either boring repetitive work or requires one-off scripts and  “single-use” plugins and tools that are usually neither fun to write nor to use.</p><p>EmDash is designed to be managed programmatically by your AI agents. It provides the context and the tools that your agents need, including:</p><ol><li><p><b>Agent Skills:</b> Each EmDash instance includes <a href="https://agentskills.io/home"><u>Agent Skills</u></a> that describe to your agent the capabilities EmDash can provide to plugins, the hooks that can trigger plugins, <a href="https://github.com/emdash-cms/emdash/blob/main/skills/creating-plugins/SKILL.md"><u>guidance on how to structure a plugin</u></a>, and even <a href="https://github.com/emdash-cms/emdash/blob/main/skills/wordpress-theme-to-emdash/SKILL.md"><u>how to port legacy WordPress themes to EmDash natively</u></a>. When you give an agent an EmDash codebase, EmDash provides everything the agent needs to be able to customize your site in the way you need.</p></li><li><p><b>EmDash CLI:</b> The <a href="https://github.com/emdash-cms/emdash/blob/main/docs/src/content/docs/reference/cli.mdx"><u>EmDash CLI</u></a> enables your agent to interact programmatically with your local or remote instance of EmDash. You can <a href="https://github.com/emdash-cms/emdash/blob/main/docs/src/content/docs/reference/cli.mdx#media-upload-file"><u>upload media</u></a>, <a href="https://github.com/emdash-cms/emdash/blob/main/docs/src/content/docs/reference/cli.mdx#emdash-search"><u>search for content</u></a>, <a href="https://github.com/emdash-cms/emdash/blob/main/docs/src/content/docs/reference/cli.mdx#schema-create-collection"><u>create and manage schemas</u></a>, and do the same set of things you can do in the Admin UI.</p></li><li><p><b>Built-in MCP Server:</b> Every EmDash instance provides its own remote Model Context Protocol (MCP) server, allowing you to do the same set of things you can do in the Admin UI.</p></li></ol>
    <div>
      <h3>Pluggable authentication, with Passkeys by default</h3>
      <a href="#pluggable-authentication-with-passkeys-by-default">
        
      </a>
    </div>
    <p>EmDash uses passkey-based authentication by default, meaning there are no passwords to leak and no brute-force vectors to defend against. User management includes familiar role-based access control out of the box: administrators, editors, authors, and contributors, each scoped strictly to the actions they need. Authentication is pluggable, so you can set EmDash up to work with your SSO provider, and automatically provision access based on IdP metadata.</p>
    <div>
      <h3>Import your WordPress sites to EmDash</h3>
      <a href="#import-your-wordpress-sites-to-emdash">
        
      </a>
    </div>
    <p>You can import an existing WordPress site by either going to WordPress admin and exporting a WXR file, or by installing the <a href="https://github.com/emdash-cms/wp-emdash/tree/main/plugins/emdash-exporter"><u>EmDash Exporter plugin</u></a> on a WordPress site, which configures a secure endpoint that is only exposed to you, and protected by a WordPress Application Password you control. Migrating content takes just a few minutes, and automatically works to bring any attached media into EmDash’s media library.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/SUFaWUIoEFSN2z9rclKZW/28870489d502cff34e35ab3b59f19eae/image1.png" />
          </figure><p>Creating any custom content types on WordPress that are not a Post or a Page has meant installing heavy plugins like Advanced Custom Fields, and squeezing the result into a crowded WordPress posts table. EmDash does things differently: you can define a schema directly in the admin panel, which will create entirely new EmDash collections for you, separately ordered in the database. On import, you can use the same capabilities to take any custom post types from WordPress, and create an EmDash content type from it. </p><p>For bespoke blocks, you can use the <a href="https://github.com/emdash-cms/emdash/blob/main/skills/creating-plugins/references/block-kit.md"><u>EmDash Block Kit Agent Skill</u></a> to instruct your agent of choice and build them for EmDash.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5xutdF9nvHYMYlN6XfqRGu/1db0e0d73327e926d606f92fdd7aabec/image3.png" />
          </figure>
    <div>
      <h3>Try it</h3>
      <a href="#try-it">
        
      </a>
    </div>
    <p>EmDash is v0.1.0 preview, and we’d love you to try it, give feedback, and we welcome contributions to the <a href="https://github.com/emdash-cms/emdash/"><u>EmDash GitHub repository</u></a>.</p><p>If you’re just playing around and want to first understand what’s possible — try out the admin interface in the <a href="https://emdashcms.com/"><u>EmDash Playground</u></a>.</p><p>To create a new EmDash site locally, via the CLI, run:</p><p><code>npm create emdash@latest</code></p><p>Or you can do the same via the Cloudflare dashboard below:</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/emdash-cms/templates/tree/main/blog-cloudflare"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p>We’re excited to see what you build, and if you're active in the WordPress community, as a hosting platform, a plugin or theme author, or otherwise — we’d love to hear from you. Email us at emdash@cloudflare.com, and tell us what you’d like to see from the EmDash project.</p><p>If you want to stay up to date with major EmDash developments, you can leave your email address <a href="https://forms.gle/ofE1LYRYxkpAPqjE7"><u>here</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Product News]]></category>
            <guid isPermaLink="false">64rkKr9jewVmxagIFgbwY4</guid>
            <dc:creator>Matt “TK” Taylor</dc:creator>
            <dc:creator>Matt Kane</dc:creator>
        </item>
        <item>
            <title><![CDATA[Fixing request smuggling vulnerabilities in Pingora OSS deployments]]></title>
            <link>https://blog.cloudflare.com/pingora-oss-smuggling-vulnerabilities/</link>
            <pubDate>Mon, 09 Mar 2026 14:00:00 GMT</pubDate>
            <description><![CDATA[ Today we’re disclosing request smuggling vulnerabilities when our open source Pingora service is deployed as an ingress proxy and how we’ve fixed them in Pingora 0.8.0.  ]]></description>
            <content:encoded><![CDATA[ <p>In December 2025, Cloudflare received reports of HTTP/1.x request smuggling vulnerabilities in the <a href="https://github.com/cloudflare/pingora"><u>Pingora open source</u></a> framework when Pingora is used to build an ingress proxy. Today we are discussing how these vulnerabilities work and how we patched them in <a href="https://github.com/cloudflare/pingora/releases/tag/0.8.0"><u>Pingora 0.8.0</u></a>.</p><p>The vulnerabilities are <a href="https://www.cve.org/CVERecord?id=CVE-2026-2833"><u>CVE-2026-2833</u></a>, <a href="https://www.cve.org/CVERecord?id=CVE-2026-2835"><u>CVE-2026-2835</u></a>, and <a href="https://www.cve.org/CVERecord?id=CVE-2026-2836"><u>CVE-2026-2836</u></a>. These issues were responsibly reported to us by Rajat Raghav (xclow3n) through our <a href="https://www.cloudflare.com/disclosure/"><u>Bug Bounty Program</u></a>.</p><p><b>Cloudflare’s CDN and customer traffic were not affected</b>, our investigation found. <b>No action is needed for Cloudflare customers, and no impact was detected.</b> </p><p>Due to the architecture of Cloudflare’s network, these vulnerabilities could not be exploited: Pingora is not used as an ingress proxy in Cloudflare’s CDN.</p><p>However, these issues impact standalone Pingora deployments exposed to the Internet, and may enable an attacker to:</p><ul><li><p>Bypass Pingora proxy-layer security controls</p></li><li><p>Desync HTTP request/responses with backends for cross-user hijacking attacks (session or credential theft)</p></li><li><p>Poison Pingora proxy-layer caches retrieving content from shared backends</p></li></ul><p>We have released <a href="https://github.com/cloudflare/pingora/releases/tag/0.8.0"><u>Pingora 0.8.0</u></a> with fixes and hardening. While Cloudflare customers were not affected, we strongly recommend users of the Pingora framework to <b>upgrade as soon as possible.</b></p>
    <div>
      <h2>What was the vulnerability?</h2>
      <a href="#what-was-the-vulnerability">
        
      </a>
    </div>
    <p>The reports described a few different HTTP/1 attack payloads that could cause desync attacks. Such requests could cause the proxy and backend to disagree about where the request body ends, allowing a second request to be “smuggled” past proxy‑layer checks. The researcher provided a proof-of-concept to validate how a basic Pingora reverse proxy misinterpreted request body lengths and forwarded those requests to server backends such as Node/Express or uvicorn.</p><p>Upon receiving the reports, our engineering team immediately investigated and validated that, as the reporter also confirmed, the Cloudflare CDN itself was not vulnerable. However, the team did also validate that vulnerabilities exist when Pingora acts as the ingress proxy to shared backends.</p><p>By design, the Pingora framework <a href="https://blog.cloudflare.com/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/#design-decisions"><u>does allow</u></a> edge case HTTP requests or responses that are not strictly RFC compliant, because we must accept this sort of traffic for customers with legacy HTTP stacks. But this leniency has limits to avoid exposing Cloudflare itself to vulnerabilities.</p><p>In this case, Pingora had non-RFC-compliant interpretations of request bodies within its HTTP/1 stack that allowed these desync attacks to exist. Pingora deployments within Cloudflare are not directly exposed to ingress traffic, and we found that production traffic that arrived at Pingora services were not subject to these misinterpretations. Thus, the attacks were not exploitable on Cloudflare traffic itself, unlike a <a href="https://blog.cloudflare.com/resolving-a-request-smuggling-vulnerability-in-pingora/"><u>previous Pingora smuggling vulnerability</u></a> disclosed in May 2025.</p><p>We’ll explain, case-by-case, how these attack payloads worked.</p>
    <div>
      <h3>1. Premature upgrade without 101 handshake</h3>
      <a href="#1-premature-upgrade-without-101-handshake">
        
      </a>
    </div>
    <p>The first report showed that a request with an <code>Upgrade</code> header value would cause Pingora to pass through subsequent bytes on the HTTP connection immediately, before the backend had accepted an upgrade (by returning <code>101 Switching Protocols</code>). The attacker could thus pipeline a second HTTP request after the upgrade request on the same connection:</p>
            <pre><code>GET / HTTP/1.1
Host: example.com
Upgrade: foo


GET /admin HTTP/1.1
Host: example.com</code></pre>
            <p>Pingora would parse only the initial request, then treat the remaining buffered bytes as the “upgraded” stream and forward them directly to the backend in a “passthrough” mode <a href="https://github.com/cloudflare/pingora/blob/ef017ceb01962063addbacdab2a4fd2700039db5/pingora-core/src/protocols/http/v1/server.rs#L797"><u>due to the Upgrade header</u></a> (until the response <a href="https://github.com/cloudflare/pingora/blob/ef017ceb01962063addbacdab2a4fd2700039db5/pingora-core/src/protocols/http/v1/server.rs#L523"><u>was received</u></a>).</p><p>This is not at all how the HTTP/1.1 Upgrade process per <a href="https://datatracker.ietf.org/doc/html/rfc9110#field.upgrade"><u>RFC 9110</u></a> is intended to work. The subsequent bytes should <i>only</i> be interpreted as part of an upgraded stream if a <code>101 Switching Protocols</code> header is received, and if a <code>200 OK</code> response is received instead, the subsequent bytes should continue to be interpreted as HTTP.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2IYHyGkABpNA0e09wiiGpY/4f51ea330c2d266260f6361dd9d64d79/image4.png" />
          </figure><p><sup><i>An attacker that sends an Upgrade request, then pipelines a partial HTTP request may cause a desync attack. Pingora will incorrectly interpret both as the same upgraded request, even if the backend server declines the upgrade with a 200.</i></sup></p><p>Via the improper pass-through, a Pingora deployment that received a non-101 response could still forward the second partial HTTP request to the upstream as-is, bypassing any Pingora user‑defined ACL-handling or WAF logic, and poison the connection to the upstream so that a subsequent request from a different user could improperly receive the <code>/admin</code> response.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/oIwatu6gaMoJHCCs95sFN/8ea94ee8f04be6f7f00474168b382180/image3.png" />
          </figure><p><sup><i>After the attack payload, Pingora and the backend server are now “desynced.” The backend server will wait until it thinks the rest of the partial /attack request header that Pingora forwarded is complete. When Pingora forwards a different user’s request, the two headers are combined from the backend server’s perspective, and the attacker has now poisoned the other user’s response.</i></sup></p><p>We’ve since <a href="https://github.com/cloudflare/pingora/commit/824bdeefc61e121cc8861de1b35e8e8f39026ecd"><u>patched</u></a> Pingora to switch the interpretation of subsequent bytes only once the upstream responds with <code>101 Switching Protocols</code>.</p><p>We verified Cloudflare was <b>not affected</b> for two reasons:</p><ol><li><p>The ingress CDN proxies do not have this improper behavior.</p></li><li><p>The clients to our internal Pingora services do not attempt to <a href="https://en.wikipedia.org/wiki/HTTP_pipelining"><u>pipeline</u></a> HTTP/1 requests. Furthermore, the Pingora service these clients talk directly with disables keep-alive on these <code>Upgrade</code> requests by injecting a <code>Connection: close</code> header; this prevents additional requests that would be sent — and subsequently smuggled — over the same connection.</p></li></ol>
    <div>
      <h3>2. HTTP/1.0, close-delimiting, and transfer-encoding</h3>
      <a href="#2-http-1-0-close-delimiting-and-transfer-encoding">
        
      </a>
    </div>
    <p>The reporter also demonstrated what <i>appeared</i> to be a more classic “CL.TE” desync-type attack, where the Pingora proxy would use Content-Length as framing while the backend would use Transfer-Encoding as framing:</p>
            <pre><code>GET / HTTP/1.0
Host: example.com
Connection: keep-alive
Transfer-Encoding: identity, chunked
Content-Length: 29

0

GET /admin HTTP/1.1
X:
</code></pre>
            <p>In the reporter’s example, Pingora would treat all subsequent bytes after the first GET / request header as part of that request’s body, but the node.js backend server would interpret the body as chunked and ending at the zero-length chunk. There are actually a few things going on here:</p><ol><li><p>Pingora’s chunked encoding recognition was quite barebones (only checking for whether <code>Transfer-Encoding</code> was “<a href="https://github.com/cloudflare/pingora/blob/9ac75d0356f449d26097e08bf49af14de6271727/pingora-core/src/protocols/http/v1/common.rs#L146"><u>chunked</u></a>”) and assumed that there could only be one encoding or <code>Transfer-Encoding</code> header. But the RFC only <a href="https://datatracker.ietf.org/doc/html/rfc9112#section-6.3-2.4.1"><u>mandates</u></a> that the <i>final</i> encoding must be <code>chunked</code> to apply chunked framing. So per RFC, this request should have a chunked message body (if it were not HTTP/1.0 — more on that below).</p></li><li><p>Pingora was <i>also </i>not actually using the <code>Content-Length</code> (because the Transfer-Encoding overrode the Content-Length <a href="https://datatracker.ietf.org/doc/html/rfc9112#section-6.3-2.3"><u>per RFC</u></a>). Because of the unrecognized Transfer-Encoding and the HTTP/1.0 version, the request body was <a href="https://github.com/cloudflare/pingora/blob/ef017ceb01962063addbacdab2a4fd2700039db5/pingora-core/src/protocols/http/v1/server.rs#L817"><u>instead treated as close-delimited</u></a> (which means that the response body’s end is marked by closure of the underlying transport connection). An absence of framing headers would also trigger the same misinterpretation on HTTP/1.0. Although response bodies are allowed to be close-delimited, request bodies are <i>never</i> close-delimited. In fact, this clarification is now explicitly called out as a separate note in <a href="https://datatracker.ietf.org/doc/html/rfc9112#section-6.3-4.1"><u>RFC 9112</u></a>.</p></li><li><p>This is an HTTP/1.0 request that <a href="https://datatracker.ietf.org/doc/html/rfc9112#appendix-C.2.3-1"><u>did not define</u></a> Transfer-Encoding. The RFC <a href="https://datatracker.ietf.org/doc/html/rfc9112#section-6.1-16">mandates</a> that HTTP/1.0 requests containing Transfer-Encoding must “treat the message as if the framing is faulty” and close the connection. Parsers such as the ones in nginx and hyper just reject these requests to avoid ambiguous framing.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1jLbMNafmF96toxAPxj2Cm/8561b96a56dc0fc654476e33d0f34888/image2.png" />
          </figure><p><sup><i>When an attacker pipelines a partial HTTP request header after the HTTP/1.0 + Transfer-Encoding request, Pingora would incorrectly interpret that partial header as part of the same request, rather than as a distinct request. This enables the same kind of desync attack as described in the premature Upgrade example.</i></sup></p><p>This spoke to a more fundamental misreading of the RFC particularly in terms of response vs. request message framing. We’ve since fixed the improper <a href="https://github.com/cloudflare/pingora/commit/7f7166d62fa916b9f11b2eb8f9e3c4999e8b9023"><u>multiple Transfer-Encoding parsing</u></a>, adhere strictly to the request length guidelines such that HTTP request bodies can <a href="https://github.com/cloudflare/pingora/commit/40c3c1e9a43a86b38adeab8da7a2f6eba68b83ad"><u>never be considered close-delimited</u></a>, and reject <a href="https://github.com/cloudflare/pingora/commit/fc904c0d2c679be522de84729ec73f0bd344963d"><u>invalid Content-Length</u></a> and <a href="https://github.com/cloudflare/pingora/commit/87e2e2fb37edf9be33e3b1d04726293ae6bf2052"><u>HTTP/1.0 + Transfer-Encoding</u></a> request messages. Further protections we’ve added include <a href="https://github.com/cloudflare/pingora/commit/d3d2cf5ef4eca1e5d327fe282ec4b4ee474350c6"><u>rejecting</u></a> <a href="https://datatracker.ietf.org/doc/html/rfc9110#name-connect"><u>CONNECT</u></a> requests by default because the HTTP proxy logic doesn’t currently treat CONNECT as special for the purposes of CONNECT upgrade proxying, and these requests have special <a href="https://datatracker.ietf.org/doc/html/rfc9112#section-6.3-2.2"><u>message framing rules</u></a>. (Note that incoming CONNECT requests are <a href="https://developers.cloudflare.com/fundamentals/concepts/traffic-flow-cloudflare/#cloudflares-network"><u>rejected</u></a> by the Cloudflare CDN.)</p><p>When we investigated and instrumented our services internally, we found no requests arriving at our Pingora services that would have been misinterpreted. We found that downstream proxy layers in the CDN would forward as HTTP/1.1 only, reject ambiguous framing such as invalid Content-Length, and only forward a single <code>Transfer-Encoding: chunked</code> header for chunked requests.</p>
    <div>
      <h3>3. Cache key construction</h3>
      <a href="#3-cache-key-construction">
        
      </a>
    </div>
    <p>The researcher also reported one other cache poisoning vulnerability regarding default <code>CacheKey</code> construction. The <a href="https://github.com/cloudflare/pingora/blob/ef017ceb01962063addbacdab2a4fd2700039db5/pingora-cache/src/key.rs#L218"><u>naive default implementation</u></a> factored in only the URI path (without other factors such as host header or upstream server HTTP scheme), which meant different hosts using the same HTTP path could collide and poison each other’s cache.</p><p>This would affect users of the alpha proxy caching feature who chose to use the default <code>CacheKey</code> implementation. We have since <a href="https://github.com/cloudflare/pingora/commit/257b59ada28ed6cac039f67d0b71f414efa0ab6e"><u>removed that default</u></a>, because while using something like HTTP scheme + host + URI makes sense for many applications, we want users to be careful when constructing their cache keys for themselves. If their proxy logic will conditionally adjust the URI or method on the upstream request, for example, that logic likely also must be factored into the cache key scheme to avoid poisoning.</p><p>Internally, Cloudflare’s <a href="https://developers.cloudflare.com/cache/how-to/cache-keys/"><u>default cache key</u></a> uses a number of factors to prevent cache key poisoning, and never made use of the previously provided default.</p>
    <div>
      <h2>Recommendation</h2>
      <a href="#recommendation">
        
      </a>
    </div>
    <p>If you use Pingora as a proxy, upgrade to <a href="https://github.com/cloudflare/pingora/releases/tag/0.8.0"><u>Pingora 0.8.0</u></a> at your earliest convenience.</p><p>We apologize for the impact this vulnerability may have had on Pingora users. As Pingora earns its place as critical Internet infrastructure beyond Cloudflare, we believe it’s important for the framework to promote use of strict RFC compliance by default and will continue this effort. Very few users of the framework should have to deal with the same “wild Internet” that Cloudflare does. Our intention is that stricter adherence to the latest RFC standards by default will harden security for Pingora users and move the Internet as a whole toward best practices.</p>
    <div>
      <h2>Disclosure and response timeline</h2>
      <a href="#disclosure-and-response-timeline">
        
      </a>
    </div>
    <p>- 2025‑12‑02: Upgrade‑based smuggling reported via bug bounty.</p><p>- 2026‑01‑13: Transfer‑Encoding / HTTP/1.0 parsing issues reported.</p><p>- 2026-01-18: Default cache key construction issue reported.</p><p>- 2026‑01‑29 to 2026‑02‑13: Fixes validated with the reporter. Work on more RFC-compliance checks continues.</p><p>- 2026-02-25: Cache key default removal and additional RFC checks validated with researcher.</p><p>- 2026‑03-02: Pingora 0.8.0 released.</p><p>- 2026-03-04: CVE advisories published.</p>
    <div>
      <h2>Acknowledgements</h2>
      <a href="#acknowledgements">
        
      </a>
    </div>
    <p>We thank Rajat Raghav (xclow3n) for the report, detailed reproductions, and verification of the fixes through our bug bounty program. Please see the researcher's<a href="https://xclow3n.github.io/post/6"> corresponding blog post</a> for more information.</p><p>We would also extend a heartfelt thank you to the Pingora open source community for their active engagement, issue reports, and contributions to the framework. You truly help us build a better Internet.</p> ]]></content:encoded>
            <category><![CDATA[Pingora]]></category>
            <category><![CDATA[Application Security]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">1b0iJgL57wbfiLHXhEjuwR</guid>
            <dc:creator>Edward Wang</dc:creator>
            <dc:creator>Fei Deng</dc:creator>
            <dc:creator>Andrew Hauck</dc:creator>
        </item>
        <item>
            <title><![CDATA[We deserve a better streams API for JavaScript]]></title>
            <link>https://blog.cloudflare.com/a-better-web-streams-api/</link>
            <pubDate>Fri, 27 Feb 2026 06:00:00 GMT</pubDate>
            <description><![CDATA[ The Web streams API has become ubiquitous in JavaScript runtimes but was designed for a different era. Here's what a modern streaming API could (should?) look like. ]]></description>
            <content:encoded><![CDATA[ <p>Handling data in streams is fundamental to how we build applications. To make streaming work everywhere, the <a href="https://streams.spec.whatwg.org/"><u>WHATWG Streams Standard</u></a> (informally known as "Web streams") was designed to establish a common API to work across browsers and servers. It shipped in browsers, was adopted by Cloudflare Workers, Node.js, Deno, and Bun, and became the foundation for APIs like <a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API"><u>fetch()</u></a>. It's a significant undertaking, and the people who designed it were solving hard problems with the constraints and tools they had at the time.</p><p>But after years of building on Web streams – implementing them in both Node.js and Cloudflare Workers, debugging production issues for customers and runtimes, and helping developers work through far too many common pitfalls – I've come to believe that the standard API has fundamental usability and performance issues that cannot be fixed easily with incremental improvements alone. The problems aren't bugs; they're consequences of design decisions that may have made sense a decade ago, but don't align with how JavaScript developers write code today.</p><p>This post explores some of the fundamental issues I see with Web streams and presents an alternative approach built around JavaScript language primitives that demonstrate something better is possible. </p><p>In benchmarks, this alternative can run anywhere between 2x to <i>120x</i> faster than Web streams in every runtime I've tested it on (including Cloudflare Workers, Node.js, Deno, Bun, and every major browser). The improvements are not due to clever optimizations, but fundamentally different design choices that more effectively leverage modern JavaScript language features. I'm not here to disparage the work that came before; I'm here to start a conversation about what can potentially come next.</p>
    <div>
      <h2>Where we're coming from</h2>
      <a href="#where-were-coming-from">
        
      </a>
    </div>
    <p>The Streams Standard was developed between 2014 and 2016 with an ambitious goal to provide "APIs for creating, composing, and consuming streams of data that map efficiently to low-level I/O primitives." Before Web streams, the web platform had no standard way to work with streaming data.</p><p>Node.js already had its own <a href="https://nodejs.org/api/stream.html"><u>streaming API</u></a> at the time that was ported to also work in browsers, but WHATWG chose not to use it as a starting point given that it is chartered to only consider the needs of Web browsers. Server-side runtimes only adopted Web streams later, after Cloudflare Workers and Deno each emerged with first-class Web streams support and cross-runtime compatibility became a priority.</p><p>The design of Web streams predates async iteration in JavaScript. The <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for-await...of"><code><u>for await...of</u></code></a> syntax didn't land until <a href="https://262.ecma-international.org/9.0/"><u>ES2018</u></a>, two years after the Streams Standard was initially finalized. This timing meant the API couldn't initially leverage what would eventually become the idiomatic way to consume asynchronous sequences in JavaScript. Instead, the spec introduced its own reader/writer acquisition model, and that decision rippled through every aspect of the API.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3X0niHShBlgF4LlpWYB7eC/f0bbf35f12ecc98a3888e6e3835acf3a/1.png" />
          </figure>
    <div>
      <h4>Excessive ceremony for common operations</h4>
      <a href="#excessive-ceremony-for-common-operations">
        
      </a>
    </div>
    <p>The most common task with streams is reading them to completion. Here's what that looks like with Web streams:</p>
            <pre><code>// First, we acquire a reader that gives an exclusive lock
// on the stream...
const reader = stream.getReader();
const chunks = [];
try {
  // Second, we repeatedly call read and await on the returned
  // promise to either yield a chunk of data or indicate we're
  // done.
  while (true) {
    const { value, done } = await reader.read();
    if (done) break;
    chunks.push(value);
  }
} finally {
  // Finally, we release the lock on the stream
  reader.releaseLock();
}</code></pre>
            <p>You might assume this pattern is inherent to streaming. It isn't. The reader acquisition, the lock management, and the <code>{ value, done }</code> protocol are all just design choices, not requirements. They are artifacts of how and when the Web streams spec was written. Async iteration exists precisely to handle sequences that arrive over time, but async iteration did not yet exist when the streams specification was written. The complexity here is pure API overhead, not fundamental necessity.</p><p>Consider the alternative approach now that Web streams do support <code>for await...of</code>:</p>
            <pre><code>const chunks = [];
for await (const chunk of stream) {
  chunks.push(chunk);
}</code></pre>
            <p>This is better in that there is far less boilerplate, but it doesn't solve everything. Async iteration was retrofitted onto an API that wasn't designed for it, and it shows. Features like <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamBYOBReader"><u>BYOB (bring your own buffer)</u></a> reads aren't accessible through iteration. The underlying complexity of readers, locks, and controllers are still there, just hidden. When something does go wrong, or when additional features of the API are needed, developers find themselves back in the weeds of the original API, trying to understand why their stream is "locked" or why <code>releaseLock()</code> didn't do what they expected or hunting down bottlenecks in code they don't control.</p>
    <div>
      <h4>The locking problem</h4>
      <a href="#the-locking-problem">
        
      </a>
    </div>
    <p>Web streams use a locking model to prevent multiple consumers from interleaving reads. When you call <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/getReader"><code><u>getReader()</u></code></a>, the stream becomes locked. While locked, nothing else can read from the stream directly, pipe it, or even cancel it – only the code that is actually holding the reader can.</p><p>This sounds reasonable until you see how easily it goes wrong:</p>
            <pre><code>async function peekFirstChunk(stream) {
  const reader = stream.getReader();
  const { value } = await reader.read();
  // Oops — forgot to call reader.releaseLock()
  // And the reader is no longer available when we return
  return value;
}

const first = await peekFirstChunk(stream);
// TypeError: Cannot obtain lock — stream is permanently locked
for await (const chunk of stream) { /* never runs */ }</code></pre>
            <p>Forgetting <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamDefaultReader/releaseLock"><code><u>releaseLock()</u></code></a> permanently breaks the stream. The <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/locked"><code><u>locked</u></code></a><code> </code>property tells you that a stream is locked, but not why, by whom, or whether the lock is even still usable. <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/pipeTo"><u>Piping</u></a> internally acquires locks, making streams unusable during pipe operations in ways that aren't obvious.</p><p>The semantics around releasing locks with pending reads were also unclear for years. If you called read() but didn't await it, then called releaseLock(), what happened? The spec was recently clarified to cancel pending reads on lock release – but implementations varied, and code that relied on the previous unspecified behavior can break.</p><p>That said, it's important to recognize that locking in itself is not bad. It does, in fact, serve an important purpose to ensure that applications properly and orderly consume or produce data. The key challenge is with the original manual implementation of it using APIs like <code>getReader() </code>and <code>releaseLock()</code>. With the arrival of automatic lock and reader management with async iterables, dealing with locks from the users point of view became a lot easier.</p><p>For implementers, the locking model adds a fair amount of non-trivial internal bookkeeping. Every operation must check lock state, readers must be tracked, and the interplay between locks, cancellation, and error states creates a matrix of edge cases that must all be handled correctly.</p>
    <div>
      <h4>BYOB: complexity without payoff</h4>
      <a href="#byob-complexity-without-payoff">
        
      </a>
    </div>
    <p><a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamBYOBReader"><u>BYOB (bring your own buffer)</u></a> reads were designed to let developers reuse memory buffers when reading from streams, an important optimization intended for high-throughput scenarios. The idea is sound: instead of allocating new buffers for each chunk, you provide your own buffer and the stream fills it.</p><p>In practice, (and yes, there are always exceptions to be found) BYOB is rarely used to any measurable benefit. The API is substantially more complex than default reads, requiring a separate reader type (<code>ReadableStreamBYOBReader</code>) and other specialized classes (e.g. <code>ReadableStreamBYOBRequest</code>), careful buffer lifecycle management, and understanding of <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer#transferring_arraybuffers"><code><u>ArrayBuffer</u></code><u> detachment</u></a> semantics. When you pass a buffer to a BYOB read, the buffer becomes detached – transferred to the stream – and you get back a different view over potentially different memory. This transfer-based model is error-prone and confusing:</p>
            <pre><code>const reader = stream.getReader({ mode: 'byob' });
const buffer = new ArrayBuffer(1024);
let view = new Uint8Array(buffer);

const result = await reader.read(view);
// 'view' should now be detached and unusable
// (it isn't always in every impl)
// result.value is a NEW view, possibly over different memory
view = result.value; // Must reassign</code></pre>
            <p>BYOB also can't be used with async iteration or TransformStreams, so developers who want zero-copy reads are forced back into the manual reader loop.</p><p>For implementers, BYOB adds significant complexity. The stream must track pending BYOB requests, handle partial fills, manage buffer detachment correctly, and coordinate between the BYOB reader and the underlying source. The <a href="https://github.com/web-platform-tests/wpt/tree/master/streams/readable-byte-streams"><u>Web Platform Tests for readable byte streams</u></a> include dedicated test files just for BYOB edge cases: detached buffers, bad views, response-after-enqueue ordering, and more.</p><p>BYOB ends up being complex for both users and implementers, yet sees little adoption in practice. Most developers stick with default reads and accept the allocation overhead.</p><p>Most userland implementations of custom ReadableStream instances do not typically bother with all the ceremony required to correctly implement both default and BYOB read support in a single stream – and for good reason. It's difficult to get right and most of the time consuming code is typically going to fallback on the default read path. The example below shows what a "correct" implementation would need to do. It's big, complex, and error prone, and not a level of complexity that the typical developer really wants to have to deal with:</p>
            <pre><code>new ReadableStream({
    type: 'bytes',
    
    async pull(controller: ReadableByteStreamController) {      
      if (offset &gt;= totalBytes) {
        controller.close();
        return;
      }
      
      // Check for BYOB request FIRST
      const byobRequest = controller.byobRequest;
      
      if (byobRequest) {
        // === BYOB PATH ===
        // Consumer provided a buffer - we MUST fill it (or part of it)
        const view = byobRequest.view!;
        const bytesAvailable = totalBytes - offset;
        const bytesToWrite = Math.min(view.byteLength, bytesAvailable);
        
        // Create a view into the consumer's buffer and fill it
        // not critical but safer when bytesToWrite != view.byteLength
        const dest = new Uint8Array(
          view.buffer,
          view.byteOffset,
          bytesToWrite
        );
        
        // Fill with sequential bytes (our "data source")
        // Can be any thing here that writes into the view
        for (let i = 0; i &lt; bytesToWrite; i++) {
          dest[i] = (offset + i) &amp; 0xFF;
        }
        
        offset += bytesToWrite;
        
        // Signal how many bytes we wrote
        byobRequest.respond(bytesToWrite);
        
      } else {
        // === DEFAULT READER PATH ===
        // No BYOB request - allocate and enqueue a chunk
        const bytesAvailable = totalBytes - offset;
        const chunkSize = Math.min(1024, bytesAvailable);
        
        const chunk = new Uint8Array(chunkSize);
        for (let i = 0; i &lt; chunkSize; i++) {
          chunk[i] = (offset + i) &amp; 0xFF;
        }
        
        offset += chunkSize;
        controller.enqueue(chunk);
      }
    },
    
    cancel(reason) {
      console.log('Stream canceled:', reason);
    }
  });</code></pre>
            <p>When a host runtime provides a byte-oriented ReadableStream from the runtime itself, for instance, as the <code>body </code>of a fetch <code>Response</code>, it is often far easier for the runtime itself to provide an optimized implementation of BYOB reads, but those still need to be capable of handling both default and BYOB reading patterns and that requirement brings with it a fair amount of complexity.</p>
    <div>
      <h4>Backpressure: good in theory, broken in practice</h4>
      <a href="#backpressure-good-in-theory-broken-in-practice">
        
      </a>
    </div>
    <p>Backpressure – the ability for a slow consumer to signal a fast producer to slow down – is a first-class concept in Web streams. In theory. In practice, the model has some serious flaws.</p><p>The primary signal is <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamDefaultController/desiredSize"><code><u>desiredSize</u></code></a> on the controller. It can be positive (wants data), zero (at capacity), negative (over capacity), or null (closed). Producers are supposed to check this value and stop enqueueing when it's not positive. But there's nothing enforcing this: <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStreamDefaultController/enqueue"><code><u>controller.enqueue()</u></code></a> always succeeds, even when desiredSize is deeply negative.</p>
            <pre><code>new ReadableStream({
  start(controller) {
    // Nothing stops you from doing this
    while (true) {
      controller.enqueue(generateData()); // desiredSize: -999999
    }
  }
});</code></pre>
            <p>Stream implementations can and do ignore backpressure; and some spec-defined features explicitly break backpressure. <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/tee"><code><u>tee()</u></code></a>, for instance, creates two branches from a single stream. If one branch reads faster than the other, data accumulates in an internal buffer with no limit. A fast consumer can cause unbounded memory growth while the slow consumer catches up, and there's no way to configure this or opt out beyond canceling the slower branch.</p><p>Web streams do provide clear mechanisms for tuning backpressure behavior in the form of the <code>highWaterMark</code> option and customizable size calculations, but these are just as easy to ignore as <code>desiredSize</code>, and many applications simply fail to pay attention to them.</p><p>The same issues exist on the <code>WritableStream</code> side. A <code>WritableStream</code> has a <code>highWaterMark</code> and <code>desiredSize</code>. There is a <code>writer.ready</code> promise that producers of data are supposed to pay attention but often don't.</p>
            <pre><code>const writable = getWritableStreamSomehow();
const writer = writable.getWriter();

// Producers are supposed to wait for the writer.ready
// It is a promise that, when resolves, indicates that
// the writables internal backpressure is cleared and
// it is ok to write more data
await writer.ready;
await writer.write(...);</code></pre>
            <p>For implementers, backpressure adds complexity without providing guarantees. The machinery to track queue sizes, compute <code>desiredSize</code>, and invoke <code>pull()</code> at the right times must all be implemented correctly. However, since these signals are advisory, all that work doesn't actually prevent the problems backpressure is supposed to solve.</p>
    <div>
      <h4>The hidden cost of promises</h4>
      <a href="#the-hidden-cost-of-promises">
        
      </a>
    </div>
    <p>The Web streams spec requires promise creation at numerous points, often in hot paths and often invisible to users. Each <code>read()</code> call doesn't just return a promise; internally, the implementation creates additional promises for queue management, <code>pull()</code> coordination, and backpressure signaling.</p><p>This overhead is mandated by the spec's reliance on promises for buffer management, completion, and backpressure signals. While some of it is implementation-specific, much of it is unavoidable if you're following the spec as written. For high-frequency streaming – video frames, network packets, real-time data – this overhead is significant.</p><p>The problem compounds in pipelines. Each <code>TransformStream</code> adds another layer of promise machinery between source and sink. The spec doesn't define synchronous fast paths, so even when data is available immediately, the promise machinery still runs.</p><p>For implementers, this promise-heavy design constrains optimization opportunities. The spec mandates specific promise resolution ordering, making it difficult to batch operations or skip unnecessary async boundaries without risking subtle compliance failures. There are many hidden internal optimizations that implementers do make but these can be complicated and difficult to get right.</p><p>While I was writing this blog post, Vercel's Malte Ubl published their own <a href="https://vercel.com/blog/we-ralph-wiggumed-webstreams-to-make-them-10x-faster"><u>blog post</u></a> describing some research work Vercel has been doing around improving the performance of Node.js' Web streams implementation. In that post they discuss the same fundamental performance optimization problem that every implementation of Web streams face:</p><blockquote><p>"Or consider pipeTo(). Each chunk passes through a full Promise chain: read, write, check backpressure, repeat. An {value, done} result object is allocated per read. Error propagation creates additional Promise branches.</p><p>None of this is wrong. These guarantees matter in the browser where streams cross security boundaries, where cancellation semantics need to be airtight, where you do not control both ends of a pipe. But on the server, when you are piping React Server Components through three transforms at 1KB chunks, the cost adds up.</p><p>We benchmarked native WebStream pipeThrough at 630 MB/s for 1KB chunks. Node.js pipeline() with the same passthrough transform: ~7,900 MB/s. That is a 12x gap, and the difference is almost entirely Promise and object allocation overhead." 
- Malte Ubl, <a href="https://vercel.com/blog/we-ralph-wiggumed-webstreams-to-make-them-10x-faster"><u>https://vercel.com/blog/we-ralph-wiggumed-webstreams-to-make-them-10x-faster</u></a></p></blockquote><p>As part of their research, they have put together a set of proposed improvements for Node.js' Web streams implementation that will eliminate promises in certain code paths which can yield a significant performance boost up to 10x faster, which only goes to prove the point: promises, while useful, add significant overhead. As one of the core maintainers of Node.js, I am looking forward to helping Malte and the folks at Vercel get their proposed improvements landed!</p><p>In a recent update made to Cloudflare Workers, I made similar kinds of modifications to an internal data pipeline that reduced the number of JavaScript promises created in certain application scenarios by up to 200x. The result is several orders of magnitude improvement in performance in those applications.</p>
    <div>
      <h3>Real-world failures</h3>
      <a href="#real-world-failures">
        
      </a>
    </div>
    
    <div>
      <h4>Exhausting resources with unconsumed bodies</h4>
      <a href="#exhausting-resources-with-unconsumed-bodies">
        
      </a>
    </div>
    <p>When <code>fetch()</code> returns a response, the body is a <a href="https://developer.mozilla.org/en-US/docs/Web/API/Response/body"><code><u>ReadableStream</u></code></a>. If you only check the status and don't consume or cancel the body, what happens? The answer varies by implementation, but a common outcome is resource leakage.</p>
            <pre><code>async function checkEndpoint(url) {
  const response = await fetch(url);
  return response.ok; // Body is never consumed or cancelled
}

// In a loop, this can exhaust connection pools
for (const url of urls) {
  await checkEndpoint(url);
}</code></pre>
            <p>This pattern has caused connection pool exhaustion in Node.js applications using <a href="https://nodejs.org/api/globals.html#fetch"><u>undici</u></a> (the <code>fetch() </code>implementation built into Node.js), and similar issues have appeared in other runtimes. The stream holds a reference to the underlying connection, and without explicit consumption or cancellation, the connection may linger until garbage collection – which may not happen soon enough under load.</p><p>The problem is compounded by APIs that implicitly create stream branches. <a href="https://developer.mozilla.org/en-US/docs/Web/API/Request/clone"><code><u>Request.clone()</u></code></a> and <a href="https://developer.mozilla.org/en-US/docs/Web/API/Response/clone"><code><u>Response.clone()</u></code></a> perform implicit <code>tee()</code> operations on the body stream – a detail that's easy to miss. Code that clones a request for logging or retry logic may unknowingly create branched streams that need independent consumption, multiplying the resource management burden.</p><p>Now, to be certain, these types of issues <i>are</i> implementation bugs. The connection leak was definitely something that undici needed to fix in its own implementation, but the complexity of the specification does not make dealing with these types of issues easy.</p><blockquote><p>"Cloning streams in Node.js's fetch() implementation is harder than it looks. When you clone a request or response body, you're calling tee() - which splits a single stream into two branches that both need to be consumed. If one consumer reads faster than the other, data buffers unbounded in memory waiting for the slow branch. If you don't properly consume both branches, the underlying connection leaks. The coordination required between two readers sharing one source makes it easy to accidentally break the original request or exhaust connection pools. It's a simple API call with complex underlying mechanics that are difficult to get right." - Matteo Collina, Ph.D. - Platformatic Co-Founder &amp; CTO, Node.js Technical Steering Committee Chair</p></blockquote>
    <div>
      <h4>Falling headlong off the tee() memory cliff</h4>
      <a href="#falling-headlong-off-the-tee-memory-cliff">
        
      </a>
    </div>
    <p><a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/tee"><code><u>tee()</u></code></a> splits a stream into two branches. It seems straightforward, but the implementation requires buffering: if one branch is read faster than the other, the data must be held somewhere until the slower branch catches up.</p>
            <pre><code>const [forHash, forStorage] = response.body.tee();

// Hash computation is fast
const hash = await computeHash(forHash);

// Storage write is slow — meanwhile, the entire stream
// may be buffered in memory waiting for this branch
await writeToStorage(forStorage);</code></pre>
            <p>The spec does not mandate buffer limits for <code>tee()</code>. And to be fair, the spec allows implementations to implement the actual internal mechanisms for <code>tee()</code>and other APIs in any way they see fit so long as the observable normative requirements of the specification are met. But if an implementation chooses to implement <code>tee()</code> in the specific way described by the streams specification, then <code>tee()</code> will come with a built-in memory management issue that is difficult to work around.</p><p>Implementations have had to develop their own strategies for dealing with this. Firefox initially used a linked-list approach that led to O<code>(n)</code> memory growth proportional to the consumption rate difference. In Cloudflare Workers, we opted to implement a shared buffer model where backpressure is signaled by the slowest consumer rather than the fastest.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5cl4vqYfaHaVXiHjLSXv0a/03a0b9fe4c9c0594e181ffee43b63998/2.png" />
          </figure>
    <div>
      <h4>Transform backpressure gaps</h4>
      <a href="#transform-backpressure-gaps">
        
      </a>
    </div>
    <p><code>TransformStream</code> creates a <code>readable/writable</code> pair with processing logic in between. The <code>transform()</code> function executes on <i>write</i>, not on read. Processing of the transform happens eagerly as data arrives, regardless of whether any consumer is ready. This causes unnecessary work when consumers are slow, and the backpressure signaling between the two sides has gaps that can cause unbounded buffering under load. The expectation in the spec is that the producer of the data being transformed is paying attention to the <code>writer.ready</code> signal on the writable side of the transform but quite often producers just simply ignore it.</p><p>If the transform's <code>transform() </code>operation is synchronous and always enqueues output immediately, it never signals backpressure back to the writable side even when the downstream consumer is slow. This is a consequence of the spec design that many developers completely overlook. In browsers, where there's only a single user and typically only a small number of stream pipelines active at any given time, this type of foot gun is often of no consequence, but it has a major impact on server-side or edge performance in runtimes that serve thousands of concurrent requests.</p>
            <pre><code>const fastTransform = new TransformStream({
  transform(chunk, controller) {
    // Synchronously enqueue — this never applies backpressure
    // Even if the readable side's buffer is full, this succeeds
    controller.enqueue(processChunk(chunk));
  }
});

// Pipe a fast source through the transform to a slow sink
fastSource
  .pipeThrough(fastTransform)
  .pipeTo(slowSink);  // Buffer grows without bound</code></pre>
            <p>What TransformStreams are supposed to do is check for backpressure on the controller and use promises to communicate that back to the writer:</p>
            <pre><code>const fastTransform = new TransformStream({
  async transform(chunk, controller) {
    if (controller.desiredSize &lt;= 0) {
      // Wait on the backpressure to clear somehow
    }

    controller.enqueue(processChunk(chunk));
  }
});</code></pre>
            <p>A difficulty here, however, is that the <code>TransformStreamDefaultController</code> does not have a ready promise mechanism like Writers do; so the <code>TransformStream</code> implementation would need to implement a polling mechanism to periodically check when <code>controller.desiredSize</code> becomes positive again.</p><p>The problem gets worse in pipelines. When you chain multiple transforms – say, parse, transform, then serialize – each <code>TransformStream</code> has its own internal readable and writable buffers. If implementers follow the spec strictly, data cascades through these buffers in a push-oriented fashion: the source pushes to transform A, which pushes to transform B, which pushes to transform C, each accumulating data in intermediate buffers before the final consumer has even started pulling. With three transforms, you can have six internal buffers filling up simultaneously.</p><p>Developers using the streams API are expected to remember to use options like <code>highWaterMark</code> when creating their sources, transforms, and writable destinations but often they either forget or simply choose to ignore it.</p>
            <pre><code>source
  .pipeThrough(parse)      // buffers filling...
  .pipeThrough(transform)  // more buffers filling...
  .pipeThrough(serialize)  // even more buffers...
  .pipeTo(destination);    // consumer hasn't started yet</code></pre>
            <p>Implementations have found ways to optimize transform pipelines by collapsing identity transforms, short-circuiting non-observable paths, deferring buffer allocation, or falling back to native code that does not run JavaScript at all. Deno, Bun, and Cloudflare Workers have all successfully implemented "native path" optimizations that can help eliminate much of the overhead, and Vercel's recent <a href="https://vercel.com/blog/we-ralph-wiggumed-webstreams-to-make-them-10x-faster"><u>fast-webstreams</u></a> research is working on similar optimizations for Node.js. But the optimizations themselves add significant complexity and still can't fully escape the inherently push-oriented model that TransformStream uses.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/64FcAUPYrTvOSYOPoT2FkR/cc91e0d32dd47320e8ac9d6f431a2fda/3.png" />
          </figure>
    <div>
      <h4>GC thrashing in server-side rendering</h4>
      <a href="#gc-thrashing-in-server-side-rendering">
        
      </a>
    </div>
    <p>Streaming server-side rendering (SSR) is a particularly painful case. A typical SSR stream might render thousands of small HTML fragments, each passing through the streams machinery:</p>
            <pre><code>// Each component enqueues a small chunk
function renderComponent(controller) {
  controller.enqueue(encoder.encode(`&lt;div&gt;${content}&lt;/div&gt;`));
}

// Hundreds of components = hundreds of enqueue calls
// Each one triggers promise machinery internally
for (const component of components) {
  renderComponent(controller);  // Promises created, objects allocated
}</code></pre>
            <p>Every fragment means promises created for <code>read()</code> calls, promises for backpressure coordination, intermediate buffer allocations, and <code>{ value, done } </code>result objects – most of which become garbage almost immediately.</p><p>Under load, this creates GC pressure that can devastate throughput. The JavaScript engine spends significant time collecting short-lived objects instead of doing useful work. Latency becomes unpredictable as GC pauses interrupt request handling. I've seen SSR workloads where garbage collection accounts for a substantial portion (up to and beyond 50%) of total CPU time per request. That's time that could be spent actually rendering content.</p><p>The irony is that streaming SSR is supposed to improve performance by sending content incrementally. But the overhead of the streams machinery can negate those gains, especially for pages with many small components. Developers sometimes find that buffering the entire response is actually faster than streaming through Web streams, defeating the purpose entirely.</p>
    <div>
      <h3>The optimization treadmill</h3>
      <a href="#the-optimization-treadmill">
        
      </a>
    </div>
    <p>To achieve usable performance, every major runtime has resorted to non-standard internal optimizations for Web streams. Node.js, Deno, Bun, and Cloudflare Workers have all developed their own workarounds. This is particularly true for streams wired up to system-level I/O, where much of the machinery is non-observable and can be short-circuited.</p><p>Finding these optimization opportunities can itself be a significant undertaking. It requires end-to-end understanding of the spec to identify which behaviors are observable and which can safely be elided. Even then, whether a given optimization is actually spec-compliant is often unclear. Implementers must make judgment calls about which semantics they can relax without breaking compatibility. This puts enormous pressure on runtime teams to become spec experts just to achieve acceptable performance.</p><p>These optimizations are difficult to implement, frequently error-prone, and lead to inconsistent behavior across runtimes. Bun's "<a href="https://bun.sh/docs/api/streams#direct-readablestream"><u>Direct Streams</u></a>" optimization takes a deliberately and observably non-standard approach, bypassing much of the spec's machinery entirely. Cloudflare Workers' <a href="https://developers.cloudflare.com/workers/runtime-apis/streams/transformstream/"><code><u>IdentityTransformStream</u></code></a> provides a fast-path for pass-through transforms but is Workers-specific and implements behaviors that are not standard for a <code>TransformStream</code>. Each runtime has its own set of tricks and the natural tendency is toward non-standard solutions, because that's often the only way to make things fast.</p><p>This fragmentation hurts portability. Code that performs well on one runtime may behave differently (or poorly) on another, even though it's using "standard" APIs. The complexity burden on runtime implementers is substantial, and the subtle behavioral differences create friction for developers trying to write cross-runtime code, particularly those maintaining frameworks that must be able to run efficiently across many runtime environments.</p><p>It is also necessary to emphasize that many optimizations are only possible in parts of the spec that are unobservable to user code. The alternative, like Bun "Direct Streams", is to intentionally diverge from the spec-defined observable behaviors. This means optimizations often feel "incomplete". They work in some scenarios but not in others, in some runtimes but not others, etc. Every such case adds to the overall unsustainable complexity of the Web streams approach which is why most runtime implementers rarely put significant effort into further improvements to their streams implementations once the conformance tests are passing.</p><p>Implementers shouldn't need to jump through these hoops. When you find yourself needing to relax or bypass spec semantics just to achieve reasonable performance, that's a sign something is wrong with the spec itself. A well-designed streaming API should be efficient by default, not require each runtime to invent its own escape hatches.</p>
    <div>
      <h3>The compliance burden</h3>
      <a href="#the-compliance-burden">
        
      </a>
    </div>
    <p>A complex spec creates complex edge cases. The <a href="https://github.com/web-platform-tests/wpt/tree/master/streams"><u>Web Platform Tests for streams</u></a> span over 70 test files, and while comprehensive testing is a good thing, what's telling is what needs to be tested.</p><p>Consider some of the more obscure tests that implementations must pass:</p><ul><li><p>Prototype pollution defense: One test patches <code>Object.prototype.</code>then to intercept promise resolutions, then verifies that <code>pipeTo()</code> and <code>tee()</code> operations don't leak internal values through the prototype chain. This tests a security property that only exists because the spec's promise-heavy internals create an attack surface.</p></li><li><p>WebAssembly memory rejection: BYOB reads must explicitly reject ArrayBuffers backed by WebAssembly memory, which look like regular buffers but can't be transferred. This edge case exists because of the spec's buffer detachment model – a simpler API wouldn't need to handle it.</p></li><li><p>Crash regression for state machine conflicts: A test specifically checks that calling <code>byobRequest.respond()</code> after <code>enqueue()</code> doesn't crash the runtime. This sequence creates a conflict in the internal state machine — the <code>enqueue()</code> fulfills the pending read and should invalidate the <code>byobRequest</code>, but implementations must gracefully handle the subsequent <code>respond()</code> rather than corrupting memory in order to cover the very likely possibility that developers are not using the complex API correctly.</p></li></ul><p>These aren't contrived scenarios invented by test authors in total vacuum. They're consequences of the spec's design and reflect real world bugs.</p><p>For runtime implementers, passing the WPT suite means handling intricate corner cases that most application code will never encounter. The tests encode not just the happy path but the full matrix of interactions between readers, writers, controllers, queues, strategies, and the promise machinery that connects them all.</p><p>A simpler API would mean fewer concepts, fewer interactions between concepts, and fewer edge cases to get right resulting in more confidence that implementations actually behave consistently.</p>
    <div>
      <h3>The takeaway</h3>
      <a href="#the-takeaway">
        
      </a>
    </div>
    <p>Web streams are complex for users and implementers alike. The problems with the spec aren't bugs. They emerge from using the API exactly as designed. They aren't issues that can be fixed solely through incremental improvements. They're consequences of fundamental design choices. To improve things we need different foundations.</p>
    <div>
      <h2>A better streams API is possible</h2>
      <a href="#a-better-streams-api-is-possible">
        
      </a>
    </div>
    <p>After implementing the Web streams spec multiple times across different runtimes and seeing the pain points firsthand, I decided it was time to explore what a better, alternative streaming API could look like if designed from first principles today.</p><p>What follows is a proof of concept: it's not a finished standard, not a production-ready library, not even necessarily a concrete proposal for something new, but a starting point for discussion that demonstrates the problems with Web streams aren't inherent to streaming itself; they're consequences of specific design choices that could be made differently. Whether this exact API is the right answer is less important than whether it sparks a productive conversation about what we actually need from a streaming primitive.</p>
    <div>
      <h3>What is a stream?</h3>
      <a href="#what-is-a-stream">
        
      </a>
    </div>
    <p>Before diving into API design, it's worth asking: what is a stream?</p><p>At its core, a stream is just a sequence of data that arrives over time. You don't have all of it at once. You process it incrementally as it becomes available.</p><p>Unix pipes are perhaps the purest expression of this idea:</p>
            <pre><code>cat access.log | grep "error" | sort | uniq -c</code></pre>
            <p>
Data flows left to right. Each stage reads input, does its work, writes output. There's no pipe reader to acquire, no controller lock to manage. If a downstream stage is slow, upstream stages naturally slow down as well. Backpressure is implicit in the model, not a separate mechanism to learn (or ignore).</p><p>In JavaScript, the natural primitive for "a sequence of things that arrive over time" is already in the language: the async iterable. You consume it with <code>for await...of</code>. You stop consuming by stopping iteration.</p><p>This is the intuition the new API tries to preserve: streams should feel like iteration, because that's what they are. The complexity of Web streams – readers, writers, controllers, locks, queuing strategies – obscures this fundamental simplicity. A better API should make the simple case simple and only add complexity where it's genuinely needed.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3AUAA4bitbTOVSQg7Pd7fv/0856b44d78899dcffc4493f4146fb64f/4.png" />
          </figure>
    <div>
      <h3>Design principles</h3>
      <a href="#design-principles">
        
      </a>
    </div>
    <p>I built the proof-of-concept alternative around a different set of principles.</p>
    <div>
      <h4>Streams are iterables.</h4>
      <a href="#streams-are-iterables">
        
      </a>
    </div>
    <p>No custom <code>ReadableStream</code> class with hidden internal state. A readable stream is just an <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Iteration_protocols#the_async_iterator_and_async_iterable_protocols"><code><u>AsyncIterable&lt;Uint8Array[]&gt;</u></code></a>. You consume it with <code>for await...of</code>. No readers to acquire, no locks to manage.</p>
    <div>
      <h4>Pull-through transforms</h4>
      <a href="#pull-through-transforms">
        
      </a>
    </div>
    <p>Transforms don't execute until the consumer pulls. There's no eager evaluation, no hidden buffering. Data flows on-demand from source, through transforms, to the consumer. If you stop iterating, processing stops.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4bEXBTEOHBMnCRKGA7odt5/cf51074cce3bb8b2ec1b5158c7560b68/5.png" />
          </figure>
    <div>
      <h4>Explicit backpressure</h4>
      <a href="#explicit-backpressure">
        
      </a>
    </div>
    <p>Backpressure is strict by default. When a buffer is full, writes reject rather than silently accumulating. You can configure alternative policies – block until space is available, drop oldest, drop newest – but you have to choose explicitly. No more silent memory growth.</p>
    <div>
      <h4>Batched chunks</h4>
      <a href="#batched-chunks">
        
      </a>
    </div>
    <p>Instead of yielding one chunk per iteration, streams yield <code>Uint8Array[]:</code> arrays of chunks. This amortizes the async overhead across multiple chunks, reducing promise creation and microtask latency in hot paths.</p>
    <div>
      <h4>Bytes only</h4>
      <a href="#bytes-only">
        
      </a>
    </div>
    <p>The API deals exclusively with bytes (<a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array"><code><u>Uint8Array</u></code></a>). Strings are UTF-8 encoded automatically. There's no "value stream" vs "byte stream" dichotomy. If you want to stream arbitrary JavaScript values, use async iterables directly. While the API uses <code>Uint8Array</code>, it treats chunks as opaque. There is no partial consumption, no BYOB patterns, no byte-level operations within the streaming machinery itself. Chunks go in, chunks come out, unchanged unless a transform explicitly modifies them.</p>
    <div>
      <h4>Synchronous fast paths matter</h4>
      <a href="#synchronous-fast-paths-matter">
        
      </a>
    </div>
    <p>The API recognizes that synchronous data sources are both necessary and common. The application should not be forced to always accept the performance cost of asynchronous scheduling simply because that's the only option provided. At the same time, mixing sync and async processing can be dangerous. Synchronous paths should always be an option and should always be explicit.</p>
    <div>
      <h3>The new API in action</h3>
      <a href="#the-new-api-in-action">
        
      </a>
    </div>
    
    <div>
      <h4>Creating and consuming streams</h4>
      <a href="#creating-and-consuming-streams">
        
      </a>
    </div>
    <p>In Web streams, creating a simple producer/consumer pair requires <code>TransformStream</code>, manual encoding, and careful lock management:</p>
            <pre><code>const { readable, writable } = new TransformStream();
const enc = new TextEncoder();
const writer = writable.getWriter();
await writer.write(enc.encode("Hello, World!"));
await writer.close();
writer.releaseLock();

const dec = new TextDecoder();
let text = '';
for await (const chunk of readable) {
  text += dec.decode(chunk, { stream: true });
}
text += dec.decode();</code></pre>
            <p>Even this relatively clean version requires: a <code>TransformStream</code>, manual <code>TextEncoder</code> and <code>TextDecoder</code>, and explicit lock release.</p><p>Here's the equivalent with the new API:</p>
            <pre><code>import { Stream } from 'new-streams';

// Create a push stream
const { writer, readable } = Stream.push();

// Write data — backpressure is enforced
await writer.write("Hello, World!");
await writer.end();

// Consume as text
const text = await Stream.text(readable);</code></pre>
            <p>The readable is just an async iterable. You can pass it to any function that expects one, including <code>Stream.text()</code> which collects and decodes the entire stream.</p><p>The writer has a simple interface: <code>write(), writev()</code> for batched writes, <code>end()</code> to signal completion, and <code>abort()</code> for errors. That's essentially it.</p><p>The Writer is not a concrete class. Any object that implements <code>write()</code>, <code>end()</code>, and <code>abort()</code> can be a writer making it easy to adapt existing APIs or create specialized implementations without subclassing. There's no complex <code>UnderlyingSink</code> protocol with <code>start()</code>, <code>write()</code>, <code>close()</code>, <code>and abort() </code>callbacks that must coordinate through a controller whose lifecycle and state are independent of the <code>WritableStream</code> it is bound to.</p><p>Here's a simple in-memory writer that collects all written data:</p>
            <pre><code>// A minimal writer implementation — just an object with methods
function createBufferWriter() {
  const chunks = [];
  let totalBytes = 0;
  let closed = false;

  const addChunk = (chunk) =&gt; {
    chunks.push(chunk);
    totalBytes += chunk.byteLength;
  };

  return {
    get desiredSize() { return closed ? null : 1; },

    // Async variants
    write(chunk) { addChunk(chunk); },
    writev(batch) { for (const c of batch) addChunk(c); },
    end() { closed = true; return totalBytes; },
    abort(reason) { closed = true; chunks.length = 0; },

    // Sync variants return boolean (true = accepted)
    writeSync(chunk) { addChunk(chunk); return true; },
    writevSync(batch) { for (const c of batch) addChunk(c); return true; },
    endSync() { closed = true; return totalBytes; },
    abortSync(reason) { closed = true; chunks.length = 0; return true; },

    getChunks() { return chunks; }
  };
}

// Use it
const writer = createBufferWriter();
await Stream.pipeTo(source, writer);
const allData = writer.getChunks();</code></pre>
            <p>No base class to extend, no abstract methods to implement, no controller to coordinate with. Just an object with the right shape.</p>
    <div>
      <h4>Pull-through transforms</h4>
      <a href="#pull-through-transforms">
        
      </a>
    </div>
    <p>Under the new API design, transforms should not perform any work until the data is being consumed. This is a fundamental principle.</p>
            <pre><code>// Nothing executes until iteration begins
const output = Stream.pull(source, compress, encrypt);

// Transforms execute as we iterate
for await (const chunks of output) {
  for (const chunk of chunks) {
    process(chunk);
  }
}</code></pre>
            <p><code>Stream.pull()</code> creates a lazy pipeline. The <code>compress</code> and <code>encrypt</code> transforms don't run until you start iterating output. Each iteration pulls data through the pipeline on demand.</p><p>This is fundamentally different from Web streams' <a href="https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream/pipeThrough"><code><u>pipeThrough()</u></code></a>, which starts actively pumping data from the source to the transform as soon as you set up the pipe. Pull semantics mean you control when processing happens, and stopping iteration stops processing.</p><p>Transforms can be stateless or stateful. A stateless transform is just a function that takes chunks and returns transformed chunks:</p>
            <pre><code>// Stateless transform — a pure function
// Receives chunks or null (flush signal)
const toUpperCase = (chunks) =&gt; {
  if (chunks === null) return null; // End of stream
  return chunks.map(chunk =&gt; {
    const str = new TextDecoder().decode(chunk);
    return new TextEncoder().encode(str.toUpperCase());
  });
};

// Use it directly
const output = Stream.pull(source, toUpperCase);</code></pre>
            <p>Stateful transforms are simple objects with member functions that maintain state across calls:</p>
            <pre><code>// Stateful transform — a generator that wraps the source
function createLineParser() {
  // Helper to concatenate Uint8Arrays
  const concat = (...arrays) =&gt; {
    const result = new Uint8Array(arrays.reduce((n, a) =&gt; n + a.length, 0));
    let offset = 0;
    for (const arr of arrays) { result.set(arr, offset); offset += arr.length; }
    return result;
  };

  return {
    async *transform(source) {
      let pending = new Uint8Array(0);
      
      for await (const chunks of source) {
        if (chunks === null) {
          // Flush: yield any remaining data
          if (pending.length &gt; 0) yield [pending];
          continue;
        }
        
        // Concatenate pending data with new chunks
        const combined = concat(pending, ...chunks);
        const lines = [];
        let start = 0;

        for (let i = 0; i &lt; combined.length; i++) {
          if (combined[i] === 0x0a) { // newline
            lines.push(combined.slice(start, i));
            start = i + 1;
          }
        }

        pending = combined.slice(start);
        if (lines.length &gt; 0) yield lines;
      }
    }
  };
}

const output = Stream.pull(source, createLineParser());</code></pre>
            <p>For transforms that need cleanup on abort, add an abort handler:</p>
            <pre><code>// Stateful transform with resource cleanup
function createGzipCompressor() {
  // Hypothetical compression API...
  const deflate = new Deflater({ gzip: true });

  return {
    async *transform(source) {
      for await (const chunks of source) {
        if (chunks === null) {
          // Flush: finalize compression
          deflate.push(new Uint8Array(0), true);
          if (deflate.result) yield [deflate.result];
        } else {
          for (const chunk of chunks) {
            deflate.push(chunk, false);
            if (deflate.result) yield [deflate.result];
          }
        }
      }
    },
    abort(reason) {
      // Clean up compressor resources on error/cancellation
    }
  };
}</code></pre>
            <p>For implementers, there's no Transformer protocol with <code>start()</code>, <code>transform()</code>, <code>flush()</code> methods and controller coordination passed into a <code>TransformStream</code> class that has its own hidden state machine and buffering mechanisms. Transforms are just functions or simple objects: far simpler to implement and test.</p>
    <div>
      <h4>Explicit backpressure policies</h4>
      <a href="#explicit-backpressure-policies">
        
      </a>
    </div>
    <p>When a bounded buffer fills up and a producer wants to write more, there are only a few things you can do:</p><ol><li><p>Reject the write: refuse to accept more data</p></li><li><p>Wait: block until space becomes available</p></li><li><p>Discard old data: evict what's already buffered to make room</p></li><li><p>Discard new data: drop what's incoming</p></li></ol><p>That's it. Any other response is either a variation of these (like "resize the buffer," which is really just deferring the choice) or domain-specific logic that doesn't belong in a general streaming primitive. Web streams currently always choose Wait by default.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/68339c8QsvNmb7JcZ2lSDO/e52a86a9b8f52b52eb9328d5ee58f23a/6.png" />
          </figure><p>The new API makes you choose one of these four explicitly:</p><ul><li><p><code>strict</code> (default): Rejects writes when the buffer is full and too many writes are pending. Catches "fire-and-forget" patterns where producers ignore backpressure.</p></li><li><p><code>block</code>: Writes wait until buffer space is available. Use when you trust the producer to await writes properly.</p></li><li><p><code>drop-oldest</code>: Drops the oldest buffered data to make room. Useful for live feeds where stale data loses value.</p></li><li><p><code>drop-newest</code>: Discards incoming data when full. Useful when you want to process what you have without being overwhelmed.</p></li></ul>
            <pre><code>const { writer, readable } = Stream.push({
  highWaterMark: 10,
  backpressure: 'strict' // or 'block', 'drop-oldest', 'drop-newest'
});</code></pre>
            <p>No more hoping producers cooperate. The policy you choose determines what happens when the buffer fills.</p><p>Here's how each policy behaves when a producer writes faster than the consumer reads:</p>
            <pre><code>// strict: Catches fire-and-forget writes that ignore backpressure
const strict = Stream.push({ highWaterMark: 2, backpressure: 'strict' });
strict.writer.write(chunk1);  // ok (not awaited)
strict.writer.write(chunk2);  // ok (fills slots buffer)
strict.writer.write(chunk3);  // ok (queued in pending)
strict.writer.write(chunk4);  // ok (pending buffer fills)
strict.writer.write(chunk5);  // throws! too many pending writes

// block: Wait for space (unbounded pending queue)
const blocking = Stream.push({ highWaterMark: 2, backpressure: 'block' });
await blocking.writer.write(chunk1);  // ok
await blocking.writer.write(chunk2);  // ok
await blocking.writer.write(chunk3);  // waits until consumer reads
await blocking.writer.write(chunk4);  // waits until consumer reads
await blocking.writer.write(chunk5);  // waits until consumer reads

// drop-oldest: Discard old data to make room
const dropOld = Stream.push({ highWaterMark: 2, backpressure: 'drop-oldest' });
await dropOld.writer.write(chunk1);  // ok
await dropOld.writer.write(chunk2);  // ok
await dropOld.writer.write(chunk3);  // ok, chunk1 discarded

// drop-newest: Discard incoming data when full
const dropNew = Stream.push({ highWaterMark: 2, backpressure: 'drop-newest' });
await dropNew.writer.write(chunk1);  // ok
await dropNew.writer.write(chunk2);  // ok
await dropNew.writer.write(chunk3);  // silently dropped</code></pre>
            
    <div>
      <h4>Explicit Multi-consumer patterns</h4>
      <a href="#explicit-multi-consumer-patterns">
        
      </a>
    </div>
    
            <pre><code>// Share with explicit buffer management
const shared = Stream.share(source, {
  highWaterMark: 100,
  backpressure: 'strict'
});

const consumer1 = shared.pull();
const consumer2 = shared.pull(decompress);</code></pre>
            <p>Instead of <code>tee()</code> with its hidden unbounded buffer, you get explicit multi-consumer primitives. <code>Stream.share()</code> is pull-based: consumers pull from a shared source, and you configure the buffer limits and backpressure policy upfront.</p><p>There's also <code>Stream.broadcast()</code> for push-based multi-consumer scenarios. Both require you to think about what happens when consumers run at different speeds, because that's a real concern that shouldn't be hidden.</p>
    <div>
      <h4>Sync/async separation</h4>
      <a href="#sync-async-separation">
        
      </a>
    </div>
    <p>Not all streaming workloads involve I/O. When your source is in-memory and your transforms are pure functions, async machinery adds overhead without benefit. You're paying for coordination of "waiting" that adds no benefit.</p><p>The new API has complete parallel sync versions: <code>Stream.pullSync()</code>, <code>Stream.bytesSync()</code>, <code>Stream.textSync()</code>, and so on. If your source and transforms are all synchronous, you can process the entire pipeline without a single promise.</p>
            <pre><code>// Async — when source or transforms may be asynchronous
const textAsync = await Stream.text(source);

// Sync — when all components are synchronous
const textSync = Stream.textSync(source);</code></pre>
            <p>Here's a complete synchronous pipeline – compression, transformation, and consumption with zero async overhead:</p>
            <pre><code>// Synchronous source from in-memory data
const source = Stream.fromSync([inputBuffer]);

// Synchronous transforms
const compressed = Stream.pullSync(source, zlibCompressSync);
const encrypted = Stream.pullSync(compressed, aesEncryptSync);

// Synchronous consumption — no promises, no event loop trips
const result = Stream.bytesSync(encrypted);</code></pre>
            <p>The entire pipeline executes in a single call stack. No promises are created, no microtask queue scheduling occurs, and no GC pressure from short-lived async machinery. For CPU-bound workloads like parsing, compression, or transformation of in-memory data, this can be significantly faster than the equivalent Web streams code – which would force async boundaries even when every component is synchronous.</p><p>Web streams has no synchronous path. Even if your source has data ready and your transform is a pure function, you still pay for promise creation and microtask scheduling on every operation. Promises are fantastic for cases in which waiting is actually necessary, but they aren't always necessary. The new API lets you stay in sync-land when that's what you need.</p>
    <div>
      <h4>Bridging the gap between this and web streams</h4>
      <a href="#bridging-the-gap-between-this-and-web-streams">
        
      </a>
    </div>
    <p>The async iterator based approach provides a natural bridge between this alternative approach and Web streams. When coming from a ReadableStream to this new approach, simply passing the readable in as input works as expected when the ReadableStream is set up to yield bytes:</p>
            <pre><code>const readable = getWebReadableStreamSomehow();
const input = Stream.pull(readable, transform1, transform2);
for await (const chunks of input) {
  // process chunks
}</code></pre>
            <p>When adapting to a ReadableStream, a bit more work is required since the alternative approach yields batches of chunks, but the adaptation layer is as easily straightforward:</p>
            <pre><code>async function* adapt(input) {
  for await (const chunks of input) {
    for (const chunk of chunks) {
      yield chunk;
    }
  }
}

const input = Stream.pull(source, transform1, transform2);
const readable = ReadableStream.from(adapt(input));</code></pre>
            
    <div>
      <h4>How this addresses the real-world failures from earlier</h4>
      <a href="#how-this-addresses-the-real-world-failures-from-earlier">
        
      </a>
    </div>
    <ul><li><p>Unconsumed bodies: Pull semantics mean nothing happens until you iterate. No hidden resource retention. If you don't consume a stream, there's no background machinery holding connections open.</p></li><li><p>The <code>tee()</code> memory cliff: <code>Stream.share()</code> requires explicit buffer configuration. You choose the <code>highWaterMark</code> and backpressure policy upfront: no more silent unbounded growth when consumers run at different speeds.</p></li><li><p>Transform backpressure gaps: Pull-through transforms execute on-demand. Data doesn't cascade through intermediate buffers; it flows only when the consumer pulls. Stop iterating, stop processing.</p></li><li><p>GC thrashing in SSR: Batched chunks (<code>Uint8Array[]</code>) amortize async overhead. Sync pipelines via <code>Stream.pullSync()</code> eliminate promise allocation entirely for CPU-bound workloads.</p></li></ul>
    <div>
      <h3>Performance</h3>
      <a href="#performance">
        
      </a>
    </div>
    <p>The design choices have performance implications. Here are benchmarks from the reference implementation of this possible alternative compared to Web streams (Node.js v24.x, Apple M1 Pro, averaged over 10 runs):</p><table><tr><td><p><b>Scenario</b></p></td><td><p><b>Alternative</b></p></td><td><p><b>Web streams</b></p></td><td><p><b>Difference</b></p></td></tr><tr><td><p>Small chunks (1KB × 5000)</p></td><td><p>~13 GB/s</p></td><td><p>~4 GB/s</p></td><td><p>~3× faster</p></td></tr><tr><td><p>Tiny chunks (100B × 10000)</p></td><td><p>~4 GB/s</p></td><td><p>~450 MB/s</p></td><td><p>~8× faster</p></td></tr><tr><td><p>Async iteration (8KB × 1000)</p></td><td><p>~530 GB/s</p></td><td><p>~35 GB/s</p></td><td><p>~15× faster</p></td></tr><tr><td><p>Chained 3× transforms (8KB × 500)</p></td><td><p>~275 GB/s</p></td><td><p>~3 GB/s</p></td><td><p><b>~80–90× faster</b></p></td></tr><tr><td><p>High-frequency (64B × 20000)</p></td><td><p>~7.5 GB/s</p></td><td><p>~280 MB/s</p></td><td><p>~25× faster</p></td></tr></table><p>The chained transform result is particularly striking: pull-through semantics eliminate the intermediate buffering that plagues Web streams pipelines. Instead of each <code>TransformStream</code> eagerly filling its internal buffers, data flows on-demand from consumer to source.</p><p>Now, to be fair, Node.js really has not yet put significant effort into fully optimizing the performance of its Web streams implementation. There's likely significant room for improvement in Node.js' performance results through a bit of applied effort to optimize the hot paths there. That said, running these benchmarks in Deno and Bun also show a significant performance improvement with this alternative iterator based approach than in either of their Web streams implementations as well.</p><p>Browser benchmarks (Chrome/Blink, averaged over 3 runs) show consistent gains as well:</p><table><tr><td><p><b>Scenario</b></p></td><td><p><b>Alternative</b></p></td><td><p><b>Web streams</b></p></td><td><p><b>Difference</b></p></td></tr><tr><td><p>Push 3KB chunks</p></td><td><p>~135k ops/s</p></td><td><p>~24k ops/s</p></td><td><p>~5–6× faster</p></td></tr><tr><td><p>Push 100KB chunks</p></td><td><p>~24k ops/s</p></td><td><p>~3k ops/s</p></td><td><p>~7–8× faster</p></td></tr><tr><td><p>3 transform chain</p></td><td><p>~4.6k ops/s</p></td><td><p>~880 ops/s</p></td><td><p>~5× faster</p></td></tr><tr><td><p>5 transform chain</p></td><td><p>~2.4k ops/s</p></td><td><p>~550 ops/s</p></td><td><p>~4× faster</p></td></tr><tr><td><p>bytes() consumption</p></td><td><p>~73k ops/s</p></td><td><p>~11k ops/s</p></td><td><p>~6–7× faster</p></td></tr><tr><td><p>Async iteration</p></td><td><p>~1.1M ops/s</p></td><td><p>~10k ops/s</p></td><td><p><b>~40–100× faster</b></p></td></tr></table><p>These benchmarks measure throughput in controlled scenarios; real-world performance depends on your specific use case. The difference between Node.js and browser gains reflects the distinct optimization paths each environment takes for Web streams.</p><p>It's worth noting that these benchmarks compare a pure TypeScript/JavaScript implementation of the new API against the native (JavaScript/C++/Rust) implementations of Web streams in each runtime. The new API's reference implementation has had no performance optimization work; the gains come entirely from the design. A native implementation would likely show further improvement.</p><p>The gains illustrate how fundamental design choices compound: batching amortizes async overhead, pull semantics eliminate intermediate buffering, and the freedom for implementations to use synchronous fast paths when data is available immediately all contribute.</p><blockquote><p>"We’ve done a lot to improve performance and consistency in Node streams, but there’s something uniquely powerful about starting from scratch. New streams’ approach embraces modern runtime realities without legacy baggage, and that opens the door to a simpler, performant and more coherent streams model." 
- Robert Nagy, Node.js TSC member and Node.js streams contributor</p></blockquote>
    <div>
      <h2>What's next</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>I'm publishing this to start a conversation. What did I get right? What did I miss? Are there use cases that don't fit this model? What would a migration path for this approach look like? The goal is to gather feedback from developers who've felt the pain of Web streams and have opinions about what a better API should look like.</p>
    <div>
      <h3>Try it yourself</h3>
      <a href="#try-it-yourself">
        
      </a>
    </div>
    <p>A reference implementation for this alternative approach is available now and can be found at <a href="https://github.com/jasnell/new-streams"><u>https://github.com/jasnell/new-streams</u></a>.</p><ul><li><p>API Reference: See the <a href="https://github.com/jasnell/new-streams/blob/main/API.md"><u>API.md</u></a> for complete documentation</p></li><li><p>Examples: The <a href="https://github.com/jasnell/new-streams/tree/main/samples"><u>samples directory</u></a> has working code for common patterns</p></li></ul><p>I welcome issues, discussions, and pull requests. If you've run into Web streams problems I haven't covered, or if you see gaps in this approach, let me know. But again, the idea here is not to say "Let's all use this shiny new object!"; it is to kick off a discussion that looks beyond the current status quo of Web Streams and returns back to first principles.</p><p>Web streams was an ambitious project that brought streaming to the web platform when nothing else existed. The people who designed it made reasonable choices given the constraints of 2014 – before async iteration, before years of production experience revealed the edge cases.</p><p>But we've learned a lot since then. JavaScript has evolved. A streaming API designed today can be simpler, more aligned with the language, and more explicit about the things that matter, like backpressure and multi-consumer behavior.</p><p>We deserve a better stream API. So let's talk about what that could look like.</p> ]]></content:encoded>
            <category><![CDATA[Standards]]></category>
            <category><![CDATA[JavaScript]]></category>
            <category><![CDATA[TypeScript]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Node.js]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[API]]></category>
            <guid isPermaLink="false">37h1uszA2vuOfmXb3oAnZr</guid>
            <dc:creator>James M Snell</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we rebuilt Next.js with AI in one week]]></title>
            <link>https://blog.cloudflare.com/vinext/</link>
            <pubDate>Tue, 24 Feb 2026 20:00:00 GMT</pubDate>
            <description><![CDATA[ One engineer used AI to rebuild Next.js on Vite in a week. vinext builds up to 4x faster, produces 57% smaller bundles, and deploys to Cloudflare Workers with a single command. ]]></description>
            <content:encoded><![CDATA[ <p><sub><i>*This post was updated at 12:35 pm PT to fix a typo in the build time benchmarks.</i></sub></p><p>Last week, one engineer and an AI model rebuilt the most popular front-end framework from scratch. The result, <a href="https://github.com/cloudflare/vinext"><u>vinext</u></a> (pronounced "vee-next"), is a drop-in replacement for Next.js, built on <a href="https://vite.dev/"><u>Vite</u></a>, that deploys to Cloudflare Workers with a single command. In early benchmarks, it builds production apps up to 4x faster and produces client bundles up to 57% smaller. And we already have customers running it in production. </p><p>The whole thing cost about $1,100 in tokens.</p>
    <div>
      <h2>The Next.js deployment problem</h2>
      <a href="#the-next-js-deployment-problem">
        
      </a>
    </div>
    <p><a href="https://nextjs.org/"><u>Next.js</u></a> is the most popular React framework. Millions of developers use it. It powers a huge chunk of the production web, and for good reason. The developer experience is top-notch.</p><p>But Next.js has a deployment problem when used in the broader serverless ecosystem. The tooling is entirely bespoke: Next.js has invested heavily in Turbopack but if you want to deploy it to Cloudflare, Netlify, or AWS Lambda, you have to take that build output and reshape it into something the target platform can actually run.</p><p>If you’re thinking: “Isn’t that what OpenNext does?”, you are correct. </p><p>That is indeed the problem <a href="https://opennext.js.org/"><u>OpenNext</u></a> was built to solve. And a lot of engineering effort has gone into OpenNext from multiple providers, including us at Cloudflare. It works, but quickly runs into limitations and becomes a game of whack-a-mole. </p><p>Building on top of Next.js output as a foundation has proven to be a difficult and fragile approach. Because OpenNext has to reverse-engineer Next.js's build output, this results in unpredictable changes between versions that take a lot of work to correct. </p><p>Next.js has been working on a first-class adapters API, and we've been collaborating with them on it. It's still an early effort but even with adapters, you're still building on the bespoke Turbopack toolchain. And adapters only cover build and deploy. During development, next dev runs exclusively in Node.js with no way to plug in a different runtime. If your application uses platform-specific APIs like Durable Objects, KV, or AI bindings, you can't test that code in dev without workarounds.</p>
    <div>
      <h2>Introducing vinext </h2>
      <a href="#introducing-vinext">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7BCYnb6nCnc9oRBPQnuES5/d217b3582f4fe30597a3b4bf000d9bd7/BLOG-3194_2.png" />
          </figure><p>What if instead of adapting Next.js output, we reimplemented the Next.js API surface on <a href="https://vite.dev/"><u>Vite</u></a> directly? Vite is the build tool used by most of the front-end ecosystem outside of Next.js, powering frameworks like Astro, SvelteKit, Nuxt, and Remix. A clean reimplementation, not merely a wrapper or adapter. We honestly didn't think it would work. But it’s 2026, and the cost of building software has completely changed.</p><p>We got a lot further than we expected.</p>
            <pre><code>npm install vinext</code></pre>
            <p>Replace <code>next</code> with <code>vinext</code> in your scripts and everything else stays the same. Your existing <code>app/</code>, <code>pages/</code>, and <code>next.config.js</code> work as-is.</p>
            <pre><code>vinext dev          # Development server with HMR
vinext build        # Production build
vinext deploy       # Build and deploy to Cloudflare Workers</code></pre>
            <p>This is not a wrapper around Next.js and Turbopack output. It's an alternative implementation of the API surface: routing, server rendering, React Server Components, server actions, caching, middleware. All of it built on top of Vite as a plugin. Most importantly Vite output runs on any platform thanks to the <a href="https://vite.dev/guide/api-environment"><u>Vite Environment API</u></a>.</p>
    <div>
      <h2>The numbers</h2>
      <a href="#the-numbers">
        
      </a>
    </div>
    <p>Early benchmarks are promising. We compared vinext against Next.js 16 using a shared 33-route App Router application.

Both frameworks are doing the same work: compiling, bundling, and preparing server-rendered routes. We disabled TypeScript type checking and ESLint in Next.js's build (Vite doesn't run these during builds), and used force-dynamic so Next.js doesn't spend extra time pre-rendering static routes, which would unfairly slow down its numbers. The goal was to measure only bundler and compilation speed, nothing else. Benchmarks run on GitHub CI on every merge to main. </p><p><b>Production build time:</b></p>
<div><table><colgroup>
<col></col>
<col></col>
<col></col>
</colgroup>
<thead>
  <tr>
    <th><span>Framework</span></th>
    <th><span>Mean</span></th>
    <th><span>vs Next.js</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Next.js 16.1.6 (Turbopack)</span></td>
    <td><span>7.38s</span></td>
    <td><span>baseline</span></td>
  </tr>
  <tr>
    <td><span>vinext (Vite 7 / Rollup)</span></td>
    <td>4.64s</td>
    <td>1.6x faster</td>
  </tr>
  <tr>
    <td><span>vinext (Vite 8 / Rolldown)</span></td>
    <td>1.67s</td>
    <td>4.4x faster</td>
  </tr>
</tbody></table></div><p><b>Client bundle size (gzipped):</b></p>
<div><table><colgroup>
<col></col>
<col></col>
<col></col>
</colgroup>
<thead>
  <tr>
    <th><span>Framework</span></th>
    <th><span>Gzipped</span></th>
    <th><span>vs Next.js</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>Next.js 16.1.6</span></td>
    <td><span>168.9 KB</span></td>
    <td><span>baseline</span></td>
  </tr>
  <tr>
    <td><span>vinext (Rollup)</span></td>
    <td><span>74.0 KB</span></td>
    <td><span>56% smaller</span></td>
  </tr>
  <tr>
    <td><span>vinext (Rolldown)</span></td>
    <td><span>72.9 KB</span></td>
    <td><span>57% smaller</span></td>
  </tr>
</tbody></table></div><p>These benchmarks measure compilation and bundling speed, not production serving performance. The test fixture is a single 33-route app, not a representative sample of all production applications. We expect these numbers to evolve as three projects continue to develop. The <a href="https://benchmarks.vinext.workers.dev"><u>full methodology and historical results</u></a> are public. Take them as directional, not definitive.</p><p>The direction is encouraging, though. Vite's architecture, and especially <a href="https://rolldown.rs/"><u>Rolldown</u></a> (the Rust-based bundler coming in Vite 8), has structural advantages for build performance that show up clearly here.</p>
    <div>
      <h2>Deploying to Cloudflare Workers</h2>
      <a href="#deploying-to-cloudflare-workers">
        
      </a>
    </div>
    <p>vinext is built with Cloudflare Workers as the first deployment target. A single command takes you from source code to a running Worker:</p>
            <pre><code>vinext deploy</code></pre>
            <p>This handles everything: builds the application, auto-generates the Worker configuration, and deploys. Both the App Router and Pages Router work on Workers, with full client-side hydration, interactive components, client-side navigation, React state.</p><p>For production caching, vinext includes a Cloudflare KV cache handler that gives you ISR (Incremental Static Regeneration) out of the box:</p>
            <pre><code>import { KVCacheHandler } from "vinext/cloudflare";
import { setCacheHandler } from "next/cache";

setCacheHandler(new KVCacheHandler(env.MY_KV_NAMESPACE));</code></pre>
            <p><a href="https://developers.cloudflare.com/kv/"><u>KV</u></a> is a good default for most applications, but the caching layer is designed to be pluggable. That setCacheHandler call means you can swap in whatever backend makes sense. <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a> might be a better fit for apps with large cached payloads or different access patterns. We're also working on improvements to our Cache API that should provide a strong caching layer with less configuration. The goal is flexibility: pick the caching strategy that fits your app.</p><p>Live examples running right now:</p><ul><li><p><a href="https://app-router-playground.vinext.workers.dev"><u>App Router Playground</u></a></p></li><li><p><a href="https://hackernews.vinext.workers.dev"><u>Hacker News clone</u></a></p></li><li><p><a href="https://app-router-cloudflare.vinext.workers.dev"><u>App Router minimal</u></a></p></li><li><p><a href="https://pages-router-cloudflare.vinext.workers.dev"><u>Pages Router minimal</u></a></p></li></ul><p>We also have <a href="https://next-agents.threepointone.workers.dev/"><u>a live example</u></a> of Cloudflare Agents running in a Next.js app, without the need for workarounds like <a href="https://developers.cloudflare.com/workers/wrangler/api/#getplatformproxy"><u>getPlatformProxy</u></a>, since the entire app now runs in workerd, during both dev and deploy phases. This means being able to use Durable Objects, AI bindings, and every other Cloudflare-specific service without compromise. <a href="https://github.com/cloudflare/vinext-agents-example"><u>Have a look here.</u></a>   </p>
    <div>
      <h2>Frameworks are a team sport</h2>
      <a href="#frameworks-are-a-team-sport">
        
      </a>
    </div>
    <p>The current deployment target is Cloudflare Workers, but that's a small part of the picture. Something like 95% of vinext is pure Vite. The routing, the module shims, the SSR pipeline, the RSC integration: none of it is Cloudflare-specific.</p><p>Cloudflare is looking to work with other hosting providers about adopting this toolchain for their customers (the lift is minimal — we got a proof-of-concept working on <a href="https://vinext-on-vercel.vercel.app/"><u>Vercel</u></a> in less than 30 minutes!). This is an open-source project, and for its long term success, we believe it’s important we work with partners across the ecosystem to ensure ongoing investment. PRs from other platforms are welcome. If you're interested in adding a deployment target, <a href="https://github.com/cloudflare/vinext/issues"><u>open an issue</u></a> or reach out.</p>
    <div>
      <h2>Status: Experimental</h2>
      <a href="#status-experimental">
        
      </a>
    </div>
    <p>We want to be clear: vinext is experimental. It's not even one week old, and it has not yet been battle-tested with any meaningful traffic at scale. If you're evaluating it for a production application, proceed with appropriate caution.</p><p>That said, the test suite is extensive: over 1,700 Vitest tests and 380 Playwright E2E tests, including tests ported directly from the Next.js test suite and OpenNext's Cloudflare conformance suite. We’ve verified it against the Next.js App Router Playground. Coverage sits at 94% of the Next.js 16 API surface.

Early results from real-world customers are encouraging. We've been working with <a href="https://ndstudio.gov/"><u>National Design Studio</u></a>, a team that's aiming to modernize every government interface, on one of their beta sites, <a href="https://www.cio.gov/"><u>CIO.gov</u></a>. They're already running vinext in production, with meaningful improvements in build times and bundle sizes.</p><p>The README is honest about <a href="https://github.com/cloudflare/vinext#whats-not-supported-and-wont-be"><u>what's not supported and won't be</u></a>, and about <a href="https://github.com/cloudflare/vinext#known-limitations"><u>known limitations</u></a>. We want to be upfront rather than overpromise.</p>
    <div>
      <h2>What about pre-rendering?</h2>
      <a href="#what-about-pre-rendering">
        
      </a>
    </div>
    <p>vinext already supports Incremental Static Regeneration (ISR) out of the box. After the first request to any page, it's cached and revalidated in the background, just like Next.js. That part works today.</p><p>vinext does not yet support static pre-rendering at build time. In Next.js, pages without dynamic data get rendered during <code>next build</code> and served as static HTML. If you have dynamic routes, you use <code>generateStaticParams()</code> to enumerate which pages to build ahead of time. vinext doesn't do that… yet.</p><p>This was an intentional design decision for launch. It's  <a href="https://github.com/cloudflare/vinext/issues/9">on the roadmap</a>, but if your site is 100% prebuilt HTML with static content, you probably won't see much benefit from vinext today. That said, if one engineer can spend <span>$</span>1,100 in tokens and rebuild Next.js, you can probably spend $10 and migrate to a Vite-based framework designed specifically for static content, like <a href="https://astro.build/">Astro</a> (which <a href="https://blog.cloudflare.com/astro-joins-cloudflare/">also deploys to Cloudflare Workers</a>).</p><p>For sites that aren't purely static, though, we think we can do something better than pre-rendering everything at build time.</p>
    <div>
      <h2>Introducing Traffic-aware Pre-Rendering</h2>
      <a href="#introducing-traffic-aware-pre-rendering">
        
      </a>
    </div>
    <p>Next.js pre-renders every page listed in <code>generateStaticParams()</code> during the build. A site with 10,000 product pages means 10,000 renders at build time, even though 99% of those pages may never receive a request. Builds scale linearly with page count. This is why large Next.js sites end up with 30-minute builds.</p><p>So we built <b>Traffic-aware Pre-Rendering</b> (TPR). It's experimental today, and we plan to make it the default once we have more real-world testing behind it.</p><p>The idea is simple. Cloudflare is already the reverse proxy for your site. We have your traffic data. We know which pages actually get visited. So instead of pre-rendering everything or pre-rendering nothing, vinext queries Cloudflare's zone analytics at deploy time and pre-renders only the pages that matter.</p>
            <pre><code>vinext deploy --experimental-tpr

  Building...
  Build complete (4.2s)

  TPR (experimental): Analyzing traffic for my-store.com (last 24h)
  TPR: 12,847 unique paths — 184 pages cover 90% of traffic
  TPR: Pre-rendering 184 pages...
  TPR: Pre-rendered 184 pages in 8.3s → KV cache

  Deploying to Cloudflare Workers...
</code></pre>
            <p>For a site with 100,000 product pages, the power law means 90% of traffic usually goes to 50 to 200 pages. Those get pre-rendered in seconds. Everything else falls back to on-demand SSR and gets cached via ISR after the first request. Every new deploy refreshes the set based on current traffic patterns. Pages that go viral get picked up automatically. All of this works without <code>generateStaticParams()</code> and without coupling your build to your production database.</p>
    <div>
      <h2>Taking on the Next.js challenge, but this time with AI</h2>
      <a href="#taking-on-the-next-js-challenge-but-this-time-with-ai">
        
      </a>
    </div>
    <p>A project like this would normally take a team of engineers months, if not years. Several teams at various companies have attempted it, and the scope is just enormous. We tried once at Cloudflare! Two routers, 33+ module shims, server rendering pipelines, RSC streaming, file-system routing, middleware, caching, static export. There's a reason nobody has pulled it off.</p><p>This time we did it in under a week. One engineer (technically engineering manager) directing AI.</p><p>The first commit landed on February 13. By the end of that same evening, both the Pages Router and App Router had basic SSR working, along with middleware, server actions, and streaming. By the next afternoon, <a href="https://app-router-playground.vinext.workers.dev"><u>App Router Playground</u></a> was rendering 10 of 11 routes. By day three, <code>vinext deploy</code> was shipping apps to Cloudflare Workers with full client hydration. The rest of the week was hardening: fixing edge cases, expanding the test suite, bringing API coverage to 94%.</p><p>What changed from those earlier attempts? AI got better. Way better.</p>
    <div>
      <h2>Why this problem is made for AI</h2>
      <a href="#why-this-problem-is-made-for-ai">
        
      </a>
    </div>
    <p>Not every project would go this way. This one did because a few things happened to line up at the right time.</p><p><b>Next.js is well-specified.</b> It has extensive documentation, a massive user base, and years of Stack Overflow answers and tutorials. The API surface is all over the training data. When you ask Claude to implement <code>getServerSideProps</code> or explain how <code>useRouter</code> works, it doesn't hallucinate. It knows how Next works.</p><p><b>Next.js has an elaborate test suite.</b> The <a href="https://github.com/vercel/next.js"><u>Next.js repo</u></a> contains thousands of E2E tests covering every feature and edge case. We ported tests directly from their suite (you can see the attribution in the code). This gave us a specification we could verify against mechanically.</p><p><b>Vite is an excellent foundation.</b> <a href="https://vite.dev/"><u>Vite</u></a> handles the hard parts of front-end tooling: fast HMR, native ESM, a clean plugin API, production bundling. We didn't have to build a bundler. We just had to teach it to speak Next.js. <a href="https://github.com/vitejs/vite-plugin-rsc"><code><u>@vitejs/plugin-rsc</u></code></a> is still early, but it gave us React Server Components support without having to build an RSC implementation from scratch.</p><p><b>The models caught up.</b> We don't think this would have been possible even a few months ago. Earlier models couldn't sustain coherence across a codebase this size. New models can hold the full architecture in context, reason about how modules interact, and produce correct code often enough to keep momentum going. At times, I saw it go into Next, Vite, and React internals to figure out a bug. The state-of-the-art models are impressive, and they seem to keep getting better.</p><p>All of those things had to be true at the same time. Well-documented target API, comprehensive test suite, solid build tool underneath, and a model that could actually handle the complexity. Take any one of them away and this doesn't work nearly as well.</p>
    <div>
      <h2>How we actually built it</h2>
      <a href="#how-we-actually-built-it">
        
      </a>
    </div>
    <p>Almost every line of code in vinext was written by AI. But here's the thing that matters more: every line passes the same quality gates you'd expect from human-written code. The project has 1,700+ Vitest tests, 380 Playwright E2E tests, full TypeScript type checking via tsgo, and linting via oxlint. Continuous integration runs all of it on every pull request. Establishing a set of good guardrails is critical to making AI productive in a codebase.</p><p>The process started with a plan. I spent a couple of hours going back and forth with Claude in <a href="https://opencode.ai"><u>OpenCode</u></a> to define the architecture: what to build, in what order, which abstractions to use. That plan became the north star. From there, the workflow was straightforward:</p><ol><li><p>Define a task ("implement the <code>next/navigation</code> shim with usePathname, <code>useSearchParams</code>, <code>useRouter</code>").</p></li><li><p>Let the AI write the implementation and tests.</p></li><li><p>Run the test suite.</p></li><li><p>If tests pass, merge. If not, give the AI the error output and let it iterate.</p></li><li><p>Repeat.</p></li></ol><p>We wired up AI agents for code review too. When a PR was opened, an agent reviewed it. When review comments came back, another agent addressed them. The feedback loop was mostly automated. </p><p>It didn't work perfectly every time. There were PRs that were just wrong. The AI would confidently implement something that seemed right but didn't match actual Next.js behavior. I had to course-correct regularly. Architecture decisions, prioritization, knowing when the AI was headed down a dead end: that was all me. When you give AI good direction, good context, and good guardrails, it can be very productive. But the human still has to steer.</p><p>For browser-level testing, I used <a href="https://github.com/vercel-labs/agent-browser"><u>agent-browser</u></a> to verify actual rendered output, client-side navigation, and hydration behavior. Unit tests miss a lot of subtle browser issues. This caught them.</p><p>Over the course of the project, we ran over 800 sessions in OpenCode. Total cost: roughly $1,100 in Claude API tokens.</p>
    <div>
      <h2>What this means for software</h2>
      <a href="#what-this-means-for-software">
        
      </a>
    </div>
    <p>Why do we have so many layers in the stack? This project forced me to think deeply about this question. And to consider how AI impacts the answer.</p><p>Most abstractions in software exist because humans need help. We couldn't hold the whole system in our heads, so we built layers to manage the complexity for us. Each layer made the next person's job easier. That's how you end up with frameworks on top of frameworks, wrapper libraries, thousands of lines of glue code.</p><p>AI doesn't have the same limitation. It can hold the whole system in context and just write the code. It doesn't need an intermediate framework to stay organized. It just needs a spec and a foundation to build on.</p><p>It's not clear yet which abstractions are truly foundational and which ones were just crutches for human cognition. That line is going to shift a lot over the next few years. But vinext is a data point. We took an API contract, a build tool, and an AI model, and the AI wrote everything in between. No intermediate framework needed. We think this pattern will repeat across a lot of software. The layers we've built up over the years aren't all going to make it.</p>
    <div>
      <h2>Acknowledgments</h2>
      <a href="#acknowledgments">
        
      </a>
    </div>
    <p>Thanks to the Vite team. <a href="https://vite.dev/"><u>Vite</u></a> is the foundation this whole thing stands on. <a href="https://github.com/vitejs/vite-plugin-rsc"><code><u>@vitejs/plugin-rsc</u></code></a> is still early days, but it gave me RSC support without having to build that from scratch, which would have been a dealbreaker. The Vite maintainers were responsive and helpful as I pushed the plugin into territory it hadn't been tested in before.</p><p>We also want to acknowledge the <a href="https://nextjs.org/"><u>Next.js</u></a> team. They've spent years building a framework that raised the bar for what React development could look like. The fact that their API surface is so well-documented and their test suite so comprehensive is a big part of what made this project possible. vinext wouldn't exist without the standard they set.</p>
    <div>
      <h2>Try it</h2>
      <a href="#try-it">
        
      </a>
    </div>
    <p>vinext includes an <a href="https://agentskills.io"><u>Agent Skill</u></a> that handles migration for you. It works with Claude Code, OpenCode, Cursor, Codex, and dozens of other AI coding tools. Install it, open your Next.js project, and tell the AI to migrate:</p>
            <pre><code>npx skills add cloudflare/vinext</code></pre>
            <p>Then open your Next.js project in any supported tool and say:</p>
            <pre><code>migrate this project to vinext</code></pre>
            <p>The skill handles compatibility checking, dependency installation, config generation, and dev server startup. It knows what vinext supports and will flag anything that needs manual attention.</p><p>Or if you prefer doing it by hand:</p>
            <pre><code>npx vinext init    # Migrate an existing Next.js project
npx vinext dev     # Start the dev server
npx vinext deploy  # Ship to Cloudflare Workers</code></pre>
            <p>The source is at <a href="https://github.com/cloudflare/vinext"><u>github.com/cloudflare/vinext</u></a>. Issues, PRs, and feedback are welcome.</p> ]]></content:encoded>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[JavaScript]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Performance]]></category>
            <guid isPermaLink="false">2w61xT0J7H7ECzhiABytS</guid>
            <dc:creator>Steve Faulkner</dc:creator>
        </item>
        <item>
            <title><![CDATA[Code Mode: give agents an entire API in 1,000 tokens]]></title>
            <link>https://blog.cloudflare.com/code-mode-mcp/</link>
            <pubDate>Fri, 20 Feb 2026 14:00:00 GMT</pubDate>
            <description><![CDATA[ The Cloudflare API has over 2,500 endpoints. Exposing each one as an MCP tool would consume over 2 million tokens. With Code Mode, we collapsed all of it into two tools and roughly 1,000 tokens of context. ]]></description>
            <content:encoded><![CDATA[ <p><a href="https://www.cloudflare.com/learning/ai/what-is-model-context-protocol-mcp/"><u>Model Context Protocol (MCP)</u></a> has become the standard way for AI agents to use external tools. But there is a tension at its core: agents need many tools to do useful work, yet every tool added fills the model's context window, leaving less room for the actual task. </p><p><a href="https://blog.cloudflare.com/code-mode/"><u>Code Mode</u></a> is a technique we first introduced for reducing context window usage during agent tool use. Instead of describing every operation as a separate tool, let the model write code against a typed SDK and execute the code safely in a <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/worker-loader/"><u>Dynamic Worker Loader</u></a>. The code acts as a compact plan. The model can explore tool operations, compose multiple calls, and return just the data it needs. Anthropic independently explored the same pattern in their <a href="https://www.anthropic.com/engineering/code-execution-with-mcp"><u>Code Execution with MCP</u></a> post.</p><p>Today we are introducing <a href="https://github.com/cloudflare/mcp"><u>a new MCP server</u></a> for the <a href="https://developers.cloudflare.com/api/"><u>entire Cloudflare API</u></a> — from <a href="https://developers.cloudflare.com/dns/"><u>DNS</u></a> and <a href="https://developers.cloudflare.com/cloudflare-one/"><u>Zero Trust</u></a> to <a href="https://workers.cloudflare.com/product/workers/"><u>Workers</u></a> and <a href="https://workers.cloudflare.com/product/r2/"><u>R2</u></a> — that uses Code Mode. With just two tools, search() and execute(), the server is able to provide access to the entire Cloudflare API over MCP, while consuming only around 1,000 tokens. The footprint stays fixed, no matter how many API endpoints exist.</p><p>For a large API like the Cloudflare API, Code Mode reduces the number of input tokens used by 99.9%. An equivalent MCP server without Code Mode would consume 1.17 million tokens — more than the entire context window of the most advanced foundation models.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7KqjQiI09KubtUSe9Dgf0N/6f37896084c7f34abca7dc36ab18d8e0/image2.png" />
          </figure><p><sup><i>Code mode savings vs native MCP, measured with </i></sup><a href="https://github.com/openai/tiktoken"><sup><i><u>tiktoken</u></i></sup></a><sup></sup></p><p>You can start using this new Cloudflare MCP server today. And we are also open-sourcing a new <a href="https://github.com/cloudflare/agents/tree/main/packages/codemode"><u>Code Mode SDK</u></a> in the <a href="https://github.com/cloudflare/agents"><u>Cloudflare Agents SDK</u></a>, so you can use the same approach in your own MCP servers and AI Agents.</p>
    <div>
      <h3>Server‑side Code Mode</h3>
      <a href="#server-side-code-mode">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ir1KOZHIjVNyqdC9FSuZs/334456a711fb2b5fa612b3fc0b4adc48/images_BLOG-3184_2.png" />
          </figure><p>This new MCP server applies Code Mode server-side. Instead of thousands of tools, the server exports just two: <code>search()</code> and <code>execute()</code>. Both are powered by Code Mode. Here is the full tool surface area that gets loaded into the model context:</p>
            <pre><code>[
  {
    "name": "search",
    "description": "Search the Cloudflare OpenAPI spec. All $refs are pre-resolved inline.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to search the OpenAPI spec"
        }
      },
      "required": ["code"]
    }
  },
  {
    "name": "execute",
    "description": "Execute JavaScript code against the Cloudflare API.",
    "inputSchema": {
      "type": "object",
      "properties": {
        "code": {
          "type": "string",
          "description": "JavaScript async arrow function to execute"
        }
      },
      "required": ["code"]
    }
  }
]
</code></pre>
            <p>To discover what it can do, the agent calls <code>search()</code>. It writes JavaScript against a typed representation of the OpenAPI spec. The agent can filter endpoints by product, path, tags, or any other metadata and narrow thousands of endpoints to the handful it needs. The full OpenAPI spec never enters the model context. The agent only interacts with it through code.</p><p>When the agent is ready to act, it calls <code>execute()</code>. The agent writes code that can make Cloudflare API requests, handle pagination, check responses, and chain operations together in a single execution. </p><p>Both tools run the generated code inside a <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/worker-loader/"><u>Dynamic Worker</u></a> isolate — a lightweight V8 sandbox with no file system, no environment variables to leak through prompt injection and external fetches disabled by default. Outbound requests can be explicitly controlled with outbound fetch handlers when needed.</p>
    <div>
      <h4>Example: Protecting an origin from DDoS attacks</h4>
      <a href="#example-protecting-an-origin-from-ddos-attacks">
        
      </a>
    </div>
    <p>Suppose a user tells their agent: "protect my origin from DDoS attacks." The agent's first step is to consult documentation. It might call the <a href="https://developers.cloudflare.com/agents/model-context-protocol/mcp-servers-for-cloudflare/"><u>Cloudflare Docs MCP Server</u></a>, use a <a href="https://github.com/cloudflare/skills"><u>Cloudflare Skill</u></a>, or search the web directly. From the docs it learns: put <a href="https://www.cloudflare.com/application-services/products/waf/"><u>Cloudflare WAF</u></a> and <a href="https://www.cloudflare.com/ddos/"><u>DDoS protection</u></a> rules in front of the origin.</p><p><b>Step 1: Search for the right endpoints
</b>The <code>search</code> tool gives the model a <code>spec</code> object: the full Cloudflare OpenAPI spec with all <code>$refs</code> pre-resolved. The model writes JavaScript against it. Here the agent looks for WAF and ruleset endpoints on a zone:</p>
            <pre><code>async () =&gt; {
  const results = [];
  for (const [path, methods] of Object.entries(spec.paths)) {
    if (path.includes('/zones/') &amp;&amp;
        (path.includes('firewall/waf') || path.includes('rulesets'))) {
      for (const [method, op] of Object.entries(methods)) {
        results.push({ method: method.toUpperCase(), path, summary: op.summary });
      }
    }
  }
  return results;
}
</code></pre>
            <p>The server runs this code in a Workers isolate and returns:</p>
            <pre><code>[
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages",              "summary": "List WAF packages" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}", "summary": "Update a WAF package" },
  { "method": "GET",    "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules", "summary": "List WAF rules" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/firewall/waf/packages/{package_id}/rules/{rule_id}", "summary": "Update a WAF rule" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets",                           "summary": "List zone rulesets" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets",                           "summary": "Create a zone ruleset" },
  { "method": "GET",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Get a zone entry point ruleset" },
  { "method": "PUT",    "path": "/zones/{zone_id}/rulesets/phases/{ruleset_phase}/entrypoint", "summary": "Update a zone entry point ruleset" },
  { "method": "POST",   "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules",        "summary": "Create a zone ruleset rule" },
  { "method": "PATCH",  "path": "/zones/{zone_id}/rulesets/{ruleset_id}/rules/{rule_id}", "summary": "Update a zone ruleset rule" }
]
</code></pre>
            <p>The full Cloudflare API spec has over 2,500 endpoints. The model narrowed that to the WAF and ruleset endpoints it needs, without any of the spec entering the context window. </p><p>The model can also drill into a specific endpoint's schema before calling it. Here it inspects what phases are available on zone rulesets:</p>
            <pre><code>async () =&gt; {
  const op = spec.paths['/zones/{zone_id}/rulesets']?.get;
  const items = op?.responses?.['200']?.content?.['application/json']?.schema;
  // Walk the schema to find the phase enum
  const props = items?.allOf?.[1]?.properties?.result?.items?.allOf?.[1]?.properties;
  return { phases: props?.phase?.enum };
}

{
  "phases": [
    "ddos_l4", "ddos_l7",
    "http_request_firewall_custom", "http_request_firewall_managed",
    "http_response_firewall_managed", "http_ratelimit",
    "http_request_redirect", "http_request_transform",
    "magic_transit", "magic_transit_managed"
  ]
}
</code></pre>
            <p>The agent now knows the exact phases it needs: <code>ddos_l7 </code>for DDoS protection and <code>http_request_firewall_managed</code> for WAF.</p><p><b>Step 2: Act on the API
</b>The agent switches to using <code>execute</code>. The sandbox gets a <code>cloudflare.request()</code> client that can make authenticated calls to the Cloudflare API. First the agent checks what rulesets already exist on the zone:</p>
            <pre><code>async () =&gt; {
  const response = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets`
  });
  return response.result.map(rs =&gt; ({
    name: rs.name, phase: rs.phase, kind: rs.kind
  }));
}

[
  { "name": "DDoS L7",          "phase": "ddos_l7",                        "kind": "managed" },
  { "name": "Cloudflare Managed","phase": "http_request_firewall_managed", "kind": "managed" },
  { "name": "Custom rules",     "phase": "http_request_firewall_custom",   "kind": "zone" }
]
</code></pre>
            <p>The agent sees that managed DDoS and WAF rulesets already exist. It can now chain calls to inspect their rules and update sensitivity levels in a single execution:</p>
            <pre><code>async () =&gt; {
  // Get the current DDoS L7 entrypoint ruleset
  const ddos = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/ddos_l7/entrypoint`
  });

  // Get the WAF managed ruleset
  const waf = await cloudflare.request({
    method: "GET",
    path: `/zones/${zoneId}/rulesets/phases/http_request_firewall_managed/entrypoint`
  });
}
</code></pre>
            <p>This entire operation, from searching the spec and inspecting a schema to listing rulesets and fetching DDoS and WAF configurations, took four tool calls.</p>
    <div>
      <h3>The Cloudflare MCP server</h3>
      <a href="#the-cloudflare-mcp-server">
        
      </a>
    </div>
    <p>We started with MCP servers for individual products. Want an agent that manages DNS? Add the <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/dns-analytics"><u>DNS MCP server</u></a>. Want Workers logs? Add the <a href="https://developers.cloudflare.com/agents/model-context-protocol/mcp-servers-for-cloudflare/"><u>Workers Observability MCP server</u></a>. Each server exported a fixed set of tools that mapped to API operations. This worked when the tool set was small, but the Cloudflare API has over 2,500 endpoints. No collection of hand-maintained servers could keep up.</p><p>The Cloudflare MCP server simplifies this. Two tools, roughly 1,000 tokens, and coverage of every endpoint in the API. When we add new products, the same <code>search()</code> and <code>execute()</code> code paths discover and call them — no new tool definitions, no new MCP servers. It even has support for the <a href="https://developers.cloudflare.com/analytics/graphql-api/"><u>GraphQL Analytics API</u></a>.</p><p>Our MCP server is built on the latest MCP specifications. It is OAuth 2.1 compliant, using <a href="https://github.com/cloudflare/workers-oauth-provider"><u>Workers OAuth Provider</u></a> to downscope the token to selected permissions approved by the user when connecting. The agent  only gets the capabilities the user explicitly granted. </p><p>For developers, this means you can use a simple agent loop and still give your agent access to the full Cloudflare API with built-in progressive capability discovery.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/60ZoSFdK6t6hR6DpAn6Bub/93b86239cedb06d7fb265859be7590e8/images_BLOG-3184_4.png" />
          </figure>
    <div>
      <h3>Comparing approaches to context reduction</h3>
      <a href="#comparing-approaches-to-context-reduction">
        
      </a>
    </div>
    <p>Several approaches have emerged to reduce how many tokens MCP tools consume:</p><p><b>Client-side Code Mode</b> was our first experiment. The model writes TypeScript against typed SDKs and runs it in a Dynamic Worker Loader on the client. The tradeoff is that it requires the agent to ship with secure sandbox access. Code Mode is implemented in <a href="https://block.github.io/goose/blog/2025/12/15/code-mode-mcp/"><u>Goose</u></a> and Anthropics Claude SDK as <a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling"><u>Programmatic Tool Calling</u></a>.</p><p><b>Command-line interfaces </b>are another path. CLIs are self-documenting and reveal capabilities as the agent explores. Tools like <a href="https://openclaw.ai/"><u>OpenClaw</u></a> and <a href="https://blog.cloudflare.com/moltworker-self-hosted-ai-agent/"><u>Moltworker</u></a> convert MCP servers into CLIs using <a href="https://github.com/steipete/mcporter"><u>MCPorter</u></a> to give agents progressive disclosure. The limitation is obvious: the agent needs a shell, which not every environment provides and which introduces a much broader attack surface than a sandboxed isolate.</p><p><b>Dynamic tool search</b>, as used by <a href="https://x.com/trq212/status/2011523109871108570"><u>Anthropic in Claude Code</u></a>, surfaces a smaller set of tools hopefully relevant to the current task. It shrinks context use but now requires a search function that must be maintained and evaluated, and each matched tool still uses tokens.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5FPxVAuJggv7A08DbPsksb/aacb9087a79d08a1430ea87bb6960ad3/images_BLOG-3184_5.png" />
          </figure><p>Each approach solves a real problem. But for MCP servers specifically, server-side Code Mode combines their strengths: fixed token cost regardless of API size, no modifications needed on the agent side, progressive discovery built in, and safe execution inside a sandboxed isolate. The agent just calls two tools with code. Everything else happens on the server.</p>
    <div>
      <h3>Get started today</h3>
      <a href="#get-started-today">
        
      </a>
    </div>
    <p>The Cloudflare MCP server is available now. Point your MCP client at the server URL and you'll be redirected to Cloudflare to authorize and select the permissions to grant to your agent. Add this config to your MCP client: </p>
            <pre><code>{
  "mcpServers": {
    "cloudflare-api": {
      "url": "https://mcp.cloudflare.com/mcp"
    }
  }
}
</code></pre>
            <p>For CI/CD, automation, or if you prefer managing tokens yourself, create a Cloudflare API token with the permissions you need. Both user tokens and account tokens are supported and can be passed as bearer tokens in the <code>Authorization</code> header.</p><p>More information on different MCP setup configurations can be found at the <a href="https://github.com/cloudflare/mcp"><u>Cloudflare MCP repository</u></a>.</p>
    <div>
      <h3>Looking forward</h3>
      <a href="#looking-forward">
        
      </a>
    </div>
    <p>Code Mode solves context costs for a single API. But agents rarely talk to one service. A developer's agent might need the Cloudflare API alongside GitHub, a database, and an internal docs server. Each additional MCP server brings the same context window pressure we started with.</p><p><a href="https://blog.cloudflare.com/zero-trust-mcp-server-portals/"><u>Cloudflare MCP Server Portals</u></a> let you compose multiple MCP servers behind a single gateway with unified auth and access control. We are building a first-class Code Mode integration for all your MCP servers, and exposing them to agents with built-in progressive discovery and the same fixed-token footprint, regardless of how many services sit behind the gateway.</p> ]]></content:encoded>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Optimization]]></category>
            <category><![CDATA[Open Source]]></category>
            <guid isPermaLink="false">2lWwgP33VT0NJjZ3pWShsw</guid>
            <dc:creator>Matt Carey</dc:creator>
        </item>
        <item>
            <title><![CDATA[Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare]]></title>
            <link>https://blog.cloudflare.com/ecdysis-rust-graceful-restarts/</link>
            <pubDate>Fri, 13 Feb 2026 14:00:00 GMT</pubDate>
            <description><![CDATA[ ecdysis is a Rust library enabling zero-downtime upgrades for network services. After five years protecting millions of connections at Cloudflare, it’s now open source. ]]></description>
            <content:encoded><![CDATA[ <blockquote><p>ecdysis | <i>ˈekdəsəs</i> |</p><p>noun</p><p>    the process of shedding the old skin (in reptiles) or casting off the outer 
    cuticle (in insects and other arthropods).  </p></blockquote><p>How do you upgrade a network service, handling millions of requests per second around the globe, without disrupting even a single connection?</p><p>One of our solutions at Cloudflare to this massive challenge has long been <a href="https://github.com/cloudflare/ecdysis"><b><u>ecdysis</u></b></a>, a Rust library that implements graceful process restarts where no live connections are dropped, and no new connections are refused. </p><p>Last month, <b>we open-sourced ecdysis</b>, so now anyone can use it. After five years of production use at Cloudflare, ecdysis has proven itself by enabling zero-downtime upgrades across our critical Rust infrastructure, saving millions of requests with every restart across Cloudflare’s <a href="https://www.cloudflare.com/network/"><u>global network</u></a>.</p><p>It’s hard to overstate the importance of getting these upgrades right, especially at the scale of Cloudflare’s network. Many of our services perform critical tasks such as traffic routing, <a href="https://www.cloudflare.com/application-services/solutions/certificate-lifecycle-management/"><u>TLS lifecycle management</u></a>, or firewall rules enforcement, and must operate continuously. If one of these services goes down, even for an instant, the cascading impact can be catastrophic. Dropped connections and failed requests quickly lead to degraded customer performance and business impact.</p><p>When these services need updates, security patches can’t wait. Bug fixes need deployment and new features must roll out. </p><p>The naive approach involves waiting for the old process to be stopped before spinning up the new one, but this creates a window of time where connections are refused and requests are dropped. For a service handling thousands of requests per second in a single location, multiply that across hundreds of data centers, and a brief restart becomes millions of failed requests globally.</p><p>Let’s dig into the problem, and how ecdysis has been the solution for us — and maybe will be for you. </p><p><b>Links</b>: <a href="https://github.com/cloudflare/ecdysis">GitHub</a> <b>|</b> <a href="https://crates.io/crates/ecdysis">crates.io</a> <b>|</b> <a href="https://docs.rs/ecdysis">docs.rs</a></p>
    <div>
      <h3>Why graceful restarts are hard</h3>
      <a href="#why-graceful-restarts-are-hard">
        
      </a>
    </div>
    <p>The naive approach to restarting a service, as we mentioned, is to stop the old process and start a new one. This works acceptably for simple services that don’t handle real-time requests, but for network services processing live connections, this approach has critical limitations.</p><p>First, the naive approach creates a window during which no process is listening for incoming connections. When the old process stops, it closes its listening sockets, which causes the OS to immediately refuse new connections with <code>ECONNREFUSED</code>. Even if the new process starts immediately, there will always be a gap where nothing is accepting connections, whether milliseconds or seconds. For a service handling thousands of requests per second, even a gap of 100ms means hundreds of dropped connections.</p><p>Second, stopping the old process kills all already-established connections. A client uploading a large file or streaming video gets abruptly disconnected. Long-lived connections like WebSockets or gRPC streams are terminated mid-operation. From the client’s perspective, the service simply vanishes.</p><p>Binding the new process before shutting down the old one appears to solve this, but also introduces additional issues. The kernel normally allows only one process to bind to an address:port combination, but <a href="https://man7.org/linux/man-pages/man7/socket.7.html"><u>the SO_REUSEPORT socket option</u></a> permits multiple binds. However, this creates a problem during process transitions that makes it unsuitable for graceful restarts.</p><p>When <code>SO_REUSEPORT</code> is used, the kernel creates separate listening sockets for each process and <a href="https://lwn.net/Articles/542629/"><u>load balances new connections across these sockets</u></a>. When the initial <code>SYN</code> packet for a connection is received, the kernel will assign it to one of the listening processes. Once the initial handshake is completed, the connection then sits in the <code>accept()</code> queue of the process until the process accepts it. If the process then exits before accepting this connection, it becomes orphaned and is terminated by the kernel. GitHub’s engineering team documented this issue extensively when <a href="https://github.blog/2020-10-07-glb-director-zero-downtime-load-balancer-updates/"><u>building their GLB Director load balancer</u></a>.</p>
    <div>
      <h3>How ecdysis works</h3>
      <a href="#how-ecdysis-works">
        
      </a>
    </div>
    <p>When we set out to design and build ecdysis, we identified four key goals for the library:</p><ol><li><p><b>Old code can be completely shut down</b> post-upgrade.</p></li><li><p><b>The new process has a grace period</b> for initialization.</p></li><li><p><b>New code crashing during initialization is acceptable</b> and shouldn’t affect the running service.</p></li><li><p><b>Only a single upgrade runs in parallel</b> to avoid cascading failures.</p></li></ol><p>ecdysis satisfies these requirements following an approach pioneered by NGINX, which has supported graceful upgrades since its early days. The approach is straightforward: </p><ol><li><p>The parent process <code>fork()</code>s a new child process.</p></li><li><p>The child process replaces itself with a new version of the code with <code>execve()</code>.</p></li><li><p>The child process inherits the socket file descriptors via a named pipe shared with the parent.</p></li><li><p>The parent process waits for the child process to signal readiness before shutting down.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4QK8GY1s30C8RUovBQnqbD/525094478911eda96c7877a10753159f/image3.png" />
          </figure><p>Crucially, the socket remains open throughout the transition. The child process inherits the listening socket from the parent as a file descriptor shared via a named pipe. During the child's initialization, both processes share the same underlying kernel data structure, allowing the parent to continue accepting and processing new and existing connections. Once the child completes initialization, it notifies the parent and begins accepting connections. Upon receiving this ready notification, the parent immediately closes its copy of the listening socket and continues handling only existing connections. </p><p>This process eliminates coverage gaps while providing the child a safe initialization window. There is a brief window of time when both the parent and child may accept connections concurrently. This is intentional; any connections accepted by the parent are simply handled until completion as part of the draining process.</p><p>This model also provides the required crash safety. If the child process fails during initialization (e.g., due to a configuration error), it simply exits. Since the parent never stopped listening, no connections are dropped, and the upgrade can be retried once the problem is fixed.</p><p>ecdysis implements the forking model with first-class support for asynchronous programming through<a href="https://tokio.rs"> <u>Tokio</u></a> and s<code>ystemd</code> integration:</p><ul><li><p><b>Tokio integration</b>: Native async stream wrappers for Tokio. Inherited sockets become listeners without additional glue code. For synchronous services, ecdysis supports operation without async runtime requirements.</p></li><li><p><b>systemd-notify support</b>: When the <code>systemd_notify</code> feature is enabled, ecdysis automatically integrates with systemd’s process lifecycle notifications. Setting <code>Type=notify-reload</code> in your service unit file allows systemd to track upgrades correctly.</p></li><li><p><b>systemd named sockets</b>: The <code>systemd_sockets</code> feature enables ecdysis to manage systemd-activated sockets. Your service can be socket-activated and support graceful restarts simultaneously.</p></li></ul><p>Platform note: ecdysis relies on Unix-specific syscalls for socket inheritance and process management. It does not work on Windows. This is a fundamental limitation of the forking approach.</p>
    <div>
      <h3>Security considerations</h3>
      <a href="#security-considerations">
        
      </a>
    </div>
    <p>Graceful restarts introduce security considerations. The forking model creates a brief window where two process generations coexist, both with access to the same listening sockets and potentially sensitive file descriptors.</p><p>ecdysis addresses these concerns through its design:</p><p><b>Fork-then-exec</b>: ecdysis follows the traditional Unix pattern of <code>fork()</code> followed immediately by <code>execve()</code>. This ensures the child process starts with a clean slate: new address space, fresh code, and no inherited memory. Only explicitly-passed file descriptors cross the boundary.</p><p><b>Explicit inheritance</b>: Only listening sockets and communication pipes are inherited. Other file descriptors are closed via <code>CLOEXEC</code> flags. This prevents accidental leakage of sensitive handles.</p><p><b>seccomp compatibility</b>: Services using seccomp filters must allow <code>fork()</code> and <code>execve()</code>. This is a tradeoff: graceful restarts require these syscalls, so they cannot be blocked.</p><p>For most network services, these tradeoffs are acceptable. The security of the fork-exec model is well understood and has been battle-tested for decades in software like NGINX and Apache.</p>
    <div>
      <h3>Code example</h3>
      <a href="#code-example">
        
      </a>
    </div>
    <p>Let’s look at a practical example. Here’s a simplified TCP echo server that supports graceful restarts:</p>
            <pre><code>use ecdysis::tokio_ecdysis::{SignalKind, StopOnShutdown, TokioEcdysisBuilder};
use tokio::{net::TcpStream, task::JoinSet};
use futures::StreamExt;
use std::net::SocketAddr;

#[tokio::main]
async fn main() {
    // Create the ecdysis builder
    let mut ecdysis_builder = TokioEcdysisBuilder::new(
        SignalKind::hangup()  // Trigger upgrade/reload on SIGHUP
    ).unwrap();

    // Trigger stop on SIGUSR1
    ecdysis_builder
        .stop_on_signal(SignalKind::user_defined1())
        .unwrap();

    // Create listening socket - will be inherited by children
    let addr: SocketAddr = "0.0.0.0:8080".parse().unwrap();
    let stream = ecdysis_builder
        .build_listen_tcp(StopOnShutdown::Yes, addr, |builder, addr| {
            builder.set_reuse_address(true)?;
            builder.bind(&amp;addr.into())?;
            builder.listen(128)?;
            Ok(builder.into())
        })
        .unwrap();

    // Spawn task to handle connections
    let server_handle = tokio::spawn(async move {
        let mut stream = stream;
        let mut set = JoinSet::new();
        while let Some(Ok(socket)) = stream.next().await {
            set.spawn(handle_connection(socket));
        }
        set.join_all().await;
    });

    // Signal readiness and wait for shutdown
    let (_ecdysis, shutdown_fut) = ecdysis_builder.ready().unwrap();
    let shutdown_reason = shutdown_fut.await;

    log::info!("Shutting down: {:?}", shutdown_reason);

    // Gracefully drain connections
    server_handle.await.unwrap();
}

async fn handle_connection(mut socket: TcpStream) {
    // Echo connection logic here
}</code></pre>
            <p>The key points:</p><ol><li><p><code><b>build_listen_tcp</b></code> creates a listener that will be inherited by child processes.</p></li><li><p><code><b>ready()</b></code> signals to the parent process that initialization is complete and that it can safely exit.</p></li><li><p><code><b>shutdown_fut.await</b></code> blocks until an upgrade or stop is requested. This future only yields once the process should be shut down, either because an upgrade/reload was executed successfully or because a shutdown signal was received.</p></li></ol><p>When you send <code>SIGHUP</code> to this process, here’s what ecdysis does…</p><p><i>…on the parent process:</i></p><ul><li><p>Forks and execs a new instance of your binary.</p></li><li><p>Passes the listening socket to the child.</p></li><li><p>Waits for the child to call <code>ready()</code>.</p></li><li><p>Drains existing connections, then exits.</p></li></ul><p><i>…on the child process:</i></p><ul><li><p>Initializes itself following the same execution flow as the parent, except any sockets owned by ecdysis are inherited and not bound by the child.</p></li><li><p>Signals readiness to the parent by calling <code>ready()</code>.</p></li><li><p>Blocks waiting for a shutdown or upgrade signal.</p></li></ul>
    <div>
      <h3>Production at scale</h3>
      <a href="#production-at-scale">
        
      </a>
    </div>
    <p>ecdysis has been running in production at Cloudflare since 2021. It powers critical Rust infrastructure services deployed across 330+ data centers in 120+ countries. These services handle billions of requests per day and require frequent updates for security patches, feature releases, and configuration changes.</p><p>Every restart using ecdysis saves hundreds of thousands of requests that would otherwise be dropped during a naive stop/start cycle. Across our global footprint, this translates to millions of preserved connections and improved reliability for customers.</p>
    <div>
      <h3>ecdysis vs alternatives</h3>
      <a href="#ecdysis-vs-alternatives">
        
      </a>
    </div>
    <p>Graceful restart libraries exist for several ecosystems. Understanding when to use ecdysis versus alternatives is critical to choosing the right tool.</p><p><a href="https://github.com/cloudflare/tableflip"><b><u>tableflip</u></b></a> is our Go library that inspired ecdysis. It implements the same fork-and-inherit model for Go services. If you need Go, tableflip is a great option!</p><p><a href="https://github.com/cloudflare/shellflip"><b><u>shellflip</u></b></a> is Cloudflare’s other Rust graceful restart library, designed specifically for Oxy, our Rust-based proxy. shellflip is more opinionated: it assumes systemd and Tokio, and focuses on transferring arbitrary application state between parent and child. This makes it excellent for complex stateful services, or services that want to apply such aggressive sandboxing that they can’t even open their own sockets, but adds overhead for simpler cases.</p>
    <div>
      <h3>Start building</h3>
      <a href="#start-building">
        
      </a>
    </div>
    <p>ecdysis brings five years of production-hardened graceful restart capabilities to the Rust ecosystem. It’s the same technology protecting millions of connections across Cloudflare’s global network, now open-sourced and available for anyone!</p><p>Full documentation is available at <a href="https://docs.rs/ecdysis"><u>docs.rs/ecdysis</u></a>, including API reference, examples for common use cases, and steps for integrating with <code>systemd</code>.</p><p>The <a href="https://github.com/cloudflare/ecdysis/tree/main/examples"><u>examples directory</u></a> in the repository contains working code demonstrating TCP listeners, Unix socket listeners, and systemd integration.</p><p>The library is actively maintained by the Argo Smart Routing &amp; Orpheus team, with contributions from teams across Cloudflare. We welcome contributions, bug reports, and feature requests on <a href="https://github.com/cloudflare/ecdysis"><u>GitHub</u></a>.</p><p>Whether you’re building a high-performance proxy, a long-lived API server, or any network service where uptime matters, ecdysis can provide a foundation for zero-downtime operations.</p><p>Start building:<a href="https://github.com/cloudflare/ecdysis"> <u>github.com/cloudflare/ecdysis</u></a></p> ]]></content:encoded>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Engineering]]></category>
            <category><![CDATA[Edge]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Application Services]]></category>
            <category><![CDATA[Rust]]></category>
            <guid isPermaLink="false">GMarF75NkFuiwVuyFJk77</guid>
            <dc:creator>Manuel Olguín Muñoz</dc:creator>
        </item>
        <item>
            <title><![CDATA[Keeping the Internet fast and secure: introducing Merkle Tree Certificates]]></title>
            <link>https://blog.cloudflare.com/bootstrap-mtc/</link>
            <pubDate>Tue, 28 Oct 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare is launching an experiment with Chrome to evaluate fast, scalable, and quantum-ready Merkle Tree Certificates, all without degrading performance or changing WebPKI trust relationships. ]]></description>
            <content:encoded><![CDATA[ <p>The world is in a race to build its first quantum computer capable of solving practical problems not feasible on even the largest conventional supercomputers. While the quantum computing paradigm promises many benefits, it also threatens the security of the Internet by breaking much of the cryptography we have come to rely on.</p><p>To mitigate this threat, Cloudflare is helping to migrate the Internet to Post-Quantum (PQ) cryptography. Today, <a href="https://radar.cloudflare.com/adoption-and-usage#post-quantum-encryption"><u>about 50%</u></a> of traffic to Cloudflare's edge network is protected against the most urgent threat: an attacker who can intercept and store encrypted traffic today and then decrypt it in the future with the help of a quantum computer. This is referred to as the <a href="https://en.wikipedia.org/wiki/Harvest_now,_decrypt_later"><u>harvest now, decrypt later</u></a><i> </i>threat.</p><p>However, this is just one of the threats we need to address. A quantum computer can also be used to crack a server's <a href="https://www.cloudflare.com/application-services/products/ssl/">TLS certificate</a>, allowing an attacker to impersonate the server to unsuspecting clients. The good news is that we already have PQ algorithms we can use for quantum-safe authentication. The bad news is that adoption of these algorithms in TLS will require significant changes to one of the most complex and security-critical systems on the Internet: the Web Public-Key Infrastructure (WebPKI).</p><p>The central problem is the sheer size of these new algorithms: signatures for ML-DSA-44, one of the most performant PQ algorithms standardized by NIST, are 2,420 bytes long, compared to just 64 bytes for ECDSA-P256, the most popular non-PQ signature in use today; and its public keys are 1,312 bytes long, compared to just 64 bytes for ECDSA. That's a roughly 20-fold increase in size. Worse yet, the average TLS handshake includes a number of public keys and signatures, adding up to 10s of kilobytes of overhead per handshake. This is enough to have a <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/#how-many-added-bytes-are-too-many-for-tls"><u>noticeable impact</u></a> on the performance of TLS.</p><p>That makes drop-in PQ certificates a tough sell to enable today: they don’t bring any security benefit before Q-day — the day a cryptographically relevant quantum computer arrives — but they do degrade performance. We could sit and wait until Q-day is a year away, but that’s playing with fire. Migrations always take longer than expected, and by waiting we risk the security and privacy of the Internet, which is <a href="https://developers.cloudflare.com/ssl/edge-certificates/universal-ssl/"><u>dear to us</u></a>.</p><p>It's clear that we must find a way to make post-quantum certificates cheap enough to deploy today by default for everyone — not just those that can afford it. In this post, we'll introduce you to the plan we’ve brought together with industry partners to the <a href="https://datatracker.ietf.org/group/plants/about/"><u>IETF</u></a> to redesign the WebPKI in order to allow a smooth transition to PQ authentication with no performance impact (and perhaps a performance improvement!). We'll provide an overview of one concrete proposal, called <a href="https://datatracker.ietf.org/doc/draft-davidben-tls-merkle-tree-certs/"><u>Merkle Tree Certificates (MTCs)</u></a>, whose goal is to whittle down the number of public keys and signatures in the TLS handshake to the bare minimum required.</p><p>But talk is cheap. We <a href="https://blog.cloudflare.com/experiment-with-pq/"><u>know</u></a> <a href="https://blog.cloudflare.com/announcing-encrypted-client-hello/"><u>from</u></a> <a href="https://blog.cloudflare.com/why-tls-1-3-isnt-in-browsers-yet/"><u>experience</u></a> that, as with any change to the Internet, it's crucial to test early and often. <b>Today we're announcing our intent to deploy MTCs on an experimental basis in collaboration with Chrome Security.</b> In this post, we'll describe the scope of this experiment, what we hope to learn from it, and how we'll make sure it's done safely.</p>
    <div>
      <h2>The WebPKI today — an old system with many patches</h2>
      <a href="#the-webpki-today-an-old-system-with-many-patches">
        
      </a>
    </div>
    <p>Why does the TLS handshake have so many public keys and signatures?</p><p>Let's start with Cryptography 101. When your browser connects to a website, it asks the server to <b>authenticate</b> itself to make sure it's talking to the real server and not an impersonator. This is usually achieved with a cryptographic primitive known as a digital signature scheme (e.g., ECDSA or ML-DSA). In TLS, the server signs the messages exchanged between the client and server using its <b>secret key</b>, and the client verifies the signature using the server's <b>public key</b>. In this way, the server confirms to the client that they've had the same conversation, since only the server could have produced a valid signature.</p><p>If the client already knows the server's public key, then only <b>1 signature</b> is required to authenticate the server. In practice, however, this is not really an option. The web today is made up of around a billion TLS servers, so it would be unrealistic to provision every client with the public key of every server. What's more, the set of public keys will change over time as new servers come online and existing ones rotate their keys, so we would need some way of pushing these changes to clients.</p><p>This scaling problem is at the heart of the design of all PKIs.</p>
    <div>
      <h3>Trust is transitive</h3>
      <a href="#trust-is-transitive">
        
      </a>
    </div>
    <p>Instead of expecting the client to know the server's public key in advance, the server might just send its public key during the TLS handshake. But how does the client know that the public key actually belongs to the server? This is the job of a <b>certificate</b>.</p><p>A certificate binds a public key to the identity of the server — usually its DNS name, e.g., <code>cloudflareresearch.com</code>. The certificate is signed by a Certification Authority (CA) whose public key is known to the client. In addition to verifying the server's handshake signature, the client verifies the signature of this certificate. This establishes a chain of trust: by accepting the certificate, the client is trusting that the CA verified that the public key actually belongs to the server with that identity.</p><p>Clients are typically configured to trust many CAs and must be provisioned with a public key for each. Things are much easier however, since there are only 100s of CAs instead of billions. In addition, new certificates can be created without having to update clients.</p><p>These efficiencies come at a relatively low cost: for those counting at home, that's <b>+1</b> signature and <b>+1</b> public key, for a total of <b>2 signatures and 1 public key</b> per TLS handshake.</p><p>That's not the end of the story, however. As the WebPKI has evolved, so have these chains of trust grown a bit longer. These days it's common for a chain to consist of two or more certificates rather than just one. This is because CAs sometimes need to rotate<b> </b>their keys, just as servers do. But before they can start using the new key, they must distribute the corresponding public key to clients. This takes time, since it requires billions of clients to update their trust stores. To bridge the gap, the CA will sometimes use the old key to issue a certificate for the new one and append this certificate to the end of the chain.</p><p>That's<b> +1</b> signature and<b> +1</b> public key, which brings us to<b> 3 signatures and 2 public keys</b>. And we still have a little ways to go.</p>
    <div>
      <h3>Trust but verify</h3>
      <a href="#trust-but-verify">
        
      </a>
    </div>
    <p>The main job of a CA is to verify that a server has control over the domain for which it’s requesting a certificate. This process has evolved over the years from a high-touch, CA-specific process to a standardized, <a href="https://datatracker.ietf.org/doc/html/rfc8555/"><u>mostly automated process</u></a> used for issuing most certificates on the web. (Not all CAs fully support automation, however.) This evolution is marked by a number of security incidents in which a certificate was <b>mis-issued </b>to a party other than the server, allowing that party to impersonate the server to any client that trusts the CA.</p><p>Automation helps, but <a href="https://en.wikipedia.org/wiki/DigiNotar#Issuance_of_fraudulent_certificates"><u>attacks</u></a> are still possible, and mistakes are almost inevitable. <a href="https://blog.cloudflare.com/unauthorized-issuance-of-certificates-for-1-1-1-1/"><u>Earlier this year</u></a>, several certificates for Cloudflare's encrypted 1.1.1.1 resolver were issued without our involvement or authorization. This apparently occurred by accident, but it nonetheless put users of 1.1.1.1 at risk. (The mis-issued certificates have since been revoked.)</p><p>Ensuring mis-issuance is detectable is the job of the Certificate Transparency (CT) ecosystem. The basic idea is that each certificate issued by a CA gets added to a public <b>log</b>. Servers can audit these logs for certificates issued in their name. If ever a certificate is issued that they didn't request itself, the server operator can prove the issuance happened, and the PKI ecosystem can take action to prevent the certificate from being trusted by clients.</p><p>Major browsers, including Firefox and Chrome and its derivatives, require certificates to be logged before they can be trusted. For example, Chrome, Safari, and Firefox will only accept the server's certificate if it appears in at least two logs the browser is configured to trust. This policy is easy to state, but tricky to implement in practice:</p><ol><li><p>Operating a CT log has historically been fairly expensive. Logs ingest billions of certificates over their lifetimes: when an incident happens, or even just under high load, it can take some time for a log to make a new entry available for auditors.</p></li><li><p>Clients can't really audit logs themselves, since this would expose their browsing history (i.e., the servers they wanted to connect to) to the log operators.</p></li></ol><p>The solution to both problems is to include a signature from the CT log along with the certificate. The signature is produced immediately in response to a request to log a certificate, and attests to the log's intent to include the certificate in the log within 24 hours.</p><p>Per browser policy, certificate transparency adds <b>+2</b> signatures to the TLS handshake, one for each log. This brings us to a total of <b>5 signatures and 2 public keys</b> in a typical handshake on the public web.</p>
    <div>
      <h3>The future WebPKI</h3>
      <a href="#the-future-webpki">
        
      </a>
    </div>
    <p>The WebPKI is a living, breathing, and highly distributed system. We've had to patch it a number of times over the years to keep it going, but on balance it has served our needs quite well — until now.</p><p>Previously, whenever we needed to update something in the WebPKI, we would tack on another signature. This strategy has worked because conventional cryptography is so cheap. But <b>5 signatures and 2 public keys </b>on average for each TLS handshake is simply too much to cope with for the larger PQ signatures that are coming.</p><p>The good news is that by moving what we already have around in clever ways, we can drastically reduce the number of signatures we need.</p>
    <div>
      <h3>Crash course on Merkle Tree Certificates</h3>
      <a href="#crash-course-on-merkle-tree-certificates">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/draft-davidben-tls-merkle-tree-certs/"><u>Merkle Tree Certificates (MTCs)</u></a> is a proposal for the next generation of the WebPKI that we are implementing and plan to deploy on an experimental basis. Its key features are as follows:</p><ol><li><p>All the information a client needs to validate a Merkle Tree Certificate can be disseminated out-of-band. If the client is sufficiently up-to-date, then the TLS handshake needs just <b>1 signature, 1 public key, and 1 Merkle tree inclusion proof</b>. This is quite small, even if we use post-quantum algorithms.</p></li><li><p>The MTC specification makes certificate transparency a first class feature of the PKI by having each CA run its own log of exactly the certificates they issue.</p></li></ol><p>Let's poke our head under the hood a little. Below we have an MTC generated by one of our internal tests. This would be transmitted from the server to the client in the TLS handshake:</p>
            <pre><code>-----BEGIN CERTIFICATE-----
MIICSzCCAUGgAwIBAgICAhMwDAYKKwYBBAGC2ksvADAcMRowGAYKKwYBBAGC2ksv
AQwKNDQzNjMuNDguMzAeFw0yNTEwMjExNTMzMjZaFw0yNTEwMjgxNTMzMjZaMCEx
HzAdBgNVBAMTFmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wWTATBgcqhkjOPQIBBggq
hkjOPQMBBwNCAARw7eGWh7Qi7/vcqc2cXO8enqsbbdcRdHt2yDyhX5Q3RZnYgONc
JE8oRrW/hGDY/OuCWsROM5DHszZRDJJtv4gno2wwajAOBgNVHQ8BAf8EBAMCB4Aw
EwYDVR0lBAwwCgYIKwYBBQUHAwEwQwYDVR0RBDwwOoIWY2xvdWRmbGFyZXJlc2Vh
cmNoLmNvbYIgc3RhdGljLWN0LmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wDAYKKwYB
BAGC2ksvAAOB9QAAAAAAAAACAAAAAAAAAAJYAOBEvgOlvWq38p45d0wWTPgG5eFV
wJMhxnmDPN1b5leJwHWzTOx1igtToMocBwwakt3HfKIjXYMO5CNDOK9DIKhmRDSV
h+or8A8WUrvqZ2ceiTZPkNQFVYlG8be2aITTVzGuK8N5MYaFnSTtzyWkXP2P9nYU
Vd1nLt/WjCUNUkjI4/75fOalMFKltcc6iaXB9ktble9wuJH8YQ9tFt456aBZSSs0
cXwqFtrHr973AZQQxGLR9QCHveii9N87NXknDvzMQ+dgWt/fBujTfuuzv3slQw80
mibA021dDCi8h1hYFQAA
-----END CERTIFICATE-----</code></pre>
            <p>Looks like your average PEM encoded certificate. Let's decode it and look at the parameters:</p>
            <pre><code>$ openssl x509 -in merkle-tree-cert.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 531 (0x213)
        Signature Algorithm: 1.3.6.1.4.1.44363.47.0
        Issuer: 1.3.6.1.4.1.44363.47.1=44363.48.3
        Validity
            Not Before: Oct 21 15:33:26 2025 GMT
            Not After : Oct 28 15:33:26 2025 GMT
        Subject: CN=cloudflareresearch.com
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:70:ed:e1:96:87:b4:22:ef:fb:dc:a9:cd:9c:5c:
                    ef:1e:9e:ab:1b:6d:d7:11:74:7b:76:c8:3c:a1:5f:
                    94:37:45:99:d8:80:e3:5c:24:4f:28:46:b5:bf:84:
                    60:d8:fc:eb:82:5a:c4:4e:33:90:c7:b3:36:51:0c:
                    92:6d:bf:88:27
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 Extended Key Usage:
                TLS Web Server Authentication
            X509v3 Subject Alternative Name:
                DNS:cloudflareresearch.com, DNS:static-ct.cloudflareresearch.com
    Signature Algorithm: 1.3.6.1.4.1.44363.47.0
    Signature Value:
        00:00:00:00:00:00:02:00:00:00:00:00:00:00:02:58:00:e0:
        44:be:03:a5:bd:6a:b7:f2:9e:39:77:4c:16:4c:f8:06:e5:e1:
        55:c0:93:21:c6:79:83:3c:dd:5b:e6:57:89:c0:75:b3:4c:ec:
        75:8a:0b:53:a0:ca:1c:07:0c:1a:92:dd:c7:7c:a2:23:5d:83:
        0e:e4:23:43:38:af:43:20:a8:66:44:34:95:87:ea:2b:f0:0f:
        16:52:bb:ea:67:67:1e:89:36:4f:90:d4:05:55:89:46:f1:b7:
        b6:68:84:d3:57:31:ae:2b:c3:79:31:86:85:9d:24:ed:cf:25:
        a4:5c:fd:8f:f6:76:14:55:dd:67:2e:df:d6:8c:25:0d:52:48:
        c8:e3:fe:f9:7c:e6:a5:30:52:a5:b5:c7:3a:89:a5:c1:f6:4b:
        5b:95:ef:70:b8:91:fc:61:0f:6d:16:de:39:e9:a0:59:49:2b:
        34:71:7c:2a:16:da:c7:af:de:f7:01:94:10:c4:62:d1:f5:00:
        87:bd:e8:a2:f4:df:3b:35:79:27:0e:fc:cc:43:e7:60:5a:df:
        df:06:e8:d3:7e:eb:b3:bf:7b:25:43:0f:34:9a:26:c0:d3:6d:
        5d:0c:28:bc:87:58:58:15:00:00</code></pre>
            <p>While some of the parameters probably look familiar, others will look unusual. On the familiar side, the subject and public key are exactly what we might expect: the DNS name is <code>cloudflareresearch.com</code> and the public key is for a familiar signature algorithm, ECDSA-P256. This algorithm is not PQ, of course — in the future we would put ML-DSA-44 there instead.</p><p>On the unusual side, OpenSSL appears to not recognize the signature algorithm of the issuer and just prints the raw OID and bytes of the signature. There's a good reason for this: the MTC does not have a signature in it at all! So what exactly are we looking at?</p><p>The trick to leave out signatures is that a Merkle Tree Certification Authority (MTCA) produces its <i>signatureless</i> certificates <i>in batches</i> rather than individually. In place of a signature, the certificate has an <b>inclusion proof</b> of the certificate in a batch of certificates signed by the MTCA.</p><p>To understand how inclusion proofs work, let's think about a slightly simplified version of the MTC specification. To issue a batch, the MTCA arranges the unsigned certificates into a data structure called a <b>Merkle tree</b> that looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4LGhISsS07kbpSgDkqx8p2/68e3b36deeca7f97139654d2c769df68/image3.png" />
          </figure><p>Each leaf of the tree corresponds to a certificate, and each inner node is equal to the hash of its children. To sign the batch, the MTCA uses its secret key to sign the head of the tree. The structure of the tree guarantees that each certificate in the batch was signed by the MTCA: if we tried to tweak the bits of any one of the certificates, the treehead would end up having a different value, which would cause the signature to fail.</p><p>An inclusion proof for a certificate consists of the hash of each sibling node along the path from the certificate to the treehead:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4UZZHkRwsBLWXRYeop4rXv/8598cde48c27c112bc4992889f3d5799/image1.gif" />
          </figure><p>Given a validated treehead, this sequence of hashes is sufficient to prove inclusion of the certificate in the tree. This means that, in order to validate an MTC, the client also needs to obtain the signed treehead from the MTCA.</p><p>This is the key to MTC's efficiency:</p><ol><li><p>Signed treeheads can be disseminated to clients out-of-band and validated offline. Each validated treehead can then be used to validate any certificate in the corresponding batch, eliminating the need to obtain a signature for each server certificate.</p></li><li><p>During the TLS handshake, the client tells the server which treeheads it has. If the server has a signatureless certificate covered by one of those treeheads, then it can use that certificate to authenticate itself. That's <b>1 signature,1 public key and 1 inclusion proof</b> per handshake, both for the server being authenticated.</p></li></ol><p>Now, that's the simplified version. MTC proper has some more bells and whistles. To start, it doesn’t create a separate Merkle tree for each batch, but it grows a single large tree, which is used for better transparency. As this tree grows, periodically (sub)tree heads are selected to be shipped to browsers, which we call <b>landmarks</b>. In the common case browsers will be able to fetch the most recent landmarks, and servers can wait for batch issuance, but we need a fallback: MTC also supports certificates that can be issued immediately and don’t require landmarks to be validated, but these are not as small. A server would provision both types of Merkle tree certificates, so that the common case is fast, and the exceptional case is slow, but at least it’ll work.</p>
    <div>
      <h2>Experimental deployment</h2>
      <a href="#experimental-deployment">
        
      </a>
    </div>
    <p>Ever since early designs for MTCs emerged, we’ve been eager to experiment with the idea. In line with the IETF principle of “<a href="https://www.ietf.org/runningcode/"><u>running code</u></a>”, it often takes implementing a protocol to work out kinks in the design. At the same time, we cannot risk the security of users. In this section, we describe our approach to experimenting with aspects of the Merkle Tree Certificates design <i>without</i> changing any trust relationships.</p><p>Let’s start with what we hope to learn. We have lots of questions whose answers can help to either validate the approach, or uncover pitfalls that require reshaping the protocol — in fact, an implementation of an early MTC draft by <a href="https://www.cs.ru.nl/masters-theses/2025/M_Pohl___Implementation_and_Analysis_of_Merkle_Tree_Certificates_for_Post-Quantum_Secure_Authentication_in_TLS.pdf"><u>Maximilian Pohl</u></a> and <a href="https://www.ietf.org/archive/id/draft-davidben-tls-merkle-tree-certs-07.html#name-acknowledgements"><u>Mia Celeste</u></a> did exactly this. We’d like to know:</p><p><b>What breaks?</b> Protocol ossification (the tendency of implementation bugs to make it harder to change a protocol) is an ever-present issue with deploying protocol changes. For TLS in particular, despite having built-in flexibility, time after time we’ve found that if that flexibility is not regularly used, there will be buggy implementations and middleboxes that break when they see things they don’t recognize. TLS 1.3 deployment <a href="https://blog.cloudflare.com/why-tls-1-3-isnt-in-browsers-yet/"><u>took years longer</u></a> than we hoped for this very reason. And more recently, the rollout of PQ key exchange in TLS caused the Client Hello to be split over multiple TCP packets, something that many middleboxes <a href="https://tldr.fail/"><u>weren't ready for</u></a>.</p><p><b>What is the performance impact?</b> In fact, we expect MTCs to <i>reduce </i>the size of the handshake, even compared to today's non-PQ certificates. They will also reduce CPU cost: ML-DSA signature verification is about as fast as ECDSA, and there will be far fewer signatures to verify. We therefore expect to see a <i>reduction in latency</i>. We would like to see if there is a measurable performance improvement.</p><p><b>What fraction of clients will stay up to date? </b>Getting the performance benefit of MTCs requires the clients and servers to be roughly in sync with one another. We expect MTCs to have fairly short lifetimes, a week or so. This means that if the client's latest landmark is older than a week, the server would have to fallback to a larger certificate. Knowing how often this fallback happens will help us tune the parameters of the protocol to make fallbacks less likely.</p><p>In order to answer these questions, we are implementing MTC support in our TLS stack and in our certificate issuance infrastructure. For their part, Chrome is implementing MTC support in their own TLS stack and will stand up infrastructure to disseminate landmarks to their users.</p><p>As we've done in past experiments, we plan to enable MTCs for a subset of our free customers with enough traffic that we will be able to get useful measurements. Chrome will control the experimental rollout: they can ramp up slowly, measuring as they go and rolling back if and when bugs are found.</p><p>Which leaves us with one last question: who will run the Merkle Tree CA?</p>
    <div>
      <h3>Bootstrapping trust from the existing WebPKI</h3>
      <a href="#bootstrapping-trust-from-the-existing-webpki">
        
      </a>
    </div>
    <p>Standing up a proper CA is no small task: it takes years to be trusted by major browsers. That’s why Cloudflare isn’t going to become a “real” CA for this experiment, and Chrome isn’t going to trust us directly.</p><p>Instead, to make progress on a reasonable timeframe, without sacrificing due diligence, we plan to "mock" the role of the MTCA. We will run an MTCA (on <a href="https://github.com/cloudflare/azul/"><u>Workers</u></a> based on our <a href="https://blog.cloudflare.com/azul-certificate-transparency-log/"><u>StaticCT logs</u></a>), but for each MTC we issue, we also publish an existing certificate from a trusted CA that agrees with it. We call this the <b>bootstrap certificate</b>. When Chrome’s infrastructure pulls updates from our MTCA log, they will also pull these bootstrap certificates, and check whether they agree. Only if they do, they’ll proceed to push the corresponding landmarks to Chrome clients. In other words, Cloudflare is effectively just “re-encoding” an existing certificate (with domain validation performed by a trusted CA) as an MTC, and Chrome is using certificate transparency to keep us honest.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>With almost 50% of our traffic already protected by post-quantum encryption, we’re halfway to a fully post-quantum secure Internet. The second part of our journey, post-quantum certificates, is the hardest yet though. A simple drop-in upgrade has a noticeable performance impact and no security benefit before Q-day. This means it’s a hard sell to enable today by default. But here we are playing with fire: migrations always take longer than expected. If we want to keep an ubiquitously private and secure Internet, we need a post-quantum solution that’s performant enough to be enabled by default <b>today</b>.</p><p>Merkle Tree Certificates (MTCs) solves this problem by reducing the number of signatures and public keys to the bare minimum while maintaining the WebPKI's essential properties. We plan to roll out MTCs to a fraction of free accounts by early next year. This does not affect any visitors that are not part of the Chrome experiment. For those that are, thanks to the bootstrap certificates, there is no impact on security.</p><p>We’re excited to keep the Internet fast <i>and</i> secure, and will report back soon on the results of this experiment: watch this space! MTC is evolving as we speak, if you want to get involved, please join the IETF <a href="https://mailman3.ietf.org/mailman3/lists/plants@ietf.org/"><u>PLANTS mailing list</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Post-Quantum]]></category>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Cryptography]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[TLS]]></category>
            <category><![CDATA[Chrome]]></category>
            <category><![CDATA[Google]]></category>
            <category><![CDATA[IETF]]></category>
            <category><![CDATA[Transparency]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <guid isPermaLink="false">4jURWdZzyjdrcurJ4LlJ1z</guid>
            <dc:creator>Luke Valenta</dc:creator>
            <dc:creator>Christopher Patton</dc:creator>
            <dc:creator>Vânia Gonçalves</dc:creator>
            <dc:creator>Bas Westerbaan</dc:creator>
        </item>
        <item>
            <title><![CDATA[Why Cloudflare, Netlify, and Webflow are collaborating to support Open Source tools like Astro and TanStack]]></title>
            <link>https://blog.cloudflare.com/cloudflare-astro-tanstack/</link>
            <pubDate>Tue, 23 Sep 2025 13:10:00 GMT</pubDate>
            <description><![CDATA[ Today, Cloudflare is proud to announce support for two cornerstone frameworks in the modern web ecosystem: we’re partnering with Webflow to sponsor Astro, and with Netlify to sponsor TanStack. ]]></description>
            <content:encoded><![CDATA[ 
    <div>
      <h5></h5>
      <a href="#">
        
      </a>
    </div>
    <p>Open source is the core fabric of the web, and the open source tools that power the modern web depend on the stability and support of the community. </p><p>To ensure two major open source projects have the resources they need, we are proud to announce our financial sponsorship to two cornerstone frameworks in the modern web ecosystem: <b>Astro</b> and <b>TanStack</b>.</p><p>Critically, we think it’s important we don’t do this alone — for the open web to continue to thrive, we must bet on and support technologies and frameworks that are open and accessible to all, and not beholden to any one company. </p><p>Which is why we are also excited to announce that for these sponsorships we are joining forces with our peers at <b>Netlify to sponsor TanStack</b> and <b>Webflow to sponsor Astro</b>.</p>
    <div>
      <h2>Why Astro and TanStack? Investing in the Future of the Frontend</h2>
      <a href="#why-astro-and-tanstack-investing-in-the-future-of-the-frontend">
        
      </a>
    </div>
    <p>Our decision to support Astro and TanStack was deliberate. These two projects represent distinct but complementary visions for the future of web development. One is redefining the architecture for high-performance, content-driven websites, while the other provides a full-stack toolkit for building the most ambitious web applications.</p>
    <div>
      <h3>Astro: the framework for the high-performance sites </h3>
      <a href="#astro-the-framework-for-the-high-performance-sites">
        
      </a>
    </div>
    <p>When it comes to endorsing a technology, we believe actions speak louder than words. </p><p>That’s why our support for Astro isn't just financial—it's foundational. We run our developer documentation site, developers.cloudflare.com, entirely on Astro. This isn't a small side project — it's a critical resource visited by hundreds of thousands of developers every day, with dozens of contributors constantly keeping it updated. For a site like this, performance isn't a feature; it's a requirement. </p><p>We chose Astro because its core principles mirror our own. Its "zero JS by default" architecture delivers the raw performance and stellar SEO that a content-heavy site demands, ensuring our docs are fast and discoverable. Just as importantly, Astro is framework-agnostic, letting teams use components from React, Vue, or Svelte without vendor lock-in. </p><p>Astro makes it easy for our global team to keep content up-to-date and, most importantly, keep our docs blazing fast. Our sponsorship is a direct result of the immense value we've experienced firsthand.   </p><blockquote><p><i>Cloudflare’s partnership and support affirms our shared mission: to make the web faster, more open, and better for everyone who builds on it.  - Fred K. Schott, Astro Co-creator, Project Steward</i></p></blockquote><blockquote><p><i>Webflow gives marketers, designers, and developers the freedom to build without compromise. Astro shares that same spirit by removing barriers, speeding up workflows, and opening new creative possibilities. Together with Cloudflare and Netlify, we’re helping ensure the tools our community relies on stay open, sustainable, and ready for the future. - Allan Leinwand, Webflow CTO</i></p></blockquote>
    <div>
      <h3>TanStack Start: the full-stack framework for ambitious applications</h3>
      <a href="#tanstack-start-the-full-stack-framework-for-ambitious-applications">
        
      </a>
    </div>
    <p>If Astro provides the ideal foundation for content-heavy sites, TanStack provides the ideal engine for complex web applications. TanStack is not a single framework but a suite of powerful, headless, and type-safe libraries that solve the hardest problems in modern application development.</p><p>Libraries like TanStack Query have become the de facto industry standard for managing asynchronous server state, elegantly solving complex challenges like caching, background refetching, and optimistic updates that once required thousands of lines of fragile, bespoke code. Similarly, TanStack Router brings full type-safety to routing, eliminating an entire class of common bugs, while TanStack Table and TanStack Form provide the robust, headless primitives needed to build sophisticated, data-intensive user interfaces.</p><p>And today, TanStack announced its official release of the release candidate for TanStack Start 1.0, taking a big stride towards production-readiness.</p><p><b>TanStack Start</b> is a new full-stack framework that composes these powerful libraries into a cohesive, enterprise-grade development experience. With features like full-document Server-Side Rendering (SSR), streaming, and a "deploy anywhere" architecture, TanStack Start is designed for the modern, serverless edge. It provides the power and type-safety needed for ambitious applications and is a perfect match for deployment environments like Cloudflare Workers.</p><blockquote><p><i>With Cloudflare alongside us, TanStack can keep raising the bar for fast, scalable, and type-safe tools for powering the next generation of web apps while protecting the openness and freedom developers depend on. - Tanner Linsley, TanStack creator</i></p></blockquote><blockquote><p><i>Supporting an open web is not a nice-to-have for us, but a requirement for us to fulfill our mission to build a better web. Collaborating with Cloudflare on making sure these top projects are funded is the easiest decision we can make! -</i> <i>Mat B, CEO</i></p></blockquote>
    <div>
      <h2>Joining forces builds a stronger open web</h2>
      <a href="#joining-forces-builds-a-stronger-open-web">
        
      </a>
    </div>
    <p>It is not lost on us that this coalition includes companies that compete in the market. We believe this is a feature, not a bug. It demonstrates a shared understanding that we are all building on the same open-source foundations. A healthy, innovative, and sustainable open-source ecosystem is the rising tide that lifts all of our boats.</p><p>This joint sponsorship model means Astro and TanStack are more resilient. For you, that means you can build on them with confidence, knowing they aren't dependent on a single company's shifting priorities.</p>
    <div>
      <h2>With that, show us what you build!</h2>
      <a href="#with-that-show-us-what-you-build">
        
      </a>
    </div>
    <p>The best way to support open source is to use it, build with it, and contribute back to it. See how easy it is to get started with Astro and TanStack and deploy an application to Cloudflare in minutes with the following framework guides:</p><ul><li><p><a href="https://developers.cloudflare.com/workers/framework-guides/web-apps/astro/"><u>Get started with Astro</u></a></p></li><li><p><a href="https://tanstack.com/start/latest/docs/framework/react/overview"><u>Get started with TanStack Start</u></a></p></li></ul><p></p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Partners]]></category>
            <category><![CDATA[Open Source]]></category>
            <guid isPermaLink="false">6fqBbuuMhg7sdSmsIGTchD</guid>
            <dc:creator>Rita Kozlov</dc:creator>
        </item>
        <item>
            <title><![CDATA[Supporting the future of the open web: Cloudflare is sponsoring Ladybird and Omarchy ]]></title>
            <link>https://blog.cloudflare.com/supporting-the-future-of-the-open-web/</link>
            <pubDate>Mon, 22 Sep 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ We are excited to announce our support of two independent, open source projects: Ladybird, an ambitious project to build an independent browser, and Omarchy, an opinionated Arch Linux for developers.  ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we believe that helping build a better Internet means encouraging a healthy ecosystem of options for how people can connect safely and quickly to the resources they need. Sometimes that means we tackle immense, Internet-scale problems with established partners. And sometimes that means we support and partner with fantastic open teams taking big bets on the next generation of tools.</p><p>To that end, today we are excited to announce our support of two independent, open source projects: <a href="https://ladybird.org/"><u>Ladybird</u></a>, an ambitious project to build a completely independent browser from the ground up, and <a href="https://omarchy.org/"><u>Omarchy</u></a>, an opinionated Arch Linux setup for developers. </p>
    <div>
      <h2>Two open source projects strengthening the open Internet </h2>
      <a href="#two-open-source-projects-strengthening-the-open-internet">
        
      </a>
    </div>
    <p>Cloudflare has a long history of supporting open-source software – both through <a href="https://blog.cloudflare.com/tag/open-source/"><u>our own projects shared with the community</u></a> and <a href="https://developers.cloudflare.com/sponsorships/"><u>external</u></a> projects that we support. We see our sponsorship of Ladybird and Omarchy as a natural extension of these efforts in a moment where energy for a diverse ecosystem is needed more than ever.  </p>
    <div>
      <h3>Ladybird, a new and independent browser </h3>
      <a href="#ladybird-a-new-and-independent-browser">
        
      </a>
    </div>
    <p>Most of us spend a significant amount of time using a web browser –  in fact, you’re probably using one to read this blog! The beauty of browsers is that they help users experience the open Internet, giving you access to everything from the largest news publications in the world to a tiny website hosted on a Raspberry Pi.  </p><p>Unlike dedicated apps, browsers reduce the barriers to building an audience for new services and communities on the Internet. If you are launching something new, you can offer it through a browser in a world where most people have absolutely zero desire to install an app just to try something out. Browsers help encourage competition and new ideas on the open web.</p><p>While the openness of how browsers work has led to an explosive growth of services on the Internet, browsers themselves have consolidated to a tiny handful of viable options. There’s a high probability you’re reading this on a Chromium-based browser, like Google’s Chrome, along with about <a href="https://radar.cloudflare.com/reports/browser-market-share-2025-q2"><u>65% of users on the Internet.</u></a> However, that consolidation has also scared off new entrants in the space. If all browsers ship on the same operating systems, powered by the same underlying technology, we lose out on potential privacy, security and performance innovations that could benefit developers and everyday Internet users.  </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3j6xYLX9ZdqhS0yWCMjM0b/45fa8bd5b275a45a9f37b7a015d4c15d/BLOG-2998_2.png" />
          </figure><p><sup><i>A screenshot of Cloudflare Workers developer docs in Ladybird </i></sup></p><p>This is where Ladybird comes in: it’s not Chromium based – everything is built from scratch. The Ladybird project has two main components: LibWeb, a brand-new rendering engine, and LibJS, a brand-new JavaScript engine with its own parser, interpreter, and bytecode execution engine. </p><p>Building an engine that can correctly and securely render the modern web is a monumental task that requires deep technical expertise and navigating decades of specifications governed by standards bodies like the W3C and WHATWG. And because Ladybird implements these standards directly, it also stress-tests them in practice. Along the way, the project has found, reported, and sometimes fixed countless issues in the specifications themselves, contributions that strengthen the entire web platform for developers, browser vendors, and anyone who may attempt to build a browser in the future.</p><p>Whether to build something from scratch or not is a perennial source of debate between software engineers, but absent the pressures of revenue or special interests, we’re excited about the ways Ladybird will prioritize privacy, performance, and security, potentially in novel ways that will influence the entire ecosystem.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7zzAGb1Te5G6wGH2ieFbMU/1a3289c199695f88f6f6e57d7289851e/image1.png" />
          </figure><p><sup><i>A screenshot of the Omarchy development environment</i></sup></p>
    <div>
      <h3>Omarchy, an independent development environment </h3>
      <a href="#omarchy-an-independent-development-environment">
        
      </a>
    </div>
    <p>Developers deserve choice, too. Beyond the browser, a developer’s operating system and environment is where they spend a ton of time – and where a few big players have become the dominant choice. Omarchy challenges this by providing a complete, opinionated Arch Linux distribution that transforms a bare installation into a modern development workstation that developers are <a href="https://github.com/basecamp/omarchy"><u>excited about</u></a>.</p><p>Perfecting one’s development environment can be a career-long art, but learning how to do so shouldn’t be a barrier to beginning to code. The beauty of Omarchy is that it makes Linux approachable to more developers by doing most of the setup for them, making it look good, and then making it configurable. Omarchy provides most of the tools developers need – like Neovim, Docker, and Git – out of the box, and <a href="https://learn.omacom.io/2/the-omarchy-manual"><u>tons of other features</u></a>.</p><p>At its core, Omarchy embraces Linux for all of its complexity and configurability, and makes a version of it that is accessible and fun to use for developers that don’t have a deep background in operating systems. Projects like this ensure that a powerful, independent Linux desktop remains a compelling choice for people building the next generation of applications and Internet infrastructure. </p>
    <div>
      <h3>Our support comes with no strings attached  </h3>
      <a href="#our-support-comes-with-no-strings-attached">
        
      </a>
    </div>
    <p>We want to be very clear here: we are supporting these projects because we believe the Internet can be better if these projects, and more like them, succeed. No requirement to use our technology stack or any arrangement like that. We are happy to partner with great teams like Ladybird and Omarchy simply because we believe that our missions have real overlap.</p>
    <div>
      <h2>Notes from the teams</h2>
      <a href="#notes-from-the-teams">
        
      </a>
    </div>
    <p>Ladybird is still in its early days, with an alpha release planned for 2026, but we encourage anyone who is interested to consider contributing to the <a href="https://github.com/LadybirdBrowser/ladybird/tree/master"><u>open source codebase</u></a> as they prepare for launch.</p><blockquote><p><i>"Cloudflare knows what it means to build critical web infrastructure on the server side. With Ladybird, we’re tackling the near-monoculture on the client side, because we believe it needs multiple implementations to stay healthy, and we’re extremely thankful for their support in that mission.”</i></p><p>– <b>Andreas Kling</b>, Founder, Ladybird  </p></blockquote><p><a href="https://github.com/basecamp/omarchy/releases/tag/v3.0.0"><u>Omarchy 3.0</u></a> was released just last week with faster installation and increased Macbook compatibility, so if you’ve been Linux-curious for a while now, we encourage you to try it out!</p><blockquote><p><i>"Cloudflare's support of Omarchy has ensured we have the fastest ISO and package delivery from wherever you are in the world. Without a need to manually configure mirrors or deal with torrents. The combo of a super CDN, great R2 storage, and the best DDoS shield in the business has been a huge help for the project."</i></p><p>– <b>David Heinemeier Hansson</b>, Creator of Omarchy and Ruby on Rails</p></blockquote><p>A better Internet is one where people have more choice in how they browse and develop new software. We’re incredibly excited about the potential of Ladybird, Omarchy, and other audacious projects that support a free and open Internet. </p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Browser Rendering]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">1mBKYqbp7645szLQobH6SI</guid>
            <dc:creator>Sam Rhea</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cap'n Web: a new RPC system for browsers and web servers]]></title>
            <link>https://blog.cloudflare.com/capnweb-javascript-rpc-library/</link>
            <pubDate>Mon, 22 Sep 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cap'n Web is a new open source, JavaScript-native RPC protocol for use in browsers and web servers. It provides the expressive power of Cap'n Proto, but with no schemas and no boilerplate. ]]></description>
            <content:encoded><![CDATA[ <p>Allow us to introduce <a href="https://github.com/cloudflare/capnweb"><u>Cap'n Web</u></a>, an RPC protocol and implementation in pure TypeScript.</p><p>Cap'n Web is a spiritual sibling to <a href="https://capnproto.org/"><u>Cap'n Proto</u></a>, an RPC protocol I (Kenton) created a decade ago, but designed to play nice in the web stack. That means:</p><ul><li><p>Like Cap'n Proto, it is an object-capability protocol. ("Cap'n" is short for "capabilities and".) We'll get into this more below, but it's incredibly powerful.</p></li><li><p>Unlike Cap'n Proto, Cap'n Web has <i>no schemas</i>. In fact, it has almost no boilerplate whatsoever. This means it works more like the <a href="https://blog.cloudflare.com/javascript-native-rpc/"><u>JavaScript-native RPC system in Cloudflare Workers</u></a>.</p></li><li><p>That said, it integrates nicely with TypeScript.</p></li><li><p>Also unlike Cap'n Proto, Cap'n Web's underlying serialization is human-readable. In fact, it's just JSON, with a little pre-/post-processing.</p></li><li><p>It works over HTTP, WebSocket, and postMessage() out-of-the-box, with the ability to extend it to other transports easily.</p></li><li><p>It works in all major browsers, Cloudflare Workers, Node.js, and other modern JavaScript runtimes.</p></li><li><p>The whole thing compresses (minify+gzip) to under 10 kB with no dependencies.</p></li><li><p><a href="https://github.com/cloudflare/capnweb"><u>It's open source</u></a> under the MIT license.</p></li></ul><p>Cap'n Web is more expressive than almost every other RPC system, because it implements an <b>object-capability RPC model</b>. That means it:</p><ul><li><p>Supports bidirectional calling. The client can call the server, and the server can also call the client.</p></li><li><p>Supports passing functions by reference: If you pass a function over RPC, the recipient receives a "stub". When they call the stub, they actually make an RPC back to you, invoking the function where it was created. This is how bidirectional calling happens: the client passes a callback to the server, and then the server can call it later.</p></li><li><p>Similarly, supports passing objects by reference: If a class extends the special marker type <code>RpcTarget</code>, then instances of that class are passed by reference, with method calls calling back to the location where the object was created.</p></li><li><p>Supports promise pipelining. When you start an RPC, you get back a promise. Instead of awaiting it, you can immediately use the promise in dependent RPCs, thus performing a chain of calls in a single network round trip.</p></li><li><p>Supports capability-based security patterns.</p></li></ul><p>In short, Cap'n Web lets you design RPC interfaces the way you'd design regular JavaScript APIs – while still acknowledging and compensating for network latency.</p><p>The best part is, Cap'n Web is absolutely trivial to set up.</p><p>A client looks like this:</p>
            <pre><code>import { newWebSocketRpcSession } from "capnweb";

// One-line setup.
let api = newWebSocketRpcSession("wss://example.com/api");

// Call a method on the server!
let result = await api.hello("World");

console.log(result);
</code></pre>
            <p>And here's a complete Cloudflare Worker implementing an RPC server:</p>
            <pre><code>import { RpcTarget, newWorkersRpcResponse } from "capnweb";

// This is the server implementation.
class MyApiServer extends RpcTarget {
  hello(name) {
    return `Hello, ${name}!`
  }
}

// Standard Workers HTTP handler.
export default {
  fetch(request, env, ctx) {
    // Parse URL for routing.
    let url = new URL(request.url);

    // Serve API at `/api`.
    if (url.pathname === "/api") {
      return newWorkersRpcResponse(request, new MyApiServer());
    }

    // You could serve other endpoints here...
    return new Response("Not found", {status: 404});
  }
}
</code></pre>
            <p>That's it. That's the app.</p><ul><li><p>You can add more methods to <code>MyApiServer</code>, and call them from the client.</p></li><li><p>You can have the client pass a callback function to the server, and then the server can just call it.</p></li><li><p>You can define a TypeScript interface for your API, and easily apply it to the client and server.</p></li></ul><p>It just works.</p>
    <div>
      <h3>Why RPC? (And what is RPC anyway?)</h3>
      <a href="#why-rpc-and-what-is-rpc-anyway">
        
      </a>
    </div>
    <p>Remote Procedure Calls (RPC) are a way of expressing communications between two programs over a network. Without RPC, you might communicate using a protocol like HTTP. With HTTP, though, you must format and parse your communications as an HTTP request and response, perhaps designed in <a href="https://en.wikipedia.org/wiki/REST"><u>REST</u></a> style. RPC systems try to make communications look like a regular function call instead, as if you were calling a library rather than a remote service. The RPC system provides a "stub" object on the client side which stands in for the real server-side object. When a method is called on the stub, the RPC system figures out how to serialize and transmit the parameters to the server, invoke the method on the server, and then transmit the return value back.</p><p>The merits of RPC have been subject to a great deal of debate. RPC is often accused of committing many of the <a href="https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing"><u>fallacies of distributed computing</u></a>.</p><p>But this reputation is outdated. When RPC was first invented some 40 years ago, async programming barely existed. We did not have Promises, much less async and await. Early RPC was synchronous: calls would block the calling thread waiting for a reply. At best, latency made the program slow. At worst, network failures would hang or crash the program. No wonder it was deemed "broken".</p><p>Things are different today. We have Promise and async and await, and we can throw exceptions on network failures. We even understand how RPCs can be pipelined so that a chain of calls takes only one network round trip. Many large distributed systems you likely use every day are built on RPC. It works.</p><p>The fact is, RPC fits the programming model we're used to. Every programmer is trained to think in terms of APIs composed of function calls, not in terms of byte stream protocols nor even REST. Using RPC frees you from the need to constantly translate between mental models, allowing you to move faster.</p>
    <div>
      <h3>When should you use Cap'n Web?</h3>
      <a href="#when-should-you-use-capn-web">
        
      </a>
    </div>
    <p>Cap'n Web is useful anywhere where you have two JavaScript applications speaking to each other over a network, including client-to-server and microservice-to-microservice scenarios. However, it is particularly well-suited to interactive web applications with real-time collaborative features, as well as modeling interactions over complex security boundaries.</p><p>Cap'n Web is still new and experimental, so for now, a willingness to live on the cutting edge may also be required!</p>
    <div>
      <h2>Features, features, features…</h2>
      <a href="#features-features-features">
        
      </a>
    </div>
    <p>Here's some more things you can do with Cap'n Web.</p>
    <div>
      <h3>HTTP batch mode</h3>
      <a href="#http-batch-mode">
        
      </a>
    </div>
    <p>Sometimes a WebSocket connection is a bit too heavyweight. What if you just want to make a quick one-time batch of calls, but don't need an ongoing connection?</p><p>For that, Cap'n Web supports HTTP batch mode:</p>
            <pre><code>import { newHttpBatchRpcSession } from "capnweb";

let batch = newHttpBatchRpcSession("https://example.com/api");

let result = await batch.hello("World");

console.log(result);
</code></pre>
            <p><i>(The server is exactly the same as before.)</i></p><p>Note that once you've awaited an RPC in the batch, the batch is done, and all the remote references received through it become broken. To make more calls, you need to start over with a new batch. However, you can make multiple calls in a single batch:</p>
            <pre><code>let batch = newHttpBatchRpcSession("https://example.com/api");

// We can call make multiple calls, as long as we await them all at once.
let promise1 = batch.hello("Alice");
let promise2 = batch.hello("Bob");

let [result1, result2] = await Promise.all([promise1, promise2]);

console.log(result1);
console.log(result2);
</code></pre>
            <p>And that brings us to another feature…</p>
    <div>
      <h3>Chained calls (Promise Pipelining)</h3>
      <a href="#chained-calls-promise-pipelining">
        
      </a>
    </div>
    <p>Here's where things get magical.</p><p>In both batch mode and WebSocket mode, you can make a call that depends on the result of another call, without waiting for the first call to finish. In batch mode, that means you can, in a single batch, call a method, then use its result in another call. The entire batch still requires only one network round trip.</p><p>For example, say your API is:</p>
            <pre><code>class MyApiServer extends RpcTarget {
  getMyName() {
    return "Alice";
  }

  hello(name) {
    return `Hello, ${name}!`
  }
}
</code></pre>
            <p>You can do:</p>
            <pre><code>let namePromise = batch.getMyName();
let result = await batch.hello(namePromise);

console.log(result);
</code></pre>
            <p>Notice the initial call to <code>getMyName()</code> returned a promise, but we used the promise itself as the input to <code>hello()</code>, without awaiting it first. With Cap'n Web, this just works: The client sends a message to the server saying: "Please insert the result of the first call into the parameters of the second."</p><p>Or perhaps the first call returns an object with methods. You can call the methods immediately, without awaiting the first promise, like:</p>
            <pre><code>let batch = newHttpBatchRpcSession("https://example.com/api");

// Authencitate the API key, returning a Session object.
let sessionPromise = batch.authenticate(apiKey);

// Get the user's name.
let name = await sessionPromise.whoami();

console.log(name);
</code></pre>
            <p>This works because the promise returned by a Cap'n Web call is not a regular promise. Instead, it's a JavaScript Proxy object. Any methods you call on it are interpreted as speculative method calls on the eventual result. These calls are sent to the server immediately, telling the server: "When you finish the call I sent earlier, call this method on what it returns."</p>
    <div>
      <h3>Did you spot the security?</h3>
      <a href="#did-you-spot-the-security">
        
      </a>
    </div>
    <p>This last example shows an important security pattern enabled by Cap'n Web's object-capability model.</p><p>When we call the authenticate() method, after it has verified the provided API key, it returns an authenticated session object. The client can then make further RPCs on the session object to perform operations that require authorization as that user. The server code might look like this:</p>
            <pre><code>class MyApiServer extends RpcTarget {
  authenticate(apiKey) {
    let username = await checkApiKey(apiKey);
    return new AuthenticatedSession(username);
  }
}

class AuthenticatedSession extends RpcTarget {
  constructor(username) {
    super();
    this.username = username;
  }

  whoami() {
    return this.username;
  }

  // ...other methods requiring auth...
}
</code></pre>
            <p>Here's what makes this work: <b>It is impossible for the client to "forge" a session object. The only way to get one is to call authenticate(), and have it return successfully.</b></p><p>In most RPC systems, it is not possible for one RPC to return a stub pointing at a new RPC object in this way. Instead, all functions are top-level, and can be called by anyone. In such a traditional RPC system, it would be necessary to pass the API key again to every function call, and check it again on the server each time. Or, you'd need to do authorization outside the RPC system entirely.</p><p>This is a common pain point for WebSockets in particular. Due to the design of the web APIs for WebSocket, you generally cannot use headers nor cookies to authorize them. Instead, authorization must happen in-band, by sending a message over the WebSocket itself. But this can be annoying for RPC protocols, as it means the authentication message is "special" and changes the state of the connection itself, affecting later calls. This breaks the abstraction.</p><p>The authenticate() pattern shown above neatly makes authentication fit naturally into the RPC abstraction. It's even type-safe: you can't possibly forget to authenticate before calling a method requiring auth, because you wouldn't have an object on which to make the call. Speaking of type-safety…</p>
    <div>
      <h3>TypeScript</h3>
      <a href="#typescript">
        
      </a>
    </div>
    <p>If you use TypeScript, Cap'n Web plays nicely with it. You can declare your RPC API once as a TypeScript interface, implement in on the server, and call it on the client:</p>
            <pre><code>// Shared interface declaration:
interface MyApi {
  hello(name: string): Promise&lt;string&gt;;
}

// On the client:
let api: RpcStub&lt;MyApi&gt; = newWebSocketRpcSession("wss://example.com/api");

// On the server:
class MyApiServer extends RpcTarget implements MyApi {
  hello(name) {
    return `Hello, ${name}!`
  }
}
</code></pre>
            <p>Now you get end-to-end type checking, auto-completed method names, and so on.</p><p>Note that, as always with TypeScript, no type checks occur at runtime. The RPC system itself does not prevent a malicious client from calling an RPC with parameters of the wrong type. This is, of course, not a problem unique to Cap'n Web – JSON-based APIs have always had this problem. You may wish to use a runtime type-checking system like Zod to solve this. (Meanwhile, we hope to add type checking based directly on TypeScript types in the future.)</p>
    <div>
      <h2>An alternative to GraphQL?</h2>
      <a href="#an-alternative-to-graphql">
        
      </a>
    </div>
    <p>If you’ve used GraphQL before, you might notice some similarities. One benefit of GraphQL was to solve the “waterfall” problem of traditional REST APIs by allowing clients to ask for multiple pieces of data in one query. For example, instead of making three sequential HTTP calls:</p>
            <pre><code>GET /user
GET /user/friends
GET /user/friends/photos</code></pre>
            <p>…you can write one GraphQL query to fetch it all at once.</p><p>That’s a big improvement over REST, but GraphQL comes with its own tradeoffs:</p><ul><li><p><b>New language and tooling.</b> You have to adopt GraphQL’s schema language, servers, and client libraries. If your team is all-in on JavaScript, that’s a lot of extra machinery.</p></li><li><p><b>Limited composability.</b> GraphQL queries are declarative, which makes them great for fetching data, but awkward for chaining operations or mutations. For example, you can’t easily say: “create a user, then immediately use that new user object to make a friend request, all-in-one round trip.”</p></li><li><p><b>Different abstraction model.</b> GraphQL doesn’t look or feel like the JavaScript APIs you already know. You’re learning a new mental model rather than extending the one you use every day.</p></li></ul>
    <div>
      <h3>How Cap'n Web goes further</h3>
      <a href="#how-capn-web-goes-further">
        
      </a>
    </div>
    <p>Cap'n Web solves the waterfall problem <i>without</i> introducing a new language or ecosystem. It’s just JavaScript. Because Cap'n Web supports promise pipelining and object references, you can write code that looks like this:</p>
            <pre><code>let user = api.createUser({ name: "Alice" });
let friendRequest = await user.sendFriendRequest("Bob");</code></pre>
            <p>What happens under the hood? Both calls are pipelined into a single network round trip:</p><ol><li><p>Create the user.</p></li><li><p>Take the result of that call (a new User object).</p></li><li><p>Immediately invoke sendFriendRequest() on that object.</p></li></ol><p>All of this is expressed naturally in JavaScript, with no schemas, query languages, or special tooling required. You just call methods and pass objects around, like you would in any other JavaScript code.</p><p>In other words, GraphQL gave us a way to flatten REST’s waterfalls. Cap'n Web lets us go even further: it gives you the power to model complex interactions exactly the way you would in a normal program, with no impedance mismatch.</p>
    <div>
      <h3>But how do we solve arrays?</h3>
      <a href="#but-how-do-we-solve-arrays">
        
      </a>
    </div>
    <p>With everything we've presented so far, there's a critical missing piece to seriously consider Cap'n Web as an alternative to GraphQL: handling lists. Often, GraphQL is used to say: "Perform this query, and then, for every result, perform this other query." For example: "List the user's friends, and then for each one, fetch their profile photo."</p><p>In short, we need an <code>array.map()</code> operation that can be performed without adding a round trip.</p><p>Cap'n Proto, historically, has never supported such a thing.</p><p>But with Cap'n Web, we've solved it. You can do:</p>
            <pre><code>let user = api.authenticate(token);

// Get the user's list of friends (an array).
let friendsPromise = user.listFriends();

// Do a .map() to annotate each friend record with their photo.
// This operates on the *promise* for the friends list, so does not
// add a round trip.
// (wait WHAT!?!?)
let friendsWithPhotos = friendsPromise.map(friend =&gt; {
  return {friend, photo: api.getUserPhoto(friend.id))};
}

// Await the friends list with attached photos -- one round trip!
let results = await friendsWithPhotos;
</code></pre>
            
    <div>
      <h3>Wait… How!?</h3>
      <a href="#wait-how">
        
      </a>
    </div>
    <p><code>.map()</code> takes a callback function, which needs to be applied to each element in the array. As we described earlier, <i>normally</i> when you pass a function to an RPC, the function is passed "by reference", meaning that the remote side receives a stub, where calling that stub makes an RPC back to the client where the function was created.</p><p>But that is NOT what is happening here. That would defeat the purpose: we don't want the server to have to round-trip to the client to process every member of the array. We want the server to just apply the transformation server-side.</p><p>To that end, <code>.map() </code>is special. It does not send JavaScript code to the server, but it does send something like "code", restricted to a domain-specific, non-Turing-complete language. The "code" is a list of instructions that the server should carry out for each member of the array. In this case, the instructions are:</p><ol><li><p>Invoke <code>api.getUserPhoto(friend.id)</code>.</p></li><li><p>Return an object <code>{friend, photo}</code>, where friend is the original array element and photo is the result of step 1.</p></li></ol><p>But the application code just specified a JavaScript method. How on Earth could we convert this into the narrow DSL?</p><p>The answer is record-replay: On the client side, we execute the callback once, passing in a special placeholder value. The parameter behaves like an RPC promise. However, the callback is required to be synchronous, so it cannot actually await this promise. The only thing it can do is use promise pipelining to make pipelined calls. These calls are intercepted by the implementation and recorded as instructions, which can then be sent to the server, where they can be replayed as needed.</p><p>And because the recording is based on promise pipelining, which is what the RPC protocol itself is designed to represent, it turns out that the "DSL" used to represent "instructions" for the map function is <i>just the RPC protocol itself</i>. 🤯</p>
    <div>
      <h2>Implementation details</h2>
      <a href="#implementation-details">
        
      </a>
    </div>
    
    <div>
      <h3>JSON-based serialization</h3>
      <a href="#json-based-serialization">
        
      </a>
    </div>
    <p>Cap'n Web's underlying protocol is based on JSON – but with a preprocessing step to handle special types. Arrays are treated as "escape sequences" that let us encode other values. For example, JSON does not have an encoding for <code>Date</code> objects, but Cap'n Web does. You might see a message that looks like this:</p>
            <pre><code>{
  event: "Birthday Week",
  timestamp: ["date", 1758499200000]
}
</code></pre>
            <p>To encode a literal array, we simply double-wrap it in <code>[]</code>:</p>
            <pre><code>{
  names: [["Alice", "Bob", "Carol"]]
}
</code></pre>
            <p>In other words, an array with just one element which is itself an array, evaluates to the inner array literally. An array whose first element is a type name, evaluates to an instance of that type, where the remaining elements are parameters to the type.</p><p>Note that only a fixed set of types are supported: essentially, <a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Structured_clone_algorithm"><u>"structured clonable" types</u></a>, and RPC stub types.</p><p>On top of this basic encoding, we define an RPC protocol inspired by Cap'n Proto – but greatly simplified.</p>
    <div>
      <h3>RPC protocol</h3>
      <a href="#rpc-protocol">
        
      </a>
    </div>
    <p>Since Cap'n Web is a symmetric protocol, there is no well-defined "client" or "server" at the protocol level. There are just two parties exchanging messages across a connection. Every kind of interaction can happen in either direction.</p><p>In order to make it easier to describe these interactions, I will refer to the two parties as "Alice" and "Bob".</p><p>Alice and Bob start the connection by establishing some sort of bidirectional message stream. This may be a WebSocket, but Cap'n Web also allows applications to define their own transports. Each message in the stream is JSON-encoded, as described earlier.</p><p>Alice and Bob each maintain some state about the connection. In particular, each maintains an "export table", describing all the pass-by-reference objects they have exposed to the other side, and an "import table", describing the references they have received. Alice's exports correspond to Bob's imports, and vice versa. Each entry in the export table has a signed integer ID, which is used to reference it. You can think of these IDs like file descriptors in a POSIX system. Unlike file descriptors, though, IDs can be negative, and an ID is never reused over the lifetime of a connection.</p><p>At the start of the connection, Alice and Bob each populate their export tables with a single entry, numbered zero, representing their "main" interfaces. Typically, when one side is acting as the "server", they will export their main public RPC interface as ID zero, whereas the "client" will export an empty interface. However, this is up to the application: either side can export whatever they want.</p><p>From there, new exports are added in two ways:</p><ul><li><p>When Alice sends a message to Bob that contains within it an object or function reference, Alice adds the target object to her export table. IDs assigned in this case are always negative, starting from -1 and counting downwards.</p></li><li><p>Alice can send a "push" message to Bob to request that Bob add a value to his export table. The "push" message contains an expression which Bob evaluates, exporting the result. Usually, the expression describes a method call on one of Bob's existing exports – this is how an RPC is made. Each "push" is assigned a positive ID on the export table, starting from 1 and counting upwards. Since positive IDs are only assigned as a result of pushes, Alice can predict the ID of each push she makes, and can immediately use that ID in subsequent messages. This is how promise pipelining is achieved.</p></li></ul><p>After sending a push message, Alice can subsequently send a "pull" message, which tells Bob that once he is done evaluating the "push", he should proactively serialize the result and send it back to Alice, as a "resolve" (or "reject") message. However, this is optional: Alice may not actually care to receive the return value of an RPC, if Alice only wants to use it in promise pipelining. In fact, the Cap'n Web implementation will only send a "pull" message if the application has actually awaited the returned promise.</p><p>Putting it together, a code sequence like this:</p>
            <pre><code>let namePromise = api.getMyName();
let result = await api.hello(namePromise);

console.log(result);</code></pre>
            <p>Might produce a message exchange like this:</p>
            <pre><code>// Call api.getByName(). `api` is the server's main export, so has export ID 0.
-&gt; ["push", ["pipeline", 0, "getMyName", []]
// Call api.hello(namePromise). `namePromise` refers to the result of the first push,
// so has ID 1.
-&gt; ["push", ["pipeline", 0, "hello", [["pipeline", 1]]]]
// Ask that the result of the second push be proactively serialized and returned.
-&gt; ["pull", 2]
// Server responds.
&lt;- ["resolve", 2, "Hello, Alice!"]</code></pre>
            <p>For more details about the protocol, <a href="https://github.com/cloudflare/capnweb/blob/main/protocol.md"><u>check out the docs</u></a>.</p>
    <div>
      <h2>Try it out!</h2>
      <a href="#try-it-out">
        
      </a>
    </div>
    <p>Cap'n Web is new and still highly experimental. There may be bugs to shake out. But, we're already using it today. Cap'n Web is the basis of <a href="https://developers.cloudflare.com/changelog/2025-09-16-remote-bindings-ga/"><u>the recently-launched "remote bindings" feature in Wrangler</u></a>, allowing a local test instance of workerd to speak RPC to services in production. We've also begun to experiment with it in various frontend applications – expect more blog posts on this in the future.</p><p>In any case, Cap'n Web is open source, and you can start using it in your own projects now.</p><p><a href="https://github.com/cloudflare/capnweb"><u>Check it out on GitHub.</u></a></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/53YF87AtEsYhHMN3PV23UV/8e9a938099c71e6f274e95292b16b382/BLOG-2954_2.png" />
          </figure><p>
</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[JavaScript]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">4Du5F6RJFvwqqEbMMuuxTi</guid>
            <dc:creator>Kenton Varda</dc:creator>
            <dc:creator>Steve Faulkner</dc:creator>
        </item>
        <item>
            <title><![CDATA[Performance measurements… and the people who love them]]></title>
            <link>https://blog.cloudflare.com/loving-performance-measurements/</link>
            <pubDate>Tue, 20 May 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Developers have a gut-felt understanding for performance, but that intuition breaks down when systems reach Cloudflare’s scale. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>⚠️ WARNING ⚠️ This blog post contains graphic depictions of probability. Reader discretion is advised.</p><p>Measuring performance is tricky. You have to think about accuracy and precision. Are your sampling rates high enough? Could they be too high?? How much metadata does each recording need??? Even after all that, all you have is raw data. Eventually for all this raw performance information to be useful, it has to be aggregated and communicated. Whether it's in the form of a dashboard, customer report, or a paged alert, performance measurements are only useful if someone can see and understand them.</p><p>This post is a collection of things I've learned working on customer performance escalations within Cloudflare and analyzing existing tools (both internal and commercial) that we use when evaluating our own performance.  A lot of this information also comes from Gil Tene's talk, <a href="https://youtu.be/lJ8ydIuPFeU"><u>How NOT to Measure Latency</u></a>. You should definitely watch that too (but maybe after reading this, so you don't spoil the ending). I was surprised by my own blind spots and which assumptions turned out to be wrong, even though they seemed "obviously true" at the start. I expect I am not alone in these regards. For that reason this journey starts by establishing fundamental definitions and ends with some new tools and techniques that we will be sharing as well as the surprising results that those tools uncovered.</p>
    <div>
      <h2>Check your verbiage</h2>
      <a href="#check-your-verbiage">
        
      </a>
    </div>
    <p>So ... what is performance? Alright, let's start with something easy: definitions. "Performance" is not a very precise term because it gets used in too many contexts. Most of us as nerds and engineers have a gut understanding of what it means, without a real definition. We can't <i>really</i> measure it because how "good" something is depends on what makes that thing good. "Latency" is better ... but not as much as you might think. Latency does at least have an implicit time unit, so we <i>can</i> measure it. But ... <a href="https://www.cloudflare.com/learning/performance/glossary/what-is-latency/">what is latency</a>? There are lots of good, specific examples of measurements of latency, but we are going to use a general definition. Someone starts something, and then it finishes — the elapsed time between is the latency.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1r4blwH5oeloUdXoizuLB4/f58014b1b4b3715f54400e6b03c60ea7/image7.png" />
          </figure><p>This seems a bit reductive, but it’s a surprisingly useful definition because it gives us a key insight. This fundamental definition of latency is based around the client's perspective. Indeed, when we look at our internal measurements of latency for health checks and monitoring, they all have this one-sided caller/callee relationship. There is the latency of the caching layer from the point of view of the ingress proxy. There’s the latency of the origin from the cache’s point of view. Each component can measure the latency of its upstream counterparts, but not the other way around. </p><p>This one-sided nature of latency observation is a real problem for us because Cloudflare <i>only</i> exists on the server side. This makes all of our internal measurements of latency purely estimations. Even if we did have full visibility into a client’s request timing, the start-to-finish latency of a request to Cloudflare isn’t a great measure of Cloudflare’s latency. The process of making an HTTP request has lots of steps, only a subset of which are affected by us. Time spent on things like DNS lookup, local computation for TLS, or resource contention <i>do</i> affect the client’s experience of latency, but only serve as sources of noise when we are considering our own performance.</p><p>There is a very useful and common metric that is used to measure web requests, and I’m sure lots of you have been screaming it in your brains from the second you read the title of this post. ✨Time to first byte✨. Clearly this is the answer, right?!  But ... what is “Time to first byte”?</p>
    <div>
      <h2>TTFB mine</h2>
      <a href="#ttfb-mine">
        
      </a>
    </div>
    <p>Time to first byte (TTFB) on its face is simple. The name implies that it's the time it takes (on the client's side) to receive the first byte of the response from the server, but unfortunately, that only describes when the timer should end. It doesn't say when the timer should start. This ambiguity is just one factor that leads to inconsistencies when trying to compare TTFB across different measurement platforms ... or even across a single platform because there is no <i>one</i> definition of TTFB. Similar to “performance”, it is used in too many places to have a single definition. That being said, TTFB is a very useful concept, so in order to measure it and report it in an unambiguous way, we need to pick a definition that’s already in use.</p><p>We have mentioned TTFB in other blog posts, but <a href="https://blog.cloudflare.com/ttfb-is-not-what-it-used-to-be/"><u>this one</u></a> sums up the problem best with “Time to first byte isn’t what it used to be.” You should read that article too, but the gist is that one popular TTFB definition used by browsers was changed in a confusing way with the introduction of <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/103"><u>early hints</u></a> in June 2022. That post and <a href="https://blog.cloudflare.com/tag/ttfb/"><u>others</u></a> make the point that while TTFB is useful, it isn’t the best direct measurement for web performance. Later on in this post we will derive why that’s the case.</p><p>One common place <i>we</i> see TTFB used is our customers’ analysis comparing Cloudflare's performance to our competitors through <a href="https://www.catchpoint.com/"><u>Catchpoint</u></a>. Customers, as you might imagine, have a vested interest in measuring our latency, as it affects theirs. Catchpoint provides several tools built on their global Internet probe network for measuring HTTP request latency (among other things) and visualizing it in their web interface. In an effort to align better with our customers, we decided to adopt Catchpoint’s terminology for talking about latency, both internally and externally.</p>
    <div>
      <h2>Catchpoint catch-up</h2>
      <a href="#catchpoint-catch-up">
        
      </a>
    </div>
    <p>While Catchpoint makes things like TTFB easy to plot over time, the visualization tool doesn't give a definition of what TTFB is, but after going through all of their technical blog posts and combing through thousands of lines of raw data, we were able to get functional definitions for TTFB and other composite metrics. This was an important step because these metrics are how our customers are viewing our performance, so we all need to be able to understand exactly what they signify! The final report for this is internal (and long and dry), so in this post, I'll give you the highlights in the form of colorful diagrams, starting with this one.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5bB3HmSrIIhQ2AzVpheJWa/8d2b73f3f2f0602217daaf7fea847e11/image6.png" />
          </figure><p>This diagram shows our customers' most commonly viewed client metrics on Catchpoint and how they fit together into the processing of a request from the server side. Notice that some are directly measured, and some are calculated based on the direct measurements. Right in the middle is TTFB, which Catchpoint calculates as the sum of the DNS, Connect, TLS, and Wait times. It’s worth noting again that this is not <i>the</i> definition of TTFB, this is just Catchpoint’s definition, and now ours.</p><p>This breakdown of HTTPS phases is not the only one commonly used. Browsers themselves have a standard for measuring the stages of a request. The diagram below shows how most browsers are reporting request metrics. Luckily (and maybe unsurprisingly) these phases match Catchpoint's very closely.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ZouyuBQV7XgER2kqhMy8r/04f750eef44ba12bb6915a06eac532ca/image1.png" />
          </figure><p>There are some differences beyond the inclusion of things like <a href="https://html.spec.whatwg.org/#applicationcache"><u>AppCache</u></a> and <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Redirections"><u>Redirects</u></a> (which are not directly impacted by Cloudflare's latency). Browser timing metrics are based on timestamps instead of durations. The diagram subtly calls this out with gaps between the different phases indicating that there is the potential for the computer running the browser to do things that are not part of any phase. We can line up these timestamps with Catchpoint's metrics like so:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4TvwOuTxWvMBxKGQQTUfZc/a8105d77725a9fa0d3e5bf6a115a13a5/Screenshot_2025-05-15_at_11.31.46.png" />
          </figure><p>Now that we, our customers, and our browsers (with data coming from <a href="https://en.wikipedia.org/wiki/Real_user_monitoring"><u>RUM</u></a>) have a common and well-defined language to talk about the phases of a request, we can start to measure, visualize, and compare the components that make up the network latency of a request. </p>
    <div>
      <h2>Visual basics</h2>
      <a href="#visual-basics">
        
      </a>
    </div>
    <p>Now that we have defined what our key values for latency are, we can record numbers and put them in a chart and watch them roll by ... except not directly. In most cases, the systems we use to record the data actively prevent us from seeing the recorded data in its raw form. Tools like <a href="https://prometheus.io/"><u>Prometheus</u></a> are designed to collect pre-aggregated data, not individual samples, and for a good reason. Storing every recorded metric (even compacted) would be an enormous amount of data. Even worse, the data loses its value exponentially over time, since the most recent data is the most actionable.</p><p>The unavoidable conclusion is that some aggregation has to be done before performance data can be visualized. In most cases, the aggregation means looking at a series of windowed percentiles over time. The most common are 50th percentile (median), 75th, 90th, and 99th if you're really lucky. Here is an example of a latency visualization from one of our own internal dashboards.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/lvjAR41mTJf2d5Vdg5SwT/19ff931587790b1fb7fbcc317ab83a5e/image8.png" />
          </figure><p>It clearly shows a spike in latency around 14:40 UTC. Was it an incident? The p99 jumped by 1300% (500ms to 6500</p><p>ms) for multiple minutes while the p50 jumped by more than 13600% (4.4ms to 600ms). It is a clear signal, so something must have happened, but what was it? Let me keep you in suspense for a second while we talk about statistics and probability.</p>
    <div>
      <h2>Uncooked math</h2>
      <a href="#uncooked-math">
        
      </a>
    </div>
    <p>Let me start with a quote from my dear, close, personal friend <a href="https://www.youtube.com/watch?v=xV4rLfpidIk"><u>@ThePrimeagen</u></a>:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/I8VbrcSjVSKY1i7fbVEMl/8108e25e78c1ee5356bbd080c467c056/Screenshot_2025-05-15_at_11.33.40.png" />
          </figure><p>It's a good reminder that while statistics is a great tool for providing a simplified and generalized representation of a complex system, it can also obscure important subtleties of that system. A good way to think of statistical modeling is like lossy compression. In the latency visualization above (which is a plot of TTFB over time), we are compressing the entire spectrum of latency metrics into 4 percentile bands, and because we are only considering up to the 99th percentile, there's an entire 1% of samples left over that we are ignoring! </p><p>"What?" I hear you asking. "P99 is already well into perfection territory. We're not trying to be perfectionists. Maybe we should get our p50s down first". Let's put things in perspective. This zone (<a href="http://www.cloudflare.com/"><u>www.cloudflare.com</u></a>) is getting about 30,000 req/s and the 99th percentile latency is 500 ms. (Here we are defining latency as “Edge TTFB”, a server-side approximation of our now official definition.) So there are 300 req/s that are taking longer than half a second to complete, and that's just the portion of the request that <i>we</i> can see. How much worse than 500 ms are those requests in the top 1%? If we look at the 100th percentile (the max), we get a much different vibe from our Edge TTFB plot.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/NDvJObDLjy5D8bKIEhsjS/10f1c40940ba41aae308100c7f374836/image12.png" />
          </figure><p>Viewed like this, the spike in latency no longer looks so remarkable. Without seeing more of the picture, we could easily believe something was wrong when in reality, even if something is wrong, it is not localized to that moment. In this case, it's like we are using our own statistics to lie to ourselves. </p>
    <div>
      <h2>The top 1% of requests have 99% of the latency</h2>
      <a href="#the-top-1-of-requests-have-99-of-the-latency">
        
      </a>
    </div>
    <p>Maybe you're still not convinced. It feels more intuitive to focus on the median because the latency experienced by 50 out of 100 people seems more important to focus on than that of 1 in 100. I would argue that is a totally true statement, but notice I said "people"<sup> </sup>and not "requests." A person visiting a website is not likely to be doing it one request at a time.</p><p>Taking <a href="http://www.cloudflare.com/"><u>www.cloudflare.com</u></a> as an example again, when a user opens that page, their browser makes more than <b>70</b> requests. It sounds big, but in the world of user-facing websites, it’s not that bad. In contrast, <a href="http://www.amazon.com/"><u>www.amazon.com</u></a> issues more than <b>400</b> requests! It's worth noting that not all those requests need to complete before a web page or application becomes usable. That's why more advanced and browser-focused metrics exist, but I will leave a discussion of those for later blog posts. I am more interested in how making that many requests changes the probability calculations for expected latency on a per-user basis. </p><p>Here's a brief primer on combining probabilities that covers everything you need to know to understand this section.</p><ul><li><p>The probability of two things happening is the probability of the first happening multiplied by the probability of the second thing happening. $$P(X\cap Y )=P(X) \times P (Y)$$</p></li><li><p>The probability of something in the $X^{th}$ percentile happening is $X\%$. $$P(pX) = X\%$$</p></li></ul><p>Let's define $P( pX_{N} )$ as the probability that someone on a website with $N$ requests experiences no latencies &gt;= the $X^{th}$ percentile. For example, $P(p50_{2})$ would be the probability of getting no latencies greater than the median on a page with 2 requests. This is equivalent to the probability of one request having a latency less than the $p50$ and the other request having a latency less than the $p50$. We can use the first identities above. </p><p>$$\begin{align}
P( p50_{2}) &amp;= P\left ( p50 \cap p50 \right ) \\
   &amp;= P( p50) \times P\left ( p50 \right ) \\
   &amp;= 50\%^{2} \\
   &amp;= 25\%
\end{align}$$</p><p>We can generalize this for any percentile and any number of requests. $$P( pX_{N}) = X\%^{N}$$</p><p>For <a href="http://www.cloudflare.com/"><u>www.cloudflare.com</u></a> and its 70ish requests, the percentage of visitors that won't experience a latency above the median is </p><p>$$\begin{align} 
P( p50_{70}) &amp;= 50\%^{70} \\
  &amp;\approx 0.000000000000000000001\%
\end{align}$$</p><p>This vanishingly small number should make you question why we would value the $p50$ latency so highly at all when effectively no one experiences it as their worst case latency.</p><p>So now the question is, what request latency percentile <i>should</i> we be looking at? Let's go back to the statement at the beginning of this section. What does the median person experience on <a href="http://www.cloudflare.com./"><u>www.cloudflare.com</u></a>? We can use a little algebra to solve for that.</p><p>$$\begin{align} 
P( pX_{70}) &amp;= 50\% \\
X^{70}  &amp;= 50\% \\
X &amp;= e^{ \frac{ln\left ( 50\% \right )}{70}} \\
X &amp;\approx  99\%
\end{align}$$</p><p>This seems a little too perfect, but I am not making this up. For <a href="http://www.cloudflare.com/"><u>www.cloudflare.com</u></a>, if you want to capture a value that's representative of what the median user can expect, you need to look at $p99$ request latency. Extending this even further, if you want a value that's representative of what 99% of users will experience, you need to look at the <b>99.99th</b> <b>percentile</b>!</p>
    <div>
      <h2>Spherical latency in a vacuum</h2>
      <a href="#spherical-latency-in-a-vacuum">
        
      </a>
    </div>
    <p>Okay, this is where we bring everything together, so stay with me. So far, we have only talked about measuring the performance of a single system. This gives us absolute numbers to look at internally for monitoring, but if you’ll recall, the goal of this post was to be able to clearly communicate about performance outside the company. Often this communication takes the form of comparing Cloudflare’s performance against other providers. How are these comparisons done? By plotting a percentile request "latency" over time and eyeballing the difference.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/x9j5kstMS1kXdsb1PaIbu/837398e0da4758743155595f4f570340/image2.png" />
          </figure><p>With everything we have discussed in this post, it seems like we can devise a better method for doing this comparison. We saw how exposing more of the percentile spectrum can provide a new perspective on existing data, and how impactful higher percentile statistics can be when looking at a more complete user experience. Let me close this post with an example of how putting those two concepts together yields some intriguing results.</p>
    <div>
      <h2>One last thing</h2>
      <a href="#one-last-thing">
        
      </a>
    </div>
    <p>Below is a comparison of the latency (defined here as the sum of the TLS, Connect, and Wait times or the equivalent of TTFB - DNS lookup time) for the customer when viewed through Cloudflare and a competing provider. This is the same data represented in the chart immediately above (containing 90,000 samples for each provider), just in a different form called a <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function"><u>CDF plot</u></a>, which is one of a few ways we are making it easier to visualize the entire percentile range. The chart shows the percentiles on the y-axis and latency measurements on the x-axis, so to see the latency value for a given percentile, you go up to the percentile you want and then over to the curve. Interpreting these charts is as easy as finding which curve is farther to the left for any given percentile. That curve will have the lower latency.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/53sRk6UCoflU2bGcXypgEQ/f435bbdf43e1646cf2afb56d2aca26be/image4.png" />
          </figure><p>It's pretty clear that for nearly the entire percentile range, the other provider has the lower latency by as much as 30ms. That is, until you get to the very top of the chart. There's a little bit of blue that's above (and therefore to the left) of the green. In order to see what's going on there more clearly, we can use a different kind of visualization. This one is called a <a href="https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot"><u>QQ-Plot</u></a>, or quantile-quantile plot. This shows the same information as the CDF plot, but now each point on the curve represents a specific quantile, and the 2 axes are the latency values of the two providers at that percentile.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/jeYkDomZjnqhrCIIUJBqj/ebb4533c6982b0f8b9f5f491aa1549fb/image9.png" />
          </figure><p>This chart looks complicated, but interpreting it is similar to the CDF plot. The blue is a dividing marker that shows where the latency of both providers is equal. Points below the line indicate percentiles where the other provider has a lower latency than Cloudflare, and points above the line indicate percentiles where Cloudflare is faster. We see again that for most of the percentile range, the other provider is faster, but for percentiles above 99, Cloudflare is significantly faster. </p><p>This is not so compelling by itself, but what if we take into account the number of requests this page issues ... which is over 180. Using the same math from above, and only considering <i>half</i> the requests to be required for the page to be considered loaded, yields this new effective QQ plot.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/S0lLIZfVyVM7KjWUawcNg/967417729939f454bacd0d4c12b0c0e2/image3.png" />
          </figure><p>Taking multiple requests into account, we see that the median latency is close to even for both Cloudflare and the other provider, but the stories above and below that point are very different. A user has about an even chance of an experience where Cloudflare is significantly faster and one where Cloudflare is slightly slower than the other provider. We can show the impact of this shift in perspective more directly by calculating the <a href="https://en.wikipedia.org/wiki/Expected_value#Arbitrary_real-valued_random_variables"><u>expected value</u></a> for request and experienced latency.</p><table><tr><td><p><b>Latency Kind</b></p></td><td><p><b>Cloudflare </b>(ms)</p></td><td><p><b>Other CDN</b> (ms)</p></td><td><p><b>Difference</b> (ms)</p></td></tr><tr><td><p>Expected Request Latency</p></td><td><p>141.9</p></td><td><p>129.9</p></td><td><p><b>+12.0</b></p></td></tr><tr><td><p>Expected Experienced Latency </p><p>Based on 90 Requests </p></td><td><p>207.9</p></td><td><p>281.8</p></td><td><p><b>-71.9</b></p></td></tr></table><p>Shifting the focus from individual request latency to user latency we see that Cloudflare is 70 ms faster than the other provider. This is where our obsession with reliability and tail latency becomes a win for our customers, but without a large volume of raw data, knowledge, and tools, this win would be totally hidden. That is why in the near future we are going to be making this tool and others available to our customers so that we can all get a more accurate and clear picture of our users’ experiences with latency. Keep an eye out for more announcements to come later in 2025.</p> ]]></content:encoded>
            <category><![CDATA[Internet Performance]]></category>
            <category><![CDATA[Latency]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Observability]]></category>
            <category><![CDATA[TTFB]]></category>
            <guid isPermaLink="false">6R3IB3ISH3fXyycnjNPyZC</guid>
            <dc:creator>Kevin Guthrie</dc:creator>
        </item>
        <item>
            <title><![CDATA[Thirteen new MCP servers from Cloudflare you can use today]]></title>
            <link>https://blog.cloudflare.com/thirteen-new-mcp-servers-from-cloudflare/</link>
            <pubDate>Thu, 01 May 2025 13:01:19 GMT</pubDate>
            <description><![CDATA[ You can now connect to Cloudflare's first publicly available remote Model Context Protocol (MCP) servers from any MCP client that supports remote servers.  ]]></description>
            <content:encoded><![CDATA[ <p>You can now connect to Cloudflare's first publicly available <a href="https://blog.cloudflare.com/remote-model-context-protocol-servers-mcp/"><u>remote Model Context Protocol (MCP) servers</u></a> from Claude.ai (<a href="http://anthropic.com/news/integrations"><u>now supporting remote MCP connections!</u></a>) and other <a href="https://modelcontextprotocol.io/clients"><u>MCP clients</u></a> like Cursor, Windsurf, or our own <a href="https://playground.ai.cloudflare.com/"><u>AI Playground</u></a>. Unlock Cloudflare tools, resources, and real time information through our new suite of MCP servers including: </p>
<div><table><thead>
  <tr>
    <th><span>Server</span></th>
    <th><span>Description </span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/docs-vectorize"><span>Cloudflare Documentation server</span></a></td>
    <td><span>Get up to date reference information from Cloudflare Developer Documentation</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/workers-bindings"><span>Workers Bindings server </span></a></td>
    <td><span>Build Workers applications with storage, AI, and compute primitives</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/workers-observability"><span>Workers Observability server </span></a></td>
    <td><span>Debug and get insight into your Workers application’s logs and analytics</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/sandbox-container"><span>Container server</span></a></td>
    <td><span>Spin up a sandbox development environment </span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/browser-rendering"><span>Browser rendering server</span></a><span> </span></td>
    <td><span>Fetch web pages, convert them to markdown and take screenshots</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/radar"><span>Radar server </span></a></td>
    <td><span>Get global Internet traffic insights, trends, URL scans, and other utilities</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/logpush"><span>Logpush server </span></a></td>
    <td><span>Get quick summaries for Logpush job health</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/ai-gateway"><span>AI Gateway server </span></a></td>
    <td><span>Search your logs, get details about the prompts and responses</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/autorag"><span>AutoRAG server</span></a></td>
    <td><span>List and search documents on your AutoRAGs</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/auditlogs"><span>Audit Logs server </span></a></td>
    <td><span>Query audit logs and generate reports for review</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/dns-analytics"><span>DNS Analytics server </span></a></td>
    <td><span>Optimize DNS performance and debug issues based on current set up</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/dex-analysis"><span>Digital Experience Monitoring server </span></a></td>
    <td><span>Get quick insight on critical applications for your organization</span></td>
  </tr>
  <tr>
    <td><a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/cloudflare-one-casb"><span>Cloudflare One CASB server </span></a></td>
    <td><span>Quickly identify any security misconfigurations for SaaS applications to safeguard applications, users, and data</span></td>
  </tr>
</tbody></table></div><p>… all through a natural language interface! </p><p>Today, we also <a href="http://blog.cloudflare.com/mcp-demo-day"><u>announced our collaboration with Anthropic</u></a> to bring remote MCP to <a href="https://claude.ai/"><u>Claude</u></a> users, and showcased how other leading companies such as <a href="https://www.atlassian.com/platform/remote-mcp-server"><u>Atlassian</u></a>, <a href="https://developer.paypal.com/tools/mcp-server/"><u>PayPal</u></a>, <a href="https://docs.sentry.io/product/sentry-mcp/"><u>Sentry</u></a>, and <a href="https://mcp.webflow.com"><u>Webflow</u></a> have built remote MCP servers on Cloudflare to extend their service to their users. We’ve also been using the same infrastructure and tooling to build out our own suite of remote servers, and today we’re excited to show customers what’s ready for use and share what we’ve learned along the way. </p>
    <div>
      <h3>Cloudflare’s MCP servers available today: </h3>
      <a href="#cloudflares-mcp-servers-available-today">
        
      </a>
    </div>
    <p>These <a href="https://www.cloudflare.com/learning/ai/what-is-model-context-protocol-mcp/">MCP servers</a> allow your <a href="https://modelcontextprotocol.io/clients"><u>MCP Client</u></a> to read configurations from your account, process information, make suggestions based on data, and even make those suggested changes for you. All of these actions can happen across Cloudflare's many services including application development, security, and performance.</p>
    <div>
      <h4><b>Cloudflare Documentation Server: </b>Get up-to-date reference information on Cloudflare </h4>
      <a href="#cloudflare-documentation-server-get-up-to-date-reference-information-on-cloudflare">
        
      </a>
    </div>
    <p>Our <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/docs-vectorize"><u>Cloudflare Documentation server</u></a> enables any MCP Client to access up-to-date <a href="https://developers.cloudflare.com/"><u>documentation</u></a> in real-time, rather than relying on potentially outdated information from the model's training data. If you’re new to building with Cloudflare, this server synthesizes information right from our documentation and exposes it to your MCP Client, so you can get reliable, up-to-date responses to any complex question like “Search Cloudflare for the best way to build an AI Agent”.  </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3vanQPwy6YSwI7bsDTk2md/09cb4763ddbd4858fcd90aca00106bb9/BLOG-2808_2.png" />
          </figure>
    <div>
      <h4><b>Workers Bindings server: </b>Build with developer resources </h4>
      <a href="#workers-bindings-server-build-with-developer-resources">
        
      </a>
    </div>
    <p>Connecting to the <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/workers-bindings"><u>Bindings MCP server</u></a> lets you leverage application development primitives like D1 databases, <a href="https://www.cloudflare.com/developer-platform/products/r2/">R2 object storage</a> and Key Value stores on the fly as you build out a Workers application. If you're leveraging your MCP Client to generate code, the bindings server provides access to read existing resources from your account or create fresh resources to implement in your application. In combination with our <a href="https://developers.cloudflare.com/workers/get-started/prompting/"><u>base prompt</u></a> designed to help you build robust Workers applications, you can add the Bindings MCP server to give your client all it needs to start generating full stack applications from natural language. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6N0Y8BCBz5ULSHbj0JCIkL/3a6a9ef269202a6c05d18444f313ce87/BLOG-2808_3.png" />
          </figure><p>
Full example output using the Workers Bindings MCP server can be found <a href="https://claude.ai/share/273dadf7-b060-422d-b2b6-4f436d537136"><u>here</u></a>.</p>
    <div>
      <h4><b>Workers Observability server: </b>Debug your application </h4>
      <a href="#workers-observability-server-debug-your-application">
        
      </a>
    </div>
    <p>The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/workers-observability"><u>Workers Observability MCP server</u></a> integrates with <a href="https://developers.cloudflare.com/workers/observability/logs/workers-logs/"><u>Workers Logs</u></a> to browse invocation logs and errors, compute statistics across invocations, and find specific invocations matching specific criteria. By querying logs across all of your Workers, this MCP server can help isolate errors and trends quickly. The telemetry data that the MCP server returns can also be used to create new visualizations and improve <a href="https://www.cloudflare.com/learning/performance/what-is-observability/">observability</a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1rydyUALBbwtPrT477xAKM/81547e1fb3cec5ffadd90ee5e68e1a5e/BLOG-2808_4.png" />
          </figure>
    <div>
      <h4><b>Container server:</b> Spin up a development environment</h4>
      <a href="#container-server-spin-up-a-development-environment">
        
      </a>
    </div>
    <p>The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/sandbox-container"><u>Container MCP server</u></a> provides any MCP client with access to a secure, isolated execution environment running on Cloudflare’s network where it can run and test code if your MCP client does not have a built in development environment (e.g. claude.ai). When building and generating application code, this lets the AI run its own commands and validate its assumptions in real time. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1rXgpQ3znIE01ccY1Qd2cQ/058902719a90af14175b8e838b09b78e/BLOG-2808_5.png" />
          </figure>
    <div>
      <h4><b>Browser Rendering server: </b>Fetch and convert web pages, take screenshots </h4>
      <a href="#browser-rendering-server-fetch-and-convert-web-pages-take-screenshots">
        
      </a>
    </div>
    <p>The <a href="https://developers.cloudflare.com/browser-rendering/"><u>Browser Rendering</u></a> MCP server provides AI friendly tools from our <a href="https://developers.cloudflare.com/browser-rendering/rest-api/"><u>RESTful interface</u></a> for common browser actions such as capturing screenshots, extracting HTML content, and <a href="https://blog.cloudflare.com/markdown-for-agents/">converting pages to Markdown</a>. These are particularly useful when building agents that require interacting with a web browser.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3aQtQxzj1hP6cbtbY4CHXI/27535e9f9a041187c12f6b41ba36afdb/BLOG-2808_6.png" />
          </figure>
    <div>
      <h4><b>Radar server: </b>Ask questions about how we see the Internet and Scan URLs</h4>
      <a href="#radar-server-ask-questions-about-how-we-see-the-internet-and-scan-urls">
        
      </a>
    </div>
    <p>The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/radar"><u>Cloudflare Radar MCP server</u></a> exposes tools that allow any MCP client to explore our aggregated <a href="https://radar.cloudflare.com/traffic#http-traffic"><u>HTTP traffic data</u></a>, get information on <a href="https://radar.cloudflare.com/traffic/as701"><u>Autonomous Systems</u></a> (AS) and <a href="https://radar.cloudflare.com/ip/72.74.50.251"><u>IP addresses</u></a>, list traffic anomalies from our <a href="https://radar.cloudflare.com/outage-center"><u>Outage Center</u></a>, get <a href="https://radar.cloudflare.com/domains"><u>trending domains</u></a>, and domain rank information. It can even create charts. Here’s a chat where we ask "show me the <a href="https://claude.ai/public/artifacts/34c8a494-abdc-4755-9ca7-cd8e0a8bea41"><u>HTTP traffic from Portugal</u></a> for the last week":</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/9yg9Fnkoz6t6QOwUK1a5r/b11756fd82058f04037740678160cc7c/BLOG-2808_7.png" />
          </figure>
    <div>
      <h4><b>Logpush server: </b>Get quick summaries for Logpush job health </h4>
      <a href="#logpush-server-get-quick-summaries-for-logpush-job-health">
        
      </a>
    </div>
    <p><a href="https://developers.cloudflare.com/logs/about/"><u>Logpush</u></a> jobs deliver comprehensive logs to your destination of choice, allowing near real-time information processing. The Logpush MCP server can help you analyze your Logpush job results and understand your job health at a high level, allowing you to filter and narrow down for jobs or scenarios you care about. For example, you can ask “provide me with a list of recently failed jobs.” Now, you can quickly find out which jobs are failing with which error message and when, summarized in a human-readable format. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ltuXD2TgEhiblx6aNhm4d/d63b14f151fd3a239b0a3cf0dfb92ebf/BLOG-2808_8.png" />
          </figure>
    <div>
      <h4><b>AI Gateway server: </b>Check out your AI Gateway logs </h4>
      <a href="#ai-gateway-server-check-out-your-ai-gateway-logs">
        
      </a>
    </div>
    <p>Use this <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/ai-gateway"><u>MCP server</u></a> to inspect your <a href="https://www.cloudflare.com/developer-platform/products/ai-gateway/">AI Gateway</a> logs and get details about the data from your prompts and the AI models responses. In this example we ask our agent “What is my average latency for my AI Gateway logs in the <i>Cloudflare Radar</i> account?”</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7dYzkZ0cYcXPjFKlhdqkMp/52f2b7e62fe2bb05c91fc563738ddfc2/BLOG-2808_9.png" />
          </figure>
    <div>
      <h4><b>AutoRAG server:</b> List and search documents on your AutoRAGs</h4>
      <a href="#autorag-server-list-and-search-documents-on-your-autorags">
        
      </a>
    </div>
    <p>Having AutoRAG RAGs available to query as MCP tools greatly expands the typical static one-shot retrieval​ and opens doors to use cases where the agent can dynamically decide if and when to retrieve information from one or more <a href="https://www.cloudflare.com/learning/ai/retrieval-augmented-generation-rag/">RAGs</a>, combine them with other tools and APIs, cross-check information and generate a much more rich and complete final answer.</p><p>Here we have a RAG that has a few blog posts that talk about retrocomputers. If we ask "tell me about restoring an amiga 1000 using the blog-celso autorag" the agent will go into a sequence of <a href="https://claude.ai/share/18f1be31-9936-48c0-9de0-151d64f3534e"><u>reasoning steps</u></a>:</p><ul><li><p>“Now that I have some information about Amiga 1000 restoration from blog-celso, let me search for more specific details.”</p></li><li><p>“Let me get more specific information about hardware upgrades and fixes for the Amiga 1000.”</p></li><li><p>“Let me get more information about the DiagROM and other tools used in the restoration.”</p></li><li><p>“Let me search for information about GBA1000 and other expansions mentioned in the blog.”</p></li><li><p>And finally, “Based on the comprehensive information I've gathered from the blog-celso AutoRAG, I can now provide you with a detailed guide on restoring an Amiga 1000.”</p></li></ul><p>And at the end, it generates a very detailed answer based on all the data from all the queries:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Z8BJSvd4x233FPZDePeSk/e59b8676a92f828d32feb1d28381a216/BLOG-2808_10.png" />
          </figure>
    <div>
      <h4><b>Audit Logs server: </b>Query audit logs and generate reports for review</h4>
      <a href="#audit-logs-server-query-audit-logs-and-generate-reports-for-review">
        
      </a>
    </div>
    <p>Audit Logs record detailed information about actions and events within a system, providing a transparent history of all activity. However, because these logs can be large and complex, it may take effort to query and reconstruct a clear sequence of events. The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/auditlogs"><u>Audit Logs MCP server</u></a> helps by allowing you to query audit logs and generate reports. Common queries include if anything notable happened in a Cloudflare account under a user around a particular time of the day, or identifying whether any users used API keys to perform actions on the account. For example, you can ask “Were there any suspicious changes made to my Cloudflare account yesterday around lunchtime?” and obtain the following response: </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/YdV73LhsCmjdQtOK8U7Ii/89f11db15e079190a234665ac4794754/BLOG-2808_11.png" />
          </figure>
    <div>
      <h4><b>DNS Analytics server: </b>Optimize DNS performance and debug issues based on current set up</h4>
      <a href="#dns-analytics-server-optimize-dns-performance-and-debug-issues-based-on-current-set-up">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/application-services/products/analytics/"><u>Cloudflare’s DNS Analytics</u></a> provides detailed insights into DNS traffic, which helps you monitor, analyze, and troubleshoot DNS performance and security across your domains. With Cloudflare’s <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/dns-analytics"><u>DNS Analytics MCP server</u></a>, you can review DNS configurations across all domains in your account, access comprehensive DNS performance reports, and receive recommendations for performance improvements. By leveraging documentation, the MCP server can help identify opportunities for improving performance. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3X7w64xQvvFbv24HaFeLDv/3fefb7ff9e912207c5897200beefd26f/image4.png" />
          </figure>
    <div>
      <h4><b>Digital Experience Monitoring server</b>: Get quick insight on critical applications for your organization </h4>
      <a href="#digital-experience-monitoring-server-get-quick-insight-on-critical-applications-for-your-organization">
        
      </a>
    </div>
    <p>Cloudflare <a href="https://www.cloudflare.com/learning/performance/what-is-digital-experience-monitoring/">Digital Experience Monitoring (DEM)</a> was built to help network professionals understand the performance and availability of their critical applications from self-hosted applications like Jira and Bitbucket to SaaS applications like Figma or Salesforce. The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/dex-analysis"><u>Digital Experience Monitoring MCP server</u></a> fetches DEM test results to surface performance and availability trends within your Cloudflare One deployment, providing quick insights on users, applications, and the networks they are connected to. You can ask questions like: Which users had the worst experience? What times of the day were applications most and least performant? When do I see the most HTTP status errors? When do I see the shortest, longest, or most instability in the network path? </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7Ctxdt7tw04Rfzl9Ihxnkw/fc9c9ab553daa58f59e024dd66dd3dea/BLOG-2808_12.png" />
          </figure>
    <div>
      <h4><b>CASB server</b>: Insights from SaaS Integrations</h4>
      <a href="#casb-server-insights-from-saas-integrations">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/zero-trust/products/casb/"><u>Cloudflare CASB</u></a> provides the ability to integrate with your organization’s <a href="https://developers.cloudflare.com/cloudflare-one/applications/casb/casb-integrations/"><u>SaaS and cloud applications</u></a> to discover assets and surface any security misconfigurations that may be present. A core task is helping security teams understand information about users, files, and other assets they care about that transcends any one SaaS application. The <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main/apps/cloudflare-one-casb"><u>CASB MCP server</u></a> can explore across users, files, and the many other asset categories to help understand relationships from data that can exist across many different integrations. A common query may include “Tell me about “Frank Meszaros” and what SaaS tools they appear to have accessed”.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3aJOta5YYZ1FqZyHF0wVnx/2c79512fb674eb2762395e5ccaac9700/BLOG-2808_13.png" />
          </figure>
    <div>
      <h3>Get started with our MCP servers </h3>
      <a href="#get-started-with-our-mcp-servers">
        
      </a>
    </div>
    <p>You can start using our Cloudflare MCP servers today! If you’d like to read more about specific tools available in each server, you can find them in our <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main"><u>public GitHub repository</u></a>. Each server is deployed to a server URL, such as</p><p><code>https://observability.mcp.cloudflare.com/sse.</code></p><p>If your MCP client has first class support for remote MCP servers, the client will provide a way to accept the server URL directly within its interface. For example, if you are using <a href="https://claude.ai/settings/profile"><u>claude.ai</u></a>, you can: </p><ol><li><p>Navigate to your <a href="https://claude.ai/settings/profile"><u>settings</u></a> and add a new “Integration” by entering the URL of your MCP server</p></li><li><p>Authenticate with Cloudflare</p></li><li><p>Select the tools you’d like claude.ai to be able to call</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5zWyWq2gS08CZsCQNB2fFZ/4e2c88abc90e11055159127e2abaf7b2/BLOG-2808_14.png" />
          </figure><p>If your client does not yet support remote MCP servers, you will need to set up its respective configuration file (mcp_config.json) using <a href="https://www.npmjs.com/package/mcp-remote"><u>mcp-remote</u></a> to specify which servers your client can access.</p>
            <pre><code>{
	"mcpServers": {
		"cloudflare-observability": {
			"command": "npx",
			"args": ["mcp-remote", "https://observability.mcp.cloudflare.com/sse"]
		},
		"cloudflare-bindings": {
			"command": "npx",
			"args": ["mcp-remote", "https://bindings.mcp.cloudflare.com/sse"]
		}
	}
}
</code></pre>
            
    <div>
      <h3>Have feedback on our servers?</h3>
      <a href="#have-feedback-on-our-servers">
        
      </a>
    </div>
    <p>While we're launching with these initial 13 MCP servers, we are just getting started! We want to hear your feedback as we shape existing and build out more Cloudflare MCP servers that unlock the most value for your teams leveraging AI in their daily workflows. If you’d like to provide feedback, request a new MCP server, or report bugs, please raise an issue on our <a href="https://github.com/cloudflare/mcp-server-cloudflare/tree/main"><u>GitHub repository. </u></a> </p>
    <div>
      <h3>Building your own MCP server?</h3>
      <a href="#building-your-own-mcp-server">
        
      </a>
    </div>
    <p>If you’re interested in building your own servers, we've discovered valuable best practices that we're excited to share with you as we’ve been building ours. While MCP is really starting to gain momentum and many organizations are just beginning to build their own servers, these principals should help guide you as you start building out MCP servers for your customers. </p><ol><li><p><b>An MCP server is not our entire API schema: </b>Our goal isn't to build a large wrapper around all of Cloudflare’s API schema, but instead focus on optimizing for specific jobs to be done and reliability of the outcome. This means while one tool from our MCP server may map to one API, another tool may map to many. We’ve found that fewer but more powerful tools may be better for the agent with smaller context windows, less costs, a faster output, and likely more valid answers from LLMs. Our MCP servers were created directly by the product teams who are responsible for each of these areas of Cloudflare – application development, security and performance – and are designed with user stories in mind. This is a pattern you will continue to see us use as we build out more Cloudflare servers. </p></li><li><p><b>Specialize permissions with multiple servers:</b> We built out several specialized servers rather than one for a critical reason: security through precise permission scoping. Each MCP server operates with exactly the permissions needed for its specific task – nothing more. By separating capabilities across multiple servers, each with its own authentication scope, we prevent the common security pitfall of over-privileged access. </p></li><li><p><b>Add robust server descriptions within parameters:</b> Tool descriptions were core to providing helpful context to the agent. We’ve found that more detailed descriptions help the agent understand not just the expected data type, but also the parameter's purpose, acceptable value ranges, and impact on server behavior. This context allows agents to make intelligent decisions about parameter values rather than providing arbitrary and potentially problematic inputs, allowing your natural language to go further with the agent. </p></li><li><p><b>Using evals at each iteration:</b> For each server, we implemented evaluation tests or “evals” to assess the model's ability to follow instructions, select appropriate tools, and provide correct arguments to those tools. This gave us a programmatic way to understand if any regressions occurred through each iteration, especially when tweaking tool descriptions. </p></li></ol><p>Ready to start building? Click the button below to deploy your first remote MCP server to production: </p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/ai/tree/main/demos/remote-mcp-authless"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p>Or check out our documentation to learn more! If you have any questions or feedback for us, you can reach us via email at <a href="#"><u>1800-mcp@cloudflare.com</u></a> or join the chatter in the <a href="https://discord.com/channels/595317990191398933/1354548448635912324"><u>Cloudflare Developers Discord</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Model Context Protocol]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[MCP]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">17j3OSuM89oMb5wurF4Tij</guid>
            <dc:creator>Nevi Shah</dc:creator>
            <dc:creator>Maximo Guk </dc:creator>
            <dc:creator>Christian Sparks</dc:creator>
        </item>
        <item>
            <title><![CDATA[A next-generation Certificate Transparency log built on Cloudflare Workers]]></title>
            <link>https://blog.cloudflare.com/azul-certificate-transparency-log/</link>
            <pubDate>Fri, 11 Apr 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ Learn about recent developments in Certificate Transparency (CT), and how we built a next-generation CT log on top of Cloudflare's Developer Platform. ]]></description>
            <content:encoded><![CDATA[ <p>Any public <a href="https://en.wikipedia.org/wiki/Certificate_authority"><u>certification authority (CA)</u></a> can issue a <a href="https://www.cloudflare.com/learning/ssl/what-is-an-ssl-certificate/"><u>certificate</u></a> for any website on the Internet to allow a webserver to authenticate itself to connecting clients. Take a moment to scroll through the list of trusted CAs for your web browser (e.g., <a href="https://chromium.googlesource.com/chromium/src/+/main/net/data/ssl/chrome_root_store/test_store.certs"><u>Chrome</u></a>). You may recognize (and even trust) some of the names on that list, but it should make you uncomfortable that <i>any</i> CA on that list could issue a certificate for any website, and your browser would trust it. It’s a castle with 150 doors.</p><p><a href="https://datatracker.ietf.org/doc/html/rfc6962"><u>Certificate Transparency (CT)</u></a> plays a vital role in the <a href="https://datatracker.ietf.org/wg/wpkops/about/"><u>Web Public Key Infrastructure (WebPKI)</u></a>, the set of systems, policies, and procedures that help to establish trust on the Internet. CT ensures that all website certificates are <a href="https://crt.sh"><u>publicly visible</u></a> and <a href="https://developers.cloudflare.com/ssl/edge-certificates/additional-options/certificate-transparency-monitoring/"><u>auditable</u></a>, helping to protect website operators from certificate mis-issuance by dishonest CAs, and helping honest CAs to detect key compromise and other failures.</p><p>In this post, we’ll discuss the history, evolution, and future of the CT ecosystem. We’ll cover some of the challenges we and others have faced in operating CT logs, and how the new <a href="https://c2sp.org/static-ct-api"><u>static CT API</u></a> log design lowers the bar for operators, helping to ensure that this critical infrastructure keeps up with the fast growth and changing landscape of the Internet and WebPKI. We’re excited to open source our <a href="https://github.com/cloudflare/azul"><u>Rust implementation</u></a> of the new log design, built for deployment on Cloudflare’s Developer Platform, and to announce <a href="https://github.com/cloudflare/azul/tree/main/crates/ct_worker#test-logs"><u>test logs</u></a> deployed using this infrastructure.</p>
    <div>
      <h2>What is Certificate Transparency?</h2>
      <a href="#what-is-certificate-transparency">
        
      </a>
    </div>
    <p>In 2011, the Dutch CA DigiNotar was <a href="https://threatpost.com/final-report-diginotar-hack-shows-total-compromise-ca-servers-103112/77170/"><u>hacked</u></a>, allowing attackers to forge a certificate for *.google.com and use it to impersonate Gmail to targeted Iranian users in an attempt to compromise personal information. Google caught this because they used <a href="https://developers.cloudflare.com/ssl/reference/certificate-pinning/"><u>certificate pinning</u></a>, but that technique <a href="https://blog.cloudflare.com/why-certificate-pinning-is-outdated/"><u>doesn’t scale well</u></a> for the web. This, among other similar attacks, led a team at Google in 2013 to develop Certificate Transparency (CT) as a mechanism to catch mis-issued certificates. CT creates a public audit trail of all certificates issued by public CAs, helping to protect users and website owners by holding <a href="https://sslmate.com/resources/certificate_authority_failures"><u>CAs accountable</u></a> for the certificates they issue (even unwittingly, in the event of key compromise or software bugs). CT has been a great success: since 2013, over <a href="https://crt.sh/cert-populations"><u>17 billion</u></a> certificates have been logged, and CT was awarded the prestigious <a href="https://blog.transparency.dev/certificate-transparency-wins-the-levchin-prize"><u>Levchin Prize</u></a> in 2024 for its role as a critical safety mechanism for the Internet.</p><p>Let’s take a brief look at the entities involved in the CT ecosystem. Cloudflare itself operates the <a href="https://blog.cloudflare.com/introducing-certificate-transparency-and-nimbus/"><u>Nimbus CT logs</u></a> and the CT monitor powering the <a href="https://blog.cloudflare.com/a-tour-through-merkle-town-cloudflares-ct-ecosystem-dashboard/"><u>Merkle Town</u></a> <a href="https://ct.cloudflare.com"><u>dashboard</u></a>.</p><p><i>Certification Authorities (CAs)</i> are organizations entrusted to issue certificates on behalf of website operators, which in turn can use those certificates to authenticate themselves to connecting clients.</p><p><i>CT-enforcing clients</i> like the <a href="https://googlechrome.github.io/CertificateTransparency/ct_policy.html"><u>Chrome</u></a>, <a href="https://support.apple.com/en-us/103214"><u>Safari</u></a>, and <a href="https://developer.mozilla.org/en-US/docs/Web/Security/Certificate_Transparency"><u>Firefox</u></a> browsers are web clients that only accept certificates compliant with their CT policies. For example, a policy might require that a certificate includes proof that it has been submitted to at least two independently-operated public CT logs.</p><p><i>Log operators</i> run CT logs, which are public, append-only lists of certificates. CAs and other clients can submit a certificate to a CT log to obtain a “promise” from the CT log that it will incorporate the entry into the append-only log within some grace period. CT logs periodically (every few seconds, typically) update their log state to incorporate batches of new entries, and publish a signed checkpoint that attests to the new state.</p><p><i>Monitors</i> are third parties that continuously crawl CT logs and check that their behavior is correct. For instance, they verify that a log is self-consistent and append-only by ensuring that when new entries are added to the log, no previous entries are deleted or modified. Monitors may also examine logged certificates to help website operators detect mis-issuance.</p>
    <div>
      <h2>Challenges in operating a CT log</h2>
      <a href="#challenges-in-operating-a-ct-log">
        
      </a>
    </div>
    <p>Despite the success of CT, it is a less than perfect system. Eric Rescorla has an <a href="https://educatedguesswork.org/posts/transparency-part-2/"><u>excellent writeup</u></a> on the many compromises made to make CT deployable on the Internet of 2013. We’ll focus on the operational complexities of running a CT log.</p><p>Let’s look at the requirements for running a CT log from <a href="https://googlechrome.github.io/CertificateTransparency/log_policy.html#ongoing-requirements-of-included-logs"><u>Chrome’s CT log policy</u></a> (which are more or less mirrored by those of <a href="https://support.apple.com/en-us/103703"><u>Safari</u></a> and <a href="https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/lypRGp4JGGE"><u>Firefox</u></a>), and what can go wrong. The requirements center around <b>integrity</b> and <b>availability</b>.</p><p>To be considered a trusted auditing source, CT logs necessarily have stringent <b>integrity</b> requirements. Anything the log produces must be correct and self-consistent, meaning that a CT log cannot present two different views of the log to different clients, and must present a consistent history for its entire lifetime. Similarly, when a CT log accepts a certificate and promises to incorporate it by returning a Signed Certificate Timestamp (SCT) to the client, it must eventually incorporate that certificate into its append-only log.</p><p>The integrity requirements are unforgiving. A single bit-flip due to a hardware failure or cosmic ray can (<a href="https://www.agwa.name/blog/post/how_ct_logs_fail"><u>and</u></a> <a href="https://groups.google.com/a/chromium.org/g/ct-policy/c/R27Zy9U5NjM"><u>has</u></a>) caused logs to produce incorrect results and thus be disqualified by CT programs. Even software updates to running logs can be fatal, as a change that causes a correctness violation cannot simply be rolled back. Perhaps the <a href="https://github.com/C2SP/C2SP/issues/79"><u>greatest risk</u></a> to individual log integrity is <a href="https://groups.google.com/a/chromium.org/g/ct-policy/c/W1Ty2gO0JNA"><u>failing to incorporate certificates</u></a> for which they issued SCTs, for example if they fail to commit those pending certificates to durable storage. See Andrew Ayer’s <a href="https://www.agwa.name/blog/post/how_ct_logs_fail"><u>great synopsis</u></a> for more examples of CT log failures (up to 2021).</p><p>A CT log must also meet certain <b>availability</b> requirements to effectively provide its core functionality as a publicly auditable log. Clients must be able to reliably retrieve log data — Chrome’s policy requires a minimum of 99% average uptime over a 90-day rolling period for each API endpoint — and any entries for which an SCT has been issued must be incorporated into the log within the grace period, called the Maximum Merge Delay (MMD), 24 hours in Chrome’s policy.</p><p>The design of the current CT log read APIs puts strain on the ability of log operators to meet uptime requirements. The API endpoints are <i>dynamic</i> and not easily cacheable without bespoke caching rules that are aware of the CT API. For instance, the <a href="https://datatracker.ietf.org/doc/html/rfc6962#section-4.6"><u>get-entries</u></a> endpoint allows a client to request arbitrary ranges of entries from a log, and the <a href="https://datatracker.ietf.org/doc/html/rfc6962#section-4.5"><u>get-proof-by-hash</u></a> requires the server to construct inclusion proofs for any certificate requested by the client. To serve these requests, CT log servers need to be backed by databases easily 5-10TB in size capable of serving tens of millions of requests per day. This increases operator complexity and expense, not to mention the high cost of bandwidth of serving these requests.</p><p>MMD violations are unfortunately not uncommon. Cloudflare’s own Nimbus logs have experienced prolonged outages in the past, most recently in <a href="https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/"><u>November 2023</u></a> due to complete power loss in the datacenter running the logs. During normal log operation, if the log accepts entries more quickly than it incorporates them, the backlog can grow to exceed the MMD. Log operators can remedy this by rate-limiting or temporarily disabling the write APIs, but this can in turn contribute to violations of the uptime requirements.</p><p>The high bar for log operation has limited the organizations operating CT logs to only <a href="https://ct.cloudflare.com/logs"><u>Cloudflare and five others</u></a>! Losing one or two logs is enough to compromise the stability of the CT ecosystem. Clearly, a change is needed.</p>
    <div>
      <h2>A next-generation CT log design</h2>
      <a href="#a-next-generation-ct-log-design">
        
      </a>
    </div>
    <p>In May 2024, Let’s Encrypt <a href="https://letsencrypt.org/2024/03/14/introducing-sunlight/"><u>announced</u></a> <a href="https://github.com/FiloSottile/sunlight"><u>Sunlight</u></a>, an implementation of a next-generation CT log designed for the modern WebPKI, incorporating a decade of lessons learned from running CT and similar transparency systems. The new CT log design, called the <a href="https://c2sp.org/static-ct-api"><u>static CT API</u></a>, is partially based on the <a href="https://go.googlesource.com/proposal/+/master/design/25530-sumdb.md"><u>Go checksum database</u></a>, and organizes log data as a series of <a href="https://research.swtch.com/tlog#tiling_a_log"><u>tiles</u></a> that are easy to cache and serve. The new design provides efficiency improvements that cut operation costs, help logs to meet availability requirements, and reduce the risk of integrity violations.</p><p>The static CT API is split into two parts, the <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#monitoring-apis"><b><u>monitoring APIs</u></b></a> (so named because CT monitors are the primary clients), and the <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#monitoring-apis"><b><u>submission APIs</u></b></a> for adding new certificates to the log.</p><p>The <b>monitoring APIs</b> replace the dynamic read APIs of <a href="https://datatracker.ietf.org/doc/html/rfc6962#section-4"><u>RFC 6962</u></a>, and organize log data into static, cacheable tiles. (See <a href="https://research.swtch.com/tlog#tiling_a_log"><u>Russ Cox’s blog post</u></a> for an in-depth explanation of tiled logs.) CT log operators can efficiently serve static tiles from <a href="https://www.cloudflare.com/developer-platform/solutions/s3-compatible-object-storage/">S3-compatible object storage buckets</a> and cache them using CDN infrastructure, without needing dedicated API servers. Clients can then download the necessary tiles to retrieve specific log entries or reconstruct arbitrary proofs.</p><p>The static CT API introduces another efficiency by deduplicating intermediate and root “issuer” certificates in a log entry’s certificate chain. The number of publicly-trusted issuer certificates is small (<a href="https://www.ccadb.org/"><u>in the low thousands</u></a>), so instead of storing them repeatedly for each log entry, only the issuer hash is stored. Clients can look up issuer certificates by hash from a <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#issuers"><u>separate endpoint</u></a>.</p><p>The <b>submission APIs</b> remain backwards-compatible with <a href="https://datatracker.ietf.org/doc/html/rfc6962#section-4"><u>RFC 6962</u></a>, meaning that TLS clients and CAs can submit to them without any changes. However, there is one notable addition: the static CT specification requires logs to hold on to requests as it batches and sequences them, and responds with an SCT only after entries have been incorporated into the log. The specification defines a <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#sct-extension"><u>required SCT extension</u></a> indicating the entry’s index in the log. At the cost of slightly delayed SCT issuance (on the order of seconds), this change eliminates one of the major pain points of operating a CT log (the Merge Delay).</p><p>Having the log <i>index</i> of a certificate available in an SCT enables further efficiencies. <i>SCT auditing</i> refers to the process by which TLS clients or monitors can check if a log has fulfilled its promise to incorporate a certificate for which it has issued an SCT. In the RFC 6962 API, checking if a certificate is present in a log when you don’t already know the index requires using the <a href="https://datatracker.ietf.org/doc/html/rfc6962#section-4.5"><u>get-proof-by-hash</u></a> endpoint to look up the entry by the certificate hash (and the server needs to maintain a mapping from hash to index to efficiently serve these requests). Instead, with the index immediately available in the SCT, clients can directly retrieve the specific log data tile covering that index, even with <a href="https://transparency.dev/summit2024/sct-auditing.html"><u>efficient privacy-preserving techniques</u></a>.</p><p>Since it was announced, the static CT API has taken the CT ecosystem by storm. Aside from <a href="https://github.com/FiloSottile/sunlight"><u>Sunlight</u></a> and our brand new <a href="https://github.com/cloudflare/azul"><u>Azul</u></a> (discussed below), there are at least two other independent implementations, <a href="https://blog.transparency.dev/i-built-a-new-certificate-transparency-log-in-2024-heres-what-i-learned"><u>Itko</u></a> and <a href="https://blog.transparency.dev/introducing-trillian-tessera"><u>Trillian Tessera</u></a>. Several CT monitors (including <a href="https://crt.sh"><u>crt.sh</u></a>, <a href="https://sslmate.com/certspotter/"><u>certspotter</u></a>, <a href="https://censys.com/"><u>Censys</u></a>, and our own <a href="https://ct.cloudflare.com"><u>Merkle Town</u></a>) have added support for the new log format, and as of April 1, 2025, Chrome has begun accepting submissions for <a href="https://groups.google.com/a/chromium.org/g/ct-policy/c/HBFZHG0TCsY/m/HAaVRK6MAAAJ"><u>static CT API logs</u></a> into their CT log program.</p>
    <div>
      <h2>A static CT API implementation on Workers</h2>
      <a href="#a-static-ct-api-implementation-on-workers">
        
      </a>
    </div>
    <p>This section discusses how we designed and built our static CT log implementation, <a href="https://github.com/cloudflare/azul"><u>Azul</u></a> (short for <a href="https://en.wikipedia.org/wiki/Azulejo"><u>azulejos</u></a>, the colorful Portuguese and Spanish ceramic tiles). For curious readers and prospective CT log operators, we encourage you to follow the instructions in the repo to quickly set up your own static CT log. Questions and feedback in the form of GitHub issues are welcome!</p><p>Our two prototype logs, <a href="https://static-ct.cloudflareresearch.com/logs/cftest2025h1a/metadata"><u>Cloudflare Research 2025h1a</u></a> and <a href="https://static-ct.cloudflareresearch.com/logs/cftest2025h2a/metadata"><u>Cloudflare Research 2025h2a</u></a> (accepting certificates expiring in the first and second half of 2025, respectively), are available for testing.</p>
    <div>
      <h3>Design decisions and goals</h3>
      <a href="#design-decisions-and-goals">
        
      </a>
    </div>
    <p>The advent of the static CT API gave us the perfect opportunity to rethink how we run our CT logs. There were a few design decisions we made early on to shape the project.</p><p>First and foremost, we wanted to run our CT logs on our distributed global network. Especially after the <a href="https://blog.cloudflare.com/post-mortem-on-cloudflare-control-plane-and-analytics-outage/"><u>painful November 2023 control plane outage</u></a>, there’s been a push to deploy services on our highly available and resilient network instead of running in centralized datacenters.</p><p>Second, with Cloudflare’s deeply engrained culture of <a href="https://blog.cloudflare.com/tag/dogfooding/"><u>dogfooding</u></a> (building Cloudflare on top of Cloudflare), we decided to implement the CT log on top of Cloudflare’s Developer Platform and <a href="https://workers.cloudflare.com/"><u>Workers</u></a>. </p><p>Dogfooding gives us an opportunity to find pain points in our product offerings, and to provide feedback to our development teams to improve the developer experience for everyone. We restricted ourselves to only features and default limits generally available to customers, so that we could have the same experience as an external Cloudflare developer, and would produce an implementation that anyone could deploy.</p><p>Another major design decision was to implement the CT log in Rust, a modern systems programming language with static typing and built-in memory safety that is heavily used across Cloudflare, and which already has mature (if sometimes <a href="#developing-a-workers-application-in-rust"><u>lacking full feature parity</u></a>) <a href="https://github.com/cloudflare/workers-rs"><u>Workers bindings</u></a> that we have used to build <a href="https://blog.cloudflare.com/wasm-coredumps/"><u>several production services</u></a>. This also provided us with an opportunity to produce Rust crates porting <a href="https://pkg.go.dev/golang.org/x/mod/sumdb"><u>Go implementations</u></a> of various <a href="https://c2sp.org"><u>C2SP</u></a> specifications that can be reused across other projects.</p><p>For the new logs to be deployable, they needed to be at least as performant as existing CT logs. As a point of reference, the <a href="https://ct.cloudflare.com/logs/nimbus2025"><u>Nimbus2025</u></a> log currently handles just over 33 million requests per day (~380/s) across the read APIs, and about 6 million per day (~70/s) across the write APIs.</p>
    <div>
      <h3>Implementation </h3>
      <a href="#implementation">
        
      </a>
    </div>
    <p>We based Azul heavily on <a href="https://github.com/FiloSottile/sunlight"><u>Sunlight</u></a>, a Go application built for deployment as a standalone server. As such, this section serves as a reference for translating a traditional server to Cloudflare’s serverless platform.</p><p>To start, let’s briefly review the Sunlight architecture (described in more detail in the <a href="https://github.com/FiloSottile/sunlight/blob/main/README.md"><u>README</u></a> and <a href="https://filippo.io/a-different-CT-log"><u>original design doc</u></a>). A Sunlight instance is a single Go process, serving one or multiple CT logs. It is backed by three different storage locations with different properties:</p><ul><li><p>A “lock backend” which stores the current checkpoint for each log. This datastore needs to be strongly consistent, but only stores trivial amounts of data.</p></li><li><p>A per-log object storage bucket from which to serve tiles, checkpoints, and issuers to CT clients. This datastore needs to be strongly consistent, and to handle multiple terabytes of data.</p></li><li><p>A per-log deduplication cache, to return SCTs for previously-submitted (pre-)certificates. This datastore is best-effort (as duplicate entries are not fatal to log operation), and stores tens to hundreds of gigabytes of data.</p></li></ul><p>Two major components handle the bulk of the CT log application logic:</p><ul><li><p>A frontend HTTP server handles incoming requests to the submission APIs to add new certificates to the log, validates them, checks the deduplication cache, adds the certificate to a pool of entries to be sequenced, and waits for sequencing to complete before responding to the client.</p></li><li><p>The sequencer periodically (every 1s, by default) sequences the pool of pending entries, writes new tiles to the object backend, persists the latest checkpoint covering the new log state to the lock and object backends, and signals to waiting requests that the pool has been sequenced.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6gLwzRo4Azbls2wvM12TJx/80d6f7aad1317f31dfe06a0c474ee93c/image5.png" />
          </figure><p><sup><i>A static CT API log running on a traditional server using the Sunlight implementation.</i></sup></p><p>Next, let’s look at how we can translate these components into ones suitable for deployment on Workers.</p>
    <div>
      <h4>Making it work</h4>
      <a href="#making-it-work">
        
      </a>
    </div>
    <p>Let’s start with the easy choices. The static CT <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#monitoring-apis"><u>monitoring APIs</u></a> are designed to serve static, cacheable, compressible assets from object storage. The API should be highly available and have the capacity to serve any number of CT clients. The natural choice is <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>Cloudflare R2</u></a>, which provides globally consistent storage with capacity for <a href="https://developers.cloudflare.com/r2/platform/limits/"><u>large data volumes</u></a>, customizability to configure caching and compression, and unbounded read operations.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/qsC1dO8blS1eGOysu9WQa/75da37719be35824a7533dbbd62bede3/image4.png" />
          </figure><p><sup><i>A static CT API log running on Workers using a preliminary version of the Azul implementation which ran into performance limitations.</i></sup></p><p>The static CT <a href="https://github.com/C2SP/C2SP/blob/main/static-ct-api.md#submission-apis"><u>submission APIs</u></a> are where the real challenge lies. In particular, they allow CT clients to submit certificate chains to be incorporated into the append-only log. We used <a href="https://developers.cloudflare.com/learning-paths/workers/concepts/workers-concepts/"><u>Workers</u></a> as the frontend for the CT log application. Workers run in data centers close to the client, scaling on demand to handle request load, making them the ideal place to run the majority of the heavyweight request handling logic, including validating requests, checking the deduplication cache (discussed below), and submitting the entry to be sequenced.</p><p>The next question was where and how we’d run the backend to handle the CT log sequencing logic, which needs to be stateful and tightly coordinated. We chose <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Objects (DOs)</u></a>, a special type of stateful Cloudflare Worker where each instance has persistent storage and a unique name which can be used to route requests to it from anywhere in the world. DOs are designed to scale effortlessly for applications that can be easily broken up into self-contained units that do not need a lot of coordination across units. For example, a <a href="https://blog.cloudflare.com/introducing-workers-durable-objects/#demo-chat"><u>chat application</u></a> can use one DO to control each chat room. In our model, then, each CT log is controlled by a single DO. This architecture allows us to easily run multiple CT logs within a single Workers application, but as we’ll see, the limitations of <i>individual</i> single-threaded DOs can easily become a bottleneck. More on this later.</p><p>With the CT log backend as a Durable Object, several other components fell into place: Durable Objects’ <a href="https://developers.cloudflare.com/durable-objects/api/storage-api/"><u>strongly-consistent transactional storage</u></a> neatly fit the requirements for the “lock backend” to persist the log’s latest checkpoint, and we can use an <a href="https://developers.cloudflare.com/durable-objects/api/alarms/"><u>alarm</u></a> to trigger the log sequencing every second. We can also use <a href="https://developers.cloudflare.com/durable-objects/reference/data-location/#provide-a-location-hint"><u>location hints</u></a> to place CT logs in locations geographically close to clients for reduced latency, similar to <a href="https://groups.google.com/g/certificate-transparency/c/I74Wp-KdWHc"><u>Google’s Argon and Xenon logs</u></a>.</p><p>The <a href="https://developers.cloudflare.com/workers/platform/storage-options/"><u>choice of datastore</u></a> for the deduplication cache proved to be non-obvious. The cache is best-effort, and intended to avoid re-sequencing entries that are already present in the log. The cache key is computed by hashing certain fields of the <code>add-[pre-]chain</code> request, and the cache value consists of the entry’s index in the log and the timestamp at which it was sequenced. At current log submission rates, the deduplication cache could grow in excess of <a href="https://github.com/FiloSottile/sunlight/tree/main?tab=readme-ov-file#operating-a-sunlight-log"><u>50 GB for 6 months of log data</u></a>. In the Sunlight implementation, the deduplication cache is implemented as a local SQLite database, where checks against it are tightly coupled with sequencing, which ensures that duplicates from in-flight requests are correctly accounted for. However, this architecture did not translate well to Cloudflare's architecture. The data size doesn’t comfortably fit within <a href="https://developers.cloudflare.com/durable-objects/platform/limits/"><u>Durable Object Storage</u></a> or <a href="https://developers.cloudflare.com/d1/platform/limits/"><u>single-database D1</u></a> limits, and it was too slow to directly read and write to remote storage from within the sequencing loop. Ultimately, we split the deduplication cache into two components: a local fixed-size in-memory cache for fast deduplication over short periods of time (on the order of minutes), and the other a long-term deduplication cache built on <a href="https://developers.cloudflare.com/kv/"><u>Cloudflare Workers KV</u></a> a global, low-latency, <a href="https://developers.cloudflare.com/kv/reference/faq/#is-workers-kv-eventually-consistent-or-strongly-consistent"><u>eventually-consistent</u></a> key-value store <a href="https://developers.cloudflare.com/kv/platform/limits/"><u>without storage limitations</u></a>.</p><p>With this architecture, it was <a href="#developing-a-workers-application-in-rust"><u>relatively straightforward</u></a> to port the Go code to Rust, and to bring up a functional static CT log up on Workers. We’re done then, right? Not quite. Performance tests showed that the log was only capable of sequencing 20-30 new entries per second, well under the 70 per second target of existing logs. We could work around this by simply <a href="https://letsencrypt.org/2024/03/14/introducing-sunlight/#running-more-logs"><u>running more logs</u></a>, but that puts strain on other parts of the CT ecosystem — namely on TLS clients and monitors, which need to keep state for each log. Additionally, the alarm used to trigger sequencing would often be delayed by multiple seconds, meaning that the log was failing to produce new tree heads at consistent intervals. Time to go back to the drawing board.</p>
    <div>
      <h4>Making it fast</h4>
      <a href="#making-it-fast">
        
      </a>
    </div>
    <p>In the design thus far, we’re asking a single-threaded Durable Object instance to do a lot of multi-tasking. The DO processes incoming requests from the Frontend Worker to add entries to the sequencing pool, and must periodically sequence the pool and write state to the various storage backends. A log handling 100 requests per second needs to switch between 101 running tasks (the extra one for the sequencing), plus any async tasks like writing to remote storage — usually 10+ writes to object storage and one write to the long-term deduplication cache per sequenced entry. No wonder the sequencing task was getting delayed!</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7BCidjDyYw2YS1Ot84LHdk/240ce935eb4e36c82255d846d964fdff/image2.png" />
          </figure><p><sup><i>A static CT API log running on Workers using the Azul implementation with batching to improve performance.</i></sup></p><p>We were able to work around these issues by adding an additional layer of DOs between the Frontend Worker and the Sequencer, which we call Batchers. The Frontend Worker uses <a href="https://en.wikipedia.org/wiki/Consistent_hashing"><u>consistent hashing</u></a> on the cache key to determine which of several Batchers to submit the entry to, and the Batcher helps to reduce the number of requests to the Sequencer by buffering requests and sending them together in batches. When the batch is sequenced, the Batcher distributes the responses back to the Frontend Workers that submitted the request. The Batcher also handles writing updates to the deduplication cache, further freeing up resources for the Sequencer.</p><p>By limiting the scope of the critical block of code that needed to be run synchronously in a single DO, and leaning on the strengths of DOs by scaling horizontally where the workload allows it, we were able to drastically improve application performance. With this new architecture, the CT log application can handle upwards of 500 requests per second to the submission APIs to add new log entries, while maintaining a consistent sequencing tempo to keep per-request latency low (typically 1-2 seconds).</p>
    <div>
      <h3>Developing a Workers application in Rust</h3>
      <a href="#developing-a-workers-application-in-rust">
        
      </a>
    </div>
    <p>One of the reasons I was excited to work on this project is that it gave me an opportunity to implement a Workers application in Rust, which I’d never done from scratch before. Not everything was smooth, but overall I would recommend the experience.</p><p>The <a href="https://github.com/cloudflare/workers-rs"><u>Rust bindings to Cloudflare Workers</u></a> are an open source project that aims to bring support for all of the features you know and love from the <a href="https://developers.cloudflare.com/workers/languages/javascript/"><u>JavaScript APIs</u></a> to the Rust language. However, there is some lag in terms of feature parity. Often when working on this project, I’d read about a particular Workers feature in the <a href="https://developers.cloudflare.com"><u>developer docs</u></a>, only to find that support had <a href="https://github.com/cloudflare/workers-rs/issues/645"><u>not yet</u></a> <a href="https://github.com/cloudflare/workers-rs/issues/716"><u>been added</u></a>, or was only <a href="https://github.com/cloudflare/workers-rs?tab=readme-ov-file#rpc-support"><u>partially supported</u></a>, for the Rust bindings. I came across some <a href="https://github.com/cloudflare/workers-rs/issues/432"><u>surprising gotchas</u></a> (not all bad, like <a href="https://docs.rs/tokio/1.44.1/tokio/sync/watch/index.html"><u>tokio::sync::watch</u></a> channels <a href="https://github.com/cloudflare/workers-rs/pull/719"><u>working seamlessly</u></a>, despite <a href="https://github.com/cloudflare/workers-rs?tab=readme-ov-file#faq"><u>this warning</u></a>). Documentation about <a href="https://developers.cloudflare.com/workers/observability/dev-tools/breakpoints/"><u>debugging</u></a> and <a href="https://developers.cloudflare.com/workers/observability/dev-tools/cpu-usage/"><u>profiling</u></a> Rust Workers was also not clear (e.g., how to <a href="https://github.com/cloudflare/cloudflare-docs/pull/21347"><u>preserve debug symbols</u></a>), but it does in fact work!</p><p>To be clear, these rough edges are expected! The Workers platform is continuously gaining new features, and it’s natural that the Rust bindings would fall behind. As more developers rely on (and contribute to, <i>hint hint</i>) the Rust bindings, the developer experience will continue to improve.</p>
    <div>
      <h2>What is next for Certificate Transparency</h2>
      <a href="#what-is-next-for-certificate-transparency">
        
      </a>
    </div>
    <p>The WebPKI is constantly evolving and growing, and upcoming changes, in particular shorter certificate lifetimes and larger post-quantum certificates, are going to place significantly more load on the CT ecosystem.</p><p>The <a href="https://cabforum.org/"><u>CA/Browser Forum</u></a> defines a set of <a href="https://cabforum.org/working-groups/server/baseline-requirements/documents/TLSBRv2.0.4.pdf"><u>Baseline Requirements</u></a> for publicly-trusted TLS server certificates.  As of 2020, the maximum certificate lifetime for publicly-trusted certificates is 398 days. However, there is a <a href="https://github.com/cabforum/servercert/pull/553"><u>ballot measure</u></a> to reduce that period to as low as 47 days by March 2029. Let’s Encrypt is going even further, and at the <a href="https://letsencrypt.org/2024/12/11/eoy-letter-2024/"><u>end of 2024 announced</u></a> that they will be offering short-lived certificates with a lifetime of only <a href="https://letsencrypt.org/2025/01/16/6-day-and-ip-certs/"><u>six days</u></a> by the end of 2025. Based on some back-of-the-envelope calculations using statistics from <a href="https://ct.cloudflare.com/"><u>Merkle Town</u></a>, these changes could increase the number of logged entries in the CT ecosystem by <b>16-20x</b>.</p><p>If you’ve been keeping up with this blog, you’ll also know that <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/"><u>post-quantum certificates</u></a> are on the horizon, bringing with them larger signature and public key sizes. Today, a <a href="https://crt.sh/?id=17119212878"><u>certificate</u></a> with an P-256 ECDSA public key and issuer signature can be less than 1kB. Dropping in a ML-DSA<sub>44</sub> public key and signature brings the same certificate size to 4.6 kB, assuming the SCTs use 96-byte <a href="https://blog.cloudflare.com/another-look-at-pq-signatures/"><u>UOV</u><u><sub>ls-pkc</sub></u></a> signatures. With these choices, post-quantum certificates could require CT logs to store <b>4x</b> the amount of data per log entry.</p><p>The static CT API design helps to ensure that CT logs are much better equipped to handle this increased load, especially if the load is distributed across <a href="https://letsencrypt.org/2024/03/14/introducing-sunlight/#running-more-logs"><u>multiple logs</u></a> per operator. Our <a href="https://github.com/cloudflare/azul"><u>new implementation</u></a> makes it easy for log operators to run CT logs on top of Cloudflare’s infrastructure, adding more operational diversity and robustness to the CT ecosystem. We welcome feedback on the design and implementation as <a href="https://github.com/cloudflare/azul/issues"><u>GitHub issues</u></a>, and encourage CAs and other interested parties to start submitting to and consuming from our <a href="https://github.com/cloudflare/azul/tree/main/crates/ct_worker#test-logs"><u>test logs</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Transparency]]></category>
            <category><![CDATA[Certificate Transparency]]></category>
            <guid isPermaLink="false">5n88kLCWbpk22AmRzMQN9g</guid>
            <dc:creator>Luke Valenta</dc:creator>
        </item>
        <item>
            <title><![CDATA[Skip the setup: deploy a Workers application in seconds]]></title>
            <link>https://blog.cloudflare.com/deploy-workers-applications-in-seconds/</link>
            <pubDate>Tue, 08 Apr 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ You can now add a Deploy to Cloudflare button to your repository’s README when building a Workers application, making it simple for other developers to set up and deploy your project!  ]]></description>
            <content:encoded><![CDATA[ <p>You can now add a <a href="http://developers.cloudflare.com/workers/platform/deploy-buttons/"><u>Deploy to Cloudflare button</u></a> to the README of your Git repository containing a Workers application — making it simple for other developers to quickly set up and deploy your project! </p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/templates/tree/main/saas-admin-template"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p><p>The Deploy to Cloudflare button: </p><ol><li><p><b>Creates a new Git repository on your GitHub/ GitLab account: </b>Cloudflare will automatically clone and create a new repository on your account, so you can continue developing. </p></li><li><p><b>Automatically provisions resources the app needs:</b> If your repository requires Cloudflare primitives like a <a href="https://developers.cloudflare.com/kv/"><u>Workers KV namespace</u></a>, a <a href="https://www.cloudflare.com/developer-platform/products/d1/"><u>D1 database</u></a>, or an <a href="https://developers.cloudflare.com/r2/buckets/"><u>R2 bucket</u></a>, Cloudflare will automatically provision them on your account and bind them to your Worker upon deployment. </p></li><li><p><b>Configures Workers Builds (CI/CD): </b>Every new push to your production branch on your newly created repository will <a href="https://www.cloudflare.com/learning/serverless/glossary/what-is-ci-cd/">automatically build and deploy</a> courtesy of <a href="https://developers.cloudflare.com/workers/ci-cd/builds/"><u>Workers Builds</u></a>. </p></li><li><p><b>Adds preview URLs to each pull request: </b>If you’d like to test your changes before deploying, you can push changes to a <a href="https://developers.cloudflare.com/workers/ci-cd/builds/build-branches/#configure-non-production-branch-builds"><u>non-production branch</u></a> and <a href="https://developers.cloudflare.com/workers/configuration/previews/"><u>preview URLs</u></a> will be generated and <a href="https://developers.cloudflare.com/workers/ci-cd/builds/git-integration/github-integration/#pull-request-comment"><u>posted back to GitHub as a comment</u></a>.   </p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1mViUpslwRWYrQqkr1I0dU/a890e9ae5d4d36278c4b6ff3a002c3b5/Introducing_Deploy_to_Cloudflare_Buttons_2.png" />
          </figure><p>There is nothing more frustrating than struggling to kick the tires on a new project because you don’t know where to start. Over the past couple of months, we’ve launched some improvements to getting started on Workers, including a gallery of <a href="https://dash.cloudflare.com/?to=/:account/workers-and-pages/templates"><u>Git-connected templates</u></a> that help you kickstart your development journey. </p><p>But we think there’s another part of the story. Everyday, we see new Workers applications being built and open-sourced by developers in the community, ranging from starter projects to mission critical applications. These projects are designed to be shared, deployed, customized, and contributed to. But first and foremost, they must be simple to deploy.</p>
    <div>
      <h2>Ditch the setup instructions</h2>
      <a href="#ditch-the-setup-instructions">
        
      </a>
    </div>
    <p>If you’ve open-sourced a new Workers application before, you may have listed in your README the following in order to get others going with your repository:</p><ol><li><p>“Clone this repo” </p></li><li><p>“Install these packages”</p></li><li><p>“Install Wrangler” </p></li><li><p>“Create this database”</p></li><li><p>“Paste the database ID back into your config file” </p></li><li><p>“Run this command to deploy” </p></li><li><p>“Push to a new Git repo” </p></li><li><p>“Set up CI” </p></li></ol><p>And the list goes on the more complicated your application gets, deterring other developers and making your project feel intimidating to deploy. Now, your project can be up and running in one shot — which means more traction, more feedback, and more contributions.</p>
    <div>
      <h2>Self-hosting made easy </h2>
      <a href="#self-hosting-made-easy">
        
      </a>
    </div>
    <p>We’re not just talking about building and sharing small starter apps but also complex pieces of software. If you’ve ever <a href="https://www.cloudflare.com/developer-platform/solutions/hosting/">self-hosted your own instance of an application</a> on a traditional cloud provider before, you’re likely familiar with the pain of tedious setup, operational overhead, or hidden costs of your infrastructure. </p><table><tr><td><p><b>Self-hosting with traditional cloud provider</b></p></td><td><p><b>Self-hosting with Cloudflare </b></p></td></tr><tr><td><p>Setup a VPC</p><p>Install tools and dependencies  </p><p>Set up and provision storage </p><p>Manually configure CI/CD pipeline to automate deployments </p><p>Scramble to manually secure your environment if a runtime vulnerability is discovered</p><p>Configure autoscaling policies and manage idle servers</p></td><td><p>✅Serverless</p><p>✅Highly-available global network</p><p>✅Automatic provisioning of datastores like D1 databases and R2 buckets</p><p>✅Built-in CI/CD workflow configured out of the box</p><p>✅Automatic runtime updates to keep your environment secure</p><p>✅Scale automatically and only pay for what you use.</p></td></tr></table><p>By making your open-source repository accessible with a Deploy to Cloudflare button, you can allow other developers to deploy their own instance of your app without requiring deep infrastructure expertise. </p>
    <div>
      <h2>From starter projects to full-stack applications</h2>
      <a href="#from-starter-projects-to-full-stack-applications">
        
      </a>
    </div>
    <p>We’re inviting all Workers developers looking to open-source their project to add Deploy to Cloudflare buttons to their projects and help others get up and running faster. We’ve already started working with open-source app developers! Here are a few great examples to explore: </p>
    <div>
      <h3>Test and explore your APIs with Fiberplane </h3>
      <a href="#test-and-explore-your-apis-with-fiberplane">
        
      </a>
    </div>
    <p><a href="https://fiberplane.com/"><u>Fiberplane</u></a> helps developers build, test and explore <a href="https://hono.dev/"><u>Hono</u></a> APIs and AI Agents in an embeddable playground. This Developer Week, Fiberplane released a set of sample Worker applications built on the ‘<a href="http://honc.dev/"><u>HONC</u></a>' stack — Hono, <a href="https://orm.drizzle.team/"><u>Drizzle</u></a> ORM, <a href="https://developers.cloudflare.com/d1/"><u>D1 Database</u></a>, and <a href="https://developers.cloudflare.com/workers/"><u>Cloudflare Workers</u></a> — that you can use as the foundation for your own projects. With an easy one-click Deploy to Cloudflare, each application comes preconfigured with the <a href="https://github.com/fiberplane/fiberplane"><u>open source</u></a> Fiberplane API Playground, making it easy to generate OpenAPI docs, test your handlers, and explore your API, all within one embedded interface.</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/fiberplane/create-honc-app/tree/main/examples/uptime-monitor"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p>
    <div>
      <h3>Deploy your first remote MCP server </h3>
      <a href="#deploy-your-first-remote-mcp-server">
        
      </a>
    </div>
    <p>You can now <a href="https://blog.cloudflare.com/remote-model-context-protocol-servers-mcp/"><u>build and deploy remote Model Context Protocol (MCP) servers</u></a> on Cloudflare Workers! <a href="https://www.cloudflare.com/learning/ai/what-is-model-context-protocol-mcp/">MCP servers</a> provide a standardized way for AI agents to interact with services directly, enabling them to complete actions on users' behalf. Cloudflare's remote MCP server implementation supports authentication, allowing users to login to their service from the agent to give it scoped permissions. This gives users the ability to interact with services without navigating dashboards or learning APIs — they simply tell their AI agent what they want to accomplish.</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/ai/tree/main/demos/remote-mcp-server"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p>
    <div>
      <h3>Start building your first agent </h3>
      <a href="#start-building-your-first-agent">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/">AI agents are intelligent systems</a> capable of autonomously executing tasks by making real-time decisions about which tools to use and how to structure their workflows. Unlike traditional automation (which follows rigid, predefined steps), agents dynamically adapt their strategies based on context and evolving inputs. This template serves as a starting point for building AI-driven chat agents on Cloudflare's Agent platform. Powered by Cloudflare’s <a href="https://www.npmjs.com/package/agents"><u>Agents SDK</u></a>, it provides a solid foundation for creating interactive AI chat experiences with a modern UI and tool integrations capabilities.</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/agents-starter"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p>
    <div>
      <h2>Try it now</h2>
      <a href="#try-it-now">
        
      </a>
    </div>
    <p>You can start using <a href="http://developers.cloudflare.com/workers/platform/deploy-buttons/"><u>Deploy to Cloudflare buttons</u></a> today!</p>
    <div>
      <h3>Add a Deploy to Cloudflare button to your README</h3>
      <a href="#add-a-deploy-to-cloudflare-button-to-your-readme">
        
      </a>
    </div>
    <p>Be sure to make your Git repository public and add the following snippet including your Git repository URL.</p>
            <pre><code>[![Deploy to Cloudflare](https://deploy.workers.cloudflare.com/button)](https://deploy.workers.cloudflare.com/?url=&lt;YOUR_GIT_REPO_URL&gt;)</code></pre>
            <p>When another developer clicks your Deploy to Cloudflare button, Cloudflare will parse the Wrangler configuration file, provision any resources detected, and create a new repo on their account that’s updated with information about newly created resources. For example:</p>
            <pre><code>{
  "compatibility_date": "2024-04-03",

  "d1_databases": [
    {
      "binding": "MY_D1_DATABASE",

	//will be updated with newly created database ID
      "database_id": "1234567890abcdef1234567890abcdef"
    }
  ]
}</code></pre>
            <p>Check out our <a href="http://developers.cloudflare.com/workers/platform/deploy-buttons/"><u>documentation</u></a> for more information on how to set up a deploy button for your application and best practices to ensure a successful deployment for other developers. </p>
    <div>
      <h3>Start building </h3>
      <a href="#start-building">
        
      </a>
    </div>
    <p>For new Cloudflare developers, keep an eye out for “Deploy to Cloudflare” buttons across the web, or simply paste the URL of any public GitHub or GitLab repository containing a Workers application into the <a href="https://dash.cloudflare.com/?to=/:account/workers-and-pages/create/deploy-to-workers"><u>Cloudflare dashboard</u></a> to get started.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/64oCplgDSH0jgE9Nsqt4VL/57083c66d1c6240c03973a43642da9e9/Screenshot_2025-04-07_at_17.29.16.png" />
          </figure><p>During Developer Week, <a href="https://blog.cloudflare.com/"><u>tune in to our blog</u></a> as we unveil new features and announcements — many including Deploy to Cloudflare buttons — so you can jump right in and start building!</p> ]]></content:encoded>
            <category><![CDATA[Developer Week]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Open Source]]></category>
            <guid isPermaLink="false">18znr6c8JHaWhYT3czW9hw</guid>
            <dc:creator>Nevi Shah</dc:creator>
        </item>
        <item>
            <title><![CDATA[Open-sourcing OpenPubkey SSH (OPKSSH): integrating single sign-on with SSH]]></title>
            <link>https://blog.cloudflare.com/open-sourcing-openpubkey-ssh-opkssh-integrating-single-sign-on-with-ssh/</link>
            <pubDate>Tue, 25 Mar 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ OPKSSH (OpenPubkey SSH) is now open-sourced as part of the OpenPubkey project. ]]></description>
            <content:encoded><![CDATA[ <p>OPKSSH makes it easy to <a href="https://en.wikipedia.org/wiki/Secure_Shell"><u>SSH</u></a> with single sign-on technologies like OpenID Connect, thereby removing the need to manually manage and configure SSH keys. It does this without adding a trusted party other than your identity provider (IdP).</p><p>We are excited to announce <a href="https://github.com/openpubkey/opkssh/"><u>OPKSSH (OpenPubkey SSH)</u></a> has been open-sourced under the umbrella of the OpenPubkey project. While the underlying protocol <a href="https://github.com/openpubkey/openpubkey/"><u>OpenPubkey</u></a> became <a href="https://www.linuxfoundation.org/press/announcing-openpubkey-project"><u>an open source Linux foundation project in 2023</u></a>, OPKSSH was closed source and owned by <a href="https://www.cloudflare.com/press-releases/2024/cloudflare-acquires-bastionzero-to-add-zero-trust-infrastructure-access/"><u>BastionZero (now Cloudflare)</u></a>. Cloudflare has gifted this code to the OpenPubkey project, making it open source.</p><p>In this post, we describe what OPKSSH is, how it simplifies SSH management, and what OPKSSH being open source means for you.</p>
    <div>
      <h2>Background</h2>
      <a href="#background">
        
      </a>
    </div>
    <p>A cornerstone of modern access control is single sign-on <a href="https://www.cloudflare.com/learning/access-management/what-is-sso/"><u>(SSO)</u></a>, where a user authenticates to an <a href="https://www.cloudflare.com/learning/access-management/what-is-an-identity-provider/"><u>identity provider (IdP)</u></a>, and in response the IdP issues the user a <i>token</i>. The user can present this token to prove their identity, such as “Google says I am Alice”. SSO is the rare security technology that both increases convenience — users only need to sign in once to get access to many different systems — and increases security.</p>
    <div>
      <h3>OpenID Connect</h3>
      <a href="#openid-connect">
        
      </a>
    </div>
    <p><a href="https://openid.net/developers/how-connect-works/"><u>OpenID Connect (OIDC)</u></a> is the main protocol used for SSO. As shown below, in OIDC the IdP, called an OpenID Provider (OP), issues the user an ID Token which contains identity claims about the user, such as “email is alice@example.com”. These claims are digitally signed by the OP, so anyone who receives the ID Token can check that it really was issued by the OP.</p><p>Unfortunately, while ID Tokens <i>do </i>include identity claims like name, organization, and email address, they <i>do not</i> include the user’s public key. This prevents them from being used to directly secure protocols like SSH or <a href="https://www.cloudflare.com/learning/privacy/what-is-end-to-end-encryption/"><u>End-to-End Encrypted messaging</u></a>.</p><p>Note that throughout this post we use the term OpenID Provider (OP) rather than IdP, as OP specifies the exact type of IdP we are using, i.e., an OpenID IdP. We use Google as an example OP, but OpenID Connect works with Google, Azure, Okta, etc.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2zfixdknoL3a9HqBzcsmAY/605f469bdd25b4b8cfaf29dac3561c4f/image1.png" />
          </figure><p><sup><i>Shows a user Alice signing in to Google using OpenID Connect and receiving an ID Token</i></sup></p>
    <div>
      <h3>OpenPubkey</h3>
      <a href="#openpubkey">
        
      </a>
    </div>
    <p>OpenPubkey, shown below, adds public keys to ID Tokens. This enables ID Tokens to be used like certificates, e.g. “Google says <code>alice@example.com</code> is using public key 0x123.” We call an ID token that contains a public key a <i>PK Token</i>. The beauty of OpenPubkey is that, unlike other approaches, OpenPubkey does not require any changes to existing SSO protocols and supports any OpenID Connect compliant OP.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2HNkaW8vPE26KQrNwNOzNo/ef00f91dc983f3f2ac3c3a00b223b3e5/image3.png" />
          </figure><p><sup><i>Shows a user Alice signing in to Google using OpenID Connect/OpenPubkey and then producing a PK Token</i></sup>
While OpenPubkey enables ID Tokens to be used as certificates, OPKSSH extends this functionality so that these ID Tokens can be used as SSH keys in the SSH protocol. This adds SSO authentication to SSH without requiring changes to the SSH protocol.</p>
    <div>
      <h2>Why this matters</h2>
      <a href="#why-this-matters">
        
      </a>
    </div>
    <p>OPKSSH frees users and administrators from the need to manage long-lived SSH keys, making SSH more secure and more convenient.</p><blockquote><p><i>“In many organizations – even very security-conscious organizations – there are many times more obsolete authorized keys than they have employees. Worse, authorized keys generally grant command-line shell access, which in itself is often considered privileged. We have found that in many organizations about 10% of the authorized keys grant root or administrator access. SSH keys never expire.”</i>  
- <a href="https://ylonen.org/papers/ssh-key-challenges.pdf">Challenges in Managing SSH Keys – and a Call for Solutions</a> by Tatu Ylonen (Inventor of SSH)</p></blockquote><p>In SSH, users generate a long-lived SSH public key and SSH private key. To enable a user to access a server, the user or the administrator of that server configures that server to trust that user’s public key. Users must protect the file containing their SSH private key. If the user loses this file, they are locked out. If they copy their SSH private key to multiple computers or back up the key, they increase the risk that the key will be compromised. When a private key is compromised or a user no longer needs access, the user or administrator must remove that public key from any servers it currently trusts. All of these problems create headaches for users and administrators.</p><p>OPKSSH overcomes these issues:</p><p><b>Improved security:</b> OPKSSH replaces long-lived SSH keys with ephemeral SSH keys that are created on-demand by OPKSSH and expire when they are no longer needed. This reduces the risk a private key is compromised, and limits the time period where an attacker can use a compromised private key. By default, these OPKSSH public keys expire every 24 hours, but the expiration policy can be set in a configuration file.</p><p><b>Improved usability:</b> Creating an SSH key is as easy as signing in to an OP. This means that a user can SSH from any computer with opkssh installed, even if they haven’t copied their SSH private key to that computer.</p><p>To generate their SSH key, the user simply runs opkssh login, and they can use ssh as they typically do.</p><p><b>Improved visibility:</b> OPKSSH moves SSH from authorization by public key to authorization by identity. If Alice wants to give Bob access to a server, she doesn’t need to ask for his public key, she can just add Bob’s email address bob@example.com to the OPKSSH authorized users file, and he can sign in. This makes tracking who has access much easier, since administrators can see the email addresses of the authorized users.</p><p>OPKSSH does not require any code changes to the SSH server or client. The only change needed to SSH on the SSH server is to add two lines to the SSH config file. For convenience, we provide an installation script that does this automatically, as seen in the video below.</p><div>
  
</div>
<p></p>
    <div>
      <h2>How it works</h2>
      <a href="#how-it-works">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5rVAtbu3vv9wU84ke8IZIb/a5e564c2ae3834391bd7f04c843508b7/image4.png" />
          </figure><p><sup><i>Shows a user Alice SSHing into a server with her PK Token inside her SSH public key. The server then verifies her SSH public key using the OpenPubkey verifier.</i></sup></p><p>Let’s look at an example of Alice (<code>alice@example.com</code>) using OPKSSH to SSH into a server:</p><ul><li><p>Alice runs <code>opkssh login</code>. This command automatically generates an ephemeral public key and private key for Alice. Then it runs the OpenPubkey protocol by opening a browser window and having Alice log in through their SSO provider, e.g., Google. </p></li><li><p>If Alice SSOs successfully, OPKSSH will now have a PK Token that commits to Alice’s ephemeral public key and Alice’s identity. Essentially, this PK Token says “<code>alice@example.com</code> authenticated her identity and her public key is 0x123…”.</p></li><li><p>OPKSSH then saves to Alice’s <code>.ssh </code>directory:</p><ul><li><p>an SSH public key file that contains Alice’s PK Token </p></li><li><p>and an SSH private key set to Alice’s ephemeral private key.</p></li></ul></li><li><p>When Alice attempts to SSH into a server, the SSH client will find the SSH public key file containing the PK Token in Alice’s <code>.ssh</code> directory, and it will send it to the SSH server to authenticate.</p></li><li><p>The SSH server forwards the received SSH public key to the OpenPubkey verifier installed on the SSH server. This is because the SSH server has been configured to use the OpenPubkey verifier via the AuthorizedKeysCommand.</p></li><li><p>The OpenPubkey verifier receives the SSH public key file and extracts the PK Token from it. It then verifies that the PK Token is unexpired, valid, signed by the OP and that the public key in the PK Token matches the public key field in the SSH public key file. Finally, it extracts the email address from the PK Token and checks if <code>alice@example.com</code> is allowed to SSH into this server.</p></li></ul><p>Consider the problems we face in getting OpenPubkey to work with SSH without requiring any changes to the SSH protocol or software:</p><p><b>How do we get the PK Token from the user’s machine to the SSH server inside the SSH protocol?</b>
We use the fact that SSH public keys can be SSH certificates, and that SSH certificates have <a href="https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/PROTOCOL.certkeys?rev=1.4&amp;content-type=text/x-cvsweb-markup"><u>an extension field</u></a> that allows arbitrary data to be included in the certificate. Thus, we package the PK Token into an SSH certificate extension so that the PK Token will be transmitted inside the SSH public key as a normal part of the SSH protocol. This enables us to send the PK Token to the SSH server as additional data in the SSH certificate, and allows OPKSSH to work without any changes to the SSH client.</p><p><b>How do we check that the PK Token is valid once it arrives at the SSH server?
</b>SSH servers support a configuration parameter called the <a href="https://man.openbsd.org/sshd_config#AuthorizedKeysCommand"><i><u>AuthorizedKeysCommand</u></i></a><i> </i>that allows us to use a custom program to determine if an SSH public key is authorized or not. Thus, we change the SSH server’s config file to use the OpenPubkey verifier instead of the SSH verifier by making the following two line change to <code>sshd_config</code>:</p>
            <pre><code>AuthorizedKeysCommand /usr/local/bin/opkssh verify %u %k %t
AuthorizedKeysCommandUser root</code></pre>
            <p>The OpenPubkey verifier will check that the PK Token is unexpired, valid and signed by the OP. It checks the user’s email address in the PK Token to determine if the user is authorized to access the server.</p><p><b>How do we ensure that the public key in the PK Token is actually the public key that secures the SSH session?</b>
The OpenPubkey verifier also checks that the public key in the public key field in the SSH public key matches the user’s public key inside the PK Token. This works because the public key field in the SSH public key is the actual public key that secures the SSH session.</p>
    <div>
      <h2>What is happening</h2>
      <a href="#what-is-happening">
        
      </a>
    </div>
    <p>We have <a href="https://github.com/openpubkey/openpubkey/pull/234"><u>open sourced OPKSSH</u></a> under the <a href="https://www.apache.org/licenses/LICENSE-2.0"><u>Apache 2.0 license</u></a>, and released it as <a href="https://github.com/openpubkey/opkssh/"><u>openpubkey/opkssh on GitHub</u></a>. While the OpenPubkey project has had code for using SSH with OpenPubkey since the early days of the project, this code was intended as a prototype and was missing many important features. With OPKSSH, SSH support in OpenPubkey is no longer a prototype and is now a complete feature. Cloudflare is not endorsing OPKSSH, but simply donating code to OPKSSH.</p><p><b>OPKSSH provides the following improvements to OpenPubkey:</b></p><ul><li><p>Production ready SSH in OpenPubkey</p></li><li><p>Automated installation</p></li><li><p>Better configuration tools</p></li></ul>
    <div>
      <h2>To learn more</h2>
      <a href="#to-learn-more">
        
      </a>
    </div>
    <p>See the <a href="https://github.com/openpubkey/opkssh/blob/main/README.md"><u>OPKSSH readme</u></a> for documentation on how to install and connect using OPKSSH.</p>
    <div>
      <h2>How to get involved</h2>
      <a href="#how-to-get-involved">
        
      </a>
    </div>
    <p>There are a number of ways to get involved in OpenPubkey or OPKSSH. The project is organized through the <a href="https://github.com/openpubkey/opkssh"><u>OPKSSH GitHub</u></a>. We are building an open and friendly community and welcome pull requests from anyone. If you are interested in contributing, see <a href="https://github.com/openpubkey/openpubkey/blob/main/CONTRIBUTING.md"><u>our contribution guide</u></a>.</p><p>We run a <a href="https://github.com/openpubkey/community"><u>community</u></a> meeting every month which is open to everyone, and you can also find us over on the <a href="https://openssf.org/getinvolved/"><u>OpenSSF Slack</u></a> in the #openpubkey channel.</p> ]]></content:encoded>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[SSH]]></category>
            <category><![CDATA[Single Sign On (SSO)]]></category>
            <category><![CDATA[Cryptography]]></category>
            <category><![CDATA[Authentication]]></category>
            <category><![CDATA[Research]]></category>
            <guid isPermaLink="false">01zA7RtUKkhrUeINJ9AIS3</guid>
            <dc:creator>Ethan Heilman</dc:creator>
        </item>
        <item>
            <title><![CDATA[Open source all the way down: Upgrading our developer documentation]]></title>
            <link>https://blog.cloudflare.com/open-source-all-the-way-down-upgrading-our-developer-documentation/</link>
            <pubDate>Wed, 08 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ At Cloudflare, we treat developer content like an open source product. This collaborative approach enables global contributions to enhance quality and relevance for a wide range of users. This year, ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we treat developer <a href="https://blog.cloudflare.com/content-as-a-product/"><u>content like a product</u></a>, where we take the user and their feedback into consideration. We are constantly iterating, testing, analyzing, and refining content. Inspired by agile practices, treating developer content like an open source product means we approach our documentation the same way an open source software project is created and maintained.  Open source documentation empowers the developer community because it allows anyone, anywhere, to contribute content. By making both the content and the framework of the documentation site publicly accessible, we provide developers with the opportunity to not only improve the material itself but also understand and engage with the processes that govern how the documentation is built, approved, and maintained. This transparency fosters collaboration, learning, and innovation, enabling developers to contribute their expertise and learn from others in a shared, open environment. We also provide feedback to other open source products and plugins, giving back to the same community that supports us.</p>
    <div>
      <h2>Building the best open source documentation experience</h2>
      <a href="#building-the-best-open-source-documentation-experience">
        
      </a>
    </div>
    <p>Great documentation empowers users to be successful with a new product as quickly as possible, showing them how to use the product and describing its benefits. Relevant, timely, and accurate content can save frustration, time, and money. Open source documentation adds a few more benefits, including building inclusive and supportive communities that help reduce the learning curve. We love being open source!</p><p>While the Cloudflare content team has scaled to deliver documentation alongside product launches, the open source documentation site itself was not scaling well. <a href="http://developers.cloudflare.com"><u>developers.cloudflare.com</u></a> had outgrown the workflow for contributors, plus we were missing out on all the neat stuff created by developers in the community.</p><p>Just like a software product evaluation, we reviewed our business needs. We asked ourselves if remaining open source was appropriate? Were there other tools we wanted to use? What benefits did we want to see in a year or in five years? Our biggest limitations in addition to the contributor workflow challenges seemed to be around scalability and high maintenance costs for user experience improvements. </p><p>After compiling our wishlist of new features to implement, we reaffirmed our commitment to open source. We valued the benefit of open source in both the content and the underlying framework of our documentation site. This commitment goes beyond technical considerations, because it's a fundamental aspect of our relationship with our community and our philosophy of transparency and collaboration. While the choice of an open source framework to build the site on might not be visible to many visitors, we recognized its significance for our community of developers and contributors. Our decision-making process was heavily influenced by two primary factors: first, whether the update would enhance the collaborative ecosystem, and second, how it would improve the overall documentation experience. This focus reflects that our open source principles, applied to both content and infrastructure, are essential for fostering innovation, ensuring quality through peer review, and building a more engaged and empowered user community.</p>
    <div>
      <h2>Cloudflare developer documentation: A collaborative open source approach</h2>
      <a href="#cloudflare-developer-documentation-a-collaborative-open-source-approach">
        
      </a>
    </div>
    <p>Cloudflare’s developer documentation is <a href="https://github.com/cloudflare/cloudflare-docs/"><u>open source on GitHub</u></a>, with content supporting all of Cloudflare’s products. The underlying documentation engine has gone through a few iterations, with the first version of the site released in 2020. That first version provided dev-friendly features such as dark mode and proper code syntax. </p>
    <div>
      <h3>2021 update: enhanced documentation engine</h3>
      <a href="#2021-update-enhanced-documentation-engine">
        
      </a>
    </div>
    <p>In 2021, we introduced a new custom documentation engine, bringing significant improvements to the Cloudflare content experience. The benefits of the Gatsby to Hugo <a href="https://blog.cloudflare.com/new-dev-docs/"><u>migration</u></a> included:</p><ul><li><p><b>Faster development flow</b>: The development flow replicated production behavior, increasing iteration speed and confidence. <a href="https://developers.cloudflare.com/pages/configuration/preview-deployments/"><u>Preview links</u></a> via Cloudflare Pages were also introduced, so the content team and stakeholders could quickly review what content would look like in production.</p></li><li><p><b>Custom components</b>: Introduced features like <a href="https://github.com/cloudflare/cloudflare-docs/blob/4c3c819ebe3714df1698097135c645429bcbe7cc/layouts/shortcodes/resource-by-selector.html"><u>resources-by-selector</u></a> which let us reference content throughout the repository and gave us the flexibility to expand checks and automations.</p></li><li><p><b>Structured changelog management</b>: Implementation of <a href="https://github.com/cloudflare/cloudflare-docs/tree/4c3c819ebe3714df1698097135c645429bcbe7cc/data/changelogs"><u>structured YAML</u></a> changelog entries which facilitated sharing with various platforms like <a href="https://developers.cloudflare.com/changelog/index.xml"><u>RSS feeds</u></a>, <a href="http://discord.cloudflare.com"><u>Developer Discord</u></a>, and within the docs themselves.</p></li><li><p><b>Improved performance</b>: Significant page load time improvements with the migration to HTML-first and almost instantaneous local builds.</p></li></ul><p>These features were non-negotiable as part of our evaluation of whether to migrate. We knew that any update to the site had to maintain the functionality we’d established as core parts of the new experience.</p>
    <div>
      <h3>2024 update: Say “hello, world!” to our new developer documentation, powered by Astro</h3>
      <a href="#2024-update-say-hello-world-to-our-new-developer-documentation-powered-by-astro">
        
      </a>
    </div>
    <p>After careful evaluation, we chose to migrate from Hugo to the <a href="https://astro.build/"><u>Astro</u></a> (and by extension, JavaScript) ecosystem. Astro fulfilled many items on our wishlist including:</p><ul><li><p><b>Enhanced content organization</b>: Improved tagging and better cross-referencing of  related pages.</p></li><li><p><b>Extensibility</b>: Support for user plugins like <a href="https://github.com/HiDeoo/starlight-image-zoom"><u>starlight-image-zoom</u></a> for lightbox functionality.</p></li><li><p><b>Development experience</b>: Type-checking at build time with <a href="https://docs.astro.build/en/reference/cli-reference/#astro-check"><u>astro check</u></a>, along with syntax highlighting, Intellisense, diagnostic messages, and plugins for ESLint, Stylelint, and Prettier. </p></li><li><p><b>JavaScript/TypeScript support</b>: Aligned the docs site framework with the preferred languages of many contributors, facilitating easier contribution.</p></li><li><p><b>CSS management</b>: Introduction of Tailwind and <a href="https://docs.astro.build/en/guides/styling/#scoped-styles"><u>scoped styles</u></a>.</p></li><li><p><a href="https://docs.astro.build/en/guides/content-collections/"><b><u>Content collections</u></b></a>: Offered various ways to manage and enhance tagging practices including Markdown front matter <a href="https://docs.astro.build/en/guides/content-collections/#defining-datatypes-with-zod"><u>validated by Zod schemas</u></a>, JSON schemas for <a href="https://docs.astro.build/en/guides/content-collections/#enabling-json-schema-generation"><u>Intellisense</u></a>, and a JavaScript callback for <a href="https://docs.astro.build/en/guides/content-collections/#filtering-collection-queries"><u>filtering returned entries</u></a>.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1wz2uWlAwbHFG4QgG0d8tt/4eeb3fbcd4d9b33c5590be39654bbff1/BLOG-2600_2.png" />
          </figure><p><a href="https://starlight.astro.build/"><u>Starlight</u></a>, Astro’s documentation theme, was a key factor in the decision. Its powerful <a href="https://starlight.astro.build/guides/overriding-components/"><u>component overrides</u></a> and <a href="https://starlight.astro.build/resources/plugins/"><u>plugins</u></a> system allowed us to leverage built-in components and base styling.</p>
    <div>
      <h3>How we migrated to Astro</h3>
      <a href="#how-we-migrated-to-astro">
        
      </a>
    </div>
    <p>Content needed to be migrated quickly. With dozens of pull requests opened and merged each day, entering a code freeze for a week simply wasn’t feasible. This is where the nature of <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree"><u>abstract syntax trees</u></a> (ASTs) came into play, only parsing the structure of a <a href="https://blog.cloudflare.com/markdown-for-agents/">Markdown document</a> rather than details like whitespace or indentation that would make a <a href="https://en.wikipedia.org/wiki/Regular_expression"><u>regular expression</u></a> approach tricky.</p><p>With Hugo in 2021, we configured code block functionality like titles or line highlights with front matter inside the code block.</p>
            <pre><code>---
title: index.js
highlight: 1
---
const foo = "bar";
</code></pre>
            <p>Starlight uses <a href="https://expressive-code.com/"><u>Expressive Code</u></a> for code blocks, and these options are now on the opening code fence.</p>
            <pre><code>js title="index.js" {1}
const foo = "bar";
</code></pre>
            <p>With <a href="https://www.npmjs.com/package/astray"><u>astray</u></a>, this is a simple as visiting the `code` nodes and:</p><ol><li><p>Parsing `node.value` with <a href="https://www.npmjs.com/package/front-matter"><u>front-matter</u></a>.</p></li><li><p>Assigning the attributes from `front-matter` to `node.meta`.</p></li><li><p>Replacing `node.value` with the rest of the code block.</p></li></ol>
            <pre><code>import { fromMarkdown } from "mdast-util-from-markdown";
import { toMarkdown } from "mdast-util-to-markdown";
 
import * as astray from "astray";
import type * as MDAST from "mdast";
import fm from "front-matter";
 
const markdown = await Bun.file("example.md").text();
 
const AST = fromMarkdown(markdown);
 
astray.walk&lt;MDAST.Root, void, any&gt;(AST, {
    code(node: MDAST.Code) {
        const { attributes, body } = fm(node.value);
        const { title, highlight } = attributes;
 
        if (title) {
            node.meta = `title="${title}"`;
        }
 
        if (highlight) {
            node.meta += ` {${highlight}}`;
        }
 
        node.value = body;
 
        return;
    }
})
</code></pre>
            
    <div>
      <h2>The migration in numbers</h2>
      <a href="#the-migration-in-numbers">
        
      </a>
    </div>
    <p>When we <a href="https://blog.cloudflare.com/new-dev-docs/"><u>migrated from Gatsby to Hugo</u></a> in 2021, the <a href="https://github.com/cloudflare/cloudflare-docs/pull/3609/"><u>pull request</u></a> included 4,850 files and the migration took close to three weeks from planning to implementation. This time around, the migration was nearly twice as large, with 8,060 files changed. Our planning and migration took six weeks in total:</p><ul><li><p>10 days: Evaluate platforms, vendors, and features </p></li><li><p>14 days: Migrate the <a href="https://developers.cloudflare.com/style-guide/components/"><u>components</u></a> required by the documentation site</p></li><li><p>5 days: Staging and user acceptance testing (UAT) </p></li><li><p>8 hours: Code freeze and <a href="https://github.com/cloudflare/cloudflare-docs/pull/16096"><u>migrate to Astro/Starlight</u></a></p></li></ul><p>The migration resulted in removing a net -19,624 lines of code from our maintenance burden.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3r9Hj2NwU40GPLTw5TbGYG/d292b405c097ebd577173f5d61c17d03/BLOG-2600_3.png" />
          </figure><p>While the number of files had grown substantially since our last major migration, our strategy was very similar to the 2021 migration. We used <a href="https://github.com/syntax-tree/mdast"><u>Markdown AST</u></a> and astray, a utility to walk ASTs, created specifically for the previous migration!</p>
    <div>
      <h2>What we learned</h2>
      <a href="#what-we-learned">
        
      </a>
    </div>
    <p>A website migration like our move to Astro/Starlight is a complex process that requires time to plan, review, and coordinate, and our preparation paid off! Including our <a href="https://community.cloudflare.com/t/2025-mvp-nominations/705496"><u>Cloudflare Community MVPs</u></a> as part of the planning and review period proved incredibly helpful. They provided great guidance and feedback as we planned for the migration. We only needed one day of code freeze, and there were no rollbacks or major incidents. Visitors to the site never experienced downtime, and overall the migration was a major success.</p><p>During testing, we ran into several use cases that warranted using <a href="https://docs.astro.build/en/reference/container-reference/"><u>experimental Astro APIs</u></a>. These APIs were always well documented, thanks to fantastic open source content from the Astro community. We were able to implement them quickly without impacting our release timeline.</p><p>We also ran into <a href="https://github.com/withastro/starlight/issues/2215"><u>an edge case</u></a> with build time performance due to the number of pages on our site (4000+). The Astro team was quick to triage the problem and begin investigation for a <a href="https://github.com/withastro/starlight/pull/2252"><u>permanent fix</u></a>. Their fast, helpful fixes made us truly grateful for the support from the Astro Discord server. A big thank you to the Astro/Starlight community!</p>
    <div>
      <h2>Contribute to developers.cloudflare.com!</h2>
      <a href="#contribute-to-developers-cloudflare-com">
        
      </a>
    </div>
    <p>Migrating <a href="http://developers.cloudflare.com"><u>developers.cloudflare.com</u></a> to Astro/Starlight is just one example of the ways we prioritize world-class documentation and user experiences at Cloudflare. Our deep investment in documentation makes this a great place to work for technical writers, UX strategists, and many other content creators. Since adopting a <a href="https://blog.cloudflare.com/content-as-a-product/"><u>content like a product</u></a> strategy in 2021, we have evolved to better serve the open source community by focusing on inclusivity and transparency, which ultimately leads to happier Cloudflare users. </p><p>We invite everyone to connect with us and explore these exciting new updates. Feel free to <a href="https://github.com/cloudflare/cloudflare-docs/issues"><u>reach out</u></a> if you’d like to speak with someone on the content team or share feedback about our documentation. You can share your thoughts or submit a pull request directly on the cloudflare-docs <a href="https://github.com/cloudflare/cloudflare-docs"><u>repository</u></a> in GitHub.</p> ]]></content:encoded>
            <category><![CDATA[Technical Writing]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Developer Documentation]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">6HAo0CAvmODAhYHnIF5Hbr</guid>
            <dc:creator>Kim Jeske</dc:creator>
            <dc:creator>Kian Newman-Hazel</dc:creator>
            <dc:creator>Kody Jackson</dc:creator>
        </item>
        <item>
            <title><![CDATA[Is this thing on? Using OpenBMC and ACPI power states for reliable server boot]]></title>
            <link>https://blog.cloudflare.com/how-we-use-openbmc-and-acpi-power-states-to-monitor-the-state-of-our-servers/</link>
            <pubDate>Tue, 22 Oct 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare’s global fleet benefits from being managed by open source firmware for the Baseboard Management Controller (BMC), OpenBMC. This has come with various challenges, some of which we discuss here with an explanation of how the open source nature of the firmware for the BMC enabled us to fix the issues and maintain a more stable fleet. ]]></description>
            <content:encoded><![CDATA[ 
    <div>
      <h2>Introduction</h2>
      <a href="#introduction">
        
      </a>
    </div>
    <p>At Cloudflare, we provide a range of services through our global network of servers, located in <a href="https://www.cloudflare.com/network/"><u>330 cities</u></a> worldwide. When you interact with our long-standing <a href="https://www.cloudflare.com/application-services/products/"><u>application services</u></a>, or newer services like <a href="https://ai.cloudflare.com/?_gl=1*1vedsr*_gcl_au*NzE0Njc1NTIwLjE3MTkzMzEyODc.*_ga*NTgyMWU1Y2MtYTI2NS00MDA3LTlhZDktYWUxN2U5MDkzYjY3*_ga_SQCRB0TXZW*MTcyMTIzMzM5NC4xNS4xLjE3MjEyMzM1MTguMC4wLjA."><u>Workers AI</u></a>, you’re in contact with one of our fleet of thousands of servers which support those services.</p><p>These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor  — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.</p><p>Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as <a href="https://en.wikipedia.org/wiki/Firmware"><u>firmware</u></a>. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the <a href="https://www.openbmc.org/"><u>Linux Foundation Project for BMCs, OpenBMC</u></a>. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. This gives us transparency (which is important to us as a security company) and allows us faster time to develop custom features/fixes for the BMC firmware that we run on our entire fleet.</p><p>In this blog post, we are going to describe how we customized and extended the OpenBMC firmware to better monitor our servers’ boot-up processes to start more reliably and allow better diagnostics in the event that an issue happens during server boot-up.</p>
    <div>
      <h2>Server subsystems</h2>
      <a href="#server-subsystems">
        
      </a>
    </div>
    <p>Server systems consist of multiple complex subsystems that include the processors, memory, storage, networking, power supply, cooling, etc. When booting up the host of a server system, the power state of each subsystem of the server is changed in an asynchronous manner. This is done so that subsystems can initialize simultaneously, thereby improving the efficiency of the boot process. Though started asynchronously, these subsystems may interact with each other at different points of the boot sequence and rely on handshake/synchronization to exchange information. For example, during boot-up, the <a href="https://en.wikipedia.org/wiki/UEFI"><u>UEFI (Universal Extensible Firmware Interface)</u></a>, often referred to as the <a href="https://en.wikipedia.org/wiki/BIOS"><u>BIOS</u></a>, configures the motherboard in a phase known as the Platform Initialization (PI) phase, during which the UEFI collects information from subsystems such as the CPUs, memory, etc. to initialize the motherboard with the right settings.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6csPNEksLXsGgt3dq5xZ0S/3236656dbc01f3085bada5af853c3516/image1.png" />
          </figure><p><sup><i>Figure 1: Server Boot Process</i></sup></p><p>When the power state of the subsystems, handshakes, and synchronization are not properly managed, there may be race conditions that would result in failures during the boot process of the host. Cloudflare experienced some of these boot-related failures while rolling out open source firmware (<a href="https://en.wikipedia.org/wiki/OpenBMC"><u>OpenBMC</u></a>) to the Baseboard Management Controllers (BMCs) of our servers. </p>
    <div>
      <h2>Baseboard Management Controller (BMC) as a manager of the host</h2>
      <a href="#baseboard-management-controller-bmc-as-a-manager-of-the-host">
        
      </a>
    </div>
    <p>A BMC is a specialized microprocessor that is attached to the board of a host (server) to assist with remote management capabilities of the host. Servers usually sit in data centers and are often far away from the administrators, and this creates a challenge to maintain them at scale. This is where a BMC comes in, as the BMC serves as the interface that gives administrators the ability to securely and remotely access the servers and carry out management functions. The BMC does this by exposing various interfaces, including <a href="https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface"><u>Intelligent Platform Management Interface (IPMI)</u></a> and <a href="https://www.dmtf.org/standards/redfish"><u>Redfish</u></a>, for distributed management. In addition, the BMC receives data from various sensors/devices (e.g. temperature, power supply) connected to the server, and also the operating parameters of the server, such as the operating system state, and publishes the values on its IPMI and Redfish interfaces.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/33dNmfyjqrbAGvcbZLTa0h/db3e6b79b1010081916ee6498b10c297/image2.png" />
          </figure><p><sup><i>Figure 2: Block diagram of BMC in a server system.</i></sup></p><p>At Cloudflare, we use the <a href="https://github.com/openbmc/openbmc"><u>OpenBMC</u></a> project for our Baseboard Management Controller (BMC).</p><p>Below are examples of management functions carried out on a server through the BMC. The interactions in the examples are done over <a href="https://github.com/ipmitool/ipmitool/wiki"><u>ipmitool</u></a>, a command line utility for interacting with systems that support IPMI.</p>
            <pre><code># Check the sensor readings of a server remotely (i.e. over a network)
$  ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; sdr
PSU0_CURRENT_IN  | 0.47 Amps         | ok
PSU0_CURRENT_OUT | 6 Amps            | ok
PSU0_FAN_0       | 6962 RPM          | ok
SYS_FAN          | 13034 RPM         | ok
SYS_FAN1         | 11172 RPM         | ok
SYS_FAN2         | 11760 RPM         | ok
CPU_CORE_VR_POUT | 9.03 Watts        | ok
CPU_POWER        | 76.95 Watts       | ok
CPU_SOC_VR_POUT  | 12.98 Watts       | ok
DIMM_1_VR_POUT   | 29.03 Watts       | ok
DIMM_2_VR_POUT   | 27.97 Watts       | ok
CPU_CORE_MOSFET  | 40 degrees C      | ok
CPU_TEMP         | 50 degrees C      | ok
DIMM_MOSFET_1    | 36 degrees C      | ok
DIMM_MOSFET_2    | 39 degrees C      | ok
DIMM_TEMP_A1     | 34 degrees C      | ok
DIMM_TEMP_B1     | 33 degrees C      | ok

…

# check the power status of a server remotely (i.e. over a network)
ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; power status
Chassis Power is off

# power on the server
ipmitool &lt;some authentication&gt; &lt;bmc ip&gt; power on
Chassis Power Control: On</code></pre>
            <p>Switching to OpenBMC firmware for our BMCs gives us more control over the software that powers our infrastructure. This has given us more flexibility, customizations, and an overall better uniform experience for managing our servers. Since OpenBMC is open source, we also leverage community fixes while upstreaming some of our own. Some of the advantages we have experienced with OpenBMC include a faster turnaround time to fixing issues, <a href="https://blog.cloudflare.com/de-de/thermal-design-supporting-gen-12-hardware-cool-efficient-and-reliable/"><u>optimizations around thermal cooling</u></a>, <a href="https://blog.cloudflare.com/gen-12-servers/"><u>increased power efficiency</u></a> and <a href="https://blog.cloudflare.com/how-we-used-openbmc-to-support-ai-inference-on-gpus-around-the-world/"><u>supporting AI inference</u></a>.</p><p>While developing Cloudflare’s OpenBMC firmware, however, we ran into a number of boot problems.</p><p><b><i>Host not booting:</i></b> When we send a request over IPMI for a host to power on (as in the example above, power on the server), ipmitool would indicate the power status of the host as ON, but we would not see any power going into the CPU nor any activity on the CPU. While ipmitool was correct about the power going into the chassis as ON, we had no information about the power state of the server from ipmitool, and we initially falsely assumed that since the chassis power was on, the rest of the server components should be ON. The <a href="https://documents.uow.edu.au/~blane/netapp/ontap/sysadmin/monitoring/concept/c_oc_mntr_bmc-sys-event-log.html"><u>System Event Log (SEL)</u></a>, which is responsible for displaying platform-specific events, was not giving us any useful information beyond indicating that the server was in a soft-off state (powered off), working state (operating system is loading and running), or that a “System Restart” of the host was initiated.</p>
            <pre><code># System Event Logs (SEL) showing the various power states of the server
$ ipmitool sel elist | tail -n3
  4d |  Pre-Init  |0000011021| System ACPI Power State ACPI_STATUS | S5_G2: soft-off | Asserted
  4e |  Pre-Init  |0000011022| System ACPI Power State ACPI_STATUS | S0_G0: working | Asserted
  4f |  Pre-Init  |0000011023| System Boot Initiated RESTART_CAUSE | System Restart | Asserted</code></pre>
            <p>In the System Event Logs shown above, ACPI is the acronym for Advanced Configuration and Power Interface, a standard for power management on computing systems. In the ACPI soft-off state, the host is powered off (the motherboard is on standby power but CPU/host isn’t powered on); according to the <a href="https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf"><u>ACPI specifications</u></a>, this state is called S5_G2. (These states are discussed in more detail below.) In the ACPI working state, the host is booted and in a working state, also known in the ACPI specifications as status S0_G0 (which in our case happened to be false), and the third row indicates the cause of the restart was due to a System Restart. Most of the boot-related SEL events are sent from the UEFI to the BMC. The UEFI has been something of a black box to us, as we rely on our original equipment manufacturers (OEMs) to develop the UEFI firmware for us, and for the generation of servers with this issue, the UEFI firmware did not implement sending the boot progress of the host to the BMC.</p><p>One discrepancy we observed was the difference in the power status and the power going into the CPU, which we read with a sensor we call CPU_POWER.</p>
            <pre><code># Check power status
$ ipmitool &lt;some authentication&gt; &lt;bmc ip&gt;  power status
Chassis Power is on
</code></pre>
            <p>However, checking the power into the CPU shows that the CPU was not receiving any power.</p>
            <pre><code># Check power going into the CPU
$ ipmitool &lt;some authentication&gt; &lt;bmc ip&gt;  sdr | grep CPU_POWER    
CPU_POWER        | 0 Watts           | ok</code></pre>
            <p>The CPU_POWER being at 0 watts contradicts all the previous information that the host was powered up and working, when the host was actually completely shut down.</p><p><b><i>Missing Memory Modules:</i></b> Our servers would randomly boot up with less memory than expected. Computers can boot up with less memory than installed due to a number of problems, such as a loose connection, hardware problem, or faulty memory. For our case, it happened not to be any of the usual suspects, but instead was due to both the BMC and UEFI trying to simultaneously read from the memory modules, leading to access contentions. Memory modules usually contain a <a href="https://en.wikipedia.org/wiki/Serial_presence_detect"><u>Serial Presence Detect (SPD)</u></a>, which is used by the UEFI to dynamically detect the memory module. This SPD is usually located on an <a href="https://learn.sparkfun.com/tutorials/i2c/all"><u>inter-integrated circuit (i2c)</u></a>, which is a low speed, two write protocol for devices to talk to each other. The BMC also reads the temperature of the memory modules via the i2c. When the server is powered on, amongst other hardware initializations, the UEFI also initializes the memory modules that it can detect via their (i.e. each individual memory modules) Serial Presence Detect (SPD), the BMC could also be trying to access the temperature of the memory module at the same time, over the same i2c protocol. This simultaneous attempted read denies one of the parties access. When the UEFI is denied access to the SPD, it thinks the memory module is not available and skips over it. Below is an example of the related i2c-bus contention logs we saw in the <a href="https://www.freedesktop.org/software/systemd/man/latest/journalctl.html"><u>journal</u></a> of the BMC when the host is booting.</p>
            <pre><code>kernel: aspeed-i2c-bus 1e78a300.i2c-bus: irq handled != irq. expected 0x00000021, but was 0x00000020</code></pre>
            <p>The above logs indicate that the i2c address 1e78a300 (which happens to be connected to the serial presence detect of the memory modules) could not properly handle a signal, known as an interrupt request (irq). When this scenario plays out on the UEFI, the UEFI is unable to detect the memory module.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Fe8wb6xqwXkanb8iPv8O2/eaecfe0474576a00cdc25bfeb6fba7a2/image4.png" />
          </figure><p><sup><i>Figure 3: I2C diagram showing I2C interconnection of the server’s memory modules (also known as DIMMs) with the BMC </i></sup></p><p><a href="https://www.techtarget.com/searchstorage/definition/DIMM"><u>DIMM</u></a> in Figure 3 refers to <a href="https://www.techtarget.com/searchstorage/definition/DIMM"><u>Dual Inline Memory Module</u></a>, which is the type of memory module used in servers.</p><p><b><i>Thermal telemetry:</i></b> During the boot-up process of some of our servers, some temperature devices, such as the temperature sensors of the memory modules, would show up as failed, thereby causing some of the fans to enter a fail-safe <a href="https://en.wikipedia.org/wiki/Pulse-width_modulation"><u>Pulse Width Modulation (PWM)</u></a> mode. <a href="https://en.wikipedia.org/wiki/Pulse-width_modulation"><u>PWM</u></a> is a technique to encode information delivered to electronic devices by adjusting the frequency of the waveform signal to the device. It is used in this case to control fan speed by adjusting the frequency of the power signal delivered to the fan. When a fan enters a fail-safe mode, PWM is used to set the fan speeds to a preset value, irrespective of what the optimized PWM setting of the fans should be, and this could negatively affect the cooling of the server and power consumption.</p>
    <div>
      <h2>Implementing host ACPI state on OpenBMC</h2>
      <a href="#implementing-host-acpi-state-on-openbmc">
        
      </a>
    </div>
    <p>In the process of studying the issues we faced relating to the boot-up process of the host, we learned how the power state of the subsystems within the chassis changes. Part of our learnings led us to investigate the Advanced Configuration and Power Interface (ACPI) and how the ACPI state of the host changed during the boot process.</p><p>Advanced Configuration and Power Interface (ACPI) is an open industry specification for power management used in desktop, mobile, workstation, and server systems. The <a href="https://uefi.org/sites/default/files/resources/ACPI_Spec_6_5_Aug29.pdf"><u>ACPI Specification</u></a> replaces previous power management methodologies such as <a href="https://en.wikipedia.org/wiki/Advanced_Power_Management"><u>Advanced Power Management (APM)</u></a>. ACPI provides the advantages of:</p><ul><li><p>Allowing OS-directed power management (OSPM).</p></li><li><p>Having a standardized and robust interface for power management.</p></li><li><p>Sending system-level events such as when the server power/sleep buttons are pressed </p></li><li><p>Hardware and software support, such as a real-time clock (RTC) to schedule the server to wake up from sleep or to reduce the functionality of the CPU based on RTC ticks when there is a loss of power.</p></li></ul><p>From the perspective of power management, ACPI enables an OS-driven conservation of energy by transitioning components which are not in active use to a lower power state, thereby reducing power consumption and contributing to more efficient power management.</p><p>The ACPI Specification defines four global “Gx” states, six sleeping “Sx” states, and four “Dx” device power states. These states are defined as follows:</p><div>
    <figure>
        <table>
            <colgroup>
                <col></col>
                <col></col>
                <col></col>
                <col></col>
            </colgroup>
            <tbody>
                <tr>
                    <td>
                        <p><span><span><strong>Gx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Name</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Sx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Description</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Working</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>The run state. In this state the machine is fully running</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Sleeping</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A sleep state where the CPU will suspend activity but retain its contexts.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A sleep state where memory contexts are held, but CPU contexts are lost. CPU re-initialization is done by firmware.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A logically deeper sleep state than S2 where CPU re-initialization is done by device. Equates to Suspend to RAM.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>S4</span></span></p>
                    </td>
                    <td>
                        <p><span><span>A logically deeper sleep state than S3 in which DRAM is context is not maintained and contexts are saved to disk. Can be implemented by either OS or firmware. </span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Soft off but PSU still supplies power</span></span></p>
                    </td>
                    <td>
                        <p><span><span>S5</span></span></p>
                    </td>
                    <td>
                        <p><span><span>The soft off state. All activity will stop, and all contexts are lost. The Complex Programmable Logic Device (CPLD) responsible for power-up and power-down sequences of various components e.g. CPU, BMC is on standby power, but the CPU/host is off.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>G3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Mechanical off</span></span></p>
                    </td>
                    <td> </td>
                    <td>
                        <p><span><span>PSU does not supply power. The system is safe for disassembly.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span><strong>Dx</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Name</strong></span></span></p>
                    </td>
                    <td>
                        <p><span><span><strong>Description</strong></span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D0</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Fully powered on</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is fully functional and operational </span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is partially powered down</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Reduced functionality and can be quickly powered back to D0</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D2</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is in a deeper lower power than D1</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Much more limited functionality and can only be slowly powered back to D0.</span></span></p>
                    </td>
                </tr>
                <tr>
                    <td>
                        <p><span><span>D3</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Hardware device is significantly powered down or off</span></span></p>
                    </td>
                    <td>
                        <p><span><span>Device is inactive with perhaps only the ability to be powered back on</span></span></p>
                    </td>
                </tr>
            </tbody>
        </table>
    </figure>
</div><p>The states that matter to us are:</p><ul><li><p><b>S0_G0_D0:</b> often referred to as the working state. Here we know our host system is running just fine.</p></li><li><p><b>S2_D2: </b>Memory contexts are held, but CPU context is lost. We usually use this state to know when the host’s UEFI is performing platform firmware initialization.</p></li><li><p><b>S5_G2:</b> Often referred to as the soft off state. Here we still have power going into the chassis, however, processor and DRAM context are not maintained, and the operating system power management of the host has no context.</p></li></ul><p>Since the issues we were experiencing were related to the power state changes of the host — when we asked the host to reboot or power on — we needed a way to track the various power state changes of the host as it went from power off to a complete working state. This would give us better management capabilities over the devices that were on the same power domain of the host during the boot process. Fortunately, the OpenBMC community already implemented an <a href="https://github.com/openbmc/google-misc/tree/master/subprojects/acpi-power-state-daemon"><u>ACPI daemon</u></a>, which we extended to serve our needs. We added an ACPI S2_D2 power state, in which memory contexts are held, but CPU context is lost, to the ACPI daemon running on the BMC to enable us to know when the host’s UEFI is performing firmware initialization, and also set up various management tasks for the different ACPI power states.</p><p>An example of a power management task we carry out using the S0_G0_D0 state is to re-export our Voltage Regulator (VR) sensors on S0_G0_D0 state, as shown with the service file below:</p>
            <pre><code>cat /lib/systemd/system/Re-export-VR-device.service 
[Unit]
Description=RE Export VR Device Process
Wants=xyz.openbmc_project.EntityManager.service
After=xyz.openbmc_project.EntityManager.service
Conflicts=host-s2-state.target

[Service]
Type=simple
ExecStart=/bin/bash -c 'set -a &amp;&amp; source /usr/bin/Re-export-VR-device.sh on'
SyslogIdentifier=Re-export-VR-device.service

[Install]
WantedBy=host-s0-state.target
</code></pre>
            <p>Having set this up, OpenBMC has a Net Function (ipmiSetACPIState) in <a href="https://github.com/openbmc/phosphor-host-ipmid/tree/master"><u>phosphor-host-ipmid</u></a> that is responsible for setting the ACPIState of the host on the BMC. This command is called by the host using the standard ipmi command with the corresponding NetFn=0x06 and Cmd=0x06.</p><p>In the event of an immediate power cycle (i.e. host reboots without operating system shutdown), the host is unable to send its S5_G2 state to the BMC. For this case, we created a patch to OpenBMC’s <a href="https://github.com/openbmc/x86-power-control/tree/master"><u>x86-power-control</u></a> to let the BMC become aware that the host has entered the ACPI S5_G2 state (i.e. soft-off). When the host comes out of the power off state, the UEFI performs the Power On Self Test (POST) and sends the S2_D2 to the BMC, and after the UEFI has loaded the OS on the host, it notifies the BMC by sending the ACPI S0_G0_D0 state.</p>
    <div>
      <h2>Fixing the issues</h2>
      <a href="#fixing-the-issues">
        
      </a>
    </div>
    <p>Going back to the boot-up issues we faced, we discovered that they were mostly caused by devices which were in the same power domain of the CPU, interfering with the UEFI/platform firmware initialization phase. Below is a high level description of the fixes we applied.</p><p><b><i>Servers not booting</i></b><b>:</b> After identifying the devices that were interfering with the POST stage of the firmware initialization, we used the host ACPI state to control when we set the appropriate power mode state for those devices so as not to cause POST to fail.</p><p><b><i>Memory modules missing</i></b><b>:</b> During the boot-up process, memory modules (DIMMs) are powered and initialized in S2_D2 ACPI state. During this initialization process, UEFI firmware sends read commands to the Serial Presence Detect (SPD) on the DIMM to retrieve information for DIMM enumeration. At the same time, the BMC could be sending commands to read DIMM temperature sensors. This can cause SMBUS collisions, which could either cause DIMM temperature reading to fail or UEFI DIMM enumeration to fail. The latter case would cause the system to boot up with reduced DIMM capacity, which could be mistaken as a failing DIMM scenario. After we had discovered the race condition issue, we disabled the BMC from reading the DIMM temperature sensors during S2_D2 ACPI state and set a fixed speed for the corresponding fans. This solution allows our UEFI to retrieve all the necessary DIMM subsystems information for enumeration, and our servers now boot up with the correct size of memory.</p><p><b>Thermal telemetry:</b> In S0_G0 power state, when sensors are not reporting values back to the BMC, the BMC assumes that devices may be overheating and puts the fan controller into fail-safe mode where fan speeds are ramped up to maximum speed. However, in S5_G2 state, some thermal sensors like CPU temperature, NIC temperature, etc. are not powered and not available. Our solution is to set these thermal sensors as non-functional in their exported configuration when in S5_G2 state and during the transition from S5_G2 state to S2_D2 state. Setting the affected devices as non-functional in their configuration, instead of waiting for thermal sensor read commands to error out, prevents the controller from entering the fail-safe mode.</p>
    <div>
      <h2>Moving forward</h2>
      <a href="#moving-forward">
        
      </a>
    </div>
    <p>Aside from resolving issues, we have seen other benefits from implementing ACPI Power State on our BMC firmware. An example is in the area of our automated firmware regression testing. Various parts of our tests require rebooting/power cycling the servers over a hundred times, during which we monitor the ACPI power state changes of our servers as against using a boolean (running or not running, pingable or not pingable) to assert the status of our servers.</p><p>Also, it has given us the opportunity to learn more about the complex subsystems in a server system, and the various power modes of the different subsystems. This is an aspect that we are still actively learning about as we look to further optimize various aspects of the boot sequence of our servers.</p><p>In the course of time, implementing ACPI states is helping us achieve the following:</p><ul><li><p>All components are enabled by end of boot sequence,</p></li><li><p>BIOS and BMC are able to retrieve component information,</p></li><li><p>And the BMC is aware when thermal sensors are in a non-functional state.
</p></li></ul><p>For better observability of the boot progress and “last state” of our systems, we have also started the process of adding the BootProgress object of the <a href="https://redfish.dmtf.org/schemas/v1/ComputerSystem.v1_13_0.json"><u>Redfish ComputerSystem Schema</u></a> into our systems. This will give us an opportunity for pre-operating system (OS) boot observability and an easier debug starting point when the UEFI has issues (such as when the server isn’t coming on) during the server platform initialization.</p><p>With each passing day, Cloudflare’s OpenBMC team, which is made up of folks from different embedded backgrounds, learns about, experiments with, and deploys OpenBMC across our global fleet. This has been made possible by relying on the OpenBMC community’s contribution (as well as upstreaming some of our own contributions), and our interaction with our various vendors, thereby giving us the opportunity to make our systems more reliable, and giving us the ownership and responsibility of the firmware that powers the BMCs that manage our servers. If you are thinking of embracing open-source firmware in your BMC, we hope this blog post written by a team which started deploying OpenBMC less than 18 months ago has inspired you to give it a try. </p><p>For those who are interested in considering making the jump to open-source firmware, check it out <a href="https://github.com/openbmc/openbmc"><u>here</u></a>!</p> ]]></content:encoded>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[OpenBMC]]></category>
            <category><![CDATA[Servers]]></category>
            <category><![CDATA[Firmware]]></category>
            <guid isPermaLink="false">2hySj1JFTXmlofjA6IRijm</guid>
            <dc:creator>Nnamdi Ajah</dc:creator>
            <dc:creator>Ryan Chow</dc:creator>
            <dc:creator>Giovanni Pereira Zantedeschi</dc:creator>
        </item>
        <item>
            <title><![CDATA[Expanding Cloudflare's support for open source projects with Project Alexandria]]></title>
            <link>https://blog.cloudflare.com/expanding-our-support-for-oss-projects-with-project-alexandria/</link>
            <pubDate>Fri, 27 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ At Cloudflare, we believe in the power of open source. With Project Alexandria, our expanded open source program, we’re helping open source projects have a sustainable and scalable future, providing them with the tools and protection needed to thrive. ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we believe in the power of open source. It’s more than just code, it’s the spirit of collaboration, innovation, and shared knowledge that drives the Internet forward. Open source is the foundation upon which the Internet thrives, allowing developers and creators from around the world to contribute to a greater whole.</p><p>But oftentimes, open source maintainers struggle with the costs associated with running their projects and providing access to users all over the world. We’ve had the privilege of supporting incredible open source projects such as <a href="https://git-scm.com/"><u>Git</u></a> and the <a href="https://www.linuxfoundation.org/"><u>Linux Foundation</u></a> through our <a href="https://blog.cloudflare.com/cloudflare-new-oss-sponsorships-program/"><u>open source program</u></a> and learned first-hand about the places where Cloudflare can help the most.</p><p>Today, we're introducing a streamlined and expanded open source program: Project Alexandria. The ancient city of Alexandria is known for hosting a prolific library and a lighthouse that was one of the Seven Wonders of the Ancient World. The Lighthouse of Alexandria served as a beacon of culture and community, welcoming people from afar into the city. We think Alexandria is a great metaphor for the role open source projects play as a beacon for developers around the world and a source of knowledge that is core to making a better Internet. </p><p>This project offers recurring annual credits to even more open source projects to provide our products for free. In the past, we offered an upgrade to our Pro plan, but now we’re offering upgrades tailored to the size and needs of each project, along with access to a broader range of products like <a href="https://workers.cloudflare.com/"><u>Workers</u></a>, <a href="https://pages.cloudflare.com/"><u>Pages</u></a>, and more. Our goal with Project Alexandria is to ensure every OSS project not only survives but thrives, with access to Cloudflare’s enhanced security, performance optimization, and developer tools — all at no cost.</p>
    <div>
      <h2>Building a program based on your needs</h2>
      <a href="#building-a-program-based-on-your-needs">
        
      </a>
    </div>
    <p>We realize that open source projects have different needs. Some projects, like package repositories, may be most concerned about storage and transfer costs. Other projects need help protecting them from DDoS attacks. And some projects need a robust developer platform to enable them to quickly build and deploy scalable and secure applications.</p><p>With our new program we’ll work with your project to help unlock the following based on your needs:</p><ul><li><p>An upgrade to a Cloudflare Pro, Business, or Enterprise plan, which will give you more flexibility with more <a href="https://developers.cloudflare.com/rules/"><u>Cloudflare Rules</u></a> to manage traffic with, Image Optimization with <a href="https://developers.cloudflare.com/images/polish/"><u>Polish</u></a> to accelerate the speed of image downloads, and enhanced security with <a href="https://www.cloudflare.com/en-gb/application-services/products/waf/"><u>Web Application Firewall (WAF)</u></a>, <a href="https://developers.cloudflare.com/waf/analytics/security-analytics/"><u>Security Analytics</u></a>, and <a href="https://developers.cloudflare.com/page-shield/"><u>Page Shield</u></a>, to protect projects from potential threats and vulnerabilities.</p></li><li><p>Increased requests to Cloudflare <a href="https://workers.cloudflare.com/"><u>Workers</u></a> and <a href="https://pages.cloudflare.com/"><u>Pages</u></a>, allowing you to handle more traffic and scale your applications globally.</p></li><li><p>Increased <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a> storage for builds and artifacts, ensuring you have the space needed to store and access your project’s assets efficiently.</p></li><li><p>Enhanced <a href="https://developers.cloudflare.com/cloudflare-one/"><u>Zero Trust</u></a> access, including <a href="https://developers.cloudflare.com/cloudflare-one/policies/browser-isolation/"><u>Remote Browser Isolation</u></a>, no user limits, and extended activity log retention to give you deeper insights and more control over your project’s security.</p></li></ul><p>Every open source project in the program will receive additional resources and support through a dedicated <a href="https://discord.com/channels/595317990191398933/1284158129474506802"><u>channel</u></a> on our <a href="https://discord.cloudflare.com"><u>Discord server</u></a>. And if there’s something you think we can do to help that we don’t currently offer, we’re here to figure out how to make it happen.</p><p>Many open source projects run within the limits of Cloudflare’s generous <a href="https://www.cloudflare.com/en-gb/plans/"><u>free tiers</u></a>. Our mission to help build a better Internet means that cost should not be a barrier to creating, securing, and distributing your open source packages globally, no matter the size of the project. Indie or niche open source projects can still run for free without the need for credits. For larger open source projects, the annual recurring credits are available to you, so your money can continue to be reinvested into innovation, instead of paying for infrastructure to store, secure, and deliver your packages and websites. </p><p>We’re dedicated to supporting projects that are not only innovative but also crucial to the continued growth and health of the internet. The criteria for the program remain the same:</p><ul><li><p>Operate solely on a non-profit basis and/or otherwise align with the project mission.</p></li><li><p>Be an open source project with a <a href="https://opensource.org/licenses/"><u>recognized OSS license</u></a>.</p></li></ul><p>If you’re an open source project that meets these requirements, you can <a href="https://www.cloudflare.com/lp/project-alexandria/"><u>apply for the program here</u></a>.</p>
    <div>
      <h2>Empowering the Open Source community</h2>
      <a href="#empowering-the-open-source-community">
        
      </a>
    </div>
    <p>We’re incredibly lucky to have open source projects that we admire, and the incredible people behind those <a href="https://developers.cloudflare.com/sponsorships/"><u>projects</u></a>, as part of our program — including the <a href="https://openjsf.org/"><u>OpenJS Foundation</u></a>, <a href="https://opentofu.org/"><u>OpenTofu</u></a>, and <a href="https://julialang.org/"><u>JuliaLang</u></a>.</p><p><b>OpenJS Foundation</b></p><p><a href="https://github.com/nodejs"><u>Node.js</u></a> has been part of our OSS Program since 2019, and we’ve recently partnered with the <a href="https://openjsf.org/"><u>OpenJS Foundation</u></a> to provide technical support and infrastructure improvements to other critical JavaScript projects hosted at the foundation, including <a href="https://github.com/fastify/fastify"><u>Fastify</u></a>, <a href="https://github.com/jquery/jquery"><u>jQuery</u></a>, <a href="https://github.com/electron/electron"><u>Electron</u></a>, and <a href="https://github.com/NativeScript/NativeScript"><u>NativeScript</u></a>.</p><p>One prominent example of the<a href="https://openjsf.org/"><u> OpenJS Foundation</u></a> using Cloudflare is the Node.js CDN Worker.  It’s currently in active development by the Node.js Web Infrastructure and Build teams and aims to serve all Node.js release assets (binaries, documentations, etc.) provided on their website. </p><p><a href="https://x.com/NodeConfEU/status/1823676122648715581"><u>Aaron Snell</u></a> explained that these release assets are currently being served by a single static origin file server fronted by Cloudflare. This worked fine up until a few years ago when issues began to pop up with new releases. With a new release came a cache purge, meaning that all the requests for the release assets were cache misses, causing Cloudflare to go forward directly to the static file server, overloading it. Because Node.js releases nightly builds, this issue occurs every day.</p><p>The CDN Worker plans to fix this by using Cloudflare Workers and R2 to serve requests for the release assets, taking all the load off the static file server, resulting in improved availability for Node.js downloads and documentation, and ultimately making the process more sustainable in the long run.</p><p><b>OpenTofu</b></p><p><a href="https://github.com/opentofu/opentofu"><u>OpenTofu</u></a> has been focused on building a free and open alternative to proprietary infrastructure-as-code platforms. One of their major challenges has been ensuring the reliability and scalability of their registry while keeping costs low. Cloudflare's <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a> storage and caching services provided the perfect fit, allowing <a href="https://github.com/opentofu/opentofu"><u>OpenTofu</u></a> to serve static files at scale without worrying about bandwidth or performance bottlenecks.</p><p>The OpenTofu team noted that it was paramount for OpenTofu to keep the costs of running the registry as low as possible both in terms of bandwidth and also in human cost. However, they also needed to make sure that the registry had an uptime close to 100% since thousands upon thousands of developers would be left without a means to update their infrastructure if it went down.</p><p>The registry codebase (written in Go) pre-generates all possible answers of the OpenTofu Registry API and uploads the static files to an R2 bucket. With R2, OpenTofu has been able to run the registry essentially for free with no servers and scaling issues to worry about.</p><p><b>JuliaLang</b></p><p><a href="https://github.com/JuliaLang/julia"><u>JuliaLang</u></a> has recently joined our OSS Sponsorship Program, and we’re excited to support their critical infrastructure to ensure the smooth operation of their ecosystem. A key aspect of this support is enabling the use of Cloudflare’s services to help <a href="https://github.com/JuliaLang/julia"><u>JuliaLang</u></a> deliver packages to its user base.</p><p>According to <a href="https://staticfloat.github.io/"><u>Elliot Saba</u></a>, JuliaLang had been using Amazon Lightsail as a cost-effective global CDN to serve packages to their user base. However, as their user base grew they would occasionally exceed their bandwidth limits and rack up serious cloud costs, not to mention experiencing degraded performance due to load balancer VMs getting overloaded by traffic spikes. Now JuliaLang is using Cloudflare <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a>, and the speed and reliability of <a href="https://www.cloudflare.com/developer-platform/products/r2/">R2 object storage</a> has so far exceeded that of their own within-datacenter solutions, and the lack of bandwidth charges means JuliaLang is now getting faster, more reliable service for less than a tenth of their previous spend.</p>
    <div>
      <h2>How can we help?</h2>
      <a href="#how-can-we-help">
        
      </a>
    </div>
    <p>If your project fits our criteria, and you’re looking to reduce costs and eliminate surprise bills, we invite you to apply! We’re eager to help the next generation of open source projects make their mark on the internet.</p><p>For more details and to apply, visit our new <a href="https://www.cloudflare.com/lp/project-alexandria/"><u>Project Alexandria page</u></a>. And if you know other projects that could benefit from this program, please spread the word!</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Better Internet]]></category>
            <guid isPermaLink="false">5LrF3eCtonOcP2Sf5BSVpe</guid>
            <dc:creator>Veronica Marin</dc:creator>
            <dc:creator>Gabby Shires</dc:creator>
        </item>
        <item>
            <title><![CDATA[A good day to trie-hard: saving compute 1% at a time]]></title>
            <link>https://blog.cloudflare.com/pingora-saving-compute-1-percent-at-a-time/</link>
            <pubDate>Tue, 10 Sep 2024 14:00:00 GMT</pubDate>
            <description><![CDATA[ Pingora handles 35M+ requests per second, so saving a few microseconds per request can translate to thousands of dollars saved on computing costs. In this post, we share how we freed up over 500 CPU  ]]></description>
            <content:encoded><![CDATA[ 
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5uwVNobeSBws457ad5SoNY/080de413142fc98caffc3c0108912fe4/2442-1-hero.png" />
          </figure><p>Cloudflare’s global network handles <i>a lot</i> of HTTP requests – over 60 million per second on average. That in and of itself is not news, but it is the starting point to an adventure that started a few months ago and ends with the announcement of a new <a href="https://github.com/cloudflare/trie-hard"><u>open-source Rust crate</u></a> that we are using to reduce our CPU utilization, enabling our CDN to handle even more of the world’s ever-increasing Web traffic. </p>
    <div>
      <h2>Motivation</h2>
      <a href="#motivation">
        
      </a>
    </div>
    <p>Let’s start at the beginning. You may recall a few months ago we released <a href="https://blog.cloudflare.com/pingora-open-source/"><u>Pingora</u></a> (the heart of our Rust-based proxy services) as an <a href="https://github.com/cloudflare/pingora"><u>open-source project on GitHub</u></a>. I work on the team that maintains the Pingora framework, as well as Cloudflare’s production services built upon it. One of those services is responsible for the final step in transmitting users’ (non-cached) requests to their true destination. Internally, we call the request’s destination server its “origin”, so our service has the (unimaginative) name of “pingora-origin”.</p><p>One of the many responsibilities of pingora-origin is to ensure that when a request leaves our infrastructure, it has been cleaned to remove the internal information we use to route, measure, and optimize traffic for our customers. This has to be done for every request that leaves Cloudflare, and as I mentioned above, it’s <i>a lot</i> of requests. At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths, and it’s in this path that we find this code and comment:</p>
            <pre><code>// PERF: heavy function: 1.7% CPU time
pub fn clear_internal_headers(request_header: &amp;mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}</code></pre>
            <p></p><p>This small and pleasantly-readable function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating <code>clear_internal_headers</code>. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.</p>
    <div>
      <h2>Benchmarking</h2>
      <a href="#benchmarking">
        
      </a>
    </div>
    <p>Benchmarking the function shown above is straightforward because we can use the wonderful <a href="https://crates.io/crates/criterion"><u>criterion</u></a> Rust crate. Criterion provides an api for timing rust code down to the nanosecond by aggregating multiple isolated executions. It also provides feedback on how the performance improves or regresses over time. The input for the benchmark is a large set of synthesized requests with a random number of headers with a uniform distribution of internal vs. non-internal headers. With our tooling and test data we find that our original <code>clear_internal_headers</code> function runs in an average of <b>3.65µs</b>. Now for each new method of clearing headers, we can measure against the same set of requests and get a relative performance difference. </p>
    <div>
      <h2>Reducing Reads</h2>
      <a href="#reducing-reads">
        
      </a>
    </div>
    <p>One potentially quick win is to invert how we find the headers that need to be removed from requests. If you look at the original code, you can see that we are evaluating <code>request_header.remove_header(h)</code> for each header in our list of internal headers, so 100+ times. Diagrammatically, it looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7y2qHbNfBQeoGRc8PqjBcp/9e8fccb6951a475a26def66695e47635/2442-2.png" />
          </figure><p></p><p>Since an average request has significantly fewer than 100 headers (10-30), flipping the lookup direction should reduce the number of reads while yielding the same intersection. Because we are working in Rust (and because <code>retain</code> does not exist for <code>http::HeaderMap</code> <a href="https://github.com/hyperium/http/issues/541"><u>yet</u></a>), we have to collect the identified internal headers in a separate step before removing them from the request. Conceptually, it looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6hgLavu1hZbwkw91Tee8e1/4d43b538274ae2c680236ca66791d73b/2442-3.png" />
          </figure><p></p><p>Using our benchmarking tool, we can measure the impact of this small change, and surprisingly this is already a substantial improvement. The runtime improves from <b>3.65µs</b> to <b>1.53µs</b>. That’s a 2.39x speed improvement for our function. We can calculate the theoretical CPU percentage by multiplying the starting utilization by the ratio of the new and old times: 1.71% * 1.53 / 3.65 = 0.717%. Unfortunately, if we subtract that from the original 1.71% that only equates to saving 1.71% - 0.717% = <i>0.993%</i> of the total CPU time. We should be able to do better. </p>
    <div>
      <h2>Searching Data Structures</h2>
      <a href="#searching-data-structures">
        
      </a>
    </div>
    <p>Now that we have reorganized our function to search a static set of internal headers instead of the actual request, we have the freedom to choose what data structure we store our header name in simply by changing the type of <code>INTERNAL_HEADER_SET</code>.</p>
            <pre><code>pub fn clear_internal_headers(request_header: &amp;mut RequestHeader) {
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::&lt;Vec&lt;_&gt;&gt;();


   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });</code></pre>
            <p></p><p>Our first attempt used <code>std::HashMap</code>, but there may be other data structures that better suit our needs. All computer science students were taught at some point that hash tables are great because they have constant-time asymptotic behavior, or O(1), for reading. (If you are not familiar with <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation"><u>big O notation</u></a>, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) This means no matter how large the map gets, reads always take the same amount of time. Too bad this is only partially true. In order to read from a hash table, you have to compute the hash. Computing a hash for strings requires reading every byte, so while read time for a hashmap is constant over the table’s size, it’s linear over key length. So, our goal is to find a data structure that is better than O(L) where L is the length of the key.</p><p>There are a few common data structures that provide for reads that have read behavior that meets our criteria. Sorted sets like <code>BTreeSet</code> use comparisons for searching, and that makes them logarithmic over key length <b>O(log(L))</b>, but they are also logarithmic in size too. The net effect is that even very fast sorted sets like <a href="https://crates.io/crates/fst"><u>FST</u></a> work out to be a little (50 ns) slower in our benchmarks than the standard hashmap.</p><p>State machines like parsers and regex are another common tool for searching for strings, though it’s hard to consider them data structures. These systems work by accepting input one unit at a time and determining on each step whether or not to keep evaluating. Being able to make these determinations at every step means state machines are very fast to identify negative cases (i.e. when a string is not valid or not a match). This is perfect for us because only one or two headers per request on average will be internal. In fact, benchmarking an implementation of <code>clear_internal_headers</code> using regular expressions clocks in as taking about twice as long as the hashmap-based solution. This is impressively fast given that regexes, while powerful, aren't known for their raw speed. This approach feels promising – we just need something in between a data structure and a state machine. </p><p>That’s where the trie comes in.</p>
    <div>
      <h2>Don’t Just Trie</h2>
      <a href="#dont-just-trie">
        
      </a>
    </div>
    <p>A <a href="https://en.wikipedia.org/wiki/Trie"><u>trie</u></a> (pronounced like “try” or “tree”) is a type of <a href="https://en.wikipedia.org/wiki/Tree_(data_structure)"><u>tree data structure</u></a> normally used for prefix searches or auto-complete systems over a known set of strings. The structure of the trie lends itself to this because each node in the trie represents a substring of characters found in the initial set. The connections between the nodes represent the characters that can follow a prefix. Here is a small example of a trie built from the words: “and”, “ant”, “dad”, “do”, &amp; “dot”. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5wy48j3XNs9awxRNvjLljC/4e2a05b4e1802eba26f9e10e95bd843f/2442-4.png" />
          </figure><p>The root node represents an empty string prefix, so the two lettered edges directed out of it are the only letters that can appear as the first letter in the list of strings, “a” and “d”. Subsequent nodes have increasingly longer prefixes until the final valid words are reached. This layout should make it easy to see how a trie could be useful for quickly identifying strings that are not contained. Even at the root node, we can eliminate any strings that are presented that do not start with “a” or “d”. This paring down of the search space on every step gives reading from a trie the <b>O(log(L))</b> we were looking for … but only for misses. Hits within a trie are still <b>O(L)</b>, but that’s okay, because we are getting misses over 90% of the time.</p><p>Benchmarking a few trie implementations from <a href="https://crates.io/search?q=trie"><u>crates.io</u></a> was disheartening. Remember, most tries are used in response to keyboard events, so optimizing them to run in the hot path of tens of millions of requests per second is not a priority. The fastest existing implementation we found was <a href="https://crates.io/crates/radix_trie"><u>radix_trie</u></a>, but it still clocked in at a full microsecond slower than hashmap. The only thing left to do was write our own implementation of a trie that was optimized for our use case.</p>
    <div>
      <h2>Trie Hard</h2>
      <a href="#trie-hard">
        
      </a>
    </div>
    <p>And we did! Today we are announcing <a href="https://github.com/cloudflare/trie-hard"><u>trie-hard</u></a>. The repository gives a full description of how it works, but the big takeaway is that it gets its speed from storing node relationships in the bits of unsigned integers and keeping the entire tree in a contiguous chunk of memory. In our benchmarks, we found that trie-hard reduced the average runtime for <code>clear_internal_headers</code> to under a microsecond (0.93µs). We can reuse the same formula from above to calculate the expected CPU utilization for trie-hard to be 1.71% * 3.65 / 0.93 = 0.43% That means we have finally achieved and surpassed our goal by reducing the compute utilization of pingora-origin by 1.71% - 0.43% =  <b>1.28%</b>! </p><p>Up until now we have been working only in theory and local benchmarking. What really matters is whether our benchmarking reflects real-life behavior. Trie-hard has been running in production since July 2024, and over the course of this project we have been collecting performance metrics from the running production of pingora-origin using a statistical sampling of its stack trace over time. Using this technique, the CPU utilization percentage of a function is estimated by the percent of samples in which the function appears. If we compare the sampled performance of the different versions of <code>clear_internal_headers</code>, we can see that the results from the performance sampling closely match what our benchmarks predicted.</p><table><tr><th><p>Implementation</p></th><th><p>Stack trace samples containing <code>clear_internal_headers</code></p></th><th><p>Actual CPU Usage (%)</p></th><th><p>Predicted CPU Usage (%)</p></th></tr><tr><td><p>Original </p></td><td><p>19 / 1111</p></td><td><p>1.71</p></td><td><p>n/a</p></td></tr><tr><td><p>Hashmap</p></td><td><p>9 / 1103</p></td><td><p>0.82</p></td><td><p>0.72</p></td></tr><tr><td><p>trie-hard</p></td><td><p>4 / 1171</p></td><td><p>0.34</p></td><td><p>0.43</p></td></tr></table>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Optimizing functions and writing new data structures is cool, but the real conclusion for this post is that knowing where your code is slow and by how much is more important than how you go about optimizing it. Take a moment to thank your observability team (if you're lucky enough to have one), and make use of flame graphs or any other profiling and benchmarking tool you can. Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.</p> ]]></content:encoded>
            <category><![CDATA[Internet Performance]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Optimization]]></category>
            <category><![CDATA[Pingora]]></category>
            <guid isPermaLink="false">2CqKLNS1jaf5H2j99sDONe</guid>
            <dc:creator>Kevin Guthrie</dc:creator>
        </item>
        <item>
            <title><![CDATA[Go wild: Wildcard support in Rules and a new open-source wildcard crate]]></title>
            <link>https://blog.cloudflare.com/wildcard-rules/</link>
            <pubDate>Thu, 22 Aug 2024 14:00:00 GMT</pubDate>
            <description><![CDATA[ We’re excited to announce wildcard support across our Ruleset Engine-based products and our open-source wildcard crate in Rust. Configuring rules has never been easier, with powerful pattern matching enabling simple and flexible URL redirects and beyond for users on all plans. ]]></description>
            <content:encoded><![CDATA[ 
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5SVubtrh9iaqDDQSnA60Me/b511db040441d802b147cb838448ab26/2478-1-hero.png" />
          </figure><p>Back in 2012, we <a href="https://blog.cloudflare.com/introducing-pagerules-fine-grained-feature-co"><u>introduced</u></a> <a href="https://developers.cloudflare.com/rules/page-rules/"><u>Page Rules</u></a>, a pioneering feature that gave Cloudflare users unprecedented control over how their web traffic was managed. At the time, this was a significant leap forward, enabling users to define <a href="https://developers.cloudflare.com/rules/page-rules/reference/wildcard-matching/"><u>patterns</u></a> for specific URLs and adjust Cloudflare <a href="https://developers.cloudflare.com/rules/page-rules/reference/settings/"><u>features</u></a> on a page-by-page basis. The ability to apply such precise configurations through a simple, user-friendly interface was a major advancement, establishing Page Rules as a cornerstone of our platform.</p><p>Page Rules allowed users to implement a variety of actions, including <a href="https://developers.cloudflare.com/rules/url-forwarding/#redirects"><u>redirects</u></a>, which automatically send visitors from one URL to another. Redirects are crucial for maintaining a seamless user experience on the Internet, whether it's guiding users <a href="https://developers.cloudflare.com/rules/url-forwarding/examples/redirect-new-url/"><u>from outdated links to new content</u></a> or managing traffic during <a href="https://developers.cloudflare.com/rules/url-forwarding/examples/redirect-all-different-hostname/"><u>site migrations</u></a>.</p><p>As the Internet has evolved, so too have the needs of our users. The demand for greater flexibility, higher performance, and more advanced capabilities led to the development of the <a href="https://developers.cloudflare.com/ruleset-engine/"><u>Ruleset Engine</u></a>, a powerful framework designed to handle complex rule evaluations with unmatched speed and precision.</p><p>In September 2022, we announced and released <a href="https://blog.cloudflare.com/dynamic-redirect-rules"><u>Single Redirects</u></a> as a modern replacement for the <a href="https://developers.cloudflare.com/rules/page-rules/how-to/url-forwarding/"><u>URL Forwarding</u></a> feature of Page Rules. Built on top of the Ruleset Engine, this new product offered a powerful syntax and enhanced performance.</p><p>Despite the <a href="https://blog.cloudflare.com/future-of-page-rules"><u>enhancements</u></a>, one of the most consistent pieces of feedback from our users was the need for wildcard matching and expansion, also known as <a href="https://github.com/begin/globbing"><u>globbing</u></a>. This feature is essential for creating dynamic and flexible URL patterns, allowing users to manage a broader range of scenarios with ease.</p><p>Today we are excited to announce that wildcard support is now available across our Ruleset Engine-based products, including <a href="https://developers.cloudflare.com/cache/how-to/cache-rules/"><u>Cache Rules</u></a>, <a href="https://developers.cloudflare.com/rules/compression-rules/"><u>Compression Rules</u></a>, <a href="https://developers.cloudflare.com/rules/configuration-rules/"><u>Configuration Rules</u></a>, <a href="https://developers.cloudflare.com/rules/custom-error-responses/"><u>Custom Errors</u></a>, <a href="https://developers.cloudflare.com/rules/origin-rules/"><u>Origin Rules</u></a>, <a href="https://developers.cloudflare.com/rules/url-forwarding/"><u>Redirect Rules</u></a>, <a href="https://developers.cloudflare.com/rules/snippets/"><u>Snippets</u></a>, <a href="https://developers.cloudflare.com/rules/transform/"><u>Transform Rules</u></a>, <a href="https://developers.cloudflare.com/waf/"><u>Web Application Firewall (WAF)</u></a>, <a href="https://developers.cloudflare.com/waiting-room/"><u>Waiting Room</u></a>, and more.</p>
    <div>
      <h3>Understanding wildcards</h3>
      <a href="#understanding-wildcards">
        
      </a>
    </div>
    <p>Wildcard pattern matching allows users to employ an asterisk <code>(*)</code> in a string to match certain patterns. For example, a single pattern like <code>https://example.com/*/t*st</code> can cover multiple URLs such as <code>https://example.com/en/test</code>, <code>https://example.com/images/toast</code>, and <code>https://example.com/blog/trust</code>.</p><p>Once a segment is captured, it can be used in another expression by referencing the matched wildcard with the <code>${&lt;X&gt;}</code> syntax, where <code>&lt;X&gt;</code> indicates the index of a matched pattern. This is particularly useful in URL forwarding. For instance, the URL pattern <code>https://example.com/*/t*st</code> can redirect to <code>https://${1}.example.com/t${2}st</code>, allowing dynamic and flexible URL redirection. This setup ensures that <code>https://example.com/uk/test</code> is forwarded to <code>https://uk.example.com/test</code>, <code>https://example.com/images/toast</code> to <code>https://images.example.com/toast</code>, and so on.</p>
    <div>
      <h3>Challenges with Single Redirects</h3>
      <a href="#challenges-with-single-redirects">
        
      </a>
    </div>
    <p>In Page Rules, redirecting from an old URI path to a new one looked like this:</p><ul><li><p><b>Source URL:</b> <code>https://example.com/old-path/*</code></p></li><li><p><b>Target URL:</b> <code>https://example.com/new-path/$1</code></p></li></ul><p>In comparison, replicating this behaviour in Single Redirects without wildcards required a more complex approach:</p><ul><li><p><b>Filter:</b> <code>(http.host eq "example.com" and starts_with(http.request.uri.path, "/old-path/"))</code></p></li><li><p><b>Expression:</b> <code>concat("/new-path/", substring(http.request.uri.path, 10)) (where 10 is the length of /old-path/)</code></p></li></ul><p>This complexity created unnecessary overhead and difficulty, especially for users without access to regular expressions (regex) or the technical expertise to come up with expressions that use nested functions.</p>
    <div>
      <h3>Wildcard support in Ruleset Engine</h3>
      <a href="#wildcard-support-in-ruleset-engine">
        
      </a>
    </div>
    <p>With the introduction of wildcard support across our Ruleset Engine-based products, users can now take advantage of the power and flexibility of the Ruleset Engine through simpler and more intuitive configurations. This enhancement ensures high performance while making it easier to create dynamic and flexible URL patterns and beyond.</p>
    <div>
      <h3>What’s new?</h3>
      <a href="#whats-new">
        
      </a>
    </div>
    
    <div>
      <h4>1) Operators "wildcard" and "strict wildcard" in Ruleset Engine:</h4>
      <a href="#1-operators-wildcard-and-strict-wildcard-in-ruleset-engine">
        
      </a>
    </div>
    <ul><li><p>"<b>wildcard</b>" (case insensitive): Matches patterns regardless of case (e.g., "test" and "TesT" are treated the same, similar to <a href="https://developers.cloudflare.com/rules/page-rules/reference/wildcard-matching/"><u>Page Rules</u></a>).</p></li><li><p>"<b>strict wildcard</b>" (case sensitive): Matches patterns exactly, respecting case differences (e.g., "test" won't match "TesT").</p></li></ul><p>Both operators <a href="https://developers.cloudflare.com/ruleset-engine/rules-language/operators/#wildcard-matching"><u>can be applied</u></a> to any string field available in the Ruleset Engine, including full URI, host, headers, cookies, user-agent, country, and more.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/46A6KAfGTCGykWIGiSLItF/c2b0743622244de48369da29fc7c4093/2478-2.png" />
          </figure><p></p><p>This example demonstrates the use of the "wildcard" operator in a <a href="https://developers.cloudflare.com/waf/"><u>Web Application Firewall (WAF)</u></a> rule applied to the User Agent field. This rule matches any incoming request where the User Agent string contains patterns starting with "Mozilla/" and includes specific elements like "Macintosh; Intel Mac OS ", "Gecko/", and "Firefox/". Importantly, the wildcard operator is case insensitive, so it captures variations like "mozilla" and "Mozilla" without requiring exact matches.</p>
    <div>
      <h4>2) Function <code>wildcard_replace()</code> in Single Redirects:</h4>
      <a href="#2-function-wildcard_replace-in-single-redirects">
        
      </a>
    </div>
    <p>In <a href="https://developers.cloudflare.com/rules/url-forwarding/single-redirects/"><u>Single Redirects</u></a>, the <code>wildcard_replace()</code> <a href="https://developers.cloudflare.com/ruleset-engine/rules-language/functions/#wildcard_replace"><u>function</u></a> allows you to use matched segments in redirect URL targets.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5s2Y9zPgK4AqzD24DNSGU1/e8c456882160ad62b339888d05545f0d/2478-3.png" />
          </figure><p></p><p>Consider the URL pattern <code>https://example.com/*/t*st</code> mentioned earlier. Using <code>wildcard_replace()</code>, you can now set the target URL to <code>https://${1}.example.com/t${2}st</code> and dynamically redirect URLs like <code>https://example.com/uk/test</code> to <code>https://uk.example.com/test</code> and <code>https://example.com/images/toast</code> to <code>https://images.example.com/toast</code>.</p>
    <div>
      <h4>3) Simplified UI in Single Redirects:</h4>
      <a href="#3-simplified-ui-in-single-redirects">
        
      </a>
    </div>
    <p>We understand that not everyone wants to use advanced Ruleset Engine <a href="https://developers.cloudflare.com/ruleset-engine/rules-language/functions/"><u>functions</u></a>, especially for simple URL patterns. That’s why we’ve introduced an easy and intuitive UI for <a href="https://developers.cloudflare.com/rules/url-forwarding/single-redirects/"><u>Single Redirects</u></a> called “wildcard pattern”. This new interface, available under the Rules &gt; Redirect Rules tab of the zone dashboard, lets you specify request and target URL wildcard patterns in seconds without needing to delve into complex functions, much like Page Rules.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1y72vKTVFZjDUglnTpC2Nl/3da615997c9e245858356e79dfbbd3ec/2478-4.png" />
          </figure><p></p>
    <div>
      <h3>How we built it</h3>
      <a href="#how-we-built-it">
        
      </a>
    </div>
    <p>The Ruleset Engine powering Cloudflare Rules products is written in <a href="https://www.rust-lang.org/"><u>Rust</u></a>. When adding wildcard support, we first explored existing <a href="https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html"><u>Rust crates</u></a> for wildcard matching.</p><p>We considered using the popular <a href="https://crates.io/crates/regex"><code><u>regex</u></code></a> crate, known for its robustness. However, it requires converting wildcard patterns into regular expressions (e.g., <code>*</code> to <code>.*</code>, and <code>?</code> to <code>.</code>) and escaping other characters that are special in regex patterns, which adds complexity.</p><p>We also looked at the <a href="https://crates.io/crates/wildmatch"><code><u>wildmatch</u></code></a> crate, which is designed specifically for wildcard matching and has a couple of advantages over <code>regex</code>. The most obvious one is that there is no need to convert wildcard patterns to regular expressions. More importantly, wildmatch can handle complex patterns efficiently: wildcard matching takes quadratic time – in the worst case the time is proportional to the length of the pattern multiplied by the length of the input string. To be more specific, the time complexity is <i>O(p + ℓ + s ⋅ ℓ)</i>, where <i>p</i> is the length of the wildcard pattern, <i>ℓ</i> the length of the input string, and <i>s</i> the number of asterisk metacharacters in the pattern. (If you are not familiar with <a href="https://www.khanacademy.org/computing/computer-science/algorithms/asymptotic-notation/a/big-o-notation"><u>big O notation</u></a>, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) In the Ruleset Engine, we limit the number of asterisk metacharacters in the pattern to a maximum of 8. This ensures we will have good performance and limits the impact of a bad actor trying to consume too much CPU time by targeting extremely complicated patterns and input strings.</p><p>Unfortunately, <code>wildmatch</code> did not meet all our requirements. Ruleset Engine uses byte-oriented matching, and <code>wildmatch</code> works only on UTF-8 strings. We also have to support escape sequences –  for example, you should be able to represent a literal * in the pattern with <code>\*</code>.</p><p>Last but not least, to implement the <a href="https://developers.cloudflare.com/ruleset-engine/rules-language/functions/#wildcard_replace"><code><u>wildcard_replace() function</u></code></a> we needed not only to be able to match, but also to be able to replace parts of strings with captured segments. This is necessary to dynamically create HTTP redirects based on the source URL. For example, to redirect a request from <code>https://example.com/*/page/*</code> to <code>https://example.com/products/${1}?page=${2}</code>, you should be able to define the target URL using an expression like this:</p>
            <pre><code>wildcard_replace(
http.request.full_uri, 
&amp;quot;https://example.com/*/page/*&amp;quot;, 
&amp;quot;https://example.com/products/${1}?page=${2}&amp;quot;
)</code></pre>
            <p></p><p>This means that in order to implement this function in the Ruleset Engine, we also need our wildcard matching implementation to capture the input substrings that match the wildcard’s metacharacters.</p><p>Given these requirements, we decided to build our own wildcard matching crate. The implementation is based on <a href="http://dodobyte.com/wildcard.html"><u>Kurt's 2016 iterative algorithm</u></a>, with optimizations from <a href="http://developforperformance.com/MatchingWildcards_AnImprovedAlgorithmForBigData.html"><u>Krauss’ 2014 algorithm</u></a>. (You can find more information about the algorithm <a href="https://github.com/cloudflare/wildcard/blob/v0.2.0/src/lib.rs#L555-L569"><u>here</u></a>). Our implementation supports byte-oriented matching, escape sequences, and capturing matched segments for further processing.</p><p>Cloudflare’s <a href="https://crates.io/crates/wildcard"><code><u>wildcard crate</u></code></a> is now available and is open-source. You can find the source repository <a href="https://github.com/cloudflare/wildcard"><u>here</u></a>. Contributions are welcome!</p>
    <div>
      <h3>FAQs and resources</h3>
      <a href="#faqs-and-resources">
        
      </a>
    </div>
    <p>For more details on using wildcards in Rules products, please refer to our updated Ruleset Engine documentation:</p><ul><li><p><a href="https://developers.cloudflare.com/ruleset-engine/rules-language/operators/#wildcard-matching"><u>Ruleset Engine Operators</u></a></p></li><li><p><a href="https://developers.cloudflare.com/ruleset-engine/rules-language/functions/#wildcard_replace"><u>Ruleset Engine Functions</u></a></p></li></ul><p>We value your feedback and invite you to share your thoughts in our <a href="https://community.cloudflare.com/t/wildcard-support-in-ruleset-engine-products/692658"><u>community forums</u></a>. Your input directly influences our product and design decisions, helping us make Rules products even better.</p><p>Additionally, check out our <a href="https://crates.io/crates/wildcard"><code><u>wildcard crate</u></code></a> implementation and <a href="https://github.com/cloudflare/wildcard"><u>contribute to its development</u></a>.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>The new wildcard functionality in Rules is available to all plans and is completely free. This feature is rolling out immediately, and no beta access registration required. </p><p>We are thrilled to offer this much-requested feature and look forward to seeing how you leverage wildcards in your Rules configurations. Try it now and experience the enhanced flexibility and performance. Your feedback is invaluable to us, so please let us know in <a href="https://community.cloudflare.com/t/wildcard-support-in-ruleset-engine-products/692658"><u>community</u></a> how this new feature works for you!</p> ]]></content:encoded>
            <category><![CDATA[CDN]]></category>
            <category><![CDATA[Edge Rules]]></category>
            <category><![CDATA[Open Source]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">1NVmSxeyTXrlaivG80ZNzS</guid>
            <dc:creator>Nikita Cano</dc:creator>
            <dc:creator>Diogo Sousa</dc:creator>
        </item>
    </channel>
</rss>