The Cloudflare Blog

Redirects for AI Training enforces canonical content

Cam Whiteside — Fri, 17 Apr 2026 13:00:00 GMT

Cloudflare's Wrangler CLI has published several major versions over the past six years, each containing at least some critical changes to commands, configuration, or how developers interact with the platform. Like any actively maintained open-source project, we keep documentation for older versions available. The v1 documentation carries a deprecation banner, a noindex meta tag, and canonical tags pointing to current docs. Every advisory signal says the same thing: this content is outdated, look elsewhere. AI training crawlers don’t reliably honor those signals.

We use AI Crawl Control on developers.cloudflare.com, so we know that bots in the AI Crawler Category visited 4.8 million times over the last 30 days, and they consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference. The effect is cumulative because AI agents don't always fetch content live; they draw on trained models. When crawlers ingest deprecated docs, agents inherit outdated foundations.

Today, we’re launching Redirects for AI Training to let you enforce that verified AI training crawlers are redirected to up-to-date content. Your existing canonical tags become HTTP 301 redirects for verified AI training crawlers, automatically, with one toggle, on all paid Cloudflare plans.

And because status codes are ultimately how the web communicates policy to crawlers, Radar's AI Insights page now includes Response status code analysis showing the various types (successful (2xx), redirection (3xx), client error (4xx), and server error (5xx) of status codes AI crawlers receive across all Cloudflare traffic as a view of how the web responds to AI crawlers today.

AI training crawlers face dead ends today

For search engines, noindex functions as a rich signal system, but there’s no equivalent inline directive a page can carry that says “don’t train on this”. Keeping a deprecated page live with a warning banner may work for humans, who read the notice and navigate on, but AI training crawlers ingest the full text and risk treating the banner as just one more paragraph, returning thousands of times even after the warning is visible.

Blocking creates its own problem: it produces a void with no signal about what the crawler should learn instead. robots.txt offers limited protection, but as automated traffic grows, maintaining per-crawler, per-path, per-content-update directives requires hefty manual upkeep. What crawlers need is specific direction: “Here is where the current content lives.”

The tag is an HTML element defined in RFC 6596 that tells search engines and automated systems which URL represents the authoritative version of a page. It’s already present on 65-69% of web pages and is generated automatically by platforms like EmDash, WordPress, and Contentful. That infrastructure declares what the current version of your content is, and Redirects for AI Training enforces it.

How it works

Redirects for AI Training operates on two inputs: Cloudflare's cf.verified_bot_category field and the tags already in your HTML. The AI Crawler category covers bots that crawl for AI model training, including GPTBot, ClaudeBot, and Bytespider, and is distinct from the AI Assistant and AI Search categories that cover AI Agents.

When a request arrives from a verified AI Crawler, Cloudflare reads the response HTML. If a non-self-referencing canonical tag is present, Cloudflare issues a 301 Moved Permanently to the canonical URL before returning the response. Human traffic, search indexing, and other automated traffic is unaffected.

Here’s what the exchange looks like for a GPTBot request to a deprecated path:

GET /durable-objects/api/legacy-kv-storage-api/

Host: developers.cloudflare.com

User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)

HTTP/1.1 301 Moved Permanently

Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/

What this does not do

It doesn't retroactively correct training data already ingested or cover unverified crawlers outside the AI Crawler bot category. Humans and AI Agents visiting deprecated pages will not be redirected. We also exclude cross-origin canonicals by design (tags directing to preferred URLs on different domains), since they’re often used for domain consolidation rather than content freshness. To avoid loops, self-referencing canonicals (a tag on a page pointing to its own URL) don't trigger a redirect either.

Why not just use redirect rules?

Single Redirect Rules can target AI crawlers by user-agent string, and if a site has just a handful of known deprecated paths, that works. But it doesn't scale: every new deprecated path requires a change to the rule, user-agents must be manually tracked, and it would contribute to plan limitations that may otherwise be used for campaign URLs or domain migrations. Redirect rules also manually re-encode what canonical tags already declare and fall out of sync as content changes.

What we found on our own documentation site

Our own experience shows that this problem is real. We run AI Crawl Control on developers.cloudflare.com using the same dashboard available to all Cloudflare customers. In March 2026, legacy Workers documentation was crawled around 46,000 times by OpenAI, 3,600 times by Anthropic, and 1,700 times by Meta.

That crawling of deprecated pages may be why when we asked a leading AI assistant in April 2026, "How do I write KV values using the Wrangler CLI?", it gave an out-of-date answer: "You write to Cloudflare KV via the Wrangler CLI using the kv:key put command."

In fact, the correct syntax (as at April 2026) is wrangler kv key put; the colon syntax (kv:key put) was deprecated in Wrangler 3.60.0. Our documentation carries an inline deprecation notice, but it's unclear how training pipelines interpret them.

So we enabled Redirects for AI Training on developers.cloudflare.com and measured the response. In the first seven days, 100% of AI training crawler requests to pages with non-self-referencing canonical tags were redirected and were not served with deprecated content.

We expect that redirecting crawlers to current content eventually improves AI-generated answers about legacy tools. Given the closed nature of training pipelines and variability in recrawl timing, this is a hypothesis we will continue to verify. But what the crawler receives at the point of access has seen immediate improvement.

How to enable

If your site has canonical tags, your existing content hierarchy can now be enforced for verified AI training crawlers. Cloudflare's verified bot classification handles crawler identification automatically.

In the dashboard: on any domain, go to AI Crawl Control > Quick Actions > Redirects for AI training > toggle on.

For path-specific control via Configuration Rules and Cloudflare for SaaS, see the full documentation.

How the web responds to AI crawlers

Redirects for AI Training turns one status code, 301 Moved Permanently, into an enforcement mechanism for your content policy. But 301 is one signal in a broader conversation between origins and crawlers. A 200 OK means content was served. A 403 Forbidden means access was blocked. A 402 Payment Required tells the client it needs to pay for access. Taken together, the distribution of status codes across AI crawler traffic reveals how the web is actually responding to crawlers at scale.

Radar’s AI Insights page now includes a Response status code analysis graph illustrating the distribution of the top response status codes or response status code groupings (selectable via a dropdown) for AI crawler traffic. The data can be filtered by industry set; the crawl purpose filter can also be applied in Data Explorer. Filtered analyses provide a perspective into whether certain types of crawlers behave differently, or if request patterns and distributions vary by industry.

In the general example shown below, we can see that for the time period covered by the graph, just over 70% of requests were serviced successfully (200), while 10.1% of the requests were redirected (301, 302) to another URL, and 3.7% were for files that weren’t found (404). Access to content was blocked for 8.3% of requests, receiving a 403 response status code. Grouped, we find that nearly 74% of requests received successful responses (2xx), 13.7% received client error responses (4xx), 11.3% received redirection messages (3xx), and 1.2% were sent server error responses (5xx).

This analysis has also been added to individual bot pages to provide insight into this aspect of a crawler’s behavior as well. In the GPTBot example shown below, we can see that for the time period covered by the graph, just over 80% of requests were serviced successfully (200), while 4.7% of the requests were redirected (301, 302) to another URL, and just 2.7% were for files that weren’t found (404). Nearly 6% were blocked, with Cloudflare returning a 403 response status code. Grouped, we find that 83% of requests received successful responses (2xx), nearly 10% received client error responses (4xx), 5.1% received redirection messages (3xx), and the remaining 2.2% got server error responses (5xx).

As noted above, Radar’s Data Explorer enables users to drill down further into the data by applying additional filters. For example, we can look at things like which crawlers are requesting the most non-existent content (resulting in a 404 response status code), and how that request traffic trends over time, or which industries are sending the most Redirection (3xx) response status codes to Training crawlers, and how that activity trends over time.

Response status code data, both in aggregate and on a per-bot basis, is also available through the Cloudflare Radar API.

Redirects for AI Training lets you shape what crawlers receive from your origin; Radar's status code analysis lets you see how the rest of the web is doing the same. Enable Redirects for AI Training in AI Crawl Control > Overview > Quick Actions to start replacing advisory signals with enforced outcomes on your site today.

Have questions or want to share what you're seeing? Join the discussion on the Cloudflare Community or find us on Discord.

Watch on Cloudflare TV

Email Routing subdomain support, new APIs and security protocols

Celso Martinho — Thu, 26 Oct 2023 13:10:06 GMT

It's been two years since we announced Email Routing, our solution to create custom email addresses for your domains and route incoming emails to your preferred mailbox. Since then, the team has worked hard to evolve the product and add more powerful features to meet our users' expectations. Examples include Route to Workers, which allows you to process your Emails programmatically using Workers scripts, Public APIs, Audit Logs, or DMARC Management.

We also made significant progress in supporting more email security extensions and protocols, protecting our customers from unwanted traffic, and keeping our IP space reputation for email egress impeccable to maximize our deliverability rates to whatever inbox upstream provider you chose.

Since leaving beta, Email Routing has grown into one of our most popular products; it’s used by more than one million different customer zones globally, and we forward around 20 million messages daily to every major email platform out there. Our product is mature, robust enough for general usage, and suitable for any production environment. And it keeps evolving: today, we announce three new features that will help make Email Routing more secure, flexible, and powerful than ever.

New security protocols

The SMTP email protocol has been around since the early 80s. Naturally, it wasn't designed with the best security practices and requirements in mind, at least not the ones that the Internet expects today. For that reason, several protocol revisions and extensions have been standardized and adopted by the community over the years. Cloudflare is known for being an early adopter of promising emerging technologies; Email Routing already supports things like SPF, DKIM signatures, DMARC policy enforcement, TLS transport, STARTTLS, and IPv6 egress, to name a few. Today, we are introducing support for two new standards to help increase email security and improve deliverability to third-party upstream email providers.

ARC

Authenticated Received Chain (ARC) is an email authentication system designed to allow an intermediate email server (such as Email Routing) to preserve email authentication results. In other words, with ARC, we can securely preserve the results of validating sender authentication mechanisms like SPF and DKIM, which we support when the email is received, and transport that information to the upstream provider when we forward the message. ARC establishes a chain of trust with all the hops the message has passed through. So, if it was tampered with or changed in one of the hops, it is possible to see where by following that chain.

We began rolling out ARC support to Email Routing a few weeks ago. Here’s how it works:

As you can see, joe@example.com sends an Email to henry@domain.example, an Email Routing address, which in turn is forwarded to the final address, example@gmail.com.

Email Routing will use @example.com’s DMARC policy to check the SPF and DKIM alignments (SPF, DKIM, and DMARC help authenticate email senders by verifying that the emails came from the domain that they claim to be from.) It then stores this authentication result by adding a Arc-Authentication-Results header in the message:

ARC-Authentication-Results: i=1; mx.cloudflare.net; dkim=pass header.d=cloudflare.com header.s=example09082023 header.b=IRdayjbb; dmarc=pass header.from=example.com policy.dmarc=reject; spf=none (mx.cloudflare.net: no SPF records found for postmaster@example.com) smtp.helo=smtp.example.com; spf=pass (mx.cloudflare.net: domain of joe@example.com designates 2a00:1440:4824:20::32e as permitted sender) smtp.mailfrom=joe@example.com; arc=none smtp.remote-ip=2a00:1440:4824:20::32e

Then we take a snapshot of all the headers and the body of the original message, and we generate an Arc-Message-Signature header with a DKIM-like cryptographic signature (in fact ARC uses the same DKIM keys):

ARC-Message-Signature: i=1; a=rsa-sha256; s=2022; d=email.cloudflare.net; c=relaxed/relaxed; h=To:Date:Subject:From:reply-to:cc:resent-date:resent-from:resent-to :resent-cc:in-reply-to:references:list-id:list-help:list-unsubscribe :list-subscribe:list-post:list-owner:list-archive; t=1697709687; bh=sN/+...aNbf==;

Finally, before forwarding the message to example@gmail.com, Email Routing generates the Arc-Seal header, another DKIM-like signature, composed out of the Arc-Authentication-Results and Arc-Message-Signature, and cryptographically “seals” the message:

ARC-Seal: i=1; a=rsa-sha256; s=2022; d=email.cloudflare.net; cv=none; b=Lx35lY6..t4g==;

When Gmail receives the message from Email Routing, it not only normally authenticates the last hop domain.example domain (Email Routing uses SRS), but it also checks the ARC seal header, which provides the authentication results of the original sender.

ARC increases the traceability of the message path through email intermediaries, allowing for more informed delivery decisions by those who receive emails as well as higher deliverability rates for those who transport them, like Email Routing. It has been adopted by all the major email providers like Gmail and Microsoft. You can read more about the ARC protocol in the RFC8617.

MTA-STS

As we said earlier, SMTP is an old protocol. Initially Email communications were done in the clear, in plain-text and unencrypted. At some point in time in the late 90s, the email providers community standardized STARTTLS, also known as Opportunistic TLS. The STARTTLS extension allowed a client in a SMTP session to upgrade to TLS encrypted communications.

While at the time this seemed like a step forward in the right direction, we later found out that because STARTTLS can start with an unencrypted plain-text connection, and that can be hijacked, the protocol is susceptible to man-in-the-middle attacks.

A few years ago MTA Strict Transport Security (MTA-STS) was introduced by email service providers including Microsoft, Google and Yahoo as a solution to protect against downgrade and man-in-the-middle attacks in SMTP sessions, as well as solving the lack of security-first communication standards in email.

Suppose that example.com uses Email Routing. Here’s how you can enable MTA-STS for it.

First, log in to the Cloudflare dashboard and select your account and zone. Then go to DNS > Records and create a new CNAME record with the name “_mta-sts” that points to Cloudflare’s record “_mta-sts.mx.cloudflare.net”. Make sure to disable the proxy mode.

Confirm that the record was created:

$ dig txt _mta-sts.example.com
_mta-sts.example.com.	300	IN	CNAME	_mta-sts.mx.cloudflare.net.
_mta-sts.mx.cloudflare.net. 300	IN	TXT	"v=STSv1; id=20230615T153000;"

This tells the other end client that is trying to connect to us that we support MTA-STS.

Next you need an HTTPS endpoint at mta-sts.example.com to serve your policy file. This file defines the mail servers in the domain that use MTA-STS. The reason why HTTPS is used here instead of DNS is because not everyone uses DNSSEC yet, so we want to avoid another MITM attack vector.

To do this you need to deploy a very simple Worker that allows Email clients to pull Cloudflare’s Email Routing policy file using the “well-known” URI convention. Go to your Account > Workers & Pages and press Create Application. Pick the “MTA-STS” template from the list.

This Worker simply proxies https://mta-sts.mx.cloudflare.net/.well-known/mta-sts.txt to your own domain. After deploying it, go to the Worker configuration, then Triggers > Custom Domains and Add Custom Domain.

You can then confirm that your policy file is working:

$ curl https://mta-sts.example.com/.well-known/mta-sts.txt
version: STSv1
mode: enforce
mx: *.mx.cloudflare.net
max_age: 86400

This says that we enforce MTA-STS. Capable email clients will only deliver email to this domain over a secure connection to the specified MX servers. If no secure connection can be established the email will not be delivered.

Email Routing also supports MTA-STS upstream, which greatly improves security when forwarding your Emails to service providers like Gmail or Microsoft, and others.

While enabling MTA-STS involves a few steps today, we plan to simplify things for you and automatically configure MTA-STS for your domains from the Email Routing dashboard as a future improvement.

Sending emails and replies from Workers

Last year we announced Email Workers, allowing anyone using Email Routing to associate a Worker script to an Email address rule, and programmatically process their incoming emails in any way they want. Workers is our serverless compute platform, it provides hundreds of features and APIs, like databases and storage. Email Workers opened doors to a flood of use-cases and applications that weren’t possible before like implementing allow/block lists, advanced rules, notifications to messaging applications, honeypot aggregators and more.

Still, you could only act on the incoming email event. You could read and process the email message, you could even manipulate and create some headers, but you couldn’t rewrite the body of the message or create new emails from scratch.

Today we’re announcing two new powerful Email Workers APIs that will further enhance what you can do with Email Routing and Workers.

Send emails from Workers

Now you can send an email from any Worker, from scratch, whenever you want, not just when you receive incoming messages, to any email address verified on Email Routing under your account. Here are a few practical examples where sending email from Workers to your verified addresses can be helpful:

Daily digests with the news from your favorite publications.
Alert messages whenever the weather conditions are adverse.
Automatic notifications when systems complete tasks.
Receive a message composed of the inputs of a form online on a contact page.

Let's see a simple example of a Worker sending an email. First you need to create “send_email” bindings in your wrangler.toml configuration:

send_email = [
    {type = "send_email", name = "EMAIL_OUT"}
 ]

And then creating a new message and sending it in a Workers is as simple as:

import { EmailMessage } from "cloudflare:email";
import { createMimeMessage } from "mimetext";

export default {
 async fetch(request, env) {
   const msg = createMimeMessage();
   msg.setSender({ name: "Workers AI story", addr: "joe@example.com" });
   msg.setRecipient("mary@domain.example");
   msg.setSubject("An email generated in a worker");
   msg.addMessage({
       contentType: 'text/plain',
       data: `Congratulations, you just sent an email from a worker.`
   });

   var message = new EmailMessage(
     "joe@example.com",
     "mary@domain.example",
     msg.asRaw()
   );
   try {
     await env.EMAIL_OUT.send(message);
   } catch (e) {
     return new Response(e.message);
   }

   return new Response("email sent!");
 },
};

This example makes use of mimetext, an open-source raw email message generator.

Again, for security reasons, you can only send emails to the addresses for which you confirmed ownership in Email Routing under your Cloudflare account. If you’re looking for sending email campaigns or newsletters to destination addresses that you do not control or larger subscription groups, you should consider other options like our MailChannels integration.

Since sending Emails from Workers is not tied to the EmailEvent, you can send them from any type of Worker, including Cron Triggers and Durable Objects, whenever you want, you control all the logic.

Reply to emails

One of our most-requested features has been to provide a way to programmatically respond to incoming emails. It has been possible to do this with Email Workers in a very limited capacity by returning a permanent SMTP error message — but this may or may not be visible to the end user depending on the client implementation.

export default {
  async email(message, env, ctx) {
      message.setReject("Address not allowed");
  }
}

As of today, you can now truly reply to incoming emails with another new message and implement smart auto-responders programmatically, adding any content and context in the main body of the message. Think of a customer support email automatically generating a ticket and returning the link to the sender, an out-of-office reply with instructions when you're on vacation, or a detailed explanation of why you rejected an email. Here’s a code example:

To mitigate security risks and abuse, replying to incoming emails has a few requirements:

The incoming email has to have valid DMARC.
The email can only be replied to once.
The In-Reply-To header of the reply message must match the Message-ID of the incoming message.
The recipient of the reply must match the incoming sender.
The outgoing sender domain must match the same domain that received the email.

If these and other internal conditions are not met, then reply() will fail with an exception, otherwise you can freely compose your reply message and send it back to the original sender.

For more information the documentation to these APIs is available in our Developer Docs.

Subdomains support

This is a big one.

Email Routing is a zone-level feature. A zone has a top-level domain (the same as the zone name) and it can have subdomains (managed under the DNS feature.) As an example, I can have the example.com zone, and then the mail.example.com and corp.example.com subdomains under it. However, we can only use Email Routing with the top-level domain of the zone, example.com in this example. While this is fine for the vast majority of use cases, some customers — particularly bigger organizations with complex email requirements — have asked for more flexibility.

This changes today. Now you can use Email Routing with any subdomain of any zone in your account. To make this possible we redesigned the dashboard UI experience to make it easier to get you started and manage all your Email Routing domains and subdomains, rules and destination addresses in one single place. Let’s see how it works.

To add Email Routing features to a new subdomain, log in to the Cloudflare dashboard and select your account and zone. Then go to Email > Email Routing > Settings and click “Add subdomain”.

Once the subdomain is added and the DNS records are configured, you can see it in the Settings list under the Subdomains section:

Now you can go to Email > Email Routing > Routing rules and create new custom addresses that will show you the option of using either the top domain of the zone or any other configured subdomain.

After the new custom address for the subdomain is created you can see it in the list with all the other addresses, and manage it from there.

It’s this easy.

Final words

We hope you enjoy the new features that we are announcing today. Still, we want to be clear: there are no changes in pricing, and Email Routing is still free for Cloudflare customers.

Ever since Email Routing was launched, we’ve been listening to customers’ feedback and trying to adjust our roadmap to both our requirements and their own ideas and requests. Email shouldn't be difficult; our goal is to listen, learn and keep improving the email security service with better, more powerful features.

You can find detailed information about the new features and more in our Email Routing Developer Docs.

If you have any questions or feedback about Email Routing, please come see us in the Cloudflare Community and the Cloudflare Discord.

You can now use WebGPU in Cloudflare Workers

André Cruz — Wed, 27 Sep 2023 13:00:56 GMT

The browser as an app platform is real and stronger every day; long gone are the Browser Wars. Vendors and standard bodies have done amazingly well over the last years, working together and advancing web standards with new APIs that allow developers to build fast and powerful applications, finally comparable to those we got used to seeing in the native OS' environment.

Today, browsers can render web pages and run code that interfaces with an extensive catalog of modern Web APIs. Things like networking, rendering accelerated graphics, or even accessing low-level hardware features like USB devices are all now possible within the browser sandbox.

One of the most exciting new browser APIs that browser vendors have been rolling out over the last months is WebGPU, a modern, low-level GPU programming interface designed for high-performance 2D and 3D graphics and general purpose GPU compute.

Today, we are introducing WebGPU support to Cloudflare Workers. This blog will explain why it's important, why we did it, how you can use it, and what comes next.

The history of the GPU in the browser

To understand why WebGPU is a big deal, we must revisit history and see how browsers went from relying only on the CPU for everything in the early days to taking advantage of GPUs over the years.

In 2011, WebGL 1, a limited port of OpenGL ES 2.0, was introduced, providing an API for fast, accelerated 3D graphics in the browser for the first time. By then, this was somewhat of a revolution in enabling gaming and 3D visualizations in the browser. Some of the most popular 3D animation frameworks, like Three.js, launched in the same period. Who doesn't remember going to the (now defunct) Google Chrome Experiments page and spending hours in awe exploring the demos? Another option then was using the Flash Player, which was still dominant in the desktop environment, and their Stage 3D API.

Later, in 2017, based on the learnings and shortcomings of its predecessor, WebGL 2 was a significant upgrade and brought more advanced GPU capabilities like shaders and more flexible textures and rendering.

WebGL, however, has proved to be a steep and complex learning curve for developers who want to take control of things, do low-level 3D graphics using the GPU, and not use 3rd party abstraction libraries.

Furthermore and more importantly, with the advent of machine learning and cryptography, we discovered that GPUs are great not only at drawing graphics but can be used for other applications that can take advantage of things like high-speed data or blazing-fast matrix multiplications, and one can use them to perform general computation. This became known as GPGPU, short for general-purpose computing on graphics processing units.

With this in mind, in the native desktop and mobile operating system worlds, developers started using more advanced frameworks like CUDA, Metal, DirectX 12, or Vulkan. WebGL stayed behind. To fill this void and bring the browser up to date, in 2017, companies like Google, Apple, Intel, Microsoft, Kronos, and Mozilla created the GPU for Web Community Working Group to collaboratively design the successor of WebGL and create the next modern 3D graphics and computation capabilities APIs for the Web.

What is WebGPU

WebGPU was developed with the following advantages in mind:

Lower Level Access - WebGPU provides lower-level, direct access to the GPU vs. the high-level abstractions in WebGL. This enables more control over GPU resources.
Multi-Threading - WebGPU can leverage multi-threaded rendering and compute, allowing improved CPU/GPU parallelism compared to WebGL, which relies on a single thread.
Compute Shaders - First-class support for general-purpose compute shaders for GPGPU tasks, not just graphics. WebGL compute is limited.
Safety - WebGPU ensures memory and GPU access safety, avoiding common WebGL pitfalls.
Portability - WGSL shader language targets cross-API portability across GPU vendors vs. GLSL in WebGL.
Reduced Driver Overhead - The lower level Vulkan/Metal/D3D12 basis improves overhead vs. OpenGL drivers in WebGL.
Pipeline State Objects - Predefined pipeline configs avoid per-draw driver overhead in WebGL.
Memory Management - Finer-grained buffer and resource management vs. WebGL.

The “too long didn't read” version is that WebGPU provides lower-level control over the GPU hardware with reduced overhead. It's safer, has multi-threading, is focused on compute, not just graphics, and has portability advantages compared to WebGL.

If these aren't reasons enough to get excited, developers are also looking at WebGPU as an option for native platforms, not just the Web. For instance, you can use this C API that mimics the JavaScript specification. If you think about this and the power of WebAssembly, you can effectively have a truly platform-agnostic GPU hardware layer that you can use to develop platforms for any operating system or browser.

More than just graphics

As explained above, besides being a graphics API, WebGPU makes it possible to perform tasks such as:

Machine Learning - Implement ML applications like neural networks and computer vision algorithms using WebGPU compute shaders and matrices.
Scientific Computing - Perform complex scientific computation like physics simulations and mathematical modeling using the GPU.
High Performance Computing - Unlock breakthrough performance for parallel workloads by connecting WebGPU to languages like Rust, C/C++ via WebAssembly.

WGSL, the shader language for WebGPU, is what enables the general-purpose compute feature. Shaders, or more precisely, compute shaders, have no user-defined inputs or outputs and are used for computing arbitrary information. Here are some examples of simple WebGPU compute shaders if you want to learn more.

WebGPU in Workers

We've been watching WebGPU since the API was published. Its general-purpose compute features perfectly fit our Workers' ecosystem and capabilities and align well with our vision of providing our customers multiple compute and hardware options and bringing GPU workloads to our global network, close to clients.

Cloudflare also has a track record of pioneering support for emerging web standards on our network and services, accelerating their adoption for our customers. Examples of these are Web Crypto API, HTTP/2, HTTP/3, TLS 1.3, or Early hints, but there are more.

Bringing WebGPU to Workers was both natural and timely. Today, we are announcing that we have released a version of workerd, the open-sourced JavaScript / Wasm runtime that powers Cloudflare Workers, with WebGPU support, that you can start playing and developing applications with, locally.

Starting today anyone can run this on their personal computer and experiment with WebGPU-enabled workers. Implementing local development first allows us to put this API in the hands of our customers and developers earlier and get feedback that will guide the development of this feature for production use.

But before we dig into code examples, let's explain how we built it.

How we built WebGPU on top of Workers

To implement the WebGPU API, we took advantage of Dawn, an open-source library backed by Google, the same used in Chromium and Chrome, that provides applications with an implementation of the WebGPU standard. It also provides the webgpu.h headers file, the de facto reference for all the other implementations of the standard.

Dawn can interoperate with Linux, MacOS, and Windows GPUs by interfacing with each platform's native GPU frameworks. For example, when an application makes a WebGPU draw call, Dawn will convert that draw command into the equivalent Vulkan, Metal, or Direct3D 12 API call, depending on the platform.

From an application standpoint, Dawn handles the interactions with the underlying native graphics APIs that communicate directly with the GPU drivers. Dawn essentially acts as a middle layer that translates the WebGPU API calls into calls for the platform's native graphics API.

Cloudflare workerd is the underlying open-source runtime engine that executes Workers code. It shares most of its code with the same runtime that powers Cloudflare Workers' production environment but with some changes designed to make it more portable to other environments. We then have release cycles that aim to synchronize both codebases; more on that later. Workerd is also used with wrangler, our command-line tool for building and interacting with Cloudflare Workers, to support local development.

The WebGPU code that interfaces with the Dawn library can be found here, and can easily be enabled with a flag, checked here.

jsg::Ref Navigator::getGPU(CompatibilityFlags::Reader flags) {
  // is this a durable object?
  KJ_IF_MAYBE (actor, IoContext::current().getActor()) {
    JSG_REQUIRE(actor->getPersistent() != nullptr, TypeError,
                "webgpu api is only available in Durable Objects (no storage)");
  } else {
    JSG_FAIL_REQUIRE(TypeError, "webgpu api is only available in Durable Objects");
  };

  JSG_REQUIRE(flags.getWebgpu(), TypeError, "webgpu needs the webgpu compatibility flag set");

  return jsg::alloc();
}

The WebGPU API can only be accessed using Durable Objects, which are essentially global singleton instances of Cloudflare Workers. There are two important reasons for this:

WebGPU code typically wants to store the state between requests, for example, loading an AI model into the GPU memory once and using it multiple times for inference.
Not all Cloudflare servers have GPUs yet, so although the worker that receives the request is typically the closest one available, the Durable Object that uses WebGPU will be instantiated where there are GPU resources available, which may not be on the same machine.

Using Durable Objects instead of regular Workers allow us to address both of these issues.

The WebGPU Hello World in Workers

Wrangler uses Miniflare 3, a fully-local simulator for Workers, which in turn is powered by workerd. This means you can start experimenting and doing WebGPU code locally on your machine right now before we prepare things in our production environment.

Let’s get coding then.

Since Workers doesn't render graphics yet, we started with implementing the general-purpose GPU (GPGPU) APIs in the WebGPU specification. In other words, we fully support the part of the API that the compute shaders and the compute pipeline require, but we are not yet focused on fragment or vertex shaders used in rendering pipelines.

Here’s a typical “hello world” in WebGPU. This Durable Object script will output the name of the GPU device that workerd found in your machine to your console.

const adapter = await navigator.gpu.requestAdapter();
const adapterInfo = await adapter.requestAdapterInfo(["device"]);
console.log(adapterInfo.device);

A more interesting example, though, is a simple compute shader. In this case, we will fill a results buffer with an incrementing value taken from the iteration number via global_invocation_id.

For this, we need two buffers, one to store the results of the computations as they happen (storageBuffer) and another to copy the results at the end (mappedBuffer).

We then dispatch four workgroups, meaning that the increments can happen in parallel. This parallelism and programmability are two key reasons why compute shaders and GPUs provide an advantage for things like machine learning inference workloads. Other advantages are:

Bandwidth - GPUs have a very high memory bandwidth, up to 10-20x more than CPUs. This allows fast reading and writing of all the model parameters and data needed for inference.
Floating-point performance - GPUs are optimized for high floating point operation throughput, which are used extensively in neural networks. They can deliver much higher TFLOPs than CPUs.

Let’s look at the code:

// Create device and command encoder
const adapter = await navigator.gpu.requestAdapter();
const device = await adapter.requestDevice();
const encoder = device.createCommandEncoder();

// Storage buffer
const storageBuffer = device.createBuffer({
  size: 4 * Float32Array.BYTES_PER_ELEMENT, // 4 float32 values
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
});

// Mapped buffer
const mappedBuffer = device.createBuffer({
  size: 4 * Float32Array.BYTES_PER_ELEMENT,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
});

// Create shader that writes incrementing numbers to storage buffer
const computeShaderCode = `
    @group(0) @binding(0)
    var result : array;

    @compute @workgroup_size(1)
    fn main(@builtin(global_invocation_id) gid : vec3) {
      result[gid.x] = f32(gid.x);
    }
`;

// Create compute pipeline
const computePipeline = device.createComputePipeline({
  layout: "auto",
  compute: {
    module: device.createShaderModule({ code: computeShaderCode }),
    entryPoint: "main",
  },
});

// Bind group
const bindGroup = device.createBindGroup({
  layout: computePipeline.getBindGroupLayout(0),
  entries: [{ binding: 0, resource: { buffer: storageBuffer } }],
});

// Dispatch compute work
const computePass = encoder.beginComputePass();
computePass.setPipeline(computePipeline);
computePass.setBindGroup(0, bindGroup);
computePass.dispatchWorkgroups(4);
computePass.end();

// Copy from storage to mapped buffer
encoder.copyBufferToBuffer(
  storageBuffer,
  0,
  mappedBuffer,
  0,
  4 * Float32Array.BYTES_PER_ELEMENT //mappedBuffer.size
);

// Submit and read back result
const gpuBuffer = encoder.finish();
device.queue.submit([gpuBuffer]);

await mappedBuffer.mapAsync(GPUMapMode.READ);
console.log(new Float32Array(mappedBuffer.getMappedRange()));
// [0, 1, 2, 3]

Now that we covered the basics of WebGPU and compute shaders, let's move to something more demanding. What if we could perform machine learning inference using Workers and GPUs?

ONNX WebGPU demo

The ONNX runtime is a popular open-source cross-platform, high performance machine learning inferencing accelerator. Wonnx is a GPU-accelerated version of the same engine, written in Rust, that can be compiled to WebAssembly and take advantage of WebGPU in the browser. We are going to run it in Workers using a combination of workers-rs, our Rust bindings for Cloudflare Workers, and the workerd WebGPU APIs.

For this demo, we are using SqueezeNet. This small image classification model can run under lower resources but still achieves similar levels of accuracy on the ImageNet image classification validation dataset as larger models like AlexNet.

In essence, our worker will receive any uploaded image and attempt to classify it according to the 1000 ImageNet classes. Once ONNX runs the machine learning model using the GPU, it will return the list of classes with the highest probability scores. Let’s go step by step.

First we load the model from R2 into the GPU memory the first time the Durable Object is called:

#[durable_object]
pub struct Classifier {
    env: Env,
    session: Option,
}

impl Classifier {
    async fn ensure_session(&mut self) -> Result<()> {
        match self.session {
            Some(_) => worker::console_log!("DO already has a session"),
            None => {
                // No session, so this should be the first request. In this case
                // we will fetch the model from R2, build a wonnx session, and
                // store it for subsequent requests.
                let model_bytes = fetch_model(&self.env).await?;
                let session = wonnx::Session::from_bytes(&model_bytes)
                    .await
                    .map_err(|err| err.to_string())?;
                worker::console_log!("session created in DO");
                self.session = Some(session);
            }
        };
        Ok(())
    }
}

This is only required once, when the Durable Object is instantiated. For subsequent requests, we retrieve the model input tensor, call the existing session for the inference, and return to the calling worker the result tensor converted to JSON:

        let request_data: ArrayBase, Dim<[usize; 4]>> =
            serde_json::from_str(&req.text().await?)?;
        let mut input_data = HashMap::new();
        input_data.insert("data".to_string(), request_data.as_slice().unwrap().into());

        let result = self
            .session
            .as_ref()
            .unwrap() // we know the session exists
            .run(&input_data)
            .await
            .map_err(|err| err.to_string())?;
...
        let probabilities: Vec = result
            .into_iter()
            .next()
            .ok_or("did not obtain a result tensor from session")?
            .1
            .try_into()
            .map_err(|err: TensorConversionError| err.to_string())?;

        let do_response = serde_json::to_string(&probabilities)?;
        Response::ok(do_response)

On the Worker script itself, we load the uploaded image and pre-process it into a model input tensor:

    let image_file: worker::File = match req.form_data().await?.get("file") {
        Some(FormEntry::File(buf)) => buf,
        Some(_) => return Response::error("`file` part of POST form must be a file", 400),
        None => return Response::error("missing `file`", 400),
    };
    let image_content = image_file.bytes().await?;
    let image = load_image(&image_content)?;

Finally, we call the GPU Durable Object, which runs the model and returns the most likely classes of our image:

    let probabilities = execute_gpu_do(image, stub).await?;
    let mut probabilities = probabilities.iter().enumerate().collect::>();
    probabilities.sort_unstable_by(|a, b| b.1.partial_cmp(a.1).unwrap());
    Response::ok(LABELS[probabilities[0].0])

We packaged this demo in a public repository, so you can also run it. Make sure that you have a Rust compiler, Node.js, Git and curl installed, then clone the repository:

git clone https://github.com/cloudflare/workers-wonnx.git
cd workers-wonnx

Upload the model to the local R2 simulator:

npx wrangler@latest r2 object put model-bucket-dev/opt-squeeze.onnx --local --file models/opt-squeeze.onnx

And then run the Worker locally:

npx wrangler@latest dev

With the Worker running and waiting for requests you can then open another terminal window and upload one of the image examples in the same repository using curl:

> curl -F "file=@images/pelican.jpeg" http://localhost:8787
n02051845 pelican

If everything goes according to plan the result of the curl command will be the most likely class of the image.

Next steps and final words

Over the upcoming weeks, we will merge the workerd WebGPU code in the Cloudflare Workers production environment and make it available globally, on top of our growing GPU nodes fleet. We didn't do it earlier because that environment is subject to strict security and isolation requirements. For example, we can't break the security model of our process sandbox and have V8 talking to the GPU hardware directly, that would be a problem; we must create a configuration where another process is closer to the GPU and use IPC (inter-process communication) to talk to it. Other things like managing resource allocation and billing are being sorted out.

For now, we wanted to get the good news out that we will support WebGPU in Cloudflare Workers and ensure that you can start playing and coding with it today and learn from it. WebGPU and general-purpose computing on GPUs is still in its early days. We presented a machine-learning demo, but we can imagine other applications taking advantage of this new feature, and we hope you can show us some of them.

As usual, you can talk to us on our Developers Discord or the Community forum; the team will be listening. We are eager to hear from you and learn about what you're building.

How we built DMARC Management using Cloudflare Workers

André Cruz — Fri, 17 Mar 2023 13:00:00 GMT

What are DMARC reports

DMARC stands for Domain-based Message Authentication, Reporting, and Conformance. It's an email authentication protocol that helps protect against email phishing and spoofing.

When an email is sent, DMARC allows the domain owner to set up a DNS record that specifies which authentication methods, such as SPF (Sender Policy Framework) and DKIM (DomainKeys Identified Mail), are used to verify the email's authenticity. When the email fails these authentication checks DMARC instructs the recipient's email provider on how to handle the message, either by quarantining it or rejecting it outright.

DMARC has become increasingly important in today's Internet, where email phishing and spoofing attacks are becoming more sophisticated and prevalent. By implementing DMARC, domain owners can protect their brand and their customers from the negative impacts of these attacks, including loss of trust, reputation damage, and financial loss.

In addition to protecting against phishing and spoofing attacks, DMARC also provides reporting capabilities. Domain owners can receive reports on email authentication activity, including which messages passed and failed DMARC checks, as well as where these messages originated from.

DMARC management involves the configuration and maintenance of DMARC policies for a domain. Effective DMARC management requires ongoing monitoring and analysis of email authentication activity, as well as the ability to make adjustments and updates to DMARC policies as needed.

Some key components of effective DMARC management include:

Setting up DMARC policies: This involves configuring the domain's DMARC record to specify the appropriate authentication methods and policies for handling messages that fail authentication checks. Here’s what a DMARC DNS record looks like:

v=DMARC1; p=reject; rua=mailto:dmarc@example.com

This specifies that we are going to use DMARC version 1, our policy is to reject emails if they fail the DMARC checks, and the email address to which providers should send DMARC reports.

Monitoring email authentication activity: DMARC reports are an important tool for domain owners to ensure email security and deliverability, as well as compliance with industry standards and regulations. By regularly monitoring and analyzing DMARC reports, domain owners can identify email threats, optimize email campaigns, and improve overall email authentication.
Making adjustments as needed: Based on analysis of DMARC reports, domain owners may need to make adjustments to DMARC policies or authentication methods to ensure that email messages are properly authenticated and protected from phishing and spoofing attacks.
Working with email providers and third-party vendors: Effective DMARC management may require collaboration with email providers and third-party vendors to ensure that DMARC policies are being properly implemented and enforced.

Today we launched DMARC management. This is how we built it.

How we built it

As a leading provider of cloud-based security and performance solutions, we at Cloudflare take a specific approach to test our products. We "dogfood" our own tools and services, which means we use them to run our business. This helps us identify any issues or bugs before they affect our customers.

We use our own products internally, such as Cloudflare Workers, a serverless platform that allows developers to run their code on our global network. Since its launch in 2017, the Workers ecosystem has grown significantly. Today, there are thousands of developers building and deploying applications on the platform. The power of the Workers ecosystem lies in its ability to enable developers to build sophisticated applications that were previously impossible or impractical to run so close to clients. Workers can be used to build APIs, generate dynamic content, optimize images, perform real-time processing, and much more. The possibilities are virtually endless. We used Workers to power services like Radar 2.0, or software packages like Wildebeest.

Recently our Email Routing product joined forces with Workers, enabling processing incoming emails via Workers scripts. As the documentation states: “With Email Workers you can leverage the power of Cloudflare Workers to implement any logic you need to process your emails and create complex rules. These rules determine what happens when you receive an email.” Rules and verified addresses can all be configured via our API.

Here’s how a simple Email Worker looks like:

export default {
  async email(message, env, ctx) {
    const allowList = ["friend@example.com", "coworker@example.com"];
    if (allowList.indexOf(message.headers.get("from")) == -1) {
      message.setReject("Address not allowed");
    } else {
      await message.forward("inbox@corp");
    }
  }
}

Pretty straightforward, right?

With the ability to programmatically process incoming emails in place, it seemed like the perfect way to handle incoming DMARC report emails in a scalable and efficient manner, letting Email Routing and Workers do the heavy lifting of receiving an unbound number of emails from across the globe. A high level description of what we needed is:

Receive email and extract report
Publish relevant details to analytics platform
Store the raw report

Email Workers enable us to do #1 easily. We just need to create a worker with an email() handler. This handler will receive the SMTP envelope elements, a pre-parsed version of the email headers, and a stream to read the entire raw email.

For #2 we can also look into the Workers platform, and we will find the Workers Analytics Engine. We just need to define an appropriate schema, which depends both on what’s present in the reports and the queries we plan to do later. Afterwards we can query the data using either the GraphQL or SQL API.

For #3 we don’t need to look further than our R2 object storage. It is trivial to access R2 from a Worker. After extracting the reports from the email we will store them in R2 for posterity.

We built this as a managed service that you can enable on your zone, and added a dashboard interface for convenience, but in reality all the tools are available for you to deploy your own DMARC reports processor on top of Cloudflare Workers, in your own account, without having to worry about servers, scalability or performance.

Architecture

Email Workers is a feature of our Email Routing product. The Email Routing component runs in all our nodes, so any one of them is able to process incoming mail, which is important because we announce the Email ingress BGP prefix from all our datacenters. Sending emails to an Email Worker is as easy as setting a rule in the Email Routing dashboard.

When the Email Routing component receives an email that matches a rule to be delivered to a Worker, it will contact our internal version of the recently open-sourced workerd runtime, which also runs on all nodes. The RPC schema that governs this interaction is defined in a Capnproto schema, and allows the body of the email to be streamed to Edgeworker as it’s read. If the worker script decides to forward this email, Edgeworker will contact Email Routing using a capability sent in the original request.

jsg::Promise ForwardableEmailMessage::forward(kj::String rcptTo, jsg::Optional> maybeHeaders) {
  auto req = emailFwdr->forwardEmailRequest();
  req.setRcptTo(rcptTo);

  auto sendP = req.send().then(
      [](capnp::Response res) mutable {
    auto result = res.getResponse().getResult();
    JSG_REQUIRE(result.isOk(), Error, result.getError());
  });
  auto& context = IoContext::current();
  return context.awaitIo(kj::mv(sendP));
}

In the context of DMARC reports this is how we handle the incoming emails:

Fetch the recipient of the email being processed, this is the RUA that was used. RUA is a DMARC configuration parameter that indicates where aggregate DMARC processing feedback should be reported pertaining to a certain domain. This recipient can be found in the “to” attribute of the message.

const ruaID = message.to

Since we handle DMARC reports for an unbounded number of domains, we use Workers KV to store some information about each one and key this information on the RUA. This also lets us know if we should be receiving these reports.

const accountInfoRaw = await env.KV_DMARC_REPORTS.get(dmarc:${ruaID})

At this point, we want to read the entire email into an arrayBuffer in order to parse it. Depending on the size of the report we may run into the limits of the free Workers plan. If this happens, we recommend that you switch to the Workers Unbound resource model which does not have this issue.

const rawEmail = new Response(message.raw)
const arrayBuffer = await rawEmail.arrayBuffer()

Parsing the raw email involves, among other things, parsing its MIME parts. There are multiple libraries available that allow one to do this. For example, you could use postal-mime:

const parser = new PostalMime.default()
const email = await parser.parse(arrayBuffer)

Having parsed the email we now have access to its attachments. These attachments are the DMARC reports themselves and they can be compressed. The first thing we want to do is store them in their compressed form in R2 for long-term storage. They can be useful later on for re-processing or investigating interesting reports. Doing this is as simple as calling put() on the R2 binding. In order to facilitate retrieval later we recommend that you spread the report files across directories based on the current time.

await env.R2_DMARC_REPORTS.put(
    `${date.getUTCFullYear()}/${date.getUTCMonth() + 1}/${attachment.filename}`,
    attachment.content
  )

We now need to look into the attachment mime type. The raw form of DMARC reports is XML, but they can be compressed. In this case we need to decompress them first. DMARC reporter files can use multiple compression algorithms. We use the MIME type to know which one to use. For Zlib compressed reports pako can be used while for ZIP compressed reports unzipit is a good choice.
Having obtained the raw XML form of the report, fast-xml-parser has worked well for us in parsing them. Here’s how the DMARC report XML looks:


  
    example.com
    
   http://example.com/dmarc/support
    9391651994964116463
    
      1335521200
      1335652599
    
  
  
    business.example
    r
    r
    none
    none
    100
  
  
    
      192.0.2.1
      2
      
        none
        fail
        pass
      
    
    
      business.example
    
    
      
        business.example
        fail
        
      
      
        business.example
        pass

We now have all the data in the report at our fingertips. What we do from here on depends a lot on how we want to present the data. For us, the goal was to display meaningful data extracted from them in our Dashboard. Therefore we needed an Analytics platform where we could push the enriched data. Enter, Workers Analytics Engine. The Analytics engine is perfect for this task since it allows us to send data to it from a worker, and exposes a GraphQL API to interact with the data afterwards. This is how we obtain the data to show in our dashboard.

In the future, we are also considering integrating Queues in the workflow to asynchronously process the report and avoid waiting for the client to complete it.

We managed to implement this project end-to-end relying only on the Workers infrastructure, proving that it’s possible, and advantageous, to build non-trivial apps without having to worry about scalability, performance, storage and security issues.

Open sourcing

As we mentioned before, we built a managed service that you can enable and use, and we will manage it for you. But, everything we did can also be deployed by you, in your account, so that you can manage your own DMARC reports. It’s easy, and free. To help you with that, we are releasing an open-source version of a Worker that processes DMARC reports in the way described above: https://github.com/cloudflare/dmarc-email-worker

If you don’t have a dashboard where to show the data, you can also query the Analytics Engine from a Worker. Or, if you want to store them in a relational database, then there’s D1 to the rescue. The possibilities are endless and we are excited to find out what you’ll build with these tools.

Please contribute, make your own, we’ll be listening.

Final words

We hope that this post has furthered your understanding of the Workers platform. Today Cloudflare takes advantage of this platform to build most of our services, and we think you should too.

Feel free to contribute to our open-source version and show us what you can do with it.

The Email Routing is also working on expanding the Email Workers API more functionally, but that deserves another blog soon.

Email Routing leaves Beta

Celso Martinho — Tue, 25 Oct 2022 13:00:00 GMT

Email Routing was announced during Birthday Week in 2021 and has been available for free to every Cloudflare customer since early this year. When we launched in beta, we set out to make a difference and provide the most uncomplicated, more powerful email forwarding service on the Internet for all our customers, for free.

We feel we've met and surpassed our goals for the first year. Cloudflare Email Routing is now one of our most popular features and a top leading email provider. We are processing email traffic for more than 550,000 inboxes and forwarding an average of two million messages daily, and still growing month to month.

In February, we also announced that we were acquiring Area1. Merging their team, products, and know-how with Cloudflare was a significant step in strengthening our Email Security capabilities.

All this is good, but what about more features, you ask?

The team has been working hard to enhance Email Routing over the last few months. Today Email Routing leaves beta.

Also, we feel that this could be a good time to give you an update on all the new things we've been adding to the service, including behind-the-scenes and not-so-visible improvements.

Let’s get started.

Public API and Terraform

Cloudflare has a strong API-first philosophy. All of our services expose their primitives in our vast API catalog and gateway, which we then “dogfood” extensively. For instance, our customer's configuration dashboard is built entirely on top of these APIs.

The Email Routing APIs didn't quite make it to this catalog on day one and were kept private and undocumented for a while. This summer we made those APIs available on the public Cloudflare API catalog. You can programmatically use them to manage your destination emails, rules, and other Email Routing settings. The methods' definitions and parameters are documented, and we provide curl examples if you want to get your hands dirty quickly.

Even better, if you're an infrastructure as code type of user and use Terraform to configure your systems automatically, we have you covered too. The latest releases of Cloudflare's Terraform provider now incorporate the Email Routing API resources, which you can use with HCL.

IPv6 egress

IPv6 adoption is on a sustained growth path. Our latest IPv6 adoption report shows that we're nearing the 30% penetration figure globally, with some countries, where mobile usage is prevalent, exceeding the 50% mark. Cloudflare has offered full IPv6 support since 2011 as it aligns entirely with our mission to help build a better Internet.

We are IPv6-ready across the board in our network and our products, and Email Routing has had IPv6 ingress support since day one.

➜  ~ dig celso.io MX +noall +answer
celso.io.		300	IN	MX	91 isaac.mx.cloudflare.net.
celso.io.		300	IN	MX	2 linda.mx.cloudflare.net.
celso.io.		300	IN	MX	2 amir.mx.cloudflare.net.
➜  ~ dig linda.mx.cloudflare.net AAAA +noall +answer
linda.mx.cloudflare.net. 300	IN	AAAA	2606:4700:f5::b
linda.mx.cloudflare.net. 300	IN	AAAA	2606:4700:f5::c
linda.mx.cloudflare.net. 300	IN	AAAA	2606:4700:f5::d

More recently, we closed the loop and added egress IPv6 as well. Now we also use IPv6 when sending emails to upstream servers. If the MX server to which an email is being forwarded supports IPv6, then we will try to use it. Gmail is one good example of a high traffic destination that has IPv6 MX records.

➜  ~ dig gmail.com MX +noall +answer
gmail.com.		3362	IN	MX	30 alt3.gmail-smtp-in.l.google.com.
gmail.com.		3362	IN	MX	5 gmail-smtp-in.l.google.com.
gmail.com.		3362	IN	MX	10 alt1.gmail-smtp-in.l.google.com.
gmail.com.		3362	IN	MX	20 alt2.gmail-smtp-in.l.google.com.
gmail.com.		3362	IN	MX	40 alt4.gmail-smtp-in.l.google.com.
➜  ~ dig gmail-smtp-in.l.google.com AAAA +noall +answer
gmail-smtp-in.l.google.com. 116	IN	AAAA	2a00:1450:400c:c03::1a

We’re happy to report that we’re now delivering most of our email to upstreams using IPv6.

Observability

Email Routing is effectively another system that sits in the middle of the life of an email message. No one likes to navigate blindly, especially when using and depending on critical services like email, so it's our responsibility to provide as much observability as possible about what's going on when messages are transiting through our network.

End to end email deliverability is a complex topic and often challenging to troubleshoot due to the nature of the protocol and the number of systems and hops involved. We added two widgets, Analytics and Detailed Logs, which will hopefully provide the needed insights and help increase visibility.

The Analytics section of Email Routing shows general statistics about the number of emails received during the selected timeframe, how they got handled to the upstream destination addresses, and a convenient time-series chart.

On the Activity Log, you can get detailed information about what happened to each individual message that was received and then delivered to the destination. That information includes the sender and the custom address used, the timestamp, and the delivery attempt result. It also has the details of our SPF, DMARC, and DKIM validations. We also provide filters to help you find what you're looking for in case your message volume is higher.

More recently, the Activity Log now also shows bounces. A bounce message happens when the upstream SMTP server accepts the delivery, but then, for any reason (exceeded quota, virus checks, forged messages, or other issues), the recipient inbox decides to reject it and return a new message back with an error to the latest MTA in the chain, read from the Return-Path headers, which is us.

Audit Logs

Audit Logs are available on all plan types and summarize the history of events, like login and logout actions, or zone configuration changes, made within your Cloudflare account. Accounts with multiple members or companies that must comply with regulatory obligations rely on Audit logs for tracking and evidence reasons.

Email Routing now integrates with Audit Logs and records all configuration changes, like adding a new address, changing a rule, or editing the catch-all address. You can find the Audit Logs on the dashboard under "Manage Account" or use our API to download the list.

Anti-spam

Unsolicited and malicious messages plague the world of email and are a big problem for end users. They affect the user experience and efficiency of email, and often carry security risks that can lead to scams, identity theft, and manipulation.

Since day one, we have supported and validated SPF (Sender Policy Framework) records, DKIM (DomainKeys Identified Mail) signatures, and DMARC (Domain-based Message Authentication) policies in incoming messages. These steps are important and mitigate some risks associated with authenticating the origin of an email from a specific legitimate domain, but they don't solve the problem completely. You can still have bad actors generating spam or phishing Attacks from other domains who ignore SPF or DKIM completely.

Anti-spam techniques today are often based on blocking emails whose origin (the IP address of the client trying to deliver the message) confidence score isn't great. This is commonly known in the industry as IP reputation. Other companies specialize in maintaining reputation lists for IPs and email domains, also known as RBL lists, which are then shared across providers and used widely.

Simply put, an IP or a domain gets a bad reputation when it starts sending unsolicited or malicious emails. If your IP or domain has a bad reputation, you'll have a hard time delivering Emails from them to any major email provider. A bad reputation goes away when the IP or domain stops acting bad.

Cloudflare is a security company that knows a few things about IP threat scores and reputation. Working with the Area1 team and learning from them, we added support to flag and block emails received from what we consider bad IPs at the SMTP level. Our approach uses a combination of heuristics and reputation databases, including some RBL lists, which we constantly update.

This measure benefits not only those customers that receive a lot of spam, who will now get another layer of protection and filtering, but also everyone else using Email Routing. The reputation of our own IP space and forwarding domain, which we use to deliver messages to other email providers, will improve, and with it, so will our deliverability success rate.

IDN support

Internationalized domain names, or IDNs for short, are domains that contain at least one non-ASCII character. To accommodate backward compatibility with older Internet protocols and applications, the IETF approved the IDNA protocol (Internationalized Domain Names in Applications), which was then adopted by many browsers, top-level domain registrars and other service providers.

Cloudflare was one of the first platforms to adopt IDNs back in 2012. Supporting internationalized domain names on email, though, is challenging. Email uses DNS, SMTP, and other standards (like TLS and DKIM signatures) stacked on top of each other. IDNA conversions need to work end to end, or something will break.

Email Routing didn’t support IDNs until now. Starting today, Email Routing can be used with IDNs and everything will work end to end as expected.

8-bit MIME transport

The SMTP protocol supports extensions since the RFC 2821 revision. When an email client connects to an SMTP server, it announces its capabilities on the EHLO command.

➜  ~ telnet linda.mx.cloudflare.net 25
Trying 162.159.205.24...
Connected to linda.mx.cloudflare.net.
Escape character is '^]'.
220 mx.cloudflare.net Cloudflare Email ESMTP Service ready
EHLO celso.io
250-mx.cloudflare.net greets celso.io
250-STARTTLS
250-8BITMIME
250 ENHANCEDSTATUSCODES

This tells our client that we support the Secure SMTP over TLS, Enhanced Error Codes, and the 8-bit MIME Transport, our latest addition.

Most modern clients and servers support the 8BITMIME extension, making transmitting binary files easier and more efficient without additional conversions to and from 7-bit.

Email Routing now supports transmitting 8BITMIME SMTP messages end to end and handles DKIM signatures accordingly.

Other fixes

We’ve been making other smaller improvements to Email Routing too:

We ported our SMTP server to use BoringSSL, Cloudflare’s SSL/TLS implementation of choice, and now support more ciphers when clients connect to us using STARTTLS and when we connect to upstream servers.
We made a number of improvements when we added our own DKIM signatures in the messages. We keep our Rust ?DKIM implementation open source on GitHub, and we also contribute to Lettre, a Rust mailer library that we use.
When a destination address domain has multiple MX records, we now try them all in their preference value order, as described in the RFC, until we get a good delivery, or we fail.

Route to Workers update

We announced Route to Workers in May this year. Route to Workers enables everyone to programmatically process their emails and use them as triggers for any other action. In other words, you can choose to process any incoming email with a Cloudflare Worker script and then implement any logic you wish before you deliver it to a destination address or drop it. Think about it as programmable email.

The good news, though, is that we're near completing the project. The APIs, the dashboard configuration screens, the SMTP service, and the necessary Cap'n Proto interface to Workers are mostly complete, and "all" we have left now is adding the Email Workers primitives to the runtime and testing the hell out of everything before we ship.

Thousands of users are waiting for Email Workers to start creating advanced email processing workflows, and we're excited about the possibilities this will open. We promise we're working hard to open the public beta as soon as possible.

What’s next?

We keep looking at ways to improve email and will add more features and support to emerging protocols and extensions. Two examples are ARC (Authenticated Received Chain), a new signature-based authentication system designed with email forwarding services in mind, and EAI (Email Address Internationalization), which we will be supporting soon.

In the meantime, you can start using Email Routing with your own domain if you haven't yet, it only takes a few minutes to set up, and it's free. Our Developers Documentation page has details on how to get started, troubleshooting, and technical information.

Ping us on our Discord server, community forum, or Twitter if you have suggestions or questions, the team is listening.

The benefits of serving stale DNS entries when using Consul

André Cruz — Mon, 08 Mar 2021 14:00:00 GMT

Introduction

We use Consul for service discovery, and we’ve deployed a cluster that spans several of our data centers. This cluster exposes HTTP and DNS interfaces so that clients can query the Consul catalog and search for a particular service and the majority of the clients use DNS. We were aware from the start that the DNS query latencies were not great from certain parts of the world that were furthest away from these data centers. This, together with the fact that we use DNS over TLS, results in some long latencies. The TTL of these names being low makes it even more impractical when resolving these names in the hot path.

The usual way to solve these issues is by caching values so that at least subsequent requests are resolved quickly, and this is exactly what our resolver of choice, Unbound, is configured to do. The problem remains when the cache expires. When it expires, the next client will have to wait while Unbound resolves the name using the network. To have a low recovery time in case some service needs to failover and clients need to use another address we use a small TTL (30 seconds) in our records and as such cache expirations happen often.

There are at least two strategies to deal with this: prefetching and stale cache.

Prefetching

When prefetching is turned on, the server tries to refresh DNS records in the background before they expire. In practice, the way this works is: if an entry is served from cache and the TTL is less than 10% of the lifetime of the records, the server responds to the client, but in the background, it dispatches a job to refresh the record across the network. If it all goes as planned, the entry is refreshed in the cache before the TTL expires, and no client is ever left waiting for the network resolution.

This works very well for records that are accessed frequently, but poses issues in two situations: records that are accessed infrequently and records that have a small TTL. In both cases it is unlikely that a prefetch would be triggered, and so the cache expires and the next client will have to wait for the network resolution.

In our case, the records have a 30 second TTL, so frequently no requests come in during the yellow "prefetch" stage, which bumps the records out of the cache and causes the next client to have to wait for the slow network resolution. Prefetching was already turned on in our Unbound configuration but this still was not enough for our use case.

Stale cache

Another option is to serve stale records from the cache while we dispatch a job to refresh the record in the background without keeping the client waiting for the response. This would provide benefits in two different scenarios:

General high latency - even after the TTL expires subsequent attempts will be returned from cache if the response takes too long to arrive.
Consul/network blips - if for some reason Consul is unavailable/unreachable, there is still a period of time that expired responses can be used instead of failing resolution.

The first case is exactly our current scenario.

Being able to serve stale entries from the cache for a period after the TTL expires affords us some time in which we can return a result to a client either immediately or after some small amount of time has passed. The latter is the behaviour that we are looking for which also makes us compliant with RFC 8767, “Serving Stale Data to Improve DNS Resiliency”, published March 2020.

Previous dealings with Unbound's stale cache

Stale cache had been tried before at Cloudflare in 2017, and at that time, we encountered two problems:

1. Unbound could serve stale records from cache indefinitelyWe could solve this through configuration that did not exist in 2017, but we had to test it.

2. Unbound serves negative/incomplete information from the expired cacheIt was reported that Unbound would serve incomplete chains of responses to clients (for example, a CNAME record with no corresponding A/AAAA record).

This turned out to be unconfirmed but it is a failure mode that nevertheless we wanted to make sure Unbound could handle since it would be possible for parts of the response chain to be purged from the stale cache before others.

Test Configuration

As mentioned before we would need to test a current version of Unbound, come up with an up-to-date stale cache configuration, and test the scenarios that posed problems when we last tried to use it plus any others that we came up with. The Unbound version used for tests was 1.13.0 and the stale cache was configured thus:

serve-expired: yes - enable the stale cache.
serve-expired-ttl: 3600 - this limits the serving of expired resources to one hour after they expire and no response from upstream was obtained.
serve-expired-client-timeout: 500 - serve from stale cache only when upstream takes longer than 500ms to respond.

The option serve-expired-ttl-reset was considered, which would cause the records to never expire, even with the upstream down indefinitely as long as there are regular requests for them, but it was deemed too dangerous since it could potentially last forever and applies to all entries.

Example consul zone: service1.sd.

While resolving this name we end up with:

;; ANSWER SECTION:
service1.sd. 120 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
 
;; Query time: 230 msec

If we dump Unbound's cache (unbound-control dump_cache), we can find the entries there:

service1.query.consul. 23  IN  A   192.0.2.1
service1.sd.   113 IN  CNAME   service1.query.consul.

If we wait for the TTL of these entries to expire and dump the cache again, we notice that they don't show up anymore:

➜ unbound-control dump_cache | grep -e query.consul -e service1
➜

But if we block access to the Consul resolver, we can see that still the entries are returned:

;; ANSWER SECTION:
service1.sd. 119 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
 
;; Query time: 503 msec

Notice that Unbound waited 500ms before returning a response, which was the value we configured. This also shows that the dump cache method does not return expired entries even though they are present. So far so good.

Test - Unbound could serve stale records from cache indefinitely

After 3600 seconds pass we again try to resolve the same name:

➜ dig  service1.sd. @127.0.0.1
 
; <<>> DiG 9.10.6 <<>> service1.sd. @127.0.0.1
;; global options: +cmd
;; connection timed out; no servers could be reached

Dig gave up on waiting for a response from Unbound. So we see that after the configured time, Unbound will indeed stop serving the stale record, even if it has been requested recently. This is due to serve-expired-ttl: 3600 and serve-expired-ttl-reset being set to "no" by default. These options were not available when we first tried Unbound’s stale cache.

Test - Unbound serves negative/incomplete information from the expired cache

Since the record we are resolving includes a CNAME and an A record with different TTLs, we will try to see if Unbound can be made to return an incomplete chain (CNAME with no A record).

We start by having both entries in cache, and block connections to the Consul resolver. We then remove from the cache the A entry:

$ /usr/sbin/unbound-control flush_type service1.query.consul. A
ok

Now we attempt to query:

$ dig  service1.sd. @127.0.0.1
...
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9532
;; Query time: 22 msec

Nothing was returned which is the result we were expecting.

Another scenario we wanted to test was to see if when serve-expired-ttl + A's TTL had passed, the CNAME would be returned on its own.

➜ dig  service1.sd. @127.0.0.1
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48918
;; ANSWER SECTION:
service1.sd. 48 IN CNAME service1.query.consul.
service1.query.consul. 30 IN A 192.0.2.1
;; Query time: 430 msec
;; WHEN: Thu Dec 31 17:07:55 WET 2020
 
➜ dig  service1.sd. @127.0.0.1
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 42432
;; Query time: 509 msec
;; WHEN: Thu Dec 31 17:07:56 WET 2020

We lowered serve-expired-ttl to 40 to help testing the issue and we were able to see that as soon as the A entry was deemed not serve-able, the CNAME was also not returned, which is the behaviour that we want.

One last thing that we wanted to test was if the cache was really refreshed in the background while the stale entry was returned. Network Link Conditioner was used to simulate high latency:

➜ dig service1.service.consul @CONSUL_ADDRESS
 
;; ANSWER SECTION:
service1.service.consul. 60 IN A 192.0.2.10
service1.service.consul. 60 IN A 192.0.2.12
 
;; Query time: 3053 msec

We can see that the artificial latency is working. We can now test if changes to the address are actually reflected on the consul record even though a stale record is served at first:

➜ dig service2.service.consul @127.0.0.1
...
;; ANSWER SECTION:
service2.service.consul. 30 IN A 192.0.2.100
 
;; Query time: 501 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Jan 02 13:00:36 WET 2021
;; MSG SIZE  rcvd: 89
 
➜ dig service2.service.consul @127.0.0.1
...
;; ANSWER SECTION:
service2.service.consul. 58 IN A 192.0.2.200
 
;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Jan 02 13:00:39 WET 2021
;; MSG SIZE  rcvd: 89

We can see that the first attempt was served from the stale cache, since the address was already changed on Consul to 192.0.2.200, and the query time was 500ms. The next attempt was served from the fresh cache, hence the 0ms response time, and so we know that the cache was refreshed in the background.

Test notes

All the test scenarios completed successfully, even though we did find an issue that has since been fixed upstream. We realised that the prefetching logic blocked the cache while it attempted to refresh the entry, which prevented stale entries from being served if the upstream was unresponsive. This fix is included in version 1.13.1.

Deployment status

This change has already been deployed to all of our data centers, and while the change addressed our original use case, there were some interesting side-effects as can be seen from the graphs below.

Decreased in-flight requests

The graph above is from the Toronto data center which received the change at approximately 14:00 UTC. We can observe an attenuated curve, which indicates that spikes in the number of in-flight requests were probably caused by slow/unresponsive upstreams which are now served a stale response after a short while.

The same graph from the Perth data center, which received the change on a later day.

Decrease in P99 latency

The graph above is from the Lisbon data center_,_ which received the change at approximately 12:37 UTC. We can see that the long tail (P99) of request latencies decreases dramatically as we are now able to serve a response to clients in more situations without having them wait for a slow upstream network resolution that might not even succeed.

The effect is even more dramatic on the Paris data center.

The decrease in P99 latencies directly correlates to the start of serving stale entries, even though there are not many of them. In the graph above we can observe the number of stale records served per second for the same data center, using the last 3 minutes of datapoints.

Decreased general error rate

This graph is from our Perth data center, which received the change on February 8. This is explained by now being able to serve some requests to unresponsive/slow upstreams, which previously would have resulted in a failed request.

Graph from the Vientiane data center, which received the change on February 1.

Less jitter on the rate of load balancer queries by RRDNS

Another graph from the Perth data center, which received the change on February 8. An explanation for this change is that resolvers usually retry after a small amount of time when they don't get any response, and so by returning a stale response instead of keeping the client waiting (and retrying), we prevent a surge of new queries for expired and slow to resolve records.

A graph from the Doha data center, which received the change on February 1.

Miscellaneous Unbound metrics

This change also had a positive effect on several Unbound-specific metrics, below are several graphs from the Perth data center that received the change on February 8.

How it started vs how it's going

Further improvements

In the future, we would like to tweak the serve-expired-client-timeout value, which is currently set to 500ms. It was chosen conservatively, at a time when we were cautiously experimenting with a feature that had been problematic in the past, and our gut feeling is that it can be at least halved without causing issues to further improve the client’s experience. There is also the possibility of using different values per data center, for example picking up the P90 resolve latency value which differs per datacenter, and this is something that we want to explore further.

Summary

We are enabling Unbound to return expired entries from the cache whenever an upstream takes longer than 500ms to respond, while refreshing the cache in the background, unless 3600 seconds have elapsed from the record's original TTL.