The Cloudflare Blog

Agents that remember: introducing Agent Memory

Tyson Trautmann — Fri, 17 Apr 2026 13:00:00 GMT

As developers build increasingly sophisticated agents on Cloudflare, one of the biggest challenges they face is getting the right information into context at the right time. The quality of results produced by models is directly tied to the quality of context they operate with, but even as context window sizes grow past one million (1M) tokens, context rot remains an unsolved problem. A natural tension emerges between two bad options: keep everything in context and watch quality degrade, or aggressively prune and risk losing information the agent needs later.

Today we're announcing the private beta of Agent Memory, a managed service that extracts information from agent conversations and makes it available when it’s needed, without filling up the context window.

It gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time. In this post, we’ll explain how it works — and what it can help you build.

The state of agentic memory

Agentic memory is one of the fastest-moving spaces in AI infrastructure, with new open-source libraries, managed services, and research prototypes launching on a near-weekly basis. These offerings vary widely in what they store, how they retrieve, and what kinds of agents they're designed for. Benchmarks like LongMemEval, LoCoMo, and BEAM provide useful apples-to-apples comparisons, but they also make it easy to build systems that overfit for a specific evaluation and break down in production.

Existing offerings also differ in architecture. Some are managed services that handle extraction and retrieval in the background, others are self-hosted frameworks where you run the memory pipeline yourself. Some expose constrained, purpose-built APIs that keep memory logic out of the agent's main context; others give the model raw access to a database or filesystem and let it design its own queries, burning tokens on storage and retrieval strategy instead of the actual task. Some try to fit everything into the context window, partitioning across multiple agents if needed, while others use retrieval to surface only what's relevant.

Agent Memory is a managed service with an opinionated API and retrieval-based architecture. We've carefully considered the alternatives, and we believe this combination is the right default for most production workloads. Tighter ingestion and retrieval pipelines are superior to giving agents raw filesystem access. In addition to improved cost and performance, they provide a better foundation for complex reasoning tasks required in production, like temporal logic, supersession, and instruction following. We'll likely expose data for programmatic querying down the road, but we expect that to be useful for edge cases, not common cases.

We built Agent Memory because the workloads we see on our platform exposed gaps that existing approaches don't fully address. Agents running for weeks or months against real codebases and production systems need memory that stays useful as it grows — not just memory that performs well on a clean benchmark dataset that may fit entirely into a newer model's context window.

They need fast ingestion. They need retrieval that doesn't block the conversation. And they need to run on models that keep the per-query cost reasonable.

How you use it

Agent Memory stores memories in a profile, which is addressed by name. A profile gives you several operations: ingest a conversation, remember something specific, recall what you need, list memories, or forget a specific memory. Ingest is the bulk path that is typically called when the harness compacts context. Remember is for the model to store something important on the spot. Recall runs the full retrieval pipeline and returns a synthesized answer.

export default {
  async fetch(request: Request, env: Env): Promise {
    // Get a profile -- an isolated memory store shared across sessions, agents, and users
    const profile = await env.MEMORY.getProfile("my-project");
    // Ingest -- extract memories from a conversation (typically called at compaction)
    await profile.ingest([
      { role: "user", content: "Set up the project with React and TypeScript." },
      { role: "assistant", content: "Done. Scaffolded a React + TS project targeting Workers." },
      { role: "user", content: "Use pnpm, not npm. And dark mode by default." },
      { role: "assistant", content: "Got it -- pnpm and dark mode as default." },
    ], { sessionId: "session-001" });
    // Remember -- store a single memory explicitly (direct tool use by the model)
    const memory = await profile.remember({
      content: "API rate limit was increased to 10,000 req/s per zone after the April 10 incident.",
      sessionId: "session-001",
    });
    // Recall -- retrieve memories and get a synthesized answer
    const results = await profile.recall("What package manager does the user prefer?");
    console.log(results.result); // "The user prefers pnpm over npm."
    return Response.json({ ok: true });
  },
};

Agent Memory is accessed via a binding from any Cloudflare Worker. It can also be accessed via a REST API for agents running outside of Workers, following the same pattern as other Cloudflare developer platform APIs. If you’re building with the Cloudflare Agents SDK, the Agent Memory service integrates neatly as the reference implementation for handling compaction, remembering, and searching over memories in the memory portion of the Sessions API.

What you can build with it

Agent Memory is designed to work across a range of agent architectures:

Memory for individual agents. Regardless of whether you're building with coding agents like Claude Code or OpenCode with a human in the loop, using self-hosted agent frameworks like OpenClaw or Hermes to act on your behalf, or wiring up managed services like Anthropic’s Managed Agents, Agent Memory can serve as the persistent memory layer without any changes to the agent's core loop.

Memory for custom agent harnesses. Many teams are building their own agent infrastructure, including background agents that run autonomously without a human in the loop. Ramp Inspect is one public example; Stripe and Spotify have described similar systems. These harnesses can also benefit from giving their agents memory that persists across sessions and survives restarts.

Shared memory across agents, people, and tools. A memory profile doesn't have to belong to a single agent. A team of engineers can share a memory profile so that knowledge learned by one person's coding agent is available to everyone: coding conventions, architectural decisions, tribal knowledge that currently lives in people's heads or gets lost when context is pruned. A code review bot and a coding agent can share memory so that review feedback shapes future code generation. The knowledge your agents accumulate stops being ephemeral and starts becoming a durable team asset.

While search is a component of memory, agent search and agent memory solve distinct problems. AI Search is our primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files; it's derived from sessions. An agent can use both, and they are designed to work together.

Your memories are yours

As agents become more capable and more deeply embedded in business processes, the memory they accumulate becomes genuinely valuable — not just as an operational state, but as institutional knowledge that took real work to build. We're hearing growing concern from customers about what it means to tie that asset to a single vendor, which is reasonable. The more an agent learns, the higher the switching cost if that memory can't move with it.

Agent Memory is a managed service, but your data is yours. Every memory is exportable, and we're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. We think the right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to.

How Agent Memory works

To understand what happens behind the API shown above, it helps to break down how agents manage context. An agent has three components:

A harness that drives repeated calls to a model, facilitates tool calls, and manages state.
A model that takes context and returns completions.
State that includes both the current context window and additional information outside context: conversation history, files, databases, memory.

The critical moment in an agent’s context lifecycle is compaction, when the harness decides to shorten context to stay within a model's limits or to avoid context rot. Today, most agents discard information permanently. Agent Memory preserves knowledge on compaction instead of losing it.

Agent Memory integrates into this lifecycle in two ways:

Bulk ingestion at compaction. When the harness compacts context, it ships the conversation to Agent Memory for ingestion. Ingestion extracts facts, events, instructions, and tasks from the message history, deduplicates them against existing memories, and stores them as memories for future retrieval.
Direct tool use by the model. The model gets tools to interact directly with memories, including the ability to recall (search memories for specific information). The model can also remember (explicitly store memories based on something important), forget (mark a memory as no longer important or true), and list (see what memories are stored). These are lightweight operations that don't require the model to design queries or manage storage. The primary agent should never burn context on storage strategy. The tool surface it sees is deliberately constrained so that memory stays out of the way of the actual task.

The ingestion pipeline

When a conversation arrives for ingestion, it passes through a multi-stage pipeline that extracts, verifies, classifies, and stores memories.

The first step is deterministic ID generation. Each message gets a content-addressed ID — a SHA-256 hash of session ID, role, and content, truncated to 128 bits. If the same conversation is ingested twice, every message resolves to the same ID, making re-ingestion idempotent.

Next, the extractor runs two passes in parallel. A full pass chunks messages at roughly 10K characters with two-message overlap and processes up to four chunks concurrently. Each chunk gets a structured transcript with role labels, relative dates resolved to absolutes ("yesterday" becomes "2026-04-14"), and line indices for source provenance. For longer conversations (9+ messages), a detail pass runs alongside the full pass, using overlapping windows that focus specifically on extracting concrete values like names, prices, version numbers, and entity attributes that broad extraction tends to miss. The two result sets are then merged.

The next step is to verify each extracted memory against the source transcript. The verifier runs eight checks covering entity identity, object identity, location context, temporal accuracy, organizational context, completeness, relational context, and whether inferred facts are actually supported by the conversation. Each item is passed, corrected, or dropped accordingly.

The pipeline then classifies each verified memory into one of four types.

Facts represent what is true right now, atomic, stable knowledge like "the project uses GraphQL" or "the user prefers dark mode."
Events capture what happened at a specific time, like a deployment or a decision.
Instructions describe how to do something, such as procedures, workflows, runbooks.
Tasks track what is being worked on right now and are ephemeral by design.

Facts and instructions are keyed. Each gets a normalized topic key, and when a new memory has the same key as an existing one, the old memory is superseded rather than deleted. This creates a version chain with a forward pointer from the old memory to the new memory. Tasks are excluded from the vector index entirely to keep it lean but remain discoverable via full-text search.

Finally, everything is written to storage using INSERT OR IGNORE so that content-addressed duplicates are silently skipped. After returning a response to the harness, background vectorization runs asynchronously. The embedding text prepends the 3-5 search queries generated during classification to the memory content itself, bridging the gap between how memories are written (declaratively: "user prefers dark mode") and how they're searched (interrogatively: "what theme does the user want?"). Vectors for superseded memories are deleted in parallel with new upserts.

The retrieval pipeline

When an agent searches for a memory, the query goes through a separate retrieval pipeline. During development, we discovered that no single retrieval method works best for all queries, so we run several methods in parallel and fuse the results.

The first stage runs query analysis and embedding concurrently. The query analyzer produces ranked topic keys, full-text search terms with synonyms, and a HyDE (Hypothetical Document Embedding), a declarative statement phrased as if it were the answer to the question. This stage embeds the raw query directly, and both embeddings are used downstream.

In the next stage, five retrieval channels run in parallel. Full-text search with Porter stemming handles keyword precision for queries where you know the exact term but not the surrounding context. Exact fact-key lookup returns results where the query maps directly to a known topic key. Raw message search queries the stored conversation messages directly via full-text search for unclassified conversation fragments that act as a safety net, catching verbatim details that the extraction pipeline may have generalized away. Direct vector search finds semantically similar memories using the embedded query. And HyDE vector search finds memories that are similar to what the answer would look like, which often surfaces results that direct embedding misses — particularly for abstract or multi-hop queries where the question and the answer use different vocabulary.

In the third and final stage, results from all five retrieval channels are merged using Reciprocal Rank Fusion (RRF), where each result receives a weighted score based on where it ranked within a given channel. Fact-key matches get the highest weight because an exact topic match is the strongest signal. Full-text search, HyDE vectors, and direct vectors are each weighted based on strength of signal. Finally, raw message matches are also included with low weight as a safety net to identify candidate results the extraction pipeline may have missed. Ties are broken by recency, with newer results ranked higher.

The pipeline then passes the top candidates to the synthesis model, which generates a natural-language answer to the original search query. Some specific query types get special treatment. As an example, temporal computation is handled deterministically via regex and arithmetic, not by the LLM. The results are injected into the synthesis prompt as pre-computed facts. Models are unreliable at things like date math, so we don't ask them to do it.

How we built it

Our initial prototype of Agent Memory was lightweight, with a basic extraction pipeline, vector storage, and simple retrieval. It worked well enough to demonstrate the concept, but not well enough to ship.

So we put it into an agent-driven loop and iterated. The cycle looked like this: run benchmarks, analyze where we had gaps, propose solutions, have a human review the proposals to select strategies that generalize rather than overfit, let the agent make the changes, repeat.

This worked well, but came with one specific challenge. LLMs are stochastic, even with temperature set to zero. This caused results to vary across runs, which meant we had to average multiple runs (time-consuming for large benchmarks) and rely on trend analysis alongside raw scores to understand what was actually working. Along the way we had to guard carefully against overfitting the benchmarks in ways that didn't genuinely make the product better for the general case.

Over time, this got us to a place where benchmark scores improved consistently with each iteration and we had a generalized architecture that would work in the real world. We intentionally tested against multiple benchmarks (including LoCoMo, LongMemEval, and BEAM) to push the system in different ways.

Why Cloudflare

We build Cloudflare on Cloudflare, and Agent Memory is no different. Existing primitives that are powerful and easily composable allowed us to ship the first prototype in a weekend and a fully functioning, productionized internal version of Agent Memory in less than a month. In addition to speed of delivery, Cloudflare turned out to be the ideal place to build this kind of service for a few other reasons.

Under the hood, Agent Memory is a Cloudflare Worker that coordinates several systems:

Durable Object: stores the raw messages and classified memories
Vectorize: provides vector search over embedded memories
Workers AI: runs the LLMs and embedding models

Each memory context maps to its own Durable Object instance and Vectorize index, keeping data fully isolated between contexts. It also allows us to scale easily with higher demands.

Compute isolation via Durable Objects. Each memory profile gets its own Durable Object (DO) with a SQLite-backed store, providing strong isolation between tenants without any infrastructure overhead. The DO handles FTS indexing, supersession chains, and transactional writes. DO’s getByName() addressing means any request, from anywhere, can reach the right memory profile by name, and ensures that sensitive memories are strongly isolated from other tenants.

Storage across the stack. Memory content lives in SQLite-backed DOs. Vectors live in Vectorize. In the future, snapshots and exports will go to R2 for cost-efficient long-term storage. Each primitive is purpose-built for its workload, we don't need to force everything into a single shape or database.

Local model inference with Workers AI. The entire extraction, classification, and synthesis pipeline runs on Workers AI models deployed on Cloudflare's network. All AI calls pass a session affinity header routed to the memory profile name, so repeated requests hit the same backend for prompt caching benefits.

One interesting finding from our model selection: a bigger, more powerful model isn't always better. We currently default to Llama 4 Scout (17B, 16-expert MoE) for extraction, verification, classification, and query analysis, and Nemotron 3 (120B MoE, 12B active parameters) for synthesis. Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers. The synthesizer is the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better sweet spot of cost, quality, and latency.

How we've been using it

We run Agent Memory internally for our own workflows at Cloudflare, as both a proving ground and a source of ideas for what to build next.

Coding agent memory. We use an internal OpenCode plugin that wires Agent Memory into the development loop. Agent Memory provides memory of past compaction within sessions and across them. The less obvious benefit has been shared memory across a team: with a shared profile, the agent knows what other members of your team have already learned, which means it can stop asking questions that have already been answered and stop making mistakes that have already been corrected.

Agentic code review. We've connected Agent Memory to our internal agentic code reviewer. Arguably the most useful thing it learned to do was stay quiet. The reviewer now remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged, and the author chose to keep it for a good reason. Reviews get less noisy over time, not just smarter.

Chat bots. We've also wired memory into an internal chat bot that ingests message history and then lurks and remembers new messages that are sent. Then, when someone asks a question, the bot can answer based on previous conversations.

We also have a number of additional use cases that we plan to roll out internally in the near future as we refine and improve the service.

What's next

We're continuing to test and refine Agent Memory internally, improving the extraction pipeline, tuning retrieval quality, and expanding the background processing capabilities. Similar to how the human brain consolidates memories by replaying and strengthening connections during sleep, we see opportunities for memory storage to improve asynchronously and are currently implementing and testing various strategies to make this work.

We plan to make Agent Memory publicly available soon. If you're building agents on Cloudflare and want early access, contact us to join the waitlist.

If you want to dig into the architecture, share what you're building, or follow along as we develop this further, join us on the Cloudflare Discord or start a thread in the Cloudflare Community. We're actively watching both, and are interested in what production agent workloads actually look like in the wild.

Artifacts: versioned storage that speaks Git

Dillon Mulroy — Thu, 16 Apr 2026 13:01:00 GMT

Agents have changed how we think about source control, file systems, and persisting state. Developers and agents are generating more code than ever — more code will be written over the next 5 years than in all of programming history — and it’s driven an order-of-magnitude change in the scale of the systems needed to meet this demand. Source control platforms are especially struggling here: they were built to meet the needs of humans, not a 10x change in volume driven by agents who never sleep, can work on several issues at once, and never tire.

We think there’s a need for a new primitive: a distributed, versioned filesystem that’s built for agents first and foremost, and that can serve the types of applications that are being built today.

We’re calling this Artifacts: a versioned file system that speaks Git. You can create repositories programmatically, alongside your agents, sandboxes, Workers, or any other compute paradigm, and connect to it from any regular Git client.

Want to give every agent session a repo? Artifacts can do it. Every sandbox instance? Also Artifacts. Want to create 10,000 forks from a known-good starting point? You guessed it: Artifacts again. Artifacts exposes a REST API and native Workers API for creating repositories, generating credentials, and commits for environments where a Git client isn’t the right fit (i.e. in any serverless function).

Artifacts is available in private beta and we’re aiming to open this up as a public beta by early May.

// Create a repo
const repo = await env.AGENT_REPOS.create(name)
// Pass back the token & remote to your agent
return { repo.remote, repo.token }

# Clone it and use it like any regular git remote
$ git clone https://x:${TOKEN}@123def456abc.artifacts.cloudflare.net/git/repo-13194.git

That’s it. A bare repo, ready to go, created on the fly, that any git client can operate it against.

And if you want to bootstrap an Artifacts repo from an existing git repository so that your agent can work on it independently and push independent changes, you can do that too with .import():

interface Env {
  ARTIFACTS: Artifacts
}

export default {
  async fetch(request: Request, env: Env) {
    // Import from GitHub
    const { remote, token } = await env.ARTIFACTS.import({
      source: {
        url: "https://github.com/cloudflare/workers-sdk",
        branch: "main",
      },
      target: {
        name: "workers-sdk",
      },
    })

    // Get a handle to the imported repo
    const repo = await env.ARTIFACTS.get("workers-sdk")

    // Fork to an isolated, read-only copy
    const fork = await repo.fork("workers-sdk-review", {
      readOnly: true,
    })

    return Response.json({ remote: fork.remote, token: fork.token })
  },
}

Check out the documentation to get started, or if you want to understand how Artifacts is being used, how it was built, and how it works under the hood: read on.

Why Git? What’s a versioned file system?

Agents know Git. It’s deep in the training data of most models. The happy path and the edge cases are well known to agents, and code-optimized models (and/or harnesses) are particularly good at using git.

Further, Git’s data model is not only good for source control, but for anything where you need to track state, time travel, and persist large amounts of small data. Code, config, session prompts and agent history: all of these are things (“objects”) that you often want to store in small chunks (“commits”) and be able to revert or otherwise roll back to (“history”).

We could have invented an entirely new, bespoke protocol… but then you have the bootstrap problem. AI models don’t know it, so you have to distribute skills, or a CLI, or hope that users are plugged into your docs MCP… all of that adds friction. If we can just give agents an authenticated, secure HTTPS Git remote URL and have them operate as if it were a Git repo, though? That turns out to work pretty well. And for non-Git-speaking clients — such as a Cloudflare Worker, a Lambda function, or a Node.js app — we’ve exposed a REST API and (soon) language-specific SDKs. Those clients can also use isomorphic-git, but in many cases a simpler TypeScript API can reduce the API surface needed.

Not just for source control

Artifacts’ Git API might make you think it’s just for source control, but it turns out that the Git API and data model is a powerful way to persist state in a way that allows you to fork, time-travel and diff state for any data.

Inside Cloudflare, we’re using Artifacts for our internal agents: automatically persisting the current state of the filesystem and the session history in a per-session Artifacts repo. This enables us to:

Persist sandbox state without having to provision (and keep) block storage around.
Share sessions with others and allow them to time-travel back through both session (prompt) state and file state, irrespective of whether there were commits to the “actual” repository (source control).
And the best: fork a session from any point, allowing our team to share sessions with a co-worker and have them pick it up from them. Debugging something and want another set of eyes? Send a URL and fork it. Want to riff on an API? Have a co-worker fork it and pick up from where you left off.

We’ve also spoken to teams who want to use Artifacts in cases where the Git protocol isn’t a requirement at all, but the semantics (reverting, cloning, diffing) are. Storing per-customer config as part of your product, and want the ability to roll back? Artifacts can be a good representation of this.

We’re excited to see teams explore the non-Git use-cases around Artifacts just as much as the Git-focused ones.

Under the hood

Artifacts are built on top of Durable Objects. The ability to create millions (or tens of millions+) of instances of stateful, isolated compute is inherent to how Durable Objects work today, and that’s exactly what we needed for supporting millions of Git repos per namespace.

Major League Baseball (for live game fan-out), Confluence Whiteboards, and our own Agents SDK use Durable Objects under the hood at significant scale, and so we’re building this on a primitive that we’ve had in production for some time.

What we did need, however, was a Git server implementation that could run on Cloudflare Workers. It needed to be small, as complete as possible, extensible (notes, LFS), and efficient. So we built one in Zig, and compiled it to Wasm.

Why did we use Zig? Three reasons:

The entire git protocol engine is written in pure Zig (no libc), compiled to a ~100KB WASM binary (with room for optimization!). It implements SHA-1, zlib inflate/deflate, delta encoding/decoding, pack parsing, and the full git smart HTTP protocol — all from scratch, with zero external dependencies other than the standard library.
Zig gives us manual control over memory allocation which is important in constrained environments like Durable Objects. The Zig Build System lets us easily share code between the WASM runtime (production) and native builds (testing against libgit2 for correctness verification).
The WASM module communicates with the JS host via a thin callback interface: 11 host-imported functions for storage operations (host_get_object, host_put_object, etc.) and one for streaming output (host_emit_bytes). The WASM side is fully testable in isolation.

Under the hood, Artifacts also uses R2 (for snapshots) and KV (for tracking auth tokens):

^{How Artifacts works (Workers, Durable Objects, and WebAssembly)}

A Worker acts as the front-end, handling authentication & authorization, key metrics (errors, latency) and looking up each Artifacts repository (Durable Object) on the fly.

Specifically:

Files are stored in the underlying Durable Object’s SQLite database.
- Durable Object storage has a 2MB max row size, so large Git objects are chunked and stored across multiple rows.
- We make use of the sync KV API (state.storage.kv) which is backed by SQLite under the hood.
DOs have ~128MB memory limits: this means we can spawn tens of millions of them (they’re fast and light) but have to work within those limits.
- We make heavy use of streaming in both the fetch and push paths, directly returning a `ReadableStream` built from the raw WASM output chunks.
- We avoid calculating our own git deltas, instead, the raw deltas and base hashes are persisted alongside the resolved object. On fetch, if the requesting client already has the base object, Zig emits the delta instead of the full object, which saves bandwidth and memory.
Support for both v1 and v2 of the git protocol.
- We support capabilities including ls-refs, shallow clones (deepen, deepen-since, deepen-relative), and incremental fetch with have/want negotiation.
- We have an extensive test suite with conformance tests against git clients and verification tests against a libgit2 server designed to validate protocol support.

On top of this, we have native support for git-notes. Artifacts is designed to be agent-first, and notes enable agents to add notes (metadata) to Git objects. This includes prompts, agent attribution and other metadata that can be read/written from the repo without mutating the objects themselves.

Big repos, big problems? Meet ArtifactFS.

Most repos aren’t that big, and Git is designed to be extremely efficient in terms of storage: most repositories take only a few seconds to clone at most, and that’s dominated by network setup time, auth, and checksumming. In most agent or sandbox scenarios, that’s workable: just clone the repo as the sandbox starts and get to work.

But what about a multi-GB repository and/or repos with millions of objects? How can we clone that repo quickly, without blocking the agent’s ability to get to work for minutes and consuming compute?

A popular web framework (at 2.4GB and with a long history!) takes close to 2 minutes to clone. A shallow clone is faster, but not enough to get down to single digit seconds, and we don’t always want to omit history (agents find it useful).

Can we get large repos down to ~10-15 seconds so that our agent can get to work? Well, yes: with a few tricks.

As part of our launch of Artifacts, we’re open-sourcing ArtifactFS, a filesystem driver designed to mount large Git repos as quickly as possible, hydrating file contents on the fly instead of blocking on the initial clone. It's ideal for agents, sandboxes, containers and other use cases where startup time is critical. If you can shave ~90-100 seconds off your sandbox startup time for every large repo, and you’re running 10,000 of those sandbox jobs per month: that’s 2,778 sandbox hours saved.

You can think of ArtifactFS as “Git clone but async”:

ArtifactFS runs a blobless clone of a git repository: it fetches the file tree and refs, but not the file contents. It can do that during sandbox startup, which then allows your agent harness to get to work.
In the background, it starts to hydrate (download) file contents concurrently via a lightweight daemon.
It prioritizes files that agents typically want to operate on first: package manifests (package.json, go.mod), configuration files, and code, deprioritizing binary blobs (images, executables and other non-text-files) where possible so that agents can scan the file tree as the files themselves are hydrated.
If a file isn’t fully hydrated when the agent tries to read it, the read will block until it has.

The filesystem does not attempt to “sync” files back to the remote repository: with thousands or millions of objects, that’s typically very slow, and since we’re speaking git, we don’t need to. Your agent just needs to commit and push, as it would with any repository. No new APIs to learn.

Importantly, ArtifactFS works with any Git remote, not just our own Artifacts. If you’re cloning large repos from GitHub, GitLab, or self-hosted Git infrastructure: you can still use ArtifactFS.

What’s coming?

Our release today is just the beta, and we’re already working on a number of features that you’ll see land over the next few weeks:

Expanding the available metrics we expose. Today we’re shipping metrics for key operations counts per namespace, repo and stored bytes per repo, so that managing millions of Artifacts isn’t toilsome.
Support for Event Subscriptions for repo-level events so that we can emit events on pushes, pulls, clones, and forks to any repository within a namespace. This will also allow you to consume events, write webhooks, and use those events to notify end-users, drive lifecycle events within your products, and/or run post-push jobs (like CI/CD).
Native TypeScript, Go and Python client SDKs for interacting with the Artifacts API
Repo-level search APIs and namespace-wide search APIs, e.g. “find all the repos with a package.json file”.

We’re also planning an API for Workers Builds, allowing you to run CI/CD jobs on any agent-driven workflow.

What will it cost me?

We’re still early with Artifacts, but want our pricing to work at agent-scale: it needs to be cost effective to have millions of repos, unused (or rarely used) repos shouldn’t be a drag, and our pricing should match the massively-single-tenant nature of agents.

You also shouldn’t have to think about whether a repo is going to be used or not, whether it’s hot or cold, and/or whether an agent is going to wake it up. We’ll charge you for the storage you consume and the operations (e.g. clones, forks, pushes & pulls) against each repo.

	$/unit	Included
Operations	$0.15 per 1,000 operations	First 10k included (per month)
Storage	$0.50/GB-mo	First 1GB included.

Big, busy repos will cost more than smaller, less-often-used repos, whether you have 1,000, 100,000, or 10 million of them.

We’ll also be bringing Artifacts to the Workers Free plan (with some fair limits) as the beta progresses, and we’ll provide updates throughout the beta should this pricing change and ahead of billing any usage.

Where do I start?

Artifacts is launching in private beta, and we expect public beta to be ready in early May (2026, to be clear!). We’ll be allowing customers in progressively over the next few weeks, and you can register interest for the private beta directly.

In the meantime, you can learn more about Artifacts by:

Reading the getting started guide in the docs.
Visiting the Cloudflare dashboard (Build > Storage & Databases > Artifacts)
Reading through the REST API examples
Learning more about how Artifacts works under the hood

Follow the changelog to track the beta as it progresses.

Watch on Cloudflare TV

Deploy Postgres and MySQL databases with PlanetScale + Workers

Vy Ton — Thu, 16 Apr 2026 13:00:22 GMT

Cloudflare announced our PlanetScale partnership last September to give Cloudflare Workers direct access to Postgres and MySQL databases for fast, full-stack applications.

Soon, we’re bringing our technologies even closer: you’ll be able to create PlanetScale Postgres and MySQL databases directly from the Cloudflare dashboard and API, and have them billed to your Cloudflare account.

You choose the data storage that fits your Worker application needs and keep a single system for billing as a Cloudflare self-serve or enterprise customer. Cloudflare credits like those given in our startup program or Cloudflare committed spend can be used towards PlanetScale databases.

Postgres & MySQL for Workers

SQL relational databases like Postgres and MySQL are a foundation of modern applications. In particular, Postgres has risen in developer popularity with its rich tooling ecosystem (ORMs, GUIs, etc) and extensions like pgvector for building vector search in AI-driven applications. Postgres is the default choice for most developers who need a powerful, flexible, and scalable database to power their applications.

You can already connect your PlanetScale account and create Postgres databases directly from the Cloudflare dashboard for your Workers. Starting next month, a new Cloudflare subscription will bill for new PlanetScale databases direct to your Cloudflare account as a self-serve or enterprise user.

^{How to create PlanetScale databases via}^{Cloudflare dashboard}^{after your PlanetScale account is connected. Cloudflare billing is coming next month.}

With our built-in integration, PlanetScale databases automatically work with Workers using Hyperdrive, our database connectivity service. Hyperdrive service manages database connection pools and query caching to make database queries fast and reliable. You just add a binding to your Worker’s config file:

// wrangler.jsonc file
{
  "hyperdrive": [
    {
      "binding": "DATABASE",
      "id": 
    }
  ]
}

And start running SQL queries via your Worker with your Postgres client of choice:

import { Client } from "pg";

export default {
  async fetch(request, env, ctx) {
   
    const client = new Client({ connectionString: env.DATABASE.connectionString });
    await client.connect();

    const result = await client.query("SELECT * FROM pg_tables");
    ...
}

PlanetScale developer experience

PlanetScale was the obvious choice to provide to the Workers community due to it’s unrivaled performance and reliability. Developers can choose from two of the most popular relational databases with Postgres or Vitess MySQL. PlanetScale matches how Cloudflare treats performance and reliability as key features of a developer platform. And with features like query insights and agent driven workflows for improving SQL query performance and branching for deploying code safely, including database changes, the PlanetScale database developer experience is first-class.

Cloudflare users get the exact same PlanetScale database developer experience. Your PlanetScale databases can be deployed directly from Cloudflare with connections managed via Hyperdrive, which already makes your existing regional databases fast with global Workers. This means access to the same PlanetScale database clusters at standard PlanetScale pricing with all features included like query insights and detailed breakdown of usage and costs.

^{A single node on PlanetScale Postgres starts at}^$5/month^.

Workers placement

With centralized databases, Workers can run right next to your primary database to reduce latency with an explicit placement hint. By default, Workers execute closest to a user request, which adds network latency when querying a central database especially for multiple queries. Instead, you can configure your Worker to execute in the closest Cloudflare data center to your PlanetScale database. In the future, Cloudflare can automatically set a placement hint based on the location of your PlanetScale database and reduce network latency to single digit milliseconds.

{
  "placement": {
    "region": "aws:us-east-1"
  }
}

Coming soon

You can deploy a PlanetScale Postgres database or connect an existing PlanetScale database to Workers today via the Cloudflare dashboard. Everything today is still billed via PlanetScale.

Launching next month, new PlanetScale databases can be billed to your Cloudflare account.

We are building more with our PlanetScale partners, such as Cloudflare API integration, so tell us what you’d like to see next.

Project Think: building the next generation of AI agents on Cloudflare

Sunil Pai — Wed, 15 Apr 2026 13:01:00 GMT

Today, we're introducing Project Think: the next generation of the Agents SDK. Project Think is a set of new primitives for building long-running agents (durable execution, sub-agents, sandboxed code execution, persistent sessions) and an opinionated base class that wires them all together. Use the primitives to build exactly what you need, or use the base class to get started fast.

Something happened earlier this year that changed how we think about AI. Tools like Pi, OpenClaw, Claude Code, and Codex proved a simple but powerful idea: give an LLM the ability to read files, write code, execute it, and remember what it learned, and you get something that looks less like a developer tool and more like a general-purpose assistant.

These coding agents aren't just writing code anymore. People are using them to manage calendars, analyze datasets, negotiate purchases, file taxes, and automate entire business workflows. The pattern is always the same: the agent reads context, reasons about it, writes code to take action, observes the result, and iterates. Code is the universal medium of action.

Our team has been using these coding agents every day. And we kept running into the same walls:

They only run on your laptop or an expensive VPS: there's no sharing, no collaboration, no handoff between devices.
They're expensive when idle: a fixed monthly cost whether the agent is working or not. Scale that to a team, or a company, and it adds up fast.
They require management and manual setup: installing dependencies, managing updates, configuring identity and secrets.

And there's a deeper structural issue. Traditional applications serve many users from one instance. As mentioned in our Welcome to Agents Week post, agents are one-to-one. Each agent is a unique instance, serving one user, running one task. A restaurant has a menu and a kitchen optimized to churn out dishes at volume. An agent is more like a personal chef: different ingredients, different techniques, different tools every time.

That fundamentally changes the scaling math. If a hundred million knowledge workers each use an agentic assistant at even modest concurrency, you need capacity for tens of millions of simultaneous sessions. At current per-container costs, that's unsustainable. We need a different foundation.

That's what we've been building.

Introducing Project Think

Project Think ships a set of new primitives for the Agents SDK:

Durable execution with fibers: crash recovery, checkpointing, automatic keepalive
Sub-agents: isolated child agents with their own SQLite and typed RPC
Persistent sessions: tree-structured messages, forking, compaction, full-text search
Sandboxed code execution: Dynamic Workers, codemode, runtime npm resolution
The execution ladder: workspace, isolate, npm, browser, sandbox
Self-authored extensions: agents that write their own tools at runtime

Each of these is usable directly with the Agent base class. Build exactly what you need with the primitives, or use the Think base class to get started fast. Let's look at what each one does.

Long-running agents

Agents, as they exist today, are ephemeral. They run for a session, tied to a single process or device, and then they are gone. A coding agent that dies when your laptop sleeps, that’s a tool. An agent that persists — that can wake up on demand, continue work after interruptions, and carry forward the state without depending on your local runtime — that starts to look like infrastructure. And it changes the scaling model for agents completely.

The Agents SDK builds on Durable Objects to give every agent an identity, persistent state, and the ability to wake on message. This is the actor model: each agent is an addressable entity with its own SQLite database. It consumes zero compute when hibernated. When something happens (an HTTP request, a WebSocket message, a scheduled alarm, an inbound email) the platform wakes the agent, loads its state, and hands it the event. The agent does its work, then goes back to sleep.

	VMs / Containers	Durable Objects
Idle cost	Full compute cost, always	Zero (hibernated)
Scaling	Provision and manage capacity	Automatic, per-agent
State	External database required	Built-in SQLite
Recovery	You build it (process managers, health checks)	Platform restarts, state survives
Identity / routing	You build it (load balancers, sticky sessions)	Built-in (name → agent)
10,000 agents, each active 1% of the time	10,000 always-on instances	~100 active at any moment

This changes the economics of running agents at scale. Instead of "one expensive agent per power user," you can build "one agent per customer" or "one agent per task" or "one agent per email thread." The marginal cost of spawning a new agent is effectively zero.

Surviving crashes: durable execution with fibers

An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish: a deploy, a platform restart, hitting resource limits. The upstream connection to the model provider is severed permanently, in-memory state is lost, and connected clients see the stream stop with no explanation.

runFiber() solves this. A fiber is a durable function invocation: registered in SQLite before execution begins, checkpointable at any point via stash(), and recoverable on restart via onFiberRecovered.

import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async startResearch(topic: string) {
    void this.runFiber("research", async (ctx) => {
      const findings = [];

      for (let i = 0; i < 10; i++) {
        const result = await this.callLLM(`Research step ${i}: ${topic}`);
        findings.push(result);

        // Checkpoint: if evicted, we resume from here
        ctx.stash({ findings, step: i, topic });

        this.broadcast({ type: "progress", step: i });
      }

      return { findings };
    });
  }

  async onFiberRecovered(ctx) {
    if (ctx.name === "research" && ctx.snapshot) {
      const { topic } = ctx.snapshot;
      await this.startResearch(topic);
    }
  }
}

The SDK keeps the agent alive automatically during fiber execution, no special configuration needed. For work measured in minutes, keepAlive() / keepAliveWhile() prevents eviction during active work. For longer operations (CI pipelines, design reviews, video generation) the agent starts the work, persists the job ID, hibernates, and wakes on callback.

Delegating work: sub-agents via Facets

A single agent shouldn't do everything itself. Sub-agents are child Durable Objects colocated with the parent via Facets, each with their own isolated SQLite and execution context:

import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async search(query: string) { /* ... */ }
}

export class ReviewAgent extends Agent {
  async analyze(query: string) { /* ... */ }
}

export class Orchestrator extends Agent {
  async handleTask(task: string) {
    const researcher = await this.subAgent(ResearchAgent, "research");
    const reviewer = await this.subAgent(ReviewAgent, "review");

    const [research, review] = await Promise.all([
      researcher.search(task),
      reviewer.analyze(task)
    ]);

    return this.synthesize(research, review);
  }
}

Sub-agents are isolated at the storage level. Each one gets its own SQLite database, and there’s no implicit sharing of data between them. This is enforced by the runtime where sub-agent RPC latency is a function call. TypeScript catches misuse at compile time.

Conversations that persist: the Session API

Agents that run for days or weeks need more than the typical flat list of messages. The experimental Session API models this explicitly. Available on the Agent base class, conversations are stored as trees, where each message has a parent_id. This enables forking (explore an alternative without losing the original path), non-destructive compaction (summarize older messages rather than deleting them), and full-text search across conversation history via FTS5.

import { Agent } from "agents";
import { Session, SessionManager } from "agents/experimental/memory/session";

export class MyAgent extends Agent {
  sessions = SessionManager.create(this);

  async onStart() {
    const session = this.sessions.create("main");
    const history = session.getHistory();
    const forked = this.sessions.fork(session.id, messageId, "alternative-approach");
  }
}

Session is usable directly with Agent, and it's the storage layer that the Think base class builds on.

From tool calls to code execution

Conventional tool-calling has an awkward shape. The model calls a tool, pulls the result back through the context window, calls another tool, pulls that back, and so on. As the tool surface grows, this gets both expensive and clumsy. A hundred files means a hundred round-trips through the model.

But models are better at writing code to use a system than they are at playing the tool-calling game. This is the insight behind @cloudflare/codemode: instead of sequential tool calls, the LLM writes a single program that handles the entire task.

// The LLM writes this. It runs in a sandboxed Dynamic Worker.
const files = await tools.find({ pattern: "**/*.ts" });
const results = [];
for (const file of files) {
  const content = await tools.read({ path: file });
  if (content.includes("TODO")) {
    results.push({ file, todos: content.match(/\/\/ TODO:.*/g) });
  }
}
return results;

Instead of 100 round-trips to the model, you just run a single program. This leads to fewer tokens used, faster execution, and better results. The Cloudflare API MCP server demonstrates this at scale. We expose only two tools (search() and execute()), which consume ~1,000 tokens, vs. ~1.17 million tokens for the naive tool-per-endpoint equivalent. This is a 99.9% reduction.

The missing primitive: safe sandboxes

Once you accept that models should write code on behalf of users, the question becomes: where does that code run? Not eventually, not after a product team turns it into a roadmap item. Right now, for this user, against this system, with tightly defined permissions.

Dynamic Workers are that sandbox. A fresh V8 isolate spun up at runtime, in milliseconds, with a few megabytes of memory. That's roughly 100x faster and up to 100x more memory-efficient than a container. You can start a new one for every single request, run a snippet of code, and throw it away.

The critical design choice is the capability model. Instead of starting with a general-purpose machine and trying to constrain it, Dynamic Workers begin with almost no ambient authority (globalOutbound: null, no network access) and the developer grants capabilities explicitly, resource by resource, through bindings. We go from asking "how do we stop this thing from doing too much?" to "what exactly do we want this thing to be able to do?"

This is the right question for agent infrastructure.

The execution ladder

This capability model leads naturally to a spectrum of compute environments, an execution ladder that the agent escalates through as needed:

Tier 0 is the Workspace, a durable virtual filesystem backed by SQLite and R2. Read, write, edit, search, grep, diff. Powered by @cloudflare/shell.

Tier 1 is a Dynamic Worker: LLM-generated JavaScript running in a sandboxed isolate with no network access. Powered by @cloudflare/codemode.

Tier 2 adds npm. @cloudflare/worker-bundler fetches packages from the registry, bundles them with esbuild, and loads the result into the Dynamic Worker. The agent writes import { z } from "zod" and it just works.

Tier 3 is a headless browser via Cloudflare Browser Run. Navigate, click, extract, screenshot. Useful when the service doesn't support agents yet via MCP or APIs.

Tier 4 is a Cloudflare Sandbox configured with your toolchains, repos, and dependencies: git clone, npm test, cargo build, synced bidirectionally with the Workspace.

The key design principle: the agent should be useful at Tier 0 alone, where each tier is additive. The user can add capabilities as they go.

Building blocks, not a framework

All of these primitives are available as standalone packages. Dynamic Workers, @cloudflare/codemode, @cloudflare/worker-bundler, and @cloudflare/shell (a durable filesystem with tools) are all usable directly with the Agent base class. You can combine them to give any agent a workspace, code execution, and runtime package resolution without adopting an opinionated framework.

The platform

Here's the complete stack for building agents on Cloudflare:

Capability	What it does	Powered by
Per-agent isolation	Every agent is its own world	Durable Objects (DOs)
Zero cost when idle	$0 until the agent wakes up	DO Hibernation
Persistent state	Queryable, transactional storage	DO SQLite
Durable filesystem	Files that survive restarts	Workspace (SQLite + R2)
Sandboxed code execution	Run LLM-generated code safely	Dynamic Workers + `@cloudflare/codemode`
Runtime dependencies	`import * from react` just works	`@cloudflare/worker-bundler`
Web automation	Browse, navigate, fill forms	Browser Run
Full OS access	git, compilers, test runners	Sandboxes
Scheduled execution	Proactive, not just reactive	DO Alarms + Fibers
Real-time streaming	Token-by-token to any client	WebSockets
External tools	Connect to any tool server	MCP
Agent coordination	Typed RPC between agents	Sub-agents (Facets)
Model access	Connect to an LLM to power the agent	AI Gateway + Workers AI (or Bring Your Own Model)

Each of these is a building block. Together, they form something new: a platform where anyone can build, deploy, and run AI agents as capable as the ones running on your local machine today, but serverless, durable, and safe by construction.

The Think base class

Now that you've seen the primitives, here's what happens when you wire them all together.

Think is an opinionated harness that handles the full chat lifecycle: agentic loop, message persistence, streaming, tool execution, stream resumption, and extensions. You focus on what makes your agent unique.

The minimal subclass looks like this:

import { Think } from "@cloudflare/think";
import { createWorkersAI } from "workers-ai-provider";

export class MyAgent extends Think {
  getModel() {
    return createWorkersAI({ binding: this.env.AI })(
      "@cf/moonshotai/kimi-k2.5"
    );
  }
}

That’s effectively all you need to have a working chat agent with streaming, persistence, abort/cancel, error handling, resumable streams, and a built-in workspace filesystem. Deploy with npx wrangler deploy.

Think makes decisions for you. When you need more control, you can override the ones you care about:

Override	Purpose
`getModel()`	Return the `LanguageModel` to use
`getSystemPrompt()`	System prompt
`getTools()`	AI SDK compatible `ToolSet` for the agentic loop
`maxSteps`	Max tool-call rounds per turn
`configureSession()`	Context blocks, compaction, search, skills

Under the hood, Think runs the complete agentic loop on every turn: it assembles the context (base instructions + tool descriptions + skills + memory + conversation history), calls streamText, executes tool calls (with output truncation to prevent context blowup), appends results, loops until the model is done or the step limit is reached. All messages are persisted after each turn.

Lifecycle hooks

Think gives you hooks at every stage of the chat turn, without requiring you to own the whole pipeline:

beforeTurn()
  → streamText()
    → beforeToolCall()
    → afterToolCall()
  → onStepFinish()
→ onChatResponse()

Switch to a lower cost model for follow-up turns, limit the tools it can use, and pass in client-side context on each turn. Also log every tool call to analytics and automatically trigger one more follow-up turn after the model completes, all without replacing onChatMessage.

Persistent memory and long conversations

Think builds on Session API as its storage layer, giving you tree-structured messages with branching built in.

On top of that, it adds persistent memory through context blocks. These are structured sections of the system prompt that the model can read and update over time, and they persist across hibernation.The model sees "MEMORY (Important facts, use set_context to update) [42%, 462/1100 tokens]" and can proactively remember things.

configureSession(session: Session) {
  return session
    .withContext("soul", {
      provider: { get: async () => "You are a helpful coding assistant." }
    })
    .withContext("memory", {
      description: "Important facts learned during conversation.",
      maxTokens: 2000
    })
    .withCachedPrompt();
}

Sessions are flexible. You can run multiple conversations per agent and fork them to try a different direction without losing the original.

As context grows, Think handles limits with non-destructive compaction. Older messages are summarized instead of removed, while the full history remains stored in SQLite.

Search is built in as well. Using FTS5, you can query conversation history within a session or across all the sessions. The agent is also able to search its own past using search_context tool.

The full execution ladder, wired in

Think integrates the entire execution ladder into a single getTools() return:

import { Think } from "@cloudflare/think";
import { createWorkspaceTools } from "@cloudflare/think/tools/workspace";
import { createExecuteTool } from "@cloudflare/think/tools/execute";
import { createBrowserTools } from "@cloudflare/think/tools/browser";
import { createSandboxTools } from "@cloudflare/think/tools/sandbox";
import { createExtensionTools } from "@cloudflare/think/tools/extensions";

export class MyAgent extends Think {
  extensionLoader = this.env.LOADER;

  getModel() {
    /* ... */
  }

  getTools() {
    return {
      execute: createExecuteTool({
        tools: createWorkspaceTools(this.workspace),
        loader: this.env.LOADER
      }),
      ...createBrowserTools(this.env.BROWSER),
      ...createSandboxTools(this.env.SANDBOX), // configured per-agent: toolchains, repos, snapshots
      ...createExtensionTools({ manager: this.extensionManager! }),
      ...this.extensionManager!.getTools()
    };
  }
}

Self-authored extensions

Think takes code execution one step further. An agent can write its own extensions: TypeScript programs that run in Dynamic Workers, declaring permissions for network access and workspace operations.

{
  "name": "github",
  "description": "GitHub integration: PRs, issues, repos",
  "tools": ["create_pr", "list_issues", "review_pr"],
  "permissions": {
    "network": ["api.github.com"],
    "workspace": "read-write"
  }
}

Think's ExtensionManager bundles the extension (optionally with npm deps via @cloudflare/worker-bundler), loads it into a Dynamic Worker, and registers the new tools. The extension persists in DO storage and survives hibernation. The next time the user asks about pull requests, the agent has a github_create_pr tool that didn't exist 30 seconds ago.

This is the kind of self-improvement loop that makes agents genuinely more useful over time. Not through fine-tuning or RLHF, but through code. The agent is able to write new capabilities for itself, all in sandboxed, auditable, and revocable TypeScript.

Sub-agent RPC

Think also works as a sub-agent, called via chat() over RPC from a parent, with streaming events via callback:

const researcher = await this.subAgent(ResearchSession, "research");
const result = await researcher.chat(`Research this: ${task}`, streamRelay);

Each child gets its own conversation tree, memory, tools, and model. The parent doesn't need to know the details.

Getting started

Project Think is experimental. The API surface is stable but will continue to evolve in the coming days and weeks. We're already using it internally to build our own background agent infrastructure, and we're sharing it early so you can build alongside us.

npm install @cloudflare/think agents ai @cloudflare/shell zod workers-ai-provider

// src/server.ts
import { Think } from "@cloudflare/think";
import { createWorkersAI } from "workers-ai-provider";
import { routeAgentRequest } from "agents";

export class MyAgent extends Think {
  getModel() {
    return createWorkersAI({ binding: this.env.AI })(
      "@cf/moonshotai/kimi-k2.5"
    );
  }
}

export default {
  async fetch(request: Request, env: Env) {
    return (
      (await routeAgentRequest(request, env)) ||
      new Response("Not found", { status: 404 })
    );
  }
} satisfies ExportedHandler;

// src/client.tsx
import { useAgent } from "agents/react";
import { useAgentChat } from "@cloudflare/ai-chat/react";

function Chat() {
  const agent = useAgent({ agent: "MyAgent" });
  const { messages, sendMessage, status } = useAgentChat({ agent });
  // Render your chat UI
}

Think speaks the same WebSocket protocol as @cloudflare/ai-chat, so existing UI components work out of the box. If you've built on AIChatAgent, your client code doesn't change.

The third wave

We see three waves of AI agents:

The first wave was chatbots. They were stateless, reactive, and fragile. Every conversation started from scratch with no memory, no tools, and no ability to act. This made them useful for answering questions, but limited them to only answering questions.

The second wave was coding agents. These are stateful, tool-using and far more capable tools like Pi, Claude Code, OpenClaw, and Codex. These agents can read codebases, write code, execute it, and iterate. These proved that an LLM with the right tools is a general-purpose machine, but they run on your laptop, for one user, with no durability guarantees.

Now we are entering the third wave: agents as infrastructure. Durable, distributed, structurally safe, and serverless. These are agents that run on the Internet, survive failures, cost nothing when idle, and enforce security through architecture rather than behavior. Agents that any developer can build and deploy for any number of users.

This is the direction we’re betting on.

The Agents SDK is already powering thousands of production agents. With Project Think and the the primitives it introduces, we're adding the missing pieces to make those agents dramatically more capable: persistent workspaces, sandboxed code execution, durable long-running tasks, structural security, sub-agent coordination, and self-authored extensions.

It's available today in preview. We're building alongside you, and we'd genuinely love to see what you (and your coding agent) create with it.

^{Think is part of the Cloudflare Agents SDK, available as @cloudflare/think. The features described in this post are in preview. APIs may change as we incorporate feedback. Check the}^{documentation}^and^example^{to get started.}

Durable Objects in Dynamic Workers: Give each AI-generated app its own database

Kenton Varda — Mon, 13 Apr 2026 13:08:35 GMT

A few weeks ago, we announced Dynamic Workers, a new feature of the Workers platform which lets you load Worker code on-the-fly into a secure sandbox. The Dynamic Worker Loader API essentially provides direct access to the basic compute isolation primitive that Workers has been based on all along: isolates, not containers. Isolates are much lighter-weight than containers, and as such, can load 100x faster using 1/10 the memory. They are so efficient, they can be treated as "disposable": start one up to run a few lines of code, then throw it away. Like a secure version of eval().

Dynamic Workers have many uses. In the original announcement, we focused on how to use them to run AI-agent-generated code as an alternative to tool calls. In this use case, an AI agent performs actions at the request of a user by writing a few lines of code and executing them. The code is single-use, intended to perform one task one time, and is thrown away immediately after it executes.

But what if you want an AI to generate more persistent code? What if you want your AI to build a small application with a custom UI the user can interact with? What if you want that application to have long-lived state? But of course, you still want it to run in a secure sandbox.

One way to do this would be to use Dynamic Workers, and simply provide the Worker with an RPC API that gives it access to storage. Using bindings, you could give the Dynamic Worker an API that points back to your remote SQL database (perhaps backed by Cloudflare D1, or a Postgres database you access through Hyperdrive — it's up to you).

But Workers also has a unique and extremely fast type of storage that may be a perfect fit for this use case: Durable Objects. A Durable Object is a special kind of Worker that has a unique name, with one instance globally per name. That instance has a SQLite database attached, which lives on local disk on the machine where the Durable Object runs. This makes storage access ridiculously fast: there is effectively zero latency.

Perhaps, then, what you really want is for your AI to write code for a Durable Object, and then you want to run that code in a Dynamic Worker.

But how?

This presents a weird problem. Normally, to use Durable Objects you have to:

Write a class extending DurableObject.
Export it from your Worker's main module.
Specify in your Wrangler config that storage should be provision for this class. This creates a Durable Object namespace that points at your class for handling incoming requests.
Declare a Durable Object namespace binding pointing at your namespace (or use ctx.exports), and use it to make requests to your Durable Object.

This doesn't extend naturally to Dynamic Workers. First, there is the obvious problem: The code is dynamic. You run it without invoking the Cloudflare API at all. But Durable Object storage has to be provisioned through the API, and the namespace has to point at an implementing class. It can't point at your Dynamic Worker.

But there is a deeper problem: Even if you could somehow configure a Durable Object namespace to point directly at a Dynamic Worker, would you want to? Do you want your agent (or user) to be able to create a whole namespace full of Durable Objects? To use unlimited storage spread around the world?

You probably don't. You probably want some control. You may want to limit, or at least track, how many objects they create. Maybe you want to limit them to just one object (probably good enough for vibe-coded personal apps). You may want to add logging and other observability. Metrics. Billing. Etc.

To do all this, what you really want is for requests to these Durable Objects to go to your code first, where you can then do all the "logistics", and then forward the request into the agent's code. You want to write a supervisor that runs as part of every Durable Object.

Solution: Durable Object Facets

Today we are releasing, in open beta, a feature that solves this problem.

Durable Object Facets allow you to load and instantiate a Durable Object class dynamically, while providing it with a SQLite database to use for storage. With Facets:

First you create a normal Durable Object namespace, pointing to a class you write.
In that class, you load the agent's code as a Dynamic Worker, and call into it.
The Dynamic Worker's code can implement a Durable Object class directly. That is, it literally exports a class declared as extends DurableObject.
You are instantiating that class as a "facet" of your own Durable Object.
The facet gets its own SQLite database, which it can use via the normal Durable Object storage APIs. This database is separate from the supervisor's database, but the two are stored together as part of the same overall Durable Object.

How it works

Here is a simple, complete implementation of an app platform that dynamically loads and runs a Durable Object class:

import { DurableObject } from "cloudflare:workers";

// For the purpose of this example, we'll use this static
// application code, but in the real world this might be generated
// by AI (or even, perhaps, a human user).
const AGENT_CODE = `
  import { DurableObject } from "cloudflare:workers";

  // Simple app that remembers how many times it has been invoked
  // and returns it.
  export class App extends DurableObject {
    fetch(request) {
      // We use storage.kv here for simplicity, but storage.sql is
      // also available. Both are backed by SQLite.
      let counter = this.ctx.storage.kv.get("counter") || 0;
      ++counter;
      this.ctx.storage.kv.put("counter", counter);

      return new Response("You've made " + counter + " requests.\\n");
    }
  }
`;

// AppRunner is a Durable Object you write that is responsible for
// dynamically loading applications and delivering requests to them.
// Each instance of AppRunner contains a different app.
export class AppRunner extends DurableObject {
  async fetch(request) {
    // We've received an HTTP request, which we want to forward into
    // the app.

    // The app itself runs as a child facet named "app". One Durable
    // Object can have any number of facets (subject to storage limits)
    // with different names, but in this case we have only one. Call
    // this.ctx.facets.get() to get a stub pointing to it.
    let facet = this.ctx.facets.get("app", async () => {
      // If this callback is called, it means the facet hasn't
      // started yet (or has hibernated). In this callback, we can
      // tell the system what code we want it to load.

      // Load the Dynamic Worker.
      let worker = this.#loadDynamicWorker();

      // Get the exported class we're interested in.
      let appClass = worker.getDurableObjectClass("App");

      return { class: appClass };
    });

    // Forward request to the facet.
    // (Alternatively, you could call RPC methods here.)
    return await facet.fetch(request);
  }

  // RPC method that a client can call to set the dynamic code
  // for this app.
  setCode(code) {
    // Store the code in the AppRunner's SQLite storage.
    // Each unique code must have a unique ID to pass to the
    // Dynamic Worker Loader API, so we generate one randomly.
    this.ctx.storage.kv.put("codeId", crypto.randomUUID());
    this.ctx.storage.kv.put("code", code);
  }

  #loadDynamicWorker() {
    // Use the Dynamic Worker Loader API like normal. Use get()
    // rather than load() since we may load the same Worker many
    // times.
    let codeId = this.ctx.storage.kv.get("codeId");
    return this.env.LOADER.get(codeId, async () => {
      // This Worker hasn't been loaded yet. Load its code from
      // our own storage.
      let code = this.ctx.storage.kv.get("code");

      return {
        compatibilityDate: "2026-04-01",
        mainModule: "worker.js",
        modules: { "worker.js": code },
        globalOutbound: null,  // block network access
      }
    });
  }
}

// This is a simple Workers HTTP handler that uses AppRunner.
export default {
  async fetch(req, env, ctx) {
    // Get the instance of AppRunner named "my-app".
    // (Each name has exactly one Durable Object instance in the
    // world.)
    let obj = ctx.exports.AppRunner.getByName("my-app");

    // Initialize it with code. (In a real use case, you'd only
    // want to call this once, not on every request.)
    await obj.setCode(AGENT_CODE);

    // Forward the request to it.
    return await obj.fetch(req);
  }
}

In this example:

AppRunner is a "normal" Durable Object written by the platform developer (you).
Each instance of AppRunner manages one application. It stores the app code and loads it on demand.
The application itself implements and exports a Durable Object class, which the platform expects is named App.
AppRunner loads the application code using Dynamic Workers, and then executes the code as a Durable Object Facet.
Each instance of AppRunner is one Durable Object composed of two SQLite databases: one belonging to the parent (AppRunner itself) and one belonging to the facet (App). These databases are isolated: the application cannot read AppRunner's database, only its own.

To run the example, copy the code above into a file worker.js, pair it with the following wrangler.jsonc, and run it locally with npx wrangler dev.

// wrangler.jsonc for the above sample worker.
{
  "compatibility_date": "2026-04-01",
  "main": "worker.js",
  "migrations": [
    {
      "tag": "v1",
      "new_sqlite_classes": [
        "AppRunner"
      ]
    }
  ],
  "worker_loaders": [
    {
      "binding": "LOADER",
    },
  ],
}

Start building

Facets are a feature of Dynamic Workers, available in beta immediately to users on the Workers Paid plan.

Check out the documentation to learn more about Dynamic Workers and Facets.

Investigating multi-vector attacks in Log Explorer

Jen Sells — Tue, 10 Mar 2026 13:00:00 GMT

In the world of cybersecurity, a single data point is rarely the whole story. Modern attackers don’t just knock on the front door; they probe your APIs, flood your network with "noise" to distract your team, and attempt to slide through applications and servers using stolen credentials.

To stop these multi-vector attacks, you need the full picture. By using Cloudflare Log Explorer to conduct security forensics, you get 360-degree visibility through the integration of 14 new datasets, covering the full surface of Cloudflare’s Application Services and Cloudflare One product portfolios. By correlating telemetry from application-layer HTTP requests, network-layer DDoS and Firewall logs, and Zero Trust Access events, security analysts can significantly reduce Mean Time to Detect (MTTD) and effectively unmask sophisticated, multi-layered attacks.

Read on to learn more about how Log Explorer gives security teams the ultimate landscape for rapid, deep-dive forensics.

The flight recorder for your entire stack

The contemporary digital landscape requires deep, correlated telemetry to defend against adversaries using multiple attack vectors. Raw logs serve as the "flight recorder" for an application, capturing every single interaction, attack attempt, and performance bottleneck. And because Cloudflare sits at the edge, between your users and your servers, all of these events are logged before the requests even reach your infrastructure.

Cloudflare Log Explorer centralizes these logs into a unified interface for rapid investigation.

Log Types Supported

Zone-Scoped Logs

Focus: Website traffic, security events, and edge performance.

HTTP Requests	As the most comprehensive dataset, it serves as the "primary record" of all application-layer traffic, enabling the reconstruction of session activity, exploit attempts, and bot patterns.
Firewall Events	Provides critical evidence of blocked or challenged threats, allowing analysts to identify the specific WAF rules, IP reputations, or custom filters that intercepted an attack.
DNS Logs	Identify cache poisoning attempts, domain hijacking, and infrastructure-level reconnaissance by tracking every query resolved at the authoritative edge.
NEL (Network Error Logging) Reports	Distinguish between a coordinated Layer 7 DDoS attack and legitimate network connectivity issues by tracking client-side browser errors.
Spectrum Events	For non-web applications, these logs provide visibility into L4 traffic (TCP/UDP), helping to identify anomalies or brute-force attacks against protocols like SSH, RDP, or custom gaming traffic.
Page Shield	Track and audit unauthorized changes to your site's client-side environment such as JavaScript, outbound connections.
Zaraz Events	Examine how third-party tools and trackers are interacting with user data, which is vital for auditing privacy compliance and detecting unauthorized script behaviors.

Account-Scoped Logs

Focus: Internal security, Zero Trust, administrative changes, and network activity.

Access Requests	Tracks identity-based authentication events to determine which users accessed specific internal applications and whether those attempts were authorized.
Audit Logs	Provides a trail of configuration changes within the Cloudflare dashboard to identify unauthorized administrative actions or modifications.
CASB Findings	Identifies security misconfigurations and data risks within SaaS applications (like Google Drive or Microsoft 365) to prevent unauthorized data exposure.
Magic Transit / IPSec Logs	Helps network engineers perform network-level (L3) monitoring such as reviewing tunnel health and view BGP routing changes.
Browser Isolation Logs	Tracks user actions inside an isolated browser session (e.g., copy-paste, print, or file uploads) to prevent data leaks on untrusted sites
Device Posture Results	Details the security health and compliance status of devices connecting to your network, helping to identify compromised or non-compliant endpoints.
DEX Application Tests	Monitors application performance from the user's perspective, which can help distinguish between a security-related outage and a standard performance degradation.
DEX Device State Events	Provides telemetry on the physical state of user devices, useful for correlating hardware or OS-level anomalies with potential security incidents.
DNS Firewall Logs	Tracks DNS queries filtered through the DNS Firewall to identify communication with known malicious domains or command-and-control (C2) servers.
Email Security Alerts	Logs malicious email activity and phishing attempts detected at the gateway to trace the origin of email-based entry vectors.
Gateway DNS	Monitors every DNS query made by users on your network to identify shadow IT, malware callbacks, or domain-generation algorithms (DGAs).
Gateway HTTP	Provides full visibility into encrypted and unencrypted web traffic to detect hidden payloads, malicious file downloads, or unauthorized SaaS usage.
Gateway Network	Tracks L3/L4 network traffic (non-HTTP) to identify unauthorized port usage, protocol anomalies, or lateral movement within the network.
IPSec Logs	Monitors the status and traffic of encrypted site-to-site tunnels to ensure the integrity and availability of secure network connections.
Magic IDS Detections	Surfaces matches against intrusion detection signatures to alert investigators to known exploit patterns or malware behavior traversing the network.
Network Analytics Logs	Provides high-level visibility into packet-level data to identify volumetric DDoS attacks or unusual traffic spikes targeting specific infrastructure.
Sinkhole HTTP Logs	Captures traffic directed to "sinkholed" IP addresses to confirm which internal devices are attempting to communicate with known botnet infrastructure.
WARP Config Changes	Tracks modifications to the WARP client settings on end-user devices to ensure that security agents haven't been tampered with or disabled.
WARP Toggle Changes	Specifically logs when users enable or disable their secure connectivity, helping to identify periods where a device may have been unprotected.
Zero Trust Network Session Logs	Logs the duration and status of authenticated user sessions to map out the complete lifecycle of a user's access within the protected perimeter.

Log Explorer can identify malicious activity at every stage

Get granular application layer visibility with HTTP Requests, Firewall Events, and DNS logs to see exactly how traffic is hitting your public-facing properties. Track internal movement with Access Requests, Gateway logs, and Audit logs. If a credential is compromised, you’ll see where they went. Use Magic IDS and Network Analytics logs to spot volumetric attacks and "East-West" lateral movement within your private network.

Identify the reconnaissance

Attackers use scanners and other tools to look for entry points, hidden directories, or software vulnerabilities. To identify this, using Log Explorer, you can query http_requests for any EdgeResponseStatus codes of 401, 403, or 404 coming from a single IP, or requests to sensitive paths (e.g. /.env, /.git, /wp-admin).

Additionally, magic_ids_detections logs can also be used to identify scanning at the network layer. These logs provide packet-level visibility into threats targeting your network. Unlike standard HTTP logs, these logs focus on signature-based detections at the network and transport layers (IP, TCP, UDP). Query to discover cases where a single SourceIP is triggering multiple unique detections across a wide range of DestinationPort values in a short timeframe. Magic IDS signatures can specifically flag activities like Nmap scans or SYN stealth scans.

Check for diversions

While the attacker is conducting reconnaissance, they may attempt to disguise this with a simultaneous network flood. Pivot to network_analytics_logs to see if a volumetric attack is being used as a smokescreen.

Identify the approach

Once attackers identify a potential vulnerability, they begin to craft their weapon. The attacker sends malicious payloads (e.g. SQL injection or large/corrupt file uploads) to confirm the vulnerability. Review http_requests and/or fw_events to identify any Cloudflare detection tools that have triggered. Cloudflare logs security signals in these datasets to easily identify requests with malicious payloads using fields such as WAFAttackScore, WAFSQLiAttackScore, FraudAttack, ContentScanJobResults, and several more. Review our documentation to get a full understanding of these fields. The fw_events logs can be used to determine whether these requests made it past Cloudflare’s defenses by examining the action, source, and ruleID fields. Cloudflare’s managed rules by default blocks many of these payloads by default. Review Application Security Overview to know if your application is protected.

^{Showing the Managed rules Insight that displays on Security Overview if the current zone does not have Managed Rules enabled}

Audit the identity

Did that suspicious IP manage to log in? Use the ClientIP to search access_requests. If you see a "Decision: Allow" for a sensitive internal app, you know you have a compromised account.

Stop the leak (data exfiltration)

Attackers sometimes use DNS tunneling to bypass firewalls by encoding sensitive data (like passwords or SSH keys) into DNS queries. Instead of a normal request like google.com, the logs will show long, encoded strings. Look for an unusually high volume of queries for unique, long, and high-entropy subdomains by examining the fields: QueryName: Look for strings like h3ldo293js92.example.com, QueryType: Often uses TXT, CNAME, or NULL records to carry the payload, and ClientIP: Identify if a single internal host is generating thousands of these unique requests.

Additionally, attackers may attempt to leak sensitive data by hiding it within non-standard protocols or by using common protocols (like DNS or ICMP) in unusual ways to bypass standard firewalls. Discover this by querying the magic_ids_detections logs to look for signatures that flag protocol anomalies, such as "ICMP tunneling" or "DNS tunneling" detections in the SignatureMessage.

Whether you are investigating a zero-day vulnerability or tracking a sophisticated botnet, the data you need is now at your fingertips.

Correlate across datasets

Investigate malicious activity across multiple datasets by pivoting between multiple concurrent searches. With Log Explorer, you can now work with multiple queries simultaneously with the new Tabs feature. Switch between tabs to query different datasets or Pivot and adjust queries using filtering via your query results.

When you correlate data across multiple Cloudflare log sources, you can detect sophisticated multi-stage attacks that appear benign when viewed in isolation. This cross-dataset analysis allows you to see the full attack chain from reconnaissance to exfiltration.

Session hijacking (token theft)

Scenario: A user authenticates via Cloudflare Access, but their subsequent HTTP_request traffic looks like a bot.

Step 1: Identify high-risk sessions in http_requests.

SELECT RayID, ClientIP, ClientRequestUserAgent, BotScore
FROM http_requests
WHERE date = '2026-02-22' 
  AND BotScore < 20 
LIMIT 100

Step 2: Copy the RayID and search access_requests to see which user account is associated with that suspicious bot activity.


SELECT Email, IPAddress, Allowed
FROM access_requests
WHERE date = '2026-02-22' 
  AND RayID = 'INSERT_RAY_ID_HERE'

Post-phishing C2 beaconing

Scenario: An employee clicked a link in a phishing email which resulted in compromising their workstation. This workstation sends a DNS query for a known malicious domain, then immediately triggers an IDS alert.

Step 1: Find phishing attacks by examining email_security_alerts for violations.

SELECT Timestamp, Threatcategories, To, Alertreason
FROM email_security_alerts
WHERE date = '2026-02-22' 
  AND Threatcategories LIKE 'phishing'

Step 2: Use Access logs to correlate the user’s email (To) to their IP Address.

SELECT Email, IPAddress
FROM access_requests
WHERE date = '2026-02-22'

Step 3: Find internal IPs querying a specific malicious domain in gateway_dns logs.


SELECT SrcIP, QueryName, DstIP, 
FROM gateway_dns
WHERE date = '2026-02-22' 
  AND SrcIP = 'INSERT_IP_FROM_PREVIOUS_QUERY'
  AND QueryName LIKE '%malicious_domain_name%'

Lateral movement (Access → network probing)

Scenario: A user logs in via Zero Trust and then tries to scan the internal network.

Step 1: Find successful logins from unexpected locations in access_requests.

SELECT IPAddress, Email, Country
FROM access_requests
WHERE date = '2026-02-22' 
  AND Allowed = true 
  AND Country != 'US' -- Replace with your HQ country

Step 2: Check if that IPAddress is triggering network-level signatures in magic_ids_detections.

SELECT SignatureMessage, DestinationIP, Protocol
FROM magic_ids_detections
WHERE date = '2026-02-22' 
  AND SourceIP = 'INSERT_IP_ADDRESS_HERE'

Opening doors for more data

From the beginning, Log Explorer was designed with extensibility in mind. Every dataset schema is defined using JSON Schema, a widely-adopted standard for describing the structure and types of JSON data. This design decision has enabled us to easily expand beyond HTTP Requests and Firewall Events to the full breadth of Cloudflare's telemetry. The same schema-driven approach that powered our initial datasets scaled naturally to accommodate Zero Trust logs, network analytics, email security alerts, and everything in between.

More importantly, this standardization opens the door to ingesting data beyond Cloudflare's native telemetry. Because our ingestion pipeline is schema-driven rather than hard-coded, we're positioned to accept any structured data that can be expressed in JSON format. For security teams managing hybrid environments, this means Log Explorer could eventually serve as a single pane of glass, correlating Cloudflare's edge telemetry with logs from third-party sources, all queryable through the same SQL interface. While today's release focuses on completing coverage of Cloudflare's product portfolio, the architectural groundwork is laid for a future where customers can bring their own data sources with custom schemas.

Faster data, faster response: architectural upgrades

To investigate a multi-vector attack effectively, timing is everything. A delay of even a few minutes in the log availability can be the difference between proactive defense and reactive damage control.

That is why we have optimized our ingestion for better speed and resilience. By increasing concurrency in one part of our ingestion path, we have eliminated bottlenecks that could cause “noisy neighbor” issues, ensuring that one client’s data surge doesn’t slow down another’s visibility. This architectural work has reduced our P99 ingestion latency by approximately 55%, and our P50 by 25%, cutting the time it takes for an event at the edge to become available for your SQL queries.

^{Grafana chart displaying the drop in ingest latency after architectural upgrades}

Follow along for more updates

We're just getting started. We're actively working on even more powerful features to further enhance your experience with Log Explorer, including the ability to run these detection queries on a custom defined schedule.

^{Design mockup of upcoming Log Explorer Scheduled Queries feature}

Subscribe to the blog and keep an eye out for more Log Explorer updates soon in our Change Log.

Get access to Log Explorer

To get access to Log Explorer, you can purchase self-serve directly from the dash or for contract customers, reach out for a consultation or contact your account manager. Additionally, you can read more in our Developer Documentation.

Improve global upload performance with R2 Local Uploads

Frank Chen — Tue, 03 Feb 2026 14:00:00 GMT

Today, we are launching Local Uploads for R2 in open beta. With Local Uploads enabled, object data is automatically written to a storage location close to the client first, then asynchronously copied to where the bucket lives. The data is immediately accessible and stays strongly consistent. Uploads get faster, and data feels global.

For many applications, performance needs to be global. Users uploading media content from different regions, for example, or devices sending logs and telemetry from all around the world. But your data has to live somewhere, and that means uploads from far away have to travel the full distance to reach your bucket.

R2 is object storage built on Cloudflare's global network. Out of the box, it automatically caches object data globally for fast reads anywhere — all while retaining strong consistency and zero egress fees. This happens behind the scenes whether you're using the S3 API, Workers Bindings, or plain HTTP. And now with Local Uploads, both reads and writes can be fast from anywhere in the world.

Try it yourself in this demo to see the benefits of Local Uploads.

Ready to try it? Enable Local Uploads in the Cloudflare Dashboard under your bucket's settings, or with a single Wrangler command on an existing bucket.

npx wrangler r2 bucket local-uploads enable [BUCKET]

75% lower total request duration for global uploads

Local Uploads makes upload requests (i.e. PutObject, UploadPart) faster. In both our private beta tests with customers and our synthetic benchmarks, we saw up to 75% reduction in Time to Last Byte (TTLB) when upload requests are made in a different region than the bucket. In these results, TTLB is measured from when R2 receives the upload request to when R2 returns a 200 response.

In our synthetic tests, we measured the impact of Local Uploads by using a synthetic workload to simulate a cross-region upload workflow. We deployed a test client in Western North America and configured an R2 bucket with a location hint for Asia-Pacific. The client performed around 20 PutObject requests per second over 30 minutes to upload objects of 5 MB size.

The following graph compares the p50 (or median) TTLB metrics for these requests, showing the difference in upload request duration — first without Local Uploads (TTLB around 2s), and then with Local Uploads enabled (TTLB around 500ms):

How it works: The distance problem

To understand how Local Uploads can improve upload requests, let’s first take a look at how R2 works. R2's architecture is composed of multiple components including:

R2 Gateway Worker: The entry point for all API requests that handles authentication and routing logic. It is deployed across Cloudflare's global network via Cloudflare Workers.
Durable Object Metadata Service: A distributed layer built on Durable Objects used to store and manage object metadata (e.g. object key, checksum).
Distributed Storage Infrastructure: The underlying infrastructure that persistently stores encrypted object data.

Without Local Uploads, here’s what happens when you upload objects to your bucket: The request is first received by the R2 Gateway, close to the user, where it is authenticated. Then, as the client streams bytes of the object data, the data is encrypted and written into the storage infrastructure in the region where the bucket is placed. When this is completed, the Gateway reaches out to the Metadata Service to publish the object metadata, and it returns a success response back to the client after it is committed.

If the client and the bucket are in separate regions, more variability can be introduced in the process of uploading bytes of the object data, due to the longer distance that the request must travel. This could result in slower or less reliable uploads.

^{A client uploading from Eastern North America to a bucket in Eastern Europe without Local Uploads enabled.}

Now, when you make an upload request to a bucket with Local Uploads enabled, there are two cases that are handled:

The client and the bucket region are in the same region
The client and the bucket region are in different regions

In the first case, R2 follows the regular flow, where object data is written to the storage infrastructure for your bucket. In the second case, R2 writes to the storage infrastructure located in the client region while still publishing to the object metadata to the region of the bucket.

Importantly, the object is immediately accessible after the initial write completes. It remains accessible throughout the entire replication process — there's no waiting period for background replication to finish before the object can be read.

^{A client uploading from Eastern North America to a bucket in Eastern Europe with Local Uploads enabled.}

Note that this is for non-jurisdiction restricted buckets, and Local Uploads are not available for buckets with jurisdiction restriction (e.g. EU, FedRAMP) enabled.

When to use Local Uploads

Local uploads are built for workloads that receive a lot of upload requests originating from different geographic regions than where your bucket is located. This feature is ideal when:

Your users are globally distributed
Upload performance and reliability is critical to your application
You want to optimize write performance without changing your bucket's primary location

To understand the geographic distribution of where your read and write requests are initiated, you can visit the Cloudflare Dashboard, and go to your R2 bucket’s Metrics page and view the Request Distribution by Region graph.

How we built Local Uploads

With Local Uploads, object data is written close to the client and then copied to the bucket's region in the background. We call this copy job a replication task.

Given these replication tasks, we needed an asynchronous processing component for them, which tends to be a great use case for Cloudflare Queues. Queues allow us to control the rate at which we process replication tasks, and it provides built-in failure handling capabilities like retries and dead letter queues. In this case, R2 shards replication tasks across multiple queues per storage region.

Publishing metadata and scheduling replication

When publishing the metadata of an object with Local Uploads enabled, we perform three operations atomically:

Store the object metadata
Create a pending replica key that tracks which replications still need to happen
Create a replication task marker keyed by timestamp, which controls when the task should be sent to the queue

The pending replica key contains the full replication plan: the number of replication tasks, which source location to read from, which destination location to write to, the replication mode and priority, and whether the source should be deleted after successful replication.

This gives us flexibility in how we move an object's data. For example, moving data across long geographical distances is expensive. We could try to move all the replicas as fast as possible by processing them in parallel, but this would incur greater cost and pressure the network infrastructure. Instead, we minimize the number of cross-regional data movements by first creating one replica in the target bucket region, and then use this local copy to create additional replicas within the bucket region.

A background process periodically scans the replication task markers and sends them to one of the queues associated with the destination storage region. The markers guarantee at-least-once delivery to the queue — if enqueueing fails or the process crashes, the marker persists and the task will be retried on the next scan. This also allows us to process replications at different times and enqueue only valid tasks. Once a replication task reaches a queue, it is ready to be processed.

Asynchronous replication: Pull model

For the queue consumer, we chose a pull model where a centralized polling service consumes tasks from the regional queues and dispatches them to the Gateway Worker for execution.

Here's how it works:

Polling service pulls from a regional queue: The consumer service polls the regional queue for replication tasks. It then batches the tasks to create uniform batch sizes based on the amount of data to be moved.
Polling service dispatches to Gateway Worker: The consumer service sends the replication job to the Gateway Worker.
Gateway Worker executes replication: The worker reads object data from the source location, writes it to the destination, and updates metadata in the Durable Object, optionally marking the source location to be garbage collected.
Gateway Worker reports result: On completion, the worker returns the result to the poller, which acknowledges the task to the queue as completed or failed.

By using this pull model approach, we ensure that the replication process remains stable and efficient. The service can dynamically adjust its pace based on real-time system health, guaranteeing that data is safely replicated across regions.

Try it out

Local Uploads is available now in open beta. There is no additional cost to enable Local Uploads. Upload requests made with this feature enabled incur the standard Class A operation costs, same as upload requests made without Local Uploads.

To get started, visit the Cloudflare Dashboard under your bucket's settings and look for the Local Uploads card to enable, or simply run the following command using Wrangler to enable Local Uploads on a bucket.

npx wrangler r2 bucket local-uploads enable [BUCKET]

Enabling Local Uploads on a bucket is seamless: existing uploads will complete as expected and there’s no interruption to traffic.

For more information, refer to the Local Uploads documentation. If you have questions or want to share feedback, join the discussion on our Developer Discord.

Quicksilver v2: evolution of a globally distributed key-value store (Part 2)

Marten van de Sanden — Thu, 17 Jul 2025 13:00:00 GMT

What is Quicksilver?

Cloudflare has servers in 330 cities spread across 125+ countries. All of these servers run Quicksilver, which is a key-value database that contains important configuration information for many of our services, and is queried for all requests that hit the Cloudflare network.

Because it is used while handling requests, Quicksilver is designed to be very fast; it currently responds to 90% of requests in less than 1 ms and 99.9% of requests in less than 7 ms. Most requests are only for a few keys, but some are for hundreds or even more keys.

Quicksilver currently contains over five billion key-value pairs with a combined size of 1.6 TB, and it serves over three billion keys per second, worldwide. Keeping Quicksilver fast provides some unique challenges, given that our dataset is always growing, and new use cases are added regularly.

Quicksilver used to store all key-values on all servers everywhere, but there is obviously a limit to how much disk space can be used on every single server. For instance, the more disk space used by Quicksilver, the less disk space is left for content caching. Also, with each added server that contains a particular key-value, the cost of storing that key-value increases.

This is why disk space usage has been the main battle that the Quicksilver team has been waging over the past several years. A lot was done over the years, but we now think that we have finally created an architecture that will allow us to get ahead of the disk space limitations and finally make Quicksilver scale better.

^{The size of the Quicksilver database has grown by 50% to about 1.6 TB in the past year}

What we talked about previously

Part one of the story explained how Quicksilver V1 stored all key-value pairs on each server all around the world. It was a very simple and fast design, it worked very well, and it was a great way to get started. But over time, it turned out to not scale well from a disk space perspective.

The problem was that disk space was running out so fast that there was not enough time to design and implement a fully scalable version of Quicksilver. Therefore, Quicksilver V1.5 was created first. It halved the disk space used on each server compared to V1.

For this, a new proxy mode was introduced for Quicksilver. In this mode, Quicksilver does not contain the full dataset anymore, but only contains a cache. All cache misses are looked up on another server that runs Quicksilver with a full dataset. Each server runs about ten separate instances of Quicksilver, and all have different databases with different sets of key-values. We call Quicksilver instances with the full data set replicas.

For Quicksilver V1.5, half of those instances on a particular server would run Quicksilver in proxy mode, and therefore would not have the full dataset anymore. The other half would run in replica mode. This worked well for a time, but it was not the final solution.

Building this intermediate solution had the added benefit of allowing the team to gain experience running an even more distributed version of Quicksilver.

The problem

There were a few reasons why Quicksilver V1.5 was not fully scalable.

First, the size of the separate instances were not very stable. The key-space is owned by the teams that use Quicksilver, not by the Quicksilver team, and the way those teams use Quicksilver changes frequently. Furthermore, while most instances grow in size over time, some instances have actually gotten smaller, such as when the use of Quicksilver is optimised by teams. The result of this is that the split of instances that was well-balanced at the start, quickly became unbalanced.

Second, the analyses that were done to estimate how much of the key space would need to be in cache on each server assumed that taking all keys that were accessed in a three-day period would represent a good enough cache. This assumption turned out to be wildly off. This analysis estimated that we needed about 20% of the key space in cache, which turned out to not be entirely accurate. Whereas most instances did have a good cache hit rate, with 20% or less of the key space in cache, some instances turned out to need a much higher percentage.

The main issue, however, was that reducing the disk space used by Quicksilver on our network by as much as 40% does not actually make it more scalable. The number of key-values that are stored in Quicksilver keeps growing. It only took about two years before disk space was running low again.

The solution

^{Except for a handful of special storage servers, Quicksilver does not contain the full dataset anymore, but only cache. Any cache misses will be looked up in replicas on our storage servers, which do have the full dataset.}

The solution to the scalability problem was brought on by a new insight. As it turns out, numerous key-values were actually almost never used. We call these cold keys. There are different reasons for these cold keys: some of them were old and not well cleaned up, some were used only in certain regions or in certain data centers, and some were not used for a very long time or maybe not at all (a domain name that is never looked up for example or a script that was uploaded but never used).

At first, the team had been considering solving our scalability problem by splitting up the entire dataset into shards and distributing those across the servers in the different data centers. But sharding the full dataset adds a lot of complexity, corner cases, and unknowns. Sharding also does not optimize for data locality. For example, if the key-space is split into 4 shards and each server gets one shard, that server can only serve 25% of the requested keys from its local database. The cold keys would also still be contained in those shards and would take up disk space unnecessarily.

Another data structure that is much better at data locality and explicitly avoids storing keys that are never used is a cache. So it was decided that only a handful of servers with large disks would maintain the full data set, and all other servers would only have a cache. This was an obvious evolution from Quicksilver V1.5. Caching was already being done on a smaller scale, so all the components were already available. The caching proxies and the inter-data center discovery mechanisms were already in place. They had been used since 2021 and were therefore thoroughly battle tested. However, one more component needed to be added.

There was a concern that having all instances on all servers connect to a handful of storage nodes with replicas would overload them with too many connections. So a Quicksilver relay was added. For each instance, a few servers would be elected within each data center on which Quicksilver would run in relay mode. The relays would maintain the connections to the replicas on the storage nodes. All proxies inside a data center would discover those relays and all cache misses would be relayed through them to the replicas.

This new architecture worked very well. The cache hit rates still needed some improvement, however.

Prefetching the future

^{Every resolved cache miss is prefetched by all servers in the data center}

We had a hypothesis that prefetching all keys that were cache misses on the other servers inside the same data center would improve the cache hit rate. So an analysis was done, and it indeed showed that every key that was a cache miss on one server in a data center had a very high probability of also being a cache miss on another server in the same data center sometime in the near future. Therefore, a mechanism was built that distributed all resolved cache misses on relays to all other servers.

All cache misses in a data center are resolved by requesting them from a relay, which subsequently forwards the requests to one of the replicas on the storage nodes. Therefore, the prefetching mechanism was implemented by making relays publish a stream of all resolved cache misses, to which all Quicksilver proxies in the same data center subscribe. The resulting key-values were then added to the proxy local caches.

This strategy is called reactive prefetching, because it fills caches only with the key-values that directly resulted from cache misses inside the same data center. Those prefetches are a reaction to the cache misses. Another way of prefetching is called predictive prefetching, in which an algorithm tries to predict which keys that have not yet been requested will be requested in the near future. A few approaches for making these predictions were tried, but they did not result in any improvement, and so this idea was abandoned.

With the prefetching enabled, cache hit rates went up to about 99.9% for the worst performing instance. This was the goal that we were trying to reach. But while rolling this out to more of our network, it turned out that there was one team that needed an even higher cache hit rate, because the tail latencies they were seeing with this new architecture were too high.

This team was using a Quicksilver instance called dnsv2. This is a very latency sensitive instance, because it is the one from which DNS queries are served. Some of the DNS queries under the hood need multiple queries to Quicksilver, so any added latency to Quicksilver multiplies for them. This is why it was decided that one more improvement to the Quicksilver cache was needed.

^{The level 1 cache hit-rate is 99.9% or higher, on average.}

Back to the sharding

^{Before going to a replica in another data center, a cache miss is first looked up in a data center-wide sharded cache}

The instance on which higher cache hit rates were required was also the instance on which the cache performed the worst. The cache works with a retention time, defined as the number of days a key-value is kept in cache after it was last accessed, after which it is evicted from the cache. An analysis of the cache showed that this instance needed a much longer retention time. But, a higher retention time also causes the cache to take up more disk space — space that was not available.

However, while running Quicksilver V1.5, we had already noticed the pattern that caches generally performed much better in smaller data centers as compared to larger ones. This sparked the hypothesis that led to the final improvement.

It turns out that smaller data centers, with fewer servers, generally needed less disk space for their cache. Vice versa, the more servers there are in a data center, the larger the Quicksilver cache needs to be. This is easily explained by the fact that larger data centers generally serve larger populations, and therefore have a larger diversity of requests. More servers also means more total disk space available inside the data center. To be able to make use of this pattern the concept of sharding was reintroduced.

Our key space was split up into multiple shards. Each server in a data center was assigned one of the shards. Instead of those shards containing the full dataset for their part of the key space, they contain a cache for it. Those cache shards are populated by all cache misses inside the data center. This all forms a data center-wide cache that is distributed using sharding.

The data locality issue that sharding the full dataset has, as described above, is solved by keeping the local per-server caches as well. The sharded cache is in addition to the local caches. All servers in a data center contain both their local cache and a cache for one physical shard of the sharded cache. Therefore, each requested key is first looked up in the server’s local cache, after that the data center-wide sharded cache is queried, and finally if both caches miss the requested key, it is looked up on one of the storage nodes.

The key space is split up into separate shards by first dividing hashes of the keys by range into 1024 logical shards. Those logical shards are then divided up into physical shards, again by range. Each server gets one physical shard assigned by repeating the same process on the server hostname.

^{Each server contains one physical shard. A physical shard contains a range of logical shards. A local shard contains a range of the ordered set that result from hashing all keys.}

This approach has the advantage that the sharding factor can be scaled up by factors of two without the need for copying caches to other servers. When the sharding factor is increased in this way, the servers will automatically get a new physical shard assigned that contains a subset of the key space that the previous physical shard on that server contained. After this has happened, their cache will contain supersets of the needed cache. The key-values that are not needed anymore will be evicted over time.

^{When the number of physical shards are doubled the servers will automatically get new physical shards that are subsets of their previous physical shards, therefore still have the relevant key-values in cache.}

This approach means that the sharded caches can easily be scaled up when needed as the number of keys that are in Quicksilver grows, and without any need for relocating data. Also, shards are well-balanced due to the fact that they contain uniform random subsets of a very large key-space.

Adding new key-values to the physical cache shards piggybacks on the prefetching mechanism, which already distributes all resolved cache misses to all servers in a data center. The keys that are part of the key space for a physical shard on a particular server are just kept longer in cache than the keys that are not part of that physical shard.

Another reason why a sharded cache is simpler than sharding the full key-space is that it is possible to cut some corners with a cache. For instance, looking up older versions of key-values (as used for multiversion concurrency control) is not supported on cache shards. As explained in an earlier blog post, this is needed for consistency when looking up key-values on different servers, when that server has a newer version of the database. It is not needed in the cache shards, because lookups can always fall back to the storage nodes when the right version is not available.

^{Proxies have a recent keys window that contains all recently written key-values. A cache shard only has its cached key-values. Storage replicas contain all key-values and on top of that they contain multiple versions for recently written key-values. When the proxy, that has database version 1000, has a cache miss for}^key1^{it can be seen that the version of that key on the cache shard was written at database version 1002 and therefore is too new. This means that it is not consistent with the proxy’s database version. This is why the relay will fetch that key from a replica instead, which can return the earlier consistent version. In contrast,}^key2^{on the cache shard can be used, because it was written at index 994, well below the database version of the proxy.}

There is only one very specific corner case in which a key-value on a cache shard cannot be used. This happens when the key-value in the cache shard was written at a more recent database version than the version of the proxy database at that time. This would mean that the key-value probably has a different value than it had at the correct version. Because, in general, the cache shard and the proxy database versions are very close to each other, and this only happens for key-values that were written in between those two database versions, this happens very rarely. As such, deferring the lookup to storage nodes has no noticeable effect on the cache hit rate.

Tiered Storage

To summarize, Quicksilver V2 has three levels of storage.

Level 1: The local cache on each server that contains the key-values that have most recently been accessed.
Level 2: The data center wide sharded cache that contains key-values that haven’t been accessed in a while, but do have been accessed.
Level 3: The replicas on the storage nodes that contain the full dataset, which live on a handful of storage nodes and are only queried for the cold keys.

The results

The percentage of keys that can be resolved within a data center improved significantly by adding the second caching layer. The worst performing instance has a cache hit rate higher than 99.99%. All other instances have a cache hit rate that is higher than 99.999%.

^{The combined level 1 and level 2 cache hit-rate is 99.99% or higher for the worst caching instance.}

Final notes

It took the team quite a few years to go from the old Quicksilver V1, where all data was stored on each server to the tiered caching Quicksilver V2, where all but a handful of servers only have cache. We faced many challenges, including migrating hundreds of thousands of live databases without interruptions, while serving billions of requests per second. A lot of code changes were rolled out, with the result that Quicksilver now has a significantly different architecture. All of this was done transparently to our customers. It was all done iteratively, always learning from the previous step before taking the next one. And always making sure that, if at all possible, all changes are easy to revert. These are important strategies for migrating complex systems safely.

If you like these kinds of stories, keep an eye out for more development stories on our blog. And if you are enthusiastic about solving these kinds of problems, we are hiring for multiple types of roles across the organization

Thank you

And finally, a big thanks to the rest of the Quicksilver team, because we all do this together: Aleksandr Matveev, Aleksei Surikov, Alex Dzyoba, Alexandra (Modi) Stana-Palade, Francois Stiennon, Geoffrey Plouviez, Ilya Polyakovskiy, Manzur Mukhitdinov, Volodymyr Dorokhov.

Quicksilver v2: evolution of a globally distributed key-value store (Part 1)

Anton Dort-Golts — Thu, 10 Jul 2025 14:00:00 GMT

Quicksilver is a key-value store developed internally by Cloudflare to enable fast global replication and low-latency access on a planet scale. It was initially designed to be a global distribution system for configurations, but over time it gained popularity and became the foundational storage system for many products in Cloudflare.

A previous post described how we moved Quicksilver to production and started replicating on all machines across our global network. That is what we called Quicksilver v1: each server has a full copy of the data and updates it through asynchronous replication. The design served us well for some time. However, as our business grew with an ever-expanding data center footprint and a growing dataset, it became more and more expensive to store everything everywhere.

We realized that storing the full dataset on every server is inefficient. Due to the uniform design, data accessed in one region or data center is replicated globally, even if it's never accessed elsewhere. This leads to wasted disk space. We decided to introduce a more efficient system with two new server roles: replica, which stores the full dataset and proxy, which acts as a persistent cache, evicting unused key-value pairs to free up some disk space. We call this design Quicksilver v1.5 – an interim step towards a more sophisticated and scalable system.

To understand how those two roles helped us reduce disk space usage, we first need to share some background on our setup and introduce some terminology. Cloudflare is architected in a way where we have a few hyperscale core data centers that form our control plane, and many smaller data centers distributed across the globe where resources are more constrained. Quicksilver has dozens of servers in the core data centers with terabytes of storage called root nodes. In the smaller data centers, though, things are different. A typical data center has two types of nodes: intermediate nodes and leaf nodes. Intermediate servers replicate data either from the other intermediate nodes or directly from the root nodes. Leaf nodes serve end user traffic, and receive updates from intermediate servers, effectively being leaves of a replication tree. Disk capacity varies significantly between node types. While root nodes aren't facing an imminent disk space bottleneck, it's a definite concern for leaf nodes.

Every server – whether it’s a root, intermediate, or leaf – hosts 10 Quicksilver instances. These are independent databases, each used by specific Cloudflare services or products such as the DNS, CDN or WAF.

^{Figure 1. Global Quicksilver}

Let’s consider the role distribution. Instead of hosting ten full datasets on every machine within a data center, what if we deploy only a few replicas in each? The remaining servers would be proxies, maintaining a persistent cache of hot keys and querying replicas for any cache misses.

^{Figure 2. Role allocation for different Quicksilver instances}

Data centers across our network are very different in size, ranging from hundreds of servers to a single rack with just a few servers. To ensure every data center has at least one replica, the simplest initial step is an even split: on each server, place five replicas of some instances and five proxies for others. The change immediately frees up disk space, as the cached hot dataset on a proxy should be smaller than a full replica. While it doesn’t remove the bottleneck entirely, it could, in theory, lead to an up to 50% reduction in disk space usage. More importantly, it lays the foundation for a new distributed design of Quicksilver, where queries can be served by multiple machines in a data center, paving the way for further horizontal scaling. Additionally, an iterative approach helps to battle-proof the code changes earlier.

Can it even work?

Before committing to building Quicksilver v1.5, we wanted to be sure that the proxy/replica design would actually work for our workload. If proxies needed to cache the entire dataset for good performance, then it would be a dead end, offering no potential disk space benefits. To assess this, we built a data pipeline which pushes accessed keys from all across our network to ClickHouse. This allowed us to estimate typical sizes of working sets. Our analysis revealed that:

in large data centers approximately, 20% of the keyspace was in use
in small data centers this number dropped to just about 1%

These findings gave us confidence that the caching approach should work, though it wouldn’t be without its challenges.

Persistent caching

When talking about caches, the first thing that comes to mind is an in-memory cache. However, this cannot work for Quicksilver for two main reasons: memory usage and the “cold cache” problem.

Indeed, with billions of stored keys, even a fraction of them would lead to an unmanageable increase in memory usage. System restarts should not affect performance, which means that cache data must be preserved somewhere anyway. So we decided to make the cache persistent and store it in the same way as full datasets: in our embedded RocksDB. Thus, cached keys normally sit on disk and can be retrieved on-demand with low memory footprint.

When a key cannot be found in the proxy’s cache, we request it from a replica using our internal distributed key-value protocol, and put it into a local cache after processing.

Evictions are based on RocksDB compaction filters. Compaction filters allow defining custom logic executed in background RocksDB threads responsible for compacting files on disk. Each key-value pair is processed with a filter on a regular basis, evicting least recently used data from the disk when available disk space drops below a certain threshold called a soft limit. To track keys accessed on disk, we have an LRU-like in-memory data structure, which is passed to the compaction filter to set last access date in key metadata and inform potential evictions.

However, with some specific workloads there is still a chance that evictions will not keep up with disk space growth, and for this scenario we have a hard limit: when available disk space drops below a critical threshold, we temporarily stop adding new keys to the cache. This hurts performance, but it acts as a safeguard, ensuring our proxies remain stable and don't overflow under a massive surge of requests.

Consistency and asynchronous replication

Quicksilver has, from the start, provided sequential consistency to clients: if key A was written before B, it’s not possible to read B and not A. We are committed to maintaining this guarantee in the new design. We have experienced Hyrum's Law first hand, with Quicksilver being so widely adopted across the company that every property we introduced in earlier versions is now relied upon by other teams. This means that changing behaviour would inevitably break existing functionality and introduce bugs.

However, there is one thing standing in our way: asynchronous replication. Quicksilver replication is asynchronous mainly because machines in different parts of the world replicate at different speeds, and we don’t want a single server to slow down the entire tree. But it turns out in a proxy-replica design, independent replication progress can result in non-monotonic reads!

Consider the following scenario: a client sequentially writes keys A, B, C, .. K one after another to the Quicksilver root node. These keys are asynchronously replicated through data centers across our network with varying latency. Imagine we have a proxy on index 5, which has observed keys from A to E, and two replicas:

replica_1 is at index 2 (slightly behind the proxy), having only received A and B
replica_2 at index 9, which is slightly ahead due to a faster replication path and has received all keys from A to I

^{Figure 3. Asynchronous replication in QSv1.5}

Now, a client performs two successive requests on a proxy, each time reading the keys E, F, G, H and I. For simplicity, we assume these keys are not cacheable (for example, due to low disk space). The proxy’s first remote request is routed to replica_2, which already has all keys and responds back with values. To prevent hot spots in a data center, we load balance requests from proxies, and the next one lands on replica_1, which hasn’t received any of the requested keys yet, and responds with a “not found” error.

So, which result is correct?

The correct behavior here is that of Quicksilver v1, which we aim to preserve. If the server on replication index 5 were a replica instead of a proxy, it would have seen updates for keys A through E inclusive, resulting in E being the only key in both replies, while all other keys cannot be found yet. Which means responses from both replica_1 and replica_2 are wrong!

Therefore, to maintain previous guarantees and API backwards compatibility, Quicksilver v1.5 must address two crucial consistency problems: cases where the replica is ahead of the proxy, and conversely, where it lags behind. For now let’s focus on the case where a proxy lags behind a replica.

Multiversion concurrency control

In our example, replica_2 responds to a request from a proxy “from the past”. We cannot use any locks for synchronizing two servers, as it would introduce undesirable delays to the replication tree, defeating the purpose of asynchronous replication. The only option is for replicas to maintain a history of recent updates. This naturally leads us to implementing multiversion concurrency control (MVCC), a popular database mechanism for tracking changes in a non-blocking fashion, where for any key we can keep multiple versions of its values for different points in time.

With MVCC, we no longer overwrite the latest value of a key in the default column family for every update. Instead, we introduced a new MVCC column family in RocksDB, where all updates are stored with a corresponding replication index. Lookup for a key at some index in the past goes as follows:

First we search in the default column family. If a key is found and the write timestamp is not greater than the index of a requesting proxy, we can use it straight away.
Otherwise, we begin scanning the MVCC column family, where keys have unique suffixes based on latest timestamps for which they are still valid.

In the example above, replica_2 has MVCC enabled and has keys A@1 .. K@11 in its default column family. The MVCC is initially empty, because no keys have been overwritten yet. When it receives a request for, say, key H with target index 5, it first makes a lookup in a default column family and finds the given key, but its timestamp is 8, which means this version should not be visible to the proxy yet. It then scans the MVCC, finds no matching previous versions and responds with “not found” to the proxy. Should key H be updated twice at indexes 4 and 8, we would have placed the initial version into MVCC before overwriting it in the default column family, and the proxy would receive the first version in response.

If a key E is requested at index 5, replica_2 can find it quickly in the default column family and return it back to the proxy. There is no need to read from MVCC, as the timestamp of the latest version (5) satisfies the request.

Another corner case to consider is deletions. When a key is deleted and then re-written, we need to explicitly mark the period of removal in MVCC. For that we’ve implemented tombstones – a special value format for absent keys.

Finally, we need to make sure that key history is not growing uncontrollably, using up all of the disk space available. Luckily we don’t actually need to record history for a long period of time, it just needs to cover the maximum replication index difference between any two machines. And in practice, a two-hour interval turned out to be way more than enough, while adding only about 500 MB of extra disk space usage. All records in the MVCC column family older than two hours are garbage collected, and for that again we use custom RocksDB compaction filters.

Sliding window

Now we know how to deal with proxies lagging behind replicas. But what about the opposite case, when a proxy is ahead of replicas?

The simplest solution is for replicas to not serve requests with a target index higher than its own. After all, it cannot know about keys from the future, whether they will be added, updated, or removed. In fact, our first implementation just returned an error when the proxy was ahead, as we expected it to happen quite infrequently. But after rolling out gradually to a few data centers, our metrics made it clear that the approach was not going to work.

This led us to analyze which keys are affected by this kind of replication asymmetry. It’s definitely not keys added or updated a long time ago, because replicas would already have the changes replicated. The only problematic keys are those updated very recently, which the proxy already knows about, but the replica does not.

With this insight, we realized that the issue should be solved on the proxies rather than on the replica side. By preserving all recent updates locally, the proxy can avoid querying the replica. This became known as the sliding window approach.

The sliding window retains all recent updates written in a short, rolling timeframe. Unlike cached keys, items in the window cannot be evicted until they move outside of the window. Internally, the sliding window is defined by lower and upper boundary pointers. These are kept in memory, and can easily be restored after a reload from the current database index and the pre-configured window size.

^{Figure 4. The sliding window shifts when replication updates arrive}

When a new update event arrives from the replication layer we add it to the sliding window by moving both the upper and lower boundary one position higher. Thereby, we maintain the fixed size of the window. Keys written before the lower bound can be evicted by the compaction filter, which is aware of current sliding window boundaries.

Negative lookups

Another problem arising with our distributed replica-proxy design is negative lookups – requests for keys which don't exist in the database. Interestingly, in our workloads we see about ten times more negative lookups than positive ones!

But why is it a problem? Unfortunately, each negative lookup will be a cache miss on a proxy, requiring a request to a replica. Given the volume of requests and proportion of such lookups, it would be a disaster for performance, with overloaded replicas, overused data center networks, and massive latency degradation. We needed a fast and efficient approach to identifying non-existing keys directly at the proxy level.

In v1, negative lookups are the quickest type of requests. We rely on a special probabilistic data structure – Bloom filters – used in RocksDB to determine if the requested key might belong to a certain data file containing a range of sorted keys (called Sorted Sequence Table or SST) or definitely not. 99% of the time, negative lookups are served using only this in-memory data structure, avoiding the need for disk I/O.

One approach we considered for proxies was to cache negative lookups. Two problems immediately arise:

How big is the keyspace of negative lookups? In theory, it’s infinite, but the real size was unclear. We can store it in our cache only if it is small enough.
Cached negative lookups would no longer be served by the fast Bloom filters. We have row and block caches in RocksDB, but the hit rate is nowhere near the filters for SSTs, which means negative lookups would end up going to disk more often.

These turned out to be dealbreakers: not only was the negative keyspace vast, greatly exceeding the actual keyspace (by a thousand times for some instances!), but clients also need lookups to be really fast, ideally served from memory.

In pursuit of probabilistic data structures which could give us a dynamic compact representation of a full keyspace on proxies, we spent some time exploring Cuckoo filters. Unfortunately, with 5 billion keys it takes about 18 GB to have a false positive rate similar to Bloom filters (which only require 6 GB). And this is not only about wasted disk space — to be fast we have to keep it all in memory too!

Clearly some other solution was needed.

Finally, we decided to implement key and value separation, storing all keys on proxies, but persisting values only for cached keys. Evicting a key from the cache actually results in the removal of its value.

But wait, don’t the keys, even stripped of values, take a lot of space? Well, yes and no.

The total size of pure keys in Quicksilver is approximately 11 times smaller than the full dataset. Of course, it’s larger than any representation by probabilistic data structure, but there are some very desirable properties to such a solution. Firstly, we continue to enjoy fast Bloom filter lookups in RocksDB. Another benefit is that it unlocks some cool optimizations for range queries in a distributed context.

We may revisit it one day, but so far it has worked great for us.

Discovery mechanism

Having solved all of the above challenges, one bit remained to be sorted out to make distributed query execution work: how can proxies discover replicas?

Within the local data center it is fairly easy. Each one runs its own consul cluster, where machines are registered as services. Consul is well integrated with our internal DNS resolvers, and with a single DNS request, we can get the names of all replicas running in a data center, which proxies can directly connect to.

However, data centers vary in size, servers are constantly added and removed, and having only local discovery would not be enough for the system to work reliably. Proxies also need to find replicas in other nearby data centers.

We had previously encountered a similar problem with our replication layer. Initially, the replication topology was statically defined in a configuration and distributed to all servers, such that they know from which sources they should replicate. While simple, this approach was quite fragile and tedious to operate. It led to a rigid replication tree with suboptimal overall performance, unable to adapt to network changes.

Our solution to this problem was the Network Oracle – a special overlay network based on a gossip protocol and consisting of intermediate nodes in our data centers. Each member of this overlay constantly exchanges status and metainformation with other nodes, which helps us see active members in near-real time. Each member runs network probes measuring round-trip time to its peers, making it easy to find closest (in terms of RTT) active intermediate nodes to form a low-latency replication tree. Introducing the Network Oracle was a major improvement: we no longer needed to reconfigure the topology, watch intermediate nodes or entire data centers go down, or investigate frequent replication issues. Replication is now a completely self-organized and self-healing dynamic system.

Naturally, we decided to reuse the Network Oracle for our discovery mechanism. It consists of two subproblems: data center discovery and specific service lookup. We use the Network Oracle to find the closest data centers. Adding all machines running Quicksilver to the same overlay would be inefficient because of significant increase of network traffic and message delivery times. Instead, we use intermediate nodes as sources of network proximity information for the leaf nodes. Knowing which data centers are nearby, we can directly send DNS queries there to resolve specific services – Quicksilver replicas in this case.

Proxies maintain a pool of connections to active replicas and distribute requests among them to smooth out the load and avoid hotspots in a data center. Proxies also have a health-tracking mechanism, monitoring the state of connections and errors coming from replicas, and temporarily deprioritizing or isolating potentially faulty ones.

^{Figure 5. Internal replica request errors}

To demonstrate its efficiency, we graphed errors coming from replica requests, which showed that such errors almost disappeared after introducing the new discovery system.

Results

Our objective with Quicksilver v1.5 was simple: gain some disk space without losing request latency, because clients rely heavily on us being fast. While the replica-proxy design delivered significant space savings, what about latencies?

Proxy

Replica

^{Figure 6. Proxy-replica latency comparison}

Above, we have the 99.9% percentile of request latency on both a replica and proxy during a 24-hour window. One can hardly find a difference between the two. Surprisingly, proxies can even be slightly faster than replicas sometimes, likely because of smaller datasets on disk!

Quicksilver v1.5 is released but our journey to a highly scalable and efficient solution is not over. In the next post we’ll share what challenges we faced with the following iteration. Stay tuned!

Thank you

This project was a big team effort, so we’d like to thank everyone on the Quicksilver team – it would not have come true without you all.

Aleksandr Matveev
Aleksei Surikov
Alex Dzyoba
Alexandra (Modi) Stana-Palade
Francois Stiennon
Geoffrey Plouviez
Ilya Polyakovskiy
Manzur Mukhitdinov
Volodymyr Dorokhov

R2 Data Catalog: Managed Apache Iceberg tables with zero egress fees

Phillip Jones — Thu, 10 Apr 2025 14:00:00 GMT

Apache Iceberg is quickly becoming the standard table format for querying large analytic datasets in object storage. We’re seeing this trend firsthand as more and more developers and data teams adopt Iceberg on Cloudflare R2. But until now, using Iceberg with R2 meant managing additional infrastructure or relying on external data catalogs.

So we’re fixing this. Today, we’re launching the R2 Data Catalog in open beta, a managed Apache Iceberg catalog built directly into your Cloudflare R2 bucket.

If you’re not already familiar with it, Iceberg is an open table format built for large-scale analytics on datasets stored in object storage. With R2 Data Catalog, you get the database-like capabilities Iceberg is known for – ACID transactions, schema evolution, and efficient querying – without the overhead of managing your own external catalog.

R2 Data Catalog exposes a standard Iceberg REST catalog interface, so you can connect the engines you already use, like PyIceberg, Snowflake, and Spark. And, as always with R2, there are no egress fees, meaning that no matter which cloud or region your data is consumed from, you won’t have to worry about growing data transfer costs.

Ready to query data in R2 right now? Jump into the developer docs and enable a data catalog on your R2 bucket in just a few clicks. Or keep reading to learn more about Iceberg, data catalogs, how metadata files work under the hood, and how to create your first Iceberg table.

What is Apache Iceberg?

Apache Iceberg is an open table format for analyzing large datasets in object storage. It brings database-like features – ACID transactions, time travel, and schema evolution – to files stored in formats like Parquet or ORC.

Historically, data lakes were just collections of raw files in object storage. However, without a unified metadata layer, datasets could easily become corrupted, were difficult to evolve, and queries often required expensive full-table scans.

Iceberg solves these problems by:

Providing ACID transactions for reliable, concurrent reads and writes.
Maintaining optimized metadata, so engines can skip irrelevant files and avoid unnecessary full-table scans.
Supporting schema evolution, allowing columns to be added, renamed, or dropped without rewriting existing data.

Iceberg is already widely supported by engines like Apache Spark, Trino, Snowflake, DuckDB, and ClickHouse, with a fast-growing community behind it.

How Iceberg tables are stored

Internally, an Iceberg table is a collection of data files (typically stored in columnar formats like Parquet or ORC) and metadata files (typically stored in JSON or Avro) that describe table snapshots, schemas, and partition layouts.

To understand how query engines interact efficiently with Iceberg tables, it helps to look at an Iceberg metadata file (simplified):

{
  "format-version": 2,
  "table-uuid": "0195e49b-8f7c-7933-8b43-d2902c72720a",
  "location": "s3://my-bucket/warehouse/0195e49b-79ca/table",
  "current-schema-id": 0,
  "schemas": [
    {
      "schema-id": 0,
      "type": "struct",
      "fields": [
        { "id": 1, "name": "id", "required": false, "type": "long" },
        { "id": 2, "name": "data", "required": false, "type": "string" }
      ]
    }
  ],
  "current-snapshot-id": 3567362634015106507,
  "snapshots": [
    {
      "snapshot-id": 3567362634015106507,
      "sequence-number": 1,
      "timestamp-ms": 1743297158403,
      "manifest-list": "s3://my-bucket/warehouse/0195e49b-79ca/table/metadata/snap-3567362634015106507-0.avro",
      "summary": {},
      "schema-id": 0
    }
  ],
  "partition-specs": [{ "spec-id": 0, "fields": [] }]
}

A few of the important components are:

schemas: Iceberg tracks schema changes over time. Engines use schema information to safely read and write data without needing to rewrite underlying files.
snapshots: Each snapshot references a specific set of data files that represent the state of the table at a point in time. This enables features like time travel.
partition-specs: These define how the table is logically partitioned. Query engines leverage this information during planning to skip unnecessary partitions, greatly improving query performance.

By reading Iceberg metadata, query engines can efficiently prune partitions, load only the relevant snapshots, and fetch only the data files it needs, resulting in faster queries.

Why do you need a data catalog?

Although the Iceberg data and metadata files themselves live directly in object storage (like R2), the list of tables and pointers to the current metadata need to be tracked centrally by a data catalog.

Think of a data catalog as a library's index system. While books (your data) are physically distributed across shelves (object storage), the index provides a single source of truth about what books exist, their locations, and their latest editions. Without this index, readers (query engines) would waste time searching for books, might access outdated versions, or could accidentally shelve new books in ways that make them unfindable.

Similarly, data catalogs ensure consistent, coordinated access, allowing multiple query engines to safely read from and write to the same tables without conflicts or data corruption.

Create your first Iceberg table on R2

Ready to try it out? Here’s a quick example using PyIceberg and Python to get you started. For a detailed step-by-step guide, check out our developer docs.

1. Enable R2 Data Catalog on your bucket:

npx wrangler r2 bucket catalog enable my-bucket

Or use the Cloudflare dashboard: Navigate to R2 Object Storage > Settings > R2 Data Catalog and click Enable.

2. Create a Cloudflare API token with permissions for both R2 storage and the data catalog.

3. Install PyIceberg and PyArrow, then open a Python shell or notebook:

pip install pyiceberg pyarrow

4. Connect to the catalog and create a table:

import pyarrow as pa
from pyiceberg.catalog.rest import RestCatalog

# Define catalog connection details (replace variables)
WAREHOUSE = ""
TOKEN = ""
CATALOG_URI = ""

# Connect to R2 Data Catalog
catalog = RestCatalog(
    name="my_catalog",
    warehouse=WAREHOUSE,
    uri=CATALOG_URI,
    token=TOKEN,
)

# Create default namespace
catalog.create_namespace("default")

# Create simple PyArrow table
df = pa.table({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
})

# Create an Iceberg table
table = catalog.create_table(
    ("default", "my_table"),
    schema=df.schema,
)

You can now append more data or run queries, just as you would with any Apache Iceberg table.

Pricing

While R2 Data Catalog is in open beta, there will be no additional charges beyond standard R2 storage and operations costs incurred by query engines accessing data. Storage pricing for buckets with R2 Data Catalog enabled remains the same as standard R2 buckets – \$0.015 per GB-month. As always, egress directly from R2 buckets remains \$0.

In the future, we plan to introduce pricing for catalog operations (e.g., creating tables, retrieving table metadata, etc.) and data compaction.

Below is our current thinking on future pricing. We’ll communicate more details around timing well before billing begins, so you can confidently plan your workloads.

	Pricing
R2 storage For standard storage class	$0.015 per GB-month (no change)
R2 Class A operations	$4.50 per million operations (no change)
R2 Class B operations	$0.36 per million operations (no change)
Data Catalog operations e.g., create table, get table metadata, update table properties	$9.00 per million catalog operations
Data Catalog compaction data processed	$0.05 per GB processed $4.00 per million objects processed
Data egress	$0 (no change, always free)

What’s next?

We’re excited to see how you use R2 Data Catalog! If you’ve never worked with Iceberg – or even analytics data – before, we think this is the easiest way to get started.

Next on our roadmap is tackling compaction and table optimization. Query engines typically perform better when dealing with fewer, but larger data files. We will automatically re-write collections of small data files into larger files to deliver even faster query performance.

We’re also collaborating with the broad Apache Iceberg community to expand query-engine compatibility with the Iceberg REST Catalog spec.

We’d love your feedback. Join the Cloudflare Developer Discord to ask questions and share your thoughts during the public beta. For more details, examples, and guides, visit our developer documentation.

Building Vectorize, a distributed vector database, on Cloudflare’s Developer Platform

Jérôme Schneider — Tue, 22 Oct 2024 13:00:00 GMT

Vectorize is a globally distributed vector database that enables you to build full-stack, AI-powered applications with Cloudflare Workers. Vectorize makes querying embeddings — representations of values or objects like text, images, audio that are designed to be consumed by machine learning models and semantic search algorithms — faster, easier and more affordable.

In this post, we dive deep into how we built Vectorize on Cloudflare’s Developer Platform, leveraging Cloudflare’s global network, Cache, Workers, R2, Queues, Durable Objects, and container platform.

What is a vector database?

A vector database is a queryable store of vectors. A vector is a large array of numbers called vector dimensions.

A vector database has a similarity search query: given an input vector, it returns the vectors that are closest according to a specified metric, potentially filtered on their metadata.

Vector databases are used to power semantic search, document classification, and recommendation and anomaly detection, as well as contextualizing answers generated by LLMs (Retrieval Augmented Generation, RAG).

Why do vectors require special database support?

Conventional data structures like B-trees, or binary search trees expect the data they index to be cheap to compare and to follow a one-dimensional linear ordering. They leverage this property of the data to organize it in a way that makes search efficient. Strings, numbers, and booleans are examples of data featuring this property.

Because vectors are high-dimensional, ordering them in a one-dimensional linear fashion is ineffective for similarity search, as the resulting ordering doesn’t capture the proximity of vectors in the high-dimensional space. This phenomenon is often referred to as the curse of dimensionality.

In addition to this, comparing two vectors using distance metrics useful for similarity search is a computationally expensive operation, requiring vector-specific techniques for databases to overcome.

Query processing architecture

Vectorize builds upon Cloudflare’s global network to bring fast vector search close to its users, and relies on many components to do so.

These are the Vectorize components involved in processing vector queries.

Vectorize runs in every Cloudflare data center, on the infrastructure powering Cloudflare Workers. It serves traffic coming from Worker bindings as well as from the Cloudflare REST API through our API Gateway.

Each query is processed on a server in the data center in which it enters, picked in a fashion that spreads the load across all servers of that data center.

The Vectorize DB Service (a Rust binary) running on that server processes the query by reading the data for that index on R2, Cloudflare’s object storage. It does so by reading through Cloudflare’s Cache to speed up I/O operations.

Searching vectors, and indexing them to speed things up

Being a vector database, Vectorize features a similarity search query: given an input vector, it returns the K vectors that are closest according to a specified metric.

Conceptually, this similarity search consists of 3 steps:

Evaluate the proximity of the query vector with every vector present in the index.
Sort the vectors based on their proximity “score”.
Return the top matches.

While this method is accurate and effective, it is computationally expensive and does not scale well to indexes containing millions of vectors (see Why do vectors require special database support? above).

To do better, we need to prune the search space, that is, avoid scanning the entire index for every query.

For this to work, we need to find a way to discard vectors we know are irrelevant for a query, while focusing our efforts on those that might be relevant.

Indexing vectors with IVF

Vectorize prunes the search space for a query using an indexing technique called IVF, Inverted File Index.

IVF clusters the index vectors according to their relative proximity. For each cluster, it then identifies its centroid, the center of gravity of that cluster, a high-dimensional point minimizing the distance with every vector in the cluster.

Once the list of centroids is determined, each centroid is given a number. We then structure the data on storage by placing each vector in a file named like the centroid it is closest to.

When processing a query, we then can then focus on relevant vectors by looking only in the centroid files closest to that query vector, effectively pruning the search space.

Compressing vectors with PQ

Vectorize supports vectors of up to 1536 dimensions. At 4 bytes per dimension (32 bits float), this means up to 6 KB per vector. That’s 6 GB of uncompressed vector data per million vectors that we need to fetch from storage and put in memory.

To process multi-million vector indexes while limiting the CPU, memory, and I/O required to do so, Vectorize uses a dimensionality reduction technique called PQ (Product Quantization). PQ compresses the vectors data in a way that retains most of their specificity while greatly reducing their size — a bit like down sampling a picture to reduce the file size, while still being able to tell precisely what’s in the picture — enabling Vectorize to efficiently perform similarity search on these lighter vectors.

In addition to storing the compressed vectors, their original data is retained on storage as well, and can be requested through the API; the compressed vector data is used only to speed up the search.

Approximate nearest neighbor search and result accuracy refining

By pruning the search space and compressing the vector data, we’ve managed to increase the efficiency of our query operation, but it is now possible to produce a set of matches that is different from the set of true closest matches. We have traded result accuracy for speed by performing an approximate nearest neighbor search, reaching an accuracy of ~80%.

To boost the result accuracy up to over 95%, Vectorize then performs a result refinement pass on the top approximate matches using uncompressed vector data, and returns the best refined matches.

Eventual consistency and snapshot versioning

Whenever you query your Vectorize index, you are guaranteed to receive results which are read from a consistent, immutable snapshot — even as you write to your index concurrently. Writes are applied in strict order of their arrival in our system, and they are funneled into an asynchronous process. We update the index files by reading the old version, making changes, and writing this updated version as a new object in R2. Each index file has its own version number, and can be updated independently of the others. Between two versions of the index we may update hundreds or even thousands of IVF and metadata index files, but even as we update the files, your queries will consistently use the current version until it is time to switch.

Each IVF and metadata index file has its own version. The list of all versioned files which make up the snapshotted version of the index is contained within a manifest file. Each version of the index has its own manifest. When we write a new manifest file based on the previous version, we only need to update references to the index files which were modified; if there are files which weren't modified, we simply keep the references to the previous version.

We use a root manifest as the authority of the current version of the index. This is the pivot point for changes. The root manifest is a copy of a manifest file from a particular version, which is written to a deterministic location (the root of the R2 bucket for the index). When our async write process has finished processing vectors, and has written all new index files to R2, we commit by overwriting the current root manifest with a copy of the new manifest. PUT operations in R2 are atomic, so this effectively makes our updates atomic. Once the manifest is updated, Vectorize DB Service instances running on our network will pick it up, and use it to serve reads.

Because we keep past versions of index and manifest files, we effectively maintain versioned snapshots of your index. This means we have a straightforward path towards building a point-in-time recovery feature (similar to D1's Time Travel feature).

You may have noticed that because our write process is asynchronous, this means Vectorize is eventually consistent — that is, there is a delay between the successful completion of a request writing on the index, and finally seeing those updates reflected in queries. This isn't always ideal for all data storage use cases. For example, imagine two users using an online ticket reservation application for airline tickets, where both users buy the same seat — one user will successfully reserve the ticket, and the other will eventually get an error saying the seat was taken, and they need to choose again. Because a vector index is not typically used as a primary database for these transactional use cases, we decided eventual consistency was a worthy trade off in order to ensure Vectorize queries would be fast, high-throughput, and cheap even as the size of indexes grew into the millions.

Coordinating distributed writes: it's just another block in the WAL

In the section above, we touched on our eventually consistent, asynchronous write process. Now we'll dive deeper into our implementation.

The WAL

A write ahead log (WAL) is a common technique for making atomic and durable writes in a database system. Vectorize’s WAL is implemented with SQLite in Durable Objects.

In Vectorize, the payload for each update is given an ID, written to R2, and the ID for that payload is handed to the WAL Durable Object which persists it as a "block." Because it's just a pointer to the data, the blocks are lightweight records of each mutation.

Durable Objects (DO) have many benefits — strong transactional guarantees, a novel combination of compute and storage, and a high degree of horizontal scale — but individual DOs are small allotments of memory and compute. However, the process of updating the index for even a single mutation is resource intensive — a single write may include thousands of vectors, which may mean reading and writing thousands of data files stored in R2, and storing a lot of data in memory. This is more than what a single DO can handle.

So we designed the WAL to leverage DO's strengths and made it a coordinator. It controls the steps of updating the index by delegating the heavy lifting to beefier instances of compute resources (which we call "Executors"), but uses its transactional properties to ensure the steps are done with strong consistency. It safeguards the process from rogue or stalled executors, and ensures the WAL processing continues to move forward. DOs are easy to scale, so we create a new DO instance for each Vectorize index.

WAL Executor

The executors run from a single pool of compute resources, shared by all WALs. We use a simple producer-consumer pattern using Cloudflare Queues. The WAL enqueues a request, and executors poll the queue. When they get a request, they call an API on the WAL requesting to be assigned to the request.

The WAL ensures that one and only one executor is ever assigned to that write. As the executor writes, the index files and the updated manifest are written in R2, but they are not yet visible. The final step is for the executor to call another API on the WAL to commit the change — and this is key — it passes along the updated manifest. The WAL is responsible for overwriting the root manifest with the updated manifest. The root manifest is the pivot point for atomic updates: once written, the change is made visible to Vectorize’s database service, and the updated data will appear in queries.

From the start, we designed this process to account for non-deterministic errors. We focused on enumerating failure modes first, and only moving forward with possible design options after asserting they handled the possibilities for failure. For example, if an executor stalls, the WAL finds a new executor. If the first executor comes back, the coordinator will reject its attempt to commit the update. Even if that first executor is working on an old version which has already been written, and writes new index files and a new manifest to R2, they will not overwrite the files written from the committed version.

Batching updates

Now that we have discussed the general flow, we can circle back to one of our favorite features of the WAL. On the executor, the most time-intensive part of the write process is reading and writing many files from R2. Even with making our reads and writes concurrent to maximize throughput, the cost of updating even thousands of vectors within a single file is dwarfed by the total latency of the network I/O. Therefore it is more efficient to maximize the number of vectors processed in a single execution.

So that is what we do: we batch discrete updates. When the WAL is ready to request work from an executor, it will get a chunk of "blocks" off the WAL, starting with the next un-written block, and maintaining the sequence of blocks. It will write a new "batch" record into the SQLite table, which ties together that sequence of blocks, the version of the index, and the ID of the executor assigned to the batch.

Users can batch multiple vectors to update in a single insert or upsert call. Because the size of each update can vary, the WAL adaptively calculates the optimal size of its batch to increase throughput. The WAL will fit as many upserted vectors as possible into a single batch by counting the number of updates represented by each block. It will batch up to 200,000 vectors at once (a value we arrived at after our own testing) with a limit of 1,000 blocks. With this throughput, we have been able to quickly load millions of vectors into an index (with upserts of 5,000 vectors at a time). Also, the WAL does not pause itself to collect more writes to batch — instead, it begins processing a write as soon as it arrives. Because the WAL only processes one batch at a time, this creates a natural pause in its workflow to batch up writes which arrive in the meantime.

Retraining the index

The WAL also coordinates our process for retraining the index. We occasionally re-train indexes to ensure the mapping of IVF centroids best reflects the current vectors in the index. This maintains the high accuracy of the vector search.

Retraining produces a completely new index. All index files are updated; vectors have been reshuffled across the index space. For this reason, all indexes have a second version stamp — which we call the generation — so that we can differentiate between retrained indexes.

The WAL tracks the state of the index, and controls when the training is started. We have a second pool of processes called "trainers." The WAL enqueues a request on a queue, then a trainer picks up the request and it begins training.

Training can take a few minutes to complete, but we do not pause writes on the current generation. The WAL will continue to handle writes as normal. But the training runs from a fixed snapshot of the index, and will become out-of-date as the live index gets updated in parallel. Once the trainer has completed, it signals the WAL, which will then start a multi-step process to switch to the new generation. It enters a mode where it will continue to record writes in the WAL, but will stop making those writes visible on the current index. Then it will begin catching up the retrained index with all of the updates that came in since it started. Once it has caught up to all data present in the index when the trainer signaled the WAL, it will switch over to the newly retrained index. This prevents the new index from appearing to "jump back in time." All subsequent writes will be applied to that new index.

This is all modeled seamlessly with the batch record. Because it associates the index version with a range of WAL blocks, multiple batches can span the same sequence of blocks as long as they belong to different generations. We can say this another way: a single WAL block can be associated with many batches, as long as these batches are in different generations. Conceptually, the batches act as a second WAL layered over the WAL blocks.

Indexing and filtering metadata

Vectorize supports metadata filters on vector similarity queries. This allows a query to focus the vector similarity search on a subset of the index data, yielding matches that would otherwise not have been part of the top results.

For instance, this enables us to query for the best matching vectors for color: “blue” and category: ”robe”.

Conceptually, what needs to happen to process this example query is:

Identify the set of vectors matching color: “blue” by scanning all metadata.
Identify the set of vectors matching category: “robe” by scanning all metadata.
Intersect both sets (boolean AND in the filter) to identify vectors matching both the color and category filter.
Score all vectors in the intersected set, and return the top matches.

While this method works, it doesn’t scale well. For an index with millions of vectors, processing the query that way would be very resource intensive. What’s worse, it prevents us from using our IVF index to identify relevant vector data, forcing us to compute a proximity score on potentially millions of vectors if the filtered set of vectors is large.

To do better, we need to prune the metadata search space by indexing it like we did for the vector data, and find a way to efficiently join the vector sets produced by the metadata index with our IVF vector index.

Indexing metadata with Chunked Sorted List Indexes

Vectorize maintains one metadata index per filterable property. Each filterable metadata property is indexed using a Chunked Sorted List Index.

A Chunked Sorted List Index is a sorted list of all distinct values present in the data for a filterable property, with each value mapped to the set of vector IDs having that value. This enables Vectorize to binary search a value in the metadata index in O(log n) complexity, in other words about as fast as search can be on a large dataset.

Because it can become very large on big indexes, the sorted list is chunked in pieces matching a target weight in KB to keep index state fetches efficient.

A lightweight chunk descriptor list is maintained in the index manifest, keeping track of the list chunks and their lower/upper values. This chunk descriptor list can be binary searched to identify which chunk would contain the searched metadata value.

Once the candidate chunk is identified, Vectorize fetches that chunk from index data and binary searches it to take the set of vector IDs matching a metadata value if found, or an empty set if not found.

We identify the matching vector set this way for every predicate in the metadata filter of the query, then intersect the sets in memory to determine the final set of vectors matched by the filters.

This is just half of the query being processed. We now need to identify the vectors most similar to the query vector, within those matching the metadata filters.

Joining the metadata and vector indexes

A vector similarity query always comes with an input vector. We can rank all centroids of our IVF vector index based on their proximity with that query vector.

The vector set matched by the metadata filters contains for each vector its ID and IVF centroid number.

From this, Vectorize derives the number of vectors matching the query filters per IVF centroid, and determines which and how many top-ranked IVF centroids need to be scanned according to the number of matches the query asks for.

Vectorize then performs the IVF-indexed vector search (see the section Searching Vectors, and indexing them to speed things up above) by considering only the vectors in the filtered metadata vector set while doing so.

Because we’re effectively pruning the vector search space using metadata filters, filtered queries can often be faster than their unfiltered equivalent.

Query performance

The performance of a system is measured in terms of latency and throughput.

Latency is a measure relative to individual queries, evaluating the time it takes for a query to be processed, usually expressed in milliseconds. It is what an end user perceives as the “speed” of the service, so a lower latency is desirable.

Throughput is a measure relative to an index, evaluating the number of queries it can process concurrently over a period of time, usually expressed in requests per second or RPS. It is what enables an application to scale to thousands of simultaneous users, so a higher throughput is desirable.

Vectorize is designed for great index throughput and optimized for low query latency to deliver great performance for demanding applications. Check out our benchmarks.

Query latency optimization

As a distributed database keeping its data state on blob storage, Vectorize’s latency is primarily driven by the fetch of index data, and relies heavily on Cloudflare’s network of caches as well as individual server RAM cache to keep latency low.

Because Vectorize data is snapshot versioned, (see Eventual consistency and snapshot versioning above), each version of the index data is immutable and thus highly cacheable, increasing the latency benefits Vectorize gets from relying on Cloudflare’s cache infrastructure.

To keep the index data lean, Vectorize uses techniques to reduce its weight. In addition to Product Quantization (see Compressing vectors with PQ above), index files use a space-efficient binary format optimized for runtime performance that Vectorize is able to use without parsing, once fetched.

Index data is fragmented in a way that minimizes the amount of data required to process a query. Auxiliary indexes into that data are maintained to limit the amount of fragments to fetch, reducing overfetch by jumping straight to the relevant piece of data on mass storage.

Vectorize boosts all vector proximity computations by leveraging SIMD CPU instructions, and by organizing the vector search in 2 passes, effectively balancing the latency/result accuracy ratio (see Approximate nearest neighbor search and result accuracy refining above).

When used via a Worker binding, each query is processed close to the server serving the worker request, and thus close to the end user, minimizing the network-induced latency between the end user, the Worker application, and Vectorize.

Query throughput

Vectorize runs in every Cloudflare data center, on thousands of servers across the world.

Thanks to the snapshot versioning of every index’s data, every server is simultaneously able to serve the index concurrently, without contention on state.

This means that a Vectorize index elastically scales horizontally with its distributed traffic, providing very high throughput for the most demanding Worker applications.

Increased index size

We are excited that our upgraded version of Vectorize can support a maximum of 5 million vectors, which is a 25x improvement over the limit in beta (200,000 vectors). All the improvements we discussed in this blog post contribute to this increase in vector storage. Improved query performance and throughput comes with this increase in storage as well.

However, 5 million may be constraining for some use cases. We have already heard this feedback. The limit falls out of the constraints of building a brand new globally distributed stateful service, and our desire to iterate fast and make Vectorize generally available so builders can confidently leverage it in their production apps.

We believe builders will be able to leverage Vectorize as their primary vector store, either with a single index or by sharding across multiple indexes. But if this limit is too constraining for you, please let us know. Tell us your use case, and let's see if we can work together to make Vectorize work for you.

Try it now!

Every developer on a free plan can give Vectorize a try. You can visit our developer documentation to get started.

If you’re looking for inspiration on what to build, see the semantic search tutorial that combines Workers AI and Vectorize for document search, running entirely on Cloudflare. Or an example of how to combine OpenAI and Vectorize to give an LLM more context and dramatically improve the accuracy of its answers.

And if you have questions about how to use Vectorize for our product & engineering teams, or just want to bounce an idea off of other developers building on Workers AI, join the #vectorize and #workers-ai channels on our Developer Discord.

Sippy helps you avoid egress fees while incrementally migrating data from S3 to R2

Phillip Jones — Tue, 26 Sep 2023 13:00:44 GMT

Earlier in 2023, we announced Super Slurper, a data migration tool that makes it easy to copy large amounts of data to R2 from other cloud object storage providers. Since the announcement, developers have used Super Slurper to run thousands of successful migrations to R2!

While Super Slurper is perfect for cases where you want to move all of your data to R2 at once, there are scenarios where you may want to migrate your data incrementally over time. Maybe you want to avoid the one time upfront AWS data transfer bill? Or perhaps you have legacy data that may never be accessed, and you only want to migrate what’s required?

Today, we’re announcing the open beta of Sippy, an incremental migration service that copies data from S3 (other cloud providers coming soon!) to R2 as it’s requested, without paying unnecessary cloud egress fees typically associated with moving large amounts of data. On top of addressing vendor lock-in, Sippy makes stressful, time-consuming migrations a thing of the past. All you need to do is replace the S3 endpoint in your application or attach your domain to your new R2 bucket and data will start getting copied over.

How does it work?

Sippy is an incremental migration service built directly into your R2 bucket. Migration-specific egress fees are reduced by leveraging requests within the flow of your application where you’d already be paying egress fees to simultaneously copy objects to R2. Here is how it works:

When an object is requested from Workers, S3 API, or public bucket, it is served from your R2 bucket if it is found.

If the object is not found in R2, it will simultaneously be returned from your S3 bucket and copied to R2.

Note: Some large objects may take multiple requests to copy.

That means after objects are copied, subsequent requests will be served from R2, and you’ll begin saving on egress fees immediately.

Start incrementally migrating data from S3 to R2

Create an R2 bucket

To get started with incremental migration, you’ll first need to create an R2 bucket if you don’t already have one. To create a new R2 bucket from the Cloudflare dashboard:

Log in to the Cloudflare dashboard and select R2.
Select Create bucket.
Give your bucket a name and select Create bucket.

To learn more about other ways to create R2 buckets refer to the documentation on creating buckets.

Enable Sippy on your R2 bucket

Next, you’ll enable Sippy for the R2 bucket you created. During the beta, you can do this by using the API. Here’s an example of how to enable Sippy for an R2 bucket with cURL:

curl -X PUT https://api.cloudflare.com/client/v4/accounts/{account_id}/r2/buckets/{bucket_name}/sippy \
--header "Authorization: Bearer " \
--data '{"provider": "AWS", "bucket": "", "zone": "","key_id": "", "access_key":"", "r2_key_id": "", "r2_access_key": ""}'

For more information on getting started, please refer to the documentation. Once enabled, requests to your bucket will now start copying data over from S3 if it’s not already present in your R2 bucket.

Finish your migration with Super Slurper

You can run your incremental migration for as long as you want, but eventually you may want to complete the migration to R2. To do this, you can pair Sippy with Super Slurper to easily migrate your remaining data that hasn’t been accessed to R2.

What’s next?

We’re excited about open beta, but it’s only the starting point. Next, we plan on making incremental migration configurable from the Cloudflare dashboard, complete with analytics that show you the progress of your migration and how much you are saving by not paying egress fees for objects that have been copied over so far.

If you are looking to start incrementally migrating your data to R2 and have any questions or feedback on what we should build next, we encourage you to join our Discord community to share!

Introducing Object Lifecycle Management for Cloudflare R2

Harshal Brahmbhatt — Wed, 10 May 2023 13:00:39 GMT

Last year, R2 made its debut, providing developers with object storage while eliminating the burden of egress fees. (For many, egress costs account for over half of their object storage bills!) Since R2’s launch, tens of thousands of developers have chosen it to store data for many different types of applications.

But for some applications, data stored in R2 doesn’t need to be retained forever. Over time, as this data grows, it can unnecessarily lead to higher storage costs. Today, we’re excited to announce that Object Lifecycle Management for R2 is generally available, allowing you to effectively manage object expiration, all from the R2 dashboard or via our API.

Object Lifecycle Management

Object lifecycles give you the ability to define rules (up to 1,000) that determine how long objects uploaded to your bucket are kept. For example, by implementing an object lifecycle rule that deletes objects after 30 days, you could automatically delete outdated logs or temporary files. You can also define rules to abort unfinished multipart uploads that are sitting around and contributing to storage costs.

Getting started with object lifecycles in R2

Cloudflare dashboard

From the Cloudflare dashboard, select R2.
Select your R2 bucket.
Navigate to the Settings tab and find the Object lifecycle rules section.
Select Add rule to define the name, relevant prefix, and lifecycle actions: delete uploaded objects or abort incomplete multipart uploads.
Select Add rule to complete the process.

S3 Compatible API

With R2’s S3-compatible API, it’s easy to apply any existing object lifecycle rules to your R2 buckets.

Here’s an example of how to configure your R2 bucket’s lifecycle policy using the AWS SDK for JavaScript. To try this out, you’ll need to generate an Access Key.

import S3 from "aws-sdk/clients/s3.js";

const client = new S3({
  endpoint: `https://${ACCOUNT_ID}.r2.cloudflarestorage.com`,
  credentials: {
    accessKeyId: ACCESS_KEY_ID, //  fill in your own
    secretAccessKey: SECRET_ACCESS_KEY, // fill in your own
  },
  region: "auto",
});

await client
  .putBucketLifecycleConfiguration({
    LifecycleConfiguration: {
      Bucket: "testBucket",
      Rules: [
        // Example: deleting objects by age
        // Delete logs older than 90 days
        {
          ID: "Delete logs after 90 days",
          Filter: {
            Prefix: "logs/",
          },
          Expiration: {
            Days: 90,
          },
        },
        // Example: abort all incomplete multipart uploads after a week
        {
          ID: "Abort incomplete multipart uploads",
          AbortIncompleteMultipartUpload: {
            DaysAfterInitiation: 7,
          },
        },
      ],
    },
  })
  .promise();

For more information on how object lifecycle policies work and how to configure them in the dashboard or API, see the documentation here.

Speaking of documentation, if you’d like to provide feedback for R2’s documentation, fill out our documentation survey!

What’s next?

Creating object lifecycle rules to delete ephemeral objects is a great way to reduce storage costs, but what if you need to keep objects around to access in the future? We’re working on new, lower cost ways to store objects in R2 that aren’t frequently accessed, like long tail user-generated content, archive data, and more. If you’re interested in providing feedback and gaining early access, let us know by joining the waitlist here.

Join the conversation: share your feedback and experiences

If you have any questions or feedback relating to R2, we encourage you to join our Discord community to share! Stay tuned for more exciting R2 updates in the future.

How we built an open-source SEO tool using Workers, D1, and Queues

Kristian Freeman — Thu, 02 Mar 2023 15:03:54 GMT

Building applications on Cloudflare Workers has always been fun. Workers applications have low latency response times by default, and easy developer ergonomics thanks to Wrangler. It's no surprise that for years now, developers have been going from idea to production with Workers in just a few minutes.

Internally, we're no different. When a member of our team has a project idea, we often reach for Workers first, and not just for the MVP stage, but in production, too. Workers have been a secret ingredient to Cloudflare’s innovation for some time now, allowing us to build products like Access, Stream and Workers KV. Even better, when we have new ideas and we can use new Cloudflare products to build them, it's a great way to give feedback on those products.

We've discussed this in the past on the Cloudflare blog - in May last year, I wrote how we rebuilt Cloudflare's developer documentation using many of the tools that had recently been released in the Workers ecosystem: Cloudflare Pages for hosting, and Bulk Redirects for the redirect rules. In November, we released a new version of our API documentation, which again used Pages for hosting, and Pages functions for intelligent caching and transformation of our API schema.

In this blog post, I’m excited to show off some of the new tools in Cloudflare’s developer arsenal, D1 and Queues, to prototype and ship an internal tool for our SEO experts at Cloudflare. We've made this project, which we're calling Prospector, open-source too - check it out in our [cloudflare/templates](https://github.com/cloudflare/workers-sdk/tree/main/templates/worker-prospector) repo on GitHub. Whether you're a developer looking to understand how to use multiple parts of Cloudflare's developer stack together, or an SEO specialist who may want to deploy the tool in production, we've made it incredibly easy to get up and running.

What we're building

Prospector is a tool that allows Cloudflare's SEO experts to monitor our blog and marketing site for specific keywords. When a keyword is matched on a page, Prospector will notify an email address. This allows our SEO experts to stay informed of any changes to our website, and take action accordingly.

Using MailChannels' integration with Workers, we can quickly and easily send emails from our application using a single API call. This allows us to focus on the core functionality of the application, and not worry about the details of sending emails.

Prospector uses Cloudflare Workers as the user-facing API for the application. It uses D1 to store and retrieve data in real-time, and Queues to handle the fetching of all URLs and the notification process. We've also included an intuitive user interface for the application, which is built with HTML, CSS, and JavaScript.

Why we built it

It is widely known in SEO that both internal and external links help Google and other search engines understand what a website is about, which impacts keyword rankings. Not only do these links guide readers to additional helpful information, they also allow web crawlers for search engines to discover and index content on the site.

Acquiring external links is often a time-consuming process and at the discretion of third parties, whereas website owners typically have much more control over internal links. As a result, internal linking is one of the most useful levers available in SEO.

In an ideal world, every piece of content would be fully formed upon publication, replete with helpful internal links throughout the piece. However, this is often not the case. Many times, content is edited after the fact or additional pieces of relevant content come along after initial publication. These situations result in missed opportunities for internal linking.

Like other large organizations, Cloudflare has published thousands of blogs and web pages over the years. We share new content every time a product/technology is introduced and improved. Ultimately, that also means it's become more challenging to identify opportunities for internal linking in a timely, automated fashion. We needed a tool that would allow us to identify internal linking opportunities as they appear, and speed up the time it takes to identify new internal linking opportunities.

Although we tested several tools that might solve this problem, we found that they were limited in several ways. First, some tools only scanned the first 2,000 characters of a web page. Any opportunities found beyond that limit would not be detected. Next, some tools did not allow us to limit searches to certain areas of the site and resulted in many false positives. Finally, other potential solutions required manual operation, leaving the process at the mercy of human memory.

To solve our problem (and ultimately, improve our SEO), we needed an automated tool that could discover and notify us of new instances of targeted phrases on a specified range of pages.

How it works

Data model

First, let's explore the data model for Prospector. We have two main tables: notifiers and urls. The notifiers table stores the email address and keyword that we want to monitor. The urls table stores the URL and sitemap that we want to scrape. The notifiers table has a one-to-many relationship with the urls table, meaning that each notifier can have many URLs associated with it.

In addition, we have a sitemaps table that stores the sitemap URLs that we've scraped. Many larger websites don't just have a single sitemap: the Cloudflare blog, for instance, has a primary sitemap that contains four sub-sitemaps. When the application is deployed, a primary sitemap is provided as configuration, and Prospector will parse it to find all of the sub-sitemaps.

Finally, notifier_matches is a table that stores the matches between a notifier and a URL. This allows us to keep track of which URLs have already been matched, and which ones still need to be processed. When a match has been found, the notifier_matches table is updated to reflect that, and "matches" for a keyword are no longer processed. This saves our SEO experts from a crowded inbox, and allows them to focus and act on new matches.

Connecting the pieces with Cloudflare QueuesCloudflare Queues acts as the work queue for Prospector. When a new notifier is added, a new job is created for it and added to the queue. Behind the scenes, Queues will distribute the work across multiple Workers, allowing us to scale the application as needed. When a job is processed, Prospector will scrape the URL and check for matches. If a match is found, Prospector will send an email to the notifier's email address.

Using the Cron Triggers functionality in Workers, we can schedule the scraping process to run at a regular interval - by default, once a day. This allows us to keep our data up-to-date, and ensures that we're always notified of any changes to our website. It also allows the end-user to configure when they receive emails in case they want to receive them more or less frequently, or at the beginning of their workday.

The Module Workers syntax for Workers makes accessing the application bindings - the constants available in the application for querying D1, Queues, and other services - incredibly easy. src/index.ts, the entrypoint for the application, looks like this:

import { DBUrl, Env } from './types'

import {
  handleQueuedUrl,
  scheduled,
} from './functions';

import h from './api'

export default {
  async fetch(
	request: Request,
	env: Env,
	ctx: ExecutionContext
  ): Promise {
	return h.fetch(request, env, ctx)
  },

  async queue(
	batch: MessageBatch,
	env: Env
  ): Promise {
	for (const message of batch.messages) {
  	const url: DBUrl = JSON.parse(message.body)
  	await handleQueuedUrl(url, env.DB)
	}
  },

  async scheduled(
	env: Env,
  ): Promise {
	await scheduled({
  	authToken: env.AUTH_TOKEN,
  	db: env.DB,
  	queue: env.QUEUE,
  	sitemapUrl: env.SITEMAP_URL,
	})
  }
};

With this syntax, we can see where the various events incoming to the application - the fetch event, the queue event, and the scheduled event - are handled. The fetch event is the main entrypoint for the application, and is where we handle all of the API routes. The queue event is where we handle the work that's been added to the queue, and the scheduled event is where we handle the scheduled scraping process.

Central to the application, of course, is Workers - acting as the API gateway and coordinator. We've elected to use the popular open-source framework Hono, an Express-style API for Workers, in Prospector. With Hono, we can quickly map out a REST API in just a few lines of code. Here's an example of a few API routes and how they're defined with Hono:

const app = new Hono()

app.get("/", (context) => {
  return context.html(index)
})

app.post("/notifiers", async context => {
  try {
	const { keyword, email } = await context.req.parseBody()
	await context.env.DB.prepare(
  	"insert into notifiers (keyword, email) values (?, ?)"
	).bind(keyword, email).run()
	return context.redirect('/')
  } catch (err) {
	context.status(500)
	return context.text("Something went wrong")
  }
})

app.get('/sitemaps', async (context) => {
  const query = await context.env.DB.prepare(
	"select * from sitemaps"
  ).all();
  const sitemaps: Array = query.results
  return context.json(sitemaps)
})

Crucial to the development of Prospector are the improved TypeScript bindings for Workers. As announced in November of last year, TypeScript bindings for Workers are now automatically generated based on our open source runtime, workerd. This means that whenever we use the types provided from the @cloudflare/workers-types package in our application, we can be sure that the types are always up-to-date.

With these bindings, we can define the types for our environment variables, and use them in our application. Here's an example of the Env type, which defines the environment variables that we use in the application:

export interface Env {
  AUTH_TOKEN: string
  DB: D1Database
  QUEUE: Queue
  SITEMAP_URL: string
}

Notice the types of the DB and QUEUE bindings - D1Database and Queue, respectively. These types are automatically generated, complete with type signatures for each method inside of the D1 and Queue APIs. This means that we can be sure that we're using the correct methods, and that we're passing the correct arguments to them, directly from our text editor - without having to refer to the documentation.

How to use it

One of my favorite things about Workers is that deploying applications is quick and easy. Using `wrangler.toml` and some simple build scripts, we can deploy a fully-functional application in just a few minutes. Prospector is no different. With just a few commands, we can create the necessary D1 database and Queues instance, and deploy the application to our account.

First, you'll need to clone the repository from our cloudflare/templates repository:

git clone $URL

If you haven't installed wrangler yet, you can do so by running:

npm install @cloudflare/wrangler -g

With Wrangler installed, you can login to your account by running:

wrangler login

After you've done that, you'll need to create a new D1 database, as well as a Queues instance. You can do this by running the following commands:

wrangler d1 create $DATABASE_NAMEwrangler queues create $QUEUE_NAME

Configure your wrangler.toml with the appropriate bindings (see [the README](URL) for an example):

[[ d1_databases ]]
binding = "DB"
database_name = "keyword-tracker-db"
database_id = "ab4828aa-723b-4a77-a3f2-a2e6a21c4f87"
preview_database_id = "8a77a074-8631-48ca-ba41-a00d0206de32"
	
[[queues.producers]]
  queue = "queue"
  binding = "QUEUE"

[[queues.consumers]]
  queue = "queue"
  max_batch_size = 10
  max_batch_timeout = 30
  max_retries = 10
  dead_letter_queue = "queue-dlq"

Next, you can run the bin/migrate script to create the tables in your database:

bin/migrate

This will create all the needed tables in your database, both in development (locally) and in production. Note that you'll even see the creation of a honest-to-goodness .sqlite3 file in your project directory - this is the local development database, which you can connect to directly using the same SQLite CLI that you're used to:

$ sqlite3 .wrangler/state/d1/DB.sqlite3sqlite> .tables notifier_matches notifiers sitemaps urls

Finally, you can deploy the application to your account:

npm run deploy

With a deployed application, you can visit your Workers URL to see the user interface. From there, you can add new notifiers and URLs, and see the results of your scraping process. When a new keyword match is found, you’ll receive an email with the details of the match instantly:

Conclusion

For some time, there have been a great deal of applications that were hard to build on Workers without relational data or background task tooling. Now, with D1 and Queues, we can build applications that seamlessly integrate between real-time user interfaces, geographically distributed data, background processing, and more, all using the same developer ergonomics and low latency that Workers is known for.

D1 has been crucial for building this application. On larger sites, the number of URLs that need to be scraped can be quite large. If we were to use Workers KV, our key-value store, for storing this data, we would quickly struggle with how to model, retrieve, and update the data needed for this use-case. With D1, we can build relational data models and quickly query just the data we need for each queued processing task.

Using these tools, developers can build internal tools and applications for their companies that are more powerful and more scalable than ever before. With the integration of Cloudflare's Zero Trust suite, developers can make these applications secure by default, and deploy them to Cloudflare's global network. This allows developers to build applications that are fast, secure, and reliable, all without having to worry about the underlying infrastructure.

Prospector is a great example of how easy it is to build applications on Cloudflare Workers. With the recent addition of D1 and Queues, we've been able to build fully-functional applications that require real-time data and background processing in just a few hours. We're excited to share the open-source code for Prospector, and we'd love to hear your feedback on the project.

If you have any questions, feel free to reach out to us on Twitter at @cloudflaredev, or join us in the Cloudflare Workers Discord community, which recently hit 20k members and is a great place to ask questions and get help from other developers.

Making static sites dynamic with Cloudflare D1

Kristian Freeman — Wed, 16 Nov 2022 14:00:00 GMT

Introduction

There are many ways to store data in your applications. For example, in Cloudflare Workers applications, we have Workers KV for key-value storage and Durable Objects for real-time, coordinated storage without compromising on consistency. Outside the Cloudflare ecosystem, you can also plug in other tools like NoSQL and graph databases.

But sometimes, you want SQL. Indexes allow us to retrieve data quickly. Joins enable us to describe complex relationships between different tables. SQL declaratively describes how our application's data is validated, created, and performantly queried.

D1 was released today in open alpha, and to celebrate, I want to share my experience building apps with D1: specifically, how to get started, and why I’m excited about D1 joining the long list of tools you can use to build apps on Cloudflare.

D1 is remarkable because it's an instant value-add to applications without needing new tools or stepping out of the Cloudflare ecosystem. Using wrangler, we can do local development on our Workers applications, and with the addition of D1 in wrangler, we can now develop proper stateful applications locally as well. Then, when it's time to deploy the application, wrangler allows us to both access and execute commands to your D1 database, as well as your API itself.

What we’re building

In this blog post, I'll show you how to use D1 to add comments to a static blog site. To do this, we'll construct a new D1 database and build a simple JSON API that allows the creation and retrieval of comments.

As I mentioned, separating D1 from the app itself - an API and database that remains separate from the static site - allows us to abstract the static and dynamic pieces of our website from each other. It also makes it easier to deploy our application: we will deploy the frontend to Cloudflare Pages, and the D1-powered API to Cloudflare Workers.

Building a new application

First, we'll add a basic API in Workers. Create a new directory and in it a new wrangler project inside it:

$ mkdir d1-example && d1-example
$ wrangler init

In this example, we’ll use Hono, an Express.js-style framework, to rapidly build our API. To use Hono in this project, install it using NPM:

$ npm install hono

Then, in src/index.ts, we’ll initialize a new Hono app, and define a few endpoints - GET /API/posts/:slug/comments, and POST /get/api/:slug/comments.

import { Hono } from 'hono'
import { cors } from 'hono/cors'

const app = new Hono()

app.get('/api/posts/:slug/comments', async c => {
  // do something
})

app.post('/api/posts/:slug/comments', async c => {
  // do something
})

export default app

Now we'll create a D1 database. In Wrangler 2, there is support for the wrangler d1 subcommand, which allows you to create and query your D1 databases directly from the command line. So, for example, we can create a new database with a single command:

$ wrangler d1 create d1-example

With our created database, we can take the database name ID and associate it with a binding inside of wrangler.toml, wrangler's configuration file. Bindings allow us to access Cloudflare resources, like D1 databases, KV namespaces, and R2 buckets, using a simple variable name in our code. Below, we’ll create the binding DB and use it to represent our new database:

[[ d1_databases ]]
binding = "DB" # i.e. available in your Worker on env.DB
database_name = "d1-example"
database_id = "4e1c28a9-90e4-41da-8b4b-6cf36e5abb29"

Note that this directive, the [[d1_databases]] field, currently requires a beta version of wrangler. You can install this for your project using the command npm install -D wrangler/beta.

With the database configured in our wrangler.toml, we can start interacting with it from the command line and inside our Workers function.

First, you can issue direct SQL commands using wrangler d1 execute:

$ wrangler d1 execute d1-example --command "SELECT name FROM sqlite_schema WHERE type ='table'"
Executing on d1-example:
┌─────────────────┐
│ name │
├─────────────────┤
│ sqlite_sequence │
└─────────────────┘

You can also pass a SQL file - perfect for initial data seeding in a single command. Create src/schema.sql, which will create a new comments table for our project:

drop table if exists comments;
create table comments (
  id integer primary key autoincrement,
  author text not null,
  body text not null,
  post_slug text not null
);
create index idx_comments_post_id on comments (post_slug);

-- Optionally, uncomment the below query to create data

-- insert into comments (author, body, post_slug)
-- values ("Kristian", "Great post!", "hello-world");

With the file created, execute the schema file against the D1 database by passing it with the flag --file:

$ wrangler d1 execute d1-example --file src/schema.sql

We've created a SQL database with just a few commands and seeded it with initial data. Now we can add a route to our Workers function to retrieve data from that database. Based on our wrangler.toml config, the D1 database is now accessible via the DB binding. In our code, we can use the binding to prepare SQL statements and execute them, for instance, to retrieve comments:

app.get('/api/posts/:slug/comments', async c => {
  const { slug } = c.req.param()
  const { results } = await c.env.DB.prepare(`
    select * from comments where post_slug = ?
  `).bind(slug).all()
  return c.json(results)
})

In this function, we accept a slug URL query parameter and set up a new SQL statement where we select all comments with a matching post_slug value to our query parameter. We can then return it as a simple JSON response.

So far, we've built read-only access to our data. But "inserting" values to SQL is, of course, possible as well. So let's define another function that allows POST-ing to an endpoint to create a new comment:

app.post('/API/posts/:slug/comments', async c => {
  const { slug } = c.req.param()
  const { author, body } = await c.req.json()

  if (!author) return c.text("Missing author value for new comment")
  if (!body) return c.text("Missing body value for new comment")

  const { success } = await c.env.DB.prepare(`
    insert into comments (author, body, post_slug) values (?, ?, ?)
  `).bind(author, body, slug).run()

  if (success) {
    c.status(201)
    return c.text("Created")
  } else {
    c.status(500)
    return c.text("Something went wrong")
  }
})

In this example, we built a comments API for powering a blog. To see the source for this D1-powered comments API, you can visit cloudflare/templates/worker-d1-api.

Conclusion

One of the things most exciting about D1 is the opportunity to augment existing applications or websites with dynamic, relational data. As a former Ruby on Rails developer, one of the things I miss most about that framework in the world of JavaScript and serverless development tools is the ability to rapidly spin up full data-driven applications without needing to be an expert in managing database infrastructure. With D1 and its easy onramp to SQL-based data, we can build true data-driven applications without compromising on performance or developer experience.

This shift corresponds nicely with the advent of static sites in the last few years, using tools like Hugo or Gatsby. A blog built with a static site generator like Hugo is incredibly performant - it will build in seconds with small asset sizes.

But by trading a tool like WordPress for a static site generator, you lose the opportunity to add dynamic information to your site. Many developers have patched over this problem by adding more complexity to their build processes: fetching and retrieving data and generating pages using that data as part of the build.

This addition of complexity in the build process attempts to fix the lack of dynamism in applications, but it still isn't genuinely dynamic. Instead of being able to retrieve and display new data as it's created, the application rebuilds and redeploys whenever data changes so that it appears to be a live, dynamic representation of data. Your application can remain static, and the dynamic data will live geographically close to the users of your site, accessible via a queryable and expressive API.

Easy Postgres integration on Cloudflare Workers with Neon.tech

Erwin van der Koogh — Tue, 15 Nov 2022 13:44:00 GMT

It’s no wonder that Postgres is one of the world’s favorite databases. It’s easy to learn, a pleasure to use, and can scale all the way up from your first database in an early-stage startup to the system of record for giant organizations. Postgres has been an integral part of Cloudflare’s journey, so we know this fact well. But when it comes to connecting to Postgres from environments like Cloudflare Workers, there are unfortunately a bunch of challenges, as we mentioned in our Relational Database Connector post.

Neon.tech not only solves these problems; it also has other cool features such as branching databases — being able to branch your database in exactly the same way you branch your code: instant, cheap and completely isolated.

How to use it

It’s easy to get started. Neon’s client library @neondatabase/serverless is a drop-in replacement for node-postgres, the npm pg package with which you may already be familiar. After going through the getting started process to set up your Neon database, you can easily create a Worker to ask Postgres for the current time like so:

Create a new Worker — Run npx wrangler init neon-cf-demo and accept all the defaults. Enter the new folder with cd neon-cf-demo.
Install the Neon package — Run npm install @neondatabase/serverless.
Provide connection details — For deployment, run npx wrangler secret put DATABASE_URL and paste in your connection string when prompted (you’ll find this in your Neon dashboard: something like postgres://user:password@project-name-1234.cloud.neon.tech/main). For development, create a new file .dev.vars with the contents DATABASE_URL= plus the same connection string.
Write the code — Lastly, replace src/index.ts with the following code:

import { Client } from '@neondatabase/serverless';
interface Env { DATABASE_URL: string; }

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext) {
    const client = new Client(env.DATABASE_URL);
    await client.connect();
    const { rows: [{ now }] } = await client.query('select now();');
    ctx.waitUntil(client.end());  // this doesn’t hold up the response
    return new Response(now);
  }
}

To try this locally, type npm start. To deploy it around the globe, type npx wrangler publish.

You can also check out the source for a slightly more complete demo app. This shows your nearest UNESCO World Heritage sites using IP geolocation in Cloudflare Workers and nearest-neighbor sorting in PostGIS.

How does this work? In this case, we take the coordinates supplied to our Worker in request.cf.longitude and request.cf.latitude. We then feed these coordinates to a SQL query that uses the PostGIS distance operator <-> to order our results:

const { longitude, latitude } = request.cf
const { rows } = await client.query(`
  select 
    id_no, name_en, category,
    st_makepoint($1, $2) <-> location as distance
  from whc_sites_2021
  order by distance limit 10`,
  [longitude, latitude]
);

Since we created a spatial index on the location column, the query is blazing fast. The result (rows) looks like this:

[{
  "id_no": 308,
  "name_en": "Yosemite National Park",
  "category": "Natural",
  "distance": 252970.14782223428
},
{
  "id_no": 134,
  "name_en": "Redwood National and State Parks",
  "category": "Natural",
  "distance": 416334.3926827573
},
/* … */
]

For even lower latencies, we could cache these results at a slightly coarser geographical resolution — rounding, say, to one sixtieth of a degree (one arc minute) of longitude and latitude, which is a little under a mile.

Why we did it

Cloudflare Workers has enormous potential to improve back-end development and deployment. It’s cost-effective, admin-free, and radically scalable.

The use of V8 isolates means Workers are now fast and lightweight enough for nearly any use case. But it has a key drawback: Cloudflare Workers don’t yet support raw TCP communication, which has made database connections a challenge.

Even when Workers eventually support raw TCP communication, we will not have fully solved our problem, because database connections are expensive to set up and also have quite a bit of memory overhead.

This is what the solution looks like:

It consists of three parts:

Connection pooling built into the platform — Given Neon’s serverless compute model, splitting storage and compute operations, it is not recommended to rely on a one-to-one mapping between external clients and Postgres connections. Instead, you can turn on connection pooling simply by flicking a switch (it’s in the Settings area of your Neon dashboard).
WebSocket proxy — We deploy our own WebSocket-to-TCP proxy, written in Go. The proxy simply accepts WebSocket connections from Cloudflare Worker clients, relays message payloads to a requested (Neon-only) host over plain TCP, and relays back the responses.
Client library — Our driver library is based on node-postgres but provides the necessary shims for Node.js features that aren’t present in Cloudflare Workers. Crucially, we replace Node’s net.Socket and tls.connect with code that redirects network reads and writes via the WebSocket connection. To support end-to-end TLS encryption between Workers and the database, we compile WolfSSL to WebAssembly with emscripten. Then we use esbuild to bundle it all together into an easy-to-use npm package.

The @neondatabase/serverless package is currently in public beta. We have plans to improve, extend, and explain it further in the near future on the Neon blog. In line with our commitment to open source, you can configure our serverless driver and/or run our WebSocket proxy to provide access to Postgres databases hosted anywhere — just see the respective repos for details.

So try Neon using invite code serverless, sign up and connect to it with Cloudflare Workers, and you’ll have a fully flexible back-end service running in next to no time.

Cloudflare Workers scale too well and broke our infrastructure, so we are rebuilding it on Workers

Jonathan Norris (Guest Author) — Mon, 14 Nov 2022 14:00:00 GMT

While scaling our new Feature Flagging product DevCycle, we’ve encountered an interesting challenge: our Cloudflare Workers-based infrastructure can handle way more instantaneous load than our traditional AWS infrastructure. This led us to rethink how we design our infrastructure to always use Cloudflare Workers for everything.

The origin of DevCycle

For almost 10 years, Taplytics has been a leading provider of no-code A/B testing and feature flagging solutions for product and marketing teams across a wide range of use cases for some of the largest consumer-facing companies in the world. So when we applied ourselves to build a new engineering-focused feature management product, DevCycle, we built upon our experience using Workers which have served over 140 billion requests for Taplytics customers.

The inspiration behind DevCycle is to build a focused feature management tool for engineering teams, empowering them to build their software more efficiently and deploy it faster. Helping engineering teams reach their goals, whether it be continuous deployment, lower change failure rate, or a faster recovery time. DevCycle is the culmination of our vision of how teams should use Feature Management to build high-quality software faster. We've used DevCycle to build DevCycle, enabling us to implement continuous deployment successfully.

DevCycle architecture

One of the first things we asked ourselves when ideating DevCycle was how we could get out of the business of managing 1000’s of vCPUs worth of AWS instances and move our core business logic closer to our end-user devices. Based on our experience with Cloudflare Workers at Taplytics we knew we wanted it to be a core part of our future infrastructure for DevCycle.

By using the global computing power of Workers and moving as much logic to the SDKs as possible with our local bucketing server-side SDKs, we were able to massively reduce or eliminate the latency of fetching feature flag configurations for our users. In addition, we used a shared WASM library across our Workers and local bucketing SDKs to dramatically reduce the amount of code we need to maintain per SDK, and increase the consistency of our platform. This architecture has also fundamentally changed our business's cost structure to easily serve any customer of any scale.

The core architecture of DevCycle revolves around publishing and consuming JSON configuration files per project environment. The publishing side is managed in our AWS services, while Cloudflare manages the consumption of these config files at scale. This split in responsibilities allows for all high-scale requests to be managed by Cloudflare, while keeping our AWS services simple and low-scale.

“Workers are breaking our events pipeline”

One of the primary challenges as a feature management platform is that we don’t have direct control over the load from our customers’ applications using our SDKs; our systems need the ability to scale instantly to match their load. For example, we have a couple of large customers whose mobile traffic is primarily driven by push notifications, which causes massive instantaneous spikes in traffic to our APIs in the range of 10x increases in load. As you can imagine, any traditional auto-scaled API service and the load balancer cannot manage that type of increase in load. Thus, our choices are to dramatically increase the minimum size of our cluster and load balancer to handle these unknown load spikes, accept that some requests will be rate-limited, or move to an architecture that can handle this load.

Given that all our SDK API requests are already served with Workers, they have no problem scaling instantly to 10x+ their base load. Sadly we can’t say the same about the traditional parts of our infrastructure.

For each feature flag configuration request to a Worker, a corresponding events request is sent to our AWS events infrastructure. The events are received by our events API container in Kubernetes, where they are then published to Kafka and eventually ingested by Snowflake. While Cloudflare Workers have no problem handling instantaneous spikes in feature flag requests, the events system can't keep up. Our cluster and events API containers need to be scaled up faster to prevent the existing instances from being overwhelmed. Even the load balancer has issues accepting the sudden increase. Cloudflare Workers just work too well in comparison to EC2 instances + EKS.

To solve this issue we are moving towards a new events Cloudflare Worker which will be able to handle the instantaneous events load from these requests and make use of the Kinesis Data Firehose to write events to our existing S3 bucket which is ingested by Snowflake. In the future, we look forward to testing out Cloudflare Queues writing to R2 once a Snowflake connector has been created. This architecture should allow us to ingest events at almost any scale and withstand instantaneous traffic spikes with a predictable and efficient cost structure.

Building without a database next to your code

Workers provide many benefits, including fast response times, infinite scalability, serverless architecture, and excellent up-time performance. However, if you want to see all these benefits, you need to architect your Workers to assume that you don’t have direct access to a centralized SQL / NoSQL database (or D1) like you would with a traditional API service. For example, suppose you build your workers to require reaching out to a database to fetch and update user data every time a request is made to your Workers. In that case, your request latency will be tied to the geographic distance between your Worker and the database plus the latency of the database. In addition, your Workers will be able to scale significantly beyond the number of database connections you can support, and your uptime will be tied to the uptime of your external database. Therefore, when architecting your systems to use Workers, we advise relying primarily on data sent as part of the API request and cacheable data on Cloudflare’s global network.

Cloudflare provides multiple products and services to help with data on their global network:

KV: “global, low-latency, key-value data store.”
- However, the lowest latency way of retrieving data from within a Worker is limited by a minimum 60-second TTL. So you’ll need to be ok with cached data that is 60 seconds stale.
Durable Objects: “provide low-latency coordination and consistent storage for the Workers platform through two features: global uniqueness and a transactional storage API.”
- Ability to store user-level information closer to the end user.
- Unfamiliar worker interface for accessing data for developers with SQL / NoSQL experience.
R2: “store large amounts of unstructured data.”
- Ability to store arbitrarily large amounts of unstructured data using familiar S3 APIs.
- Cloudflare’s cache can be used to provide low-latency access within workers.
D1: “serverless SQLite database”

Each of these tools that Cloudflare provides has made building APIs far more accessible than when Workers launched initially; however, each service has aspects which need to be accounted for when architecting your systems. Being an open platform, you can also access any publically available database you want from a Worker. For example, we are making use of Macrometa for our EdgeDB product built into our Workers to help customers access their user data.

The predictable cost structure of Workers

One of the greatest advantages of moving most of our workloads towards Cloudflare Workers is the predictable cost structure that can scale 1:1 with our request loads and can be easily mapped to usage-based billing for our customers. In addition, we no longer have to run excess EC2 instances to handle random spikes in load, just in case they happen.

Too many SaaS services have opaque billing based on max usage or other metrics that don’t relate directly to their costs. Moving from our legacy AWS architecture with high fixed costs like databases and caching layers to Workers has resulted in our infrastructure spending is directly tied to using our APIs and SDKs. For DevCycle, this architecture has been over ~5x more cost-efficient to operate.

The future of DevCycle and Cloudflare

With DevCycle we will continue to invest in leveraging serverless computing and moving our core business logic as close to our users as possible, either on Cloudflare’s global network or locally within our SDKs. We’re excited to integrate even more deeply with the Cloudflare developer platform as new services evolve. We already see future use cases for R2, Queues and Durable Objects and look forward to what’s coming next from Cloudflare.

Build applications of any size on Cloudflare with the Queues open beta

Rob Sutter — Mon, 14 Nov 2022 14:00:00 GMT

Message queues are a fundamental building block of cloud applications—and today the Cloudflare Queues open beta brings queues to every developer building for Region: Earth. Cloudflare Queues follows Cloudflare Workers and Cloudflare R2 in a long line of innovative application services built for the Workers Developer Platform, enabling developers to build more complex applications without configuring networks, choosing regions, or estimating capacity. Best of all, like many other Cloudflare services, there are no egregious egress charges!

If you’ve ever purchased something online and seen a message like “you will receive confirmation of your order shortly,” you’ve interacted with a queue. When you completed your order, your shopping cart and information were stored and the order was placed into a queue. At some later point, the order fulfillment service picks and packs your items and hands it off to the shipping service—again, via a queue. Your order may sit for only a minute, or much longer if an item is out of stock or a warehouse is busy, and queues enable all of this functionality.

Message queues are great at decoupling components of applications, like the checkout and order fulfillment services for an ecommerce site. Decoupled services are easier to reason about, deploy, and implement, allowing you to ship features that delight your customers without worrying about synchronizing complex deployments.

Queues also allow you to batch and buffer calls to downstream services and APIs. This post shows you how to enroll in the open beta, walks you through a practical example of using Queues to build a log sink, and tells you how we built Queues using other Cloudflare services. You’ll also learn a bit about the roadmap for the open beta.

Getting started

Enrolling in the open beta

Open the Cloudflare dashboard and navigate to the Workers section. Select Queues from the Workers navigation menu and choose Enable Queues Beta.

Review your order and choose Proceed to Payment Details.

Note: If you are not already subscribed to a Workers Paid Plan, one will be added to your order automatically.

Enter your payment details and choose Complete Purchase. That’s it - you’re enrolled in the open beta! Choose Return to Queues on the confirmation page to return to the Cloudflare Queues home page.

Creating your first queue

After enabling the open beta, open the Queues home page and choose Create Queue. Name your queue `my-first-queue` and choose Create queue. That’s all there is to it!

The dash displays a confirmation message along with a list of all the queues in your account.

Note: As of the writing of this blog post each account is limited to ten queues. We intend to raise this limit as we build towards general availability.

Managing your queues with Wrangler

You can also manage your queues from the command line using Wrangler, the CLI for Cloudflare Workers. In this section, you build a simple but complete application implementing a log aggregator or sink to learn how to integrate Workers, Queues, and R2.

Setting up resourcesTo create this application, you need access to a Cloudflare Workers account with a subscription plan, access to the Queues open beta, and an R2 plan.

Install and authenticate Wrangler then run wrangler queues create log-sink from the command line to create a queue for your application.

Run wrangler queues list and note that Wrangler displays your new queue.

Note: The following screenshots use the jq utility to format the JSON output of wrangler commands. You do not need to install jq to complete this application.

Finally, run wrangler r2 bucket create log-sink to create an R2 bucket to store your aggregated logs. After the bucket is created, run wrangler r2 bucket list to see your new bucket.

Creating your WorkerNext, create a Workers application with two handlers: a fetch() handler to receive individual incoming log lines and a queue() handler to aggregate a batch of logs and write the batch to R2.

In an empty directory, run wrangler init to create a new Cloudflare Workers application. When prompted:

Choose “y” to create a new package.json
Choose “y” to use TypeScript
Choose “Fetch handler” to create a new Worker at src/index.ts

Open wrangler.toml and replace the contents with the following:

wrangler.toml

name = "queues-open-beta"
main = "src/index.ts"
compatibility_date = "2022-11-03"
 
 
[[queues.producers]]
 queue = "log-sink"
 binding = "BUFFER"
 
[[queues.consumers]]
 queue = "log-sink"
 max_batch_size = 100
 max_batch_timeout = 30
 
[[r2_buckets]]
 bucket_name = "log-sink"
 binding = "LOG_BUCKET"

The [[queues.producers]] section creates a producer binding for the Worker at src/index.ts called BUFFER that refers to the log-sink queue. This Worker can place messages onto the log-sink queue by calling await env.BUFFER.send(log);

The [[queues.consumers]] section creates a consumer binding for the log-sink queue for your Worker. Once the log-sink queue has a batch ready to be processed (or consumed), the Workers runtime will look for the queue() event handler in src/index.ts and invoke it, passing the batch as an argument. The queue() function signature looks as follows:

async queue(batch: MessageBatch, env: Environment): Promise {

The final binding in your wrangler.toml creates a binding for the log-sink R2 bucket that makes the bucket available to your Worker via env.LOG_BUCKET.

src/index.ts

Open src/index.ts and replace the contents with the following code:

export interface Env {
 BUFFER: Queue;
 LOG_BUCKET: R2Bucket;
}
 
export default {
 async fetch(request: Request, env: Environment): Promise {
   let log = await request.json();
   await env.BUFFER.send(log);
   return new Response("Success!");
 },
 async queue(batch: MessageBatch, env: Environment): Promise {
   const logBatch = JSON.stringify(batch.messages);
   await env.LOG_BUCKET.put(`logs/${Date.now()}.log.json`, logBatch);
 },
};

The export interface Env section exposes the two bindings you defined in wrangler.toml: a queue named BUFFER and an R2 bucket named LOG_BUCKET.

The fetch() handler transforms the request body into JSON, adds the body to the BUFFER queue, then returns an HTTP 200 response with the message Success!

The `queue()` handler receives a batch of messages that each contain log entries, iterates through concatenating each log into a string buffer, then writes that buffer to the LOG_BUCKET R2 bucket using the current timestamp as the filename.

Publishing and running your applicationTo publish your log sink application, run wrangler publish. Wrangler packages your application and its dependencies and deploys it to Cloudflare’s global network.

Note that the output of wrangler publish includes the BUFFER queue binding, indicating that this Worker is a producer and can place messages onto the queue. The final line of output also indicates that this Worker is a consumer for the log-sink queue and can read and remove messages from the queue.

Use your favorite API client, like curl, httpie, or Postman, to send JSON log entries to the published URL for your Worker via HTTP POST requests. Navigate to your log-sink R2 bucket in the Cloudflare dashboard and note that the logs prefix is now populated with aggregated logs from your request.

Download and open one of the logfiles to view the JSON array inside. That’s it - with fewer than 45 lines of code and config, you’ve built a log aggregator to ingest and store data in R2!

Buffering R2 writes with Queues in the real world

In the previous example, you create a simple Workers application that buffers data into batches before writing the batches to R2. This reduces the number of calls to the downstream service, reducing load on the service and saving you money.

UUID.rocks, the fastest UUIDv4-as-a-service, wanted to confirm whether their API truly generates unique IDs on every request. With 80,000 requests per day, it wasn’t trivial to find out. They decided to write every generated UUID to R2 to compare IDs across the entire population. However, writing directly to R2 at the rate UUIDs are generated is inefficient and expensive.

To reduce writes and costs, UUID.rocks introduced Cloudflare Queues into their UUID generation workflow. Each time a UUID is requested, a Worker places the value of the UUID into a queue. Once enough messages have been received, the buffered batch of JSON objects is written to R2. This avoids invoking an R2 write on every API call, saving costs and making the data easier to process later.

The uuid-queue application consists of a single Worker with three event handlers:

A fetch handler that receives a JSON object representing the generated UUID and writes it to a Cloudflare Queue.
A queue handler that writes batches of JSON objects to R2 in CSV format.
A scheduled handler that combines batches from the previous hour into a single file for future processing.

To view the source or deploy this application into your own account, visit the repository on GitHub.

How we built Cloudflare Queues

Like many of the Cloudflare services you use and love, we built Queues by composing other Cloudflare services like Workers and Durable Objects. This enabled us to rapidly solve two difficult challenges: securely invoking your Worker from our own service and maintaining a strongly consistent state at scale. Several recent Cloudflare innovations helped us overcome these challenges.

Securely invoking your Worker

In the Before Times (early 2022), invoking one Worker from another Worker meant a fresh HTTP call from inside your script. This was a brittle experience, requiring you to know your downstream endpoint at deployment time. Nested invocations ran as HTTP calls, passing all the way through the Cloudflare network a second time and adding latency to your request. It also meant security was on you - if you wanted to control how that second Worker was invoked, you had to create and implement your own authentication and authorization scheme.

Worker to Worker requestsDuring Platform Week in May 2022, Service Worker Bindings entered general availability. With Service Worker Bindings, your Worker code has a binding to another Worker in your account that you invoke directly, avoiding the network penalty of a nested HTTP call. This removes the performance and security barriers discussed previously, but it still requires that you hard-code your nested Worker at compile time. You can think of this setup as “static dispatch,” where your Worker has a static reference to another Worker where it can dispatch events.

Dynamic dispatchAs Service Worker Bindings entered general availability, we also launched a closed beta of Workers for Platforms, our tool suite to help make any product programmable. With Workers for Platforms, software as a service (SaaS) and platform providers can allow users to upload their own scripts and run them safely via Cloudflare Workers. User scripts are not known at compile time, but are dynamically dispatched at runtime.

Workers for Platforms entered general availability during GA week in September 2022, and is available for all customers to build with today.

With dynamic dispatch generally available, we now have the ability to discover and invoke Workers at runtime without the performance penalty of HTTP traffic over the network. We use dynamic dispatch to invoke your queue’s consumer Worker whenever a message or batch of messages is ready to be processed.

Consistent stateful data with Durable Objects

Another challenge we faced was storing messages durably without sacrificing performance. We took the design goal of ensuring that all messages were persisted to disk in multiple locations before we confirmed receipt of the message to the user. Again, we turned to an existing Cloudflare product—Durable Objects—which entered general availability nearly one year ago today.

Durable Objects are named instances of JavaScript classes that are guaranteed to be unique across Cloudflare’s entire network. Durable Objects process messages in-order and on a single-thread, allowing for coordination across messages and provide a strongly consistent storage API for key-value pairs. Offloading the hard problem of storing data durably in a distributed environment to Distributed Objects allowed us to reduce the time to build Queues and prepare it for open beta.

Open beta roadmap

Our open beta process empowers you to influence feature prioritization and delivery. We’ve set ambitious goals for ourselves on the path to general availability, most notably supporting unlimited throughput while maintaining 100% durability. We also have many other great features planned, like first-in first-out (FIFO) message processing and API compatibility layers to ease migrations, but we need your feedback to build what you need most, first.

Conclusion

Cloudflare Queues is a global message queue for the Workers developer. Building with Queues makes your applications more performant, resilient, and cost-effective—but we’re not done yet. Join the Open Beta today and share your feedback to help shape the Queues roadmap as we deliver application integration services for the next generation cloud.

Store and retrieve your logs on R2

Shelley Jones — Wed, 21 Sep 2022 14:15:00 GMT

Following today’s announcement of General Availability of Cloudflare R2 object storage, we’re excited to announce that customers can also store and retrieve their logs on R2.

Cloudflare’s Logging and Analytics products provide vital insights into customers’ applications. Though we have a breadth of capabilities, logs in particular play a pivotal role in understanding what occurs at a granular level; we produce detailed logs containing metadata generated by Cloudflare products via events flowing through our network, and they are depended upon to illustrate or investigate anything (and everything) from the general performance or health of applications to closely examining security incidents.

Until today, we have only provided customers with the ability to export logs to 3rd-party destinations - to both store and perform analysis. However, with Log Storage on R2 we are able to offer customers a cost-effective solution to store event logs for any of our products.

The cost conundrum

We’ve unpacked the commercial impact in a previous blog post, but to recap, the cost of storage can vary broadly depending on the volume of requests Internet properties receive. On top of that - and specifically pertaining to logs - there’s usually more expensive fees to access that data whenever the need arises. This can be incredibly problematic, especially when customers are having to balance their budget with the need to access their logs - whether it's to mitigate a potential catastrophe or just out of curiosity.

With R2, not only do we not charge customers egress costs, but we also provide the opportunity to make further operational savings by centralizing storage and retrieval. Though, most of all, we just want to make it easy and convenient for customers to access their logs via our Retrieval API - all you need to do is provide a time range!

Logs on R2: get started!

Why would you want to store your logs on Cloudflare R2? First, R2 is S3 API compatible, so your existing tooling will continue to work as is. Second, not only is R2 cost-effective for storage, we also do not charge any egress fees if you want to get your logs out of Cloudflare to be ingested into your own systems. You can store logs for any Cloudflare product, and you can also store what you need for as long as you need; retention is completely within your control.

Storing Logs on R2

To create Logpush jobs pushing to R2, you can use either the dashboard or Cloudflare API. Using the dashboard, you can create a job and select R2 as the destination during configuration:

To use the Cloudflare API to create the job, do something like:

curl -s -X POST 'https://api.cloudflare.com/client/v4/zones//logpush/jobs' \
-H "X-Auth-Email: " \
-H "X-Auth-Key: " \
-d '{
 "name":"",
"destination_conf":"r2:///{DATE}?account-id=&access-key-id=&secret-access-key=",
 "dataset": "http_requests",
"logpull_options":"fields=ClientIP,ClientRequestHost,ClientRequestMethod,ClientRequestURI,EdgeEndTimestamp,EdgeResponseBytes,EdgeResponseStatus,EdgeStartTimestamp,RayID×tamps=rfc3339",
 "kind":"edge"
}' | jq .

Please see Logpush over R2 docs for more information.

Log Retrieval on R2

If you have your logs pushed to R2, you could use the Cloudflare API to retrieve logs in specific time ranges like the following:

curl -s -g -X GET 'https://api.cloudflare.com/client/v4/accounts//logs/retrieve?start=2022-09-25T16:00:00Z&end=2022-09-25T16:05:00Z&bucket=&prefix=/{DATE}' \
-H "X-Auth-Email: " \
-H "X-Auth-Key: " \ 
-H "R2-Access-Key-Id: R2_ACCESS_KEY_ID" \
-H "R2-Secret-Access-Key: R2_SECRET_ACCESS_KEY" | jq .

See Log Retrieval API for more details.

Now that you have critical logging infrastructure on Cloudflare, you probably want to be able to monitor the health of these Logpush jobs as well as get relevant alerts when something needs your attention.

Looking forward

While we have a vision to build out log analysis and forensics capabilities on top of R2 - and a roadmap to get us there - we’d still love to hear your thoughts on any improvements we can make, particularly to our retrieval options.

Get setup on R2 to start pushing logs today! If your current plan doesn’t include Logpush, storing logs on R2 is another great reason to upgrade!

Using Cloudflare R2 as an apt/yum repository

Sudarsan Reddy — Thu, 15 Sep 2022 13:00:00 GMT

In this blog post, we’re going to talk about how we use Cloudflare R2 as an apt/yum repository to bring cloudflared (the Cloudflare Tunnel daemon) to your Debian/Ubuntu and CentOS/RHEL systems and how you can do it for your own distributable in a few easy steps!

I work on Cloudflare Tunnel, a product which enables customers to quickly connect their private networks and services through the Cloudflare global network without needing to expose any public IPs or ports through their firewall. Cloudflare Tunnel is managed for users by cloudflared, a tool that runs on the same network as the private services. It proxies traffic for these services via Cloudflare, and users can then access these services securely through the Cloudflare network.

Our connector, cloudflared, was designed to be lightweight and flexible enough to be effectively deployed on a Raspberry Pi, a router, your laptop, or a server running on a data center with applications ranging from IoT control to private networking. Naturally, this means cloudflared comes built for a myriad of operating systems, architectures and package distributions: You could download the appropriate package from our GitHub releases, brew install it or apt/yum install it (https://pkg.cloudflare.com).

In the rest of this blog post, I’ll use cloudflared as an example of how to create an apt/yum repository backed by Cloudflare’s object storage service R2. Note that this can be any binary/distributable. I simply use cloudflared as an example because this is something we recently did and actually use in production.

How apt-get works

Let’s see what happens when you run something like this on your terminal.

$ apt-get install cloudflared

Let’s also assume that apt sources were already added like so:

  $ echo 'deb [signed-by=/usr/share/keyrings/cloudflare-main.gpg] https://pkg.cloudflare.com/cloudflared buster main' | sudo tee /etc/apt/sources.list.d/cloudflared.list


$ apt-get update

From the source.list above, apt first looks up the Release file (or InRelease if it’s a signed package like cloudflared, but we will ignore this for the sake of simplicity).

A Release file contains a list of supported architectures and their md5, sha1 and sha256 checksums. It looks something like this:

$ curl https://pkg.cloudflare.com/cloudflared/dists/buster/Release
Origin: cloudflared
Label: cloudflared
Codename: buster
Date: Thu, 11 Aug 2022 08:40:18 UTC
Architectures: amd64 386 arm64 arm armhf
Components: main
Description: apt repository for cloudflared - buster
MD5Sum:
 c14a4a1cbe9437d6575ae788008a1ef4 549 main/binary-amd64/Packages
 6165bff172dd91fa658ca17a9556f3c8 374 main/binary-amd64/Packages.gz
 9cd622402eabed0b1b83f086976a8e01 128 main/binary-amd64/Release
 5d2929c46648ea8dbeb1c5f695d2ef6b 545 main/binary-386/Packages
 7419d40e4c22feb19937dce49b0b5a3d 371 main/binary-386/Packages.gz
 1770db5634dddaea0a5fedb4b078e7ef 126 main/binary-386/Release
 b0f5ccfe3c3acee38ba058d9d78a8f5f 549 main/binary-arm64/Packages
 48ba719b3b7127de21807f0dfc02cc19 376 main/binary-arm64/Packages.gz
 4f95fe1d9afd0124a32923ddb9187104 128 main/binary-arm64/Release
 8c50559a267962a7da631f000afc6e20 545 main/binary-arm/Packages
 4baed71e49ae3a5d895822837bead606 372 main/binary-arm/Packages.gz
 e472c3517a0091b30ab27926587c92f9 126 main/binary-arm/Release
 bb6d18be81e52e57bc3b729cbc80c1b5 549 main/binary-armhf/Packages
 31fd71fec8acc969a6128ac1489bd8ee 371 main/binary-armhf/Packages.gz
 8fbe2ff17eb40eacd64a82c46114d9e4 128 main/binary-armhf/Release
SHA1:
…
SHA256:
…

Depending on your system’s architecture, Debian picks the appropriate Packages location. A Packages file contains metadata about the binary apt wants to install, including location and its checksum. Let’s say it’s an amd64 machine. This means we’ll go here next:

$ curl https://pkg.cloudflare.com/cloudflared/dists/buster/main/binary-amd64/Packages
Package: cloudflared
Version: 2022.8.0
License: Apache License Version 2.0
Vendor: Cloudflare
Architecture: amd64
Maintainer: Cloudflare 
Installed-Size: 30736
Homepage: https://github.com/cloudflare/cloudflared
Priority: extra
Section: default
Filename: pool/main/c/cloudflared/cloudflared_2022.8.0_amd64.deb
Size: 15286808
SHA256: c47ca10a3c60ccbc34aa5750ad49f9207f855032eb1034a4de2d26916258ccc3
SHA1: 1655dd22fb069b8438b88b24cb2a80d03e31baea
MD5sum: 3aca53ccf2f9b2f584f066080557c01e
Description: Cloudflare Tunnel daemon

Notice the Filename field. This is where apt gets the deb from before running a dpkg command on it. What all of this means is the apt repository (and the yum repositories too) is basically a structured file system of mostly plaintext files and a deb.

The deb file is a Debian software package that contains two things: installable data (in our case, the cloudflared binary) and metadata about the installable.

Building our own apt repository

Now that we know what happens when an apt-get install runs, let’s work our way backwards to construct the repository.

Create a deb file out of the binary

Note: It is optional but recommended one signs the packages. See the section about how apt verifies packages here.

Debian files can be built by the dpkg-buildpackage tool in a linux or Debian environment. We use a handy command line tool called fpm (https://fpm.readthedocs.io/en/v1.13.1/) to do this because it works for both rpm and deb.

$ fpm -s  -t deb -C /path/to/project -name  –version

This yields a .deb file.

Create plaintext files needed by apt to lookup downloads

Again, these files can be built by hand, but there are multiple tools available to generate this:

We use reprepro.

Using it is as simple as

$ reprepro buster includedeb

reprepro neatly creates a bunch of folders in the structure we looked into above.

Upload them to Cloudflare R2

We use Cloudflare R2 to now be the host for this structured file system. R2 lets us upload and serve objects in this structured format. This means, we just need to upload the files in the same structure reprepro created them.

Here is a copyable example of how we do it for cloudflared.

Serve them from an R2 worker

For fine-grained control, we’ll write a very lightweight Cloudflare Worker to be the service we talk to and serve as the front-end API for apt to talk to. For an apt repository, we only need it to perform the GET function.

Here’s an example on how-to do this: https://developers.cloudflare.com/r2/examples/demo-worker/

Putting it all together

Here is a handy script we use to push cloudflared to R2 ready for apt/yum downloads and includes signing and publishing the pubkey.

And that’s it! You now have your own apt/yum repo without needing a server that needs to be maintained, there are no egress fees for downloads, and it is on the Cloudflare global network, protecting it against high request volumes. You can automate many of these steps to make it part of a release process.

Today, this is how cloudflared is distributed on the apt and yum repositories: https://pkg.cloudflare.com/