The Cloudflare Blog

When DNSSEC goes wrong: how we responded to the .de TLD outage

Sebastiaan Neuteboom — Wed, 06 May 2026 17:00:00 GMT

On May 5, 2026, at roughly 19:30 UTC, DENIC, the registry operator for the .de country-code top-level domain (TLD), started publishing incorrect DNSSEC signatures for the .de zone. Any validating DNS resolver receiving these signatures was required by the DNSSEC specification to reject them and return SERVFAIL to clients, including 1.1.1.1, the public DNS resolver operated by Cloudflare.

The country-code top-level domain for Germany, .de, is one of the largest on the Internet. On Cloudflare Radar, it consistently ranks among the most broadly queried TLDs globally. An outage at this level of the DNS hierarchy has the potential to make millions of domains unreachable.

In this post, we’ll walk through what we saw, the impact of these events, and how we applied temporary mitigations while DENIC resolved the issue.

How DNSSEC works

DNSSEC (Domain Name System Security Extensions) adds cryptographic authentication to DNS. When a zone is signed with DNSSEC, each set of records is accompanied by a digital signature known as an RRSIG record that lets a resolver verify the records haven’t been tampered with. Unlike encrypted DNS protocols, such as DNS over TLS (DoT) and DNS over HTTPs (DoH), DNSSEC is about integrity, not privacy. The records are visible, but their authenticity can be proven.

What makes DNSSEC unique is that the signatures travel together with the records they protect. This means integrity can be verified regardless of how many caches or hops a response has passed through. A cached record is just as verifiable as a fresh one.

DNSSEC is built on a chain of trust. Starting at the root zone, whose trust anchor is hard-coded into the resolvers, each zone delegates trust to child zones via Delegation Signer (DS) records. A DS record in the parent zone contains a cryptographic hash of a public key in the child zone. When a resolver validates example.de it verifies the chain: root trusts .de, .de trusts example.de. A break anywhere in that chain causes validation to fail for everything below it, which is why a misconfiguration at a TLD like .de affects every domain under it.

Zones typically use two types of keys: a Zone Signing Key (ZSK), used to sign the zone’s records, and a Key Signing Key (KSK), used to sign the ZSK itself. The KSK’s public key is what the parent zone’s DS record points to, anchoring the chain of trust. Rotating a ZSK is relatively straightforward: generate a new key, re-sign the zone’s records, and wait for caches to expire. Rotating a KSK is more involved, because the parent’s DS record must also be updated, often requiring coordination with a registrar or registry.

During a key rotation, there is a critical window where the old key is being phased out and the new one phased in. If the signatures published in the zone are made with a key that resolvers cannot verify against the zone’s published DNSKEY records, whether because the signing step failed, the timing was wrong, or the new key wasn’t fully distributed yet, resolvers have no choice but to reject the responses and return SERVFAIL.

What we saw

On May 5, 2026, at roughly 19:30 UTC, DENIC, the operator for the .de TLD, started publishing incorrect DNSSEC signatures for the .de zone. Any validating resolver receiving these records was required by the DNSSEC specification to reject them and return SERVFAIL. 1.1.1.1 was no exception.

The graph below shows the response codes 1.1.1.1 returned for .de queries during the incident.

After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing.

Also visible is a large increase in query volume. This is typical during DNS incidents, as clients retry failed queries, often three or more times, inflating the raw numbers. The SERVFAIL rate looks more alarming than the actual user impact, as many of those queries represent the same user retrying the same domain.

What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's “serve stale” at work, which we'll cover in the next section.

Serve stale

Recursive resolvers cache the records they receive from authoritative nameservers for the duration of each record's TTL (Time-to-Live). While a record is cached, the resolver serves it directly without going back to the authoritative nameserver. When the TTL expires, the resolver fetches a fresh copy and re-caches it.

During the outage, freshly requested records ended up resolving to SERVFAIL. The DNSSEC signatures were broken and the resolver correctly rejected them. But many .de records were still sitting in cache from before the incident began. Rather than immediately discarding those and returning SERVFAIL to users, 1.1.1.1 continued serving them past their TTL. This is called “serving stale.”

1.1.1.1 implements RFC 8767, which formalizes this behavior. When upstream resolution fails, a resolver may continue serving expired cached records rather than returning an error. This significantly cushions the impact of an upstream outage, buying time for operators to respond.

The result is visible in the graph below, which shows response codes for .de queries during the incident excluding the stale-served responses. Without stale-served responses, the NOERROR rate drops steadily from 19:30 onward. These represent queries that users received good answers for only because their record was still in cache.

Our mitigation

While the issue was largely out of our own control, and serve stale was doing its job, there was still a legitimate impact for a lot of users. Luckily, there were some actions we were able to take to improve the situation.

Negative Trust Anchors

RFC 7646 defines the concept of a Negative Trust Anchor (NTA). In normal DNSSEC operation, a validating resolver maintains a set of trust anchors: public keys at the root of the chain of trust. Each DNS zone signed with DNSSEC has a trust anchor, and every child zone builds its own trust anchor upon it. When the cryptographic signatures linking the chain together are broken, responses will be rejected and result in SERVFAIL. An NTA is an explicit exception. It tells the resolver to treat a specific zone as if it were unsigned, bypassing validation for names under that zone.

NTAs exist precisely for these types of incidents. When a TLD operator publishes broken signatures, every DNSSEC-validating resolver is forced to return SERVFAIL for every domain under that TLD. Not because of anything wrong with those domains themselves, but because their parent zone is misconfigured. Continuing to return SERVFAIL in that situation provides no security value: the failure is already known, public, and being fixed. RFC 7646 explicitly names TLD misconfiguration as the primary use case for NTAs.

What we actually deployed

For 1.1.1.1 we have our own resolver referred to as Big Pineapple, which also powers 1.1.1.1 for Families, Gateway DNS, DNS Firewall, and more. At this time, we have not implemented a native NTA mechanism. Instead, we used an existing override rule mechanism to mark .de as an insecure zone, which causes all .de queries to be resolved as if they don’t have DNSSEC enabled. This is functionality equivalent to an NTA, though it is not formally defined in any RFC.

The decision to bypass DNSSEC is a deliberate tradeoff. Without DNSSEC validation, .de domains become vulnerable to genuine attacks for the duration of the incident. During incidents like this, we weighed this as acceptable because the signing failure was widespread, publicly confirmed, and affected every validating resolver on the Internet equally. As it was put in our internal incident room: “There is no user of 1.1.1.1 resolving a .de name right now who would prefer a SERVFAIL over an unvalidated response.”

We rolled out our mitigation at 22:17 UTC, which marked the end of impact for 1.1.1.1. We communicated this with fellow DNS operators in the DNS-OARC Mattermost.

Origin resolution mitigations

While all Internet users can access our 1.1.1.1 resolver, we have a particular responsibility to customers using our CDN platform services. Those with .de origin names were also affected by this outage.

Cloudflare operates a separate internal resolver for origin resolution, distinct from our publicly available 1.1.1.1 service. To mitigate impact we applied a similar NTA for .de on the internal resolver service, restoring origin connectivity for affected customers.

Extended DNS Errors

Before our mitigation, queries that couldn't be served from cache received a SERVFAIL response from 1.1.1.1. Each SERVFAIL included an Extended DNS Error (EDE) code, defined in RFC 8914, which gives clients more detail about what went wrong.

Some resolvers returned EDE 6 (DNSSEC Bogus) with a descriptive message pointing directly at the broken signature. This is the correct behavior:

EDE: 6 (DNSSEC Bogus): RRSIG with malformed signature found for example.de/nsec3 (keytag=33834)

1.1.1.1, on the other hand, returned EDE 22 (No Reachable Authority), which on the surface suggests a connectivity problem with the upstream nameservers rather than a DNSSEC validation failure.

The cause is a bug in how we propagate DNSSEC EDE codes up from our trust chain verifier. When the verifier detects a bogus signature it creates the DNSSEC Bogus EDE code, but this is never inserted into the response. Instead, the outer layer of the resolver sees a problem with recursive resolution with no error code and falls back to reporting “No Reachable Authority.” This obscures the underlying DNSSEC cause.

We're aware that this isn't helpful for 1.1.1.1 users and will be fixing our responses to surface the DNSSEC errors.

Is this a failure of DNSSEC as a technology?

DNS is a critical part of the request chain for most Internet communication. It would be easy to come to the conclusion that this outage and the mitigations applied means DNSSEC has failed as a technology. However, any technology that is misconfigured will risk breaking for users that rely on it. Leaving critical fiber cables exposed on the seabed for sharks to chew on does not invalidate the important role underwater cables pose in today's Internet communications. It only highlights that we’ve sometimes failed to accurately protect it. The same applies here. DNSSEC serves a critical role in ensuring that we can rely on the DNS answers without tampering by malicious actors.

#HugOps

No one likes to have serious incidents. These things, unfortunately, happen to everyone who operates critical infrastructure at scale. When they do, the DNS community tends to show up for each other.

Incidents like this also highlight why relationships between operators matter. DNS is a decentralized system, no single organization controls all of it, and keeping it running reliably depends on mutual trust and open lines of communication between registries, resolver operators, and the broader community. Forums like DNS-OARC provide exactly this: shared mailing lists and chat rooms where operators can coordinate quickly across organizational boundaries when something goes wrong.

DENIC has published a short blog post about the incident where they state: “The outage is linked to a routine, scheduled key rollover. During this process, non-validatable signatures were generated and distributed. As a precautionary measure, future rollovers have been suspended until the exact technical causes have been identified.”

We're sure we’ll hear more when their own analysis is ready.

Takeaways from this incident

This incident highlights a structural reality of the DNS hierarchy: when a registry at the TLD level fails, every domain under that TLD is affected simultaneously, regardless of where it's hosted or which resolver is used. This isn't unique to DNSSEC; the same is true if a TLD’s nameservers become unreachable. The hierarchy that makes the global DNS work is also what makes failures at the top propagate downward.

There is no simple fix for this. What the industry can do is respond quickly and consistently when it happens. In this incident, resolver operators across the Internet independently applied Negative Trust Anchors within an hour, restoring resolution while DENIC worked to fix the zone. Operational practices, industry communication channels like DNS-OARC, and features like serve stale all reduce the impact, even if they can’t eliminate the underlying dependency.

We also came away with some points to improve for ourselves. We will be working on our EDE errors to better surface DNSSEC errors.

We look forward to DENIC’s post-incident report and appreciate the transparency they showed throughout.

If you want to learn more about how DNSSEC works, visit our page How does DNSSEC work? And you can always follow real-time DNS trends and TLD data on Cloudflare Radar.

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Guy Bedford — Wed, 22 Apr 2026 13:00:00 GMT

Rust Workers run on the Cloudflare Workers platform by compiling Rust to WebAssembly, but as we’ve found, WebAssembly has some sharp edges. When things go wrong with a panic or an unexpected abort, the runtime can be left in an undefined state. For users of Rust Workers, panics were historically fatal, poisoning the instance and possibly even bricking the Worker for a period of time.

While we were able to detect and mitigate these issues, there remained a small chance that a Rust Worker would unexpectedly fail and cause other requests to fail along with it. An unhandled Rust abort in a Worker affecting one request might escalate into a broader failure affecting sibling requests or even continue to affect new incoming requests. The root cause of this was in wasm-bindgen, the core project that generates the Rust-to-JavaScript bindings Rust Workers depend on, and its lack of built-in recovery semantics.

In this post, we’ll share how the latest version of Rust Workers handles comprehensive Wasm error recovery that solves this abort-induced sandbox poisoning. This work has been contributed back into wasm-bindgen as part of our collaboration within the wasm-bindgen organization formed last year. First with panic=unwind support, which ensures that a single failed request never poisons other requests, and then with abort recovery mechanisms that guarantee Rust code on Wasm can never re-execute after an abort.

Initial recovery mitigations

Our initial attempts to address reliability in this area focused on understanding and containing failures caused by Rust panics and aborts in production Rust Workers. We introduced a custom Rust panic handler that tracked failure state within a Worker and triggered full application reinitialization before handling subsequent requests. On the JavaScript side, this required wrapping the Rust-JavaScript call boundary using Proxy‑based indirection to ensure that all entrypoints were consistently encapsulated. We also made targeted modifications to the generated bindings to correctly reinitialize the WebAssembly module after a failure.

While this approach relied on custom JavaScript logic, it demonstrated that reliable recovery was achievable and eliminated the persistent failure modes we were seeing in practice. This solution was shipped by default to all workers‑rs users starting in version 0.6, and it laid the groundwork for the more general, upstreamed abort recovery mechanisms described in the sections that follow.

Implementing `panic=unwind` with WebAssembly Exception Handling

The abort recovery mechanisms described above ensure that a Worker can survive a failure, but they do so by reinitializing the entire application. For stateless request handlers, this is fine. But for workloads that hold meaningful state in memory, such as Durable Objects, reinitialization means losing that state entirely. A single panic in one request could wipe the in-memory state being used by other concurrent requests.

In most native Rust environments, panics can be unwound, allowing destructors to run and the program to recover without losing state. In WebAssembly, things historically looked very different. Rust compiled to Wasm via wasm32-unknown-unknown defaults to panic=abort, so a panic inside a Rust Worker would abruptly trap with an unreachable instruction and exit Wasm back to JS with a WebAssembly.RuntimeError.

To recover from panics without discarding instance state, we needed panic=unwind support for wasm32-unknown-unknown in wasm-bindgen, made possible by the WebAssembly Exception Handling proposal, which gained wide engine support in 2023.

We start by compiling with RUSTFLAGS='-Cpanic=unwind' cargo build -Zbuild-std, which rebuilds the standard library with unwind support and generates code with proper panic unwinding. For example:

struct HasDropA;
struct HasDropB;
extern "C" {
    fn imported_func();
}

fn some_func() {
    let a = HasDropA;
    let b = HasDropB;
    imported_func();
}

compiles to WebAssembly as:

try
  call 
catch_all
  call 
  call 
  rethrow
end
call 
call

This ensures that even if imported_func() panics, destructors still run. Similarly, std::panic::catch_unwind(|| some_func()) compiles into:

try
  call 
  ;; set result to Ok(return value)
catch
  try
    call 
    ;; set result to Err(panic payload)
  catch_all
    call 
    unreachable
  end
end

Getting this to work end-to-end required several changes to the wasm-bindgen toolchain. The WebAssembly parser Walrus did not know how to handle try/catch instructions, so we added support for them. The descriptor interpreter also needed to be taught how to evaluate code containing exception handling blocks. At that point, the full application could be built with panic=unwind.

The final step was modifying the exports generated by wasm-bindgen to catch panics at the Rust-JavaScript boundary and surface them as JavaScript PanicError exceptions. One subtlety: Rust will catch foreign exceptions and abort when unwinding through extern "C" functions, so exports needed to be marked extern "C-unwind" to explicitly allow unwinding across the boundary. For futures, a panic rejects the JavaScript Promise with a PanicError.

Closures required special attention to ensure unwind safety was properly checked, via a new MaybeUnwindSafe trait that checks UnwindSafe only when built with panic=unwind. This quickly exposed a problem, though: many closures capture references that remain after an unwind, making them inherently unwind-unsafe. To avoid a situation where users are encouraged to incorrectly wrap closures in AssertUnwindSafe just to satisfy the compiler, we added Closure::new_aborting variants, which terminate on panic instead of unwinding in cases where unwind safety can't be guaranteed.

With panic unwinding enabled:

Panics in exported Rust functions are caught by wasm-bindgen
Panics surface to JavaScript as PanicError exceptions
Async exports reject their returned promises with a PanicError
Rust destructors run correctly
The WebAssembly instance remains valid and reusable

The full details of the approach and how to use it in wasm-bindgen are covered in the latest guide page for Wasm Bindgen: Catching Panics.

Abort recovery

Even with panic=unwind support, aborts still happen - out-of-memory errors being one common cause. Because aborts can’t unwind, there is no possibility of state recovery at all, but we can at least detect and recover from aborts for future operations to avoid invalid state erroring subsequent requests.

Panic unwind support introduced a new problem for abort recovery. When we receive an error from Wasm we don’t know if it came from an extern “C-unwind” foreign error, or if it was a genuine abort. Aborts can take many shapes in WebAssembly.

We had two options to solve this technically: either mark all errors which are definitely aborts, or mark all errors which are definitely unwinds. Either could have worked but we chose the latter. Since our foreign exception handling was directly using raw WAT-level (WebAssembly text format) Exception Handling instructions already, we found it easier to implement exception tags for foreign exceptions to distinguish them from aborting non-unwind-safe exceptions.

With the ability to clearly distinguish between recoverable and non-recoverable errors thanks to this Exception.Tag feature in WebAssembly Exception Handling, we were able to then integrate both a new abort handler as well as abort reentrancy guards. A new abort hook, set_on_abort, can be used at initialization time to attach a handler that recovers accordingly for the platform embedding’s needs.

Hardening panic and abort handling is critical to avoiding invalid execution state. WebAssembly allows deeply interleaved call stacks, where Wasm can call into JavaScript and JavaScript can re-enter Wasm at arbitrary depths, while alongside this, multiple tasks can be functioning in the same instance. Previously, an abort occurring in one task or nested stack was not guaranteed to invalidate higher stacks through JS, leading to undefined behavior. Care was required to ensure we can guarantee the execution model, and contribution in this space remains ongoing.

While aborts are never ideal, and reinitialization on failure is an absolute worst-case scenario, implementing critical error recovery as the last line of defense ensures execution correctness and that future operations will be able to succeed. The invalid state does not persist, ensuring a single failure does not cascade into multiple failures.

Extension: abort reinitialization for wasm-bindgen libraries

While we were working on this, we realized that this is a common problem for libraries used by JS that are built with wasm-bindgen, and that they would also benefit from attaching an abort handler to be able to perform recovery.

But when building Wasm as an ES module and importing it directly (e.g. via import { func } from ‘wasm-dep’), it’s not clear what the recovery mechanism would be for a Wasm abort while calling func() for an already-linked and initialized library that is in a user JS application.

While not strictly a Rust Workers use case, our team also supports JS-based Workers users who run Rust-backed Wasm library dependencies. If we could fix this problem at the same time, that could indirectly also benefit Wasm usage on the Cloudflare Workers platform.

To support automatic abort recovery for Wasm library use cases, we added support for an experimental reinitialization mechanism into wasm‑bindgen, --reset-state-function. This exposes a function that allows the Rust application to effectively request that it reset its internal Wasm instance back to its initial state for the next call, without requiring consumers of the generated bindings to reimport or recreate them. Class instances from the old instance will throw as their handles become orphaned, but new classes can then be constructed. The JS application using a Wasm library is errored but not bricked.

The full technical details of this feature and how to use it in wasm-bindgen are covered in the new wasm-bindgen guide section Wasm Bindgen: Handling Aborts.

Maturing the Rust Wasm Exception Handling ecosystem

Upstream contributions for this work did not stop at the wasm-bindgen project. Building for Wasm with panic=unwind still requires an experimental nightly Rust target, so we’ve also been working to advance Rust’s Wasm support for WebAssembly Exception Handling to help bring this to stable Rust.

During the development of WebAssembly Exception Handling, a late‑stage specification change resulted in two variants: legacy exception handling and the final modern exception handling "with exnref". Today, Rust’s WebAssembly targets still default to emitting code for the legacy variant. While legacy exception handling is widely supported, it is now deprecated.

Modern WebAssembly Exception Handling is supported as of the following JS platform releases:

Runtime	Version	Release Date
v8	13.8.1	April 28, 2025
workerd	v1.20250620.0	June 19, 2025
Chrome	138	June 28, 2025
Firefox	131	October 1, 2024
Safari	18.4	March 31, 2025
Node.js	25.0.0	October 15, 2025

As we were investigating the support matrix, the largest concern ended up being the Node.js 24 LTS release schedule, which would have left the entire ecosystem stuck on legacy WebAssembly Exception Handling until April 2028.

Having discovered this discrepancy, we were able to backport modern exception handling to the Node.js 24 release, and even backport the fixes needed to make it work on the Node.js 22 release line to ensure support for this target. This should allow the modern Exception Handling proposal to become the default target next year.

Over the coming months, we’ll be working to make the transition to stable panic=unwind and modern Exception Handling as invisible as possible to end users.

While these long‑term investments in the ecosystem take time, they help build a stronger foundation for the Rust WebAssembly community as a whole, and we’re glad to be able to contribute to these improvements.

Using panic unwind in Rust Workers

As of version 0.8.0 of Rust Workers, we have a new --panic-unwind flag, which can be added to the build command, following the instructions here.

With this flag, panics can be fully recovered, and abort recovery will use the new abort classification and recovery hook mechanism. We highly recommend upgrading and trying it out for a more stable Rust Workers experience, and plan to make panic=unwind the default in a subsequent release. Users remaining on panic=abort will still continue to take advantage of the previous custom recovery wrapper handling from 0.6.0.

Committing to Rust Workers stability

This work is part of our ongoing effort towards a stable release for Rust Workers. By solving these sharp edges of the Wasm platform foundations at their root, and contributing back to the ecosystem where it makes sense, we build stronger foundations not just for our platform, but the entire Rust, JS, and Wasm ecosystem.

We have a number of future improvements planned for Rust Workers, and we’ll soon be sharing updates on this additional work, including wasm-bindgen generics and automated bindgen, which Guy Bedford from our team previewed in a talk on Rust & JS Interoperability at Wasm.io last month.

Find us in #rust‑on‑workers on the Cloudflare Discord. We also welcome feedback and discussion and especially all new contributors to the workers-rs and wasm-bindgen GitHub projects.

How Workers powers our internal maintenance scheduling pipeline

Kevin Deems — Mon, 22 Dec 2025 14:00:00 GMT

Cloudflare has data centers in over 330 cities globally, so you might think we could easily disrupt a few at any time without users noticing when we plan data center operations. However, the reality is that disruptive maintenance requires careful planning, and as Cloudflare grew, managing these complexities through manual coordination between our infrastructure and network operations specialists became nearly impossible.

It is no longer feasible for a human to track every overlapping maintenance request or account for every customer-specific routing rule in real time. We reached a point where manual oversight alone couldn't guarantee that a routine hardware update in one part of the world wouldn't inadvertently conflict with a critical path in another.

We realized we needed a centralized, automated "brain" to act as a safeguard — a system that could see the entire state of our network at once. By building this scheduler on Cloudflare Workers, we created a way to programmatically enforce safety constraints, ensuring that no matter how fast we move, we never sacrifice the reliability of the services on which our customers depend.

In this blog post, we’ll explain how we built it, and share the results we’re seeing now.

Building a system to de-risk critical maintenance operations

Picture an edge router that acts as one of a small, redundant group of gateways that collectively connect the public Internet to the many Cloudflare data centers operating in a metro area. In a populated city, we need to ensure that the multiple data centers sitting behind this small cluster of routers do not get cut off because the routers were all taken offline simultaneously.

Another maintenance challenge comes from our Zero Trust product, Dedicated CDN Egress IPs, which allows customers to choose specific data centers from which their user traffic will exit Cloudflare and be sent to their geographically close origin servers for low latency. (For the purpose of brevity in this post, we'll refer to the Dedicated CDN Egress IPs product as "Aegis," which was its former name.) If all the data centers a customer chose are offline at once, they would see higher latency and possibly 5xx errors, which we must avoid.

Our maintenance scheduler solves problems like these. We can make sure that we always have at least one edge router active in a certain area. And when scheduling maintenance, we can see if the combination of multiple scheduled events would cause all the data centers for a customer’s Aegis pools to be offline at the same time.

Before we created the scheduler, these simultaneous disruptive events could cause downtime for customers. Now, our scheduler notifies internal operators of potential conflicts, allowing us to propose a new time to avoid overlapping with other related data center maintenance events.

We define these operational scenarios, such as edge router availability and customer rules, as maintenance constraints which allow us to plan more predictable and safe maintenance.

Maintenance constraints

Every constraint starts with a set of proposed maintenance items, such as a network router or list of servers. We then find all the maintenance events in the calendar that overlap with the proposed maintenance time window.

Next, we aggregate product APIs, such as a list of Aegis customer IP pools. Aegis returns a set of IP ranges where a customer requested egress out of specific data center IDs, shown below.

[
    {
      "cidr": "104.28.0.32/32",
      "pool_name": "customer-9876",
      "port_slots": [
        {
          "dc_id": 21,
          "other_colos_enabled": true,
        },
        {
          "dc_id": 45,
          "other_colos_enabled": true,
        }
      ],
      "modified_at": "2023-10-22T13:32:47.213767Z"
    },
]

In this scenario, data center 21 and data center 45 relate to each other because we need at least one data center online for the Aegis customer 9876 to receive egress traffic from Cloudflare. If we tried to take data centers 21 and 45 down simultaneously, our coordinator would alert us that there would be unintended consequences for that customer workload.

We initially had a naive solution to load all data into a single Worker. This included all server relationships, product configurations, and metrics for product and infrastructure health to compute constraints. Even in our proof of concept phase, we ran into problems with “out of memory” errors.

We needed to be more cognizant of Workers’ platform limits. This required loading only as much data as was absolutely necessary to process the constraint’s business logic. If a maintenance request for a router in Frankfurt, Germany, comes in, we almost certainly do not care what is happening in Australia since there is no overlap across regions. Thus, we should only load data for neighboring data centers in Germany. We needed a more efficient way to process relationships in our dataset.

Graph processing on Workers

As we looked at our constraints, a pattern emerged where each constraint boiled down to two concepts: objects and associations. In graph theory, these components are known as vertices and edges, respectively. An object could be a network router and an association could be the list of Aegis pools in the data center that requires the router to be online. We took inspiration from Facebook’s TAO research paper to establish a graph interface on top of our product and infrastructure data. The API looks like the following:

type ObjectID = string

interface MainTAOInterface {
  object_get(id: ObjectID): Promise

  assoc_get(id1: ObjectID, atype: TAssocType): AsyncIterable
}

The core insight is that associations are typed. For example, a constraint would call the graph interface to retrieve Aegis product data.

async function constraint(c: AppContext, aegis: TAOAegisClient, datacenters: string[]): Promise> {
  const datacenterEntries = await Promise.all(
    datacenters.map(async (dcID) => {
      const iter = aegis.assoc_get(c, dcID, AegisAssocType.DATACENTER_INSIDE_AEGIS_POOL)
      const pools: string[] = []
      for await (const assoc of iter) {
        pools.push(assoc.id2)
      }
      return [dcID, pools] as const
    }),
  )

  const datacenterToPools = new Map(datacenterEntries)
  const uniquePools = new Set()
  for (const pools of datacenterToPools.values()) {
    for (const pool of pools) uniquePools.add(pool)
  }

  const poolTotalsEntries = await Promise.all(
    [...uniquePools].map(async (pool) => {
      const total = aegis.assoc_count(c, pool, AegisAssocType.AEGIS_POOL_CONTAINS_DATACENTER)
      return [pool, total] as const
    }),
  )

  const poolTotals = new Map(poolTotalsEntries)
  const poolAnalysis: Record = {}
  for (const [dcID, pools] of datacenterToPools.entries()) {
    for (const pool of pools) {
      poolAnalysis[pool] = {
        affectedDatacenters: new Set([dcID]),
        totalDatacenters: poolTotals.get(pool),
      }
    }
  }

  return poolAnalysis
}

We use two association types in the code above:

DATACENTER_INSIDE_AEGIS_POOL, which retrieves the Aegis customer pools that a data center resides in.
AEGIS_POOL_CONTAINS_DATACENTER, which retrieves the data centers an Aegis pool needs to serve traffic.

The associations are inverted indices of one another. The access pattern is exactly the same as before, but now the graph implementation has much more control of how much data it queries. Before, we needed to load all Aegis pools into memory and filter inside constraint business logic. Now, we can directly fetch only the data that matters to the application.

The interface is powerful because our graph implementation can improve performance behind the scenes without complicating the business logic. This lets us use the scalability of Workers and Cloudflare’s CDN to fetch data from our internal systems very quickly.

Fetch pipeline

We switched to using the new graph implementation, sending more targeted API requests. Response sizes dropped by 100x overnight, switching from loading a few massive requests to many tiny requests.

While this solves the issue of loading too much into memory, we now have a subrequest problem because instead of a few large HTTP requests, we make an order of magnitude more of small requests. Overnight, we started consistently breaching subrequest limits.

In order to solve this problem, we built a smart middleware layer between our graph implementation and the fetch API.

export const fetchPipeline = new FetchPipeline()
  .use(requestDeduplicator())
  .use(lruCacher({
    maxItems: 100,
  }))
  .use(cdnCacher())
  .use(backoffRetryer({
    retries: 3,
    baseMs: 100,
    jitter: true,
  }))
  .handler(terminalFetch);

If you’re familiar with Go, you may have seen the singleflight package before. We took inspiration from this idea and the first middleware component in the fetch pipeline deduplicates inflight HTTP requests, so they all wait on the same Promise for data instead of producing duplicate requests in the same Worker. Next, we use a lightweight Least Recently Used (LRU) cache to internally cache requests that we have already seen before.

Once both of those are complete, we use Cloudflare’s caches.default.match function to cache all GET requests in the region that the Worker is running. Since we have multiple data sources with different performance characteristics, we choose time to live (TTL) values carefully. For example, real-time data is only cached for 1 minute. Relatively static infrastructure data could be cached for 1–24 hours depending on the type of data. Power management data might be changed manually and infrequently, so we can cache it for longer at the edge.

In addition to those layers, we have the standard exponential backoff, retries and jitter. This helps reduce wasted fetch calls where a downstream resource might be unavailable temporarily. By backing off slightly, we increase the chance that we fetch the next request successfully. Conversely, if the Worker sends requests constantly without backoff, it will easily breach the subrequest limit when the origin starts returning 5xx errors.

Putting it all together, we saw ~99% cache hit rate. Cache hit rate is the percentage of HTTP requests served from Cloudflare’s fast cache memory (a "hit") versus slower requests to data sources running in our control plane (a "miss"), calculated as (hits / (hits + misses)). A high rate means better HTTP request performance and lower costs because querying data from cache in our Worker is an order of magnitude faster than fetching from an origin server in a different region. After tuning settings, for our in memory and CDN caches, hit rates have increased dramatically. Since much of our workload is real-time, we will never have a 100% hit rate as we must request fresh data at least once per minute.

We have talked about improving the fetching layer, but not about how we made origin HTTP requests faster. Our maintenance coordinator needs to react in real-time to network degradation and failure of machines in data centers. We use our distributed Prometheus query engine, Thanos, to deliver performant metrics from the edge into the coordinator.

Thanos in real-time

To explain how our choice in using the graph processing interface affected our real-time queries, let’s walk through an example. In order to analyze the health of edge routers, we could send the following query:

sum by (instance) (network_snmp_interface_admin_status{instance=~"edge.*"})

Originally, we asked our Thanos service, which stores Prometheus metrics, for a list of each edge router’s current health status and would manually filter for routers relevant to the maintenance inside the Worker. This is suboptimal for many reasons. For example, Thanos returned multi-MB responses which it needed to decode and encode. The Worker also needed to cache and decode these large HTTP responses only to filter out the majority of the data while processing a specific maintenance request. Since TypeScript is single-threaded and parsing JSON data is CPU-bound, sending two large HTTP requests means that one is blocked waiting for the other to finish parsing.

Instead, we simply use the graph to find targeted relationships such as the interface links between edge and spine routers, denoted as EDGE_ROUTER_NETWORK_CONNECTS_TO_SPINE.

sum by (lldp_name) (network_snmp_interface_admin_status{instance=~"edge01.fra03", lldp_name=~"spine.*"})

The result is 1 Kb on average instead of multiple MBs, or approximately 1000x smaller. This also massively reduces the amount of CPU required inside the Worker because we offload most of the deserialization to Thanos. As we explained before, this means we need to make a higher number of these smaller fetch requests, but load balancers in front of Thanos can spread the requests evenly to increase throughput for this use case.

Our graph implementation and fetch pipeline successfully tamed the 'thundering herd' of thousands of tiny real-time requests. However, historical analysis presents a different I/O challenge. Instead of fetching small, specific relationships, we need to scan months of data to find conflicting maintenance windows. In the past, Thanos would issue a massive amount of random reads to our object store, R2. To solve this massive bandwidth penalty without losing performance, we adopted a new approach the Observability team developed internally this year.

Historical data analysis

There are enough maintenance use cases that we must rely on historical data to tell us if our solution is accurate and will scale with the growth of Cloudflare’s network. We do not want to cause incidents, and we also want to avoid blocking proposed physical maintenance unnecessarily. In order to balance these two priorities, we can use time series data about maintenance events that happened two months or even a year ago to tell us how often a maintenance event is violating one of our constraints, e.g. edge router availability or Aegis. We blogged earlier this year about using Thanos to automatically release and revert software to the edge.

Thanos primarily fans out to Prometheus, but when Prometheus' retention is not enough to answer the query it has to download data from object storage — R2 in our case. Prometheus TSDB blocks were originally designed for local SSDs, relying on random access patterns that become a bottleneck when moved to object storage. When our scheduler needs to analyze months of historical maintenance data to identify conflicting constraints, random reads from object storage incur a massive I/O penalty. To solve this, we implemented a conversion layer that transforms these blocks into Apache Parquet files. Parquet is a columnar format native to big data analytics that organizes data by column rather than row, which — together with rich statistics — allows us to only fetch what we need.

Furthermore, since we are rewriting TSDB blocks into Parquet files, we can also store the data in a way that allows us to read the data in just a few big sequential chunks.

sum by (instance) (hmd:release_scopes:enabled{dc_id="45"})

In the example above we would choose the tuple “(__name__, dc_id)” as a primary sorting key so that metrics with the name “hmd:release_scopes:enabled” and the same value for “dc_id” get sorted close together.

Our Parquet gateway now issues precise R2 range requests to fetch only the specific columns relevant to the query. This reduces the payload from megabytes to kilobytes. Furthermore, because these file segments are immutable, we can aggressively cache them on the Cloudflare CDN.

This turns R2 into a low-latency query engine, allowing us to backtest complex maintenance scenarios against long-term trends instantly, avoiding the timeouts and high tail latency we saw with the original TSDB format. The graph below shows a recent load test, where Parquet reached up to 15x the P90 performance compared to the old system for the same query pattern.

To get a deeper understanding of how the Parquet implementation works, you can watch this talk at PromCon EU 2025, Beyond TSDB: Unlocking Prometheus with Parquet for Modern Scale.

Building for scale

By leveraging Cloudflare Workers, we moved from a system that ran out of memory to one that intelligently caches data and uses efficient observability tooling to analyze product and infrastructure data in real time. We built a maintenance scheduler that balances network growth with product performance.

But “balance” is a moving target.

Every day, we add more hardware around the world, and the logic required to maintain it without disrupting customer traffic gets exponentially harder with more products and types of maintenance operations. We’ve worked through the first set of challenges, but now we’re staring down more subtle, complex ones that only appear at this massive scale.

We need engineers who aren't afraid of hard problems. Join our Infrastructure team and come build with us.

Vulnerability transparency: strengthening security through responsible disclosure

Sri Pulla — Fri, 16 May 2025 15:00:00 GMT

In an era where digital threats evolve faster than ever, cybersecurity isn't just a back-office concern — it's a critical business priority. At Cloudflare, we understand the responsibility that comes with operating in a connected world. As part of our ongoing commitment to security and transparency, Cloudflare is proud to have joined the United States Cybersecurity and Infrastructure Security Agency’s (CISA) “Secure by Design” pledge in May 2024.

By signing this pledge, Cloudflare joins a growing coalition of companies committed to strengthening the resilience of the digital ecosystem. This isn’t just symbolic — it's a concrete step in aligning with cybersecurity best practices and our commitment to protect our customers, partners, and data.

A central goal in CISA’s Secure by Design pledge is promoting transparency in vulnerability reporting. This initiative underscores the importance of proactive security practices and emphasizes transparency in vulnerability management — values that are deeply embedded in Cloudflare’s Product Security program. We believe that openness around vulnerabilities is foundational to earning and maintaining the trust of our customers, partners, and the broader security community.

Why transparency in vulnerability reporting matters

Transparency in vulnerability reporting is essential for building trust between companies and customers. In 2008, Linus Torvalds noted that disclosure is inherently tied to resolution: “So as far as I'm concerned, disclosing is the fixing of the bug”, emphasizing that resolution must start with visibility. While this mindset might apply well to open-source projects and communities familiar with code and patches, it doesn’t scale easily to non-expert users and enterprise users who require structured, validated, and clearly communicated disclosures regarding a vulnerability’s impact. Today’s threat landscape demands not only rapid remediation of vulnerabilities but also clear disclosure of their nature, impact and resolution. This builds trust with the customer and contributes to the broader collective understanding of common vulnerability classes and emerging systemic flaws.

What is a CVE?

Common Vulnerabilities and Exposures (CVE) is a catalog of publicly disclosed vulnerabilities and exposures. Each CVE includes a unique identifier, summary, associated metadata like the Common Weakness Enumeration (CWE) and Common Platform Enumeration (CPE), and a severity score that can range from None to Critical.

The format of a CVE ID consists of a fixed prefix, the year of the disclosure and an arbitrary sequence number like CVE-2017-0144. Memorable names such as "EternalBlue" (CVE-2017-0144) are often associated with high-profile exploits to enhance recall.

What is a CNA?

As an authorized CVE Numbering Authority (CNA), Cloudflare can assign CVE identifiers for vulnerabilities discovered within our products and ecosystems. Cloudflare has been actively involved with MITRE's CVE program since its founding in 2009. As a CNA, Cloudflare assumes the responsibility to manage disclosure timelines ensuring they are accurate, complete, and valuable to the broader industry.

Cloudflare CVE issuance process

Cloudflare issues CVEs for vulnerabilities discovered internally and through our Bug Bounty program when they affect open source software and/or our distributed closed source products.

The findings are triaged based on real-world exploitability and impact. Vulnerabilities without a plausible exploitation path, in addition to findings related to test repositories or exposed credentials like API keys, typically do not qualify for CVE issuance.

We recognize that CVE issuance involves nuance, particularly for sophisticated security issues in a complex codebase (for example, the Linux kernel). Issuance relies on impact to users and the likelihood of the exploit, which depends on the complexity of executing an attack. The growing number of CVEs issued industry-wide reflects a broader effort to balance theoretical vulnerabilities against real-world risk.

In scenarios where Cloudflare was impacted by a vulnerability, but the root cause was within another CNA’s scope of products, Cloudflare will not assign the CVE. Instead, Cloudflare may choose other mediums of disclosure, like blog posts.

How does Cloudflare disclose a CVE?

Our disclosure process begins with internal evaluation of severity and scope, and any potential privacy or compliance impacts. When necessary, we engage our Legal and Security Incident Response Teams (SIRT). For vulnerabilities reported to Cloudflare by external entities via our Bug Bounty program, our standard disclosure timeline is within 90 days. This timeline allows us to ensure proper remediation, thorough testing, and responsible coordination with affected parties. While we are committed to transparent disclosure, we believe addressing and validating fixes before public release is essential to protect users and uphold system security. For open source projects, we also issue security advisories on the relevant GitHub repositories. Additionally, we encourage external researchers to publish/blog about their findings after issues are remediated. Full details and process of Cloudflare’s external researcher/entity disclosure policy can be found via our Bug Bounty program policy page

Outcomes

To date, Cloudflare has issued and disclosed multiple CVEs. Because of the security platforms and products that Cloudflare builds, vulnerabilities have primarily been in the areas of denial of service, local privilege escalation, logical flaws, and improper input validation. Cloudflare also believes in collaboration and open sources of some of our software stack, therefore CVEs in these repositories are also promptly disclosed.

Cloudflare disclosures can be found here. Below are some of the most notable vulnerabilities disclosed by Cloudflare:

CVE-2024-1765: quiche: Memory Exhaustion Attack using post-handshake CRYPTO frames

Cloudflare quiche (through version 0.19.1/0.20.0) was affected by an unlimited resource allocation vulnerability causing rapid increase of memory usage of the system running a quiche server or client.

A remote attacker could take advantage of this vulnerability by repeatedly sending an unlimited number of 1-RTT CRYPTO frames after previously completing the QUIC handshake.

Exploitation was possible for the duration of the connection, which could be extended by the attacker.

quiche 0.19.2 and 0.20.1 are the earliest versions containing the fix for this issue.

CVE-2024-0212: Cloudflare WordPress plugin enables information disclosure of Cloudflare API (for low-privilege users)

The Cloudflare WordPress plugin was found to be vulnerable to improper authentication. The vulnerability enables attackers with a lower privileged account to access data from the Cloudflare API.

The issue has been fixed in version >= 4.12.3 of the plugin

CVE-2023-2754 - Plaintext transmission of DNS requests in Windows 1.1.1.1 WARP client

The Cloudflare WARP client for Windows assigns loopback IPv4 addresses for the DNS servers, since WARP acts as a local DNS server that performs DNS queries securely. However, if a user is connected to WARP over an IPv6-capable network, the WARP client did not assign loopback IPv6 addresses but rather Unique Local Addresses, which under certain conditions could point towards unknown devices in the same local network, enabling an attacker to view DNS queries made by the device.

This issue was patched in version 2023.7.160.0 of the WARP client (Windows).

CVE-2025-0651 - Improper privilege management allows file manipulations

An improper privilege management vulnerability in Cloudflare WARP for Windows allowed file manipulation by low-privilege users. Specifically, a user with limited system permissions could create symbolic links within the C:\ProgramData\Cloudflare\warp-diag-partials directory. When the "Reset all settings" feature is triggered, the WARP service — running with SYSTEM-level privileges — followed these symlinks and may delete files outside the intended directory, potentially including files owned by the SYSTEM user.

This vulnerability affected versions of WARP prior to 2024.12.492.0.

CVE-2025-23419: TLS client authentication can be bypassed due to ticket resumption (disclosed Cloudflare impact via blog post)

Cloudflare’s mutual TLS implementation caused a vulnerability in the session resumption handling. The underlying issue originated from BoringSSL’s process to resume TLS sessions. BoringSSL stored client certificates, which were reused from the original session (without revalidating the full certificate chain) and the original handshake's verification status was not re-validated.

While Cloudflare was impacted by the vulnerability, the root cause was within NGINX's implementation, making F5 the appropriate CNA to assign the CVE. This is an example of alternate mediums of disclosure that Cloudflare sometimes opt for. This issue was fixed as per guidance from the respective CVE — please see our blog post for more details.

Conclusion

Irrespective of the industry, if your organization builds software, we encourage you to familiarize yourself with CISA’s “Secure by Design” principles and create a plan to implement them in your company. The CISA Secure by Design pledge is built around seven security goals, prioritizing the security of customers, and challenges organizations to think differently about security.

As we continue to enhance our security posture, Cloudflare remains committed to enhancing our internal practices, investing in tooling and automation, and sharing knowledge with the community. CVE transparency is not a one-time initiative — it’s a sustained effort rooted in openness, discipline, and technical excellence. By embedding these values in how we design, build and secure our products, we aim to meet and exceed expectations set out in the CISA pledge and make the Internet more secure, faster and reliable!

For more updates on our CISA progress, review our related blog posts. Cloudflare has delivered five of the seven CISA Secure by Design pledge goals, and we aim to complete the remainder of the pledge goals in May 2025.

Scaling with safety: Cloudflare's approach to global service health metrics and software releases

Harshal Brahmbhatt — Mon, 05 May 2025 14:00:00 GMT

Has your browsing experience ever been disrupted by this error page? Sometimes Cloudflare returns "Error 500" when our servers cannot respond to your web request. This inability to respond could have several potential causes, including problems caused by a bug in one of the services that make up Cloudflare's software stack.

We know that our testing platform will inevitably miss some software bugs, so we built guardrails to gradually and safely release new code before a feature reaches all users. Health Mediated Deployments (HMD) is Cloudflare’s data-driven solution to automating software updates across our global network. HMD works by querying Thanos, a system for storing and scaling Prometheus metrics. Prometheus collects detailed data about the performance of our services, and Thanos makes that data accessible across our distributed network. HMD uses these metrics to determine whether new code should continue to roll out, pause for further evaluation, or be automatically reverted to prevent widespread issues.

Cloudflare engineers configure signals from their service, such as alerting rules or Service Level Objectives (SLOs). For example, the following Service Level Indicator (SLI) checks the rate of HTTP 500 errors over 10 minutes returned from a service in our software stack.

sum(rate(http_request_count{code="500"}[10m])) / sum(rate(http_request_count[10m]))

An SLO is a combination of an SLI and an objective threshold. For example, the service returns 500 errors <0.1% of the time.

If the success rate is unexpectedly decreasing where the new code is running, HMD reverts the change in order to stabilize the system, reacting before humans even know what Cloudflare service was broken. Below, HMD recognizes the degradation in signal in an early release stage and reverts the code back to the prior version to limit the blast radius.

Cloudflare’s network serves millions of requests per second across diverse geographies. How do we know that HMD will react quickly the next time we accidentally release code that contains a bug? HMD performs a testing strategy called backtesting, outside the release process, which uses historical incident data to test how long it would take to react to degrading signals in a future release.

We use Thanos to join thousands of small Prometheus deployments into a single unified query layer while keeping our monitoring reliable and cost-efficient. To backfill historical incident metric data that has fallen out of Prometheus’ retention period, we use our object storage solution, R2.

Today, we store 4.5 billion distinct time series for a year of retention, which results in roughly 8 petabytes of data in 17 million objects distributed all over the globe.

Making it work at scale

To give a sense of scale, we can estimate the impact of a batch of backtests:

Each backtest run is made up of multiple SLOs to evaluate a service's health.
Each SLO is evaluated using multiple queries containing batches of data centers.
Each data center issues anywhere from tens to thousands of requests to R2.

Thus, in aggregate, a batch can translate to hundreds of thousands of PromQL queries and millions of requests to R2. Initially, batch runs would take about 30 hours to complete but through blood, sweat, and tears, we were able to cut this down to 2 hours.

Let’s review how we made this processing more efficient.

Recording rules

HMD slices our fleet of machines across multiple dimensions. For the purposes of this post, let’s refer to them as “tier” and “color”. Given a pair of tier and color, we would use the following PromQL expression to find the machines that make up this combination:

group by (instance, datacenter, tier, color) (
  up{job="node_exporter"}
  * on (datacenter) group_left(tier) datacenter_metadata{tier="tier3"}
  * on (instance) group_left(color) server_metadata{color="green"}
  unless on (instance) (machine_in_maintenance == 1)
  unless on (datacenter) (datacenter_disabled == 1)
)

Most of these series have a cardinality of approximately the number of machines in our fleet. That’s a substantial amount of data we need to fetch from object storage and transmit home for query evaluation, as well as a significant number of series we need to decode and join together.

Since this is a fairly common query that is issued in every HMD run, it makes sense to precompute it. In the Prometheus ecosystem, this is commonly done with recording rules:

hmd:release_scopes:info{tier="tier3", color="green"}

Aside from looking much cleaner, this also reduces the load at query time significantly. Since all the joins involved can only have matches within a data center, it is well-defined to evaluate those rules directly in the Prometheus instances inside the data center itself.

Compared to the original query, the cardinality we need to deal with now scales with the size of the release scope instead of the size of the entire fleet.

This is significantly cheaper and also less likely to be affected by network issues along the way, which in turn reduces the amount that we need to retry the query, on average.

Distributed query processing

HMD and the Thanos Querier, depicted above, are stateless components that can run anywhere, with highly available deployments in North America and Europe. Let us quickly recap what happens when we evaluate the SLI expression from HMD in our introduction:

sum(rate(http_request_count{code="500"}[10m]))
/ 
sum(rate(http_request_count[10m]))

Upon receiving this query from HMD, the Thanos Querier will start requesting raw time series data for the “http_requests_total” metric from its connected Thanos Sidecar and Thanos Store instances all over the world, wait for all the data to be transferred to it, decompress it, and finally compute its result:

While this works, it is not optimal for several reasons. We have to wait for raw data from thousands of data sources all over the world to arrive in one location before we can even start to decompress it, and then we are limited by all the data being processed by one instance. If we double the number of data centers, we also need to double the amount of memory we allocate for query evaluation.

Many SLIs come in the form of simple aggregations, typically to boil down some aspect of the service's health to a number, such as the percentage of errors. As with the aforementioned recording rule, those aggregations are often distributive — we can evaluate them inside the data center and coalesce the sub-aggregations again to arrive at the same result.

To illustrate, if we had a recording rule per data center, we could rewrite our example like this:

sum(datacenter:http_request_count:rate10m{code="500"})
/ 
sum(datacenter:http_request_count:rate10m)

This would solve our problems, because instead of requesting raw time series data for high-cardinality metrics, we would request pre-aggregated query results. Generally, these pre-aggregated results are an order of magnitude less data that needs to be sent over the network and processed into a final result.

However, recording rules come with a steep write-time cost in our architecture, evaluated frequently across thousands of Prometheus instances in production, just to speed up a less frequent ad-hoc batch process. Scaling recording rules alongside our growing set of service health SLIs quickly would be unsustainable. So we had to go back to the drawing board.

It would be great if we could evaluate data center-scoped queries remotely and coalesce their result back again — for arbitrary queries and at runtime. To illustrate, we would like to evaluate our example like this:

(sum(rate(http_requests_total{status="500", datacenter="dc1"}[10m])) + ...)
/
(sum(rate(http_requests_total{datacenter="dc1"}[10m])) + ...)

This is exactly what Thanos’ distributed query engine is capable of doing. Instead of requesting raw time series data, we request data center scoped aggregates and only need to send those back home where they get coalesced back again into the full query result:

Note that we ensure all the expensive data paths are as short as possible by utilizing R2 location hints to specify the primary access region.

To measure the effectiveness of this approach, we used Cloudprober and wrote probes that evaluate the relatively cheap, but still global, query count(node_uname_info).

sum(thanos_cloudprober_latency:rate6h{component="thanos-central"})
/
sum(thanos_cloudprober_latency:rate6h{component="thanos-distributed"})

In the graph below, the y-axis represents the speedup of the distributed execution deployment relative to the centralized deployment. On average, distributed execution responds 3–5 times faster to probes.

Anecdotally, even slightly more complex queries quickly time out or even crash our centralized deployment, but they still can be comfortably computed by the distributed one. For a slightly more expensive query like count(up) for about 17 million scrape jobs, we had difficulty getting the centralized querier to respond and had to scope it to a single region, which took about 42 seconds:

Meanwhile, our distributed queriers were able to return the full result in about 8 seconds:

Congestion control

HMD batch processing leads to spiky load patterns that are hard to provision for. In a perfect world, it would issue a steady and predictable stream of queries. At the same time, HMD batch queries have lower priority to us than the queries that on-call engineers issue to triage production problems. We tackle both of those problems by introducing an adaptive priority-based concurrency control mechanism. After reading Netflix’s work on adaptive concurrency limits, we implemented a similar proxy to dynamically limit batch request flow when Thanos SLOs start to degrade. For example, one such SLO is its cloudprober failure rate over the last minute:

sum(thanos_cloudprober_fail:rate1m)
/
(sum(thanos_cloudprober_success:rate1m) + sum(thanos_cloudprober_fail:rate1m))

We apply jitter, a random delay, to smooth query spikes inside the proxy. Since batch processing prioritizes overall query throughput over individual query latency, jitter helps HMD send a burst of queries, while allowing Thanos to process queries gradually over several minutes. This reduces instantaneous load on Thanos, improving overall throughput, even if individual query latency increases. Meanwhile, HMD encounters fewer errors, minimizing retries and boosting batch efficiency.

Our solution simulates how TCP’s congestion control algorithm, additive increase/multiplicative decrease, works. When the proxy server receives a successful request from Thanos, it allows one more concurrent request through next time. If backpressure signals breach defined thresholds, the proxy limits the congestion window proportional to the failure rate.

As the failure rate increases past the “warn” threshold, approaching the “emergency” threshold, the proxy gets exponentially closer to allowing zero additional requests through the system. However, to prevent bad signals from halting all traffic, we cap the loss with a configured minimum request rate.

Columnar experiments

Because Thanos deals with Prometheus TSDB blocks that were never designed for being read over a slow medium like object storage, it does a lot of random I/O. Inspired by this excellent talk, we started storing our time series data in Parquet files, with some promising preliminary results. This project is still too early to draw any robust conclusions, but we wanted to share our implementation with the Prometheus community, so we are publishing our experimental object storage gateway as parquet-tsdb-poc on GitHub.

Conclusion

We built Health Mediated Deployments (HMD) to enable safe and reliable software releases while pushing the limits of our observability infrastructure. Along the way, we significantly improved Thanos’ ability to handle high-load queries, reducing batch runtimes by 15x.

But this is just the beginning. We’re excited to continue working with the observability, resiliency, and R2 teams to push our infrastructure to its limits — safely and at scale. As we explore new ways to enhance observability, one exciting frontier is optimizing time series storage for object storage.

We’re sharing this work with the community as an open-source proof of concept. If you’re interested in exploring Parquet-based time series storage and its potential for large-scale observability, check out the GitHub project linked above.

Demonstrating reduction of vulnerability classes: a key step in CISA’s “Secure by Design” pledge

Sri Pulla — Tue, 14 Jan 2025 14:00:00 GMT

In today’s rapidly evolving digital landscape, securing software systems has never been more critical. Cyber threats continue to exploit systemic vulnerabilities in widely used technologies, leading to widespread damage and disruption. That said, the United States Cybersecurity and Infrastructure Agency (CISA) helped shape best practices for the technology industry with their Secure-by-Design pledge. Cloudflare signed this pledge on May 8, 2024, reinforcing our commitment to creating resilient systems where security is not just a feature, but a foundational principle.

We’re excited to share an update aligned with one of CISA’s goals in the pledge: To reduce entire classes of vulnerabilities. This goal aligns with the Cloudflare Product Security program’s initiatives to continuously automate proactive detection and vigorously prevent vulnerabilities at scale.

Cloudflare’s commitment to the CISA pledge reflects our dedication to transparency and accountability to our customers. This blog post outlines why we prioritized certain vulnerability classes, the steps we took to further eliminate vulnerabilities, and the measurable outcomes of our work.

The core philosophy that continues: prevent, not patch

Cloudflare’s core security philosophy is to prevent security vulnerabilities from entering production environments. One of the goals for Cloudflare’s Product Security team is to champion this philosophy and ensure secure-by-design approaches are part of product and platform development. Over the last six months, the Product Security team aggressively added both new and customized rulesets aimed at completely eliminating secrets and injection code vulnerabilities. These efforts have enhanced detection precision, reducing false positives, while enabling the proactive detection and blocking of these two vulnerability classes. Cloudflare’s security practice to block vulnerabilities before they are introduced into code at merge or code changes serves to maintain a high security posture and aligns with CISA’s pledge around proactive security measures.

Injection vulnerabilities are a critical vulnerability class, irrespective of the product or platform. These occur when code and data are improperly mixed due to lack of clear boundaries as a result of inadequate validation, unsafe functions, and/or improper sanitization. Injection vulnerabilities are considered high impact as they lead to compromise of confidentiality, integrity, and availability of the systems involved. Some of the ways Cloudflare continuously detects and prevents these risks is through security reviews, secure code scanning, and vulnerability testing. Additionally, ongoing efforts to institute improved precision serve to reduce false positives and aggressively detect and block these vulnerabilities at the source if engineers accidentally introduce these into code.

Secrets in code is another vulnerability class of high impact, as it presents significant risk related to confidential information leaks, potentially leading to unauthorized access and insider threat challenges. In 2023, Cloudflare prioritized tuning our security tools and systems to further improve the detection and reduction of secrets within code. Through audits and usage patterns analysis across all Cloudflare repositories, we further decreased the probability of the reintroduction of these vulnerabilities into new code by writing and enabling enhanced secrets detection rules.

Cloudflare is committed to elimination of these vulnerability classes regardless of their criticality. By addressing these vulnerabilities at their source, Cloudflare has significantly reduced the attack surface and the potential for exploitation in production environments. This approach established secure defaults by enabling developers to rely on frameworks and tools that inherently separate data or secrets from code, minimizing the need for reactive fixes. Additionally, resolving these vulnerabilities at the code level “future-proofs” applications, ensuring they remain resilient as the threat landscape evolves.

Cloudflare’s techniques for addressing these vulnerabilities

To address both injection and embedded secrets vulnerabilities, Cloudflare focused on building secure defaults, leveraging automation, and empowering developers. To establish secure default configurations, Cloudflare uses frameworks designed to inherently separate data from code. We also increased reliance on secure storage systems and secret management tools, integrating them seamlessly into the development pipeline.

Continuous automation played a critical role in our strategy. Static analysis tools integration with DevOps process were enhanced with customized rule sets to block issues based on observed patterns and trends. Additionally, along with security scans running on every pull and merge request, software quality assurance measures of “build break” and “stop the code” were enforced. This prevented risks from entering production when true positive vulnerabilities were detected across all Cloudflare development activities, irrespective of criticality and impacted product. This proactive approach has further reduced the likelihood of these vulnerabilities reaching production environments.

Developer enablement was another key pillar. Priority was placed on bolstering existing continuous education and training for engineering teams by providing additional guidance and best practices on preventing security vulnerabilities, and leveraging our centralized secrets platform in an automated way. Embedding these principles into daily workflows has fostered a culture of shared responsibility for security across the organization.

The role of custom rulesets and “build break”

To operationalize the more aggressive detection and blocking capabilities, Cloudflare’s Product Security team wrote new detection rulesets for its static application security testing (SAST) tool integrated in CI/CD workflows and hardened the security criteria for code releases to production. Using the SAST tooling with both default and custom rulesets allows the security team to perform comprehensive scans for secure code, secrets, and software supply chain vulnerabilities, virtually eliminating injection vulnerabilities and secrets from source code. It also enables the security team to identify and address issues early while systematically enforcing security policies.

Cloudflare’s expansion of the security tool suite played a critical role in the company’s secure product strategy. Initially, rules were enabled in “monitoring only” mode to understand trends and potential false positives. Then rules were fine-tuned to enforce and adjust priorities without disrupting development workflows. Leveraging internal threat models, the team writes custom rules tailored to Cloudflare’s infrastructure. Every pull request (PR) and merge request (MR) was scanned against these specific rule sets, including those targeting injection and secrets. The fine-tuned rules, optimized for high precision, are then activated in blocking mode, which leads to breaking the build when detected. This process provides vulnerability remediation at the PR/MR stage.

Hardening these security checks directly into the CI/CD pipeline enforces a proactive security assurance strategy in the development lifecycle. This approach ensures vulnerabilities are detected and addressed early in the development process before reaching production. The detection and blocking of these issues early reduces remediation efforts, minimizes risk, and strengthens the overall security of our products and systems.

Outcomes

Cloudflare continues to follow a culture of transparency as it provides increased visibility into the root cause of an issue and consequently allowing us to improve the process/product at scale. As a result, these efforts have yielded tangible results and continue to strengthen the security posture of all Cloudflare products.

In the second half of 2024, the team aggressively added new rulesets that helped detect and remove new secrets introduced into code repositories. This led to a 79% reduction of secrets in code over the previous quarter, underscoring Cloudflare’s commitment to safeguarding the company's codebase and protecting sensitive information. Following a similar approach, the team also introduced new rulesets in blocking mode, irrespective of the criticality level for all injection vulnerabilities. These improvements led to an additional 44% reduction of potential SQL injection and code injection vulnerabilities.

While security tools may produce false positives, customized rulesets with high-confidence true positives remain a key step in order to methodically evaluate and address the findings. These reductions reflect the effectiveness of proactive security measures in reducing entire vulnerability classes at scale.

Future plans

Cloudflare will continue to mature the current practices and enforce secure-by-design principles. Some other security practices we will continue to mature include: providing secure frameworks, threat modeling at scale, integration of automated security tooling in every stage of the software development lifecycle (SDLC), and ongoing role based developer training on leading edge security standards. All of these strategies help reduce, or eliminate, entire classes of vulnerabilities.

Conclusion

Irrespective of the industry, if your organization builds software, we encourage you to familiarize yourself with CISA’s ‘Secure by Design’ principles and create a plan to implement them in your company. The commitment is built around seven security goals, prioritizing the security of customers.

The CISA Secure by Design pledge challenges organizations to think differently about security. By addressing vulnerabilities at their source, Cloudflare has demonstrated measurable progress in reducing systemic risks.

Cloudflare’s continued focus on addressing vulnerability classes through prevention mechanisms outlined above serves as a critical foundation. These efforts ensure the security of Cloudflare systems, employees, and customers. Cloudflare is invested in continuous innovation and building a safe digital world.

You can also find more updates on our blog as we build our roadmap to meet all seven CISA Secure by Design pledge goals by May 2025, such as our post about reaching Goal #5 of the pledge.

As a cybersecurity company, Cloudflare considers product security an integral part of its DNA. We strongly believe in CISA’s principles issued in the Secure by Design pledge, and will continue to uphold these principles in the work we do.

Improving platform resilience at Cloudflare through automation

Opeyemi Onikute — Wed, 09 Oct 2024 13:00:00 GMT

Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.

When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.

In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.

Growing pains

The Cloudflare Site Reliability Engineering (SRE) team builds and manages the platform that helps product teams deliver our extensive suite of offerings to customers. One important component of this platform is the collection of servers that power critical products such as Durable Objects, Workers, and DDoS mitigation. We also build and maintain foundational software services that power our product offerings, such as configuration management, provisioning, and IP address allocation systems.

As part of tactical operations work, we are often required to respond to failures in any of these components to minimize impact to users. Impact can vary from lack of access to a specific product feature, to total unavailability. The level of response required is determined by the priority, which is usually a reflection of the severity of impact on users. Lower-priority failures are more common — a server may run too hot, or experience an unrecoverable hardware error. Higher-priority failures are rare and are typically resolved via a well-defined incident response process, requiring collaboration with multiple other teams.

The commonality of lower-priority failures makes it obvious when the response required, as defined in runbooks, is “toilsome”. To reduce this toil, we had previously implemented a plethora of solutions to automate runbook actions such as manually-invoked shell scripts, cron jobs, and ad-hoc software services. These had grown organically over time and provided solutions on a case-by-case basis, which led to duplication of work, tight coupling, and lack of context awareness across the solutions.

We also care about how long it takes to resolve any potential impact on users. A resolution process which involves the manual invocation of a script relies on human action, increasing the Mean-Time-To-Resolve (MTTR) and leaving room for human error. This risks increasing the amount of errors we serve to users and degrading trust.

These problems proved that we needed a way to automatically heal these platform components. This especially applies to our servers, for which failure can cause impact across multiple product offerings. While we have mechanisms to automatically steer traffic away from these degraded servers, in some rare cases the breakage is sudden enough to be visible.

Solving the problem

To provide a more reliable platform, we needed a new component that provides a common ground for remediation efforts. This would remove duplication of work, provide unified context-awareness and increase development speed, which ultimately saves hours of engineering time and effort.

A good solution would not allow only the SRE team to auto-remediate, it would empower the entire company. The key to adding self-healing capability was a generic interface for all teams to self-service and quickly remediate failures at various levels: machine, service, network, or dependencies.

A good way to think about auto-remediation is in terms of workflows. A workflow is a sequence of steps to get to a desired outcome. This is not dissimilar to a manual shell script which executes what a human would otherwise do via runbook instructions. Because of this logical fit with workflows and durable execution, we decided to adopt an open-source platform called Temporal.

The concept of durable execution is useful to gracefully manage infrastructure failures such as network outages and transient failures in external service endpoints. This capability meant we only needed to build a way to schedule “workflow” tasks and have the code provide reliability guarantees by default, using Temporal. This allowed us to focus on building out the orchestration system to support the control and flow of workflow execution in our data centers.

Temporal’s documentation provides a good introduction to writing Temporal workflows.

Building an Automatic Remediation System

Below, we describe how our automatic remediation system works. It is essentially a way to schedule tasks across our global network with built-in reliability guarantees. With this system, teams can serve their customers more reliably. An unexpected failure mode can be recognized and immediately mitigated, while the root cause can be determined later via a more detailed analysis.

Step one: we need a coordinator

After our initial testing of Temporal, it was now possible to write workflows. But we needed a way to schedule workflow tasks from other internal services. The coordinator was built to serve this purpose, and became the primary mechanism for the authorisation and scheduling of workflows.

The most important roles of the coordinator are authorisation, workflow task routing, and safety constraints enforcement. Each consumer is authorized via mTLS authentication, and the coordinator uses an ACL to determine whether to permit the execution of a workflow. An ACL configuration looks like the following example.

server_config {
    enable_tls = true
    [...]
    route_rule {
      name  = "global_get"
      method = "GET"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
    }
    route_rule {
      name = "global_post"
      method = "POST"
      route_patterns = ["/*"]
      uris = ["spiffe://example.com/worker-admin"]
      allow_public = true
    }
    route_rule {
      name = "public_access"
      method = "GET"
      route_patterns = ["/metrics"]
      uris = []
      allow_public = true
      skip_log_match = true
    }
}

Each workflow specifies two key characteristics: where to run the tasks and the safety constraints, using an HCL configuration file. Example constraints could be whether to run on only a specific node type (such as a database), or if multiple parallel executions are allowed: if a task has been triggered too many times, that is a sign of a wider problem that might require human intervention. The coordinator uses the Temporal Visibility API to determine the current state of the executions in the Temporal cluster.

An example of a configuration file is shown below:

task_queue_target = ""

# The following entries will ensure that
# 1. This workflow is not run at the same time in a 15m window.
# 2. This workflow will not run more than once an hour.
# 3. This workflow will not run more than 3 times in one day.
#
constraint {
    kind = "concurency"
    value = "1"
    period = "15m"
}

constraint {
    kind = "maxExecution"
    value = "1"
    period = "1h"
}

constraint {
    kind = "maxExecution"
    value = "3"
    period = "24h"
    is_global = true
}

Step two: Task Routing is amazing

An unforeseen benefit of using a central Temporal cluster was the discovery of Task Routing. This feature allows us to schedule a Workflow/Activity on any server that has a running Temporal Worker, and further segment by the type of server, its location, etc. For this reason, we have three primary task queues — the general queue in which tasks can be executed by any worker in the datacenter, the node type queue in which tasks can only be executed by a specific node type in the datacenter, and the individual node queue where we target a specific node for task execution.

We rely on this heavily to ensure the speed and efficiency of automated remediation. Certain tasks can be run in datacenters with known low latency to an external resource, or a node type with better performance than others (due to differences in the underlying hardware). This reduces the amount of failure and latency we see overall in task executions. Sometimes we are also constrained by certain types of tasks that can only run on a certain node type, such as a database.

Task Routing also means that we can configure certain task queues to have a higher priority for execution, although this is not a feature we have needed so far. A drawback of task routing is that every Workflow/Activity needs to be registered to the target task queue, which is a common gotcha. Thankfully, it is possible to catch this failure condition with proper testing.

Step three: when/how to self-heal?

None of this would be relevant if we didn’t put it to good use. A primary design goal for the platform was to ensure we had easy, quick ways to trigger workflows on the most important failure conditions. The next step was to determine what the best sources to trigger the actions were. The answer to this was simple: we could trigger workflows from anywhere as long as they are properly authorized and detect the failure conditions accurately.

Example triggers are an alerting system, a log tailer, a health check daemon, or an authorized engineer via a chatbot. Such flexibility allows a high level of reuse, and permits to invest more in workflow quality and reliability.

As part of the solution, we built a daemon that is able to poll a signal source for any unwanted condition and trigger a configured workflow. We have initially found Prometheus useful as a source because it contains both service-level and hardware/system-level metrics. We are also exploring more event-based trigger mechanisms, which could eliminate the need to use precious system resources to poll for metrics.

We already had internal services that are able to detect widespread failure conditions for our customers, but were only able to page a human. With the adoption of auto-remediation, these systems are now able to react automatically. This ability to create an automatic feedback loop with our customers is the cornerstone of these self-healing capabilities, and we continue to work on stronger signals, faster reaction times, and better prevention of future occurrences.

The most exciting part, however, is the future possibility. Every customer cares about any negative impact from Cloudflare. With this platform we can onboard several services (especially those that are foundational for the critical path) and ensure we react quickly to any failure conditions, even before there is any visible impact.

Step four: packaging and deployment

The whole system is written in golang, and a single binary can implement each role. We distribute it as an apt package or a container for maximum ease of deployment.

We deploy a Temporal-based worker to every server we intend to run tasks on, and a daemon in datacenters where we intend to automatically trigger workflows based on the local conditions. The coordinator is more nuanced since we rely on task routing and can trigger from a central coordinator, but we have also found value in running coordinators locally in the datacenters. This is especially useful in datacenters with less capacity or degraded performance, removing the need for a round-trip to schedule the workflows.

Step five: test, test, test

Temporal provides native mechanisms to test an entire workflow, via a comprehensive test suite that supports end-to-end, integration, and unit testing, which we used extensively to prevent regressions while developing. We also ensured proper test coverage for all the critical platform components, especially the coordinator.

Despite the ease of written tests, we quickly discovered that they were not enough. After writing workflows, engineers need an environment as close as possible to the target conditions. This is why we configured our staging environments to support quick and efficient testing. These environments receive the latest changes and point to a different (staging) Temporal cluster, which enables experimentation and easy validation of changes.

After a workflow is validated in the staging environment, we can then do a full release to production. It seems obvious, but catching simple configuration errors before releasing has saved us many hours in development/change-related-task time.

Deploying to production

As you can guess from the title of this post, we put this in production to automatically react to server-specific errors and unrecoverable failures. To this end, we have a set of services that are able to detect single-server failure conditions based on analyzed traffic data. After deployment, we have successfully mitigated potential impact by taking any errant single sources of failure out of production.

We have also created a set of workflows to reduce internal toil and improve efficiency. These workflows can automatically test pull requests on target machines, wipe and reset servers after experiments are concluded, and take away manual processes that cost many hours in toil.

Building a system that is maintained by several SRE teams has allowed us to iterate faster, and rapidly tackle long-standing problems. We have set ambitious goals regarding toil elimination and are on course to achieve them, which will allow us to scale faster by eliminating the human bottleneck.

Looking to the future

Our immediate plans are to leverage this system to provide a more reliable platform for our customers and drastically reduce operational toil, freeing up engineering resources to tackle larger-scale problems. We also intend to leverage more Temporal features such as Workflow Versioning, which will simplify the process of making changes to workflows by ensuring that triggered workflows run expected versions.

We are also interested in how others are solving problems using durable execution platforms such as Temporal, and general strategies to eliminate toil. If you would like to discuss this further, feel free to reach out on the Cloudflare Community and start a conversation!

If you’re interested in contributing to projects that help build a better Internet, our engineering teams are hiring.

Changing the industry with CISA’s Secure by Design principles

Kristina Galicova — Mon, 04 Mar 2024 14:00:56 GMT

The United States Cybersecurity and Infrastructure Agency (CISA) and seventeen international partners are helping shape best practices for the technology industry with their ‘Secure by Design’ principles. The aim is to encourage software manufacturers to not only make security an integral part of their products’ development, but to also design products with strong security capabilities that are configured by default.

As a cybersecurity company, Cloudflare considers product security an integral part of its DNA. We strongly believe in CISA’s principles and will continue to uphold them in the work we do. We’re excited to share stories about how Cloudflare has baked secure by design principles into the products we build and into the services we make available to all of our customers.

What do “secure by design” and “secure by default” mean?

Secure by design describes a product where the security is ‘baked in’ rather than ‘bolted on’. Rather than manufacturers addressing security measures reactively, they take actions to mitigate any risk beforehand by building products in a way that reasonably protects against attackers successfully gaining access to them.

Secure by default means products are built to have the necessary security configurations come as a default, without additional charges.

CISA outlines the following three software product security principles:

Take ownership of customer security outcomes
Embrace radical transparency and accountability
Lead from the top

In its documentation, CISA provides comprehensive guidance on how to achieve its principles and what security measures a manufacturer should follow. Adhering to these guidelines not only enhances security benefits to customers and boosts the developer’s brand reputation, it also reduces long term maintenance and patching costs for manufacturers.

Why does it matter?

Technology undeniably plays a significant role in our lives, automating numerous everyday tasks. The world’s dependence on technology and Internet-connected devices has significantly increased in the last few years, in large part due to Covid-19. During the outbreak, individuals and companies moved online as they complied with the public health measures that limited physical interactions.

While Internet connectivity makes our lives easier, bringing opportunities for online learning and remote work, it also creates an opportunity for attackers to benefit from such activities. Without proper safeguards, sensitive data such as user information, financial records, and login credentials can all be compromised and used for malicious activities.

Systems vulnerabilities can also impact entire industries and economies. In 2023, hackers from North Korea were suspected of being responsible for over 20% of crypto losses, exploiting software vulnerabilities and stealing more than $300 million from individuals and companies around the world.

Despite the potentially devastating consequences of insecure software, too many vendors place the onus of security on their customers — a fact that CISA underscores in their guidelines. While a level of care from customers is expected, the majority of risks should be handled by manufacturers and their products. Only then can we have more secure and trusting online interactions. The ‘Secure by Design’ principles are essential to bridge that gap and change the industry.

How does Cloudflare support secure by design principles?

Taking ownership of customer security outcomes

CISA explains that in order to take ownership of customer security outcomes, software manufacturers should invest in product security efforts that include application hardening, application features, and application default settings. At Cloudflare, we always have these product security efforts top of mind and a few examples are shared below.

Application hardening

At Cloudflare, our developers follow a defined software development life cycle (SDLC) management process with checkpoints from our security team. We proactively address known vulnerabilities before they can be exploited and fix any exploited vulnerabilities for all of our customers. For example, we are committed to memory safe programming languages and use them where possible. Back in 2021, Cloudflare rewrote the Cloudflare WAF from Lua into the memory safe Rust. More recently, Cloudflare introduced a new in-house built HTTP proxy named Pingora, that moved us from memory unsafe C to memory safe Rust as well. Both of these projects were extra large undertakings that would not have been possible without executive support from our technical leadership team.

Zero Trust Security

By default, we align with CISA’s Zero Trust Maturity Model through the use of Cloudflare’s Zero Trust Security suite of services, to prevent unauthorized access to Cloudflare data, development resources, and other services. We minimize trust assumptions and require strict identity verification for every person and device trying to access any Cloudflare resources, whether self-hosted or in the cloud.

At Cloudflare, we believe that Zero Trust Security is a must-have security architecture in today’s environment, where cyber security attacks are rampant and hybrid work environments are the new normal. To help protect small businesses today, we have a Zero Trust plan that provides the essential security controls needed to keep employees and apps protected online available free of charge for up to 50 users.

Application features

We not only provide users with many essential security tools for free, but we have helped push the entire industry to provide better security features by default since our early days.

Back in 2014, during Cloudflare's birthday week, we announced that we were making encryption free for all our customers by introducing Universal SSL. Then in 2015, we went one step further and provided full encryption of all data from the browser to the origin, for free. Now, the rest of the industry has followed our lead and encryption by default has become the standard for Internet applications.

During Cloudflare’s seventh Birthday Week in 2017, we were incredibly proud to announce unmetered DDoS mitigation. The service absorbs and mitigates large-scale DDoS attacks without charging customers for the excess bandwidth consumed during an attack. With such announcement we eliminated the industry standard of ‘surge pricing’ for DDoS attacks

In 2021, we announced a protocol called MIGP ("Might I Get Pwned") that allows users to check whether their credentials have been compromised without exposing any unnecessary information in the process. Aside from a bucket ID derived from a prefix of the hash of your email, your credentials stay on your device and are never sent (even encrypted) over the Internet. Before that, using credential checking services could turn out to be a vulnerability in itself, leaking sensitive information while you are checking whether or not your credentials have been compromised.

A year later, in 2022, Cloudflare again disrupted the industry when we announced WAF (Web Application Firewall) Managed Rulesets free of charge for all Cloudflare plans. WAF is a service responsible for protecting web applications from malicious attacks. Such attacks have a major impact across the Internet regardless of the size of an organization. By making WAF free, we are making the Internet safe for everyone.

Finally, at the end of 2023, we were excited to help lead the industry by making post-quantum cryptography available free of charge to all of our customers irrespective of plan levels.

Application default settings

To further protect our customers, we ensure our default settings provide a robust security posture right from the start. Once users are comfortable, they can change and configure any settings the way they prefer. For example, Cloudflare automatically deploys the Free Cloudflare Managed Ruleset to any new Cloudflare zone. The managed ruleset includes Log4j rules, Shellshock rules, rules matching very common WordPress exploits, and others. Customers are able to disable the ruleset, if necessary, or configure the traffic filter or individual rules. To provide an even more secure-by-default system, we also created the ML-computed WAF Attack Score that uses AI to detect bypasses of existing managed rules and can detect software exploits before they are made public.

As another example, all Cloudflare accounts come with unmetered DDoS mitigation services to protect applications from many of the Internet's most common and hard to handle attacks, by default.

As yet another example, when customers use our R2 storage, all the stored objects are encrypted at rest. Both encryption and decryption is automatic, does not require user configuration to enable, and does not impact the performance of R2.

Cloudflare also provides all of our customers with robust audit logs. Audit logs summarize the history of changes made within your Cloudflare account. Audit logs include account level actions like login, as well as zone configuration changes. Audit Logs are available on all plan types and are captured for both individual users and for multi-user organizations. Our audit logs are available across all plan levels for 18 months.

Embracing radical transparency and accountability

To embrace radical transparency and accountability means taking pride in delivering safe and secure products. Transparency and sharing information are crucial for improving and evolving the security industry, fostering an environment where companies learn from each other and make the online world safer. Cloudflare shows transparency in multiple ways, as outlined below.

The Cloudflare blog

On the Cloudflare blog, you can find the latest information about our features and improvements, but also about zero-day attacks that are relevant to the entire industry, like the historic HTTP/2 Rapid Reset attacks detected last year. We are transparent and write about important security incidents, such as the Thanksgiving 2023 security incident, where we go in detail about what happened, why it happened, and the steps we took to resolve it. We have also made a conscious effort to embrace radical transparency from Cloudflare’s inception about incidents impacting our services, and continue to embrace this important principle as one of our core values. We hope that the information we share can assist others in enhancing their software practices.

Cloudflare System Status

Cloudflare System Status is a page to inform website owners about the status of Cloudflare services. It provides information about the current status of services and whether they are operating as expected. If there are any ongoing incidents, the status page notes which services were affected, as well as details about the issue. Users can also find information about scheduled maintenance that may affect the availability of some services.

Technical transparency for code integrity

We believe in the importance of using cryptography as a technical means for transparently verifying identity and data integrity. For example, in 2022, we partnered with WhatsApp to provide a system for WhatsApp that assures users they are running the correct, untampered code when visiting the web version of the service by enabling the code verify extension to confirm hash integrity automatically. It’s this process, and the fact that is automated on behalf of the user, that helps provide transparency in a scalable way. If users had to manually fetch, compute, and compare the hashes themselves, detecting tampering would likely only be done by a small fraction of technical users.

Transparency report and warrant canaries

We also believe that an essential part of earning and maintaining the trust of our customers is being transparent about the requests we receive from law enforcement and other governmental entities. To this end, Cloudflare publishes semi-annual updates to our Transparency Report on the requests we have received to disclose information about our customers.

An important part of Cloudflare’s transparency report is our warrant canaries. Warrant canaries are a method to implicitly inform users that we have not taken certain actions or received certain requests from government or law enforcement authorities, such as turning over our encryption or authentication keys or our customers' encryption or authentication keys to anyone. Through these means we are able to let our users know just how private and secure their data is while adhering to orders from law enforcement that prohibit disclosing some of their requests. You can read Cloudflare’s warrant canaries here.

While transparency reports and warrant canaries are not explicitly mentioned in CISA’s secure by design principles, we think they are an important aspect in a technology company being transparent about their practices.

Public bug bounties

We invite you to contribute to our security efforts by participating in our public bug bounty hosted by HackerOne, where you can report Cloudflare vulnerabilities and receive financial compensation in return for your help.

Leading from the top

With this principle, security is deeply rooted inside Cloudflare’s business goals. Because of the tight relationship of security and quality, by improving a product's default security, the quality of the overall product also improves.

At Cloudflare, our dedication to security is reflected in the company’s structure. Our Chief Security Officer reports directly to our CEO, and presents at every board meeting. That allows for board members well-informed about the current cybersecurity landscape and emphasizes the importance of the company's initiatives to improve security.

Additionally, our security engineers are a part of the main R&D organization, with their work being as integral to our products as that of our system engineers. This means that our security engineers can bake security into the SDLC instead of bolting it on as an afterthought.

How can you help?

If you are a software manufacturer, we encourage you to familiarize yourself with CISA’s ‘Secure by Design’ principles and create a plan to implement them in your company.

As an individual, we encourage you to participate in bug bounty programs (such as Cloudflare’s HackerOne public bounty) and promote cybersecurity awareness in your community.

Let’s help build a better Internet together.

Hardening Workers KV

Matt Silverlock — Wed, 02 Aug 2023 13:05:42 GMT

Over the last couple of months, Workers KV has suffered from a series of incidents, culminating in three back-to-back incidents during the week of July 17th, 2023. These incidents have directly impacted customers that rely on KV — and this isn’t good enough.

We’re going to share the work we have done to understand why KV has had such a spate of incidents and, more importantly, share in depth what we’re doing to dramatically improve how we deploy changes to KV going forward.

Workers KV?

Workers KV — or just “KV” — is a key-value service for storing data: specifically, data with high read throughput requirements. It’s especially useful for user configuration, service routing, small assets and/or authentication data.

We use KV extensively inside Cloudflare too, with Cloudflare Access (part of our Zero Trust suite) and Cloudflare Pages being some of our highest profile internal customers. Both teams benefit from KV’s ability to keep regularly accessed key-value pairs close to where they’re accessed, as well its ability to scale out horizontally without any need to become an expert in operating KV.

Given Cloudflare’s extensive use of KV, it wasn’t just external customers impacted. Our own internal teams felt the pain of these incidents, too.

The summary of the post-mortem

Back in June 2023, we announced the move to a new architecture for KV, which is designed to address two major points of customer feedback we’ve had around KV: high latency for infrequently accessed keys (or a key accessed in different regions), and working to ensure the upper bound on KV’s eventual consistency model for writes is 60 seconds — not “mostly 60 seconds”.

At the time of the blog, we’d already been testing this internally, including early access with our community champions and running a small % of production traffic to validate stability and performance expectations beyond what we could emulate within a staging environment.

However, in the weeks between mid-June and culminating in the series of incidents during the week of July 17th, we would continue to increase the volume of new traffic onto the new architecture. When we did this, we would encounter previously unseen problems (many of these customer-impacting) — then immediately roll back, fix bugs, and repeat. Internally, we’d begun to identify that this pattern was becoming unsustainable — each attempt to cut traffic onto the new architecture would surface errors or behaviors we hadn’t seen before and couldn’t immediately explain, and thus we would roll back and assess.

The issues at the root of this series of incidents proved to be significantly challenging to track and observe. Once identified, the two causes themselves proved to be quick to fix, but an (1) observability gap in our error reporting and (2) a mutation to local state that resulted in an unexpected mutation of global state were both hard to observe and reproduce over the days following the customer-facing impact ending.

The detail

One important piece of context to understand before we go into detail on the post-mortem: Workers KV is composed of two separate Workers scripts – internally referred to as the Storage Gateway Worker and SuperCache. SuperCache is an optional path in the Storage Gateway Worker workflow, and is the basis for KV's new (faster) backend (refer to the blog).

Here is a timeline of events:

Time	Description
2023-07-17 21:52 UTC	Cloudflare observes alerts showing 500 HTTP status codes in the MEL01 data-center (Melbourne, AU) and begins investigating. We also begin to see a small set of customers reporting HTTP 500s being returned via multiple channels. It is not immediately clear if this is a data-center-wide issue or KV specific, as there had not been a recent KV deployment, and the issue directly correlated with three data-centers being brought back online.
2023-07-18 00:09 UTC	We disable the new backend for KV in MEL01 in an attempt to mitigate the issue (noting that there had not been a recent deployment or change to the % of users on the new backend).
2023-07-18 05:42 UTC	Investigating alerts showing 500 HTTP status codes in VIE02 (Vienna, AT) and JNB01 (Johannesburg, SA).
2023-07-18 13:51 UTC	The new backend is disabled globally after seeing issues in VIE02 (Vienna, AT) and JNB01 (Johannesburg, SA) data-centers, similar to MEL01. In both cases, they had also recently come back online after maintenance, but it remained unclear as to why KV was failing.
2023-07-20 19:12 UTC	The new backend is inadvertently re-enabled while deploying the update due to a misconfiguration in a deployment script.
2023-07-20 19:33 UTC	The new backend is (re-) disabled globally as HTTP 500 errors return.
2023-07-20 23:46 UTC	Broken Workers script pipeline deployed as part of gradual rollout due to incorrectly defined pipeline configuration in the deployment script. Metrics begin to report that a subset of traffic is being black-holed.
2023-07-20 23:56 UTC	Broken pipeline rolled back; errors rates return to pre-incident (normal) levels.

All timestamps referenced are in Coordinated Universal Time (UTC).

We initially observed alerts showing 500 HTTP status codes in the MEL01 data-center (Melbourne, AU) at 21:52 UTC on July 17th, and began investigating. We also received reports from a small set of customers reporting HTTP 500s being returned via multiple channels. This correlated with three data centers being brought back online, and it was not immediately clear if it related to the data centers or was KV-specific — especially given there had not been a recent KV deployment. On 05:42, we began investigating alerts showing 500 HTTP status codes in VIE02 (Vienna) and JNB02 (Johannesburg) data-centers; while both had recently come back online after maintenance, it was still unclear why KV was failing. At 13:51 UTC, we made the decision to disable the new backend globally.

Following the incident on July 18th, we attempted to deploy an allow-list configuration to reduce the scope of impacted accounts. However, while attempting to roll out a change for the Storage Gateway Worker at 19:12 UTC on July 20th, an older configuration was progressed causing the new backend to be enabled again, leading to the third event. As the team worked to fix this and deploy this configuration, they attempted to manually progress the deployment at 23:46 UTC, which resulted in the passing of a malformed configuration value that caused traffic to be sent to an invalid Workers script configuration.

After all deployments and the broken Workers configuration (pipeline) had been rolled back at 23:56 on the 20th July, we spent the following three days working to identify the root cause of the issue. We lacked observability as KV's Worker script (responsible for much of KV's logic) was throwing an unhandled exception very early on in the request handling process. This was further exacerbated by prior work to disable error reporting in a disabled data-center due to the noise generated, which had previously resulted in logs being rate-limited upstream from our service.

This previous mitigation prevented us from capturing meaningful logs from the Worker, including identifying the exception itself, as an uncaught exception terminates request processing. This has raised the priority of improving how unhandled exceptions are reported and surfaced in a Worker (see Recommendations, below, for further details). This issue was exacerbated by the fact that KV's Worker script would fail to re-enter its "healthy" state when a Cloudflare data center was brought back online, as the Worker was mutating an environment variable perceived to be in request scope, but that was in global scope and persisted across requests. This effectively left the Worker “frozen” with the previous, invalid configuration for the affected locations.

Further, the introduction of a new progressive release process for Workers KV, designed to de-risk rollouts (as an action from a prior incident), prolonged the incident. We found a bug in the deployment logic that led to a broader outage due to an incorrectly defined configuration.

This configuration effectively caused us to drop a single-digit % of traffic until it was rolled back 10 minutes later. This code is untested at scale, and we need to spend more time hardening it before using it as the default path in production.

Additionally: although the root cause of the incidents was limited to three Cloudflare data-centers (Melbourne, Vienna, and Johannesburg), traffic across these regions still uses these data centers to route reads and writes to our system of record. Because these three data centers participate in KV’s new backend as regional tiers, a portion of traffic across the Oceania, Europe, and African regions was affected. Only a portion of keys from enrolled namespaces use any given data center as a regional tier in order to limit a single (regional) point of failure, so while traffic across all data centers in the region was impacted, nowhere was all traffic in a given data center affected.

We estimated the affected traffic to be 0.2-0.5% of KV's global traffic (based on our error reporting), however we observed some customers with error rates approaching 20% of their total KV operations. The impact was spread across KV namespaces and keys for customers within the scope of this incident.

Both KV’s high total traffic volume and its role as a critical dependency for many customers amplify the impact of even small error rates. In all cases, once the changes were rolled back, errors returned to normal levels and did not persist.

Thinking about risks in building software

Before we dive into what we’re doing to significantly improve how we build, test, deploy and observe Workers KV going forward, we think there are lessons from the real world that can equally apply to how we improve the safety factor of the software we ship.

In traditional engineering and construction, there is an extremely common procedure known as a “JSEA”, or Job Safety and Environmental Analysis (sometimes just “JSA”). A JSEA is designed to help you iterate through a list of tasks, the potential hazards, and most importantly, the controls that will be applied to prevent those hazards from damaging equipment, injuring people, or worse.

One of the most critical concepts is the “hierarchy of controls” — that is, what controls should be applied to mitigate these hazards. In most practices, these are elimination, substitution, engineering, administration and personal protective equipment. Elimination and substitution are fairly self-explanatory: is there a different way to achieve this goal? Can we eliminate that task completely? Engineering and administration ask us whether there is additional engineering work, such as changing the placement of a panel, or using a horizontal boring machine to lay an underground pipe vs. opening up a trench that people can fall into.

The last and lowest on the hierarchy, is personal protective equipment (PPE). A hard hat can protect you from severe injury from something falling from above, but it’s a last resort, and it certainly isn’t guaranteed. In engineering practice, any hazard that only lists PPE as a mitigating factor is unsatisfactory: there must be additional controls in place. For example, instead of only wearing a hard hat, we should engineer the floor of scaffolding so that large objects (such as a wrench) cannot fall through in the first place. Further, if we require that all tools are attached to the wearer, then it significantly reduces the chance the tool can be dropped in the first place. These controls ensure that there are multiple degrees of mitigation — defense in depth — before your hard hat has to come into play.

Coming back to software, we can draw parallels between these controls: engineering can be likened to improving automation, gradual rollouts, and detailed metrics. Similarly, personal protective equipment can be likened to code review: useful, but code review cannot be the only thing protecting you from shipping bugs or untested code. Automation with linters, more robust testing, and new metrics are all vastly safer ways of shipping software.

As we spent time assessing where to improve our existing controls and how to put new controls in place to mitigate risks and improve the reliability (safety) of Workers KV, we took a similar approach: eliminating unnecessary changes, engineering more resilience into our codebase, automation, deployment tooling, and only then looking at human processes.

How we plan to get better

Cloudflare is undertaking a larger, more structured review of KV's observability tooling, release infrastructure and processes to mitigate not only the contributing factors to the incidents within this report, but recent incidents related to KV. Critically, we see tooling and automation as the most powerful mechanisms for preventing incidents, with process improvements designed to provide an additional layer of protection. Process improvements alone cannot be the only mitigation.

Specifically, we have identified and prioritized the below efforts as the most important next steps towards meeting our own availability SLOs, and (above all) make KV a service that customers building on Workers can rely on for storing configuration and service data in the hot path of their traffic:

Substantially improve the existing observability tooling for unhandled exceptions, both for internal teams and customers building on Workers. This is especially critical for high-volume services, where traditional logging alone can be too noisy (and not specific enough) to aid in tracking down these cases. The existing ongoing work to land this will be prioritized further. In the meantime, we have directly addressed the specific uncaught exception with KV's primary Worker script.
Improve the safety around the mutation of environmental variables in a Worker, which currently operate at "global" (per-isolate) scope, but can appear to be per-request. Mutating an environmental variable in request scope mutates the value for all requests transiting that same isolate (in a given location), which can be unexpected. Changes here will need to take backwards compatibility in mind.
Continue to expand KV’s test coverage to better address the above issues, in parallel with the aforementioned observability and tooling improvements, as an additional layer of defense. This includes allowing our test infrastructure to simulate traffic from any source data-center, which would have allowed us to more quickly reproduce the issue and identify a root cause.
Improvements to our release process, including how KV changes and releases are reviewed and approved, going forward. We will enforce a higher level of scrutiny for future changes, and where possible, reduce the number of changes deployed at once. This includes taking on new infrastructure dependencies, which will have a higher bar for both design and testing.
Additional logging improvements, including sampling, throughout our request handling process to improve troubleshooting & debugging. A significant amount of the challenge related to these incidents was due to the lack of logging around specific requests (especially non-2xx requests)
Review and, where applicable, improve alerting thresholds surrounding error rates. As mentioned previously in this report, sub-% error rates at a global scale can have severe negative impact on specific users and/or locations: ensuring that errors are caught and not lost in the noise is an ongoing effort.
Address maturity issues with our progressive deployment tooling for Workers, which is net-new (and will eventually be exposed to customers directly).

This is not an exhaustive list: we're continuing to expand on preventative measures associated with these and other incidents. These changes will not only improve KVs reliability, but other services across Cloudflare that KV relies on, or that rely on KV.

We recognize that KV hasn’t lived up to our customers’ expectations recently. Because we rely on KV so heavily internally, we’ve felt that pain first hand as well. The work to fix the issues that led to this cycle of incidents is already underway. That work will not only improve KV’s reliability but also improve the reliability of any software written on the Cloudflare Workers developer platform, whether by our customers or by ourselves.

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

Dirk-Jan van Helmond — Fri, 23 Jun 2023 13:00:55 GMT

2023 was the first year that non-participating countries could vote for their favorites during the Eurovision Song Contest, adding millions of additional viewers and voters to an already impressive 162 million tuning in from the participating countries. It became a truly global event with a potential for disruption from multiple sources. To prepare for anything, Cloudflare helped scale and protect the voting application, used by millions of dedicated fans around the world to choose the winner.

In this blog we will cover how once.net built their platform based.io to monitor, manage and scale the Eurovision voting application to handle all traffic using many Cloudflare services. The speed with which DNS changes made through the Cloudflare API propagate globally allowed them to scale their backend within seconds. At the same time, Cloudflare Pages was ready to serve any amount of traffic to the voting landing page so fans didn’t miss a beat. And to cap it off, by combining Cloudflare CDN, DDoS protection, WAF, and Turnstile, they made sure that attackers didn’t steal any of the limelight.

The unsung heroes

Based.io is a resilient live data platform built by the once.net team, with the capability to scale up to 400 million concurrent connected users. It’s built from the ground up for speed and performance, consisting of an observable real time graph database, networking layer, cloud functions, analytics and infrastructure orchestration. Since all system information, traffic analysis and disruptions are monitored in real time, it makes the platform instantly responsive to variable demand, which enables real time scaling of your infrastructure during spikes, outages and attacks.

Although the based.io platform on its own is currently in closed beta, it is already serving a few flagship customers in production assisted by the software and services of the once.net team. One such customer is Tally, a platform used by multiple broadcasters in Europe to add live interaction to traditional television. Over 100 live shows have been performed using the platform. Another is Airhub, a startup that handles and logs automatic drone flights. And of course the star of this blog post, the Eurovision Song Contest.

Setting the stage

The Eurovision Song Contest is one of the world’s most popular broadcasted contests, and this year it reached 162 million people across 38 broadcasting countries. In addition, on TikTok the three live shows were viewed 4.8 million times, while 7.6 million people watched the Grand Final live on YouTube. With such an audience, it is no surprise that Cloudflare sees the impact of it on the Internet. Last year, we wrote a blog post where we showed lower than average traffic during, and higher than average traffic after the grand final. This year, the traffic from participating countries showed an even more remarkable surge:

HTTP Requests per Second from Norway, with a similar pattern visible in countries such as the UK, Sweden and France. Internet traffic spiked at 21:20 UTC, when voting started.

Such large amounts of traffic are nothing new to the Eurovision Song Contest. Eurovision has relied on Cloudflare’s services for over a decade now and Cloudflare has helped to protect Eurovision.tv and improve its performance through noticeable faster load time to visitors from all corners of the world. Year after year, the team of Eurovision continued to use our services more, discovering additional features to improve performance and reliability further, with increasingly fine-grained control over their traffic flows. Eurovision.tv uses Page Rules to cache additional content on Cloudflare’s edge, speeding up delivery without sacrificing up-to-the-minute updates during the global event. Finally, to protect their backend and content management system, the team has placed their admin portals behind Cloudflare Zero Trust to delegate responsibilities down to individual levels.

Since then the contest itself has also evolved – sometimes by choice, sometimes by force. During the COVID-19 pandemic in-person cheering became impossible for many people due to a reduced live audience, resulting in the Eurovision Song Contest asking once.net to build a new iOS and Android application in which fans could cheer virtually. The feature was an instant hit, and it was clear that it would become part of this year’s contest as well.

A screenshot of the official Eurovision Song Contest application showing the real-time number of connected fans (1) and allowing them to cheer (2) for their favorites.

In 2023, once.net was also asked to handle the paid voting from the regions where phone and SMS voting was not possible. It was the first time that Eurovision allowed voting online. The challenge that had to be overcome was the extreme peak demand on the platform when the show was live, and especially when the voting window started.

Complicating it further, was the fact that during last year’s show, there had been a large number of targeted and coordinated attacks.

To prepare for these spikes in demand and determined adversaries, once.net needed a platform that isn’t only resilient and highly scalable, but could also act as a mitigation layer in front of it. once.net selected Cloudflare for this functionality and integrated Cloudflare deeply with its real-time monitoring and management platform. To understand how and why, it’s essential to understand based.io underlying architecture.

The based.io platform

Instead of relying on network or HTTP load balancers, based.io uses a client-side service discovery pattern, selecting the most suitable server to connect to and leveraging Cloudflare's fast cache propagation infrastructure to handle spikes in traffic (both malicious and benign).

First, each server continuously registers a unique access key that has an expiration of 15 seconds, which must be used when a client connects to the server. In addition, the backend servers register their health (such as active connections, CPU, memory usage, requests per second, etc.) to the service registry every 300 milliseconds. Clients then request the optimal server URL and associated access key from a central discovery registry and proceed to establish a long lived connection with that server. When a server gets overloaded it will disconnect a certain amount of clients and those clients will go through the discovery process again.

The central discovery registry would normally be a huge bottleneck and attack target. based.io resolves this by putting the registry behind Cloudflare's global network with a cache time of three seconds. Since the system relies on real-time stats to distribute load and uses short lived access keys, it is crucial that the cache updates fast and reliably. This is where Cloudflare’s infrastructure proved its worth, both due to the fast updating cache and reducing load with Tiered Caching.

Not using load balancers means the based.io system allows clients to connect to the backend servers through Cloudflare, resulting in better performance and a more resilient infrastructure by eliminating the load balancers as potential attack surface. It also results in a better distribution of connections, using the real-time information of server health, amount of active connections, active subscriptions.

Scaling up the platform happens automatically under load by deploying additional machines that can each handle 40,000 connected users. These are spun up in batches of a couple of hundred and as each machine spins up, it reaches out directly to the Cloudflare API to configure its own DNS record and proxy status. Thanks to Cloudflare’s high speed DNS system, these changes are then propagated globally within seconds, resulting in a total machine turn-up time of around three seconds. This means faster discovery of new servers and faster dynamic rebalancing from the clients. And since the voting window of the Eurovision Song Contest is only 45 minutes, with the main peak within minutes after the window opens, every second counts!

High level architecture of the based.io platform used for the 2023 Eurovision Song Contest‌ ‌

To vote, users of the mobile app and viewers globally were pointed to the voting landing page, esc.vote. Building a frontend web application able to handle this kind of an audience is a challenge in itself. Although hosting it yourself and putting a CDN in front seems straightforward, this still requires you to own, configure and manage your origin infrastructure. once.net decided to leverage Cloudflare’s infrastructure directly by hosting the voting landing page on Cloudflare Pages. Deploying was as quick as a commit to their Git repository, and they never had to worry about reachability or scaling of the webpage.

once.net also used Cloudflare Turnstile to protect their payment API endpoints that were used to validate online votes. They used the invisible Turnstile widget to make sure the request was not coming from emulated browsers (e.g. Selenium). And best of all, using the invisible Turnstile widget the user did not have to go through extra steps, which allowed for a better user experience and better conversion.

Cloudflare Pages stealing the show!

After the two semi-finals went according to plan with approximately 200,000 concurrent users during each,May 13 brought the Grand Final. The once.net team made sure that there were enough machines ready to take the initial load, jumped on a call with Cloudflare to monitor and started looking at the number of concurrent users slowly increasing. During the event, there were a few attempts to DDoS the site, which were automatically and instantaneously mitigated without any noticeable impact to any visitors.

The based.io discovery registry server also got some attention. Since the cache TTL was set quite low at five seconds, a high rate of distributed traffic to it could still result in a significant load. Luckily, on its own, the highly optimized based.io server can already handle around 300,000 requests per second. Still, it was great to see that during the event the cache hit ratio for normal traffic was 20%, and during one significant attack the cache hit ratio peaked towards 80%. This showed how easy it is to leverage a combination of Cloudflare CDN and DDoS protection to mitigate such attacks, while still being able to serve dynamic and real time content.

When the curtains finally closed, 1.3 million concurrent users connected to the based.io platform at peak. The based.io platform handled a total of 350 million events and served seven million unique users in three hours. The voting landing page hosted by Cloudflare Pages served 2.3 million requests per second at peak, and made sure that the voting payments were by real human fans using Turnstile. Although the Cloudflare platform didn’t blink for such a flood of traffic, it is no surprise that it shows up as a short crescendo in our traffic statistics:

Get in touch with us

If you’re also working on or with an application that would benefit from Cloudflare’s speed and security, but don’t know where to start, reach out and we’ll work together.

SLP: a new DDoS amplification vector in the wild

Alex Forster — Tue, 25 Apr 2023 13:07:56 GMT

Earlier today, April 25, 2023, researchers Pedro Umbelino at Bitsight and Marco Lux at Curesec published their discovery of CVE-2023-29552, a new DDoS reflection/amplification attack vector leveraging the SLP protocol. If you are a Cloudflare customer, your services are already protected from this new attack vector.

Service Location Protocol (SLP) is a “service discovery” protocol invented by Sun Microsystems in 1997. Like other service discovery protocols, it was designed to allow devices in a local area network to interact without prior knowledge of each other. SLP is a relatively obsolete protocol and has mostly been supplanted by more modern alternatives like UPnP, mDNS/Zeroconf, and WS-Discovery. Nevertheless, many commercial products still offer support for SLP.

Since SLP has no method for authentication, it should never be exposed to the public Internet. However, Umbelino and Lux have discovered that upwards of 35,000 Internet endpoints have their devices’ SLP service exposed and accessible to anyone. Additionally, they have discovered that the UDP version of this protocol has an amplification factor of up to 2,200x, which is the third largest discovered to-date.

Cloudflare expects the prevalence of SLP-based DDoS attacks to rise significantly in the coming weeks as malicious actors learn how to exploit this newly discovered attack vector.

Cloudflare customers are protected

If you are a Cloudflare customer, our automated DDoS protection system already protects your services from these SLP amplification attacks.To avoid being exploited to launch the attacks, if you are a network operator, you should ensure that you are not exposing the SLP protocol directly to the public Internet. You should consider blocking UDP port 427 via access control lists or other means. This port is rarely used on the public Internet, meaning it is relatively safe to block without impacting legitimate traffic. Cloudflare Magic Transit customers can use the Magic Firewall to craft and deploy such rules.

Oxy: Fish/Bumblebee/Splicer subsystems to improve reliability

Quang Luong — Thu, 20 Apr 2023 13:00:00 GMT

At Cloudflare, we are building proxy applications on top of Oxy that must be able to handle a huge amount of traffic. Besides high performance requirements, the applications must also be resilient against crashes or reloads. As the framework evolves, the complexity also increases. While migrating WARP to support soft-unicast (Cloudflare servers don't own IPs anymore), we needed to add different functionalities to our proxy framework. Those additions increased not only the code size but also resource usage and states required to be preserved between process upgrades.

To address those issues, we opted to split a big proxy process into smaller, specialized services. Following the Unix philosophy, each service should have a single responsibility, and it must do it well. In this blog post, we will talk about how our proxy interacts with three different services - Splicer (which pipes data between sockets), Bumblebee (which upgrades an IP flow to a TCP socket), and Fish (which handles layer 3 egress using soft-unicast IPs). Those three services help us to improve system reliability and efficiency as we migrated WARP to support soft-unicast.

Splicer

Most transmission tunnels in our proxy forward packets without making any modifications. In other words, given two sockets, the proxy just relays the data between them: read from one socket and write to the other. This is a common pattern within Cloudflare, and we reimplement very similar functionality in separate projects. These projects often have their own tweaks for buffering, flushing, and terminating connections, but they also have to coordinate long-running proxy tasks with their process restart or upgrade handling, too.

Turning this into a service allows other applications to send a long-running proxying task to Splicer. The applications pass the two sockets to Splicer and they will not need to worry about keeping the connection alive when restart. After finishing the task, Splicer will return the two original sockets and the original metadata attached to the request, so the original application can inspect the final state of the sockets - for example using TCP_INFO - and finalize audit logging if required.

Bumblebee

Many of Cloudflare’s on-ramps are IP-based (layer 3) but most of our services operate on TCP or UDP sockets (layer 4). To handle TCP termination, we want to create a kernel TCP socket from IP packets received from the client (and we can later forward this socket and an upstream socket to Splicer to proxy data between the eyeball and origin). Bumblebee performs the upgrades by spawning a thread in an anonymous network namespace with unshare syscall, NAT-ing the IP packets, and using a tun device there to perform TCP three-way handshakes to a listener. You can find a more detailed write-up on how we upgrade an IP flows to a TCP stream here.

In short, other services just need to pass a socket carrying the IP flow, and Bumblebee will upgrade it to a TCP socket, no user-space TCP stack involved! After the socket is created, Bumblebee will return the socket to the application requesting the upgrade. Again, the proxy can restart without breaking the connection as Bumblebee pipes the IP socket while Splicer handles the TCP ones.

Fish

Fish forwards IP packets using soft-unicast IP space without upgrading them to layer 4 sockets. We previously implemented packet forwarding on shared IP space using iptables and conntrack. However, IP/port mapping management is not simple when you have many possible IPs to egress from and variable port assignments. Conntrack is highly configurable, but applying configuration through iptables rules requires careful coordination and debugging iptables execution can be challenging. Plus, relying on configuration when sending a packet through the network stack results in arcane failure modes when conntrack is unable to rewrite a packet to the exact IP or port range specified.

Fish attempts to overcome this problem by rewriting the packets and configuring conntrack using the netlink protocol. Put differently, a proxy application sends a socket containing IP packets from the client, together with the desired soft-unicast IP and port range, to Fish. Then, Fish will ensure to forward those packets to their destination. The client’s choice of IP address does not matter; Fish ensures that egressed IP packets have a unique five-tuple within the root network namespace and performs the necessary packet rewriting to maintain this isolation. Fish’s internal state is also survived across restarts.

The Unix philosophy, manifest

To sum up what we are having so far: instead of adding the functionalities directly to the proxy application, we create smaller and reusable services. It becomes possible to understand the failure cases present in a smaller system and design it to exhibit reliable behavior. Then if we can remove the subsystems of a larger system, we can apply this logic to those subsystems. By focusing on making the smaller service work correctly, we improve the whole system's reliability and development agility.

Although those three services’ business logics are different, you can notice what they do in common: receive sockets, or file descriptors, from other applications to allow them to restart. Those services can be restarted without dropping the connection too. Let’s take a look at how graceful restart and file descriptor passing work in our cases.

File descriptor passing

We use Unix Domain Sockets for interprocess communication. This is a common pattern for inter-process communication. Besides sending raw data, unix sockets also allow passing file descriptors between different processes. This is essential for our architecture as well as graceful restarts.

There are two main ways to transfer a file descriptor: using pid_getfd syscall or SCM_RIGHTS. The latter is the better choice for us here as the use cases gear toward the proxy application “giving” the sockets instead of the microservices “taking” them. Moreover, the first method would require special permission and a way for the proxy to signal which file descriptor to take.

Currently we have our own internal library named hot-potato to pass the file descriptors around as we use stable Rust in production. If you are fine with using nightly Rust, you may want to consider the unix_socket_ancillary_data feature. The linked blog post above about SCM_RIGHTS also explains how that can be implemented. Still, we also want to add some “interesting” details you may want to know before using your SCM_RIGHTS in production:

There is a maximum number of file descriptors you can pass per messageThe limit is defined by the constant SCM_MAX_FD in the kernel. This is set to 253 since kernel version 2.6.38
Getting the peer credentials of a socket may be quite useful for observability in multi-tenant settings
A SCM_RIGHTS ancillary data forms a message boundary.
It is possible to send any file descriptors, not only socketsWe use this trick together with memfd_create to get around the maximum buffer size without implementing something like length-encoded frames. This also makes zero-copy message passing possible.

Graceful restart

We explored the general strategy for graceful restart in “Oxy: the journey of graceful restarts” blog. Let’s dive into how we leverage tokio and file descriptor passing to migrate all important states in the old process to the new one. We can terminate the old process almost instantly without leaving any connection behind.

Passing states and file descriptors

Applications like NGINX can be reloaded with no downtime. However, if there are pending requests then there will be lingering processes that handle those connections before they terminate. This is not ideal for observability. It can also cause performance degradation when the old processes start building up after consecutive restarts.

In three micro-services in this blog post, we use the state-passing concept, where the pending requests will be paused and transferred to the new process. The new process will pick up both new requests and the old ones immediately on start. This method indeed requires a higher complexity than keeping the old process running. At a high level, we have the following extra steps when the application receives an upgrade request (usually SIGHUP): pause all tasks, wait until all tasks (in groups) are paused, and send them to the new process.

WaitGroup using JoinSet

Problem statement: we dynamically spawn different concurrent tasks, and each task can spawn new child tasks. We must wait for some of them to complete before continuing.

In other words, tasks can be managed as groups. In Go, waiting for a collection of tasks to complete is a solved problem with WaitGroup. We discussed a way to implement WaitGroup in Rust using channels in a previous blog. There also exist crates like waitgroup that simply use AtomicWaker. Another approach is using JoinSet, which may make the code more readable. Considering the below example, we group the requests using a JoinSet.

    let mut task_group = JoinSet::new();

    loop {
        // Receive the request from a listener
        let Some(request) = listener.recv().await else {
            println!("There is no more request");
            break;
        };
        // Spawn a task that will process request.
        // This returns immediately
        task_group.spawn(process_request(request));
    }

    // Wait for all requests to be completed before continue
    while task_group.join_next().await.is_some() {}

However, an obvious problem with this is if we receive a lot of requests then the JoinSet will need to keep the results for all of them. Let’s change the code to clean up the JoinSet as the application processes new requests, so we have lower memory pressure

    loop {
        tokio::select! {
            biased; // This is optional

            // Clean up the JoinSet as we go
            // Note: checking for is_empty is important ?
            _task_result = task_group.join_next(), if !task_group.is_empty() => {}

            req = listener.recv() => {
                let Some(request) = req else {
                    println!("There is no more request");
                    break;
                };
                task_group.spawn(process_request(request));
            }
        }
    }

    while task_group.join_next().await.is_some() {}

Cancellation

We want to pass the pending requests to the new process as soon as possible once the upgrade signal is received. This requires us to pause all requests we are processing. In other terms, to be able to implement graceful restart, we need to implement graceful shutdown. The official tokio tutorial already covered how this can be achieved by using channels. Of course, we must guarantee the tasks we are pausing are cancellation-safe. The paused results will be collected into the JoinSet, and we just need to pass them to the new process using file descriptor passing.

For example, in Bumblebee, a paused state will include the environment’s file descriptors, client socket, and the socket proxying IP flow. We also need to transfer the current NAT table to the new process, which could be larger than the socket buffer. So the NAT table state is encoded into an anonymous file descriptor, and we just need to pass the file descriptor to the new process.

Conclusion

We considered how a complex proxy app can be divided into smaller components. Those components can run as new processes, allowing different life-times. Still, this type of architecture does incur additional costs: distributed tracing and inter-process communication. However, the costs are acceptable nonetheless considering the performance, maintainability, and reliability improvements. In the upcoming blog posts, we will talk about different debug tricks we learned when working with a large codebase with complex service interactions using tools like strace and eBPF.

Accelerate building resiliency into systems with Cloudflare Workers

Revathy Ramasundaram — Wed, 08 Mar 2023 14:00:00 GMT

In this blog post we’ll discuss how Cloudflare Workers enabled us to quickly improve the resiliency of a legacy system. In particular, we’ll discuss how we prevented the email notification systems within Cloudflare from outages caused by external vendors.

Email notification services

At Cloudflare, we send email notifications to customers such as billing invoices, password resets, OTP logins and certificate status updates. We rely on external Email Service Providers (ESPs) to deliver these emails to customers.

The following diagram shows how the system looks. Multiple services in our control plane dispatch emails through an external email vendor. They use HTTP Transmission APIs and also SMTP to send messages to the vendor. If dispatching an email fails, they are retried with exponential back-off mechanisms. Even when our ESP has outages, the retry mechanisms in place guarantee that we don’t lose any messages.

Why did we need to improve resilience?

In some cases, it isn’t sufficient to just deliver the email to the customer; it must be delivered on time. For example, OTP login emails are extremely time sensitive; their validity is short-lived such that a delay in sending them is as bad as not sending them at all. If the ESP is down, all emails including the critical ones will be delayed.

Some time ago our ESP suffered an outage; both the HTTP API and SMTP were unavailable and we couldn’t send emails to customers for a few hours. Several thousand emails couldn’t be delivered immediately and were queued. This led to a delay in clearing out the backlogged messages in the retry queue in addition to sending new messages.

Critical emails not being sent is a major incident for us. A major incident impacts our customers, our business, and also our reputation. We take serious measures to prevent it from happening again in the future. Hence improving the resiliency of this system was our top priority.

Our concrete action items to reduce the failure possibilities were to improve availability by onboarding another email vendor and, when ESP has an outage, mitigate downtime by failing over from one vendor to another.

We wanted to achieve the above as quickly as possible so that our customers are not impacted again.

Step 1 - Improve Availability

Introducing a new vendor gives us the option to failover. But before we introduced it we had to consider another key component of email deliverability, which is the email sender reputation. ISPs assign a sending score to any organization that sends emails. The higher the score, the higher the probability that the email sent by the organization reaches your inbox. Similarly the lower the score, the higher the probability it reaches your spam folder.

Adding a new vendor means emails will be delivered from a new IP address. Mailbox providers (such as Gmail or Hotmail) view new IP addresses as suspicious until they achieve a positive sending reputation. A positive sending reputation is determined by answering the question “did the recipient want this email?”. This is calculated differently for different mailbox providers, but the simplest measure would be that the recipient opened it, and perhaps clicked a link within. To build this reputation, it's recommended to use an IP warm-up strategy. This usually involves increasing e-mail volume slowly over a matter of days and weeks.

Using this new vendor “only” during failovers would increase the probability of emails reaching our customer’s spam folder for the reasons outlined above. To avoid this and establish a positive sending reputation we want to send a constant stream of traffic between the two vendors.

Solution

We came up with a weighted traffic routing module, an algorithm to route emails to the ESPs based on the ratio of weights assigned to them. This is how the algorithm works roughly:

This algorithm solves the problem of maintaining a constant stream of traffic between the two vendors and to failover when outages happen.

Our next big question was “where” to implement this module? How should this be implemented so that we can build, deploy, and test as quickly as possible? Since there are multiple teams involved, how can we make it quicker for all of them? This leads to a perfect segue to our next step.

Step 2 - Mitigate downtime

Let’s discuss a few approaches on where to build this logic. Should this just be a library? Or should it be a service in itself? Or do we have an even better way?

Design 1 - A simple solution

In this proposed solution, we would implement the weighted traffic routing module in a library and import it in every service. The pros of this approach is that there are no major changes to the overall existing architecture. However, it does come with quite a few drawbacks. Firstly, All of the services require a major code change and a considerable amount of development effort would have to be spent on this integration and testing end-to-end. Furthermore, updating libraries can be slow and there is no guarantee that teams would update in a timely manner. This means we may see teams still sending only to the old ESP during an incident, and we have not solved the problem we set out to.

Design 2 - Extract email routing decisions to its own microservice

The first approach is simple and doesn’t disrupt the architecture, but is not a sustainable solution in the long run. This naturally leads to our next design.

In this iteration, the weighted-traffic-routing module is extracted to its own service. This service decides where to route emails. Although it’s better than the previous approach, it is not without its drawbacks.

Firstly, the positives. This approach requires no major code changes in each service. As long as we honor the original ESP contract, team’s should simply need to update the base URL. Furthermore, we achieve the Single Responsibility Principle. When outages happen, the configurations have to be changed only in the weighted-traffic-routing service and this is the only service to be modified. Finally, we have much more control over retry semantics as they are implemented in a single place.

Some of the drawbacks of taking this approach are the time involved in spinning up a new service from scratch - building, testing, and shipping it is not insignificant. Secondly, we are adding an additional network hop; previously the email-notification-services talked directly to the email service providers, now we are proxying it through another service. Furthermore, since all the upstream requests are funneled through this service, it has to be load-balanced and scaled according to the incoming traffic and we need to ensure we service every request.

Design 3 - The Workers way

The final design we considered; Implementing the weighted traffic routing module in a Cloudflare Worker.

There are a lot of positives to doing it this way.

Firstly, minimal code changes are required in each service. In fact, nothing beyond the environment variables need to be updated. As with the previous design, when email vendor outages happen, only the Email Weighted Traffic Router Worker configuration has to be changed. One positive of using a Cloudflare Worker is when an environment variable is changed and deployed, it’s deployed in all of Cloudflare’s network in a few milliseconds. This is a significant advantage for us because it means the failover process takes only a few milliseconds; time-critical emails will reach our customers’ inbox on time, preventing major incidents.

Secondly, As the email traffic increases, we don’t have to care about scaling or load balancing the weighted-traffic-routing module. A Worker automatically takes care of it.

Finally, the operational overhead to building and running this is incredibly low. We don’t have to spend time on spinning up new containers, configuring and deploying them. The majority of the work is just to programme the algorithm that we discussed earlier.

There are still some drawbacks here though.We are still introducing a new service and an additional network hop. However, we can make use of the templates available here to make bootstrapping as efficient as possible.

Building it the Workers way

Weighing the pros and cons of the above approaches, we went with implementing it in a Cloudflare Worker.

The Email Traffic Router Worker sits in between the Cloudflare control plane and the ESPs and proxies requests between them. We stream raw responses from the upstream; the internal service is completely agnostic of multiple ESPs, and how the traffic is routed.

We configure the percentage of traffic we want to send to each vendor using environment variables. Cloudflare Workers provide environment variables which is a convenient way to configure a worker application.

addEventListener('fetch', async (event: FetchEvent) => {
try {
    const finalInstance = ratioBasedRandPicker(
      parseFloat(VENDOR1_TO_VENDOR2_RATIO),
      API_ENDPOINT_VENDOR1,
      API_ENDPOINT_VENDOR2,
    )
    // stream request and response
    return doRequest(request, finalInstance)
  } catch (err) {
       // handle errors
    }
})

export const ratioBasedRandPicker = (
  ratio: number,
  VENDOR1: string,
  VENDOR2: string,
  generator: () => number = Math.random,
): string => {
  if (isNaN(ratio) || ratio < 0 || ratio > 1) throw new ConfigurationError(`invalid ratio ${ratio}`)
  return generator() < ratio ? VENDOR1 : VENDOR2  
}

The Email Weighted Traffic Router Worker generates a random number between 0 (inclusive) and 1 with approximately uniform distribution over the range. When the ratio is set to 1, vendor 1 will be picked always, and similarly when it’s set to 0 vendor 2 will be picked all the time. For any values in between, the vendor is picked based on the weights assigned. Thus, when all the traffic has to be routed to vendor 1 we set the ratio to 1; the random numbers generated are always less than 1, hence vendor 1 will always be selected.

Doing a failover is as simple as updating the vendor1_to_vendor2 environment variable. In general, an environment variable for a Cloudflare Worker can be updated in one of the following ways:

Using Cloudflare Dashboard: All Worker environment variables are available to be configured in the Cloudflare Dashboard. Navigate to Workers -> Settings -> Variables -> Edit Variables and update the value

Using Wrangler (if your Worker is managed by wrangler): edit the environment variable in the wrangler.toml file and execute wrangler publish for the new values to kick in.

Once the variable is updated using one of the above methods, the Email Traffic Router Worker is deployed immediately in all Cloudflare data centers with this value. Hence all our emails will now be routed by the healthy vendor.

Now, we have built and shipped our weighted-traffic-routing logic in a Worker and building it in a Worker really accelerated our process of improving the resiliency.

The module is shipped, now what happens when the ESP is suffering an outage or a degraded performance? Let me conclude with a success story.

When an ESP outage happened

After deploying this Email Traffic Router Worker to production, we saw an outage declared by one of our vendors. We immediately configured our Traffic Router to send all traffic to the healthy vendor. Changing this configuration took about a few seconds, and deploying the change was even faster. Within a few milliseconds all the traffic was routed to the healthy instance, and all the critical emails reached their corresponding inboxes on time!

Looking at some timelines:

18:45 UTC - we get alerted that message delivery is slow/degraded from email provider 1. The other provider is not impacted

18:47 UTC - Teams within Cloudflare confirmed that their message outbound is experiencing delays

18:47 UTC - we decide to failover

18:50 UTC - we configured the Email Traffic Router Worker to dispatch all the traffic to the healthy email provider, deployed the changes

Starting 18:50 we saw a steady decrease in requests to email provider 1 and a steady uptick in the requests to the provider 2 instance.

Here’s our metrics dashboard:

Next steps

Currently, the failover step is manual. Our next step is to automate this process. Once we get notified that one of the vendors is experiencing outages, the Email Traffic Router Worker will failover automatically to the healthy vendor.

Conclusion

To reiterate our goals, we wanted to make our email notification systems resilient by

adding another vendor
having a simple and fast mechanism to failover when vendor outages happen
achieving the above two objectives as quickly as possible to reduce customer impact

Thus our email systems are more reliable, and more resilient to any outages or failures of our email service providers. Cloudflare Worker makes it easy and simple to build and ship a production-ready service in no time; it also makes failovers possible without any code changes or downtime. It’s powerful that multiple services/teams within Cloudflare can rely on one place, the Email Traffic Router Worker, for uninterrupted email delivery.

Cloudflare and COVID-19: Project Fair Shot Update

Brian Batraski — Thu, 29 Jul 2021 13:00:43 GMT

In February 2021, Cloudflare launched Project Fair Shot — a program that gave our Waiting Room product free of charge to any government, municipality, private/public business, or anyone responsible for the scheduling and/or dissemination of the COVID-19 vaccine.

By having our Waiting Room technology in front of the vaccine scheduling application, it ensured that:

Applications would remain available, reliable, and resilient against massive spikes of traffic for users attempting to get their vaccine appointment scheduled.
Visitors could wait for their long-awaited vaccine with confidence, arriving at a branded queuing page that provided accurate, estimated wait times.
Vaccines would get distributed equitably, and not just to folks with faster reflexes or Internet connections.

Since February, we’ve seen a good number of participants in Project Fair Shot. To date, we have helped more than 100 customers across more than 10 countries to schedule approximately 100 million vaccinations. Even better, these vaccinations went smoothly, with customers like the County of San Luis Obispo regularly dealing with more than 20,000 appointments in a day. “The bottom line is Cloudflare saved lives today. Our County will forever be grateful for your participation in getting the vaccine to those that need it most in an elegant, efficient and ethical manner” — Web Services Administrator for the County of San Luis Obispo.

We are happy to have helped not just in the US, but worldwide as well. In Canada, we partnered with a number of organizations and the Canadian government to increase access to the vaccine. One partner stated: “Our relationship with Cloudflare went from ‘Let's try Waiting Room’ to ‘Unless you have this, we're not going live with that public-facing site.'” — CEO of Verto Health. In another country in Europe, we saw over three million people go through the Waiting Room in less than 24 hours, leading to a significantly smoother and less stressful experience. Cities in Japan, — working closely with our partner, Classmethod — have been able to vaccinate over 40 million people and are on track to complete their vaccination process across 317 cities. If you want more stories from Project Fair Shot, check out our case studies.

A European customer seeing very high amounts of traffic during a vaccination event

We are continuing to add more customers to Project Fair Shot every day to ensure we are doing all that we can to help distribute more vaccines. With the emergence of the Delta variant and others, vaccine distribution (and soon, booster shots) is still very much a real problem to keep everyone healthy and resilient. Because of these new developments, Cloudflare will be extending Project Fair Shot until at least July 1, 2022. Though we are not excited to see the pandemic continue, we are humbled to be able to provide our services and be a critical part in helping us collectively move towards a better tomorrow.

Automatic Remediation of Kubernetes Nodes

Andrew DeMaria — Thu, 15 Jul 2021 12:59:42 GMT

We use Kubernetes to run many of the diverse services that help us control Cloudflare’s edge. We have five geographically diverse clusters, with hundreds of nodes in our largest cluster. These clusters are self-managed on bare-metal machines which gives us a good amount of power and flexibility in the software and integrations with Kubernetes. However, it also means we don’t have a cloud provider to rely on for virtualizing or managing the nodes. This distinction becomes even more prominent when considering all the different reasons that nodes degrade. With self-managed bare-metal machines, the list of reasons that cause a node to become unhealthy include:

Hardware failures
Kernel-level software failures
Kubernetes cluster-level software failures
Degraded network communication
Software updates are required
Resource exhaustion¹

Unhappy Nodes

We have plenty of examples of failures in the aforementioned categories, but one example has been particularly tedious to deal with. It starts with the following log line from the kernel:

unregister_netdevice: waiting for lo to become free. Usage count = 1

The issue is further observed with the number of network interfaces on the node owned by the Container Network Interface (CNI) plugin getting out of proportion with the number of running pods:

$ ip link | grep cali | wc -l
1088

This is unexpected as it shouldn't exceed the maximum number of pods allowed on a node (we use the default limit of 110). While this issue is interesting and perhaps worthy of a whole separate blog, the short of it is that the Linux network interfaces owned by the CNI are not getting cleaned up after a pod terminates.

Some history on this can be read in a Docker GitHub issue. We found this seems to plague nodes with a longer uptime, and after rebooting the node it would be fine for about a month. However, with a significant number of nodes, this was happening multiple times per day. Each instance would need rebooting, which means going through our worker reboot procedure which looked like this:

Cordon off the affected node to prevent new workloads from scheduling on it.
Collect any diagnostic information for later investigation.
Drain the node of current workloads.
Reboot and wait for the node to come back.
Verify the node is healthy.
Re-enable scheduling of new workloads to the node.

While solving the underlying issue would be ideal, we needed a mitigation to avoid toil in the meantime — an automated node remediation process.

Existing Detection and Remediation Solutions

While not complicated, the manual remediation process outlined previously became tedious and distracting, as we had to reboot nodes multiple times a day. Some manual intervention is unavoidable, but for cases matching the following, we wanted automation:

Generic worker nodes
Software issues confined to a given node
Already researched and diagnosed issues

Limiting automatic remediation to generic worker nodes is important as there are other node types in our clusters where more care is required. For example, for control-plane nodes the process has to be augmented to check etcd cluster health and ensure proper redundancy for components servicing the Kubernetes API. We are also going to limit the problem space to known software issues confined to a node where we expect automatic remediation to be the right answer (as in our ballooning network interface problem). With that in mind, we took a look at existing solutions that we could use.

Node Problem Detector

Node problem detector is a daemon that runs on each node that detects problems and reports them to the Kubernetes API. It has a pluggable problem daemon system such that one can add their own logic for detecting issues with a node. Node problems are distinguished between temporary and permanent problems, with the latter being persisted as status conditions on the Kubernetes node resources.²

Draino and Cluster-Autoscaler

Draino as its name implies, drains nodes but does so based on Kubernetes node conditions. It is meant to be used with cluster-autoscaler which then can add or remove nodes via the cluster plugins to scale node groups.

Kured

Kured is a daemon that looks at the presence of a file on the node to initiate a drain, reboot and uncordon of the given node. It uses a locking mechanism via the Kubernetes API to ensure only a single node is acted upon at a time.

Cluster-API

The Kubernetes cluster-lifecycle SIG has been working on the cluster-api project to enable declaratively defining clusters to simplify provisioning, upgrading, and operating multiple Kubernetes clusters. It has a concept of machine resources which back Kubernetes node resources and furthermore has a concept of machine health checks. Machine health checks use node conditions to determine unhealthy nodes and then the cluster-api provider is then delegated to replace that machine via create and delete operations.

Proof of Concept

Interestingly, with all the above except for Kured, there is a theme of pluggable components centered around Kubernetes node conditions. We wanted to see if we could build a proof of concept using the existing theme and solutions. For the existing solutions, draino with cluster-autoscaler didn’t make sense in a non-cloud environment like our bare-metal set up. The cluster-api health checks are interesting, however they require a more complete investment into the cluster-api project to really make sense. That left us with node-problem-detector and kured. Deploying node-problem-detector was simple, and we ended up testing a custom-plugin-monitor like the following:

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector-config
data:
  check_calico_interfaces.sh: |
    #!/bin/bash
    set -euo pipefail
    
    count=$(nsenter -n/proc/1/ns/net ip link | grep cali | wc -l)
    
    if (( $count > 150 )); then
      echo "Too many calico interfaces ($count)"
      exit 1
    else
      exit 0
    fi
  cali-monitor.json: |
    {
      "plugin": "custom",
      "pluginConfig": {
        "invoke_interval": "30s",
        "timeout": "5s",
        "max_output_length": 80,
        "concurrency": 3,
        "enable_message_change_based_condition_update": false
      },
      "source": "calico-custom-plugin-monitor",
      "metricsReporting": false,
      "conditions": [
        {
          "type": "NPDCalicoUnhealthy",
          "reason": "CalicoInterfaceCountOkay",
          "message": "Normal amount of interfaces"
        }
      ],
      "rules": [
        {
          "type": "permanent",
          "condition": "NPDCalicoUnhealthy",
          "reason": "TooManyCalicoInterfaces",
          "path": "/bin/bash",
          "args": [
            "/config/check_calico_interfaces.sh"
          ],
          "timeout": "3s"
        }
      ]
    }

Testing showed that when the condition became true, a condition would be updated on the associated Kubernetes node like so:

kubectl get node -o json worker1a | jq '.status.conditions[] | select(.type | test("^NPD"))'
{
  "lastHeartbeatTime": "2020-03-20T17:05:17Z",
  "lastTransitionTime": "2020-03-20T17:05:16Z",
  "message": "Too many calico interfaces (154)",
  "reason": "TooManyCalicoInterfaces",
  "status": "True",
  "type": "NPDCalicoUnhealthy"
}

With that in place, the actual remediation needed to happen. Kured seemed to do most everything we needed, except that it was looking at a file instead of Kubernetes node conditions. We hacked together a patch to change that and tested it successfully end to end — we had a working proof of concept!

Revisiting Problem Detection

While the above worked, we found that node-problem-detector was unwieldy because we were replicating our existing monitoring into shell scripts and node-problem-detector configuration. A 2017 blog post describes Cloudflare’s monitoring stack, although some things have changed since then. What hasn’t changed is our extensive usage of Prometheus and Alertmanager.

For the network interface issue and other issues we wanted to address, we already had the necessary exported metrics and alerting to go with them. Here is what our already existing alert looked like³:

- alert: CalicoTooManyInterfaces
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 200
  for: 1h
  labels:
    priority: "5"
    notify: chat-sre-core chat-k8s

Note that we use a “notify” label to drive the routing logic in Alertmanager. However, that got us asking, could we just route this to a Kubernetes node condition instead?

Introducing Sciuro

Sciuro is our open-source replacement of node-problem-detector that has one simple job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager. Node problems can be defined with a more holistic view and using already existing exporters such as node exporter, cadvisor or mtail. It also doesn’t run on affected nodes which allows us to rely on out-of-band remediation techniques. Here is a high level diagram of how Sciuro works:

Starting from the top, nodes are scraped by Prometheus, which collects those metrics and fires relevant alerts to Alertmanager. Sciuro polls Alertmanager for alerts with a matching receiver, matches them with a corresponding node resource in the Kubernetes API and updates that node’s conditions accordingly.

In more detail, we can start by defining an alert in Prometheus like the following:

- alert: CalicoTooManyInterfacesEarly
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 150
  labels:
    priority: "6"
    notify: node-condition-k8s

Note the two differences from the previous alert. First, we use a new name with a more sensitive trigger. The idea is that we want automatic node remediation to try fixing the node first as soon as possible, but if the problem worsens or automatic remediation is failing, humans will still get notified to act. The second difference is that instead of notifying chat rooms, we route to a target called “node-condition-k8s”.

Sciuro then comes into play, polling the Altertmanager API for alerts matching the “node-condition-k8s” receiver. The following shows the equivalent using amtool:

$ amtool alert query -r node-condition-k8s
Alertname                 	Starts At            	Summary                                                               	 
CalicoTooManyInterfacesEarly  2021-05-11 03:25:21 UTC  Kubernetes node worker1a has too many Calico interfaces

We can also check the labels for this alert:

$ amtool alert query -r node-condition-k8s -o json | jq '.[] | .labels'
{
  "alertname": "CalicoTooManyInterfacesEarly",
  "cluster": "a.k8s",
  "instance": "worker1a",
  "node": "worker1a",
  "notify": "node-condition-k8s",
  "priority": "6",
  "prometheus": "k8s-a"
}

Note the node and instance labels which Sciuro will use for matching with the corresponding Kubernetes node. Sciuro uses the excellent controller-runtime to keep track of and update node sources in the Kubernetes API. We can observe the updated node condition on the status field via kubectl:

$ kubectl get node worker1a -o json | jq '.status.conditions[] | select(.type | test("^AlertManager"))'
{
  "lastHeartbeatTime": "2021-05-11T03:31:20Z",
  "lastTransitionTime": "2021-05-11T03:26:53Z",
  "message": "[P6] Kubernetes node worker1a has too many Calico interfaces",
  "reason": "AlertIsFiring",
  "status": "True",
  "type": "AlertManager_CalicoTooManyInterfacesEarly"
}

One important note is Sciuro added the AlertManager_ prefix to the node condition type to prevent conflicts with other node condition types. For example, DiskPressure, a kubelet managed condition, could also be an alert name. Sciuro will also properly update heartbeat and transition times to reflect when it first saw the alert and its last update. With node conditions synchronized by Sciuro, remediation can take place via one of the existing tools. As mentioned previously we are using a modified version of Kured for now.

We’re happy to announce that we’ve open sourced Sciuro, and it can be found on GitHub where you can read the code, find the deployment instructions, or open a Pull Request for changes.

Managing Node Uptime

While we began using automatic node remediation for obvious problems, we’ve expanded its purpose to additionally keep node uptime low. Low node uptime is desirable to further reduce drift on nodes, keep the node initialization process well-oiled, and encourage the best deployment practices on the Kubernetes clusters. To expand on the last point, services that are deployed with best practices and in a high availability fashion should see negligible impact when a single node leaves the cluster. However, services that are not deployed with best practices will most likely have problems especially if they rely on singleton pods. By draining nodes more frequently, it introduces regular chaos that encourages best practices. To enable this with automatic node remediation the following alert was defined:

- alert: WorkerUptimeTooHigh
  expr: |
    (
      (
        (
              max by(node) (kube_node_role{role="worker"})
            - on(node) group_left()
              (max by(node) (kube_node_role{role!="worker"}))
          or on(node)
            max by(node) (kube_node_role{role="worker"})
        ) == 1
      )
    * on(node) group_left()
      (
        (time() - node_boot_time_seconds) > (60 * 60 * 24 * 7)
      )
    )
  labels:
    priority: "9"
    notify: node-condition-k8s

There is a bit of juggling with the kube_node_roles metric in the above to isolate the alert to generic worker nodes, but at a high level it looks at node_boot_time_seconds, a metric from prometheus node_exporter. Again the notify label is configured to send to node conditions which kicks off the automatic node remediation. One further detail is the priority here is set to “9” which is of lower precedence than our other alerts. Note that the message field of the node condition is prefixed with the alert priority in brackets. This allows the remediation process to take priority into account when choosing which node to remediate first, which is important because Kured uses a lock to act on a single node at a time.

Wrapping Up

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time. We’ve also been able to reduce the time to repair for some issues as automatic remediation can act at all times of the day and with a faster response time.

As mentioned before, we’re open sourcing Sciuro and its code can be found on GitHub. We’re open to issues, suggestions, and pull requests. We do have some ideas for future improvements. For Sciuro, we may look to reduce latency which is mainly due to polling and potentially add a push model from Altermanager although this isn’t a need we’ve had yet. For the larger node remediation story, we hope to do an overhaul of the remediating component. As mentioned previously, we are currently using a fork of kured, but a future replacement component should include the following:

Use out-of-band management interfaces to be able to shut down and power on nodes without a functional operating system.
Move from decentralized architecture to a centralized one that can integrate more complicated logic. This might include being able to act on entire failure domains in parallel.
Handle specialized nodes such as masters or storage nodes.

Finally, we’re looking for more people passionate about Kubernetes to join our team. Come help us push Kubernetes to the next level to serve Cloudflare’s many needs!

¹Exhaustion can be applied to hardware resources, kernel resources, or logical resources like the amount of logging being produced.

²Nearly all Kubernetes objects have spec and status fields. The status field is used to describe the current state of an object. For node resources, typically the kubelet manages a conditions field under the status field for reporting things like if the node is ready for servicing pods. ³The format of the following alert is documented on Prometheus Alerting Rules.

Encrypt it or lose it: how encrypted SNI works

Alessandro Ghedini — Mon, 24 Sep 2018 12:01:00 GMT

Today we announced support for encrypted SNI, an extension to the TLS 1.3 protocol that improves privacy of Internet users by preventing on-path observers, including ISPs, coffee shop owners and firewalls, from intercepting the TLS Server Name Indication (SNI) extension and using it to determine which websites users are visiting.

Encrypted SNI, together with other Internet security features already offered by Cloudflare for free, will make it harder to censor content and track users on the Internet. Read on to learn how it works.

SNWhy?

The TLS Server Name Indication (SNI) extension, originally standardized back in 2003, lets servers host multiple TLS-enabled websites on the same set of IP addresses, by requiring clients to specify which site they want to connect to during the initial TLS handshake. Without SNI the server wouldn’t know, for example, which certificate to serve to the client, or which configuration to apply to the connection.

The client adds the SNI extension containing the hostname of the site it’s connecting to to the ClientHello message. It sends the ClientHello to the server during the TLS handshake. Unfortunately the ClientHello message is sent unencrypted, due to the fact that client and server don’t share an encryption key at that point.

TLS 1.3 with Unencrypted SNI

This means that an on-path observer (say, an ISP, coffee shop owner, or a firewall) can intercept the plaintext ClientHello message, and determine which website the client is trying to connect to. That allows the observer to track which sites a user is visiting.

But with SNI encryption the client encrypts the SNI even though the rest of the ClientHello is sent in plaintext.

TLS 1.3 with Encrypted SNI

So how come the original SNI couldn’t be encrypted before, but now it can? Where does the encryption key come from if client and server haven’t negotiated one yet?

If the chicken must come before the egg, where do you put the chicken?

As with many other Internet features the answer is simply “DNS”.

The server publishes a public key on a well-known DNS record, which can be fetched by the client before connecting (as it already does for A, AAAA and other records). The client then replaces the SNI extension in the ClientHello with an “encrypted SNI” extension, which is none other than the original SNI extension, but encrypted using a symmetric encryption key derived using the server’s public key, as described below. The server, which owns the private key and can derive the symmetric encryption key as well, can then decrypt the extension and therefore terminate the connection (or forward it to a backend server). Since only the client, and the server it’s connecting to, can derive the encryption key, the encrypted SNI cannot be decrypted and accessed by third parties.

It’s important to note that this is an extension to TLS version 1.3 and above, and doesn’t work with previous versions of the protocol. The reason is very simple: one of the changes introduced by TLS 1.3 (not without problems) meant moving the Certificate message sent by the server to the encrypted portion of the TLS handshake (before 1.3, it was sent in plaintext). Without this fundamental change to the protocol, an attacker would still be able to determine the identity of the server by simply observing the plaintext certificate sent on the wire.

The underlying cryptographic machinery involves using the Diffie-Hellman key exchange algorithm which allows client and server to generate a shared encryption key over an untrusted channel. The encrypted SNI encryption key is thus calculated on the client-side by using the server’s public key (which is actually the public portion of a Diffie-Hellman semi-static key share) and the private portion of an ephemeral Diffie-Hellman share generated by the client itself on the fly and discarded immediately after the ClientHello is sent to the server. Additional data (such as some of the cryptographic parameters sent by the client as part of its ClientHello message) is also mixed into the cryptographic process for good measure.

The client’s ESNI extension will then include, not only the actual encrypted SNI bits, but also the client’s public key share, the cipher suite it used for encryption and the digest of the server’s ESNI DNS record. On the other side, the server uses its own private key share, and the public portion of the client’s share to generate the encryption key and decrypt the extension.

While this may seem overly complicated, this ensures that the encryption key is cryptographically tied to the specific TLS session it was generated for, and cannot be reused across multiple connections. This prevents an attacker able to observe the encrypted extension sent by the client from simply capturing it and replaying it to the server in a separate session to unmask the identity of the website the user was trying to connect to (this is known as “cut-and-paste” attack).

However a compromise of the server’s private key would put all ESNI symmetric keys generated from it in jeopardy (which would allow observers to decrypt previously collected encrypted data), which is why Cloudflare’s own SNI encryption implementation rotates the server’s keys every hour to improve forward secrecy, but keeps track of the keys for the previous few hours to allow for DNS caching and replication delays, so that clients with slightly outdated keys can still use ESNI without problems (but eventually all keys are discarded and forgotten).

But wait, DNS? For real?

The observant reader might have realized that simply using DNS (which is, by default, unencrypted) would make the whole encrypted SNI idea completely pointless: an on-path observer would be able to determine which website the client is connecting to by simply observing the plaintext DNS queries sent by the client itself, whether encrypted SNI was used or not.

But with the introduction of DNS features such as DNS over TLS (DoT) and DNS over HTTPS (DoH), and of public DNS resolvers that provide those features to their users (such as Cloudflare’s own 1.1.1.1), DNS queries can now be encrypted and protected by the prying eyes of censors and trackers alike.

However, while responses from DoT/DoH DNS resolvers can be trusted, to a certain extent (evil resolvers notwithstanding), it might still be possible for a determined attacker to poison the resolver’s cache by intercepting its communication with the authoritative DNS server and injecting malicious data. That is, unless both the authoritative server and the resolver support DNSSEC_[1]. Incidentally, Cloudflare’s authoritative DNS servers can sign responses returned to resolvers, and the 1.1.1.1 resolver can verify them.

What about the IP address?

While both DNS queries and the TLS SNI extensions can now be protected by on-path attackers, it might still be possible to determine which websites users are visiting by simply looking at the destination IP addresses on the traffic originating from users’ devices. Some of our customers are protected by this to a certain degree thanks to the fact that many Cloudflare domains share the same sets of addresses, but this is not enough and more work is required to protect end users to a larger degree. Stay tuned for more updates from Cloudflare on the subject in the future.

Where do I sign up?

Encrypted SNI is now enabled for free on all Cloudflare zones using our name servers, so you don’t need to do anything to enable it on your Cloudflare website. On the browser side, our friends at Firefox tell us that they expect to add encrypted SNI support this week to Firefox Nightly (keep in mind that the encrypted SNI spec is still under development, so it’s not stable just yet).

By visiting encryptedsni.com you can check how secure your browsing experience is. Are you using secure DNS? Is your resolver validating DNSSEC signatures? Does your browser support TLS 1.3? Did your browser encrypt the SNI? If the answer to all those questions is “yes” then you can sleep peacefully knowing that your browsing is protected from prying eyes.

Conclusion

Encrypted SNI, along with TLS 1.3, DNSSEC and DoT/DoH, plugs one of the few remaining holes that enable surveillance and censorship on the Internet. More work is still required to get to a surveillance-free Internet, but we are (slowly) getting there.

[1]: It's important to mention that DNSSEC could be disabled by BGP route hijacking between a DNS resolver and the TLD server. Last week we announced our commitment to RPKI and if DNS resolvers and TLDs also implement RPKI, this type of hijacking will be much more difficult.

Subscribe to the blog for daily updates on all our Birthday Week announcements.

Encrypting SNI: Fixing One of the Core Internet Bugs

Matthew Prince — Mon, 24 Sep 2018 12:00:00 GMT

Cloudflare launched on September 27, 2010. Since then, we've considered September 27th our birthday. This Thursday we'll be turning 8 years old.

Ever since our first birthday, we've used the occasion to launch new products or services. Over the years we came to the conclusion that the right thing to do to celebrate our birthday wasn't so much about launching products that we could make money from but instead to do things that were gifts back to our users and the Internet in general. My cofounder Michelle wrote about this tradition in a great blog post yesterday.

Personally, one of my proudest moments at Cloudflare came on our birthday in 2014 when we made HTTPS support free for all our users. At the time, people called us crazy — literally and repeatedly. Frankly, internally we had significant debates about whether we were crazy since encryption was the primary reason why people upgraded from a free account to a paid account.

But it was the right thing to do. The fact that encryption wasn't built into the web from the beginning was, in our mind, a bug. Today, almost exactly four years later, the web is nearly 80% encrypted thanks to leadership from great projects like Let's Encrypt, the browser teams at Google, Apple, Microsoft, and Mozilla, and the fact that more and more hosting and SaaS providers have built in support for HTTPS at no cost. I'm proud of the fact that we were a leader in helping start that trend.

Today is another day I expect to look back on and be proud of because today we hope to help start a new trend to make the encrypted web more private and secure. To understand that, you have to understand a bit about why the encrypted web as exists today still leaks a lot of your browsing history.

How Private Is Your Browsing History?

The expectation when you visit a site over HTTPS is that no one listening on the line between you and where your connection terminates can see what you're doing. And to some extent, that's true. If you visit your bank's website, HTTPS is effective at keeping the contents sent to or from the site (for example, your username and password or the balance of your bank account) from being leaked to your ISP or anyone else monitoring your network connection.

While the contents sent to or received from a HTTPS site are protected, the fact that you visited the site can be observed easily in a couple of ways. Traditionally, one of these has been via DNS. DNS queries are, by default, unencrypted so your ISP or anyone else can see where you're going online. That's why last April, we launched 1.1.1.1 — a free (and screaming fast) public DNS resolver with support for DNS over TLS and DNS over HTTPS.

1.1.1.1 has been a huge success and we've significantly increased the percentage of DNS queries sent over an encrypted connection. Critics, however, rightly pointed out that the identity of the sites that you visit still can leak in other ways. The most problematic is something called the Server Name Indication (SNI) extension.

Why SNI?

Fundamentally, SNI exists in order to allow you to host multiple encrypted websites on a single IP address. Early browsers didn't include the SNI extension. As a result, when a request was made to establish a HTTPS connection the web server didn't have much information to go on and could only hand back a single SSL certificate per IP address the web server was listening on.

One solution to this problem was to create certificates with multiple Subject Alternative Names (SANs). These certificates would encrypt traffic for multiple domains that could all be hosted on the same IP. This is how Cloudflare handles HTTPS traffic from older browsers that don't support SNI. We limit that feature to our paying customers, however, for the same reason that SANs aren't a great solution: they're a hack, a pain to manage, and can slow down performance if they include too many domains.

The more scalable solution was SNI. The analogy that makes sense to me is to think of a postal mail envelope. The contents inside the envelope are protected and can't be seen by the postal carrier. However, outside the envelope is the street address which the postal carrier uses to bring the envelope to the right building. On the Internet, a web server's IP address is the equivalent of the street address.

However, if you live in a multi-unit building, a street address alone isn't enough to get the envelope to the right recipient. To supplement the street address you include an apartment number or recipient's name. That's the equivalent of SNI. If a web server hosts multiple domains, SNI ensures that a request is routed to the correct site so that the right SSL certificate can be returned to be able to encrypt and decrypt any content.

Nosey Networks

The specification for SNI was introduced by the IETF in 2003 and browsers rolled out support over the next few years. At the time, it seemed like an acceptable tradeoff. The vast majority of Internet traffic was unencrypted. Adding a TLS extension that made it easier to support encryption seemed like a great trade even if that extension itself wasn't encrypted.

But, today, as HTTPS covers nearly 80% of all web traffic, the fact that SNI leaks every site you go to online to your ISP and anyone else listening on the line has become a glaring privacy hole. Knowing what sites you visit can build a very accurate picture of who you are, creating both privacy and security risks.

In the United States, ISPs were briefly restricted in their ability to gather customer browsing data under FCC rules passed at the end of the Obama administration. ISPs, however, lobbied Congress and, in April 2017, President Trump signed a Congressional Resolution repealing those protections. As ISPs increasingly acquire media companies and ad targeting businesses, being able to mine the data flowing through their pipes is an increasingly attractive business for them and an increasingly troubling privacy threat to all of us.

Closing the SNI Privacy Hole

On May 3, about a month after we launched 1.1.1.1, I was reading a review of our new service. While the article praised the fact that 1.1.1.1 was privacy-oriented, it somewhat nihilistically concluded that it was all for naught because ISPs could still spy on you by monitoring SNI. Frustrated, I dashed off an email to some of Cloudflare's engineers and the senior team at Mozilla, who we'd been working on a project to help encrypt DNS. I concluded my email:

My simple PRD: if Firefox connects to a Cloudflare IP then we'd give you a public key to use to encrypt the SNI entry before sending it to us. How does it scale to other providers? Dunno, but we have to start somewhere. Rough consensus and running code, right?

It turned out to be a bit more complex than that. However, today I'm proud to announce that Encrypted SNI (ESNI) is live across Cloudflare's network. Later this week we expect Mozilla's Firefox to become the first browser to support the new protocol in their Nightly release. In the months to come, the plan is for it go mainstream. And it's not just Mozilla. There's been significant interest from all the major browser makers and I'm hopeful they'll all add support for ESNI over time.

Hoping to Start Another Trend

While we're the first to support ESNI, we haven't done this alone. We worked on ESNI with great teams from Apple, Fastly, Mozilla, and others across the industry who, like us, are concerned about Internet privacy. While Cloudflare is the first content network to support ESNI, this isn't a proprietary protocol. It's being worked on as an IETF Draft RFC and we are hopeful others will help us formalize the draft and implement the standard as well. If you're curious about the technical details behind ESNI, you can learn more from the great blog post just published by my colleague Alessandro Ghedini. Finally, when browser support starts to launch later this week you can test this from our handy ESNI testing tool.

Four years ago I'm proud that we helped start a trend that today has led to nearly all the web being encrypted. Today, I hope we are again helping start a trend — this time to make the encrypted web even more private and secure.

Subscribe to the blog for daily updates on all our Birthday Week announcements.

Expanding DNSSEC Adoption

Sergi Isasi — Tue, 18 Sep 2018 13:00:00 GMT

Cloudflare first started talking about DNSSEC in 2014 and at the time, Nick Sullivan wrote: “DNSSEC is a valuable tool for improving the trust and integrity of DNS, the backbone of the modern Internet.”

Over the past four years, it has become an even more critical part of securing the internet. While HTTPS has gone a long way in preventing user sessions from being hijacked and maliciously (or innocuously) redirected, not all internet traffic is HTTPS. A safer Internet should secure every possible layer between a user and the origin they are intending to visit.

As a quick refresher, DNSSEC allows a user, application, or recursive resolver to trust that the answer to their DNS query is what the domain owner intends it to be. Put another way: DNSSEC proves authenticity and integrity (though not confidentiality) of a response from the authoritative nameserver. Doing so makes it much harder for a bad actor to inject malicious DNS records into the resolution path through BGP Leaks and cache poisoning. Trust in DNS matters even more when a domain is publishing record types that are used to declare trust for other systems. As a specific example, DNSSEC is helpful for preventing malicious actors from obtaining fraudulent certificates for a domain. Research has shown how DNS responses can be spoofed for domain validation.

This week we are announcing our full support for CDS and CDNSKEY from RFC 8078. Put plainly: this will allow for setting up of DNSSEC without requiring the user to login to their registrar to upload a DS record. Cloudflare customers on supported registries will be able to enable DNSSEC with the click of one button in the Cloudflare dashboard.

Validation by Resolvers

DNSSEC’s largest problem has been adoption. The number of DNS queries validated by recursive resolvers for DNSSEC has remained flat. Worldwide, less than 14% of DNS requests have DNSSEC validated by the resolver according to our friends at APNIC. The blame here falls on the shoulders of the default DNS providers that most devices and users receive from DHCP via their ISP or network provider. Data shows that some countries do considerably better: Sweden, for example, has over 80% of their requests validated, showing that the default DNS resolvers in those countries validate the responses as they should. APNIC also has a fun interactive map so you can see how well your country does.

So what can we do? To ensure your resolver supports DNSSEC, visit brokendnssec.net in your browser. If the page loads, you are not protected by a DNSSEC validating resolver and should switch your resolver. However, in order to really move the needle across the internet, Cloudflare encourages network providers to either turn on the validation of DNSSEC in their software or switch to publicly available resolvers that validate DNSSEC by default. Of course we have a favourite, but there are other fine choices as well.

Signing of Zones

Validation handles the user side, but another problem has been the signing of the zones themselves. Initially, there was concern about adoption at the TLD level given that TLD support is a requirement for DNSSEC to work. This is now largely a non-issue with over 90% of TLDs signed with DS records in the root zone, as of 2018-08-27.

It’s a different story when it comes to the individual domains themselves. Per NIST data, a woefully low 3% of the Fortune 1000 sign their primary domains. Some of this is due to apathy by the domain owners. However, some large DNS operators do not yet support the option at all, requiring domain owners who want to protect their users to move to another provider altogether. If you are on a service that does not support DNSSEC, we encourage you to switch to one that does and let them know that was the reason for the switch. Other large operators, such as GoDaddy, charge for DNSSEC. Our stance here is clear: DNSSEC should be available and included at all DNS operators for free.

The DS Parent Issue

In December of 2017, APNIC wrote about why DNSSEC deployment remains so low and that remains largely true today. One key point was that the number of domain owners who attempt DNSSEC activation but do not complete it is very high. Using Cloudflare as an example, APNIC measured that 40% of those who enabled DNSSEC in the Cloudflare Dash (evidenced by the presence of a DNSKEY record) were actually successful in serving a DS key from the registry. Current data over a recent 90 day period is slightly better: we are seeing just over 50% of all zones which attempted to enable DNSSEC were able to complete the process with the registry (Note: these domains still resolve, they are just still not secured). Of our most popular TLDs, .be and .nl have success rates of over 70%, but these numbers are still not where we would want them to be in an ideal world. The graph below shows the specific rates for the most popular TLDs (most popular from left to right).

This end result is likely not surprising to anyone who has tried to add a DS record to their registrar. Locating the part of the registrar UI that houses DNSSEC can be problematic, as can the UI of adding the record itself. Additional factors such as varying degrees of technical knowledge amongst users and simply having to manage multiple logins and roles can also explain the lack of completion in the process. Finally, varying levels of DNSSEC compatibility amongst registrars may prevent even knowledgeable users from creating DS records in the parent.

As an example, at Cloudflare, we took a minimalist UX approach for adding DS records for delegated child domains. A novice user may not understand the fields and requirements for the DS record:

CDS and CDNSKEY

As mentioned in the aforementioned APNIC blog, Cloudflare is supportive of RFC 8078 and the CDS and CDNSKEY records. This should come as no surprise given that our own Olafur Gudmundsson is a co-author of the RFC. CDS and CDNSKEY are records that mirror the DS and DNSKEY record types but are designated to signal the parent/registrar that the child domain wishes to enable DNSSEC and have a DS record presented by the registry. We have been pushing for automated solutions in this space for years and are encouraging the industry to move with us.

Today, we are announcing General Availability and full support for CDS and CDNSKEY records for all Cloudflare managed domains that enable DNSSEC in the Cloudflare dash.

How It Works

Cloudflare will publish CDS and CDNSKEY records for all domains who enable DNSSEC. Parent registries should scan the nameservers of the domains under their purview and check for these rrsets. The presence of a CDS key for a domain delegated to Cloudflare indicates that a verified Cloudflare user has enabled DNSSEC within their dash and that the parent operator (a registrar or the registry itself) should take the CDS record content and create the requisite DS record to start signing the domain. TLDs .ch and .cz already support this automated method through Cloudflare and any other DNS operators that choose to support RFC8078. The registrar Gandi and a number of TLDs have indicated support in the near future.

Cloudflare also supports CDS0 for the removal of the DS record in the case that the user transfers their domain off of Cloudflare or otherwise disables DNSSEC.

Best Practices for Parent Operators

Below are a number of suggested procedures that parent registries may take to provide for the best experience for our users:

Scan Selection - Parent Operators should only scan their child domains who have nameservers pointed at Cloudflare (or other DNS operators who adopt RFC8078). Cloudflare nameservers are indicated *.ns.cloudflare.com.
Scan Regularly - Parent Operators should scan at regular intervals for the presence and change of CDS records. A scan every 12 hours should be sufficient, though faster is better.
Notify Domain Contacts - Parent Operators should notify their designated contacts through known channels (such as email and/or SMS) for a given child domain upon detection of a new CDS record and an impending change of their DS record. The Parent Operator may also wish to provide a standard delay (24 hours) before changing the DS record to allow the domain contact to cancel or otherwise change the operation.
Verify Success - Parent Operators must ensure that the domain continues to resolve after being signed. Should the domain fail to resolve immediately after changing the DS record, the Parent Operator must fall back to the previous functional state and should notify designated contacts.

What Does This All Mean and What’s Next?

For Cloudflare customers, this means an easier implementation of DNSSEC once your registry/registrar supports CDS and CDNSKEY. Customers can also enable DNSSEC for free on Cloudflare and manually enter the DS to the parent. To check your domain’s DNSSEC status, DNSViz (example cloudflare.com) has one of the most standards compliant tools online.

For registries and registrars, we are taking this step with the hope that more providers support RFC8078 and help increase the global adoption of technology that helps end users be less vulnerable to DNS attacks on the internet.

For other DNS operators, we encourage you to join us in supporting this method as the more major DNS operators that publish CDS and CDNSKEY, the more likely it will be that the registries will start looking for and use them.

Cloudflare will continue pushing down this path and has plans to create and open source additional tools to help registries and operators push and consume records. If this sounds interesting to you, we are hiring.

Subscribe to the blog for daily updates on our announcements.

Welcome to Crypto Week

Nick Sullivan — Mon, 17 Sep 2018 13:00:00 GMT

The Internet is an amazing invention. We marvel at how it connects people, connects ideas, and makes the world smaller. But the Internet isn’t perfect. It was put together piecemeal through publicly funded research, private investment, and organic growth that has left us with an imperfect tapestry. It’s also evolving. People are constantly developing creative applications and finding new uses for existing Internet technology. Issues like privacy and security that were afterthoughts in the early days of the Internet are now supremely important. People are being tracked and monetized, websites and web services are being attacked in interesting new ways, and the fundamental system of trust the Internet is built on is showing signs of age. The Internet needs an upgrade, and one of the tools that can make things better is cryptography.

Every day this week, Cloudflare will be announcing support for a new technology that uses cryptography to make the Internet better (hint: subscribe to the blog to make sure you don't miss any of the news). Everything we are announcing this week is free to use and provides a meaningful step towards supporting a new capability or structural reinforcement. So why are we doing this? Because it’s good for the users and good for the Internet. Welcome to Crypto Week!

Day 1: Distributed Web Gateway

Day 2: DNSSEC

Expanding DNSSEC Adoption

Day 3: RPKI

Day 4: Onion Routing

Introducing the Cloudflare Onion Service

Day 5: Roughtime

Roughtime: Securing Time with Digital Signatures

A more trustworthy Internet

Everything we do online depends on a relationship between users, services, and networks that is supported by some sort of trust mechanism. These relationships can be physical (I plug my router into yours), contractual (I paid a registrar for this domain name), or reliant on a trusted third party (I sent a message to my friend on iMessage via Apple). The simple act of visiting a website involves hundreds of trust relationships, some explicit and some implicit. The sheer size of the Internet and number of parties involved make trust online incredibly complex. Cryptography is a tool that can be used to encode and enforce, and most importantly scale these trust relationships.

To illustrate this, let’s break down what happens when you visit a website. But before we can do this, we need to know the jargon.

Autonomous Systems (100 thousand or so active): An AS corresponds to a network provider connected to the Internet. Each has a unique Autonomous System Number (ASN).
IP ranges (1 million or so): Each AS is assigned a set of numbers called IP addresses. Each of these IP addresses can be used by the AS to identify a computer on its network when connecting to other networks on the Internet. These addresses are assigned by the Regional Internet Registries (RIR), of which there are 5. Data sent from one IP address to another hops from one AS to another based on a “route” that is determined by a protocol called BGP.
Domain names (>1 billion): Domain names are the human-readable names that correspond to Internet services (like “cloudflare.com” or “mail.google.com”). These Internet services are accessed via the Internet by connecting to their IP address, which can be obtained from their domain name via the Domain Name System (DNS).
Content (infinite): The main use case of the Internet is to enable the transfer of specific pieces of data from one point on the network to another. This data can be of any form or type.

When you type a website such as blog.cloudflare.com into your browser, a number of things happen. First, a (recursive) DNS service is contacted to get the IP address of the site. This DNS server configured by your ISP when you connect to the Internet, or it can be a public service such as 1.1.1.1 or 8.8.8.8. A query to the DNS service travels from network to network along a path determined by BGP announcements. If the recursive DNS server does not know the answer to the query, then it contacts the appropriate authoritative DNS services, starting with a root DNS server, down to a top level domain server (such as com or org), down to the DNS server that is authoritative for the domain. Once the DNS query has been answered, the browser sends an HTTP request to its IP address (traversing a sequence of networks), and in response, the server sends the content of the website.

So what’s the problem with this picture? For one, every DNS query and every network hop needs to be trusted in order to trust the content of the site. Any DNS query could be modified, a network could advertise an IP that belongs to another network, and any machine along the path could modify the content. When the Internet was small, there were mechanisms to combat this sort of subterfuge. Network operators had a personal relationship with each other and could punish bad behavior, but given the number of networks in existence almost 400,000 as of this week this is becoming difficult to scale.

Cryptography is a tool that can encode these trust relationships and make the whole system reliant on hard math rather than physical handshakes and hopes.

Building a taller tower of turtles

Attribution 2.0 Generic (CC BY 2.0)

The two main tools that cryptography provides to help solve this problem are cryptographic hashes and digital signatures.

A hash function is a way to take any piece of data and transform it into a fixed-length string of data, called a digest or hash. A hash function is considered cryptographically strong if it is computationally infeasible (read: very hard) to find two inputs that result in the same digest, and that changing even one bit of the input results in a completely different digest. The most popular hash function that is considered secure is SHA-256, which has 256-bit outputs. For example, the SHA-256 hash of the word “crypto” is

DA2F073E06F78938166F247273729DFE465BF7E46105C13CE7CC651047BF0CA4

And the SHA-256 hash of “crypt0” is

7BA359D3742595F38347A0409331FF3C8F3C91FF855CA277CB8F1A3A0C0829C4

The other main tool is digital signatures. A digital signature is a value that can only be computed by someone with a private key, but can be verified by anyone with the corresponding public key. Digital signatures are way for a private key holder to “sign,” or attest to the authenticity of a specific message in a way that anyone can validate it.

These two tools can be combined to solidify the trust relationships on the Internet. By giving private keys to the trusted parties who are responsible for defining the relationships between ASs, IPs, domain names and content, you can create chains of trust that can be publicly verified. Rather than hope and pray, these relationships can be validated in real time at scale.

Let’s take our webpage loading example and see where digital signatures can be applied.

Routing. Time-bound delegation of trust is defined through a system called the RPKI. RPKI defines an object called a Resource Certificate, an attestation that states that a given IP range belongs to a specific ASN for this period of time, digitally signed by the RIR responsible for assigning the IP range. Networks share routes via BGP, and if a route is advertised for an IP that does not conform the the Resource Certificate, the network can choose not to accept it.

DNS. Adding cryptographic assurance to routing is powerful, but if a network adversary can change the content of the data (such as the DNS responses), then the system is still at risk. DNSSEC is a system built to provide a trusted link between names and IP addresses. The root of trust in DNSSEC is the DNS root key, which is managed with an elaborate signing ceremony.

HTTPS. When you connect to a site, not only do you want it to be coming from the right host, you also want the content to be private. The Web PKI is a system that issues certificates to sites, allowing you to bind the domain name to a time-bounded private key. Because there are many CAs, additional accountability systems like certificate transparency need to be involved to help keep the system in check.

This cryptographic scaffolding turns the Internet into an encoded system of trust. With these systems in place, Internet users no longer need to trust every network and party involved in this diagram, they only need to trust the RIRs, DNSSEC and the CAs (and know the correct time).

This week we’ll be making some announcements that help strengthen this system of accountability.

Privacy and integrity with friends

The Internet is great because it connects us to each other, but the details of how it connects us are important. The technical choices made when Internet was designed come with some interesting human implications.

One implication is trackability. Your IP address is contained on every packet you send over the Internet. This acts as a unique identifier for anyone (corporations, governments, etc.) to track what you do online. Furthermore, if you connect to a server, that server’s identity is sent in plaintext on the request even over HTTPS, revealing your browsing patterns to any intermediary who cares to look.

Another implication is malleability. Resources on the Internet are defined by where they are, not what they are. If you want to go to CNN or BBC, then you connect to the server for cnn.com or bbc.co.uk and validate the certificate to make sure it’s the right site. But once you’ve made the connection, there’s no good way to know that the actual content is what you expect it to be. If the server is hacked, it could send you anything, including dangerous malicious code. HTTPS is a secure pipe, but there’s no inherent way to make sure what gets sent through the pipe is what you expect.

Trackability and malleability are not inherent features of interconnectedness. It is possible to design networks that don’t have these downsides. It is also possible to build new networks with better characteristic on top of the existing Internet. The key ingredient is cryptography.

Tracking-resilient networking

One of the networks built on top of the Internet that provides good privacy properties is Tor. The Tor network is run by a group of users who allow their computers to be used to route traffic for other members of the network. Using cryptography, it is possible to route traffic from one place to another without points along the path knowing both the source and the destination at the same time. This is called onion routing because it involves multiple layers of encryption, like an onion. Traffic coming out of the Tor network is “anonymous” because it could have come from anyone connected to the network. Everyone just blends in, making it hard to track individuals.

Similarly, web services can use onion routing to serve content inside the Tor network without revealing their location to visitors. Instead of using a hostname to identify their network location, so-called onion services use a cryptographic public key as their address. There are hundreds of onion services in use, including the one we use for 1.1.1.1 or the one in use by Facebook.

Troubles occur at the boundary between Tor network and the rest of the Internet. This is especially true for user is attempting to access services that rely on abuse prevention mechanisms based on reputation. Since Tor is used by both privacy-conscious users and malicious bots, connections from both get lumped together and as the expression goes, one bad apple ruins the bunch. This unfortunately exposes legitimate visitors to anti-abuse mechanisms like CAPTCHAs. Tools like Privacy Pass help reduce this burden but don’t eliminate it completely. This week we’ll be announcing a new way to improve this situation.

Bringing integrity to content delivery

Let’s revisit the issue of malleability: the fact that you can’t always trust the other side of a connection to send you the content you expect. There are technologies that allow users to insure the integrity of content without trusting the server. One such technology is a feature of HTML called Subresource Integrity (SRI). SRI allows a webpage with sub-resources (like a script or stylesheet) to embed a unique cryptographic hash into the page so that when the sub-resource is loaded, it is checked to see that is matches the expected value. This protects the site from loading unexpected scripts from third parties, a known attack vector.

Another idea is to flip this on its head: what if instead of fetching a piece of content from a specific location on the network, you asked the network to find piece of content that matches a given hash? By assigning resources based on their actual content rather than by location it’s possible to create a network in which you can fetch content from anywhere on the network and still know it’s authentic. This idea is called content addressing and there are networks built on top of the Internet that use it. These content addressed networks, based on protocols such IPFS and DAT, are blazing a trail new trend in Internet applications called the Distributed Web. With the Distributed Web applications, malleability is no longer an issue, opening up a new set of possibilities.

Combining strengths

Networks based on cryptographic principles, like Tor and IPFS, have one major downside compared to networks based on names: usability. Humans are exceptionally bad at remembering or distinguishing between cryptographically-relevant numbers. Take, for instance, the New York Times’ onion address:

https://www.nytimes3xbfgragh.onion/

This is would easily confused with similar-looking onion addresses, such as

https://www.nytimes3xfkdbgfg.onion/

which may be controlled by a malicious actor.

Content addressed networks are even worse from the perspective of regular people. For example, there is a snapshot of the Turkish version of Wikipedia on IPFS with the hash:

QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX

Try typing this hash into your browser without making a mistake.

These naming issues are things Cloudflare is perfectly positioned to help solve.First, by putting the hash address of an IPFS site in the DNS (and adding DNSSEC for trust) you can give your site a traditional hostname while maintaining a chain of trust.

Second, by enabling browsers to use a traditional DNS name to access the web through onion services, you can provide safer access to your site for Tor user with the added benefit of being better able to distinguish between bots and humans.With Cloudflare as the glue, is is possible to connect both standard internet and tor users to web sites and services on both the traditional web with the distributed web.

This is the promise of Crypto Week: using cryptographic guarantees to make a stronger, more trustworthy and more private internet without sacrificing usability.

Happy Crypto Week

In conclusion, we’re working on many cutting-edge technologies based on cryptography and applying them to make the Internet better. The first announcement today is the launch of Cloudflare's Distributed Web Gateway and browser extension. Keep tabs on the Cloudflare blog for exciting updates as the week progresses.

I’m very proud of the team’s work on Crypto Week, which was made possible by the work of a dedicated team, including several brilliant interns. If this type of work is interesting to you, Cloudflare is hiring for the crypto team and others!

Fixing an old hack - why we are bumping the IPv6 MTU

Marek Majkowski — Mon, 10 Sep 2018 09:21:25 GMT

Back in 2015 we deployed ECMP routing - Equal Cost Multi Path - within our datacenters. This technology allowed us to spread traffic heading to a single IP address across multiple physical servers.

You can think about it as a third layer of load balancing.

First we split the traffic across multiple IP addresses with DNS.
Then we split the traffic across multiple datacenters with Anycast.
Finally, we split the traffic across multiple servers with ECMP.

photo by Sahra by-sa/2.0

When deploying ECMP we hit a problem with Path MTU discovery. The ICMP packets destined to our Anycast IP's were being dropped. You can read more about that (and the solution) in the 2015 blog post Path MTU Discovery in practice.

To solve the problem we created a small piece of software, called pmtud (https://github.com/cloudflare/pmtud). Since deploying pmtud, our ECMP setup has been working smoothly.

Hardcoding IPv6 MTU

During that initial ECMP rollout things were broken. To keep services running until pmtud was done, we deployed a quick hack. We reduced the MTU of IPv6 traffic to the minimal possible value: 1280 bytes.

This was done as a tag on a default route. This is how our routing table used to look:

$ ip -6 route show
...
default via 2400:xxxx::1 dev eth0 src 2400:xxxx:2  metric 1024  mtu 1280

Notice the mtu 1280 in the default route.

With this setting our servers never transmitted IPv6 packets larger than 1280 bytes, therefore "fixing" the issue. Since all IPv6 routers must have an MTU of at least 1280, we could expect that no ICMP Packet-Too-Big message would ever be sent to us.

Remember - the original problem introduced by ECMP was that ICMP routed back to our Anycast addresses could go to a wrong machine within the ECMP group. Therefore we became ICMP black holes. Cloudflare would send large packets, they would be dropped with ICMP PTB packet flying back to us. Which, in turn would fail to be delivered to the right machine due to ECMP.

But why did this problem not appear for IPv4 traffic? We believe the same issue exists on IPv4, but it's less damaging due to the different nature of the network. IPv4 is more mature and the great majority of end-hosts support either MTU 1500 or have their MSS option well configured - or clamped by some middle box. This is different in IPv6 where a large proportion of users use tunnels, have Path MTU strictly smaller than 1500 and use incorrect MSS settings in the TCP header. Finally, Linux implements RFC4821 for IPv4 but not IPv6. RFC4821 (PLPMTUD) has its disadvantages, but does slightly help to alleviate the ICMP blackhole issue.

Our "fix" - reducing the MTU to 1280 - was serving us well and we had no pressing reason to revert it.

Researchers did notice though. We were caught red-handed twice:

In 2017 Geoff Huston noticed (pdf) that we sent DNS fragments of 1280 only (older blog post).
In June 2018 the paper "Exploring usable Path MTU in the Internet" (pdf) mentioned our weird setting - where we can accept 1500 bytes just fine, but transmit is limited to 1280.

When small MTU is too small

photo by NH53 by/2.0

This changed recently, when we started working on Cloudflare Spectrum support for UDP. Spectrum is a terminating proxy, able to handle protocols other than HTTP. Getting Spectrum to forward TCP was relatively straightforward (barring couple of awesome hacks). UDP is different.

One of the major issues we hit was related to the MTU on our servers.

During tests we wanted to forward UDP VPN packets through Spectrum. As you can imagine, any VPN would encapsulate a packet in another packet. Spectrum received packets like this:

 +---------------------+------------------------------------------------+
 +  IPv6 + UDP header  |  UDP payload encapsulating a 1280 byte packet  |
 +---------------------+------------------------------------------------+

It's pretty obvious, that our edge servers supporting IPv6 packets of max 1280 bytes won't be able to handle this type of traffic. We are going to need at least 1280+40+8 bytes MTU! Hardcoding MTU=1280 in IPv6 may be acceptable solution if you are an end-node on the internet, but is definitely too small when forwarding tunneled traffic.

Picking a new MTU

But what MTU value should we use instead? Let's see what other major internet companies do. Here is a couple of examples of advertised MSS values in TCP SYN+ACK packets over IPv6:

+---- site -----+- MSS --+- estimated MTU -+
| google.com    |  1360  |    1420         |
+---------------+--------+-----------------+
| facebook.com  |  1410  |    1470         |
+---------------+--------+-----------------+
| wikipedia.org |  1440  |    1500         |
+---------------+--------+-----------------+

I believe Google and Facebook adjust their MTU due to their use of L4 load balancers. Their implementations do IP-in-IP encapsulation so need a bit of space for the header. Read more:

Google - Maglev
Facebook - Katran

There may be other reasons for having a smaller MTU. A reduced value may decrease the probability of the Path MTU detection algorithm kicking in (ie: relying on ICMP PTB). We can theorize that for the misconfigured eyeballs:

MTU=1280 will never run Path MTU detection
MTU=1500 will always run it.
In-between values would have increasing different chances of hitting the problem.

But just what is the chance of that?

A quick unscientific study of the MSS values we encountered from eyeballs shows the following distributions. For connections going over IPv4:

IPv4 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1300 |                                                 *  1.28%   98.53%
  1360 |                                              ****  4.40%   95.68%
  1370 |                                                 *  1.15%   91.05%
  1380 |                                               ***  3.35%   89.81%
  1400 |                                          ********  7.95%   84.79%
  1410 |                                                 *  1.17%   76.66%
  1412 |                                              ****  4.58%   75.49%
  1440 |                                            ******  6.14%   65.71%
  1452 |                                      ************ 11.50%   58.94%
  1460 |************************************************** 47.09%   47.34%

Assuming the majority of clients have MSS configured right, we can say that 89.8% of connections advertised MTU=1380+40=1420 or higher. 75% had MTU >= 1452.

For IPv6 connections we saw:

IPv6 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1220 |                                               ***  4.21%   99.96%
  1340 |                                                **  3.11%   93.23%
  1362 |                                                 *  1.31%   87.70%
  1368 |                                               ***  3.38%   86.36%
  1370 |                                               ***  4.24%   82.98%
  1380 |                                               ***  3.52%   78.65%
  1390 |                                                 *  2.11%   75.10%
  1400 |                                               ***  3.89%   72.25%
  1412 |                                               ***  3.64%   68.21%
  1420 |                                                 *  2.02%   64.54%
  1440 |************************************************** 54.31%   54.34%

On IPv6 87.7% connections had MTU >= 1422 (1362+60). 75% had MTU >= 1450. (See also: MTU distribution of DNS servers).

Interpretation

Before we move on it's worth reiterating the original problem. Each connection from an eyeball to our Anycast network has three numbers related to it:

Client advertised MTU - seen in MSS option in TCP header
True Path MTU value - generally unknown until measured
Our Edge server MTU - value we are trying to optimize in this exercise

(This is a slight simplification, paths on the internet aren't symmetric so the path from eyeball to Cloudflare could have different Path MTU than the reverse path.)

In order for the connection to misbehave, three conditions must be met:

Client advertised MTU must be "wrong", that is: larger than True Path MTU
Our edge server must be willing to send such large packets: Edge server MTU >= True Path MTU
The ICMP PTB messages must fail to be delivered to our edge server - preventing Path MTU detection from working.

The last condition could occur for one of the reasons:

the routers on the path are misbehaving and perhaps firewalling ICMP
due to the asymmetric nature of the internet the ICMP back is routed to the wrong Anycast datacenter
something is wrong on our side, for example pmtud process fails

In the past we limited our Edge Server MTU value to the smallest possible, to make sure we never encounter the problem. Due to the development of Spectrum UDP support we must increase the Edge Server MTU, while still minimizing the probability of the issue happening.

Finally, relying on ICMP PTB messages for a large fraction of traffic is a bad idea. It's easy to imagine the cost this induces: even with Path MTU detection working fine, the affected connection will suffer a hiccup. A couple of large packets will be dropped before the reverse ICMP will get through and reconfigure the saved Path MTU value. This is not optimal for latency.

Progress

In recent days we increased the IPv6 MTU. As part of the process we could have chosen 1300, 1350, or 1400. We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.

Since the rollout we've been monitoring Icmp6InPktTooBigs counters:

$ nstat -az | grep Icmp6InPktTooBigs
Icmp6InPktTooBigs               738748             0.0

Here is a chart of the ICMP PTB packets we received over last 7 days. You can clearly see that when the rollout started, we saw a large increase in PTB ICMP messages (Y label - packet count - deliberately obfuscated):

Interestingly the majority of the ICMP packets are concentrated in our Frankfurt datacenter:

We estimate that in our Frankfurt datacenter, we receive ICMP PTB message on 2 out of every 100 IPv6 TCP connections. These seem to come from only a handful of ASNs:

AS6830 - Liberty Global Operations B.V.
AS20825- Unitymedia NRW GmbH
AS31334 - Vodafone Kabel Deutschland GmbH
AS29562 - Kabel BW GmbH

These networks send to us ICMP PTB messages, usually informing that their MTU is 1280. For example:

$ sudo tcpdump -tvvvni eth0 icmp6 and ip6[40+0]==2 
IP6 2a02:908:xxx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2a02:810d:xx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2001:ac8:xxx > 2400:xxx ICMP6, packet too big, mtu 1390, length 1240

Final thoughts

Finally, if you are an IPv6 user with a weird MTU and have misconfigured MSS - basically if you are doing tunneling - please let us know of any issues. We know that debugging MTU issues is notoriously hard. To aid that we created an online fragmentation and ICMP delivery test. You can run it:

IPv6 version: http://icmpcheckv6.popcount.org
(for completeness, we also have an IPv4 version)

If you are a server operator running IPv6 applications, you should not worry. In most cases leaving the MTU at default 1500 is a good choice and should work for the majority of connections. Just remember to allow ICMP PTB packets on the firewall and you should be good. If you serve variety of IPv6 users and need to optimize latency, you may consider choosing a slightly smaller MTU for outbound packets, to reduce the risk of relying on Path MTU Detection / ICMP.

Low level network tuning sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

The Cloudflare Blog

When DNSSEC goes wrong: how we responded to the .de TLD outage

How DNSSEC works

What we saw

Serve stale

Our mitigation

Negative Trust Anchors

What we actually deployed

Origin resolution mitigations

Extended DNS Errors

Is this a failure of DNSSEC as a technology?

#HugOps

Takeaways from this incident

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Initial recovery mitigations

Implementing panic=unwind with WebAssembly Exception Handling

Abort recovery

Extension: abort reinitialization for wasm-bindgen libraries

Maturing the Rust Wasm Exception Handling ecosystem

Using panic unwind in Rust Workers

Committing to Rust Workers stability

How Workers powers our internal maintenance scheduling pipeline

Building a system to de-risk critical maintenance operations

Maintenance constraints

Graph processing on Workers

Fetch pipeline

Thanos in real-time

Historical data analysis

Building for scale

Vulnerability transparency: strengthening security through responsible disclosure

Why transparency in vulnerability reporting matters

What is a CVE?

What is a CNA?

Cloudflare CVE issuance process

How does Cloudflare disclose a CVE?

Outcomes

CVE-2024-1765: quiche: Memory Exhaustion Attack using post-handshake CRYPTO frames

CVE-2024-0212: Cloudflare WordPress plugin enables information disclosure of Cloudflare API (for low-privilege users)

CVE-2023-2754 - Plaintext transmission of DNS requests in Windows 1.1.1.1 WARP client

CVE-2025-0651 - Improper privilege management allows file manipulations

CVE-2025-23419: TLS client authentication can be bypassed due to ticket resumption (disclosed Cloudflare impact via blog post)

Conclusion

Scaling with safety: Cloudflare's approach to global service health metrics and software releases

Making it work at scale

Recording rules

Distributed query processing

Congestion control

Columnar experiments

Conclusion

Demonstrating reduction of vulnerability classes: a key step in CISA’s “Secure by Design” pledge

The core philosophy that continues: prevent, not patch

Cloudflare’s techniques for addressing these vulnerabilities

The role of custom rulesets and “build break”

Outcomes

Future plans

Conclusion

Improving platform resilience at Cloudflare through automation

Growing pains

Solving the problem

Building an Automatic Remediation System

Step one: we need a coordinator

Step two: Task Routing is amazing

Step three: when/how to self-heal?

Step four: packaging and deployment

Step five: test, test, test

Deploying to production

Looking to the future

Changing the industry with CISA’s Secure by Design principles

What do “secure by design” and “secure by default” mean?

Why does it matter?

How does Cloudflare support secure by design principles?

Taking ownership of customer security outcomes

Application hardening

Zero Trust Security

Application features

Application default settings

Embracing radical transparency and accountability

The Cloudflare blog

Cloudflare System Status

Technical transparency for code integrity

Implementing `panic=unwind` with WebAssembly Exception Handling