The Cloudflare Blog

Remediating new DNSSEC resource exhaustion vulnerabilities

Vicky Shrestha — Thu, 29 Feb 2024 14:00:57 GMT

Cloudflare has been part of a multivendor, industry-wide effort to mitigate two critical DNSSEC vulnerabilities. These vulnerabilities exposed significant risks to critical infrastructures that provide DNS resolution services. Cloudflare provides DNS resolution for anyone to use for free with our public resolver 1.1.1.1 service. Mitigations for Cloudflare’s public resolver 1.1.1.1 service were applied before these vulnerabilities were disclosed publicly. Internal resolvers using unbound (open source software) were upgraded promptly after a new software version fixing these vulnerabilities was released.

All Cloudflare DNS infrastructure was protected from both of these vulnerabilities before they were disclosed and is safe today. These vulnerabilities do not affect our Authoritative DNS or DNS firewall products.

All major DNS software vendors have released new versions of their software. All other major DNS resolver providers have also applied appropriate mitigations. Please update your DNS resolver software immediately, if you haven’t done so already.

Background

Domain name system (DNS) security extensions, commonly known as DNSSEC, are extensions to the DNS protocol that add authentication and integrity capabilities. DNSSEC uses cryptographic keys and signatures that allow DNS responses to be validated as authentic. DNSSEC protocol specifications have certain requirements that prioritize availability at the cost of increased complexity and computational cost for the validating DNS resolvers. The mitigations for the vulnerabilities discussed in this blog require local policies to be applied that relax these requirements in order to avoid exhausting the resources of validators.

The design of the DNS and DNSSEC protocols follows the Robustness principle: “be conservative in what you do, be liberal in what you accept from others”. There have been many vulnerabilities in the past that have taken advantage of protocol requirements following this principle. Malicious actors can exploit these vulnerabilities to attack DNS infrastructure, in this case by causing additional work for DNS resolvers by crafting DNSSEC responses with complex configurations. As is often the case, we find ourselves having to create a pragmatic balance between the flexibility that allows a protocol to adapt and evolve and the need to safeguard the stability and security of the services we operate.

Cloudflare’s public resolver 1.1.1.1 is a privacy-centric public resolver service. We have been using stricter validations and limits aimed at protecting our own infrastructure in addition to shielding authoritative DNS servers operated outside our network. As a result, we often receive complaints about resolution failures. Experience shows us that strict validations and limits can impact availability in some edge cases, especially when DNS domains are improperly configured. However, these strict validations and limits are necessary to improve the overall reliability and resilience of the DNS infrastructure.

The vulnerabilities and how we mitigated them are described below.

Keytrap vulnerability (CVE-2023-50387)

Introduction

A DNSSEC signed zone can contain multiple keys (DNSKEY) to sign the contents of a DNS zone and a Resource Record Set (RRSET) in a DNS response can have multiple signatures (RRSIG). Multiple keys and signatures are required to support things like key rollover, algorithm rollover, and multi-signer DNSSEC. DNSSEC protocol specifications require a validating DNS resolver to try every possible combination of keys and signatures when validating a DNS response.

During validation, a resolver looks at the key tag of every signature and tries to find the associated key that was used to sign it. A key tag is an unsigned 16-bit number calculated as a checksum over the key’s resource data (RDATA). Key tags are intended to allow efficient pairing of a signature with the key which has supposedly created it. However, key tags are not unique, and it is possible that multiple keys can have the same key tag. A malicious actor can easily craft a DNS response with multiple keys having the same key tag together with multiple signatures, none of which might validate. A validating resolver would have to try every combination (number of keys multiplied by number of signatures) when trying to validate this response. This increases the computational cost of the validating resolver many-fold, degrading performance for all its users. This is known as the Keytrap vulnerability.

Variations of this vulnerability include using multiple signatures with one key, using one signature with multiple keys having colliding key tags, and using multiple keys with corresponding hashes added to the parent delegation signer record.

Mitigation

We have limited the maximum number of keys we will accept at a zone cut. A zone cut is where a parent zone delegates to a child zone, e.g. where the .com zone delegates cloudflare.com to Cloudflare nameservers. Even with this limit already in place and various other protections built for our platform, we realized that it would still be computationally costly to process a malicious DNS answer from an authoritative DNS server.

To address and further mitigate this vulnerability, we added a signature validations limit per RRSET and a total signature validations limit per resolution task. One resolution task might include multiple recursive queries to external authoritative DNS servers in order to answer a single DNS question. Clients queries exceeding these limits will fail to resolve and will receive a response with an Extended DNS Error (EDE) code 0. Furthermore, we added metrics which will allow us to detect attacks attempting to exploit this vulnerability.

NSEC3 iteration and closest encloser proof vulnerability (CVE-2023-50868)

Introduction

NSEC3 is an alternative approach for authenticated denial of existence. You can learn more about authenticated denial of existence here. NSEC3 uses a hash derived from DNS names instead of the DNS names directly in an attempt to prevent zone enumeration and the standard supports multiple iterations for hash calculations. However, because the full DNS name is used as input to the hash calculation, increasing hashing iterations beyond the initial doesn’t provide any additional value and is not recommended in RFC9276. This complication is further inflated while finding the closest enclosure proof. A malicious DNS response from an authoritative DNS server can set a high NSEC3 iteration count and long DNS names with multiple DNS labels to exhaust the computing resources of a validating resolver by making it do unnecessary hash computations.

Mitigation

For this vulnerability, we applied a similar mitigation technique as we did for Keytrap. We added a limit for total hash calculations per resolution task to answer a single DNS question. Similarly, clients queries exceeding this limit will fail to resolve and will receive a response with an EDE code 27. We also added metrics to track hash calculations allowing early detection of attacks attempting to exploit this vulnerability.

Timeline

Date and time in UTC	Event
2023-11-03 16:05	John Todd from Quad9 invites Cloudflare to participate in a joint task force to discuss a new DNS vulnerability.
2023-11-07 14:30	A group of DNS vendors and service providers meet to discuss the vulnerability during IETF 118. Discussions and collaboration continues in a closed chat group hosted at DNS-OARC
2023-12-08 20:20	Cloudflare public resolver 1.1.1.1 is fully patched to mitigate Keytrap vulnerability (CVE-2023-50387)
2024-01-17 22:39	Cloudflare public resolver 1.1.1.1 is fully patched to mitigate NSEC3 iteration count and closest encloser vulnerability (CVE-2023-50868)
2024-02-13 13:04	Unbound package is released
2024-02-13 23:00	Cloudflare internal CDN resolver is fully patched to mitigate both CVE-2023-50387 and CVE-2023-50868

Credits

We would like to thank Elias Heftrig, Haya Schulmann, Niklas Vogel, Michael Waidner from the German National Research Center for Applied Cybersecurity ATHENE, for discovering the Keytrap vulnerability and doing a responsible disclosure.

We would like to thank Petr Špaček from Internet Systems Consortium (ISC) for discovering the NSEC3 iteration and closest encloser proof vulnerability and doing a responsible disclosure.

We would like to thank John Todd from Quad9 and the DNS Operations Analysis and Research Center (DNS-OARC) for facilitating coordination amongst various stakeholders.

And finally, we would like to thank the DNS-OARC community members, representing various DNS vendors and service providers, who all came together and worked tirelessly to fix these vulnerabilities, working towards a common goal of making the internet resilient and secure.

How Rust and Wasm power Cloudflare's 1.1.1.1

Anbang Wen — Tue, 28 Feb 2023 14:00:00 GMT

On April 1, 2018, Cloudflare announced the 1.1.1.1 public DNS resolver. Over the years, we added the debug page for troubleshooting, global cache purge, 0 TTL for zones on Cloudflare, Upstream TLS, and 1.1.1.1 for families to the platform. In this post, we would like to share some behind the scenes details and changes.

When the project started, Knot Resolver was chosen as the DNS resolver. We started building a whole system on top of it, so that it could fit Cloudflare's use case. Having a battle tested DNS recursive resolver, as well as a DNSSEC validator, was fantastic because we could spend our energy elsewhere, instead of worrying about the DNS protocol implementation.

Knot Resolver is quite flexible in terms of its Lua-based plugin system. It allowed us to quickly extend the core functionality to support various product features, like DoH/DoT, logging, BPF-based attack mitigation, cache sharing, and iteration logic override. As the traffic grew, we reached certain limitations.

Lessons we learned

Before going any deeper, let’s first have a bird’s-eye view of a simplified Cloudflare data center setup, which could help us understand what we are going to talk about later. At Cloudflare, every server is identical: the software stack running on one server is exactly the same as on another server, only the configuration may be different. This setup greatly reduces the complexity of fleet maintenance.

Figure 1 Data center layout

The resolver runs as a daemon process, kresd, and it doesn’t work alone. Requests, specifically DNS requests, are load-balanced to the servers inside a data center by Unimog. DoH requests are terminated at our TLS terminator. Configs and other small pieces of data can be delivered worldwide by Quicksilver in seconds. With all the help, the resolver can concentrate on its own goal - resolving DNS queries, and not worrying about transport protocol details. Now let’s talk about 3 key areas we wanted to improve here - blocking I/O in plugins, a more efficient use of cache space, and plugin isolation.

Callbacks blocking the event loop

Knot Resolver has a very flexible plugin system for extending its core functionality. The plugins are called modules, and they are based on callbacks. At certain points during request processing, these callbacks will be invoked with current query context. This gives a module the ability to inspect, modify, and even produce requests / responses. By design, these callbacks are supposed to be simple, in order to avoid blocking the underlying event loop. This matters because the service is single threaded, and the event loop is in charge of serving many requests at the same time. So even just one request being held up in a callback means that no other concurrent requests can be progressed until the callback finishes.

The setup worked well enough for us until we needed to do blocking operations, for example, to pull data from Quicksilver before responding to the client.

Cache efficiency

As requests for a domain could land on any node inside a data center, it would be wasteful to repetitively resolve a query when another node already has the answer. By intuition, the latency could be improved if the cache could be shared among the servers, and so we created a cache module which multicasted the newly added cache entries. Nodes inside the same data center could then subscribe to the events and update their local cache.

The default cache implementation in Knot Resolver is LMDB. It is fast and reliable for small to medium deployments. But in our case, cache eviction shortly became a problem. The cache itself doesn’t track for any TTL, popularity, etc. When it’s full, it just clears all the entries and starts over. Scenarios like zone enumeration could fill the cache with data that is unlikely to be retrieved later.

Furthermore, our multicast cache module made it worse by amplifying the less useful data to all the nodes, and led them to the cache high watermark at the same time. Then we saw a latency spike because all the nodes dropped the cache and started over around the same time.

Module isolation

With the list of Lua modules increasing, debugging issues became increasingly difficult. This is because a single Lua state is shared among all the modules, so one misbehaving module could affect another. For example, when something went wrong inside the Lua state, like having too many coroutines, or being out of memory, we got lucky if the program just crashed, but the resulting stack traces were hard to read. It is also difficult to forcibly tear down, or upgrade, a running module as it not only has state in the Lua runtime, but also FFI, so memory safety is not guaranteed.

Hello BigPineapple

We didn’t find any existing software that would meet our somewhat niche requirements, so eventually we started building something ourselves. The first attempt was to wrap Knot Resolver's core with a thin service written in Rust (modified edgedns).

This proved to be difficult due to having to constantly convert between the storage, and C/FFI types, and some other quirks (for example, the ABI for looking up records from cache expects the returned records to be immutable until the next call, or the end of the read transaction). But we learned a lot from trying to implement this sort of split functionality where the host (the service) provides some resources to the guest (resolver core library), and how we would make that interface better.

In the later iterations, we replaced the entire recursive library with a new one based around an async runtime; and a redesigned module system was added to it, sneakily rewriting the service into Rust over time as we swapped out more and more components. That async runtime was tokio, which offered a neat thread pool interface for running both non-blocking and blocking tasks, as well as a good ecosystem for working with other crates (Rust libraries).

After that, as the futures combinators became tedious, we started converting everything to async/await. This was before the async/await feature that landed in Rust 1.39, which led us to use nightly (Rust beta) for a while and had some hiccups. When the async/await stabilized, it enabled us to write our request processing routine ergonomically, similar to Go.

All the tasks can be run concurrently, and certain I/O heavy ones can be broken down into smaller pieces, to benefit from a more granular scheduling. As the runtime executes tasks on a threadpool, instead of a single thread, it also benefits from work stealing. This avoids a problem we previously had, where a single request taking a lot of time to process, that blocks all the other requests on the event loop.

Figure 2 Components overview

Finally, we forged a platform that we are happy with, and we call it BigPineapple. The figure above shows an overview of its main components and the data flow between them. Inside BigPineapple, the server module gets inbound requests from the client, validates and transforms them into unified frame streams, which can then be processed by the worker module. The worker module has a set of workers, whose task is to figure out the answer to the question in the request. Each worker interacts with the cache module to check if the answer is there and still valid, otherwise it drives the recursor module to recursively iterate the query. The recursor doesn’t do any I/O, when it needs anything, it delegates the sub-task to the conductor module. The conductor then uses outbound queries to get the information from upstream nameservers. Through the whole process, some modules can interact with the Sandbox module, to extend its functionality by running the plugins inside.

Let’s look at some of them in more detail, and see how they helped us overcome the problems we had before.

Updated I/O architecture

A DNS resolver can be seen as an agent between a client and several authoritative nameservers: it receives requests from the client, recursively fetches data from the upstream nameservers, then composes the responses and sends them back to the client. So it has both inbound and outbound traffic, which are handled by the server and the conductor component respectively.

The server listens on a list of interfaces using different transport protocols. These are later abstracted into streams of “frames”. Each frame is a high level representation of a DNS message, with some extra metadata. Underneath, it can be a UDP packet, a segment of TCP stream, or the payload of a HTTP request, but they are all processed the same way. The frame is then converted into an asynchronous task, which in turn is picked up by a set of workers in charge of resolving these tasks. The finished tasks are converted back into responses, and sent back to the client.

This “frame” abstraction over the protocols and their encodings simplified the logic used to regulate the frame sources, such as enforcing fairness to prevent starving and controlling pacing to protect the server from being overwhelmed. One of the things we’ve learned with the previous implementations is that, for a service open to the public, a peak performance of the I/O matters less than the ability to pace clients fairly. This is mainly because the time and computational cost of each recursive request is vastly different (for example a cache hit from a cache miss), and it’s difficult to guess it beforehand. The cache misses in recursive service not only consume Cloudflare’s resources, but also the resources of the authoritative nameservers being queried, so we need to be mindful of that.

On the other side of the server is the conductor, which manages all the outbound connections. It helps to answer some questions before reaching out to the upstream: Which is the fastest nameserver to connect to in terms of latency? What to do if all the nameservers are not reachable? What protocol to use for the connection, and are there any better options? The conductor is able to make these decisions by tracking the upstream server’s metrics, such as RTT, QoS, etc. With that knowledge, it can also guess for things like upstream capacity, UDP packet loss, and take necessary actions, e.g. retry when it thinks the previous UDP packet didn’t reach the upstream.

Figure 3 I/O conductor

Figure 3 shows a simplified data flow about the conductor. It is called by the exchanger mentioned above, with upstream requests as input. The requests will be deduplicated first: meaning in a small window, if a lot of requests come to the conductor and ask for the same question, only one of them will pass, the others are put into a waiting queue. This is common when a cache entry expires, and can reduce unnecessary network traffic. Then based on the request and upstream metrics, the connection instructor either picks an open connection if available, or generates a set of parameters. With these parameters, the I/O executor is able to connect to the upstream directly, or even take a route via another Cloudflare data center using our Argo Smart Routing technology!

The cache

Caching in a recursive service is critical as a server can return a cached response in under one millisecond, while it will be hundreds of milliseconds to respond on a cache miss. As the memory is a finite resource (and also a shared resource in Cloudflare’s architecture), more efficient use of space for cache was one of the key areas we wanted to improve. The new cache is implemented with a cache replacement data structure (ARC), instead of a KV store. This makes good use of the space on a single node, as less popular entries are progressively evicted, and the data structure is resistant to scans.

Moreover, instead of duplicating the cache across the whole data center with multicast, as we did before, BigPineapple is aware of its peer nodes in the same data center, and relays queries from one node to another if it cannot find an entry in its own cache. This is done by consistent hashing the queries onto the healthy nodes in each data center. So, for example, queries for the same registered domain go through the same subset of nodes, which not only increases the cache hit ratio, but also helps the infrastructure cache, which stores information about performance and features of nameservers.

Figure 4 Updated data center layout

Async recursive library

The recursive library is the DNS brain of BigPineapple, as it knows how to find the answer to the question in the query. Starting from the root, it breaks down the client query into subqueries, and uses them to collect knowledge recursively from various authoritative nameservers on the internet. The product of this process is the answer. Thanks to the async/await it can be abstracted as a function like such:

async fn resolve(Request, Exchanger) → Result;

The function contains all the logic necessary to generate a response to a given request, but it doesn’t do any I/O on its own. Instead, we pass an Exchanger trait (Rust interface) that knows how to exchange DNS messages with upstream authoritative nameservers asynchronously. The exchanger is usually called at various await points - for example, when a recursion starts, one of the first things it does is that it looks up the closest cached delegation for the domain. If it doesn’t have the final delegation in cache, it needs to ask what nameservers are responsible for this domain and wait for the response, before it can proceed any further.

Thanks to this design, which decouples the “waiting for some responses” part from the recursive DNS logic, it is much easier to test by providing a mock implementation of the exchanger. In addition, it makes the recursive iteration code (and DNSSEC validation logic in particular) much more readable, as it’s written sequentially instead of being scattered across many callbacks.

Fun fact: writing a DNS recursive resolver from scratch is not fun at all!

Not only because of the complexity of DNSSEC validation, but also because of the necessary “workarounds” needed for various RFC incompatible servers, forwarders, firewalls, etc. So we ported deckard into Rust to help test it. Additionally, when we started migrating over to this new async recursive library, we first ran it in “shadow” mode: processing real world query samples from the production service, and comparing differences. We’ve done this in the past on Cloudflare’s authoritative DNS service as well. It is slightly more difficult for a recursive service due to the fact that a recursive service has to look up all the data over the Internet, and authoritative nameservers often give different answers for the same query due to localization, load balancing and such, leading to many false positives.

In December 2019, we finally enabled the new service on a public test endpoint (see the announcement) to iron out remaining issues before slowly migrating the production endpoints to the new service. Even after all that, we continued to find edge cases with the DNS recursion (and DNSSEC validation in particular), but fixing and reproducing these issues has become much easier due to the new architecture of the library.

Sandboxed plugins

Having the ability to extend the core DNS functionality on the fly is important for us, thus BigPineapple has its redesigned plugin system. Before, the Lua plugins run in the same memory space as the service itself, and are generally free to do what they want. This is convenient, as we can freely pass memory references between the service and modules using C/FFI. For example, to read a response directly from cache without having to copy to a buffer first. But it is also dangerous, as the module can read uninitialized memory, call a host ABI using a wrong function signature, block on a local socket, or do other undesirable things, in addition the service doesn’t have a way to restrict these behaviors.

So we looked at replacing the embedded Lua runtime with JavaScript, or native modules, but around the same time, embedded runtimes for WebAssembly (Wasm for short) started to appear. Two nice properties of WebAssembly programs are that it allows us to write them in the same language as the rest of the service, and that they run in an isolated memory space. So we started modeling the guest/host interface around the limitations of WebAssembly modules, to see how that would work.

BigPineapple’s Wasm runtime is currently powered by Wasmer. We tried several runtimes over time like Wasmtime, WAVM in the beginning, and found Wasmer was simpler to use in our case. The runtime allows each module to run in its own instance, with an isolated memory and a signal trap, which naturally solved the module isolation problem we described before. In addition to this, we can have multiple instances of the same module running at the same time. Being controlled carefully, the apps can be hot swapped from one instance to another without missing a single request! This is great because the apps can be upgraded on the fly without a server restart. Given that the Wasm programs are distributed via Quicksilver, BigPineapple’s functionality can be safely changed worldwide within a few seconds!

To better understand the WebAssembly sandbox, several terms need to be introduced first:

Host: the program which runs the Wasm runtime. Similar to a kernel, it has full control through the runtime over the guest applications.
Guest application: the Wasm program inside the sandbox. Within a restricted environment, it can only access its own memory space, which is provided by the runtime, and call the imported Host calls. We call it an app for short.
Host call: the functions defined in the host that can be imported by the guest. Comparable to syscall, it’s the only way guest apps can access the resources outside the sandbox.
Guest runtime: a library for guest applications to easily interact with the host. It implements some common interfaces, so an app can just use async, socket, log and tracing without knowing the underlying details.

Now it’s time to dive into the sandbox, so stay awhile and listen. First let’s start from the guest side, and see what a common app lifespan looks like. With the help of the guest runtime, guest apps can be written similar to regular programs. So like other executables, an app begins with a start function as an entrypoint, which is called by the host upon loading. It is also provided with arguments as from the command line. At this point, the instance normally does some initialization, and more importantly, registers callback functions for different query phases. This is because in a recursive resolver, a query has to go through several phases before it gathers enough information to produce a response, for example a cache lookup, or making subrequests to resolve a delegation chain for the domain, so being able to tie into these phases is necessary for the apps to be useful for different use cases. The start function can also run some background tasks to supplement the phase callbacks, and store global state. For example - report metrics, or pre-fetch shared data from external sources, etc. Again, just like how we write a normal program.

But where do the program arguments come from? How could a guest app send log and metrics? The answer is, external functions.

Figure 5 Wasm based Sandbox

In figure 5, we can see a barrier in the middle, which is the sandbox boundary, that separates the guest from the host. The only way one side can reach out to the other, is via a set of functions exported by the peer beforehand. As in the picture, the “hostcalls” are exported by the host, imported and called by the guest; while the “trampoline” are guest functions that the host has knowledge of.

It is called trampoline because it is used to invoke a function or a closure inside a guest instance that’s not exported. The phase callbacks are one example of why we need a trampoline function: each callback returns a closure, and therefore can’t be exported on instantiation. So a guest app wants to register a callback, it calls a host call with the callback address “hostcall_register_callback(pre_cache, #30987)”, when the callback needs to be invoked, the host cannot just call that pointer as it’s pointing to the guest’s memory space. What it can do instead is, to leverage one of the aforementioned trampolines, and give it the address of the callback closure: “trampoline_call(#30987)”.

Isolation overhead

Like a coin that has two sides, the new sandbox does come with some additional overhead. The portability and isolation that WebAssembly offers bring extra cost. Here, we'll list two examples.

Firstly, guest apps are not allowed to read host memory. The way it works is the guest provides a memory region via a host call, then the host writes the data into the guest memory space. This introduces a memory copy that would not be needed if we were outside the sandbox. The bad news is, in our use case, the guest apps are supposed to do something on the query and/or the response, so they almost always need to read data from the host on every single request. The good news, on the other hand, is that during a request life cycle, the data won’t change. So we pre-allocate a bulk of memory in the guest memory space right after the guest app instantiates. The allocated memory is not going to be used, but instead serves to occupy a hole in the address space. Once the host gets the address details, it maps a shared memory region with the common data needed by the guest into the guest’s space. When the guest code starts to execute, it can just access the data in the shared memory overlay, and no copy is needed.

Another issue we ran into was when we wanted to add support for a modern protocol, oDoH, into BigPineapple. The main job of it is to decrypt the client query, resolve it, then encrypt the answers before sending it back. By design, this doesn’t belong to core DNS, and should instead be extended with a Wasm app. However, the WebAssembly instruction set doesn’t provide some crypto primitives, such as AES and SHA-2, which prevents it from getting the benefit of host hardware. There is ongoing work to bring this functionality to Wasm with WASI-crypto. Until then, our solution for this is to simply delegate the HPKE to the host via host calls, and we already saw 4x performance improvements, compared to doing it inside Wasm.

Async in Wasm

Remember the problem we talked about before that the callbacks could block the event loop? Essentially, the problem is how to run the sandboxed code asynchronously. Because no matter how complex the request processing callback is, if it can yield, we can put an upper bound on how long it is allowed to block. Luckily, Rust’s async framework is both elegant and lightweight. It gives us the opportunity to use a set of guest calls to implement the “Future”s.

In Rust, a Future is a building block for asynchronous computations. From the user’s perspective, in order to make an asynchronous program, one has to take care of two things: implement a pollable function that drives the state transition, and place a waker as a callback to wake itself up, when the pollable function should be called again due to some external event (e.g. time passes, socket becomes readable, and so on). The former is to be able to progress the program gradually, e.g. read buffered data from I/O and return a new state indicating the status of the task: either finished, or yielded. The latter is useful in case of task yielding, as it will trigger the Future to be polled when the conditions that the task was waiting for are fulfilled, instead of busy looping until it’s complete.

Let’s see how this is implemented in our sandbox. For a scenario when the guest needs to do some I/O, it has to do so via the host calls, as it is inside a restricted environment. Assuming the host provides a set of simplified host calls which mirror the basic socket operations: open, read, write, and close, the guest can have its pseudo poller defined as below:

fn poll(&mut self, wake: fn()) -> Poll {
	match hostcall_socket_read(self.sock, self.buffer) {
    	    HostOk  => Poll::Ready,
    	    HostEof => Poll::Pending,
	}
}

Here the host call reads data from a socket into a buffer, depending on its return value, the function can move itself to one of the states we mentioned above: finished(Ready), or yielded(Pending). The magic happens inside the host call. Remember in figure 5, that it is the only way to access resources? The guest app doesn’t own the socket, but it can acquire a “handle” via “hostcall_socket_open”, which will in turn create a socket on the host side, and return a handle. The handle can be anything in theory, but in practice using integer socket handles map well to file descriptors on the host side, or indices in a vector or slab. By referencing the returned handle, the guest app is able to remotely control the real socket. As the host side is fully asynchronous, it can simply relay the socket state to the guest. If you noticed that the waker function isn’t used above, well done! That’s because when the host call is called, it not only starts opening a socket, but also registers the current waker to be called then the socket is opened (or fails to do so). So when the socket becomes ready, the host task will be woken up, it will find the corresponding guest task from its context, and wakes it up using the trampoline function as shown in figure 5. There are other cases where a guest task needs to wait for another guest task, an async mutex for example. The mechanism here is similar: using host calls to register wakers.

All of these complicated things are encapsulated in our guest async runtime, with easy to use API, so the guest apps get access to regular async functions without having to worry about the underlying details.

(Not) The End

Hopefully, this blog post gave you a general idea of the innovative platform that powers 1.1.1.1. It is still evolving. As of today, several of our products, such as 1.1.1.1 for Families, AS112, and Gateway DNS, are supported by guest apps running on BigPineapple. We are looking forward to bringing new technologies into it. If you have any ideas, please let us know in the community or via email.

Dig through SERVFAILs with EDE

Stanley Chiang — Wed, 25 May 2022 12:59:18 GMT

It can be frustrating to get errors (SERVFAIL response codes) returned from your DNS queries. It can be even more frustrating if you don’t get enough information to understand why the error is occurring or what to do next. That’s why back in 2020, we launched support for Extended DNS Error (EDE) Codes to 1.1.1.1.

As a quick refresher, EDE codes are a proposed IETF standard enabled by the Extension Mechanisms for DNS (EDNS) spec. The codes return extra information about DNS or DNSSEC issues without touching the RCODE so that debugging is easier.

Now we’re happy to announce we will return more error code types and include additional helpful information to further improve your debugging experience. Let’s run through some examples of how these error codes can help you better understand the issues you may face.

To try for yourself, you’ll need to run the dig or kdig command in the terminal. For dig, please ensure you have v9.11.20 or above. If you are on macOS 12.1, by default you only have dig 9.10.6. Install an updated version of BIND to fix that.

Let’s start with the output of an example dig command without EDE support.

% dig @1.1.1.1 dnssec-failed.org +noedns

; <<>> DiG 9.18.0 <<>> @1.1.1.1 dnssec-failed.org +noedns
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8054
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;dnssec-failed.org.		IN	A

;; Query time: 23 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Thu Mar 17 10:12:57 PDT 2022
;; MSG SIZE  rcvd: 35

In the output above, we tried to do DNSSEC validation on dnssec-failed.org. It returns a SERVFAIL, but we don’t have context as to why.

Now let’s try that again with 1.1.1.1’s EDE support.

% dig @1.1.1.1 dnssec-failed.org +dnssec

; <<>> DiG 9.18.0 <<>> @1.1.1.1 dnssec-failed.org +dnssec
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 34492
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; EDE: 9 (DNSKEY Missing): (no SEP matching the DS found for dnssec-failed.org.)
;; QUESTION SECTION:
;dnssec-failed.org.		IN	A

;; Query time: 15 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Fri Mar 04 12:53:45 PST 2022
;; MSG SIZE  rcvd: 103

We can see there is still a SERVFAIL. However, this time there is also an EDE Code 9 which stands for “DNSKey Missing”. Accompanying that, we also have additional information saying “no SEP matching the DS found” for dnssec-failed.org. That’s better!

Another nifty feature is that we will return multiple errors when appropriate, so you can debug each one separately. In the example below, we returned a SERVFAIL with three different error codes: “Unsupported DNSKEY Algorithm”, “No Reachable Authority”, and “Network Error”.

dig @1.1.1.1 [domain] +dnssec

; <<>> DiG 9.18.0 <<>> @1.1.1.1 [domain] +dnssec
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55957
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; EDE: 1 (Unsupported DNSKEY Algorithm): (no supported DNSKEY algorithm for [domain].)
; EDE: 22 (No Reachable Authority): (at delegation [domain].)
; EDE: 23 (Network Error): (135.181.58.79:53 rcode=REFUSED for [domain] A)
;; QUESTION SECTION:
;[domain].		IN	A

;; Query time: 1197 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Wed Mar 02 13:41:30 PST 2022
;; MSG SIZE  rcvd: 202

Here’s a list of the additional codes we now support:

Error Code Number	Error Code Name
1	Unsupported DNSKEY Algorithm
2	Unsupported DS Digest Type
5	DNSSEC Indeterminate
7	Signature Expired
8	Signature Not Yet Valid
9	DNSKEY Missing
10	RRSIGs Missing
11	No Zone Key Bit Set
12	NSEC Missing

We have documented all the error codes we currently support with additional information you may find helpful. Refer to our dev docs for more information.

Announcing experimental DDR in 1.1.1.1

Christopher Wood — Tue, 08 Mar 2022 20:55:00 GMT

1.1.1.1 sees approximately 600 billion queries per day. However, proportionally, most queries sent to this resolver are over cleartext: 89% over UDP and TCP combined, and the remaining 11% are encrypted. We care about end-user privacy and would prefer to see all of these queries sent to us over an encrypted transport using DNS-over-TLS or DNS-over-HTTPS. Having a mechanism by which clients could discover support for encrypted protocols such as DoH or DoT will help drive this number up and lead to more name encryption on the Internet. That’s where DDR – or Discovery of Designated Resolvers – comes into play. As of today, 1.1.1.1 supports the latest version of DDR so clients can automatically upgrade non-secure UDP and TCP connections to secure connections. In this post, we’ll describe the motivations for DDR, how the mechanism works, and, importantly, how you can test it out as a client.

DNS transports and public resolvers

We initially launched our public recursive resolver service 1.1.1.1 over three years ago, and have since seen its usage steadily grow. Today, it is one of the fastest public recursive resolvers available to end-users, supporting the latest security and privacy DNS transports such as HTTP/3 for DNS-over-HTTPS (DoH), as well as Oblivious DoH.

As a public resolver, all clients, regardless of type, are typically manually configured based on a user’s desired performance, security, and privacy requirements. This choice reflects answers to two separate but related types of questions:

What recursive resolver should be used to answer my DNS queries? Does the resolver perform well? Does the recursive resolver respect my privacy?
What protocol should be used to speak to this particular recursive resolver? How can I keep my DNS data safe from eavesdroppers that should otherwise not have access to it?

The second question primarily concerns technical matters. In particular, whether or not a recursive resolver supports DoH is simple enough to answer. Either the recursive resolver does or does not support it!

In contrast, the first question is primarily a matter of policy. For example, consider the question of choosing between a local network-provided DNS recursive resolver and a public recursive resolver. How do resolver features (including DoH support, for example) influence this decision? How does the resolver’s privacy policy regarding data use and retention influence this decision? More generally, what information about recursive resolver capabilities is available to clients in making this decision and how is this information delivered to clients?

These policy questions have been the topic of substantial debate in the Internet Engineering Task Force (IETF), the standards body where DoH was standardized, and is the one facet of the Adaptive DNS Discovery (ADD) Working Group, which is chartered to work on the following items (among others):

_- Define a mechanism that allows clients to discover DNS resolvers that support encryption and that are available to the client either on the public Internet or on private or local networks.
- Define a mechanism that allows communication of DNS resolver information to clients for use in selection decisions. This could be part of the mechanism used for discovery, above._

In other words, the ADD Working Group aims to specify mechanisms by which clients can obtain the information they need to answer question (1). Critically, one of those pieces of information is what encrypted transport protocols the recursive resolver supports, which would answer question (2).

As the answer to question (2) is purely technical and not a matter of policy, the ADD Working Group was able to specify a workable solution that we’ve implemented and tested with existing clients. Before getting into the details of how it works, let’s dig into the problem statement here and see what’s required to address it.

Threat model and problem statement

The DDR problem is relatively straightforward: given the IP address of a DNS recursive resolver, how can one discover parameters necessary for speaking to the same resolver using an encrypted transport? (As above, discovering parameters for a different resolver is a distinctly different problem that pertains to policy and is therefore out of scope.)

This question is only meaningful insofar as using encryption helps protect against some attacker. Otherwise, if the network was trusted, encryption would add no value! A direct consequence is that this question assumes the network – for some definition of “the network” – is untrusted and encryption helps protect against this network.

But what exactly is the network here? In practice, the topology typically looks like the following:

Typical DNS configuration from DHCP

Again, for DNS discovery to have any meaning, we assume that either the ISP or home network – or both – is untrusted and malicious. The setting here depends on the client and the network they are attached to, but it’s generally simplest to assume the ISP network is untrusted.

This question also makes one important assumption: clients know the desired recursive resolver address. Why is this important? Typically, the IP address of a DNS recursive resolver is provided via Dynamic Host Configuration Protocol (DHCP). When a client joins a network, it uses DHCP to learn information about the network, including the default DNS recursive resolver. However, DHCP is a famously unauthenticated protocol, which means that any active attacker on the network can spoof the information, as shown below.

Unauthenticated DHCP discovery

One obvious attacker vector would be for the attacker to redirect DNS traffic from the network’s desired recursive resolver to an attacker-controlled recursive resolver. This has important implications on the threat model for discovery.

First, there is currently no known mechanism for encrypted DNS discovery in the presence of an active attacker that can influence the client’s view of the recursive resolver’s address. In other words, to make any meaningful improvement, DNS discovery assumes the client’s view of the DNS recursive resolver address is correct (and obtained through some secure mechanism). A second implication is that the attacker can simply block any attempt of client discovery, preventing upgrade to encrypted transports. This seems true of any interactive discovery mechanism. As a result, DNS discovery must relax this attacker’s capabilities somewhat: rather than add, drop, or modify packets, the attacker can only add or modify packets.

Altogether, this threat model lets us sharpen the DNS discovery problem statement: given the IP address of a DNS recursive resolver, how can one securely discover parameters necessary for speaking to the same resolver using an encrypted transport in the presence of an active attacker that can add or modify packets? It should be infeasible, for example, for the attacker to redirect the client from the resolver that it knows at the outset to one the attacker controls.

So how does this work, exactly?

DDR mechanics

DDR depends on two mechanisms:

Certificate-based authentication of encrypted DNS resolvers.
SVCB records for encoding and communicating DNS parameters.

Certificates allow resolvers to prove authority for IP addresses. For example, if you view the certificate for one.one.one.one, you’ll see several IP addresses listed under the SubjectAlternativeName extension, including 1.1.1.1.

SubjectAltName list of the one.one.one.one certificate

SVCB records are extensible key-value stores that can be used for conveying information about services to clients. Example information includes the supported application protocols, including HTTP/3, as well as keying material like that used for TLS Encrypted Client Hello.

How does DDR combine these two to solve the discovery problem above? In three simple steps:

Clients query the expected DNS resolver for its designations and their parameters with a special-purpose SVCB record.
Clients open a secure connection to the designated resolver, for example, one.one.one.one, authenticating the resolver against the one.one.one.one name.
Clients check that the designated resolver is additionally authenticated for the IP address of the origin resolver. That is, the certificate for one.one.one.one, the designated resolver, must include the IP address 1.1.1.1, the original designator resolver.

If this validation completes, clients can then use the secure connection to the designated resolver. In pictures, this is as follows:

DDR discovery process

This demonstrates that the encrypted DNS resolver is authoritative for the client’s original DNS resolver. Or, in other words, that the original resolver and the encrypted resolver are effectively “the same.” An encrypted resolver that does not include the originally requested resolver IP address on its certificate would fail the validation, and clients are not expected to follow the designated upgrade path. This entire process is referred to as “Verified Discovery” in the DDR specification.

Experimental deployment and next steps

To enable more encrypted DNS on the Internet and help the standardization process, 1.1.1.1 now has experimental support for DDR. You can query it directly to find out:

$ dig +short @1.1.1.1 _dns.resolver.arpa type64

QUESTION SECTION
_dns.resolver.arpa.               IN SVCB 

ANSWER SECTION
_dns.resolver.arpa.                           300    IN SVCB  1 one.one.one.one. alpn="h2,h3" port="443" ipv4hint="1.1.1.1,1.0.0.1" ipv6hint="2606:4700:4700::1111,2606:4700:4700::1001" key7="/dns-query{?dns}"
_dns.resolver.arpa.                           300    IN SVCB  2 one.one.one.one. alpn="dot" port="853" ipv4hint="1.1.1.1,1.0.0.1" ipv6hint="2606:4700:4700::1111,2606:4700:4700::1001"

ADDITIONAL SECTION
one.one.one.one.                              300    IN AAAA  2606:4700:4700::1111
one.one.one.one.                              300    IN AAAA  2606:4700:4700::1001
one.one.one.one.                              300    IN A     1.1.1.1
one.one.one.one.                              300    IN A     1.0.0.1

This command sends a SVCB query (type64) for the reserved name _dns.resolver.arpa to 1.1.1.1. The output lists the contents of this record, including the DoH and DoT designation parameters. Let’s walk through the contents of this record:

_dns.resolver.arpa.                           300    IN SVCB  1 one.one.one.one. alpn="h2,h3" port="443" ipv4hint="1.1.1.1,1.0.0.1" ipv6hint="2606:4700:4700::1111,2606:4700:4700::1001" key7="/dns-query{?dns}"

This says that the DoH target one.one.one.one is accessible over port 443 (port=”443”) using either HTTP/2 or HTTP/3 (alpn=”h2,h3”), and the DoH path (key7) for queries is “/dns-query{?dns}”.

Moving forward

DDR is a simple mechanism that lets clients automatically upgrade to encrypted transport protocols for DNS queries without any manual configuration. At the end of the day, users running compatible clients will enjoy a more private Internet experience. Happily, both Microsoft and Apple recently announced experimental support for this emerging standard, and we’re pleased to help them and other clients test support.Going forward, we hope to help add support for DDR to open source DNS resolver software such as dnscrypt-proxy and Bind. If you’re interested in helping us continue to drive adoption of encrypted DNS and related protocols to help build a better Internet, we’re hiring!

Unwrap the SERVFAIL

Anbang Wen — Fri, 30 Oct 2020 12:00:00 GMT

We recently released a new version of Cloudflare Resolver which adds a piece of information called “Extended DNS Errors” (EDE) along with the response code under certain circumstances. This will be helpful in tracing DNS resolution errors and figuring out what went wrong behind the scenes.

(image from: https://www.pxfuel.com/en/free-photo-expka)

A tight-lipped agent

The DNS protocol was designed to map domain names to IP addresses. To inform the client about the result of the lookup, the protocol has a 4 bit field, called response code/RCODE. The logic to serve a response might look something like this:

function lookup(domain) {
    ...
    switch result {
    case "No error condition":
        return NOERROR with client expected answer
    case "No record for the request type":
        return NOERROR
    case "The request domain does not exist":
        return NXDOMAIN
    case "Refuse to perform the specified operation for policy reasons":
        return REFUSE
    default("Server failure: unable to process this query due to a problem with the name server"):
        return SERVFAIL
    }
}

try {
    lookup(domain)
} catch {
    return SERVFAIL
}

Although the context hasn't changed much, protocol extensions such as DNSSEC have been added, which makes the RCODE run out of space to express the server's internal status. To keep backward compatibility, DNS servers have to squeeze various statuses into existing ones. This behavior could confuse the client, especially with the "catch-all" SERVFAIL: something went wrong but what exactly?

Most often, end users don't talk to authoritative name servers directly, but use a stub and/or a recursive resolver as an agent to acquire the information it needs. When a user receives SERVFAIL, the failure can be one of the following:

The stub resolver fails to send the request.
The stub resolver doesn’t get a response.
The recursive resolver, which the stub resolver sends its query to, is overloaded.
The recursive resolver is unable to communicate with upstream authoritative servers.
The recursive resolver fails to verify the DNSSEC chain.
The authoritative server takes too long to respond.
...

In such cases, it is nearly impossible for the user to know exactly what’s wrong. The resolver is usually the one to be blamed, because, as an agent, it fails to get back the answer, and doesn’t return a clear reason for the failure in the response.

Keep backward compatibility

It seems we need to return more information, but (there's always a but) we also need to keep the behavior of existing clients unchanged.

One way is to extend the RCODE space, which came out with the Extension mechanisms for DNS or EDNS. It defines a 8 bit EXTENDED-RCODE, as high-order bits to current 4 bit RCODE. Together they make up a 12 bit integer. This changes the processing of RCODE, requires both client and server to fully support the logic unfortunately.

Another approach is to provide out-of-band data without touching the current RCODE. This is how Extended DNS Errors is defined. It introduces a new option to EDNS, containing an INFO-CODE to describe error details with an EXTRA-TEXT as an optional supplement. The option can be repeated as many times as needed, so it's possible for the client to get a full error chain with detailed messages. The INFO-CODE is just something like RCODE, but is 16 bits wide, while the EXTRA-TEXT is an utf-8 encoded string. For example, let’s say a client sends a request to a resolver, and the requested domain has two name servers. The client may receive a SERVFAIL response with an OPT record (see below) which contains two extended errors, one from one of the authoritative servers that shows it's not ready to serve, and the other from the resolver, showing it cannot connect to the other name server.

;; OPT PSEUDOSECTION:
; ...
; EDE: 14 (Not Ready)
; EDE: 23 (Network Error): (cannot reach upstream 192.0.2.1)
; ...

Google has something similar in their DoH JSON API, which provides diagnostic information in the "Comment" field.

Let's dig into it

Our 1.1.1.1 service has an initial support of the draft version of Extended DNS Errors, while we are still trying to find the best practice. As we mentioned above, this is not a breaking change, and existing clients will not be affected. The additional options can be safely ignored without any problem, since the RCODE stays the same.

If you have a newer version of dig, you can simply check it out with a known problematic domain. As you can see, due to DNSSEC verification failing, the RCODE is still SERVFAIL, but the extended error shows the failure is "DNSSEC Bogus".

$ dig @1.1.1.1 dnssec-failed.org

; <<>> DiG 9.16.4-Debian <<>> @1.1.1.1 dnssec-failed.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 1111
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; EDE: 6 (DNSSEC Bogus)
;; QUESTION SECTION:
;dnssec-failed.org.		IN	A

;; Query time: 111 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Wed Sep 01 00:00:00 PDT 2020
;; MSG SIZE  rcvd: 52

Note that Extended DNS Error relies on EDNS. So to be able to get one, the client needs to support EDNS, and needs to enable it in the request. At the time of writing this blog post, we see about 17% of queries that 1.1.1.1 received had EDNS enabled within a short time range. We hope this information will help you uncover the root cause of a SERVFAIL in the future.