The R2 Gateway Worker uses credentials (ID and key pair) to securely authenticate with our distributed storage infrastructure. We rotate these credentials regularly as a best practice security precaution.
Our key rotation process involves the following high-level steps:
Create a new set of credentials (ID and key pair) for our storage infrastructure. At this point, the old credentials are maintained to avoid downtime during credential change over.
Set the new credential secret for the R2 production gateway Worker using the wrangler secret put command.
Set the new updated credential ID as an environment variable in the R2 production gateway Worker using the wrangler deploy command. At this point, new storage credentials start being used by the gateway Worker.
Remove previous credentials from our storage infrastructure to complete the credential rotation process.
Monitor operational dashboards and logs to validate change over.
The R2 engineering team uses Workers environments to separate production and development environments for the R2 Gateway Worker. Each environment defines a separate isolated Cloudflare Worker with separate environment variables and secrets.
Critically, both wrangler secret put and wrangler deploy commands default to the default environment if the --env command line parameter is not included. In this case, due to human error, we inadvertently omitted the --env parameter and deployed the new storage credentials to the wrong Worker (default environment instead of production). To correctly deploy storage credentials to the production R2 Gateway Worker, we need to specify --env production.
The action we took on step 4 above to remove the old credentials from our storage infrastructure caused authentication errors, as the R2 Gateway production Worker still had the old credentials. This is ultimately what resulted in degraded availability.
The decline in R2 availability metrics was gradual and not immediately obvious because there was a delay in the propagation of the previous credential deletion to storage infrastructure. This accounted for a delay in our initial discovery of the problem. Instead of relying on availability metrics after updating the old set of credentials, we should have explicitly validated which token was being used by the R2 Gateway service to authenticate with R2's storage infrastructure.
Overall, the impact on read availability was significantly mitigated by our intermediate cache that sits in front of storage and continued to serve requests.
Once we identified the root cause, we were able to resolve the incident quickly by deploying the new credentials to the production R2 Gateway Worker. This resulted in an immediate recovery of R2 availability.
This incident happened because of human error and lasted longer than it should have because we didn’t have proper visibility into which credentials were being used by the R2 Gateway Worker to authenticate with our storage infrastructure.
We have taken immediate steps to prevent this failure (and others like it) from happening again:
Added logging tags that include the suffix of the credential ID the R2 Gateway Worker uses to authenticate with our storage infrastructure. With this change, we can explicitly confirm which credential is being used.
Related to the above step, our internal processes now require explicit confirmation that the suffix of the new token ID matches logs from our storage infrastructure before deleting the previous token.
Require that key rotation takes place through our hotfix release tooling instead of relying on manual wrangler command entry which introduces human error. Our hotfix release deploy tooling explicitly enforces the environment configuration and contains other safety checks.
While it’s been an implicit standard that this process involves at least two humans to validate the changes ahead as we progress, we’ve updated our relevant SOPs (standard operating procedures) to include this explicitly.
In Progress: Extend our existing closed loop health check system that monitors our endpoints to test new keys, automate reporting of their status through our alerting platform, and ensure global propagation prior to releasing the gateway Worker.
In Progress: To expedite triage on any future issues with our distributed storage endpoints, we are updating our observability platform to include views of upstream success rates that bypass caching to give clearer indication of issues serving requests for any reason.
The list above is not exhaustive: as we work through the above items, we will likely uncover other improvements to our systems, controls, and processes that we’ll be applying to improve R2’s resiliency, on top of our business-as-usual efforts. We are confident that this set of changes will prevent this failure, and related credential rotation failure modes, from occurring again. Again, we sincerely apologize for this incident and deeply regret any disruption it has caused you or your users.
"],"published_at":[0,"2025-03-25T01:40:38.542Z"],"updated_at":[0,"2025-04-07T23:11:40.989Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1zVzYYX4Zs6rRox4hJO4wJ/c64947208676753e532135f9393df5c5/BLOG-2793_1.png"],"tags":[1,[[0,{"id":[0,"7JpaihvGGjNhG2v4nTxeFV"],"name":[0,"R2 Storage"],"slug":[0,"cloudflare-r2"]}],[0,{"id":[0,"4yliZlpBPZpOwBDZzo1tTh"],"name":[0,"Outage"],"slug":[0,"outage"]}],[0,{"id":[0,"3cCNoJJ5uusKFBLYKFX1jB"],"name":[0,"Post Mortem"],"slug":[0,"post-mortem"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"Phillip Jones"],"slug":[0,"phillip"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5KTNNpw9VHuwoZlwWsA7MN/c50f3f98d822a0fdce3196d7620d714e/phillip.jpg"],"location":[0,null],"website":[0,null],"twitter":[0,"@akaphill"],"facebook":[0,null]}]]],"meta_description":[0,"On March 21, 2025, multiple Cloudflare services, including R2 object storage experienced an elevated rate of error responses. Here’s what caused the incident, the impact, and how we are making sure it doesn’t happen again."],"primary_author":[0,{}],"localeList":[0,{"name":[0,"LOC: Cloudflare incident on March 21, 2025"],"enUS":[0,"English for Locale"],"zhCN":[0,"Translated for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"Translated for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/cloudflare-incident-march-21-2025"],"metadata":[0,{"title":[0],"description":[0,"On March 21, 2025, multiple Cloudflare services, including R2 object storage experienced an elevated rate of error responses. Here’s what caused the incident, the impact, and how we are making sure it doesn’t happen again."],"imgPreview":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Snz5aLHm9r8iuQrJu1PFo/b08d68ab5199b08dfc096237501e8bf9/BLOG-2793_OG_Share.png"]}]}],[0,{"id":[0,"2aI8Y4m36DD0HQghRNFZ2n"],"title":[0,"Some TXT about, and A PTR to, new DNS insights on Cloudflare Radar"],"slug":[0,"new-dns-section-on-cloudflare-radar"],"excerpt":[0,"The new Cloudflare Radar DNS page provides increased visibility into aggregate traffic and usage trends seen by our 1.1.1.1 resolver"],"featured":[0,false],"html":[0,"
No joke – Cloudflare's 1.1.1.1 resolver was launched on April Fool's Day in 2018. Over the last seven years, this highly performant and privacy-conscious service has grown to handle an average of 1.9 Trillion queries per day from approximately 250 locations (countries/regions) around the world. Aggregated analysis of this traffic provides us with unique insight into Internet activity that goes beyond simple Web traffic trends, and we currently use analysis of 1.1.1.1 data to power Radar's Domains page, as well as the Radar Domain Rankings.
In December 2022, Cloudflare joined the AS112 Project, which helps the Internet deal with misdirected DNS queries. In March 2023, we launched an AS112 statistics page on Radar, providing insight into traffic trends and query types for this misdirected traffic. Extending the basic analysis presented on that page, and building on the analysis of resolver data used for the Domains page, today we are excited to launch a dedicated DNS page on Cloudflare Radar to provide increased visibility into aggregate traffic and usage trends seen across 1.1.1.1 resolver traffic. In addition to looking at global, location, and autonomous system (ASN) traffic trends, we are also providing perspectives on protocol usage, query and response characteristics, and DNSSEC usage.
The traffic analyzed for this new page may come from users that have manually configured their devices or local routers to use 1.1.1.1 as a resolver, ISPs that set 1.1.1.1 as the default resolver for their subscribers, ISPs that use 1.1.1.1 as a resolver upstream from their own, or users that have installed Cloudflare’s 1.1.1.1/WARP app on their device. The traffic analysis is based on anonymised DNS query logs, in accordance with Cloudflare’s Privacy Policy, as well as our 1.1.1.1 Public DNS Resolver privacy commitments.
Below, we walk through the sections of Radar’s new DNS page, reviewing the included graphs and the importance of the metrics they present. The data and trends shown within these graphs will vary based on the location or network that the aggregated queries originate from, as well as on the selected time frame.
As with many Radar metrics, the DNS page leads with traffic trends, showing normalized query volume at a worldwide level (default), or from the selected location or autonomous system (ASN). Similar to other Radar traffic-based graphs, the time period shown can be adjusted using the date picker, and for the default selections (last 24 hours, last 7 days, etc.), a comparison with traffic seen over the previous period is also plotted.
For location-level views (such as Latvia, in the example below), a table showing the top five ASNs by query volume is displayed alongside the graph. Showing the network’s share of queries from the selected location, the table provides insights into the providers whose users are generating the most traffic to 1.1.1.1.
\n \n \n \n \n
When a country/region is selected, in addition to showing an aggregate traffic graph for that location, we also show query volumes for the country code top level domain (ccTLD) associated with that country. The graph includes a line showing worldwide query volume for that ccTLD, as well as a line showing the query volume based on queries from the associated location. Anguilla’s ccTLD is .ai, and is a popular choice among the growing universe of AI-focused companies. While most locations see a gap between the worldwide and “local” query volume for their ccTLD, Anguilla’s is rather significant — as the graph below illustrates, this size of the gap is driven by both the popularity of the ccTLD and Anguilla’s comparatively small user base. (Traffic for .ai domains from Anguilla is shown by the dark blue line at the bottom of the graph.) Similarly, sizable gaps are seen with other “popular” ccTLDs as well, such as .io (British Indian Ocean Territory), .fm (Federated States of Micronesia), and .co (Colombia). A higher “local” ccTLD query volume in other locations results in smaller gaps when compared to the worldwide query volume.
\n \n \n \n \n
Depending on the strength of the signal (that is, the volume of traffic) from a given location or ASN, this data can also be used to corroborate reported Internet outages or shutdowns, or reported blocking of 1.1.1.1. For example, the graph below illustrates the result of Venezuelan provider CANTV reportedly blocking access to 1.1.1.1 for its subscribers. A comparable drop is visible for Supercable, another Venezuelan provider that also reportedly blocked access to Cloudflare’s resolver around the same time.
\n \n \n \n \n
Individual domain pages (like the one for cloudflare.com, for example) have long had a choropleth map and accompanying table showing the popularity of the domain by location, based on the share of DNS queries for that domain from each location. A similar view is included at the bottom of the worldwide overview page, based on the share of total global queries to 1.1.1.1 from each location.
While traffic trends are always interesting and important to track, analysis of the characteristics of queries to 1.1.1.1 and the associated responses can provide insights into the adoption of underlying transport protocols, record type popularity, cacheability, and security.
Published in November 1987, RFC 1035 notes that “The Internet supports name server access using TCP [RFC-793] on server port 53 (decimal) as well as datagram access using UDP [RFC-768] on UDP port 53 (decimal).” Over the subsequent three-plus decades, UDP has been the primary transport protocol for DNS queries, falling back to TCP for a limited number of use cases, such as when the response is too big to fit in a single UDP packet. However, as privacy has become a significantly greater concern, encrypted queries have been made possible through the specification of DNS over TLS (DoT) in 2016 and DNS over HTTPS (DoH) in 2018. Cloudflare’s 1.1.1.1 resolver has supported both of these privacy-preserving protocols since launch. The DNS transport protocol graph shows the distribution of queries to 1.1.1.1 over these four protocols. (Setting up 1.1.1.1 on your device or router uses DNS over UDP by default, although recent versions of Android support DoT and DoH. The 1.1.1.1 app uses DNS over HTTPS by default, and users can also configure their browsers to use DNS over HTTPS.)
Note that Cloudflare's resolver also services queries over DoH and Oblivious DoH (ODoH) for Mozilla and other large platforms, but this traffic is not currently included in our analysis. As such, DoH adoption is under-represented in this graph.
Aggregated worldwide between February 19 - February 26, distribution of transport protocols was 86.6% for UDP, 9.6% for DoT, 2.0% for TCP, and 1.7% for DoH. However, in some locations, these ratios may shift if users are more privacy conscious. For example, the graph below shows the distribution for Egypt over the same time period. In that country, the UDP and TCP shares are significantly lower than the global level, while the DoT and DoH shares are significantly higher, suggesting that users there may be more concerned about the privacy of their DNS queries than the global average, or that there is a larger concentration of 1.1.1.1 users on Android devices who have set up 1.1.1.1 using DoT manually. (The 2024 Cloudflare Radar Year in Review found that Android had an 85% mobile device traffic share in Egypt, so mobile device usage in the country leans very heavily toward Android.)
\n \n \n \n \n
RFC 1035 also defined a number of standard and Internet specific resource record types that return the associated information about the submitted query name. The most common record types are A and AAAA, which return the hostname’s IPv4 and IPv6 addresses respectively (assuming they exist). The DNS query type graph below shows that globally, these two record types comprise on the order of 80% of the queries received by 1.1.1.1. Among the others shown in the graph, HTTPS records can be used to signal HTTP/3 and HTTP/2 support, PTR records are used in reverse DNS records to look up a domain name based on a given IP address, and NS records indicate authoritative nameservers for a domain.
\n \n \n \n \n
A response code is sent with each response from 1.1.1.1 to the client. Six possible values were originally defined in RFC 1035, with the list further extended in RFC 2136 and RFC 2671. NOERROR, as the name suggests, means that no error condition was encountered with the query. Others, such as NXDOMAIN, SERVFAIL, REFUSED, and NOTIMP define specific error conditions encountered when trying to resolve the requested query name. The response codes may be generated by 1.1.1.1 itself (like REFUSED) or may come from an upstream authoritative nameserver (like NXDOMAIN).
The DNS response code graph shown below highlights that the vast majority of queries seen globally do not encounter an error during the resolution process (NOERROR), and that when errors are encountered, most are NXDOMAIN (no such record). It is worth noting that NOERROR also includes empty responses, which occur when there are no records for the query name and query type, but there are records for the query name and some other query type.
\n \n \n \n \n
With DNS being a first-step dependency for many other protocols, the amount of queries of particular types can be used to indirectly measure the adoption of those protocols. But to effectively measure adoption, we should also consider the fraction of those queries that are met with useful responses, which are represented with the DNS record adoption graphs.
The example below shows that queries for A records are met with a useful response nearly 88% of the time. As IPv4 is an established protocol, the remaining 12% are likely to be queries for valid hostnames that have no A records (e.g. email domains that only have MX records). But the same graph also shows that there’s still a significant adoption gap where IPv6 is concerned.
\n \n \n \n \n
When Cloudflare’s DNS resolver gets a response back from an upstream authoritative nameserver, it caches it for a specified amount of time — more on that below. By caching these responses, it can more efficiently serve subsequent queries for the same name. The DNS cache hit ratio graph provides insight into how frequently responses are served from cache. At a global level, as seen below, over 80% of queries have a response that is already cached. These ratios will vary by location or ASN, as the query patterns differ across geographies and networks.
\n \n \n \n \n
As noted in the preceding paragraph, when an authoritative nameserver sends a response back to 1.1.1.1, each record inside it includes information about how long it should be cached/considered valid for. This piece of information is known as the Time-To-Live (TTL) and, as a response may contain multiple records, the smallest of these TTLs (the “minimum” TTL) defines how long 1.1.1.1 can cache the entire response for. The TTLs on each response served from 1.1.1.1’s cache decrease towards zero as time passes, at which point 1.1.1.1 needs to go back to the authoritative nameserver. Hostnames with relatively low TTL values suggest that the records may be somewhat dynamic, possibly due to traffic management of the associated resources; longer TTL values suggest that the associated resources are more stable and expected to change infrequently.
The DNS minimum TTL graphs show the aggregate distribution of TTL values for five popular DNS record types, broken out across seven buckets ranging from under one minute to over one week. During the third week of February, for example, A and AAAA responses had a concentration of low TTLs, with over 80% below five minutes. In contrast, NS and MX responses were more concentrated across 15 minutes to one hour and one hour to one day. Because MX and NS records change infrequently, they are generally configured with higher TTLs. This allows them to be cached for longer periods in order to achieve faster DNS resolution.
DNS Security Extensions (DNSSEC) add an extra layer of authentication to DNS establishing the integrity and authenticity of a DNS response. This ensures subsequent HTTPS requests are not routed to a spoofed domain. When sending a query to 1.1.1.1, a DNS client can indicate that it is DNSSEC-aware by setting a specific flag (the “DO” bit) in the query, which lets our resolver know that it is OK to return DNSSEC data in the response. The DNSSEC client awareness graph breaks down the share of queries that 1.1.1.1 sees from clients that understand DNSSEC and can require validation of responses vs. those that don’t. (Note that by default, 1.1.1.1 tries to protect clients by always validating DNSSEC responses from authoritative nameservers and not forwarding invalid responses to clients, unless the client has explicitly told it not to by setting the “CD” (checking-disabled) bit in the query.)
Unfortunately, as the graph below shows, nearly 90% of the queries seen by Cloudflare’s resolver are made by clients that are not DNSSEC-aware. This broad lack of client awareness may be due to several factors. On the client side, DNSSEC is not enabled by default for most users, and enabling DNSSEC requires extra work, even for technically savvy and security conscious users. On the authoritative side, for domain owners, supporting DNSSEC requires extra operational maintenance and knowledge, and a mistake can cost your domain to disappear from the Internet, causing significant (including financial) issues.
The companion End-to-end security graph represents the fraction of DNS interactions that were protected from tampering, when considering the client’s DNSSEC capabilities and use of encryption (use of DoT or DoH). This shows an even greater imbalance at a global level, and highlights the importance of further adoption of encryption and DNSSEC.
\n \n \n \n \n
For DNSSEC validation to occur, the query name being requested must be part of a DNSSEC-enabled domain, and the DNSSEC validation status graph represents the share of queries where that was the case under the Secure and Invalid labels. Queries for domains without DNSSEC are labeled as Insecure, and queries where DNSSEC validation was not applicable (such as various kinds of errors) fall under the Other label. Although nearly 93% of generic Top Level Domains (TLDs) and 65% of country code Top Level Domains (ccTLDs) are signed with DNSSEC (as of February 2025), the adoption rate across individual (child) domains lags significantly, as the graph below shows that over 80% of queries were labeled as Insecure.
DNS is a fundamental, foundational part of the Internet. While most Internet users don’t think of DNS beyond its role in translating easy-to-remember hostnames to IP addresses, there’s a lot going on to make even that happen, from privacy to performance to security. The new DNS page on Cloudflare Radar endeavors to provide visibility into what’s going on behind the scenes, at a global, national, and network level.
While the graphs shown above are taken from the DNS page, all the underlying data is available via the API and can be interactively explored in more detail across locations, networks, and time periods using Radar’s Data Explorer and AI Assistant. And as always, Radar and Data Assistant charts and graphs are downloadable for sharing, and embeddable for use in your own blog posts, websites, or dashboards.
"],"published_at":[0,"2025-02-27T14:00+00:00"],"updated_at":[0,"2025-04-10T01:56:34.248Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2hLvT4QdAxk6Uqpy7AelLJ/18c827f73c965bc9e2d6862ab0afd8b2/image5.png"],"tags":[1,[[0,{"id":[0,"2FQK880QI5lKEUCjVHBber"],"name":[0,"1.1.1.1"],"slug":[0,"1-1-1-1"]}],[0,{"id":[0,"5kZtWqjqa7aOUoZr8NFGwI"],"name":[0,"Radar"],"slug":[0,"cloudflare-radar"]}],[0,{"id":[0,"5fZHv2k9HnJ7phOPmYexHw"],"name":[0,"DNS"],"slug":[0,"dns"]}],[0,{"id":[0,"2erOhyZHpwsORouNTuWZfJ"],"name":[0,"Resolver"],"slug":[0,"resolver"]}],[0,{"id":[0,"5GwDZZTEDK1ZYAHNV31ygs"],"name":[0,"DNSSEC"],"slug":[0,"dnssec"]}],[0,{"id":[0,"lYgpkmxckzYQz50iiVcjw"],"name":[0,"DoH"],"slug":[0,"doh"]}],[0,{"id":[0,"2ScX2j6LG2ruyaS8eLYhsd"],"name":[0,"Traffic"],"slug":[0,"traffic"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"David Belson"],"slug":[0,"david-belson"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/en7vkXf6rLBm4F8IcNHXT/645022bf841fabff7732aa3be3949808/david-belson.jpeg"],"location":[0,null],"website":[0,null],"twitter":[0,"@dbelson"],"facebook":[0,null]}],[0,{"name":[0,"Carlos Rodrigues"],"slug":[0,"carlos-rodrigues"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/zkL0dQnH3FcqRYV7JkuSH/aa50211b4da9f4b79125905340e086e3/carlos-rodrigues.png"],"location":[0,"Lisbon, Portugal"],"website":[0,null],"twitter":[0,null],"facebook":[0,null]}],[0,{"name":[0,"Vicky Shrestha"],"slug":[0,"vicky"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4RvgQSpjYreEXaLPL0stwq/7df86de7712505d3a2af6ae50a39c00b/vicky.jpg"],"location":[0,null],"website":[0,null],"twitter":[0,null],"facebook":[0,null]}],[0,{"name":[0,"Hannes Gerhart"],"slug":[0,"hannes"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4NLFewzqaiZmAizzu5u0wA/1552ecba4be976c9c2cde1ad4ea1ffc2/hannes.jpg"],"location":[0,"Berlin, Germany"],"website":[0,"https://linkedin.com/in/hannesgerhart"],"twitter":[0,null],"facebook":[0,null]}]]],"meta_description":[0,"The new Cloudflare Radar DNS page provides increased visibility into aggregate traffic and usage trends seen by our 1.1.1.1 resolver. In addition to global, location, and ASN traffic trends, we are also providing perspectives on protocol usage, query/response characteristics, and DNSSEC usage. "],"primary_author":[0,{}],"localeList":[0,{"name":[0,"blog-english-only"],"enUS":[0,"English for Locale"],"zhCN":[0,"No Page for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"No Page for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/new-dns-section-on-cloudflare-radar"],"metadata":[0,{"title":[0,"Some TXT about, and A PTR to, new DNS insights on Cloudflare Radar"],"description":[0,"The new Cloudflare Radar DNS page provides increased visibility into aggregate traffic and usage trends seen by our 1.1.1.1 resolver. In addition to global, location, and ASN traffic trends, we are also providing perspectives on protocol usage, query/response characteristics, and DNSSEC usage."],"imgPreview":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3NKxH6J1R6ou3wH82AYmwJ/884ef2960c553444ac37d35be5d5766f/Some_TXT_about__and_A_PTR_to__new_DNS_insights_on_Cloudflare_Radar-OG.png"]}]}],[0,{"id":[0,"mDiwAePfMfpVHMlYrfrFu"],"title":[0,"Cloudflare incident on February 6, 2025"],"slug":[0,"cloudflare-incident-on-february-6-2025"],"excerpt":[0,"On Thursday, February 6, 2025, we experienced an outage with our object storage service (R2) and products that rely on it. Here's what happened and what we're doing to fix this going forward."],"featured":[0],"html":[0,"
Multiple Cloudflare services, including our R2 object storage, were unavailable for 59 minutes on Thursday, February 6, 2025. This caused all operations against R2 to fail for the duration of the incident, and caused a number of other Cloudflare services that depend on R2 — including Stream, Images, Cache Reserve, Vectorize and Log Delivery — to suffer significant failures.
The incident occurred due to human error and insufficient validation safeguards during a routine abuse remediation for a report about a phishing site hosted on R2. The action taken on the complaint resulted in an advanced product disablement action on the site that led to disabling the production R2 Gateway service responsible for the R2 API.
Critically, this incident did not result in the loss or corruption of any data stored on R2.
We’re deeply sorry for this incident: this was a failure of a number of controls, and we are prioritizing work to implement additional system-level controls related not only to our abuse processing systems, but so that we continue to reduce the blast radius of any system- or human- action that could result in disabling any production service at Cloudflare.
All customers using Cloudflare R2 would have observed a 100% failure rate against their R2 buckets and objects during the primary incident window. Services that depend on R2 (detailed in the table below) observed heightened error rates and failure modes depending on their usage of R2.
The primary incident window occurred between 08:14 UTC to 09:13 UTC, when operations against R2 had a 100% error rate. Dependent services (detailed below) observed increased failure rates for operations that relied on R2.
From 09:13 UTC to 09:36 UTC, as R2 recovered and clients reconnected, the backlog and resulting spike in client operations caused load issues with R2's metadata layer (built on Durable Objects). This impact was significantly more isolated: we observed a 0.09% increase in error rates in calls to Durable Objects running in North America during this window.
The following table details the impacted services, including the user-facing impact, operation failures, and increases in error rates observed:
Product/Service
Impact
R2
100% of operations against R2 buckets and objects, including uploads, downloads, and associated metadata operations were impacted during the primary incident window. During the secondary incident window, we observed a <1% increase in errors as clients reconnected and increased pressure on R2's metadata layer.
There was no data loss within the R2 storage subsystem: this incident impacted the HTTP frontend of R2. Separation of concerns and blast radius management meant that the underlying R2 infrastructure was unaffected by this.
Stream
100% of operations (upload & streaming delivery) against assets managed by Stream were impacted during the primary incident window.
Images
100% of operations (uploads & downloads) against assets managed by Images were impacted during the primary incident window.
Impact to Image Delivery was minor: success rate dropped to 97% as these assets are fetched from existing customer backends and do not rely on intermediate storage.
Cache Reserve
Cache Reserve customers observed an increase in requests to their origin during the incident window as 100% of operations failed. This resulted in an increase in requests to origins to fetch assets unavailable in Cache Reserve during this period. This impacted less than 0.049% of all cacheable requests served during the incident window.
User-facing requests for assets to sites with Cache Reserve did not observe failures as cache misses failed over to the origin.
Log Delivery
Log delivery was delayed during the primary incident window, resulting in significant delays (up to an hour) in log processing, as well as some dropped logs.
Specifically:
Non-R2 delivery jobs would have experienced up to 4.5% data loss during the incident. This level of data loss could have been different between jobs depending on log volume and buffer capacity in a given location.
R2 delivery jobs would have experienced up to 13.6% data loss during the incident.
R2 is a major destination for Cloudflare Logs. During the primary incident window, all available resources became saturated attempting to buffer and deliver data to R2. This prevented other jobs from acquiring resources to process their queues. Data loss (dropped logs) occurred when the job queues expired their data (to allow for new, incoming data). The system recovered when we enabled a kill switch to stop processing jobs sending data to R2.
Durable Objects
Durable Objects, and services that rely on it for coordination & storage, were impacted as the stampeding horde of clients re-connecting to R2 drove an increase in load.
We observed a 0.09% actual increase in error rates in calls to Durable Objects running in North America, starting at 09:13 UTC and recovering by 09:36 UTC.
Cache Purge
Requests to the Cache Purge API saw a 1.8% error rate (HTTP 5xx) increase and a 10x increase in p90 latency for purge operations during the primary incident window. Error rates returned to normal immediately after this.
Vectorize
Queries and operations against Vectorize indexes were impacted during the primary incident window. 75% of queries to indexes failed (the remainder were served out of cache) and 100% of insert, upsert, and delete operations failed during the incident window as Vectorize depends on R2 for persistent storage. Once R2 recovered, Vectorize systems recovered in full.
We observed no continued impact during the secondary incident window, and we have not observed any index corruption as the Vectorize system has protections in place for this.
Key Transparency Auditor
100% of signature publish & read operations to the KT auditor service failed during the primary incident window. No third party reads occurred during this window and thus were not impacted by the incident.
Workers & Pages
A small volume (0.002%) of deployments to Workers and Pages projects failed during the primary incident window. These failures were limited to services with bindings to R2, as our control plane was unable to communicate with the R2 service during this period.
The incident timeline, including the initial impact, investigation, root cause, and remediation, are detailed below.
All timestamps referenced are in Coordinated Universal Time (UTC).
\n
\n
\n
\n
\n\n
\n
Time
\n
Event
\n
\n\n
\n
2025-02-06 08:12
\n
The R2 Gateway service is inadvertently disabled while responding to an abuse report.
\n
\n
\n
2025-02-06 08:14
\n
-- IMPACT BEGINS --
\n
\n
\n
2025-02-06 08:15
\n
R2 service metrics begin to show signs of service degradation.
\n
\n
\n
2025-02-06 08:17
\n
Critical R2 alerts begin to fire due to our service no longer responding to our health checks.
\n
\n
\n
2025-02-06 08:18
\n
R2 on-call engaged and began looking at our operational dashboards and service logs to understand impact to availability.
\n
\n
\n
2025-02-06 08:23
\n
Sales engineering escalated to the R2 engineering team that customers are experiencing a rapid increase in HTTP 500’s from all R2 APIs.
\n
\n
\n
2025-02-06 08:25
\n
Internal incident declared.
\n
\n
\n
2025-02-06 08:33
\n
R2 on-call was unable to identify the root cause and escalated to the lead on-call for assistance.
\n
\n
\n
2025-02-06 08:42
\n
Root cause identified as R2 team reviews service deployment history and configuration, which surfaces the action and the validation gap that allowed this to impact a production service.
\n
\n
\n
2025-02-06 08:46
\n
On-call attempts to re-enable the R2 Gateway service using our internal admin tooling, however this tooling was unavailable because it relies on R2.
\n
\n
\n
2025-02-06 08:49
\n
On-call escalates to an operations team who has lower level system access and can re-enable the R2 Gateway service.
\n
\n
\n
2025-02-06 08:57
\n
The operations team engaged and began to re-enable the R2 Gateway service.
\n
\n
\n
2025-02-06 09:09
\n
R2 team triggers a redeployment of the R2 Gateway service.
\n
\n
\n
2025-02-06 09:10
\n
R2 began to recover as the forced re-deployment rolled out as clients were able to reconnect to R2.
\n
\n
\n
2025-02-06 09:13
\n
-- IMPACT ENDS -- R2 availability recovers to within its service-level objective (SLO). Durable Objects begins to observe a slight increase in error rate (0.09%) for Durable Objects running in North America due to the spike in R2 clients reconnecting.
\n
\n
\n
2025-02-06 09:36
\n
The Durable Objects error rate recovers.
\n
\n
\n
2025-02-06 10:29
\n
The incident is closed after monitoring error rates.
\n
\n
At the R2 service level, our internal Prometheus metrics showed R2’s SLO near-immediately drop to 0% as R2’s Gateway service stopped serving all requests and terminated in-flight requests.
The slight delay in failure was due to the product disablement action taking 1–2 minutes to take effect as well as our configured metrics aggregation intervals:
\n \n \n
For context, R2’s architecture separates the Gateway service, which is responsible for authenticating and serving requests to R2’s S3 & REST APIs and is the “front door” for R2 — its metadata store (built on Durable Objects), our intermediate caches, and the underlying, distributed storage subsystem responsible for durably storing objects.
\n \n \n
During the incident, all other components of R2 remained up: this is what allowed the service to recover so quickly once the R2 Gateway service was restored and re-deployed. The R2 Gateway acts as the coordinator for all work when operations are made against R2. During the request lifecycle, we validate authentication and authorization, write any new data to a new immutable key in our object store, then update our metadata layer to point to the new object. When the service was disabled, all running processes stopped.
While this means that all in-flight and subsequent requests fail, anything that had received a HTTP 200 response had already succeeded with no risk of reverting to a prior version when the service recovered. This is critical to R2’s consistency guarantees and mitigates the chance of a client receiving a successful API response without the underlying metadata and storage infrastructure having persisted the change.
Due to human error and insufficient validation safeguards in our admin tooling, the R2 Gateway service was taken down as part of a routine remediation for a phishing URL.
During a routine abuse remediation, action was taken on a complaint that inadvertently disabled the R2 Gateway service instead of the specific endpoint/bucket associated with the report. This was a failure of multiple system level controls (first and foremost) and operator training.
A key system-level control that led to this incident was in how we identify (or "tag") internal accounts used by our teams. Teams typically have multiple accounts (dev, staging, prod) to reduce the blast radius of any configuration changes or deployments, but our abuse processing systems were not explicitly configured to identify these accounts and block disablement actions against them. Instead of disabling the specific endpoint associated with the abuse report, the system allowed the operator to (incorrectly) disable the R2 Gateway service.
Once we identified this as the cause of the outage, remediation and recovery was inhibited by the lack of direct controls to revert the product disablement action and the need to engage an operations team with lower level access than is routine. The R2 Gateway service then required a re-deployment in order to rebuild its routing pipeline across our edge network.
Once re-deployed, clients were able to re-connect to R2, and error rates for dependent services (including Stream, Images, Cache Reserve and Vectorize) returned to normal levels.
We have taken immediate steps to resolve the validation gaps in our tooling to prevent this specific failure from occurring in the future.
We are prioritizing several work-streams to implement stronger, system-wide controls (defense-in-depth) to prevent this, including how we provision internal accounts so that we are not relying on our teams to correctly and reliably tag accounts. A key theme to our remediation efforts here is around removing the need to rely on training or process, and instead ensuring that our systems have the right guardrails and controls built-in to prevent operator errors.
These work-streams include (but are not limited to) the following:
Actioned: deployed additional guardrails implemented in the Admin API to prevent product disablement of services running in internal accounts.
Actioned: Product disablement actions in the abuse review UI have been disabled while we add more robust safeguards. This will prevent us from inadvertently repeating similar high-risk manual actions.
In-flight: Changing how we create all internal accounts (staging, dev, production) to ensure that all accounts are correctly provisioned into the correct organization. This must include protections against creating standalone accounts to avoid re-occurrence of this incident (or similar) in the future.
In-flight: Further restricting access to product disablement actions beyond the remediations recommended by the system to a smaller group of senior operators.
In-flight: Two-party approval required for ad-hoc product disablement actions. Going forward, if an investigator requires additional remediations, they must be submitted to a manager or a person on our approved remediation acceptance list to approve their additional actions on an abuse report.
In-flight: Expand existing abuse checks that prevent accidental blocking of internal hostnames to also prevent any product disablement action of products associated with an internal Cloudflare account.
In-flight: Internal accounts are being moved to our new Organizations model ahead of public release of this feature. The R2 production account was a member of this organization, but our abuse remediation engine did not have the necessary protections to prevent acting against accounts within this organization.
We’re continuing to discuss & review additional steps and effort that can continue to reduce the blast radius of any system- or human- action that could result in disabling any production service at Cloudflare.
We understand this was a serious incident, and we are painfully aware of — and extremely sorry for — the impact it caused to customers and teams building and running their businesses on Cloudflare.
This is the first (and ideally, the last) incident of this kind and duration for R2, and we’re committed to improving controls across our systems and workflows to prevent this in the future.
"],"published_at":[0,"2025-02-06T16:00-08:00"],"updated_at":[0,"2025-02-07T20:37:36.839Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1JR64uLhxWHgQPhhOlkyTt/dc50a43a0475ab8a0069bb8fce372e47/Screenshot_2025-02-06_at_4.54.16_PM.png"],"tags":[1,[[0,{"id":[0,"3cCNoJJ5uusKFBLYKFX1jB"],"name":[0,"Post Mortem"],"slug":[0,"post-mortem"]}],[0,{"id":[0,"4yliZlpBPZpOwBDZzo1tTh"],"name":[0,"Outage"],"slug":[0,"outage"]}],[0,{"id":[0,"7JpaihvGGjNhG2v4nTxeFV"],"name":[0,"R2 Storage"],"slug":[0,"cloudflare-r2"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"Matt Silverlock"],"slug":[0,"silverlock"],"bio":[0,"Director of Product at Cloudflare."],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7xP5qePZD9eyVtwIesXYxh/e714aaa573161ec9eb48d59bd1aa6225/silverlock.jpeg"],"location":[0,null],"website":[0,null],"twitter":[0,"@elithrar"],"facebook":[0,null]}],[0,{"name":[0,"Javier Castro"],"slug":[0,"javier"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3hJsvxP0uRGmk4DjS9IdSW/0197c661fe20e1ebc9922768d727df02/javier.png"],"location":[0,null],"website":[0,null],"twitter":[0,null],"facebook":[0,null]}]]],"meta_description":[0,"On Thursday February 6th, we experienced an outage with our object storage service (R2) and products that rely on it. Here's what happened and what we're doing to fix this going forward."],"primary_author":[0,{}],"localeList":[0,{"name":[0,"blog-english-only"],"enUS":[0,"English for Locale"],"zhCN":[0,"No Page for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"No Page for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/cloudflare-incident-on-february-6-2025"],"metadata":[0,{"title":[0,"Cloudflare incident on February 6, 2025"],"description":[0,"On Thursday, February 6, 2025, we experienced an outage with our object storage service (R2) and products that rely on it. Here's what happened and what we're doing to fix this going forward."],"imgPreview":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/JjRBjBprqgy01LVreSbzM/9744518638a7c5c7072d3e29bdf8e267/BLOG-2685_OG.png"]}]}],[0,{"id":[0,"5LVsblxj2Git54IRxadpyg"],"title":[0,"How we prevent conflicts in authoritative DNS configuration using formal verification"],"slug":[0,"topaz-policy-engine-design"],"excerpt":[0,"We describe how Cloudflare uses a custom Lisp-like programming language and formal verifier (written in Racket and Rosette) to prevent logical contradictions in our authoritative DNS nameserver’s behavior."],"featured":[0,false],"html":[0,"
Over the last year, Cloudflare has begun formally verifying the correctness of our internal DNS addressing behavior — the logic that determines which IP address a DNS query receives when it hits our authoritative nameserver. This means that for every possible DNS query for a proxied domain we could receive, we try to mathematically prove properties about our DNS addressing behavior, even when different systems (owned by different teams) at Cloudflare have contradictory views on which IP addresses should be returned.
To achieve this, we formally verify the programs — written in a custom Lisp-like programming language — that our nameserver executes when it receives a DNS query. These programs determine which IP addresses to return. Whenever an engineer changes one of these programs, we run all the programs through our custom model checker (written in Racket + Rosette) to check for certain bugs (e.g., one program overshadowing another) before the programs are deployed.
Our formal verifier runs in production today, and is part of a larger addressing system called Topaz. In fact, it’s likely you’ve made a DNS query today that triggered a formally verified Topaz program.
This post is a technical description of how Topaz’s formal verification works. Besides being a valuable tool for Cloudflare engineers, Topaz is a real-world example of formal verification applied to networked systems. We hope it inspires other network operators to incorporate formal methods, where appropriate, to help make the Internet more reliable for all.
Topaz’s full technical details have been peer-reviewed and published in ACM SIGCOMM 2024, with both a paper and short video available online.
When a DNS query for a customer’s proxied domain hits Cloudflare’s nameserver, the nameserver returns an IP address — but how does it decide which address to return?
Let’s make this more concrete. When a customer, say example.com, signs up for Cloudflare and proxies their traffic through Cloudflare, it makes Cloudflare’s nameserver authoritative for their domain, which means our nameserver has the authority to respond to DNS queries for example.com. Later, when a client makes a DNS query for example.com, the client’s recursive DNS resolver (for example, 1.1.1.1) queries our nameserver for the authoritative response. Our nameserver returns someCloudflare IP address (of our choosing) to the resolver, which forwards that address to the client. The client then uses the IP address to connect to Cloudflare’s network, which is a global anycast network — every data center advertises all of our addresses.
\n \n \n
Clients query Cloudflare’s nameserver (via their resolver) for customer domains. The nameserver returns Cloudflare IP addresses, advertised by our entire global network, which the client uses to connect to the customer domain. Cloudflare may then connect to the origin server to fulfill the user’s HTTPS request.
When the customer has configured a static IP address for their domain, our nameserver’s choice of IP address is simple: it simply returns that static address in response to queries made for that domain.
But for all other customer domains, our nameserver could respond with virtually any IP address that we own and operate. We may return the same address in response to queries for different domains, or different addresses in response to different queries for the same domain. We do this for resilience, but also because decoupling names and IP addresses improves flexibility.
With all that in mind, let’s return to our initial question: given a query for a proxied domain without a static IP, which IP address should be returned? The answer: Cloudflare chooses IP addresses to meet various business objectives. For instance, we may choose IPs to:
Change the IP address of a domain that is under attack.
Direct fractions of traffic to specific IP addresses to test new features or services.
To change authoritative nameserver behavior — how we choose IPs — a Cloudflare engineer encodes their desired DNS business objective as a declarative Topaz program. Our nameserver stores the list of all such programs such that when it receives a DNS query for a proxied domain, it executes the list of programs in sequence until one returns an IP address. It then returns that IP to the resolver.
\n \n \n
Topaz receives DNS queries (metadata included) for proxied domains from Cloudflare’s nameserver. It executes a list of policies in sequence until a match is found. It returns the resulting IP address to the nameserver, which forwards it to the resolver.
What do these programs look like?
Each Topaz program has three primary components:
Match function: A program’s match function specifies under which circumstances the program should execute. It takes as input DNS query metadata (e.g., datacenter information, account information) and outputs a boolean. If, given a DNS query, the match function returns true, the program’s response function is executed.
Response function: A program’s response function specifies which IP addresses should be chosen. It also takes as input all the DNS query metadata, but outputs a 3-tuple (IPv4 addresses, IPv6 addresses, and TTL). When a program’s match function returns true, its corresponding response function is executed. The resulting IP addresses and TTL are returned to the resolver that made the query.
Configuration: A program’s configuration is a set of variables that parameterize that program’s match and response function. The match and response functions reference variables in the corresponding configuration, thereby separating the macro-level behavior of a program (match/response functions) from its nitty-gritty details (specific IP addresses, names, etc.). This separation makes it easier to understand how a Topaz program behaves at a glance, without getting bogged down by specific function parameters.
Let’s walk through an example Topaz program. The goal of this program is to give all queried domains whose metadata field “tag1” is equal to “orange” a particular IP address. The program looks like this:
Before we walk through the program, note that the program’s configuration, match, and response function are YAML strings, but more specifically they are topaz-lang expressions. Topaz-lang is the domain-specific language (DSL) we created specifically for expressing Topaz programs. It is based on Scheme, but is much simpler. It is dynamically typed, it is not Turing complete, and every expression evaluates to exactly one value (though functions can throw errors). Operators cannot define functions within topaz-lang, they can only add new DSL functions by writing functions in the host language (Go). The DSL provides basic types (numbers, lists, maps) but also Topaz-specific types, like IPv4/IPv6 addresses and TTLs.
Let’s now examine this program in detail.
The config is a set of four bindings from name to value. The first binds the string ”orange” to the name desired_tag1. The second binds the IPv4 address 192.0.2.3 to the name ipv4. The third binds the IPv6 address 2001:DB8:1:3 to the name ipv6. And the fourth binds the TTL (for which we added a topaz-lang type) 300 (seconds) to the name t.
The match function is an expression that must evaluate to a boolean. It can reference configuration values (e.g., desired_tag1), and can also reference DNS query fields. All DNS query fields use the prefix query_ and are brought into scope at evaluation time. This program’s match function checks whether desired_tag1 is equal to the tag attached to the queried domain, query_domain_tag1.
The response function is an expression that evaluates to the special response type, which is really just a 3-tuple consisting of: a list of IPv4 addresses, a list of IPv6 addresses, and a TTL. This program’s response function simply returns the configured IPv4 address, IPv6 address, and TTL (seconds).
Critically, all Topaz programs are encoded as YAML and live in the same version-controlled file. Imagine this program file contained only the orange program above, but now, a new team wants to add a new program, which checks whether the queried domain’s “tag1” field is equal to “orange” AND that the domain’s “tag2” field is equal to true:
This new team must place their new orange_and_true program either below or above the orange program in the file containing the list of Topaz programs. For instance, they could place orange_and_true after orange, like so:
Now let’s add a third, more interesting Topaz program. Say a Cloudflare team wants to test a modified version of our CDN’s HTTP server on a small percentage of domains, and only in a subset of Cloudflare’s data centers. Furthermore, they want to distribute these queries across a specific IP prefix such that queries for the same domain get the same IP. They write the following:
This Topaz program is significantly more complicated, so let’s walk through it.
Starting with configuration:
The first configuration value, purple_datacenters, is bound to the expression (fetch_datacenters “purple”), which is a function that retrieves all Cloudflare data centers tagged “purple” via an internal HTTP API. The result of this function call is a list of data centers.
The second configuration value, percentage, is a number representing the fraction of traffic we would like our program to act upon.
The third and fourth names are bound to IP prefixes, v4 and v6 respectively (note the built-in ipv4_prefix and ipv6_prefix types).
The match function is also more complicated. First, note the let form — this lets operators define local variables. We define one local variable, a random number generator called rand seeded with the hash of the queried domain name. The match expression itself is a conjunction that checks two things.
First, it checks whether the query landed in a data center tagged “purple”.
Second, it checks whether a random number between 0 and 99 (produced by a generator seeded by the domain name) is less than the configured percentage. By seeding the random number generator with the domain, the program ensures that 10% of domains trigger a match. If we had seeded the RNG with, say, the query ID, then queries for the same domain would behave differently.
Together, the conjuncts guarantee that the match expression evaluates to true for 10% of domains queried in “purple” data centers.
Now let’s look at the response function. We define three local variables. The first is a hash of the domain. The second is an IPv4 address selected from the configured IPv4 prefix. select_from always chooses the same IP address given the same prefix and hash — this ensures that queries for a given domain always receive the same IP address (which makes it easier to correlate queries for a single domain), but that queries for different domains can receive different IP addresses within the configured prefix. The third local variable is an IPv6 address selected similarly. The response function returns these IP addresses and a TTL of value 1 (second).
Topaz’s control plane validates the list of programs and distributes them to our global nameserver instances. As we’ve seen, the list of programs reside in a single, version-controlled YAML file. When an operator changes this file (i.e., adds a program, removes a program, or modifies an existing program), Topaz’s control plane does the following things in order:
First, it validates the programs, making sure there are no syntax errors.
Second, it “finalizes” each program’s configuration by evaluating every configuration binding and storing the result. (For instance, to finalize the purple program, it evaluates fetch_datacenters, storing the resulting list. This way our authoritative nameservers never need to retrieve external data.)
Third, it verifies the finalized programs, which we will explain below.
Finally, it distributes the finalized programs across our network.
Topaz’s control plane distributes the programs to all servers globally by writing the list of programs to QuickSilver, our edge key-value store. The Topaz service on each server detects changes in Quicksilver and updates its program list.
When our nameserver service receives a DNS query, it augments the query with additional metadata (e.g., tags) and then forwards the query to the Topaz service (both services run on every Cloudflare server) via Inter-Process Communication (IPC). Topaz, upon receiving a DNS query from the nameserver, walks through its program list, executing each program’s match function (using the topaz-lang interpreter) with the DNS query in scope (with values prefixed with query_). It walks the list until a match function returns true. It then executes that program’s response function, and returns the resulting IP addresses and TTL to our nameserver. The nameserver packages these addresses and TTL in valid DNS format, and then returns them to the resolver.
Before programs are distributed to our global network, they are formally verified. Each program is passed through our formal verification tool which throws an error if a program has a bug, or if two programs (e.g., the orange_and_true and orange programs) conflict with one another.
The Topaz formal verifier (model-checker) checks three properties.
First, it checks that each program is satisfiable — that there exists some DNS query that causes each program’s match function to return true. This property is useful for detecting internally-inconsistent programs that will simply never match. For instance, if a program’s match expression was (and true false), there exists no query that will cause this to evaluate to true, so the verifier throws an error.
Second, it checks that each program is reachable — that there exists some DNS query that causes each program’s match function to return truegiven all preceding programs. This property is useful for detecting “dead” programs that are completely overshadowed by higher-priority programs. For instance, recall the ordering of the orange and orange_and_true programs:
The verifier would throw an error because the orange_and_true program is unreachable. For all DNS queries for which query_domain_tag1 is ”orange”, regardless of metadata2, the orange program will always match, which means the orange_and_true program will never match. To resolve this error, we’d need to swap these two programs like we did above.
Finally, and most importantly, the verifier checks for program conflicts: queries that cause any two programs to both match. If such a query exists, it throws an error (and prints the relevant query), and the operators are forced to resolve the conflict by changing their programs. However, it only checks whether specific programs conflict — those that are explicitly marked exclusive. Operators mark their program as exclusive if they want to be sure that no other exclusive program could match on the same queries.
To see what conflict detection looks like, consider the corrected ordering of the orange_and_true and orange programs, but note that the two programs have now been marked exclusive:
After marking these two programs exclusive, the verifier will throw an error. Not only will it say that these two programs can contradict one another, but it will provide a sample query as proof:
\n
Checking: no exclusive programs match the same queries: check FAILED!\nIntersecting programs found:\nprograms "orange_and_true" and "orange" both match any query...\n to any domain...\n with tag1: "orange"\n with tag2: true\n
\n
The teams behind the orange and orange_and_true programs respectively must resolve this conflict before these programs are deployed, and can use the above query to help them do so. To resolve the conflict, the teams have a few options. The simplest option is to remove the exclusive setting from one program, and acknowledge that it is simply not possible for these programs to be exclusive. In that case, the order of the two programs matters (one must have higher priority). This is fine! Topaz allows developers to write certain programs that absolutely cannot overlap with other programs (using exclusive), but sometimes that is just not possible. And when it’s not, at least program priority is explicit.
Note: in practice, we place all exclusive programs at the top of the program file. This makes it easier to reason about interactions between exclusive and non-exclusive programs.
In short, verification is powerful not only because it catches bugs (e.g., satisfiability and reachability), but it also highlights the consequences of program changes. It helps operators understand the impact of their changes by providing immediate feedback. If two programs conflict, operators are forced to resolve it before deployment, rather than after an incident.
Bonus: verification-powered diffs. One of the newest features we’ve added to the verifier is one we call semantic diffs. It’s in early stages, but the key insight is that operators often just want to understand the impact of changes, even if these changes are deemed safe. To help operators, the verifier compares the old and new versions of the program file. Specifically, it looks for any query that matched program X in the old version, but matches a different program Y in the new version (or vice versa). For instance, if we changed orange_and_true thus:
Generating a report to help you understand your changes...\nNOTE: the queries below (if any) are just examples. Other such queries may exist.\n\n* program "orange_and_true" now MATCHES any query...\n to any domain...\n with tag1: "orange"\n with tag2: false
\n
While not exhaustive, this information helps operators understand whether their changes are doing what they intend or not, before deployment. We look forward to expanding our verifier’s diff capabilities going forward.
How does the verifier work? At a high-level, the verifier checks that, for all possible DNS queries, the three properties outlined above are satisfied. A Satisfiability Modulo Theories (SMT) solver — which we explain below — makes this seemingly impossible operation feasible. (It doesn't literally loop over all DNS queries, but it is equivalent to doing so — it provides exhaustive proof.)
We implemented our formal verifier in Rosette, a solver-enhanced domain-specific language written in the Racket programming language. Rosette makes writing a verifier more of an engineering exercise, rather than a formal logic test: if you can express the interpreter for your language in Racket/Rosette, you get verification “for free”, in some sense. We wrote a topaz-lang interpreter in Racket, then crafted our three properties using the Rosette DSL.
How does Rosette work? Rosette translates our desired properties into formulae in first-order logic. At a high level, these formulae are like equations from algebra class in school, with “unknowns” or variables. For instance, when checking whether the orange program is reachable (with the orange_and_true program ordered before it), Rosette produces the formula ((NOT orange_and_true.match) AND orange.match). The “unknowns” here are the DNS query parameters that these match functions operate over, e.g., query_domain_tag1. To solve this formula, Rosette interfaces with an SMT solver (like Z3), which is specifically designed to solve these types of formulae by efficiently finding values to assign to the DNS query parameters that make the formulae true. Once the SMT solver finds satisfying values, Rosette translates them into a Racket data structure: in our case, a sample DNS query. In this example, once it finds a satisfying DNS query, it would report that the orange program is indeed reachable.
However, verification is not free. The primary cost is maintenance. The model checker’s interpreter (Racket) must be kept in lockstep with the main interpreter (Go). If they fall out-of-sync, the verifier loses the ability to accurately detect bugs. Furthermore, functions added to topaz-lang must be compatible with formal verification.
Also, not all functions are easily verifiable, which means we must restrict the kinds of functions that program authors can write. Rosette can only verify functions that operate over integers and bit-vectors. This means we only permit functions whose operations can be converted into operations over integers and bit-vectors. While this seems restrictive, it actually gets us pretty far. The main challenge is strings: Topaz does not support programs that, for example, manipulate or work with substrings of the queried domain name. However, it does support simple operations on closed-set strings. For instance, it supports checking if two domain names are equal, because we can convert all strings to a small set of values representable using integers (which are easily verifiable).
Fortunately, thanks to our design of Topaz programs, the verifier need not be compatible with all Topaz program code. The verifier only ever examines Topaz match functions, so only the functions specified in match functions need to be verification-compatible. We encountered other challenges when working to make our model accurate, like modeling randomness — if you are interested in the details, we encourage you to read the paper.
Another potential cost is verification speed. We find that the verifier can ensure our existing seven programs satisfy all three properties within about six seconds, which is acceptable because verification happens only at build time. We verify programs centrally, before programs are deployed, and only when programs change.
We also ran microbenchmarks to determine how fast the verifier can check more programs — we found that, for instance, it would take the verifier about 300 seconds to verify 50 programs. While 300 seconds is still acceptable, we are looking into verifier optimizations that will reduce the time further.
\n
\n
Bringing formal verification from research to production
Topaz’s verifier began as a research project, and has since been deployed to production. It formally verifies all changes made to the authoritative DNS behavior specified in Topaz.
For more in-depth information on Topaz, see both our research paper published at SIGCOMM 2024 and the recording of the talk.
We thank our former intern, Tim Alberdingk-Thijm, for his invaluable work on Topaz’s verifier.
"],"published_at":[0,"2024-11-08T14:00+00:00"],"updated_at":[0,"2024-11-08T17:51:35.682Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7jvYSp04T46jqP9h9zOSyE/8bb1594ce04a8aaed1cb35331a59c203/image2.png"],"tags":[1,[[0,{"id":[0,"5fZHv2k9HnJ7phOPmYexHw"],"name":[0,"DNS"],"slug":[0,"dns"]}],[0,{"id":[0,"1x7tpPmKIUCt19EDgM1Tsl"],"name":[0,"Research"],"slug":[0,"research"]}],[0,{"id":[0,"4G97lXs2uurehtdhYVj4r6"],"name":[0,"Addressing"],"slug":[0,"addressing"]}],[0,{"id":[0,"35dMVPLpVqHtZrNY9drkyz"],"name":[0,"Formal Methods"],"slug":[0,"formal-methods"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"James Larisch"],"slug":[0,"james-larisch"],"bio":[0],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1CYzjboSN2jX2JkZpzkAN0/837e9a57e77efd26c440e2b216323f0f/unnamed.jpg"],"location":[0],"website":[0],"twitter":[0,"@jameslarisch"],"facebook":[0]}],[0,{"name":[0,"Suleman Ahmad"],"slug":[0,"suleman"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1WnbvI0pMk4oVLXHdvpIqH/51084688ace222ec1c1aafbb3bf0fc8b/suleman.png"],"location":[0,"Austin, TX"],"website":[0,null],"twitter":[0,"@sulemanahmadd"],"facebook":[0,null]}],[0,{"name":[0,"Marwan Fayed"],"slug":[0,"marwan"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/IXphrXdPbBBtMHdRMzIrB/340102a31143f4743a74c87ec0e0c11e/marwan.jpg"],"location":[0,null],"website":[0,null],"twitter":[0,"@marwanfayed"],"facebook":[0,null]}]]],"meta_description":[0,"We describe how Cloudflare uses a custom Lisp-like programming language and formal verifier (written in Racket and Rosette) to prevent logical contradictions in our authoritative DNS nameserver’s behavior."],"primary_author":[0,{}],"localeList":[0,{"name":[0,"blog-english-only"],"enUS":[0,"English for Locale"],"zhCN":[0,"No Page for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"No Page for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/topaz-policy-engine-design"],"metadata":[0,{"title":[0,"How we prevent conflicts in authoritative DNS configuration using formal verification"],"description":[0,"We describe how Cloudflare uses a custom Lisp-like programming language and formal verifier (written in Racket and Rosette) to prevent logical contradictions in our authoritative DNS nameserver’s behavior."],"imgPreview":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7kguebziUGENWVEKYDB09P/cbd2d1758e28879bbd52b5d46fffe905/How_we_prevent_conflicts_in_authoritative_DNS_configuration_using_formal_verification-OG.png"]}]}]]],"locale":[0,"en-us"],"translations":[0,{"posts.by":[0,"By"],"footer.gdpr":[0,"GDPR"],"lang_blurb1":[0,"This post is also available in {lang1}."],"lang_blurb2":[0,"This post is also available in {lang1} and {lang2}."],"lang_blurb3":[0,"This post is also available in {lang1}, {lang2} and {lang3}."],"footer.press":[0,"Press"],"header.title":[0,"The Cloudflare Blog"],"search.clear":[0,"Clear"],"search.filter":[0,"Filter"],"search.source":[0,"Source"],"footer.careers":[0,"Careers"],"footer.company":[0,"Company"],"footer.support":[0,"Support"],"footer.the_net":[0,"theNet"],"search.filters":[0,"Filters"],"footer.our_team":[0,"Our team"],"footer.webinars":[0,"Webinars"],"page.more_posts":[0,"More posts"],"posts.time_read":[0,"{time} min read"],"search.language":[0,"Language"],"footer.community":[0,"Community"],"footer.resources":[0,"Resources"],"footer.solutions":[0,"Solutions"],"footer.trademark":[0,"Trademark"],"header.subscribe":[0,"Subscribe"],"footer.compliance":[0,"Compliance"],"footer.free_plans":[0,"Free plans"],"footer.impact_ESG":[0,"Impact/ESG"],"posts.follow_on_X":[0,"Follow on X"],"footer.help_center":[0,"Help center"],"footer.network_map":[0,"Network Map"],"header.please_wait":[0,"Please Wait"],"page.related_posts":[0,"Related posts"],"search.result_stat":[0,"Results {search_range} of {search_total} for {search_keyword}"],"footer.case_studies":[0,"Case Studies"],"footer.connect_2024":[0,"Connect 2024"],"footer.terms_of_use":[0,"Terms of Use"],"footer.white_papers":[0,"White Papers"],"footer.cloudflare_tv":[0,"Cloudflare TV"],"footer.community_hub":[0,"Community Hub"],"footer.compare_plans":[0,"Compare plans"],"footer.contact_sales":[0,"Contact Sales"],"header.contact_sales":[0,"Contact Sales"],"header.email_address":[0,"Email Address"],"page.error.not_found":[0,"Page not found"],"footer.developer_docs":[0,"Developer docs"],"footer.privacy_policy":[0,"Privacy Policy"],"footer.request_a_demo":[0,"Request a demo"],"page.continue_reading":[0,"Continue reading"],"footer.analysts_report":[0,"Analyst reports"],"footer.for_enterprises":[0,"For enterprises"],"footer.getting_started":[0,"Getting Started"],"footer.learning_center":[0,"Learning Center"],"footer.project_galileo":[0,"Project Galileo"],"pagination.newer_posts":[0,"Newer Posts"],"pagination.older_posts":[0,"Older Posts"],"posts.social_buttons.x":[0,"Discuss on X"],"search.icon_aria_label":[0,"Search"],"search.source_location":[0,"Source/Location"],"footer.about_cloudflare":[0,"About Cloudflare"],"footer.athenian_project":[0,"Athenian Project"],"footer.become_a_partner":[0,"Become a partner"],"footer.cloudflare_radar":[0,"Cloudflare Radar"],"footer.network_services":[0,"Network services"],"footer.trust_and_safety":[0,"Trust & Safety"],"header.get_started_free":[0,"Get Started Free"],"page.search.placeholder":[0,"Search Cloudflare"],"footer.cloudflare_status":[0,"Cloudflare Status"],"footer.cookie_preference":[0,"Cookie Preferences"],"header.valid_email_error":[0,"Must be valid email."],"search.result_stat_empty":[0,"Results {search_range} of {search_total}"],"footer.connectivity_cloud":[0,"Connectivity cloud"],"footer.developer_services":[0,"Developer services"],"footer.investor_relations":[0,"Investor relations"],"page.not_found.error_code":[0,"Error Code: 404"],"search.autocomplete_title":[0,"Insert a query. Press enter to send"],"footer.logos_and_press_kit":[0,"Logos & press kit"],"footer.application_services":[0,"Application services"],"footer.get_a_recommendation":[0,"Get a recommendation"],"posts.social_buttons.reddit":[0,"Discuss on Reddit"],"footer.sse_and_sase_services":[0,"SSE and SASE services"],"page.not_found.outdated_link":[0,"You may have used an outdated link, or you may have typed the address incorrectly."],"footer.report_security_issues":[0,"Report Security Issues"],"page.error.error_message_page":[0,"Sorry, we can't find the page you are looking for."],"header.subscribe_notifications":[0,"Subscribe to receive notifications of new posts:"],"footer.cloudflare_for_campaigns":[0,"Cloudflare for Campaigns"],"header.subscription_confimation":[0,"Subscription confirmed. Thank you for subscribing!"],"posts.social_buttons.hackernews":[0,"Discuss on Hacker News"],"footer.diversity_equity_inclusion":[0,"Diversity, equity & inclusion"],"footer.critical_infrastructure_defense_project":[0,"Critical Infrastructure Defense Project"]}],"localesAvailable":[1,[]],"footerBlurb":[0,"Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions."]}" client="load" opts="{"name":"Post","value":true}" await-children="">
The Internet is a cooperative system: CNAME to Dyn DNS outage of 6 July 2015
Today, shortly after 21:00 UTC, on our internal operations chat there was a scary message from one of our senior support staff: "getting DNS resolution errors on support.cloudflare.com", at the same time as automated monitoring indicated a problem. Shortly thereafter, we saw alarms and feedback from a variety of customers (but not everyone) reporting "1001 errors", which indicated a DNS resolution error on the CloudFlare backend. Needless to say, this got an immediate and overwhelming response from our operations and engineering teams, as we hadn't changed anything and had no other indications of anomaly.
In the course of debugging, we were able to identify common characteristics of affected sites—CNAME-based users of CloudFlare, rather than complete domain hosted entirely on CloudFlare, which, ironically, included our own support site, support.cloudflare.com. When users point (via CNAME) to a domain instead of providing us with an IP address, our network resolves that name —- and is obviously unable to connect if the DNS provider has issues. (Our status page https://www.cloudflarestatus.com/ is off-network and was unaffected). Then, we were investigating why only certain domains were having issues—was the issue with the upstream DNS? Testing whether their domains were resolvable on the Internet (which they were) added a confounding data point.
The Internet is made up of many networks, operated by companies, organizations, governments, and individuals around the world, all cooperating using a common set of protocols and agreed policies and behaviors. These systems interoperate in a number of ways, sometimes entirely non-obviously. The mutual goal is to provide service to end users, letting them communicate, enjoy new services, and explore together. When one provider has a technical issue, it can cascade throughout the Internet and it isn’t obvious to users exactly which piece is broken.
Fortunately, even when companies are competitors, the spirit of the Internet is to work together for the good of the users. Once we identified this issue, we immediately contacted Dyn and relayed what we knew, and worked with them to resolve the issue for everyone’s benefit. We have temporarily put in workarounds to address this issue on our side, and hope the underlying difficulties will be resolved shortly.
Update: The good folks at Dyn have posted a short explanation of what happened on their nameservers.
Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.
To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.
On March 21, 2025, multiple Cloudflare services, including R2 object storage experienced an elevated rate of error responses. Here’s what caused the incident, the impact, and how we are making sure it...
On Thursday, February 6, 2025, we experienced an outage with our object storage service (R2) and products that rely on it. Here's what happened and what we're doing to fix this going forward....
We describe how Cloudflare uses a custom Lisp-like programming language and formal verifier (written in Racket and Rosette) to prevent logical contradictions in our authoritative DNS nameserver’s behavior....