
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Thu, 07 May 2026 00:06:57 GMT</lastBuildDate>
        <item>
            <title><![CDATA[When DNSSEC goes wrong: how we responded to the .de TLD outage]]></title>
            <link>https://blog.cloudflare.com/de-tld-outage-dnssec/</link>
            <pubDate>Wed, 06 May 2026 17:00:00 GMT</pubDate>
            <description><![CDATA[ On May 5, 2026, DENIC published broken DNSSEC signatures for the .de TLD, making millions of domains unreachable. Here's what 1.1.1.1 saw, how serve stale cushioned the impact, and how we restored resolution. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On May 5, 2026, at roughly 19:30 UTC, DENIC, the registry operator for the <code>.de</code> country-code top-level domain (TLD), started publishing incorrect DNSSEC signatures for the <code>.de</code> zone. Any validating DNS resolver receiving these signatures was required by the DNSSEC specification to reject them and return SERVFAIL to clients, including <a href="https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/"><u>1.1.1.1</u></a>, the public DNS resolver operated by Cloudflare. </p><p>The country-code top-level domain for Germany, <code>.de</code>, is one of the largest on the Internet. On <a href="https://radar.cloudflare.com/tlds?dateRange=7d"><u>Cloudflare Radar</u></a>, it consistently ranks among the most broadly queried TLDs globally. An outage at this level of the DNS hierarchy has the potential to make millions of domains unreachable.</p><p>In this post, we’ll walk through what we saw, the impact of these events, and how we applied temporary mitigations while DENIC resolved the issue.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4hF64h72z4oKRg28w0mDJm/7f535cf687750f9ea730c27fa5e729e3/BLOG-3309_2.png" />
          </figure>
    <div>
      <h2>How DNSSEC works</h2>
      <a href="#how-dnssec-works">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>DNSSEC</u></a> (Domain Name System Security Extensions) adds cryptographic authentication to DNS. When a zone is signed with DNSSEC, each set of records is accompanied by a digital signature known as an RRSIG record that lets a resolver verify the records haven’t been tampered with. Unlike encrypted DNS protocols, such as DNS over TLS (DoT) and DNS over HTTPs (DoH), DNSSEC is about integrity, not privacy. The records are visible, but their authenticity can be proven.</p><p>What makes DNSSEC unique is that the signatures travel together with the records they protect. This means integrity can be verified regardless of how many caches or hops a response has passed through. A cached record is just as verifiable as a fresh one.</p><p>DNSSEC is built on a chain of trust. Starting at the root zone, whose trust anchor is hard-coded into the resolvers, each zone delegates trust to child zones via Delegation Signer (DS) records. A DS record in the parent zone contains a cryptographic hash of a public key in the child zone. When a resolver validates <code>example.de</code> it verifies the chain: root trusts <code>.de</code>, <code>.de</code> trusts <code>example.de</code>. A break anywhere in that chain causes validation to fail for everything below it, which is why a misconfiguration at a TLD like <code>.de</code> affects every domain under it.</p><p>Zones typically use two types of keys: a Zone Signing Key (ZSK), used to sign the zone’s records, and a Key Signing Key (KSK), used to sign the ZSK itself. The KSK’s public key is what the parent zone’s DS record points to, anchoring the chain of trust. Rotating a ZSK is relatively straightforward: generate a new key, re-sign the zone’s records, and wait for caches to expire. Rotating a KSK is more involved, because the parent’s DS record must also be updated, often requiring coordination with a registrar or registry.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EDg7LKirRAVrzXCYprNIv/f14a9e3a24595d898cc9a650e9101fdd/image13.png" />
          </figure><p>During a key rotation, there is a critical window where the old key is being phased out and the new one phased in. If the signatures published in the zone are made with a key that resolvers cannot verify against the zone’s published DNSKEY records, whether because the signing step failed, the timing was wrong, or the new key wasn’t fully distributed yet, resolvers have no choice but to reject the responses and return SERVFAIL.</p>
    <div>
      <h2>What we saw</h2>
      <a href="#what-we-saw">
        
      </a>
    </div>
    <p>On May 5, 2026, at roughly 19:30 UTC, DENIC, the operator for the <code>.de</code> TLD, started publishing incorrect DNSSEC signatures for the <code>.de</code> zone. Any validating resolver receiving these records was required by the DNSSEC specification to reject them and return SERVFAIL. 1.1.1.1 was no exception.</p><p>The graph below shows the response codes 1.1.1.1 returned for <code>.de</code> queries during the incident.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/78zFXArtjyc8vcUup4zr9L/4207aa01b3caad16392266b4c32037e7/BLOG-3309_4.png" />
          </figure><p>After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing.</p><p>Also visible is a large increase in query volume. This is typical during DNS incidents, as clients retry failed queries, often three or more times, inflating the raw numbers. The SERVFAIL rate looks more alarming than the actual user impact, as many of those queries represent the same user retrying the same domain.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KpMo46Phe5HtxmP34FYMK/46a1281a625760f58d592cbde91943f8/BLOG-3309_5.png" />
          </figure><p>What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's “serve stale” at work, which we'll cover in the next section.</p>
    <div>
      <h2>Serve stale</h2>
      <a href="#serve-stale">
        
      </a>
    </div>
    <p>Recursive resolvers cache the records they receive from authoritative nameservers for the duration of each record's TTL (Time-to-Live). While a record is cached, the resolver serves it directly without going back to the authoritative nameserver. When the TTL expires, the resolver fetches a fresh copy and re-caches it.</p><p>During the outage, freshly requested records ended up resolving to SERVFAIL. The DNSSEC signatures were broken and the resolver correctly rejected them. But many <code>.de</code> records were still sitting in cache from before the incident began. Rather than immediately discarding those and returning SERVFAIL to users, 1.1.1.1 continued serving them past their TTL. This is called “serving stale.”</p><p>1.1.1.1 implements <a href="https://datatracker.ietf.org/doc/html/rfc8767"><u>RFC 8767</u></a>, which formalizes this behavior. When upstream resolution fails, a resolver may continue serving expired cached records rather than returning an error. This significantly cushions the impact of an upstream outage, buying time for operators to respond.</p><p>The result is visible in the graph below, which shows response codes for <code>.de</code> queries during the incident excluding the stale-served responses. Without stale-served responses, the NOERROR rate drops steadily from 19:30 onward. These represent queries that users received good answers for only because their record was still in cache.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3YUtnXiFixcdswxtGik46r/78082f4b4439130cf23ff1473448781a/BLOG-3309_6.png" />
          </figure>
    <div>
      <h2>Our mitigation</h2>
      <a href="#our-mitigation">
        
      </a>
    </div>
    <p>While the issue was largely out of our own control, and serve stale was doing its job, there was still a legitimate impact for a lot of users. Luckily, there were some actions we were able to take to improve the situation.</p>
    <div>
      <h3>Negative Trust Anchors</h3>
      <a href="#negative-trust-anchors">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/html/rfc7646"><u>RFC 7646</u></a> defines the concept of a Negative Trust Anchor (NTA). In normal DNSSEC operation, a validating resolver maintains a set of trust anchors: public keys at the root of the chain of trust. Each DNS zone signed with DNSSEC has a trust anchor, and every child zone builds its own trust anchor upon it. When the cryptographic signatures linking the chain together are broken, responses will be rejected and result in SERVFAIL. An NTA is an explicit exception. It tells the resolver to treat a specific zone as if it were unsigned, bypassing validation for names under that zone.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ZPkgIvIf1R9rlScLS7ofh/3483daabde429f99e5ca3bc2f6b5709f/BLOG-3309_7.png" />
          </figure><p>NTAs exist precisely for these types of incidents. When a TLD operator publishes broken signatures, every DNSSEC-validating resolver is forced to return SERVFAIL for every domain under that TLD. Not because of anything wrong with those domains themselves, but because their parent zone is misconfigured. Continuing to return SERVFAIL in that situation provides no security value: the failure is already known, public, and being fixed. RFC 7646 explicitly names TLD misconfiguration as the primary use case for NTAs.</p>
    <div>
      <h3>What we actually deployed</h3>
      <a href="#what-we-actually-deployed">
        
      </a>
    </div>
    <p>For 1.1.1.1 we have our own resolver referred to as <a href="https://blog.cloudflare.com/big-pineapple-intro/"><u>Big Pineapple</u></a>, which also powers 1.1.1.1 for Families, Gateway DNS, DNS Firewall, and more. At this time, we have not implemented a native NTA mechanism. Instead, we used an existing override rule mechanism to mark <code>.de</code> as an insecure zone, which causes all <code>.de</code> queries to be resolved as if they don’t have DNSSEC enabled. This is functionality equivalent to an NTA, though it is not formally defined in any RFC.</p><p>The decision to bypass DNSSEC is a deliberate tradeoff. Without DNSSEC validation, <code>.de</code> domains become vulnerable to <a href="https://www.cloudflare.com/en-gb/learning/dns/dnssec/how-dnssec-works/"><u>genuine attacks</u></a> for the duration of the incident. During incidents like this, we weighed this as acceptable because the signing failure was widespread, publicly confirmed, and affected every validating resolver on the Internet equally. As it was put in our internal incident room: “<i>There is no user of 1.1.1.1 resolving a .de name right now who would prefer a SERVFAIL over an unvalidated response</i>.”</p><p>We rolled out our mitigation at 22:17 UTC, which marked the end of impact for 1.1.1.1. We communicated this with fellow DNS operators in the <a href="https://www.dns-oarc.net/oarc/services/chat"><u>DNS-OARC Mattermost</u></a>.</p>
    <div>
      <h3>Origin resolution mitigations</h3>
      <a href="#origin-resolution-mitigations">
        
      </a>
    </div>
    <p>While all Internet users can access our 1.1.1.1 resolver, we have a particular responsibility to customers using our CDN platform services. Those with <code>.de</code> origin names were also affected by this outage.</p><p>Cloudflare operates a separate internal resolver for origin resolution, distinct from our publicly available 1.1.1.1 service. To mitigate impact we applied a similar NTA for <code>.de</code> on the internal resolver service, restoring origin connectivity for affected customers.</p>
    <div>
      <h3>Extended DNS Errors</h3>
      <a href="#extended-dns-errors">
        
      </a>
    </div>
    <p>Before our mitigation, queries that couldn't be served from cache received a SERVFAIL response from 1.1.1.1. Each SERVFAIL included an Extended DNS Error (EDE) code, defined in <a href="https://datatracker.ietf.org/doc/html/rfc8914"><u>RFC 8914</u></a>, which gives clients more detail about what went wrong. </p><p>Some resolvers returned EDE 6 (DNSSEC Bogus) with a descriptive message pointing directly at the broken signature. This is the correct behavior:</p>
            <pre><code>EDE: 6 (DNSSEC Bogus): RRSIG with malformed signature found for example.de/nsec3 (keytag=33834)
</code></pre>
            <p></p><p>1.1.1.1, on the other hand, returned EDE 22 (No Reachable Authority), which on the surface suggests a connectivity problem with the upstream nameservers rather than a DNSSEC validation failure.  </p><p>The cause is a bug in how we propagate DNSSEC EDE codes up from our trust chain verifier. When the verifier detects a bogus signature it creates the DNSSEC Bogus EDE code, but this is never inserted into the response. Instead, the outer layer of the resolver sees a problem with recursive resolution with no error code and falls back to reporting “No Reachable Authority.” This obscures the underlying DNSSEC cause.</p><p>We're aware that this isn't helpful for 1.1.1.1 users and will be fixing our responses to surface the DNSSEC errors.</p>
    <div>
      <h2>Is this a failure of DNSSEC as a technology?</h2>
      <a href="#is-this-a-failure-of-dnssec-as-a-technology">
        
      </a>
    </div>
    <p>DNS is a critical part of the request chain for most Internet communication. It would be easy to come to the conclusion that this outage and the mitigations applied means DNSSEC has failed as a technology. However, any technology that is misconfigured will risk breaking for users that rely on it. Leaving critical fiber cables exposed on the seabed for sharks to chew on does not invalidate the important role underwater cables pose in today's Internet communications. It only highlights that we’ve sometimes failed to accurately protect it. The same applies here. DNSSEC serves a critical role in ensuring that we can rely on the DNS answers without tampering by malicious actors.</p>
    <div>
      <h2>#HugOps</h2>
      <a href="#hugops">
        
      </a>
    </div>
    <p>No one likes to have serious incidents. These things, unfortunately, happen to everyone who operates critical infrastructure at scale. When they do, the DNS community tends to show up for each other.</p><p>Incidents like this also highlight why relationships between operators matter. DNS is a decentralized system, no single organization controls all of it, and keeping it running reliably depends on mutual trust and open lines of communication between registries, resolver operators, and the broader community. Forums like <a href="https://dns-oarc.net/">DNS-OARC</a> provide exactly this: shared mailing lists and chat rooms where operators can coordinate quickly across organizational boundaries when something goes wrong.</p><p>DENIC has published <a href="https://blog.denic.de/en/technical-issue-with-de-domains-resolved/"><u>a short blog post about the incident</u></a> where they state: “The outage is linked to a routine, scheduled key rollover. During this process, non-validatable signatures were generated and distributed. As a precautionary measure, future rollovers have been suspended until the exact technical causes have been identified.”</p><p> We're sure we’ll hear more when their own analysis is ready. </p>
    <div>
      <h2>Takeaways from this incident</h2>
      <a href="#takeaways-from-this-incident">
        
      </a>
    </div>
    <p>This incident highlights a structural reality of the DNS hierarchy: when a registry at the TLD level fails, every domain under that TLD is affected simultaneously, regardless of where it's hosted or which resolver is used. This isn't unique to DNSSEC; the same is true if a TLD’s nameservers become unreachable. The hierarchy that makes the global DNS work is also what makes failures at the top propagate downward.</p><p>There is no simple fix for this. What the industry can do is respond quickly and consistently when it happens. In this incident, resolver operators across the Internet independently applied Negative Trust Anchors within an hour, restoring resolution while DENIC worked to fix the zone. Operational practices, industry communication channels like DNS-OARC, and features like serve stale all reduce the impact, even if they can’t eliminate the underlying dependency.</p><p>We also came away with some points to improve for ourselves. We will be working on our EDE errors to better surface DNSSEC errors.</p><p>We look forward to DENIC’s post-incident report and appreciate the transparency they showed throughout.</p><p>If you want to learn more about how DNSSEC works, visit our page <a href="https://www.cloudflare.com/en-gb/learning/dns/dnssec/how-dnssec-works/"><u>How does DNSSEC work?</u></a> And you can always follow real-time DNS trends and TLD data on <a href="https://radar.cloudflare.com/tlds/de?dateStart=2026-05-05&amp;dateEnd=2026-05-06"><u>Cloudflare Radar</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[DNSSEC]]></category>
            <category><![CDATA[1.1.1.1]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Outage]]></category>
            <guid isPermaLink="false">2MckFmlh9Epgpruqa9MXRh</guid>
            <dc:creator>Sebastiaan Neuteboom</dc:creator>
            <dc:creator>Christian Elmerot</dc:creator>
            <dc:creator>Max Worsley</dc:creator>
        </item>
        <item>
            <title><![CDATA[Connection errors in Asia Pacific region on July 9, 2023]]></title>
            <link>https://blog.cloudflare.com/connection-errors-in-asia-pacific-region-on-july-9-2023/</link>
            <pubDate>Tue, 11 Jul 2023 08:48:13 GMT</pubDate>
            <description><![CDATA[ On July 9, 2023, users in the Asia Pacific region experienced connection errors due to origin DNS resolution failures to .com and .net TLD nameservers ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4XSclbffVsXyJvs6H28PzZ/8ca3e3e580eecf4e762af00eb94eb8d4/image2-5.png" />
            
            </figure><p>On Sunday, July 9, 2023, early morning UTC time, we observed a high number of DNS resolution failures — up to 7% of all DNS queries across the Asia Pacific region — caused by invalid DNSSEC signatures from Verisign .com and .net Top Level Domain (TLD) nameservers. This resulted in connection errors for visitors of Internet properties on Cloudflare in the region.</p><p>The local instances of Verisign’s nameservers started to respond with expired DNSSEC signatures in the Asia Pacific region. In order to remediate the impact, we have rerouted upstream DNS queries towards Verisign to locations on the US west coast which are returning valid signatures.</p><p>We have already reached out to Verisign to get more information on the root cause. Until their issues have been resolved, we will keep our DNS traffic to .com and .net TLD nameservers rerouted, which might cause slightly increased latency for the first visitor to domains under .com and .net in the region.</p>
    <div>
      <h3>Background</h3>
      <a href="#background">
        
      </a>
    </div>
    <p>In order to proxy a domain’s traffic through Cloudflare’s network, there are two components involved with respect to the Domain Name System (DNS) from the perspective of a Cloudflare data center: external DNS resolution, and upstream or origin DNS resolution.</p><p>To understand this, let’s look at the domain name <code>blog.cloudflare.com</code> — which is proxied through Cloudflare’s network — and let’s assume it is configured to use <code>origin.example</code> as the origin server:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5nnwNUFQmflIHHioGDISlg/250c388a2d796d0dc8139b4eddda6c05/image5-1.png" />
            
            </figure><p>Here, the external DNS resolution is the part where DNS queries to <code>blog.cloudflare.com</code> sent by public resolvers like <code>1.1.1.1</code> or <code>8.8.8.8</code> on behalf of a visitor return a set of Cloudflare Anycast IP addresses. This ensures that the visitor’s browser knows where to send an HTTPS request to load the website. Under the hood your browser performs a DNS query that looks something like this (the trailing dot indicates the <a href="https://en.wikipedia.org/wiki/DNS_root_zone">DNS root zone</a>):</p>
            <pre><code>$ dig blog.cloudflare.com. +short
104.18.28.7
104.18.29.7</code></pre>
            <p>(Your computer doesn’t actually use the dig command internally; we’ve used it to illustrate the process) Then when the next closest Cloudflare data center receives the HTTPS request for blog.cloudflare.com, it needs to fetch the content from the origin server (assuming it is not cached).</p><p>There are two ways Cloudflare can reach the origin server. If the DNS settings in Cloudflare contain IP addresses then we can connect directly to the origin. In some cases, our customers use a CNAME which means Cloudflare has to perform another DNS query to get the IP addresses associated with the CNAME. In the example above, <code>blog.cloudflare.com</code> has a CNAME record instructing us to look at <code>origin.example</code> for IP addresses. During the incident, only customers with CNAME records like this going to .com and .net domain names may have been affected.</p><p>The domain <code>origin.example</code> needs to be resolved by Cloudflare as part of the upstream or origin DNS resolution. This time, the Cloudflare data center needs to perform an outbound DNS query that looks like this:</p>
            <pre><code>$ dig origin.example. +short
192.0.2.1</code></pre>
            <p>DNS is a hierarchical protocol, so the recursive DNS resolver, which usually handles DNS resolution for whoever wants to resolve a <a href="https://www.cloudflare.com/learning/dns/glossary/what-is-a-domain-name/">domain name</a>, needs to talk to several involved nameservers until it finally gets an answer from the authoritative nameservers of the domain (assuming no DNS responses are cached). This is the same process during the external DNS resolution and the origin DNS resolution. Here is an example for the origin DNS resolution:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7E1GLN7i8qGi3oB6Zentug/8b55136d3c67a79d0d9c711c428911b4/image6-1.png" />
            
            </figure><p>Inherently, DNS is a public system that was initially published without any means to validate the integrity of the DNS traffic. So in order to prevent someone from spoofing DNS responses, <a href="/dnssec-an-introduction/">DNS Security Extensions (DNSSEC)</a> were introduced as a means to authenticate that DNS responses really come from who they claim to come from. This is achieved by generating cryptographic signatures alongside existing DNS records like A, AAAA, MX, CNAME, etc. By validating a DNS record’s associated signature, it is possible to verify that a requested DNS record really comes from its authoritative nameserver and wasn’t altered en-route. If a signature cannot be validated successfully, recursive resolvers usually return an error indicating the invalid signature. This is exactly what happened on Sunday.</p>
    <div>
      <h3>Incident timeline and impact</h3>
      <a href="#incident-timeline-and-impact">
        
      </a>
    </div>
    <p>On Saturday, July 8, 2023, at 21:10 UTC our logs show the first instances of DNSSEC validation errors that happened during upstream DNS resolution from multiple Cloudflare data centers in the Asia Pacific region. It appeared DNS responses from the TLD nameservers of .com and .net of the type NSEC3 (a DNSSEC mechanism to <a href="/black-lies/">prove non-existing DNS records</a>) included invalid signatures. About an hour later at 22:16 UTC, the first internal alerts went off (since it required issues to be consistent over a certain period of time), but error rates were still at a level at around 0.5% of all upstream DNS queries.</p><p>After several hours, the error rate had increased to a point in which ~13% of our upstream DNS queries in affected locations were failing. This percentage continued to fluctuate over the duration of the incident between the ranges of 10-15% of upstream DNS queries, and roughly 5-7% of all DNS queries (external &amp; upstream resolution) to affected Cloudflare data centers in the Asia Pacific region.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/acWvj718KdxYfGx33fBZT/7ee25b63cf83ee9ff18e6734aeb1cc3e/image1-6.png" />
            
            </figure><p>Initially it appeared as though only a single upstream nameserver was having issues with DNS resolution, however upon further investigation it was discovered that the issue was more widespread. Ultimately, we were able to verify that the Verisign nameservers for .com and .net were returning expired DNSSEC signatures on a portion of DNS queries in the Asia Pacific region. Based on our tests, other nameserver locations were correctly returning valid DNSSEC signatures.</p><p>In response, we rerouted our DNS traffic to the .com and .net TLD nameserver IP addresses to Verisign’s US west locations. After this change was propagated, the issue very quickly subsided and origin resolution error rates returned to normal levels.</p><p>All times are in UTC:</p><p><b>2023-07-08 21:10</b> First instances of DNSSEC validation errors appear in our logs for origin DNS resolution.</p><p><b>2023-07-08 22:16</b> First internal alerts for Asia Pacific data centers go off indicating origin DNS resolution error on our test domain. Very few errors for other domains at this point.</p><p><b>2023-07-09 02:58</b> Error rates have increased substantially since the first instance. An incident is declared.</p><p><b>2023-07-09 03:28</b> DNSSEC validation issues seem to be isolated to a single upstream provider. It is not abnormal that a single upstream has issues that propagate back to us, and in this case our logs were predominantly showing errors from domains that resolve to this specific upstream.</p><p><b>2023-07-09 04:52</b> We realize that DNSSEC validation issues are more widespread and affect multiple .com and .net domains. Validation issues continue to be isolated to the Asia Pacific region only, and appear to be intermittent.</p><p><b>2023-07-09 05:15</b> DNS queries via popular recursive resolvers like 8.8.8.8 and 1.1.1.1 do not return invalid DNSSEC signatures at this point. DNS queries using the local stub resolver continue to return DNSSEC errors.</p><p><b>2023-07-09 06:24</b> Responses from .com and .net Verisign nameservers in Singapore contain expired DNSSEC signatures, but responses from Verisign TLD nameservers in other locations are fine.</p><p><b>2023-07-09 06:41</b> We contact Verisign and notify them about expired DNSSEC signatures.</p><p><b>2023-07-09 06:50</b> In order to remediate the impact, we reroute DNS traffic via IPv4 for the .com and .net Verisign nameserver IPs to US west IPs instead. This immediately leads to a substantial drop in the error rate.</p><p><b>2023-07-09 07:06</b> We also reroute DNS traffic via IPv6 for the .com and .net Verisign nameserver IPs to US west IPs. This leads to the error rate going down to zero.</p><p><b>2023-07-10 09:23</b> The rerouting is still in place, but the underlying issue with expired signatures in the Asia Pacific region has still not been resolved.</p><p><b>2023-07-10 18:23</b> Verisign gets back to us confirming that they “were serving stale data” at their local site and have resolved the issues.</p>
    <div>
      <h3>Technical description of the error and how it happened</h3>
      <a href="#technical-description-of-the-error-and-how-it-happened">
        
      </a>
    </div>
    <p>As mentioned in the introduction, the underlying cause for this incident was expired DNSSEC signatures for .net and .com zones. Expired DNSSEC signatures will cause a DNS response to be interpreted as invalid. There are two main scenarios in which this error was observed by a user:</p><ol><li><p><a href="https://developers.cloudflare.com/dns/cname-flattening/">CNAME flattening</a> for external DNS resolution. This is when our authoritative nameservers attempt to return the IP address(es) that a CNAME record target resolves to rather than the CNAME record itself.</p></li><li><p>CNAME target lookup for origin DNS resolution. This is most commonly used when an HTTPS request is sent to a Cloudflare anycast IP address, and we need to determine what IP address to forward the request to. See <a href="https://developers.cloudflare.com/fundamentals/get-started/concepts/how-cloudflare-works/">How Cloudflare works</a> for more details.</p></li></ol><p>In both cases, behind the scenes the DNS query goes through an in-house recursive DNS resolver in order to lookup what a hostname resolves to. The purpose of this resolver is to cache queries, optimize future queries and provide DNSSEC validation. If the query from this resolver fails for whatever reason, our authoritative DNS system will not be able to perform the two scenarios outlined above.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5qE0JFXaLHPwt3orOsNBm5/37d2d34d396d4cc4c2a22b7241e4120f/image3-1.png" />
            
            </figure><p>During the incident, the recursive resolver was failing to validate the DNSSEC signatures in DNS responses for domains ending in .com and .net. These signatures are sent in the form of an RRSIG record alongside the other DNS records they cover. Together they form a Resource Record set (RRset). Each RRSIG has the corresponding fields:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3aZjsnpM6WSE70sPrnkHbr/6af366bd2accafc3f06296e241ceaba5/image4.png" />
            
            </figure><p>As you can see, the main part of the payload is associated with the signature and its corresponding metadata. Each recursive resolver is responsible for not only checking the signature but also the expiration time of the signature. It is important to obey the expiration time in order to avoid returning responses for RRsets that have been signed by old keys, which could have potentially been compromised by that time.</p><p>During this incident, Verisign, the authoritative operator for the .com and .net TLD zones, was occasionally returning expired signatures in its DNS responses in the Asia Pacific region. As a result our recursive resolver was not able to validate the corresponding RRset. Ultimately this meant that a percentage of DNS queries would return SERVFAIL as response code to our authoritative nameserver. This in turn caused connection issues for users trying to connect to a domain on Cloudflare, because we weren't able to resolve the upstream target of affected domain names and thus didn’t know where to send proxied HTTPS requests to upstream servers.</p>
    <div>
      <h3>Remediation and follow-up steps</h3>
      <a href="#remediation-and-follow-up-steps">
        
      </a>
    </div>
    <p>Once we had identified the root cause we started to look at different ways to remedy the issue. We came to the conclusion that the fastest way to work around this very regionalized issue was to stop using the local route to Verisign's nameservers. This means that, at the time of writing this, our outgoing DNS traffic towards Verisign's nameservers in the Asia Pacific region now traverses the Pacific and ends up at the US west coast, rather than being served by closer nameservers. Due to the nature of DNS and the important role of DNS caching, this has less impact than you might initially expect. Frequently looked up names will be cached after the first request, and this only needs to happen once per data center, as we share and pool the local DNS recursor caches to maximize their efficiency.</p><p>Ideally, we would have been able to fix the issue right away as it potentially affected others in the region too, not just our customers. We will therefore work diligently to improve our incident communications channels with other providers in order to ensure that the DNS ecosystem remains robust against issues such as this. Being able to quickly get hold of other providers that can take action is vital when urgent issues like these arise.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>This incident <a href="/october-2021-facebook-outage/">once again</a> shows how impactful DNS failures are and how crucial this technology is for the Internet. We will investigate how we can improve our systems to detect and resolve issues like this more efficiently and quickly if they occur again in the future.</p><p>While Cloudflare was not the cause of this issue, and we are certain that we were not the only ones affected by this, we are still sorry for the disruption to our customers and to all the users who were unable to access Internet properties during this incident.</p><p><b>Update</b>: On Tue Jul 11 22:24:21 UTC 2023_,_ Verisign posted an <a href="https://lists.dns-oarc.net/pipermail/dns-operations/2023-July/022174.html">announcement</a>, providing more details:</p><blockquote><p><i>Last week, during a migration of one of our DNS resolution sites in Singapore, from one provider to another, we unexpectedly lost management access and the ability to deliver changes and DNS updates to the site. Following our standard procedure, we disabled all transit links to the affected site. Unfortunately, a peering router remained active, which was not immediately obvious to our teams due to the lack of connectivity there.</i></p></blockquote><blockquote><p><i>Over the weekend, this caused an issue that may have affected the ability of some internet users in the region to reach some .com and .net domains, as DNSSEC signatures on the site began expiring. The issue was resolved by powering off the site’s peering router, causing the anycast route announcement to be withdrawn and traffic to be directed to other sites.</i></p></blockquote><blockquote><p><i>We are updating our processes and procedures and will work to prevent such issues from recurring in the future.</i></p></blockquote><blockquote><p><i>The Singapore site is part of a highly redundant constellation of more than 200 sites that make up our global network. This issue had no effect on the core resolution of .com and .net resolution globally. We apologize to those who may have been affected.</i></p></blockquote><p></p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[DNSSEC]]></category>
            <category><![CDATA[Outage]]></category>
            <category><![CDATA[Post Mortem]]></category>
            <guid isPermaLink="false">3ZlvMILKrfS2Z4IQ0qumTD</guid>
            <dc:creator>Christian Elmerot</dc:creator>
            <dc:creator>Alex Fattouche</dc:creator>
            <dc:creator>Hannes Gerhart</dc:creator>
        </item>
        <item>
            <title><![CDATA[DNS Flag Day 2020]]></title>
            <link>https://blog.cloudflare.com/dns-flag-day-2020/</link>
            <pubDate>Fri, 02 Oct 2020 07:35:00 GMT</pubDate>
            <description><![CDATA[ October 1 is DNS Flag Day, an initiative by the DNS community to make DNS more secure, reliable and robust. This year the focus is on problems around IP fragmentation of DNS packets. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5gu3Uesra4lbdKiknRxKHG/38b6887b68d23412d9292e740e40df73/DNS_Flag-1.png" />
            
            </figure><p>October 1 was this year’s DNS Flag Day. Read on to find out all about <a href="https://dnsflagday.net/2020/">DNS Flag Day</a> and how it affects Cloudflare’s DNS services (hint: it doesn’t, we already did the work to be compliant).</p>
    <div>
      <h3>What is DNS Flag Day?</h3>
      <a href="#what-is-dns-flag-day">
        
      </a>
    </div>
    <p>DNS Flag Day is an initiative by several DNS vendors and operators to increase the compliance of implementations with DNS standards. The goal is to make DNS more secure, reliable and robust. Rather than a push for new features, DNS flag day is meant to ensure that workarounds for non-compliance can be reduced and a common set of functionalities can be established and relied upon.</p><p>Last year’s flag day was February 1, and it set forth that servers and clients must be able to properly handle the Extensions to DNS (EDNS0) protocol (first RFC about EDNS0 are from 1999 - <a href="https://tools.ietf.org/html/rfc2671">RFC 2671</a>). This way, by assuming clients have a working implementation of EDNS0, servers can resort to always sending messages as EDNS0. This is needed to support DNSSEC, the DNS security extensions. We were, of course, more than thrilled to support the effort, as we’re keen to push <a href="/tag/dnssec/">DNSSEC</a> <a href="/automatically-provision-and-maintain-dnssec/">adoption forward</a> .</p>
    <div>
      <h3>DNS Flag Day 2020</h3>
      <a href="#dns-flag-day-2020">
        
      </a>
    </div>
    <p>The goal for this year’s flag day is to increase DNS messaging reliability by focusing on problems around IP fragmentation of DNS packets. The intention is to reduce DNS message fragmentation which <a href="https://blog.apnic.net/2019/07/12/its-time-to-consider-avoiding-ip-fragmentation-in-the-dns/">continues</a> to be a <a href="https://blog.apnic.net/2017/08/22/dealing-ipv6-fragmentation-dns/">problem</a>. We can do that by ensuring cleartext DNS messages sent over UDP are not too large, as large messages risk being fragmented during the transport. Additionally, when sending or receiving large DNS messages, we have the ability to do so over TCP.</p>
    <div>
      <h3>Problem with DNS transport over UDP</h3>
      <a href="#problem-with-dns-transport-over-udp">
        
      </a>
    </div>
    <p>A potential issue with sending DNS messages over UDP is that the sender has no indication of the recipient actually receiving the message. When using TCP, each packet being sent is acknowledged (ACKed) by the recipient, and the sender will attempt to resend any packets not being ACKed. UDP, although it may be faster than TCP, does not have the same mechanism of messaging reliability. Anyone still wishing to use UDP as their transport protocol of choice will have to implement this reliability mechanism in higher layers of the network stack. For instance, this is what is being done in <a href="/tag/quic/">QUIC</a>, the new Internet transport protocol used by HTTP/3 that is built on top of UDP.</p><p>Even the earliest DNS standards (<a href="https://tools.ietf.org/html/rfc1035">RFC 1035</a>) specified the use of sending DNS messages over TCP as well as over UDP. Unfortunately, the choice of supporting TCP or not was up to the implementer/operator, and then firewalls were sometimes set to block DNS over TCP. More recent <a href="https://tools.ietf.org/html/rfc7766">updates</a> to RFC 1035, on the other hand, require that the DNS server is available to query using DNS over TCP.</p>
    <div>
      <h3>DNS message fragmentation</h3>
      <a href="#dns-message-fragmentation">
        
      </a>
    </div>
    <p>Sending data over networks and the Internet is restricted to the limitation of how large each packet can be. Data is chopped up into a stream of packets, and sized to adhere to the Maximum Transmission Unit (MTU) of the network. MTU is typically 1500 bytes for IPv4 and, in the case of IPv6, the minimum is 1280 bytes. Subtracting both the IP header size (IPv4 20 bytes/IPv6 40 bytes) and the UDP protocol header size (8 bytes) from the MTU, we end up with a maximum DNS message size of 1472 bytes for IPv4 and 1232 bytes in order for a message to fit within a single packet. If the message is any larger than that, it will have to be fragmented into more packets.</p><p>Sending large messages causes them to get fragmented into more than one pack. This is not a problem with TCP transports since each packet is ACK:ed to ensure proper delivery. However, the same does not hold true when sending large DNS messages over UDP. For many intents and purposes, UDP has been treated as a second-class citizen to TCP as far as network routing is concerned. It is quite common to see UDP packet fragments being dropped by routers and firewalls, potentially causing parts of a message to be lost. To avoid fragmentation over UDP it is better to truncate the DNS message and set the Truncation Flag in the DNS response. This tells the recipient that more data is available if the query is retried over TCP.</p><p>DNS Flag Day 2020 wants to ensure that DNS message fragmentation does not happen. When larger DNS messages need to be sent, we need to ensure it can be done reliably over TCP.</p><p>DNS servers need to support DNS message transport over TCP in order to be compliant with this year's flag day. Also, DNS messages sent over UDP must never exceed the limit over which they risk being fragmented.</p>
    <div>
      <h3>Cloudflare authoritative DNS and 1.1.1.1</h3>
      <a href="#cloudflare-authoritative-dns-and-1-1-1-1">
        
      </a>
    </div>
    <p>We fully support the DNS Flag Day initiative, as it aims to make DNS more reliable and robust, and it ensures a common set of features for the DNS community to evolve on. In the DNS ecosystem, we are as much a client as we are a provider. When we perform DNS lookups on behalf of our customers and users, we rely on other providers to follow standards and be compliant. When they are not, and we can’t work around the issues, it leads to problems resolving names and reaching resources.</p><p>Both our public resolver 1.1.1.1 as well as our authoritative DNS service, set and enforce reasonable limits on DNS message sizes when sent over UDP. Of course, both services are available over TCP. If you’re already using Cloudflare, there is nothing you need to do but to keep using our DNS services! We will continually work on improving DNS.</p><p>Oh, and you can test your domain on the DNS Flag Day site: <a href="https://dnsflagday.net/2020/">https://dnsflagday.net/2020/</a></p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <guid isPermaLink="false">1ORVgEYdgNU3bOHN69m1tl</guid>
            <dc:creator>Christian Elmerot</dc:creator>
        </item>
    </channel>
</rss>