The Cloudflare Blog

Performance measurements… and the people who love them

Kevin Guthrie — Tue, 20 May 2025 13:00:00 GMT

⚠️ WARNING ⚠️ This blog post contains graphic depictions of probability. Reader discretion is advised.

Measuring performance is tricky. You have to think about accuracy and precision. Are your sampling rates high enough? Could they be too high?? How much metadata does each recording need??? Even after all that, all you have is raw data. Eventually for all this raw performance information to be useful, it has to be aggregated and communicated. Whether it's in the form of a dashboard, customer report, or a paged alert, performance measurements are only useful if someone can see and understand them.

This post is a collection of things I've learned working on customer performance escalations within Cloudflare and analyzing existing tools (both internal and commercial) that we use when evaluating our own performance. A lot of this information also comes from Gil Tene's talk, How NOT to Measure Latency. You should definitely watch that too (but maybe after reading this, so you don't spoil the ending). I was surprised by my own blind spots and which assumptions turned out to be wrong, even though they seemed "obviously true" at the start. I expect I am not alone in these regards. For that reason this journey starts by establishing fundamental definitions and ends with some new tools and techniques that we will be sharing as well as the surprising results that those tools uncovered.

Check your verbiage

So ... what is performance? Alright, let's start with something easy: definitions. "Performance" is not a very precise term because it gets used in too many contexts. Most of us as nerds and engineers have a gut understanding of what it means, without a real definition. We can't really measure it because how "good" something is depends on what makes that thing good. "Latency" is better ... but not as much as you might think. Latency does at least have an implicit time unit, so we can measure it. But ... what is latency? There are lots of good, specific examples of measurements of latency, but we are going to use a general definition. Someone starts something, and then it finishes — the elapsed time between is the latency.

This seems a bit reductive, but it’s a surprisingly useful definition because it gives us a key insight. This fundamental definition of latency is based around the client's perspective. Indeed, when we look at our internal measurements of latency for health checks and monitoring, they all have this one-sided caller/callee relationship. There is the latency of the caching layer from the point of view of the ingress proxy. There’s the latency of the origin from the cache’s point of view. Each component can measure the latency of its upstream counterparts, but not the other way around.

This one-sided nature of latency observation is a real problem for us because Cloudflare only exists on the server side. This makes all of our internal measurements of latency purely estimations. Even if we did have full visibility into a client’s request timing, the start-to-finish latency of a request to Cloudflare isn’t a great measure of Cloudflare’s latency. The process of making an HTTP request has lots of steps, only a subset of which are affected by us. Time spent on things like DNS lookup, local computation for TLS, or resource contention do affect the client’s experience of latency, but only serve as sources of noise when we are considering our own performance.

There is a very useful and common metric that is used to measure web requests, and I’m sure lots of you have been screaming it in your brains from the second you read the title of this post. ✨Time to first byte✨. Clearly this is the answer, right?! But ... what is “Time to first byte”?

TTFB mine

Time to first byte (TTFB) on its face is simple. The name implies that it's the time it takes (on the client's side) to receive the first byte of the response from the server, but unfortunately, that only describes when the timer should end. It doesn't say when the timer should start. This ambiguity is just one factor that leads to inconsistencies when trying to compare TTFB across different measurement platforms ... or even across a single platform because there is no one definition of TTFB. Similar to “performance”, it is used in too many places to have a single definition. That being said, TTFB is a very useful concept, so in order to measure it and report it in an unambiguous way, we need to pick a definition that’s already in use.

We have mentioned TTFB in other blog posts, but this one sums up the problem best with “Time to first byte isn’t what it used to be.” You should read that article too, but the gist is that one popular TTFB definition used by browsers was changed in a confusing way with the introduction of early hints in June 2022. That post and others make the point that while TTFB is useful, it isn’t the best direct measurement for web performance. Later on in this post we will derive why that’s the case.

One common place we see TTFB used is our customers’ analysis comparing Cloudflare's performance to our competitors through Catchpoint. Customers, as you might imagine, have a vested interest in measuring our latency, as it affects theirs. Catchpoint provides several tools built on their global Internet probe network for measuring HTTP request latency (among other things) and visualizing it in their web interface. In an effort to align better with our customers, we decided to adopt Catchpoint’s terminology for talking about latency, both internally and externally.

Catchpoint catch-up

While Catchpoint makes things like TTFB easy to plot over time, the visualization tool doesn't give a definition of what TTFB is, but after going through all of their technical blog posts and combing through thousands of lines of raw data, we were able to get functional definitions for TTFB and other composite metrics. This was an important step because these metrics are how our customers are viewing our performance, so we all need to be able to understand exactly what they signify! The final report for this is internal (and long and dry), so in this post, I'll give you the highlights in the form of colorful diagrams, starting with this one.

This diagram shows our customers' most commonly viewed client metrics on Catchpoint and how they fit together into the processing of a request from the server side. Notice that some are directly measured, and some are calculated based on the direct measurements. Right in the middle is TTFB, which Catchpoint calculates as the sum of the DNS, Connect, TLS, and Wait times. It’s worth noting again that this is not the definition of TTFB, this is just Catchpoint’s definition, and now ours.

This breakdown of HTTPS phases is not the only one commonly used. Browsers themselves have a standard for measuring the stages of a request. The diagram below shows how most browsers are reporting request metrics. Luckily (and maybe unsurprisingly) these phases match Catchpoint's very closely.

There are some differences beyond the inclusion of things like AppCache and Redirects (which are not directly impacted by Cloudflare's latency). Browser timing metrics are based on timestamps instead of durations. The diagram subtly calls this out with gaps between the different phases indicating that there is the potential for the computer running the browser to do things that are not part of any phase. We can line up these timestamps with Catchpoint's metrics like so:

Now that we, our customers, and our browsers (with data coming from RUM) have a common and well-defined language to talk about the phases of a request, we can start to measure, visualize, and compare the components that make up the network latency of a request.

Visual basics

Now that we have defined what our key values for latency are, we can record numbers and put them in a chart and watch them roll by ... except not directly. In most cases, the systems we use to record the data actively prevent us from seeing the recorded data in its raw form. Tools like Prometheus are designed to collect pre-aggregated data, not individual samples, and for a good reason. Storing every recorded metric (even compacted) would be an enormous amount of data. Even worse, the data loses its value exponentially over time, since the most recent data is the most actionable.

The unavoidable conclusion is that some aggregation has to be done before performance data can be visualized. In most cases, the aggregation means looking at a series of windowed percentiles over time. The most common are 50th percentile (median), 75th, 90th, and 99th if you're really lucky. Here is an example of a latency visualization from one of our own internal dashboards.

It clearly shows a spike in latency around 14:40 UTC. Was it an incident? The p99 jumped by 1300% (500ms to 6500

ms) for multiple minutes while the p50 jumped by more than 13600% (4.4ms to 600ms). It is a clear signal, so something must have happened, but what was it? Let me keep you in suspense for a second while we talk about statistics and probability.

Uncooked math

Let me start with a quote from my dear, close, personal friend @ThePrimeagen:

It's a good reminder that while statistics is a great tool for providing a simplified and generalized representation of a complex system, it can also obscure important subtleties of that system. A good way to think of statistical modeling is like lossy compression. In the latency visualization above (which is a plot of TTFB over time), we are compressing the entire spectrum of latency metrics into 4 percentile bands, and because we are only considering up to the 99th percentile, there's an entire 1% of samples left over that we are ignoring!

"What?" I hear you asking. "P99 is already well into perfection territory. We're not trying to be perfectionists. Maybe we should get our p50s down first". Let's put things in perspective. This zone (www.cloudflare.com) is getting about 30,000 req/s and the 99th percentile latency is 500 ms. (Here we are defining latency as “Edge TTFB”, a server-side approximation of our now official definition.) So there are 300 req/s that are taking longer than half a second to complete, and that's just the portion of the request that we can see. How much worse than 500 ms are those requests in the top 1%? If we look at the 100th percentile (the max), we get a much different vibe from our Edge TTFB plot.

Viewed like this, the spike in latency no longer looks so remarkable. Without seeing more of the picture, we could easily believe something was wrong when in reality, even if something is wrong, it is not localized to that moment. In this case, it's like we are using our own statistics to lie to ourselves.

The top 1% of requests have 99% of the latency

Maybe you're still not convinced. It feels more intuitive to focus on the median because the latency experienced by 50 out of 100 people seems more important to focus on than that of 1 in 100. I would argue that is a totally true statement, but notice I said "people"and not "requests." A person visiting a website is not likely to be doing it one request at a time.

Taking www.cloudflare.com as an example again, when a user opens that page, their browser makes more than 70 requests. It sounds big, but in the world of user-facing websites, it’s not that bad. In contrast, www.amazon.com issues more than 400 requests! It's worth noting that not all those requests need to complete before a web page or application becomes usable. That's why more advanced and browser-focused metrics exist, but I will leave a discussion of those for later blog posts. I am more interested in how making that many requests changes the probability calculations for expected latency on a per-user basis.

Here's a brief primer on combining probabilities that covers everything you need to know to understand this section.

The probability of two things happening is the probability of the first happening multiplied by the probability of the second thing happening. $$P(X\cap Y )=P(X) \times P (Y)$$
The probability of something in the $X^{th}$ percentile happening is $X\%$. $$P(pX) = X\%$$

Let's define $P( pX_{N} )$ as the probability that someone on a website with $N$ requests experiences no latencies >= the $X^{th}$ percentile. For example, $P(p50_{2})$ would be the probability of getting no latencies greater than the median on a page with 2 requests. This is equivalent to the probability of one request having a latency less than the $p50$ and the other request having a latency less than the $p50$. We can use the first identities above.

$$\begin{align} P( p50_{2}) &= P\left ( p50 \cap p50 \right ) \\ &= P( p50) \times P\left ( p50 \right ) \\ &= 50\%^{2} \\ &= 25\% \end{align}$$

We can generalize this for any percentile and any number of requests. $$P( pX_{N}) = X\%^{N}$$

For www.cloudflare.com and its 70ish requests, the percentage of visitors that won't experience a latency above the median is

$$\begin{align} P( p50_{70}) &= 50\%^{70} \\ &\approx 0.000000000000000000001\% \end{align}$$

This vanishingly small number should make you question why we would value the $p50$ latency so highly at all when effectively no one experiences it as their worst case latency.

So now the question is, what request latency percentile should we be looking at? Let's go back to the statement at the beginning of this section. What does the median person experience on www.cloudflare.com? We can use a little algebra to solve for that.

$$\begin{align} P( pX_{70}) &= 50\% \\ X^{70} &= 50\% \\ X &= e^{ \frac{ln\left ( 50\% \right )}{70}} \\ X &\approx 99\% \end{align}$$

This seems a little too perfect, but I am not making this up. For www.cloudflare.com, if you want to capture a value that's representative of what the median user can expect, you need to look at $p99$ request latency. Extending this even further, if you want a value that's representative of what 99% of users will experience, you need to look at the 99.99th percentile!

Spherical latency in a vacuum

Okay, this is where we bring everything together, so stay with me. So far, we have only talked about measuring the performance of a single system. This gives us absolute numbers to look at internally for monitoring, but if you’ll recall, the goal of this post was to be able to clearly communicate about performance outside the company. Often this communication takes the form of comparing Cloudflare’s performance against other providers. How are these comparisons done? By plotting a percentile request "latency" over time and eyeballing the difference.

With everything we have discussed in this post, it seems like we can devise a better method for doing this comparison. We saw how exposing more of the percentile spectrum can provide a new perspective on existing data, and how impactful higher percentile statistics can be when looking at a more complete user experience. Let me close this post with an example of how putting those two concepts together yields some intriguing results.

One last thing

Below is a comparison of the latency (defined here as the sum of the TLS, Connect, and Wait times or the equivalent of TTFB - DNS lookup time) for the customer when viewed through Cloudflare and a competing provider. This is the same data represented in the chart immediately above (containing 90,000 samples for each provider), just in a different form called a CDF plot, which is one of a few ways we are making it easier to visualize the entire percentile range. The chart shows the percentiles on the y-axis and latency measurements on the x-axis, so to see the latency value for a given percentile, you go up to the percentile you want and then over to the curve. Interpreting these charts is as easy as finding which curve is farther to the left for any given percentile. That curve will have the lower latency.

It's pretty clear that for nearly the entire percentile range, the other provider has the lower latency by as much as 30ms. That is, until you get to the very top of the chart. There's a little bit of blue that's above (and therefore to the left) of the green. In order to see what's going on there more clearly, we can use a different kind of visualization. This one is called a QQ-Plot, or quantile-quantile plot. This shows the same information as the CDF plot, but now each point on the curve represents a specific quantile, and the 2 axes are the latency values of the two providers at that percentile.

This chart looks complicated, but interpreting it is similar to the CDF plot. The blue is a dividing marker that shows where the latency of both providers is equal. Points below the line indicate percentiles where the other provider has a lower latency than Cloudflare, and points above the line indicate percentiles where Cloudflare is faster. We see again that for most of the percentile range, the other provider is faster, but for percentiles above 99, Cloudflare is significantly faster.

This is not so compelling by itself, but what if we take into account the number of requests this page issues ... which is over 180. Using the same math from above, and only considering half the requests to be required for the page to be considered loaded, yields this new effective QQ plot.

Taking multiple requests into account, we see that the median latency is close to even for both Cloudflare and the other provider, but the stories above and below that point are very different. A user has about an even chance of an experience where Cloudflare is significantly faster and one where Cloudflare is slightly slower than the other provider. We can show the impact of this shift in perspective more directly by calculating the expected value for request and experienced latency.

Latency Kind

Cloudflare (ms)

Other CDN (ms)

Difference (ms)

Expected Request Latency

141.9

129.9

+12.0

Expected Experienced Latency

Based on 90 Requests

207.9

281.8

-71.9

Shifting the focus from individual request latency to user latency we see that Cloudflare is 70 ms faster than the other provider. This is where our obsession with reliability and tail latency becomes a win for our customers, but without a large volume of raw data, knowledge, and tools, this win would be totally hidden. That is why in the near future we are going to be making this tool and others available to our customers so that we can all get a more accurate and clear picture of our users’ experiences with latency. Keep an eye out for more announcements to come later in 2025.

A good day to trie-hard: saving compute 1% at a time

Kevin Guthrie — Tue, 10 Sep 2024 14:00:00 GMT

Cloudflare’s global network handles a lot of HTTP requests – over 60 million per second on average. That in and of itself is not news, but it is the starting point to an adventure that started a few months ago and ends with the announcement of a new open-source Rust crate that we are using to reduce our CPU utilization, enabling our CDN to handle even more of the world’s ever-increasing Web traffic.

Motivation

Let’s start at the beginning. You may recall a few months ago we released Pingora (the heart of our Rust-based proxy services) as an open-source project on GitHub. I work on the team that maintains the Pingora framework, as well as Cloudflare’s production services built upon it. One of those services is responsible for the final step in transmitting users’ (non-cached) requests to their true destination. Internally, we call the request’s destination server its “origin”, so our service has the (unimaginative) name of “pingora-origin”.

One of the many responsibilities of pingora-origin is to ensure that when a request leaves our infrastructure, it has been cleaned to remove the internal information we use to route, measure, and optimize traffic for our customers. This has to be done for every request that leaves Cloudflare, and as I mentioned above, it’s a lot of requests. At the time of writing, the rate of requests leaving pingora-origin (globally) is 35 million requests per second. Any code that has to be run per-request is in the hottest of hot paths, and it’s in this path that we find this code and comment:

// PERF: heavy function: 1.7% CPU time
pub fn clear_internal_headers(request_header: &mut RequestHeader) {
    INTERNAL_HEADERS.iter().for_each(|h| {
        request_header.remove_header(h);
    });
}

This small and pleasantly-readable function consumes more than 1.7% of pingora-origin’s total cpu time. To put that in perspective, the total cpu time consumed by pingora-origin is 40,000 compute-seconds per second. You can think of this as 40,000 saturated CPU cores fully dedicated to running pingora-origin. Of those 40,000, 1.7% (680) are only dedicated to evaluating clear_internal_headers. The function’s heavy usage and simplicity make it seem like a great place to start optimizing.

Benchmarking

Benchmarking the function shown above is straightforward because we can use the wonderful criterion Rust crate. Criterion provides an api for timing rust code down to the nanosecond by aggregating multiple isolated executions. It also provides feedback on how the performance improves or regresses over time. The input for the benchmark is a large set of synthesized requests with a random number of headers with a uniform distribution of internal vs. non-internal headers. With our tooling and test data we find that our original clear_internal_headers function runs in an average of 3.65µs. Now for each new method of clearing headers, we can measure against the same set of requests and get a relative performance difference.

Reducing Reads

One potentially quick win is to invert how we find the headers that need to be removed from requests. If you look at the original code, you can see that we are evaluating request_header.remove_header(h) for each header in our list of internal headers, so 100+ times. Diagrammatically, it looks like this:

Since an average request has significantly fewer than 100 headers (10-30), flipping the lookup direction should reduce the number of reads while yielding the same intersection. Because we are working in Rust (and because retain does not exist for http::HeaderMap yet), we have to collect the identified internal headers in a separate step before removing them from the request. Conceptually, it looks like this:

Using our benchmarking tool, we can measure the impact of this small change, and surprisingly this is already a substantial improvement. The runtime improves from 3.65µs to 1.53µs. That’s a 2.39x speed improvement for our function. We can calculate the theoretical CPU percentage by multiplying the starting utilization by the ratio of the new and old times: 1.71% * 1.53 / 3.65 = 0.717%. Unfortunately, if we subtract that from the original 1.71% that only equates to saving 1.71% - 0.717% = 0.993% of the total CPU time. We should be able to do better.

Searching Data Structures

Now that we have reorganized our function to search a static set of internal headers instead of the actual request, we have the freedom to choose what data structure we store our header name in simply by changing the type of INTERNAL_HEADER_SET.

pub fn clear_internal_headers(request_header: &mut RequestHeader) {
   let to_remove = request_header
       .headers
       .keys()
       .filter_map(|name| INTERNAL_HEADER_SET.get(name))
       .collect::>();


   to_remove.into_iter().for_each(|k| {
       request_header.remove_header(k);
   });

Our first attempt used std::HashMap, but there may be other data structures that better suit our needs. All computer science students were taught at some point that hash tables are great because they have constant-time asymptotic behavior, or O(1), for reading. (If you are not familiar with big O notation, it is a way to express how an algorithm consumes a resource, in this case time, as the input size changes.) This means no matter how large the map gets, reads always take the same amount of time. Too bad this is only partially true. In order to read from a hash table, you have to compute the hash. Computing a hash for strings requires reading every byte, so while read time for a hashmap is constant over the table’s size, it’s linear over key length. So, our goal is to find a data structure that is better than O(L) where L is the length of the key.

There are a few common data structures that provide for reads that have read behavior that meets our criteria. Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too. The net effect is that even very fast sorted sets like FST work out to be a little (50 ns) slower in our benchmarks than the standard hashmap.

State machines like parsers and regex are another common tool for searching for strings, though it’s hard to consider them data structures. These systems work by accepting input one unit at a time and determining on each step whether or not to keep evaluating. Being able to make these determinations at every step means state machines are very fast to identify negative cases (i.e. when a string is not valid or not a match). This is perfect for us because only one or two headers per request on average will be internal. In fact, benchmarking an implementation of clear_internal_headers using regular expressions clocks in as taking about twice as long as the hashmap-based solution. This is impressively fast given that regexes, while powerful, aren't known for their raw speed. This approach feels promising – we just need something in between a data structure and a state machine.

That’s where the trie comes in.

Don’t Just Trie

A trie (pronounced like “try” or “tree”) is a type of tree data structure normally used for prefix searches or auto-complete systems over a known set of strings. The structure of the trie lends itself to this because each node in the trie represents a substring of characters found in the initial set. The connections between the nodes represent the characters that can follow a prefix. Here is a small example of a trie built from the words: “and”, “ant”, “dad”, “do”, & “dot”.

The root node represents an empty string prefix, so the two lettered edges directed out of it are the only letters that can appear as the first letter in the list of strings, “a” and “d”. Subsequent nodes have increasingly longer prefixes until the final valid words are reached. This layout should make it easy to see how a trie could be useful for quickly identifying strings that are not contained. Even at the root node, we can eliminate any strings that are presented that do not start with “a” or “d”. This paring down of the search space on every step gives reading from a trie the O(log(L)) we were looking for … but only for misses. Hits within a trie are still O(L), but that’s okay, because we are getting misses over 90% of the time.

Benchmarking a few trie implementations from crates.io was disheartening. Remember, most tries are used in response to keyboard events, so optimizing them to run in the hot path of tens of millions of requests per second is not a priority. The fastest existing implementation we found was radix_trie, but it still clocked in at a full microsecond slower than hashmap. The only thing left to do was write our own implementation of a trie that was optimized for our use case.

Trie Hard

And we did! Today we are announcing trie-hard. The repository gives a full description of how it works, but the big takeaway is that it gets its speed from storing node relationships in the bits of unsigned integers and keeping the entire tree in a contiguous chunk of memory. In our benchmarks, we found that trie-hard reduced the average runtime for clear_internal_headers to under a microsecond (0.93µs). We can reuse the same formula from above to calculate the expected CPU utilization for trie-hard to be 1.71% * 3.65 / 0.93 = 0.43% That means we have finally achieved and surpassed our goal by reducing the compute utilization of pingora-origin by 1.71% - 0.43% = 1.28%!

Up until now we have been working only in theory and local benchmarking. What really matters is whether our benchmarking reflects real-life behavior. Trie-hard has been running in production since July 2024, and over the course of this project we have been collecting performance metrics from the running production of pingora-origin using a statistical sampling of its stack trace over time. Using this technique, the CPU utilization percentage of a function is estimated by the percent of samples in which the function appears. If we compare the sampled performance of the different versions of clear_internal_headers, we can see that the results from the performance sampling closely match what our benchmarks predicted.

Implementation	Stack trace samples containing `clear_internal_headers`	Actual CPU Usage (%)	Predicted CPU Usage (%)
Original	19 / 1111	1.71	n/a
Hashmap	9 / 1103	0.82	0.72
trie-hard	4 / 1171	0.34	0.43

Conclusion

Optimizing functions and writing new data structures is cool, but the real conclusion for this post is that knowing where your code is slow and by how much is more important than how you go about optimizing it. Take a moment to thank your observability team (if you're lucky enough to have one), and make use of flame graphs or any other profiling and benchmarking tool you can. Optimizing operations that are already measured in microseconds may seem a little silly, but these small improvements add up.

Gone offline: how Cloudflare Radar detects Internet outages

Carlos Azevedo — Tue, 26 Sep 2023 13:00:02 GMT

Currently, Cloudflare Radar curates a list of observed Internet disruptions (which may include partial or complete outages) in the Outage Center. These disruptions are recorded whenever we have sufficient context to correlate with an observed drop in traffic, found by checking status updates and related communications from ISPs, or finding news reports related to cable cuts, government orders, power outages, or natural disasters.

However, we observe more disruptions than we currently report in the outage center because there are cases where we can’t find any source of information that provides a likely cause for what we are observing, although we are still able to validate with external data sources such as Georgia Tech’s IODA. This curation process involves manual work, and is supported by internal tooling that allows us to analyze traffic volumes and detect anomalies automatically, triggering the workflow to find an associated root cause. While the Cloudflare Radar Outage Center is a valuable resource, one of key shortcomings include that we are not reporting all disruptions, and that the current curation process is not as timely as we’d like, because we still need to find the context.

As we announced today in a related blog post, Cloudflare Radar will be publishing anomalous traffic events for countries and Autonomous Systems (ASes). These events are the same ones referenced above that have been triggering our internal workflow to validate and confirm disruptions. (Note that at this time “anomalous traffic events” are associated with drops in traffic, not unexpected traffic spikes.) In addition to adding traffic anomaly information to the Outage Center, we are also launching the ability for users to subscribe to notifications at a location (country) or network (autonomous system) level whenever a new anomaly event is detected, or a new entry is added to the outage table. Please refer to the related blog post for more details on how to subscribe.

The current status of each detected anomaly will be shown in the new “Traffic anomalies” table on the Outage Center page:

When the anomaly is automatically detected its status will initially be Unverified
After attempting to validate ‘Unverified’ entries:
- We will change the status to ‘Verified’ if we can confirm that the anomaly appears across multiple internal data sources, and possibly external ones as well. If we find associated context for it, we will also create an outage entry.
- We will change status to ‘False Positive’ if we cannot confirm it across multiple data sources. This will remove it from the “Traffic anomalies” table. (If a notification has been sent, but the anomaly isn’t shown in Radar anymore, it means we flagged it as ‘False Positive’.)
We might also manually add an entry with a “Verified” status. This might occur if we observe, and validate, a drop in traffic that is noticeable, but was not large enough for the algorithm to catch it.

A glimpse at what Internet traffic volume looks like

At Cloudflare, we have several internal data sources that can give us insights into what the traffic for a specific entity looks like. We identify the entity based on IP address geolocation in the case of locations, and IP address allocation in the case of ASes, and can analyze traffic from different sources, such as DNS, HTTP, NetFlows, and Network Error Logs (NEL). All the signals used in the figures below come from one of these data sources and in this blog post we will treat this as a univariate time-series problem — in the current algorithm, we use more than one signal just to add redundancy and identify anomalies with a higher level of confidence. In the discussion below, we intentionally select various examples to encompass a broad spectrum of potential Internet traffic volume scenarios.

1. Ideally, the signals would resemble the pattern depicted below for Australia (AU): a stable weekly pattern with a slightly positive trend meaning that the trend average is moving up over time (we see more traffic over time from users in Australia).

These statements can be clearly seen when we perform time-series decomposition which allows us to break down a time-series into its constituent parts to better understand and analyze its underlying patterns. Decomposing the traffic volume for Australia above assuming a weekly pattern with Seasonal-Trend decomposition using LOESS (STL) we get the following:

The weekly pattern we are referring to is represented by the seasonal part of the signal that is expected to be observed due to the fact that we are interested in eyeball / human internet traffic. As observed in the image above, the trend component is expected to move slowly when compared with the signal level and the residual part ideally would resemble white noise meaning that all existing patterns in the signal are represented by the seasonal and trend components.

2. Below we have the traffic volume for AS15964 (CAMNET-AS) that appears to have more of a daily pattern, as opposed to weekly.

We also observe that there’s a value offset of the signal right after the first four days (blue dashed-line) and the red background shows us an outage for which we didn’t find any reporting besides seeing it in our data and other Internet data providers — our intention here is to develop an algorithm that will trigger an event when it comes across this or similar patterns.

3. Here we have a similar example for French Guiana (GF). We observe some data offsets (August 9 and 23), a change in the amplitude (between August 15 and 23) and another outage for which we do have context that is observable in Cloudflare Radar.

4. Another scenario is several scheduled outages for AS203214 (HulumTele), for which we also have context. These anomalies are the easiest to detect since the traffic goes to values that are unique to outages (cannot be mistaken as regular traffic), but it poses another challenge: if our plan was to just check the weekly patterns, since these government-directed outages happen with the same frequency, at some point the algorithm would see this as expected traffic.

5. This outage in Kenya could be seen as similar to the above: the traffic volume went down to unseen values although not as significantly. We also observe some upward spikes in the data that are not following any specific pattern — possibly outliers — that we should clean depending on the approach we use to model the time-series.

6. Lastly, here's the data that will be used throughout this post as an example of how we are approaching this problem. For Madagascar (MG), we observe a clear pattern with pronounced weekends (blue background). There’s also a holiday (Assumption of Mary), highlighted with a green background, and an outage, with a red background. In this example, weekends, holidays, and outages all seem to have roughly the same traffic volume. Fortunately, the outage gives itself away by showing that it intended to go up as in a normal working day, but then there was a sudden drop — we will see it more closely later in this post.

In summary, here we looked over six examples out of ~700 (the number of entities we are automatically detecting anomalies for currently) and we see a wide range of variability. This means that in order to effectively model the time-series we would have to run a lot of preprocessing steps before the modeling itself. These steps include removing outliers, detecting short and long-term data offsets and readjusting, and detecting changes in variance, mean, or magnitude. Time is also a factor in preprocessing, as we would also need to know in advance when to expect events / holidays that will push the traffic down, apply daylight saving time adjustments that will cause a time shift in the data, and be able to apply local time zones for each entity, including dealing with locations that have multiple time zones and AS traffic that is shared across different time zones.

To add to the challenge, some of these steps cannot even be performed in a close-to-real-time fashion (example: we can only say there’s a change in seasonality after some time of observing the new pattern). Considering the challenges mentioned earlier, we have chosen an algorithm that combines basic preprocessing and statistics. This approach aligns with our expectations for the data's characteristics, offers ease of interpretation, allows us to control the false positive rate, and ensures fast execution while reducing the need for many of the preprocessing steps discussed previously.

Above, we noted that we are detecting anomalies for around 700 entities (locations and autonomous systems) at launch. This obviously does not represent the entire universe of countries and networks, and for good reason. As we discuss in this post, we need to see enough traffic from a given entity (have a strong enough signal) to be able to build relevant models and subsequently detect anomalies. For some smaller or sparsely populated countries, the traffic signal simply isn’t strong enough, and for many autonomous systems, we see little-to-no traffic from them, again resulting in a signal too weak to be useful. We are initially focusing on locations where we have a sufficiently strong traffic signal and/or are likely to experience traffic anomalies, as well as major or notable autonomous systems — those that represent a meaningful percentage of a location’s population and/or those that are known to have been impacted by traffic anomalies in the past.

Detecting anomalies

The approach we took to solve this problem involves creating a forecast that is a set of data points that correspond to our expectation according to what we’ve seen in historical data. This will be explained in the section Creating a forecast. We take this forecast and compare it to what we are actually observing — if what we are observing is significantly different from what we expect, then we call it an anomaly. Here, since we are interested in traffic drops, an anomaly will always correspond to lower traffic than the forecast / expected traffic. This comparison is elaborated in the section Comparing forecast with actual traffic.

In order to compute the forecast we need to fulfill the following business requirements:

We are mainly interested in traffic related to human activity.
The more timely we detect the anomaly, the more useful it is. This needs to take into account constraints such as data ingestion and data processing times, but once the data is available, we should be able to use the latest data point and detect if it is an anomaly.
A low False Positive (FP) rate is more important than a high True Positive (TP) rate. As an internal tool, this is not necessarily true, but as a publicly visible notification service, we want to limit spurious entries at the cost of not reporting some anomalies.

Selecting which entities to observe

Aside from the examples given above, the quality of the data highly depends on the volume of the data, and this means that we have different levels of data quality depending on which entity (location / AS) we are considering. As an extreme example, we don’t have enough data from Antarctica to reliably detect outages. Follows the process we used to select which entities are eligible to be observed.

For ASes, since we are mainly interested in Internet traffic that represents human activity, we use the number of users estimation provided by APNIC. We then compute the total number of users per location by summing up the number of users of each AS in that location, and then we calculate what percentage of users an AS has for that location (this number is also provided by the APNIC table in column ‘% of country’). We filter out ASes that have less than 1% of the users in that location. Here’s what the list looks like for Portugal — AS15525 (MEO-EMPRESAS) is excluded because it has less than 1% of users of the total number of Internet users in Portugal (estimated).

At this point we have a subset of ASes and a set of locations (we don’t exclude any location a priori because we want to cover as much as possible) but we will have to narrow it down based on the quality of the data to be able to reliably detect anomalies automatically. After testing several metrics and visually analyzing the results, we came to the conclusion that the best predictor of a stable signal is related to the volume of data, so we removed the entities that don’t satisfy the criteria of a minimum number of unique IPs daily in a two weeks period — the threshold is based on visual inspection.

Creating a forecast

In order to detect the anomalies in a timely manner, we decided to go with traffic aggregated every fifteen minutes, and we are forecasting one hour of data (four data points / blocks of fifteen minutes) that are compared with the actual data.

After selecting the entities for which we will detect anomalies, the approach is quite simple:

1. We look at the last 24 hours immediately before the forecast window and use that interval as the reference. The assumption is that the last 24 hours will contain information about the shape of what follows. In the figure below, the last 24 hours (in blue) corresponds to data transitioning from Friday to Saturday. By using the Euclidean distance, we get the six most similar matches to that reference (orange) — four of those six matches correspond to other transitions from Friday to Saturday. It also captures the holiday on Monday (August 14, 2023) to Tuesday, and we also see a match that is the most dissimilar to the reference, a regular working day from Wednesday to Thursday. Capturing one that doesn't represent the reference properly should not be a problem because the forecast is the median of the most similar 24 hours to the reference, and thus the data of that day ends up being discarded.

There are two important parameters that we are using for this approach to work:
- We take into consideration the last 28 days (plus the reference day equals 29). This way we ensure that the weekly seasonality can be seen at least 4 times, we control the risk associated with the trend changing over time, and we set an upper bound to the amount of data we need to process. Looking at the example above, the first day was one with the highest similarity to the reference because it corresponds to the transition from Friday to Saturday.
- The other parameter is the number of most similar days. We are using six days as a result of empirical knowledge: given the weekly seasonality, when using six days, we expect at most to match four days for the same weekday and then two more that might be completely different. Since we use the median to create the forecast, the majority is still four and thus those extra days end up not being used as reference. Another scenario is in the case of holidays such as the example below:

A holiday in the middle of the week in this case looks like a transition from Friday to Saturday. Since we are using the last 28 days and the holiday starts on a Tuesday we only see three such transitions that are matching (orange) and then another three regular working days because that pattern is not found anywhere else in the time-series and those are the closest matches. This is why we use the lower quartile when computing the median for an even number of values (meaning we round the data down to the lower values) and use the result as the forecast. This also allows us to be more conservative and plays a role in the true positive/false positive tradeoff.

Lastly let's look at the outage example:

In this case, the matches are always connected to low traffic because the last 24h (reference) corresponds to a transition from Sunday to Monday and due to the low traffic the lowest Euclidean distance (most similar 24h) are either Saturdays (two times) or Sundays (four times). So the forecast is what we would expect to see on a regular Monday and that’s why the forecast (red) has an upward trend but since we had an outage, the actual volume of traffic (black) is considerably lower than the forecast.

This approach works for regular seasonal patterns, as would several other modeling approaches, and it has also been shown to work in case of holidays and other moving events (such as festivities that don’t happen at the same day every year) without having to actively add that information in. Nevertheless, there are still use cases where it will fail specifically when there’s an offset in the data. This is one of the reasons why we use multiple data sources to reduce the chances of the algorithm being affected by data artifacts.

Below we have an example of how the algorithm behaves over time.

Comparing forecast with actual traffic

Once we have the forecast and the actual traffic volume, we do the following steps.

We calculate relative change, which measures how much one value has changed relative to another. Since we are detecting anomalies based on traffic drops, the actual traffic will always be lower than the forecast.

After calculating this metric, we apply the following rules:

The difference between the actual and the forecast must be at least 10% of the magnitude of the signal. This magnitude is computed using the difference between 95th and 5th percentiles of the selected data. The idea is to avoid scenarios where the traffic is low, particularly during the off-peaks of the day and scenarios where small changes in actual traffic correspond to big changes in relative change because the forecast is also low. As an example:
- a forecast of 100 Gbps compared with an actual value of 80 Gbps gives us a relative change of -0.20 (-20%).
- a forecast of 20 Mbps compared with an actual value of 10 Mbps gives us a much smaller decrease in total volume than the previous example but a relative change of -0.50 (50%).
Then we have two rules for detecting considerably low traffic:
- Sustained anomaly: The relative change is below a given threshold α throughout the forecast window (for all four data points). This allows us to detect weaker anomalies (with smaller relative changes) that are extended over time.

Point anomaly: The relative change of the last data point of the forecast window is below a given threshold β (where β < α — these thresholds are negative; as an example, β and α might be -0.6 and -0.4, respectively). In this case we need β < α to avoid triggering anomalies due to the stochastic nature of the data but still be able to detect sudden and short-lived traffic drops.

The values of α and β were chosen empirically to maximize detection rate, while keeping the false positive rate at an acceptable level.

Closing an anomaly event

Although the most important message that we want to convey is when an anomaly starts, it is also crucial to detect when the Internet traffic volume goes back to normal for two main reasons:

We need to have the notion of active anomaly, which means that we detected an anomaly and that same anomaly is still ongoing. This allows us to stop considering new data for the reference while the anomaly is still active. Considering that data would impact the reference and the selection of most similar sets of 24 hours.
Once the traffic goes back to normal, knowing the duration of the anomaly allows us to flag those data points as outliers and replace them, so we don’t end up using it as reference or as best matches to the reference. Although we are using the median to compute the forecast, and in most cases that would be enough to overcome the presence of anomalous data, there are scenarios such as the one for AS203214 (HulumTele), used as example four, where the outages are frequently occurring at the same time of the day that would make the anomalous data become the expectation after few days.

Whenever we detect an anomaly we keep the same reference until the data comes back to normal, otherwise our reference would start including anomalous data. To determine when the traffic is back to normal, we use lower thresholds than α and we give it a time period (currently four hours) where there should be no anomalies in order for it to close. This is to avoid situations where we observe drops in traffic that bounce back to normal and drop again. In such cases we want to detect a single anomaly and aggregate it to avoid sending multiple notifications, and in terms of semantics there’s a high chance that it’s related to the same anomaly.

Conclusion

Internet traffic data is generally predictable, which in theory would allow us to build a very straightforward anomaly detection algorithm to detect Internet disruptions. However, due to the heterogeneity of the time series depending on the entity we are observing (Location or AS) and the presence of artifacts in the data, it also needs a lot of context that poses some challenges if we want to track it in real-time. Here we’ve shown particular examples of what makes this problem challenging, and we have explained how we approached this problem in order to overcome most of the hurdles. This approach has been shown to be very effective at detecting traffic anomalies while keeping a low false positive rate, which is one of our priorities. Since it is a static threshold approach, one of the downsides is that we are not detecting anomalies that are not as steep as the ones we’ve shown.

We will keep working on adding more entities and refining the algorithm to be able to cover a broader range of anomalies.

Visit Cloudflare Radar for additional insights around (Internet disruptions, routing issues, Internet traffic trends, attacks, Internet quality, etc.). Follow us on social media at @CloudflareRadar (Twitter), https://noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via e-mail.

INP. Get ready for the new Core Web Vital

William Woodhead — Tue, 20 Jun 2023 13:00:16 GMT

INP will replace FID in the Core Web Vitals

On May 10, 2023, Google announced that INP will replace FID in the Core Web Vitals in March 2024. The Core Web Vitals play a role in the Google Search algorithm. So website owners who care about Search Engine Optimization (SEO) should prepare for the change. Otherwise their search ranking might suffer.

This post will first explain what FID, INP and the Core Web Vitals are. Then it will show how FID and INP relate to each other across a large range of Cloudflare sites. (Spoiler alert - If a site has ‘Good’ scoring FID, it might not have ‘Good’ scoring INP). Then it will discuss how to prepare for this change and how Cloudflare can help.

A few definitions

In order to make sense of the upcoming change, here are some definitions that will set the scene.

Core Web Vitals

Measuring user-centric web performance is challenging. To face this challenge, Google developed a series of metrics called the Web Vitals. These Web Vitals are signals that measure different aspects of web performance. For example Time To First Byte (TTFB) is one of the Web Vitals: from the perspective of the browser, TTFB “measures the time between the request for a resource and when the first byte of a response begins to arrive.” - https://web.dev/ttfb/

A subset of the Web Vitals are the Core Web Vitals. The Core Web Vitals are identified as the most critical Web Vitals to pay attention to. As such, they play a role in the Google Search algorithm. Improving a webpage’s Core Web Vitals can improve success with Google Search. The Core Web Vitals currently consist of three metrics: Largest Contentful Paint (LCP), First Input Delay (FID) and Cumulative Layout Shift (CLS). Respectively they measure Loading Speed, Interactivity and Visual Stability. This will change in March 2024 when Interaction To Next Paint (INP) replaces FID as the Core Web Vital for Interactivity.

Interaction

An Interaction on a webpage starts with a user input. The browser then reacts to this input. This includes the input delay, the processing time and the presentation delay before the next paint occurs and the new frame is presented.

Image from https://web.dev/inp/#whats-in-an-interaction

Keeping in mind that an interaction is composed of these three serialized durations - input delay, processing time and presentation delay - will help later on to understand the difference between FID and INP.

First Input Delay (FID)

“FID measures the time from when a user first interacts with a page (that is, when they click a link, tap on a button, or use a custom, JavaScript-powered control) to the time when the browser is actually able to begin processing event handlers in response to that interaction.” - https://web.dev/fid/

FID is the current Core Web Vital that measures interactivity. FID measures the first interaction with the page. User interactions after the first interaction are not measured with FID. FID also only measures the input delay part of the first interaction. It does not measure the processing time and the presentation delay.

FID is classed as ‘Good’ if the input delay is less than 100 ms. FID is classed as ‘Needs Improvement’ between 100ms and 300ms. Delays above 300 ms are classed as ‘Poor’.

Image from https://web.dev/fid/

Interaction to Next Paint (INP)

“INP is a metric that assesses a page's overall responsiveness to user interactions by observing the latency of all click, tap, and keyboard interactions that occur throughout the lifespan of a user's visit to a page. The final INP value is the longest interaction observed, ignoring outliers.” - https://web.dev/inp/

INP, like FID, is also a Web Vital that measures interactivity. INP observes all user interactions in a page view and reports the longest. It measures more than just the input delay duration of each interaction. It also measures the processing time and the presentation delay until the next paint occurs. Once you understand the mechanics of it, the name ‘Interaction to Next Paint’ is a good description of the duration it measures.

INP is classed as ‘Good’ if the final reported value is less than 200 ms. INP is classed as ‘Needs Improvement’ between 200ms and 500ms. Any measurement above 500 ms is classed as ‘Poor’.

Image from https://web.dev/inp/

A more comprehensive measure of interactivity

Both FID and INP measure interactivity by measuring delays after user interactions. But INP is more comprehensive than FID. As an example, let’s say a user loads a web page and interacts only once by clicking a button. In this case, FID will capture this interaction by reporting the input delay after the user input. INP will also observe this interaction but it will measure not only the input delay but also the subsequent processing time and the presentation delay. Since this is the only interaction on the web page, INP will observe this as the longest duration within the page view and so will report this duration as the final reported INP value. Already it is clear that with just a single user interaction, INP provides a more comprehensive signal of interactivity because it includes processing time and presentation delay.

But INP also goes beyond the first interaction on the page. It observes all the interactions within a page view and reports the longest interaction of the set (ignoring outliers). This means that web sites need to ensure that all interactions throughout the lifecycle of a page view are under 200ms in order to score a ‘Good’ INP. With FID, web sites only need to focus on optimizing the first interaction’s input delay under 100 ms to score ‘Good’ FID. With INP joining the Core Web Vitals, web sites will need to focus on optimizing all interactions.

Comparing FID with INP across Cloudflare

At Cloudflare, for sites that have enabled Real User Measurements (RUM) via the Web Analytics Product or the Observatory Product, we already collect INP data. This means we have a rich dataset across 850k sites with which to analyze INP against FID.

FID vs INP

Over the period of a week, across all Cloudflare RUM-enabled sites, we observed over four billion reported FIDs and INPs. 93% of the FID values classify as ‘Good’ (under 100 ms). 75% of the INP values classify as ‘Good’ (under 200 ms). Since these values are sourced from the same set of page views, it is already possible to see at a glance that INP is distributed more lightly in the ‘Good’ range than FID.

Interestingly, the P75 value for FID is just 16 ms. A 0.16 times multiple of the FID ‘Good’ threshold of 100 ms. For INP, the P75 value is 200 ms. A one times multiple of the INP ‘Good’ threshold.

INP across devices

Focusing on INP, when the INP data is segmented by device type, we observe that 88% of desktop INP classify as ‘Good’, but only 67% of mobile INP classify as ‘Good’. This suggests that INP is generally suffering more on mobile devices than on desktop.

The bottom line is that INP is more challenging than FID to ‘get into the Green’. Since INP is a more comprehensive measure of interactivity, web sites are more exposed, and less likely to score favorably.

How to prepare for the change

INP will start affecting Google Search rankings when it becomes a Core Web Vital in March 2024. To get ready, websites need to improve interactivity beyond the initial load of the site.

As always, in order to improve on a metric, a baseline must be drawn by collecting data on the metric. But with INP, this can be challenging because INP is best reported from field data. Field data means that the data comes from real users interacting with web pages on real browsers. With field data, synthetic tools like Google Lighthouse aren’t a great fit. Synthetic tools are not designed to emulate the wide range of users on the wide range of browsers and operating systems that will navigate a web page.

In order to collect real user data from real browsers, websites need to enable a RUM provider. RUM providers collect performance data from real browsers, and process the data so that it can be aggregated and analyzed. It’s worth noting that performance data is anonymous and does not include any personal identifiable information (PII).

So the first step to understanding INP on a website is to enable a RUM provider.

Once a RUM provider is enabled on a website, it is then possible to analyze INP data to understand which web pages are not optimized for interactivity. There can be many possible causes for this. Since INP is composed of input delay, processing time and presentation delay, optimizing INP can be broken down into optimizing these distinct durations.

The best advice for optimizing INP is set out by web.dev. Optimizing INP is not straight-forward - it can involve breaking up JavaScript tasks, reducing DOM interactions and layouts, avoiding use of timers, and limiting third-party code amongst a range of other measures. That’s why it’s best to get started early and learn as much as possible before the change takes place.

How Cloudflare can help

A free RUM provider

Cloudflare offers a free RUM provider which collects INP as part of its dataset. It currently powers Cloudflare Web Analytics and Cloudflare Observatory. By enabling RUM with Cloudflare, you can explore the interactivity of your site’s web pages and start to identify interactivity issues.

Just log onto the Cloudflare Dashboard, and head to your target account. Go to Web Analytics.

Reduce the impact of third-party scripts

One of the contributing causes of degraded INP is JavaScript blocking the main thread after an interaction. Often this can be caused by heavy third-party scripts attaching event listeners to DOM elements. For example, third-party analytics scripts can register listeners on interactive elements of a page to power various forms of analytics. Many sites have dozens, if not hundreds of third-party scripts running, each one potentially degrading INP.

Cloudflare offers a unique product that can entirely remove third-party scripts from the browser. This frees up the main thread, boosting interactivity. If you haven’t come across Cloudflare Zaraz before, check it out in our docs. Not only can it improve INP, but it can also dramatically improve the security posture of your site by reducing (or even entirely removing) the amount of third-party JavaScript that runs on your website.

Watch on Cloudflare TV

Announcing database integrations: a few clicks to connect to Neon, PlanetScale and Supabase on Workers

Shaun Persad — Tue, 16 May 2023 13:05:00 GMT

This blog post references a feature which has updated documentation. For the latest reference content, visit https://developers.cloudflare.com/workers/databases/third-party-integrations/

One of the best feelings as a developer is seeing your idea come to life. You want to move fast and Cloudflare’s developer platform gives you the tools to take your applications from 0 to 100 within minutes.

One thing that we’ve heard slows developers down is the question: “What databases can be used with Workers?”. Developers stumble when it comes to things like finding the databases that Workers can connect to, the right library or driver that's compatible with Workers and translating boilerplate examples to something that can run on our developer platform.

Today we’re announcing Database Integrations – making it seamless to connect to your database of choice on Workers. To start, we’ve added some of the most popular databases that support HTTP connections: Neon, PlanetScale and Supabase with more (like Prisma, Fauna, MongoDB Atlas) to come!

Focus more on code, less on config

Our serverless SQL database, D1, launched in open alpha last year, and we’re continuing to invest in making it production ready (stay tuned for an exciting update later this week!). We also recognize that there are plenty of flavours of databases, and we want developers to have the freedom to select what’s best for them and pair it with our powerful compute offering.

On our second day of this Developer Week 2023, data is in the spotlight. We’re taking huge strides in making it possible and more performant to connect to databases from Workers (spoiler alert!):

Making it possible and performant is just the start, we also want to make connecting to databases painless. Databases have specific protocols, drivers, APIs and vendor specific features that you need to understand in order to get up and running. With Database Integrations, we want to make this process foolproof.

Whether you’re working on your first project or your hundredth project, you should be able to connect to your database of choice with your eyes closed. With Database Integrations, you can spend less time focusing on configuration and more on doing what you love – building your applications!

What does this experience look like?

Discoverability

If you’re starting a project from scratch or want to connect Workers to an existing database, you want to know “What are my options?”.

Workers supports connections to a wide array of database providers over HTTP. With newly released outbound TCP support, the databases that you can connect to on Workers will only grow!

In the new “Integrations” tab, you’ll be able to view all the databases that we support and add the integration to your Worker directly from here. To start, we have support for Neon, PlanetScale and Supabase with many more coming soon.

Authentication

You should never have to copy and paste your database credentials or other parts of the connection string.

Once you hit “Add Integration” we take you through an OAuth2 flow that automatically gets the right configuration from your database provider and adds them as encrypted environment variables to your Worker.

Once you have credentials set up, check out our documentation for examples on how to get started using the data platform’s client library. What’s more – we have templates coming that will allow you to get started even faster!

That’s it! With database integrations, you can connect your Worker with your database in just a few clicks. Head to your Worker > Settings > Integrations to try it out today.

What’s next?

We’ve only just scratched the surface with Database Integrations and there’s a ton more coming soon!

While we’ll be continuing to add support for more popular data platforms we also know that it's impossible for us to keep up in a moving landscape. We’ve been working on an integrations platform so that any database provider can easily build their own integration with Workers. As a developer, this means that you can start tinkering with the next new database right away on Workers.

Additionally, we’re working on adding wrangler support, so you can create integrations directly from the CLI. We’ll also be adding support for account level environment variables in order for you to share integrations across the Workers in your account.

We’re really excited about the potential here and to see all the new creations from our developers! Be sure to join Cloudflare’s Developer Discord and share your projects. Happy building!

Making home Internet faster has little to do with “speed”

Mike Conlow — Tue, 18 Apr 2023 13:00:00 GMT

More than ten years ago, researchers at Google published a paper with the seemingly heretical title “More Bandwidth Doesn’t Matter (much)”. We published our own blog showing it is faster to fly 1TB of data from San Francisco to London than it is to upload it on a 100 Mbps connection. Unfortunately, things haven’t changed much. When you make purchasing decisions about home Internet plans, you probably consider the bandwidth of the connection when evaluating Internet performance. More bandwidth is faster speed, or so the marketing goes. In this post, we’ll use real-world data to show both bandwidth and – spoiler alert! – latency impact the speed of an Internet connection. By the end, we think you’ll understand why Cloudflare is so laser focused on reducing latency everywhere we can find it.

The grand summary of the blog that follows is this:

There are many ways to evaluate network performance.
Performance “goodness” depends on the application -- a good number for one application can be of zero benefit to a different application.
“Speed” numbers can be misleading, not least because any single metric cannot accurately describe how all applications will perform.

To better understand these ideas, we should define bandwidth and latency. Bandwidth is the amount of data that can be transmitted at any single time. It’s the maximum throughput, or capacity, of the communications link between two servers that want to exchange data. The “bottleneck” is the place in the network where the connection is constrained by the amount of bandwidth available. Usually this is in the “last mile”, either the wire that connects a home, or the modem or router in the home itself.

If the Internet is an information superhighway, bandwidth is the number of lanes on the road. The wider the road, the more traffic can fit on the highway at any time. Bandwidth is useful for downloading large files like operating system updates and big game updates. We use bandwidth when streaming video, though probably less than you think. Netflix recommends 15 Mbps of bandwidth to watch a stream in 4K/Ultra HD. A 1 Gbps connection could stream more than 60 Netflix shows in 4K at the same time!

Latency, on the other hand, is the time it takes data to move through the Internet. To extend our superhighway analogy, latency is the speed at which vehicles move on the highway. If traffic is moving quickly, you’ll get to your destination faster. Latency is measured in the number of milliseconds that it takes a packet of data to travel between a client (such as your laptop computer) and a server. In practice, we have to measure latency as the round-trip time (RTT) between client and server because every device has its own independent clock, so it’s hard to measure latency in just one direction. If you’re practicing tennis against a wall, round-trip latency is the time the ball was in the air. On the Internet fibre optic “backbone”, data is travels at almost 200,000 kilometers per second as it bounces off the glass on the inside of optical wires. That’s fast!

Low-latency connections are important for gaming, where tiny bits of data, such as the change in position of players in a game, need to reach another computer quickly. And increasingly, we’re becoming aware of high latency when it makes our live video conferencing choppy and unpleasant.

While we can’t make light travel through glass much faster, we can improve latency by moving the content closer to users, shortening the distance data needs to travel. That’s the effect of our presence in more than 285 cities globally: when you’re on the Internet superhighway trying to reach Cloudflare, we want to be just off the next exit.

The terms bandwidth, capacity, and maximum throughput are slightly different from each other, but close enough in their meaning to be interchangeable, Confusingly “speed” has come to mean bandwidth when talking about Internet plans, but “speed” gives no indication of the latency between your devices and the servers they connect to. More on this later. For now, we don’t use the Internet only to play games, nor only watch streaming video. We do those and more, and we visit a lot of normal web pages in between.

In the 2010 paper from Google, the author simulated loading web pages while varying the throughput and latency of the connection. The finding was that above about 5 Mbps, the page doesn’t load much faster. Increasing bandwidth from 1 Mbps to 2 Mbps is almost a 40 percent improvement in page load time. From 5 Mbps to 6 Mbps is less than a 5 percent improvement.

However, something interesting happened when varying the latency (the Round Trip Time, or RTT): there was a linear and proportional improvement on page load times. For every 20 milliseconds of reduced latency, the page load time improved by about 10%.

Let’s see what this looks like in real life with empirical data. Below is a chart from an excellent recent paper by two researchers from MIT. Using data from the FCC’s Measuring Broadband America program, these researchers produced a chart showing similar results to the 2010 simulation. Those results are summarized in the chart below. Though the point of diminishing returns to more bandwidth has moved higher – to about 20 Mbps – the overall trend was exactly the same.

We repeated this analysis with a focus on latency using our own Cloudflare data. The results are summarized in the next chart, showing a familiar pattern. For every 200 milliseconds of latency we can save, we cut the page load time by over 1 second. That relationship applies when the latency is 950 milliseconds. And it applies when the latency is 50 milliseconds.

There are a few reasons latency matters in the set of transactions needed to load pages. When you connect to a website, the first thing that your browser does is establish a secure connection, to authenticate the website and ensure your data is encrypted. The protocols to do this are TCP and TLS, or QUIC (that is encrypted by default). The number of message exchanges each needs to establish a secure connection varies, but one aspect of the establishment phase is common to all of them: Latency matters most.

On top of that, when we load a webpage after we establish encryption and verify website authority, we might be asking the browser to load hundreds of different files across dozens of different domains. Some of these files can be loaded in parallel, but others need to be loaded sequentially. As the browser races to compile all these different files, it’s the speed at which it can get to the server and back that determines how fast it can put the page together. The files are often quite small, but there’s a lot of them.

The chart below shows the beginning of what the browser does when it loads cnn.com. First is the connection handshake phase, followed by 301 redirect to www.cnn.com, which requires a completely new connection handshake before the browser can load the main HTML page in step two. Only then, more than 1 second into the load, does it learn about all the JavaScript files it requires in order to render the page. Files 3-19 are requested mostly on the same connection but are not served until after the HTML file has been delivered in full. Files 8, 9, and 10 are requested over separate connections (all costing handshakes). Files 20-27 are all blocked on earlier files and similarly need new connections. They can’t start until the browser has the previous file back from the server and executes it. There are 650 assets in this page load, and the blocking happens all the way through the page load. Here’s why this matters: better latency makes every file load faster, which in turn unblocks other files faster, and so on.

The protocols will use all the bandwidth available, but often complete a transfer before all the available bandwidth is consumed. It’s no wonder then that adding more bandwidth doesn’t speed up the page load, but better latency does. While developments like Early Hints help this by informing browsers of dependencies earlier, allowing them to pre-connect to servers or pre-fetch resources that don’t need to be strictly ordered, this is still a problem for many websites on the Internet today.

Recently, Internet researchers have turned their attention to using our understanding of the relationship between throughput and latency to improve Internet Quality of Experience (QoE). A paper from the Broadband Internet Technical Advisory Group (BITAG) summarizes:

But we now recognize that it is not just greater throughput that matters, but also consistently low latency. Unfortunately, the way that we’ve historically understood and characterized latency was flawed, and our latency measurements and metrics were not aligned with end-user QoE.

Confusing matters further, there is a difference between latency on an idle Internet connection and latency measured in working conditions when many connections share the network resources, which we call “working latency” or “responsiveness”. Since responsiveness is what the user experiences as the speed of their Internet connection, it’s important to understand and measure this particular latency.

An Internet connection can suffer from poor responsiveness (even if it has good idle latency) when data is delayed in buffers. If you download a large file, for example an operating system update, the server sending the file might send the file with higher throughput than the Internet connection can accept. That’s ok. Extra bits of the file will sit in a buffer until it’s their turn to go through the funnel. Adding extra lanes to the highway allows more cars to pass through, and is a good strategy if we aren’t particularly concerned with the speed of the traffic.

Say for example, Christabel is watching a stream of the news while on a video meeting. When Christabel starts watching the video, her browser fetches a bunch of content and stores it in various buffers on the way from the content host to the browser. Those same buffers also contain data packets pertaining to the video meeting Christabel is currently in. If the data generated as part of a video conference sits in the same buffer as the video files, the video files will fill up the buffer and cause delay for the video meeting packets as well. The larger the buffers, the longer the wait for video conference packets.

Cloudflare is helping to make “speed” meaningful

To help users understand the strengths and weaknesses of their connection, we recently added Aggregated Internet Measurement (AIM) scores to our own “Speed” Test. These scores remove the technical metrics and give users a real-world, plain-English understanding of what their connection will be good at, and where it might struggle. We’d also like to collect more data from our speed test to help track Page Load Times (PLT) and see how they are correlated with the reduction of lower working latency. You’ll start seeing those numbers on our speed test soon!

We all use our Internet connections in slightly different ways, but we share the desire for our connections to be as fast as possible. As more and more services move into the cloud – word documents, music, websites, communications, etc – the speed at which we can access those services becomes critical. While bandwidth plays a part, the latency of the connection – the real Internet “speed” – is more important.

At Cloudflare, we’re working every day to help build a more performant Internet. Want to help? Apply for one of our open engineering roles here.

The unintended consequences of blocking IP addresses

Alissa Starzak — Fri, 16 Dec 2022 14:00:00 GMT

In late August 2022, Cloudflare’s customer support team began to receive complaints about sites on our network being down in Austria. Our team immediately went into action to try to identify the source of what looked from the outside like a partial Internet outage in Austria. We quickly realized that it was an issue with local Austrian Internet Service Providers.

But the service disruption wasn’t the result of a technical problem. As we later learned from media reports, what we were seeing was the result of a court order. Without any notice to Cloudflare, an Austrian court had ordered Austrian Internet Service Providers (ISPs) to block 11 of Cloudflare’s IP addresses.

In an attempt to block 14 websites that copyright holders argued were violating copyright, the court-ordered IP block rendered thousands of websites inaccessible to ordinary Internet users in Austria over a two-day period. What did the thousands of other sites do wrong? Nothing. They were a temporary casualty of the failure to build legal remedies and systems that reflect the Internet’s actual architecture.

Today, we are going to dive into a discussion of IP blocking: why we see it, what it is, what it does, who it affects, and why it’s such a problematic way to address content online.

Collateral effects, large and small

The craziest thing is that this type of blocking happens on a regular basis, all around the world. But unless that blocking happens at the scale of what happened in Austria, or someone decides to highlight it, it is typically invisible to the outside world. Even Cloudflare, with deep technical expertise and understanding about how blocking works, can’t routinely see when an IP address is blocked.

For Internet users, it’s even more opaque. They generally don’t know why they can’t connect to a particular website, where the connection problem is coming from, or how to address it. They simply know they cannot access the site they were trying to visit. And that can make it challenging to document when sites have become inaccessible because of IP address blocking.

Blocking practices are also wide-spread. In their Freedom on the Net report, Freedom House recently reported that 40 out of the 70 countries that they examined - which vary from countries like Russia, Iran and Egypt to Western democracies like the United Kingdom and Germany - did some form of website blocking. Although the report doesn’t delve into exactly how those countries block, many of them use forms of IP blocking, with the same kind of potential effects for a partial Internet shutdown that we saw in Austria.

Although it can be challenging to assess the amount of collateral damage from IP blocking, we do have examples where organizations have attempted to quantify it. In conjunction with a case before the European Court of Human Rights, the European Information Society Institute, a Slovakia-based nonprofit, reviewed Russia’s regime for website blocking in 2017. Russia exclusively used IP addresses to block content. The European Information Society Institute concluded that IP blocking led to “collateral website blocking on a massive scale” and noted that as of June 28, 2017, “6,522,629 Internet resources had been blocked in Russia, of which 6,335,850 – or 97% – had been blocked collaterally, that is to say, without legal justification.”

In the UK, overbroad blocking prompted the non-profit Open Rights Group to create the website Blocked.org.uk. The website has a tool enabling users and site owners to report on overblocking and request that ISPs remove blocks. The group also has hundreds of individual stories about the effect of blocking on those whose websites were inappropriately blocked, from charities to small business owners. Although it’s not always clear what blocking methods are being used, the fact that the site is necessary at all conveys the amount of overblocking. Imagine a dressmaker, watchmaker or car dealer looking to advertise their services and potentially gain new customers with their website. That doesn’t work if local users can’t access the site.

One reaction might be, “Well, just make sure there are no restricted sites sharing an address with unrestricted sites.” But as we’ll discuss in more detail, this ignores the large difference between the number of possible domain names and the number of available IP addresses, and runs counter to the very technical specifications that empower the Internet. Moreover, the definitions of restricted and unrestricted differ across nations, communities, and organizations. Even if it were possible to know all the restrictions, the designs of the protocols -- of the Internet, itself -- mean that it is simply infeasible, if not impossible, to satisfy every agency’s constraints.

Legal and human rights concerns

Overblocking websites is not only a problem for users; it has legal implications. Because of the effect it can have on ordinary citizens looking to exercise their rights online, government entities (both courts and regulatory bodies) have a legal obligation to make sure that their orders are necessary and proportionate, and don’t unnecessarily affect those who are not contributing to the harm.

It would be hard to imagine, for example, that a court in response to alleged wrongdoing would blindly issue a search warrant or an order based solely on a street address without caring if that address was for a single family home, a six-unit condo building, or a high rise with hundreds of separate units. But those sorts of practices with IP addresses appear to be rampant.

In 2020, the European Court of Human Rights (ECHR) - the court overseeing the implementation of the Council of Europe’s European Convention on Human Rights - considered a case involving a website that was blocked in Russia not because it had been targeted by the Russian government, but because it shared an IP address with a blocked website. The website owner brought suit over the block. The ECHR concluded that the indiscriminate blocking was impermissible, ruling that the block on the lawful content of the site “amounts to arbitrary interference with the rights of owners of such websites.” In other words, the ECHR ruled that it was improper for a government to issue orders that resulted in the blocking of sites that were not targeted.

Using Internet infrastructure to address content challenges

Ordinary Internet users don’t think a lot about how the content they are trying to access online is delivered to them. They assume that when they type a domain name into their browser, the content will automatically pop up. And if it doesn’t, they tend to assume the website itself is having problems unless their entire Internet connection seems to be broken. But those basic assumptions ignore the reality that connections to a website are often used to limit access to content online.

Why do countries block connections to websites? Maybe they want to limit their own citizens from accessing what they believe to be illegal content - like online gambling or explicit material - that is permissible elsewhere in the world. Maybe they want to prevent the viewing of a foreign news source that they believe to be primarily disinformation. Or maybe they want to support copyright holders seeking to block access to a website to limit viewing of content that they believe infringes their intellectual property.

To be clear, blocking access is not the same thing as removing content from the Internet. There are a variety of legal obligations and authorities designed to permit actual removal of illegal content. Indeed, the legal expectation in many countries is that blocking is a matter of last resort, after attempts have been made to remove content at the source.

Blocking just prevents certain viewers - those whose Internet access depends on the ISP that is doing the blocking - from being able to access websites. The site itself continues to exist online and is accessible by everyone else. But when the content originates from a different place and can’t be easily removed, a country may see blocking as their best or only approach.

We recognize the concerns that sometimes drive countries to implement blocking. But fundamentally, we believe it’s important for users to know when the websites they are trying to access have been blocked, and, to the extent possible, who has blocked them from view and why. And it’s critical that any restrictions on content should be as limited as possible to address the harm, to avoid infringing on the rights of others.

Brute force IP address blocking doesn’t allow for those things. It’s fully opaque to Internet users. The practice has unintended, unavoidable consequences on other content. And the very fabric of the Internet means that there is no good way to identify what other websites might be affected either before or during an IP block.

To understand what happened in Austria and what happens in many other countries around the world that seek to block content with the bluntness of IP addresses, we have to understand what is going on behind the scenes. That means diving into some technical details.

Identity is attached to names, never addresses

Before we even get started describing the technical realities of blocking, it’s important to stress that the first and best option to deal with content is at the source. A website owner or hosting provider has the option of removing content at a granular level, without having to take down an entire website. On the more technical side, a domain name registrar or registry can potentially withdraw a domain name, and therefore a website, from the Internet altogether.

But how do you block access to a website, if for whatever reason the content owner or content source is unable or unwilling to remove it from the Internet? There are only three possible control points.

The first is via the Domain Name System (DNS), which translates domain names into IP addresses so that the site can be found. Instead of returning a valid IP address for a domain name, the DNS resolver could lie and respond with a code, NXDOMAIN, meaning that “there is no such name.” A better approach would be to use one of the honest error numbers standardized in 2020, including error 15 for blocked, error 16 for censored, 17 for filtered, or 18 for prohibited, although these are not widely used currently.

Interestingly, the precision and effectiveness of DNS as a control point depends on whether the DNS resolver is private or public. Private or ‘internal’ DNS resolvers are operated by ISPs and enterprise environments for their own known clients, which means that operators can be precise in applying content restrictions. By contrast, that level of precision is unavailable to open or public resolvers, not least because routing and addressing is global and ever-changing on the Internet map, and in stark contrast to addresses and routes on a fixed postal or street map. For example, private DNS resolvers may be able to block access to websites within specified geographic regions with at least some level of accuracy in a way that public DNS resolvers cannot, which becomes profoundly important given the disparate (and inconsistent) blocking regimes around the world.

The second approach is to block individual connection requests to a restricted domain name. When a user or client wants to visit a website, a connection is initiated from the client to a server name, i.e. the domain name. If a network or on-path device is able to observe the server name, then the connection can be terminated. Unlike DNS, there is no mechanism to communicate to the user that access to the server name was blocked, or why.

The third approach is to block access to an IP address where the domain name can be found. This is a bit like blocking the delivery of all mail to a physical address. Consider, for example, if that address is a skyscraper with its many unrelated and independent occupants. Halting delivery of mail to the address of the skyscraper causes collateral damage by invariably affecting all parties at that address. IP addresses work the same way.

Notably, the IP address is the only one of the three options that has no attachment to the domain name. The website domain name is not required for routing and delivery of data packets; in fact it is fully ignored. A website can be available on any IP address, or even on many IP addresses, simultaneously. And the set of IP addresses that a website is on can change at any time. The set of IP addresses cannot definitively be known by querying DNS, which has been able to return any valid address at any time for any reason, since 1995.

The idea that an address is representative of an identity is anathema to the Internet’s design, because the decoupling of address from name is deeply embedded in the Internet standards and protocols, as is explained next.

The Internet is a set of protocols, not a policy or perspective

Many people still incorrectly assume that an IP address represents a single website. We’ve previously stated that the association between names and addresses is understandable given that the earliest connected components of the Internet appeared as one computer, one interface, one address, and one name. This one-to-one association was an artifact of the ecosystem in which the Internet Protocol was deployed, and satisfied the needs of the time.

Despite the one-to-one naming practice of the early Internet, it has always been possible to assign more than one name to a server (or ‘host’). For example, a server was (and is still) often configured with names to reflect its service offerings such as mail.example.com and www.example.com, but these shared a base domain name. There were few reasons to have completely different domain names until the need to colocate completely different websites onto a single server. That practice was made easier in 1997 by the Host header in HTTP/1.1, a feature preserved by the SNI field in a TLS extension in 2003.

Throughout these changes, the Internet Protocol and, separately, the DNS protocol, have not only kept pace, but have remained fundamentally unchanged. They are the very reason that the Internet has been able to scale and evolve, because they are about addresses, reachability, and arbitrary name to IP address relationships.

The designs of IP and DNS are also entirely independent, which only reinforces that names are separate from addresses. A closer inspection of the protocols’ design elements illuminates the misperceptions of policies that lead to today's common practice of controlling access to content by blocking IP addresses.

By design, IP is for reachability and nothing else

Much like large public civil engineering projects rely on building codes and best practice, the Internet is built using a set of open standards and specifications informed by experience and agreed by international consensus. The Internet standards that connect hardware and applications are published by the Internet Engineering Task Force (IETF) in the form of “Requests for Comment” or RFCs -- so named not to suggest incompleteness, but to reflect that standards must be able to evolve with knowledge and experience. The IETF and its RFCs are cemented in the very fabric of communications, for example, with the first RFC 1 published in 1969. The Internet Protocol (IP) specification reached RFC status in 1981.

Alongside the standards organizations, the Internet’s success has been helped by a core idea known as the end-to-end (e2e) principle, codified also in 1981, based on years of trial and error experience. The end-to-end principle is a powerful abstraction that, despite taking many forms, manifests a core notion of the Internet Protocol specification: the network’s only responsibility is to establish reachability, and every other possible feature has a cost or a risk.

The idea of “reachability” in the Internet Protocol is also enshrined in the design of IP addresses themselves. Looking at the Internet Protocol specification, RFC 791, the following excerpt from Section 2.3 is explicit about IP addresses having no association with names, interfaces, or anything else.

Addressing

    A distinction is made between names, addresses, and routes [4].   A
    name indicates what we seek.  An address indicates where it is.  A
    route indicates how to get there.  The internet protocol deals
    primarily with addresses.  It is the task of higher level (i.e.,
    host-to-host or application) protocols to make the mapping from
    names to addresses.   The internet module maps internet addresses to
    local net addresses.  It is the task of lower level (i.e., local net
    or gateways) procedures to make the mapping from local net addresses
    to routes.
                            [ RFC 791, 1981 ]

Just like postal addresses for skyscrapers in the physical world, IP addresses are no more than street addresses written on a piece of paper. And just like a street address on paper, one can never be confident about the entities or organizations that exist behind an IP address. In a network like Cloudflare’s, any single IP address represents thousands of servers, and can have even more websites and services -- in some cases numbering into the millions -- expressly because the Internet Protocol is designed to enable it.

Here’s an interesting question: could we, or any content service provider, ensure that every IP address matches to one and only one name? The answer is an unequivocal no, and here too, because of a protocol design -- in this case, DNS.

The number of names in DNS always exceeds the available addresses

A one-to-one relationship between names and addresses is impossible given the Internet specifications for the same reasons that it is infeasible in the physical world. Ignore for a moment that people and organizations can change addresses. Fundamentally, the number of people and organizations on the planet exceeds the number of postal addresses. We not only want, but need for the Internet to accommodate more names than addresses.

The difference in magnitude between names and addresses is also codified in the specifications. IPv4 addresses are 32 bits, and IPv6 addresses are 128 bits. The size of a domain name that can be queried by DNS is as many as 253 octets, or 2,024 bits (from Section 2.3.4 in RFC 1035, published 1987). The table below helps to put those differences into perspective:

On November 15, 2022, the United Nations announced the population of the Earth surpassed eight billion people. Intuitively, we know that there cannot be anywhere near as many postal addresses. The difference between the number of possible names on the planet, and similarly on the Internet, does and must exceed the number of available addresses.

The proof is in the pudding names!

Now that those two relevant principles about IP addresses and DNS names in the international standards are understood - that IP address and domain names serve distinct purposes and there is no one to one relationship between the two - an examination of a recent case of content blocking using IP addresses can help to see the reasons it is problematic. Take, for example, the IP blocking incident in Austria late August 2022. The goal was to restrict access to 14 target domains, by blocking 11 IP addresses (source: RTR.Telekom. Post via the Internet Archive) -- the mismatch between those two numbers should have been a warning flag that IP blocking might not have the desired effect.

Analogies and international standards may explain the reasons that IP blocking should be avoided, but we can see the scale of the problem by looking at Internet-scale data. To better understand and explain the severity of IP blocking, we decided to generate a global view of domain names and IP addresses (thanks are due to a PhD research intern, Sudheesh Singanamalla, for the effort). In September 2022, we used the authoritative zone files for the top-level domains (TLDs) .com, .net, .info, and .org, together with top-1M website lists, to find a total of 255,315,270 unique names. We then queried DNS from each of five regions and recorded the set of IP addresses returned. The table below summarizes our findings:

The table above makes clear that it takes no more than 10.7 million addresses to reach 255,315,270 names from any region on the planet, and the total set of IP addresses for those names from everywhere is about 16 million -- the ratio of names to IP addresses is nearly 24x in Europe and 16x globally.

There is one more worthwhile detail about the numbers above: The IP addresses are the combined totals of both IPv4 and IPv6 addresses, meaning that far fewer addresses are needed to reach all 255M websites.

We’ve also inspected the data a few different ways to find some interesting observations. For example, the figure below shows the cumulative distribution (CDF) of the proportion of websites that can be visited with each additional IP address. On the y-axis is the proportion of websites that can be reached given some number of IP addresses. On the x-axis, the 16M IP addresses are ranked from the most domains on the left, to the least domains on the right. Note that any IP address in this set is a response from DNS and so it must have at least one domain name, but the highest numbers of domains on IP addresses in the set number are in the 8-digit millions.

By looking at the CDF there are a few eye-watering observations:

Fewer than 10 IP addresses are needed to reach 20% of, or approximately 51 million, domains in the set;
100 IPs are enough to reach almost 50% of domains;
1000 IPs are enough to reach 60% of domains;
10,000 IPs are enough to reach 80%, or about 204 million, domains.

In fact, from the total set of 16 million addresses, fewer than half, 7.1M (43.7%), of the addresses in the dataset had one name. On this ‘one’ point we must be additionally clear: we are unable to ascertain if there was only one and no other names on those addresses because there are many more domain names than those contained only in .com, .org, .info., and .net -- there might very well be other names on those addresses.

In addition to having a number of domains on a single IP address, any IP address may change over time for any of those domains. Changing IP addresses periodically can be helpful with certain security, performance, and to improve reliability for websites. One common example in use by many operations is load balancing. This means DNS queries may return different IP addresses over time, or in different places, for the same websites. This is a further, and separate, reason why blocking based on IP addresses will not serve its intended purpose over time.

Ultimately, there is no reliable way to know the number of domains on an IP address without inspecting all names in the DNS, from every location on the planet, at every moment in time -- an entirely infeasible proposition.

Any action on an IP address must, by the very definitions of the protocols that rule and empower the Internet, be expected to have collateral effects.

Lack of transparency with IP blocking

So if we have to expect that the blocking of an IP address will have collateral effects, and it’s generally agreed that it’s inappropriate or even legally impermissible to overblock by blocking IP addresses that have multiple domains on them, why does it still happen? That’s hard to know for sure, so we can only speculate. Sometimes it reflects a lack of technical understanding about the possible effects, particularly from entities like judges who are not technologists. Sometimes governments just ignore the collateral damage - as they do with Internet shutdowns - because they see the blocking as in their interest. And when there is collateral damage, it’s not usually obvious to the outside world, so there can be very little external pressure to have it addressed.

It’s worth stressing that point. When an IP is blocked, a user just sees a failed connection. They don’t know why the connection failed, or who caused it to fail. On the other side, the server acting on behalf of the website doesn’t even know it’s been blocked until it starts getting complaints about the fact that it is unavailable. There is virtually no transparency or accountability for the overblocking. And it can be challenging, if not impossible, for a website owner to challenge a block or seek redress for being inappropriately blocked.

Some governments, including Austria, do publish active block lists, which is an important step for transparency. But for all the reasons we’ve discussed, publishing an IP address does not reveal all the sites that may have been blocked unintentionally. And it doesn’t give those affected a means to challenge the overblocking. Again, in the physical world example, it’s hard to imagine a court order on a skyscraper that wouldn’t be posted on the door, but we often seem to jump over such due process and notice requirements in virtual space.

We think talking about the problematic consequences of IP blocking is more important than ever as an increasing number of countries push to block content online. Unfortunately, ISPs often use IP blocks to implement those requirements. It may be that the ISP is newer or less robust than larger counterparts, but larger ISPs engage in the practice, too, and understandably so because IP blocking takes the least effort and is readily available in most equipment.

And as more and more domains are included on the same number of IP addresses, the problem is only going to get worse.

Next steps

So what can we do?

We believe the first step is to improve transparency around the use of IP blocking. Although we’re not aware of any comprehensive way to document the collateral damage caused by IP blocking, we believe there are steps we can take to expand awareness of the practice. We are committed to working on new initiatives that highlight those insights, as we’ve done with the Cloudflare Radar Outage Center.

We also recognize that this is a whole Internet problem, and therefore has to be part of a broader effort. The significant likelihood that blocking by IP address will result in restricting access to a whole series of unrelated (and untargeted) domains should make it a non-starter for everyone. That’s why we’re engaging with civil society partners and like-minded companies to lend their voices to challenge the use of blocking IP addresses as a way of addressing content challenges and to point out collateral damage when they see it.

To be clear, to address the challenges of illegal content online, countries need legal mechanisms that enable the removal or restriction of content in a rights-respecting way. We believe that addressing the content at the source is almost always the best and the required first step. Laws like the EU’s new Digital Services Act or the Digital Millennium Copyright Act provide tools that can be used to address illegal content at the source, while respecting important due process principles. Governments should focus on building and applying legal mechanisms in ways that least affect other people’s rights, consistent with human rights expectations.

Very simply, these needs cannot be met by blocking IP addresses.

We’ll continue to look for new ways to talk about network activity and disruption, particularly when it results in unnecessary limitations on access. Check out Cloudflare Radar for more insights about what we see online.

New cities on the Cloudflare global network: March 2022 edition

Mike Conlow — Mon, 21 Mar 2022 12:59:02 GMT

If you follow the Cloudflare blog, you know that we love to add cities to our global map. With each new city we add, we help make the Internet faster, more reliable, and more secure. Today, we are announcing the addition of 18 new cities in Africa, South America, Asia, and the Middle East, bringing our network to over 270 cities globally. We’ll also look closely at how adding new cities improves Internet performance, such as our new locations in Israel, which reduced median response time (latency) from 86ms to 29ms (a 66% improvement) in a matter of weeks for subscribers of one Israeli Internet service provider (ISP).

The Cities

Without further ado, here are the 18 new cities in 10 countries we welcomed to our global network: Accra, Ghana; Almaty, Kazakhstan; Bhubaneshwar, India; Chiang Mai, Thailand; Joinville, Brazil; Erbil, Iraq; Fukuoka, Japan; Goiânia, Brazil; Haifa, Israel; Harare, Zimbabwe; Juazeiro do Norte, Brazil; Kanpur, India; Manaus, Brazil; Naha, Japan; Patna, India; São José do Rio Preto, Brazil; Tashkent, Uzbekistan; Uberlândia, Brazil.

Cloudflare’s ISP Edge Partnership Program

But let’s take a step back and understand why and how adding new cities to our list helps make the Internet better. First, we should reintroduce the Cloudflare Edge Partnership Program. Cloudflare is used as a reverse proxy by nearly 20% of all Internet properties, which means the volume of ISP traffic trying to reach us can be significant. In some cases, as we’ll see in Israel, the distance data needs to travel can also be significant, adding to latency and reducing Internet performance for the user. Our solution is partnering with ISPs to embed our servers inside their network. Not only does the ISP avoid lots of back haul traffic, but their subscribers also get much better performance because the website is served on-net, and close to them geographically. It is a win-win-win.

Consider a large Israeli ISP we did not peer with locally in Tel Aviv. Last year, if a subscriber wanted to reach a website on the Cloudflare network, their request had to travel on the Internet backbone – the large carriers that connect networks together on behalf of smaller ISPs – from Israel to Europe before reaching Cloudflare and going back. The map below shows where they were able to find Cloudflare content before our deployment went live: 48% in Frankfurt, 33% in London, and 18% in Amsterdam. That’s a long way!

In January and March 2022, we turned up deployments with the ISP in Tel Aviv and Haifa. Now live, these two locations serve practically all requests from their subscribers locally within Israel. Instead of traveling 3,000 km to reach one of the millions of websites on our network, most requests from Israel now travel 65 km, or less. The improvement has been dramatic: now we’re serving 66% of requests in under 50ms; before the deployment we couldn’t serve any in under 50ms because the distance was too great. Now, 85% are served in under 100ms; before, we served 66% of requests in under 100ms.

![Logarithmic graph depicting the improvement in performance. The 50th percentile of requests decreased from almost 90ms to around 30ms.]](http://staging.blog.mrk.cfdata.org/content/images/2022/03/image2-76.png_REGULAR)

As we continue to put dots on the map, we’ll keep putting updates here on how Internet performance is improving. As we like to say, we’re just getting started.

If you’re an ISP that is interested in hosting a Cloudflare cache to improve performance and reduce back haul, get in touch on our Edge Partnership Program page. And if you’re a software, data, or network engineer – or just the type of person who is curious and wants to help make the Internet better – consider joining our team.

Internet outage in Yemen amid airstrikes

João Tomé — Fri, 21 Jan 2022 12:20:22 GMT

The early hours of Friday, January 21, 2022, started in Yemen with a country-wide Internet outage. According to local and global news reports airstrikes are happening in the country and the outage is likely related, as there are reports that a telecommunications building in Al-Hudaydah where the FALCON undersea cable lands was hit.

Cloudflare Radar shows that Internet traffic dropped close to zero between 21:30 UTC (January 20, 2022) and by 22:00 UTC (01:00 in local time).

The outage affected the main state-owned ISP, Public Telecommunication Corporation (AS30873 in blue in the next chart), which represents almost all the Internet traffic in the country.

Looking at BGP (Border Gateway Protocol) updates from Yemen’s ASNs around the time of the outage, we see a clear spike at the same time the main ASN was affected ~21:55 UTC, January 20, 2022. These update messages are BGP signalling that Yemen’s main ASN was no longer routable, something similar to what we saw happening in The Gambia and Kazakhstan but for very different reasons.

So far, 2022 has started with a few significant Internet disruptions for different reasons:

1. An Internet outage in The Gambia because of a cable problem.2. An Internet shutdown in Kazakhstan because of unrest.3. A mobile Internet shutdown in Burkina Faso because of a coup plot.4. An Internet outage in Tonga because of a volcanic eruption (still ongoing).

You can keep an eye on Cloudflare Radar to monitor this situation as it unfolds.

Where is mobile traffic the most and least popular?

João Tomé — Sat, 09 Oct 2021 12:58:41 GMT

You’re having dinner. You look at the table next to you, noticing everyone is checking their phones, scrolling, browsing, and interacting with that little piece of hardware that puts everyone in touch with friends, family, work. Not to mention the giant public square of sorts that social media has become. That could happen in the car (hopefully with the passengers, never the driver), at home when you’re on the sofa, in bed or even when you’re commuting or just bored in line for the groceries.

Or perhaps you use your mobile phone as your only connection to the Internet. It might be your one means of communication and doing business. For many, the mobile Internet opened up access and opportunity that simply was not possible before.

Around the world the use of mobile Internet differs widely. In some countries mobile traffic dominates, in others desktop still reigns supreme.

Mobile Internet traffic has changed the way we relate to the online world — work (once, for some, done on desktop/laptop computers) is just one part of it — and Cloudflare Radar can help us get a better understanding of global Internet traffic but also access regional trends, and monitor emerging security threats. So let’s dig into the mobile traffic trends, starting with a kind of contest (the data reflected here is from the 30 days before October 4).

In this area of Cloudflare Radar users can check the mobile traffic trends by country or worldwide (the case shown here) in the past 7 or 30 days. Worldwide we can see that mobile wins over desktop traffic with 52%

The country that has the greatest proportion of mobile Internet traffic is…

Cloudflare Radar has information on countries across the world, so we looked for, in the past month, the country with the highest proportion of mobile Internet traffic. And the answer is... Sudan, with 83% of Internet traffic is done using mobile devices — actually it’s a tie with Yemen, which we talk about a little further below.

In many emerging economies (Sudan is one), a large percent of the population had its first contact with the Internet through a smartphone. In these countries it is normal not to have a computer and some even got their first bank account thanks to the mobile device.

How about Sudan’s neighbours? South Sudan follows that pattern and mobile traffic represents 74% of Internet use. The same in Chad (74%), Libya (75%), Egypt (68%), Eritrea (67%) and Ethiopia (58%). It’s a clear trend throughout Africa, especially in the central and eastern part of the continent, where mobile traffic wins in every country (for the past 30 days).

World map that shows (in yellow) the areas of the planet where most of the Internet traffic is done via mobile devices. Africa, the Middle East and Asia have the highest percentage of mobile traffic.

The Vatican goes for the desktop experience (but Italy loves mobile)

On the other hand, the country we found with the least mobile traffic in the past 30 days is... Vatican City, with only 13% (since the Vatican is literally inside Rome this might be an anomaly caused by mobile devices inside the Vatican connecting to Italian networks). Small countries like Seychelles (29%), Andorra (29%), Estonia (34%) and San Marino (36%) have the same pattern — also with a low mobile traffic percentage there’s Madagascar (27%), Haiti (34%) and Greenland (37%).

We can also see that the pattern inside Vatican City differs greatly from the pattern in Italy. Italy is one of the most mobile-friendly European countries — Italians seem to prefer mobile to desktop. About 57% of Internet traffic is via mobile devices. Italy is only matched, in Europe, by its neighbour Croatia — on the other side of the Adriatic Sea — that in the past month has had 58% mobile traffic.

European countries have differing mobile preferences

While we’re talking about Italy and Croatia, let’s dig a bit more into Europe. Only six countries have more mobile than desktop (laptops included) traffic. Besides Italy and Croatia, there’s Romania (54%), Slovakia (52%) and Greece (51%) — all more to the east of Europe.

At the end of this mobile ranking we have one of the most digitally advanced countries in the world: Estonia (a truly digital society, according to Wired). The small country only has 34% of mobile traffic. Other countries in the north of Europe like Denmark (38%) and Finland (39%) follow the same trend.

Spain (47%), France (48%) and Ireland (49%) are getting close to being mobile-first countries. The UK (50%) has the same trend as its neighbours — Russia is actually in the same ‘neighbourhood’ (with 49%). On the other hand, Portugal (42%), Netherlands (43%) or Germany (44%) are still a little far.

How about the Americas?

Peru seems to be the country in the American continent that has less mobile use (36%), only compared with Canada (38%). Cuba is the country with more mobile use (70%)

Peru (36%) and Canada (38%) have in common that both are the countries in the Americas with the least mobile use in the past 30 days.

Then there’s Brazil (50%), Mexico (52%) — Chile is not far, with 48% of mobile use. Cuba takes the crown, with 70%, followed by the Dominican Republic (56%), Puerto Rico (51%) and Jamaica (51%), all Caribbean countries. The exception is Haiti, the least mobile of the continent, with 34% of mobile use.

Let’s go to the Middle East: the champion of mobile traffic

Most Internet traffic in Yemen is done with mobile devices like this chart from Radar of the previous 30 days shows

In this part of our planet there are no doubts whatsoever: mobile traffic rules completely. On the top of the list is Yemen, with the same 83% of mobile traffic as Sudan (that we talked about before). But Syria is actually a close second, with 82%.

Iran (71%), Iraq (70%), Pakistan (70%) and Egypt (69%) show the same trend. The exception, here, is the United Arab Emirates, with 44% of mobile traffic, and also Israel (45%). Nearby, Saudi Arabia (the country with the highest GDP in the region) is at 55%.

A (mobile) giant called India

Of the top 10 most populated countries, the clear winner of our mobile ranking is, without any doubt, India, with 80% mobile use. The country of 1.3 billion people surpasses the biggest country on the planet, China (1.4 billion live there), with 65% mobile.

Also in Asia, the fourth-biggest country in the world (after the US), Indonesia, has 68% of traffic by mobile devices. The same trend of mobile-first is followed by Thailand (65%), Vietnam (64%), Malaysia (64%), South Korea (56%), Japan (56%) and the Philippines (51%).

Down under, Australia is desktop first (37% mobile traffic), just like its neighbour New Zealand (38%).

Just as a curiosity, Vanuatu, the South Pacific Ocean nation (population of 307,150), ranked some years as the happiest nation on the planet (by the Happy Planet Index) has 52% of mobile traffic. The current number one in that same index, Costa Rica, is at 50%.

Conclusion

Mobile devices are here to stay and have become already a bridge to help bring more humans to the vast opportunities that the Internet brings. Of the top 15 countries with more mobile Internet traffic, there’s just one that is in the top 15 in terms of GDP, India.

As we already showed, there is a world of trends and even human habits (differing from country to country) to discover on our Cloudflare Radar platform. It’s all a matter of asking a question that could be reflected in our data and searching for the answers.

Project Myriagon: Cloudflare Passes 10,000 Connected Networks

Ticiane Takami — Sun, 19 Sep 2021 12:59:09 GMT

During Speed Week, we’ve talked a lot about the products we’ve improved and the places we’ve expanded to. Today, we have a final exciting announcement: Cloudflare now connects with more than 10,000 other networks. Put another way, over 10,000 networks have direct on-ramps to the Cloudflare network.

This is the culmination of a special project we’ve been working on for the last few months dubbed Project Myriagon, a reference to the 10,000-sided polygon of the same name. In going about this project, we have learned a lot about the performance impact of adding more direct connections to our network — in one recent case, we saw a 90% reduction in median round-trip end-user latency.

But to really explain why this is such a big milestone, we first need to explain a bit about how the Internet works.

More roads leading to Rome

The Internet that all know and rely on is, on a basic level, an interconnected series of independently run local networks. Each network is defined as its own “autonomous system.” These networks are delineated numerically with Autonomous Systems Numbers, or ASNs. An ASN is like the Internet version of a zip code, a short number directly mapping to a distinct region of IP space using a clearly defined methodology. Network interconnection is all about bringing together different ASNs to exponentially multiply the number of possible paths between source and destination.

Most of us have home networks behind a modem and router, connecting your individual miniature network to your ISP. Your ISP then connects with other networks, to fetch the web pages or other Internet traffic you request. These networks in turn have connections to different networks, who in turn connect to interconnected networks, and so on, until your data reaches its destination. The fewer networks your request has to traverse, generally, the lower the end-to-end latency and odds that something will get lost along the way.

The average number of hops between any one network on the Internet to any other network is around 5.7 and 4.7, for the IPv4 and IPv6 networks respectively.

Source: https://blog.apnic.net/2020/01/14/bgp-in-2019-the-bgp-table/

How do ASNs work?

ASNs are a key part of the routing protocol that directs traffic along the Internet, BGP. Internet Assigned Numbers Authority (IANA), the global coordinator of the DNS Root, IP addressing, and other Internet protocol resources like AS Numbers, delegates ASN-making authority to Regional Internet Registries (RIRs), who in turn assign individual ASNs to network operators in line with their regional policies. The five RIRs are AFRINIC, APNIC, ARIN, LACNIC and RIPE, each entitled to assign and attribute ASN numbers in their respective appointed regions.

Cloudflare’s ASN is 13335, one of the approximately 70,000 ASNs advertised on the Internet. While we’d like to — and plan on — connecting to every one of these ASNs eventually, our team tries to prioritize those with the greatest impact on our overall breadth and improving the proximity to as many people on Earth as possible.

As enabling optimal routes is key to our core business and services, we continuously track how many ASNs we connect to (technically referred to as “adjacent networks”). With Project Myriagon, we aimed to speed up our rate of interconnection and pass 10,000 adjacent networks by the end of the year. By September 2021, we reached that milestone, bringing us from 8,300 at the start of 2020 to over 10,000 today.

As shown in the table below, that milestone is part of a continuous effort towards gradually hitting more of the total advertised ASNs on the Internet.

The Regional Internet Registries and their Regions

Table 1: Cloudflare's peer ASNs and their respective RIR

Given that there are 70,000+ ASNs out there, you might be wondering: why is 10,000 a big deal? To understand this, we need to look deeply at BGP, the protocol that glues the Internet together. There are three different classes of ASNs:

Transit Only ASNs: these networks only provide connectivity to other networks. They don't have any IP addresses inside their networks. These networks are quite rare, as it's very unusual to not have any IP addresses inside your network. Instead, these networks are often used primarily for distinct management purposes within a single organization.
Origin Only ASNs: these are networks that do not provide connectivity to other networks. They are a stub network, and often, like your home network, only connected to a single ISP.
Mixed ASNs: these networks both have IP addresses inside their network, and provide connectivity to other networks.

Origin Only ASNs	Mixed ASNs	Transit Only ASNs
61,127	11,128	443

Source: https://bgp.potaroo.net/as6447/

One interesting fact: of the 61,127 origin only ASNs, nearly 43,000 of them are only connected to their ISP. As such, our direct connections to over 10,000 networks indicates that of the networks that connect more than one network, a very good percentage are now already connected to Cloudflare.

Cutting out the middle man

Directly connecting to a network — and eliminating the hops in between — can greatly improve performance in two ways. First, connecting with a network directly allows for Internet traffic to be exchanged locally rather than detouring through remote cities; and secondly, direct connections help avoid the congestion caused by bottlenecks that sometimes happen between networks.

To take a recent real-world example, turning up a direct peering session caused a 90% improvement in median end-user latency when turning up a peering session with a European network, from an average of 76ms to an average of 7ms.

Immediate 90% improvement in median end-user latency after peering with a new network.

By using our own on-ramps to other networks, we both ensure superior performance for our users and avoid adding load and causing congestion on the Internet at large.

And AS13335 is just getting started

Cloudflare is an anycast network, meaning that the better connected we are, the faster and better-protected we are — obviating legacy concepts like scrubbing centers and slow origins. Hitting five digits of connected networks is something we’re really proud of as a step on our goal to helping to build a better Internet. As we’ve mentioned throughout the week, we’re all about high speed without having to pay a security or reliability cost.

There’s still work to do! While Project Myriagon has brought us, we believe, to be one of the top 5 most connected networks in the world, we estimate Google is connected to 12,000-15,000 networks. And so, today, we are kicking off Project CatchG. We won’t rest until we’re #1.

Interested in peering with us to help build a better Internet? Reach out to peering@cloudflare.com with your request. More details on the locations we are present at can be found at http://as13335.peeringdb.com/.

Improve site load times and SEO with one-click support for Signed Exchanges on Google Search

Marc Lamik — Tue, 14 Sep 2021 12:59:06 GMT

We’re excited to announce that, starting today, Cloudflare customers will be able to generate Signed Exchanges (SXG) for Google Search with just one click. Signed Exchanges is an open web platform specification Google developed as a way of verifying a cached version of a website — enabling massively faster delivery of a website from a third party, such as Google itself from its search results page, or from a news aggregator that is linking out to other sites.

The advantage to you as a website owner? Not only will your site load faster when linked to from a site supporting SXG, but because many search engines use page load times in order to determine search results, you should see a very nice boost in SEO.

What are signed exchanges, and how do they work?

Introduced by Google, a Signed Exchange (SXG) is an open standard delivery mechanism that makes it possible to authenticate the origin of a resource, independent of how it was delivered. This decoupling advances a variety of use cases, such as prefetching, offline Internet experiences, and serving from third-party caches. It does so in a secure and privacy-preserving manner.

Now, imagine yourself as the ruler of your kingdom with an important message to deliver to all your subjects. You have too many people to reach, so you can’t do it alone. You decide to enlist your trusty knights to ride out with large chests filled with copies of your message. There are villains everywhere that would love to take these messages and modify them for their own nefarious machinations for their own profit.

You, being the wise ruler you are, have a crafty plan: you have a very special stamp made that can imprint a seal that everyone can recognize, yet no one can recreate. With this wondrous seal, no one can tamper with the messages without breaking the seal and proving the forgery for all to see. Now, your knights can bring these chests to all corners of the kingdom and hand out the messages to the masses, and your subjects can trust that the message came from you. There is a side benefit for your people, too. They can come whenever they want to pick up the message without your watchful eye, so they’re more inclined to read it at their leisure.

Maybe this is stretching the analogy a bit, but in the case of Signed Exchanges, a cryptographic signature on a digest of the response and headers acts as the tamper proof seal for the message. Fast forwarding our example to the present day: you want to get your newest web experience out to global distribution with the understanding that just about everyone will come through a search engine or aggregator site. Ahead of time, when you publish your content, the search engine crawls your site for content, but instead of delivering the raw content, you negotiate the delivery of the signed exchange. (This is accomplished simply through additional “Accept: application/signed-exchange;v=” request headers from the crawler that announces the preference for signed exchanges).

Then Cloudflare generates the Signed Exchange, using the following process:

Cloudflare fetches the original content that you want to sign, including the response headers.
An additional Digest header is added that uses Merkle Integrity Content Encoding to support the progressive detection of data modification/corruption.
We also strip out headers that don’t make sense within the context of Signed Exchanges (like Connection, Keep-Alive, etc.) as well as security sensitive headers (Set-Cookie, Authentication-Info, etc.).
Then these headers, including the digest, along with additional metadata, like request URL, URL of the certificate, hash of the certificate, expiration time, etc., are all chained together into a stream that is used to calculate the final signature.
The original content, along with the headers, signature, and a fallback URL are then packed into a final binary for delivery.

This Signed Exchange is then cached and sent to the crawler, which also stores the Signed Exchange. After indexing the content, it can now show up in searches. The user then discovers the link to your content in the search results. The search engine also preloads the signed exchange for your content in the background in the meantime, effectively pre-filling the cache in the client’s browser. This exchange was delivered from the search engine, so no signal has gone to the origin yet. Thus, the search intent of the user isn’t leaked to the origin. Since the exchange is signed and validated against your certificate, the browser trusts the contents and can display the content with attribution to the original URL. Now, when the user clicks on the link to view the contents, it magically loads instantaneously from the local cache.

There are many resources on the web available that go into detail about the specific format of Signed Exchanges, so we won’t rehash them here in detail. But one important aspect that isn’t obvious at first glance is the complexity of managing the signing process itself. The many details involve:

The inclusion of the atypical CanSignHttpExchanges extension to your certificate.
The requirement to deliver your certificates in a specific CBOR (like binary JSON) format.
OCSP stapling to ensure the validity of the certificates is required.
Renewals of these certificates on a more frequent basis (i.e. requires automation).
Caching of the generated signed exchanges, since they can be expensive to generate.

Luckily, all of these are in Cloudflare’s wheelhouse, since we already have deep expertise in Certificate Management and TLS delivery infrastructure. By partnering with Google on the Signed Exchange implementation, we can ensure the consistency of implementation, but improve the simplicity of integrating the technology with the single push of a button.

“Signed Exchanges make the web faster and a better user experience for users, by enabling cross-site prefetching. Site owners have seen clear improvement to Largest Contentful Paint, one of the Core Web Vitals, as well as increased user stickiness. Cloudflare now makes it simple for sites to implement Signed Exchanges and derive these benefits.” — Jeff Jose, Product Manager, Google

Bigger than search alone

The broader implication of SXGs is that they make content portable: content delivered via an SXG can be easily distributed by third parties while maintaining full assurance and attribution of its origin. Historically, the only way for a site to use a third party to distribute its content while maintaining attribution has been for the site to share its SSL certificates with the distributor. This has security drawbacks. Moreover, it is a far stretch from making content truly portable.

In the long-term, truly portable content can be used to achieve use cases like fully offline experiences. In the immediate term, the primary use case of SXGs is the delivery of faster user experiences by providing content in an easily cacheable format. Specifically, Google Search will cache and sometimes prefetch SXGs. For sites that receive a large portion of their traffic from Google Search, SXGs can be an important tool for delivering faster page loads to users.

It’s also possible that all sites could eventually support this standard. Every time a site is loaded, all the linked articles could be pre-loaded. Web speeds across the board would be dramatically increased. Matthew’s blog post talks more about this possibility.

Sign up today

Automatic Signed Exchanges will be free for all Cloudflare Pro, Business and Enterprise customers as well as for customers using our Advanced Platform Optimization product.

Sign up for the Automatic Signed Exchange beta waitlist today and after being approved, activating is only one flip of a switch.

To sign up for the waitlist go to the Speed page on the Cloudflare dashboard and click on “Join Waitlist” on the Automatic Signed Exchanges (SXGs) card.

We’ll take care of the rest.

Watch on Cloudflare TV

Cloudflare Passes 250 Cities, Triples External Network Capacity, 8x-es Backbone

Jon Rolfe — Mon, 13 Sep 2021 12:59:02 GMT

It feels like just the other week that we announced ten new cities and our expansion to 25+ cities in Brazil — probably because it was. Today, I have three speedy infrastructure updates: we’ve passed 250 on-network cities, more than tripled our external network capacity, and increased our long-haul internal backbone network by over 800% since the start of 2020.

Light only travels through fiber so fast and with so much bandwidth — and worse still over the copper or on mobile networks that make up most end-users’ connections to the Internet. At some point, there’s only so much software you can throw at the problem before you run into the fundamental problem that an edge network solves: if you want your users to see incredible performance, you have to have servers incredibly physically close. For example, over the past three months, we’ve added another 10 cities in Brazil. Here’s how that lowered the connection time to Cloudflare. The red line shows the latency prior to the expansion, the blue shows after.

We’re exceptionally proud of all the teams at Cloudflare that came together to raise the bar for the entire industry in terms of global performance despite border closures, semiconductor shortages, and a sudden shift to working from home. 95% of the entire Internet-connected world is now within 50 ms of a Cloudflare presence, and 80% of the entire Internet-connected world is within 20ms (for reference, it takes 300-400 ms for a human to blink):

Today, when we ask ourselves what it means to have a fast website, it means having a server less than 0.05 seconds away from your user, no matter where on Earth they are. This is only possible by adding new cities, partners, capacity, and cables — so let’s talk about those.

New Cities

Cutting straight to the point, let’s start with cities and countries: in the last two-ish months, we’ve added another 17 cities (outside of mainland China) split across eight countries: Guayaquil, Ecuador; Dammam, Saudi Arabia; Algiers, Algeria; Surat Thani, Thailand; Hagåtña, Guam, United States; Krasnoyarsk, Russia; Cagayan, Philippines; and ten cities in Brazil: Caçador, Ribeirão Preto, Brasília, Florianópolis, Sorocaba, Itajaí, Belém, Americana, Blumenau, and Belo Horizonte.

Meanwhile, with our partner, JD Cloud and AI, we’re up to 37 cities in mainland China: Anqing and Huainan, Anhui; Beijing, Beijing; Fuzhou and Quanzhou, Fujian; Lanzhou, Gansu; Foshan, Guangzhou, and Maoming, Guangdong; Guiyang, Guizhou; Chengmai and Haikou, Hainan; Langfang and Qinhuangdao, Hebei; Zhengzhou, Henan; Shiyan and Yichang, Hubei; Changde and Yiyang, Hunan; Hohhot, Inner Mongolia; Changzhou, Suqian, and Wuxi, Jiangsu; Nanchang and Xinyu, Jiangxi; Dalian and Shenyang, Liaoning; Xining, Qinghai; Baoji and Xianyang, Shaanxi; Jinan and Qingdao, Shandong; Shanghai, Shanghai; Chengdu, Sichuan; Jinhua, Quzhou, and Taizhou, Zhejiang. These are subject to change: as we ramp up, we have been working with JD Cloud to “trial” cities for a few weeks or months to observe performance and tweak the cities to match.

More Capacity: What and Why?

In addition to all these new cities, we’re also proud to announce that we have seen a 3.5x increase in external network capacity from the start of 2020 to now. This is just as key to our network strategy as new cities: it wouldn’t matter if we were in every city on Earth if we weren’t interconnected with other networks. Last-mile ISPs will sometimes still “trombone” their traffic, but in general, end users will get faster Internet as we interconnect more.

This interconnection is spread far and wide, both to user networks and those of website hosts and other major cloud networks. This has involved a lot of middleman-removal: rather than run fiber optics from our routers through a third-party network to an origin or user’s network, we’re running more and more Private Network Interconnects (PNIs) and, better yet, Cloudflare Network Interconnects (CNIs) to our customers.

These PNIs and CNIs can not only reduce egress costs for our customers (particularly with our Bandwidth Alliance partners) but also increase the speed, reliability, and privacy of connections. The fewer networks and less distance your Internet traffic flows through, the better off everyone is. To put some numbers on that, only 30% of this newly doubled capacity was transit, leaving 70% flowing directly either physically over PNIs/CNIs or logically over peering sessions at Internet exchange points.

The Backbone

At the same time as this increase in external capacity, we’ve quietly been adding hundreds of new segments to our backbone. Our backbone consists of dedicated fiber optic lines and reserved portions of wavelength that connect Cloudflare data centers together. This is split approximately 55/45 between “metro” capacity, which redundantly connects data centers in which we have a presence, and “long-haul” capacity, which connects Cloudflare data centers in different cities.

The backbone is used to increase the speed of our customer traffic, e.g., for Argo Smart Routing, Argo Tiered Caching, and WARP+. Our backbone is like a private highway connecting cities, while public Internet routing is like local roads: not only does the backbone directly connect two cities, but it’s reliably faster and sees fewer issues. We’ll dive into some benchmarks of the speed improvements of the backbone in a more comprehensive future blog post.

The backbone is also more secure. While Cloudflare signs all of its BGP routes with RPKI, pushes adjacent networks to use RPKI to avoid route hijacks, and encrypts external and internal traffic, the most secure and private way to safeguard our users’ traffic is to keep it on-network as much as possible.

Internal load balancing between cities has also been greatly improved, thanks to the use of the backbone for traffic management with a technology we call Plurimog (a reference to our in-colo Layer 4 load balancer, Unimog). A surge of traffic into Portland can be shifted instantaneously over diverse links to Seattle, Denver, or San Jose with a single hop, without waiting for changes to propagate over anycast or running the risk of an interim increase in errors.

From an expansion perspective, two key areas of focus have been our undersea North America to Europe (transatlantic) and Asia to North America (transpacific) backbone rings. These links use geographically diverse subsea cable systems and connect into diverse routers and data centers on both ends — four transatlantic cables from North America to Europe, three transamerican cables connecting South and North America, and three transpacific cables connecting Asia and North America. User traffic coming from Los Angeles could travel to an origin as west as Singapore or as east as Moscow without leaving our network.

This rate of growth has been enabled by improved traffic forecast modeling, rapid internal feedback loops on link utilization, and more broadly by growing our teams and partnerships. We are creating a global view of capacity, pricing, and desirability of backbone links in the same way that we have for transit and peering. The result is a backbone that doubled in long-haul capacity this year, increased more than 800% from the start of last year, and will continue to expand to intelligently crisscross the globe.

The backbone has taken on a huge amount of traffic that would otherwise go over external transit and peering connections, freeing up capacity for when it is explicitly needed (last-hop routes, failover, etc.) and avoiding any outages on other major global networks (e.g., CenturyLink, Verizon).

In Conclusion

A map of the world highlighting all 250+ cities in which Cloudflare is deployed.

More cities, capacity, and backbone are more steps as part of going from being the most global network on Earth to the most local one as well. We believe in providing security, privacy, and reliability for all — not just those who have the money to pay for something we consider fundamental Internet rights. We have seen the investment into our network pay huge dividends this past year.

Happy Speed Week!

Do you want to work on the future of a globally local network? Are you passionate about edge networks? Do you thrive in an exciting, rapid-growth environment? If so, good news: Cloudflare Infrastructure is hiring; check our open roles here!

Alternatively — if you work at an ISP we aren’t already deployed with and want to bring this level of speed and control to your users, we’re here to make that happen. Please reach out to our Edge Partnerships team at epp@cloudflare.com.

Watch on Cloudflare TV

Cloudflare’s Network Doubles CPU Capacity and Expands Into Ten New Cities in Four New Countries

Jon Rolfe — Wed, 30 Jun 2021 12:59:45 GMT

Cloudflare’s global network is always expanding, and 2021 has been no exception. Today, I’m happy to give a mid-year update: we've added ten new Cloudflare cities, with four new countries represented among them. And we've doubled our computational footprint since the start of pandemic-related lockdowns.

No matter what else we do at Cloudflare, constant expansion of our infrastructure to new places is a requirement to help build a better Internet. 2021, like 2020, has been a difficult time to be a global network — from semiconductor shortages to supply-chain disruptions — but regardless, we have continued to expand throughout the entire globe, experimenting with technologies like ARM, ASICs, and Nvidia all the way.

The Cities

Without further ado, here are the new Cloudflare cities: Tbilisi, Georgia; San José, Costa Rica; Tunis, Tunisia; Yangon, Myanmar; Nairobi, Kenya; Jashore, Bangladesh; Canberra, Australia; Palermo, Italy; and Salvador and Campinas, Brazil.

These deployments are spread across every continent except Antarctica.

We’ve solidified our presence in every country of the Caucuses with our first deployment in the country of Georgia in the capital city of Tbilisi. And on the other side of the world, we’ve also established our first deployment in Costa Rica’s capital of San José with NIC.CR, run by the Academia Nacional de Ciencias.

In the northernmost country in Africa comes another capital city deployment, this time in Tunis, bringing us one country closer towards fully circling the Mediterranean Sea. Wrapping up the new country docket is our first city in Myanmar with our presence in Yangon, the country’s capital and largest city.

Our second Kenyan city is the country’s capital, Nairobi, bringing our city count in sub-Saharan Africa to a total of fifteen. In Bangladesh, Jashore puts us in the capital of its namesake Jashore District and the third largest city in the country after Chittagong and Dhaka, both already Cloudflare cities.

In the land way down under, our Canberra deployment puts us in Australia’s capital city, located, unsurprisingly, in the Australian Capital Territory. In differently warm lands is Palermo, Italy, capital of the island of Sicily, which we already see boosting performance throughout Italy.

25th percentile latency of non-bot traffic in Italy, year-to-date.

Finally, we’ve gone live in Salvador (capital of the state of Bahia) and Campinas, Brazil, the only city announced today that isn’t a capital. These are some of the first few steps in a larger Brazilian expansion — watch this blog for more news on that soon.

This is in addition to the dozens of new cities we’ve added in Mainland China with our partner, JD Cloud, with whom we have been working closely to quickly deploy and test new cities since last year.

The Impact

While we’re proud of our provisioning process, the work with new cities begins, not ends, with deployment. Each city is not only a new source of opportunity, but risk: Internet routing is fickle, and things that should improve network quality don’t always do so. While we have always had a slew of ways to track performance, we’ve found that a significant, constant improvement in the 25th percentile latency of non-bot traffic to be an ideal approximation of latency impacted by only physical distance.

Using this metric, we can quickly see the improvement that comes from adding new cities. For example, in Kenya, we can see that the addition of our Nairobi presence improved real user performance:

25th percentile latency of non-bot Kenyan traffic, before and after Nairobi gained a Cloudflare point of presence.

Latency variations in general are expected on the Internet — particularly in countries with high amounts of Internet traffic originating from non-fixed connections, like mobile phones — but in aggregate, the more consistently low latency, the better. From this chart, you can clearly see that not only was there a reduction in latency, but also that there were fewer frustrating variations in user latency. We all get annoyed when a page loads quickly one second and slowly the next, and the lower jitter that comes with being closer to the server helps to eliminate it.

As a reminder, while these measurements are in thousandths of a second, they add up quickly. Popular sites often require hundreds of individual requests for assets, some of which are initiated serially, so the difference between 25 milliseconds and 5 milliseconds can mean the difference between single and multi-second page load times.

To sum things up, users in the cities or greater areas of these cities should expect an improved Internet experience when using everything from our free, private 1.1.1.1 DNS resolver to the tens of millions of Internet properties that trust Cloudflare with their traffic. We have dozens more cities in the works at any given time, including now. Watch this space for more!

Join Our Team

Like our network, Cloudflare continues to rapidly grow. If working at a rapidly expanding, globally diverse company interests you, we’re hiring for scores of positions, including in the Infrastructure group. Or, if you work at a global ISP and would like to improve your users’ experience and be part of building a better Internet, get in touch with our Edge Partners Program at epp@cloudflare.com we’ll look into sending some servers your way!

Sudan's exam-related Internet shutdowns

John Graham-Cumming — Tue, 22 Jun 2021 10:19:00 GMT

To prevent cheating in exams many countries restrict or even shut down Internet access during critical exam hours. I wrote two weeks ago about Syria having planned Internet shutdowns during June, for exams.

Sudan is doing the same thing and has had four shutdowns so far. Here's the Internet traffic pattern for Sudan over the last seven days. I've circled the shutdowns on Saturday, Sunday, Monday and Tuesday (today, June 22, 2021).

Cloudflare Radar allows anyone to track Internet traffic patterns around the world, and it has country-specific pages. The chart for the last seven days (shown above) came from the dedicated page for Sudan.

The Internet outages start at 0600 UTC (0800 local time) and end three hours later at 0900 UTC (1100 local time). This corresponds to the timings announced by the Sudanese Ministry of Education.

Further shutdowns are likely in Sudan on June 24, 26, 27, 29 and 30 (thanks to Twitter user _adonese for his assistance). Looking deeper into the data, the largest drop in use is for mobile Internet access in Sudan (the message above talks about mobile Internet use being restricted) while some non-mobile access appears to continue.

That can be seen by looking at the traffic mix from Sudan. During the exam times mobile use drops (as a percentage of traffic) and desktop use increases. This chart also shows how popular mobile Internet access is in Sudan: it's typically above 75% of traffic (compare with, for example, the US).

If you want to follow the other outages for the remaining five exams, you can see live data on the Cloudflare Radar Sudan page.

Internet traffic disruption caused by the Christmas Day bombing in Nashville

John Graham-Cumming — Wed, 06 Jan 2021 17:05:44 GMT

On December 25th, 2020, an apparent suicide bomb exploded in Nashville, TN. The explosion happened outside an AT&T network building on Second Avenue in Nashville at 1230 UTC. Damage to the AT&T building and its power supply and generators quickly caused an outage for telephone and Internet service for local people. These outages continued for two days.

Looking at traffic flow data for AT&T in the Nashville area to Cloudflare we can see that services continued operating (on battery power according to reports) for over five hours after the explosion, but at 1748 UTC we saw a dramatic drop in traffic. 1748 UTC is close to noon in Nashville when reports indicate that people lost phone and Internet service.

We saw traffic from Nashville via AT&T start to recover over a 45 minute period on December 27 at 1822 UTC making the total outage 2 days and 34 minutes.

Traffic flows continue to be normal and no further disruption has been seen.

Cloudflare Network Interconnection partnerships launch

Steven Pack — Tue, 04 Aug 2020 13:00:00 GMT

Today we’re excited to announce Cloudflare’s Network Interconnection Partner Program, in support of our new CNI product. As ever more enterprises turn to Cloudflare to secure and accelerate their branch and core networks, the ability to connect privately and securely becomes increasingly important. Today's announcement significantly increases the interconnection options for our customers, allowing them to connect with us in the location of their choice using the method or vendors they prefer.

In addition to our physical locations, our customers can now interconnect with us at any of 23 metro areas across five continents using software-defined layer 2 networking technology. Following the recent release of CNI (which includes PNI support for Magic Transit), customers can now order layer 3 DDoS protection in any of the markets below, without requiring physical cross connects, providing private and secure links, with simpler setup.

Launch Partners

We’re very excited to announce that five of the world's premier interconnect platforms are available at launch. Console Connect by PCCW Global in 14 locations, Megaport in 14 locations, PacketFabric in 15 locations, Equinix ECX Fabric™ in 8 locations and Zayo Tranzact in 3 locations, spanning North America, Europe, Asia, Oceania and Africa.

What is an Interconnection Platform?

Like much of the networking world, there are many terms in the interconnection space for the same thing: Cloud Exchange, Virtual Cross Connect Platform and Interconnection Platform are all synonyms. They are platforms that allow two networks to interconnect privately at layer 2, without requiring additional physical cabling. Instead the customer can order a port and a virtual connection on a dashboard, and the interconnection ‘fabric’ will establish the connection. Since many large customers are already connected to these fabrics for their connections to traditional Cloud providers, it is a very convenient method to establish private connectivity with Cloudflare.

Why interconnect virtually?

Cloudflare has an extensive peering infrastructure and already has private links to thousands of other networks. Virtual private interconnection is particularly attractive to customers with strict security postures and demanding performance requirements, but without the added burden of ordering and managing additional physical cross connects and expanding their physical infrastructure.

Key Benefits of Interconnection Platforms

SecureSimilar to physical PNI, traffic does not pass across the Internet. Rather, it flows from the customer router, to the Interconnection Platform’s network and ultimately to Cloudflare. So while there is still some element of shared infrastructure, it’s not over the public Internet.

EfficientModern PNIs are typically a minimum of 1Gbps, but if you have the security motivation without the sustained 1Gbps data transfer rates, then you will have idle capacity. Virtual connections provide for “sub-rate” speeds, which means less than 1Gbps, such as 100Mbps, meaning you only pay for what you use. Most providers also allow some level of “burstiness”, which is to say you can exceed that 100Mbps limit for short periods.

PerformanceBy avoiding the public Internet, virtual links avoid Internet congestion.

PriceThe major cloud providers typically have different pricing for egressing data to the Internet compared to an Interconnect Platform. By connecting to your cloud via an Interconnect Partner, you can benefit from those reduced egress fees between your cloud and the Interconnection Platform. This builds on our Bandwidth Alliance to give customers more options to continue to drive down their network costs.

Less OverheadBy virtualizing, you reduce physical cable management to just one connection into the Interconnection Platform. From there, everything defined and managed in software. For example, ordering a 100Mbps link to Cloudflare can be a few clicks in a Dashboard, as would be a 100Mbps link into Salesforce.

Data Center IndependenceIs your infrastructure in the same metro, but in a different facility to Cloudflare? An Interconnection Platform can bring us together without the need for additional physical links.

Where can I connect?

In any of our physical facilities
In any of the 23 metro areas where we are currently connected to an Interconnection Platform (see below)
If you’d like to connect virtually in a location not yet listed below, simply get in touch via our interconnection page and we’ll work out the best way to connect.

Metro Areas

The metro areas below have currently active connections. New providers and locations can be turned up on request.

What’s next?

Our customers have been asking for direct on-ramps to our global network for a long time and we’re excited to deliver that today with both physical and virtual connectivity of the world’s leading interconnection Platforms.

Already a Cloudflare customer and connected with one of our Interconnection partners? Then contact your account team today to get connected and benefit from improved reliability, security and privacy of Cloudflare Network Interconnect via our interconnection partners.

Are you an Interconnection Platform with customers demanding direct connectivity to Cloudflare? Head to our partner program page and click “Become a partner”. We’ll continue to add platforms and partners according to customer demand.

"Equinix and Cloudflare share the vision of software-defined, virtualized and API-driven network connections. The availability of Cloudflare on the Equinix Cloud Exchange Fabric demonstrates that shared vision and we’re excited to offer it to our joint customers today."– Joseph Harding, Equinix, Vice President, Global Product & Platform MarketingSoftware Developer

"Cloudflare and Megaport are driven to offer greater flexibility to our customers. In addition to accessing Cloudflare’s platform on Megaport’s global internet exchange service, customers can now provision on-demand, secure connections through our Software Defined Network directly to Cloudflare Network Interconnect on-ramps globally. With over 700 enabled data centres in 23 countries, Megaport extends the reach of CNI onramps to the locations where enterprises house their critical IT infrastructure. Because Cloudflare is interconnected with our SDN, customers can point, click, and connect in real time. We’re delighted to grow our partnership with Cloudflare and bring CNI to our services ecosystem — allowing customers to build multi-service, securely-connected IT architectures in a matter of minutes."– Matt Simpson, Megaport, VP of Cloud Services

“The ability to self-provision direct connections to Cloudflare’s network from Console Connect is a powerful tool for enterprises as they come to terms with new demands on their networks. We are really excited to bring together Cloudflare’s industry-leading solutions with PCCW Global’s high-performance network on the Console Connect platform, which will deliver much higher levels of network security and performance to businesses worldwide.”– Michael Glynn, PCCW Global, VP of Digital Automated Innovation

"Our customers can now connect to Cloudflare via a private, secure, and dedicated connection via the PacketFabric Marketplace. PacketFabric is proud to be the launch partner for Cloudflare's Interconnection program. Our large U.S. footprint provides the reach and density that Cloudflare customers need."– Dave Ward, PacketFabric CEO

When people pause the Internet goes quiet

John Graham-Cumming — Fri, 01 May 2020 11:00:00 GMT

Recent news about the Internet has mostly been about the great increase in usage as those workers who can have been told to work from home. I've written about this twice recently, first in early March and then last week look at how Internet use has risen to a new normal.

As human behaviour has changed in response to the pandemic, it's left a mark on the charts that network operators look at day in, day out to ensure that their networks are running correctly.

Most Internet traffic has a fairly simple rhythm to it. Here, for example, is daily traffic seen on the Amsterdam Internet Exchange. It's a pattern that's familiar to most network operators. People sleep at night, and there's a peak of usage in the early evening when people get home and perhaps stream a movie, or listen to music or use the web for things they couldn't do during the workday.

But sometimes that rhythm get broken. Recently we've seen the evening peak by joined by morning peaks as well. Here's a graph from the Milan Internet Exchange. There are three peaks: morning, afternoon and evening. These peaks seem to be caused by people working from home and children being schooled and playing at home.

But there are ways human behaviour shows up on graphs like these. When humans pause the Internet goes quiet. Here are two examples that I've seen recently.

The UK and #ClapForNHS

Here's a chart of Internet traffic last week in the UK. The triple peak is clearly visible (see circle A). But circle B shows a significant drop in traffic on Thursday, April 23.

That's when people in the UK clapped for NHS workers to show their appreciation for those on the front line dealing with people sick with COVID-19.

Ramadan

Ramadan started last Friday, April 24 and it shows up in Internet traffic in countries with large Muslim populations. Here, for example, is a graph of traffic in Tunisia over the weekend. A similar pattern is seen across the Muslim world.

Two important parts of the day during Ramadan show up on the chart. These are the iftar and sahoor. Circle A shows the iftar, the evening meal at which Muslims break the fast. Circle B shows the sahoor, the early morning meal before the day's fasting.

Looking at the previous weekend (in green) you can see that the Ramadan-related changes are not present and that Internet use is generally higher (by 10% to 15%).

Conclusion

We built the Internet for ourselves and despite all the machine to machine traffic that takes place (think IoT devices chatting to their APIs, or computers updating software in the night), human directed traffic dominates.

I'd love to hear from readers about other ways human activity might show up in these Internet trends.

Internet performance during the COVID-19 emergency

John Graham-Cumming — Thu, 23 Apr 2020 15:08:51 GMT

A month ago I wrote about changes in Internet traffic caused by the COVID-19 emergency. At the time I wrote:

Cloudflare is watching carefully as Internet traffic patterns around the world alter as people alter their daily lives through home-working, cordon sanitaire, and social distancing. None of these traffic changes raise any concern for us. Cloudflare's network is well provisioned to handle significant spikes in traffic. We have not seen, and do not anticipate, any impact on our network's performance, reliability, or security globally.

That holds true today; our network is performing as expected under increased load. Overall the Internet has shown that it was built for this: designed to handle huge changes in traffic, outages, and a changing mix of use. As we are well into April I thought it was time for an update.

Growth

Here's a chart showing the relative change in Internet use as seen by Cloudflare since the beginning of the year. I've calculated moving average of the trailing seven days for each country and use December 29, 2019 as the reference point.

On this chart the highest growth in Internet use has been in Portugal: it's currently running at about a 50% increase with Spain close behind followed by the UK. Italy flattened out at about a 40% increase in usage towards the end of March and France seems to be plateauing at a little over 30% up on the end of last year.

It's interesting to see how steeply Internet use grew in the UK, Spain and Portugal (the red, yellow and blue lines rise very steeply), with Spain and Portugal almost in unison and the UK lagging by about two weeks.

Looking at some other major economies we see other, yet similar patterns.

Similar increases in utilization are seen here. The US, Canada, Australia and Brazil are all running at between 40% and 50% the level of use at the beginning of the year.

Stability

We measure the TCP RTT (round trip time) between our servers and visitors to Internet properties that are Cloudflare customers. This gives us a measure of the speed of the networks between us and end users, and if the RTT increases it is also a measure of congestion along the path.

Looking at TCP RTT over the last 90 days can help identify changes in congestion or the network. Cloudflare connects widely to the Internet via peering (and through the use of transit) and we connect to the largest number of Internet exchanges worldwide to ensure fast access for all users.

Cloudflare is also present in 200 cities worldwide; thus the TCP RTT seen by Cloudflare gives a measure of the performance of end-user networks within a country. Here's a chart showing the median and 95th percentile TCP RTT in the UK in the last 90 days.

What's striking in this chart is that despite the massive increase in Internet use (the grey line), the TCP RTT hasn't changed significantly. From our vantage point UK networks are coping well.

Here's the situation in Italy:

The picture here is slightly different. Both median and 95th percentile TCP RTT increased as traffic increased. This indicates that networks aren't operating as smoothly in Italy. It's noticeable, though, that as traffic has plateaued the TCP RTT has improved somewhat (take a look at the 95th percentile) indicating that ISPs and other network providers in Italy have likely taken action to improve the situation.

This doesn't mean that Italian Internet is in trouble, just that it's strained more than, say, the Internet in the UK.

Conclusion

The Internet has seen incredible, sudden growth in traffic but continues to operate well. What Cloudflare sees reflects what we've heard anecdotally: some end-user networks are feeling the strain of the sudden change of load but are working and helping us all cope with the societal effects of COVID-19.

It's hard to imagine another utility (say electricity, water or gas) coping with a sudden and continuous increase in demand of 50%.