Data at Cloudflare scale: some insights on measurement for 1,111 interns

Cloudflare recently announced our goal to hire 1,111 interns in 2026 — that’s equivalent to about 25% of our full-time workforce. This means countless opportunities to develop and ship working code into production. It also creates novel opportunities to measure aspects of the Internet that are otherwise hard to observe — and more difficult still to understand.

Measurement is hard, even at Cloudflare, despite the vast amount of data generated by our traffic (much of it published via Cloudflare Radar). A common misconception we often hear is, “Cloudflare has so much data that it must have all the answers.” Having a huge amount of data is great — but it also means much more noise to filter out, and lots of additional work to rule out alternative explanations.

Ram Sundara Raman was an intern at Cloudflare in 2022 as he pursued his PhD. He’s now an assistant professor at University of California, Santa Cruz, and we’ve invited him back to share his insights about working with data at Cloudflare.

Ram’s project is a great example of how insights that researchers shared and brought from their university research lab can lay the groundwork for a valuable project at Cloudflare — in this case, detecting and explaining connection failures to customers. One tip for prospective interns: If you’re applying and thinking about data and measurement ideas to work on at Cloudflare, a good question to ponder is if, how, or why, your idea might matter to Cloudflare. We love hearing your ideas!

Without further ado, here’s Ram. We hope his insights are as informative and refreshing to future interns — and practitioners — as they are to us here at Cloudflare.

Insights from data at large scale might just be a small miracle

by Ram Sundara Raman, Assistant Professor of Computer Science and Engineering, UC Santa Cruz

Before joining Cloudflare as a research intern in the summer of 2022, I’d worked on multiple network security and privacy research problems as a PhD student at the University of Michigan. My previous experience involved active measurements, in which probes were carefully crafted and transmitted to detect and quantify security issues such as HTTPS interception and connection tampering. These attacks, performed by powerful network middleboxes between users and Internet servers, can block Internet content and services for numerous users in various regions, and can also reduce their security. For example, the HTTPS Interception Man-in-the-Middle Attack in Kazakhstan in 2019 was detected in 7-24% of all measurements we performed in the country.

Detecting such attacks is challenging. The underlying mechanisms are diverse, with both geographic and temporal variations — and they’re entirely opaque. Moreover, the Internet has no technical mechanisms to report to users when their traffic is being manipulated, and third party actors rarely, if ever, are transparent with affected users.

My active measurement work before Cloudflare helped resolve these challenges. Along with my PI and team at the University of Michigan, I helped develop Censored Planet, one of the largest active Internet censorship observatories, detecting connection tampering in more than 200 countries. However, active measurements face barriers on scale, resources, and real-world view. For instance, Censored Planet is only able to measure blocking and connection tampering for the 2,000 most popular websites, simply because of limits on time and resources.

While working on projects like Censored Planet, I’d often look at large network operators or cloud providers and think: “If only I had my hands on the data they collect, I could solve this problem so easily. They have a global view of real-world traffic from nearly every network, and probably enough resources and telemetry to scale analysis to that level of data. How hard could it be to use this data, for example, to detect when middleboxes interfere?”

As we learned through our research published at ACM SIGCOMM’23 — it can be very hard.

My perspectives on censorship evolved as a direct result of my experience at Cloudflare, which taught me that detection at scale is hard, even with large-scale data. The research I did during my internship helped reveal that network middleboxes block or otherwise interfere with certain connections not only in limited places, but also at various scales around the world.

An internship project built on real insights, using production data

In this research, we built upon insights gathered by the wider active measurement community, namely that middleboxes interfere with Internet TCP connections by dropping packets, or injecting RST packets to cause connections to abort. The same insights revealed that the patterns of packet drops and RSTs are deterministic — and, as a result, potentially detectable. Such is the flexibility of active measurement: craft a custom request, or ‘probe,’ that elicits a response from the environment. However, such a targeted approach would be difficult to scale and maintain, even for Cloudflare: What probes should be crafted? Where should they be sent? What motivation would Cloudflare have to even try, if the risk of missing so much is so high?

The goal of my internship was to see if we could instead flip the approach: to be passive instead of active. Everything Cloudflare does must be both scalable and sustainable. However, it was entirely uncertain whether a system restricted to passive observation could be constructed, even if the tampering events could be detected. The requirement was clear: Only observe and use data that comes to Cloudflare naturally. No mixing in other datasets, no running our own active measurements. Either would have made life easier: we could have controlled the variables, maybe even obtained ground truth that would help us confirm our observations. But where’s the fun in that? Besides, Cloudflare has all the data anyway… right?

Yes, maybe — if it is sampled appropriately, can be teased out reliably, and correctly interpreted.

Here’s a useful insight: I’ve often heard people say that finding middleboxes that tamper with Internet connections using active measurements is like finding a needle in a haystack — rare, finicky, and hard to pin down. When we started looking at this problem from the lens of Cloudflare’s passive dataset, we quickly realized we were still looking for the same needle — and in some ways, it was now even harder to find.

That’s because as a passive observer we lose the ability to choose where to look. Also, the haystack now stretches across continents, millions of users, and — I’m not exaggerating here — thousands of ways connections can be made and broken. Not only did we have to identify tampering from millions of real-world data points, we had to do it with data that was full of obstacles and pitfalls. It felt a lot like working with unseen traps and their tripwires.

The traps and tripwires of large-scale passive data

There were multiple challenges that I only truly understood once faced with them. Let’s start with the obvious one: scale.

First, there was a glut of large-scale datasets, primarily associated with incoming connections to Cloudflare. For example, at the time of my internship, Cloudflare was serving more than 45 million HTTP requests per second globally, across more than 285 data centers. Cloudflare also gets TCP connections to its 1.1.1.1 DNS server. We also explored Network Error Logging (NEL) data. Usually, in measurement research, we’re dealing with the issue of too little scale. Here, we had the opposite problem: too much of a good thing. In practice, each of these datasets had their own independent sampling methods, making it all but impossible to utilize them all together. Moreover, datasets like NEL are biased since only some clients support it, and because only some websites enable it. After evaluating these biases, NEL did not make the final cut.

To manage the scale, we constructed special IPTABLES rules to log and store incoming TCP connections across all of Cloudflare’s points of presence — every server in each of 285 datacenters. However, due to the extremely large scale of the data, we had to limit ourselves to work with a uniformly random sample of one in every 10,000 connections. For each sample, we only logged the first 10 inbound packets of each connection. That meant we could not detect certain infrequent types of tampering, or any tampering that occurs later in a flow, after the first 10 packets.

Still, within those constraints, we managed to develop tampering signatures — distinctive packet patterns that reveal when middleboxes interfere. However, developing these signatures was anything but straightforward, due to the second tripwire: noisy data.

It’s difficult to imagine that we could have anticipated all the different sources of noise. For example, the resolution of time-keeping in event records was milliseconds, but many packets could arrive in a single millisecond, which meant we could not trust the ordering of logged packets. We eventually learned that some denial-of-service attack traffic, as well as port scans, can look eerily like tampering events, and certain “best practices” designed to help improve the Internet, such as Happy Eyeballs, became quirks that messed with our detection. We spent a lot of time analyzing these sources of noise and iterating on our signatures to understand them. We accepted events as tampering only if supported by other sources of evidence that we identified, including but not limited to inconsistent changes in the Time-To-Live (TTL) field in the IP header.

That brings me to our last tripwire: a lack of ground truth.

Without active, controlled experiments, it would have been extremely difficult for us to confirm when something we detected was indeed tampering, and not one of the thousand other phenomena on the Internet. Fortunately, thanks to the amazing work of many researchers in the censorship measurement space, we were able to recognize at least some known signals and patterns in the data, and these helped us confirm many cases of tampering.

There were plenty more tripwires. But the key realization for me was this: While providers have lots of data that can tell you things, it’s incredibly hard to know which thing, how much of it, and about what. Large infrastructure operators see a filtered, sampled, and often partial view of the Internet. For example,

Services like Cloudflare can see only which connections were affected and where the connections were initiated, but not who did the tampering;
It was sometimes possible to understand which domains were blocked, but not always, because the necessary packets can be dropped before they get to Cloudflare;
As a passive observer, it’s possible only to see users' activity that is affected, not what could be affected.

For a company that handles a double-digit percentage of Internet websites and services, these were surprising — but understandable – limitations. It may seem like the exercise is impossible, but it’s not. It’s just more challenging than I expected it to be. Despite all that, we found ways to extract meaning from chaos. For example, we carefully and painstakingly enumerated all common packet sequences Cloudflare observed, and extracted from them those that might indicate tampering, based on prior work. Moreover, we used signals like the TTL field mentioned above as supporting evidence that these packet signatures did indeed show tampering.

All of this adds up to a simple but important conclusion: large infrastructure providers are not omniscient. Having a global view can be powerful, but doesn’t automatically translate into easy observations. You can have all the data in the world and still struggle to tell the difference between a middlebox, a security filter, a confused IoT device, and even regular users closing tabs and browsers.

But that dichotomy is also the beauty of the problem space. Working with imperfect data forces us to be creative, to find patterns in the noise, and to design methods that work despite what’s missing. And no, before you ask, you can’t just throw machine learning at the problem, nor do you need to — even with all the noise, the protocols are tightly specified, meaning patterns can be enumerated easily but must still be debated manually.

An internship project built on real insights, using production data

Using our packet-level samples and 19 tampering signatures, we saw distinctive tampering behaviors across hundreds of networks, including being able to track large increases in tampering rates (Figure 1). And it worked because, despite the data’s limits, Cloudflare’s networks let us see the real-world effects of tampering. Also, thanks to the tireless efforts of Luke Valenta and the Cloudflare Radar team, the data from our project is continuously being published on Cloudflare Radar (Figure 2).

^{Figure 1: Increase in mach rates of our 19 tampering signatures during a period of nationwide protests in Iran in late-2022.}

^{Figure 2: Data from our connection tampering research is available live on Radar.}

In the future, though, I think solving challenges like these will require a combination of passive and active probing, using the scale of providers like Cloudflare together with targeted, controlled measurements to paint the full picture of Internet tampering. My team at UCSC’s RANDLab and the research group at Censored Planet continue to work on this problem, especially asking how we can automatically identify tampering when attacks happen or networks change.

While collaborations between academia and industry aren’t always straightforward, they hold strong potential to help build a better Internet. If you’re interested in an internship adventure like the one I described, apply today!

The Cloudflare Blog

Data at Cloudflare scale: some insights on measurement for 1,111 interns

Insights from data at large scale might just be a small miracle

An internship project built on real insights, using production data

The traps and tripwires of large-scale passive data

An internship project built on real insights, using production data

Fresh insights from old data: corroborating reports of Turkmenistan IP unblocking and firewall testing

Beyond IP lists: a registry format for bots and agents

Anonymous credentials: rate-limiting bots and agents without compromising privacy

Defending QUIC from acknowledgement-based DDoS attacks