The Cloudflare Blog

Announcing the Monetization Gateway: charge for any resource behind Cloudflare via x402

Rohin Lohe — Wed, 01 Jul 2026 13:00:00 GMT

Today, we are announcing the Cloudflare Monetization Gateway, an engine that will give Cloudflare customers the ability to charge for any asset protected by Cloudflare: web pages, datasets, APIs, or MCP tools.

It will provide a single control plane to manage payment policies and access controls across your applications, while also protecting your origin from high payment volumes by handling payment verification and enforcement at the edge. At launch, payments will settle in stablecoins over x402, the open protocol we are building with a coalition of more than 25 industry leaders via the x402 Foundation.

The evolving business model of the web

For 30 years, the web has run on a simple economic bargain: trading content for human attention. That attention has been monetized through advertising, subscriptions, and e-commerce. This bargain funded the Internet as we know it.

But as agents become the dominant Internet users, the model is breaking. An agent does not look at ads or need to maintain a monthly subscription to all the tools it wants to access. It reads a page or consumes a data feed once, takes what it needs, and moves on. Across the web, AI crawlers already request content anywhere from a hundred to tens of thousands of times for every visitor they send back.

This reality demands a new model: usage-based pricing for everything. If attention and e-commerce are moving from websites to AI harnesses and AI-written software, then agents should pay for the inputs they need — training data, inference content, developer tooling, and API usage. The natural unit of payment for software is the request, the token, or the outcome, not the seat or the month. A few examples of what that could look like:

A few cents per web search, billed per call
\$0.001 base fee plus a \$0.01 per MB charge for an upload endpoint
\$0.99 per resolved support escalation, paid only when the work succeeds

This is the same shift behind paying creators when an answer engine uses their content — a fair exchange of value whenever content or a resource is used, priced on neutral rails built for the purpose. People often envision an agent buying high-priced assets like web domains, but most of what an agent pays for sits upstream of any checkout, and is priced far lower.

Some of the Internet already works this way. Cloud and APIs have been sold by the call and by the hour for years, but only to a known buyer: a user signs up, they are issued an API key, and they incur usage-based metered billing. Content mostly skipped payment and ran on advertising instead. These business models have never been able to serve unverified buyers for sub-cent transactions because the payment rails cost too much and took too long to settle. Below a certain price, collecting the payment cost more than the payment was worth.

Historically, usage-based billing was difficult to implement. Businesses needed to effectively become payments companies, running their own accounting to track internal usage in a robust and auditable way. Tracking this usage required significant overhauls of backend systems. Many instead chose per-seat pricing because it is simpler and frequently more profitable.

Agents flip this dynamic. A single agent can do the work of an entire team around the clock, making a flat one-time fee disconnected from actual consumption. At the same time, an agent can make thousands of micropayments without friction, while asking a person to approve each payment would be impossibly burdensome. Usage-based price points are where agents live and where stablecoin-based micropayments shine. That's because stablecoins (such as Open USD and USDC) allow buyers to transfer tiny sums across the Internet, incurring negligible fees and settling in less than a second. This is not feasible with other payment rails today.

Here’s where we can help. Cloudflare has spent years building usage-based accounting for our own billing systems and for our customers’ analytics. We can dramatically simplify the implementation of usage-based billing for web-based assets thanks to our position as a proxy layer between buyers and sellers. As shown below, with Cloudflare supporting usage-based billing, the evidence of payment can move into the request itself, and the payment validation and the request paths merge.

And here’s the benefit to you: the metering, the payment exchange, and the settlement move off your origin. What stays with you is what matters — your rules, your prices, and your revenue. You will not need to onboard the buyer or stand up a billing system. You will write a rule and agentic buyers will pay for what they use.

A refresher on x402

Last year on Content Independence Day, we gave site owners one-click control over which AI crawlers could reach their content, and with Pay Per Crawl we let them charge crawlers for it. The Monetization Gateway is the next step: instead of only charging crawlers for content, you will be able to charge any caller for any resource, from an API to data to an MCP tool call, and you will not have to build the payment machinery yourself.

x402 is an open protocol that makes it possible to pay over HTTP, named for the 402 status code it finally puts to use. The x402 exchange is simple: a client requests a payment-gated resource. Instead of serving it, the server responds with 402 Payment Required and a small payload that states the price, the accepted asset, and where to pay. The client pays and repeats the request with proof of payment attached. A facilitator verifies, and the server returns the resource. It all happens inside ordinary HTTP requests and responses, with no redirect to a checkout page and no separate payment API to call. Settlement happens peer-to-peer, so any funds that a buyer sends to a seller are directly deposited to the seller’s wallet. We are designing the Monetization Gateway to keep payment overhead low and are aiming for sub-second payment settlement.

^{x402 Payment Flow: AI Agent ↔ APIServer ↔ Blockchain, Source:}^{x402 Readme on GitHub}

Two properties make x402 a good fit for machine payments. The payment amounts can be small, down to fractions of a cent, because the protocol adds almost no overhead. And the buyer needs no account with the seller, because the payment itself is the credential. x402 is rail agnostic, but it is a natural fit for stablecoins, which can settle in under a second for a fraction of a cent with zero chargebacks.

What the Monetization Gateway does

The Monetization Gateway will provide a flexible payment rules API that will allow you to express exactly when you want a caller to pay to access your digital resources.

Here’s how it will work. Tokens, APIs, MCP tool calls, and data already flow through that path. You will decide, as precisely as you want, which of that traffic has to pay. And you will be able to enforce your decisions by writing expressions, similar to expressions that you already write for other Cloudflare rules, in a simple, dedicated product API. The Monetization Gateway will scale with Cloudflare’s global network across 330+ cities, which means that the x402 handshake will occur in close proximity to your buyer. This will reduce request latency and protect your origin.

A few examples of planned capabilities:

Charge for specific REST verbs: Require payment on calls to a specific route, for example $0.01 for every GET or POST request to /api/premium/*.
Variable pricing: Charge variable amounts for tasks of varying complexity, for example, image generation might charge any amount up to $2, depending on the compute used.
Charge only unauthenticated callers: Intercept HTTP 401 "Unauthorized" responses from your origin and return 402 "Payment Required" instead with pricing and payment instructions.

When a request matches, the Monetization Gateway will verify payment before letting it through. You will be able to set these rules in the dashboard, or manage them as code through the Cloudflare API and Terraform, so a paid endpoint is just another part of your infrastructure config.

The Monetization Gateway will initially allow users to require buyers to pay for services and resources in stablecoins. Sellers will be able to use the stablecoins they accumulate for their own transactions or redeem the stablecoins for equivalent fiat currency in their bank account. Using the Monetization Gateway offers a way to increase the addressable market for your products. With the Gateway, agents can request your resource, be told the price, pay, and get the response. No signup, no API key, no prior relationship required. You will decide how much you need to know about that buyer, and you will have the flexibility to require agents to authenticate with Web Bot Auth and apply usage-based pricing against accounts they already hold.

Where we see this going

The Monetization Gateway will turn the request into a payment and give Cloudflare customers new revenue opportunities, but where this goes is far bigger.

An agent is software that acts autonomously on a user’s behalf, and agents are starting to act on their own. Soon they will carry wallets and buy what they need without a person in the loop: a dataset, an API call, a tool, a block of compute. Some of those resources will be free, and some will require proof of who the agent is and who it acts for, through verified agent identity. Many will require both an identity and a payment, and Cloudflare is one of the few places that will be able to settle all of it inside a single request, by verifying the agent, applying the rule, and checking the payment before the origin ever sees the call. The agent becomes the primary buyer on the Internet, and the request becomes the transaction.

There is an enormous amount of value moving across the Internet today that goes unmonetized or undermonetized, not because no one would pay for it, but because the tools to charge for it have never existed. Every useful API call, every answer, every tool invocation an agent makes has value, and almost none of it is paid for today. That is the opportunity in front of us, and it is what the Monetization Gateway will unlock.

This is what we are building toward: an agent-first Internet with Internet-scale settlement built in. Where the people who make something worth paying for get paid by the software that uses it, automatically. And where the smallest new API can reach the same buyers, on the same terms, as the largest company on the web, and the independent creator is paid by the large language models that use their work. That is the next business model of the Internet, and we are building to power it.

Sign up for our waitlist

The Monetization Gateway waitlist is open now for Cloudflare customers. If you’re interested in monetizing your web page, dataset, API, or MCP tool with usage-based pricing, please join our early access list.

Content Independence Day, one year on: building the business model for the agentic Internet

Arielle Weiss — Wed, 01 Jul 2026 13:00:00 GMT

One year ago, we declared Content Independence Day. At the time, we could see what many in the industry were beginning to sense: the fundamental economics of the Internet were shifting. AI adoption was accelerating, publishers were experiencing rapid declines in referral traffic, and AI companies were crawling the web at unprecedented scale, often without clearly declaring intent, and almost always without compensation.

We changed the defaults. For all new domains on Cloudflare, AI training crawlers would be blocked by default unless domain owners chose otherwise. We didn't do this to wall off the web. We did it because we believed a healthier ecosystem required transparency, control, scarcity, and ultimately, a market where high-quality content could be valued and exchanged fairly.

A year later, that market has emerged. But the transformation of the Internet has happened even faster than we anticipated. In this report, we share key data points that illustrate how quickly the business model of the Internet has shifted – and what this new content market means for publishers and site owners.

Part I: The Internet has changed – faster than anyone expected

The vertical adoption curve

AI is not just another technology cycle. It is a platform shift happening at more than 2x the speed that smartphones were adopted. In just 3.5 years, over 30% of humanity — 2.5 billion active users — has adopted regular use of generative AI. The adoption curve isn't merely steep: it's going vertical.

The decline of the open web

Never before have we seen such a rapid change in how humans interact with information, perform work, and spend time online.

The way people use the Internet is changing dramatically. Today, for every hour spent online searching for information, only 15 minutes is spent on the open web. Traditional search behavior is collapsing as users shift to AI-driven discovery and consumption. Instead of visiting multiple sites to source and compare information, users simply type a prompt and receive a nearly instantaneous, consolidated answer.

The agentic Internet is here

This year, agent traffic crossed a historic threshold for the first time: more than 50% of traffic on the Internet is now non-human. This shift has staggering implications for publishers, content owners, and the future of the open web.

Crawlers have changed their purpose

When looking at the crawlers Cloudflare identifies by purpose, the composition of crawler traffic tells the story clearly:

52% of crawler requests are now for AI training as of June 2026, up from 22% in Spring 2025.
Mixed-use crawlers (those blending search, agent use, and training) represent over 36% of activity.
Pure search crawling now represents a small and declining share of overall crawler activity, despite remaining critical for publisher visibility.

As AI training becomes a primary driver of crawler activity, the ability to distinguish between discovery and training becomes increasingly important. Mixed-use crawlers blur that distinction, putting content owners in a difficult position: choose between remaining discoverable in the agentic era, and giving away their most valuable content without compensation.

The old business model is gone

For decades, the economic model of the open web was straightforward. Content creators exchanged access to their content for visibility in search engines, which returned referral traffic. That traffic became the primary mechanism through which publishers, creators, and businesses generated economic value.

But today, that exchange is breaking down. Content is still being crawled, indexed, and used — but increasingly without corresponding traffic being returned to the source. As AI systems answer questions, compare products, conduct research, and complete tasks directly, information across the open web is increasingly becoming part of AI training and retrieval systems. The existential question this raises is simple: if content is consumed without audiences ever visiting the source, how do content creators sustain themselves?

The implications are industry-agnostic

The earliest industries to feel the impact were news organizations and media companies. Today, similar dynamics are impacting businesses across retail, software, IT, and finance. Some of the most heavily crawled categories have seen human traffic decline as much as 40% in less than one year.

Many publishers are now preparing for what they call "Google Zero" — a world where little to no traffic comes from search referrals.

The implications extend to essentially every industry. Any organization that publishes proprietary information on the Internet will need to understand how to operate in an agentic era. This dynamic matters not just to content owners, but to all of us. The Internet is a critical part of the global economy and one of the world's most important public resources for surfacing information. Ensuring it remains healthy and sustainable is essential for all.

Part II: The market has emerged

What we built

When we launched Content Independence Day, we committed to three things:

Transparency and control for site owners, enabling them to define how their content is accessed and monetized.
Tools that create scarcity, shifting the balance of power back to content owners.
A marketplace where content creators and AI companies of all sizes can discover, license, and determine the value of content more efficiently.

One year later, a market for monetized content is here, and the conditions for a dynamic marketplace are forming.

Transparency and control created scarcity

Historically, publishers have had limited visibility into how AI companies accessed and used their content. As referral traffic declined, that lack of visibility became an economic problem prompting publishers to seek new ways to capture value.

Cloudflare's attribution, business intelligence, and enforcement tools gave publishers visibility into AI consumption at the network level — an enforcement mechanism far more effective than voluntary standards like robots.txt. For the first time, publishers could determine how their content was accessed and monetized. That control created scarcity, and drove a supply-and-demand content economy.

Scarcity created leverage

Publishers that exercised control over access successfully created scarcity, giving them negotiating leverage that led to better deals. For the first time, publishers gained operator-level attribution data — evidence of how often LLMs attempted to access their content, which competitive LLMs were crawling, what their most in-demand URLs were, and what their crawl-to-referral ratios looked like. This reduced information asymmetry in licensing discussions and enabled publishers to negotiate from a position of knowledge.

Leverage is changing the balance of power

This leverage has empowered our customers. As they have gained greater visibility into how AI systems access and use their content, they’ve become better equipped to understand the implications for their businesses and more confidently articulate the value of the information, brand, and audiences they have built.

As the balance of power between content owners and AI companies begins to change, a licensing economy is emerging:

More than 50 publisher-AI agreements have been signed since 2023.
Major AI companies now actively license content, increasingly recognizing the value of differentiated and premium content.
Collective licensing models continue to emerge and scale.
Large publishers are securing meaningful licensing agreements, demonstrating that content has real economic value within the AI ecosystem.

The conversation is no longer whether content should be compensated. The conversation now is how.

The market is maturing, but inefficiencies remain

Early licensing agreements proved demand exists, but licensing today remains largely bespoke and unlikely to fully replace lost referral, advertising, and affiliate revenue. As a result, publishers are increasingly optimizing for AI consumption alongside traditional human discovery while exploring new monetization pathways.

Supply and demand remain difficult to match efficiently, and while there’s an understanding that not all content carries the same value, content valuation is still unresolved.

The Google convergence problem

No discussion of this market is complete without addressing Google's unique role. Google remains the dominant gateway to online discovery, accounting for approximately 88% of referral traffic. But increasingly, Google is helping users consume content directly within Google-owned AI experiences.

Discovery and consumption serve fundamentally different purposes. Search drives users to content, while AI-powered experiences increasingly summarize and reuse it without requiring users to visit the source. Website owners view these activities differently because one generates traffic, while the other increasingly substitutes for it.

These differences become especially important when site owners are deciding who should be allowed to access their content and for what purpose. Most leading AI companies separate discovery crawlers from training crawlers, making it relatively simple for publishers to enable content access for one purpose or the other. Google does not. Today, Google has access to about 2x more information than leading AI companies because Google leverages a mixed-use bot that makes it difficult for customers to participate in Google's search ecosystem without also participating in Google's AI ecosystem.

Unlike other AI providers, Google’s mixed-use crawler also limits transparency for site owners. Because discovery and AI access are combined into a single crawler, publishers cannot tell why Google is accessing their content or distinguish between traffic used for search and traffic used for AI experiences. They also lose the visibility and evidence that comes from being able to allow or block these activities independently at the network level.

This dynamic has accelerated demand for greater transparency and control, as well as new monetization models to better serve both content owners and AI companies of all sizes.

Part III: A unique view of the ecosystem

Cloudflare sits at the intersection of the emerging agentic economy.

More than 20% of the web sits behind Cloudflare’s network. Of the world's most-visited websites, 36% rely on our network, and more than 40% of the Fortune 500 are Cloudflare customers. Nearly 80% of leading AI companies use Cloudflare, alongside thousands of developers and emerging AI companies.

This unique position gives us visibility into both sides of the market. We see the content owners creating content, the AI companies consuming it, and the signals increasingly connecting them. That perspective has given us a unique view into how the market has evolved over the past year, and what it now requires.

Part IV: Lessons from an emerging market

As publishers and AI companies adapt to a new agentic economy, Cloudflare has gained a clearer understanding of what the ecosystem now needs.

Transparency must become the standard

Content owners increasingly need visibility and control over who is accessing their content, how it is being used, and for what purpose. AI companies increasingly recognize that transparency builds trust and reduces friction with publishers. Visibility and enforcement are no longer security concerns alone — they have become business requirements that directly influence licensing negotiations and commercial decision making.

To help make transparency the standard, Cloudflare is continuing to invest in enhanced attribution, measurement, and publisher controls that give content owners greater visibility into and control over how their content is accessed and used.

As the industry shifts toward greater transparency, we believe that verifiable bot self-identification and declarations of crawl intent are fundamental to a sustainable ecosystem. Today, more than one-third of crawler activity on our network still comes from mixed-use bots that make it impossible for content owners to distinguish crawl intent. We are actively engaging with the ecosystem and investing in tooling to help drive that number to zero by this time next year.

Better AI requires better signals

Over the past year, it has become increasingly clear that AI companies need more than access to content. They need better ways to determine what to access, when to access it, and how frequently it has changed. Indiscriminate crawling wastes compute for AI companies and creates unnecessary bandwidth burden for publishers, reducing efficiency across the ecosystem.

We believe better answers require better intelligence. We are investing in real-time freshness signals with richer trust, quality, and relevance to help AI companies discover differentiated information while reducing unnecessary crawling across the web.

Markets need better discovery before better pricing

We believe better discovery must precede better pricing. In order for the market to mature, publishers and AI companies need better information about one another. We are investing in richer market intelligence, content signaling, and capabilities that improve discovery between both sides of the ecosystem, laying the foundation for more scalable market mechanisms over time.

Part V. Building the infrastructure for the agentic Internet

One year ago, Content Independence Day introduced a simple idea: content owners should have greater control over how AI companies access and use their information.

Over the past twelve months, that control helped give rise to a market. Transparency created scarcity. Scarcity created leverage. Leverage accelerated licensing. What was once a theoretical discussion about the future of AI and content has become an active market, with publishers, AI companies, and technology providers all adapting to a new set of economic realities.

The market is now entering a new phase that demands new infrastructure. As the Internet becomes increasingly agentic, the underlying systems that support it must evolve to handle permissions, licensing, and commercial transactions at scale. Content owners and AI companies need more efficient ways to connect and exchange value. We believe these capabilities will converge into programmable, scalable mechanisms for content discovery and monetization – reducing friction while unlocking richer forms of value exchange.

Cloudflare's role is to build the infrastructure and business intelligence, and contribute to the standards that allow the market to determine value more efficiently and help publishers and AI companies participate in a healthier, more dynamic content economy.

The Internet has always evolved. This evolution is faster and more consequential than most. But with the right infrastructure, the right incentives, and a commitment to transparency, we believe the agentic Internet can become more sustainable, more efficient, and better for everyone.

Methodology: The data in this report is compiled from Cloudflare Radar and the Cloudflare Investor Day 2026 Presentation.

Cloudflare Radar is a hub showcasing global Internet traffic, attack, and technology trends and insights. Powered by data from Cloudflare's global network, Radar was created to help anyone understand what is happening on the Internet from a security, performance and usage perspective.

Cloudflare's unique understanding of the Internet comes from its global network — one of the world's largest, spanning 330+ cities in 100+ countries — and aggregated and anonymized data from Cloudflare's 1.1.1.1 public DNS Resolver, widely used as a fast and private way to browse the Internet. More than 20% of the web sits behind Cloudflare’s network.

Making AI search smarter

Matthew Conroy — Wed, 01 Jul 2026 13:00:00 GMT

Search drives most experiences on the web. It's how we get things done, and how nearly everything on the web gets found — the creators, the merchants, the answer to whatever you just typed into a box. For nearly 30 years, that discovery journey ran on a simple bargain: let a search engine crawl your content, and it sends you visitors. You turned those visitors into a business — through ads, subscriptions, or just the audience itself. Being discoverable and getting paid were the same thing. A year ago, on the first Content Independence Day, we drew a line to defend that bargain in the AI era. But a line in the sand was only a first step. Since then, the prevalence of AI search in consumers’ lives has only accelerated as more than 50% of traffic online is non-human. The threat is no longer a handful of training crawlers you can block; it's search itself being rebuilt around AI answers.

Today's answer engines read your page and hand the user a summary, so the visit — and the revenue that depended on it — isn’t needed. We see it firsthand, and independent research backs it up: a 2025 Pew Research Center study found that when Google shows an AI summary, users clicked on a traditional search result link just 8% of the time (about half as often as when there's no summary) and clicked a link inside the summary only 1% of the time. That leaves our customers in a bind: opt out of AI and be hard to find, or opt in and deliver significant value to users while seeing increasingly little in return. Our customers want to be found and compensated for the value they provide, and right now they're forced to choose.

Today, we’ve announced new bot options to help our customers better control who can access their site and what they can do with it. But blocking was only step one: saying "no" protects content without rebuilding the business models that sustain it. So, it’s time to start building the new economic model of the Internet, starting with search.

Rebuilding the bargain

Transparency and control are the foundation, but more is needed. In 2025, we laid out our foundation via a set of responsible AI bot principles: bots should be transparent about who they are and what they're for, respect site owners' choices, and act in good faith. Our tools hold bots to that bar. But enforcing good bot behavior doesn't make AI search any better for the people relying on it, and it doesn't send a dollar back to the creator whose work made the answer possible. We can do more than help the web say "no"; we can help rebuild what it says "yes" to.

So today, we're announcing two initiatives that move from defense to offense and start putting both halves of that old bargain back together.

Make AI search smarter: By using the signals we see across our global network, like what's fresh, what's high quality, and what's actually changed, we can help answer engines surface the most relevant content and reduce unwanted crawling. People searching get better answers, while costs are reduced for both AI companies and site owners if webpages are only recrawled when they’ve changed.

Pay creators for the value they provide: When your work is used to answer someone's question, you should be rewarded instead of just being scraped for free. And you should be able to see what's being used and what people are asking. This should be a real revenue stream, and an incentive to keep producing original content worth finding.

Making search smarter

Today we're launching a research program to make AI search smarter and stop our customers footing the bill for crawls that produce nothing new.

More than 20% of the web sits behind Cloudflare’s network, which gives us a unique perspective. We can tell which pages have genuinely changed and which ones people and agents are flocking to. Through this program, we will explore using signals our customers have chosen to share about the freshness of their content, and we will combine those with our own insight into traffic flows, both human and bot. For answer engines, that's a roadmap to high-quality content. For our customers, it provides a view of what users are actually asking, and how their content shows up in AI results. The aim is to measure two things: how much these signals help answer engines to surface fresher, higher-quality content, and how much unnecessary crawling they cut out.

That second benefit, cutting unnecessary crawling, is bigger than it sounds. Cloudflare data suggests that more than 50% of crawl traffic from good bots goes to re-fetching pages that haven't changed — and that number is likely to climb as crawl volumes do. A signal that just says "nothing's changed here" lets a crawler skip the trip. That saves the answer engine compute. More importantly, it saves site owners from serving and paying for requests they never needed to.

The program is neutral by design: our goal is to make it work for every answer engine willing to play fair. It's limited to search. We aren't sharing any content, and nothing is used to train foundation models. We intend to publish what we learn, including the benefits to site owners such as better content discoverability and reduced server strain. We plan to make the capability broadly available later this year and reduce unnecessary crawling across our network.

From Pay Per Crawl to Pay Per Use

Last year we launched Pay Per Crawl so publishers could charge AI companies for crawling their content. It was a real start, but crawling is a crude measure of value. A single page might be crawled once and then cited in thousands of answers, or crawled over and over and never used at all. Creators want to be paid fairly for the value they provide.

So we're starting to shape Pay Per Crawl into Pay Per Use. We're running experiments with top AI companies, like Ceramic.ai and You.com, and the arrangement is straightforward: organizations can bring their payment models and easily scale them to content owners across the Cloudflare network.

Ceramic has built what it calls a "pay-per-query" model, so publishers who opt in can be paid when their content appears in Ceramic's search results. This means payment is designed to follow the value the work delivers rather than the number of times a crawler happens to fetch it.

"To scale the future of AI search, we need a partner with massive reach and a shared commitment to transparency and fair compensation,” says Anna Patterson, founder and CEO of Ceramic.ai. “Cloudflare allows us to easily and programmatically scale our operations. By bringing our pay-per-query model to their network, we ensure millions of content owners can seamlessly opt in to be compensated every single time their content appears in our search results."

In addition to compensation, content owners participating in the Cloudflare/Ceramic program will unlock new reporting to help with answer engine optimization (AEO). Customers can finally see the top queries leading to their content appearing in search results, the specific webpage and snippet, their average search result ranking position, and more. This is the first of many products we’ll be launching to help our customers with discoverability.

This is just one emerging approach. Another comes from You.com: agents can pay on demand for a specific piece of premium content they need, without any upfront commitment. New payment models from AI providers are being tested (e.g., Pay per Query, Pay per Result, etc.) and we have the infrastructure to support them all.

We want to be honest that this is an experiment. There’s a lot to learn, including exactly how this holds up at the scale of the Internet. We'll work that out with our partners and our customers as we go, and share what we learn. But the goal is clear: AI search companies get fresher, better-grounded answers, and the customers whose work makes the answers possible get paid when they help. Cloudflare's job in all of this is to provide the infrastructure layer that makes this market flourish.

We think this is a more natural fit for where the economics of search are heading. The old, human web optimized search to save time — providing excerpts, ten blue links, and a click. The agentic Internet is different: an agent can read fast and search continuously. Search is becoming something an agent does dozens of times to answer a single question, closer to a utility than a destination. In that world, the unit that matters isn't the crawl or the click. It's the outcome. Pricing the outcome, and paying the people who made it possible, is how the web continues to thrive.

The headline we want to earn

A year ago on Content Independence Day, the headline was a default ‘no’: AI can’t crawl without compensation. This year, our focus is on giving our users more products and controls to say ‘yes’ and bring more benefits with it.

Today's announcements are just the beginning. Cloudflare’s research project is designed to see if our signals produce better results with less crawling. Pay Per Use is a promising direction we’ll experiment with alongside partners who believe that content creators deserve fair compensation for their work. This is how the last 30 years of the web got built too: somebody runs the pilot that turns "the model is broken" into "here's the new model," one experiment at a time. We believe there’s value to our customers to be discoverable in this new agentic era, and to optimize their content for maximum discovery. But they should be able to do this without giving away their most valuable creative assets for free.

The web is changing, and the business models it’s relied on are changing with it. The old Internet was open, neutral, and worth contributing to. We have a rare chance to keep it that way, and to build the business models that fund it in the future. Smarter answers for humans and agents asking the questions. A fair deal for the people whose skill, creativity, and commitment makes the answers worthwhile. That’s how we pursue Cloudflare’s mission: to help build a better Internet.

Happy Content Independence Day!

Building on the open, agent-ready web? If you are interested in learning more about the Ceramic and You programs, please fill out this form. If you're building an answer engine and want to crawl smarter, we’d love to hear from you too: aeo@cloudflare.com.

Your site, your rules: new AI traffic options for all customers

Jin-Hee Lee — Wed, 01 Jul 2026 13:00:00 GMT

One year ago, we declared the first Content Independence Day, and we gave website owners the means to take back control of their content. The deal between crawlers and website owners that had held up for 30 years — we crawl you, and you get referrals — was no longer true. AI was taking everything and sending back nothing, presenting an existential threat to website owners. And so we launched a one-click "Block AI Bots" option, along with a Pay-Per-Crawl marketplace.

A lot has changed in a year. Last July, conversations around “AI bots” centered around blocking AI training without compensation, pointing to the win–lose deal where content was used for model training with no value driven back to the website owner. But a desire for more nuance has emerged: Content owners still want to be able to protect their content, and they should be compensated for the original content that they work hard to create, curate, and share. We also know that locking down content isn’t a one-size-fits-all solution; website owners want more options than resorting to “block all automation, every time.”

If you run a small site, the problem isn’t just that someone could train models on your content — it's that nobody can find you in the first place. So you have to make a Faustian bargain: either show up in search and let AI train on you, or risk losing discoverability. This unfairly advantages incumbent search providers if they use the same bots for both search and training; and this unfair advantage incentivizes new players to be evasive as they try to close the competitive gap.

Now, AI can be anything

Today, AI can be in anything. Google search has changed from being sorted by AI to being a full answer engine that answers your question directly on the results page. And Google is not unique in this position — this is the direction in which “search” is moving.

We could debate the cutoff for what qualifies as “AI” today, just to find that the standard changes tomorrow. So, instead of defining a bot primarily as “AI” or not, our updated approach to classification will ask deeper questions about bot or agent behavior: What are they doing on my site? What are they storing? And how will they reshare my content?

A pragmatic taxonomy

To address these questions, we need a more nuanced view — a pragmatic taxonomy that aligns with the AI use cases our customers care about. So we are opening the discussion beyond AI training alone and focusing on three AI use cases that we want all customers to be able to manage:

Search: any behavior that collects or indexes your content, so it can answer questions about it later. The key is that Search is proactively building a database of your site to later respond to queries with. Site owners should expect to get referral traffic or other equitable compensation as a result.
Agent: automated behavior that is acting, usually in real time, on a person's behalf, to get something done right now. This includes chat fetch bots (e.g., ChatGPT-User) and browser-use agents (e.g., Gemini or Claude driving Chrome). The key is that it visits your web application in order to complete a job, and often there's a human waiting on the other end.
Training: a crawler taking your content to train or fine-tune a model. The key is that your data is permanently absorbed into the underlying architecture of the AI to improve its capabilities.

Many popular crawlers on the web fall into one of the classifications above; some fall into multiple. We classify plenty of other behaviors beyond the three above — including ads verification, feed fetching, and agentic transactions (more on this below). But we believe it should be simple for all website owners to manage access for these three AI-centered use cases. We believe that bot operators should separate their crawlers because that creates more transparency for website owners: allowing them to better understand why a given crawler is visiting them, as well as to better manage the access they extend to that crawler. If a company runs automation that builds Search indexes, acts as an Agent, and collects data to Train their models, then we strongly encourage that company to separate the automation into three separate crawlers.

We want a classification system that is scalable and representative of the world of automated traffic as it evolves. Tracking a bot’s purposes is nothing new, but our new taxonomy involves a few updates that better represent the state of bot traffic today. Most notably, we want to recognize that bots that have multiple purposes should be tracked with all purposes, not just one of them.

New options to manage AI traffic

We want to provide more options for managing different kinds of AI traffic, to all website owners on the Cloudflare network.

The managed preset to “Block AI bots” that we’ve announced in the past included single-purpose bots that crawled data for model training, as shown below:

^{Screenshot of the existing setting to manage AI bot traffic on July 1, 2025.}

But not all AI use is the same, and we want our customers to have the controls they need. So, we’re launching the ability to manage AI traffic based on three major use cases: Search, Agent, and Training crawlers. With these new options, our customers can more finely tune how they manage AI bot traffic — including customers on our Free tier.

^{Screenshot of the new options to manage AI bot traffic on July 1, 2026.}

Setting new defaults

On September 15, 2026, we’ll be setting new defaults for each of these three classifications. For all new domains onboarding to Cloudflare, the categories of Training and Agent will be blocked by default on the pages that display ads, while Search will remain allowed by default.

An ad is a signal that a website owner meant for a person to land there and see it — something monetizable that fuels the business. So, on those pages, we treat human attention as the end goal, and keep away the bots that may prevent this attention (i.e., Training and Agent bots). On the other hand, Search is the behavior that most naturally funnels back visitors, and we believe it’s in the interest of most site owners to allow this.

Another change that will apply on September 15 is that multi-purpose crawlers (specifically those that combine Search with Training) will be allowed/blocked according to all of their behaviors, in line with our call for transparency for website owners. Since the defaults will be enforced by the most restrictive applicable rules, multi-purpose crawlers such as Googlebot, Applebot, and BingBot will be blocked by customers who have selected to block Training (either through the new options to manage AI traffic, or through the legacy Block AI bots service).

Of course, customer choice is paramount: if a website owner wants to opt out of these new default configurations, they can easily mark this in their Security settings any time leading up to September 15, which will confirm that they want no changes on Training crawlers that also crawl for Search purposes. We’ll also continue to notify customers of the upcoming change to defaults as we approach September 15 to ensure that customers who want to choose settings different from the defaults have the opportunity to do so.

BotBase: a new visibility plane for Enterprise customers

We’re also excited to launch a major visibility update as a new feature of Enterprise Bot Management. As Cloudflare’s directory of tracked bots has grown, so has the desire to manage these bots in sensible groupings and to understand more detail about a particular bot.

Introducing BotBase. BotBase is our new database tracking all known bots, including Verified bots and agents. This database provides a comprehensive, searchable view of our entire directory of bots, directly on the Cloudflare dashboard. We’re tackling visibility first, but, later this year, we’ll expand BotBase to provide a direct control center for known automated content on your website.

With this new view, Enterprise Bot Management customers can see the full catalogue of all Verified bots/agents and where they are classified in this updated taxonomy — a view we’ve never shown dynamically on the Cloudflare dashboard before. Customers who want to precisely target a specific bot can also easily filter for all traffic from this bot, plus copy the detection ID to use in Security rules. All of this is now live within a dedicated page, which can be accessed through the Bot Management configuration card.

As we built BotBase, we wanted to account for all of the pieces of information that would allow us to build scalable, powerful insights from bot to bot. One of these pieces is a cornerstone for our updated taxonomy, which is based on what a bot may do on your site — its behavior. We separate these classifications as shared below, and each bot is classified with one or more of these behaviors.

Bot classification	Behaviors and uses
*Search*	*Crawling to scan your site to help it appear in search engine results*
*Agent*	*User-directed agents visiting a page on behalf of a human*
*Training*	*Crawling to train or fine-tune models*
Transact	Checkout actions on behalf of users
Data Collection	Includes price scraping, competitive intelligence gathering, and third-party analytics
Security Testing	Includes vulnerability scanning and penetration testing
SEO	SEO crawling, site auditing, accessibility checks
Ads Verification	Ad placement verification, ad fraud detection
Social / Link Preview	Link previews for social platforms and messaging apps
Feed Fetching	Includes RSS readers, podcast aggregators, and news feed bots
Monitoring & Operations	Includes uptime monitoring, webhooks, and health checks

^{Bold italicized rows indicate the new configurable options that are available to all customers.}

How does a crawler use my content?

Another piece of information we’ve heard is important to our customers is a bot’s content use — what a bot may keep and reshare after it has crawled your content. To address this, we are building capabilities for Bot Management customers to select and block based on the “content use.” This setting can be set to one of three levels, from least to most permissive:

immediate — interact, but store and reuse nothing
reference (default) — index, excerpt, and link back
full — summarize and reproduce

These values can be combined with bot classifications to express nuanced rules, such as “allow all bots that are used for Search, SEO, and Ads Verification, but only up to the reference use level.” This allows website owners to make decisions in sensible groupings rather than manage individual bot-by-bot rules.

To further support this, starting today, we're testing a new signal, use, that extends Content Signals and lives in your robots.txt. This extends the three fields of the first version of Content Signals with a fourth, optional field that expresses the same preference as above:

use=immediate
use=reference
use=full

As with all other items listed in the robots.txt file, the values of content use signal a website owner’s preference, rather than issuing blocks directly. We’re now adding support for this extension: all customers who have already enabled managed robots.txt — which prepends the preference to robots.txt that crawling for search is okay, but that crawling for training is not — will now have the additional preference of use=reference added to their robots.txt.

# Cloudflare Managed content with original Content Signals

User-agent: *
Content-Signal: search=yes,ai-train=no
Allow: /

^{The contents of Cloudflare managed robots.txt with the original Content Signals values.}

# Cloudflare Managed content with the new content-use signal

User-agent: *
Content-Signal: search=yes,ai-train=no,use=reference
Allow: /

^{The contents of Cloudflare managed robots.txt with the added parameter.}

We’re also starting to track content uses for every bot in BotBase, and when we discover a bot abusing these signals, it will lose the “Verified” status, resulting in it no longer being allowed. Today, bots that reproduce in full cannot have the Verified status.

What does it mean for a bot to be Verified?

Speaking of “Verified,” the definition of Verified is being updated to reflect the upcoming changes to default allow and block baselines. Previously, all Verified bots were allowed by default, which was reflected in our basic Bot Fight Mode offering to block unwanted automatic traffic and in our rule templates for Enterprise Bot Management customers.

Starting today, we’re adjusting this to add nuance: non-verified bots are still default blocked, but we are no longer viewing Verified as “default allowed.” Now, the Verified label makes a bot allowable with its relevant category, meaning the allowed category (e.g., allowing Search) will determine what is allowed to access a website.

To balance this change, we’re opening up the process of becoming a Verified bot, and making it more transparent, too. To "Verify" a bot, a bot operator needs to show two things: that you represent yourself honestly, and you don't abuse the access that honesty earns. And to make this easier on bot operators, we’re currently building management tools for bot operators to better ensure they are accurately represented by Cloudflare’s classification system (to be announced in the near future).

^{A preview screenshot of the upcoming platform built directly for bot operators who are part of or want to be a part of BotBase, the next generation of the Cloudflare Bots Directory.}

Experimenting with transitive trust

One more piece: The bot (or agent) at your door increasingly isn't run by the company that built it. A platform like Cloudflare’s Developer Platform runs automations for thousands of different operators at once, ranging from enterprises to a developer you've never heard of. You might trust Stripe, but you don't necessarily trust everyone who wired Stripe's tools into a weekend project.

We call the case of (site owner → bot owning company → end user) a matter of transitive trust, and we're proposing to utilize the existing Forwarded header as defined in RFC 7239 that rides along with the request and allows “proxy components to disclose information lost in the proxying process.”

This is similar to what X-Forwarded-For does for IP addresses, or X-Forwarded-Host does to preserve the original Host header. So when a website owner says, "Allow this operator," that preference will hold, whether the operator comes to you directly or through three layers of intermediaries that are trusted. More details can be found in our documentation, with a brief example to show the format below.

Forwarded: for="openai"

Adding the extension with content-use discussed above, the header addition would look something like the below, specifying how the operator says they will use the content they access:

Forwarded: for="openai";use="reference"

This also lines up the incentive model we want to foster. Losing trusted status across the more than 20% of web domains that sit behind Cloudflare is a deterrent with teeth. Trust becomes something you can carry with you, and something you can lose.

However, as bot traffic blends with human traffic, it’s possible that this system of transitive trust doesn’t carry beyond the users who can afford to be identifiable. The measures we are proposing today help to convey trust, but they won’t fit the entire web for all time. Small sources of traffic need privacy, and companies that want to preserve their own privacy commitments should be able to explore fair building blocks for the future of an agentic Internet, such as private rate limiting.

Set your terms today

These are small changes that move in the same direction: site owners get more control over who uses their content, and how. We believe the new defaults we discussed today and will soon implement are ones that encourage transparency and are more reflective of where the world is going.

Of course, the ebbs and flows of the web will continue shifting under us, and we'll keep adjusting with it. But the direction won't change, because it's the one Cloudflare started with: a web ecosystem built around trust. Where the people who make things can decide how they're used — and one where being honest about what you do earns you more access, not less.

These new options to manage AI traffic are live now, and can be configured by all existing customers in their zone Settings. Not on Cloudflare yet? Start for free to set the traffic controls that you want today.

Happy Content Independence Day.

Unmasking the crawls with Attribution Business Insights

Jin-Hee Lee — Wed, 01 Jul 2026 06:00:00 GMT

Original content is the lifeblood of conversations and curiosities. Imagine a world without it: we could find a thousand ways to regurgitate the same material that’s already been created, but we would witness the decline of fresh ideas and arguments.

Website owners fuel the ecosystem of ideas, news, and interesting tidbits, but they face the increasingly complex challenge of managing traffic to their websites and being paid for their content. While some bot traffic is clearly malicious, it isn’t always obvious when a particular AI crawler is helping or harming your business. To answer this, site owners need granular, reliable data to differentiate between traffic that provides value, and traffic that strains resources while eroding the foundation of their business model: actual humans consuming their content.

At Cloudflare, we hold a core belief: website owners have the right to control access to their content. We want to help website owners maintain their high-quality content and regulate AI traffic.

To provide much-needed clarity and help website owners take control, we’re excited to announce the new Attribution Business Insights dashboard — designed with business decision-makers and publishers in mind.

The new economics of the Internet

For decades, the business model of the Internet relied on a straightforward, unspoken agreement: website owners allowed search engines to crawl their content and, in return, search engines sent readers back to their pages. This symbiotic relationship, where traditional search engines operated with a balanced "crawl-to-referral" ratio, generated the pageviews needed to sustain advertising, affiliate revenue, and subscriptions. Search index crawlers would scan your content a couple of times for each referral sent, so making your website available to crawlers had a clear pipeline to additional revenue. We can think of this as the SEO (Search Engine Optimization) era.

Today, the explosive rise of AI crawlers and agents has broken this contract, plunging the digital publishing industry into an unprecedented crisis. The Internet is risking a transition into a "zero-click" ecosystem where AI chatbots scrape original content to synthesize instant answers — completely bypassing the original sources. We’ve already seen a marked shift from the SEO-only world into an AEO (Answer Engine Optimization) world, and now conversations around GEO (Generative Engine Optimization) are taking center stage.

The imbalance of this new reality is made clear by the crawl-to-referral ratios we see across the Internet today. While traditional search engines had a more balanced ratio of crawls to legitimate visitors referred, major AI crawlers operate on a drastically different, extractive scale. Bots from leading AI companies have been observed with a range of crawl-to-referral ratios: we noted ratios of 118:1 up to nearly 50,000:1 around the time of our Content Independence Day in 2025. In other words, an AI crawler might have crawled your premium content tens of thousands of times just to send back a single visitor. This ratio is fundamentally unfair.

For publishers, this creates a double hit: first, they’re losing out on the crucial referral traffic, ad impressions, and direct audience relationships that fund content creation and journalism. Second, they’re forced to bear the rising infrastructure costs of hosting and serving content to automated bots that offer no commercial value in return. The era in which it makes sense to allow all crawlers in the hopes of being discovered is over.

Introducing Attribution Business Insights

We want website owners to have the facts — the cold, hard numbers to understand which bots are helping their business and which bots are harming it. We also want to make this analysis easier than ever, which is why we’ve designed Attribution Business Insights to cut the noise, focusing on the details that our customers have told us are most important.

Today, the Attribution Business Insights dashboard is available to all Cloudflare Bot Management customers. The new dashboard is designed to deliver a targeted view of bot traffic flowing to your website; unlike traditional analytics tools that may require extensive manual filtering, this dashboard provides you with key insights right away.

We set out to answer the most pressing questions for site owners today: How should you think about AI traffic on your websites? What is the value of different audiences — including humans, non-AI bots, and AI bots? And most importantly, what is your data being used for?

^{The new Attribution Business Insights dashboard view, which includes insights about bot traffic overall, a site-wide crawl-to-referral ratio, and the distribution of AI bot traffic vs. organic traffic.}

To answer these questions, the dashboard displays a powerful array of data and insights:

Bot traffic to content pages: View your overall bot vs. human traffic, as well as the volume of all bots successfully accessing content.
Crawl-to-referral ratios: See your site-wide crawl-to-referral ratio on the scale of 24 hours, seven days, or 30 days. You can also see crawl-to-referral ratios per bot operator (per company that owns one or more bots).
Top bots breakdown: A list of top bots by volume, including their country of origin, bandwidth they take up on your website, and whether you’re currently blocking or allowing them.
Updated classification based on crawler behavior: We go beyond a generic label of “AI Crawler” by classifying crawlers with our updated taxonomy, whether it’s Training (i.e., training the next version of an LLM chatbot), Search (i.e., refreshing databases for Retrieval-Augmented Generation), or Agent (i.e., used in agentic interaction to return answers to an end user).

From data to business strategy

You shouldn’t have to be a security expert to understand how AI crawlers affect your business. If website owners want to spend just a few minutes ingesting the high-level insights, they can walk away with a clear temperature check of the effectiveness of their content security policy.

For those who want to do a little more digging to understand how AI companies are making use of their content — or collect information to guide how they want their relationships with AI companies to develop — we show a more granular view organized by bot operator.

^{Breakdown of bot activity on a website, with important details for each bot such as type, crawl-to-referral ratio, and current action.}

By having a consolidated view of companies seeking to access content on your website, you can develop a better baseline of crawler activity. We want this data to equip our customers to step into any business conversation with the facts on their side. Tell Company1 that their crawl volume is twenty times that of Company4’s, and that Company4 is already compensating you for content. Revisit the way that Company2 licenses your content based on their recent activity. This new dashboard propels business conversations to move forward.

How does this new layer of visibility tie into the existing tools you have to protect your website from abuse? In line with other features of Bot Management, the action step still happens in Security rules. To avoid adding noise to the control plane, Attribution Business Insights is intended to be a hub for thoughtful, filtered analytics, rather than another place to take action. This dashboard serves as a central source of information, allowing you to investigate before then taking an action in the same rule engine that governs other abuse mitigations. We also want to be loud and clear about inviting business decision-makers into this dashboard, acknowledging that conversations around AI traffic have a wider set of stakeholders than only security-specialized users.

What’s next

The Attribution Business Insights dashboard is the next critical step in providing website owners with the transparency and control they need to manage evolving AI bot threats, and more broadly, shape the new dynamics of the Internet. We’re already investigating the next iteration with close publishing partners to create a visibility plane that covers security from the perspective of the website owner with valuable, original content to share.

A sneak preview below includes a new view to dissect crawler activity per-article to reveal the appetite that AI companies have for different pieces of content, different campaigns, and so on.

^{Breakdown of most popular articles, according to traffic volume. Shows key metrics such as AI bot traffic vs. other bot traffic vs. human traffic, both direct and from a referral.}

Visibility is the first piece, and there’s more to come to empower website owners to take control of their content in this new age. We encourage all customers of Cloudflare Bot Management — especially those driving business conversations — to access this today for a fresh take on analytics.

Content Independence Day: no AI crawl without compensation!

Matthew Prince — Tue, 01 Jul 2025 10:01:00 GMT

Almost 30 years ago, two graduate students at Stanford University — Larry Page and Sergey Brin — began working on a research project they called Backrub. That, of course, was the project that resulted in Google. But also something more: it created the business model for the web.

The deal that Google made with content creators was simple: let us copy your content for search, and we'll send you traffic. You, as a content creator, could then derive value from that traffic in one of three ways: running ads against it, selling subscriptions for it, or just getting the pleasure of knowing that someone was consuming your stuff.

Google facilitated all of this. Search generated traffic. They acquired DoubleClick and built AdSense to help content creators serve ads. And acquired Urchin to launch Google Analytics to let you measure just who was viewing your content at any given moment in time.

For nearly thirty years, that relationship was what defined the web and allowed it to flourish.

But that relationship is changing. For the first time in more than a decade, the percentage of searches run on Google is declining. What's taking its place? AI.

If you're like me, you've been amazed at the new AI systems that have launched over the last two years and find yourself turning to them to answer questions that, in the past, you may have previously looked to Google. While it's still early, it seems clear that the interface of the future of the web will look more like ChatGPT than a spartan search box and ten blue links.

Google itself has changed. While ten years ago they presented a list of links and said that success was getting you off their site as quickly as possible, today they've added an answer box and more recently AI Overviews which answer users' questions without them having to leave Google.com. With the answer box, researchers have found that 75 percent of mobile queries were answered without users leaving Google. With the more recent launch of AI Overviews it's even higher.

While Google’s users may like that, it's hurting content creators. Google still copies creators’ content, but over the last 10 years, because of the changes to the UI of “search” it's gotten almost 10 times more difficult for a content creator to get the same volume of traffic. That means it's 10 times more difficult to generate value from ads, subscriptions, or the ego of knowing someone cares about what you created.

And that's the good news. It’s even worse with today’s AI tools. With OpenAI, it's 750 times more difficult to get traffic than it was with the Google of old. With Anthropic, it's 30,000 times more difficult. The reason is simple: increasingly we aren't consuming originals, we're consuming derivatives.

The problem is whether you create content to sell ads, sell subscriptions, or just to know that people value what you've created, an AI-driven web doesn't reward content creators the way that the old search-driven web did. And that means the deal that Google made to take content in exchange for sending you traffic just doesn't make sense anymore.

Instead of being a fair trade, the web is being stripmined by AI crawlers with content creators seeing almost no traffic and therefore almost no value.

That changes today, July 1, what we’re calling Content Independence Day. Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers unless they pay creators for their content. That content is the fuel that powers AI engines, and so it's only fair that content creators are compensated directly for it.

But that's just the beginning. Next, we'll work on a marketplace where content creators and AI companies, large and small, can come together. Traffic was always a poor proxy for value. We think we can do better. Let me explain.

Imagine an AI engine like a block of swiss cheese. New, original content that fills one of the holes in the AI engine’s block of cheese is more valuable than repetitive, low-value content that unfortunately dominates much of the web today.

We believe that if we can begin to score and value content not on how much traffic it generates, but on how much it furthers knowledge — measured by how much it fills the current holes in AI engines “swiss cheese” — we not only will help AI engines get better faster, but also potentially facilitate a new golden age of high-value content creation.

We don’t know all the answers yet, but we’re working with some of the leading economists and computer scientists to figure them out.

The web is changing. Its business model will change. And, in the process, we have an opportunity to learn from what was great about the web of the last 30 years and what we can make better for the web of the future.

Cloudflare's mission is to help build a better Internet. I'm proud of the role we're playing in doing exactly that as the web evolves. And I’m proud that we’re helping content creators stick up and demand value for the content they worked hard to create.

Happy Content Independence Day!

The crawl before the fall… of referrals: understanding AI’s impact on content providers

David Belson — Tue, 01 Jul 2025 10:00:00 GMT

Content publishers welcomed crawlers and bots from search engines because they helped drive traffic to their sites. The crawlers would see what was published on the site and surface that material to users searching for it. Site owners could monetize their material because those users still needed to click through to the page to access anything beyond a short title.

Artificial Intelligence (AI) bots also crawl the content of a site, but with an entirely different delivery model. These Large Language Models (LLMs) do their best to read the web to train a system that can repackage that content for the user, without the user ever needing to visit the original publication.

The AI applications might still try to cite the content, but we’ve found that very few users actually click through relative to how often the AI bot scrapes a given website. We have discussed this challenge in smaller settings, and today we are excited to publish our findings as a new metric shown on the AI Insights page on Cloudflare Radar.

Visitors to Cloudflare Radar can now review how often a given AI model sends traffic to a site relative to how often it crawls that site. We are sharing this analysis with a broad audience so that site owners can have better information to help them make decisions about which AI bots to allow or block and so that users can understand how AI usage in aggregate impacts Internet traffic.

How does this measurement work?

As HTML pages are arguably the most valuable content for these crawlers, the ratios displayed are calculated by dividing the total number of requests from relevant user agents associated with a given search or AI platform where the response was of Content-type: text/html by the total number of requests for HTML content where the Referer header contained a hostname associated with a given search or AI platform.

The diagrams below illustrate two common crawling scenarios, and show that companies may use different user agents depending on the purpose of the crawler. The top one represents a simple transaction where the example AI platform is requesting content for the purposes of training an LLM, representing itself as AIBot. The bottom one represents a scenario where the example AI platform is requesting content to service a user request — looking for flight information, for example. In this case, it is representing itself as AIBot-User. Request traffic from both of these user agents would be aggregated under a single platform name for the purposes of our analysis.

When a user clicks on a link on a website or application, the client will often send a Referer: header as part of the request to the target site. In the diagram below, the example AI platform has returned content that contains links to external sites in response to a user interaction. When the user clicks on a link, a request is made to the content provider that includes ai.example.com in the Referer: header, letting them know where that request traffic came from. Hostnames are associated with their respective platforms for the purpose of our analysis.

Observations

Reviewing the ratios

The new metric is presented as a simple table, comparing the number of aggregate HTML page requests from crawlers (user agents) associated with a given platform to the number of HTML page requests from clients referred by a hostname associated with a given platform. The calculated ratio is always normalized to a single referral request.

The table below shows that for the period June 19-26, 2025, as an example, the ratios range from Anthropic’s 70,900:1 down to Mistral’s 0.1:1. This means that Anthropic’s AI platform Claude made nearly 71,000 HTML page requests for every HTML page referral, while Mistral sent 10x as many referrals as crawl requests. (However, traffic referred by Claude’s native app does not include a Referer: header, and we believe that the same holds true for traffic generated from other native apps as well. As such, because the referral counts only include traffic from the Web-based tools from these providers, these calculations may overstate the respective ratios, but it is unclear by how much.)

Of course, due in part to changes in crawling patterns, these ratios will change over time. The table above also displays the ratio changes as compared to the previous period, with changes ranging from increases of over 6% for DuckDuckGo and Yandex to Google’s 19.4% decrease. The week-over-week drop in Google’s ratio is related to an observed drop in crawling traffic from GoogleBot starting on June 24, while Yandex’s week-over-week growth is related to an observed increase in YandexBot crawling activity that started on June 21, as seen in the graphs below.

Radar’s Data Explorer includes a time series view of how these ratios change over time, such as in the Baidu example below. The time series data is also available through an API endpoint.

Patterns in referral traffic

Changes and trends in the underlying activity can be seen in the associated Data Explorer view, as well as in the raw data available via API endpoints (timeseries, summary). Note that the shares of both referral and crawl traffic are relative to the sets of referrers and crawlers included in the graphs, and not Cloudflare traffic overall.

For example, in the referrer-centric view below, covering nearly the first four weeks of June 2025, we can see that referral traffic is dominated by search platform Google, with a fairly consistent diurnal pattern visible in the data. (The google.* entry covers referral traffic from the main google.com site, as well as local sites, such as google.es or google.com.tw.) Because of prefetching driven by the use of speculation rules, referral traffic coming from Google’s ASN (AS15169) is specifically excluded from analysis here, as it doesn’t represent active user consumption of content.

Clear diurnal patterns are also visible in the referral request shares of other search platforms, although the request shares are a fraction of what is seen from Google.

Throughout June, the share of traffic referred by AI platforms was significantly lower, even in aggregate, than the share of traffic referred by search platforms.

Changes in crawling traffic

As noted above, the change in ratio values over time can be driven by shifts in crawling activity. These shifts are visible in the crawling traffic shares available in Data Explorer, as well as in the raw data available via API endpoints (timeseries, summary). In the crawler-centric view below, covering nearly the first four weeks of June 2025, we can see that the share of requests related to Google’s crawling activity for both their Googlebot and GoogleOther identifiers falls over the course of the month, with several peak/valley periods. A similar pattern observed in HTTP request traffic from Google’s AS15169 during that same time period loosely matches this observed drop in share.

In addition, it appears that OpenAI’s GPTBot saw multiple periods where little-to-no crawling activity was observed throughout the month.

What this means for content providers

These ratios directly impact the viability of content publication on the Internet. While they will vary over time, the trend continues to be more crawls and fewer referrals when compared in relation to each other. Legacy search index crawlers would scan your content a couple of times, or less, for each visitor sent. A site’s availability to crawlers made their revenue model more viable, not less.

The new data we are observing suggests that is no longer the case. These models continue to consume more content, more frequently, despite sending the same or less traffic to the source of its content.

We have released new tools over the last year to help site owners take control back. With a single click, publishers can block the kinds of AI crawlers that train against their content. And today, we announced new ways to make the exchange of value fair for both sides of the equation. However, we continue to recommend that content creators audit and then enforce their preferred policies for AI crawlers.

One more thing…

In addition to providing these new insights around crawling and referral traffic and associated trends, we’ve also taken the opportunity to launch expanded Verified Bots content. The Bots page on Cloudflare Radar includes a paginated list of Verified Bots, displaying the bot name, owner, category, and rank (based on request volume). This list has now been expanded into a standalone directory in a new Bots section. The directory, shown below, displays a card for each Verified Bot, showing the bot name, a description, the bot owner and category, and verification status. Users can search the directory by bot name, owner, or description, and can also filter by category (selecting just Monitoring & Analytics bots, for example).

Clicking on a bot name within a card brings up a bot-specific page that includes metadata about the bot, information on how the bot’s user agent is represented in HTTP request headers and how it should be specified in robots.txt directives, and a traffic graph that shows associated HTTP request volume trends for the selected time period (with a default comparison to the previous period). Associated data is also available via the API. As we add additional information to these bot-specific pages in the future, we will document the updates in Changelog entries.

Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content

Jin-Hee Lee — Tue, 01 Jul 2025 10:00:00 GMT

Cloudflare is giving all website owners two new tools to easily control whether AI bots are allowed to access their content for model training. First, customers can let Cloudflare create and manage a robots.txt file, creating the appropriate entries to let crawlers know not to access their site for AI training. Second, all customers can choose a new option to block AI bots only on portions of their site that are monetized through ads.

The new generation of AI crawlers

Creators that monetize their content by showing ads depend on traffic volume. Their livelihood is directly linked to the number of views their content receives. These creators have allowed crawlers on their sites for decades, for a simple reason: search crawlers such as Googlebot made their sites more discoverable, and drove more traffic to their content. Google benefitted from delivering better search results to their customers, and the site owners also benefitted through increased views, and therefore increased revenues.

But recently, a new generation of crawlers has appeared: bots that crawl sites to gather data for training AI models. While these crawlers operate in the same technical way as search crawlers, the relationship is no longer symbiotic. AI training crawlers use the data they ingest from content sites to answer questions for their own customers directly, within their own apps. They typically send much less traffic back to the site they crawled. Our Radar team did an analysis of crawls and referrals for sites behind Cloudflare. As HTML pages are arguably the most valuable content for these crawlers, we calculated crawl ratios by dividing the total number of requests from relevant user agents associated with a given search or AI platform where the response was of Content-type: text/html by the total number of requests for HTML content where the Referer: header contained a hostname associated with a given search or AI platform. As of June 2025, we find that Google crawls websites about 14 times for every referral. But for AI companies, the crawl-to-refer ratio is orders of magnitude greater. In June 2025, OpenAI’s crawl-to-referral ratio was 1,700:1, Anthropic’s 73,000:1. This clearly breaks the “crawl in exchange for traffic” relationship that previously existed between search crawlers and publishers. (Please note that this calculation reflects our best estimate, recognizing that traffic referred by native apps may not always be attributed to a provider due to a lack of a Referer: header, which may affect the ratio.)

And while sites can use robots.txt to tell these bots not to crawl their site, most don’t take this first step. We found that only about 37% of the top 10,000 domains currently have a robots.txt file, showing that robots.txt is underutilized in this age of evolving crawlers.

That’s where Cloudflare comes in. Our mission is to help build a better Internet, and a better Internet is one with a huge thriving ecosystem of independent publishers. So, we’re taking action to keep that ecosystem alive.

Giving ALL customers full control

Protecting content creators isn’t new for Cloudflare. In July 2024, we gave everyone on the Cloudflare network a simple way to block all AI scrapers with a single click for free. We’ve already seen more than 1 million customers enable this feature, which has given us some interesting data.

Since our last update, we can see that Bytespider, our previous top bot, has seen traffic volume decline 71.45% since the first week of July 2024. During the same time, we saw an increased number of Bytespider requests that customers chose to specifically block. In contrast, GPTBot traffic volume has grown significantly as it has become more popular, now even surpassing traffic we see from big traditional tech players like Amazon and ByteDance.

The share of sites accessed by particular crawlers has gone down across the board since our last update. Previously, Bytespider accessed >40% of websites protected by Cloudflare, but that number has dropped to only 9.37%. GPTBot has taken the top spot for most sites accessed, but while its request volume has grown significantly (noted above), the share of sites it crawls has actually decreased since last year from 35.46% to 28.97%, with an increase in customers blocking.

AI Bot	Share of Websites Accessed
GPTBot	28.97%
Meta-ExternalAgent	22.16%
ClaudeBot	18.80%
Amazonbot	14.56%
Bytespider	9.37%
GoogleOther	9.31%
ImageSiftBot	4.45%
Applebot	3.77%
OAI-SearchBot	1.66%
ChatGPT-User	1.06%

And while AI Search and AI Assistant crawling related activity has exploded in popularity in the last 6 months, we still see their total traffic pale in comparison to AI training crawl activity, which has seen a 65% increase in traffic over the past 6 months.

To this end, we launched free granular auditing in September 2024 to help customers understand which crawlers were accessing their content most often, and created simple templates to block all or specific crawlers. And in December 2024, we made it easy for publishers to automatically block crawlers that weren’t respecting robots.txt. But we realized many sites didn’t have the time to create or manage their own robots.txt file. Today, we’re going two steps further.

Step 1: fully managed robots.txt

When it comes to managing your website’s visibility to search engine crawlers and other bots, the robots.txt file is a key player. This simple text file acts like a traffic controller, signaling to bots which parts of the website they should or should not access. We can think of robots.txt as a "Code of Conduct" sign posted at a community pool, listing general dos and don'ts, according to the pool owner’s wishes. While the sign itself does not enforce the listed directives, well-behaved visitors will still read the sign and follow the instructions they see. On the other hand, poorly-behaved visitors who break the rules risk getting themselves banned.

What do these files actually look like? Take Google’s as an example, visible to anyone at https://www.google.com/robots.txt. Parsing its contents, you'll notice four directives in the set of instructions: User-agent, Disallow, Allow, and Sitemap. In a robots.txt file, the User-agent directive specifies which bots the rules apply to. The Disallow directive tells those bots which parts of the website they should avoid. In contrast, the Allow directive grants specific bots permission to access certain areas. Finally, the Sitemap directive shows a bot which pages it can reach, so that it won’t miss any important pages. The Internet Engineering Task Force (IETF) formalized the definition and language for the Robots Exclusion Protocol in RFC 9309, specifying the exact syntax and precedence of these directives. It also outlines how crawlers should handle errors or redirects while stressing that compliance is voluntary and does not constitute access control.

Website owners should have agency over AI bot activity on their websites. We mentioned that only 37% of the top 10,000 domains on Cloudflare even have a robots.txt file. Of those robots files that do exist, few include Disallow directives for the top AI Bots that we see on a daily basis. For instance, as of publication, GPTBot is only disallowed in 7.8% of the robots.txt files found for the top domains; Google-Extended only shows up in 5.6%; anthropic-ai, PerplexityBot, ClaudeBot, and Bytespider each show up in under 5%. Furthermore, the difference between the 7.8% of Disallow directives for GPTBot and the ~5% of Disallow directives for other major AI crawlers suggests a gap between the desire to prevent your content from being used for AI model training and the proper configuration that accomplishes this by calling out bots like Google-Extended. (After all, there’s more to stopping AI crawlers than disallowing GPTBot.)

Along with viewing the most active bots and crawlers, Cloudflare Radar also shares weekly updates on how websites are handling AI bots in their robots.txt files. We can examine two snapshots below, one from June 2025 and the other from January 2025:

_{Radar snapshot from the week of June 23, 2025, showing the top AI user agents mentioned in the Disallow directive in robots.txt files across the top 10,000 domains. The 3 bots with the highest number of Disallows are GPTBot, CCBot, and facebookexternalhit.}

_{Radar snapshot from the week of January 26, 2025, showing the top AI user agents mentioned in the Disallow directive in robots.txt files across the top 10,000 domains. The 3 bots with the highest number of Disallows are GPTBot, CCBot, and anthropic-ai.}

From the above data, we also observe that fewer than 100 new robots.txt files have been added among the top domains between January and June. One visually striking change is the ratio of dark blue to light blue: compared to January, there is a steep decrease in “Partially Disallowed” permissions; websites are now flat-out choosing “Fully Disallowed” for the top AI crawlers, including GPTBot, CCBot, and Google-Extended. This underscores the changing landscape of web crawling, particularly the relationship of trust between website owners and AI crawlers.

Putting up a guardrail with Cloudflare’s managed robots.txt

Many website owners have told us they’re in a tricky spot in this new era of AI crawlers. They’ve poured time and effort into creating original content, have published it on their own sites, and naturally want it to reach as many people as possible. To do that, website owners make their sites accessible to search engine crawlers, which index the content and make it discoverable in search results. But with the rise of AI-powered crawlers, that same content is now being scraped not just for indexing, but also to train AI models, often without the creator’s explicit consent. Take Googlebot, for example: it’s an absolute requirement for most website owners to allow for SEO. But Google crawls with user agent Googlebot for both SEO and AI training purposes. Specifically disallowing Google-Extended (but not Googlebot) in your robots.txt file is what communicates to Google that you do not want your content to be crawled to feed AI training.

So, what if you don’t want your content to serve as training data for the next AI model, but don’t have the time to manually maintain an up-to-date robots.txt file? Enter Cloudflare’s new managed robots.txt offering. Once enabled, Cloudflare will automatically update your existing robots.txt or create a robots.txt file on your site that includes directives asking popular AI bot operators to not use your content for AI model training. For instance, Cloudflare’s managed robots.txt signals your preference to Google-Extended and Applebot-Extended, amongst others, that they should not crawl your site for AI training, while keeping your domain(s) SEO-friendly.

^{Cloudflare dashboard snapshot of the new managed robots.txt activation toggle}

This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard. Once enabled, website owners who previously had no robots.txt file will now have Cloudflare’s managed bot directives live on their website. What about website owners who already have a robots.txt file? The contents of Cloudflare’s managed robots.txt will be prepended to site owners’ existing file. This way, their existing Block directives – and the time and rationale put into customizing this file – are honored, while still ensuring the website has AI crawler guardrails managed by Cloudflare.

As the AI bot landscape changes with new bots on the rise, Cloudflare will keep our customers a step ahead by updating the directives on our managed robots.txt, so they don’t have to worry about maintaining things on their own. Once enabled, customers won’t need to take any action in order for any updates of the managed robots.txt content to go live on their site.

We believe that managing crawling is key to protecting the open Internet, so we’ll also be encouraging every new site that onboards to Cloudflare to enable our managed robots.txt. When you onboard a new site, you’ll see the following options for managing AI crawlers:

This makes it effortless to ensure that every new customer or domain onboarded to Cloudflare gives clear directives to how they want their content used.

Under the hood: technical implementation

To implement this feature, we developed a new module that intercepts all inbound HTTP requests for /robots.txt. For all such requests, we’ll check whether the zone has opted in to use Cloudflare’s managed robots.txt by reading a value from our distributed key-value store. If they have, the module then responds with the Cloudflare’s managed robots.txt directives, prepended to the origin’s robot.txt if there is an existing file. We prepend so we can add a generalized header that instructs all bots on the customers preferences for data use, as defined in the IETF AI preferences proposal. Note that in robots.txt, the most specific match must always be used, and since our disallow expressions are scoped to cover everything, we can ensure a directive we prepend will never conflict with a more targeted customer directive. If the customer has not enabled this feature, the request is forwarded to the origin server as usual, using whatever the customer has written in their own robots.txt file. (While caching origin's robots.txt could reduce latency by eliminating a round trip to the origin, the impact on overall page load times would be minimal, as robots.txt requests comprise a small fraction of total traffic. Adding cache update/invalidation would introduce complexity with limited benefit, so we prioritized functionality and reliability in our implementation.)

Step 2: block, but only where you show ads

Adding an entry to your robots.txt file is the first step to telling AI bots not to crawl you. But robots.txt is an honor system. Nothing forces bots to follow it. That’s why we introduced our one-click managed rule to block all AI bots across your zone. However, some customers want AI bots to visit certain pages, like developer or support documentation. For customers who are hesitant to block everywhere, we have a brand-new option: let us detect when ads are shown on a hostname, and we will block AI bots ONLY on that hostname. Here’s how we do it.

First, we use multiple techniques to identify if a request is coming from an AI bot. The easiest technique is to identify well-behaved crawlers that publicly declare their user agent, and use dedicated IP ranges. Often we work directly with these bot makers to add them to our Verified Bot list.

Many bot operators act in good faith by publicly publishing their user agents, or even cryptographically verifying their bot requests directly with Cloudflare. Unfortunately, some attempt to appear like a real browser by using a spoofed user agent. It's not new for our global machine learning models to recognize this activity as a bot, even when operators lie about their user agent. When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we’re able to fingerprint, and we use Cloudflare’s network of over 57 million requests per second on average, to understand how much we should trust the fingerprint. We compute global aggregates across many signals, and based on these signals, our models are able to consistently and appropriately flag traffic from evasive AI bots.

When we see a request from an AI bot, our system checks if we have previously identified ads in the response served by the target page. To do this, we inspect the “response body” — the raw HTML code of the web page being sent back. After parsing the HTML document, we perform a comprehensive scan for code patterns commonly found in ad units, which signals to us that the page is serving an ad. Examples of such code would be:

Here, the div-container has the ui-advert class commonly used for advertising. Similarly, links to commonly used ad servers like Google Syndication are a good signal as well, such as the following:

By streaming and directly parsing small chunks of the response using our ultra-fast LOL HTML parser, we can perform scans without adding any latency to the inspected response.

So as not to reinvent the wheel, we are adopting techniques similar to those that ad blockers have been using for years. Ad blockers fundamentally perform two separate tasks to block advertisements in a browser. The first is to block the browser from fetching resources from ad servers, and the second is to suppress displaying HTML elements that contain ads. For this, ad blockers rely on large filter lists such as EasyList that contain both so-called URL block filters that match outgoing request URLs against a set of patterns, and block them if they match one of the filters, and CSS selectors that are designed to match HTML ad elements.

We can use both of these techniques to detect if an HTML response contains ads by checking external resources (e.g. content referenced by HREF or SCRIPT tags) against URL block filters, and the HTML elements themselves against CSS selectors. Because we do not actually need to block every single advertisement on a site, but rather detect the overall presence of ads on a site, we can achieve the same detection efficacy when shrinking the number of CSS and URL filters down from more than 40,000 in EasyList to the 400 most commonly seen ones to increase our computational efficiency.

Because some sites load ads dynamically rather than directly in the returned HTML (partially to avoid ad blocking), we enrich this first information source with data from Content Security Policy (CSP) reports. The Content Security Policy standard is a security mechanism that helps web developers control the resources (like scripts, stylesheets, and images) a browser is allowed to load for a specific web page, and browsers send reports about loaded resources to a CSP management system, which for many sites is Cloudflare’s Page Shield product. These reports allow us to relate scripts loaded from ad servers directly with page URLs. Both of these information sources are consumed by our endpoint management service, which then matches incoming requests against hostnames that we already know are serving ads.

We do all of this on every request for any customer who opts in, even free customers.

To enable this feature, simply navigate to the Security > Settings > Bots section of the Cloudflare dashboard, and choose either Block on pages with Ads or Block Everywhere.

The AI bot hunt: finding and identifying bots

The AI bot landscape has exploded and continues to grow with an exponential trajectory as more and more operators come online. At Cloudflare, our team of security researchers are constantly identifying and classifying different AI-related crawlers and scrapers across our network.

There are two major ways in which we track AI bots and identify those that are poorly behaved:

1. Our customers play a crucial role by directly submitting reports of misbehaved AI bots that may not yet be classified by Cloudflare. (If you have an AI bot that comes to mind here, we’d love for you to let us know through our bots submission form today.) Once such a bot comes to our attention, our security analysts investigate to determine how it should be categorized.

2. We’re able to derive insights through analysis of the massive scale of our customers’ traffic that we observe. Specifically, we can see which AI agents visit which websites and when, drawing out trends or patterns that might make a website owner want to disallow a given AI bot. This bird’s-eye view on abusive AI bot behavior was paramount as we started to determine the content of a managed robots.txt.

What’s next?

Our new managed robots.txt and blocking AI bots on pages with ads features are available to all Cloudflare customers, including everyone on a Free plan. We encourage customers to start using them today – to take control over how the content on your website gets used. Looking ahead, Cloudflare will monitor the IETF’s pending proposal allowing website publishers to control how automated systems use their content and update our managed robots.txt accordingly. We will also continue to provide more granular control around AI bot management and investigate new distinguishing signals as AI bots become more and more precise. And if you’ve seen suspicious behavior from an AI scraper, contribute to the Internet ecosystem by letting us know!

Introducing pay per crawl: Enabling content owners to charge AI crawlers for access

Will Allen — Tue, 01 Jul 2025 10:00:00 GMT

A changing landscape of consumption

Many publishers, content creators and website owners currently feel like they have a binary choice — either leave the front door wide open for AI to consume everything they create, or create their own walled garden. But what if there was another way?

At Cloudflare, we started from a simple principle: we wanted content creators to have control over who accesses their work. If a creator wants to block all AI crawlers from their content, they should be able to do so. If a creator wants to allow some or all AI crawlers full access to their content for free, they should be able to do that, too. Creators should be in the driver’s seat.

After hundreds of conversations with news organizations, publishers, and large-scale social media platforms, we heard a consistent desire for a third path: They’d like to allow AI crawlers to access their content, but they’d like to get compensated. Currently, that requires knowing the right individual and striking a one-off deal, which is an insurmountable challenge if you don’t have scale and leverage.

What if I could charge a crawler?

We believe your choice need not be binary — there should be a third, more nuanced option: You can charge for access. Instead of a blanket block or uncompensated open access, we want to empower content owners to monetize their content at Internet scale.

We’re excited to help dust off a mostly forgotten piece of the web: HTTP response code 402.

Introducing pay per crawl

Pay per crawl, in private beta, is our first experiment in this area.

Pay per crawl integrates with existing web infrastructure, leveraging HTTP status codes and established authentication mechanisms to create a framework for paid content access.

Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing. Cloudflare acts as the Merchant of Record for pay per crawl and also provides the underlying technical infrastructure.

Publisher controls and pricing

Pay per crawl grants domain owners full control over their monetization strategy. They can define a flat, per-request price across their entire site. Publishers will then have three distinct options for a crawler:

Allow: Grant the crawler free access to content.
Charge: Require payment at the configured, domain-wide price.
Block: Deny access entirely, with no option to pay.

An important mechanism here is that even if a crawler doesn’t have a billing relationship with Cloudflare, and thus couldn’t be charged for access, a publisher can still choose to ‘charge’ them. This is the functional equivalent of a network level block (an HTTP 403 Forbidden response where no content is returned) — but with the added benefit of telling the crawler there could be a relationship in the future.

While publishers currently can define a flat price across their entire site, they retain the flexibility to bypass charges for specific crawlers as needed. This is particularly helpful if you want to allow a certain crawler through for free, or if you want to negotiate and execute a content partnership outside the pay per crawl feature.

To ensure integration with each publisher’s existing security posture, Cloudflare enforces Allow or Charge decisions via a rules engine that operates only after existing WAF policies and bot management or bot blocking features have been applied.

Payment headers and access

As we were building the system, we knew we had to solve an incredibly important technical challenge: ensuring we could charge a specific crawler, but prevent anyone from spoofing that crawler. Thankfully, there’s a way to do this using Web Bot Auth proposals.

For crawlers, this involves:

Generating an Ed25519 key pair, and making the JWK-formatted public key available in a hosted directory
Registering with Cloudflare to provide the URL of your key directory and user agent information.
Configuring your crawler to use HTTP Message Signatures with each request.

Once registration is accepted, crawler requests should always include signature-agent, signature-input, and signature headers to identify your crawler and discover paid resources.

GET /example.html
Signature-Agent: "https://signature-agent.example.com"
Signature-Input: sig2=("@authority" "signature-agent")
 ;created=1735689600
 ;keyid="poqkLGiymh_W0uP6PZFw-dvez3QJT5SolqXBCW38r0U"
 ;alg="ed25519"
 ;expires=1735693200
;nonce="e8N7S2MFd/qrd6T2R3tdfAuuANngKI7LFtKYI/vowzk4lAZYadIX6wW25MwG7DCT9RUKAJ0qVkU0mEeLElW1qg=="
 ;tag="web-bot-auth"
Signature: sig2=:jdq0SqOwHdyHr9+r5jw3iYZH6aNGKijYp/EstF4RQTQdi5N5YYKrD+mCT1HA1nZDsi6nJKuHxUi/5Syp3rLWBA==:

Accessing paid content

Once a crawler is set up, determination of whether content requires payment can happen via two flows:

Reactive (discovery-first)

Should a crawler request a paid URL, Cloudflare returns an HTTP 402 Payment Required response, accompanied by a crawler-price header. This signals that payment is required for the requested resource.

HTTP 402 Payment Required
crawler-price: USD XX.XX

The crawler can then decide to retry the request, this time including a crawler-exact-price header to indicate agreement to pay the configured price.

GET /example.html
crawler-exact-price: USD XX.XX

Proactive (intent-first)

Alternatively, a crawler can preemptively include a crawler-max-price header in its initial request.

GET /example.html
crawler-max-price: USD XX.XX

If the price configured for a resource is equal to or below this specified limit, the request proceeds, and the content is served with a successful HTTP 200 OK response, confirming the charge:

HTTP 200 OK
crawler-charged: USD XX.XX 
server: cloudflare

If the amount in a crawler-max-price request is greater than the content owner’s configured price, only the configured price is charged. However, if the resource’s configured price exceeds the maximum price offered by the crawler, an HTTP 402 Payment Required response is returned, indicating the specified cost. Only a single price declaration header, crawler-exact-price or crawler-max-price, may be used per request.

The crawler-exact-price or crawler-max-price headers explicitly declare the crawler's willingness to pay. If all checks pass, the content is served, and the crawl event is logged. If any aspect of the request is invalid, the edge returns an HTTP 402 Payment Required response.

Financial settlement

Crawler operators and content owners must configure pay per crawl payment details in their Cloudflare account. Billing events are recorded each time a crawler makes an authenticated request with payment intent and receives an HTTP 200-level response with a crawler-charged header. Cloudflare then aggregates all the events, charges the crawler, and distributes the earnings to the publisher.

Content for crawlers today, agents tomorrow

At its core, pay per crawl begins a technical shift in how content is controlled online. By providing creators with a robust, programmatic mechanism for valuing and controlling their digital assets, we empower them to continue creating the rich, diverse content that makes the Internet invaluable.

We expect pay per crawl to evolve significantly. It’s very early: we believe many different types of interactions and marketplaces can and should develop simultaneously. We are excited to support these various efforts and open standards.

For example, a publisher or new organization might want to charge different rates for different paths or content types. How do you introduce dynamic pricing based not only upon demand, but also how many users your AI application has? How do you introduce granular licenses at internet scale, whether for training, inference, search, or something entirely new?

The true potential of pay per crawl may emerge in an agentic world. What if an agentic paywall could operate entirely programmatically? Imagine asking your favorite deep research program to help you synthesize the latest cancer research or a legal brief, or just help you find the best restaurant in Soho — and then giving that agent a budget to spend to acquire the best and most relevant content. By anchoring our first solution on HTTP response code 402, we enable a future where intelligent agents can programmatically negotiate access to digital resources.

Getting started

Pay per crawl is currently in private beta. We’d love to hear from you if you’re either a crawler interested in paying to access content or a content creator interested in charging for access. You can reach out to us at http://www.cloudflare.com/paypercrawl-signup/ or contact your Account Executive if you’re an existing Enterprise customer.

From Googlebot to GPTBot: who’s crawling your site in 2025

João Tomé — Tue, 01 Jul 2025 10:00:00 GMT

Web crawlers are not new. The World Wide Web Wanderer debuted in 1993, though the first web search engines to truly use crawlers and indexers were JumpStation and WebCrawler. Crawlers are part of one of the backbones of the Internet’s success: search. Their main purpose has been to index the content of websites across the Internet so that those websites can appear in search engine results and direct users appropriately. In this blog post, we’re analyzing recent trends in web crawling, which now has a crucial and complex new role with the rise of AI.

Not all crawlers are the same. Bots, automated scripts that perform tasks across the Internet, come in many forms: those considered non-threatening or “good” (such as API clients, search indexing bots like Googlebot, or health checkers) and those considered malicious or “bad” (like those used for credential stuffing, spam, or scraping content without permission). In fact, around 30% of global web traffic today, according to Cloudflare Radar data, comes from bots, and even exceeds human Internet traffic in some locations.

A new category, AI crawlers, has emerged in recent years. These bots collect data from across the web to train AI models, improving tools and experiences, but also raising issues around content rights, unauthorized use, and infrastructure overload. We aimed to confirm the growth of both search and AI crawlers, examine specific AI crawlers, and understand broader crawler usage.

This is increasingly relevant with the rapid adoption of AI, growing content rights concerns, and data privacy discussions. Some sites and creators are looking to limit or block AI crawlers using tools like robots.txt or firewall rules. Others, like Dutch indie maker and entrepreneur Pieter Levels, have embraced them: “I’m 100% fine with AI crawlers… very important to rank in LLMs [large language models]”.

It’s important to note that crawlers serve different purposes. For example, the facebookexternalhit bot is not included in this analysis, as it is used by Facebook to fetch page content when generating previews for shared links. However, within this post, we are only focusing on AI and search crawlers that are indexing or scraping website content.

AI-only crawlers perspective

Let’s start with an AI-only crawler perspective that we currently have on Cloudflare Radar, focused only on crawlers advertised as AI-related. To identify them, we’re using here a list derived from an open-source project that helps website owners manage and control access to AI crawlers — especially those used to train large language models (LLMs). It also provides guidance on what to include in robots.txt files (more on that below). The data shown below is based on matching those crawler names with user-agent strings in HTTP requests. (Further details, including one exception, about this method can be found at the end of the blog post.)

The AI crawler landscape saw a significant shift between May 2024 and May 2025, with GPTBot (from OpenAI) emerging as the dominant force, surging from 5% to 30% share, and Meta-ExternalAgent (from Meta) making a strong new entry at 19%. This growth came at the expense of former leader Bytespider, which plummeted from 42% to 7%, as well as other AI crawlers like ClaudeBot and Amazonbot, which also saw declines. Our data clearly indicates a reordering of top AI crawlers, highlighting the increasing prominence of OpenAI and Meta in this category.

May 2024

May 2025

Rank	Bot Name	Share (May 2024)	Rank	Bot Name	Share (May 2025)
1	Bytespider	42%	1	GPTBot	30%
2	ClaudeBot	27%	2	ClaudeBot	21%
3	Amazonbot	21%	3	Meta-ExternalAgent	19%
4	GPTBot	5%	4	Amazonbot	11%
5	Applebot	4.1%	5	Bytespider	7.2%

For additional context, the list below includes further information about the bots with higher crawling shares seen above. This information comes from the same open-source list mentioned above and from publications by companies like OpenAI, which explain how their crawlers are used.

GPTBot – OpenAI’s crawler used to improve and train large language models like ChatGPT.
ClaudeBot – Anthropic’s crawler for training and updating the Claude AI assistant.
Meta-ExternalAgent – Meta’s bot likely used for collecting data to train or fine-tune LLMs.
Amazonbot – Amazon’s crawler that gathers data for its search and AI applications.
Bytespider – ByteDance’s AI data collector, often linked to training models like Ernie or TikTok-related AI.
Applebot – Apple’s web crawler primarily for Siri and Spotlight search, possibly used in AI development.
OAI-SearchBot – OpenAI’s search-focused crawler, likely used for retrieving real-time web info for models.
ChatGPT-User – Represents API-based or browser usage of ChatGPT in connection with user interactions.
PerplexityBot – Crawler from Perplexity.ai, which powers their AI answer engine using real-time web data.

Webmasters can inform crawler operators of whether they want these bots and crawlers to access their content by setting out rules in a file called robots.txt, which tells crawlers what pages they should or shouldn’t access. As we’ve seen recently, crawlers honoring your robots.txt policies is voluntary, but Cloudflare announced tools like AI Audit to help content creators to enforce it.

Now, as we’ve seen, the landscape of web crawling is evolving rapidly, driven by the merging roles of search engines and AI. AI is now deeply integrated into search, seen in Google’s AI Overviews and AI Mode, but also in social media platforms, like Meta AI on Instagram. So, let's broaden our analysis to include these wider AI-driven crawling activities.

General AI and search crawling growth: +18%

A broader view reveals the growth of crawling traffic from both search and AI crawlers over the first few months of 2025. To remove customer growth bias, we'll analyze trends using a fixed set of customers from specific weeks (a method we’ve used in our Cloudflare Radar Year in Review): the first week of May 2024, a week in November 2024, and the first week of April 2025.

Using that method, we found that AI and search crawler traffic grew by 18% from May 2024 to May 2025 (comparing full-month periods). The increase was even higher, at 48%, when including new Cloudflare customers added during that time. Peak AI and search crawling traffic occurred in April 2025, with a 32% increase compared to May 2024. This confirms that crawling traffic has clearly risen over the past year, but also that growth is not always constant. Google remains the dominant player, and its share is growing too, as we’ll see in the next section.

As the next chart shows, crawling traffic increased sharply in March and April 2025 and remained high, though slightly lower, in May.

The patterns on the above crawling chart also seem to reflect broader seasonal patterns and general human Internet traffic patterns. In 2024, traffic dropped during the summer in the Northern Hemisphere, with August and September being the least active months. And like overall Internet traffic, it then rose in November, when people are typically more online due to shopping and seasonal habits, as we've seen in past analyses.

Googlebot crawling grew 96% in one year

Googlebot, which indexes content for Google Search, was clearly the top crawler throughout the period and showed strong growth, up 96% from May 2024 to May 2025, reflecting increased crawling by Google. Crawling traffic peaked in April 2025, reaching 145% higher than in May 2024. It's also important to mention that Google made changes to its search and launched AI Overviews in its search engine during this time — first in the US in May 2024, then in more countries later.

Two trends stand out when looking at daily data for Google-related crawlers, as shown in the graph below. First, Googlebot and the more recent GoogleOther (a web crawler from 2023 for “research and development”) account for most of Google’s crawling activity. Second, there were two visible drops in crawling traffic: one on December 14, 2024 (around a Google Search update), and another from May 20 to May 28, 2025. That May 20 drop occurred around the same time as the rollout of AI Mode on Google Search in the US, although the timing may be coincidental.

Breakdown of top 20 AI and search web crawlers

Ranking crawlers by their share of total requests gives a clearer picture of which bots are gaining or losing ground, especially among those focused on search and AI. The table below shows a clear trend: some AI bots have grown rapidly since last year (with growth beginning even earlier), while many traditional search crawlers have remained flat or lost share (as in the case of Bing and its Bingbot crawler). The main exception is Googlebot.

The next table shows the percentage share of each crawler out of all crawling traffic generated by this specific cohort of over 30 AI & search crawlers observed by Cloudflare in May 2024 and May 2025. The table below also includes the change in percentage points and the growth or decline in raw request volume. Crawlers are ranked by their share in May 2025. Key crawler shifts include GPTBot rising sharply (+305%), while Bytespider dropped dramatically (-85%).

Rank	Bot name	Share May 2024	Share May 2025	Δ percentage-point change	Raw requests growth (May 2024 to May 2025)
1	Googlebot	30%	50%	+20 pp	96%
2	Bingbot	10%	8.7%	-1.3 pp	2%
3	GPTBot	2.2%	7.7%	+5.5 pp	305%
4	ClaudeBot	11.7%	5.4%	-6.3 pp	-46%
5	GoogleOther	4.4%	4.3%	-0.1 pp	14%
6	Amazonbot	7.6%	4.2%	-3.4 pp	-35%
7	Googlebot-Image	4.5%	3.3%	-1.2 pp	-13%
8	Bytespider	22.8%	2.9%	-19.8 pp	-85%
9	Yandex	2.8%	2.2%	-0.7 pp	-10%
10	ChatGPT-User	0.1%	1.3%	+1.2 pp	2,825%
11	Applebot	1.9%	1.2%	-0.7 pp	-26%
12	Timpibot	0.3%	0.6%	+0.3 pp	133%
13	Baiduspider	0.5%	0.4%	-0.1 pp	7%
14	PerplexityBot	<0.01%	0.2%	+0.2 pp	157,490%
15	DuckDuckBot	0.2%	0.1%	-0.1 pp	-16%
16	SeznamBot	0.1%	0.1%		2%
17	Yeti	0.1%	0.1%		47%
18	coccocbot	0.1%	0.1%		-3%
19	Sogou	0.1%	0.1%		-22%
20	Yahoo! Slurp	0.1%	0.0%	-0.1 pp	-8%

Based on this data, two major shifts in web crawling occurred between May 2024 and May 2025:

1. Some AI crawlers rose sharply. GPTBot (from OpenAI) increased its share from 2.2% to 7.7% (+5.5 pp), with a 305% rise in requests. This underscores the data demand for training large language models like ChatGPT. GPTBot jumped from #9 in May 2024 to #3 in May 2025.

Another OpenAI crawler, ChatGPT-User, saw requests surge by 2,825%, reaching a 1.3% share. This reflects a large rise in ChatGPT user activity or API-based interactions that involve accessing web content. PerplexityBot (from Perplexity.ai), despite a small 0.2% share, recorded the highest growth rate: a staggering 157,490% increase in raw requests.

Meanwhile, some AI crawlers saw steep declines. ClaudeBot (Anthropic) fell from 11.7% to 5.4% of total traffic and dropped 46% in requests. Bytespider plummeted 85% in request volume, falling from #2 to #8 in crawler share (now at just 2.9%).

Both Amazonbot and Applebot, also considered AI crawlers, saw decreases in share and in raw requests (–35% and –26%, respectively).

2. Google’s dominance expanded. Googlebot’s share rose from 30% to 50%, supporting search indexing, but potentially also having AI-related purposes (such as new AI Overviews in Google Search). And GoogleOther (the crawler introduced in 2023) also increased in crawling traffic, 14%. Other Google crawlers not in the top 20, like Googlebot-News, also grew significantly (+71% in requests). There’s a clear trend of growth in these Google-related web crawlers at a time when the company is investing heavily in combining AI with search.

Also in the search category, Bingbot’s share (from Microsoft) declined slightly from 10% to 8.7% (-1.3 pp), though its raw requests still grew modestly by 2%.

These trends show that web crawling is increasingly dominated by bots from Google and OpenAI, reflecting clear shifts over the course of a year. Google also appears to be adapting how it collects data to support both traditional search and AI-driven features.

Also worth noting is FriendlyCrawler, which no longer appears in the top 20 list as of May 2025 (now ranked #35). It was #14 in May 2024 with a 0.2% share, but saw a 100% drop in requests by May 2025. This bot is known to index and analyze website content, although its owner and purpose remain unclear. Typically, crawlers like this are used for improving search results, market research, or analytics.

robots.txt & AI bots: GPTBot leads twice

Recent data from June 6, 2025, from Cloudflare Radar shows that out of 3,816 domains (from the top 10,000) where we were able to find a robots.txt file, 546 (about 14%) had “allow” or “disallow” (fully or partially) directives targeting AI bots in particular.

This leaves many site owners in a gray area because it’s not always clear how effective robots.txt is in managing AI crawlers. Some site owners may not think to use it specifically for AI bots, while others might be unsure whether these bots even respect robots.txt rules, especially newer or less transparent crawlers. In other cases, sites use partial rules to fine-tune access, trying to balance visibility and protection without fully opting in or out.

The “disallow” rules appear far more often than “allow” rules. The most frequently blocked bot was GPTBot, disallowed by 312 domains (250 fully, 62 partially), followed by CCBot and Google-Extended, as shown in the following graph.

Although GPTBot was the most blocked, it was also the most explicitly allowed, with 61 domains granting access (18 fully, 43 partially). Still, very few sites openly and explicitly allow AI bots, and when they do, it’s usually for limited sections. Note that bots not listed in a site’s robots.txt are effectively allowed by default.

As AI crawling increases, more websites are moving from passive signals like robots.txt to active protections like Web Application Firewalls. The ecosystem is shifting, with a growing focus on enforceable controls.

Note: When we analyze crawler traffic, we compare user-agent tokens found in robots.txt files (like those for AI crawlers) with the actual user-agent strings in HTTP requests. It's important to note that some robots.txt tokens, such as Google-Extended, aren't user-agent substrings. As described in RFC 9309, one goal of these token may be to signal the purpose of the crawler. For instance, Google uses Google-Extended in robots.txt to see if your content can be used for AI training, but the traffic itself still comes from standard Google user-agents like Googlebot. Because of this, not every robots.txt entry will have a direct match in HTTP request logs.

Conclusion

As AI crawlers reshape the Internet, websites face both new challenges and new opportunities in managing their online presence.

This analysis highlights the growing impact of AI on web crawling, showing a clear shift from traditional search indexing to data collection for training AI models. The detailed statistics, such as Googlebot’s continued growth and the rapid rise of AI-specific crawlers, offer context for understanding how this space is evolving and what it means for the future of web content access.

The trend toward stronger, enforceable blocking methods, something Cloudflare has also been invested, signals a key shift in how websites may control their interactions with AI systems going forward.