The Cloudflare Blog

Defend against frontier cyber models: Cloudflare's architecture as customer zero

Rohit Chenna Reddy — Tue, 09 Jun 2026 06:00:00 GMT

A few weeks ago, we wrote about Project Glasswing and what we observed when we pointed cyber frontier models at our own code. Since then, we’ve seen that the part of the post that has resonated most deeply is the argument that the architecture around the vulnerability matters more than the speed of the patch.

In the conversations we've had with CISOs and security teams since, the questions have been consistent: what does our architecture actually look like, what should we monitor for, where do we start, and how can Cloudflare help?

Before getting into the details: the architecture below is built almost entirely from Cloudflare's own products, because Cloudflare security is customer zero for the security products we build. The Cloudflare stack already exists in front of our code, employees, and customer-facing applications. If you're a Cloudflare customer, every layer below is available to you today. If you're not, the principles still apply to whatever stack you've built.

What a cyber frontier model actually changes

In the previous post, we showed how a cyber frontier model like Mythos changes the attacker’s timeline. It can find vulnerabilities, reason through exploit chains, and generate working proofs faster than earlier models. While models like Mythos do not change the shape of an intrusion — reconnaissance, initial access, lateral movement, persistence, and exfiltration still have to happen — the difference is in the speed and scale. When pointed at the open web, a model can find and hit low-hanging fruit quickly. Against a hardened target, it still has to probe, and adapt, and it often produces more noise than a careful human operator would.

Discovery, exploit chain construction, and proof-of-concept generation used to be the gating constraints on producing a working attack. A frontier model handles all three in a fraction of the time. Work that used to be slow and methodical is now fast and indiscriminate.

While AI is accelerating how fast developer teams at Cloudflare and many other companies can ship code, the security team’s work has not compressed the same way. An attacker only needs one opening to get in, while security teams need to find and close them all. Writing a fix, regressing it, and shipping it without breaking the code around it has constraints that AI doesn't remove. We learned this the hard way when we let an AI coding assistant write its own patches against our own bugs, as we described at the end of the previous post. Some of those patches fixed the original bug while quietly breaking something else the code depended on.

As these models become more competent and capable, our main focus from a threat standpoint comes down to three things. Each one shapes the architecture we walk through in the rest of this post.

The first is the speed of discovery. Frontier models make it easier to search large bodies of public code, including the open-source libraries that many companies depend on. That does not mean every bug in a library is exploitable, or that library bugs are where most vulnerabilities live. Exploitability still depends on how the code is used, whether attacker-controlled input can reach the vulnerable path, and the protections that sit around it. But widely used open-source libraries and frameworks give attackers a shared surface to study at scale. When a real, reachable vulnerability exists there, a model can help find it, reason about possible exploit paths, and generate proof-of-concept variants faster than maintainers and defenders can review every downstream use. The gap between when an attacker discovers a vulnerability and when defenders learn it exists is what worries us most. If you are not running these models against your own code, it is safe to assume someone else is.
The second is exploit volume and adaptation. A model can produce thousands of variations of a single exploit and run reconnaissance at the same scale. All that volume gives an attacker an advantage, but it won’t necessarily get them past signature-based detections. Many of those iterations will have the same underlying signature, so a rule that catches the first one will catch the rest. Adaptation is how they will get past signature-based detections. Ask a model to show you a SQL injection, and it will return a textbook example. Tell it there is a WAF in the way, and it will start probing, learning what gets blocked, and rewriting the payload until it can slip past the rule blocking it.
The third is the impact when a vulnerability is inevitably exploited. No architecture catches everything. After the vulnerability is exploited, the question we ask ourselves is: where can the attacker get to with one identity, one path, or one credential, before something else stops them? If the answer is "anywhere they want," the vulnerability was never the problem. The architecture around the vulnerability was.

Cloudflare’s superpower: visibility

We see roughly a fifth of the web and that tells us, in real time, which payloads are mutating, which patterns are picking up, and where attacker tooling is moving next. Two teams turn that visibility into defense.

First is Cloudforce One, our threat intelligence, research, and operations team, which sits within the Cloudflare security organization. They turn what we see across the network into insights the rest of the stack can act on: tracked adversaries, emerging campaigns, and indicators of compromise (IOCs). The hard part of this work was never knowing what is malicious — it was the delay in mitigation. Knowledge of a new threat normally has to travel from a threat report, into a feed, and then into a company’s defense before it can be used to block anything. Attackers have learned to move faster than that. Our network closes that gap: Cloudflare customers can now use Cloudforce One threat intelligence directly within the WAF to block high-risk traffic.

Second is the team that owns the WAF engine that does the actual detecting: the managed rulesets that run in front of our own properties and are available to every Cloudflare customer, the machine learning behind WAF Attack Score, and the relationships that sometimes let us ship a rule before a CVE is publicly disclosed. The team is globally distributed and moves fast, releasing rules within hours of a proof-of-concept of an attack becoming known. Once a detection is deployed, it reaches our entire network, along with every Cloudflare customer, in under 30 seconds. React2Shell is a recent example: a managed WAF rule was protecting our own properties, and everyone else's on Cloudflare, hours before the official advisory was published.

The scoring layer, the defenses we put in front of the application, and the containment around the vulnerability all build on what these two teams see.

Scores over signatures

Signature-based defenses were built for a world where novel exploits were scarce and variations took weeks. Cloudflare's traditional SLA from a fresh proof-of-concept to a live, deployed rule has been 12 hours. With the advent of frontier models, this is not good enough anymore. Detections need to be in place before a CVE is discovered. This is why we layer ML-based detection in front of the traditional signature-based WAF.

The model is trained on a large body of past attack traffic, and it catches new variants of vulnerabilities before they're publicly known. A novel SQL injection or remote code execution chain is almost always a rearrangement of attack shapes the model has seen before, even when the specific exploit is brand new. We run the model on every request and assign a WAF Attack Score between 1 and 99, based on how closely the request resembles those underlying shapes, not against a list of known-bad signatures. The lower the score, the more aggressively we treat the request. That score determines whether we let the request through. We apply a similar scoring methodology to AI prompts with AI Security for Apps: rather than check each prompt against a list of known malicious prompts, we score how closely a prompt resembles an actual attack.

The architecture around the vulnerability

Those capabilities only matter once they're stacked in front of an application, and the first layer in our defense-in-depth approach is the WAF. Anything that matches a known-bad pattern gets dropped before it reaches the application, which clears the bulk of the obvious traffic and lets the more specialized layers below focus on what's left. On the API surface, we run a positive security model through API Shield. Instead of trying to anticipate every bad request, we describe what a valid request to each API looks like, either from the API's own definition or learned from our real traffic, and anything that doesn't fit doesn't get through. This neutralizes the advantage of frontier AI models: because we only permit validated traffic, generating thousands of new attack variations fails to bypass the system.

^{Cloudflare’s layered architecture}

Bot Management catches probing traffic on our network before frontier models can build a map. It scores every request on how likely it is to be automated, using the same signals across our whole network: how the client behaves, whether it looks like a real browser, and whether the connection matches a known-bad pattern. An attack only lands if it can find a soft spot.

Zero Trust Network Access is used for every internal application. The implicit trust of being inside the network is replaced with explicit per-request identity and policy for every employee accessing every tool. The value of this was clear when one of our engineers shipped a misconfigured tool. A flat network would have exposed everything on the same segment, but in our deployment, the exposure stopped at the tool itself. We built Require Access Protection afterwards so newly deployed or misconfigured applications can't be reachable before an access policy is in place.

IdP Federation makes that secure by default posture easier to keep consistent across every Cloudflare account — which becomes even more necessary when more people are shipping internal tools quickly. Instead of asking each team to wire up SSO separately, we configure our identity provider (IdP) once and share it across the organization. New accounts get SSO automatically, recipient-side IdP connections are read-only, and Access policies in each account still evaluate the resulting identity as part of the normal request flow.

MCP Server Portal gives teams a controlled way to connect AI agents to enterprise systems. Agents access MCP servers that are centrally managed through a single portal, with every action logged. That way when an agent acts on someone's behalf, we know what it did, what it touched, and whether it should have been allowed to. The full picture of how we built it is in our post on enterprise MCP.

AI Gateway runs in front of our internal AI tools the same way AI Security for Apps runs in front of customer-facing AI features, with the same scoring and the same visibility. Inside the company, the visibility piece is more useful than the blocking, because we needed to see what engineers were actually building before we could write meaningful policy on it.

Where your teams can start

Frontier models can help attackers find vulnerabilities, adapt payloads, and move faster, but they still have to pass through the layered defense you deploy in front of your application. That is where teams should start:

Put inspection in front of public applications.
Define what valid API traffic looks like.
Use bot detection to limit automated probing.
Require identity and access policy before any internal tool is reachable.

For AI and agentic systems:

Route model traffic through a gateway.
Keep agents connected through approved MCP servers.
Log what they do.

The goal is to make sure that when one layer misses, the next layer limits what the attacker can see, reach, or change.

That is the point of the architecture around the vulnerability: to limit the scope of an attack. The vulnerability may be what starts the attack, but the architecture determines how far it can go.

How do we know this approach works?

Plenty of security stacks look impenetrable on a whiteboard but fall over in practice. That is why we test ours continuously, both at the perimeter and inside our environment, with our red team involved across both.

At the perimeter, frontier models are one tool we use to test our application security stack as an adaptive attacker. These models sit alongside the rest of our red team and detection workflows including: manual testing, threat intelligence, observed traffic patterns, proof-of-concept analysis, and signals from our own network. Together, those inputs help us decide where to aim testing: newly launched products, recently changed surfaces, and the paths an attacker is most likely to probe first. The most important part is the process that follows. When something gets through, we identify the gap, use the right mix of tools to understand it, write the rule or mitigation, ship the update, and test again to make sure the gap is closed.

Inside the environment, our red team starts from the assumption that the perimeter has already failed. They look at what has changed, where sensitive systems carry risk, and whether one compromised identity, path, or credential can reach farther than it should. When we change the architecture based on what they find, they run the scenario again against the new version to confirm the gap is actually closed.

We confirm that this architecture is working by continuously testing its behavior during failures, rather than relying on the perfection of individual layers.

If your team is working on the same problems and would like to compare notes, reach out to us at security-ai-research@cloudflare.com.

Project Glasswing: what Mythos showed us

Grant Bourzikas — Mon, 18 May 2026 06:00:00 GMT

For the last few months, we've been testing a range of security-focused LLMs on our own infrastructure. These LLMs help identify potential vulnerabilities in our own systems, so we can fix them – and they also show us what attackers are going to be able to do with the latest models.

None of these LLMs has captured more attention than Mythos Preview, from Anthropic. A few weeks ago, we were invited to use Mythos Preview as part of Project Glasswing. We soon pointed it at more than fifty of our own repositories – to see what it would find, and to see how it works.

This post shares what we observed, what the models did well and what they didn't, and how the architecture and process around them needs to change, so they can be used at scale.

What changed with Mythos Preview

Mythos Preview is a real step forward, and it's worth saying that plainly before getting into anything else. We've been running models against our code for a while now, and the jump from what was possible with previous general-purpose frontier models to what Mythos Preview does today is not just a refinement of what came before.

It's a different kind of tool doing a different kind of work, and that makes a clean apples-to-apples comparison to earlier models difficult. So rather than trying to benchmark Mythos Preview against general-purpose frontier models, it's more useful to describe what it can actually do, and two features that stood out across the work we did with Mythos Preview:

Exploit chain construction - A real attack rarely uses one bug. It chains several small attack primitives together into a working exploit. For instance, it might turn a use-after-free bug into an arbitrary read and write primitive, hijack the control flow, and use return-oriented programming (ROP) chains to take full control over a system. Mythos Preview can take several of these primitives and reason about how to combine them into a working proof. The reasoning it shows along the way looks like the work of a senior researcher rather than the output of an automated scanner.
Proof generation - Finding a bug and proving it's exploitable are two different things, and Mythos Preview can do both. It writes code that would trigger the suspected bug, compiles that code in a scratch environment, and runs it. If the program does what the model expected, that's the proof. If it doesn't, the model reads the failure, adjusts its hypothesis, and tries again. The loop matters as much as the bugs it finds, because a suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own.

Some of what we describe above is not entirely unique to Mythos Preview. When we ran other frontier models through the same harness, they found a fair number of the same underlying bugs, and in some cases they got further than we expected on the reasoning side too. Where they fell short was at the point of stitching the pieces together. A model would identify an interesting bug, write a thoughtful description of why it mattered, and then stop, leaving the actual chain unfinished and the question of exploitability open. What changed with Mythos Preview is that a model can now take those low-severity bugs (which would traditionally sit invisible in a backlog) and chain them into a single, more severe exploit.

Model refusals in legitimate vulnerability research

The Mythos Preview model provided by Anthropic, as part of Project Glasswing, did not have the additional safeguards that are present in generally available models (like Opus 4.7 or GPT-5.5).

Despite this, the model organically pushes back on certain requests - much like the cyber capabilities that made it useful for vulnerability hunting, the model has its own emergent guardrails that sometimes cause it to push back on legitimate security research requests. But as we found, these organic refusals aren’t consistent - the same task, framed differently or presented in a different context, could produce completely different outcomes as illustrated in the examples below.

^{Example of Mythos Preview pushing back on building a working proof of concept}

For example, the model initially refused to do vulnerability research on a project, then agreed to perform the same research on the same code after an unrelated change to the project’s environment. Nothing about the code being analyzed had changed. In another case, the model found and confirmed several serious memory bugs in a codebase, and then refused to write a demonstration exploit. The same request, framed differently, got a different answer, and even the same request can produce different outcomes across runs due to the probabilistic nature of the model. Semantically equivalent tasks can produce opposite outcomes depending on how and when they’re presented to the model.

This matters because while the model’s organic refusals/guardrails are real, they aren’t consistent enough to serve as a complete safety boundary on their own. That’s precisely why any capable cyber frontier model made generally available in the future must include additional safeguards on top of this baseline behavior - making it appropriate for broader use outside of a controlled research context like Project Glasswing.

The signal-to-noise problem

One of the hardest parts of triaging security vulnerabilities is deciding which bugs are real, which are exploitable, and which need fixing now. This was a hard problem even in the pre-AI world. AI vulnerability scanners and AI-generated code have made it worse, and at Cloudflare we've built multiple post-validation stages to deal with it.

Two factors dominate the noise rate:

Programming language - C and C++ give you direct memory control and, with it, bug classes - buffer overflows, out-of-bounds reads and writes - that memory-safe languages like Rust eliminate at compile time. We saw consistently more false positives from projects written in memory-unsafe languages.
Model bias - A good human researcher tells you what they found and how confident they are. Models don't. Ask a model to find bugs, and it will find them, whether the code has any or not. Findings come back hedged with "possibly," "potentially," "could in theory," and the hedged findings vastly outnumber the solid ones. That's a reasonable bias for an exploratory tool. It's a ruinous one for a triage queue, where every speculative finding spends human attention and tokens to dismiss, and that cost compounds across thousands of findings.

Mythos Preview represents a clear improvement here, particularly in its ability to chain primitives - combining multiple vulnerabilities into a working proof of concept rather than reporting them in isolation. A finding that arrives with a PoC is a finding you can act on, and it means far less time spent asking "is this even real?"

Our harnesses are deliberately tuned to over-report, so we see more (and miss less), which comes with a lot more noise. But at triage time, Mythos Preview's output has noticeably higher quality: fewer hedged findings, clearer reproduction steps, and less work to reach a fix-or-dismiss decision.

Why pointing a generic coding agent at a repo doesn't work

When we first started AI-assisted vulnerability research last year, our instinct was the obvious one: point a generic coding agent at an arbitrary repository and ask it to discover vulnerabilities. This approach works, in the sense that the model will produce findings, but it doesn't work in producing meaningful coverage of a real codebase and identifying findings of value. There are two main reasons for this:

Context - Coding agents are tuned for one focused stream of work: building a feature, fixing a bug, writing a refactor. They ingest a lot of source code, hold a single hypothesis at a time, and iterate against it. That's exactly the wrong shape for vulnerability research, which is narrow and parallel by nature. A human researcher picks one specific thing to look at and investigates it thoroughly. That one thing might be a single complex feature, transitions across security boundaries, or a specific vulnerability class like command injections, where attacker input ends up being run as a shell command. Then they do it again, for a different feature, security boundary, or vulnerability class, several thousand times across the codebase. A single agent session (even with subagents) against a hundred-thousand-line repository can cover maybe a tenth of a percent of the surface in a useful way before the model's context window fills up and compaction kicks in - potentially discarding earlier findings that would have mattered.
Throughput - A single-stream agent does one thing at a time, but real codebases need many hypotheses against many components at once, with the ability to fan out further when something interesting turns up. You can drive a single agent harder, but at some point you stop being limited by the model and start being limited by the shape of the interaction itself. Using the model directly in a coding agent turns out to be fine for manual investigation when a researcher already has a lead and wants a second pair of eyes. However, it's the wrong tool for achieving high coverage. Once we accepted that, we stopped trying to make Mythos Preview do the wrong job and started building the harness around it instead.

What a harness actually fixes

Four lessons came out of running the work at scale, and each one pointed to the need for a harness that manages the overall execution:

Narrow scope produces better findings - Telling the model "Find vulnerabilities in this repository" makes it wander. Telling it "Look for command injection in this specific function, with this trust boundary above it, here's the architecture document and here's prior coverage of this area" makes it do something much closer to what a researcher would actually do.
Adversarial review reduces noise - Adding a second agent between the initial finding and the queue - one with a different prompt, a different model, and no ability to generate its own findings - catches a lot of the noise that the first agent would miss if it just checked its own work. It turns out that putting two agents in deliberate disagreement is way more effective than just telling one agent to be careful.
Splitting the chain across agents produces better reasoning - Asking "Is this code buggy?" and "Can an attacker actually reach this bug from outside the system?" are two different questions, and the model is better at each one when you ask them separately, because each question is narrower than the combined version.
Parallel narrow tasks beat one exhaustive agent - Coverage improves when many agents work on tightly scoped questions and we deduplicate the results afterward, rather than asking one agent to be exhaustive.

Each of those observations is about model behavior, and put together they describe something that isn't a chat interface anymore. It's a harness that helps you achieve the final outcomes. The first steps to building a harness are simple, as you can ask the model to help, which is what we did. We used Mythos Preview to build on, tailor, and improve our original harnesses to suit its strengths. An example of what a harness looks like in practice is described below.

Our vulnerability discovery harness

Here's what our vulnerability discovery harness looks like, stage by stage. It was used to scan live code across our runtime, edge data path, protocol stack, control plane, and the open-source projects we depend on.

Stage	What it does	Why it matters
Recon	An agent reads the repository from the top down, fans out to subagents responsible for each subsystem, and produces an architecture document covering build commands, trust boundaries, entry points, and likely attack surface. It also generates the initial queue of tasks for the next stage.	Gives every downstream agent shared context. Cuts the wander problem.
Hunt	Each task is one attack class paired with a scope hint. Hunters (the agents that actually look for bugs) run concurrently, typically around fifty at once, each fanning out to a handful of exploration subagents. Each hunter has access to tools that compile and run proof-of-concept code in a per-task scratch directory.	This is where most of the work happens. Many narrow tasks in parallel, not one exhaustive agent.
Validate	An independent agent re-reads the code and tries to disprove the original finding. It uses a different prompt and has no ability to emit new findings of its own.	Catches a meaningful fraction of the noise the hunter wouldn't catch when reviewing its own work.
Gapfill	Hunters flag areas they touched but didn't cover thoroughly. Those areas get re-queued for another pass.	Counteracts the model's tendency to drift toward attack classes it has already had success with.
Dedupe	Findings that share the same root cause collapse into a single record.	Variant analysis is a feature, not a way to inflate the queue with duplicates.
Trace	For each confirmed finding in a shared library, a tracer agent fans out (one instance per consumer repository), uses a cross-repo symbol index, and decides whether attacker-controlled input actually reaches the bug from outside the system.	Turns "there is a flaw" into "there is a reachable vulnerability." This is the stage that matters most.
Feedback	Reachable traces become new hunt tasks in the consumer repositories where the bug is actually exposed.	Closes the loop. The pipeline gets better as it runs.
Report	An agent writes a structured report against a predefined schema, fixes any validation errors against that schema itself, and submits the report to an ingest API.	Output is queryable data, not free-form prose.

What this means for security teams

The loudest reaction to Mythos Preview from other security leaders has been about speed - scan faster, patch faster, compress the response cycle. More than one team we have spoken with is now operating under a two-hour SLA from CVE release to patch in production. The instinct is understandable: when the attacker timeline shortens, the defender timeline has to shorten with it. Faster is not going to be enough, and we think a lot of teams are about to spend a lot of time, effort, and money learning that the hard way.

Patching faster does not change the shape of the pipeline that produces the patch. If regression testing takes a day, you cannot get to a two-hour SLA without skipping it, and the bugs you ship when you skip regression testing tend to be worse than the bugs you were trying to patch. We learned a version of this when we tried letting the model write its own patches and watched a few go out that fixed the original bug while quietly breaking something else the code depended on.

The harder question is what the architecture around the vulnerability should look like. The principle is to make exploitation harder for an attacker even when a bug exists, so that the gap between when a vulnerability is disclosed and when it is patched matters less. That means defenses that sit in front of the application and block the bug from being reached. It means designing the application so that a flaw in one part of the code cannot give an attacker access to other parts. It means being able to roll out a fix to every place the code is running at the same moment, rather than waiting on individual teams to deploy it.

We also recognize this topic cuts both ways. The same capabilities that helped us find bugs in our own code will, in the wrong hands, accelerate the attack side against every application on the Internet. Cloudflare sits in front of millions of those applications, and the architectural principles described above are exactly the ones our products are built to apply on behalf of customers. We will share more on what that means for customers in the weeks ahead.

If your team is doing similar work and would like to compare notes, reach out to us at security-ai-research@cloudflare.com.

Our research with Mythos Preview was conducted in a controlled environment against our own code; every vulnerability surfaced through this work was triaged, validated, and remediated where action was needed under Cloudflare's formal vulnerability management process.

This work was a team effort. Thanks to Albert Pedersen, Craig Strubhart, Dan Jones, Irtefa Fairuz, Martin Schwarzl, and Rohit Chenna Reddy for their contributions to the research, engineering, and analysis behind this blog post.

Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code

Chase Catelli — Tue, 09 Dec 2025 06:00:00 GMT

The Cloudflare platform is a critical system for Cloudflare itself. We are our own Customer Zero – using our products to secure and optimize our own services.

Within our security division, a dedicated Customer Zero team uses its unique position to provide a constant, high-fidelity feedback loop to product and engineering that drives continuous improvement of our products. And we do this at a global scale — where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences. If you've ever hesitated before pushing a change to production, sweating because you know one small mistake could lock every employee out of critical application or take down a production service, you know the feeling. The risk of unintended consequences is real, and it keeps us up at night.

This presents an interesting challenge: How do we ensure hundreds of internal production Cloudflare accounts are secured consistently while minimizing human error?

While the Cloudflare dashboard is excellent for observability and analytics, manually clicking through hundreds of accounts to ensure security settings are identical is a recipe for mistakes. To keep our sanity and our security intact, we stopped treating our configurations as manual point-and-click tasks and started treating them like code. We adopted “shift left” principles to move security checks to the earliest stages of development.

This wasn't an abstract corporate goal for us. It was a survival mechanism to catch errors before they caused an incident, and it required a fundamental change in our governance architecture.

What Shift Left means to us

"Shifting left" refers to moving validation steps earlier in the software development lifecycle (SDLC). In practice, this means integrating testing, security audits, and policy compliance checks directly into the continuous integration and continuous deployment (CI/CD) pipeline. By catching issues or misconfigurations at the merge request stage, we identify issues when the cost of remediation is lowest, rather than discovering them after deployment.

When we think about applying shift left principles at Cloudflare, four key principles stand out:

Consistency: Configurations must be easily copied and reused across accounts.
Scalability: Large changes can be applied rapidly across multiple accounts.
Observability: Configurations must be auditable by anyone for current state, accuracy, and security.
Governance: Guardrails must be proactive — enforced before deployment to avoid incidents.

A production IaC operating model

To support this model, we transitioned all production accounts to being managed with Infrastructure as Code (IaC). Every modification is tracked, tied to a user, commit, and an internal ticket. Teams still use the dashboard for analytics and insights, but critical production changes are all done in code.

This model ensures every change is peer-reviewed, and policies, though set by the security team, are implemented by the owning engineering teams themselves.

This setup is grounded in two major technologies: Terraform and a custom CI/CD pipeline.

Our enterprise IaC stack

We chose Terraform for its mature open-source ecosystem, strong community support, and deep integration with Policy as Code tooling. Furthermore, using the Cloudflare Terraform Provider internally allows us to actively dogfood the experience and improve it for our customers.

To manage the scale of hundreds of accounts and around 30 merge requests per day, our CI/CD pipeline runs on Atlantis, integrated with GitLab. We also use a custom go program, tfstate-butler, that acts as a broker to securely store state files.

tfstate-butler operates as an HTTP backend for Terraform. The primary design driver was security: It ensures unique encryption keys per state file to limit the blast radius of any potential compromise.

All internal account configurations are defined in a centralized monorepo. Individual teams own and deploy their specific configurations and are the designated code owners for their sections of this centralized repository, ensuring accountability. To read more about this configuration, check out How Cloudflare uses Terraform to manage Cloudflare.

^{Infrastructure as Code Data Flow Diagram}

Baselines and Policy as Code

The entire shift left strategy hinges on establishing a strong security baseline for all internal production Cloudflare accounts. The baseline is a collection of security policies that are defined in code (Policy as Code). This baseline is not merely a set of guidelines but rather a required security configuration we enforce across the platform — e.g., maximum session length, required logs, specific WAF configurations, etc.

This setup is where policy enforcement shifts from manual audits to automated gates. We use the Open Policy Agent (OPA) framework and its policy language, Rego, via the Atlantis Conftest Policy Checking feature.

Defining policies as code

Rego policies define the specific security requirements that make up the baseline for all Cloudflare provider resources. We currently maintain approximately 50 policies.

For example, here is a Rego policy that validates only @cloudflare.com emails are allowed to be used in an access policy:

# validate no use of non-cloudflare email
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    include := r.change.after.include[_]
    email_address := include.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}
warn contains reason if {
    r := tfplan.resource_changes[_]
    r.mode == "managed"
    r.type == "cloudflare_access_policy"

    require := r.change.after.require[_]
    email_address := require.email[_]
    not endswith(email_address, "@cloudflare.com")

    reason := sprintf("%-40s :: only @cloudflare.com emails are allowed", [r.address])
}

Enforcing the baseline

The policy check runs on every merge request (MR), ensuring configurations are compliant before deployment. Policy check output is shown directly in the GitLab MR comment thread.

Policy enforcement operates in two modes:

Warning: Leaves a comment on the MR, but allows the merge.
Deny: Blocks the deployment outright.

If the policy check determines the configuration being applied in the MR deviates from the baseline, the output will return which resources are out of compliance.

The example below shows an output from a policy check identifying 3 discrepancies in a merge request:

WARN - cloudflare_zero_trust_access_application.app_saas_xxx :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_xxx_pay_per_crawl :: "session_duration" must be less than or equal to 10h

WARN - cloudflare_zero_trust_access_application.app_saas_ms :: you must have at least one require statement of auth_method = "swk"

41 tests, 38 passed, 3 warnings, 0 failures, 0 exception

Handling policy exceptions

We understand that exceptions are necessary, but they must be managed with the same rigor as the policy itself. When a team requires an exception, they submit a request via Jira.

Once approved by the Customer Zero team, the exception is formalized by submitting a pull request to the central exceptions.rego repository. Exceptions can be made at various levels:

Account: Exclude account_x from policy_y.
Resource Category: Exclude all resource_a’s in account_x from policy_y.
Specific Resource: Exclude resource_a_1 in account_x from policy_y.

This example shows a session length exception for five specific applications under two separate Cloudflare accounts:

{  
    "exception_type": "session_length",
    "exceptions": [
        {
            "account_id": "1xxxx",
              "tf_addresses": [
                "cloudflare_access_application.app_identity_access_denied",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass",
                "cloudflare_access_application.enforcing_ext_auth_worker_bypass_dev",
            ],
        },
        {
            "account_id": "2xxxx",
              "tf_addresses": [
                "cloudflare_access_application.extra_wildcard_application",
                "cloudflare_access_application.wildcard",
            ],
        },
    ],
}

Challenges and lessons learned

Our journey wasn't without obstacles. We had years of clickops (manual changes made directly in the dashboard) scattered across hundreds of accounts. Trying to import the existing chaos into a strict infrastructure as code system felt like trying to change the tires on a moving car. To this day, importing resources continues to be an ongoing process.

We also ran into limitations of our own tools. We found edge cases in the Cloudflare Terraform provider that only appear when you try to manage infrastructure at this scale. These weren't just minor speed bumps. They were hard lessons on the necessity of eating our own dogfood, so we could build even better solutions.

That friction clarified exactly what we were up against, leading us to three hard-earned lessons.

Lesson 1: high barriers to entry stall adoption

The first hurdle for any large-scale IaC rollout is onboarding existing, manually configured resources. We gave teams two options: manually creating Terraform resources and import blocks, or using cf-terraforming.

We quickly discovered that Terraform fluency varies across teams, and the learning curve for manually importing existing resources proved to be much steeper than we anticipated.

Luckily the cf-terraforming command-line utility uses the Cloudflare API to automatically generate the necessary Terraform code and import statements, significantly accelerating the migration process.

We also formed an internal community where experienced engineers could guide teams through the nuances of the provider and help unblock complex imports.

Lesson 2: drift happens

We also had to tackle configuration drift, which occurs when the IaC process is bypassed to expedite urgent changes. While making edits directly in the dashboard is faster during an incident, it leaves the Terraform state out of sync with reality.

We implemented a custom drift detection service that constantly compares the state defined by Terraform with the actual deployed state via the Cloudflare API. When drift is detected, an automated system creates an internal ticket and assigns it to the owning team with varying Service Level Agreements (SLAs) for remediation.

Lesson 3: automation is key

Cloudflare innovates quickly, so our set of products and APIs is ever-growing. Unfortunately, that meant that our Terraform provider was often behind in terms of feature parity with the product.

We solved that issue with the release of our v5 provider, which automatically generates the Terraform provider based on the OpenAPI specification. This transition wasn’t without bumps as we hardened our approach to code generation, but this approach ensures that the API and Terraform stay in sync, reducing the chance of capability drift.

The core lesson: proactive > reactive

By centralizing our security baselines, mandating peer reviews, and enforcing policies before any change hits production, we minimize the possibility of configuration errors, accidental deletions, or policy violations. The architecture not only helps to prevent manual mistakes, but actually increases engineering velocity because teams are confident their changes are compliant.

The key lesson from our work with Customer Zero is this: While the Cloudflare dashboard is excellent for day-to-day operations, achieving enterprise-level scale and consistent governance requires a different approach. When you treat your Cloudflare configurations as living code, you can scale securely and confidently.

Have thoughts on Infrastructure as Code? Keep the conversation going and share your experiences over at community.cloudflare.com.