Cloudflare Workers scale too well and broke our infrastructure, so we are rebuilding it on Workers

While scaling our new Feature Flagging product DevCycle, we’ve encountered an interesting challenge: our Cloudflare Workers-based infrastructure can handle way more instantaneous load than our traditional AWS infrastructure. This led us to rethink how we design our infrastructure to always use Cloudflare Workers for everything.

The origin of DevCycle

For almost 10 years, Taplytics has been a leading provider of no-code A/B testing and feature flagging solutions for product and marketing teams across a wide range of use cases for some of the largest consumer-facing companies in the world. So when we applied ourselves to build a new engineering-focused feature management product, DevCycle, we built upon our experience using Workers which have served over 140 billion requests for Taplytics customers.

The inspiration behind DevCycle is to build a focused feature management tool for engineering teams, empowering them to build their software more efficiently and deploy it faster. Helping engineering teams reach their goals, whether it be continuous deployment, lower change failure rate, or a faster recovery time. DevCycle is the culmination of our vision of how teams should use Feature Management to build high-quality software faster. We've used DevCycle to build DevCycle, enabling us to implement continuous deployment successfully.

DevCycle architecture

One of the first things we asked ourselves when ideating DevCycle was how we could get out of the business of managing 1000’s of vCPUs worth of AWS instances and move our core business logic closer to our end-user devices. Based on our experience with Cloudflare Workers at Taplytics we knew we wanted it to be a core part of our future infrastructure for DevCycle.

By using the global computing power of Workers and moving as much logic to the SDKs as possible with our local bucketing server-side SDKs, we were able to massively reduce or eliminate the latency of fetching feature flag configurations for our users. In addition, we used a shared WASM library across our Workers and local bucketing SDKs to dramatically reduce the amount of code we need to maintain per SDK, and increase the consistency of our platform. This architecture has also fundamentally changed our business's cost structure to easily serve any customer of any scale.

The core architecture of DevCycle revolves around publishing and consuming JSON configuration files per project environment. The publishing side is managed in our AWS services, while Cloudflare manages the consumption of these config files at scale. This split in responsibilities allows for all high-scale requests to be managed by Cloudflare, while keeping our AWS services simple and low-scale.

“Workers are breaking our events pipeline”

One of the primary challenges as a feature management platform is that we don’t have direct control over the load from our customers’ applications using our SDKs; our systems need the ability to scale instantly to match their load. For example, we have a couple of large customers whose mobile traffic is primarily driven by push notifications, which causes massive instantaneous spikes in traffic to our APIs in the range of 10x increases in load. As you can imagine, any traditional auto-scaled API service and the load balancer cannot manage that type of increase in load. Thus, our choices are to dramatically increase the minimum size of our cluster and load balancer to handle these unknown load spikes, accept that some requests will be rate-limited, or move to an architecture that can handle this load.

Given that all our SDK API requests are already served with Workers, they have no problem scaling instantly to 10x+ their base load. Sadly we can’t say the same about the traditional parts of our infrastructure.

For each feature flag configuration request to a Worker, a corresponding events request is sent to our AWS events infrastructure. The events are received by our events API container in Kubernetes, where they are then published to Kafka and eventually ingested by Snowflake. While Cloudflare Workers have no problem handling instantaneous spikes in feature flag requests, the events system can't keep up. Our cluster and events API containers need to be scaled up faster to prevent the existing instances from being overwhelmed. Even the load balancer has issues accepting the sudden increase. Cloudflare Workers just work too well in comparison to EC2 instances + EKS.

To solve this issue we are moving towards a new events Cloudflare Worker which will be able to handle the instantaneous events load from these requests and make use of the Kinesis Data Firehose to write events to our existing S3 bucket which is ingested by Snowflake. In the future, we look forward to testing out Cloudflare Queues writing to R2 once a Snowflake connector has been created. This architecture should allow us to ingest events at almost any scale and withstand instantaneous traffic spikes with a predictable and efficient cost structure.

Building without a database next to your code

Workers provide many benefits, including fast response times, infinite scalability, serverless architecture, and excellent up-time performance. However, if you want to see all these benefits, you need to architect your Workers to assume that you don’t have direct access to a centralized SQL / NoSQL database (or D1) like you would with a traditional API service. For example, suppose you build your workers to require reaching out to a database to fetch and update user data every time a request is made to your Workers. In that case, your request latency will be tied to the geographic distance between your Worker and the database plus the latency of the database. In addition, your Workers will be able to scale significantly beyond the number of database connections you can support, and your uptime will be tied to the uptime of your external database. Therefore, when architecting your systems to use Workers, we advise relying primarily on data sent as part of the API request and cacheable data on Cloudflare’s global network.

Cloudflare provides multiple products and services to help with data on their global network:

KV: “global, low-latency, key-value data store.”
- However, the lowest latency way of retrieving data from within a Worker is limited by a minimum 60-second TTL. So you’ll need to be ok with cached data that is 60 seconds stale.
Durable Objects: “provide low-latency coordination and consistent storage for the Workers platform through two features: global uniqueness and a transactional storage API.”
- Ability to store user-level information closer to the end user.
- Unfamiliar worker interface for accessing data for developers with SQL / NoSQL experience.
R2: “store large amounts of unstructured data.”
- Ability to store arbitrarily large amounts of unstructured data using familiar S3 APIs.
- Cloudflare’s cache can be used to provide low-latency access within workers.
D1: “serverless SQLite database”

Each of these tools that Cloudflare provides has made building APIs far more accessible than when Workers launched initially; however, each service has aspects which need to be accounted for when architecting your systems. Being an open platform, you can also access any publically available database you want from a Worker. For example, we are making use of Macrometa for our EdgeDB product built into our Workers to help customers access their user data.

The predictable cost structure of Workers

One of the greatest advantages of moving most of our workloads towards Cloudflare Workers is the predictable cost structure that can scale 1:1 with our request loads and can be easily mapped to usage-based billing for our customers. In addition, we no longer have to run excess EC2 instances to handle random spikes in load, just in case they happen.

Too many SaaS services have opaque billing based on max usage or other metrics that don’t relate directly to their costs. Moving from our legacy AWS architecture with high fixed costs like databases and caching layers to Workers has resulted in our infrastructure spending is directly tied to using our APIs and SDKs. For DevCycle, this architecture has been over ~5x more cost-efficient to operate.

The future of DevCycle and Cloudflare

With DevCycle we will continue to invest in leveraging serverless computing and moving our core business logic as close to our users as possible, either on Cloudflare’s global network or locally within our SDKs. We’re excited to integrate even more deeply with the Cloudflare developer platform as new services evolve. We already see future use cases for R2, Queues and Durable Objects and look forward to what’s coming next from Cloudflare.

The Cloudflare Blog

Cloudflare Workers scale too well and broke our infrastructure, so we are rebuilding it on Workers

The origin of DevCycle

DevCycle architecture

“Workers are breaking our events pipeline”

Building without a database next to your code

The predictable cost structure of Workers

The future of DevCycle and Cloudflare

Your Worker can now have its own cache in front of it

How we built saga rollbacks for Cloudflare Workflows

Unlocking the Cloudflare app ecosystem with OAuth for all

How we found a bug in the hyper HTTP library