The Cloudflare Blog

Workers AI gets a speed boost, batch workload support, more LoRAs, new models, and a refreshed dashboard

Michelle Chen — Fri, 11 Apr 2025 13:00:00 GMT

Since the launch of Workers AI in September 2023, our mission has been to make inference accessible to everyone.

Over the last few quarters, our Workers AI team has been heads down on improving the quality of our platform, working on various routing improvements, GPU optimizations, and capacity management improvements. Managing a distributed inference platform is not a simple task, but distributed systems are also what we do best. You’ll notice a recurring theme from all these announcements that has always been part of the core Cloudflare ethos — we try to solve problems through clever engineering so that we are able to do more with less.

Today, we’re excited to introduce speculative decoding to bring you faster inference, an asynchronous batch API for large workloads, and expanded LoRA support for more customized responses. Lastly, we’ll be recapping some of our newly added models, updated pricing, and unveiling a new dashboard to round out the usability of the platform.

Speeding up inference by 2-4x with speculative decoding and more

We’re excited to roll out speed improvements to models in our catalog, starting with the Llama 3.3 70b model. These improvements include speculative decoding, prefix caching, an updated inference backend, and more. We’ve previously done a technical deep dive on speculative decoding and how we’re making Workers AI faster, which you can read about here. With these changes, we’ve been able to improve inference times by 2-4x, without any significant change to the quality of answers generated. We’re planning to incorporate these improvements into more models in the future as we release them. Today, we’re starting to roll out these changes so all Workers AI users of @cf/meta/llama-3.3-70b-instruct-fp8-fast will enjoy this automatic speed boost.

What is speculative decoding?

The way LLMs work is by generating text by predicting the next token in a sentence given the previous tokens. Typically, an LLM is able to predict a single future token (n+1) with one forward pass through the model. These forward passes can be computationally expensive, since they need to work through all the parameters of a model to generate one token (e.g., 70 billion parameters for Llama 3.3 70b).

With speculative decoding, we put a small model (known as the draft model) in front of the original model that helps predict n+x future tokens. The draft model generates a subset of candidate tokens, and the original model just has to evaluate and confirm if they should be incorporated into the generation. Evaluating tokens is less computationally expensive, as the model can evaluate multiple tokens concurrently in a forward pass. As such, inference times can be sped up by 2-4x — meaning that users can get responses much faster.

What makes speculative decoding particularly efficient is that it’s able to use unused GPU compute left behind due to the GPU memory bottleneck LLMs create. Speculative decoding takes advantage of the unused compute by squeezing in a draft model to generate tokens faster. This means we’re able to improve the utilization of our GPUs by using them to their full extent without having parts of the GPU sit idle.

What is prefix caching?

With LLMs, there are usually two stages of generation – the first is known as “pre-fill”, which processes the user’s input tokens such as the prompt and context. Prefix caching is aimed at reducing the pre-fill time of a request. As an example, if you were asking a model to generate code based on a given file, you might insert the whole file into the context window of a request. Then, if you want to make a second request to generate the next line of code, you might send us the whole file again in the second request. Prefix caching allows us to cache the pre-fill tokens so we don’t have to process the context twice. With the same example, we would only do the pre-fill stage once for both requests, rather than doing it per request. This method is especially useful for requests that reuse the same context, such as Retrieval Augmented Generation (RAG), code generation, chatbots with memory, and more. Skipping the pre-fill stage for similar requests means faster responses for our users and more efficient usage of resources.

How did you validate that quality is preserved through these optimizations?

Since this is an in-place update to an existing model, we were particularly cautious in ensuring that we would not break any existing applications with this update. We did extensive A/B testing through a blind arena with internal employees to validate the model quality, and we asked internal and external customers to test the new version of the model to ensure that response formats were compatible and model quality was acceptable. Our testing concluded that the model performed up to standards, with people being extremely excited about the speed of the model. Most LLMs are not perfectly deterministic even with the same set of inputs, but if you do notice something off, please let us know through Discord or X.

Asynchronous batch API

Next up, we’re announcing an asynchronous (async) batch API which is helpful for users of large workloads. This feature allows customers to receive their inference responses asynchronously, with the promise that the inference will be completed at a later time rather than immediately erroring out due to capacity.

An example use case of batch workloads is people generating summaries of a large number of documents. You probably don’t need to use those summaries immediately, as you’ll likely use them once the whole document is complete versus one paragraph at a time. For these use cases, we’ve made it super simple for you to start sending us these requests in batches.

Why batch requests?

From talking to our customers, the most common use case we hear about is people creating embeddings or summarizing a large number of documents. Unfortunately, this is also one of the hardest use cases to manage capacity for as a serverless platform.

To illustrate this, imagine that you want to summarize a 70 page PDF. You typically chunk the document and then send an inference request for each chunk. If each chunk is a few paragraphs on a page, that means that we receive around 4 requests per page multiplied by 70 pages, which is about 280 requests. Multiply that by tens or hundreds of documents, and multiply that by a handful of concurrent users — this means that we get a sudden massive influx of thousands of requests when users start these large workloads.

The way we originally built Workers AI was to handle incoming requests as quickly as possible, assuming there's a human on the other side that needed an immediate response. The unique thing about batch workloads is that while they're not latency sensitive, they do require completeness guarantees — you don't want to come back the next day to realize none of your inference requests actually executed.

With the async API, you send us a batch of requests, and we promise to fulfill them as fast as possible and return them to you as a batch. This guarantees that your inference request will be fulfilled, rather than immediately (or eventually) erroring out. The async API also benefits users who have real-time use cases, as the model instances won’t be immediately consumed by these batch requests that can wait for a response. Inference times will be faster since there won’t be a bunch of competing requests in a queue waiting to reach the inference servers.

We have select models that support batch inference today, which include:

How can I use the batch API?

Users can send a batch request to supported models by passing a flag:

let res = await env.AI.run("@cf/meta/llama-3.3-70b-instruct-batch", {
  "requests": [{
    "prompt": "Explain mechanics of wormholes"
  }, {
    "prompt": "List different plant species found in America"
  }]
}, {
  queueRequest: true
});

Check out our developer docs to learn more about the batch API, or use our template to deploy a worker that implements the batch API.

Today, our batch API can be used by sending us an array of requests, and we’ll return your responses in an array. This is helpful for use cases like summarizing large amounts of data that you know beforehand. This means you can send us a single HTTP request with all of your requests, and receive a single HTTP request back with your responses. You can check on the status of the batch by polling it with the request ID we return when your batch is submitted. For the next iteration of our async API, we plan to allow queue-based inputs and outputs, where you push requests and pull responses from a queue. This will integrate tightly with Event Notifications and Workflows, so you can execute subsequent actions upon receiving a response.

Expanded LoRA support

At Birthday Week last year, we announced limited LoRA support for a handful of models. We’ve

iterated on this and now support 8 models as well as larger ranks of up to 32 and LoRA files up to 300 MB. Models that support LoRA inference now include:

What is LoRA?

In essence, a Low Rank Adaptation (LoRA) adapter allows people to take a trained adapter file and use it in conjunction with a model to alter the response of a model. We did a deep dive on LoRAs in our Birthday Week blog post, which goes into further technical detail. LoRA adapters are great alternatives to fine-tuning a model, as it isn’t as expensive to train and adapters are much smaller and more portable. They are also effective enough to tweak the output of a model to fit a certain style of response.

How do I get started?

To get started, you first need to train your own LoRA adapter or find a public one on HuggingFace. Then, you’ll upload the adapter_model.safetensors and adapter_config.json to your account with the documented wrangler commands or through the REST API. LoRA files are private and scoped to your own account. After that, you can start running fine-tuned inference — check out our LoRA developer docs to get started.

const response = await env.AI.run(
  "@cf/qwen/qwen2.5-coder-32b-instruct", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"}],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR finetune name
  }
);

Quality of life improvements: updated pricing and a new dashboard for Workers AI

While the team has been focused on large engineering milestones, we’ve also landed some quality of life improvements over the last few months. In case you missed it, we’ve announced an updated pricing model where usage will be shown in units such as tokens, audio seconds, image size/steps, etc. but still billed in neurons in the backend.

Today, we’re unveiling a new dashboard that allows users to see their usage in both units as well as neurons (built on new Workers Observability components!). Model pricing is also available via dashboard and developer docs on the models page. And if you use AI Gateway, Workers AI usage will also be displayed as metrics now.

New models available in Workers AI

Lastly, we’ve steadily been adding new models on Workers AI, with over 10 new models and a few updates on existing models. Pricing is also now listed directly on the model page in the developer docs. To summarize, here are the new models we’ve added on Workers AI, including four new ones we’re releasing today:

@cf/deepseek-ai/deepseek-r1-distill-qwen-32b: a version of Qwen 32B distilled from Deepseek’s R1 that is capable of doing chain-of-thought reasoning.
@cf/baai/bge-m3: a multi-lingual embeddings model that supports over 100 languages. It can also simultaneously perform dense retrieval, multi-vector retrieval, and sparse retrieval, with the ability to process inputs of different granularities.
@cf/baai/bge-reranker-base: our first reranker model! Rerankers are a type of text classification model that takes a query and context, and outputs a similarity score between the two. When used in RAG systems, you can use a reranker after the initial vector search to find the most relevant documents to return to a user by reranking the outputs.
@cf/openai/whisper-large-v3-turbo: a faster, more accurate speech-to-text model. This model was added earlier but is graduating out of beta with pricing included today.
@cf/myshell-ai/melotts: our first text-to-speech model that allows users to generate an MP3 with voice audio from text input.
@cf/meta/llama-4-scout-17b-16e-instruct: 17 billion parameter MoE model with 16 experts that is natively multimodal. Offers industry-leading performance in text and image understanding.
[NEW] @cf/mistralai/mistral-small-3.1-24b-instruct: a 24B parameter model achieving state-of-the-art capabilities comparable to larger models, with support for vision and tool calling.
[NEW] @cf/google/gemma-3-12b-it: well-suited for a variety of text generation and image understanding tasks, including question answering, summarization and reasoning, with a 128K context window, and multilingual support in over 140 languages.
[NEW] @cf/qwen/qwq-32b: a medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.
[NEW] @cf/qwen/qwen2.5-coder-32b-instruct: the current state-of-the-art open-source code LLM, with its coding abilities matching those of GPT-4o.

In addition, we are rolling out some in-place updates to existing models in our catalog:

@cf/meta/llama-3.3-70b-instruct-fp8-fast - Llama 3.3 70b gets a speed boost with new techniques such as speculative decoding, prefix caching, and an updated server back end (see above).
@cf/baai/bge-small-en-v1.5, @cf/baai/bge-base-en-v1.5, @cf/baai/bge-large-en-v1.5 - get a new input parameter called “pooling” which takes either “cls” or “mean”

As we release these new models, we’ll be deprecating old models to encourage use of the state-of-the-art models and make room in our catalog. We will send out an email notice on this shortly. Stay up to date with our model releases and deprecation announcements by subscribing to our Developer Docs changelog.

We’re (still) just getting started

Workers AI is one of Cloudflare’s newer products in a nascent industry, but we still operate with very traditional Cloudflare principles – learning how we can do more with less. Our engineering team is focused on solving the difficult problems that come with growing a distributed inference platform at a global scale, and we’re excited to release these new features today that we think will improve the platform as a whole for all our users. With faster inference times, better reliability, more customization possibilities, and better usability, we’re excited to see what you can do with more Workers AI — let us know what you think!

Meta’s Llama 4 is now available on Workers AI

Michelle Chen — Sun, 06 Apr 2025 03:22:00 GMT

As one of Meta’s launch partners, we are excited to make Meta’s latest and most powerful model, Llama 4, available on the Cloudflare Workers AI platform starting today. Check out the Workers AI Developer Docs to begin using Llama 4 now.

What’s new in Llama 4?

Llama 4 is an industry-leading release that pushes forward the frontiers of open-source generative Artificial Intelligence (AI) models. Llama 4 relies on a novel design that combines a Mixture of Experts architecture with an early-fusion backbone that allows it to be natively multimodal.

The Llama 4 “herd” is made up of two models: Llama 4 Scout (109B total parameters, 17B active parameters) with 16 experts, and Llama 4 Maverick (400B total parameters, 17B active parameters) with 128 experts. The Llama Scout model is available on Workers AI today.

Llama 4 Scout has a context window of up to 10 million (10,000,000) tokens, which makes it one of the first open-source models to support a window of that size. A larger context window makes it possible to hold longer conversations, deliver more personalized responses, and support better Retrieval Augmented Generation (RAG). For example, users can take advantage of that increase to summarize multiple documents or reason over large codebases. At launch, Workers AI is supporting a context window of 131,000 tokens to start and we’ll be working to increase this in the future.

Llama 4 does not compromise parameter depth for speed. Despite having 109 billion total parameters, the Mixture of Experts (MoE) architecture can intelligently use only a fraction of those parameters during active inference. This delivers a faster response that is made smarter by the 109B parameter size.

What is a Mixture of Experts model?

A Mixture of Experts (MoE) model is a type of Sparse Transformer model that is composed of individual specialized neural networks called “experts”. MoE models also have a “router” component that manages input tokens and which experts they get sent to. These specialized experts work together to provide deeper results and faster inference times, increasing both model quality and performance.

For an illustrative example, let’s say there’s an expert that’s really good at generating code while another expert is really good at creative writing. When a request comes in to write a Fibonacci algorithm in Haskell, the router sends the input tokens to the coding expert. This means that the remaining experts might remain unactivated, so the model only needs to use the smaller, specialized neural network to solve the problem.

In the case of Llama 4 Scout, this means the model is only using one expert (17B parameters) instead of the full 109B total parameters of the model. In reality, the model probably needs to use multiple experts to handle a request, but the point still stands: an MoE model architecture is incredibly efficient for the breadth of problems it can handle and the speed at which it can handle it.

MoE also makes it more efficient to train models. We recommend reading Meta’s blog post on how they trained the Llama 4 models. While more efficient to train, hosting an MoE model for inference can sometimes be more challenging. You need to load the full model weights (over 200 GB) into GPU memory. Supporting a larger context window also requires keeping more memory available in a Key Value cache.

Thankfully, Workers AI solves this by offering Llama 4 Scout as a serverless model, meaning that you don’t have to worry about things like infrastructure, hardware, memory, etc. — we do all of that for you, so you are only one API request away from interacting with Llama 4.

What is early-fusion?

One challenge in building AI-powered applications is the need to grab multiple different models, like a Large Language Model (LLM) and a visual model, to deliver a complete experience for the user. Llama 4 solves that problem by being natively multimodal, meaning the model can understand both text and images.

You might recall that Llama 3.2 11b was also a vision model, but Llama 3.2 actually used separate parameters for vision and text. This means that when you sent an image request to the model, it only used the vision parameters to understand the image.

With Llama 4, all the parameters natively understand both text and images. This allowed Meta to train the model parameters with large amounts of unlabeled text, image, and video data together. For the user, this means that you don’t have to chain together multiple models like a vision model and an LLM for a multimodal experience — you can do it all with Llama 4.

Try it out now!

We are excited to partner with Meta as a launch partner to make it effortless for developers to use Llama 4 in Cloudflare Workers AI. The release brings an efficient, multimodal, highly-capable and open-source model to anyone who wants to build AI-powered applications.

Cloudflare’s Developer Platform makes it possible to build complete applications that run alongside our Llama 4 inference. You can rely on our compute, storage, and agent layer running seamlessly with the inference from models like Llama 4. To learn more, head over to our developer docs model page for more information on using Llama 4 on Workers AI, including pricing, additional terms, and acceptable use policies.

Want to try it out without an account? Visit our AI playground or get started with building your AI experiences with Llama 4 and Workers AI.

Making Workers AI faster and more efficient: Performance optimization with KV cache compression and speculative decoding

Isaac Rehg — Thu, 26 Sep 2024 13:00:00 GMT

During Birthday Week 2023, we launched Workers AI. Since then, we have been listening to your feedback, and one thing we’ve heard consistently is that our customers want Workers AI to be faster. In particular, we hear that large language model (LLM) generation needs to be faster. Users want their interactive chat and agents to go faster, developers want faster help, and users do not want to wait for applications and generated website content to load. Today, we’re announcing three upgrades we’ve made to Workers AI to bring faster and more efficient inference to our customers: upgraded hardware, KV cache compression, and speculative decoding.

Watch on Cloudflare TV

Thanks to Cloudflare’s 12th generation compute servers, our network now supports a newer generation of GPUs capable of supporting larger models and faster inference. Customers can now use Meta Llama 3.2 11B, Meta’s newly released multi-modal model with vision support, as well as Meta Llama 3.1 70B on Workers AI. Depending on load and time of day, customers can expect to see two to three times the throughput for Llama 3.1 and 3.2 compared to our previous generation Workers AI hardware. More performance information for these models can be found in today’s post: Cloudflare’s Bigger, Better, Faster AI platform.

New KV cache compression methods, now open source

In our effort to deliver low-cost low-latency inference to the world, Workers AI has been developing novel methods to boost efficiency of LLM inference. Today, we’re excited to announce a technique for KV cache compression that can help increase throughput of an inference platform. And we’ve made it open source too, so that everyone can benefit from our research.

It’s all about memory

One of the main bottlenecks when running LLM inference is the amount of vRAM (memory) available. Every word that an LLM processes generates a set of vectors that encode the meaning of that word in the context of any earlier words in the input that are used to generate new tokens in the future. These vectors are stored in the KV cache, causing the memory required for inference to scale linearly with the total number of tokens of all sequences being processed. This makes memory a bottleneck for a lot of transformer-based models. Because of this, the amount of memory an instance has available limits the number of sequences it can generate concurrently, as well as the maximum token length of sequences it can generate.

So what is the KV cache anyway?

LLMs are made up of layers, with an attention operation occurring in each layer. Within each layer’s attention operation, information is collected from the representations of all previous tokens that are stored in cache. This means that vectors in the KV cache are organized into layers, so that the active layer’s attention operation can only query vectors from the corresponding layer of KV cache. Furthermore, since attention within each layer is parallelized across multiple attention “heads”, the KV cache vectors of a specific layer are further subdivided into groups corresponding to each attention head of that layer.

The diagram below shows the structure of an LLM’s KV cache for a single sequence being generated. Each cell represents a KV and the model’s representation for a token consists of all KV vectors for that token across all attention heads and layers. As you can see, the KV cache for a single layer is allocated as an M x N matrix of KV vectors where M is the number of attention heads and N is the sequence length. This will be important later!

For a deeper look at attention, see the original “Attention is All You Need” paper.

KV-cache compression — “use it or lose it”

Now that we know what the KV cache looks like, let’s dive into how we can shrink it!

The most common approach to compressing the KV cache involves identifying vectors within it that are unlikely to be queried by future attention operations and can therefore be removed without impacting the model’s outputs. This is commonly done by looking at the past attention weights for each pair of key and value vectors (a measure of the degree with which that KV’s representation has been queried during past attention operations) and selecting the KVs that have received the lowest total attention for eviction. This approach is conceptually similar to a LFU (least frequently used) cache management policy: the less a particular vector is queried, the more likely it is to be evicted in the future.

Different attention heads need different compression rates

As we saw earlier, the KV cache for each sequence in a particular layer is allocated on the GPU as a # attention heads X sequence length tensor. This means that the total memory allocation scales with the maximum sequence length for all attention heads of the KV cache. Usually this is not a problem, since each sequence generates the same number of KVs per attention head.

When we consider the problem of eviction-based KV cache compression, however, this forces us to remove an equal number of KVs from each attention head when doing the compression. If we remove more KVs from one attention head alone, those removed KVs won’t actually contribute to lowering the memory footprint of the KV cache on GPU, but will just add more empty “padding” to the corresponding rows of the tensor. You can see this in the diagram below (note the empty cells in the second row below):

The extra compression along the second head frees slots for two KVs, but the cache’s shape (and memory footprint) remains the same.

This forces us to use a fixed compression rate for all attention heads of KV cache, which is very limiting on the compression rates we can achieve before compromising performance.

Enter PagedAttention

The solution to this problem is to change how our KV cache is represented in physical memory. PagedAttention can represent N x M tensors with padding efficiently by using an N x M block table to index into a series of “blocks”.

This lets us retrieve the i^th element of a row by taking the i^th block number from that row in the block table and using the block number to lookup the corresponding block, so we avoid allocating space to padding elements in our physical memory representation. In our case, the elements in physical memory are the KV cache vectors, and the M and N that define the shape of our block table are the number of attention heads and sequence length, respectively. Since the block table is only storing integer indices (rather than high-dimensional KV vectors), its memory footprint is negligible in most cases.

Results

Using paged attention lets us apply different rates of compression to different heads in our KV cache, giving our compression strategy more flexibility than other methods. We tested our compression algorithm on LongBench (a collection of long-context LLM benchmarks) with Llama-3.1-8B and found that for most tasks we can retain over 95% task performance while reducing cache size by up to 8x (left figure below). Over 90% task performance can be retained while further compressing up to 64x. That means you have room in memory for 64 times as many tokens!

This lets us increase the number of requests we can process in parallel, increasing the total throughput (total tokens generated per second) by 3.44x and 5.18x for compression rates of 8x and 64x, respectively (right figure above).

Try it yourself!

If you’re interested in taking a deeper dive check out our vLLM fork and get compressing!!

Speculative decoding for faster throughput

A new inference strategy that we implemented is speculative decoding, which is a very popular way to get faster throughput (measured in tokens per second). LLMs work by predicting the next expected token (a token can be a word, word fragment or single character) in the sequence with each call to the model, based on everything that the model has seen before. For the first token generated, this means just the initial prompt, but after that each subsequent token is generated based on the prompt plus all other tokens that have been generated. Typically, this happens one token at a time, generating a single word, or even a single letter, depending on what comes next.

But what about this prompt:

Knock, knock!

If you are familiar with knock-knock jokes, you could very accurately predict more than one token ahead. For an English language speaker, what comes next is a very specific sequence that is four to five tokens long: “Who’s there?” or “Who is there?” Human language is full of these types of phrases where the next word has only one, or a few, high probability choices. Idioms, common expressions, and even basic grammar are all examples of this. So for each prediction the model makes, we can take it a step further with speculative decoding to predict the next n tokens. This allows us to speed up inference, as we’re not limited to predicting one token at a time.

There are several different implementations of speculative decoding, but each in some way uses a smaller, faster-to-run model to generate more than one token at a time. For Workers AI, we have applied prompt-lookup decoding to some of the LLMs we offer. This simple method matches the last n tokens of generated text against text in the prompt/output and predicts candidate tokens that continue these identified patterns as candidates for continuing the output. In the case of knock-knock jokes, it can predict all the tokens for “Who’s there” at once after seeing “Knock, knock!”, as long as this setup occurs somewhere in the prompt or previous dialogue already. Once these candidate tokens have been predicted, the model can verify them all with a single forward-pass and choose to either accept or reject them. This increases the generation speed of llama-3.1-8b-instruct by up to 40% and the 70B model by up to 70%.

Speculative decoding has tradeoffs, however. Typically, the results of a model using speculative decoding have a lower quality, both when measured using benchmarks like MMLU as well as when compared by humans. More aggressive speculation can speed up sequence generation, but generally comes with a greater impact to the quality of the result. Prompt lookup decoding offers one of the smallest overall quality impacts while still providing performance improvements, and we will be adding it to some language models on Workers AI including @cf/meta/llama-3.1-8b-instruct.

And, by the way, here is one of our favorite knock-knock jokes, can you guess the punchline?

Knock, knock!
Who’s there?
Figs!
Figs who?
Figs the doorbell, it’s broken!

Keep accelerating

As the AI industry continues to evolve, there will be new hardware and software that allows customers to get faster inference responses. Workers AI is committed to researching, implementing, and making upgrades to our services to help you get fast inference. As an Inference-as-a-Service platform, you’ll be able to benefit from all the optimizations we apply, without having to hire your own team of ML researchers and SREs to manage inference software and hardware deployments. We’re excited for you to try out some of these new releases we have and let us know what you think! Check out our full-suite of AI announcements here and check out the developer docs to get started.

Leveling up Workers AI: general availability and more new capabilities

Michelle Chen — Tue, 02 Apr 2024 13:01:00 GMT

Welcome to Tuesday – our AI day of Developer Week 2024! In this blog post, we’re excited to share an overview of our new AI announcements and vision, including news about Workers AI officially going GA with improved pricing, a GPU hardware momentum update, an expansion of our Hugging Face partnership, Bring Your Own LoRA fine-tuned inference, Python support in Workers, more providers in AI Gateway, and Vectorize metadata filtering.

Workers AI GA

Today, we’re excited to announce that our Workers AI inference platform is now Generally Available. After months of being in open beta, we’ve improved our service with greater reliability and performance, unveiled pricing, and added many more models to our catalog.

Improved performance & reliability

With Workers AI, our goal is to make AI inference as reliable and easy to use as the rest of Cloudflare’s network. Under the hood, we’ve upgraded the load balancing that is built into Workers AI. Requests can now be routed to more GPUs in more cities, and each city is aware of the total available capacity for AI inference. If the request would have to wait in a queue in the current city, it can instead be routed to another location, getting results back to you faster when traffic is high. With this, we’ve increased rate limits across all our models – most LLMs now have a of 300 requests per minute, up from 50 requests per minute during our beta phase. Smaller models have a limit of 1500-3000 requests per minute. Check out our Developer Docs for the rate limits of individual models.

Lowering costs on popular models

Alongside our GA of Workers AI, we published a pricing calculator for our 10 non-beta models earlier this month. We want Workers AI to be one of the most affordable and accessible solutions to run inference, so we added a few optimizations to our models to make them more affordable. Now, Llama 2 is over 7x cheaper and Mistral 7B is over 14x cheaper to run than we had initially published on March 1. We want to continue to be the best platform for AI inference and will continue to roll out optimizations to our customers when we can.

As a reminder, our billing for Workers AI started on April 1st for our non-beta models, while beta models remain free and unlimited. We offer 10,000 neurons per day for free to all customers. Workers Free customers will encounter a hard rate limit after 10,000 neurons in 24 hours while Workers Paid customers will incur usage at $0.011 per 1000 additional neurons. Read our Workers AI Pricing Developer Docs for the most up-to-date information on pricing.

New dashboard and playground

Lastly, we’ve revamped our Workers AI dashboard and AI playground. The Workers AI page in the Cloudflare dashboard now shows analytics for usage across models, including neuron calculations to help you better predict pricing. The AI playground lets you quickly test and compare different models and configure prompts and parameters. We hope these new tools help developers start building on Workers AI seamlessly – go try them out!

Run inference on GPUs in over 150 cities around the world

When we announced Workers AI back in September 2023, we set out to deploy GPUs to our data centers around the world. We plan to deliver on that promise and deploy inference-tuned GPUs almost everywhere by the end of 2024, making us the most widely distributed cloud-AI inference platform. We have over 150 cities with GPUs today and will continue to roll out more throughout the year.

We also have our next generation of compute servers with GPUs launching in Q2 2024, which means better performance, power efficiency, and improved reliability over previous generations. We provided a preview of our Gen 12 Compute servers design in a December 2023 blog post, with more details to come. With Gen 12 and future planned hardware launches, the next step is to support larger machine learning models and offer fine-tuning on our platform. This will allow us to achieve higher inference throughput, lower latency and greater availability for production workloads, as well as expanding support to new categories of workloads such as fine-tuning.

Hugging Face Partnership

We’re also excited to continue our partnership with Hugging Face in the spirit of bringing the best of open-source to our customers. Now, you can visit some of the most popular models on Hugging Face and easily click to run the model on Workers AI if it is available on our platform.

We’re happy to announce that we’ve added 4 more models to our platform in conjunction with Hugging Face. You can now access the new Mistral 7B v0.2 model with improved context windows, Nous Research’s Hermes 2 Pro fine-tuned version of Mistral 7B, Google’s Gemma 7B, and Starling-LM-7B-beta fine-tuned from OpenChat. There are currently 14 models that we’ve curated with Hugging Face to be available for serverless GPU inference powered by Cloudflare’s Workers AI platform, with more coming soon. These models are all served using Hugging Face’s technology with a TGI backend, and we work closely with the Hugging Face team to curate, optimize, and deploy these models.

“We are excited to work with Cloudflare to make AI more accessible to developers. Offering the most popular open models with a serverless API, powered by a global fleet of GPUs is an amazing proposition for the Hugging Face community, and I can’t wait to see what they build with it.”- Julien Chaumond, Co-founder and CTO, Hugging Face

You can find all of the open models supported in Workers AI in this Hugging Face Collection, and the “Deploy to Cloudflare Workers AI” button is at the top of each model card. To learn more, read Hugging Face’s blog post and take a look at our Developer Docs to get started. Have a model you want to see on Workers AI? Send us a message on Discord with your request.

Supporting fine-tuned inference - BYO LoRAs

Fine-tuned inference is one of our most requested features for Workers AI, and we’re one step closer now with Bring Your Own (BYO) LoRAs. Using the popular Low-Rank Adaptation method, researchers have figured out how to take a model and adapt some model parameters to the task at hand, rather than rewriting all model parameters like you would for a fully fine-tuned model. This means that you can get fine-tuned model outputs without the computational expense of fully fine-tuning a model.

We now support bringing trained LoRAs to Workers AI, where we apply the LoRA adapter to a base model at runtime to give you fine-tuned inference, at a fraction of the cost, size, and speed of a fully fine-tuned model. In the future, we want to be able to support fine-tuning jobs and fully fine-tuned models directly on our platform, but we’re excited to be one step closer today with LoRAs.

const response = await ai.run(
  "@cf/mistralai/mistral-7b-instruct-v0.2-lora", //the model supporting LoRAs
  {
      messages: [{"role": "user", "content": "Hello world"],
      raw: true, //skip applying the default chat template
      lora: "00000000-0000-0000-0000-000000000", //the finetune id OR name 
  }
);

BYO LoRAs is in open beta as of today for Gemma 2B and 7B, Llama 2 7B and Mistral 7B models with LoRA adapters up to 100MB in size and max rank of 8, and up to 30 total LoRAs per account. As always, we expect you to use Workers AI and our new BYO LoRA feature with our Terms of Service in mind, including any model-specific restrictions on use contained in the models’ license terms.

Read the technical deep dive blog post on fine-tuning with LoRA and developer docs to get started.

Write Workers in Python

Python is the second most popular programming language in the world (after JavaScript) and the language of choice for building AI applications. And starting today, in open beta, you can now write Cloudflare Workers in Python. Python Workers support all bindings to resources on Cloudflare, including Vectorize, D1, KV, R2 and more.

LangChain is the most popular framework for building LLM‑powered applications, and like how Workers AI works with langchain-js, the Python LangChain library works on Python Workers, as do other Python packages like FastAPI.

Workers written in Python are just as simple as Workers written in JavaScript:

from js import Response

async def on_fetch(request, env):
    return Response.new("Hello world!")

…and are configured by simply pointing at a .py file in your wrangler.toml:

name = "hello-world-python-worker"
main = "src/entry.py"
compatibility_date = "2024-03-18"
compatibility_flags = ["python_workers"]

There are no extra toolchain or precompilation steps needed. The Pyodide Python execution environment is provided for you, directly by the Workers runtime, mirroring how Workers written in JavaScript already work.

There’s lots more to dive into — take a look at the docs, and check out our companion blog post for details about how Python Workers work behind the scenes.

AI Gateway now supports Anthropic, Azure, AWS Bedrock, Google Vertex, and Perplexity

Our AI Gateway product helps developers better control and observe their AI applications, with analytics, caching, rate limiting, and more. We are continuing to add more providers to the product, including Anthropic, Google Vertex, and Perplexity, which we’re excited to announce today. We quietly rolled out Azure and Amazon Bedrock support in December 2023, which means that the most popular providers are now supported via AI Gateway, including Workers AI itself.

Take a look at our Developer Docs to get started with AI Gateway.

Coming soon: Persistent Logs

In Q2 of 2024, we will be adding persistent logs so that you can push your logs (including prompts and responses) to object storage, custom metadata so that you can tag requests with user IDs or other identifiers, and secrets management so that you can securely manage your application’s API keys.

We want AI Gateway to be the control plane for your AI applications, allowing developers to dynamically evaluate and route requests to different models and providers. With our persistent logs feature, we want to enable developers to use their logged data to fine-tune models in one click, eventually running the fine-tune job and the fine-tuned model directly on our Workers AI platform. AI Gateway is just one product in our AI toolkit, but we’re excited about the workflows and use cases it can unlock for developers building on our platform, and we hope you’re excited about it too.

Vectorize metadata filtering and future GA of million vector indexes

Vectorize is another component of our toolkit for AI applications. In open beta since September 2023, Vectorize allows developers to persist embeddings (vectors), like those generated from Workers AI text embedding models, and query for the closest match to support use cases like similarity search or recommendations. Without a vector database, model output is forgotten and can’t be recalled without extra costs to re-run a model.

Since Vectorize’s open beta, we’ve added metadata filtering. Metadata filtering lets developers combine vector search with filtering for arbitrary metadata, supporting the query complexity in AI applications. We’re laser-focused on getting Vectorize ready for general availability, with an target launch date of June 2024, which will include support for multi-million vector indexes.

// Insert vectors with metadata
const vectors: Array = [
  {
    id: "1",
    values: [32.4, 74.1, 3.2],
    metadata: { url: "/products/sku/13913913", streaming_platform: "netflix" }
  },
  {
    id: "2",
    values: [15.1, 19.2, 15.8],
    metadata: { url: "/products/sku/10148191", streaming_platform: "hbo" }
  },
...
];
let upserted = await env.YOUR_INDEX.upsert(vectors);

// Query with metadata filtering
let metadataMatches = await env.YOUR_INDEX.query(, { filter: { streaming_platform: "netflix" }} )

The most comprehensive Developer Platform to build AI applications

On Cloudflare’s Developer Platform, we believe that all developers should be able to quickly build and ship full-stack applications – and that includes AI experiences as well. With our GA of Workers AI, announcements for Python support in Workers, AI Gateway, and Vectorize, and our partnership with Hugging Face, we’ve expanded the world of possibilities for what you can build with AI on our platform. We hope you are as excited as we are – take a look at all our Developer Docs to get started, and let us know what you build.

Workers AI Update: Hello, Mistral 7B!

Jesse Kipp — Tue, 21 Nov 2023 14:00:58 GMT

Today we’re excited to announce that we’ve added the Mistral-7B-v0.1-instruct to Workers AI. Mistral 7B is a 7.3 billion parameter language model with a number of unique advantages. With some help from the founders of Mistral AI, we’ll look at some of the highlights of the Mistral 7B model, and use the opportunity to dive deeper into “attention” and its variations such as multi-query attention and grouped-query attention.

Mistral 7B tl;dr:

Mistral 7B is a 7.3 billion parameter model that puts up impressive numbers on benchmarks. The model:

Outperforms comparable 13B model on all benchmarks
Outperforms comparable 34B on many benchmarks,
Approaches CodeLlama 7B performance on code, while remaining good at English tasks, and
The chat fine-tuned version we’ve deployed outperforms comparable 2 13B chat in the benchmarks provided by Mistral.

Here’s an example of using streaming with the REST API:

curl -X POST \
“https://api.cloudflare.com/client/v4/accounts/{account-id}/ai/run/@cf/mistral/mistral-7b-instruct-v0.1” \
-H “Authorization: Bearer {api-token}” \
-H “Content-Type:application/json” \
-d '{ “prompt”: “What is grouped query attention”, “stream”: true }'

API Response: { response: “Grouped query attention is a technique used in natural language processing  (NLP) and machine learning to improve the performance of models…” }

And here’s an example using a Worker script:

import { Ai } from '@cloudflare/ai';
export default {
    async fetch(request, env) {
        const ai = new Ai(env.AI);
        const stream = await ai.run('@cf/mistral/mistral-7b-instruct-v0.1', {
            prompt: 'What is grouped query attention',
            stream: true
        });
        return Response.json(stream, { headers: { “content-type”: “text/event-stream” } });
    }
}

Mistral takes advantage of grouped-query attention for faster inference. This recently-developed technique improves the speed of inference without compromising output quality. For 7 billion parameter models, we can generate close to 4x as many tokens per second with Mistral as we can with Llama, thanks to Grouped-Query attention.

You don’t need any information beyond this to start using Mistral-7B, you can test it out today ai.cloudflare.com. To learn more about attention and Grouped-Query attention, read on!

So what is “attention” anyway?

The basic mechanism of attention, specifically “Scaled Dot-Product Attention” as introduced in the landmark paper Attention Is All You Need, is fairly simple:

We call our particular attention “Scale Dot-Product Attention”. The input consists of query and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the query with all the keys, divide each by sqrt(d_k) and apply a softmax function to obtain the weights on the values.

More concretely, this looks like this:

source

In simpler terms, this allows models to focus on important parts of the input. Imagine you are reading a sentence and trying to understand it. Scaled dot product attention enables you to pay more attention to certain words based on their relevance. It works by calculating the similarity between each word (K) in the sentence and a query (Q). Then, it scales the similarity scores by dividing them by the square root of the dimension of the query. This scaling helps to avoid very small or very large values. Finally, using these scaled similarity scores, we can determine how much attention or importance each word should receive. This attention mechanism helps models identify crucial information (V) and improve their understanding and translation capabilities.

Easy, right? To get from this simple mechanism to an AI that can write a “Seinfeld episode in which Jerry learns the bubble sort algorithm,” we’ll need to make it more complex. In fact, everything we’ve just covered doesn’t even have any learned parameters — constant values learned during model training that customize the output of the attention block!Attention blocks in the style of Attention is All You Need add mainly three types of complexity:

Learned parameters

Learned parameters refer to values or weights that are adjusted during the training process of a model to improve its performance. These parameters are used to control the flow of information or attention within the model, allowing it to focus on the most relevant parts of the input data. In simpler terms, learned parameters are like adjustable knobs on a machine that can be turned to optimize its operation.

Vertical stacking - layered attention blocks

Vertical layered stacking is a way to stack multiple attention mechanisms on top of each other, with each layer building on the output of the previous layer. This allows the model to focus on different parts of the input data at different levels of abstraction, which can lead to better performance on certain tasks.

Horizontal stacking - aka Multi-Head Attention

The figure from the paper displays the full multi-head attention module. Multiple attention operations are carried out in parallel, with the Q-K-V input for each generated by a unique linear projection of the same input data (defined by a unique set of learned parameters). These parallel attention blocks are referred to as “attention heads”. The weighted-sum outputs of all attention heads are concatenated into a single vector and passed through another parameterized linear transformation to get the final output.

source

This mechanism allows a model to focus on different parts of the input data concurrently. Imagine you are trying to understand a complex piece of information, like a sentence or a paragraph. In order to understand it, you need to pay attention to different parts of it at the same time. For example, you might need to pay attention to the subject of the sentence, the verb, and the object, all simultaneously, in order to understand the meaning of the sentence. Multi-headed attention works similarly. It allows a model to pay attention to different parts of the input data at the same time, by using multiple "heads" of attention. Each head of attention focuses on a different aspect of the input data, and the outputs of all the heads are combined to produce the final output of the model.

Styles of attention

There are three common arrangements of attention blocks used by large language models developed in recent years: multi-head attention, grouped-query attention and multi-query attention. They differ in the number of K and V vectors relative to the number of query vectors. Multi-head attention uses the same number of K and V vectors as Q vectors, denoted by “N” in the table below. Multi-query attention uses only a single K and V vector. Grouped-query attention, the type used in the Mistral 7B model, divides the Q vectors evenly into groups containing “G” vectors each, then uses a single K and V vector for each group for a total of N divided by G sets of K and V vectors. This summarizes the differences, and we’ll dive into the implications of these below.

	Number of Key/Value Blocks	Quality	Memory Usage
Multi-head attention (MHA)	N	Best	Most
Grouped-query attention (GQA)	N / G	Better	Less
Multi-query attention (MQA)	1	Good	Least

Summary of attention styles

And this diagram helps illustrate the difference between the three styles:

source

Multi-Query Attention

Multi-query attention was described in 2019 in the paper from Google: Fast Transformer Decoding: One Write-Head is All You Need. The idea is that instead of creating separate K and V entries for every Q vector in the attention mechanism, as in multi-head attention above, only a single K and V vector is used for the entire set of Q vectors. Thus the name, multiple queries combined into a single attention mechanism. In the paper, this was benchmarked on a translation task and showed performance equal to multi-head attention on the benchmark task.

Originally the idea was to reduce the total size of memory that is accessed when performing inference for the model. Since then, as generalized models have emerged and grown in number of parameters, the GPU memory needed is often the bottleneck which is the strength of multi-query attention, as it requires the least accelerator memory of the three types of attention. However, as models grew in size and generality, performance of multi-query attention fell relative to multi-head attention.

Grouped-Query Attention

The newest of the bunch — and the one used by Mistral — is grouped-query attention, as described in the paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints that was published on arxiv.org in May 2023. Grouped-query attention combines the best of both worlds: the quality of multi-headed attention with the speed and low memory usage of multi-query attention. Instead of either a single set of K and V vectors or one set for every Q vector, a fixed ratio of 1 set of K and V vectors for every Q vector is used, reducing memory usage but retaining high performance on many tasks.

Often choosing a model for a production task is not just about picking the best model available because we must consider tradeoffs between performance, memory usage, batch size, and available hardware (or cloud costs). Understanding these three styles of attention can help guide those decisions and understand when we might choose a particular model given our circumstances.

Enter Mistral — try it today

Being one of the first large language models to leverage grouped-query attention and combining it with sliding window attention, Mistral seems to have hit the goldilocks zone — it’s low latency, high-throughput, and it performs really well on benchmarks even when compared to bigger models (13B). All this to say is that it packs a punch for its size, and we couldn't be more excited to make it available to all developers today, via Workers AI.

Head over to our developer docs to get started, and if you need help, want to give feedback, or want to share what you’re building just pop into our Developer Discord!

The Workers AI team is also expanding and hiring; check our jobs page for open roles if you’re passionate about AI engineering and want to help us build and evolve our global, serverless GPU-powered inference platform.

Streaming and longer context lengths for LLMs on Workers AI

Jesse Kipp — Tue, 14 Nov 2023 14:00:33 GMT

Workers AI is our serverless GPU-powered inference platform running on top of Cloudflare’s global network. It provides a growing catalog of off-the-shelf models that run seamlessly with Workers and enable developers to build powerful and scalable AI applications in minutes. We’ve already seen developers doing amazing things with Workers AI, and we can’t wait to see what they do as we continue to expand the platform. To that end, today we’re excited to announce some of our most-requested new features: streaming responses for all Large Language Models (LLMs) on Workers AI, larger context and sequence windows, and a full-precision Llama-2 model variant.

If you’ve used ChatGPT before, then you’re familiar with the benefits of response streaming, where responses flow in token by token. LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds. The good news is we can start displaying the response as soon as the first tokens are generated, and append each additional token until the response is complete. This yields a much better experience for the end user — displaying text incrementally as it's generated not only provides instant responsiveness, but also gives the end-user time to read and interpret the text.

As of today, you can now use response streaming for any LLM model in our catalog, including the very popular Llama-2 model. Here’s how it works.

Server-sent events: a little gem in the browser API

Server-sent events are easy to use, simple to implement on the server side, standardized, and broadly available across many platforms natively or as a polyfill. Server-sent events fill a niche of handling a stream of updates from the server, removing the need for the boilerplate code that would otherwise be necessary to handle the event stream.

	Easy-to-use	Streaming	Bidirectional
fetch	✅
Server-sent events	✅	✅
Websockets		✅	✅

^{Comparing fetch, server-sent events, and websockets}

To get started using streaming on Workers AI’s text generation models with server-sent events, set the “stream” parameter to true in the input of request. This will change the response format and mime-type to text/event-stream.

Here’s an example of using streaming with the REST API:

curl -X POST \
"https://api.cloudflare.com/client/v4/accounts//ai/run/@cf/meta/llama-2-7b-chat-int8" \
-H "Authorization: Bearer " \
-H "Content-Type:application/json" \
-d '{ "prompt": "where is new york?", "stream": true }'

data: {"response":"New"}

data: {"response":" York"}

data: {"response":" is"}

data: {"response":" located"}

data: {"response":" in"}

data: {"response":" the"}

...

data: [DONE]

And here’s an example using a Worker script:

import { Ai } from "@cloudflare/ai";
export default {
    async fetch(request, env, ctx) {
        const ai = new Ai(env.AI, { sessionOptions: { ctx: ctx } });
        const stream = await ai.run(
            "@cf/meta/llama-2-7b-chat-int8",
            { prompt: "where is new york?", stream: true  }
        );
        return new Response(stream,
            { headers: { "content-type": "text/event-stream" } }
        );
    }
}

If you want to consume the output event-stream from this Worker in a browser page, the client-side JavaScript is something like:

const source = new EventSource("/worker-endpoint");
source.onmessage = (event) => {
    if(event.data=="[DONE]") {
        // SSE spec says the connection is restarted
        // if we don't explicitly close it
        source.close();
        return;
    }
    const data = JSON.parse(event.data);
    el.innerHTML += data.response;
}

You can use this simple code with any simple HTML page, complex SPAs using React or other Web frameworks.

This creates a much more interactive experience for the user, who now sees the page update as the response is incrementally created, instead of waiting with a spinner until the entire response sequence has been generated. Try it out streaming on ai.cloudflare.com.

Workers AI supports streaming text responses for the Llama-2 model and any future LLM models we are adding to our catalog.

But this is not all.

Higher precision, longer context and sequence lengths

Another top request we heard from our community after the launch of Workers AI was for longer questions and answers in our Llama-2 model. In LLM terminology, this translates to higher context length (the number of tokens the model takes as input before making the prediction) and higher sequence length (the number of tokens the model generates in the response.)

We’re listening, and in conjunction with streaming, today we are adding a higher 16-bit full-precision Llama-2 variant to the catalog, and increasing the context and sequence lengths for the existing 8-bit version.

Model	Context length (in)	Sequence length (out)
@cf/meta/llama-2-7b-chat-int8	2048 (768 before)	1800 (256 before)
@cf/meta/llama-2-7b-chat-fp16	3072	2500

Streaming, higher precision, and longer context and sequence lengths provide a better user experience and enable new, richer applications using large language models in Workers AI.

Check the Workers AI developer documentation for more information and options. If you have any questions or feedback about Workers AI, please come see us in the Cloudflare Community and the Cloudflare Discord.If you are interested in machine learning and serverless AI, the Cloudflare Workers AI team is building a global-scale platform and tools that enable our customers to run fast, low-latency inference tasks on top of our network. Check our jobs page for opportunities.

Using the power of Cloudflare’s global network to detect malicious domains using machine learning

Jesse Kipp — Wed, 15 Mar 2023 13:00:00 GMT

Cloudflare secures outbound Internet traffic for thousands of organizations every day, protecting users, devices, and data from threats like ransomware and phishing. One way we do this is by intelligently classifying what Internet destinations are risky using the domain name system (DNS). DNS is essential to Internet navigation because it enables users to look up addresses using human-friendly names, like cloudflare.com. For websites, this means translating a domain name into the IP address of the server that can deliver the content for that site.

However, attackers can exploit the DNS system itself, and often use techniques to evade detection and control using domain names that look like random strings. In this blog, we will discuss two techniques threat actors use – DNS tunneling and domain generation algorithms – and explain how Cloudflare uses machine learning to detect them.

Domain Generation Algorithm (DGA)

Most websites don’t change their domain name very often. This is the point after all, having a stable human-friendly name to be able to connect to a resource on the Internet. However, as a side-effect stable domain names become a point of control, allowing network administrators to use restrictions on domain names to enforce policies, for example blocking access to malicious websites. Cloudflare Gateway – our secure web gateway service for threat defense – makes this easy to do by allowing administrators to block risky and suspicious domains based on integrated threat intelligence.

But what if instead of using a stable domain name, an attacker targeting your users generated random domain names to communicate with, making it more difficult to know in advance what domains to block? This is the idea of Domain Generation Algorithm domains (MITRE ATT&CK technique T1568.002).

After initial installation, malware reaches out to a command-and-control server to receive further instructions, this is called “command and control” (MITRE ATT&CK tactic TA0011). The attacker may send instructions to perform such actions as gathering and transmitting information about the infected device, downloading additional stages of malware, stealing credentials and private data and sending it to the server, or operating as a bot within a network to perform denial-of-service attacks. Using a domain generation algorithm to frequently generate random domain names to communicate with for command and control gives malware a way to bypass blocks on fixed domains or IP addresses. Each day the malware generates a random set of domain names. To rendezvous with the malware, the attacker registers one of these domain names and awaits communication from the infected device.

Speed in identifying these domains is important to disrupting an attack. Because the domains rotate each day, by the time the malicious disposition of a domain propagates through the cybersecurity community, the malware may have rotated to a new domain name. However, the random nature of these domain names (they are literally a random string of letters!) also gives us an opportunity to detect them using machine learning.

The machine learning model

To identify DGA domains, we trained a model that extends a pre-trained transformers-based neural network. Transformers-based neural networks are the state-of-the-art technique in natural language processing, and underlie large language models and services like ChatGPT. They are trained by using adjacent words and context around a word or character to “learn” what is likely to come next.

Domain names largely contain words and abbreviations that are meaningful in human language. Looking at the top domains on Cloudflare Radar, we see that they are largely composed of words and common abbreviations, “face” and “book” for example, or “cloud” and “flare”. This makes the knowledge of human language encoded in transformer models a powerful tool for detecting random domain names.

For DGA models, we curated ground truth data that consisted of domain names observed from Cloudflare’s 1.1.1.1 DNS resolver for the negative class, and we used domain names from known domain generation algorithms for the positive class (all uses of DNS resolver data is completed in accordance with our privacy commitments).

Our final training set contained over 250,000 domain names, and was weighted to include more negative (not DGA domains) than positive cases. We trained three different versions of the model with different architectures: LSTM (Long Short-Term Memory Neural Network), LightGBM (binary classification), and a transformer-based model. We selected the transformer based model based on it having the highest accuracy and F1 score (the F1 score is a measure of model fit that penalizes having very different precision and recall, on an imbalanced data set the highest accuracy model might be the one that predicts everything either true or false, not what we want!), with an accuracy of over 99% on the test data.

To compute the score for a new domain never seen before by the model, the domain name is tokenized (i.e. broken up into individual components, in this case characters), and the sequence of characters are passed to the model. The transformers Python package from Hugging Face makes it easy to use these types of models for a variety of applications. The library supports summarization, question answering, translation, text generation, classification, and more. In this case we use sequence classification, together with a model that was customized for this task. The output of the model is a score indicating the chance that the domain was generated by a domain generation algorithm. If the score is over our threshold, we label the domain and a domain generation algorithm domain.

Deployment

The expansive view of domain names Cloudflare has from our 1.1.1.1 resolver means we can quickly observe DGA domains after they become active. We process all DNS query names that successfully resolve using this model, so a single successful resolution of the domain name anywhere in Cloudflare’s public resolver network can be detected.

From the queries observed on 1.1.1.1, we filter down first to new and newly seen domain names. We then apply our DGA classifier to the new and newly seen domain names, allowing us to detect activated command and control domains as soon as they are observed anywhere in the world by the 1.1.1.1 resolver.

DNS Tunneling detection

In issuing commands or extracting data from an installed piece of malware, attackers seek to avoid detection. One way to send data and bypass traditional detection methods is to encode data within another protocol. When the attacker controls the authoritative name server for a domain, information can be encoded as DNS queries and responses. Instead of making a DNS query for a simple domain name, such as www.cloudflare.com, and getting a response like 104.16.124.96, attackers can send and receive long DNS queries and responses that contain encoded data.

Here is an example query made by an application performing DNS tunneling (query shortened and partially redacted):

3rroeuvx6bkvfwq7dvruh7adpxzmm3zfyi244myk4gmswch4lcwmkvtqq2cryyi.qrsptavsqmschy2zeghydiff4ogvcacaabc3mpya2baacabqtqcaa2iaaaaocjb.br1ns.example.com

The response data to a query like the one above can vary in length based on the response record type the server uses and the recursive DNS resolvers in the path. Generally, it is at most 255 characters per response record and looks like a random string of characters.

TXT	jdqjtv64k2w4iudbe6b7t2abgubis

This ability to take an arbitrary set of bytes and send it to the server as a DNS query and receive a response in the answer data creates a bi-directional communication channel that can be used to transmit any data. The malware running on the infected host encodes the data it wants to transmit as a DNS query name and the infected host sends the DNS query to its resolver.

Since this query is not a true hostname, but actually encodes some data the malware wishes to transmit, the query is very likely to be unique, and is passed on to the authoritative DNS server for that domain.

The authoritative DNS server decodes the query back into the original data, and if necessary can transmit it elsewhere on the Internet. Responses go back the other direction, the response data is encoded as a query response (for example a TXT record) and sent back to the malware running on the infected host.

One challenge with identifying this type of traffic, however, is that there are also many benign applications that use the DNS system to encode or transmit data as well. An example of a query that was classified as not DNS tunneling:

00641f74-8518-4f03-adc2-792a34ea2612.bbbb.example.com

As humans, we can see that the leading portion of this DNS query is a UUID. Queries like this are often used by security and monitoring applications and network appliances to check in. The leading portion of the query might be the unique id of the device or installation that is performing the check-in.

During the research and training phase our researchers identified a wide variety of different applications that use a large number of random looking DNS queries. Some examples of this include subdomains of content delivery networks, video streaming, advertising and tracking, security appliances, as well as DNS tunneling. Our researchers investigated and labeled many of these domains, and while doing so, identified features that can be used to distinguish between benign applications and true DNS tunneling.

The model

For this application, we trained a two-stage model. The first stage makes quick yes/no decisions about whether the domain might be a DNS tunneling domain. The second stage of the model makes finer-grained distinctions between legitimate domains that have large numbers of subdomains, such as security appliances or AV false-positive control, and malicious DNS tunneling.

The first stage is a gradient boosted decision tree that gives us an initial classification based on minimal information. A decision tree model is like playing 20 questions – each layer of the decision tree asks a yes or no question, which gets you closer to the final answer. Decision tree models are good at both predicting binary yes/no results as well as incorporating binary or nominal attributes into a prediction, and are fast and lightweight to execute, making them a good fit for this application. Gradient boosting is a reliable technique for training models that is particularly good at combining several attributes with weak predictive power into a strong predictor. It can be used to train multiple types of models including decision trees as well as numeric predictions.

If the first stage classifies the domain as “yes, potential DNS tunneling”, it is checked against the second stage, which incorporates data observed from Cloudflare’s 1.1.1.1 DNS resolver. This second model is a neural network model and refines the categorization of the first, in order to distinguish legitimate applications.

In this model, the neural network takes 28 features as input and classifies the domain into one of 17 applications, such as DNS tunneling, IT appliance beacons, or email delivery and spam related. Figure 2 shows a diagram generated from the popular Python software package Keras showing the layers of this neural network. We see the 28 input features at the top layer and at the bottom layer, the 17 output values indicating the prediction value for each type of application. This neural network is very small, having about 2,000 individual weights that can be set during the training process. In the next section we will see an example of a model that is based on a state-of-the-art pretrained model from a model family that has tens to hundreds of millions of predefined weights.

Fig. 2, The keras.utils.plot_model() function draws a diagram of the neural network layers.

Figure 3 shows a plot of the feature values of the applications we are trying to distinguish in polar coordinates. Each color is the feature values of all the domains the model classified as a single type of application over a sample period. The position around the circle (theta) is the feature, and the distance from the center (rho) is the value of that feature. We can see how many of the applications have similar feature values.

When we observe a new domain and compute its feature values, our model uses those feature values to give us a prediction about which application the new domain resembles. As mentioned, the neural network has 28 inputs each of which is the value for a single feature and 17 outputs. The 17 output values represent the prediction that the domain is each of those 17 different types of applications, with malicious DNS tunneling being one of the 17 outputs. The job of the model is to convert the sometimes small differences between the feature values into a prediction. If the value of the malicious DNS tunneling output of the neural network is higher than the other outputs, the domain is labeled as a security threat.

Fig. 3, Domains containing high-entropy DNS subdomains, visualized as feature plots. Each section around the circumference of the plot represents a different feature of the observed DNS queries. The distance from the center represents the value of that feature. Each color line is a distinct application, and machine learning helps us distinguish between these and classify them.

Deployment

For the DNS tunneling model, our system consumes the logs from our secure web gateway service. The first stage model is applied to all DNS queries. Domains that are flagged as possible DNS tunneling are then sent to the second stage where the prediction is refined using additional features.

Looking forward: combining machine learning with human expertise

In September 2022, Cloudflare announced the general availability of our threat operations and research team, Cloudforce One, which allows our in-house experts to share insights directly with customers. Layering this human element on top of the ML models that we have already developed helps Cloudflare deliver additional protection threat protection for our customers, as we plan to explain in the next article in this blog series.

Until then, click here to create a free account, with no time limit for up to 50 users, and point just your DNS traffic, or all traffic (layers 4 to 7), to Cloudflare to protect your team, devices, and data with machine learning-driven threat defense.

New WAF intelligence feeds

Daniele Molteni — Thu, 07 Jul 2022 12:57:12 GMT

Cloudflare is expanding our WAF’s threat intelligence capabilities by adding four new managed IP lists that can be used as part of any custom firewall rule.

Managed lists are created and maintained by Cloudflare and are built based on threat intelligence feeds collected by analyzing patterns and trends observed across the Internet. Enterprise customers can already use the Open SOCKS Proxy list (launched in March 2021) and today we are adding four new IP lists: “VPNs”, “Botnets, Command and Control Servers”, “Malware” and “Anonymizers”.

You can check what rules are available in your plan by navigating to Manage Account → Configuration → Lists.

Customers can reference these lists when creating a custom firewall rule or in Advanced Rate Limiting. For example, you can choose to block all traffic generated by IPs we categorize as VPNs, or rate limit traffic generated by all Anonymizers. You can simply incorporate managed IP lists in the powerful firewall rule builder. Of course, you can also use your own custom IP list.

Managed IP Lists can be used in WAF rules to manage incoming traffic from these IPs.

Where do these feeds come from?

These lists are based on Cloudflare-generated threat feeds which are made available as IP lists to be easily consumed in the WAF. Each IP is categorized by combining open source data as well as by analyzing the behavior of each IP leveraging the scale and reach of Cloudflare network. After an IP has been included in one of these feeds, we verify its categorization and feed this information back into our security systems and make it available to our customers in the form of a managed IP list. The content of each list is updated multiple times a day.

In addition to generating IP classifications based on Cloudflare’s internal data, Cloudflare curates and combines several data sources that we believe provide reliable coverage of active security threats with a low false positive rate. In today’s environment, an IP belonging to a cloud provider might today be distributing malware, but tomorrow might be a critical resource for your company.

Some IP address classifications are publicly available, OSINT data, for example Tor exit nodes, and Cloudflare takes care of integrating this into our Anonymizer list so that you don’t have to manage integrating this list into every asset in your network. Other classifications are determined or vetted using a variety of DNS techniques, like lookup, PTR record lookup, and observing passive DNS from Cloudflare’s network.

Our malware and command-and-control focused lists are generated from curated partnerships, and one type of IP address we target when we select partners is data sources that identify security threats that do not have DNS records associated with them.

Our Anonymizer list encompasses several types of services that perform anonymization, including VPNs, open proxies, and Tor nodes. It is a superset of the more narrowly focused VPN list (known commercial VPN nodes), and the Cloudflare Open Proxies list (proxies that relay traffic without requiring authentication).

In dashboard IP annotations

Using these lists to deploy a preventative security policy for these IPs is great, but what about knowing if an IP that is interacting with your website or application is part of a Botnet or VPN? We first released contextual information for Anonymizers as part of Security Week 2022, but we are now closing the circle by extending this feature to cover all new lists.

As part of Cloudflare's threat intelligence feeds, we are exposing the IP category directly into the dashboard. Say you are investigating requests that were blocked by the WAF and that looked to be probing your application for known software vulnerabilities. If the source IP of these requests is matching with one of our feeds (for example part of a VPN), contextual information will appear directly on the analytics page.

When the source IP of a WAF event matches one of the threat feeds, we provide contextual information directly onto the Cloudflare dashboard.

This information can help you see patterns and decide whether you need to use the managed lists to handle the traffic from these IPs in a particular way, for example by creating a rate limiting rule that reduces the amount of requests these actors can perform over a period of time.

Who gets this?

The following table summarizes what plans have access to each one of these features. Any paying plans will have access to the contextual in-dash information, while Enterprise will be able to use different managed lists. Managed lists can be used only on Enterprise zones within an Enterprise account.

	FREE	PRO	BIZ	ENT with WAF Essential	ENT with WAF Advanced *
Annotations	x	✅	✅	✅	✅
Open Proxies	x	x	x	✅	✅
Anonymizers	x	x	x	x	✅
VPNs	x	x	x	x	✅
Botnets, command and control	x	x	x	x	✅
Malware	x	x	x	x	✅

* Contact your customer success manager to learn how to get access to these lists.

Future releases

We are working on enriching our threat feeds even further. In the next months we are going to provide more IP lists, specifically we are looking into lists for cloud providers and Carrier-grade Network Address Translation (CG-NAT).

Bring your own license and threat feeds to use with Cloudflare One

Patrick R. Donahue — Mon, 20 Jun 2022 13:57:23 GMT

At Cloudflare, we strive to make our customers’ lives simpler by building products that solve their problems, are extremely easy to use, and integrate well with their existing tech stack. Another element of ensuring that we fit well with existing deployments is integrating seamlessly with additional solutions that customers subscribe to, and making sure those solutions work collaboratively together to solve a pain point.

Today, we are announcing new integrations that enable our customers to integrate third-party threat intel data with the rich threat intelligence from Cloudflare One products — all within the Cloudflare dashboard. We are releasing this feature in partnership with Mandiant, Recorded Future, and VirusTotal, and will be adding new partners in the coming months.

Customers of these threat intel partners can upload their API keys to the Cloudflare Security Center to enable the use of additional threat data to create rules within Cloudflare One products such as Gateway and Magic Firewall, and infrastructure security products including the Web Application Firewall and API Gateway. Additionally, search results from Security Center’s threat investigations portal will also be automatically enriched with licensed data.

Entering your API keys

Customers will be able to enter their keys by navigating to Security Center → Reference Data, and clicking on the ellipsis next to desired rows and selecting “Edit API key”. Once a valid key has been added, the status listed on the row should change from “No key provided” to “Active key”.

Mandiant

Mandiant Advantage customers with a Threat Intelligence subscription can enter their API keys and leverage Mandiant’s most popular feeds of FQDN and IP address indicators of security threats and their related context throughout Cloudflare One products.

These include lists organized by threat category and aggregations of most active malicious infrastructure. By curating the most recent data and data relevant to your infrastructure on the Cloudflare network, Cloudflare will make it easy to take advantage of active and relevant indicators of malicious activity from Mandiant’s extensive threat intelligence data. Cloudflare takes care of importing the data and refreshing it regularly to help protect you from the latest threats Mandiant sees on the frontlines. Cloudflare products such as Gateway, Magic Firewall, and Web Application Firewall (WAF) will have access to the threat intelligence data and make it easy to operationalize using the same rule builder you use today.

“As cyber threats continue to rapidly evolve, organizations require up-to-date and relevant intelligence integrated with their preferred technology solutions to comprehensively protect their environments. Together, Mandiant and Cloudflare are enabling our mutual customers to better protect themselves from malicious actors that are active on the front lines right now”.- Robert Wallace, Senior Director, Strategy, Mandiant

Recorded Future

Recorded Future customers can upload their API key to unlock use of Security Control Feeds. Once you have set up your API key, Recorded Future intelligence will also be available in the rule builder of Cloudflare Gateway and Magic Firewall. Cloudflare will present the intelligence that is relevant to and actionable by the product being configured. Intelligence will be regularly updated for you, freeing you to focus on the security policies and actions that are relevant for your organization.

For example, customers will be able to create a rule that blocks connections where the source or destination IP is in the Security Control feed “Command and Control - IP Addresses [Prevent]”. This list will be automatically updated daily for each customer who has a valid API key.

As threats accelerate and converge in the world around us, Recorded Future and Cloudflare are working together to empower customers with the right intelligence at the right time, to keep our people and infrastructure safe.- Craig Adams, Chief Product & Engineering officer, Recorded Future

VirusTotal

Virus Total Premium customers can upload their API key to augment and enrich Security Center search results for IPs, domains, and URLs. In the future we plan to add additional object types such as binary files.

Results will be automatically populated within a new card in the ‘Investigate’ tab. When searching an IP address, you will see a summary of the IP address information from VirusTotal including the overall results of the last analysis (e.g., harmless, suspicious, malicious, etc.), reputation score, tags, community votes, and the top files (if any) associated with that IP address by communications.

“Cybersecurity teams face a challenging environment as attackers become more sophisticated. They need complete visibility and real-time threat intelligence from multiple sources to combat malicious threats. We are partnering with Cloudflare to help our mutual customers outsmart adversaries.”- Emiliano Martinez Contreras, Head of Product for VirusTotal — Google

Want to get started?

If you are interested in gaining access during our beta testing phase, please complete this form. And if there are additional data vendors you would like to see us integrate with, including your own sources, click here.

Area 1 threat indicators now available in Cloudflare Zero Trust

Jesse Kipp — Mon, 20 Jun 2022 13:28:53 GMT

Over the last several years, both Area 1 and Cloudflare built pipelines for ingesting threat indicator data, for use within our products. During the acquisition process we compared notes, and we discovered that the overlap of indicators between our two respective systems was smaller than we expected. This presented us with an opportunity: as one of our first tasks in bringing the two companies together, we have started bringing Area 1’s threat indicator data into the Cloudflare suite of products. This means that all the products today that use indicator data from Cloudflare’s own pipeline now get the benefit of Area 1’s data, too.

Area 1 built a data pipeline focused on identifying new and active phishing threats, which now supplements the Phishing category available today in Gateway. If you have a policy that references this category, you’re already benefiting from this additional threat coverage.

How Cloudflare identifies potential phishing threats

Cloudflare is able to combine the data, procedures and techniques developed independently by both the Cloudflare team and the Area 1 team prior to acquisition. Customers are able to benefit from the work of both teams across the suite of Cloudflare products.

Cloudflare curates a set of data feeds both from our own network traffic, OSINT sources, and numerous partnerships, and applies custom false positive control. Customers who rely on Cloudflare are spared the software development effort as well as the operational workload to distribute and update these feeds. Cloudflare handles this automatically, with updates happening as often as every minute.

Cloudflare is able to go beyond this and work to proactively identify phishing infrastructure in multiple ways. With the Area 1 acquisition, Cloudflare is now able to apply the adversary-focused threat research approach of Area1 across our network. A team of threat researchers track state-sponsored and financially motivated threat actors, newly disclosed CVEs, and current phishing trends.

Cloudflare now operates mail exchange servers for hundreds of organizations around the world, in addition to its DNS resolvers, Zero Trust suite, and network services. Each of these products generates data that is used to enhance the security of all of Cloudflare’s products. For example, as part of mail delivery, the mail engine performs domain lookups, scores potential phishing indicators via machine learning, and fetches URLs. Data which can now be used through Cloudflare’s offerings.

How Cloudflare Area 1 identifies potential phishing threats

The Cloudflare Area 1 team operates a suite of web crawling tools designed to identify phishing pages, capture phishing kits, and highlight attacker infrastructure. In addition, Cloudflare Area 1 threat models assess campaigns based on signals gathered from threat actor campaigns; and the associated IOCs of these campaign messages are further used to enrich Cloudflare Area 1 threat data for future campaign discovery. Together these techniques give Cloudflare Area 1 a leg up on identifying the indicators of compromise for an attacker prior to their attacks against our customers. As part of this proactive approach, Cloudflare Area 1 also houses a team of threat researchers that track state-sponsored and financially motivated threat actors, newly disclosed CVEs, and current phishing trends. Through this research, analysts regularly insert phishing indicators into an extensive indicator management system that may be used for our email product or any other product that may query it.

Cloudflare Area 1 also collects information about phishing threats during our normal operation as the mail exchange server for hundreds of organizations across the world. As part of that role, the mail engine performs domain lookups, scores potential phishing indicators via machine learning, and fetches URLs. For those emails found to be malicious, the indicators associated with the email are inserted into our indicator management system as part of a feedback loop for subsequent message evaluation.

How Cloudflare data will be used to improve phishing detection

In order to support Cloudflare products, including Gateway and Page Shield, Cloudflare has a data pipeline that ingests data from partnerships, OSINT sources, as well as threat intelligence generated in-house at Cloudflare. We are always working to curate a threat intelligence data set that is relevant to our customers and actionable in the products Cloudflare supports. This is our North star: what data can we provide that enhances our customer’s security without requiring our customers to manage the complexity of data, relationships, and configuration. We offer a variety of security threat categories, but some major focus areas include:

Malware distribution
Malware and Botnet Command & Control
Phishing,
New and newly seen domains

Phishing is a threat regardless of how the potential phishing link gets entry into an organization, whether via email, SMS, calendar invite or shared document, or other means. As such, detecting and blocking phishing domains has been an area of active development for Cloudflare’s threat data team since almost its inception.

Looking forward, we will be able to incorporate that work into Cloudflare Area 1’s phishing email detection process. Cloudflare's list of phishing domains can help identify malicious email when those domains appear in the sender, delivery headers, message body or links of an email.

1+1 = 3: Greater dataset sharing between Cloudflare and Area 1

Threat actors have long had an unfair advantage — and that advantage is rooted in the knowledge of their target, and the time they have to set up specific campaigns against their targets. That dimension of time allows threat actors to set up the right infrastructure, perform reconnaissance, stage campaigns, perform test probes, observe their results, iterate, improve and then launch their ‘production’ campaigns. This precise element of time gives us the opportunity to discover, assess and proactively filter out campaign infrastructure prior to campaigns reaching critical mass. But to do that effectively, we need visibility and knowledge of threat activity across the public IP space.

With Cloudflare’s extensive network and global insight into the origins of DNS, email or web traffic, combined with Cloudflare Area 1’s datasets of campaign tactics, techniques, and procedures (TTPs), seed infrastructure and threat models — we are now better positioned than ever to help organizations secure themselves against sophisticated threat actor activity, and regain the advantage that for so long has been heavily weighted towards the bad guys.

If you’d like to extend Zero Trust to your email security to block advanced threats, contact your Customer Success manager, or request a Phishing Risk Assessment here.

Investigating threats using the Cloudflare Security Center

Patrick R. Donahue — Mon, 14 Mar 2022 12:59:21 GMT

Cloudflare blocks a lot of diverse security threats, with some of the more interesting attacks targeting the “long tail” of the millions of Internet properties we protect. The data we glean from these attacks trains our machine learning models and improves the efficacy of our network and application security products, but historically hasn’t been available to query directly. This week, we’re changing that.

All customers will soon be granted access to our new threat investigations portal, Investigate, in the Cloudflare Security Center (first launched in December 2021). Additionally, we’ll be annotating threats across our analytics platform with this intelligence to streamline security workflows and tighten feedback loops.

What sorts of data might you want to look up here? Let’s say you’re seeing an IP address in your logs and want to learn which hostnames have pointed to it via DNS, or you’re seeing a cluster of attacks come from an autonomous system (AS) you’re not familiar with. Or maybe you want to investigate a domain name to see how it’s been categorized from a threat perspective. Simply enter any of those items into the omni search box, and we’ll tell you everything we know.

IPs and hostnames will be available to query this week, followed by AS details to give you insight into the networks that communicate with your Cloudflare accounts. Next month as we move to general availability we’ll add data types and properties. Integrations with partners will allow you to use your existing license keys to see all your threat data in a single, unified interface. We also plan to show how both your infrastructure and corporate employees are interacting with any objects you look up, e.g., you can see how many times an IP triggers a WAF or API Shield rule, or how many times your employees attempted to resolve a domain that’s known to serve malware.

Annotations in the dashboard: actionable intelligence in context

Looking up threat data on an ad hoc basis is great, but it’s better when that data is annotated directly in logs and analytics. Starting this week, we will begin rolling out intelligence that is available in Investigate in the dashboard where it is relevant to your workflow. We’re starting with the web application firewall analytics for your websites that are behind Cloudflare.

Say you are investigating a security alert for a large number of requests that are blocked by a web application firewall rule. You might see that the alert was caused by an IP address probing your website for commonly exploited software vulnerabilities. If the IP in question were a cloud IP or flagged as an anonymizer, contextual intelligence will show that information directly on the analytics page.

This context can help you see patterns. Are attacks coming from anonymizers or the Tor network? Are they coming from cloud virtual machines? An IP is just an IP. But seeing a credential stuffing attack coming from anonymizers is a pattern that enables a proactive response, “Is my bot management configuration up-to-date?”

Cloudflare’s network vantage point and how this informs our data

The scale at which each product suite operates at Cloudflare is staggering. At peak, Cloudflare handles 44 million HTTP requests a second, from more than 250 cities in over 100 countries. The Cloudflare network responds to over 1.2 trillion DNS queries per day, and it has 121 Tbps of network capacity to serve traffic and mitigate denial of service attacks across all products. But on top of this immense scale, Cloudflare’s architecture enables refining raw data and combining intelligence from all of our products to paint a holistic picture of the security landscape.

We are able to take signals refined from the raw data generated by each product and combine them with signals from other products and capabilities to enhance our network and threat data capabilities. It is a common paradigm for security products to be built to have positive flywheel effects among users of the products. If one customer sees a new piece of malware, an endpoint protection vendor can deploy an update that will detect and block this malware for all their other customers. If a botnet attacks one customer, this provides information that can be used to find the signature of that botnet and protect other customers. If a device participates in a DDoS (Distributed Denial of Service) attack, that information can be used to make the network able to faster detect and mitigate future DDoS attacks. Cloudflare’s breadth of product offerings means that the flywheel effect benefits to users accumulate not just between users, but between products as well.

Let’s look at some examples:

DNS resolution and certificate transparency

First, Cloudflare operates 1.1.1.1, one of the largest recursive DNS resolvers in the world. We operate it in a privacy-forward manner, so here at Cloudflare we do not know who or what IP performed a query, nor are we able to correlate queries together to distinct anonymous users. However, through the requests the resolver handles, Cloudflare sees newly registered and newly seen domains. Additionally, Cloudflare has one of the most advanced SSL/TLS encryption products on the market, and as part of that is a member organization helping to maintain the Certificate Transparency logs. These are public logs of every TLS certificate issued by a root certificate authority that is trusted by web browsers. Between these two products, Cloudflare has an unmatched view of what domains are out there on the Internet and when they become active. We use this information not only to populate our new and newly seen domains categories for our Gateway product, but we feed these domains into machine learning models that label suspicious or potentially malicious domains early in their lifecycle.

Email security

As another example, with the acquisition of Area 1, Cloudflare will bring a new set of mutually-reinforcing capabilities into its product offering. All the signals we can generate for a domain from our 1.1.1.1 resolver will become available to help identify malicious email, and Area 1’s years of expertise in identifying malicious email will be able to feed back into Cloudflare’s Gateway product and 1.1.1.1 for Families DNS resolver. In the past, data integrations like this would have been performed by IT or security teams. Instead, data will be able to flow seamlessly between the points on your organization’s attack surface, mutually reinforcing the quality of the analysis and classification. The entire Cloudflare Zero Trust toolkit, including request logging, blocking, and remote browser isolation will be available to handle potentially malicious links delivered via email, using the same policies already in place for other security risks.

Over the last few years, Cloudflare has integrated the use of machine learning in many of our product offerings, but today we’ve launched a new tool that puts the data and signals that power our network security into our customer’s hands as well. Whether responding to security incidents, threat hunting, or proactively setting security policies to protect for your organization, you, human, can now be part of the Cloudflare network as well. Cloudflare’s unique position in the network means that your insights can be fed back into the network to protect not just your organization across all Cloudflare products it uses, but also can participate in mutual insight and defense among all Cloudflare customers.

Looking forward

Cloudflare can cover your organization’s whole attack surface: defending websites, protecting devices and SaaS applications with Cloudflare Zero Trust, your locations and offices with Magic Transit, and your email communications. Security Center is here to make sure you have all the information you need to understand the cyber security risks present today, and to help you defend your organization using Cloudflare.

“What is the wiper malware that I hear about on the news, and how do I protect my company from it?” We hear your questions, and we’re going to give you answers. Not just raw information, but what is relevant to you and how you use the Internet. We have big plans for Security Center. A file scanning portal will provide you with information about JavaScript files seen by Page Shield, executable files scanned by Gateway, and the ability to upload and scan files. Indicators of Compromise like IP addresses and domains will link to information about the relevant threat actors, when known, giving you more information about the techniques and tactics you are faced with, and information about how Cloudflare products can be used to defend against them. CVE search will let you find information on software vulnerabilities, along with the same easy-to-understand Cloudflare perspective you are used to reading on this blog to help decode the jargon and technical language. With today’s release, we’re just getting started.

A quirk in the SUNBURST DGA algorithm

Nick Blazier — Fri, 18 Dec 2020 00:30:00 GMT

On Wednesday, December 16, the RedDrip Team from QiAnXin Technology released their discoveries (tweet, github) regarding the random subdomains associated with the SUNBURST malware which was present in the SolarWinds Orion compromise. In studying queries performed by the malware, Cloudflare has uncovered additional details about how the Domain Generation Algorithm (DGA) encodes data and exfiltrates the compromised hostname to the command and control servers.

Background

The RedDrip team discovered that the DNS queries are created by combining the previously reverse-engineered unique guid (based on hashing of hostname and MAC address) with a payload that is a custom base 32 encoding of the hostname. The article they published includes screenshots of decompiled or reimplemented C# functions that are included in the compromised DLL. This background primer summarizes their work so far (which is published in Chinese).

RedDrip discovered that the DGA subdomain portion of the query is split into three parts:

+ +

An example malicious domain is:

7cbtailjomqle1pjvr2d32i2voe60ce2.appsync-api.us-east-1.avsvmcloud.com

Where the domain is split into the three parts as

Encoded guid (15 chars)	byte	Encoded hostname
7cbtailjomqle1p	j	vr2d32i2voe60ce2

The work from the RedDrip Team focused on the encoded hostname portion of the string, we have made additional insights related to the encoded hostname and encoded guid portions.

At a high level the encoded hostnames take one of two encoding schemes. If all of the characters in the hostname are contained in the set of domain name-safe characters "0123456789abcdefghijklmnopqrstuvwxyz-_." then the OrionImprovementBusinessLayer.CryptoHelper.Base64Decode algorithm, explained in the article, is used. If there are characters outside of that set in the hostname, then the OrionImprovementBusinessLayer.CryptoHelper.Base64Encode is used instead and ‘00’ is prepended to the encoding. This allows us to simply check if the first two characters of the encoded hostname are ‘00’ and know how the hostname is encoded.

These function names within the compromised DLL are meant to resemble the names of legitimate functions, but in fact perform the message encoding for the malware. The DLL function Base64Decode is meant to resemble the legitimate function name base64decode, but its purpose is actually to perform the encoding of the query (which is a variant of base32 encoding).

The RedDrip Team has posted Python code for encoding and decoding the queries, including identifying random characters inserted into the queries at regular character intervals.

One potential issue we encountered with their implementation is the inclusion of a check clause looking for a ‘0’ character in the encoded hostname (line 138 of the decoding script). This line causes the decoding algorithm to ignore any encoded hostnames that do not contain a ‘0’. We believe this was included because ‘0’ is the encoded value of a ‘0’, ‘.’, ‘-’ or ‘_’. Since fully qualified hostnames are comprised of multiple parts separated by ‘.’s, e.g. ‘example.com’, it makes sense to be expecting a ‘.’ in the unencoded hostname and therefore only consider encoded hostnames containing a ‘0’. However, this causes the decoder to ignore many of the recorded DGA domains.

As we explain below, we believe that long domains are split across multiple queries where the second half is much shorter and unlikely to include a ‘.’. For example ‘www2.example.c’ takes up one message, meaning that in order to transmit the entire domain ‘www2.example.c’ a second message containing just ‘om’ would also need to be sent. This second message does not contain a ‘.’ so its encoded form does not contain a ‘0’ and it is ignored in the RedDrip team’s implementation.

The quirk: hostnames are split across multiple queries

A list of observed queries performed by the malware was published publicly on GitHub. Applying the decoding script to this set of queries, we see some queries appear to be truncated, such as grupobazar.loca, but also some decoded hostnames are curiously short or incomplete, such as “com”, “.com”, or a single letter, such as “m”, or “l”.

When the hostname does not fit into the available payload section of the encoded query, it is split up across multiple queries. Queries are matched up by matching the GUID section after applying a byte-by-byte exclusive-or (xor).

Analysis of first 15 characters

Noticing that long domains are split across multiple requests led us to believe that the first 16 characters encoded information to associate multipart messages. This would allow the receiver on the other end to correctly re-assemble the messages and get the entire domain. The RedDrip team identified the first 15 bytes as a GUID, we focused on those bytes and will refer to them subsequently as the header.

We found the following queries that we believed to be matches without knowing yet the correct pairings between message 1 and message 2 (payload has been altered):

Part 1 - Both decode to “www2.example.c”r1q6arhpujcf6jb6qqqb0trmuhd1r0ee.appsync-api.us-west-2.avsvmcloud.comr8stkst71ebqgj66qqqb0trmuhd1r0ee.appsync-api.us-west-2.avsvmcloud.com

Part 2 - Both decode to “om”0oni12r13ficnkqb2h.appsync-api.us-west-2.avsvmcloud.comulfmcf44qd58t9e82h.appsync-api.us-west-2.avsvmcloud.com

This gives us a final combined payload of www2.example.com

This example gave us two sets of messages where we were confident the second part was associated with the first part, and allowed us to find the following relationship where message1 is the header of the first message and message2 is the header of the second:

_Base32Decode(message1) XOR KEY = Base32Decode(message2)_

The KEY is a single character. That character is xor’d with each byte of the Base32Decoded first header to produce the Base32Decoded second header. We do not currently know how to infer what character is used as the key, but we can still match messages together without that information. Since A XOR B = C where we know A and C but not B, we can instead use A XOR C = B. This means that in order to pair messages together we simply need to look for messages where XOR’ing them together results in a repeating character (the key).

Base32Decode(message1) XOR Base32Decode(message2) = KEY

Looking at the examples above this becomes

	Message 1	Message 2
Header	r1q6arhpujcf6jb	0oni12r13ficnkq
Base32Decode (binary)	101101000100110110111111011 010010000000011001010111111 01111000101001110100000101	110110010010000011010010000 001000110110110100111100100 00100011111111000000000100

We’ve truncated the results slightly, but below shows the two binary representations and the third line shows the result of the XOR.

101101000100110110111111011010010000000011001010111111011110001010011101110110010010000011010010000001000110110110100111100100001000111111110000011011010110110101101101011011010110110101101101011011010110110101101101

We can see the XOR result is the repeating sequence ‘01101101’meaning the original key was 0x6D or ‘m’.

We provide the following python code as an implementation for matching paired messages (Note: the decoding functions are those provided by the RedDrip team):

# string1 is the first 15 characters of the first message
# string2 is the first 15 characters of the second message
def is_match(string1, string2):
    encoded1 = Base32Decode(string1)
    encoded2 = Base32Decode(string2)
    xor_result = [chr(ord(a) ^ ord(b)) for a,b in zip(encoded1, encoded2)]
    match_char = xor_result[0]
    for character in xor_result[0:9]:
        if character != match_char:
            return False, None
    return True, "0x{:02X}".format(ord(match_char))

The following are additional headers which based on the payload content Cloudflare is confident are pairs (the payload has been redacted because it contains hostname information that is not yet publicly available):

Example 1:

vrffaikp47gnsd4a
aob0ceh5l8cr6mco

xorkey: 0x4E

Example 2:

vrffaikp47gnsd4a
aob0ceh5l8cr6mco

xorkey: 0x54

Example 3:

vvu7884g0o86pr4a
6gpt7s654cfn4h6h

xorkey: 0x2B

We hypothesize that the xorkey can be derived from the header bytes and/or padding byte of the two messages, though we have not yet determined the relationship.

Update (12/18/2020):

Erik Hjelmvik posted a blog explaining where the xor key is located. Based on his code, we provide a python implementation for converting the header (first 16 bytes) into the decoded GUID as a string. Messages can then be paired by matching GUID’s to reconstruct the full hostname.

def decrypt_secure_string(header):
    decoded = Base32Decode(header[0:16])
    xor_key = ord(decoded[0])
    decrypted = ["{0:02x}".format(ord(b) ^ xor_key) for b in decoded]
    return ''.join(decrypted[1:9])

Updated example:

	Message 1	Message 2
Header	r1q6arhpujcf6jb	0oni12r13ficnkq
Base32Decode Header (hex)	b44dbf6900cafde29d05	d920d2046da7908ff004
Base32Decode first byte (xor key)	0xb4	0xd9
XOR result (hex)	00f90bddb47e495629	00f90bddb47e495629

Trend data on the SolarWinds Orion compromise

Malavika Balachandran Tadeusz — Wed, 16 Dec 2020 17:00:41 GMT

On Sunday, December 13, FireEye released a report on a sophisticated supply chain attack leveraging SolarWinds' Orion IT monitoring software. The malware was distributed as part of regular updates to Orion and had a valid digital signature.

One of the notable features of the malware is the way it hides its network traffic using a multi-staged approach. First, the malware determines its command and control (C2) server using a domain generation algorithm (DGA) to construct and resolve a subdomain of avsvmcloud[.]com.

These algorithmically generated strings are added as a subdomain of one of the following domain names to create a new fully-qualified domain name to resolve:

.appsync-api[.]eu-west-1[.]avsvmcloud[.]com.appsync-api[.]us-west-2[.]avsvmcloud[.]com.appsync-api[.]us-east-1[.]avsvmcloud[.]com.appsync-api[.]us-east-2[.]avsvmcloud[.]com

An example of such a domain name might look like: hig4gcdkgjkrt24v6isue7ax09nksd[.]appsync-api[.]eu-west-1[.]avsvmcloud[.]com

The DNS query response to a subdomain of one of the above will return a CNAME record that points to another C2 domain, which is used for data exfiltration. The following subdomains were identified as the C2 domains used for data exfiltration:

freescanonline[.]comdeftsecurity[.]comthedoccloud[.]comwebsitetheme[.]comhighdatabase[.]comincomeupdate[.]comdatabasegalore[.]companhardware[.]comzupertech[.]comvirtualdataserver[.]comdigitalcollege[.]org

Malware activity seen on Cloudflare’s public DNS resolver 1.1.1.1

Using the published details about the network observables of the malware, we analyzed DNS query traffic to the identified malicious hostnames. Because 1.1.1.1 has a strong, audited privacy policy, we are unable to identify the source IP of users connecting to the malicious hostname — we can only see aggregated trends.

We first noticed a spike in DNS traffic through Cloudflare’s 1.1.1.1 resolver to avsvmcloud[.]com starting in April 2020:

Reviewing the subdomain data, a specific pattern of DGA domains emerged as early as April. These subdomains followed a format, (e.g. {dga-string}[.]appsync-api[.]{region}[.]avsvmcloud[.]com). As time went on, the attackers added more unique subdomains. The graph below depicts the unique newly observed subdomains of avsvmcloud[.]com on a weekly basis.

As illustrated in the graphs, we noticed a major rise in activity over the summer, with total subdomains observed reaching steady state in September.

While the growth of unique names slowed down starting in October, the geographic distribution continued to change during the entire course of the attack. During the first few weeks of the attack, queries originated almost entirely from clients in North America and Europe. In May, the source of queries began to spread across the globe. By July, the queries began to cluster again, this time in South America, before returning to originate primarily from North America in November.

Protecting our customers from malicious activity

Cloudflare’s 1.1.1.1 resolver has strict privacy protections, so we can only see trends of this attack. We cannot notify users that they might be compromised, because we intentionally do not know who those users are. For customers of Cloudflare Gateway, however, we can help them block these types of threats, and identify cases where they might be compromised.

Cloudflare Gateway consists of features that secure how users and devices connect to the Internet. Gateway’s DNS filtering feature is built on the same technology that powers 1.1.1.1, and adds security filtering and logging.

Following the FireEye report, Cloudflare blocked access to the C2 domains used in this attack for customers using the “Malware” category in Gateway, as well as for customers using 1.1.1.1 for Families (1.1.1.2 & 1.1.1.3).

Our response team is working with customers to search logs for queries related to the malicious domains. Gateway customers can also download logs of their DNS query traffic and investigate on their own.