訂閱以接收新文章的通知:

Cloudflare is the best place to build realtime voice agents

2025-08-29

閱讀時間:6 分鐘
本貼文還提供以下語言版本:English

The way we interact with AI is fundamentally changing. While text-based interfaces like ChatGPT have shown us what's possible, in terms of interaction, it’s only the beginning. Humans communicate not only by texting, but also talking — we show things, we interrupt and clarify in real-time. Voice AI brings these natural interaction patterns to our applications.

Today, we're excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare's global network. These new features create a complete platform for developers building the next generation of conversational AI experiences or can function as building blocks for more advanced AI agents running across platforms.

We're launching:

  • Cloudflare Realtime Agents - A runtime for orchestrating voice AI pipelines at the edge

  • Pipe raw WebRTC audio as PCM in Workers - You can now connect WebRTC audio directly to your AI models or existing complex media pipelines already built on 

  • Workers AI WebSocket support - Realtime AI inference with models like PipeCat's smart-turn-v2

  • Deepgram on Workers AI - Speech-to-text and text-to-speech running in over 330 cities worldwide

Why realtime AI matters now

Today, building voice AI applications is hard. You need to coordinate multiple services such as speech-to-text, language models, text-to-speech while managing complex audio pipelines, handling interruptions, and keeping latency low enough for natural conversation. 

Building production voice AI requires orchestrating a complex symphony of technologies. You need low latency speech recognition, intelligent language models that understand context and can handle interruptions, natural-sounding voice synthesis, and all of this needs to happen in under 800 milliseconds — the threshold where conversation feels natural rather than stilted. This latency budget is unforgiving. Every millisecond counts: 40ms for microphone input, 300ms for transcription, 400ms for LLM inference, 150ms for text-to-speech. Any additional latency from poor infrastructure choices or distant servers transforms a delightful experience into a frustrating one.

That's why we're building real-time AI tools: we want to make real-time voice AI as easy to deploy as a static website. We're also witnessing a critical inflection point where conversational AI moves from experimental demos to production-ready systems that can scale globally. If you’re already a developer in the real-time AI ecosystem, we want to build the best building blocks for you to get the lowest latency by leveraging the 330+ datacenters Cloudflare has built.

Introducing Cloudflare Realtime Agents

Cloudflare Realtime Agents is a simple runtime for orchestrating voice AI pipelines that run on our global network, as close to your users as possible. Instead of managing complex infrastructure yourself, you can focus on building great conversational experiences.

How it works

When a user connects to your voice AI application, here's what happens:

  1. WebRTC connection - Audio streams from the user's device is sent to the nearest Cloudflare location via WebRTC, using Cloudflare RealtimeKit mobile or web SDKs

  2. AI pipeline orchestration - Your pre-configured pipeline runs: speech-to-text → LLM → text-to-speech, with support for interruption detection and turn-taking

  3. Your configured runtime options/callbacks/tools run

  4. Response delivery - Generated audio streams back to the user with minimal latency

The magic is in how we've designed this as composable building blocks. You're not locked into a rigid pipeline — you can configure data flows, add tee and join operations, and control exactly how your AI agent behaves.

Take a look at the MyTextHandler function from the above diagram, for example. It’s just a function that takes in text and returns text back, inserted after speech-to-text and before text-to-speech:

class MyTextHandler extends TextComponent {
	env: Env;

	constructor(env: Env) {
		super();
		this.env = env;
	}

	async onTranscript(text: string) {
		const { response } = await this.env.AI.run('@cf/meta/llama-3.1-8b-instruct', {
			prompt: "You are a wikipedia bot, answer the user query:" + text,
		});
		this.speak(response!);
	}
}

Your agent is a JavaScript class that extends RealtimeAgent, where you initialize a pipeline consisting of the various text-to-speech, speech-to-text, text-to-text and even speech-to-speech transformations.

export class MyAgent extends RealtimeAgent<Env> {
	constructor(ctx: DurableObjectState, env: Env) {
		super(ctx, env);
	}

	async init(agentId: string ,meetingId: string, authToken: string, workerUrl: string, accountId: string, apiToken: string) {
		// Construct your text processor for generating responses to text
		const textHandler = new MyTextHandler(this.env);
		// Construct a Meeting object to join the RTK meeting
		const transport = new RealtimeKitTransport(meetingId, authToken, [
			{
				media_kind: 'audio',
				stream_kind: 'microphone',
			},
		]);
		const { meeting } = transport;

		// Construct a pipeline to take in meeting audio, transcribe it using
		// Deepgram, and pass our generated responses through ElevenLabs to
		// be spoken in the meeting
		await this.initPipeline(
			[transport, new DeepgramSTT(this.env.DEEPGRAM_API_KEY), textHandler, new ElevenLabsTTS(this.env.ELEVENLABS_API_KEY), transport],
			agentId,
			workerUrl,
			accountId,
			apiToken,
		);

		// The RTK meeting object is accessible to us, so we can register handlers
		// on various events like participant joins/leaves, chat, etc.
		// This is optional
		meeting.participants.joined.on('participantJoined', (participant) => {
			textHandler.speak(`Participant Joined ${participant.name}`);
		});
		meeting.participants.joined.on('participantLeft', (participant) => {
			textHandler.speak(`Participant Left ${participant.name}`);
		});

		// Make sure to actually join the meeting after registering all handlers
		await meeting.rtkMeeting.join();
	}

	async deinit() {
		// Add any other cleanup logic required
		await this.deinitPipeline();
	}
}

View a full example in the developer docs and get your own Realtime Agent running. View Realtime Agents on your dashboard.

Built for flexibility

What makes Realtime Agents powerful is its flexibility:

  • Many AI provider options - Use the models on Workers AI, OpenAI, Anthropic, or any provider through AI Gateway

  • Multiple input/output modes - Accept audio and/or text and respond with audio and/or text

  • Stateful coordination - Maintain context across the conversation without managing complex state yourself

  • Speed and flexibility - use RealtimeKit to manage WebRTC sessions and UI for faster development, or for full control over your stack, you can also connect directly using any standard WebRTC client or raw WebSockets

  • Integrate with the Cloudflare Agents SDK

During the open beta starting today, Cloudflare Realtime Agents runtime is free to use and works with various AI models:

  • Speech and Audio: Integration with platforms like ElevenLabs and Deepgram.

  • LLM Inference: Flexible options to use large language models through Cloudflare Workers AI and AI Gateway, connect to third-party models like OpenAi, Gemini, Grok, Claude, or bring your own custom models.

Pipe raw WebRTC audio as PCM in Workers

For developers who need the most flexibility with their applications beyond Realtime Agents, we're exposing the raw WebRTC audio pipeline directly to Workers. 

WebRTC audio in Workers works by leveraging Cloudflare’s Realtime SFU, which converts WebRTC audio in Opus codec to PCM and streams it to any WebSocket endpoint you specify. This means you can use Workers to implement:

  • Live transcription - Stream audio from a video call directly to a transcription service

  • Custom AI pipelines - Send audio to AI models without setting up complex infrastructure

  • Recording and processing - Save, audit, or analyze audio streams in real-time

WebSockets vs WebRTC for voice AI

WebSockets and WebRTC can handle audio for AI services, but they work best in different situations. WebSockets are perfect for server-to-server communication and work fine when you don't need super-fast responses, making them great for testing and experimenting. However, if you're building an app where users need real-time conversations with low delay, WebRTC is the better choice.

WebRTC has several advantages that make it superior for live audio streaming. It uses UDP instead of TCP, which prevents audio delays caused by lost packets holding up the entire stream (head of line blocking is a common topic discussed on this blog). The Opus audio codec in WebRTC automatically adjusts to network conditions and can handle packet loss gracefully. WebRTC also includes built-in features like echo cancellation and noise reduction that WebSockets would require you to build separately. 

With this feature, you can use WebRTC for client to server communication and leveraging Cloudflare to convert to familiar WebSockets for server-to-server communication and backend processing.

The power of Workers + WebRTC

When WebRTC audio gets converted to WebSockets, you get PCM audio at the original sample rate, and from there, you can run any task in and out of the Cloudflare developer platform:

  • Resample audio and send to different AI providers

  • Run WebAssembly-based audio processing

  • Build complex applications with Durable Objects, Alarms and other Workers primitives

  • Deploy containerized processing pipelines with Workers Containers

The WebSocket works bidirectionally, so data sent back on the WebSocket becomes available as a WebRTC track on the Realtime SFU, ready to be consumed within WebRTC.

To illustrate this setup, we’ve made a simple WebRTC application demo that uses the ElevenLabs API for  text-to-speech.

Visit the Realtime SFU developer docs on how to get started.

Realtime AI inference with WebSockets

WebSockets provide the backbone of real-time AI pipelines because it is a low-latency, bidirectional primitive with ubiquitous support in developer tooling, especially for server to server communication. Although HTTP works great for many use cases like chat or batch inference, real-time voice AI needs persistent, low-latency connections when talking to AI inference servers. To support your real-time AI workloads, Workers AI now supports WebSocket connections in select models.

Launching with PipeCat SmartTurn V2

The first model with WebSocket support is PipeCat's smart-turn-v2 turn detection model — a critical component for natural conversation. Turn detection models determine when a speaker has finished talking and it's appropriate for the AI to respond. Getting this right is the difference between an AI that constantly interrupts and one that feels natural to talk to.

Below is an example on how to call smart-turn-v2 running on Workers AI.

"""
Cloudflare AI WebSocket Inference - With PipeCat's smart-turn-v2
"""

import asyncio
import websockets
import json
import numpy as np

# Configuration
ACCOUNT_ID = "your-account-id"
API_TOKEN = "your-api-token"
MODEL = "@cf/pipecat-ai/smart-turn-v2"

# WebSocket endpoint
WEBSOCKET_URL = f"wss://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/{MODEL}?dtype=uint8"

async def run_inference(audio_data: bytes) -> dict:
    async with websockets.connect(
        WEBSOCKET_URL,
        additional_headers={
            "Authorization": f"Bearer {API_TOKEN}"
        }
    ) as websocket:
        await websocket.send(audio_data)
        
        response = await websocket.recv()
        result = json.loads(response)
        
        # Response format: {'is_complete': True, 'probability': 0.87}
        return result

def generate_test_audio():    
    noise = np.random.normal(128, 20, 8192).astype(np.uint8)
    noise = np.clip(noise, 0, 255) 
    
    return noise

async def demonstrate_inference():
    # Generate test audio
    noise = generate_test_audio()
    
    try:
        print("\nTesting noise...")
        noise_result = await run_inference(noise.tobytes())
        print(f"Noise result: {noise_result}")
        
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    asyncio.run(demonstrate_inference())

Deepgram in Workers AI

On Wednesday, we announced that Deepgram's speech-to-text and text-to-speech models are available on Workers AI, running in Cloudflare locations worldwide. This means:

  • Lower latency - Speech recognition happens at the edge, close to users running in the same network as Workers

  • WebRTC audio processing without leaving the Cloudflare network

  • State-of-the-art audio ML models powerful, capable, and fast audio models, available directly through Workers AI

  • Global scale - leverages Cloudflare’s global network in 330+ cities automatically

Deepgram is a popular choice for voice AI applications. By building your voice AI systems on the Cloudflare platform, you get access to powerful models and the lowest latency infrastructure to give your application a natural, responsive experience.

Interested in other realtime AI models running on Cloudflare?

If you're developing AI models for real-time applications, we want to run them on Cloudflare's network. Whether you have proprietary models or need ultra-low latency inference at scale with open source models reach out to us.

Get started today

All of these features are available now:

Want to pick the brains of the engineers who built this? Join them for technical deep dives, live demos Q&A at Cloudflare Connect in Las Vegas. Explore the full schedule and register.

我們保護整個企業網路,協助客戶有效地建置網際網路規模的應用程式,加速任何網站或網際網路應用程式抵禦 DDoS 攻擊,阻止駭客入侵,並且可以協助您實現 Zero Trust

從任何裝置造訪 1.1.1.1,即可開始使用我們的免費應用程式,讓您的網際網路更快速、更安全。

若要進一步瞭解我們協助打造更好的網際網路的使命,請從這裡開始。如果您正在尋找新的職業方向,請查看我們的職缺
AI WeekAI

在 X 上進行關注

Renan Dincer|@rrnn
Cloudflare|@cloudflare

相關貼文