Load Balancing Monitor Groups: Multi-Service Health Checks for Resilient Applications

Modern applications are not monoliths. They are complex, distributed systems where availability depends on multiple independent components working in harmony. A web server might be running, but if its connection to the database is down or the authentication service is unresponsive, the application as a whole is unhealthy. Relying on a single health check is like knowing the “check engine” light is not on, but not knowing that one of your tires has a puncture. It’s great your engine is going, but you’re probably not driving far.

As applications grow in complexity, so does the definition of "healthy." We've heard from customers, big and small, that they need to validate multiple services to consider an endpoint ready to receive traffic. For example, they may need to confirm that an underlying API gateway is healthy and that a specific ‘/login’ service is responsive before routing users there. Until now, this required building custom, synthetic services to aggregate these checks, adding operational overhead and another potential point of failure.

Today, we are introducing Monitor Groups for Cloudflare Load Balancing. This feature provides a new way to create sophisticated, multi-service health assessments directly on our platform. With Monitor Groups, you can bundle multiple health monitors into a single logical entity, define which components are critical, and use an aggregated health score to make more intelligent and resilient failover decisions.

This new capability, available via the API for our Enterprise customers, removes the need for custom health aggregation services and provides a far more accurate picture of your application’s true availability. In the near future this feature will be available in the Dashboard for all Load Balancing users, not just Enterprise!

How Monitor Groups Work

Monitor Groups function as a superset of monitors. Once you have created your monitors they can be bundled into a single unit – the Monitor Group! When you attach a Monitor Group to an endpoint pool, the health of each endpoint in that pool is determined by aggregating the results of all enabled monitors within the group. These settings, defined within the ‘members’ array of a monitor group, give you granular control over how the collective health is determined.

// Structure for a single monitor within a group
{
  "description": "Test Monitor Group",
  "members": [
    {
      "monitor_id": "string",
      "enabled": true,
      "monitoring_only": false,
      "must_be_healthy": true
    },
    {
      "monitor_id": "string",
      "enabled": true,
      "monitoring_only": false,
      "must_be_healthy": true
    }
  ]
}

Here’s what each property does:

Critical Monitors (must_be_healthy): You can designate a monitor as critical. If a monitor with this setting fails its health check against an endpoint, that endpoint is immediately marked as unhealthy. This provides a definitive override for essential services, regardless of the status of other monitors in the group.
Observational Probes (monitoring_only): Mark a monitor as "monitoring only" to receive alerts and data without it affecting a pool's health status or traffic steering. This is perfect for testing new checks or observing non-critical dependencies without impacting production traffic.
Quorum-Based Health: In the absence of a failure from a critical monitor, an endpoint's health is determined by a quorum of all other active monitors. An endpoint is considered globally unhealthy only if more than 50% of its assigned monitors report it as unhealthy. This system prevents an endpoint from being prematurely marked as unhealthy due to a transient failure from a single, non-critical monitor.

You can add up to five monitors to a group.

^{A diagram showing three health monitors (HTTP, TCP, and Database) combined into a single Monitor Group. The group is attached to a Cloudflare Load Balancing pool, which assesses the health of three origin servers.}

A Globally Distributed Perspective

The power of Monitor Groups is amplified by the scale of Cloudflare’s global network. Health checks aren't performed from a handful of static locations; they can be configured to execute from data centers in over 300 cities across the globe. While you can configure monitoring from every data center simultaneously ('All Datacenters' mode), we recommend a more targeted approach for most applications. Choosing a few diverse regions, like Western North America and Eastern Europe, or using the 'All Regions' setting provides a robust, global perspective on your application's health while reducing the volume of health monitoring traffic sent to your origins. This creates a distributed consensus on application health, preventing a localized network issue from triggering a false positive and causing an unnecessary global failover. Your application’s health is determined not by a single perspective, but by a global one.

This same principle elevates Dynamic Steering when used in conjunction with Monitor Groups. The latency for a Monitor Group isn't just a single RTT measurement. It's a holistic performance score, averaged from, potentially, hundreds of points of presence, across all the critical services you’ve defined. This means your load balancer steers traffic based on a true, globally-aware understanding of your application’s performance.

For load balancers using Dynamic Steering and a Monitor Group, the latency used to make steering decisions is now calculated as the average Round Trip Time (RTT) of all active, non-monitoring-only members in the group. This provides a more stable and representative performance metric. Rather than relying on the latency of a single service, Dynamic Steering can now make decisions based on the collective performance of all critical components, ensuring traffic is sent to the endpoint that is truly the most performant overall.

Health Aggregation in Action

Let's walk through an example to see how Cloudflare aggregates health signals from a Monitor Group to determine the overall health of a single endpoint. In this scenario, our application has three key components we need to check: a public-facing /health endpoint, another service running on a specific TCP port, and a database dependency. Privacy and security are paramount, so, to monitor the database without exposing it to the public Internet, you would securely connect it to Cloudflare using a Cloudflare Tunnel, allowing our health checks to reach it securely.

Setup

Health Monitors in the Group:
- HTTP check for /health (must_be_healthy: true)
- TCP check for Port 3000 connectivity (must_be_healthy: false)
- DB check for database health (must_be_healthy: false)
Health Check Regions:
- Western North America (3 data centers)
- Eastern North America (3 data centers)
Quorum Threshold: An endpoint is considered healthy if more than 50% of checking data centers report it as UP.

First, Cloudflare determines the health from the perspective of each individual data center. If the critical monitor fails, that data center’s result is definitively DOWN. Otherwise, the result is based on the majority status of the remaining monitors.

Here are the results from our six data centers:

[image description: A table showing health check results from six data centers across two regions. One of the six data centers report a "DOWN" status because the critical HTTP monitor failed. The other five report "UP" because the critical monitor passed and a majority of the remaining monitors were healthy.]

Finally, the results from all six checking data centers are combined to determine the final, global health status for the endpoint.

Global Result: 5 out of the 6 total data centers (83%) report the endpoint as UP.
Conclusion: Because 83% is greater than the 50% quorum threshold, the endpoint is considered globally healthy and will continue to receive traffic.

This multi-layered quorum system provides incredible resilience, ensuring that failover decisions are based on a comprehensive and geographically distributed consensus.

Getting Started with Monitor Groups

Monitor Groups are now available via the API for all customers with an Enterprise Cloudflare Load Balancing subscription and will be made available to self-serve customers in the near future. To get started with building more sophisticated health checks for your applications today, check out our developer documentation.

To create a monitor group, you can use a POST request to the new /load_balancers/monitor_groups endpoint.

POST accounts/{account_id}/load_balancers/monitor_groups
{
  "description": "Monitor group for checkout service",
  "members": [
    {
      "monitor_id": "string",
      "must_be_healthy": true,
	"enabled": true
    },
    {
      "monitor_id": "string",
      "monitoring_only": false,
	"enabled": true
    }
  ]
}

Once created, you can attach the group to a pool by referencing its ID in the monitor_group field of the pool object.

What’s Next

We are continuing to build a seamless platform experience that simplifies traffic management for both internal and external applications. Looking ahead, Monitor Groups will be making its way into the Dashboard for all users soon! We are also working on more flexible role-based access controls and even more advanced load-based load balancing capabilities to give you the granular control you need to manage your most complex applications.

The Cloudflare Blog

Load Balancing Monitor Groups: Multi-Service Health Checks for Resilient Applications

How Monitor Groups Work

A Globally Distributed Perspective

Health Aggregation in Action

Setup

Getting Started with Monitor Groups

What’s Next

TURN and anycast: making peer connections work globally

Eliminating hardware with Load Balancing and Cloudflare One

Extending Private Network Load Balancing load balancing to Layer 4 with Spectrum

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution