Evaluating image segmentation models for background removal for Images

Last week, we wrote about face cropping for Images, which runs an open-source face detection model in Workers AI to automatically crop images of people at scale.

It wasn’t too long ago when deploying AI workloads was prohibitively complex. Real-time inference previously required specialized (and costly) hardware, and we didn’t always have standard abstractions for deployment. We also didn’t always have Workers AI to enable developers — including ourselves — to ship AI features without this additional overhead.

And whether you’re skeptical or celebratory of AI, you’ve likely seen its explosive progression. New benchmark-breaking computational models are released every week. We now expect a fairly high degree of accuracy — the more important differentiators are how well a model fits within a product’s infrastructure and what developers do with its predictions.

This week, we’re introducing background removal for Images. This feature runs a dichotomous image segmentation model on Workers AI to isolate subjects in an image from their backgrounds. We took a controlled, deliberate approach to testing models for efficiency and accuracy.

Here’s how we evaluated various image segmentation models to develop background removal.

A primer on image segmentation

In computer vision, image segmentation is the process of splitting an image into meaningful parts.

Segmentation models produce a mask that assigns each pixel to a specific category. This differs from detection models, which don’t classify every pixel but instead mark regions of interest. A face detection model, such as the one that informs face cropping, draws bounding boxes based on where it thinks there are faces. (If you’re curious, our post on face cropping discusses how we use these bounding boxes to perform crop and zoom operations.)

Salient object detection is a type of segmentation that highlights the parts of an image that most stand out. Most salient detection models create a binary mask that categorizes the most prominent (or salient) pixels as the “foreground” and all other pixels as the “background”. In contrast, a multi-class mask considers the broader context and labels each pixel as one of several possible classes, like “dog” or “chair”. These multi-class masks are the basis of content analysis models, which distinguish which pixels belong to specific objects or types of objects.

_{In this photograph of my dog, a detection model predicts that a bounding box contains a dog; a segmentation model predicts that some pixels belong to a dog, while all other pixels don’t.}

For our use case, we needed a model that could produce a soft saliency mask, which predicts how strongly each pixel belongs to either the foreground (objects of interest) or the background. That is, each pixel is assigned a value on a scale of 0–255, where 0 is completely transparent and 255 is fully opaque. Most background pixels are labeled at (or near) 0; foreground pixels may vary in opacity, depending on its degree of saliency.

In principle, a background removal feature must be able to accurately predict saliency across a broad range of contexts. For example, e-commerce and retail vendors want to display all products on a uniform, white background; in creative and image editing applications, developers want to enable users to create stickers and cutouts from uploaded content, including images of people or avatars.

In our research, we focused primarily on the following four image segmentation models:

U²-Net (U Square Net): Trained on the largest saliency dataset (DUST-TR) of 10,553 images, which were then horizontally flipped to reach a total of 21,106 training images.
IS-Net (Intermediate Supervision Network): A novel, two-step approach from the same authors of U2-Net; this model produces cleaner boundaries for images with noisy, cluttered backgrounds.
BiRefNet (Bilateral Reference Network): Specifically designed to segment complex and high-resolution images with accuracy by checking that the small details match the big picture.
SAM (Segment Anything Model): Developed by Meta to allow segmentation by providing prompts and input points.

Different scales of information allow computational models to build a holistic view of an image. Global context considers the overall shape of objects and how areas of pixels relate to the entire image, while local context traces fine details like edges, corners, and textures. If local context focuses on the trees and their leaves, then global context represents the entire forest.

U²-Net extracts information using a multi-scale approach, where it analyzes an image at different zoom levels, then combines its predictions in a single step. The model analyzes global and local context at the same time, so it works well on images with multiple objects of varying sizes.

IS-Net introduces a new, two-step strategy called intermediate supervision. First, the model separates the foreground from the background, identifying potential areas that likely belong to objects of interest — all other pixels are labeled as the background. Second, it refines the boundaries of the highlighted objects to produce a final pixel-level mask.

The initial suppression of the background results in cleaner, more precise edges, as the segmentation focuses only on the highlighted objects of interest and is less likely to mistakenly include background pixels in the final mask. This model especially excels when dealing with complex images with cluttered backgrounds.

Both models output their predictions in a single direction for scale. U²-Net interprets the global and local context in one pass, while Is-Net begins with the global context, then focuses on the local context.

In contrast, BiRefNet refines its predictions over multiple passes, moving in both contextual directions. Like Is-Net, it initially creates a map that roughly highlights the salient object, then traces the finer details. However, BiRefNet moves from global to local context, then from local context back to global. In other words, after refining the edges of the object, it feeds the output back to the large-scale view. This way, the model can check that the small-scale details align with the broader image structure, providing higher accuracy on high-resolution images.

U²-Net, IS-Net, and BiRefNet are exclusively saliency detection models, producing masks that distinguish foreground pixels from background pixels. However, SAM was designed to be more extensible and general; its primary goal is to segment any object based on specified inputs, not only salient objects. This means that the model can also be used to create multi-class masks that label various objects within an image, even if they aren’t the primary focus of an image.

How we measure segmentation accuracy

In most saliency datasets, the actual location of the object is known as the ground-truth area. These regions are typically defined by human annotators, who manually trace objects of interest in each image. This provides a reliable reference to evaluate model predictions.

Our tests compared the predicted area to the ground-truth area, or the true boundaries of the object.

_{Photograph by}_{Allen Fang}

Each model outputs a predicted area (where it thinks the foreground pixels are), which can be compared against the ground-truth area (where the foreground pixels actually are).

Models are evaluated for segmentation accuracy based on common metrics like Intersection over Union, Dice coefficient, and pixel accuracy. Each score takes a slightly different approach to quantify the alignment between the predicted and ground-truth areas (“P” and “G”, respectively, in the formulas below).

Intersection over Union

Intersection over Union (IoU), also called the Jaccard index, measures how well the predicted area matches the true object. That is, it counts the number of foreground pixels that are shared in both the predicted and ground-truth masks. Mathematically, IoU is written as:

_{Jaccard formula}

The formula divides the intersection (P∩G), or the pixels where the predicted and ground-truth areas overlap, by the union (P∪G), or the total area of pixels that belong to either area, counting the overlapping pixels only once.

IoU produces a score between 0 and 1. A higher value indicates a closer overlap between the predicted and ground-truth areas. A perfect match, although rare, would score 1, while a smaller overlapping area brings the score closer to 0.

Dice coefficient

The Dice coefficient, also called the Sørensen–Dice index, similarly compares how well the model’s prediction matches reality, but is much more forgiving than the IoU score. It gives more weight to the shared pixels between the predicted and actual foreground, even if the areas differ in size. Mathematically, the Dice coefficient is written as:

_{Sørensen–Dice formula}

The formula divides twice the intersection (P∩G) by the sum of pixels in both predicted and ground-truth areas (P+G), counting any overlapping pixels twice.

Like IoU, the Dice coefficient also produces a value between 0 and 1, indicating a more accurate match as it approaches 1.

Pixel accuracy

Pixel accuracy measures the percentage of pixels that were correctly labeled as either the foreground or the background. Mathematically, pixel accuracy is written as:

_{Pixel accuracy formula}

The formula divides the number of correctly predicted pixels by the total number of pixels in the image.

The total area of correctly predicted pixels is the sum of foreground and background pixels that accurately match the ground-truth areas.

The correctly predicted foreground is the intersection of the predicted and ground-truth areas (P∩G). The inverse of the predicted area (P’, or 1–P) represents the pixels that the model identifies as the background; the inverse of the ground-truth area (G’, or 1–G) represents the actual boundaries of the background. When these two inverted areas overlap (P’∩G’, or (1–P)∩(1–G)), this intersection is the correctly predicted background.

Interpreting the metrics

Of the three metrics, IoU is the most conservative measure of segmentation accuracy. Small mistakes, such as including extra background pixels in the predicted foreground, reduce the score noticeably. This metric is most valuable for applications that require precise boundaries, such as autonomous driving systems.

Meanwhile, the Dice coefficient rewards the overlapping pixels more heavily, and subsequently tends to be higher than the IoU score for the same prediction. In model evaluations, this metric is favored over IoU when it’s more important to capture the object than to penalize mistakes. For example, in medical imaging, the risk of missing a true positive substantially outweighs the inconvenience of flagging a false positive.

In the context of background removal, we biased toward the IoU score and Dice coefficient over pixel accuracy. Pixel accuracy can be misleading, especially when processing an image where background pixels comprise the majority of pixels.

For example, consider an image with 900 background pixels and 100 foreground pixels. A model that correctly predicts only 5 foreground pixels — 5% of all foreground pixels — will score deceptively high in pixel accuracy. Intuitively, we’d likely say that this model performed poorly. However, assuming all 900 background pixels were correctly predicted, the model maintains 90.5% pixel accuracy, despite missing the subject almost entirely.

Pixels, predictions, and patterns

To determine the most suitable model for the Images API, we performed a series of tests using the open-source rembg library, which combines all relevant models in a single interface.

Each model was tasked with outputting a prediction mask to label foreground versus background pixels. We pulled images from two saliency datasets: Humans contains over 7,000 images of people with varying skin tones, clothing, and hairstyles, while DIS5K (version 1.5) spans a vast range of objects and scenes. If a model contained variants that were pre-trained on specific types of segmentation (e.g. clothes, humans), then we repeated the tests for the generalized model and each variant.

Our experiments were executed on a GPU with 23 GB VRAM to mirror realistic hardware constraints, similar to the environment where we already run a face detection model. We also replicated the same tests on a larger GPU instance with 94 GB VRAM; this served as an upper-bound reference point to benchmark potential speed gains if additional compute were available. Cloudflare typically reserves larger GPUs for more compute-intensive AI workloads — we viewed these tests more as an exploration for comparison than as a production scenario.

During our analysis, we started to see key trends emerge:

On the smaller GPU, inference times were generally faster for lightweight models like U²-Net (176 MB) and Is-Net (179 MB). The average speed across both datasets were 307 milliseconds for U²-Net and 351 milliseconds for Is-Net. On the opposite end, BiRefNet (973 MB) had noticeably slower output times, averaging 821 milliseconds across its two generalized variants.

BiRefNet ran 2.4 times faster on the larger GPU, reducing its average inference time to 351 milliseconds — comparable to the other models, despite its larger size. In contrast, the lighter models did not show any notable speed gain with additional compute, suggesting that scaling hardware configurations primarily benefits heavier models. In Appendix 1 (“Inference Time in Milliseconds”), we compare speed across models and GPU instances.

We also observed distinct patterns when comparing model performance across the two saliency datasets. Most notably, all models ran faster on the Humans dataset, where images of people tend to be single-subject and relatively uniform. The DIS5K dataset, in contrast, includes images with higher complexity — that is, images with more objects, cluttered backgrounds, or multiple objects of varying scales.

Slower predictions suggest a relationship between visual complexity and the computation needed to identify the important parts of an image. In other words, datasets with simpler, well-separated objects can be analyzed more quickly, while complex scenes require more computation to generate accurate masks.

Similarly, complexity challenges accuracy as much as it does efficiency. In our tests, all models demonstrated higher segmentation accuracy with the Humans dataset. In Appendix 2 (“Measures of Model Accuracy”), we present our results for segmentation accuracy across both datasets.

Specialized variants scored slightly higher in accuracy compared to their generalized counterparts. But in broad, practical applications, selecting a specialized model for every input isn’t realistic, at least for our initial beta version. We favored general-purpose models that can produce accurate predictions without prior classification. For this reason, we excluded SAM — while powerful in its intended use cases, SAM is designed to work with additional inputs. On unprompted segmentation tasks, it produced lower accuracy scores (and much higher inference times) amongst the models we tested.

All BiRefNet variants showed greater accuracy compared to other models. The generalized variants (-general and -dis) were just as accurate as its more specialized variants like -portrait. The birefnet-general variant, in particular, achieved a high IoU score of 0.87 and Dice coefficient of 0.92, averaged across both datasets.

In contrast, the generalized U²-Net model showed high accuracy on the Humans dataset, reaching an IoU score of 0.89 and a Dice coefficient of 0.94, but received a low IoU score of 0.39 and Dice coefficient of 0.52 on the DIS5K dataset. The isnet-general-use model performed substantially better, obtaining an average IoU score of 0.82 and Dice coefficient of 0.89 across both datasets.

We observed whether models could interpret both the global and local context of an image. In some scenarios, the U²-Net and Is-Net models captured the overall gist of an image, but couldn’t accurately trace fine edges. We designed one test around measuring how well each model could isolate bicycle wheels; for variety, we included images across both interior and exterior backgrounds. Lower scoring models, while correctly labeling the area surrounding the wheel, struggled with the pixels between the thin spokes and produced prediction masks that included these background pixels.

_{Photograph by}_{Yomex Owo on Unsplash}

In other scenarios, the models showed the opposite limitation: they produced masks with clean edges, but failed to identify the focus of the image. We ran another test using a photograph of a gray T-shirt against black gym flooring. Both generalized U²-Net and Is-Net models labeled only the logo as the salient object, creating a mask that omitted the rest of the shirt entirely.

Meanwhile, the BiRefNet model achieved high accuracy across both types of tests. Its architecture passes information bidirectionally, allowing details at the pixel level to be informed by the larger scene (and vice versa). In practice, this means that BiRefNet interprets how fine-grained edges fit into the broader object. For our beta version, we opted to use the BiRefNet model to drive decisions for background removal.

_{Unlike lower scoring models, the BiRefNet model understood that the entire shirt is the true subject of the image.}

Applying background removal with the Images API

The Images API now supports automatic background removal for hosted and remote images. This feature is available in open beta to all Cloudflare users on Free and Paid plans.

Use the segment parameter when optimizing an image through a specially-formatted Images URL or a worker, and Cloudflare will isolate the subject of your image and convert the background into transparent pixels. This can be combined with other optimization operations, as shown in the transformation URL below:

example.com/cdn-cgi/image/gravity=face,zoom=0.5,segment=foreground,background=white/image.png

This request will:

Crop the image toward the detected face.
Isolate the subject in the image, replacing the background with transparent pixels.
Fill the transparent pixels with a solid white color (#FFFFFF).

You can also bind the Images API to your worker to build programmatic workflows that give more fine-grained control over how images will be optimized. To demonstrate how this works, I made a simple image editing app for creating cutouts and overlays, built entirely on Images and Workers. This can be used to create images like the one below. Here, we apply background removal to isolate the dog and ice cream cone, then overlay them on a landscape image.

_{Photographs by}_{Guy Hurst}_(landscape),_{Oskar Gackowski}_{(ice cream), and me (dog)}

Here is a snippet that you can use to overlay images in a worker:

export default {
  async fetch(request,env) {
    const baseURL = "{image-url}";
    const overlayURL = "{image-url}";
    
    // Fetch responses from image URLs
    const [base, overlay] = await Promise.all([fetch(baseURL),fetch(overlayURL)]);

    return (
      await env.IMAGES
        .input(base.body)
        .draw(
          env.IMAGES.input(overlay.body)
            .transform({segment: "foreground"}), // Optimize the overlay image
            {top: 0} // Position the overlay
        )
        .output({format:"image/webp"})
    ).response();
  }
};

Background removal is another step in our ongoing effort to enable developers to build interactive and imaginative products. These features are an iterative process, and we’ll continue to refine our approach even further. We’re looking forward to sharing our progress with you.

Read more about applying background removal in our documentation.

Appendix 1: Inference Time in Milliseconds

23 GB VRAM GPU

94 GB VRAM GPU

Appendix 2: Measures of Model Accuracy

Der Cloudflare-Blog