Sample segmentations — Pairs of images featuring manual segmentations *(left)* and segmentations produced by the researchers’ new weakly supervised system.

Computer vision

Learning to segment images without manually segmented training data

Machine learning method relies on coarse “bounding-box” image labels but still delivers state-of-the-art segmentation results.

September 01, 2020

Semantic segmentation is the task of automatically labeling every pixel of a digital image as belonging to one of many classes (person, cat, airplane, table, etc.), with applications in content-based image retrieval, medical imaging, and object recognition, among others.

Graphic illustrating segmentation vs. boxes — For annotators, drawing a bounding box around an object is much easier than fully segmenting the same image.

Machine-learning-based semantic segmentation systems are typically trained on images in which object boundaries have been meticulously hand-traced, a time-consuming operation. Object detection systems, on the other hand, can be trained on images in which objects are framed by rectangles known as bounding boxes. For human annotators, hand-segmenting an image takes, on average, 35 times as long as labeling bounding boxes.

In a paper we presented last week at the European Conference on Computer Vision (ECCV), we describe a new system, which we call Box2Seg, that learns to segment images using only bounding-box training data, an example of weakly supervised learning.

In experiments, our system provided a 2% improvement over previous weakly supervised systems on a metric called mean intersection over union (mIoU), which measures the congruence between the system’s segmentation of an image and manual segmentation. Our system’s performance was also comparable to that of one pretrained on general image data and then trained on fully segmented data.

Furthermore, when we trained a system using our weakly supervised approach, then fine-tuned it on fully segmented data, it improved upon the performance of the system pretrained on general image data by 16%. This suggests that, even when segmented training data is available, pretraining with our weakly supervised approach still has advantages.

Noisy labels

Our approach is to treat the bounding boxes as noisy labels. We treat each pixel inside a box as having been labeled as part of the object whose boundary we are trying to find; some of those pixels, however, have been labeled incorrectly. All the pixels outside a box we treat as correctly labeled background pixels.

During training, inputs to our system pass through three convolutional neural networks: an object segmentation network and two ancillary networks. During operation, we discard the ancillary networks, so they don’t add to the complexity of the deployed system.

Architecture of the researchers’ training model — The architecture of the researchers’ training model. The locations of the bounding boxes themselves *(B)* and coarse segmentations provided by the GrabCut segmentation algorithm *(M)* help supervise the training of both the object segmentation network *(θ_y)* and the two ancillary networks *(θ_a* and θ_b).

One of the ancillary networks performs pairwise comparisons between pixels in an image, attempting to learn general methods of distinguishing background and foreground. Intuitively, it’s looking for pixels inside the bounding box that are similar to the correctly labeled background pixels outside the box and clusters of pixels inside the box that are dissimilar from each other. We call this network the embedding network because it learns a vector representation — an embedding — of pixels that captures just those properties useful for distinguishing background and foreground.

We pretrain the embedding network using the relatively coarse segmentations provided by a standard segmentation algorithm called GrabCut. During training, the embedding network’s output provides a supervisory signal to the object segmentation network; that is, one of the criteria we use to evaluate the embedding network’s performance is the accord of its output with that of the embedding network.

Examples of affinities identified by the researchers’ embedding network. — Examples of *affinities* identified by the researchers’ embedding network. Brighter regions indicate pixels that the network concludes have something in common.

The other ancillary network is a label-specific attention network. It learns to identify visual properties that recur frequently across pixels inside bounding boxes with the same labels. It could be thought of as an object detector whose output is not an object label but rather an image map highlighting pixel clusters characteristic of a particular object class.

Manual segmentation of an image with bounding boxes — From left to right: The manual segmentation of an image; the bounding box combined with the coarse segmentation provided by the GrabCut algorithm; and the bounding box combined with the output of the researchers’ label-specific attention network. In the third pair of images, colors toward the red end of spectrum denote image features that occur frequently inside bounding boxes with a particular label. During training, the object segmentation network should pay particular attention to those features.

The label-specific attention network is useful only for object classes that it sees during training; its output could be counterproductive with classes of objects it wasn’t trained on. But during training, it — like the embedding network — provides a useful supervisory signal, which can help the object segmentation network learn to perform more general segmentations.

In experiments using a standard benchmark data set, we found that, using only bounding-box training data, Box2Seg outperformed 12 other systems trained on fully segmented training data. When a network trained using Box2Seg was fine-tuned on fully segmented data, the performance improvement was even more dramatic. This suggests that weakly supervised training of object segmentation could be useful when fully segmented training data is not available — and even when it is.

About the Author

Siddhartha Chandra

Siddhartha Chandra is an applied scientist in Amazon’s Computer Vision-Machine Learning organization.

Learning to segment images without manually segmented training data

Machine learning method relies on coarse “bounding-box” image labels but still delivers state-of-the-art segmentation results.

Noisy labels

Related content

Work with us