Logo recognition is the task of identifying a specific logo and its location in images or videos. It helps create a safe and trustworthy shopping experience, for instance by recognizing images containing offensive symbols or corporate trademarks.
Logo recognition poses challenges that other image classification problems, such as recognizing cat or dog species, don’t, since the number of logo classes is typically an order of magnitude larger. Additionally, new logos, trademarks, and symbols are constantly being created.
In a paper my colleague Mark Hubenthal and I are presenting at the 2023 Winter Conference on Applications of Computer Vision (WACV), which starts next month, we address the problem of zero-shot logo recognition, where we do not have access to all the possible types of logos during model training.
The standard solution to this problem has two stages: (i) detecting all the possible image regions that might contain a logo and (ii) matching the detected regions against an ever-evolving set of logo prototypes. The matching process is challenging, especially for logos that are very similar to other logos or that contain a lot of text.
Our paper makes two major contributions. First, we demonstrate that leveraging image-text contrastive pretraining, which involves aligning the representation of an image with its text description, significantly alleviates the challenges of text-heavy logo matching. Second, we propose a metric-learning loss function — that is, a loss function that learns from the data how to measure similarity — that better separates highly related logo classes.
In experiments on standard open-source logo recognition datasets, we compared our approach to the existing state of the art. We measured performance according to recall, or how often a model is able to identify the exact logo class compared to the total attempts. Our method achieves a new state of the art on five public logo datasets, with a 3.5% improvement in zero-shot recall on LogoDet3K test, 4% on OpenLogo, 6.5% on FlickrLogos-47, 6.2% on Logos In The Wild, and 0.6% on BelgaLogo.
Contrastive learning
Traditionally, logo recognition is treated as a specific instance of the general object detection problem. However, most commercial object detection systems assume a constant set of classes or categories during both training and inference. That assumption is often violated in logo recognition, due to new design patents and trademarks being registered or new offensive symbols being created in online forums.
Zero-shot logo recognition relies heavily on an embedding model for matching query regions against a constantly evolving set of cropped logo images. In previous work, Amazon researchers discovered that traditional pretrained computer vision models did a poor job representing text-heavy logo classes. They proposed using a separate text pipeline to extract the text in the image via optical character recognition (OCR) and using the text to augment a vision-based embedding.
In a number of recent works, researchers have discovered that image-text contrastive training — a type of metric learning — can help visual embedders implicitly recognize text in images. In contrastive training, a model is fed pairs of training examples; each pair contains either two positive examples or one positive example and one negative. The model learns to not only cluster positive examples together but to push positive examples away from negative examples.
In contrastive training, negative examples are typically chosen at random. But we further improve the separability of very similar logos by mining the training data for hard-negative examples — logos whose associated texts are similar to those associated with logos of different classes. For instance, “Heinz” is a hard negative for “Heineken”, since they share the same four starting letters.
During training, we explicitly pair positive examples with their hard negatives, to encourage the model to distinguish logos with similar texts. The combination of contrastive training and hard-negative example pairing is what enabled our model to establish new benchmarks in logo recognition.
Separately, we have used this approach to train a logo embedder on a much larger set of logo images. A currently deployed system using this embedding model is used to surface Climate Pledge Friendly-eligible products for human review by recognizing sustainability-related logos in product images. The same system is also used to identify images containing certain prohibited content or offensive symbols. Notably, our system can act on new offensive symbols as they are identified, without the need for any updates in our architecture.