When looking for cooking ideas, people often find inspiration on social media and in restaurants, saving screenshots or taking pictures of food they liked. At Amazon, we have built technology that lets people use those images to find the corresponding cooking recipes.
At the 2021 Conference of Computer Vision and Pattern Recognition (CVPR), my colleagues and I are presenting a new method for doing cross-modal image-to-recipe retrieval that achieves state-of-the-art performance by using Transformer-based architectures and self-supervised learning.
Self-supervised learning is a paradigm in which automatic manipulation of unannotated data provides supplemental training examples for a machine learning model. In our case, in addition to supervised training using images annotated with the corresponding recipes, we do self-supervised learning using recipe data alone.
Our method uses two separate encoder functions, one for the recipe text and one for the image (left and right, respectively, in the figure below). These functions extract representations that will be used for indexing and search at inference time. To encode recipe components, we use Transformer-based architectures, which are hierarchical for multi-sentence inputs (such as ingredients and instructions) and non-hierarchical for single-sentence inputs (recipe titles). For image inputs, we use the well-established image encoders ResNet and Vision Transformers.
Our model is trained with two loss functions, Lpair and Lrec (see figure above). The supervised loss, Lpair, is computed between representations extracted from the recipe (left) and the image (right). This loss will ensure that text and image representations are close to each other in a common high-dimensional space if they belong to the same training example (e.g., the image of a chocolate chip cookie and its corresponding recipe text) and far apart otherwise (e.g., the same chocolate chip cookie image and the text from a lasagna recipe).
Our novel self-supervised loss, Lrec, is computed between the representations of individual recipe components. This loss ensures that representations of recipe components (e.g., title and ingredients) will be close to each other in the representation space if they belong to the same recipe and far apart otherwise (see figure below). Intuitively, the title of a mac and cheese recipe and the names of its ingredients (macaroni, onion, parmesan cheese, etc.) share semantic cues that can enable a model to learn better recipe representations.
Since this loss does not require an image as an input, it can be computed for training examples without images, which are very common in web recipe data; in practice, 66% of our training set is composed of text-only recipe samples. Our experiments show that both the new self-supervised loss term (even when applied only to image-recipe training pairs) and the additional training data contribute to an improvement in retrieval performance.
In our experiments, we performed cross-modal retrieval in both directions: finding recipes that match images and images that match recipes. Our method demonstrated state-of-the-art performance on the Recipe1M database, a common benchmark in the field. In the image-to-recipe retrieval task, our method achieved a Recall@10 of 92.9% when searching on a recipe database of 1,000 elements. This means that given a database of 1,000 recipes and 1,000 food image queries, our method is able to find the correct recipe within the top 10 retrieved results for 92.9% of the image queries.
We show some qualitative results in the figure below, which reveal that our method is able to encode semantics in image and recipe representations and can find recipes that match the query at a fine-grained ingredient level (e.g., “bread”, “garlic”, and “loaf” in row one, or “salmon” and “asparagus” in row six).
Check out our paper to learn the details. Our code and model weights are also publicly available.