In an ideal world, finding a particular section of a video would be as simple as describing it in natural language — saying, for instance, “the person pours ingredients into a mixer”.
At this year’s meeting of the ACM Special Interest Group on Information Retrieval (SIGIR), my colleagues and I are presenting a new method for doing such natural-language-guided video moment retrieval (VMR).
Our method dispenses with the complex iterated message-passing procedure adopted by some of its predecessors, so it reduces training time; in one of our experiments, training our model took one-third as long as training the prior state-of-the-art model on the same data and the same hardware. At the same time, our model outperforms its predecessors, with relative improvement of up to 11% on the relevant metrics and datasets.
Our model has two chief novelties:
- Early fusion/cross-attention: Some prior models use “late fusion”, meaning that the video segments and the query are embedded in a representational space independently, and then the model selects the video segment nearest the query according to some metric distance. We instead use an early-fusion approach, in which the embeddings for the query and the video segments are determined in a cross-coordinated way. And where some prior early-fusion methods used iterated message passing to do cross-coordination, we use a much simpler cross-attention mechanism.
- Multitask training: We train our model on two tasks simultaneously. One is the identification of the start and stop points of a video sequence; the other is the binary classification of each of the frames between those points as part of the sequence or not. Annotator disagreement, meaning discrepancies in the start and stop times identified in the training data, can reduce model accuracy; the binary-classification task leverages the continuity in the annotation of the segment frames, which corrects imbalances in the training data.
Cross-attention
In the past, natural-language VMR models have represented both query texts and sequences of video frames as graphs. The models work out correspondences between the words of a text and the frames of a sequence through a message-passing scheme, in which each node of the text graph sends messages to multiple nodes of the video graph, and vice versa. The model refines its embeddings of the query and frames based on correspondences that emerge over several rounds of message passing.
In our model, by contrast, we first encode both the query and a candidate video segment, then use a cross-interaction multi-head attention mechanism to identify which features of the query encoding are most relevant to the video encoding, and vice versa.
On the basis of that cross-interaction, the model outputs a video embedding that factors in aspects of the query and a query embedding that factors in aspects of the video. Those embeddings are concatenated to produce a single fused embedding, which passes to two separate classifiers. One classifier identifies start/stop points, and the other classifies video frames as part of the relevant segment or not.
To test our approach, we used two benchmark datasets, both of which contain videos some of whose frames have been annotated with descriptive texts. We compared our method to five prior models, three of which have achieved state-of-the-art results.
We evaluated the models’ performance using intersection over union (IoU), a ratio of the number of correctly labeled video segment frames to the total number of frames labeled as belonging to the segment either by the model or in the dataset. A correct retrieval was defined as one that met some threshold IoU. We experimented with three thresholds, 0.3, 0.5, and 0.7.
Across six experiments — two datasets and three IoU thresholds — our approach outperformed all of its predecessors five times. In the sixth case, one prior model had a slight edge (a 1% relative improvement). But that same model fell 37% short of ours on the experiment in which our model showed the biggest gains.