Product retrieval systems, like the one in the Amazon Store, often use the text of product reviews to improve the results of queries. But such systems can be misled by counterfactual statements, which describe events that did not or cannot take place.
For example, consider the counterfactual statement “I would have bought this shirt if it were available in red”. That sentence contains the phrase “available in red”, which a naïve product retrieval system might take as evidence that, indeed, the shirt is available in red.
Counterfactual statements in reviews are rare, but they can lead to frustrating experiences for customers — as when, for instance, a search for “red shirt” pulls up a product whose reviews make clear that it is not available in red. To help ease that frustration, we have publicly released a new dataset to help train machine learning models to recognize counterfactual statements.
In a paper we presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP), we explain how we assembled the dataset. We also describe the results of experiments to determine what types of machine learning models yield the best results when trained on our dataset.
Dataset construction
At the time we started this project, there were no large-scale datasets that covered counterfactual statements in product reviews in multiple languages. We decided to annotate sentences selected from product reviews for three languages: English, German, and Japanese.
Sentences that express counterfactuals are rare in natural-language texts — only 1-2% of sentences, according to one study. Therefore, simply annotating a randomly selected set of sentences would yield a highly imbalanced dataset with a sparse training signal.
Counterfactual statements can be broken into two parts: a statement about the event (if it were available in red), also referred to as the antecedent, and the consequence of the event (I would have bought this shirt), referred to as the consequent.
To identify counterfactual statements, we specified certain relationships between antecedent and consequent, in the presence of certain clue words. For instance, in the sentence “If everyone got along, it would be more enjoyable,” the consequent follows the antecedent and contains a modal verb, while the antecedent consists of a conditional conjunction followed by a past modal verb.
With the help of professional linguists for all the languages under consideration, we compiled a set of such specifications, for conjunctive normal sentences, conjunctive converse sentences, modal propositional sentences, sentences with clue words like “wished”, “hoped”, and the like.
However, not all sentences that contain counterfactual clues express counterfactuals. For example, in the sentence “My wish came true when I got the iPhone for my birthday”, the counterfactual clue “wish” does not indicate a counterfactual condition, because the speaker truly received the iPhone. So professional linguists also reviewed the selected sentences to determine whether they truly expressed counterfactuals.
Selecting sentences based on precompiled clue word lists could, however, bias the data. So we also selected sentences that do not contain clue words but are highly similar to sentences that do. As a measure of similarity, we used proximity of sentence embeddings — vector representations of the sentences — computed by a pretrained BERT model.
Baseline models
Counterfactual detection can be modeled as a binary classification task: given a sentence, classify it as positive if it expresses a counterfactual statement and negative otherwise.
We experimented with different methods for representing sentences, such as bag-of-words representations, static word-embedding-based representations, and contextualized word-embedding-based representations.
We also evaluated different classification algorithms, ranging from logistic regression and support vector machines to multilayer perceptrons. We found that a cross-lingual language model (XLM) based on the RoBERTa model and fine-tuned on the counterfactually annotated sentences performed best overall.
To study the relationship between our dataset and existing datasets, we trained a counterfactual detection model on our dataset and evaluated it on the public dataset for a counterfactual-detection competition, which contains counterfactual statements from news articles. Models trained on our dataset performed poorly on the competition dataset, indicating that the counterfactual statements in product reviews — the focus of our dataset — are significantly different from those in news articles.
Given that our dataset covers counterfactual statements not only in English but also in Japanese and German, we were also interested in how we can transfer a counterfactual detection model trained on one language to another. As a simple baseline, we first trained a model on English training data and then applied it to German and Japanese test data, translated into English via a machine translation system. However, this simple baseline resulted in poor performance, indicating that counterfactuals are highly language-specific, so more-principled approaches will be needed for their cross-lingual transfer.
In ongoing work, we are investigating filtration by other types of linguistic constructions, besides counterfactuals, and expanding our detection models to other languages.