Improving search for new product categories via synthetic query generation strategies
2024
Efficient retrieval and ranking of relevant products in e-commerce product search relies on accurate mapping of queries to product categories. This query classification typically utilizes a combination of textual and customer behavioral signals. However, new product categories often lack customer interaction data leading to poor performance. In this paper, we present a novel approach to mitigate this cold start problem in product ranking via synthetic generation of queries as well as simulation of customer interactions. Specifically we study two strategies for synthetic data generation: (i) fine-tuning a generative language model (LLM) on historical product-query interactions and using it to generate synthetic queries from the product catalog, (ii) Bayesian prompt optimization with an instruct-tuned LLM to directly generate queries from catalog. Empirical evaluation of the proposed approaches on public datasets and real-world customer queries demonstrates significant benefits (+2.96%and +2.34% in PR-AUC on e-commerce queries)1 relative to the base-line approach without synthetic data augmentation. Furthermore, evaluation of the augmented model on live search page results in a substantial increase in highly relevant product results (+3.35%) and reduction (-3.07%) in irrelevant results.
Research areas