Beyond Keywords: How AI-Generated Data is Reshaping E-commerce Search

Author: Denis Avetisyan

A new approach leverages the power of artificial intelligence to understand and respond to complex, niche product searches in online retail.

A data synthesis framework addresses long-tail, knowledge-intensive queries through a two-stage process of multi-candidate query rewriting-its core component-and offline product retrieval, leveraging optimized production models for retrieval and business-specific relevance alongside open-source large language models for general query-product associations and rewrite relevance.

This review details a data synthesis framework using large language models and query rewriting to improve relevance ranking for long-tail, knowledge-intensive queries in e-commerce information retrieval.

Despite advancements in e-commerce search, identifying relevant products for nuanced, long-tail queries-those requiring specific knowledge-remains a persistent challenge. This paper, ‘Synthetic Data Powers Product Retrieval for Long-tail Knowledge-Intensive Queries in E-commerce Search’, introduces a novel data synthesis framework that leverages large language models to address this limitation. By distilling the capabilities of a powerful query-rewriting model into training data, the approach significantly improves retrieval performance without requiring complex architectural changes. Could this method unlock a new paradigm for building robust and adaptable e-commerce search experiences, particularly for specialized product domains?

Deciphering Intent: The Challenge of Nuanced Search

Conventional search algorithms excel at retrieving information matching common keywords, but falter when confronted with nuanced, knowledge-demanding inquiries. These systems are largely trained on frequently occurring search patterns, creating a bias toward popular topics and straightforward requests. Consequently, queries requiring complex reasoning, contextual understanding, or specialized knowledge – those venturing beyond typical search behavior – often yield unsatisfactory results. This limitation stems from the reliance on statistical correlations within massive datasets, rather than a genuine comprehension of the underlying concepts. The systems struggle to synthesize information or infer meaning when faced with questions that deviate from established patterns, highlighting a critical gap in their ability to handle the full spectrum of human information needs.

The difficulty in satisfying nuanced information needs often stems from a lack of representative data for what are termed ‘long-tail’ queries. These are the highly specific, less frequent searches – requests for “vegan leather jackets without polyurethane” or “affordable noise-canceling headphones under $100” – that collectively represent a significant portion of all searches. Machine learning models, typically trained on massive datasets of common queries, struggle with these less-observed requests because the algorithms haven’t ‘seen’ enough examples to reliably discern the user’s intent. This data scarcity means the model may misinterpret the query, provide irrelevant results, or fail to adequately consider the negative constraints or price limitations specified, ultimately hindering the user experience and limiting access to specialized information.

The ability of search engines to effectively handle nuanced, specific queries – those extending beyond common searches – directly impacts user experience and inclusivity. When a system falters with complex requests, such as identifying a product with specific features and a price limit, or finding solutions that exclude certain elements, user frustration increases and valuable information remains inaccessible. Successfully navigating these ‘long-tail’ searches isn’t merely about refining algorithms; it’s about democratizing access to knowledge, ensuring individuals with unique needs or highly specific questions can readily find relevant results. This broadened accessibility fosters greater user satisfaction and transforms search engines from simple information retrieval tools into powerful engines for discovery and problem-solving, benefiting a wider range of users and use cases.

Bridging the Data Gap: A Framework for Synthetic Query Generation

The Data Synthesis Framework addresses the scarcity of labeled data for long-tail and knowledge-intensive queries by programmatically generating synthetic training examples. This is achieved through a pipeline that constructs queries requiring specific knowledge or exhibiting infrequent phrasing. The framework leverages knowledge sources and query transformation techniques to create variations of existing queries and generate novel, yet relevant, examples. This synthetic data augmentation significantly expands the training dataset, improving model performance on complex and less frequent user requests without relying solely on manually labeled examples, which are costly and time-consuming to acquire.

The Data Synthesis Framework extends training data coverage by supporting the generation of synthetic examples for diverse query types beyond standard information retrieval. Specifically, the framework accommodates question-answering queries requiring precise answers, negative constraint queries which specify undesired attributes, and affordable alternative searches that identify lower-cost options meeting user needs. This multi-query-type support allows for the creation of a more robust training dataset, improving model performance across a wider range of user intentions and search scenarios. The system is designed to generate varied synthetic data for each query type, addressing potential biases and increasing generalization capability.

The data synthesis framework generates multiple query rewrites for a single user input, addressing the ambiguity inherent in natural language. This is achieved through techniques like paraphrasing, lexical substitution, and the introduction of synonymous phrases, resulting in a set of candidate queries that represent diverse interpretations of the original user intent. The framework then utilizes these multi-candidate rewrites to expand the training dataset, improving model robustness and performance on variations in query phrasing and semantic understanding. This process is crucial for long-tail queries where limited training examples exist for specific phrasings, and for knowledge-intensive queries requiring nuanced interpretation.

Rewriting model optimization is achieved through a multi-reward design that iteratively samples and scores responses based on relevance, alignment, and diversity, then combines these scores into a weighted reward signal for model updates.

Sculpting Rewrites: A Multi-Faceted Reward System for Optimization

The query rewriting model is optimized using the REINFORCE++ algorithm, a policy gradient method, to maximize a composite reward function. This function is designed to balance competing objectives during training, allowing the model to learn rewrites that perform well across multiple dimensions. REINFORCE++ enables direct optimization of the expected reward by estimating the gradient of the reward with respect to the model’s parameters. The algorithm iteratively adjusts these parameters to increase the probability of generating high-reward rewrites, effectively shaping the model’s behavior towards desired outcomes as defined by the reward components.

The reward function driving query rewrite optimization incorporates both Query Semantic Relevance (QSR) and Product-side Distribution Alignment (PDA). QSR assesses the semantic similarity between the original query and the rewritten query, ensuring that the rewrite maintains the user’s intended meaning and avoids introducing irrelevant concepts. PDA, conversely, measures the alignment between the generated query and the language used in product descriptions; this component encourages rewrites that utilize terminology commonly associated with relevant products, thereby increasing the likelihood of successful product retrieval. Both QSR and PDA are calculated using embedding models, and their weighted sum constitutes a significant portion of the overall reward signal, guiding the model toward generating rewrites that are both semantically accurate and product-focused.

The Diversity Reward component within the query rewriting optimization framework is designed to promote the generation of multiple, distinct rewrites for a single query. This is achieved by penalizing rewrites that are highly similar to previously generated options, thereby encouraging the model to explore a wider range of potential reformulations. The underlying principle is that a diverse set of rewrites is more likely to comprehensively capture the breadth of user intent, accounting for different phrasing, levels of specificity, and potential information needs expressed through the original query. This approach moves beyond simply identifying a single “best” rewrite and instead aims to provide a spectrum of options, increasing the likelihood of a successful product retrieval.

Offline Product Retrieval streamlines the creation of query-product pairs crucial for both training and evaluating the query rewriting model. This process involves utilizing an existing product catalog and a set of representative queries to generate pairings without requiring live search queries or user interaction. Specifically, a pre-indexed product catalog is queried using the input queries, and resulting product matches are used to construct the training dataset. This approach significantly reduces the computational cost and latency associated with generating large-scale datasets, and ensures consistent, reproducible results for model evaluation. The method facilitates efficient iteration on the rewriting model by providing a readily available, static dataset for experimentation.

Validating Performance: Metrics and Human-Centered Evaluation

Rigorous evaluation of retrieval performance necessitates a multifaceted approach, and thus, a suite of automated metrics is employed to comprehensively assess system efficacy. Query Goodrate@N quantifies the proportion of queries for which at least one relevant item appears within the top N results, providing insight into precision at various ranking cutoffs. Complementing this, Item Goodrate measures the fraction of retrieved items that are genuinely relevant, focusing on the quality of individual results. Further refinement comes from the GSB Metric, which evaluates the overall coherence and usefulness of the retrieved set. By combining these metrics-each offering a distinct perspective on retrieval quality-a nuanced and data-driven understanding of system performance is achieved, enabling targeted improvements and optimization of the retrieval pipeline.

To validate the automated metrics used in evaluating retrieval performance, a Side-By-Side (SBS) evaluation process was implemented, relying on direct human assessment. In this method, trained annotators were presented with paired search results – one from a control system and one from the treatment system – for the same user query. Annotators then directly compared the relevance and quality of the two result sets, indicating which, if either, better addressed the user’s information need. This human-in-the-loop approach provides a crucial ground truth, enabling researchers to correlate automated metric scores with subjective human judgments of quality and refine the metrics to more accurately reflect real-world user experience. The SBS evaluation serves not just as a validation step, but also as a means of identifying nuanced aspects of retrieval performance that automated systems might overlook, ultimately driving improvements in search relevance and user satisfaction.

The retrieval pipeline’s core functionality relies on the synergistic operation of two key components: the Dense Retriever (Tbstars-3B) and the Query-Product Relevance Classifier (Tbstars-42B-A3B). The Dense Retriever efficiently identifies potentially relevant products from a vast catalog by embedding both queries and products into a shared vector space, allowing for rapid similarity comparisons. This initial retrieval stage is then refined by the Query-Product Relevance Classifier, which assesses the relevance of each retrieved product to the original query with greater precision. By learning complex relationships between query terms and product attributes, the classifier filters out irrelevant results, ensuring that users are presented with a focused and highly relevant selection of items. This combined approach significantly enhances the effectiveness of the retrieval process, particularly for complex, knowledge-intensive queries where semantic understanding is crucial.

A critical component of the retrieval pipeline’s enhanced precision lies within the Query-Rewrite Relevance Classifier, powered by the Qwen3-30B-A3B model. This classifier actively filters out irrelevant or poorly formed query rewrites before they reach the retrieval stage. By discerning and discarding these suboptimal rewrites, the system concentrates on higher-quality reformulations of the original query. This selective process minimizes noise and ensures that the retrieval model receives focused and pertinent search requests, ultimately leading to more accurate and relevant results, particularly for complex or nuanced information needs.

A newly implemented data synthesis framework yielded substantial gains in query relevance, as evidenced by an 8.62 percentage point improvement in query goodrate during online A/B testing on the Taobao platform. This enhancement is particularly impactful for long-tail and knowledge-intensive queries, which often present unique challenges for retrieval systems due to their specificity and informational complexity. The observed increase suggests the framework effectively addresses these challenges, delivering more pertinent search results to users and improving the overall search experience.

Analysis of A/B testing on the Taobao platform revealed substantial gains in retrieval relevance, particularly for challenging query types. The data synthesis framework facilitated a 6.97 percentage point increase in Item Goodrate for Negative queries – those where users explicitly state what they don’t want – indicating improved filtering of undesired products. Concurrently, an 8.62 percentage point improvement was registered in Query Goodrate for Alternative queries, reflecting a greater ability to understand user intent when phrasing is varied or ambiguous. These improvements suggest the framework effectively addresses nuanced search requests, enhancing the overall user experience by delivering more relevant results even with complex or indirectly expressed needs.

Rigorous offline evaluation revealed substantial gains in retrieval relevance following implementation of the data synthesis framework. Specifically, the system demonstrated a marked +12.95 improvement in Item Goodrate when processing Negative queries – those lacking clear intent or expressing dissatisfaction – suggesting a heightened ability to surface useful results even with ambiguous user input. Furthermore, a +10.73 improvement in Item Goodrate was observed for Knowledge queries, indicating enhanced performance in retrieving information-rich results for complex, fact-seeking requests. These offline results strongly suggest the system’s capacity to address challenging query types, providing a foundation for improved user experience and more effective information access.

Towards Robust Search: Generalization and Future Directions

The development of a framework capable of synthesizing data for a wide spectrum of queries – extending beyond simple keyword matches to encompass nuanced general knowledge questions – represents a significant step toward more resilient search technology. This capacity to generate varied training data allows search engines to better understand the intent behind user queries, even those phrased in uncommon ways or seeking complex information. Consequently, the system demonstrates improved performance across a broader range of search requests, reducing the impact of ambiguous or poorly worded queries. By proactively addressing the diversity of language used in information seeking, this approach promises search experiences that are not only more accurate but also more readily accessible and satisfying for a wider user base, ultimately leading to more comprehensive results.

Search engines traditionally struggle with “long-tail” queries – those highly specific, less frequent searches that collectively represent a significant portion of all user requests. This framework directly confronts this challenge by enhancing the understanding of these nuanced requests, moving beyond simple keyword matching to grasp the underlying intent. Improved long-tail understanding not only broadens search accessibility for users seeking very specific information, but also cultivates greater user satisfaction by delivering more relevant results, even for obscure or complex topics. Consequently, individuals previously encountering dead ends or irrelevant pages can now access precisely the information they need, fostering a more inclusive and effective search experience for everyone.

Continued development centers on enhancing the efficiency of data synthesis, a crucial step in building more adaptable search systems. Researchers are actively investigating methods to automate and accelerate this process, allowing for the creation of significantly larger and more varied datasets. Simultaneously, efforts are directed towards refining query rewriting techniques – intelligently reformulating user inputs to better match relevant information – and implementing sophisticated relevance ranking algorithms. These advanced algorithms will move beyond simple keyword matching to consider semantic meaning and contextual understanding, ultimately delivering more precise and satisfying search results, even for nuanced or ambiguous queries.

The current framework’s potential is significantly amplified by integrating more sophisticated knowledge sources and reasoning capabilities. Future iterations will move beyond simple data generation to incorporate structured knowledge graphs, common-sense reasoning engines, and even external databases. This expansion isn’t merely about accessing more information, but about enabling the system to synthesize knowledge, draw inferences, and understand the nuances of complex queries. By equipping the framework with these advanced cognitive tools, researchers anticipate a substantial improvement in its ability to handle ambiguous or multi-faceted search requests, ultimately leading to more accurate, relevant, and insightful results for users – effectively bridging the gap between information retrieval and true understanding.

The pursuit of enhanced information retrieval, as detailed in this work, mirrors a systemic approach to problem-solving. The framework’s reliance on data synthesis and query rewriting isn’t merely about boosting performance for long-tail queries; it’s about recognizing the interconnectedness of data and query formulation. As Paul Erdős once stated, “A mathematician knows a lot of things, but a physicist knows a few.” This sentiment, while directed at different disciplines, resonates with the paper’s core concept: a deep understanding of the underlying structure – in this case, the relationship between queries and relevant products – is paramount. The model’s ability to generate synthetic data, acting as a kind of structural reinforcement, highlights that even seemingly isolated improvements necessitate a holistic view of the system.

The Road Ahead

The presented framework, while demonstrating promise in addressing long-tail queries, subtly highlights a perennial tension in information retrieval. Each newly synthesized data point, each elegantly rewritten query, introduces a dependency – a commitment to the generative model’s internal representation of knowledge. This is not merely a technical detail, but a structural one. The system’s freedom from sparse training data is purchased with reliance on the model’s inherent biases and potential for drift. The question, then, isn’t solely about maximizing recall, but about understanding the emergent properties of this dependency network.

Future work must move beyond isolated performance metrics and focus on systemic robustness. Evaluating the framework’s resilience to adversarial query rewriting, or its behavior when confronted with evolving product catalogs, will prove critical. A compelling direction lies in exploring methods for ‘decoupling’ the retrieval system from the specifics of the generative model – perhaps through distillation techniques or the development of more interpretable synthetic signals.

Ultimately, the pursuit of knowledge-intensive retrieval necessitates a holistic view. The system is not simply a collection of algorithms, but a complex adaptive system. Improving one component without considering its impact on the whole offers only temporary gains. The true challenge lies in designing for long-term stability, not merely immediate relevance.

Original article: https://arxiv.org/pdf/2602.23620.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/