Shopping Smarter: AI Agents Trained with Synthetic Data Boost E-Commerce Research

Author: Denis Avetisyan


A new framework uses simulated shopping experiences to train artificial intelligence agents to conduct more effective product research online.

The ProductResearch framework simulates a closed-loop system wherein a user agent constructs behavioral profiles to anticipate information needs, and a research agent iteratively refines queries and tools-under the guidance of a supervisory agent providing real-time verification-before distilling successful research pathways into optimized, single-role sequences for enhanced learning.
The ProductResearch framework simulates a closed-loop system wherein a user agent constructs behavioral profiles to anticipate information needs, and a research agent iteratively refines queries and tools-under the guidance of a supervisory agent providing real-time verification-before distilling successful research pathways into optimized, single-role sequences for enhanced learning.

ProductResearch leverages multi-agent synthetic trajectory distillation to enhance the performance of deep research agents in e-commerce environments.

Despite advances in large language models, complex product research remains a challenge for e-commerce agents lacking sufficient interaction depth and contextual understanding. To address this, we introduce ‘ProductResearch: Training E-Commerce Deep Research Agents via Multi-Agent Synthetic Trajectory Distillation’, a novel framework that leverages multi-agent collaboration to generate high-fidelity synthetic data for training robust shopping assistants. Our approach demonstrates substantial improvements in response comprehensiveness, research depth, and user utility, approaching the performance of state-of-the-art proprietary systems. Can this paradigm of multi-agent synthetic trajectory training unlock a new era of scalable and effective LLM-powered e-commerce assistance?


The Illusion of Research: Why LLMs Still Miss the Point

Contemporary research agents, powered by large language models, frequently falter when confronted with investigations demanding sustained effort and comprehensive data collection. These systems often excel at answering direct questions but struggle with tasks requiring numerous sequential steps – formulating a research question, identifying relevant sources across diverse databases, extracting pertinent information, synthesizing findings, and ultimately, drawing well-supported conclusions. The limitations stem from an inability to maintain context over extended interactions and a reliance on pattern matching rather than genuine understanding, resulting in agents that can quickly become lost in a sea of information or pursue irrelevant tangents. This difficulty highlights a critical gap between current capabilities and the nuanced, iterative process characteristic of true research, where adaptability and persistent information seeking are paramount.

Simply increasing the size of existing language models, while demonstrating impressive feats of text generation, ultimately falls short of achieving genuine research capability. These models, trained primarily on pattern recognition within vast datasets, often struggle with the nuanced reasoning, iterative planning, and reliable execution required for complex investigations. A more structured approach is therefore necessary, one that moves beyond simply predicting the next word and instead focuses on building agents capable of formulating research questions, designing experiments – even in simulation – critically evaluating evidence, and adapting strategies based on findings. This demands integrating language models with tools for knowledge representation, symbolic reasoning, and robust action execution, effectively shifting the focus from scale to architecture in the pursuit of truly intelligent research agents.

True research extends far beyond simply finding relevant information; it demands a sophisticated interplay of cognitive skills. A successful research agent must be capable of reasoning about the information it gathers, formulating a coherent plan to address a specific question, and then reliably executing that plan, adapting as needed when encountering obstacles or unexpected data. This necessitates moving beyond pattern recognition – the strength of many large language models – towards systems that can actively synthesize knowledge, identify gaps in understanding, and iteratively refine their approach. The ability to not only retrieve information, but to critically evaluate, integrate, and ultimately apply it is the defining characteristic of genuine research capability, and a significant hurdle in the development of truly autonomous research agents.

Building a Framework: A Multi-Agent Approach to Research

The ProductResearch Framework is a multi-agent system constructed to facilitate the training of deep reinforcement learning agents for e-commerce product research tasks. This system deviates from single-agent approaches by employing multiple interacting agents, enabling a more complex and nuanced learning environment. The architecture comprises distinct agents-a User Agent, a Research Agent, and a Supervisor Agent-each with defined roles and responsibilities. This multi-agent design is intended to improve the robustness and generalization capabilities of the trained agents, allowing them to perform effective product research in dynamic and unpredictable online marketplaces. The framework’s novelty lies in its ability to generate a self-supervised learning signal through the interactions between these agents, bypassing the need for extensive human-labeled data.

The ProductResearch Framework utilizes a three-agent interaction to produce training trajectories for e-commerce research agents. A User Agent initiates research tasks with defined goals and constraints. The Research Agent then executes these tasks, employing search and data extraction techniques. Crucially, a Supervisor Agent monitors the Research Agent’s actions, providing feedback and corrective guidance based on pre-defined quality metrics and research best practices. This iterative process of action and supervision generates detailed, labeled datasets reflecting successful and unsuccessful research strategies, forming the basis for robust agent training.

The ProductResearch Framework generates training data by emulating a full e-commerce research cycle, encompassing task initiation, information gathering, and result synthesis. This simulation allows for the creation of trajectories detailing agent interactions with online platforms, including search queries, website navigation, and data extraction from product listings and reviews. The resulting dataset includes both successful and unsuccessful research attempts, providing a diverse range of examples for training agents to handle various scenarios and complexities inherent in real-world e-commerce research. Data points include agent actions, observed states of the online environment, and associated rewards or penalties, enabling reinforcement learning and supervised learning approaches to complex skill development.

Manufacturing Data: Synthetic Trajectories for Agent Training

The training of research agents within this framework is fundamentally dependent on the creation of synthetic trajectories. These trajectories consist of sequential data representing agent actions and corresponding environmental observations, generated programmatically rather than through real-world interaction. Each trajectory simulates a complete research task, from initial query to final report, and serves as a training example. By generating a large volume of these synthetic sequences, the system can efficiently expose the agent to diverse scenarios and accelerate the learning process, effectively bypassing the limitations and costs associated with gathering extensive real-world data. The variability within these generated trajectories is controlled by parameters defining task complexity, data sources, and agent behavior, allowing for targeted training and performance optimization.

Reflective Internalization is a data synthesis technique that transforms complex, multi-turn dialogues between a Supervisor and Research Agent into simplified, single-role training examples. This process involves distilling the essence of the supervisory feedback – typically provided at each step of a research task – into a condensed format suitable for direct agent training. By reducing lengthy interactions into focused, single-turn examples, the technique significantly improves training efficiency and reduces the computational resources required to achieve comparable performance levels. This streamlined approach allows the agent to learn from the supervisory signals without needing to process the full conversational history, effectively accelerating the learning process and increasing data throughput.

The Research Agent operates by executing a predefined plan to fulfill research queries, primarily through the utilization of external tools and subsequent report creation. This process involves initiating Tool Calls, specifically Web Search and Product Search, to gather relevant information from external sources. The results of these searches are then processed and synthesized into a coherent report, which serves as the final output addressing the initial research query. The agent’s plan dictates the sequence and parameters of these Tool Calls, ensuring a focused and efficient information-gathering process before report generation.

The Supervisor Agent employs a defined Evaluation Rubric to provide granular, step-level feedback during the data generation process. This rubric details specific criteria for assessing the quality of each action taken by the Research Agent, focusing on factors like information accuracy, relevance to the research query, and adherence to established research methodologies. By evaluating each step individually, the Supervisor Agent facilitates the creation of high-fidelity training data and ensures that the Research Agent learns to consistently meet defined research standards. The rubric’s structured approach enables consistent and objective quality control, improving the reliability and validity of the synthetic data used for agent training.

RACE scores demonstrate that iteratively refining reports improves their answerability, indicating enhanced reasoning and comprehension capabilities.
RACE scores demonstrate that iteratively refining reports improves their answerability, indicating enhanced reasoning and comprehension capabilities.

Measuring Progress: Validating Agent Performance and Future Directions

Rigorous evaluation of the Deep Research Agent necessitated a standardized and objective metric, leading to the adoption of the RACE (Research Assessment by Comprehensive Evaluation) metric. This rubric-driven approach assesses research report quality across multiple dimensions, including clarity, completeness, relevance, and analytical rigor. By systematically comparing agent-generated reports against a predefined scoring system, researchers can quantify improvements and identify areas for refinement. The use of RACE moves beyond subjective assessments, providing a transparent and reproducible method for validating the agent’s performance and ensuring the quality of automated research outputs. This detailed evaluation framework is crucial for establishing trust and demonstrating the reliability of the Deep Research Agent in complex information-gathering tasks.

The implementation of the ProductResearch framework yielded a substantial performance increase in the Qwen3-30B-A3B model, as evidenced by a rise in the RACE score from 31.78 to 45.40. This improvement wasn’t limited to a single area; the framework demonstrably enhanced performance across all evaluation dimensions of the RACE metric, suggesting a holistic advancement in research report quality. This represents a significant leap toward automated research capabilities, indicating the framework’s efficacy in guiding the language model towards more comprehensive and well-structured outputs. The magnitude of this score increase validates the approach and highlights its potential for broader application in automating complex information gathering and analysis tasks.

A key indicator of the Deep Research Agent’s success lies in its dramatically improved ability to comprehensively cover relevant product information, as evidenced by a tripling of the Effective Product Coverage score. Initial evaluations revealed a baseline score of 3.58, representing limited product understanding; however, implementation of the ProductResearch framework propelled this metric to 12.45. This substantial increase suggests the agent not only identifies products but also synthesizes a far richer and more nuanced understanding of their features, specifications, and competitive landscape, ultimately leading to more informed and valuable research outputs. The improvement highlights the framework’s efficacy in expanding beyond superficial data gathering towards a genuinely comprehensive analysis of the target subject matter.

The ProductResearch framework distinguishes itself through its capacity to autonomously construct and execute elaborate, multi-step research plans. Unlike conventional approaches that address research tasks in isolation, this framework excels at formulating strategies that extend far beyond immediate objectives – effectively ‘planning ahead’ to anticipate necessary data acquisition and analysis. This long-horizon planning capability enables the agent to navigate complex research landscapes with greater efficiency and thoroughness, synthesizing information from diverse sources over extended periods. The result is a demonstrable shift towards true research automation, moving beyond simple information retrieval to encompass the iterative process of hypothesis refinement and knowledge discovery – a feat previously requiring significant human oversight and strategic direction.

Continued development of the ProductResearch framework prioritizes broadening its applicability to increasingly intricate research areas, moving beyond initial domains to tackle challenges requiring deeper analytical capabilities. Simultaneously, a key focus lies in integrating a robust user feedback mechanism; this iterative process will allow for continuous refinement of the agent’s performance, ensuring alignment with evolving research needs and nuanced human judgment. By actively incorporating insights from researchers, the framework aims to not only automate tasks but also enhance the quality, relevance, and ultimately, the impact of generated research reports, fostering a collaborative synergy between artificial intelligence and human expertise.

RACE scores improve with increasing training context length, demonstrating the benefit of processing longer input sequences.
RACE scores improve with increasing training context length, demonstrating the benefit of processing longer input sequences.

The pursuit of ever-more-sophisticated deep research agents, as outlined in this paper, feels…predictable. ProductResearch, with its multi-agent synthetic data generation, attempts to solve the data scarcity problem, but it’s merely shifting the complexity. One anticipates the inevitable moment when these synthetic trajectories require their own synthetic validation. As Grace Hopper observed, “It’s easier to ask forgiveness than it is to get permission.” This sentiment rings true; the rush to deploy these agents often precedes a thorough understanding of their limitations, and the resulting ‘technical debt’ will inevitably require another, equally ambitious, framework to address. The core idea of improving product investigation capabilities is laudable, of course, but it’s just another layer of abstraction built atop existing problems.

What Lies Ahead?

The promise of synthetic data for training e-commerce agents is, predictably, not a panacea. ProductResearch offers a compelling demonstration of multi-agent distillation, yet the inherent fragility of any generated dataset remains. The synthetic trajectories, however cleverly constructed, will inevitably diverge from the chaotic reality of production queries-the long tail of user intent is a relentless adversary. The current framework addresses a clear need, but it merely shifts the debugging burden; errors will no longer stem from agent exploration, but from flaws in the synthetic world itself.

Future work will undoubtedly focus on closing that gap-more sophisticated generative models, perhaps, or adaptive distillation techniques that incorporate real-world feedback. However, the fundamental problem persists: every abstraction dies in production. The elegance of a multi-agent system generating training data will be tested not by benchmark scores, but by the unpredictable ways users attempt to break it.

Ultimately, the success of this line of inquiry will be measured not by how well agents can simulate research, but by their ability to gracefully handle the inevitable failures of that simulation. It is a comfortable illusion to believe a perfectly trained agent is possible; the true challenge lies in building one that fails beautifully.


Original article: https://arxiv.org/pdf/2602.23716.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-02 15:56