Building Smarter E-Commerce with AI-Powered Product Knowledge

Author: Denis Avetisyan

A new framework leverages artificial intelligence agents to automatically map product information, unlocking scalable knowledge extraction for online retail.

A modular pipeline leverages specialized, large language model-powered agents to iteratively build and populate knowledge graphs, progressing from initial ontology creation through refinement to comprehensive data integration—a process reflecting the inevitable evolution of any structured system.

This paper details an AI agent-driven approach to automated product knowledge graph construction, eliminating the need for manual ontology creation in e-commerce applications.

The increasing volume of unstructured product data in e-commerce presents a paradox: while holding immense potential, it remains challenging to effectively leverage for improved information retrieval and data analytics. This paper introduces an ‘AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce’ that addresses this challenge by automatically building structured product knowledge graphs directly from raw text. Our framework utilizes a novel AI agent architecture to achieve high property coverage and minimal redundancy without relying on predefined schemas or manual rule creation. Could this approach unlock a new era of intelligent product data integration and scalable knowledge extraction for the retail sector?

The Inevitable Fragmentation of Product Knowledge

E-commerce platforms routinely face a significant challenge in managing product data that exists in silos, leading to inconsistencies and incomplete information. This fragmentation arises from multiple data sources – manufacturers, suppliers, internal systems – each with varying formats and levels of detail. Consequently, search functionality often delivers irrelevant or inaccurate results, frustrating customers and diminishing sales. Furthermore, the ability to personalize shopping experiences is severely hampered; without a consolidated view of product attributes, features, and relationships, algorithms struggle to recommend relevant items or suggest appropriate cross-sells. The resulting inefficiencies extend beyond the customer-facing side, impacting inventory management, supply chain optimization, and ultimately, a company’s bottom line. Addressing this data fragmentation is therefore critical for any e-commerce business striving to deliver seamless and engaging customer journeys.

Conventional data integration techniques, often reliant on rigid schemas and manual mapping, struggle to keep pace with the dynamic nature of product information. These methods typically focus on syntactic matching – ensuring data fits a predefined structure – rather than understanding the underlying meaning. Consequently, crucial semantic relationships – such as a “red cotton t-shirt” being both a type of “t-shirt,” a garment made of “cotton,” and possessing the attribute “red” – are lost or poorly represented. This inability to capture nuanced connections hinders effective product discovery; a search for “casual summer tops” might overlook the aforementioned t-shirt simply because its data isn’t explicitly linked to those broader categories. The resulting fragmented knowledge base limits a platform’s ability to provide personalized recommendations or accurately answer complex customer queries, ultimately diminishing the overall user experience and stifling innovation.

Modern e-commerce operates on a scale previously unimaginable, with product catalogs often encompassing millions of items – a volume that decisively rules out manual data organization. Consequently, automated approaches to knowledge extraction and organization are no longer simply desirable, but essential for maintaining operational efficiency. These systems leverage techniques like natural language processing and machine learning to sift through product descriptions, specifications, and customer reviews, identifying key attributes and relationships. This process moves beyond simple keyword matching to understand semantic meaning, enabling the creation of a structured knowledge graph that accurately represents the product universe. Such automation not only accelerates catalog management but also unlocks opportunities for intelligent search, personalized recommendations, and the discovery of hidden product synergies, ultimately driving both customer satisfaction and business innovation.

The inability to synthesize a comprehensive understanding of product information represents a significant impediment to growth for modern businesses. When product data remains siloed and disconnected, organizations struggle to identify emerging trends, anticipate customer needs, and develop genuinely innovative offerings. This fragmented landscape not only limits the potential for novel product development but also hinders the delivery of personalized customer experiences – crucial in today’s competitive market. Consequently, businesses forfeit opportunities to optimize pricing strategies, refine marketing campaigns, and ultimately, cultivate stronger customer loyalty, as a holistic view of product attributes, relationships, and customer interactions remains elusive.

Constructing a Foundation: Product Ontologies and the Language of Machines

A standardized product ontology is fundamental for representing product information in a consistent and machine-readable format. Our current ontology comprises 42 distinct classes, defining the types of products within our system. These classes are further described by 69 properties, which detail specific characteristics. Of these properties, 20 are data attributes – representing factual values like price or weight – while the remaining 49 define object relationships, indicating how products connect to other entities or categories within the system. This structured approach facilitates automated reasoning, data integration, and improved search capabilities.

Large Language Models (LLMs) facilitate automated ontology creation by processing unstructured text – such as product descriptions, specifications, and user reviews – to identify key entities, attributes, and relationships. This process involves semantic parsing to extract meaningful information, which is then used to populate the ontology schema. LLMs achieve this through techniques like named entity recognition, relation extraction, and topic modeling, effectively converting free-form text into a structured, machine-readable format suitable for knowledge representation. The automation significantly reduces the manual effort traditionally required for ontology development and maintenance, enabling rapid adaptation to evolving product catalogs and data sources.

The Extract-Define-Canonicalize Framework structures knowledge acquisition for product ontologies through a three-stage process. Extraction utilizes Large Language Models to identify relevant entities and relationships from unstructured product data, such as descriptions and specifications. Definition involves formulating these extracted elements into formal ontological components – classes and properties – with associated labels and definitions. Finally, Canonicalization aligns these newly defined components with existing ontological structures, resolving ambiguities and ensuring consistency with the established schema, which currently comprises 42 classes and 69 properties, including 20 data attributes and 49 object relationships.

Large Language Models (LLMs), specifically instances like ChatGPT 4.1 Mini, function as the primary computational component in automated product ontology creation by leveraging existing datasets to identify patterns and relationships. These models are trained on labeled product data, enabling them to extract semantic information, such as product attributes and classifications, and subsequently generalize this knowledge to categorize and define new, previously unseen product entries. This learning process allows the LLM to predict relevant attributes and relationships for novel products, effectively building and extending the product ontology without explicit, rule-based programming. The model’s capacity for generalization is crucial for maintaining ontology consistency and scalability as the product catalog evolves.

Automated Knowledge Graph Population: A Symphony of Data

AI Agents were implemented to automate the population of a knowledge graph with product data, removing the need for manual data entry and linking. These agents function by autonomously extracting relevant information from product descriptions and converting it into a structured format suitable for graph representation. The orchestration involves identifying entities, relationships, and attributes within the textual data, and then systematically creating connections between them within the knowledge graph. This automated process resulted in the generation of 7,459 RDF triples, representing the extracted product information and its interconnections.

The automated Knowledge Graph Population process utilizes Large Language Models (LLMs) to create Resource Description Framework (RDF) triples, which represent product attributes and their interrelationships as subject-predicate-object statements. This Triple Generation resulted in the creation of 7,459 RDF triples derived from product data. Each triple defines a specific fact, for example, stating that a particular product ($subject$) has a specific $color$ ($predicate$) of $blue$ ($object$). These triples serve as the foundational elements for constructing the knowledge graph, enabling structured data representation and facilitating advanced data querying and reasoning.

The methodologies iText2KG and CodeKGC improve knowledge graph population accuracy through the application of zero-shot learning and schema-aware prompts. Zero-shot learning allows these agents to extract information and establish relationships without requiring task-specific training data; instead, they generalize from pre-trained language models. Schema-aware prompts guide the language model by explicitly referencing the target knowledge graph schema, ensuring extracted data conforms to defined classes and properties. This approach reduces the need for labeled training sets and improves the consistency and validity of the generated RDF triples by aligning extractions with the desired ontological structure.

The processing and manipulation of Resource Description Framework (RDF) data was facilitated using the RDFLib Python library. This tool enabled the consistent structuring and validation of extracted information, contributing to data integrity throughout the knowledge graph population process. Evaluation of 291 product descriptions demonstrated a 97% success rate in processing and converting data into RDF triples, with 282 descriptions successfully parsed and integrated into the knowledge graph. This indicates a high degree of robustness and reliability in the data ingestion pipeline.

Enhancing Reasoning: Guiding the Machine Mind

Prompt learning represents a significant advancement in harnessing the capabilities of large language models (LLMs) for tasks involving knowledge graphs. Techniques like KG-ICL and PromptKG demonstrate that LLMs, traditionally strong in text generation, can achieve remarkable performance in reasoning over complex relational data. These approaches move beyond simple keyword matching by structuring prompts to include relevant contextual information directly within the input, effectively guiding the LLM’s inference process. This allows the models to not merely recall facts, but to synthesize information and draw conclusions based on the relationships encoded within the knowledge graph – a crucial step towards truly intelligent data processing and analysis. By strategically crafting these prompts, researchers are unlocking the full potential of LLMs to perform sophisticated knowledge graph tasks with increased accuracy and efficiency.

The convergence of in-context learning and prompt graphs represents a significant advancement in leveraging large language models for complex reasoning. Traditional prompting methods often fall short when faced with multi-hop inference or tasks requiring extensive background knowledge. However, by structuring prompts as graphs – where nodes represent concepts and edges define relationships – LLMs gain access to a more organized and interconnected knowledge representation. This allows the model to traverse the prompt graph, identifying relevant information and drawing inferences in a manner analogous to human reasoning. The technique effectively guides the LLM’s attention, mitigating the risk of getting lost in lengthy text and enabling more accurate and reliable results, particularly in areas like knowledge-based question answering and intricate data analysis.

Large language models demonstrate a remarkable ability to tackle intricate challenges, such as product recommendation and attribute extraction, when provided with meticulously designed prompts. This technique, known as prompt engineering, moves beyond simple instruction-following; it involves structuring the input to guide the model’s reasoning process. By framing requests with relevant context, examples, and constraints, these models can discern subtle product features, understand user preferences, and deliver more accurate and personalized recommendations. The precision gained through careful prompting allows LLMs to move beyond superficial pattern matching, enabling them to perform complex inference and deliver outputs that align more closely with desired outcomes – a critical factor in enhancing both automated systems and user satisfaction.

The refinement of reasoning within large language models directly impacts the quality of interactions with customers and, consequently, yields tangible business benefits. Enhanced reasoning allows these models to provide more accurate, relevant, and personalized responses, leading to increased customer satisfaction and loyalty. This translates into improved metrics such as higher Net Promoter Scores and reduced customer churn. Furthermore, the ability to extract nuanced information and make informed recommendations, powered by superior reasoning, unlocks opportunities for upselling and cross-selling, driving revenue growth. Businesses leveraging these advancements can expect not only to optimize customer engagement but also to gain a competitive edge through data-driven insights and more effective service delivery.

Validating and Refining: Ensuring the Knowledge Endures

Assessing the reliability of a knowledge graph hinges on quantifiable metrics, notably Ontology Quality and Ontology Coverage, which reveal how comprehensively and coherently information is structured and represented. These measurements are not merely academic; they directly reflect the graph’s utility in downstream applications, such as product discovery and recommendation systems. Recent evaluations demonstrate a high degree of completeness, with the populated knowledge graph achieving 97.1% coverage of defined ontology properties – indicating a robust foundation for accurate data retrieval and insightful connections between entities. This level of coverage suggests the system effectively captures the intended relationships within the product catalog, minimizing gaps and inconsistencies that could lead to flawed inferences or inaccurate results.

Maintaining a consistently accurate knowledge graph requires vigilant, ongoing monitoring and refinement processes. Product catalogs are dynamic entities, subject to frequent updates, new additions, and discontinued items; therefore, a static knowledge graph quickly becomes outdated and unreliable. This necessitates continuous assessment of the graph’s data against current catalog information, identifying and rectifying discrepancies as they arise. Automated checks for data consistency and completeness, coupled with human-in-the-loop validation, are crucial components of this iterative process. Such persistent attention not only preserves the integrity of existing information but also ensures the knowledge graph effectively incorporates changes, enabling it to remain a relevant and trustworthy source of product understanding over time.

The knowledge graph’s quality benefits significantly from a combined approach utilizing distant supervision and targeted annotation. Distant supervision automatically generates training data by aligning the knowledge graph with unstructured text, identifying potential relationships without manual labeling. However, this method isn’t always accurate, necessitating the implementation of annotation techniques where human experts validate and correct these automatically identified relationships. This synergistic process not only refines the existing knowledge but also expands its reach, creating a more robust and reliable resource. By combining the scalability of distant supervision with the precision of human annotation, the knowledge graph achieves a higher degree of accuracy and completeness, enabling more effective data retrieval and reasoning.

Investigations are now directed towards extending these knowledge graph construction techniques to accommodate substantially larger product catalogs and, crucially, incorporating real-time data feeds. This move from static datasets to dynamic information streams promises a continuously updated and increasingly accurate representation of product information. Initial trials demonstrate a high degree of success, with a failure rate of only 3% – representing 9 inaccuracies out of 291 processed product descriptions – suggesting the robustness and scalability of the methodology as it transitions to more complex and voluminous datasets. Further refinement aims to minimize these errors and ensure the knowledge graph remains a reliable source of truth even amidst rapidly changing product offerings and market dynamics.

The pursuit of automated knowledge graph construction, as detailed in the framework, echoes a fundamental truth about systems. They are not static entities, but evolving organisms subject to the relentless march of time. Just as technical debt accumulates as a consequence of past decisions, manually crafted ontologies become brittle and require constant upkeep. Bertrand Russell observed, “The only thing that you can be sure of is that nothing is certain.” This inherent uncertainty underscores the value of an agent-driven approach, allowing the knowledge graph to adapt and refine itself over time, acknowledging that complete and immutable knowledge is an illusion. The framework’s scalability isn’t merely about handling larger datasets, but about extending the lifespan of the knowledge representation itself.

What Lies Ahead?

The automated construction of product knowledge graphs, as demonstrated, represents a momentary reprieve from the inevitable accrual of technical debt. Each new product, each revised description, is a fresh erosion of semantic consistency. This framework offers a means to forestall complete entropy, but it does not eliminate it. Future work will undoubtedly focus on refining the agents themselves, increasing their resilience to the inherent ambiguity of natural language – a task akin to perpetually shoring up a coastline against the tide.

A critical limitation lies in the implicit assumption of a stable ‘product’ concept. Retail, however, is defined by flux. The boundaries of what constitutes a ‘product’ are constantly shifting, driven by innovation and consumer desire. Knowledge graphs, by their nature, are static representations of dynamic systems. The next phase must address the challenge of continuous ontological adaptation, a process that mirrors the slow drift of geological plates.

Ultimately, the pursuit of perfect knowledge representation is a Sisyphean task. Uptime, in this context, is a rare phase of temporal harmony, a fleeting moment before the system begins its inevitable decline. More fruitful avenues of inquiry may lie not in attempting to prevent decay, but in designing systems that degrade gracefully, anticipating and accommodating the constant influx of novelty and obsolescence.

Original article: https://arxiv.org/pdf/2511.11017.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/