Beyond Tables: Semantic Data Boosts Transaction Analysis

Author: Denis Avetisyan


New research demonstrates how incorporating the meaning of categorical data improves the performance of foundation models in understanding financial transactions.

The proposed method offers a comprehensive approach to the problem, laying the groundwork for future iterations despite the inevitable accumulation of technical debt inherent in any complex system.
The proposed method offers a comprehensive approach to the problem, laying the groundwork for future iterations despite the inevitable accumulation of technical debt inherent in any complex system.

A framework leveraging large language model-generated sentence embeddings enhances foundation models for transaction understanding and maintains production efficiency.

While foundation models excel at processing structured data, representing categorical variables in transaction analysis often leads to semantic information loss. This limitation motivates our work, ‘Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings’, which introduces a hybrid framework leveraging Large Language Model (LLM)-generated embeddings to enrich tabular transaction data. By fusing multi-source information and employing a one-word constraint, we demonstrate significant performance gains across key transaction understanding tasks while maintaining operational efficiency. Could this approach unlock a new generation of foundation models capable of truly understanding the complex narratives hidden within financial transactions?


Beyond Simple Numbers: Uncovering Meaning in Transactions

Conventional financial analysis often relies on representing transactions through categorical codes, such as Merchant Category Codes (MCCs), as simple numerical indices. While computationally efficient, this practice discards crucial semantic information inherent in the transaction itself. Reducing a purchase at a bookstore, a restaurant, or a hardware store to a single number obscures the distinct nature of each expenditure. This loss of granularity prevents a nuanced understanding of consumer behavior and limits the ability to discern subtle patterns that could be valuable for fraud detection, credit risk assessment, or personalized financial services. Consequently, analytical models built upon these index-based representations operate with an incomplete picture, potentially hindering their predictive power and overall accuracy.

The conventional reduction of transactional details into numerical indices, while computationally efficient, significantly compromises the ability to accurately interpret and forecast financial behaviors. By stripping away the semantic content – the ‘what’, ‘where’, and ‘why’ behind each purchase – these methods treat fundamentally different transactions as equivalent, obscuring crucial patterns. This loss of granularity impacts a range of financial applications, from fraud detection – where nuanced purchase descriptions can signal illicit activity – to credit risk assessment, where understanding spending habits provides a more complete picture of an applicant’s financial health. Consequently, models built on index-based representations often exhibit reduced predictive power and may fail to capture the complex relationships inherent in real-world financial data, hindering effective decision-making and potentially leading to inaccurate financial modeling.

Current financial analysis often treats transactional data, such as purchases categorized by Merchant Category Codes, as simple indices, effectively discarding valuable semantic information. Researchers are now exploring a paradigm shift by applying advanced language models to directly interpret the meaning embedded within these transactions. This involves treating transaction descriptions – even seemingly brief ones – as natural language text, enabling the models to understand not just what was purchased, but also the underlying intent and context. By moving beyond numerical categorization, these models can identify subtle patterns and relationships previously obscured, potentially leading to more accurate fraud detection, improved credit risk assessment, and a deeper understanding of consumer behavior. This approach promises to transform raw transactional data into actionable intelligence, unlocking insights that traditional methods simply miss.

The potential to refine financial modeling stems from a more nuanced understanding of transactional data. By moving beyond simple categorization, this approach allows algorithms to discern the intent behind purchases – is a charge at a sporting goods store related to a specific event, or indicative of a long-term hobby? – thereby generating more accurate predictions of future behavior. This deeper semantic analysis enables the identification of subtle patterns previously obscured by broad classifications, potentially leading to improved risk assessment, fraud detection, and personalized financial services. Consequently, models built upon this richer data representation are poised to offer a significant leap in predictive power and analytical sophistication, ultimately driving more informed financial decision-making.

From Codes to Context: LLM-Powered Transaction Embeddings

Transaction data, typically comprised of categorical features, is transformed into numerical vector representations via sentence embeddings. This process involves treating each transaction as a textual description, enabling the application of Large Language Models (LLMs) to generate a vector representation capturing the semantic meaning of the transaction’s characteristics. These embeddings facilitate downstream analytical tasks by converting discrete, non-numerical data into a continuous vector space, allowing for the calculation of similarities and differences between transactions based on their encoded representations. The resulting vectors effectively map categorical attributes into a multidimensional space where proximity indicates semantic relatedness.

Sentence embeddings are generated from transaction data by utilizing Large Language Models (LLMs) to encode textual features. Specifically, the models Llama2-7b, Llama2-13b, Llama3-8b, and Mistral-7b are employed to process textual information present within each transaction record. These LLMs transform the textual data into vector representations, capturing semantic meaning and relationships within the text. The resulting embeddings facilitate downstream tasks such as transaction clustering, anomaly detection, and similarity analysis by providing a numerical representation of the textual content associated with each transaction.

Prompt engineering was critical to refining sentence embeddings generated from transaction data. Specifically, prompts were designed to elicit concise and focused vector representations from the Large Language Models (Llama2-7b, Llama2-13b, Llama3-8b, and Mistral-7b). The “One-Word Limitation Principle” dictated that prompts were structured to encourage the LLM to represent each transaction’s textual data with a single, dominant semantic descriptor. This constraint minimizes dimensionality and promotes consistency across embeddings, improving interpretability by reducing the influence of extraneous or ambiguous language within the original transaction descriptions. The resulting embeddings prioritize core semantic meaning, facilitating downstream analysis and model performance.

A Tabular Foundation Model serves to integrate LLM-derived sentence embeddings across the entirety of the transactional dataset, addressing limitations inherent in processing purely categorical data. This integration is achieved by utilizing the Tabular Foundation Model to learn representations from both the LLM embeddings – generated from textual transaction details – and the original categorical features. Consequently, the model can generalize embedding benefits beyond transactions with associated text, effectively enriching the representation of all transactional data points and improving downstream task performance such as fraud detection or customer segmentation. The Tabular Foundation Model acts as a bridge, allowing the LLM’s semantic understanding to be applied universally across the tabular dataset.

Putting Embeddings to Work: Improved Prediction and Assessment

The implemented framework achieves improved accuracy in transaction prediction across multiple key data points. Specifically, predictions regarding transaction amount, Merchant Category Code (MCC), and location information demonstrate measurable gains over baseline methods. Quantitative results indicate MCC prediction accuracies of 82% to 100% depending on configuration, alongside reported reductions in Mean Absolute Error (MAE) for transaction amount predictions. This enhanced predictive capability is attributed to the framework’s ability to capture complex relationships within transaction data, offering more reliable forecasts for each of these elements.

Multi-Source Data Fusion is employed to augment categorical transaction data prior to input into Large Language Models (LLMs). This process integrates information from multiple data sources – beyond the core transaction details – to create a more comprehensive representation of each categorical feature. Specifically, this enrichment provides LLMs with additional contextual information regarding Merchant Category Codes (MCCs) and location data. The resulting expanded feature set allows the LLM to develop more nuanced embeddings, improving its ability to accurately predict transaction attributes such as amount, MCC, and location, and ultimately enhancing the performance of downstream tasks like fraud detection and risk assessment.

Enhanced semantic understanding, achieved through the use of embeddings, directly improves Transaction Metrics Assessment, leading to advancements in fraud detection and risk management capabilities. Quantitative analysis demonstrates this improvement via positive Relative Improvement (RI) values observed across the majority of tested configurations. These RI values indicate that the framework consistently outperforms baseline models in accurately assessing key transaction metrics relevant to risk scoring and anomaly detection. The increased accuracy in metric assessment allows for more precise identification of potentially fraudulent transactions and facilitates improved risk mitigation strategies.

Across all tested configurations, the framework achieved Merchant Category Code (MCC) and merchant prediction accuracies of 82% and 100% respectively, indicating strong performance in categorizing transaction types and identifying merchants. Quantitative analysis also revealed improvements in transaction amount prediction, as measured by reductions in Mean Absolute Error (MAE). These results demonstrate the effectiveness of utilizing Large Language Model (LLM)-generated embeddings to enhance the predictive capabilities of transaction data analysis and provide more accurate estimations of transaction values.

Traditional transaction analysis methods often rely on isolated data points and predefined rules, limiting their ability to identify complex correlations. This framework utilizes Large Language Model (LLM)-generated embeddings to represent transaction data in a high-dimensional vector space, allowing for the capture of subtle, non-linear relationships between features like transaction amount, Merchant Category Code, and location. By encoding semantic information, the framework moves beyond simple pattern matching to understand the context of each transaction, resulting in improved prediction accuracy – demonstrated by 82%/100% MCC/Merchant prediction accuracies – and more reliable risk assessment as indicated by improvements in Mean Absolute Error (MAE) and positive Relative Improvement (RI) values. This enhanced understanding translates to more actionable insights for fraud detection and risk management applications.

Beyond the Hype: A Realistic Look at Transactional Intelligence

Traditional financial modeling often relies on indexing transactions – categorizing them by simple codes or identifiers. However, a shift towards more nuanced representations of transactional data is now underway, unlocking a new era of ‘transactional intelligence’. This isn’t some revolutionary breakthrough; it’s a pragmatic step towards recognizing that data isn’t just about numbers, but about the stories behind them. This advancement moves beyond basic categorization to capture the complex relationships and contextual information embedded within each transaction. By leveraging these richer representations, analysts can build more sophisticated models capable of identifying subtle patterns, predicting future behavior, and ultimately, gaining a deeper understanding of financial ecosystems. It’s not about replacing existing systems, but augmenting them with a layer of semantic understanding.

This novel framework promises substantial improvements across critical financial functions, notably in the detection of fraudulent activities. By moving beyond traditional rule-based systems, which are easily circumvented, it can identify subtle anomalies indicative of fraud that might otherwise go unnoticed, leading to reduced financial losses and enhanced security. Simultaneously, the framework’s enhanced analytical capabilities allow for more precise risk assessments, enabling institutions to better quantify and mitigate potential exposures. Furthermore, a deeper understanding of customer behavior-gleaned from detailed transactional analysis-facilitates personalized financial products and services, improved customer retention, and more effective marketing strategies, ultimately reshaping the landscape of financial technology through data-driven insights. It’s about moving beyond reactive measures to proactive prevention.

Continued development centers on expanding the computational scope of this transactional intelligence framework to accommodate increasingly massive datasets – moving beyond manageable samples to encompass real-world financial volumes. Researchers are actively investigating the incorporation of diverse data streams, including alternative financial records, social media sentiment, and macroeconomic indicators, to enrich the analytical depth. This integration aims to move beyond purely transactional data, fostering a more holistic understanding of financial behaviors and systemic risks. Ultimately, scaling and diversification promise to unlock predictive capabilities currently obscured by data limitations, offering refined insights for fraud detection, proactive risk management, and a more nuanced prediction of market trends. It’s a long road, and the challenges are significant, but the potential rewards are substantial.

The convergence of large language models (LLMs) and traditional tabular data analysis is initiating a substantial shift within financial technology. Historically, these domains operated in relative isolation – LLMs excelling at unstructured data like news articles and reports, while tabular analysis focused on structured datasets such as transaction records. Now, the ability to seamlessly integrate these approaches allows for a holistic understanding of financial events, moving beyond simple correlations to uncover nuanced relationships and predictive signals. This synergistic combination unlocks the potential to automate complex financial reasoning, enhance the accuracy of risk assessments, and personalize financial services with unprecedented precision. The resulting advancements promise to redefine capabilities in areas ranging from algorithmic trading and fraud detection to customer relationship management and regulatory compliance, effectively reshaping the financial technology landscape. It’s not a magic bullet, but a powerful tool when used thoughtfully.

The pursuit of enhanced foundation models, as detailed in this work concerning transaction understanding, feels predictably cyclical. This paper attempts to inject semantic meaning – using LLM-generated sentence embeddings – into the notoriously rigid world of tabular data. It’s a clever application, certainly, but one can’t help but recall Marvin Minsky’s observation: “The question isn’t what computers can do, but what we want them to do.” The desire to force nuanced understanding onto systems fundamentally built for rote calculation always seems to introduce unforeseen complications. Data fusion is the current buzzword, but ultimately, it’s just another layer of abstraction destined to become tomorrow’s tech debt. Everything new is just the old thing with worse docs.

What’s Next?

The pursuit of semantic understanding in tabular data, as demonstrated by this work, invariably reveals the fragility of ‘solved’ problems. The integration of LLM-based embeddings offers a temporary reprieve from feature engineering, but it doesn’t eliminate the underlying tension: categorical data is, at its core, a discrete representation of continuous human behavior. Everything optimized will one day be optimized back, likely in a direction that renders current embeddings obsolete. The current focus on performance metrics feels, predictably, like moving deck chairs.

The real challenge isn’t improving scores on benchmark datasets, but managing the entropy of production systems. This framework, like all others, will accrue technical debt. The cost of maintaining these embeddings – retraining, drift detection, and the inevitable need for human-in-the-loop validation – will quickly outweigh any initial gains. Architecture isn’t a diagram, it’s a compromise that survived deployment.

Future work will likely center not on novel embedding techniques, but on robust monitoring and automated repair strategies. The goal shouldn’t be to build ‘smarter’ models, but to build systems that are gracefully resilient to the inevitable decay of information. One doesn’t refactor code – one resuscitates hope.


Original article: https://arxiv.org/pdf/2601.05271.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-12 18:41