Author: Denis Avetisyan
New research reveals that basic data normalization consistently outperforms sophisticated machine learning models when predicting investor behavior.

Market cap-weighted order flow provides a surprisingly effective signal, demonstrating that sufficient signal-to-noise ratio and robust preprocessing are more critical than model complexity in non-stationary financial time series.
Despite the rapid adoption of machine learning in finance, the conditions under which complex models truly outperform simpler alternatives remain poorly understood. This is the central question addressed in ‘The Limits of Complexity: Why Feature Engineering Beats Deep Learning in Investor Flow Prediction’, a study investigating the predictive power of advanced signal processing and deep learning techniques applied to investor order flows. Surprisingly, the authors find that a parsimonious linear model, utilizing market capitalization-normalized flows, substantially outperforms a sophisticated pipeline incorporating Independent Component Analysis, Wavelet transforms, and Long Short-Term Memory networks. These findings suggest that in noisy financial environments, carefully engineered features can yield significantly higher returns than increased algorithmic complexity – but what does this imply for the future of deep learning in financial forecasting?
Decoding the Herd: Understanding Investor Order Flow
Financial markets aren’t monolithic entities; instead, they represent the combined decisions of a remarkably diverse set of investors, each exhibiting distinct behavioral patterns. Institutional investors, like pension funds and hedge funds, typically operate with long-term strategies and extensive research, prioritizing fundamental value. Conversely, retail investors, often driven by shorter-term trends and emotional responses, can introduce volatility and momentum-based shifts. High-frequency traders, leveraging algorithmic strategies, focus on minute price discrepancies, adding liquidity but potentially exacerbating rapid market movements. Even within these broad categories, further segmentation reveals nuanced differences – for instance, the behavior of foreign investors versus domestic ones, or the contrasting approaches of growth versus value-focused funds. Recognizing these varied characteristics is paramount; a comprehensive understanding of who is trading, and why, provides crucial insight into the underlying forces shaping market prices and overall stability.
The predictive power of financial markets hinges on recognizing that not all investors operate with the same logic. Foreign investors, often institutional with significant resources and expertise, typically exhibit behavior aligned with fundamental valuations – earning them the label ‘smart money’. Conversely, retail investors, driven by factors like sentiment, news cycles, or even social media trends, can introduce biases and short-term volatility. Discerning these behavioral differences is not merely academic; it’s essential for building accurate market models. A failure to account for the distinct motivations and reactions of each investor type can lead to misinterpretations of market signals and flawed predictions, as the aggregated order flow reflects a complex interplay of rational analysis and potentially irrational exuberance or panic.
Investor Order Flow analysis reveals market dynamics by examining the aggregated trading activity of different investor groups, yet interpreting this data demands meticulous attention to complex interactions. Simply observing the volume of trades isn’t enough; researchers must account for the nuanced behaviors inherent to each investor type – the long-term perspective of institutional investors versus the often short-term reactions of retail traders. Sophisticated statistical models are employed to disentangle these overlapping signals, identifying patterns that suggest shifts in market sentiment or the anticipation of future price movements. This process often involves separating ‘genuine’ information conveyed through trades from ‘noise’ created by irrational exuberance or panic selling, ultimately aiming to understand whether observed flows represent informed investment decisions or merely speculative behavior. Successfully navigating this complexity provides valuable insights into market trends and potential turning points.

Unmasking Hidden Drivers: ICA and Market Decomposition
Independent Component Analysis (ICA) is a computational technique used to separate multivariate signals into additive subcomponents, assuming the subcomponents are statistically independent. When applied to Investor Order Flow – the aggregate of all buy and sell orders – ICA decomposes this flow into a set of statistically independent factors. This decomposition isn’t based on pre-defined categories, but rather emerges from the data itself, identifying underlying drivers of market activity that might otherwise be obscured by correlations. The process utilizes algorithms to maximize the statistical independence of the extracted components, effectively “unmixing” the observed order flow into its constituent parts and revealing latent, previously hidden, influences on market behavior. This differs from principal component analysis, which only guarantees orthogonality, not statistical independence.
Independent Component Analysis (ICA) applied to investor order flow identifies latent factors – statistically independent components – that explain variance in the data. These factors are not directly observable but are inferred through the ICA algorithm and represent underlying economic and behavioral drivers. Specifically, components are consistently identified as proxies for Macro Risk, capturing sensitivity to broad economic indicators; Domestic Sentiment, reflecting investor attitudes toward the local market; and Liquidity Provision, quantifying the availability of capital for trading. The extracted components, therefore, provide insight into systemic influences and the collective actions of investors, offering a decomposition of observed order flow into its fundamental drivers.
Traditional statistical analyses often identify correlations between market movements and observable variables; however, these correlations do not establish causality and can be misleading when multiple underlying factors interact. Isolating latent factors through techniques like Independent Component Analysis (ICA) addresses this limitation by identifying the statistically independent components driving observed market behavior. This decomposition allows for the separation of complex, intertwined signals into their fundamental sources, revealing the relative contribution of each factor-such as macro risk, domestic sentiment, or liquidity provision-to overall market dynamics. Consequently, analysis shifts from observing what moves with the market to understanding why the market moves, enabling more robust modeling and a deeper comprehension of the underlying economic forces at play.

Chasing Ghosts: Predictive Modeling and Validation
The predictive model utilizes a Long Short-Term Memory (LSTM) network augmented with an Attention mechanism. This architecture processes normalized Investor Order Flow as input to forecast future returns. Investor Order Flow represents the net buying or selling pressure from investors, and normalization is applied to standardize the data. The LSTM component is designed to capture temporal dependencies within the order flow data, while the Attention mechanism allows the model to focus on the most relevant time steps when making predictions. The model aims to identify patterns in investor behavior that correlate with subsequent price movements, thereby enabling the prediction of asset returns.
Market capitalization normalization is a critical preprocessing technique employed to address the inherent scale differences present in investor order flow data. Raw order flow values are directly influenced by a stock’s price and outstanding shares; larger companies, by virtue of their market capitalization, will naturally exhibit larger absolute flow values than smaller companies, even if the proportional investor interest is identical. Normalizing by market capitalization – dividing the raw flow by the company’s market cap – effectively scales the signal, allowing for a comparison of investor activity relative to the size of the company. This ensures that signals are scale-invariant, enabling meaningful cross-sectional analysis and preventing larger market cap stocks from disproportionately influencing model outputs. Without this normalization, the model would likely interpret absolute flow values as indicators of strength, rather than proportional investor sentiment.
Model performance was assessed using the Signal-to-Noise Ratio, a metric designed to quantify the model’s ability to identify predictive signals amidst inherent market noise. Evaluation of the LSTM model revealed a negative Information Ratio of -1.36, indicating that the model’s returns were consistently lower than those achievable through a risk-free investment, and therefore demonstrated poor predictive capability. Furthermore, the model achieved a Hit Rate of 47.5%, representing the percentage of correctly predicted price movements; this result is statistically worse than a 50% baseline, suggesting the model offers no predictive advantage over random chance.
A momentum strategy implemented using market capitalization-normalized investor order flow data yielded a Hit Rate of 56.5%. This metric represents the percentage of correctly predicted price movements, indicating the strategy’s ability to accurately forecast short-term directional changes. The 56.5% Hit Rate demonstrates a statistically significant improvement over the 47.5% achieved by the LSTM with Attention deep learning model, and exceeds the baseline of 50% expected from random chance. This suggests that incorporating market capitalization normalization into a momentum strategy effectively enhances predictive accuracy when utilizing investor order flow as a signal.

Turning Signals into Substance: A Momentum-Based Strategy
A straightforward momentum strategy was developed to showcase the real-world benefits of separating investor order flow. This approach capitalizes on the principle that stocks experiencing positive flow – an influx of buying pressure – tend to continue performing well in the short term. By ranking stocks based on market capitalization-normalized flow, the strategy identifies those most likely to benefit from continued investor interest. The resulting trades, executed based on this simple ranking, reveal a tangible application of the research, transforming theoretical insights into a potentially profitable investment methodology and highlighting the power of understanding the forces driving market movements.
A core element of this strategy involves ranking stocks according to their market cap-normalized order flow, a process that effectively distills trading volume into a signal of investor sentiment. By adjusting for a company’s size, the system avoids being skewed by large-cap stocks with naturally higher trading volumes, instead focusing on relative, rather than absolute, flow. This normalization allows for the identification of smaller-cap companies experiencing significant, yet previously obscured, investor interest, or conversely, flagging larger firms facing unusual selling pressure. Consequently, opportunities for profitable trades emerge as the system pinpoints stocks where demand is demonstrably outpacing supply, or vice versa, creating a dynamic ranking used to construct a momentum-based portfolio.
The implementation of a market cap-normalized flow ranking system yielded remarkably strong performance metrics. Specifically, the strategy achieved a Sharpe Ratio of 1.30, indicating a compelling risk-adjusted return, and generated a cumulative return of 272.6% over the evaluation period. This substantial gain demonstrates a clear advantage over alternative approaches; in direct comparison, an Independent Component Analysis (ICA)-based strategy resulted in a loss of 5.1%, while a Long Short-Term Memory (LSTM) network-driven strategy failed to produce a positive Sharpe Ratio. These results highlight the effectiveness of leveraging investor order flow as a signal for identifying profitable trading opportunities and suggest a viable pathway for data-driven investment strategies to surpass conventional methods.
The findings illuminate a compelling path toward superior investment performance through data-driven methodologies. By meticulously analyzing investor order flow, previously obscured signals within market activity are revealed, offering a nuanced understanding of price movements beyond traditional indicators. This approach doesn’t merely react to price changes; it anticipates them by decoding the intentions embedded in trading behavior. The substantial outperformance – a Sharpe Ratio of 1.30 and a 272.6% cumulative return – compared to ICA and LSTM-based strategies, underscores the efficacy of this method. It suggests a fundamental shift is possible, moving beyond reliance on historical data or complex predictive modeling towards strategies that directly interpret the collective actions of market participants, thereby unlocking previously inaccessible opportunities for profit.

The pursuit of increasingly complex models, as demonstrated by the failed ICA, wavelet, and LSTM approaches, feels…predictable. This paper merely confirms a harsh truth: elegance rarely survives contact with production systems. The insistence on market cap normalization as a surprisingly effective feature underscores a fundamental point about signal-to-noise ratio. It isn’t about finding the signal within complexity, but often about aggressively reducing the noise through thoughtful preprocessing. As Richard Feynman once said, ‘The first principle is that you must not fool yourself – and you are the easiest person to fool.’ The researchers didn’t fool themselves into believing more parameters equated to better prediction; they focused on a clean, interpretable signal, proving that sometimes, the simplest solution is the most robust-before someone inevitably complicates it again.
The Road Ahead
The persistence of signal-to-noise ratio as a dominant factor suggests a future less concerned with algorithmic novelty and more with rigorous data hygiene. The observed performance of normalized order flow data versus complex models isn’t a triumph of simplicity; it’s a demonstration that elaborate architectures often fail to address fundamental data quality issues. The field continues to chase diminishing returns in model complexity, while the low-hanging fruit of effective preprocessing remains largely unharvested.
Future work will inevitably explore variations on these themes-different normalization techniques, alternative feature engineering approaches-but the core challenge isn’t likely to be solved by a new activation function. The non-stationarity inherent in financial markets presents a moving target, and any predictive advantage gained through complex modeling will likely be temporary. The focus should shift from predicting the unpredictable to building systems robust enough to withstand inevitable prediction errors.
The current trajectory suggests a cyclical pattern: innovation creates complexity, complexity generates technical debt, and the eventual solution often involves reverting to more straightforward, interpretable methods. It is not a matter of abandoning machine learning, but of acknowledging its limitations and tempering expectations. The market doesn’t require elegant solutions; it demands functional ones-and often, the most functional solutions are the least glamorous.
Original article: https://arxiv.org/pdf/2601.07131.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- 39th Developer Notes: 2.5th Anniversary Update
- Shocking Split! Electric Coin Company Leaves Zcash Over Governance Row! 😲
- Live-Action Movies That Whitewashed Anime Characters Fans Loved
- Here’s Whats Inside the Nearly $1 Million Golden Globes Gift Bag
- Celebs Slammed For Hyping Diversity While Casting Only Light-Skinned Leads
- Game of Thrones author George R. R. Martin’s starting point for Elden Ring evolved so drastically that Hidetaka Miyazaki reckons he’d be surprised how the open-world RPG turned out
- TV Shows With International Remakes
- All the Movies Coming to Paramount+ in January 2026
- USD RUB PREDICTION
- Billionaire’s AI Shift: From Super Micro to Nvidia
2026-01-13 08:03