Uncovering Hidden Signals: AI-Powered Factor Discovery in Finance

Author: Denis Avetisyan

A new framework uses the power of large language models to automatically identify and validate potentially profitable financial factors, offering a safer and more transparent approach to algorithmic trading.

The Hubble framework operates as a closed-loop system, iteratively refining factor expression generation through a three-layer Abstract Syntax Tree (AST) sandbox for validation, a statistical engine for performance evaluation, and a feedback mechanism that directs subsequent generation rounds-a process reflecting the inherent tendency of all systems to converge toward optimized states despite inevitable decay.

Hubble combines a domain-specific language with an abstract syntax tree sandbox to ensure 100% computational stability in automated alpha generation and evolutionary feedback loops.

Despite the promise of quantitative finance, discovering consistently predictive alpha factors remains challenging due to complex search spaces and noisy data. This paper introduces ‘Hubble: An LLM-Driven Agentic Framework for Safe and Automated Alpha Factor Discovery’, a novel system leveraging Large Language Models within a constrained environment to generate and evaluate financial factors. Hubble achieves robust and interpretable results-demonstrating 100% computational stability and a peak composite score of 0.827-by combining LLM-driven generation with a domain-specific language and Abstract Syntax Tree-based execution sandbox. Can this agentic framework unlock a new era of automated, reliable, and insightful factor discovery in quantitative finance?

The Inevitable Limits of Conventional Insight

The historical development of investment factors has been largely shaped by human insight, a methodology now facing inherent limitations. Early factor construction depended heavily on researchers manually identifying variables believed to predict asset returns – characteristics like value, momentum, or size. This process, while yielding valuable discoveries, is demonstrably slow, resource-intensive, and susceptible to cognitive biases. Analysts, even with extensive experience, may unconsciously prioritize easily interpretable relationships or favor data confirming pre-existing beliefs, potentially overlooking genuinely predictive, yet less intuitive, signals. Furthermore, manual feature engineering struggles to keep pace with evolving market dynamics, meaning factors identified through this method may exhibit performance decay as relationships shift and new information emerges. The reliance on human intuition, therefore, presents a bottleneck in the continuous pursuit of alpha and underscores the need for more systematic and adaptive approaches to factor discovery.

While automated factor discovery techniques like Genetic Programming offer a seemingly objective alternative to manual construction, their performance often plateaus when confronted with evolving market dynamics. These algorithms, though capable of identifying patterns within historical data, frequently struggle to generalize beyond the specific conditions used during their training period. The inherent rigidity of their evolved solutions means they may fail to adapt to shifts in investor behavior, economic cycles, or even subtle changes in data reporting. Consequently, factors discovered through these methods can exhibit significant out-of-sample performance degradation, requiring frequent retraining and limiting their long-term viability as robust investment strategies. The challenge lies not simply in finding a pattern, but in identifying patterns with lasting predictive power-a feat demanding adaptability that current Genetic Programming implementations often lack.

The challenge of identifying predictive factors in financial markets isn’t simply a matter of finding a needle in a haystack; it’s akin to searching within an infinitely expanding, multi-dimensional space. The sheer number of potential combinations-derived from countless variables, timeframes, and transformations-creates a search space so vast that exhaustive, or ‘brute-force’, methods become computationally impractical and statistically unreliable. This complexity demands innovative exploration strategies – algorithms capable of intelligently navigating this landscape, prioritizing promising areas, and efficiently discarding unproductive ones. Techniques like genetic algorithms, reinforcement learning, and tree-based search methods offer potential solutions by mimicking natural evolutionary processes or focusing exploration based on observed rewards, ultimately increasing the likelihood of discovering genuinely novel and robust factors beyond those identified through traditional, manual approaches.

Composite scores evolved across mining rounds, initially peaking at 0.827 in round 1, then displaying increased variance during more exploratory round 2, and finally converging towards stable factor structures in round 3.

Leveraging Language Models for Expressive Factor Construction

Large Language Models (LLMs) offer a significant advantage in factor expression generation by automating the creation of numerous candidate factors, thereby substantially expanding the search space beyond what is feasible with manual or rule-based approaches. Traditional methods often rely on pre-defined templates or limited combinations of existing factors, while LLMs, when provided with relevant financial data and contextual information, can generate a diverse set of potential factors representing complex relationships between variables. This capability is particularly valuable in identifying non-obvious or novel factors that might otherwise be overlooked. The scale of factor generation enabled by LLMs allows for more comprehensive testing and validation, increasing the probability of discovering statistically significant and economically meaningful factors.

Employing a Domain-Specific Language (DSL) as a constraint during Large Language Model (LLM) factor expression generation is critical for producing usable results. Without a DSL, LLMs may generate syntactically incorrect or financially nonsensical expressions. A DSL defines the allowable functions, operators, and data types – for example, specifying that factors must utilize only approved risk metrics or adhere to particular valuation models. This ensures computational validity, preventing errors during backtesting or implementation. Furthermore, a DSL enforces financial meaning by restricting the LLM to combinations of variables and functions that represent legitimate investment strategies or risk assessments, thereby significantly reducing the need for manual validation and correction of generated factors.

Tree-of-Thought (ToT) prompting enhances Large Language Model (LLM) performance in factor expression generation by decomposing the problem into a series of intermediate reasoning steps. Rather than directly requesting a final expression, ToT prompts the LLM to generate multiple potential reasoning paths – “thoughts” – at each step, evaluating each thought before selecting the most promising one to continue the process. This iterative approach, where the LLM explicitly outlines its reasoning, facilitates the creation of more complex and accurate expressions compared to single-step prompting methods. The evaluation stage often involves predefined criteria or a separate evaluation model to assess the validity and financial relevance of each thought, ensuring that the generated expressions adhere to the required constraints and improve overall solution quality.

The system iteratively refines candidate expressions by leveraging LLM generation, sandbox evaluation, and statistical analysis, using performance feedback to guide subsequent iterations.

Rigorous Validation: Ensuring Computational Integrity

The Abstract Syntax Tree (AST) Sandbox employs a three-layer validation system to guarantee the computational integrity of generated factor expressions. The first layer consists of static type checking, verifying that all operations are applied to compatible data types before execution. The second layer implements a symbolic execution engine, which analyzes the AST to identify potential runtime errors, such as division by zero or out-of-bounds array access, without actually running the code. Finally, a constrained execution environment forms the third layer, limiting resource usage-including CPU time, memory allocation, and I/O operations-during runtime to prevent denial-of-service vulnerabilities or system instability, even if a previously undetected error occurs. This multi-faceted approach ensures both the correctness and safety of all generated factors before deployment.

The Statistical Evaluation Pipeline employs quantitative metrics to rigorously assess factor performance. RankIC, or Rank Information Coefficient, measures the ability of a factor to correctly rank future observations, with higher values indicating stronger predictive power. Crucially, the Information Ratio (IR) evaluates risk-adjusted returns by dividing the factor’s excess return over a benchmark by its tracking error-specifically, the standard deviation of the difference between the factor’s returns and the benchmark’s returns. An IR exceeding 0.5 is generally considered indicative of a skilled factor, while values above 1.0 suggest exceptional performance; however, these thresholds are context-dependent and require careful consideration alongside other metrics and statistical significance testing. $IR = \frac{\mu_f - \mu_b}{\sigma_{f-b}}$ , where $\mu_f$ is the factor’s average return, $\mu_b$ is the benchmark’s average return, and $\sigma_{f-b}$ is the standard deviation of the difference between the factor and the benchmark.

Monitoring turnover is crucial for evaluating the practical feasibility of a factor-based trading strategy. Turnover, calculated as the total value of assets traded over a period divided by the total value of assets under management, directly indicates trading frequency. High turnover implies more frequent trades, which translate to increased transaction costs – including brokerage fees and potential market impact – that can significantly erode profitability. Conversely, low turnover suggests a more passive investment approach with lower associated costs. Analyzing turnover alongside performance metrics allows for a realistic assessment of a factor’s risk-adjusted returns and helps determine if the anticipated benefits outweigh the practical implementation costs. Factors with strong statistical performance but excessively high turnover may be unsuitable for certain investment contexts or require careful optimization to reduce trading frequency.

The top five factors demonstrate a divergence in behavior, with the first two <span class="katex-eq" data-katex-display="false">f1f_{1}</span> and <span class="katex-eq" data-katex-display="false">f2f_{2}</span> exhibiting high annualized Information Ratios and turnover suggesting short-term momentum, while the remaining three <span class="katex-eq" data-katex-display="false">f3f_{3}</span>-<span class="katex-eq" data-katex-display="false">f5f_{5}</span> display lower turnover and more stable signal structures. — The top five factors demonstrate a divergence in behavior, with the first two $f1f_{1}$ and $f2f_{2}$ exhibiting high annualized Information Ratios and turnover suggesting short-term momentum, while the remaining three $f3f_{3}$ – $f5f_{5}$ display lower turnover and more stable signal structures.

Hubble: An Automated System for Factor Discovery

The Hubble Framework utilizes a multi-stage process for systematic factor discovery, beginning with Large Language Model (LLM) generation of potential factors expressed as quantifiable trading rules. This generative component is coupled with deterministic safeguards designed to prevent the creation of syntactically invalid or logically inconsistent expressions. Following generation, a rigorous statistical evaluation pipeline assesses each candidate factor’s performance on historical market data, calculating key metrics such as the Information Ratio and conducting robustness testing. This pipeline serves as a filter, identifying factors that meet pre-defined performance thresholds and possess statistical validity before further consideration.

Hubble utilizes evolutionary feedback loops to iteratively improve Large Language Model (LLM) factor generation. Following each round of factor creation and backtesting, performance metrics – specifically, the annualized Information Ratio – are used to assess the generated factors. This performance data is then fed back into the LLM as training signal, adjusting the model’s parameters to increase the probability of generating higher-performing factors in subsequent iterations. This process, repeated across multiple rounds of experimentation, allows the LLM to learn from its previous outputs, effectively optimizing its factor generation strategy and increasing the likelihood of discovering robust and statistically significant trading signals.

Hubble consistently maintained computational stability throughout all experimental runs. During testing on a panel of 30 U.S. equities over a 752 trading day period, the system generated factors exhibiting annualized Information Ratios exceeding 1.0. This performance metric indicates a statistically significant risk-adjusted return for the generated factors, demonstrating the system’s capacity to identify potentially profitable trading strategies. The generated factors were also designed to be interpretable, allowing for human understanding of the underlying investment logic driving the identified signals.

The Hubble system’s validation pipeline achieved a 97.3% pass rate for generated factor expressions, indicating a high degree of reliability in translating natural language prompts into quantitatively testable signals. This validation process incorporates a series of deterministic checks ensuring the generated expressions are syntactically correct, free of division-by-zero errors, and utilize only permissible data elements. The high pass rate was observed across 122 unique candidate factors processed during experimentation, confirming the pipeline’s ability to consistently identify and reject invalid or poorly formed expressions before backtesting, thereby contributing to the overall system stability and reducing the potential for erroneous factor evaluation.

Across three rounds of experimentation, the Hubble system processed a total of 122 unique candidate factors. This processing encompassed the full pipeline, including factor generation via the Large Language Model, deterministic validation to ensure computational stability, and rigorous statistical evaluation using historical trading data. The 122 factors represent the complete set of expressions successfully generated and subjected to evaluation criteria during the experimental period, providing a comprehensive dataset for performance analysis and system refinement.

The pipeline’s OK rate-the percentage of evaluated formulas satisfying all constraints-increased to 100% across rounds 2 and 3, indicating that the feedback mechanism effectively refined the large language model’s adherence to the domain-specific language, while maintaining a consistent candidate count of approximately 40 per round.

Scalability and the Pursuit of Adaptive Systems

The architecture underpinning this automated factor generation system is designed for broad applicability beyond equities. The same iterative process of DSL-based strategy construction, LLM-driven refinement, and rigorous backtesting can be readily adapted to diverse asset classes, including fixed income, commodities, and foreign exchange. This portability stems from the framework’s abstraction of market microstructure and reliance on a generalized representation of financial data. Furthermore, the system’s evaluation metrics are not specific to any single market, allowing for consistent performance assessment across different environments. Consequently, the existing infrastructure offers a scalable pathway for rapidly deploying automated investment strategies in new and previously inaccessible markets, potentially unlocking alpha opportunities across the financial landscape.

The capacity for innovation within AlphaAgent hinges on a continually evolving Domain Specific Language (DSL). By introducing novel operators and functionalities to this DSL, the system’s ability to articulate complex investment strategies is significantly amplified. This expansion doesn’t merely add features; it fundamentally alters the landscape of possible factors the Large Language Model (LLM) can generate, fostering a greater diversity of approaches to asset evaluation. A richer DSL allows for the expression of more nuanced conditions, intricate relationships between data points, and ultimately, the discovery of previously unattainable alpha signals, promising a more adaptable and resilient factor library capable of navigating evolving market dynamics.

A critical challenge in quantitative finance is the phenomenon of alpha decay, where initially profitable factors lose their predictive power over time. Integrating techniques like regularized exploration, as pioneered by AlphaAgent, offers a potential solution by systematically diversifying the search for new, robust factors. This approach doesn’t rely on simply discovering factors with immediate returns, but rather prioritizes those exhibiting consistent performance across varying market conditions and time horizons. By penalizing overly complex or specialized factors, regularized exploration encourages the development of a factor library that is less susceptible to overfitting and more resilient to changing market dynamics, ultimately improving the long-term sustainability of any automated investment strategy. This proactive approach to factor maintenance promises to enhance the overall robustness and adaptability of the system, ensuring continued performance even as market conditions evolve.

The pursuit of automated alpha generation, as detailed within this framework, echoes a fundamental truth about all systems: they are inherently subject to entropy. Hubble, with its emphasis on computational stability and a constrained domain-specific language, attempts to mitigate this decay-to engineer a system that ages gracefully rather than collapsing under unforeseen circumstances. This resonates with a sentiment expressed by Carl Friedrich Gauss: “Few things are more deceptive than a seemingly simple problem.” The framework’s rigorous approach to validation and the AST sandbox are not merely technical hurdles; they represent an acknowledgement of the complex, often hidden, vulnerabilities within financial models-a proactive measure against the inevitable march of time and the potential for unforeseen errors.

What Lies Ahead?

The pursuit of automated alpha generation, as demonstrated by Hubble, isn’t about conquering market inefficiency-it’s about documenting the inevitable erosion of signal. Each discovered factor represents a fleeting regularity, a temporary respite from the underlying chaos. The framework’s success in achieving computational stability is noteworthy, but stability is merely a pause, not a prevention, of eventual decay. The system’s chronicle-the logged factor lineage-becomes increasingly important as each generation yields diminishing returns, a testament to the relentless pressure of adaptation.

Future iterations will undoubtedly focus on extending the lifespan of these factors, perhaps through dynamic re-weighting or the discovery of meta-factors-rules governing factor behavior. However, a more fundamental question lingers: at what point does the cost of maintaining these increasingly fragile signals outweigh the benefit? The deployment of each new factor is a moment on the timeline, but the true challenge lies in anticipating the moment of its obsolescence.

The ultimate limitation isn’t computational, but epistemological. Hubble, and systems like it, can identify what is working, but offer little insight into why. The search for explanatory power, for an understanding of the underlying market dynamics, remains the elusive horizon. The framework’s evolution will likely be defined not by increasingly sophisticated factor discovery, but by increasingly accurate predictions of factor mortality.

Original article: https://arxiv.org/pdf/2604.09601.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/