Beyond the Benchmark: Rethinking Deep Learning for Time Series

Author: Denis Avetisyan

A critical look at how we evaluate deep learning models reveals that inconsistent practices are obscuring real progress in time series forecasting.

Forecasting models-excluding spatial processing-demonstrate performance, quantified by mean squared error (MSE) and mean absolute error (MAE), over a 96-step horizon, with the most effective configurations distinguished by bolded values and the second most effective underlined.

This review argues that a lack of standardized configurations and evaluation metrics hinders meaningful comparison of deep learning architectures for time series analysis.

Despite the proliferation of deep learning architectures for time series forecasting, inconsistent results and a lack of clear understanding hinder genuine progress in the field. This paper, ‘What Matters in Deep Learning for Time Series Forecasting?’, critically examines current practices, revealing that overlooked implementation details and flawed benchmarking often overshadow the true impact of architectural choices. We demonstrate that focusing on foundational forecasting principles-like accounting for locality and globality-can yield surprisingly effective, even state-of-the-art, results with simpler models. Ultimately, the question remains: can a shift towards more rigorous evaluation and principled design unlock the full potential of deep learning for accurate and reliable time series prediction?

The Pervasive Need for Accurate Temporal Prediction

The ability to accurately forecast future values within time series data underpins critical operations across a remarkably broad spectrum of industries. In financial markets, predictive modeling informs investment strategies, risk assessment, and algorithmic trading, while in the energy sector, precise forecasting of demand and renewable resource availability is essential for grid stability and efficient resource allocation. Beyond these, accurate time series analysis drives logistical optimization in supply chain management, supports proactive maintenance scheduling in manufacturing, and even enhances the accuracy of weather prediction and climate modeling. Consequently, advancements in time series forecasting aren’t merely academic exercises; they represent tangible improvements with significant economic and societal impact, demanding continuous refinement of predictive methodologies and computational power.

Historically, time series forecasting relied heavily on statistical methods like ARIMA and exponential smoothing, techniques built on the assumption of linear relationships and stationary data. However, real-world time series frequently exhibit complex, non-linear dependencies – influenced by factors like chaotic systems, regime shifts, or external events – that these models struggle to capture accurately. Consequently, performance degrades substantially when attempting long-range forecasts, as even small initial errors are amplified over time due to the inability to model intricate interactions within the data. This limitation is particularly pronounced in areas such as financial markets or climate modeling, where long-term predictions are vital, and the underlying dynamics are inherently non-linear and subject to unpredictable influences.

The proliferation of data-rich time series across numerous fields demands forecasting methods capable of handling unprecedented volume and intricacy; however, established benchmarking procedures often fail to provide a reliable assessment of these techniques. Current practices frequently prioritize short-term predictive accuracy on limited datasets, neglecting the crucial ability of models to generalize to longer horizons or unseen complexities. This emphasis can artificially inflate reported performance, masking deficiencies in a model’s adaptability and leading to misguided conclusions about its true potential. Consequently, evaluations may favor algorithms that excel in controlled settings but falter when applied to the dynamic and often unpredictable realities of real-world time series, hindering progress towards genuinely robust and dependable forecasting solutions.

Deep Learning’s Emergence and the Transformer Architecture

Deep learning models, particularly recurrent neural networks (RNNs) and their variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), demonstrate improved performance in time series forecasting by automatically learning complex, non-linear relationships within the data. Traditional statistical methods, such as ARIMA, often require manual feature engineering and struggle with high-dimensional or multivariate time series. Deep learning techniques, however, can ingest raw time series data and learn hierarchical representations, effectively capturing long-range dependencies and intricate patterns that influence future values. This automated feature extraction and pattern recognition frequently leads to statistically significant improvements in forecasting accuracy, as measured by metrics such as Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE), particularly in scenarios involving substantial data volume and complexity.

The Transformer architecture, initially developed for sequence-to-sequence tasks in natural language processing, exhibits strong performance when applied to time series data. This is largely due to its ability to model long-range dependencies without the vanishing gradient problems inherent in recurrent neural networks. Unlike traditional methods, Transformers process the entire input sequence in parallel, leveraging self-attention mechanisms to weigh the importance of different time steps. Adaptations for time series often involve positional encoding to provide information about the order of observations and modifications to the attention mechanism to better suit univariate or multivariate time series forecasting. Empirical results demonstrate that Transformer-based models frequently outperform established techniques like ARIMA and LSTM, particularly in scenarios with extended forecast horizons and complex, non-linear dependencies.

The Attention Mechanism is central to the performance of Transformer models on time series data by enabling the model to weight the importance of different input time steps when making predictions. Unlike recurrent neural networks which process data sequentially, Attention allows the model to directly access any part of the input sequence, identifying and prioritizing relevant historical data points. However, evaluating the efficacy of different Transformer-based time series models remains challenging; published benchmarks frequently lack standardization in data preprocessing, evaluation metrics, and model configurations, leading to potentially misleading performance comparisons and hindering objective assessment of architectural advancements.

Refining the Transformer: Innovations in Efficiency

The standard Transformer architecture, while effective for sequence modeling, exhibits quadratic computational complexity with respect to sequence length, limiting its application to long time series. Models such as Informer and Autoformer mitigate this issue through architectural innovations. Informer reduces complexity by employing ProbSparse Attention, which selectively attends to important time steps based on probabilistic relevance, decreasing the number of attention calculations. Autoformer utilizes a decomposition block to isolate trend and seasonal components of the time series and incorporates an auto-correlation mechanism to capture dependencies within the series, thereby reducing the need for full attention over the entire sequence length and improving computational efficiency.

Informer improves computational efficiency by utilizing ProbSparse Attention, a mechanism that reduces the quadratic complexity of standard attention by predicting a sparse set of important attention weights; this is achieved through the use of probabilistic sparsity modeling. Autoformer, conversely, enhances efficiency and performance through two primary techniques: series decomposition, which breaks down the input time series into trend and seasonal components, and auto-correlation mechanisms. These auto-correlation mechanisms replace the dot-product attention with auto-correlation, reducing computational complexity and enabling better capture of temporal dependencies inherent in time series data.

PatchTST improves the efficiency and feature extraction of Transformer-based time series models by dividing the input time series into patches. This patching process reduces the sequence length, thereby lowering the computational complexity associated with the attention mechanism. While PatchTST consistently demonstrates performance gains, reported improvements can be misleading due to variations in experimental setups. Specifically, inconsistencies in evaluation metrics-such as the use of different forecasting horizons or error functions-and a lack of standardized hyperparameter configurations across studies contribute to overstated performance benefits. Careful consideration of these methodological factors is necessary when comparing PatchTST to other time series forecasting models.

Balancing Global Perspective with Local Adaptation

Time series modeling approaches vary significantly in their scope of training. Global models utilize a single model instance trained on the combined data of all time series within a dataset; this allows for the sharing of information and the capture of overarching patterns but can be computationally expensive and struggle with series exhibiting unique characteristics. Conversely, Local models construct and train an independent model for each individual time series, enabling specialization and adaptation to series-specific dynamics. However, this approach lacks the benefit of cross-series learning and can be inefficient when dealing with a large number of time series. The selection between these strategies depends on factors such as the degree of commonality among series, computational resources, and the desired level of model complexity.

Hybrid forecasting models attempt to leverage the benefits of both global and local approaches by combining their predictions. These models typically involve training a global component on all time series to capture common patterns and trends, and then training local components specific to each individual time series to address unique characteristics and idiosyncrasies. Prediction is then achieved through a weighted average or more complex combination of the outputs from both the global and local models. This strategy aims to improve accuracy by generalizing across the dataset while still accommodating series-specific behavior, often outperforming either purely global or purely local models, particularly in datasets with diverse and complex temporal dependencies.

Time series decomposition and the inclusion of calendar features as exogenous variables are preprocessing techniques used to improve forecasting accuracy by isolating and accounting for underlying temporal patterns. However, inconsistent application of these techniques across different time series within a benchmark significantly impacts comparative results; variations in decomposition methods (e.g., additive vs. multiplicative) or the specific calendar features included (e.g., day of week, holidays) introduce methodological differences that confound performance assessments. Therefore, rigorous standardization of preprocessing steps is crucial for ensuring fair and reproducible benchmarking of forecasting models, as even seemingly minor inconsistencies can lead to substantial variations in reported accuracy metrics.

Expanding the Scope: Spatial Interplay and Systemic Understanding

Accurate time series forecasting hinges on understanding the intricate relationships that exist not only within a single data stream but also between multiple, interconnected series. Temporal processing focuses on identifying patterns and dependencies within a single time series – recognizing, for instance, that sales figures often peak during holiday seasons or that stock prices exhibit autocorrelation. However, real-world phenomena are rarely isolated; spatial processing acknowledges that these time series are often influenced by, and influence, each other. Consider weather patterns: temperature in one location is highly correlated with temperature in neighboring regions, and predicting rainfall accurately requires accounting for conditions across a broader geographic area. Successfully integrating both temporal and spatial dependencies allows forecasting models to move beyond simple extrapolation and capture the complex dynamics that drive real-world processes, leading to more robust and reliable predictions.

A frequent strategy for managing the complexity of Spatial Processing – analyzing relationships between multiple time series – involves treating each time series, or ‘channel,’ as independent. While this simplification streamlines calculations and reduces computational burden, it potentially overlooks crucial interactions where changes in one channel directly influence others. For example, demand for a product in one geographical region may be strongly correlated with promotional activities in a neighboring region – a connection lost if treated as independent entities. Consequently, models relying on channel independence may fail to capture the full dynamic range of the system, leading to suboptimal forecasts, particularly in scenarios where cross-channel dependencies are significant drivers of change. More sophisticated approaches aim to model these interdependencies, albeit at the cost of increased complexity and computational demand.

The integration of exogenous variables – encompassing external factors impacting a time series – consistently enhances both the precision and reliability of forecasting models. However, a nuanced challenge arises from the sensitivity of performance metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE) to seemingly minor variations in implementation. Differences in how these external variables are incorporated, alongside subtle changes in model configuration-such as regularization parameters or network architecture-can introduce substantial fluctuations in reported error rates across different benchmarking datasets. This variability complicates the accurate assessment of true forecasting performance, potentially masking genuine improvements or exaggerating minor differences between models and hindering meaningful comparisons of their underlying capabilities. Careful attention to standardization in implementation details and rigorous statistical analysis are therefore crucial for drawing valid conclusions regarding the efficacy of exogenous variable incorporation.

The pursuit of novel architectures in deep learning, as detailed in the paper, often overlooks the foundational importance of consistent methodology. This echoes Ken Thompson’s sentiment: “Turn off the machine. Find the vacuum tube.” The article rightfully points out that inconsistent preprocessing, hyperparameter tuning, and evaluation metrics create a fragmented landscape, obscuring genuine advancements in time series forecasting. Like a malfunctioning vacuum tube disrupting an entire system, these methodological flaws undermine the validity of comparative analyses. The paper emphasizes that a focus on structural integrity – a standardized approach – is crucial for building a reliable foundation for progress, allowing researchers to accurately assess the efficacy of new techniques and avoid being misled by superficial improvements.

The Road Ahead

The persistent inconsistencies in time series forecasting benchmarks reveal a fundamental truth: architecture is often mistaken for understanding. The field has focused on increasingly complex models, yet lacks a corresponding rigor in establishing baselines and controlling for confounding variables. The pursuit of novelty, while valuable, must be tempered by a commitment to reproducibility and a clear delineation of what constitutes genuine progress. Simply achieving a marginal improvement on a leaderboard, given the current state of affairs, offers little insight into the underlying principles at play.

Future work must prioritize the development of standardized evaluation protocols – not as a constraint on innovation, but as a necessary condition for it. This necessitates a shift in focus from purely predictive accuracy to a more holistic assessment of model behavior, including computational cost, robustness to noise, and interpretability. The current emphasis on end-to-end learning, while appealing in its simplicity, obscures the importance of feature engineering and domain expertise – elements that are likely to remain crucial for achieving truly scalable and reliable forecasting systems.

Ultimately, the true challenge lies not in building more elaborate models, but in developing a deeper understanding of the data itself. Dependencies are the true cost of freedom, and a reliance on opaque, black-box architectures will inevitably limit the field’s ability to adapt to new challenges and unexpected shifts in the underlying dynamics. Good architecture is invisible until it breaks, and a focus on elegant simplicity, rather than clever complexity, is likely to yield more enduring results.

Original article: https://arxiv.org/pdf/2512.22702.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/