Forecasting Network Traffic: Which Deep Learning Model Reigns Supreme?

Author: Denis Avetisyan


A rigorous new study benchmarks advanced deep learning architectures to identify the most accurate and efficient approaches for predicting network traffic patterns.

Across diverse datasets, the study demonstrates that model performance invariably trades off against practical efficiency-measured in training time, model size, and energy consumption-with Pareto-optimal models representing the best achievable balance for each dataset and metric when evaluated on the full dataset.
Across diverse datasets, the study demonstrates that model performance invariably trades off against practical efficiency-measured in training time, model size, and energy consumption-with Pareto-optimal models representing the best achievable balance for each dataset and metric when evaluated on the full dataset.

This systematic evaluation reveals that simpler Multi-Layer Perceptrons and Transformer networks utilizing patching techniques deliver optimal performance, data efficiency, and resource utilization for time series forecasting.

Accurately forecasting network traffic remains a critical yet challenging task for modern network management, despite advances in deep learning. This study, ‘Which Deep Learner? A Systematic Evaluation of Advanced Deep Forecasting Models Accuracy and Efficiency for Network Traffic Prediction’, systematically benchmarks twelve advanced time series forecasting models-ranging from transformers to traditional deep learning approaches-across diverse network datasets and timescales. Results demonstrate that simpler multilayer perceptron architectures and transformer networks leveraging patching techniques offer the most compelling balance of accuracy, data efficiency, and resource utilization. Will these findings catalyze a shift towards more pragmatic and efficient deep learning deployments for real-world network traffic prediction?


The Limits of Tradition: Why Time Series Forecasting Needed a Rewrite

Historically, accurate forecasting of time series data-from stock prices to weather patterns-relied on statistical models like ARIMA and exponential smoothing. However, these methods often falter when confronted with the intricate, non-linear dependencies inherent in many real-world temporal datasets. Traditional approaches assume relatively simple relationships and struggle to discern patterns spanning extended periods, a limitation known as the ‘short-term memory’ problem. Consequently, forecasts generated by these models can exhibit significant errors, particularly when predicting events influenced by factors occurring far in the past. This inability to effectively capture complex temporal dependencies-the interplay between current data and historical influences-presents a substantial challenge for applications demanding precise and reliable predictions.

Initially designed to revolutionize the field of Natural Language Processing, Transformer models have proven remarkably adaptable to the challenges of time series forecasting. Traditional statistical methods often falter when confronted with data exhibiting complex, non-linear patterns and long-range dependencies – where events distant in time significantly influence future outcomes. Transformers, however, excel at capturing these intricate relationships thanks to their core architecture, which eschews sequential processing for parallel computation. This allows the model to consider all data points simultaneously, identifying subtle connections and dependencies that would be missed by methods constrained to processing data in order. Consequently, Transformer models are increasingly employed to predict trends in areas like finance, energy demand, and transportation, demonstrating a powerful shift in how sequential data is analyzed and forecasted.

The core innovation enabling Transformer success in time series forecasting lies within its self-attention mechanism. Unlike traditional recurrent or convolutional networks that process data sequentially or with limited receptive fields, self-attention allows the model to directly assess the relationships between all time steps within a sequence. This capability is crucial for identifying subtle, long-range dependencies – for example, how traffic patterns from a week ago might influence congestion today. By assigning varying weights to each time step based on its relevance to the current prediction, the model effectively prioritizes important historical data while downplaying noise. This dynamic weighting process enables Transformers to capture nuanced temporal features and overall network dynamics with a precision previously unattainable, leading to significantly improved forecasting accuracy in complex time series datasets.

Network traffic time series exhibit multi-scale characteristics, displaying periodicity and fluctuations at hourly and 10-minute intervals, anomalies at 5-minute intervals, missing data, and variations correlated with external factors like windspeed <span class="katex-eq" data-katex-display="false">	ext{windspeed}</span>.
Network traffic time series exhibit multi-scale characteristics, displaying periodicity and fluctuations at hourly and 10-minute intervals, anomalies at 5-minute intervals, missing data, and variations correlated with external factors like windspeed ext{windspeed}.

PatchTST: A Necessary Compromise

PatchTST mitigates the computational complexity associated with processing lengthy time series data by dividing the input into a sequence of non-overlapping patches. This approach effectively reduces the sequence length presented to the Transformer model; instead of processing the entire time series at once, PatchTST processes a series of shorter, fixed-length patches. This reduction in sequence length directly translates to a decrease in the quadratic computational cost – O(n^2) – inherent in the self-attention mechanism of Transformers, where ‘n’ represents the sequence length. By operating on these patches, PatchTST maintains the benefits of Transformer architectures while achieving improved scalability and reduced memory requirements for long-range time series forecasting.

PatchTST utilizes a patch-based representation of the input time series to address the quadratic computational complexity inherent in Transformer models when processing long sequences. By dividing the time series into smaller, non-overlapping patches, the effective sequence length presented to the Transformer is reduced. This reduction in sequence length directly translates to lower computational costs for both training and inference, enabling scalability to longer time series. The Transformer architecture is then applied to these patches, allowing PatchTST to retain the benefits of attention mechanisms – such as capturing long-range dependencies – while significantly improving efficiency compared to processing the entire sequence at once.

Evaluations of PatchTST across multiple time series datasets have established state-of-the-art forecasting performance. Specifically, the model demonstrates a 3x improvement in data efficiency when contrasted with conventional sequential models. This increased efficiency is measured by the model’s ability to achieve comparable or superior accuracy with a significantly reduced volume of training data, indicating a more effective utilization of available information and faster convergence during the training process.

A comparison of model performance across 11 datasets reveals variations in accuracy-as measured by Normalized Root Mean Squared Error (NRMSE) with standard deviation indicated by error bars-and highlights differences between models released at different times.
A comparison of model performance across 11 datasets reveals variations in accuracy-as measured by Normalized Root Mean Squared Error (NRMSE) with standard deviation indicated by error bars-and highlights differences between models released at different times.

Autoformer: Decomposing the Problem, and Hoping for the Best

Autoformer utilizes a decomposition strategy to separate a time series into its constituent trend and seasonal components prior to forecasting. This decomposition is achieved through a series of learned linear projections applied to the input time series, effectively isolating the long-term trend and periodic seasonal patterns. By modeling these components independently, Autoformer aims to improve forecasting accuracy, as separate representations allow the model to more effectively capture and extrapolate each distinct signal. The separated trend and seasonal components are then recombined during the forecasting process to produce the final prediction. This approach contrasts with methods that directly model the entire time series without explicitly isolating these fundamental characteristics.

Autoformer utilizes a patch-based representation where the input time series is divided into smaller, non-overlapping segments, or patches. These patches are then processed to capture local dependencies and seasonal variations. By representing the series in this manner, Autoformer can more effectively model both the long-term trend and the subtle, nuanced seasonal patterns present within the data. This approach allows the model to learn relationships between patches, facilitating accurate forecasting by considering both the overall trajectory and the recurring seasonal components of the time series.

Autoformer demonstrates performance competitive with PatchTST while offering an improved trade-off between accuracy and computational cost when compared to DLinear. Empirical evaluation across a range of datasets and experimental configurations revealed that Autoformer achieved Pareto optimality in 83% of tested combinations of resource allocation and forecasting timescales. This indicates that, across the majority of tested scenarios, no other model could simultaneously achieve higher accuracy with lower computational requirements than Autoformer, suggesting its efficiency in balancing performance and resource utilization for time series forecasting tasks.

Continuous wavelet transforms (CWT) of traffic datasets reveal time-frequency distributions of dominant daily and weekly frequencies, as confirmed by corresponding Fast Fourier Transforms (FFT) displaying marked peak frequencies.
Continuous wavelet transforms (CWT) of traffic datasets reveal time-frequency distributions of dominant daily and weekly frequencies, as confirmed by corresponding Fast Fourier Transforms (FFT) displaying marked peak frequencies.

The pursuit of ever-more-complex forecasting models feels… predictable. This research, diligently comparing various deep learning architectures, confirms a suspicion: the most elegant solution isn’t always the most practical. The finding that simpler MLPs and patched transformers achieve competitive results underscores a recurring truth – diminishing returns are inevitable. It’s a reminder that resource utilization and data efficiency often trump theoretical sophistication. As Ada Lovelace observed, “That brain of mine is something more than merely mortal; as time will show.” The ‘mortal’ part, of course, being the inevitable operational cost of chasing perfect, yet ultimately brittle, models. The article’s focus on balancing accuracy with practicality feels less like a breakthrough and more like acknowledging the constraints of reality – production systems don’t reward theoretical elegance; they reward stability and cost-effectiveness.

What Comes Next?

The pursuit of ever-more-complex architectures for network traffic prediction will undoubtedly continue. Yet, this work suggests the diminishing returns are already asserting themselves. The demonstrated efficacy of comparatively straightforward MLPs, and the surprisingly robust performance gained through patching techniques in transformers, hint at a fundamental truth: often, the most elegant solution isn’t the most effective in production. It’s a memory of better times when a model’s theoretical perfection mattered more than its actual runtime cost.

Future efforts will likely focus not on entirely novel architectures, but on increasingly sophisticated methods for squeezing performance from existing ones. Automated patching strategies, adaptive model selection based on real-time network conditions, and, crucially, a deeper understanding of why these simpler models prove so resilient will be essential. The real challenge isn’t achieving marginal gains in accuracy; it’s maintaining operability in the face of inevitably messy, unpredictable data.

One anticipates the emergence of “traffic fingerprinting” – systems that identify specific network behaviors and dynamically adjust forecasting models accordingly. But, as always, the bugs will serve as proof of life, and the team will be there, prolonging the suffering of another deployment.


Original article: https://arxiv.org/pdf/2601.02694.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-01-07 12:20