Beyond Tree-Boosters: Closing the Performance Gap in Tabular Data

Author: Denis Avetisyan

New research reveals why deep learning often lags behind tree-based methods on structured data, and introduces feature engineering techniques to level the playing field.

The system explores data preprocessing strategies, alternating between Input Feature Concatenation (ICF) and Learned Feature Filtering (LFF) in each run, with ICF employing zero-padding to standardize feature vectors before concatenation-a technique designed to accommodate varying numbers of bins across different input features.

Implicitly categorical features are identified as a key bottleneck, addressed by innovations in feature engineering like Learned Fourier Features and improved categorical encoding.

Despite recent advances in deep learning, performance on tabular datasets continues to lag behind tree-based methods, presenting a persistent challenge for neural networks. This work, ‘Closing the gap on tabular data with Fourier and Implicit Categorical Features’, posits that this discrepancy stems from deep learning’s difficulty in effectively modeling non-linear interactions inherent in implicitly categorical features within tabular data. By introducing feature engineering techniques-including identification of these features and the application of Learned Fourier Features-we demonstrate a significant performance boost for deep learning models, achieving competitive or superior results compared to XGBoost on benchmark datasets. Could these findings unlock a new era of deep learning applications for traditionally challenging tabular data?

Deconstructing the Tabular Frontier: Why Deep Learning Struggles

While deep learning has revolutionized fields like computer vision and natural language processing, a persistent performance gap exists when applied to tabular data. Algorithms like XGBoost, LightGBM, and Random Forests, rooted in decision tree ensembles, consistently demonstrate superior accuracy and efficiency on structured, row-and-column datasets. This isn’t simply a matter of hyperparameter tuning; even with extensive optimization, deep neural networks often lag behind tree-based methods in predictive power. The reasons are multifaceted, but stem from fundamental differences in how each approach handles data characteristics inherent to tables-specifically, the ability to effectively model complex feature interactions and handle mixed data types without extensive preprocessing. This consistent outperformance suggests that, for many real-world datasets presented in tabular format, tree-based methods remain the preferred choice for achieving state-of-the-art results.

A persistent performance gap exists between deep learning models and tree-based methods – such as XGBoost and Random Forests – when applied to tabular data. While deep learning excels in areas like image and natural language processing, it consistently underperforms on well-structured datasets commonly found in business and scientific applications. This isn’t merely a matter of hyperparameter tuning; rather, the discrepancy points to fundamental limitations in how deep neural networks process this data type. Tree-based models, by their nature, are adept at capturing complex feature interactions and handling mixed data types without requiring extensive pre-processing. Consequently, they often achieve higher predictive accuracy and generalization capabilities on tabular data, even with relatively simple architectures, highlighting a significant challenge for the continued development of deep learning in this crucial domain.

Traditional deep learning architectures, designed initially for image and text data, often encounter difficulties when applied to tabular datasets due to the fundamentally different data structure. Unlike the grid-like patterns in images or sequential nature of text, tabular data consists of independent features, often requiring the model to discern complex relationships between these features – known as feature interactions. Deep neural networks, with their inherent smoothness bias and reliance on learning hierarchical representations, can struggle to efficiently capture these often non-linear and discrete interactions. This limitation necessitates significantly larger datasets and more complex architectures for deep learning models to achieve comparable performance to tree-based methods, which are specifically designed to partition the feature space and directly model these interactions through decision trees and ensembles.

Deep neural networks, while powerful in domains like image and text processing, exhibit an inherent smoothness bias when applied to tabular data, contributing to a consistent performance gap compared to tree-based methods. This bias arises from the network’s tendency to model functions that change gradually, effectively ‘smoothing out’ potentially sharp decision boundaries crucial for accurately classifying tabular data. Tabular datasets often contain discrete features and complex interactions where abrupt changes in value significantly impact outcomes; the network’s preference for smooth transitions struggles to capture these nuances. Consequently, the model may overgeneralize, failing to recognize important distinctions present in the data. This limitation highlights a fundamental mismatch between the inductive bias of deep learning and the characteristics of many tabular datasets, suggesting that alternative approaches are often better suited for this data type.

ResNet+F|C exhibits a performance spike in the top eight runs, contrasting with the more consistently performing XGBoost.

Unmasking Hidden Order: The Illusion of Continuous Data

Implicitly categorical features represent a common challenge in tabular data analysis, where numerical values are, in reality, discrete representations of categorical groups. These features appear numerical but lack meaningful interpolation between values; for example, customer IDs, product codes, or zip codes. Deep learning models, designed to identify patterns in continuous numerical data, often fail to recognize these inherent categorical distinctions, treating the numerical values as continuous and potentially introducing noise or misinterpreting relationships. This misinterpretation arises because standard numerical processing techniques, such as normalization or feature scaling, can distort the underlying categorical information, leading to suboptimal model performance and reduced predictive accuracy. Consequently, identifying and appropriately handling these implicitly categorical features is crucial for building effective deep learning models on tabular datasets.

The Categorical Feature Detection method employs statistical tests – specifically, the Kolmogorov-Smirnov test and a modified chi-squared test – to assess the distribution of numerical features and identify those exhibiting characteristics of categorical variables. The Kolmogorov-Smirnov test evaluates whether a feature’s distribution significantly deviates from a continuous uniform distribution, indicating potential categorization. The chi-squared test, adapted for numerical features by binning values, assesses whether observed value frequencies differ significantly from expected frequencies under a continuous assumption. Features failing both tests are flagged as potentially categorical, revealing hidden patterns that standard deep learning models might misinterpret as continuous data and process suboptimally.

Accurate identification of implicitly categorical features is crucial because these features exhibit fundamentally different data distributions compared to continuous numerical features. Standard deep learning techniques optimized for continuous data, such as those employing gradient descent, may perform suboptimally when applied directly to categorical data disguised as numerical values. Specifically, categorical features possess discrete values and benefit from techniques like one-hot encoding or embedding layers to represent their distinct categories, whereas continuous features are best processed with methods designed for ordered, scalable values. Failing to properly differentiate and treat these feature types can lead to inaccurate model predictions and reduced generalization performance.

The Categorical Feature Detection method is designed for integration into diverse deep learning architectures, specifically demonstrating compatibility with both Multilayer Perceptrons (MLPs) and Residual Networks (ResNets). This broad applicability stems from the statistical nature of the detection process, which operates independently of the model’s internal structure. The method assesses feature characteristics prior to model training, allowing identified categorical features to be pre-processed – for example, through one-hot encoding – or handled with appropriate embedding layers. Consequently, the detection process serves as a versatile pre-processing step within existing deep learning pipelines, enhancing model performance on tabular data regardless of architectural choice.

Breaking the Smoothness Bias: Injecting Complexity with Fourier Features

The smoothness bias in deep learning arises from the inherent limitations of neural networks in representing high-frequency, non-smooth functions, leading to underperformance on tabular data where such relationships are common. Learned Fourier Features address this by transforming input features into a higher-dimensional space using a learned combination of Fourier basis functions. This transformation allows the model to represent more complex decision boundaries and capture non-smooth relationships that would otherwise be difficult to approximate with standard neural network layers. Effectively, the model learns to represent data in a space where a smooth function can accurately represent the underlying, non-smooth relationship, circumventing the limitations of directly modeling the non-smooth function in the original input space.

Mapping inputs to a higher-dimensional space via Fourier basis functions allows a model to represent and learn non-smooth relationships that would be difficult to capture in the original input space. Fourier basis functions, comprising sine and cosine waves of varying frequencies, can decompose any periodic function into a sum of these waves. By transforming the input features using these functions, the model effectively creates new features representing different frequency components. This transformation enables the model to approximate functions with discontinuities or sharp changes, which standard deep learning architectures often struggle with due to their inherent smoothness assumptions. The resulting higher-dimensional representation facilitates the creation of more complex decision boundaries capable of modeling non-linear relationships without requiring excessively deep or wide networks.

Standard deep learning architectures, particularly Multi-Layer Perceptrons (MLPs) and ResNets, often underperform on tabular datasets when compared to gradient boosting methods like XGBoost due to their inherent smoothness bias. This bias limits the model’s ability to effectively capture complex, non-linear relationships present in tabular data. However, incorporating Learned Fourier Features (F|C) into ResNet (ResNet+F|C) and MLP (MLP+F|C) architectures mitigates this limitation. Empirical results demonstrate that these modified architectures, utilizing Fourier basis functions to map inputs to a higher-dimensional space, achieve performance levels competitive with, and in some instances exceeding, those of XGBoost on various tabular datasets, indicating a substantial improvement in the ability to model complex relationships.

The integration of learned Fourier features within ResNet architectures is efficiently realized through the application of 1D convolutional layers. These convolutions operate directly on the Fourier feature mappings of the input data, enabling the model to learn complex, non-linear relationships without requiring fully connected layers immediately after the feature mapping. This approach significantly reduces the number of parameters compared to traditional methods and improves computational efficiency. Specifically, the 1D convolutions act as learnable filters that extract relevant information from the higher-dimensional Fourier feature space, facilitating the creation of more expressive and accurate models for tabular data. The convolutional layers effectively replace the need for dense connections in processing the Fourier features, streamlining the architecture and enhancing performance.

A heatmap comparison of the top eight runs reveals that the <span class="katex-eq" data-katex-display="false">ResNet+F|C</span> model consistently outperforms the base <span class="katex-eq" data-katex-display="false">ResNet</span> model, with performance gains attributable to both the <span class="katex-eq" data-katex-display="false">ResNet+F</span> and <span class="katex-eq" data-katex-display="false">ResNet+C</span> components. — A heatmap comparison of the top eight runs reveals that the $ResNet+F|C$ model consistently outperforms the base $ResNet$ model, with performance gains attributable to both the $ResNet+F$ and $ResNet+C$ components.

Beyond Prediction: Robustness and the Pursuit of True Generalization

A key advancement in deep learning for tabular data lies in enhancing model robustness against irrelevant or uninformative features. Recent research demonstrates that explicitly detecting categorical features and then representing them using learned Fourier features significantly improves a model’s ability to discern meaningful signals. This approach allows the network to effectively filter out noise introduced by extraneous variables, concentrating instead on the features that genuinely contribute to predictive power. By transforming categorical data into a continuous space and applying Fourier-based representations, the model gains a more nuanced understanding of feature relationships, ultimately leading to improved generalization and performance, even when presented with datasets containing numerous redundant or misleading attributes.

A key characteristic of effective machine learning models is their ability to maintain performance consistency regardless of how the input data is presented. Recent research demonstrates that models incorporating specific architectural features exhibit remarkable ‘Data Orientation Preservation’. This means that even when tabular datasets undergo transformations – such as feature reordering or simple permutations – the model’s predictive accuracy remains largely unaffected. Unlike traditional methods that can be sensitive to such changes, these models effectively learn underlying relationships independent of feature arrangement, suggesting a more robust and generalized understanding of the data. This preservation of performance under transformation highlights a significant advancement in the application of deep learning to tabular data, offering a more reliable and adaptable solution for real-world applications where data presentation can vary.

Evaluations across diverse tabular datasets reveal that deep learning architectures, specifically ResNet+F|C and MLP+F|C, exhibit compelling performance gains. Notably, these models frequently outperform established gradient boosting methods like XGBoost, particularly in classification challenges where nuanced feature interactions are critical. While XGBoost maintains a strong presence in numerical regression tasks, the deep learning approaches achieve comparable results, demonstrating a closing performance gap. This suggests that, with targeted architectural modifications – such as those incorporated into ResNet+F|C and MLP+F|C – deep learning is increasingly competitive with, and in some cases surpasses, traditional machine learning techniques on structured data, opening new avenues for tabular data analysis.

Recent advancements demonstrate that deep learning, traditionally favored for image and text processing, holds considerable promise for analyzing tabular data, provided certain architectural considerations are addressed. While historically outperformed by gradient boosting methods like XGBoost on structured datasets, innovative designs-specifically those incorporating techniques like Fourier features and robust categorical feature handling-are closing the performance gap. This suggests that the limitations previously observed aren’t inherent to deep learning itself, but rather a consequence of applying architectures not optimally suited to the unique characteristics of tabular information. The ability of these newly developed models to not only match but, in some cases, exceed the performance of established algorithms on classification tasks signals a potential paradigm shift in how structured data is approached, opening avenues for more complex modeling and feature extraction.

The pursuit of improved model performance, as demonstrated in this work concerning tabular data, inherently involves a degree of controlled demolition. The paper dissects the limitations of deep learning when faced with implicitly categorical features, revealing a performance disparity with tree-based methods. This analytical approach echoes a fundamental tenet: to truly understand a system, one must probe its weaknesses. As Linus Torvalds famously stated, “Most good programmers do programming as a hobby, and then they get paid to do it.” This sentiment encapsulates the drive to not merely use a tool, but to dissect, refine, and ultimately, exploit its capabilities – in this case, feature engineering techniques like ICF and LFF – to overcome inherent limitations and close the gap in predictive accuracy.

Beyond the Table: Charting Future Directions

The persistent advantage of tree-based methods on tabular data isn’t merely an empirical observation; it’s a signal. This work correctly identifies a critical component of that signal – the tacit handling of categorical information. Yet, simply representing these implicit categories isn’t a solution, it’s a translation. The real question isn’t whether deep learning can mimic tree-based feature engineering, but whether it can surpass it by discovering entirely new representations-structures unbound by the limitations of explicitly defined features.

Future investigations should not focus solely on refining feature transformations, but on architectural innovations. Can networks be designed to inherently infer categorical boundaries, effectively performing a continuous, data-driven form of feature engineering within the model itself? The challenge lies in building systems that embrace ambiguity, recognizing that information isn’t always neatly packaged, and that the most potent signals often reside in the noise-in the gaps between categories.

Ultimately, closing the gap isn’t about achieving parity. It’s about using tree-based methods as a control – a baseline against which to measure the potential of genuinely novel deep learning architectures. The goal isn’t to replicate the past, but to deconstruct it, to understand why certain approaches succeed, and then to build something that breaks the mold, revealing a deeper, more nuanced understanding of the data itself.

Original article: https://arxiv.org/pdf/2602.23182.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Deconstructing the Tabular Frontier: Why Deep Learning Struggles

Unmasking Hidden Order: The Illusion of Continuous Data

Breaking the Smoothness Bias: Injecting Complexity with Fourier Features

Beyond Prediction: Robustness and the Pursuit of True Generalization

Beyond the Table: Charting Future Directions

See also: