Leveraging structural information in tree ensembles for table representation learning
2025
Tabular data is one of the most common data formats found in the web and used in domains like finance, banking, e-commerce and medical. Although deep neural networks (DNNs) have demonstrated outstanding performance on homogeneous data such as visual, audio, and textual data, tree ensemble methods such as Gradient Boosted Decision Trees (GBDTs) are often the go-to choice for supervised machine learning problems involving heterogeneous tabular data. However, a major limitation of these methods lies in the difficulty of plugging-in other modalities (like text, images), as is achievable with deep learning (DL) models. To bridge this gap, researchers have put forth a multitude of DL approaches tailored specifically for tabular data. In this work, we propose a new path embedding-based method to harness the structural information from tree ensembles to improve tabular data representation. Our approach not only demonstrates superior performance compared to existing DL models for tabular classification tasks but also outperforms competitive baselines when combined with textual data in multimodal tabular transformers.
Research areas