Model Comparison
Architectural comparison and computational characteristics of DeepTab’s model zoo.
Note
Focus on architecture: This document emphasizes computational complexity, architectural design, and qualitative comparisons. Quantitative performance benchmarks will be added when systematic experiments are completed.
Scope: The tables below cover the 15 stable models. The 3 experimental models (ModernNCA, Tangos, Trompt) are documented separately under Model Tiers.
See also
For practical timing and memory measurement guidance, see Model Efficiency and Benchmarking. For a runnable workflow, use the Model Efficiency Benchmarking tutorial and its notebook at docs/tutorials/notebooks/model_efficiency.ipynb.
Computational Characteristics
The table below reports dominant forward-pass scaling for a batch. It is a practical guide, not a FLOP-count benchmark.
Category |
Model |
DeepTab Default Shape |
Dominant Forward-Time Terms |
Memory Driver |
Primary References |
|---|---|---|---|---|---|
State Space Models |
Mambular |
|
Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants |
\(O(B \cdot P \cdot D)\) activations |
|
MambaTab |
|
Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants |
\(O(B \cdot P \cdot D)\) activations |
||
MambAttention |
|
Mamba term \(O(B \cdot L_m \cdot P \cdot D)\) plus feature attention \(O(B \cdot L_a \cdot P^2 \cdot D)\) |
Attention maps \(O(B \cdot P^2)\) when attention layers are active |
||
Transformers |
FTTransformer |
|
Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\) plus feed-forward blocks |
\(O(B \cdot L \cdot P^2)\) attention maps |
|
TabTransformer |
|
Categorical-token self-attention \(O(B \cdot L \cdot P_{\text{cat}}^2 \cdot D)\) plus numerical MLP head |
\(O(B \cdot L \cdot P_{\text{cat}}^2)\) attention maps |
||
SAINT |
|
Column attention \(O(B \cdot P^2 \cdot D)\) plus row attention \(O(B^2 \cdot P \cdot D)\) within a batch |
\(O(B \cdot P^2 + B^2)\) attention maps |
||
AutoInt |
|
Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\); key-value compression reduces constants |
\(O(B \cdot L \cdot P^2)\) attention maps |
||
Residual Networks |
ResNet |
|
Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\) |
Linear in batch and hidden width |
|
TabR |
|
Candidate encoding plus exact/FAISS nearest-neighbor search \(O(B \cdot N_c \cdot D)\) and context mixing \(O(B \cdot C \cdot D)\) |
Candidate cache \(O(N_c \cdot D)\) |
||
Tree-Inspired |
NODE |
|
Soft oblivious trees evaluate all splits/leaves: \(O(B \cdot L \cdot T \cdot (P \cdot D_t + D_t \cdot 2^{D_t}))\) |
Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\) |
|
ENODE |
|
NODE-style soft tree evaluation with learned embeddings |
Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\) |
||
NDTF |
|
Neural decision forest evaluates internal nodes and leaf probabilities for each tree |
Leaf probabilities scale with \(O(B \cdot E \cdot 2^{D_t})\) |
||
Other |
MLP |
|
Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\) |
Linear in batch and hidden width |
Standard MLP baseline |
TabM |
|
MLP-style dense compute with parameter-efficient batch ensembling |
Linear in batch, hidden width, and active ensemble outputs |
||
TabulaRNN |
|
Recurrent feature-sequence processing \(O(B \cdot L \cdot P \cdot D^2)\) for standard RNN-style cells |
\(O(B \cdot P \cdot D)\) activations |
Notation: \(B\) = batch size, \(P\) = feature tokens after preprocessing/embedding, \(P_{\text{cat}}\) = categorical tokens, \(D\) = hidden dimension, \(L\) = layers, \(L_m\) = Mamba layers, \(L_a\) = attention layers, \(C\) = retrieved context size, \(N_c\) = candidate rows for retrieval, \(T\) = trees per layer, \(E\) = forest ensemble size, \(D_t\) = tree depth, \(d_\ell\) = width of dense layer \(\ell\) (so a dense layer costs \(d_{\ell-1} d_\ell\)).
Important
Parameter count assumptions: Parameter counts are not listed because they depend strongly on dataset schema and preprocessing:
Input features: More features increase embedding, tokenizer, and first-layer parameters.
Categorical cardinality: More categories increase embedding-table parameters.
Hidden width: Dense projections usually scale with width squared.
Depth and ensembles: Additional layers, trees, or ensemble members increase parameters and activations.
The “DeepTab Default Shape” column is taken from the current model config defaults in deeptab/configs/models/.
Tip
Practical implications:
Linear in feature sequence: Mamba variants, RNNs, MLPs, ResNets, and TabM avoid feature-attention matrices.
Quadratic in features: FTTransformer, AutoInt, MambAttention attention layers, and TabTransformer become expensive as the number of feature tokens grows.
Quadratic in batch rows: SAINT’s row-attention term is controlled by mini-batch size, not by the total dataset size directly.
Retrieval-based: TabR can be strong on larger data, but it needs candidate encoding/search memory and depends on the retrieval index.
Soft tree-based: NODE-style models are not logarithmic at inference; differentiable trees evaluate soft paths/leaves, so tree depth matters.
Note
Category guide:
State Space Models: Selective SSM/Mamba-style sequence models adapted to tabular features.
Transformers: Self-attention mechanisms for feature and/or row interactions.
Residual Networks: Deep feedforward MLPs with skip connections.
Tree-Inspired: Differentiable decision trees with gradient optimization.
Other: Standard architectures (MLP, parameter-efficient ensembles, RNNs).
Architecture Categories
State Space Models (SSMs)
Feature-sequence models with linear sequence-length scaling in the Mamba blocks
Model |
Default Layers |
Default Hidden Dim |
Key Feature |
Best Use Case |
|---|---|---|---|---|
Mambular |
4 Mamba layers |
64 |
Stacked Mamba blocks over feature tokens |
General-purpose tabular sequence modeling |
MambaTab |
1 Mamba layer |
64 |
Lightweight Mamba block |
Small datasets, speed |
MambAttention |
Hybrid |
64 |
Mamba blocks plus feature attention |
Complex feature interactions |
Transformer-Based
Attention mechanisms for feature and row interactions
Model |
Attention Scope |
Default Hidden Dim |
Key Feature |
Best Use Case |
|---|---|---|---|---|
FTTransformer |
All feature tokens |
128 |
Feature tokenization |
Feature interactions |
TabTransformer |
Categorical tokens |
128 |
Contextual categorical embeddings |
Categorical-heavy data |
SAINT |
Row + column |
128 |
Intersample (row) plus column attention |
Semi-supervised or row-context settings |
AutoInt |
All feature tokens |
128 |
Self-attentive feature interaction learning |
Automatic interaction modeling |
Tree-Inspired
Differentiable tree and forest structures
Model |
Tree Type |
Default Shape |
Key Feature |
Best Use Case |
|---|---|---|---|---|
NODE |
Oblivious differentiable trees |
4 layers, 128 trees/layer, depth 6 |
Soft routing over oblivious trees |
Interpretable tree-inspired modeling |
ENODE |
Embedded NODE variant |
4 layers, 64 trees/layer, depth 6 |
Feature embeddings before NODE-style blocks |
Tree-inspired modeling with embeddings |
NDTF |
Neural decision tree forest |
12 trees, random depths 4 to 15 |
Multiple neural decision trees |
Tree ensemble-style experiments |
Residual Networks
Deep feedforward networks with skip connections
Model |
Default Shape |
Key Feature |
Best Use Case |
|---|---|---|---|
ResNet |
3 residual blocks, |
Residual blocks |
Fast baseline |
TabR |
|
Retrieval-augmented prediction |
Larger datasets with useful neighbor structure |
Other Architectures
Model |
Type |
Default Shape |
Key Feature |
Best Use Case |
|---|---|---|---|---|
MLP |
Feedforward |
|
Simple dense baseline |
Fastest baseline |
TabM |
Parameter-efficient ensemble |
|
Batch ensembling |
Strong efficient baseline |
TabulaRNN |
RNN |
|
Sequential feature processing |
Sequential feature modeling |
Model Selection by Use Case
Note
General pattern: Simpler models (MLP, ResNet, TabM) are strong practical baselines and often work well on small or medium datasets with proper regularization. More complex models (Transformers, SSMs, retrieval models) are most useful when their inductive bias matches the data or when the dataset is large enough to justify the extra capacity and compute.
By Dataset Size
Dataset Size |
Recommended Models |
Reasoning |
Key Consideration |
Avoid |
|---|---|---|---|---|
<5K samples |
MambaTab, ResNet, MLP, TabM |
Lower capacity and fast iteration reduce overfitting risk |
Use regularization and validation-driven early stopping |
Deep Transformers (SAINT, deep FTTransformer) |
5K to 50K samples |
Mambular, FTTransformer, TabM, MambAttention |
More capacity can pay off when features interact strongly |
Balance capacity vs training time |
Very high capacity if data is simple |
>50K samples |
Mambular, TabM, TabR, FTTransformer |
Larger data can support complex patterns and retrieval |
Watch attention/retrieval bottlenecks |
SAINT with large batches unless row attention is needed |
Alternatives: MambaTab for speed, NODE/ENODE for tree-inspired interpretability, ResNet/MLP for very fast training.
By Feature Type
Feature Composition |
Best Choice |
Good Alternatives |
Reasoning |
Avoid |
|---|---|---|---|---|
>60% categorical |
TabTransformer |
FTTransformer, Mambular |
TabTransformer’s attention is focused on categorical contextual embeddings |
- |
>80% numerical |
Mambular, TabM |
ResNet, NODE |
SSM/dense baselines avoid categorical-only assumptions |
TabTransformer |
Balanced mixed |
Mambular, FTTransformer |
MambAttention, TabM |
Unified feature processing supports mixed feature interactions |
- |
By Computational Constraints
Constraint |
Recommended Models |
Reasoning |
Avoid |
|---|---|---|---|
Memory <8GB GPU |
MLP, ResNet, MambaTab, Mambular, TabM |
No full feature-attention matrix in the main path |
FTTransformer/AutoInt with many feature tokens, SAINT with large batches |
Fast training needed |
MLP, ResNet, MambaTab, TabM |
Simple dense or short sequence paths |
FTTransformer, TabR, SAINT if retrieval/row attention dominates |
Low inference latency |
MLP, ResNet, Mamba variants, TabM |
Avoids retrieval search and full attention over many tokens |
TabR with large candidate pools, wide Transformers |
Training speed tiers: Fastest (MLP, ResNet) -> Fast (MambaTab, TabM) -> Moderate (Mambular, NODE) -> Slower or workload-dependent (FTTransformer, TabR, SAINT).
By Task Requirements
Task |
General Purpose |
Fast/Efficient |
Interpretable |
Notes |
|---|---|---|---|---|
Classification |
Mambular, FTTransformer, MambAttention |
MambaTab, ResNet, TabM |
NODE, ENODE, NDTF |
All models support multi-class |
Regression |
Mambular, FTTransformer, TabR (large data) |
MambaTab, ResNet, TabM |
NODE |
Tree models can be useful when tree-like splits fit the data |
LSS (Distributional) |
Mambular, FTTransformer, MambAttention |
MambaTab |
ENODE |
All models support LSS mode |
Special cases: For quantile regression, use any model in LSS mode with an appropriate distribution family.
Recommended Decision Tree
Start Here
|
|- Dataset size <5K? -> Use MambaTab, ResNet, MLP, or TabM with regularization
|
|- Need tree-inspired interpretability? -> Use NODE, ENODE, or NDTF
|
|- Memory constrained (<8GB)? -> Prefer Mambular, MambaTab, MLP, ResNet, or TabM
|
|- Inference latency critical? -> Avoid retrieval/large attention; use MLP, ResNet, TabM, or Mamba variants
|
|- >60% categorical features? -> Consider TabTransformer
|
|- Need retrieval from similar training examples? -> Consider TabR
|
`- General purpose -> Mambular or TabM
`- Alternative -> FTTransformer when GPU memory and feature count permit
Hardware Requirements by Model
The table below gives practical guidance on whether each model trains comfortably on a CPU-only machine or requires a GPU (CUDA, MPS, or other accelerator). Thresholds are rough estimates based on architecture cost, and the actual boundary depends on the number of features, hidden width, and depth used.
Important
Features matter as much as rows. Transformer-style models grow quadratically with feature-token count, so 20 features with a default FTTransformer config can require as much compute as 50 features with an MLP. The estimates below assume the default DeepTab config for each model and a moderate feature count (10 to 30 columns). Wide datasets shift the GPU threshold lower.
CPU comfort zone |
Models |
Primary cost driver |
When to reach for a GPU |
|---|---|---|---|
Up to ~500K rows |
MLP, ResNet |
Cache-friendly dense and skip-connection layers |
Rarely needed; CPU scales well even on large data |
Up to ~100K rows |
TabM, MambaTab |
MLP ensemble paths, single lightweight Mamba block |
Modest speedup; CPU stays competitive |
Up to ~20K rows |
Mambular, TabulaRNN, TabTransformer, NODE |
Stacked sequence/recurrent blocks or categorical attention |
Past this size, accelerators give meaningful speedup |
Up to ~10K rows |
MambAttention, FTTransformer, AutoInt, ENODE, NDTF, TabR |
Full-feature attention \(O(P^2)\), retrieval, or deep soft trees |
GPU strongly recommended as features or rows grow |
Up to ~2K rows |
SAINT |
Column plus row attention per batch |
GPU effectively required; CPU is impractically slow past a few thousand rows |
The “CPU comfort zone” is where training at default config finishes in reasonable wall-clock time on a modern CPU. Beyond it, a CUDA, MPS, or similar accelerator provides meaningful speedup.
Tip
Apple Silicon (MPS): All models run on MPS via PyTorch’s MPS backend. Set accelerator="mps" in TrainerConfig. MPS provides meaningful speedup for most models except those with Mamba CUDA kernels, which fall back to CPU on MPS unless a dedicated MPS implementation is available.
Note
Inference vs training: Inference (predict) is cheaper than training because there is no backward pass or optimizer state. A model that needs a GPU for training can often run inference on CPU in production for moderate batch sizes. Use InferenceModel to load artifacts for CPU-only inference environments.
References
Key papers used for the comparison:
Ahamed, M. A., & Cheng, Q. (2024). MambaTab: A Plug-and-Play Model for Learning Tabular Data. arXiv:2401.08867, DOI:10.1109/MIPR62202.2024.00065
Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. NeurIPS 2021. arXiv:2106.11959
Gorishniy, Y., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., & Babenko, A. (2023). TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023. arXiv:2307.14338
Gorishniy, Y., Kotelnikov, A., & Babenko, A. (2024). TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling. ICLR 2025. arXiv:2410.24210
Gu, A., & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385
Huang, X., Khetan, A., Cvitkovic, M., & Karnin, Z. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678
Kontschieder, P., Fiterau, M., Criminisi, A., & Rota Bulo, S. (2015). Deep Neural Decision Forests. ICCV 2015. CVF Open Access
Popov, S., Morozov, S., & Babenko, A. (2019). Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data. ICLR 2020. arXiv:1909.06312
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., & Goldstein, T. (2021). SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training. arXiv:2106.01342
Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., & Tang, J. (2019). AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. CIKM 2019. arXiv:1810.11921
Thielmann, A. F., Kumar, M., Weisser, C., Reuter, A., Säfken, B., & Samiee, S. (2024). Mambular: A Sequential Model for Tabular Deep Learning. arXiv:2408.06291
Thielmann, A. F., & Samiee, S. (2024). On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning. arXiv:2411.17207
Wen, Y., Tran, D., & Ba, J. (2020). BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. arXiv:2002.06715
See Also
Recommended Configs: Hyperparameter guidelines
Model Efficiency and Benchmarking: Runtime and memory benchmarking protocol
Model Tiers: Stable vs experimental