Model Comparison

Architectural comparison and computational characteristics of DeepTab’s model zoo.

Note

Focus on architecture: This document emphasizes computational complexity, architectural design, and qualitative comparisons. Quantitative performance benchmarks will be added when systematic experiments are completed.

Scope: The tables below cover the 15 stable models. The 3 experimental models (ModernNCA, Tangos, Trompt) are documented separately under Model Tiers.

See also

For practical timing and memory measurement guidance, see Model Efficiency and Benchmarking. For a runnable workflow, use the Model Efficiency Benchmarking tutorial and its notebook at docs/tutorials/notebooks/model_efficiency.ipynb.

Computational Characteristics

The table below reports dominant forward-pass scaling for a batch. It is a practical guide, not a FLOP-count benchmark.

Category

Model

DeepTab Default Shape

Dominant Forward-Time Terms

Memory Driver

Primary References

State Space Models

Mambular

d_model=64, n_layers=4

Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants

\(O(B \cdot P \cdot D)\) activations

Mambular, Mamba

MambaTab

d_model=64, n_layers=1

Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants

\(O(B \cdot P \cdot D)\) activations

MambaTab, Mamba

MambAttention

d_model=64, Mamba blocks + attention

Mamba term \(O(B \cdot L_m \cdot P \cdot D)\) plus feature attention \(O(B \cdot L_a \cdot P^2 \cdot D)\)

Attention maps \(O(B \cdot P^2)\) when attention layers are active

Mambular, Mamba

Transformers

FTTransformer

d_model=128, n_layers=4, n_heads=8

Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\) plus feed-forward blocks

\(O(B \cdot L \cdot P^2)\) attention maps

Gorishniy et al. 2021

TabTransformer

d_model=128, n_layers=4, n_heads=8

Categorical-token self-attention \(O(B \cdot L \cdot P_{\text{cat}}^2 \cdot D)\) plus numerical MLP head

\(O(B \cdot L \cdot P_{\text{cat}}^2)\) attention maps

Huang et al. 2020

SAINT

d_model=128, n_layers=1, n_heads=2

Column attention \(O(B \cdot P^2 \cdot D)\) plus row attention \(O(B^2 \cdot P \cdot D)\) within a batch

\(O(B \cdot P^2 + B^2)\) attention maps

Somepalli et al. 2021

AutoInt

d_model=128, n_layers=4, n_heads=8

Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\); key-value compression reduces constants

\(O(B \cdot L \cdot P^2)\) attention maps

Song et al. 2019

Residual Networks

ResNet

layer_sizes=[256,128,32], num_blocks=3

Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\)

Linear in batch and hidden width

He et al. 2016, Gorishniy et al. 2021

TabR

d_main=256, context_size=96

Candidate encoding plus exact/FAISS nearest-neighbor search \(O(B \cdot N_c \cdot D)\) and context mixing \(O(B \cdot C \cdot D)\)

Candidate cache \(O(N_c \cdot D)\)

Gorishniy et al. 2023

Tree-Inspired

NODE

num_layers=4, layer_dim=128, depth=6

Soft oblivious trees evaluate all splits/leaves: \(O(B \cdot L \cdot T \cdot (P \cdot D_t + D_t \cdot 2^{D_t}))\)

Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\)

Popov et al. 2019

ENODE

d_model=8, num_layers=4, layer_dim=64, depth=6

NODE-style soft tree evaluation with learned embeddings

Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\)

Popov et al. 2019

NDTF

n_ensembles=12, random depths 4 to 15

Neural decision forest evaluates internal nodes and leaf probabilities for each tree

Leaf probabilities scale with \(O(B \cdot E \cdot 2^{D_t})\)

Kontschieder et al. 2015

Other

MLP

layer_sizes=[256,128,32]

Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\)

Linear in batch and hidden width

Standard MLP baseline

TabM

layer_sizes=[256,256,128], ensemble_size=32

MLP-style dense compute with parameter-efficient batch ensembling

Linear in batch, hidden width, and active ensemble outputs

Gorishniy et al. 2024, Wen et al. 2020

TabulaRNN

d_model=128, n_layers=4

Recurrent feature-sequence processing \(O(B \cdot L \cdot P \cdot D^2)\) for standard RNN-style cells

\(O(B \cdot P \cdot D)\) activations

Thielmann & Samiee 2024

Notation: \(B\) = batch size, \(P\) = feature tokens after preprocessing/embedding, \(P_{\text{cat}}\) = categorical tokens, \(D\) = hidden dimension, \(L\) = layers, \(L_m\) = Mamba layers, \(L_a\) = attention layers, \(C\) = retrieved context size, \(N_c\) = candidate rows for retrieval, \(T\) = trees per layer, \(E\) = forest ensemble size, \(D_t\) = tree depth, \(d_\ell\) = width of dense layer \(\ell\) (so a dense layer costs \(d_{\ell-1} d_\ell\)).

Important

Parameter count assumptions: Parameter counts are not listed because they depend strongly on dataset schema and preprocessing:

  • Input features: More features increase embedding, tokenizer, and first-layer parameters.

  • Categorical cardinality: More categories increase embedding-table parameters.

  • Hidden width: Dense projections usually scale with width squared.

  • Depth and ensembles: Additional layers, trees, or ensemble members increase parameters and activations.

The “DeepTab Default Shape” column is taken from the current model config defaults in deeptab/configs/models/.

Tip

Practical implications:

  • Linear in feature sequence: Mamba variants, RNNs, MLPs, ResNets, and TabM avoid feature-attention matrices.

  • Quadratic in features: FTTransformer, AutoInt, MambAttention attention layers, and TabTransformer become expensive as the number of feature tokens grows.

  • Quadratic in batch rows: SAINT’s row-attention term is controlled by mini-batch size, not by the total dataset size directly.

  • Retrieval-based: TabR can be strong on larger data, but it needs candidate encoding/search memory and depends on the retrieval index.

  • Soft tree-based: NODE-style models are not logarithmic at inference; differentiable trees evaluate soft paths/leaves, so tree depth matters.

Note

Category guide:

  • State Space Models: Selective SSM/Mamba-style sequence models adapted to tabular features.

  • Transformers: Self-attention mechanisms for feature and/or row interactions.

  • Residual Networks: Deep feedforward MLPs with skip connections.

  • Tree-Inspired: Differentiable decision trees with gradient optimization.

  • Other: Standard architectures (MLP, parameter-efficient ensembles, RNNs).

Architecture Categories

State Space Models (SSMs)

Feature-sequence models with linear sequence-length scaling in the Mamba blocks

Model

Default Layers

Default Hidden Dim

Key Feature

Best Use Case

Mambular

4 Mamba layers

64

Stacked Mamba blocks over feature tokens

General-purpose tabular sequence modeling

MambaTab

1 Mamba layer

64

Lightweight Mamba block

Small datasets, speed

MambAttention

Hybrid

64

Mamba blocks plus feature attention

Complex feature interactions

Transformer-Based

Attention mechanisms for feature and row interactions

Model

Attention Scope

Default Hidden Dim

Key Feature

Best Use Case

FTTransformer

All feature tokens

128

Feature tokenization

Feature interactions

TabTransformer

Categorical tokens

128

Contextual categorical embeddings

Categorical-heavy data

SAINT

Row + column

128

Intersample (row) plus column attention

Semi-supervised or row-context settings

AutoInt

All feature tokens

128

Self-attentive feature interaction learning

Automatic interaction modeling

Tree-Inspired

Differentiable tree and forest structures

Model

Tree Type

Default Shape

Key Feature

Best Use Case

NODE

Oblivious differentiable trees

4 layers, 128 trees/layer, depth 6

Soft routing over oblivious trees

Interpretable tree-inspired modeling

ENODE

Embedded NODE variant

4 layers, 64 trees/layer, depth 6

Feature embeddings before NODE-style blocks

Tree-inspired modeling with embeddings

NDTF

Neural decision tree forest

12 trees, random depths 4 to 15

Multiple neural decision trees

Tree ensemble-style experiments

Residual Networks

Deep feedforward networks with skip connections

Model

Default Shape

Key Feature

Best Use Case

ResNet

3 residual blocks, [256, 128, 32] layer sizes

Residual blocks

Fast baseline

TabR

d_main=256, context_size=96

Retrieval-augmented prediction

Larger datasets with useful neighbor structure

Other Architectures

Model

Type

Default Shape

Key Feature

Best Use Case

MLP

Feedforward

[256, 128, 32] layer sizes

Simple dense baseline

Fastest baseline

TabM

Parameter-efficient ensemble

[256, 256, 128] layer sizes, 32 ensemble members

Batch ensembling

Strong efficient baseline

TabulaRNN

RNN

d_model=128, 4 recurrent layers

Sequential feature processing

Sequential feature modeling

Model Selection by Use Case

Note

General pattern: Simpler models (MLP, ResNet, TabM) are strong practical baselines and often work well on small or medium datasets with proper regularization. More complex models (Transformers, SSMs, retrieval models) are most useful when their inductive bias matches the data or when the dataset is large enough to justify the extra capacity and compute.

By Dataset Size

Dataset Size

Recommended Models

Reasoning

Key Consideration

Avoid

<5K samples

MambaTab, ResNet, MLP, TabM

Lower capacity and fast iteration reduce overfitting risk

Use regularization and validation-driven early stopping

Deep Transformers (SAINT, deep FTTransformer)

5K to 50K samples

Mambular, FTTransformer, TabM, MambAttention

More capacity can pay off when features interact strongly

Balance capacity vs training time

Very high capacity if data is simple

>50K samples

Mambular, TabM, TabR, FTTransformer

Larger data can support complex patterns and retrieval

Watch attention/retrieval bottlenecks

SAINT with large batches unless row attention is needed

Alternatives: MambaTab for speed, NODE/ENODE for tree-inspired interpretability, ResNet/MLP for very fast training.

By Feature Type

Feature Composition

Best Choice

Good Alternatives

Reasoning

Avoid

>60% categorical

TabTransformer

FTTransformer, Mambular

TabTransformer’s attention is focused on categorical contextual embeddings

-

>80% numerical

Mambular, TabM

ResNet, NODE

SSM/dense baselines avoid categorical-only assumptions

TabTransformer

Balanced mixed

Mambular, FTTransformer

MambAttention, TabM

Unified feature processing supports mixed feature interactions

-

By Computational Constraints

Constraint

Recommended Models

Reasoning

Avoid

Memory <8GB GPU

MLP, ResNet, MambaTab, Mambular, TabM

No full feature-attention matrix in the main path

FTTransformer/AutoInt with many feature tokens, SAINT with large batches

Fast training needed

MLP, ResNet, MambaTab, TabM

Simple dense or short sequence paths

FTTransformer, TabR, SAINT if retrieval/row attention dominates

Low inference latency

MLP, ResNet, Mamba variants, TabM

Avoids retrieval search and full attention over many tokens

TabR with large candidate pools, wide Transformers

Training speed tiers: Fastest (MLP, ResNet) -> Fast (MambaTab, TabM) -> Moderate (Mambular, NODE) -> Slower or workload-dependent (FTTransformer, TabR, SAINT).

By Task Requirements

Task

General Purpose

Fast/Efficient

Interpretable

Notes

Classification

Mambular, FTTransformer, MambAttention

MambaTab, ResNet, TabM

NODE, ENODE, NDTF

All models support multi-class

Regression

Mambular, FTTransformer, TabR (large data)

MambaTab, ResNet, TabM

NODE

Tree models can be useful when tree-like splits fit the data

LSS (Distributional)

Mambular, FTTransformer, MambAttention

MambaTab

ENODE

All models support LSS mode

Special cases: For quantile regression, use any model in LSS mode with an appropriate distribution family.

Hardware Requirements by Model

The table below gives practical guidance on whether each model trains comfortably on a CPU-only machine or requires a GPU (CUDA, MPS, or other accelerator). Thresholds are rough estimates based on architecture cost, and the actual boundary depends on the number of features, hidden width, and depth used.

Important

Features matter as much as rows. Transformer-style models grow quadratically with feature-token count, so 20 features with a default FTTransformer config can require as much compute as 50 features with an MLP. The estimates below assume the default DeepTab config for each model and a moderate feature count (10 to 30 columns). Wide datasets shift the GPU threshold lower.

CPU comfort zone

Models

Primary cost driver

When to reach for a GPU

Up to ~500K rows

MLP, ResNet

Cache-friendly dense and skip-connection layers

Rarely needed; CPU scales well even on large data

Up to ~100K rows

TabM, MambaTab

MLP ensemble paths, single lightweight Mamba block

Modest speedup; CPU stays competitive

Up to ~20K rows

Mambular, TabulaRNN, TabTransformer, NODE

Stacked sequence/recurrent blocks or categorical attention

Past this size, accelerators give meaningful speedup

Up to ~10K rows

MambAttention, FTTransformer, AutoInt, ENODE, NDTF, TabR

Full-feature attention \(O(P^2)\), retrieval, or deep soft trees

GPU strongly recommended as features or rows grow

Up to ~2K rows

SAINT

Column plus row attention per batch

GPU effectively required; CPU is impractically slow past a few thousand rows

The “CPU comfort zone” is where training at default config finishes in reasonable wall-clock time on a modern CPU. Beyond it, a CUDA, MPS, or similar accelerator provides meaningful speedup.

Tip

Apple Silicon (MPS): All models run on MPS via PyTorch’s MPS backend. Set accelerator="mps" in TrainerConfig. MPS provides meaningful speedup for most models except those with Mamba CUDA kernels, which fall back to CPU on MPS unless a dedicated MPS implementation is available.

Note

Inference vs training: Inference (predict) is cheaper than training because there is no backward pass or optimizer state. A model that needs a GPU for training can often run inference on CPU in production for moderate batch sizes. Use InferenceModel to load artifacts for CPU-only inference environments.


References

Key papers used for the comparison:

  • Ahamed, M. A., & Cheng, Q. (2024). MambaTab: A Plug-and-Play Model for Learning Tabular Data. arXiv:2401.08867, DOI:10.1109/MIPR62202.2024.00065

  • Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. NeurIPS 2021. arXiv:2106.11959

  • Gorishniy, Y., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., & Babenko, A. (2023). TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023. arXiv:2307.14338

  • Gorishniy, Y., Kotelnikov, A., & Babenko, A. (2024). TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling. ICLR 2025. arXiv:2410.24210

  • Gu, A., & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385

  • Huang, X., Khetan, A., Cvitkovic, M., & Karnin, Z. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678

  • Kontschieder, P., Fiterau, M., Criminisi, A., & Rota Bulo, S. (2015). Deep Neural Decision Forests. ICCV 2015. CVF Open Access

  • Popov, S., Morozov, S., & Babenko, A. (2019). Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data. ICLR 2020. arXiv:1909.06312

  • Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., & Goldstein, T. (2021). SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training. arXiv:2106.01342

  • Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., & Tang, J. (2019). AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. CIKM 2019. arXiv:1810.11921

  • Thielmann, A. F., Kumar, M., Weisser, C., Reuter, A., Säfken, B., & Samiee, S. (2024). Mambular: A Sequential Model for Tabular Deep Learning. arXiv:2408.06291

  • Thielmann, A. F., & Samiee, S. (2024). On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning. arXiv:2411.17207

  • Wen, Y., Tran, D., & Ba, J. (2020). BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. arXiv:2002.06715

See Also