Model Comparison

Architectural comparison and computational characteristics of DeepTab’s model zoo.

Note

Focus on architecture: This document emphasizes computational complexity, architectural design, and qualitative comparisons. Quantitative performance benchmarks will be added when systematic experiments are completed.

Scope: The tables below cover the 15 stable models. The 3 experimental models (ModernNCA, Tangos, Trompt) are documented separately under Model Tiers.

Computational Characteristics

The table below reports dominant forward-pass scaling for a batch. It is a practical guide, not a FLOP-count benchmark.

Category	Model	DeepTab Default Shape	Dominant Forward-Time Terms	Memory Driver	Primary References
State Space Models	Mambular	`d_model=64`, `n_layers=4`	Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants	\(O(B \cdot P \cdot D)\) activations	Mambular, Mamba
	MambaTab	`d_model=64`, `n_layers=1`	Linear in feature sequence: \(O(B \cdot L \cdot P \cdot D)\) plus projection constants	\(O(B \cdot P \cdot D)\) activations	MambaTab, Mamba
	MambAttention	`d_model=64`, Mamba blocks + attention	Mamba term \(O(B \cdot L_m \cdot P \cdot D)\) plus feature attention \(O(B \cdot L_a \cdot P^2 \cdot D)\)	Attention maps \(O(B \cdot P^2)\) when attention layers are active	Mambular, Mamba
Transformers	FTTransformer	`d_model=128`, `n_layers=4`, `n_heads=8`	Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\) plus feed-forward blocks	\(O(B \cdot L \cdot P^2)\) attention maps	Gorishniy et al. 2021
	TabTransformer	`d_model=128`, `n_layers=4`, `n_heads=8`	Categorical-token self-attention \(O(B \cdot L \cdot P_{\text{cat}}^2 \cdot D)\) plus numerical MLP head	\(O(B \cdot L \cdot P_{\text{cat}}^2)\) attention maps	Huang et al. 2020
	SAINT	`d_model=128`, `n_layers=1`, `n_heads=2`	Column attention \(O(B \cdot P^2 \cdot D)\) plus row attention \(O(B^2 \cdot P \cdot D)\) within a batch	\(O(B \cdot P^2 + B^2)\) attention maps	Somepalli et al. 2021
	AutoInt	`d_model=128`, `n_layers=4`, `n_heads=8`	Feature self-attention \(O(B \cdot L \cdot P^2 \cdot D)\); key-value compression reduces constants	\(O(B \cdot L \cdot P^2)\) attention maps	Song et al. 2019
Residual Networks	ResNet	`layer_sizes=[256,128,32]`, `num_blocks=3`	Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\)	Linear in batch and hidden width	He et al. 2016, Gorishniy et al. 2021
	TabR	`d_main=256`, `context_size=96`	Candidate encoding plus exact/FAISS nearest-neighbor search \(O(B \cdot N_c \cdot D)\) and context mixing \(O(B \cdot C \cdot D)\)	Candidate cache \(O(N_c \cdot D)\)	Gorishniy et al. 2023
Tree-Inspired	NODE	`num_layers=4`, `layer_dim=128`, `depth=6`	Soft oblivious trees evaluate all splits/leaves: \(O(B \cdot L \cdot T \cdot (P \cdot D_t + D_t \cdot 2^{D_t}))\)	Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\)	Popov et al. 2019
	ENODE	`d_model=8`, `num_layers=4`, `layer_dim=64`, `depth=6`	NODE-style soft tree evaluation with learned embeddings	Path/leaf activations \(O(B \cdot T \cdot 2^{D_t})\)	Popov et al. 2019
	NDTF	`n_ensembles=12`, random depths 4 to 15	Neural decision forest evaluates internal nodes and leaf probabilities for each tree	Leaf probabilities scale with \(O(B \cdot E \cdot 2^{D_t})\)	Kontschieder et al. 2015
Other	MLP	`layer_sizes=[256,128,32]`	Dense layers: \(O(B \cdot \sum_\ell d_{\ell-1} d_\ell)\)	Linear in batch and hidden width	Standard MLP baseline
	TabM	`layer_sizes=[256,256,128]`, `ensemble_size=32`	MLP-style dense compute with parameter-efficient batch ensembling	Linear in batch, hidden width, and active ensemble outputs	Gorishniy et al. 2024, Wen et al. 2020
	TabulaRNN	`d_model=128`, `n_layers=4`	Recurrent feature-sequence processing \(O(B \cdot L \cdot P \cdot D^2)\) for standard RNN-style cells	\(O(B \cdot P \cdot D)\) activations	Thielmann & Samiee 2024

Notation: \(B\) = batch size, \(P\) = feature tokens after preprocessing/embedding, \(P_{\text{cat}}\) = categorical tokens, \(D\) = hidden dimension, \(L\) = layers, \(L_m\) = Mamba layers, \(L_a\) = attention layers, \(C\) = retrieved context size, \(N_c\) = candidate rows for retrieval, \(T\) = trees per layer, \(E\) = forest ensemble size, \(D_t\) = tree depth, \(d_\ell\) = width of dense layer \(\ell\) (so a dense layer costs \(d_{\ell-1} d_\ell\)).

Important

Parameter count assumptions: Parameter counts are not listed because they depend strongly on dataset schema and preprocessing:

Input features: More features increase embedding, tokenizer, and first-layer parameters.
Categorical cardinality: More categories increase embedding-table parameters.
Hidden width: Dense projections usually scale with width squared.
Depth and ensembles: Additional layers, trees, or ensemble members increase parameters and activations.

The “DeepTab Default Shape” column is taken from the current model config defaults in deeptab/configs/models/.

Tip

Practical implications:

Linear in feature sequence: Mamba variants, RNNs, MLPs, ResNets, and TabM avoid feature-attention matrices.
Quadratic in features: FTTransformer, AutoInt, MambAttention attention layers, and TabTransformer become expensive as the number of feature tokens grows.
Quadratic in batch rows: SAINT’s row-attention term is controlled by mini-batch size, not by the total dataset size directly.
Retrieval-based: TabR can be strong on larger data, but it needs candidate encoding/search memory and depends on the retrieval index.
Soft tree-based: NODE-style models are not logarithmic at inference; differentiable trees evaluate soft paths/leaves, so tree depth matters.

Note

Category guide:

State Space Models: Selective SSM/Mamba-style sequence models adapted to tabular features.
Transformers: Self-attention mechanisms for feature and/or row interactions.
Residual Networks: Deep feedforward MLPs with skip connections.
Tree-Inspired: Differentiable decision trees with gradient optimization.
Other: Standard architectures (MLP, parameter-efficient ensembles, RNNs).

Architecture Categories

State Space Models (SSMs)

Feature-sequence models with linear sequence-length scaling in the Mamba blocks

Model	Default Layers	Default Hidden Dim	Key Feature	Best Use Case
Mambular	4 Mamba layers	64	Stacked Mamba blocks over feature tokens	General-purpose tabular sequence modeling
MambaTab	1 Mamba layer	64	Lightweight Mamba block	Small datasets, speed
MambAttention	Hybrid	64	Mamba blocks plus feature attention	Complex feature interactions

Transformer-Based

Attention mechanisms for feature and row interactions

Model	Attention Scope	Default Hidden Dim	Key Feature	Best Use Case
FTTransformer	All feature tokens	128	Feature tokenization	Feature interactions
TabTransformer	Categorical tokens	128	Contextual categorical embeddings	Categorical-heavy data
SAINT	Row + column	128	Intersample (row) plus column attention	Semi-supervised or row-context settings
AutoInt	All feature tokens	128	Self-attentive feature interaction learning	Automatic interaction modeling

Tree-Inspired

Differentiable tree and forest structures

Model	Tree Type	Default Shape	Key Feature	Best Use Case
NODE	Oblivious differentiable trees	4 layers, 128 trees/layer, depth 6	Soft routing over oblivious trees	Interpretable tree-inspired modeling
ENODE	Embedded NODE variant	4 layers, 64 trees/layer, depth 6	Feature embeddings before NODE-style blocks	Tree-inspired modeling with embeddings
NDTF	Neural decision tree forest	12 trees, random depths 4 to 15	Multiple neural decision trees	Tree ensemble-style experiments

Residual Networks

Deep feedforward networks with skip connections

Model	Default Shape	Key Feature	Best Use Case
ResNet	3 residual blocks, `[256, 128, 32]` layer sizes	Residual blocks	Fast baseline
TabR	`d_main=256`, `context_size=96`	Retrieval-augmented prediction	Larger datasets with useful neighbor structure

Other Architectures

Model	Type	Default Shape	Key Feature	Best Use Case
MLP	Feedforward	`[256, 128, 32]` layer sizes	Simple dense baseline	Fastest baseline
TabM	Parameter-efficient ensemble	`[256, 256, 128]` layer sizes, 32 ensemble members	Batch ensembling	Strong efficient baseline
TabulaRNN	RNN	`d_model=128`, 4 recurrent layers	Sequential feature processing	Sequential feature modeling

Model Selection by Use Case

Note

General pattern: Simpler models (MLP, ResNet, TabM) are strong practical baselines and often work well on small or medium datasets with proper regularization. More complex models (Transformers, SSMs, retrieval models) are most useful when their inductive bias matches the data or when the dataset is large enough to justify the extra capacity and compute.

By Dataset Size

Dataset Size	Recommended Models	Reasoning	Key Consideration	Avoid
<5K samples	MambaTab, ResNet, MLP, TabM	Lower capacity and fast iteration reduce overfitting risk	Use regularization and validation-driven early stopping	Deep Transformers (SAINT, deep FTTransformer)
5K to 50K samples	Mambular, FTTransformer, TabM, MambAttention	More capacity can pay off when features interact strongly	Balance capacity vs training time	Very high capacity if data is simple
>50K samples	Mambular, TabM, TabR, FTTransformer	Larger data can support complex patterns and retrieval	Watch attention/retrieval bottlenecks	SAINT with large batches unless row attention is needed

Alternatives: MambaTab for speed, NODE/ENODE for tree-inspired interpretability, ResNet/MLP for very fast training.

By Feature Type

Feature Composition	Best Choice	Good Alternatives	Reasoning	Avoid
>60% categorical	TabTransformer	FTTransformer, Mambular	TabTransformer’s attention is focused on categorical contextual embeddings	-
>80% numerical	Mambular, TabM	ResNet, NODE	SSM/dense baselines avoid categorical-only assumptions	TabTransformer
Balanced mixed	Mambular, FTTransformer	MambAttention, TabM	Unified feature processing supports mixed feature interactions	-

By Computational Constraints

Constraint	Recommended Models	Reasoning	Avoid
Memory <8GB GPU	MLP, ResNet, MambaTab, Mambular, TabM	No full feature-attention matrix in the main path	FTTransformer/AutoInt with many feature tokens, SAINT with large batches
Fast training needed	MLP, ResNet, MambaTab, TabM	Simple dense or short sequence paths	FTTransformer, TabR, SAINT if retrieval/row attention dominates
Low inference latency	MLP, ResNet, Mamba variants, TabM	Avoids retrieval search and full attention over many tokens	TabR with large candidate pools, wide Transformers

Training speed tiers: Fastest (MLP, ResNet) -> Fast (MambaTab, TabM) -> Moderate (Mambular, NODE) -> Slower or workload-dependent (FTTransformer, TabR, SAINT).

By Task Requirements

Task	General Purpose	Fast/Efficient	Interpretable	Notes
Classification	Mambular, FTTransformer, MambAttention	MambaTab, ResNet, TabM	NODE, ENODE, NDTF	All models support multi-class
Regression	Mambular, FTTransformer, TabR (large data)	MambaTab, ResNet, TabM	NODE	Tree models can be useful when tree-like splits fit the data
LSS (Distributional)	Mambular, FTTransformer, MambAttention	MambaTab	ENODE	All models support LSS mode

Special cases: For quantile regression, use any model in LSS mode with an appropriate distribution family.

Recommended Decision Tree

Start Here
|
|- Dataset size <5K? -> Use MambaTab, ResNet, MLP, or TabM with regularization
|
|- Need tree-inspired interpretability? -> Use NODE, ENODE, or NDTF
|
|- Memory constrained (<8GB)? -> Prefer Mambular, MambaTab, MLP, ResNet, or TabM
|
|- Inference latency critical? -> Avoid retrieval/large attention; use MLP, ResNet, TabM, or Mamba variants
|
|- >60% categorical features? -> Consider TabTransformer
|
|- Need retrieval from similar training examples? -> Consider TabR
|
`- General purpose -> Mambular or TabM
   `- Alternative -> FTTransformer when GPU memory and feature count permit

Hardware Requirements by Model

The table below gives practical guidance on whether each model trains comfortably on a CPU-only machine or requires a GPU (CUDA, MPS, or other accelerator). Thresholds are rough estimates based on architecture cost, and the actual boundary depends on the number of features, hidden width, and depth used.

Important

Features matter as much as rows. Transformer-style models grow quadratically with feature-token count, so 20 features with a default FTTransformer config can require as much compute as 50 features with an MLP. The estimates below assume the default DeepTab config for each model and a moderate feature count (10 to 30 columns). Wide datasets shift the GPU threshold lower.

CPU comfort zone	Models	Primary cost driver	When to reach for a GPU
Up to ~500K rows	MLP, ResNet	Cache-friendly dense and skip-connection layers	Rarely needed; CPU scales well even on large data
Up to ~100K rows	TabM, MambaTab	MLP ensemble paths, single lightweight Mamba block	Modest speedup; CPU stays competitive
Up to ~20K rows	Mambular, TabulaRNN, TabTransformer, NODE	Stacked sequence/recurrent blocks or categorical attention	Past this size, accelerators give meaningful speedup
Up to ~10K rows	MambAttention, FTTransformer, AutoInt, ENODE, NDTF, TabR	Full-feature attention \(O(P^2)\), retrieval, or deep soft trees	GPU strongly recommended as features or rows grow
Up to ~2K rows	SAINT	Column plus row attention per batch	GPU effectively required; CPU is impractically slow past a few thousand rows

The “CPU comfort zone” is where training at default config finishes in reasonable wall-clock time on a modern CPU. Beyond it, a CUDA, MPS, or similar accelerator provides meaningful speedup.

Tip

Apple Silicon (MPS): All models run on MPS via PyTorch’s MPS backend. Set accelerator="mps" in TrainerConfig. MPS provides meaningful speedup for most models except those with Mamba CUDA kernels, which fall back to CPU on MPS unless a dedicated MPS implementation is available.

Note

Inference vs training: Inference (predict) is cheaper than training because there is no backward pass or optimizer state. A model that needs a GPU for training can often run inference on CPU in production for moderate batch sizes. Use InferenceModel to load artifacts for CPU-only inference environments.

References

Key papers used for the comparison:

Ahamed, M. A., & Cheng, Q. (2024). MambaTab: A Plug-and-Play Model for Learning Tabular Data. arXiv:2401.08867, DOI:10.1109/MIPR62202.2024.00065
Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. NeurIPS 2021. arXiv:2106.11959
Gorishniy, Y., Rubachev, I., Kartashev, N., Shlenskii, D., Kotelnikov, A., & Babenko, A. (2023). TabR: Tabular Deep Learning Meets Nearest Neighbors in 2023. arXiv:2307.14338
Gorishniy, Y., Kotelnikov, A., & Babenko, A. (2024). TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling. ICLR 2025. arXiv:2410.24210
Gu, A., & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385
Huang, X., Khetan, A., Cvitkovic, M., & Karnin, Z. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678
Kontschieder, P., Fiterau, M., Criminisi, A., & Rota Bulo, S. (2015). Deep Neural Decision Forests. ICCV 2015. CVF Open Access
Popov, S., Morozov, S., & Babenko, A. (2019). Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data. ICLR 2020. arXiv:1909.06312
Somepalli, G., Goldblum, M., Schwarzschild, A., Bruss, C. B., & Goldstein, T. (2021). SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training. arXiv:2106.01342
Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., & Tang, J. (2019). AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks. CIKM 2019. arXiv:1810.11921
Thielmann, A. F., Kumar, M., Weisser, C., Reuter, A., Säfken, B., & Samiee, S. (2024). Mambular: A Sequential Model for Tabular Deep Learning. arXiv:2408.06291
Thielmann, A. F., & Samiee, S. (2024). On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning. arXiv:2411.17207
Wen, Y., Tran, D., & Ba, J. (2020). BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. arXiv:2002.06715

Model Comparison

Computational Characteristics

Architecture Categories

State Space Models (SSMs)

Transformer-Based

Tree-Inspired

Residual Networks

Other Architectures

Model Selection by Use Case

By Dataset Size

By Feature Type

By Computational Constraints

By Task Requirements

Recommended Decision Tree

Hardware Requirements by Model

References

See Also