Model Efficiency & Benchmarking

This page explains where efficiency analysis belongs in DeepTab and how to use it when selecting models. It complements the architectural complexity table in Model Comparison with a practical benchmarking protocol.

Important

Efficiency results are hardware- and workload-dependent. Use them to compare candidate models under the same feature schema, batch size, preprocessing, dtype, and device. Do not treat synthetic timing results as an accuracy benchmark or as a universal ranking.

Where This Applies

Efficiency analysis is most useful when researchers or developers need to choose a model under runtime constraints.

Decision	Why efficiency matters	Where to use it
Model selection	Attention, state-space, dense, tree-style, and retrieval models scale differently with feature tokens and batch size	Model Zoo comparison and recommended configs
Experiment planning	Search budget, number of seeds, and architecture grid size depend on training cost	Research protocol and benchmark reports
Production screening	Memory use and inference latency can rule out otherwise accurate models	Deployment and low-latency model choice
Architecture development	New blocks should be compared against strong baselines at controlled feature counts and depths	Developer benchmarking

It is less appropriate for the API reference. The API pages should document classes, signatures, and methods. Efficiency belongs in the Model Zoo because it helps users decide which architecture to try before they write code.

What to Measure

For tabular deep learning, the most informative efficiency variables are usually:

Variable	Why it matters
Feature-token count	Transformer-style feature attention grows roughly quadratically in the number of tokens, while Mamba/RNN/dense paths usually avoid full feature-attention maps
Batch size	Larger batches improve accelerator utilization, but SAINT-style row attention and activation memory can grow quickly
Hidden width	Dense projections often scale with width squared; increasing `d_model` affects attention, Mamba blocks, heads, and embeddings
Depth	More layers increase activation memory and forward/backward time; tree depth in differentiable tree models can be especially expensive
Categorical cardinality	Embedding-table size depends on category counts, not just number of columns
Retrieval candidate size	TabR-style models add candidate encoding, nearest-neighbor search, and context-mixing costs

Tip

For model selection, measure forward latency, peak device memory, and parameter count. For training-budget planning, also measure one or more full training epochs because backward pass, optimizer state, data loading, and validation can change the ranking.

Expected Scaling Patterns

These are practical expectations from the architecture, not measured leaderboard results.

Family	Main cost driver	Practical implication
MLP, ResNet	Dense layer widths	Fast baselines; good first checks for latency-sensitive workflows
TabM	Dense layer widths plus active ensemble outputs	Strong ensemble-like baseline with better cost than many independent models
Mambular, MambaTab	Feature sequence length, `d_model`, number of Mamba layers	Attractive when feature-token count is high and full attention is expensive
FTTransformer, AutoInt	Feature-token attention maps	Watch memory when many columns, numerical bins, or embedding tokens are present
TabTransformer	Categorical-token attention	Most relevant when categorical features dominate
SAINT	Column attention plus row attention within each batch	Batch size is part of the architecture cost, not just a loader setting
NODE, ENODE, NDTF	Number of trees, depth, and soft path/leaf evaluations	Tree depth is a compute knob as well as a modeling knob
TabR	Candidate encoding/search and context size	Report candidate-pool construction and retrieval settings with results

Benchmark Protocol

Use a controlled protocol when reporting efficiency numbers.

Fix the hardware, PyTorch version, DeepTab version, dtype, and device.
Use the same feature schema across models unless the research question is schema-specific.
Run warmup iterations before timing GPU code.
Use torch.inference_mode() and model.eval() for inference benchmarks.
Synchronize CUDA before and after timed regions.
Reset and report peak memory with torch.cuda.reset_peak_memory_stats() and torch.cuda.max_memory_allocated().
Report median or mean over repeated runs, not a single pass.
Separate forward-only, training-step, and full-fit measurements.

Warning

Synthetic forward-pass benchmarks are useful for isolating architecture cost, but they do not include preprocessing, data loading, validation, early stopping, checkpointing, or hyperparameter search. For end-to-end claims, benchmark the sklearn-style estimator workflow too.

Using the Efficiency Notebook

The runnable version lives in the Model Efficiency Benchmarking tutorial, with the notebook stored at docs/tutorials/notebooks/model_efficiency.ipynb (open on GitHub). The notebook is stored with the tutorial notebooks so executable examples live in one place.

Use the notebook when you want to stress-test model families across:

increasing feature counts,
increasing model depth,
fixed feature schemas with different architecture families,
GPU memory and latency constraints.

The notebook should be run on the same machine and environment used for the reported results. If you publish or share benchmark numbers, include the notebook commit, hardware, CUDA version, PyTorch version, batch size, feature count, model configs, and whether the numbers are forward-only or full-training.

Minimal Forward Benchmark Pattern

The low-level architecture classes are useful for isolating model-body cost because they avoid estimator-level preprocessing and Lightning trainer overhead.

import time

import torch

from deeptab.architectures import FTTransformer, Mambular
from deeptab.configs import FTTransformerConfig, MambularConfig


def make_feature_information(n_features: int):
    n_num = n_features // 2
    n_cat = n_features - n_num

    num_info = {
        f"num_{i}": {"preprocessing": "standard", "dimension": 1, "categories": None}
        for i in range(n_num)
    }
    cat_info = {
        f"cat_{i}": {"preprocessing": "int", "dimension": 1, "categories": 10}
        for i in range(n_cat)
    }
    return num_info, cat_info, {}


def make_batch(feature_information, batch_size: int, device: torch.device):
    num_info, cat_info, _ = feature_information
    num_features = [
        torch.randn(batch_size, info["dimension"], device=device)
        for info in num_info.values()
    ]
    cat_features = [
        torch.randint(0, info["categories"], (batch_size, info["dimension"]), device=device)
        for info in cat_info.values()
    ]
    return num_features, cat_features, []


def benchmark_forward(model, batch, repeats: int = 50, warmup: int = 10):
    model.eval()
    device = next(model.parameters()).device

    with torch.inference_mode():
        for _ in range(warmup):
            model(*batch)

        if device.type == "cuda":
            torch.cuda.synchronize()
            torch.cuda.reset_peak_memory_stats(device)

        start = time.perf_counter()
        for _ in range(repeats):
            model(*batch)

        if device.type == "cuda":
            torch.cuda.synchronize()
            memory_mb = torch.cuda.max_memory_allocated(device) / 1024**2
        else:
            memory_mb = None

    latency_ms = (time.perf_counter() - start) * 1000 / repeats
    return latency_ms, memory_mb


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_information = make_feature_information(n_features=64)
batch = make_batch(feature_information, batch_size=256, device=device)

models = {
    "Mambular": Mambular(
        feature_information=feature_information,
        config=MambularConfig(d_model=64, n_layers=4),
    ).to(device),
    "FTTransformer": FTTransformer(
        feature_information=feature_information,
        config=FTTransformerConfig(d_model=128, n_layers=4, n_heads=8),
    ).to(device),
}

for name, model in models.items():
    latency_ms, memory_mb = benchmark_forward(model, batch)
    print(name, {"latency_ms": latency_ms, "memory_mb": memory_mb})

Reporting Template

Use this compact template in experiment notes or pull requests:

Field	Value
Hardware	GPU/CPU model, memory, CUDA version
Software	DeepTab commit/version, PyTorch version, Python version
Workload	Task, number of rows, feature count, categorical cardinalities
Config	Model config, preprocessing config, trainer config
Measurement	Forward-only, train-step, epoch, or full fit
Batch size and dtype	Example: `batch_size=256`, `float32`
Repeats	Warmup count and measured repeats
Results	Latency, peak memory, parameter count, optional throughput

References

Gu, A., & Dao, T. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752
Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting Deep Learning Models for Tabular Data. NeurIPS 2021. arXiv:2106.11959
Thielmann, A. F., & Samiee, S. (2024). On the Efficiency of NLP-Inspired Methods for Tabular Deep Learning. arXiv:2411.17207