Model Efficiency Benchmarking Tutorial
This tutorial shows how to benchmark DeepTab model families under controlled synthetic workloads. It focuses on forward-pass latency, peak device memory, and parameter count so researchers and developers can decide which architectures are practical before running full training experiments.
Note
The notebook linked above is generated from this same tutorial content. Use the markdown page to understand the protocol, and use the notebook when you want to run or modify the benchmark cells.
What You Will Learn
How to isolate architecture cost from preprocessing and trainer overhead.
How feature count, depth, and batch size affect different model families.
How to report efficiency results without implying an accuracy ranking.
How to connect runtime measurements back to model selection.
Important
Efficiency numbers are hardware-specific. Report the device, CUDA version, PyTorch version, DeepTab commit, dtype, feature schema, batch size, warmup count, and repeat count whenever you share results.
Benchmark Scope
The cells below profile low-level architecture classes directly. This isolates the model body and avoids estimator-level preprocessing, Lightning training, validation, checkpointing, and data-loading overhead.
Use this tutorial for architecture screening. For end-to-end claims, add a second benchmark around the sklearn-style estimator workflow: fit, predict, and evaluate.
Setup
import platform
import time
from dataclasses import dataclass
import pandas as pd
import torch
from deeptab.architectures import (
FTTransformer,
MLP,
MambAttention,
MambaTab,
Mambular,
ResNet,
TabulaRNN,
)
from deeptab.configs import (
FTTransformerConfig,
MLPConfig,
MambAttentionConfig,
MambaTabConfig,
MambularConfig,
ResNetConfig,
TabulaRNNConfig,
)
print({
"python": platform.python_version(),
"torch": torch.__version__,
"cuda_available": torch.cuda.is_available(),
"device": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "cpu",
})
Synthetic Feature Schema
The helper below creates a controlled half-numerical, half-categorical schema. Keeping the schema synthetic makes it easier to isolate architecture scaling. It does not replace real-dataset benchmarking.
@dataclass(frozen=True)
class BenchmarkSpec:
n_features: int
batch_size: int = 256
n_layers: int = 4
repeats: int = 50
warmup: int = 10
n_categories: int = 10
def make_feature_information(n_features: int, n_categories: int = 10):
"""Create a half-numerical, half-categorical synthetic feature schema."""
n_num = n_features // 2
n_cat = n_features - n_num
num_info = {
f"num_{i}": {
"preprocessing": "standard",
"dimension": 1,
"categories": None,
}
for i in range(n_num)
}
cat_info = {
f"cat_{i}": {
"preprocessing": "int",
"dimension": 1,
"categories": n_categories,
}
for i in range(n_cat)
}
return num_info, cat_info, {}
def make_batch(feature_information, batch_size: int, device: torch.device):
num_info, cat_info, _ = feature_information
num_features = [
torch.randn(batch_size, info["dimension"], device=device)
for info in num_info.values()
]
cat_features = [
torch.randint(
low=0,
high=info["categories"],
size=(batch_size, info["dimension"]),
device=device,
)
for info in cat_info.values()
]
return num_features, cat_features, []
def count_parameters(model: torch.nn.Module) -> int:
return sum(p.numel() for p in model.parameters() if p.requires_grad)
Tip
Start with synthetic sweeps to understand scaling, then repeat the benchmark using the actual feature schema and preprocessing from your target dataset.
Model Factories
The factory function keeps model construction consistent across sweeps. The configs are intentionally simple: they are not tuned for accuracy.
def model_factories(n_layers: int):
"""Return comparable default-ish architecture configs for profiling."""
return {
"Mambular": (
Mambular,
MambularConfig(d_model=64, n_layers=n_layers),
),
"MambaTab": (
MambaTab,
MambaTabConfig(d_model=64, n_layers=max(1, min(n_layers, 4))),
),
"MambAttention": (
MambAttention,
MambAttentionConfig(d_model=64, n_layers=n_layers, n_heads=8),
),
"FTTransformer": (
FTTransformer,
FTTransformerConfig(d_model=128, n_layers=n_layers, n_heads=8),
),
"TabulaRNN": (
TabulaRNN,
TabulaRNNConfig(d_model=128, n_layers=n_layers),
),
"MLP": (
MLP,
MLPConfig(layer_sizes=[512, 256, 128, 32], use_embeddings=True, d_model=64),
),
"ResNet": (
ResNet,
ResNetConfig(layer_sizes=[512, 256, 64], use_embeddings=True, d_model=64),
),
}
Forward Benchmark Runner
This runner uses model.eval() and torch.inference_mode() because it measures inference-style forward cost. CUDA synchronization is required for meaningful GPU timing.
def benchmark_forward(model: torch.nn.Module, batch, repeats: int = 50, warmup: int = 10):
model.eval()
device = next(model.parameters()).device
with torch.inference_mode():
for _ in range(warmup):
model(*batch)
if device.type == "cuda":
torch.cuda.synchronize(device)
torch.cuda.reset_peak_memory_stats(device)
start = time.perf_counter()
for _ in range(repeats):
model(*batch)
if device.type == "cuda":
torch.cuda.synchronize(device)
memory_mb = torch.cuda.max_memory_allocated(device) / 1024**2
else:
memory_mb = None
latency_ms = (time.perf_counter() - start) * 1000 / repeats
return latency_ms, memory_mb
def run_benchmark(spec: BenchmarkSpec, selected_models=None):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
feature_information = make_feature_information(spec.n_features, spec.n_categories)
batch = make_batch(feature_information, spec.batch_size, device)
factories = model_factories(spec.n_layers)
if selected_models is not None:
factories = {name: factories[name] for name in selected_models}
rows = []
for name, (model_cls, config) in factories.items():
model = model_cls(
feature_information=feature_information,
num_classes=1,
config=config,
).to(device)
latency_ms, memory_mb = benchmark_forward(
model,
batch,
repeats=spec.repeats,
warmup=spec.warmup,
)
rows.append({
"model": name,
"n_features": spec.n_features,
"batch_size": spec.batch_size,
"n_layers": spec.n_layers,
"latency_ms": latency_ms,
"peak_memory_mb": memory_mb,
"parameters": count_parameters(model),
})
del model
if device.type == "cuda":
torch.cuda.empty_cache()
return pd.DataFrame(rows)
Warning
Forward-only inference timing does not include backward pass, optimizer state, data loading, validation, early stopping, or hyperparameter search. Use it as an architecture-screening signal, not as a full training-cost claim.
Feature-Count Sweep
This sweep is most relevant when deciding whether feature attention is affordable for wide tables. Keep batch size and depth fixed while increasing the number of synthetic feature tokens.
feature_sweep_results = []
for n_features in [10, 20, 40, 80, 160, 320]:
spec = BenchmarkSpec(n_features=n_features, batch_size=128, n_layers=4, repeats=20, warmup=5)
feature_sweep_results.append(run_benchmark(spec))
feature_sweep = pd.concat(feature_sweep_results, ignore_index=True)
feature_sweep
Interpret this sweep together with the architecture. Transformer-style feature attention becomes more expensive as feature-token count grows, while dense and state-space paths usually avoid explicit full attention maps.
Depth Sweep
This sweep is most relevant when choosing n_layers. It keeps the synthetic feature schema fixed while changing model depth for sequence and attention families.
depth_sweep_results = []
for n_layers in [1, 2, 4, 8, 12]:
spec = BenchmarkSpec(n_features=64, batch_size=128, n_layers=n_layers, repeats=20, warmup=5)
depth_sweep_results.append(
run_benchmark(
spec,
selected_models=["Mambular", "MambaTab", "MambAttention", "FTTransformer", "TabulaRNN"],
)
)
depth_sweep = pd.concat(depth_sweep_results, ignore_index=True)
depth_sweep
Depth affects more than latency. It also changes activation memory during training and often changes the amount of regularization needed.
Batch-Size Sweep
This sweep is most relevant for GPU utilization and memory planning. Larger batches can improve throughput but may hide latency problems for online inference.
batch_sweep_results = []
for batch_size in [32, 64, 128, 256, 512]:
spec = BenchmarkSpec(n_features=64, batch_size=batch_size, n_layers=4, repeats=20, warmup=5)
batch_sweep_results.append(run_benchmark(spec))
batch_sweep = pd.concat(batch_sweep_results, ignore_index=True)
batch_sweep
Important
For SAINT-style row attention or retrieval-style models, batch size can change the effective algorithmic cost. Do not report efficiency results without the batch size.
Reporting Results
Report benchmark results with enough context that another researcher can reproduce the workload.
Field |
What to record |
|---|---|
Hardware |
CPU/GPU model, GPU memory, CUDA version |
Software |
DeepTab version or commit, PyTorch version, Python version |
Workload |
Number of rows if applicable, feature count, categorical cardinalities |
Config |
Model config, preprocessing config, trainer config if training is measured |
Measurement |
Forward-only, training step, epoch, or full fit |
Runtime settings |
Batch size, dtype, warmup count, repeat count |
Results |
Latency, peak memory, parameter count, throughput if useful |
Tip
If efficiency is part of a research claim, report accuracy or validation loss separately. A faster model is not automatically a better model.