SAINT
Overview
SAINT is an attention architecture for tabular data that combines feature-wise attention with row-wise attention. In DeepTab, SAINT embeds all supported feature types, applies a row/column Transformer block, pools the resulting sequence, and predicts with an MLP head.
Use it when you want a Transformer-style model that can mix information across both columns and samples, especially for research comparisons with FTTransformer and TabTransformer.
Architectural Details
DeepTab’s SAINT implementation uses:
EmbeddingLayerto build feature tokens.Optional class token support through
use_cls.RowColTransformer, which alternates column-wise attention over feature tokens and row-wise attention after reshaping the batch/feature representation.pool_sequenceto aggregate tokens.Optional final normalization and
MLPhead.
feature tokens -> RowColTransformer -> pooling -> optional norm -> MLPhead
Main Building Blocks
Component |
DeepTab implementation |
Role |
|---|---|---|
Tokenizer |
|
Converts each input feature to a token. |
Column attention |
|
Models feature interactions within a row. |
Row attention |
Flattened row representation inside |
Mixes sample-level context within a batch. |
Feed-forward blocks |
LayerNorm + Linear + activation + dropout |
Adds nonlinear token updates. |
Prediction head |
|
Produces final outputs. |
Implementation Notes
The original SAINT paper also emphasizes contrastive pretraining and data augmentation. DeepTab’s stable model page documents the supervised architecture path implemented in deeptab.architectures.saint; do not assume contrastive pretraining is active unless added explicitly in the training workflow.
The default config uses d_model=128, n_layers=1, n_heads=2, pooling_method="cls", and use_cls=True.
Practical Config
from deeptab.configs import PreprocessingConfig, SAINTConfig, TrainerConfig
from deeptab.models import SAINTClassifier
model = SAINTClassifier(
model_config=SAINTConfig(
d_model=128,
n_layers=2,
n_heads=4,
attn_dropout=0.1,
ff_dropout=0.1,
pooling_method="cls",
use_cls=True,
),
preprocessing_config=PreprocessingConfig(numerical_preprocessing="quantile"),
trainer_config=TrainerConfig(lr=3e-4, batch_size=128, max_epochs=100),
random_state=101,
)
Key settings:
Setting |
Typical range |
Effect |
|---|---|---|
|
|
Token width. |
|
|
Row/column attention depth. |
|
|
Number of attention heads. |
|
|
Regularization. |
|
|
Token aggregation behavior. |
When To Use
Use SAINT when modeling interactions across both features and samples is part of the experimental question. It can be more expensive and batch-sensitive than FTTransformer because row attention depends on the batch representation.
References
Somepalli et al., SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training.
Vaswani et al., Attention Is All You Need.