FTTransformer
Overview
FTTransformer is a feature-token Transformer for tabular data. It represents each column as a token, applies Transformer encoder layers over the feature sequence, pools the sequence, and predicts with an MLP head.
Use it when feature interactions are expected to be high-order and nonlocal, especially on medium-to-large datasets where attention layers can be trained reliably.
Architectural Details
DeepTab’s FTTransformer implementation follows the RTDL-style feature-token design:
EmbeddingLayertokenizes numerical, categorical, and embedding features into(batch, n_features, d_model).CustomTransformerEncoderLayeris stacked withnn.TransformerEncoder.pool_sequenceconverts the token sequence to one vector usingpooling_method.Optional final normalization is applied.
MLPheadmaps the pooled vector to the task output.
feature tokens -> TransformerEncoder x n_layers -> pooling -> optional norm -> MLPhead
Main Building Blocks
Component |
DeepTab implementation |
Role |
|---|---|---|
Tokenizer |
|
Creates one vector per input feature. |
Encoder block |
|
Multi-head attention plus feed-forward transformation. |
Encoder stack |
|
Repeats the block |
Pooling |
|
Reduces feature tokens to one representation. |
Head |
|
Task-specific prediction head. |
Implementation Notes
Unlike TabTransformer, FTTransformer embeds all supported feature types before attention. This makes it a better default Transformer when the dataset has many numerical features or a balanced mix of numerical and categorical columns.
The default configuration uses d_model=128, n_layers=4, n_heads=8, attn_dropout=0.2, and ff_dropout=0.1.
Practical Config
from deeptab.configs import FTTransformerConfig, PreprocessingConfig, TrainerConfig
from deeptab.models import FTTransformerClassifier
model = FTTransformerClassifier(
model_config=FTTransformerConfig(
d_model=128,
n_layers=4,
n_heads=8,
attn_dropout=0.2,
ff_dropout=0.1,
pooling_method="avg",
),
preprocessing_config=PreprocessingConfig(numerical_preprocessing="quantile"),
trainer_config=TrainerConfig(lr=3e-4, batch_size=128, max_epochs=100),
random_state=101,
)
Key settings:
Setting |
Typical range |
Effect |
|---|---|---|
|
|
Token width and main capacity driver. |
|
|
Transformer depth. |
|
|
Attention heads; must divide |
|
|
Feed-forward capacity. |
|
|
Sequence aggregation strategy. |
When To Use
Use FTTransformer for research comparisons involving attention over feature tokens. It is usually a more general Transformer baseline than TabTransformer because it handles numerical tokens directly.
References
Gorishniy et al., Revisiting Deep Learning Models for Tabular Data.
Vaswani et al., Attention Is All You Need.