Data

The data API provides low-level control over data loading, batching, and feature inspection. Most users don’t need this. The sklearn-compatible interface (model.fit(X, y)) handles data management automatically.

Use the data API when you need:

  • Custom training loops outside the sklearn interface

  • Feature schema inspection to understand preprocessing applied to each feature

  • Fine-grained control over batching and data loading

  • Integration with Lightning for advanced training workflows

Core Classes

Class

Description

FeatureSchema

Inspect feature types, preprocessing, and dimensions after fitting a model

FeatureInfo

Metadata for individual features (type, cardinality, preprocessing method)

TabularBatch

Typed container for batches (numerical, categorical features, labels); new in v2.0

TabularDataModule

Lightning DataModule for train/val/test splits and batching (internal use)

TabularDataset

PyTorch Dataset for preprocessed tensors (internal use)

Common Use Cases

Inspecting Feature Schema

After fitting a model, inspect how features were preprocessed:

from deeptab.models import MambularClassifier

model = MambularClassifier()
model.fit(X_train, y_train)

# Access feature schema
schema = model.feature_schema

# Inspect numerical features
for name, info in schema.numerical_features.items():
    print(f"{name}: {info.preprocessing}, dim={info.dimension}")

# Inspect categorical features
for name, info in schema.categorical_features.items():
    print(f"{name}: {len(info.categories)} categories, dim={info.dimension}")

# Get totals
print(f"Total numerical dim: {schema.total_numerical_dim}")
print(f"Total categorical dim: {schema.total_categorical_dim}")

When to use: Debugging feature preprocessing, understanding model input dimensions, verifying feature detection.

Working with TabularBatch

The new TabularBatch replaces raw tuples for cleaner code:

from deeptab.data import TabularBatch

# In custom training loops
for batch in dataloader:
    if isinstance(batch, tuple):
        # Convert legacy format
        batch = TabularBatch.from_tuple(batch)

    # Move to device
    batch = batch.to('cuda')

    # Access features
    num_feats = batch.numerical_features
    cat_feats = batch.categorical_features
    labels = batch.labels

When to use: Custom training loops, cleaner code for batch processing, device management.

Custom Data Loading

For advanced workflows, create data modules directly:

from deeptab.data import TabularDataModule

# Already have a fitted preprocessor
datamodule = TabularDataModule(
    preprocessor=model.preprocessor,
    batch_size=512,
    shuffle=True,
    regression=False,
)

datamodule.preprocess_data(
    X_train, y_train,
    X_val=X_val, y_val=y_val,
)

# Access dataloaders
train_loader = datamodule.train_dataloader()
val_loader = datamodule.val_dataloader()

When to use: Custom training loops, hyperparameter tuning with fixed preprocessing, integration with PyTorch Lightning.

Key Design Principles

Automatic vs. Manual:

The sklearn interface (fit(X, y)) creates data modules automatically. Only use the data API directly for custom workflows.

Internal Representation:

Features are stored as lists of tensors (one per feature), not single concatenated tensors. This supports heterogeneous preprocessing per feature.

Typed Containers:

TabularBatch and FeatureSchema provide type hints and IDE autocompletion, replacing raw tuples and dictionaries.

See Also