Metrics

Evaluation metrics for all three DeepTab task types: regression, classification, and distributional (LSS) regression.

Every metric is a DeepTabMetric subclass with three attributes the framework reads automatically:

Attribute	Type	Purpose
`name`	`str`	Key in `model.evaluate()` results and training-log suffix (e.g. `val_rmse`, `val_crps`).
`higher_is_better`	`bool`	`True` for scores (accuracy, AUROC, R²); `False` for losses/errors (MSE, NLL, deviances). Used by HPO to set the optimisation direction.
`needs_raw`	`bool`	`False` (default): metric receives already-transformed distribution parameters. `True`: metric receives raw model logits and applies transforms itself. Only `NegativeLogLikelihood` uses `True`.

Quick Start

from deeptab.metrics import RootMeanSquaredError, CRPS, Accuracy

rmse = RootMeanSquaredError()
print(rmse.name)              # "rmse"
print(rmse.higher_is_better)  # False

# Pass to model.fit() for live training logging
from deeptab.models import MambularLSS
model = MambularLSS()
model.fit(
    X_train, y_train,
    val_metrics={
        "crps": CRPS(family="normal"),   # logged as "val_crps"
        "rmse": RootMeanSquaredError(),   # logged as "val_rmse"
    },
)

# Post-hoc evaluation
scores = model.evaluate(X_test, y_test)
# Returns e.g. {"crps": 0.32, "rmse": 1.45}

# Auto-select default metrics via the registry
from deeptab.metrics import get_default_metrics
metrics = get_default_metrics("lss", family="normal")
# [CRPS(family='normal'), RootMeanSquaredError(), MeanAbsoluteError()]

Available Metrics

Regression Metrics

Class	`name`	`higher_is_better`	Default	Notes
`MeanSquaredError`	`mse`	`False`		sklearn-backed; lower = better
`RootMeanSquaredError`	`rmse`	`False`	✓	Same units as target; primary regression metric
`MeanAbsoluteError`	`mae`	`False`	✓	Robust to outliers
`R2Score`	`r2`	`True`	✓	1.0 = perfect; higher = better
`MeanAbsolutePercentageError`	`mape`	`False`		% scale; avoid when targets near zero
`PinballLoss`	`pinball`	`False`		Quantile regression; tau in (0, 1)

The Default column marks the metrics returned by get_default_metrics("regression") and reported by model.evaluate() when no metrics argument is given; the first row (RMSE) is the primary metric used for HPO and model selection.

All regression metrics accept 2-D LSS parameter arrays and extract the first column (predicted mean) automatically.

Classification Metrics

Class	`name`	`higher_is_better`	Default	Input	Notes
`Accuracy`	`accuracy`	`True`	✓	labels	sklearn-backed; argmax of probability array
`F1Score`	`f1`	`True`		labels	`average` param: binary / macro / weighted
`AUROC`	`auroc`	`True`	✓	proba	Ranking-based; threshold-free
`AUPRC`	`auprc`	`True`		proba	Better than AUROC for imbalanced data
`LogLoss`	`log_loss`	`False`	✓	proba	Cross-entropy over class probabilities
`BrierScore`	`brier`	`False`		proba	MSE of probability; binary only
`ExpectedCalibrationError`	`ece`	`False`		proba	0 = perfectly calibrated; custom implementation

The Default column marks the metrics returned by get_default_metrics("classification"). The Input column shows which prediction model.evaluate() feeds each metric: proba metrics (auroc, auprc, log_loss, brier, ece) receive the 2-D predict_proba output, while labels metrics receive the 1-D predict output. The dispatch is automatic, keyed on the metric name.

Distributional / LSS Metrics

Class	`name`	`higher_is_better`	`needs_raw`	Notes
`NegativeLogLikelihood`	`nll`	`False`	`True`	Requires distribution object; passes raw logits
`LogScore`	`log_score`	`True`	`True`	= -NLL; higher = better
`CRPS`	`crps`	`False`	`False`	Vectorised via `properscoring`; all continuous families
`IntervalScore`	`interval_score`	`False`	`False`	Winkler score; expects [lower, upper] columns
`EnergyScore`	`energy_score`	`False`	`False`	Multivariate CRPS generalisation
`PoissonDeviance`	`poisson_deviance`	`False`	`False`	poisson, zip families
`GammaDeviance`	`gamma_deviance`	`False`	`False`	gamma, inversegamma families
`TweedieDeviance`	`tweedie_deviance`	`False`	`False`	tweedie family; `p` param (1 < p < 2)
`NegativeBinomialDeviance`	`nb_deviance`	`False`	`False`	negativebinom family
`BetaBrierScore`	`beta_brier`	`False`	`False`	beta family (proportions)
`DirichletError`	`dirichlet_error`	`False`	`False`	dirichlet family; KL divergence
`StudentTLoss`	`studentt_nll`	`False`	`False`	studentt family; proper NLL
`InverseGammaDeviance`	`inversegamma_deviance`	`False`	`False`	inversegamma family
`LogNormalNLL`	`lognormal_nll`	`False`	`False`	lognormal family
`CoverageProbability`	`coverage`	`True`	`False`	Fraction of targets inside prediction interval
`SharpnessScore`	`sharpness`	`False`	`False`	Mean interval width; lower = sharper
`ProbabilityIntegralTransform`	`pit`	`False`	`False`	MAD from uniform CDF; 0 = perfectly calibrated

Registry

The registry maps (task, family) keys to ordered lists of default metrics. The first entry in each list is the primary metric used by HPO and model selection.

from deeptab.metrics import get_default_metrics, get_default_metrics_dict

# Returns list of DeepTabMetric instances
get_default_metrics("regression")
# [RootMeanSquaredError(), MeanAbsoluteError(), R2Score()]

get_default_metrics("classification")
# [Accuracy(), AUROC(), LogLoss()]

get_default_metrics("lss", family="gamma")
# [GammaDeviance(), RootMeanSquaredError()]

# Returns {name: metric} dict, useful for model.evaluate()
get_default_metrics_dict("lss", family="normal")
# {"crps": CRPS(...), "rmse": RootMeanSquaredError(), "mae": MeanAbsoluteError()}

Choosing a Distribution-Specific Metric

For continuous point-estimate regression: use RMSE (default) or MAE for outlier-robustness.

For distributional (LSS) models: use CRPS as the primary metric. CRPS is a proper scoring rule: it rewards both accuracy and calibration, so it cannot be gamed by reporting an over-wide predictive distribution.

For count data (poisson, zip, negativebinom): use the appropriate deviance. Deviances are equivalent to twice the log-likelihood ratio against the saturated model and are the standard criterion for GLM-type models.

For probability / composition (beta, dirichlet): use BetaBrierScore or DirichletError.

For uncertainty quantification: combine CRPS with CoverageProbability and SharpnessScore to get a complete picture of calibration and precision.

Writing a Custom Metric

Subclass DeepTabMetric, set name and higher_is_better, then implement __call__:

from deeptab.metrics import DeepTabMetric
import numpy as np

class MedianAbsoluteError(DeepTabMetric):
    name = "mdae"
    higher_is_better = False    # lower = better
    needs_raw = False           # use transformed predictions

    def __call__(self, y_true, y_pred):
        y_pred = np.asarray(y_pred)
        mean_pred = y_pred[:, 0] if y_pred.ndim == 2 else y_pred.ravel()
        return float(np.median(np.abs(np.asarray(y_true).ravel() - mean_pred)))

# Use it anywhere a standard metric is accepted
model.fit(X_train, y_train, val_metrics={"mdae": MedianAbsoluteError()})
scores = model.evaluate(X_test, y_test, metrics={"mdae": MedianAbsoluteError()})

API Reference

deeptab.metrics