deeptab.metrics

Base Class

class deeptab.metrics.DeepTabMetric[source]

Abstract base class for all DeepTab evaluation metrics.

Every metric in deeptab.metrics subclasses this ABC and exposes three class-level attributes that the training loop and registry read automatically — you never need to set them yourself when using a metric, only when writing a custom one.

name

A short, machine-readable identifier for the metric. It is used as:

the key in the dict returned by model.evaluate()
the suffix in training-log entries (e.g. val_rmse)
the registry lookup key in METRIC_REGISTRY

Examples: "rmse", "crps", "auroc".

Type:: str

higher_is_better

Tells the framework whether a larger or smaller value is preferable. This matters in two places:

HPO — hyperparameter search uses it to set the optimisation direction (maximise vs. minimise) when a metric is chosen as the objective.
Early stopping / model selection — callbacks can use it to decide whether a new checkpoint is an improvement.

False (default) means lower is better — appropriate for loss functions and error metrics (MSE, MAE, NLL, deviances). True means higher is better — appropriate for scores like R², accuracy, AUROC, and CRPS variants where a higher value is desirable.

Type:: bool

needs_raw

Controls which form of y_pred the training loop passes to this metric.

False (default) — the metric receives already-transformed distribution parameters, i.e. the output of model.predict(X, raw=False). For example, a Normal distribution model returns [mean, std] where std > 0 is guaranteed. This is the right choice for almost every metric.
True — the metric receives raw model logits before the distribution’s parameter transforms are applied. NegativeLogLikelihood sets this to True because it calls distribution.compute_loss() which applies the transforms itself; passing already-transformed values would double-transform and produce wrong results.

Type:: bool

Examples

Using a built-in metric directly:

>>> from deeptab.metrics import RootMeanSquaredError
>>> import numpy as np
>>> metric = RootMeanSquaredError()
>>> metric.name
'rmse'
>>> metric.higher_is_better
False
>>> metric(np.array([1.0, 2.0, 3.0]), np.array([1.1, 2.0, 2.9]))
0.08164965809277261

Passing metrics to model.fit() for live training logging:

>>> from deeptab.metrics import CRPS, MeanAbsoluteError
>>> model.fit(X_train, y_train,
...           val_metrics={"crps": CRPS(family="normal"),
...                        "mae": MeanAbsoluteError()})
# Logs val_crps and val_mae each epoch.

Writing a custom metric:

>>> from deeptab.metrics import DeepTabMetric
>>> import numpy as np
>>> class MedianAbsoluteError(DeepTabMetric):
...     name = "mdae"
...     higher_is_better = False          # lower error = better
...     needs_raw = False                 # use transformed predictions
...
...     def __call__(self, y_true, y_pred):
...         y_pred = np.asarray(y_pred)
...         mean_pred = y_pred[:, 0] if y_pred.ndim == 2 else y_pred.ravel()
...         return float(np.median(np.abs(np.asarray(y_true).ravel() - mean_pred)))

Registry

deeptab.metrics.METRIC_REGISTRY = {'classification': [Accuracy(), AUROC(average='macro'), LogLoss()], 'lss:beta': [BetaBrierScore(), RootMeanSquaredError()], 'lss:categorical': [Accuracy(), LogLoss()], 'lss:dirichlet': [DirichletError()], 'lss:gamma': [GammaDeviance(), RootMeanSquaredError()], 'lss:inversegamma': [InverseGammaDeviance(), GammaDeviance()], 'lss:johnsonsu': [CRPS(family='johnsonsu'), RootMeanSquaredError()], 'lss:lognormal': [LogNormalNLL(), CRPS(family='lognormal'), RootMeanSquaredError()], 'lss:mog': [CRPS(family='normal'), RootMeanSquaredError()], 'lss:multinomial': [LogLoss()], 'lss:negativebinom': [NegativeBinomialDeviance(default_alpha=1.0), RootMeanSquaredError()], 'lss:normal': [CRPS(family='normal'), RootMeanSquaredError(), MeanAbsoluteError()], 'lss:poisson': [PoissonDeviance(), RootMeanSquaredError()], 'lss:quantile': [PinballLoss(quantile=0.5, col=0)], 'lss:studentt': [StudentTLoss(default_df=3.0), CRPS(family='studentt')], 'lss:tweedie': [TweedieDeviance(p=1.5), RootMeanSquaredError()], 'lss:zip': [PoissonDeviance(), RootMeanSquaredError()], 'regression': [RootMeanSquaredError(), MeanAbsoluteError(), R2Score()]}

deeptab.metrics.get_default_metrics(task, family=None)[source]

Return the default list of metrics for a given task and distribution family.

Parameters:

task (str) – One of "regression", "classification", or "lss".
family (str | None) – Distribution family key used for LSS tasks, e.g. "normal", "gamma", "poisson". Ignored for non-LSS tasks.

Returns:

Ordered list of metric instances. The first entry is the primary metric. Returns an empty list when the combination is unknown.

Return type:

list[DeepTabMetric]

deeptab.metrics.get_default_metrics_dict(task, family=None)[source]

Like get_default_metrics() but returns a {name: metric} dict.

Convenience wrapper for code paths that store metrics as dicts.

Return type:: dict[str, DeepTabMetric]

Regression Metrics

class deeptab.metrics.MeanSquaredError[source]

Mean Squared Error – delegates to sklearn.metrics.mean_squared_error().

Accepts both point-prediction vectors and 2-D parameter arrays (uses the first column as the predicted mean).

higher_is_better: bool = False

name: str = 'mse'

needs_raw: bool = False

class deeptab.metrics.RootMeanSquaredError[source]

Root Mean Squared Error – sqrt of sklearn.metrics.mean_squared_error().

higher_is_better: bool = False

name: str = 'rmse'

needs_raw: bool = False

class deeptab.metrics.MeanAbsoluteError[source]

Mean Absolute Error – delegates to sklearn.metrics.mean_absolute_error().

higher_is_better: bool = False

name: str = 'mae'

needs_raw: bool = False

class deeptab.metrics.R2Score[source]

Coefficient of Determination (R2) – delegates to sklearn.metrics.r2_score().

Higher is better; perfect prediction gives R2 = 1.

higher_is_better: bool = True

name: str = 'r2'

needs_raw: bool = False

class deeptab.metrics.MeanAbsolutePercentageError[source]

Mean Absolute Percentage Error – delegates to sklearn.metrics.mean_absolute_percentage_error().

sklearn clips the denominator to np.finfo(np.float64).eps internally.

higher_is_better: bool = False

name: str = 'mape'

needs_raw: bool = False

class deeptab.metrics.PinballLoss(quantile=0.5, col=0)[source]

Pinball (Quantile) Loss – delegates to sklearn.metrics.mean_pinball_loss().

Measures calibration at a single quantile level tau in (0, 1).

For LSS quantile family predictions, y_pred is a 2-D array where each column is a predicted quantile. Pass col to select the relevant column (default 0).

Parameters:

quantile (float) – The quantile level, e.g. 0.5 for the median.
col (int) – Column of y_pred to use when predictions are 2-D. Default 0.

higher_is_better: bool = False

name: str = 'pinball'

needs_raw: bool = False

Classification Metrics

class deeptab.metrics.Accuracy[source]

Classification accuracy – delegates to sklearn.metrics.accuracy_score().

Accepts 1-D integer labels or 2-D probability arrays (argmax is taken).

higher_is_better: bool = True

name: str = 'accuracy'

needs_raw: bool = False

class deeptab.metrics.F1Score(average='binary')[source]

F1 Score – delegates to sklearn.metrics.f1_score().

Parameters:: average (str) – Averaging strategy: "binary" (default), "macro", or "weighted".

higher_is_better: bool = True

name: str = 'f1'

needs_raw: bool = False

class deeptab.metrics.AUROC(average='macro')[source]

Area Under the ROC Curve – delegates to sklearn.metrics.roc_auc_score().

Parameters:: average (str) – "macro" (default) or "weighted". Ignored for binary tasks.

higher_is_better: bool = True

name: str = 'auroc'

needs_raw: bool = False

class deeptab.metrics.AUPRC[source]

Area Under the Precision-Recall Curve – delegates to sklearn.metrics.average_precision_score().

higher_is_better: bool = True

name: str = 'auprc'

needs_raw: bool = False

class deeptab.metrics.LogLoss[source]

Cross-Entropy / Log Loss – delegates to sklearn.metrics.log_loss().

higher_is_better: bool = False

name: str = 'log_loss'

needs_raw: bool = False

class deeptab.metrics.BrierScore[source]

Brier Score – delegates to sklearn.metrics.brier_score_loss().

Accepts 1-D probability scores or a 2-D array (second column is used).

higher_is_better: bool = False

name: str = 'brier'

needs_raw: bool = False

class deeptab.metrics.ExpectedCalibrationError(n_bins=10)[source]

Expected Calibration Error (ECE).

sklearn does not provide ECE natively, so this is a custom implementation. Bins predictions by confidence and measures the gap between mean confidence and accuracy per bin.

Parameters:: n_bins (int) – Number of confidence bins. Default 10.

higher_is_better: bool = False

name: str = 'ece'

needs_raw: bool = False

Distributional / LSS Metrics

Proper Scoring Rules

class deeptab.metrics.NegativeLogLikelihood(distribution)[source]

Negative Log-Likelihood computed via the distribution’s compute_loss.

This metric requires raw model logits (needs_raw=True) and the distribution family object, because compute_loss applies parameter transforms internally.

Parameters:: distribution (BaseDistribution) – The fitted distribution object (e.g. model.task_model.family).

higher_is_better: bool = False

name: str = 'nll'

needs_raw: bool = True

class deeptab.metrics.LogScore(distribution)[source]

Log Score (higher is better = -NLL).

Convenience wrapper around NegativeLogLikelihood.

Parameters:: distribution (BaseDistribution) – The fitted distribution object.

higher_is_better: bool = True

name: str = 'log_score'

needs_raw: bool = True

class deeptab.metrics.CRPS(family='normal')[source]

Continuous Ranked Probability Score (CRPS) for univariate distributions.

Uses vectorised properscoring routines when available. Falls back to a pure-NumPy energy-form approximation when properscoring is not installed.

Expected y_pred format (2-D array, columns are distribution parameters):

Normal / StudentT / LogNormal / JohnsonSU — [loc, scale]
All other families — [mean, ...]; CRPS is approximated from the predicted mean only (less informative).

For the normal family, the exact Gaussian CRPS is computed.

Parameters:: family (str) – Distribution family key (e.g. "normal", "studentt"). When provided, enables family-specific CRPS formulas.

higher_is_better: bool = False

name: str = 'crps'

needs_raw: bool = False

class deeptab.metrics.IntervalScore(alpha=0.05)[source]

Winkler Interval Score at coverage level 1 - alpha.

Penalises both width and mis-coverage. Expected y_pred format:

Column 0: lower bound of the prediction interval
Column 1: upper bound of the prediction interval

Parameters:: alpha (float) – Significance level, e.g. 0.05 for a 95% prediction interval.

higher_is_better: bool = False

name: str = 'interval_score'

needs_raw: bool = False

class deeptab.metrics.EnergyScore[source]

Energy Score — multivariate generalisation of CRPS.

Suitable for multivariate / compositional distributions (e.g. MixtureOfGaussiansDistribution, DirichletDistribution).

Computed via Monte-Carlo sampling from the predicted distribution when samples are provided, or via a closed-form energy distance otherwise.

For simple use-cases where y_pred is a 2-D parameter array, the energy score is approximated as the mean Euclidean distance between y_true and the predicted mean.

higher_is_better: bool = False

name: str = 'energy_score'

needs_raw: bool = False

Distribution-Specific Deviances

class deeptab.metrics.PoissonDeviance[source]

Mean Poisson Deviance.

Suitable for poisson and zip families. Expected y_pred: predicted mean (1-D or first column of 2-D).

higher_is_better: bool = False

name: str = 'poisson_deviance'

needs_raw: bool = False

class deeptab.metrics.GammaDeviance[source]

Mean Gamma Deviance.

Suitable for gamma and inversegamma families. Expected y_pred: predicted mean (1-D or first column of 2-D).

higher_is_better: bool = False

name: str = 'gamma_deviance'

needs_raw: bool = False

class deeptab.metrics.TweedieDeviance(p=1.5)[source]

Mean Tweedie Deviance.

Suitable for the tweedie family where 1 < p < 2.

Parameters:: p (float) – Tweedie power parameter. Defaults to 1.5.

higher_is_better: bool = False

name: str = 'tweedie_deviance'

needs_raw: bool = False

class deeptab.metrics.NegativeBinomialDeviance(default_alpha=1.0)[source]

Mean Negative-Binomial Deviance.

Suitable for the negativebinom family.

Expected y_pred: 2-D array where column 0 is the predicted mean mu and column 1 (optional) is the overdispersion parameter alpha. If only one column is present, alpha falls back to the default_alpha constructor argument.

Parameters:: default_alpha (float) – Overdispersion parameter used when y_pred has only one column. Defaults to 1.0.

higher_is_better: bool = False

name: str = 'nb_deviance'

needs_raw: bool = False

class deeptab.metrics.BetaBrierScore[source]

Mean Squared Error of the predicted mean for Beta-distributed targets.

Suitable for the beta family. Expected y_pred: 1-D or first column is predicted mean in (0, 1).

higher_is_better: bool = False

name: str = 'beta_brier'

needs_raw: bool = False

class deeptab.metrics.DirichletError[source]

Mean KL Divergence between true and predicted Dirichlet means.

Suitable for the dirichlet family. Both y_true and y_pred are treated as probability vectors (rows must sum to 1 after clipping).

higher_is_better: bool = False

name: str = 'dirichlet_error'

needs_raw: bool = False

class deeptab.metrics.StudentTLoss(default_df=3.0)[source]

Proper Student-T negative log-likelihood (mean) for the studentt family.

Expected y_pred columns: [loc, scale, (df)]. If only 2 columns are present, df defaults to the constructor argument.

Parameters:: default_df (float) – Degrees-of-freedom fallback when not present in y_pred. Defaults to 3.0.

higher_is_better: bool = False

name: str = 'studentt_nll'

needs_raw: bool = False

class deeptab.metrics.InverseGammaDeviance[source]

Mean Inverse-Gamma deviance for the inversegamma family.

Expected y_pred columns: [shape (alpha), scale (beta)].

The deviance is computed as -2 * (log p(y | alpha, beta) - log p(y | alpha_sat, beta_sat)) where the saturated model likelihood equals 1 (per-sample deviance).

higher_is_better: bool = False

name: str = 'inversegamma_deviance'

needs_raw: bool = False

class deeptab.metrics.LogNormalNLL[source]

Mean Log-Normal Negative Log-Likelihood for the lognormal family.

Expected y_pred columns: [loc (log-space mean), scale (log-space std)].

higher_is_better: bool = False

name: str = 'lognormal_nll'

needs_raw: bool = False

Calibration & Uncertainty

class deeptab.metrics.CoverageProbability(alpha=0.05)[source]

Empirical coverage probability at a given 1 - alpha level.

Expected y_pred columns: [lower_bound, upper_bound].

A well-calibrated model should have coverage close to 1 - alpha. Higher is not unconditionally better — the target is the nominal level.

Parameters:: alpha (float) – Significance level, e.g. 0.05 for 95% prediction intervals.

higher_is_better: bool = True

name: str = 'coverage'

needs_raw: bool = False

class deeptab.metrics.SharpnessScore[source]

Mean prediction interval width (sharpness).

Narrower intervals are sharper (lower is better), but must be balanced against calibration. Expected y_pred columns: [lower, upper].

higher_is_better: bool = False

name: str = 'sharpness'

needs_raw: bool = False

class deeptab.metrics.ProbabilityIntegralTransform(n_bins=10, family='normal')[source]

PIT uniformity test — returns the mean absolute deviation from uniformity.

The Probability Integral Transform (PIT) of a well-calibrated forecast should be uniform on [0, 1]. This metric computes the PIT values for a Normal predictive distribution and returns the MAD from the uniform CDF. Lower is better (0 = perfect calibration).

Expected y_pred columns: [loc, scale] (Normal distribution).

Parameters:

n_bins (int) – Number of histogram bins for the PIT. Defaults to 10.
family (str) – Distribution family for CDF computation. Currently only "normal" is supported.

higher_is_better: bool = False

name: str = 'pit'

needs_raw: bool = False