Bug Report WeightedEnsemble_FULL Fails During Predict With Custom Metric

by StackCamp Team 73 views

Introduction

This article addresses a bug encountered in AutoGluon's TimeSeriesPredictor, specifically when using a custom deterministic metric (with quantile_levels=[]) and multiple models. The WeightedEnsemble_FULL model fails during prediction, even though individual base models like DirectTabular_FULL and ETS_FULL complete successfully. This issue seems to arise from shape mismatches or unexpected behavior in the ensemble’s predict() method when dealing with mean-only forecasts. Let's dive into the details of the bug, its expected behavior, steps to reproduce it, and the error logs.

Understanding the Bug

The core of the problem lies in the WeightedEnsemble_FULL model's inability to aggregate forecasts from base models correctly when only mean predictions are returned (i.e., when quantile_levels=[] is specified). This scenario is common in time series forecasting when the primary focus is on point forecasts rather than probabilistic forecasts. The ensemble model, designed to combine predictions from various individual models, encounters an AssertionError due to shape mismatches during the prediction phase. This error prevents the ensemble model from generating the final forecast, which is crucial for accurate time series predictions. The bug highlights a limitation in the current implementation of the WeightedEnsemble_FULL model when handling mean-only forecasts, necessitating a fix to ensure the model's robustness and reliability in such scenarios.

Expected Behavior

When using WeightedEnsemble_FULL, the model should aggregate forecasts from the base models without any issues, even when the forecasts only include the mean (i.e., when quantile_levels=[] is passed). The ensemble model should handle these mean-only predictions gracefully and produce a final aggregated forecast. The failure of the WeightedEnsemble_FULL model in this situation contradicts the expected behavior, as ensemble methods are designed to enhance prediction accuracy by combining the strengths of individual models. Ensuring the ensemble model functions correctly with mean-only forecasts is essential for the broader applicability of AutoGluon in time series forecasting tasks. This bug report emphasizes the need for a consistent and reliable prediction process, regardless of the forecast type (mean or quantiles), to maintain user trust and satisfaction.

Steps to Reproduce

To reproduce this bug, you can use the provided minimal working example (MWE). This example generates synthetic panel time series data, defines a custom deterministic metric called DemandScore, and trains a TimeSeriesPredictor with quantile_levels=[]. The key steps include:

  1. Defining a Custom Metric: The DemandScore class is a custom metric that calculates the sum of Mean Absolute Error (MAE) and the absolute value of bias, normalized by the sum of the target variable. This metric is used to evaluate the performance of the time series models.
  2. Generating Synthetic Data: The MWE generates synthetic panel time series data for multiple clients, warehouses, and products. The data includes a date range, and sales figures are generated using a Poisson distribution. This synthetic data mimics real-world scenarios where time series data is structured across different entities.
  3. Creating TimeSeriesDataFrame: The generated data is converted into a TimeSeriesDataFrame, which is the data structure used by AutoGluon for time series tasks. This involves melting the DataFrame to a long format and creating a unique identifier for each time series.
  4. Setting Up and Training the Predictor: A TimeSeriesPredictor is initialized with specific parameters, including the prediction length, target variable, evaluation metric, and quantile levels. The predictor is then trained using the generated data with the fast_training preset, which trains a variety of models quickly.
  5. Making Predictions: After training, the code attempts to make predictions using the trained predictor. The error occurs during the prediction phase, specifically when the WeightedEnsemble_FULL model is used.

By following these steps, you can replicate the bug and observe the AssertionError that occurs when the WeightedEnsemble_FULL model tries to aggregate the forecasts. This reproducible example is crucial for developers to diagnose and fix the bug effectively.

import pandas as pd
import numpy as np
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor
from autogluon.timeseries.metrics import TimeSeriesScorer

# Custom Error Metric
class DemandScore(TimeSeriesScorer):
    greater_is_better_internal = False
    optimum = 0.0
    def compute_metric(self, data_future, predictions, target, **kwargs):
        mae = abs(predictions["mean"] - data_future[target]).sum()
        bias = (predictions["mean"] - data_future[target]).sum()
        return (mae + abs(bias)) / data_future[target].sum()

# Generate fake panel time series data
np.random.seed(42)
clients = [0, 1]
warehouses = [1, 2, 3, 4]
products = [100, 200, 300, 400, 500]
dates = pd.date_range(start="2023-01-02", end="2024-04-01", freq="W-MON")

records = []
for client in clients:
    for warehouse in warehouses:
        for product in products:
            sales = np.random.poisson(lam=10, size=len(dates)).tolist()
            records.append([client, warehouse, product] + sales)

columns = ["Client", "Warehouse", "Product"] + [d.strftime("%Y-%m-%d") for d in dates]
df = pd.DataFrame(records, columns=columns)

df = df.melt(id_vars=['Client', 'Warehouse', 'Product'], var_name='ds', value_name='y')
df['unique_id'] = df['Client'].astype(str) + '/' + df['Warehouse'].astype(str) + '/' + df['Product'].astype(str)

df_static = df[['unique_id', 'Client', 'Warehouse', 'Product']].drop_duplicates()
df_static[['Client', 'Warehouse', 'Product']] = df_static[['Client', 'Warehouse', 'Product']].astype('category')
df = df.drop(['Client', 'Warehouse', 'Product'], axis=1)

df = TimeSeriesDataFrame.from_data_frame(df, id_column="unique_id", timestamp_column="ds")
train_data = df.slice_by_timestep(end_index=-13)
test_data = df

predictor = TimeSeriesPredictor(
    prediction_length=13,
    target="y",
    eval_metric=DemandScore(),
    quantile_levels=[]
).fit(
    train_data,
    num_val_windows=2,
    verbosity=2,
    presets="fast_training",
    refit_full=True
)

past_data, known_covariates = test_data.get_model_inputs_for_scoring(
    prediction_length=predictor.prediction_length,
    known_covariates_names=predictor.known_covariates_names
)

pred_per_model = {}
for model in predictor.model_names():
    pred_per_model[model] = predictor.predict(past_data, known_covariates, model=model)

Error Logs

The following error log provides a detailed view of the traceback and the specific error encountered during the prediction phase. This information is vital for debugging and identifying the root cause of the bug. The error log indicates that the WeightedEnsemble_FULL model fails during the predict() method, leading to an AssertionError. This error is triggered by a shape mismatch between the predictions made by the individual models within the ensemble.

Beginning AutoGluon training...
AutoGluon will save models to '/content/AutogluonModels/ag-20250707_071548'
=================== System Info ===================
AutoGluon Version:  1.3.1
Python Version:     3.11.13
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
CPU Count:          2
GPU Count:          0
Memory Avail:       8.85 GB / 12.67 GB (69.8%)
Disk Space Avail:   63.84 GB / 107.72 GB (59.3%)
===================================================
Setting presets to: fast_training

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': DemandScore,
 'hyperparameters': 'very_light',
 'known_covariates_names': [],
 'num_val_windows': 2,
 'prediction_length': 13,
 'quantile_levels': [],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': True,
 'skip_model_selection': False,
 'target': 'y',
 'verbosity': 2}

Inferred time series frequency: 'W-MON'
Provided train_data has 2120 rows, 40 time series. Median time series length is 53 (min=53, max=53). 

Provided data contains following columns:
	target: 'y'

AutoGluon will gauge predictive performance using evaluation metric: 'DemandScore'
	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2025-07-07 07:15:49
Models that will be trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta']
Training timeseries model Naive. 
	-0.4036       = Validation score (-DemandScore)
	0.31    s     = Training runtime
	0.15    s     = Validation (prediction) runtime
Training timeseries model SeasonalNaive. 
	-0.4036       = Validation score (-DemandScore)
	0.15    s     = Training runtime
	0.06    s     = Validation (prediction) runtime
Training timeseries model RecursiveTabular. 
	-0.5010       = Validation score (-DemandScore)
	2.04    s     = Training runtime
	0.16    s     = Validation (prediction) runtime
Training timeseries model DirectTabular. 
	-0.2946       = Validation score (-DemandScore)
	1.58    s     = Training runtime
	0.07    s     = Validation (prediction) runtime
Training timeseries model ETS. 
	-0.2879       = Validation score (-DemandScore)
	0.18    s     = Training runtime
	0.14    s     = Validation (prediction) runtime
Training timeseries model Theta. 
	-0.2979       = Validation score (-DemandScore)
	0.14    s     = Training runtime
	0.11    s     = Validation (prediction) runtime
Fitting simple weighted ensemble.
	Ensemble weights: {'DirectTabular': 0.59, 'ETS': 0.3, 'RecursiveTabular': 0.11}
	-0.2725       = Validation score (-DemandScore)
	1.24    s     = Training runtime
	0.37    s     = Validation (prediction) runtime
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta', 'WeightedEnsemble']
Total runtime: 6.54 s
Best model: WeightedEnsemble
Best model score: -0.2725
	WARNING: refit_full functionality for TimeSeriesPredictor is experimental and is not yet supported by all models.
Refitting models via `refit_full` using all of the data (combined train and validation)...
	Models trained in this way will have the suffix '_FULL' and have NaN validation score.
	This process is not bound by time_limit, but should take less time than the original `fit` call.
Fitting model: Naive_FULL | Skipping fit via cloning parent ...
Fitting model: SeasonalNaive_FULL | Skipping fit via cloning parent ...
Fitting model: RecursiveTabular_FULL
	1.07    s     = Training runtime
Fitting model: DirectTabular_FULL
	1.03    s     = Training runtime
Fitting model: ETS_FULL | Skipping fit via cloning parent ...
Fitting model: Theta_FULL | Skipping fit via cloning parent ...
Fitting model: WeightedEnsemble_FULL | Skipping fit via cloning parent ...
Refit complete. Models trained: ['Naive_FULL', 'SeasonalNaive_FULL', 'RecursiveTabular_FULL', 'DirectTabular_FULL', 'ETS_FULL', 'Theta_FULL', 'WeightedEnsemble_FULL']
Total runtime: 2.15 s
Updated best model to 'WeightedEnsemble_FULL' (Previously 'WeightedEnsemble'). AutoGluon will default to using 'WeightedEnsemble_FULL' for predict().
Model WeightedEnsemble_FULL failed to predict with the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/autogluon/timeseries/trainer.py", line 1081, in get_model_pred_dict
    model_pred_dict[model_name] = self._predict_model(
                                  ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/autogluon/timeseries/trainer.py", line 1010, in _predict_model
    return model.predict(model_inputs, known_covariates=known_covariates)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/autogluon/timeseries/models/ensemble/abstract.py", line 91, in predict
    assert len(set(pred.shape for pred in data.values())) == 1
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[/tmp/ipython-input-63-216866318.py](https://localhost:8080/#) in <cell line: 0>()
     49 pred_per_model = {}
     50 for model in predictor.model_names():
---> 51     pred_per_model[model] = predictor.predict(past_data, known_covariates, model=model)

3 frames
[/usr/local/lib/python3.11/dist-packages/autogluon/timeseries/trainer.py](https://localhost:8080/#) in get_model_pred_dict(self, model_names, data, known_covariates, raise_exception_if_failed, use_cache, random_seed)
   1094 
   1095         if len(failed_models) > 0 and raise_exception_if_failed:
-> 1096             raise RuntimeError(f"Following models failed to predict: {failed_models}")
   1097         if self.cache_predictions and use_cache:
   1098             self._save_cached_pred_dicts(

RuntimeError: Following models failed to predict: ['WeightedEnsemble_FULL']

Installed Versions

The software versions used in the environment where the bug was reproduced are crucial for understanding the context of the issue. Here are the installed versions:

INSTALLED VERSIONS
------------------
date                   : 2025-07-07
time                   : 07:11:10.274920
python                 : 3.11.13.final.0
OS                     : Linux
OS-release             : 6.1.123+
Version                : #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
machine                : x86_64
processor              : x86_64
num_cores              : 2
cpu_ram_mb             : 12977.953125
cuda version           : None
num_gpus               : 0
gpu_ram_mb             : []
avail_disk_size_mb     : 65380

accelerate             : 1.8.1
autogluon              : 1.3.1
autogluon.common       : 1.3.1
autogluon.core         : 1.3.1
autogluon.features     : 1.3.1
autogluon.multimodal   : 1.3.1
autogluon.tabular      : 1.3.1
autogluon.timeseries   : 1.3.1
boto3                  : 1.39.3
catboost               : 1.2.8
coreforecast           : 0.0.15
defusedxml             : 0.7.1
einops                 : 0.8.1
evaluate               : 0.4.4
fastai                 : 2.7.19
fugue                  : 0.9.1
gluonts                : 0.16.2
huggingface-hub        : 0.33.1
hyperopt               : 0.2.7
imodels                : None
jinja2                 : 3.1.6
joblib                 : 1.5.1
jsonschema             : 4.23.0
lightgbm               : 4.5.0
lightning              : 2.5.2
matplotlib             : 3.10.0
mlforecast             : 0.13.6
networkx               : 3.5
nlpaug                 : 1.1.11
nltk                   : 3.8.1
numpy                  : 2.0.2
nvidia-ml-py3          : 7.352.0
omegaconf              : 2.3.0
onnx                   : None
onnxruntime            : None
onnxruntime-gpu        : None
openmim                : 0.3.9
optimum                : None
optimum-intel          : None
orjson                 : 3.10.18
pandas                 : 2.2.2
pdf2image              : 1.17.0
Pillow                 : 11.2.1
psutil                 : 5.9.5
pyarrow                : 18.1.0
pytesseract            : 0.3.13
pytorch-lightning      : 2.5.2
pytorch-metric-learning: 2.8.1
ray                    : 2.44.1
requests               : 2.32.3
scikit-image           : 0.25.2
scikit-learn           : 1.6.1
scikit-learn-intelex   : None
scipy                  : 1.15.3
seqeval                : 1.2.2
skl2onnx               : None
spacy                  : 3.8.7
statsforecast          : 2.0.1
tabpfn                 : None
tensorboard            : 2.18.0
text-unidecode         : 1.3
timm                   : 1.0.3
torch                  : 2.6.0+cu124
torchmetrics           : 1.7.4
torchvision            : 0.21.0+cu124
tqdm                   : 4.67.1
transformers           : 4.49.0
utilsforecast          : 0.2.10
xgboost                : 2.1.4

The key versions to note are:

  • AutoGluon: 1.3.1
  • Python: 3.11.13
  • pandas: 2.2.2
  • numpy: 2.0.2
  • torch: 2.6.0+cu124

These versions provide a clear picture of the environment in which the bug was identified, aiding in replication and debugging efforts. Ensuring compatibility across different versions is crucial for maintaining the stability and reliability of AutoGluon.

Conclusion

The bug in WeightedEnsemble_FULL during prediction with a custom metric and mean-only forecasts is a critical issue that needs to be addressed. This article has detailed the bug's behavior, provided a reproducible example, and included the necessary error logs and version information. Addressing this issue will enhance the robustness and reliability of AutoGluon's time series forecasting capabilities. Further investigation and a fix will ensure that the ensemble model can handle various forecasting scenarios effectively, making AutoGluon a more versatile tool for time series analysis. The resolution of this bug will benefit users who rely on ensemble methods for improved prediction accuracy, especially in situations where only mean forecasts are required. The steps outlined in this report will assist developers in identifying and rectifying the problem, ultimately leading to a more stable and user-friendly AutoGluon experience.