Models#
Model parameters very much depend on the dataset for which they are destined.
PyTorch Forecasting provides a .from_dataset()
method for each model that
takes a TimeSeriesDataSet
and additional parameters
that cannot directy derived from the dataset such as, e.g. learning_rate
or hidden_size
.
To tune models, optuna can be used. For example, tuning of the
TemporalFusionTransformer
is implemented by optimize_hyperparameters()
Selecting an architecture#
Criteria for selecting an architecture depend heavily on the usecase. There are multiple selection criteria and you should take into account. Here is an overview over the pros and cons of the implemented models:
Name 
Covariates 
Multiple targets 
Regression 
Classification 
Probabilistic 
Uncertainty 
Interactions between series 
Flexible history length 
Coldstart 
Required computational resources (15, 5=most) 

x 
x 
x 
x 
2 

x 
x 
x 
x 
x 
x 
x 
1 

x 
1 

x 
x 
x 
1 

x 
x 
x 
x 
x 
x 1 
x 
3 

x 
x 
x 
x 
x 
x 
x 
4 
 1
Accounting for correlations using a multivariate loss function which converts the network into a DeepVAR model.
Size and type of available data#
One should particularly consider five criteria.
Availability of covariates#
If you have covariates, that is variables in addition to the target variable itself that hold information
about the target, then your case will benefit from a model that can accomodate covariates. A model that
cannot use covariates is NBeats
.
Length of timeseries#
The length of time series has a significant impact on which model will work well. Unfortunately,
most models are created and tested on very long timeseries while in practice short or a mix of short and long
timeseries are often encountered. A model that can leverage covariates well such as the
TemporalFusionTransformer
will typically perform better than other models on short timeseries. It is a significant step
from short timeseries to making coldstart predictions soley based on static covariates, i.e.
making predictions without observed history. For example,
this is only supported by the
TemporalFusionTransformer
but does not work tremendously well.
Number of timeseries and their relation to each other#
If your time series are related to each other (e.g. all sales of products of the same company),
a model that can learn relations between the timeseries can improve accuracy.
Not that only models that can process covariates can
learn relationships between different timeseries.
If the timeseries denote different entities or exhibit very similar patterns accross the board,
a model such as NBeats
will work as well.
If you have only one or very few timeseries, they should be very long in order for a deep learning approach to work well. Consider also more traditional approaches.
Type of prediction task#
Not every can do regression, classification or handle multiple targets. Some are exclusively
geared towards a single task. For example, NBeats
can only be used for regression on a single target without covariates while the
TemporalFusionTransformer
supports
multiple targets and even hetrogeneous targets where some are continuous variables and others categorical,
i.e. regression and classification at the same time. DeepAR
can handle multiple targets but only works for regression tasks.
For long forecast horizon forecasts, NHiTS
is an excellent choice
as it uses interpolation capabilities.
Supporting uncertainty#
Not all models support uncertainty estimation. Those that do, might do so in different fashions. Nonparameteric models provide forecasts that are not bound to a given distribution while parametric models assume that the data follows a specific distribution.
The parametric models will be a better choice if you know how your data (and potentially error) is distributed. However, if you are missing this information or cannot make an educated guess that matches reality rather well, the model’s uncertainty estimates will be adversely impacted. In this case, a nonparameteric model will do much better.
DeepAR
is an example for a parameteric model while
the TemporalFusionTransformer
can output quantile forecasts that can fit any distribution.
Models based on normalizing flows marry the two worlds by providing a nonparameteric estimate
of a full probability distribution. PyTorch Forecasting currently does not provide
support for these but
Pyro, a package for probabilistic programming does
if you believe that your problem is uniquely suited to this solution.
Computational requirements#
Some models have simpler architectures and less parameters than others which can lead to significantly different training times. However, this not a general rule as demonstrated by Zhuohan et al. in Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. Because the data for a sample for timeseries models is often far samller than it is for computer vision or language tasks, GPUs are often underused and increasing the width of models can be an effective way to fully use a GPU. This can increase the speed of training while also improving accuracy. The other path to pushing utilization of a GPU up is increasing the batch size. However, increasing the batch size can adversly affect the generalization abilities of a trained network. Also, take into account that often computational resources are mainly necessary for inference/prediction. The upfront task of training a models will require developer time (also expensive!) but might be only a small part of the total compuational costs over the lifetime of a model.
The TemporalFusionTransformer
is
a rather large model but might benefit from being trained with.
For example, NBeats
or NHiTS
are
efficient models.
Autoregressive models such as DeepAR
might be quick to train
but might be slow at inference time (in case of DeepAR
this is
driven by sampling results probabilistically multiple times, effectively increasing the computational burden linearly with the
number of samples.
Implementing new architectures#
Please see the Using custom data and implementing custom models tutorial on how implement basic and more advanced models.
Every model should inherit from a base model in base_model
.
 class pytorch_forecasting.models.base_model.BaseModel(log_interval: Union[int, float] = 1, log_val_interval: Optional[Union[float, int]] = None, learning_rate: Union[float, List[float]] = 0.001, log_gradient_flow: bool = False, loss: Metric = SMAPE(), logging_metrics: ModuleList = ModuleList(), reduce_on_plateau_patience: int = 1000, reduce_on_plateau_reduction: float = 2.0, reduce_on_plateau_min_lr: float = 1e05, weight_decay: float = 0.0, optimizer_params: Optional[Dict[str, Any]] = None, monotone_constaints: Dict[str, int] = {}, output_transformer: Optional[Callable] = None, optimizer='ranger')[source]
BaseModel from which new timeseries models should inherit from. The
hparams
of the created object will default to the parameters indicated in__init__()
.The
forward()
method should return a named tuple with at least the entryprediction
that contains the network’s output. See the function’s documentation for more details.The idea of the base model is that common methods do not have to be reimplemented for every new architecture. The class is a [LightningModule](https://pytorchlightning.readthedocs.io/en/latest/lightning_module.html) and follows its conventions. However, there are important additions:
You need to specify a
loss
attribute that stores the function to calculate theMultiHorizonLoss
for backpropagation.The
from_dataset()
method can be used to initialize a network using the specifications of a dataset. Often, parameters such as the number of features can be easily deduced from the dataset. Further, the method will also store how to rescale normalized predictions into the unnormalized prediction space. Override it to pass additional arguments to the __init__ method of your network that depend on your dataset.The
transform_output()
method rescales the network output using the target normalizer from thedataset.The
step()
method takes care of calculating the loss, logging additional metrics defined in thelogging_metrics
attribute and plots of sample predictions. You can override this method to add custom interpretations or pass extra arguments to the networks forward method.The
epoch_end()
method can be used to calculate summaries of each epoch such as statistics on the encoder length, etc.The
predict()
method makes predictions using a dataloader or dataset. Override it if you need to pass additional arguments toforward
by default.
To implement your own architecture, it is best to go throught the Using custom data and implementing custom models and to look at existing ones to understand what might be a good approach.
Example
class Network(BaseModel): def __init__(self, my_first_parameter: int=2, loss=SMAPE()): self.save_hyperparameters() super().__init__(loss=loss) def forward(self, x): normalized_prediction = self.module(x) prediction = self.transform_output(prediction=normalized_prediction, target_scale=x["target_scale"]) return self.to_network_output(prediction=prediction)
BaseModel for timeseries forecasting from which to inherit from
 Parameters
log_interval (Union[int, float], optional) – Batches after which predictions are logged. If < 1.0, will log multiple entries per batch. Defaults to 1.
log_val_interval (Union[int, float], optional) – batches after which predictions for validation are logged. Defaults to None/log_interval.
learning_rate (float, optional) – Learning rate. Defaults to 1e3.
log_gradient_flow (bool) – If to log gradient flow, this takes time and should be only done to diagnose training failures. Defaults to False.
loss (Metric, optional) – metric to optimize, can also be list of metrics. Defaults to SMAPE().
logging_metrics (nn.ModuleList[MultiHorizonMetric]) – list of metrics that are logged during training. Defaults to [].
reduce_on_plateau_patience (int) – patience after which learning rate is reduced by a factor of 10. Defaults to 1000
reduce_on_plateau_reduction (float) – reduction in learning rate when encountering plateau. Defaults to 2.0.
reduce_on_plateau_min_lr (float) – minimum learning rate for reduce on plateua learning rate scheduler. Defaults to 1e5
weight_decay (float) – weight decay. Defaults to 0.0.
optimizer_params (Dict[str, Any]) – additional parameters for the optimizer. Defaults to {}.
monotone_constaints (Dict[str, int]) – dictionary of monotonicity constraints for continuous decoder variables mapping position (e.g.
"0"
for first position) to constraint (1
for negative and+1
for positive, larger numbers add more weight to the constraint vs. the loss but are usually not necessary). This constraint significantly slows down training. Defaults to {}.output_transformer (Callable) – transformer that takes network output and transforms it to prediction space. Defaults to None which is equivalent to
lambda out: out["prediction"]
.optimizer (str) – Optimizer, “ranger”, “sgd”, “adam”, “adamw” or class name of optimizer in
torch.optim
. Alternatively, a class or function can be passed which takes parameters as first argument and a lr argument (optionally also weight_decay) Defaults to “ranger”.
Details and available models#
See the API documentation for further details on available models:

Model with additional methods for autoregressive models. 

Model with additional methods for autoregressive models with covariates. 
BaseModel from which new timeseries models should inherit from. 


Model with additional methods using covariates. 
Baseline model that uses last known target value to make prediction. 

DeepAR Network. 

MLP on the decoder. 

Initialize NBeats Model  use its 

Initialize NHiTS Model  use its 


Embedding layer for categorical variables including groups of categorical variables. 

GRU that can handle zerolength sequences 
LSTM that can handle zerolength sequences 

Get LSTM or GRU. 

Recurrent Network. 


Temporal Fusion Transformer for forecasting timeseries  use its 