Data#
Loading data for timeseries forecasting is not trivial - in particular if covariates are included and values are missing.
PyTorch Forecasting provides the TimeSeriesDataSet
which comes with a to_dataloader()
method to convert it to a dataloader and a from_dataset()
method to create, e.g. a validation
or test dataset from a training dataset using the same label encoders and data normalization.
Further, timeseries have to be (almost always) normalized for a neural network to learn efficiently. PyTorch Forecasting provides multiple such target normalizers (some of which can also be used for normalizing covariates).
Time series data set#
The time series dataset is the central data-holding object in PyTorch Forecasting. It primarily takes a pandas DataFrame along with some metadata. See the tutorial on passing data to models to learn more it is coupled to models.
- class pytorch_forecasting.data.timeseries.TimeSeriesDataSet(data: DataFrame, time_idx: str, target: str | List[str], group_ids: List[str], weight: str | None = None, max_encoder_length: int = 30, min_encoder_length: int = None, min_prediction_idx: int = None, min_prediction_length: int = None, max_prediction_length: int = 1, static_categoricals: List[str] | None = None, static_reals: List[str] | None = None, time_varying_known_categoricals: List[str] | None = None, time_varying_known_reals: List[str] | None = None, time_varying_unknown_categoricals: List[str] | None = None, time_varying_unknown_reals: List[str] | None = None, variable_groups: Dict[str, List[int]] | None = None, constant_fill_strategy: Dict[str, str | float | int | bool] | None = None, allow_missing_timesteps: bool = False, lags: Dict[str, List[int]] | None = None, add_relative_time_idx: bool = False, add_target_scales: bool = False, add_encoder_length: bool | str = 'auto', target_normalizer: TorchNormalizer | NaNLabelEncoder | EncoderNormalizer | str | List[TorchNormalizer | NaNLabelEncoder | EncoderNormalizer] | Tuple[TorchNormalizer | NaNLabelEncoder | EncoderNormalizer] | None = 'auto', categorical_encoders: Dict[str, NaNLabelEncoder] | None = None, scalers: Dict[str, StandardScaler | RobustScaler | TorchNormalizer | EncoderNormalizer] | None = None, randomize_length: None | Tuple[float, float] | bool = False, predict_mode: bool = False)[source]
PyTorch Dataset for fitting timeseries models.
The dataset automates common tasks such as
scaling and encoding of variables
normalizing the target variable
efficiently converting timeseries in pandas dataframes to torch tensors
holding information about static and time-varying variables known and unknown in the future
holding information about related categories (such as holidays)
downsampling for data augmentation
generating inference, validation and test datasets
The tutorial on passing data to models is helpful to understand the output of the dataset and how it is coupled to models.
Each sample is a subsequence of a full time series. The subsequence consists of encoder and decoder/prediction timepoints for a given time series. This class constructs an index which defined which subsequences exists and can be samples from (
index
attribute). The samples in the index are defined by the various parameters. to the class (encoder and prediction lengths, minimum prediction length, randomize length and predict keywords). How samples are sampled into batches for training, is determined by the DataLoader. The class provides theto_dataloader()
method to convert the dataset into a dataloader.Large datasets:
Currently the class is limited to in-memory operations (that can be sped up by an existing installation of numba). If you have extremely large data, however, you can pass prefitted encoders and and scalers to it and a subset of sequences to the class to construct a valid dataset (plus, likely the EncoderNormalizer should be used to normalize targets). when fitting a network, you would then to create a custom DataLoader that rotates through the datasets. There are currently no in-built methods to do this.
- Parameters:
data (pd.DataFrame) – dataframe with sequence data - each row can be identified with
time_idx
and thegroup_ids
time_idx (str) – integer typed column denoting the time index within
data
. This columns is used to determine the sequence of samples. If there are no missings observations, the time index should increase by+1
for each subsequent sample. The first time_idx for each series does not necessarily have to be0
but any value is allowed.target (Union[str, List[str]]) – column(s) in
data
denoting the forecasting target. Can be categorical or continous dtype.group_ids (List[str]) – list of column names identifying a time series instance within
data
This means that thegroup_ids
identify a sample together with thetime_idx
. If you have only one timeseries, set this to the name of column that is constant.weight (str, optional, default=None) – column name for weights. Defaults to None.
max_encoder_length (int, optional, default=30) – maximum length to encode. This is the maximum history length used by the time series dataset.
min_encoder_length (int, optional, default=max_encoder_length) – minimum allowed length to encode. Defaults to max_encoder_length.
min_prediction_idx (int, optional, default = first time_idx in data) – minimum
time_idx
from where to start predictions. This parameter can be useful to create a validation or test set.max_prediction_length (int, optional, default=1) – maximum prediction/decoder length (choose this not too short as it can help convergence)
min_prediction_length (int, optional, default=max_prediction_length) – minimum prediction/decoder length
static_categoricals (list of str, optional, default=None) – list of categorical variables that do not change over time, in
data
, entries can be also lists which are then encoded together (e.g. useful for product categories)static_reals (list of str, optional, default=None) – list of continuous variables that do not change over time
time_varying_known_categoricals (list of str, optional, default=None) – list of categorical variables that change over time and are known in the future, entries can be also lists which are then encoded together (e.g. useful for special days or promotion categories)
time_varying_known_reals (list of str, optional, default=None) – list of continuous variables that change over time and are known in the future (e.g. price of a product, but not demand of a product)
time_varying_unknown_categoricals (list of str, optional, default=None) – list of categorical variables that are not known in the future and change over time. entries can be also lists which are then encoded together (e.g. useful for weather categories). Target variables should be included here, if categorical.
time_varying_unknown_reals (list of str, optional, default=None) – list of continuous variables that are not known in the future and change over time. Target variables should be included here, if real.
variable_groups (Dict[str, List[str]], optional, default=None) – dictionary mapping a name to a list of columns in the data. The name should be present in a categorical or real class argument, to be able to encode or scale the columns by group. This will effectively combine categorical variables is particularly useful if a categorical variable can have multiple values at the same time. An example are holidays which can be overlapping.
constant_fill_strategy (dict, optional, default=None) – Keys must be str, values can be str, float, int or bool. Dictionary of column names with constants to fill in missing values if there are gaps in the sequence (by default forward fill strategy is used). The values will be only used if
allow_missing_timesteps=True
. A common use case is to denote that demand was 0 if the sample is not in the dataset.allow_missing_timesteps (bool, optional, default=False) – whether to allow missing timesteps that are automatically filled up. Missing values refer to gaps in the
time_idx
, e.g. if a specific timeseries has only samples for 1, 2, 4, 5, the sample for 3 will be generated on-the-fly. Allow missings does not deal withNA
values. You should fill NA values before passing the dataframe to the TimeSeriesDataSet.lags (Dict[str, List[int]], optional, default=None) – dictionary of variable names mapped to list of time steps by which the variable should be lagged. Lags can be useful to indicate seasonality to the models. Useful to add if seasonalit(ies) of the data are known., In this case, it is recommended to add the target variables with the corresponding lags to improve performance. Lags must be at not larger than the shortest time series as all time series will be cut by the largest lag value to prevent NA values. A lagged variable has to appear in the time-varying variables. If you only want the lagged but not the current value, lag it manually in your input data using
data[lagged_varname] = `` ``data.sort_values(time_idx).groupby(group_ids, observed=True).shift(lag)
.add_relative_time_idx (bool, optional, default=False) – whether to add a relative time index as feature, i.e., for each sampled sequence, the index will range from -encoder_length to prediction_length.
add_target_scales (bool, optional, default=False) – whether to add scales for target to static real features, i.e., add the center and scale of the unnormalized timeseries as features.
add_encoder_length (Union[bool, str], optional, default="auto") – whether to add encoder length to list of static real variables. Defaults to “auto”, iwhich is same as
True
iffmin_encoder_length != max_encoder_length
.target_normalizer (torch transformer, str, list, tuple, optional, default="auto") – Transformer that takes group_ids, target and time_idx to normalize targets. You can choose from
TorchNormalizer
,GroupNormalizer
,NaNLabelEncoder
,EncoderNormalizer
(on which overfitting tests will fail) orNone
for using no normalizer. For multiple targets, use a :py:class`~pytorch_forecasting.data.encoders.MultiNormalizer`. By default an appropriate normalizer is chosen automatically.categorical_encoders (dict[str, BaseEstimator]) – dictionary of scikit learn label transformers. If you have unobserved categories in the future / a cold-start problem, you can use the
NaNLabelEncoder
withadd_nan=True
. Defaults effectively to sklearn’sLabelEncoder()
. Prefitted encoders will not be fit again.scalers (optional, dict with str keys and torch or sklearn scalers as values) – dictionary of scikit-learn or torch scalers. Defaults to sklearn’s
StandardScaler()
. Other options areEncoderNormalizer
,GroupNormalizer
or scikit-learn’sStandarScaler()
,RobustScaler()
orNone
for using no normalizer / normalizer withcenter=0
andscale=1
(method="identity"
). Prefittet encoders will not be fit again (with the exception of theEncoderNormalizer
that is fit on every encoder sequence).randomize_length (optional, None, bool, or tuple of float.) – None or False if not to randomize lengths. Tuple of beta distribution concentrations from which probabilities are sampled that are used to sample new sequence lengths with a binomial distribution. If True, defaults to (0.2, 0.05), i.e. ~1/4 of samples around minimum encoder length. Defaults to False otherwise.
predict_mode (bool) – If True, the TimeSeriesDataSet will only create one sequence per time series (i.e. only from the latest provided samples). Effectively, this will select each time series identified by
group_ids
the lastmax_prediction_length
samples of each time series as prediction samples and everthing previous up tomax_encoder_length
samples as encoder samples. If False, the TimeSeriesDataSet will create subsequences by sliding a window over the data samples. For training use cases, it’s preferable to set predict_mode=False to get all subseries. On the other hand, predict_mode = True is ideal for validation cases.
Timeseries dataset holding data for models.
Details#
See the API documentation for further details on available data encoders and the TimeSeriesDataSet
:
Special Normalizer that is fit on each encoding sequence. |
|
Normalizer that scales by groups. |
|
Normalizer for multiple targets. |
|
Labelencoder that can optionally always encode nan and unknown classes (in transform) as class |
|
Basic target transformer that can be fit also on torch tensors. |
|
|
Samples mini-batches randomly but in a time-synchronised manner. |
PyTorch Dataset for fitting timeseries models. |