TimeSeriesDataSet#

class pytorch_forecasting.data.timeseries.TimeSeriesDataSet(data: DataFrame, time_idx: str, target: str | List[str], group_ids: List[str], weight: str | None = None, max_encoder_length: int = 30, min_encoder_length: int = None, min_prediction_idx: int = None, min_prediction_length: int = None, max_prediction_length: int = 1, static_categoricals: List[str] | None = None, static_reals: List[str] | None = None, time_varying_known_categoricals: List[str] | None = None, time_varying_known_reals: List[str] | None = None, time_varying_unknown_categoricals: List[str] | None = None, time_varying_unknown_reals: List[str] | None = None, variable_groups: Dict[str, List[int]] | None = None, constant_fill_strategy: Dict[str, str | float | int | bool] | None = None, allow_missing_timesteps: bool = False, lags: Dict[str, List[int]] | None = None, add_relative_time_idx: bool = False, add_target_scales: bool = False, add_encoder_length: bool | str = 'auto', target_normalizer: TorchNormalizer | NaNLabelEncoder | EncoderNormalizer | str | List[TorchNormalizer | NaNLabelEncoder | EncoderNormalizer] | Tuple[TorchNormalizer | NaNLabelEncoder | EncoderNormalizer] | None = 'auto', categorical_encoders: Dict[str, NaNLabelEncoder] | None = None, scalers: Dict[str, StandardScaler | RobustScaler | TorchNormalizer | EncoderNormalizer] | None = None, randomize_length: None | Tuple[float, float] | bool = False, predict_mode: bool = False)[source]#

Bases: Dataset

PyTorch Dataset for fitting timeseries models.

The dataset automates common tasks such as

  • scaling and encoding of variables

  • normalizing the target variable

  • efficiently converting timeseries in pandas dataframes to torch tensors

  • holding information about static and time-varying variables known and unknown in the future

  • holding information about related categories (such as holidays)

  • downsampling for data augmentation

  • generating inference, validation and test datasets

The tutorial on passing data to models is helpful to understand the output of the dataset and how it is coupled to models.

Each sample is a subsequence of a full time series. The subsequence consists of encoder and decoder/prediction timepoints for a given time series. This class constructs an index which defined which subsequences exists and can be samples from (index attribute). The samples in the index are defined by the various parameters. to the class (encoder and prediction lengths, minimum prediction length, randomize length and predict keywords). How samples are sampled into batches for training, is determined by the DataLoader. The class provides the to_dataloader() method to convert the dataset into a dataloader.

Large datasets:

Currently the class is limited to in-memory operations (that can be sped up by an existing installation of numba). If you have extremely large data, however, you can pass prefitted encoders and and scalers to it and a subset of sequences to the class to construct a valid dataset (plus, likely the EncoderNormalizer should be used to normalize targets). when fitting a network, you would then to create a custom DataLoader that rotates through the datasets. There are currently no in-built methods to do this.

Parameters:
  • data (pd.DataFrame) – dataframe with sequence data - each row can be identified with time_idx and the group_ids

  • time_idx (str) – integer typed column denoting the time index within data. This columns is used to determine the sequence of samples. If there are no missings observations, the time index should increase by +1 for each subsequent sample. The first time_idx for each series does not necessarily have to be 0 but any value is allowed.

  • target (Union[str, List[str]]) – column(s) in data denoting the forecasting target. Can be categorical or continous dtype.

  • group_ids (List[str]) – list of column names identifying a time series instance within data This means that the group_ids identify a sample together with the time_idx. If you have only one timeseries, set this to the name of column that is constant.

  • weight (str, optional, default=None) – column name for weights. Defaults to None.

  • max_encoder_length (int, optional, default=30) – maximum length to encode. This is the maximum history length used by the time series dataset.

  • min_encoder_length (int, optional, default=max_encoder_length) – minimum allowed length to encode. Defaults to max_encoder_length.

  • min_prediction_idx (int, optional, default = first time_idx in data) – minimum time_idx from where to start predictions. This parameter can be useful to create a validation or test set.

  • max_prediction_length (int, optional, default=1) – maximum prediction/decoder length (choose this not too short as it can help convergence)

  • min_prediction_length (int, optional, default=max_prediction_length) – minimum prediction/decoder length

  • static_categoricals (list of str, optional, default=None) – list of categorical variables that do not change over time, in data, entries can be also lists which are then encoded together (e.g. useful for product categories)

  • static_reals (list of str, optional, default=None) – list of continuous variables that do not change over time

  • time_varying_known_categoricals (list of str, optional, default=None) – list of categorical variables that change over time and are known in the future, entries can be also lists which are then encoded together (e.g. useful for special days or promotion categories)

  • time_varying_known_reals (list of str, optional, default=None) – list of continuous variables that change over time and are known in the future (e.g. price of a product, but not demand of a product)

  • time_varying_unknown_categoricals (list of str, optional, default=None) – list of categorical variables that are not known in the future and change over time. entries can be also lists which are then encoded together (e.g. useful for weather categories). Target variables should be included here, if categorical.

  • time_varying_unknown_reals (list of str, optional, default=None) – list of continuous variables that are not known in the future and change over time. Target variables should be included here, if real.

  • variable_groups (Dict[str, List[str]], optional, default=None) – dictionary mapping a name to a list of columns in the data. The name should be present in a categorical or real class argument, to be able to encode or scale the columns by group. This will effectively combine categorical variables is particularly useful if a categorical variable can have multiple values at the same time. An example are holidays which can be overlapping.

  • constant_fill_strategy (dict, optional, default=None) – Keys must be str, values can be str, float, int or bool. Dictionary of column names with constants to fill in missing values if there are gaps in the sequence (by default forward fill strategy is used). The values will be only used if allow_missing_timesteps=True. A common use case is to denote that demand was 0 if the sample is not in the dataset.

  • allow_missing_timesteps (bool, optional, default=False) – whether to allow missing timesteps that are automatically filled up. Missing values refer to gaps in the time_idx, e.g. if a specific timeseries has only samples for 1, 2, 4, 5, the sample for 3 will be generated on-the-fly. Allow missings does not deal with NA values. You should fill NA values before passing the dataframe to the TimeSeriesDataSet.

  • lags (Dict[str, List[int]], optional, default=None) – dictionary of variable names mapped to list of time steps by which the variable should be lagged. Lags can be useful to indicate seasonality to the models. Useful to add if seasonalit(ies) of the data are known., In this case, it is recommended to add the target variables with the corresponding lags to improve performance. Lags must be at not larger than the shortest time series as all time series will be cut by the largest lag value to prevent NA values. A lagged variable has to appear in the time-varying variables. If you only want the lagged but not the current value, lag it manually in your input data using data[lagged_varname] = `` ``data.sort_values(time_idx).groupby(group_ids, observed=True).shift(lag).

  • add_relative_time_idx (bool, optional, default=False) – whether to add a relative time index as feature, i.e., for each sampled sequence, the index will range from -encoder_length to prediction_length.

  • add_target_scales (bool, optional, default=False) – whether to add scales for target to static real features, i.e., add the center and scale of the unnormalized timeseries as features.

  • add_encoder_length (Union[bool, str], optional, default="auto") – whether to add encoder length to list of static real variables. Defaults to “auto”, iwhich is same as True iff min_encoder_length != max_encoder_length.

  • target_normalizer (torch transformer, str, list, tuple, optional, default="auto") – Transformer that takes group_ids, target and time_idx to normalize targets. You can choose from TorchNormalizer, GroupNormalizer, NaNLabelEncoder, EncoderNormalizer (on which overfitting tests will fail) or None for using no normalizer. For multiple targets, use a :py:class`~pytorch_forecasting.data.encoders.MultiNormalizer`. By default an appropriate normalizer is chosen automatically.

  • categorical_encoders (dict[str, BaseEstimator]) – dictionary of scikit learn label transformers. If you have unobserved categories in the future / a cold-start problem, you can use the NaNLabelEncoder with add_nan=True. Defaults effectively to sklearn’s LabelEncoder(). Prefitted encoders will not be fit again.

  • scalers (optional, dict with str keys and torch or sklearn scalers as values) – dictionary of scikit-learn or torch scalers. Defaults to sklearn’s StandardScaler(). Other options are EncoderNormalizer, GroupNormalizer or scikit-learn’s StandarScaler(), RobustScaler() or None for using no normalizer / normalizer with center=0 and scale=1 (method="identity"). Prefittet encoders will not be fit again (with the exception of the EncoderNormalizer that is fit on every encoder sequence).

  • randomize_length (optional, None, bool, or tuple of float.) – None or False if not to randomize lengths. Tuple of beta distribution concentrations from which probabilities are sampled that are used to sample new sequence lengths with a binomial distribution. If True, defaults to (0.2, 0.05), i.e. ~1/4 of samples around minimum encoder length. Defaults to False otherwise.

  • predict_mode (bool) – If True, the TimeSeriesDataSet will only create one sequence per time series (i.e. only from the latest provided samples). Effectively, this will select each time series identified by group_ids the last max_prediction_length samples of each time series as prediction samples and everthing previous up to max_encoder_length samples as encoder samples. If False, the TimeSeriesDataSet will create subsequences by sliding a window over the data samples. For training use cases, it’s preferable to set predict_mode=False to get all subseries. On the other hand, predict_mode = True is ideal for validation cases.

Timeseries dataset holding data for models.

Inherited-members:

Methods

calculate_decoder_length(time_last, ...)

Calculate length of decoder.

filter(filter_func[, copy])

Filter subsequences in dataset.

from_dataset(dataset, data[, ...])

Construct dataset with different data, same variable encoders, scalers, etc.

from_parameters(parameters, data[, ...])

Construct dataset with different data, same variable encoders, scalers, etc.

get_parameters()

Get parameters of self as dict.

get_transformer(name[, group_id])

Get transformer for variable.

load(fname)

Load dataset from disk

plot_randomization([betas, length, min_length])

Plot expected randomized length distribution.

reset_overwrite_values()

Reset values used to override sample features.

save(fname)

Save dataset to disk

set_overwrite_values(values, variable[, target])

Overwrite values in decoder or encoder (or both) for a specific variable.

to_dataloader([train, batch_size, batch_sampler])

Construct dataloader from dataset, for use in models.

transform_values(name, values[, data, ...])

Scale and encode values.

x_to_index(x)

Decode dataframe index from x.

Attributes

categoricals

Categorical variables as used for modelling.

decoded_index

Get interpretable version of index.

dropout_categoricals

list of categorical variables that are unknown when making a forecast without observed history

flat_categoricals

Categorical variables as defined in input data.

lagged_targets

Subset of lagged_variables to variables that are lagged targets.

lagged_variables

Lagged variables.

max_lag

Maximum number of time steps variables are lagged.

min_lag

Minimum number of time steps variables are lagged.

multi_target

If dataset encodes one or multiple targets.

reals

Continous variables as used for modelling.

target_names

List of targets.

target_normalizers

List of target normalizers aligned with target_names.

variable_to_group_mapping

Mapping from categorical variables to variables in input data.

calculate_decoder_length(time_last: int | Series | ndarray, sequence_length: int | Series | ndarray) int | Series | ndarray[source]#

Calculate length of decoder.

Parameters:
  • time_last (Union[int, pd.Series, np.ndarray]) – last time index of the sequence

  • sequence_length (Union[int, pd.Series, np.ndarray]) – total length of the sequence

Returns:

decoder length(s)

Return type:

Union[int, pd.Series, np.ndarray]

filter(filter_func: Callable, copy: bool = True) TimeSeriesDataSet[source]#

Filter subsequences in dataset.

Uses interpretable version of index decoded_index() to filter subsequences in dataset.

Parameters:
  • filter_func (Callable) – function to filter. Should take decoded_index() dataframe as only argument which contains group ids and time index columns.

  • copy (bool, optional, default=True) – whether to return copy of dataset (True) or filter inplace (False).

Returns:

filtered dataset

Return type:

TimeSeriesDataSet

classmethod from_dataset(dataset, data: DataFrame, stop_randomization: bool = False, predict: bool = False, **update_kwargs)[source]#

Construct dataset with different data, same variable encoders, scalers, etc.

Calls from_parameters() under the hood.

May override parameters with update_kwargs.

Parameters:
  • dataset (TimeSeriesDataSet) – dataset from which to copy parameters

  • data (pd.DataFrame) – data from which new dataset will be generated

  • stop_randomization (bool, optional, default=None) – Whether to stop randomizing encoder and decoder lengths, useful for validation set.

  • predict (bool, optional, default=False) – Whether to predict the decoder length on the last entries in the time index (i.e. one prediction per group only).

  • **update_kwargs – keyword arguments overrides, passed to constructor of the new dataset

Returns:

new dataset

Return type:

TimeSeriesDataSet

classmethod from_parameters(parameters: Dict[str, Any], data: DataFrame, stop_randomization: bool = None, predict: bool = False, **update_kwargs)[source]#

Construct dataset with different data, same variable encoders, scalers, etc.

Returns TimeSeriesDataSet with same parameters as self, but different data. May override parameters with update_kwargs.

Parameters:
  • parameters (Dict[str, Any]) – dataset parameters which to use for the new dataset

  • data (pd.DataFrame) – data from which new dataset will be generated

  • stop_randomization (bool, optional, default=None) – Whether to stop randomizing encoder and decoder lengths, useful for validation set.

  • predict (bool, optional, default=False) – Whether to predict the decoder length on the last entries in the time index (i.e. one prediction per group only).

  • **update_kwargs – keyword arguments overrides, passed to constructor of the new dataset

Returns:

new dataset

Return type:

TimeSeriesDataSet

get_parameters() Dict[str, Any][source]#

Get parameters of self as dict.

These can be used with from_parameters() to create a new dataset with the same scalers.

Returns:

Dict[str, Any]

Return type:

dictionary of parameters

get_transformer(name: str, group_id: bool = False)[source]#

Get transformer for variable.

Parameters:
  • name (str) – variable name

  • group_id (bool, optional, default=False) – Whether the passed name refers to a group id, different encoders are used for these.

Return type:

transformer

classmethod load(fname: str)[source]#

Load dataset from disk

Parameters:

fname (str) – filename to load from

Returns:

TimeSeriesDataSet

plot_randomization(betas: Tuple[float, float] = None, length: int = None, min_length: int = None)[source]#

Plot expected randomized length distribution.

Parameters:
  • betas (Tuple[float, float], optional, default=randomize_length of dataset) – Tuple of betas, e.g. (0.2, 0.05) to use for randomization.

  • length (int, optional, default=max_encoder_length of dataset) – Length of sequence to plot.

  • min_length (int, optional, default=min_encoder_length of dataset) – Minimum length of sequence to plot.

Returns:

tuple of figure and histogram based on 1000 samples

Return type:

Tuple[plt.Figure, torch.Tensor]

reset_overwrite_values() None[source]#

Reset values used to override sample features.

save(fname: str) None[source]#

Save dataset to disk

Parameters:

fname (str) – filename to save to

set_overwrite_values(values: float | Tensor, variable: str, target: str | slice = 'decoder') None[source]#

Overwrite values in decoder or encoder (or both) for a specific variable.

Parameters:
  • values (Union[float, torch.Tensor]) – values to use for overwrite.

  • variable (str) – variable whose values should be overwritten.

  • target (Union[str, slice], optional)) – positions to overwrite. One of “decoder”, “encoder” or “all” or a slice object which is directly used to overwrite indices, e.g., slice(-5, None) will overwrite the last 5 values. Defaults to “decoder”.

to_dataloader(train: bool = True, batch_size: int = 64, batch_sampler: Sampler | str = None, **kwargs) DataLoader[source]#

Construct dataloader from dataset, for use in models.

Parameters:
  • train (bool, optional, default=Trze) – whether dataloader is used for training (True) or prediction (False). Will shuffle and drop last batch if True. Defaults to True.

  • batch_size (int, optional, default=64) – batch size for training model. Defaults to 64.

  • batch_sampler (Sampler, str, or None, optional, default=None) –

    torch batch sampler or string. One of

    • ”synchronized”: ensure that samples in decoder are aligned in time.

      Does not support missing values in dataset. This makes only sense if the underlying algorithm makes use of values aligned in time.

    • PyTorch Sampler instance: any PyTorch sampler,

      e.g., the WeightedRandomSampler()

    • None: samples are taken randomly from times series.

  • **kwargs (additional arguments passed to DataLoader constructor)

Returns:

DataLoader – First entry is x, a dictionary of tensors with the entries, and shapes in brackets.

  • encoder_catlong (batch_size x n_encoder_time_steps x n_features)

    long tensor of encoded categoricals for encoder

  • encoder_contfloat (batch_size x n_encoder_time_steps x n_features)

    float tensor of scaled continuous variables for encoder

  • encoder_targetfloat (batch_size x n_encoder_time_steps) or list thereof

    if list, each entry for a different target. float tensor with unscaled continous target or encoded categorical target, list of tensors for multiple targets

  • encoder_lengthslong (batch_size)

    long tensor with lengths of the encoder time series. No entry will be greater than n_encoder_time_steps

  • decoder_catlong (batch_size x n_decoder_time_steps x n_features)

    long tensor of encoded categoricals for decoder

  • decoder_contfloat (batch_size x n_decoder_time_steps x n_features)

    float tensor of scaled continuous variables for decoder

  • decoder_targetfloat (batch_size x n_decoder_time_steps) or list thereof

    if list, with each entry for a different target. float tensor with unscaled continous target or encoded categorical target for decoder - this corresponds to first entry of y, list of tensors for multiple targets

  • decoder_lengthslong (batch_size)

    long tensor with lengths of the decoder time series. No entry will be greater than n_decoder_time_steps

  • group_idsfloat (batch_size x number_of_ids)

    encoded group ids that identify a time series in the dataset

  • target_scalefloat (batch_size x scale_size) or list thereof.

    if list, with each entry for a different target. parameters used to normalize the target. Typically these are mean and standard deviation. Is list of tensors for multiple targets.

Second entry is y, a tuple of the form (target, weight)

  • targetfloat (batch_size x n_decoder_time_steps) or list thereof

    if list, with each entry for a different target. unscaled (continuous) or encoded (categories) targets, list of tensors for multiple targets

  • weightNone or float (batch_size x n_decoder_time_steps)

    weights for each target, None if no weight is used (= equal weights)

Return type:

dataloader that returns Tuple.

Example

Weight by samples for training:

from torch.utils.data import WeightedRandomSampler

# length of probabilties for sampler have to be equal to the length of index
probabilities = np.sqrt(1 + data.loc[dataset.index, "target"])
sampler = WeightedRandomSampler(probabilities, len(probabilities))
dataset.to_dataloader(train=True, sampler=sampler, shuffle=False)
transform_values(name: str, values: Series | Tensor | ndarray, data: DataFrame = None, inverse=False, group_id: bool = False, **kwargs) ndarray[source]#

Scale and encode values.

Parameters:
  • name (str) – name of variable

  • values (Union[pd.Series, torch.Tensor, np.ndarray]) – values to encode/scale

  • data (pd.DataFrame, optional, default=None) – extra data used for scaling (e.g. dataframe with groups columns)

  • inverse (bool, optional, default=False) – whether transform is plain (True), or inverse (False)

  • group_id (bool, optional, default=False) – whether the passed name refers to a group id - different encoders are used for these

  • **kwargs (additional arguments for transform/inverse_transform method)

Returns:

(de/en)coded/(de)scaled values

Return type:

np.ndarray

x_to_index(x: Dict[str, Tensor]) DataFrame[source]#

Decode dataframe index from x.

Returns:

dataframe with time index column for first prediction and group ids

property categoricals: List[str]#

Categorical variables as used for modelling.

Returns:

list of variables

Return type:

List[str]

property decoded_index: DataFrame#

Get interpretable version of index.

DataFrame contains - group_id columns in original encoding - time_idx_first column: first time index of subsequence - time_idx_last columns: last time index of subsequence - time_idx_first_prediction columns: first time index which is in decoder

Returns:

index that can be understood in terms of original data

Return type:

pd.DataFrame

property dropout_categoricals: List[str]#

list of categorical variables that are unknown when making a forecast without observed history

property flat_categoricals: List[str]#

Categorical variables as defined in input data.

Returns:

list of variables

Return type:

List[str]

property lagged_targets: Dict[str, str]#

Subset of lagged_variables to variables that are lagged targets.

Parameters:
  • Dict[str – dictionary of variable names corresponding to lagged variables, mapped to variable that is lagged

  • str] – dictionary of variable names corresponding to lagged variables, mapped to variable that is lagged

property lagged_variables: Dict[str, str]#

Lagged variables.

Parameters:
  • Dict[str – dictionary of variable names corresponding to lagged variables, mapped to variable that is lagged

  • str] – dictionary of variable names corresponding to lagged variables, mapped to variable that is lagged

property max_lag: int#

Maximum number of time steps variables are lagged.

Returns:

int

Return type:

maximum lag

property min_lag: int#

Minimum number of time steps variables are lagged.

Returns:

int

Return type:

minimum lag

property multi_target: bool#

If dataset encodes one or multiple targets.

Returns:

true if multiple targets

Return type:

bool

property reals: List[str]#

Continous variables as used for modelling.

Returns:

list of variables

Return type:

List[str]

property target_names: List[str]#

List of targets.

Returns:

list of targets

Return type:

List[str]

property target_normalizers: List[TorchNormalizer]#

List of target normalizers aligned with target_names.

Returns:

list of target normalizers

Return type:

List[TorchNormalizer]

property variable_to_group_mapping: Dict[str, str]#

Mapping from categorical variables to variables in input data.

Returns:

dictionary, maps categorical() to flat_categoricals().

Return type:

Dict[str, str]