TimeSeriesDataSet

class pytorch_forecasting.data.timeseries.TimeSeriesDataSet(data: pandas.core.frame.DataFrame, time_idx: str, target: Union[str, List[str]], group_ids: List[str], weight: Optional[Union[str, List[str]]] = None, max_encoder_length: int = 30, min_encoder_length: int = None, min_prediction_idx: int = None, min_prediction_length: int = None, max_prediction_length: int = 1, static_categoricals: List[str] = [], static_reals: List[str] = [], time_varying_known_categoricals: List[str] = [], time_varying_known_reals: List[str] = [], time_varying_unknown_categoricals: List[str] = [], time_varying_unknown_reals: List[str] = [], variable_groups: Dict[str, List[int]] = {}, dropout_categoricals: List[str] = [], constant_fill_strategy={}, allow_missings: bool = False, add_relative_time_idx: bool = False, add_target_scales: bool = False, add_encoder_length: Union[bool, str] = 'auto', target_normalizer: Union[pytorch_forecasting.data.encoders.TorchNormalizer, pytorch_forecasting.data.encoders.NaNLabelEncoder, str] = 'auto', categorical_encoders={}, scalers={}, randomize_length: Union[None, Tuple[float, float], bool] = False, predict_mode: bool = False)[source]

Bases: torch.utils.data.dataset.Dataset

PyTorch Dataset for fitting timeseries models.

The dataset automates common tasks such as

  • scaling and encoding of variables

  • normalizing the target variable

  • efficiently converting timeseries in pandas dataframes to torch tensors

  • holding information about static and time-varying variables known and unknown in the future

  • holiding information about related categories (such as holidays)

  • downsampling for data augmentation

  • generating inference, validation and test datasets

  • etc.

Timeseries dataset holding data for models.

Each sample is a subsequence of a full time series. The subsequence consists of encoder and decoder/prediction timepoints for a given time series. This class constructs an index which defined which subsequences exists and can be samples from (index attribute). The samples in the index are defined by by the various parameters. to the class (encoder and prediction lengths, minimum prediction length, randomize length and predict keywords). How samples are sampled into batches for training, is determined by the DataLoader. The class provides the to_dataloader() method to convert the dataset into a dataloader.

Large datasets:

Currently the class is limited to in-memory operations. If you have extremely large data, however, you can pass prefitted encoders and and scalers to it and a subset of sequences to the class to construct a valid dataset (plus, likely the EncoderNormalizer should be used to normalize targets). when fitting a network, you would then to create a custom DataLoader that rotates through the datasets. There is currently no in-built methods to do this.

Parameters
  • data – dataframe with sequence data - each row can be identified with time_idx and the group_ids

  • time_idx – integer column denoting the time index. This columns is used to determine the sequence of samples. If there no missings observations, the time index should increase by +1 for each subsequent sample. The first time_idx for each series does not necessarily have to be 0 but any value is allowed.

  • target – column denoting the target or list of columns denoting the target - categorical or continous.

  • group_ids – list of column names identifying a time series. This means that the group_ids identify a sample together with the time_idx. If you have only one timeseries, set this to the name of column that is constant.

  • weight – column name for weights or list of column names corresponding to each target

  • max_encoder_length – maximum length to encode

  • min_encoder_length – minimum allowed length to encode. Defaults to max_encoder_length.

  • min_prediction_idx – minimum time_idx from where to start predictions. This parameter can be useful to create a validation or test set.

  • max_prediction_length – maximum prediction/decoder length (choose this not too short as it can help convergence)

  • min_prediction_length – minimum prediction/decoder length. Defaults to max_prediction_length

  • static_categoricals – list of categorical variables that do not change over time, entries can be also lists which are then encoded together (e.g. useful for product categories)

  • static_reals – list of continuous variables that do not change over time

  • time_varying_known_categoricals – list of categorical variables that change over time and are know in the future, entries can be also lists which are then encoded together (e.g. useful for special days or promotion categories)

  • time_varying_known_reals – list of continuous variables that change over time and are know in the future

  • time_varying_unknown_categoricals – list of categorical variables that change over time and are not know in the future, entries can be also lists which are then encoded together (e.g. useful for weather categories)

  • time_varying_unknown_reals – list of continuous variables that change over time and are not know in the future

  • variable_groups – dictionary mapping a name to a list of columns in the data. The name should be present in a categorical or real class argument, to be able to encode or scale the columns by group.

  • dropout_categoricals – list of categorical variables that are unknown when making a forecast without observed history

  • constant_fill_strategy – dictionary of column names with constants to fill in missing values if there are gaps in the sequence (by default forward fill strategy is used). The values will be only used if allow_missings=True. A common use case is to denote that demand was 0 if the sample is not in the dataset.

  • allow_missings – if to allow missing timesteps that are automatically filled up. Missing values refer to gaps in the time_idx, e.g. if a specific timeseries has only samples for 1, 2, 4, 5, the sample for 3 will be generated on-the-fly. Allow missings does not deal with NA values. You should fill NA values before passing the dataframe to the TimeSeriesDataSet.

  • add_relative_time_idx – if to add a relative time index as feature (i.e. for each sampled sequence, the index will range from -encoder_length to prediction_length)

  • add_target_scales – if to add scales for target to static real features (i.e. add the center and scale of the unnormalized timeseries as features)

  • add_encoder_length – if to add decoder length to list of static real variables. Defaults to “auto”, i.e. yes if min_encoder_length != max_encoder_length.

  • target_normalizer – transformer that takes group_ids, target and time_idx to return normalized targets. You can choose from the classes in encoders. By default an appropriate normalizer is chosen automatically.

  • categorical_encoders – dictionary of scikit learn label transformers. If you have unobserved categories in the future, you can use the NaNLabelEncoder with add_nan=True. Defaults effectively to sklearn’s LabelEncoder(). Prefittet encoders will not be fit again.

  • scalers – dictionary of scikit learn scalers. Defaults to sklearn’s StandardScaler(). Prefittet encoders will not be fit again.

  • randomize_length – None or False if not to randomize lengths. Tuple of beta distribution concentrations from which probabilities are sampled that are used to sample new sequence lengths with a binomial distribution. If True, defaults to (0.2, 0.05), i.e. ~1/4 of samples around minimum encoder length. Defaults to False otherwise.

  • predict_mode – if to only iterate over each timeseries once (only the last provided samples). Effectively, this will take choose for each time series identified by group_ids the last max_prediction_length samples of each time series as prediction samples and everthing previous up to max_encoder_length samples as encoder samples.

Inherited-members

Methods

from_dataset(dataset, data[, …])

Generate dataset with different underlying data but same variable encoders and scalers, etc.

from_parameters(parameters, data[, …])

Generate dataset with different underlying data but same variable encoders and scalers, etc.

get_parameters()

Get parameters that can be used with from_parameters() to create a new dataset with the same scalers.

load(fname)

Load dataset from disk

plot_randomization([betas, length, min_length])

Plot expected randomized length distribution.

reset_overwrite_values()

Reset values used to override sample features.

save(fname)

Save dataset to disk

set_overwrite_values(values, variable[, target])

Convenience method to quickly overwrite values in decoder or encoder (or both) for a specific variable.

to_dataloader([train, batch_size, batch_sampler])

Get dataloader from dataset.

transform_values(name, values[, data, inverse])

Scale and encode values.

x_to_index(x)

Decode dataframe index from x.

Attributes

categoricals

Categorical variables as used for modelling.

flat_categoricals

Categorical variables as defined in input data.

reals

Continous variables as used for modelling.

variable_to_group_mapping

Mapping from categorical variables to variables in input data.

classmethod from_dataset(dataset, data: pandas.core.frame.DataFrame, stop_randomization: bool = False, predict: bool = False, **update_kwargs)[source]

Generate dataset with different underlying data but same variable encoders and scalers, etc.

Calls from_parameters() under the hood.

Parameters
  • dataset (TimeSeriesDataSet) – dataset from which to copy parameters

  • data (pd.DataFrame) – data from which new dataset will be generated

  • stop_randomization (bool, optional) – If to stop randomizing encoder and decoder lengths, e.g. useful for validation set. Defaults to False.

  • predict (bool, optional) – If to predict the decoder length on the last entries in the time index (i.e. one prediction per group only). Defaults to False.

  • **kwargs – keyword arguments overriding parameters in the original dataset

Returns

new dataset

Return type

TimeSeriesDataSet

classmethod from_parameters(parameters: Dict[str, Any], data: pandas.core.frame.DataFrame, stop_randomization: bool = False, predict: bool = False, **update_kwargs)[source]

Generate dataset with different underlying data but same variable encoders and scalers, etc.

Parameters
  • parameters (Dict[str, Any]) – dataset parameters which to use for the new dataset

  • data (pd.DataFrame) – data from which new dataset will be generated

  • stop_randomization (bool, optional) – If to stop randomizing encoder and decoder lengths, e.g. useful for validation set. Defaults to False.

  • predict (bool, optional) – If to predict the decoder length on the last entries in the time index (i.e. one prediction per group only). Defaults to False.

  • **kwargs – keyword arguments overriding parameters

Returns

new dataset

Return type

TimeSeriesDataSet

get_parameters() → Dict[str, Any][source]

Get parameters that can be used with from_parameters() to create a new dataset with the same scalers.

Returns

dictionary of parameters

Return type

Dict[str, Any]

classmethod load(fname: str)[source]

Load dataset from disk

Parameters

fname (str) – filename to load from

Returns

TimeSeriesDataSet

plot_randomization(betas: Tuple[float, float] = None, length: int = None, min_length: int = None) → Tuple[matplotlib.figure.Figure, torch.Tensor][source]

Plot expected randomized length distribution.

Parameters
  • betas (Tuple[float, float], optional) – Tuple of betas, e.g. (0.2, 0.05) to use for randomization. Defaults to randomize_length of dataset.

  • length (int, optional) – . Defaults to max_encoder_length.

  • min_length (int, optional) – [description]. Defaults to min_encoder_length.

Returns

tuple of figure and histogram based on 1000 samples

Return type

Tuple[plt.Figure, torch.Tensor]

reset_overwrite_values() → None[source]

Reset values used to override sample features.

save(fname: str) → None[source]

Save dataset to disk

Parameters

fname (str) – filename to save to

set_overwrite_values(values: Union[float, torch.Tensor], variable: str, target: Union[str, slice] = 'decoder') → None[source]

Convenience method to quickly overwrite values in decoder or encoder (or both) for a specific variable.

Parameters
  • values (Union[float, torch.Tensor]) – values to use for overwrite.

  • variable (str) – variable whose values should be overwritten.

  • target (Union[str, slice], optional) – positions to overwrite. One of “decoder”, “encoder” or “all” or a slice object which is directly used to overwrite indices, e.g. slice(-5, None) will overwrite the last 5 values. Defaults to “decoder”.

to_dataloader(train: bool = True, batch_size: int = 64, batch_sampler: Union[torch.utils.data.sampler.Sampler, str] = None, **kwargs) → torch.utils.data.dataloader.DataLoader[source]

Get dataloader from dataset.

The

Parameters
  • train (bool, optional) – if dataloader is used for training or prediction Will shuffle and drop last batch if True. Defaults to True.

  • batch_size (int) – batch size for training model. Defaults to 64.

  • batch_sampler (Union[Sampler, str]) –

    batch sampler or string. One of

    • ”synchronized”: ensure that samples in decoder are aligned in time. Does not support missing values in dataset. This makes only sense if the underlying algorithm makes use of values aligned in time.

    • PyTorch Sampler instance: any PyTorch sampler, e.g. the WeightedRandomSampler()

    • None: samples are taken randomly from times series.

  • **kwargs – additional arguments to DataLoader()

Examples

To samples for training:

from torch.utils.data import WeightedRandomSampler

# length of probabilties for sampler have to be equal to the length of the index
probabilities = np.sqrt(1 + data.loc[dataset.index, "target"])
sampler = WeightedRandomSampler(probabilities, len(probabilities))
dataset.to_dataloader(train=True, sampler=sampler, shuffle=False)
Returns

dataloader that returns Tuple.

First entry is a dictionary with the entries

  • encoder_cat

  • encoder_cont

  • encoder_target

  • encoder_lengths

  • decoder_cat

  • decoder_cont

  • decoder_target

  • decoder_lengths

Second entry is target

Return type

DataLoader

)

transform_values(name: str, values: Union[pandas.core.series.Series, torch.Tensor, numpy.ndarray], data: pandas.core.frame.DataFrame = None, inverse=False) → numpy.ndarray[source]

Scale and encode values.

Parameters
  • name (str) – name of variable

  • values (Union[pd.Series, torch.Tensor, np.ndarray]) – values to encode/scale

  • data (pd.DataFrame, optional) – extra data used for scaling (e.g. dataframe with groups columns). Defaults to None.

  • inverse (bool, optional) – if to conduct inverse transformation. Defaults to False.

Returns

(de/en)coded/(de)scaled values

Return type

np.ndarray

x_to_index(x: Dict[str, torch.Tensor]) → pandas.core.frame.DataFrame[source]

Decode dataframe index from x.

Returns

dataframe with time index column for first prediction and group ids

property categoricals

Categorical variables as used for modelling.

Returns

list of variables

Return type

List[str]

property flat_categoricals

Categorical variables as defined in input data.

Returns

list of variables

Return type

List[str]

property reals

Continous variables as used for modelling.

Returns

list of variables

Return type

List[str]

property variable_to_group_mapping

Mapping from categorical variables to variables in input data.

Returns

dictionary mapping from categorical() to flat_categoricals().

Return type

Dict[str, str]