Data v2#

Warning

Please note that the v2 modules are currently in active-development and is in beta right now, so please use this API with caution. See complete documentation for v2 API here and stable v1 documentation here.

Loading and managing time series data for deep learning can be complex, especially when handling varying sequence lengths, multiple covariates, and categorical encodings.

In API-v2, the data pipeline follows a strict two-layer architecture: the D1 Layer (Dataset) and the D2 Layer (DataModule) to maintain “separation of responsibilities”.

D1 Layer (Dataset) ingests the raw data and turn it into torch tensors
D2 Layer (DataModule) performs the pre-processing and the data loading

D1 Layer: Dataset#

The D1 Layer is the foundational data ingestion layer. Its primary responsibilities are to accept raw tabular data (e.g., pandas DataFrames), convert the raw data into PyTorch tensors, and extract base-level metadata such as static variables and basic time series properties.

Unlike the v1 dataset, the D1 layer does not handle complex preprocessing or batching logic, keeping it lightweight and highly modular.

PyTorch Dataset for time series data stored in pandas DataFrame.

Parameters:

data (pd.DataFrame) – data frame with sequence data. Column names must all be str, and contain str as referred to below.
data_future (pd.DataFrame, optional, default=None) – data frame with future data. Column names must all be str, and contain str as referred to below. May contain only columns that are in time, group, weight, known, or static.
time (str, optional, default = first col not in group_ids, weight, target, static.) – integer typed column denoting the time index within data. This column is used to determine the sequence of samples. If there are no missing observations, the time index should increase by +1 for each subsequent sample. The first time_idx for each series does not necessarily have to be 0 but any value is allowed.
target (str or List[str], optional, default = last column (at iloc -1)) – column(s) in data denoting the forecasting target. Can be categorical or numerical dtype.
group (List[str], optional, default = None) – list of column names identifying a time series instance within data. This means that the group together uniquely identify an instance, and group together with time uniquely identify a single observation within a time series instance. If None, the dataset is assumed to be a single time series.
weight (str, optional, default=None) – column name for weights. If None, it is assumed that there is no weight column.
num (list of str, optional, default = all columns with dtype in "fi") – list of numerical variables in data, list may also contain list of str, which are then grouped together.
cat (list of str, optional, default = all columns with dtype in "Obc") – list of categorical variables in data, list may also contain list of str, which are then grouped together (e.g. useful for product categories).
known (list of str, optional, default = all variables) – list of variables that change over time and are known in the future, list may also contain list of str, which are then grouped together (e.g. useful for special days or promotion categories).
unknown (list of str, optional, default = no variables) – list of variables that are not known in the future, list may also contain list of str, which are then grouped together (e.g. useful for weather categories).
static (list of str, optional, default = all variables not in known, unknown) – list of variables that do not change over time, list may also contain list of str, which are then grouped together.

D2 Layer: DataModule#

The D2 Layer sits on top of D1 and is implemented as a PyTorch Lightning LightningDataModule. This layer is responsible for the heavier lifting:

Preprocessing: Applying normalizers and encoders to the data.
Batching: Creating and managing the train_dataloader, val_dataloader, and test_dataloader.
Model Initialization Metadata: Dynamically collecting necessary architectural information (such as the number of categorical variables, embedding sizes, and vocabulary states) required to properly instantiate the Forecasting models in the Model Layer.

Model Compatibility

Because different forecasting architectures require specific input shapes and structures (e.g., standard sequential batches vs. complex encoder-decoder structures), there are several different types of DataModules available in API-v2.

Each model is optimally designed to be compatible with one or more specific DataModules. You can easily verify which DataModule pairs correctly with your chosen model by checking the compatibility overview table in the :doc:`v2 Models <models_v2>` documentation.

class pytorch_forecasting.data.data_module._tslib_data_module.TslibDataModule(time_series_dataset: TimeSeries, context_length: int, prediction_length: int, freq: str = 'h', add_relative_time_idx: bool = False, add_target_scales: bool = False, target_normalizer: TorchNormalizer | EncoderNormalizer | NaNLabelEncoder | str | list[TorchNormalizer | EncoderNormalizer | NaNLabelEncoder] | tuple[TorchNormalizer | EncoderNormalizer | NaNLabelEncoder] | None = 'auto', scalers: dict[str, StandardScaler | RobustScaler | TorchNormalizer | EncoderNormalizer] | None = None, shuffle: bool = True, window_stride: int = 1, batch_size: int = 32, num_workers: int = 0, train_val_test_split: tuple[float, float, float] = (0.7, 0.15, 0.15), collate_fn: Callable | None = None, **kwargs)[source]

Experimental data module for integrating tslib time series into PyTorch Forecasting.

This module serves as the D2 layer for tslib models including transformer-based architectures like Informer, AutoFormer, TimeXer and other model deep learning model architectures.

Parameters:

time_series_dataset (TimeSeries) – The time series dataset to be used for training and validation. This is the newly implemented D1 layer.
context_length (int) – The length of the context window for the model. This is the number of time steps used as input to the model.
prediction_length (int) – The length of the prediction window for the model. This is the number of time steps to be predicted by the model.
freq (str, default = "h") – The frequency of the time series data. This is used to determine the time steps for the model.
features (str = "MS") –
Feature combination mode:
- ”S”: Single variable forecasting (target only)
- ”M”: Multivariate forecasting, using all variables
- ”MS”: Multivariate to single, using all variables to predict target
add_relative_time_idx (bool = False) – Whether to allow the relative time index to be used with the model.
add_target_scales (bool = False) – Whether to add target scaling info.
target_normalizer –

Union[NORMALIZER, str, list[NORMALIZER], tuple[NORMALIZER], None],
default=”auto”

Normalizer for the target variable. If “auto”, uses RobustScaler.
scalers (Optional[dict[str, Union[StandardScaler, RobustScaler, TorchNormalizer]]], default=None #noqa: E501) – Dictionary of feature scalers.
shuffle (bool, default=True) – Whether to shuffle the data at every epoch.
window_stride (int, default=1) – The stride for the sliding window. This is used to create overlapping windows for the data.
batch_size (int, default=32) – Batch size for dataloader.
num_workers (int, default=0) – Number of workers for dataloader.
train_val_test_split (tuple, default=(0.7, 0.15, 0.15)) – Proportions for train, validation, and test dataset splits.
collate_fn (Optional[callable], default=None) – Custom collate function for the dataloader.

prepare_data_per_node

If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.

Type:: bool

allow_zero_length_dataloader_with_multiple_devices

If True, dataloader with zero length within local rank is allowed. Default value is False.

Type:: bool

API Reference#

See the detailed API documentation for the V2 data classes below:

`data.encoders.EncoderNormalizer`([method, ...])	Special Normalizer that is fit on each encoding sequence.
`data.encoders.GroupNormalizer`([method, ...])	Normalizer that scales by groups.
`data.encoders.MultiNormalizer`(normalizers)	Normalizer for multiple targets.
`data.encoders.NaNLabelEncoder`([add_nan, warn])	Labelencoder that can optionally always encode nan and unknown classes (in transform) as class `0`
`data.encoders.TorchNormalizer`([method, ...])	Basic target transformer that can be fit also on torch tensors.
`data.samplers.TimeSynchronizedBatchSampler`(sampler)	Samples mini-batches randomly but in a time-synchronised manner.
`data.samplers.GroupedSampler`(sampler[, ...])	Samples mini-batches randomly but in a grouped manner.
`data.timeseries._timeseries_v2.TimeSeries`(data)	PyTorch Dataset for time series data stored in pandas DataFrame.
`data.data_module._encoder_decoder_data_module.EncoderDecoderTimeSeriesDataModule`(...)	Lightning DataModule for processing time series data in an encoder-decoder format.
`data.data_module._tslib_data_module.TslibDataModule`(...)	Experimental data module for integrating tslib time series into PyTorch Forecasting.