Load Data

Submodules

kale.loaddata.avmnist_datasets module

Dataset setting and data loader for AVMNIST dataset by refactoring https://github.com/pliang279/MultiBench/blob/main/datasets/avmnist/get_data.py

class kale.loaddata.avmnist_datasets.AVMNISTDataset(data_dir, batch_size=40, flatten_audio=False, flatten_image=False, unsqueeze_channel=True, normalize_image=True, normalize_audio=True)

Bases: object

This class loads the AVMNIST data stored in a specified directory, and prepares it for training, validation, and testing. This class also takes care of the pre-processing steps such as reshaping and normalizing the data based on provided arguments. This includes options to flatten the audio and image data, normalize the image and audio data, and add a dimension to the data, often used to represent the channel in image or audio data. Furthermore, The class handles the splitting of data into training and validation sets. It provides separate data loaders for the training, validation, and testing sets, which can be used to iterate over the data during model training and evaluation. This data loader class simplifies the data preparation process for multimodal learning tasks, allowing the user to focus on model architecture and hyperparameter tuning.

Parameters:
  • data_dir (str) – Directory of data.

  • batch_size (int, optional) – Batch size. Defaults to 40.

  • flatten_audio (bool, optional) – Whether to flatten audio data or not. Defaults to False.

  • flatten_image (bool, optional) – Whether to flatten image data or not. Defaults to False.

  • unsqueeze_channel (bool, optional) – Whether to unsqueeze any channels or not. Defaults to True.

  • normalize_image (bool, optional) – Whether to normalize the images before returning. Defaults to True.

  • normalize_audio (bool, optional) – Whether to normalize the audio before returning. Defaults to True.

load_data()
get_train_loader(shuffle=True)
get_valid_loader(shuffle=False)
get_test_loader(shuffle=False)

kale.loaddata.dataset_access module

Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py

class kale.loaddata.dataset_access.DatasetAccess(n_classes)

Bases: object

This class ensures a unique API is used to access training, validation and test splits of any dataset.

Parameters:

n_classes (int) – the number of classes.

n_classes()
get_train()

Returns: a torch.utils.data.Dataset

get_train_valid(valid_ratio)

Randomly split a dataset into non-overlapping training and validation datasets.

Parameters:

valid_ratio (float) – the ratio for validation set

Returns:

a torch.utils.data.Dataset

Return type:

dataset

get_test()
kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)
Parameters:
  • dataset – a torch.utils.data.Dataset

  • class_ids (list, optional) – List of chosen subset of class ids.

Returns:

a torch.utils.data.Dataset

Return type:

dataset

kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)

Randomly split a dataset into non-overlapping new datasets of given ratios.

Parameters:
  • dataset (torch.utils.data.Dataset, list, or Tensor) – Dataset or data indices to be split.

  • split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.

Returns:

A list of subsets.

Return type:

[List]

Examples

>>> import torch
>>> from kale.loaddata.dataset_access import split_by_ratios
>>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2 = split_by_ratios(range(10), [0.3])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3])
>>> len(subset1)
3
>>> len(subset2)
3
>>> len(subset3)
4

kale.loaddata.image_access module

kale.loaddata.mnistm module

Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel

class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)

Bases: Dataset

MNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.

Parameters:
  • root (str) – path to directory where the MNISTM folder will be created (or exists.)

  • train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.

  • transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g., transforms.RandomCrop This preprocessing function applied to all images (whether source or target)

  • target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform

  • download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.

url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'
raw_folder = 'raw'
processed_folder = 'processed'
training_file = 'mnist_m_train.pt'
test_file = 'mnist_m_test.pt'
download()

Download the MNISTM data.

kale.loaddata.molecular_datasets module

Dataset setting and data loader for BindingDB, BioSNAP and Human datasets, by refactoring https://github.com/peizhenbai/DrugBAN/blob/main/dataloader.py

kale.loaddata.molecular_datasets.graph_collate_func(x)

Custom collate function for PyTorch DataLoader to batch drug-protein interaction samples.

Each sample in the input list x is a tuple containing:
  • a PyTorch Geometric Data object representing a drug molecular graph,

  • a protein sequence represented as a tensor or array,

  • a label (e.g., interaction score or binary classification target).

This function:
  • batches the molecular graphs using Batch.from_data_list,

  • stacks the protein tensors into a single tensor,

  • stacks the labels into a single tensor.

Parameters:

xlist of tuples

Each tuple contains (drug_graph, protein_tensor, label).

Returns:

drugtorch_geometric.data.Batch

A batched PyTorch Geometric Batch object of drug molecular graphs.

proteintorch.Tensor

A 2D tensor of protein sequence features, shape (batch_size, sequence_length).

labeltorch.Tensor

A 1D or 2D tensor of labels, depending on the task.

kale.loaddata.molecular_datasets.smiles_to_graph(smiles, max_drug_nodes)

Converts a SMILES string into a padded PyTorch Geometric molecular graph.

Parameters:
  • smiles (str) – SMILES representation of a molecule.

  • max_drug_nodes (int) – Maximum number of nodes in the graph. If the actual number is smaller, virtual (zero-feature) nodes are added.

Returns:

A PyTorch Geometric Data object containing: - x: Node feature matrix - edge_index: Edge connectivity - edge_attr: Edge feature matrix - num_nodes: Total number of nodes (including virtual nodes)

Return type:

Data

class kale.loaddata.molecular_datasets.DTIDataset(list_ids, df, max_drug_nodes=290)

Bases: Dataset

kale.loaddata.multi_domain module

kale.loaddata.multiomics_datasets module

kale.loaddata.polypharmacy_datasets module

kale.loaddata.sampler module

kale.loaddata.tabular_access module

Authors: Lawrence Schobs, lawrenceschobs@gmail.com

Functions for accessing tabular data.

kale.loaddata.tabular_access.load_csv_columns(datapath: str, split: str, fold: int | List[int], cols_to_return: str | List[str] = 'All') DataFrame

Reads a CSV file of data and returns samples where the value of the specified split column is contained in the fold variable. The columns specified in cols_to_return are returned.

Parameters:
  • datapath – The path to the CSV file of data.

  • split – The column name for the split (e.g. “Validation”, “Testing”).

  • fold – The fold/s contained in the split column to return. Can be a single integer or a list of integers.

  • cols_to_return – Which columns to return. If set to “All”, returns all columns.

Returns:

the first is the full DataFrame selected, and the second is the DataFrame with only the columns specified in cols_to_return.

Return type:

A tuple of two pandas DataFrames

kale.loaddata.signal_access module

kale.loaddata.signal_access.load_ecg_from_folder(base_path, csv_file)

Loads and preprocesses a batch of ECG signals from a CSV file listing file paths.

Parameters:
  • base_path (str) – Root directory containing ECG files.

  • csv_file (str) – CSV file listing files in column ‘path’.

Returns:

Batch of preprocessed ECG signals, shape (N, 1, total_samples).

Return type:

Tensor

Example

ecg_tensor = load_ecg_from_csv(“/data/ecg/”, “ecg_files.csv”)

kale.loaddata.signal_image_access module

class kale.loaddata.signal_image_access.SignalImageDataset(signal_features, image_features)

Bases: Dataset

SignalImageDataset prepares paired signal (e.g., ECG) and image (e.g., CXR) features for multimodal deep learning tasks.

This class simplifies data preparation by accepting two tensors: one for signal features and one for image features. Each sample returned by the dataset consists of a pair of (signal_features, image_features) at the same index, making it suitable for tasks where both modalities are required as input (such as multimodal classification, reconstruction, or representation learning).

Parameters:
  • signal_features (Tensor or ndarray) – Tensor containing the signal features for all samples.

  • image_features (Tensor or ndarray) – Tensor containing the image features for all samples.

Usage:

dataset = SignalImageDataset(signal_features, image_features) signal, image = dataset[0] # Can be used with DataLoader for batching in model training.

Returns:

(signal_features, image_features) for the requested sample index.

Return type:

Tuple

classmethod prepare_data_loaders(signal_features, image_features, train_ratio=0.8, random_seed=None)

Splits the dataset into training and validation subsets.

Parameters:
  • signal_features (Tensor or ndarray) – Tensor containing the signal features.

  • image_features (Tensor or ndarray) – Tensor containing the image features.

  • train_ratio (float, optional) – Ratio of the training set (e.g., 0.8 for 80% train, 20% val). Default is 0.8.

  • random_seed (int, optional) – Seed for reproducibility.

Returns:

Training subset. val_dataset (SignalImageDataset): Validation subset.

Return type:

train_dataset (SignalImageDataset)

kale.loaddata.tdc_datasets module

kale.loaddata.usps module

Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)

class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)

Bases: Dataset

USPS Dataset.

Parameters:
  • root (string) – Root directory of dataset where dataset file exist.

  • train (bool, optional) – If True, resample from dataset randomly.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'
download()

Download dataset.

load_samples()

Load sample images from dataset.

kale.loaddata.video_access module

kale.loaddata.video_datasets module

class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for GTEA, ADL and KITCHEN.

Parameters:
  • root_path (string) – The root path in which video folders lie.

  • annotationfile_path (string) – The annotation file containing one row per video sample.

  • dataset_split (string) – Split type (train or test)

  • image_modality (string) – Image modality (RGB or Optical Flow)

  • num_segments (int) – The number of segments the video should be divided into to sample frames from.

  • frames_per_segment (int) – The number of frames that should be loaded per segment.

  • imagefile_template (string) – The image filename template.

  • transform (Compose) – Video transform.

  • random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.

  • test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.

  • n_classes (int) – The number of classes.

make_dataset()

Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.

Returns:

list of (video_name, start_frame, end_frame, label)

Return type:

data (list)

class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for EPIC-Kitchen.

make_dataset()

Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.

kale.loaddata.video_multi_domain module

kale.loaddata.videos module

class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)

Bases: Dataset

A highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or FRAMES x CHANNELS x HEIGHT x WIDTH tensors where FRAMES=x if the kale.prepdata.video_transform.ImglistToTensor() transform is used.

More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.

Note

A demonstration of using this class can be seen in PyKale/examples/video_loading https://github.com/pykale/pykale/tree/master/examples/video_loading

Note

This dataset broadly corresponds to the frame sampling technique introduced in Temporal Segment Networks at ECCV2016 https://arxiv.org/abs/1608.00859.

Note

This class relies on receiving video data in a structure where inside a ROOT_DATA folder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values: VIDEO_FOLDER_PATH     START_FRAME     END_FRAME     LABEL_INDEX. VIDEO_FOLDER_PATH is expected to be the path of a video folder excluding the ROOT_DATA prefix. For example, ROOT_DATA might be home\data\datasetxyz\videos\, inside of which a VIDEO_FOLDER_PATH might be jumping\0052\ or sample1\ or 00053\.

Parameters:
  • root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.

  • annotationfile_path – The .txt annotation file containing one row per video sample as described above.

  • image_modality – Image modality (RGB or Optical Flow).

  • num_segments – The number of segments the video should be divided into to sample frames from.

  • frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.

  • imagefile_template – The image filename template that video frame files have inside of their video folders as described above.

  • transform – Transform pipeline that receives a list of PIL images/frames.

  • random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.

  • test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.

kale.loaddata.few_shot module

Dataset class to load data for few-shot learning problems under \(N\)-way-\(K\)-shot settings. Author: Wenrui Fan Email: winslow.fan@outlook.com

class kale.loaddata.few_shot.NWayKShotDataset(path: str, mode: str = 'train', num_support_samples: int = 5, num_query_samples: int = 15, transform: Callable | None = None)

Bases: Dataset

This Dataset class loads data for few-shot learning problems under \(N\)-way-\(K\)-shot settings.

  • \(N\)-way: The number of classes under a particular setting. The model is presented with samples from these \(N\) classes and needs to classify them. For example, 3-way means the model has to classify 3 different classes.

  • \(K\)-shot: The number of samples for each class in the support set. For example, in a 2-shot setting, two support samples are provided per class.

  • Support set: It is a small, labeled dataset used to train the model with a few samples of each class. The support set consists of \(N\) classes (\(N\)-way), with \(K\) samples (\(K\)-shot) for each class. For example, under a 3-way-2-shot setting, the support set has 3 classes with 2 samples per class, totaling 6 samples.

  • Query set: It evaluates the model’s ability to generalize what it has learned from the support set. It contains samples from the same \(N\) classes but not included in the support set. Continuing with the 3-way-2-shot example, the query set would include additional samples from the 3 classes, which the model must classify after learning from the support set.

In this class, __getitem__() returns a batch of images and labels for one class. When defining the training/validation/testing dataloaders, the batch size should be the number of classes (cfg.TRAIN.NUM_CLASSES/cfg.VAL.NUM_CLASSES). Therefore, __len__() returns the total number of classes in the dataset.

Note

The dataset should be organized as:

  • root
    • train
      • class_name 1
        • xxx.png

        • yyy.png

      • class_name 2
        • xxx.png

        • yyy.png

    • val
      • class_name m
        • xxx.png

        • yyy.png

      • class_name m+1
        • xxx.png

        • yyy.png

    • test
      • class_name n
        • xxx.png

        • yyy.png

      • class_name n+1
        • xxx.png

        • yyy.png

Parameters:
  • path (string) – The root directory of the data.

  • mode (string) – The mode of the type of dataset. It can be “train”, “val”, or “test”. Default: “train”.

  • num_support_samples (int) – Number of samples per class in the support set. It corresponds to \(K\) in the \(N\)-way-\(K\)-shot setting. Default: 5.

  • num_query_samples (int) – Number of samples per class in the query set. Default: 15.

  • transform (callable, optional) – Transform of images. Default: None.

Module contents