Load Data
Submodules
kale.loaddata.avmnist_datasets module
Dataset setting and data loader for AVMNIST dataset by refactoring https://github.com/pliang279/MultiBench/blob/main/datasets/avmnist/get_data.py
- class kale.loaddata.avmnist_datasets.AVMNISTDataset(data_dir, batch_size=40, flatten_audio=False, flatten_image=False, unsqueeze_channel=True, normalize_image=True, normalize_audio=True)
Bases:
objectThis class loads the AVMNIST data stored in a specified directory, and prepares it for training, validation, and testing. This class also takes care of the pre-processing steps such as reshaping and normalizing the data based on provided arguments. This includes options to flatten the audio and image data, normalize the image and audio data, and add a dimension to the data, often used to represent the channel in image or audio data. Furthermore, The class handles the splitting of data into training and validation sets. It provides separate data loaders for the training, validation, and testing sets, which can be used to iterate over the data during model training and evaluation. This data loader class simplifies the data preparation process for multimodal learning tasks, allowing the user to focus on model architecture and hyperparameter tuning.
- Parameters:
data_dir (str) – Directory of data.
batch_size (int, optional) – Batch size. Defaults to 40.
flatten_audio (bool, optional) – Whether to flatten audio data or not. Defaults to False.
flatten_image (bool, optional) – Whether to flatten image data or not. Defaults to False.
unsqueeze_channel (bool, optional) – Whether to unsqueeze any channels or not. Defaults to True.
normalize_image (bool, optional) – Whether to normalize the images before returning. Defaults to True.
normalize_audio (bool, optional) – Whether to normalize the audio before returning. Defaults to True.
- load_data()
- get_train_loader(shuffle=True)
- get_valid_loader(shuffle=False)
- get_test_loader(shuffle=False)
kale.loaddata.dataset_access module
Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py
- class kale.loaddata.dataset_access.DatasetAccess(n_classes)
Bases:
objectThis class ensures a unique API is used to access training, validation and test splits of any dataset.
- Parameters:
n_classes (int) – the number of classes.
- n_classes()
- get_train()
Returns: a torch.utils.data.Dataset
- get_train_valid(valid_ratio)
Randomly split a dataset into non-overlapping training and validation datasets.
- Parameters:
valid_ratio (float) – the ratio for validation set
- Returns:
a torch.utils.data.Dataset
- Return type:
dataset
- get_test()
- kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)
- Parameters:
dataset – a torch.utils.data.Dataset
class_ids (list, optional) – List of chosen subset of class ids.
- Returns:
a torch.utils.data.Dataset
- Return type:
dataset
- kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)
Randomly split a dataset into non-overlapping new datasets of given ratios.
- Parameters:
dataset (torch.utils.data.Dataset, list, or Tensor) – Dataset or data indices to be split.
split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.
- Returns:
A list of subsets.
- Return type:
[List]
Examples
>>> import torch >>> from kale.loaddata.dataset_access import split_by_ratios >>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2 = split_by_ratios(range(10), [0.3]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3]) >>> len(subset1) 3 >>> len(subset2) 3 >>> len(subset3) 4
kale.loaddata.image_access module
kale.loaddata.mnistm module
Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel
- class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)
Bases:
DatasetMNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.
- Parameters:
root (str) – path to directory where the MNISTM folder will be created (or exists.)
train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.
transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g.,
transforms.RandomCropThis preprocessing function applied to all images (whether source or target)target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform
download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.
- url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'
- raw_folder = 'raw'
- processed_folder = 'processed'
- training_file = 'mnist_m_train.pt'
- test_file = 'mnist_m_test.pt'
- download()
Download the MNISTM data.
kale.loaddata.molecular_datasets module
Dataset setting and data loader for BindingDB, BioSNAP and Human datasets, by refactoring https://github.com/peizhenbai/DrugBAN/blob/main/dataloader.py
- kale.loaddata.molecular_datasets.graph_collate_func(x)
Custom collate function for PyTorch DataLoader to batch drug-protein interaction samples.
- Each sample in the input list x is a tuple containing:
a PyTorch Geometric Data object representing a drug molecular graph,
a protein sequence represented as a tensor or array,
a label (e.g., interaction score or binary classification target).
- This function:
batches the molecular graphs using Batch.from_data_list,
stacks the protein tensors into a single tensor,
stacks the labels into a single tensor.
Parameters:
- xlist of tuples
Each tuple contains (drug_graph, protein_tensor, label).
Returns:
- drugtorch_geometric.data.Batch
A batched PyTorch Geometric Batch object of drug molecular graphs.
- proteintorch.Tensor
A 2D tensor of protein sequence features, shape (batch_size, sequence_length).
- labeltorch.Tensor
A 1D or 2D tensor of labels, depending on the task.
- kale.loaddata.molecular_datasets.smiles_to_graph(smiles, max_drug_nodes)
Converts a SMILES string into a padded PyTorch Geometric molecular graph.
- Parameters:
smiles (str) – SMILES representation of a molecule.
max_drug_nodes (int) – Maximum number of nodes in the graph. If the actual number is smaller, virtual (zero-feature) nodes are added.
- Returns:
A PyTorch Geometric Data object containing: - x: Node feature matrix - edge_index: Edge connectivity - edge_attr: Edge feature matrix - num_nodes: Total number of nodes (including virtual nodes)
- Return type:
Data
- class kale.loaddata.molecular_datasets.DTIDataset(list_ids, df, max_drug_nodes=290)
Bases:
Dataset
kale.loaddata.multi_domain module
kale.loaddata.multiomics_datasets module
kale.loaddata.polypharmacy_datasets module
kale.loaddata.sampler module
kale.loaddata.tabular_access module
Authors: Lawrence Schobs, lawrenceschobs@gmail.com
Functions for accessing tabular data.
- kale.loaddata.tabular_access.load_csv_columns(datapath: str, split: str, fold: int | List[int], cols_to_return: str | List[str] = 'All') DataFrame
Reads a CSV file of data and returns samples where the value of the specified split column is contained in the fold variable. The columns specified in cols_to_return are returned.
- Parameters:
datapath – The path to the CSV file of data.
split – The column name for the split (e.g. “Validation”, “Testing”).
fold – The fold/s contained in the split column to return. Can be a single integer or a list of integers.
cols_to_return – Which columns to return. If set to “All”, returns all columns.
- Returns:
the first is the full DataFrame selected, and the second is the DataFrame with only the columns specified in cols_to_return.
- Return type:
A tuple of two pandas DataFrames
kale.loaddata.signal_access module
- kale.loaddata.signal_access.load_ecg_from_folder(base_path, csv_file)
Loads and preprocesses a batch of ECG signals from a CSV file listing file paths.
- Parameters:
base_path (str) – Root directory containing ECG files.
csv_file (str) – CSV file listing files in column ‘path’.
- Returns:
Batch of preprocessed ECG signals, shape (N, 1, total_samples).
- Return type:
Tensor
Example
ecg_tensor = load_ecg_from_csv(“/data/ecg/”, “ecg_files.csv”)
kale.loaddata.signal_image_access module
- class kale.loaddata.signal_image_access.SignalImageDataset(signal_features, image_features)
Bases:
DatasetSignalImageDataset prepares paired signal (e.g., ECG) and image (e.g., CXR) features for multimodal deep learning tasks.
This class simplifies data preparation by accepting two tensors: one for signal features and one for image features. Each sample returned by the dataset consists of a pair of (signal_features, image_features) at the same index, making it suitable for tasks where both modalities are required as input (such as multimodal classification, reconstruction, or representation learning).
- Parameters:
signal_features (Tensor or ndarray) – Tensor containing the signal features for all samples.
image_features (Tensor or ndarray) – Tensor containing the image features for all samples.
- Usage:
dataset = SignalImageDataset(signal_features, image_features) signal, image = dataset[0] # Can be used with DataLoader for batching in model training.
- Returns:
(signal_features, image_features) for the requested sample index.
- Return type:
Tuple
- classmethod prepare_data_loaders(signal_features, image_features, train_ratio=0.8, random_seed=None)
Splits the dataset into training and validation subsets.
- Parameters:
signal_features (Tensor or ndarray) – Tensor containing the signal features.
image_features (Tensor or ndarray) – Tensor containing the image features.
train_ratio (float, optional) – Ratio of the training set (e.g., 0.8 for 80% train, 20% val). Default is 0.8.
random_seed (int, optional) – Seed for reproducibility.
- Returns:
Training subset. val_dataset (SignalImageDataset): Validation subset.
- Return type:
train_dataset (SignalImageDataset)
kale.loaddata.tdc_datasets module
kale.loaddata.usps module
Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)
- class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)
Bases:
DatasetUSPS Dataset.
- Parameters:
root (string) – Root directory of dataset where dataset file exist.
train (bool, optional) – If True, resample from dataset randomly.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
- url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'
- download()
Download dataset.
- load_samples()
Load sample images from dataset.
kale.loaddata.video_access module
kale.loaddata.video_datasets module
- class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDatasetDataset for GTEA, ADL and KITCHEN.
- Parameters:
root_path (string) – The root path in which video folders lie.
annotationfile_path (string) – The annotation file containing one row per video sample.
dataset_split (string) – Split type (train or test)
image_modality (string) – Image modality (RGB or Optical Flow)
num_segments (int) – The number of segments the video should be divided into to sample frames from.
frames_per_segment (int) – The number of frames that should be loaded per segment.
imagefile_template (string) – The image filename template.
transform (Compose) – Video transform.
random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.
test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
n_classes (int) – The number of classes.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.
- Returns:
list of (video_name, start_frame, end_frame, label)
- Return type:
data (list)
- class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDatasetDataset for EPIC-Kitchen.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.
kale.loaddata.video_multi_domain module
kale.loaddata.videos module
- class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)
Bases:
DatasetA highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or
FRAMES x CHANNELS x HEIGHT x WIDTHtensors where FRAMES=x if thekale.prepdata.video_transform.ImglistToTensor()transform is used.More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.
Note
A demonstration of using this class can be seen in
PyKale/examples/video_loadinghttps://github.com/pykale/pykale/tree/master/examples/video_loadingNote
This dataset broadly corresponds to the frame sampling technique introduced in
Temporal Segment Networksat ECCV2016 https://arxiv.org/abs/1608.00859.Note
This class relies on receiving video data in a structure where inside a
ROOT_DATAfolder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values:VIDEO_FOLDER_PATH START_FRAME END_FRAME LABEL_INDEX.VIDEO_FOLDER_PATHis expected to be the path of a video folder excluding theROOT_DATAprefix. For example,ROOT_DATAmight behome\data\datasetxyz\videos\, inside of which aVIDEO_FOLDER_PATHmight bejumping\0052\orsample1\or00053\.- Parameters:
root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.
annotationfile_path – The .txt annotation file containing one row per video sample as described above.
image_modality – Image modality (RGB or Optical Flow).
num_segments – The number of segments the video should be divided into to sample frames from.
frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.
imagefile_template – The image filename template that video frame files have inside of their video folders as described above.
transform – Transform pipeline that receives a list of PIL images/frames.
random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.
test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
kale.loaddata.few_shot module
Dataset class to load data for few-shot learning problems under \(N\)-way-\(K\)-shot settings. Author: Wenrui Fan Email: winslow.fan@outlook.com
- class kale.loaddata.few_shot.NWayKShotDataset(path: str, mode: str = 'train', num_support_samples: int = 5, num_query_samples: int = 15, transform: Callable | None = None)
Bases:
DatasetThis Dataset class loads data for few-shot learning problems under \(N\)-way-\(K\)-shot settings.
\(N\)-way: The number of classes under a particular setting. The model is presented with samples from these \(N\) classes and needs to classify them. For example, 3-way means the model has to classify 3 different classes.
\(K\)-shot: The number of samples for each class in the support set. For example, in a 2-shot setting, two support samples are provided per class.
Support set: It is a small, labeled dataset used to train the model with a few samples of each class. The support set consists of \(N\) classes (\(N\)-way), with \(K\) samples (\(K\)-shot) for each class. For example, under a 3-way-2-shot setting, the support set has 3 classes with 2 samples per class, totaling 6 samples.
Query set: It evaluates the model’s ability to generalize what it has learned from the support set. It contains samples from the same \(N\) classes but not included in the support set. Continuing with the 3-way-2-shot example, the query set would include additional samples from the 3 classes, which the model must classify after learning from the support set.
In this class,
__getitem__()returns a batch of images and labels for one class. When defining thetraining/validation/testing dataloaders, the batch size should be the number of classes (cfg.TRAIN.NUM_CLASSES/cfg.VAL.NUM_CLASSES). Therefore,__len__()returns the total number of classes in the dataset.Note
The dataset should be organized as:
- root
- train
- class_name 1
xxx.png
yyy.png
…
- class_name 2
xxx.png
yyy.png
…
…
- val
- class_name m
xxx.png
yyy.png
…
- class_name m+1
xxx.png
yyy.png
…
…
- test
- class_name n
xxx.png
yyy.png
…
- class_name n+1
xxx.png
yyy.png
…
…
- Parameters:
path (string) – The root directory of the data.
mode (string) – The mode of the type of dataset. It can be “train”, “val”, or “test”. Default: “train”.
num_support_samples (int) – Number of samples per class in the support set. It corresponds to \(K\) in the \(N\)-way-\(K\)-shot setting. Default: 5.
num_query_samples (int) – Number of samples per class in the query set. Default: 15.
transform (callable, optional) – Transform of images. Default: None.