Load Data

Submodules

kale.loaddata.avmnist_datasets module

Dataset setting and data loader for AVMNIST dataset by refactoring https://github.com/pliang279/MultiBench/blob/main/datasets/avmnist/get_data.py

class kale.loaddata.avmnist_datasets.AVMNISTDataset(data_dir, batch_size=40, flatten_audio=False, flatten_image=False, unsqueeze_channel=True, normalize_image=True, normalize_audio=True)

Bases: object

This class loads the AVMNIST data stored in a specified directory, and prepares it for training, validation, and testing. This class also takes care of the pre-processing steps such as reshaping and normalizing the data based on provided arguments. This includes options to flatten the audio and image data, normalize the image and audio data, and add a dimension to the data, often used to represent the channel in image or audio data. Furthermore, The class handles the splitting of data into training and validation sets. It provides separate data loaders for the training, validation, and testing sets, which can be used to iterate over the data during model training and evaluation. This data loader class simplifies the data preparation process for multimodal learning tasks, allowing the user to focus on model architecture and hyperparameter tuning.

Parameters:
  • data_dir (str) – Directory of data.

  • batch_size (int, optional) – Batch size. Defaults to 40.

  • flatten_audio (bool, optional) – Whether to flatten audio data or not. Defaults to False.

  • flatten_image (bool, optional) – Whether to flatten image data or not. Defaults to False.

  • unsqueeze_channel (bool, optional) – Whether to unsqueeze any channels or not. Defaults to True.

  • normalize_image (bool, optional) – Whether to normalize the images before returning. Defaults to True.

  • normalize_audio (bool, optional) – Whether to normalize the audio before returning. Defaults to True.

load_data()
get_train_loader(shuffle=True)
get_valid_loader(shuffle=False)
get_test_loader(shuffle=False)

kale.loaddata.dataset_access module

Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py

class kale.loaddata.dataset_access.DatasetAccess(n_classes)

Bases: object

This class ensures a unique API is used to access training, validation and test splits of any dataset.

Parameters:

n_classes (int) – the number of classes.

n_classes()
get_train()
Returns: a torch.utils.data.Dataset

Dataset: a torch.utils.data.Dataset

get_train_valid(valid_ratio)

Randomly split a dataset into non-overlapping training and validation datasets.

Parameters:

valid_ratio (float) – the ratio for validation set

Returns:

a torch.utils.data.Dataset

Return type:

Dataset

get_test()
kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)
Parameters:
  • dataset – a torch.utils.data.Dataset

  • class_ids (list, optional) – List of chosen subset of class ids.

Returns: a torch.utils.data.Dataset

Dataset: a torch.utils.data.Dataset with only classes in class_ids

kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)

Randomly split a dataset into non-overlapping new datasets of given ratios.

Parameters:
  • dataset (torch.utils.data.Dataset) – Dataset to be split.

  • split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.

Returns:

A list of subsets.

Return type:

[List]

Examples

>>> import torch
>>> from kale.loaddata.dataset_access import split_by_ratios
>>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2 = split_by_ratios(range(10), [0.3])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3])
>>> len(subset1)
3
>>> len(subset2)
3
>>> len(subset3)
4

kale.loaddata.image_access module

class kale.loaddata.image_access.DigitDataset(value)

Bases: Enum

An enumeration.

MNIST = 'MNIST'
MNIST_RGB = 'MNIST_RGB'
MNISTM = 'MNISTM'
USPS = 'USPS'
USPS_RGB = 'USPS_RGB'
SVHN = 'SVHN'
static get_channel_numbers(dataset: DigitDataset)
static get_digit_transform(dataset: DigitDataset, n_channels)
static get_access(dataset: DigitDataset, data_path, num_channels=None)

Gets data loaders for digit datasets

Parameters:
  • dataset (DigitDataset) – dataset name

  • data_path (string) – root directory of dataset

  • num_channels (int) – number of channels, defaults to None

Examples::
>>> data_access, num_channel = DigitDataset.get_access(dataset, data_path)
static get_source_target(source: DigitDataset, target: DigitDataset, data_path)

Gets data loaders for source and target datasets

Parameters:
  • source (DigitDataset) – source dataset name

  • target (DigitDataset) – target dataset name

  • data_path (string) – root directory of dataset

Examples::
>>> source_access, target_access, num_channel = DigitDataset.get_source_target(source, target, data_path)
class kale.loaddata.image_access.DigitDatasetAccess(data_path, transform_kind)

Bases: DatasetAccess

Common API for digit dataset access

Parameters:
  • data_path (string) – root directory of dataset

  • transform_kind (string) – types of image transforms

class kale.loaddata.image_access.MNISTDatasetAccess(data_path, transform_kind)

Bases: DigitDatasetAccess

MNIST data loader

get_train()
get_test()
class kale.loaddata.image_access.MNISTMDatasetAccess(data_path, transform_kind)

Bases: DigitDatasetAccess

Modified MNIST (MNISTM) data loader

get_train()
get_test()
class kale.loaddata.image_access.USPSDatasetAccess(data_path, transform_kind)

Bases: DigitDatasetAccess

USPS data loader

get_train()
get_test()
class kale.loaddata.image_access.SVHNDatasetAccess(data_path, transform_kind)

Bases: DigitDatasetAccess

SVHN data loader

get_train()
get_test()
class kale.loaddata.image_access.OfficeAccess(root, transform=Compose(     Compose(     Resize(size=256, interpolation=bilinear, max_size=None, antialias=warn)     CenterCrop(size=(256, 256)) )     ToTensor()     Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ), download=False, **kwargs)

Bases: MultiDomainImageFolder, DatasetAccess

Common API for office dataset access

Parameters:
  • root (string) – root directory of dataset

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Defaults to office_transform.

  • download (bool, optional) – Whether to allow downloading the data if not found on disk. Defaults to False.

References

[1] Saenko, K., Kulis, B., Fritz, M. and Darrell, T., 2010, September. Adapting visual category models to new domains. In European Conference on Computer Vision (pp. 213-226). Springer, Berlin, Heidelberg. [2] Griffin, Gregory and Holub, Alex and Perona, Pietro, 2007. Caltech-256 Object Category Dataset. California Institute of Technology. (Unpublished). https://resolver.caltech.edu/CaltechAUTHORS:CNS-TR-2007-001. [3] Gong, B., Shi, Y., Sha, F. and Grauman, K., 2012, June. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2066-2073).

static download(path)

Download dataset. Office-31 source: https://www.cc.gatech.edu/~judy/domainadapt/#datasets_code Caltech-256 source: http://www.vision.caltech.edu/Image_Datasets/Caltech256/ Data with this library is adapted from: http://www.stat.ucla.edu/~jxie/iFRAME/code/imageClassification.rar

class kale.loaddata.image_access.Office31(root, **kwargs)

Bases: OfficeAccess

class kale.loaddata.image_access.OfficeCaltech(root, **kwargs)

Bases: OfficeAccess

class kale.loaddata.image_access.ImageAccess

Bases: object

static get_multi_domain_images(image_set_name: str, data_path: str, sub_domain_set=None, **kwargs)

Get multi-domain images as a dataset from the given data path.

Parameters:
  • image_set_name (str) – name of image dataset

  • data_path (str) – path to the image dataset

  • sub_domain_set (list, optional) – A list of domain names, which should be a subset of domains under the directory of data path. If None, all available domains will be used. Defaults to None.

Returns:

Multi-domain image dataset

Return type:

[MultiDomainImageFolder, or MultiDomainAccess]

kale.loaddata.image_access.get_cifar(cfg)

Gets training and validation data loaders for the CIFAR datasets

Parameters:

cfg (CfgNode) – hyperparameters from configure file

Examples

>>> train_loader, valid_loader = get_cifar(cfg)
kale.loaddata.image_access.read_dicom_phases(dicom_path, sort_instance=True)

Read dicom images of multiple instances/phases for one patient.

Parameters:
  • dicom_path (str) – Path to DICOM images.

  • sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number). Defaults to True.

Returns:

List of dicom dataset objects

Return type:

[list]

kale.loaddata.image_access.check_dicom_series_uid(dcm_phases, sort_instance=True)

Check if all dicom images have the same series UID.

Parameters:
  • dcm_phases (list) – List of dicom dataset objects (phases)

  • sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number). Defaults to True.

Returns:

List of list(s) dicom phases.

Return type:

list

kale.loaddata.image_access.read_dicom_dir(dicom_path, sort_instance=True, sort_patient=False, check_series_uid=False)
Read dicom files for multiple patients and multiple instances / phases from a given directory arranged in the

following structure:

root/patient_a/…/phase_1.dcm root/patient_a/…/phase_2.dcm root/patient_a/…/phase_3.dcm

root/patient_b/…/phase_1.dcm root/patient_b/…/phase_2.dcm root/patient_b/…/phase_3.dcm

root/patient_m/…/phase_1.dcm root/patient_m/…/phase_2.dcm root/patient_m/…/phase_3.dcm

Parameters:
  • dicom_path (str) – Directory of DICOM files.

  • sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number) for each subject. Defaults to True.

  • sort_patient (bool, optional) – Whether sort subjects’ images by PatientID. Defaults to False.

  • check_series_uid (bool, optional) – Whether check if all series UIDs are the same. Defaults to False.

Returns:

[a list of dicom dataset lists]

Return type:

[list[list]]

kale.loaddata.image_access.dicom2arraylist(dicom_patient_list, return_patient_id=False)

Convert dicom datasets to arrays

Parameters:
  • dicom_patient_list (list) – List of dicom patient lists.

  • return_patient_id (bool, optional) – Whether return PatientID. Defaults to False.

Returns:

list of array-like tensors. list (optional): list of PatientIDs.

Return type:

list

kale.loaddata.mnistm module

Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel

class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)

Bases: Dataset

MNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.

Parameters:
  • root (str) – path to directory where the MNISTM folder will be created (or exists.)

  • train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.

  • transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g., transforms.RandomCrop This preprocessing function applied to all images (whether source or target)

  • target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform

  • download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.

url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'
raw_folder = 'raw'
processed_folder = 'processed'
training_file = 'mnist_m_train.pt'
test_file = 'mnist_m_test.pt'
download()

Download the MNISTM data.

kale.loaddata.multi_domain module

Construct a dataset with (multiple) source and target domains, adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/multisource.py

class kale.loaddata.multi_domain.WeightingType(value)

Bases: Enum

An enumeration.

NATURAL = 'natural'
BALANCED = 'balanced'
PRESET0 = 'preset0'
class kale.loaddata.multi_domain.DatasetSizeType(value)

Bases: Enum

An enumeration.

Max = 'max'
Source = 'source'
static get_size(size_type, source_dataset, *other_datasets)
class kale.loaddata.multi_domain.DomainsDatasetBase

Bases: object

prepare_data_loaders()

handles train/validation/test split to have 3 datasets each with data from all domains

get_domain_loaders(split='train', batch_size=32)

handles the sampling of a dataset containing multiple domains

Parameters:
  • split (string, optional) – [“train”|”valid”|”test”]. Which dataset to iterate on. Defaults to “train”.

  • batch_size (int, optional) – Defaults to 32.

Returns:

A dataloader with API similar to the torch.dataloader, but returning batches from several domains at each iteration.

Return type:

MultiDataLoader

class kale.loaddata.multi_domain.MultiDomainDatasets(source_access: DatasetAccess, target_access: DatasetAccess, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)

Bases: DomainsDatasetBase

is_semi_supervised()
prepare_data_loaders()
get_domain_loaders(split='train', batch_size=32)
class kale.loaddata.multi_domain.MultiDomainImageFolder(root: str, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, extensions: ~typing.Tuple[str, ...] | None = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp'), transform: ~typing.Callable | None = None, target_transform: ~typing.Callable | None = None, sub_domain_set=None, sub_class_set=None, is_valid_file: ~typing.Callable[[str], bool] | None = None, return_domain_label: bool | None = False, split_train_test: bool | None = False, split_ratio: float = 0.8)

Bases: VisionDataset

A generic data loader where the samples are arranged in this way:

root/domain_a/class_1/xxx.ext
root/domain_a/class_1/xxy.ext
root/domain_a/class_2/xxz.ext

root/domain_b/class_1/efg.ext
root/domain_b/class_2/pqr.ext
root/domain_b/class_2/lmn.ext

root/domain_k/class_2/123.ext
root/domain_k/class_1/abc3.ext
root/domain_k/class_1/asd932_.ext
Parameters:
  • root (string) – Root directory path.

  • loader (callable) – A function to load a sample given its path.

  • extensions (tuple[string]) – A list of allowed extensions. Either extensions or is_valid_file should be passed.

  • transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.

  • target_transform (callable, optional) – A function/transform that takes in the target and transforms it.

  • sub_domain_set (list) – A list of domain names, which should be a subset of domains (folders) under the root directory. If None, all available domains will be used. Defaults to None.

  • sub_class_set (list) – A list of class names, which should be a subset of classes (folders) under each domain’s directory. If None, all available classes will be used. Defaults to None.

  • is_valid_file – A function that takes path of a file and check if the file is a valid file (to check corrupt files). Either extensions or is_valid_file should be passed.

get_train()
get_test()
kale.loaddata.multi_domain.make_multi_domain_set(directory: str, class_to_idx: Dict[str, int], domain_to_idx: Dict[str, int], extensions: Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int, int]]

Generates a list of samples of a form (path_to_sample, class, domain). :param directory: root dataset directory :type directory: str :param class_to_idx: dictionary mapping class name to class index :type class_to_idx: Dict[str, int] :param domain_to_idx: dictionary mapping d name to class index :type domain_to_idx: Dict[str, int] :param extensions: A list of allowed extensions. Either extensions or is_valid_file should be passed.

Defaults to None.

Parameters:

is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (to check corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.

Raises:

ValueError – In case extensions and is_valid_file are None or both are not None.

Returns:

samples of a form (path_to_sample, class, domain)

Return type:

List[Tuple[str, int, int]]

class kale.loaddata.multi_domain.ConcatMultiDomainAccess(data_access: dict, domain_to_idx: dict, return_domain_label: bool | None = False)

Bases: Dataset

Concatenate multiple datasets as a single dataset with domain labels

Parameters:
  • data_access (dict) – Dictionary of domain datasets, e.g. {“Domain1_name”: domain1_set, “Domain2_name”: domain2_set}

  • domain_to_idx (dict) – Dictionary of domain name to domain labels, e.g. {“Domain1_name”: 0, “Domain2_name”: 1}

  • return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.

class kale.loaddata.multi_domain.MultiDomainAccess(data_access: dict, n_classes: int, return_domain_label: bool | None = False)

Bases: DatasetAccess

Convert multiple digits-like data accesses to a single data access. :param data_access: Dictionary of data accesses, e.g. {“Domain1_name”: domain1_access,

“Domain2_name”: domain2_access}

Parameters:
  • n_classes (int) – number of classes.

  • return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.

get_train()
get_test()
class kale.loaddata.multi_domain.MultiDomainAdapDataset(data_access, valid_split_ratio=0.1, test_split_ratio=0.2, random_state: int = 1, test_on_all=False)

Bases: DomainsDatasetBase

The class controlling how the multiple domains are iterated over.

Parameters:
  • data_access (MultiDomainImageFolder, or MultiDomainAccess) – Multi-domain data access.

  • valid_split_ratio (float, optional) – Split ratio for validation set. Defaults to 0.1.

  • test_split_ratio (float, optional) – Split ratio for test set. Defaults to 0.2.

  • random_state (int, optional) – Random state for generator. Defaults to 1.

  • test_on_all (bool, optional) – Whether test model on all target. Defaults to False.

prepare_data_loaders()
get_domain_loaders(split='train', batch_size=32)

kale.loaddata.multiomics_datasets module

Construct a dataset with multiple omics modalities based on PyTorch Geometric.

This code is written by refactoring the MOGONET dataset code (https://github.com/txWang/MOGONET/blob/main/train_test.py) within the ‘Dataset’ class provided in the PyTorch Geometric.

Reference: Wang, T., Shao, W., Huang, Z., Tang, H., Zhang, J., Ding, Z., Huang, K. (2021). MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature communications. https://www.nature.com/articles/s41467-021-23774-w

class kale.loaddata.multiomics_datasets.MultiomicsDataset(root: str, num_modalities: int, num_classes: int, url: str | None = None, raw_file_names: List[str] | None = None, random_split: bool = False, train_size: float = 0.7, transform: Callable | None = None, pre_transform: Callable | None = None, target_pre_transform: Callable | None = None)

Bases: Dataset

The multiomics data for creating graph dataset. See here in PyTorch Geometric for the accompanying tutorial.

Parameters:
  • root (string) – Root directory where the dataset should be saved.

  • num_modalities (int) – The total number of modalities in the dataset.

  • num_classes (int) – The total number of classes in the dataset.

  • url (string, optional) – The url to download the dataset from.

  • raw_file_names (list[callable], optional) – The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

  • random_split (bool, optional) – Whether to split the dataset into random train and test subsets. (default: False)

  • train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. This parameter is used when random_split is True.

  • transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • target_pre_transform (callable, optional) – A function/transform that takes in an array_like of labels and returns a transformed version. The label object will be transformed before being saved to disk. (default: None)

property raw_file_names: List[str] | None

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

property processed_file_names: str | List[str] | Tuple

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

download() None

Downloads the dataset to the self.raw_dir folder.

process() None

Processes the dataset to the self.processed_dir folder. This function reads input files, creates a ‘’Data’’ object, and saves it into the ‘’processed_dir’’.

static get_random_split(labels, num_classes: int, train_size: float = 0.7) Tuple

Split arrays into random train and test indices.

Parameters:
  • labels (array-like) – Array-like object that represents the labels of the dataset.

  • num_classes (int) – The total number of classes in the dataset.

  • train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. (default: 0.7)

Returns:

A tuple of two arrays containing the indices for the train and test sets.

static get_adjacency_info(data: Tensor) Tuple

Calculate a sparse adjacency matrix of the input dataset defined by edge indices and edge attributes.

Parameters:

data (torch.Tensor) – The input data.

Returns:

A tuple of edge indices and edge attributes.

extend_data(data: Data) Data

Extend data object by adding additional attributes.

Parameters:

data (Data) – An input data object.

Returns:

Extended data object with additional attributes.

len() int

Returns the number of graphs stored in the dataset.

get(modality_idx) Data

Gets the data object at index idx.

property num_modalities: int

Returns the number of modalities in the dataset.

property num_classes: int

Returns the number of classes in the dataset.

class kale.loaddata.multiomics_datasets.SparseMultiomicsDataset(root: str, raw_file_names: List[str], num_modalities: int, num_classes: int, edge_per_node: int, url: str | None = None, random_split: bool = False, train_size: float = 0.7, equal_weight: bool = False, transform: Callable | None = None, pre_transform: Callable | None = None, target_pre_transform: Callable | None = None)

Bases: MultiomicsDataset

The multiomics data for creating sparse graph dataset based on the settings in the MOGONET paper.

Parameters:
  • root (string) – Root directory where the dataset should be saved.

  • raw_file_names (list[callable], optional) – The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

  • num_modalities (int) – The total number of modalities in the dataset.

  • num_classes (int) – The total number of classes in the dataset.

  • edge_per_node (int) – Predefined number of edges per nodes in computing adjacency matrix.

  • url (string, optional) – The url to download the dataset from.

  • random_split (bool, optional) – Whether to split the dataset into random train and test subsets. (default: False)

  • train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. This parameter is used when random_split is True.

  • equal_weight (bool, optional) – Whether to use equal weights for all samples. (default: False)

  • transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before being saved to disk. (default: None)

  • target_pre_transform (callable, optional) – A function/transform that takes in an array_like of labels and returns a transformed version. The label object will be transformed before being saved to disk. (default: None)

extend_data(data: Data) Data

Extend data object by adding additional attributes.

Parameters:

data (Data) – An input data object.

Returns:

Extended data object with additional attributes.

kale.loaddata.polypharmacy_datasets module

class kale.loaddata.polypharmacy_datasets.PolypharmacyDataset(url: str, root: str, name: str, mode: str = 'train')

Bases: Dataset

Polypharmacy side effect prediction dataset. Only for full-batch training.

Parameters:
  • url (string) – The url to download the dataset from.

  • root (string) – The root directory containing the dataset file.

  • name (string) – Name of the dataset.

  • mode (string) – “train”, “valid” or “test”. Defaults to “train”.

load_data() Data

Setup dataset: download if need and load it.

kale.loaddata.sampler module

Various sampling strategies for datasets to construct dataloader, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/sampler.py

class kale.loaddata.sampler.SamplingConfig(balance=False, class_weights=None, balance_domain=False)

Bases: object

create_loader(dataset, batch_size)

Create the data loader

Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler

Parameters:
  • dataset (Dataset) – dataset from which to load the data.

  • batch_size (int) – how many samples per batch to load

class kale.loaddata.sampler.FixedSeedSamplingConfig(seed=1, balance=False, class_weights=None, balance_domain=False)

Bases: SamplingConfig

create_loader(dataset, batch_size)

Create the data loader with fixed seed.

class kale.loaddata.sampler.MultiDataLoader(dataloaders, n_batches)

Bases: object

Batch Sampler for a MultiDataset. Iterates in parallel over different batch samplers for each dataset. Yields batches [(x_1, y_1), …, (x_s, y_s)] for s datasets.

class kale.loaddata.sampler.BalancedBatchSampler(dataset, batch_size)

Bases: BatchSampler

BatchSampler - from a MNIST-like dataset, samples n_samples for each of the n_classes. Returns batches of size n_classes * (batch_size // n_classes) adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py

class kale.loaddata.sampler.ReweightedBatchSampler(dataset, batch_size, class_weights)

Bases: BatchSampler

BatchSampler - from a MNIST-like dataset, samples batch_size according to given input distribution assuming multi-class labels adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py

kale.loaddata.sampler.get_labels(dataset)

Get class labels for dataset

class kale.loaddata.sampler.InfiniteSliceIterator(array, class_)

Bases: object

reset()
get(n)
class kale.loaddata.sampler.DomainBalancedBatchSampler(dataset, batch_size)

Bases: BalancedBatchSampler

BatchSampler - samples n_samples for each of the n_domains.

Returns batches of size n_domains * (batch_size / n_domains)

Parameters:
  • dataset (.multi_domain.MultiDomainImageFolder or torch.utils.data.Subset) – Multi-domain data access.

  • batch_size (int) – Batch size

kale.loaddata.tabular_access module

Authors: Lawrence Schobs, lawrenceschobs@gmail.com

Functions for accessing tabular data.

kale.loaddata.tabular_access.load_csv_columns(datapath: str, split: str, fold: int | List[int], cols_to_return: str | List[str] = 'All') DataFrame

Reads a CSV file of data and returns samples where the value of the specified split column is contained in the fold variable. The columns specified in cols_to_return are returned.

Parameters:
  • datapath – The path to the CSV file of data.

  • split – The column name for the split (e.g. “Validation”, “Testing”).

  • fold – The fold/s contained in the split column to return. Can be a single integer or a list of integers.

  • cols_to_return – Which columns to return. If set to “All”, returns all columns.

Returns:

the first is the full DataFrame selected, and the second is the DataFrame with only the columns specified in cols_to_return.

Return type:

A tuple of two pandas DataFrames

kale.loaddata.tdc_datasets module

class kale.loaddata.tdc_datasets.BindingDBDataset(name: str, split='train', path='./data', mode='cnn_cnn', y_log=True, drug_transform=None, protein_transform=None)

Bases: Dataset

A custom dataset for loading and processing original TDC data, which is used as input data in DeepDTA model.

Parameters:
  • name (str) – TDC dataset name.

  • split (str) – Data split type (train, valid or test).

  • path (str) – dataset download/local load path (default: “./data”)

  • mode (str) – encoding mode (default: cnn_cnn)

  • drug_transform – Transform operation (default: None)

  • protein_transform – Transform operation (default: None)

  • y_log (bool) – Whether convert y values to log space. (default: True)

kale.loaddata.usps module

Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)

class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)

Bases: Dataset

USPS Dataset.

Parameters:
  • root (string) – Root directory of dataset where dataset file exist.

  • train (bool, optional) – If True, resample from dataset randomly.

  • download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.

  • transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'
download()

Download dataset.

load_samples()

Load sample images from dataset.

kale.loaddata.video_access module

Action video dataset loading for EPIC-Kitchen, ADL, GTEA, KITCHEN. The code is based on https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/digits_dataset_access.py

kale.loaddata.video_access.get_image_modality(image_modality)

Change image_modality (string) to rgb (bool) and flow (bool) for efficiency

kale.loaddata.video_access.get_videodata_config(cfg)

Get the configure parameters for video data from the cfg files

kale.loaddata.video_access.generate_list(data_name, data_params_local, domain)
Parameters:
  • data_name (string) – name of dataset

  • data_params_local (dict) – hyperparameters from configure file

  • domain (string) – domain type (source or target)

Returns:

image directory of dataset train_listpath (string): training list file directory of dataset test_listpath (string): test list file directory of dataset

Return type:

data_path (string)

class kale.loaddata.video_access.VideoDataset(value)

Bases: Enum

An enumeration.

EPIC = 'EPIC'
ADL = 'ADL'
GTEA = 'GTEA'
KITCHEN = 'KITCHEN'
static get_source_target(source: VideoDataset, target: VideoDataset, seed, params)

Gets data loaders for source and target datasets Sets channel_number as 3 for RGB, 2 for flow. Sets class_number as 8 for EPIC, 7 for ADL, 6 for both GTEA and KITCHEN.

Parameters:
  • source (VideoDataset) – source dataset name

  • target (VideoDataset) – target dataset name

  • seed (int) – seed value set manually

  • params (CfgNode) – hyperparameters from configure file

Examples

>>> source, target, num_classes = get_source_target(source, target, seed, params)
class kale.loaddata.video_access.VideoDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: DatasetAccess

Common API for video dataset access

Parameters:
  • data_path (string) – image directory of dataset

  • train_list (string) – training list file directory of dataset

  • test_list (string) – test list file directory of dataset

  • image_modality (string) – image type (RGB or Optical Flow)

  • frames_per_segment (int) – length of each action sample (the unit is number of frame)

  • n_classes (int) – number of class

  • transform_kind (string) – types of video transforms

  • seed (int) – seed value set manually

get_train_valid(valid_ratio)

Get the train and validation dataset with the fixed random split. This is used for joint input like RGB and optical flow, which will call get_train_valid twice. Fixing the random seed here can keep the seeds for twice the same.

class kale.loaddata.video_access.EPICDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

EPIC data loader

get_train()
get_test()
class kale.loaddata.video_access.GTEADatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

GTEA data loader

get_train()
get_test()
class kale.loaddata.video_access.ADLDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

ADL data loader

get_train()
get_test()
class kale.loaddata.video_access.KITCHENDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

KITCHEN data loader

get_train()
get_test()

kale.loaddata.video_datasets module

class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for GTEA, ADL and KITCHEN.

Parameters:
  • root_path (string) – The root path in which video folders lie.

  • annotationfile_path (string) – The annotation file containing one row per video sample.

  • dataset_split (string) – Split type (train or test)

  • image_modality (string) – Image modality (RGB or Optical Flow)

  • num_segments (int) – The number of segments the video should be divided into to sample frames from.

  • frames_per_segment (int) – The number of frames that should be loaded per segment.

  • imagefile_template (string) – The image filename template.

  • transform (Compose) – Video transform.

  • random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.

  • test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.

  • n_classes (int) – The number of classes.

make_dataset()

Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.

Returns:

list of (video_name, start_frame, end_frame, label)

Return type:

data (list)

class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for EPIC-Kitchen.

make_dataset()

Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.

kale.loaddata.video_multi_domain module

Construct a dataset for videos with (multiple) source and target domains

class kale.loaddata.video_multi_domain.VideoMultiDomainDatasets(source_access_dict, target_access_dict, image_modality, seed, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)

Bases: MultiDomainDatasets

prepare_data_loaders()
get_domain_loaders(split='train', batch_size=32)

kale.loaddata.videos module

class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)

Bases: Dataset

A highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or FRAMES x CHANNELS x HEIGHT x WIDTH tensors where FRAMES=x if the kale.prepdata.video_transform.ImglistToTensor() transform is used.

More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.

Note

A demonstration of using this class can be seen in PyKale/examples/video_loading https://github.com/pykale/pykale/tree/master/examples/video_loading

Note

This dataset broadly corresponds to the frame sampling technique introduced in Temporal Segment Networks at ECCV2016 https://arxiv.org/abs/1608.00859.

Note

This class relies on receiving video data in a structure where inside a ROOT_DATA folder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values: VIDEO_FOLDER_PATH     START_FRAME     END_FRAME     LABEL_INDEX. VIDEO_FOLDER_PATH is expected to be the path of a video folder excluding the ROOT_DATA prefix. For example, ROOT_DATA might be home\data\datasetxyz\videos\, inside of which a VIDEO_FOLDER_PATH might be jumping\0052\ or sample1\ or 00053\.

Parameters:
  • root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.

  • annotationfile_path – The .txt annotation file containing one row per video sample as described above.

  • image_modality – Image modality (RGB or Optical Flow).

  • num_segments – The number of segments the video should be divided into to sample frames from.

  • frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.

  • imagefile_template – The image filename template that video frame files have inside of their video folders as described above.

  • transform – Transform pipeline that receives a list of PIL images/frames.

  • random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.

  • test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.

Module contents