Load Data

Submodules

kale.loaddata.dataset_access module

Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py

class kale.loaddata.dataset_access.DatasetAccess(n_classes)

Bases: object

This class ensures a unique API is used to access training, validation and test splits of any dataset.

Parameters: n_classes (int) – the number of classes.

n_classes()

get_train()

Returns: a torch.utils.data.Dataset: Dataset: a torch.utils.data.Dataset

get_train_valid(valid_ratio)

Randomly split a dataset into non-overlapping training and validation datasets.

Parameters: valid_ratio (float) – the ratio for validation set
Returns: a torch.utils.data.Dataset
Return type: Dataset

get_test()

kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)

Parameters

dataset – a torch.utils.data.Dataset
class_ids (list, optional) – List of chosen subset of class ids.

Returns: a torch.utils.data.Dataset: Dataset: a torch.utils.data.Dataset with only classes in class_ids

kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)

Randomly split a dataset into non-overlapping new datasets of given ratios.

Parameters

dataset (torch.utils.data.Dataset) – Dataset to be split.
split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.

Returns

A list of subsets.

Return type

[List]

Examples

>>> import torch
>>> from kale.loaddata.dataset_access import split_by_ratios
>>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2 = split_by_ratios(range(10), [0.3])
>>> len(subset1)
3
>>> len(subset2)
7
>>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3])
>>> len(subset1)
3
>>> len(subset2)
3
>>> len(subset3)
4

kale.loaddata.videos module

class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)

Bases: Dataset

A highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or FRAMES x CHANNELS x HEIGHT x WIDTH tensors where FRAMES=x if the kale.prepdata.video_transform.ImglistToTensor() transform is used.

More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.

Note

A demonstration of using this class can be seen in PyKale/examples/video_loading https://github.com/pykale/pykale/tree/master/examples/video_loading

Note

This dataset broadly corresponds to the frame sampling technique introduced in Temporal Segment Networks at ECCV2016 https://arxiv.org/abs/1608.00859.

Note

This class relies on receiving video data in a structure where inside a ROOT_DATA folder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values: VIDEO_FOLDER_PATH START_FRAME END_FRAME LABEL_INDEX. VIDEO_FOLDER_PATH is expected to be the path of a video folder excluding the ROOT_DATA prefix. For example, ROOT_DATA might be home\data\datasetxyz\videos\, inside of which a VIDEO_FOLDER_PATH might be jumping\0052\ or sample1\ or 00053\.

Parameters

root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.
annotationfile_path – The .txt annotation file containing one row per video sample as described above.
image_modality – Image modality (RGB or Optical Flow).
num_segments – The number of segments the video should be divided into to sample frames from.
frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.
imagefile_template – The image filename template that video frame files have inside of their video folders as described above.
transform – Transform pipeline that receives a list of PIL images/frames.
random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.
test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.

kale.loaddata.image_access

kale.loaddata.mnistm module

Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel

class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)

Bases: Dataset

MNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.

Parameters

root (str) – path to directory where the MNISTM folder will be created (or exists.)
train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.
transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g., transforms.RandomCrop This preprocessing function applied to all images (whether source or target)
target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform
download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.

url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'

raw_folder = 'raw'

processed_folder = 'processed'

training_file = 'mnist_m_train.pt'

test_file = 'mnist_m_test.pt'

download(): Download the MNISTM data.

kale.loaddata.multi_domain module

Construct a dataset with (multiple) source and target domains, adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/multisource.py

class kale.loaddata.multi_domain.WeightingType(value)

Bases: Enum

An enumeration.

NATURAL = 'natural'

BALANCED = 'balanced'

PRESET0 = 'preset0'

class kale.loaddata.multi_domain.DatasetSizeType(value)

Bases: Enum

An enumeration.

Max = 'max'

Source = 'source'

static get_size(size_type, source_dataset, *other_datasets)

class kale.loaddata.multi_domain.DomainsDatasetBase

Bases: object

prepare_data_loaders(): handles train/validation/test split to have 3 datasets each with data from all domains

get_domain_loaders(split='train', batch_size=32)

handles the sampling of a dataset containing multiple domains

Parameters

split (string, optional) – [“train”|”valid”|”test”]. Which dataset to iterate on. Defaults to “train”.
batch_size (int, optional) – Defaults to 32.

Returns

A dataloader with API similar to the torch.dataloader, but returning batches from several domains at each iteration.

Return type

MultiDataLoader

class kale.loaddata.multi_domain.MultiDomainDatasets(source_access: DatasetAccess, target_access: DatasetAccess, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)

Bases: DomainsDatasetBase

is_semi_supervised()

prepare_data_loaders()

get_domain_loaders(split='train', batch_size=32)

class kale.loaddata.multi_domain.MultiDomainImageFolder(root: str, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, extensions: ~typing.Optional[~typing.Tuple[str, ...]] = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp'), transform: ~typing.Optional[~typing.Callable] = None, target_transform: ~typing.Optional[~typing.Callable] = None, sub_domain_set=None, sub_class_set=None, is_valid_file: ~typing.Optional[~typing.Callable[[str], bool]] = None, return_domain_label: ~typing.Optional[bool] = False, split_train_test: ~typing.Optional[bool] = False, split_ratio: float = 0.8)

Bases: VisionDataset

A generic data loader where the samples are arranged in this way:

root/domain_a/class_1/xxx.ext
root/domain_a/class_1/xxy.ext
root/domain_a/class_2/xxz.ext

root/domain_b/class_1/efg.ext
root/domain_b/class_2/pqr.ext
root/domain_b/class_2/lmn.ext

root/domain_k/class_2/123.ext
root/domain_k/class_1/abc3.ext
root/domain_k/class_1/asd932_.ext

Parameters

root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. Either extensions or is_valid_file should be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g, transforms.RandomCrop for images.
target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
sub_domain_set (list) – A list of domain names, which should be a subset of domains (folders) under the root directory. If None, all available domains will be used. Defaults to None.
sub_class_set (list) – A list of class names, which should be a subset of classes (folders) under each domain’s directory. If None, all available classes will be used. Defaults to None.
is_valid_file – A function that takes path of a file and check if the file is a valid file (to check corrupt files). Either extensions or is_valid_file should be passed.

get_train()

get_test()

kale.loaddata.multi_domain.make_multi_domain_set(directory: str, class_to_idx: Dict[str, int], domain_to_idx: Dict[str, int], extensions: Optional[Tuple[str, ...]] = None, is_valid_file: Optional[Callable[[str], bool]] = None) → List[Tuple[str, int, int]]

Generates a list of samples of a form (path_to_sample, class, domain). :param directory: root dataset directory :type directory: str :param class_to_idx: dictionary mapping class name to class index :type class_to_idx: Dict[str, int] :param domain_to_idx: dictionary mapping d name to class index :type domain_to_idx: Dict[str, int] :param extensions: A list of allowed extensions. Either extensions or is_valid_file should be passed.

Defaults to None.

Parameters: is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (to check corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.
Raises: ValueError – In case extensions and is_valid_file are None or both are not None.
Returns: samples of a form (path_to_sample, class, domain)
Return type: List[Tuple[str, int, int]]

class kale.loaddata.multi_domain.ConcatMultiDomainAccess(data_access: dict, domain_to_idx: dict, return_domain_label: Optional[bool] = False)

Bases: Dataset

Concatenate multiple datasets as a single dataset with domain labels

Parameters

data_access (dict) – Dictionary of domain datasets, e.g. {“Domain1_name”: domain1_set, “Domain2_name”: domain2_set}
domain_to_idx (dict) – Dictionary of domain name to domain labels, e.g. {“Domain1_name”: 0, “Domain2_name”: 1}
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.

class kale.loaddata.multi_domain.MultiDomainAccess(data_access: dict, n_classes: int, return_domain_label: Optional[bool] = False)

Bases: DatasetAccess

Convert multiple digits-like data accesses to a single data access. :param data_access: Dictionary of data accesses, e.g. {“Domain1_name”: domain1_access,

“Domain2_name”: domain2_access}

Parameters

n_classes (int) – number of classes.
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.

get_train()

get_test()

class kale.loaddata.multi_domain.MultiDomainAdapDataset(data_access, valid_split_ratio=0.1, test_split_ratio=0.2, random_state: int = 1, test_on_all=False)

Bases: DomainsDatasetBase

The class controlling how the multiple domains are iterated over.

Parameters

data_access (MultiDomainImageFolder, or MultiDomainAccess) – Multi-domain data access.
valid_split_ratio (float, optional) – Split ratio for validation set. Defaults to 0.1.
test_split_ratio (float, optional) – Split ratio for test set. Defaults to 0.2.
random_state (int, optional) – Random state for generator. Defaults to 1.
test_on_all (bool, optional) – Whether test model on all target. Defaults to False.

prepare_data_loaders()

get_domain_loaders(split='train', batch_size=32)

kale.loaddata.sampler module

Various sampling strategies for datasets to construct dataloader, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/sampler.py

class kale.loaddata.sampler.SamplingConfig(balance=False, class_weights=None, balance_domain=False)

Bases: object

create_loader(dataset, batch_size)

Create the data loader

Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler

Parameters

dataset (Dataset) – dataset from which to load the data.
batch_size (int) – how many samples per batch to load

class kale.loaddata.sampler.FixedSeedSamplingConfig(seed=1, balance=False, class_weights=None, balance_domain=False)

Bases: SamplingConfig

create_loader(dataset, batch_size): Create the data loader with fixed seed.

class kale.loaddata.sampler.MultiDataLoader(dataloaders, n_batches)

Bases: object

Batch Sampler for a MultiDataset. Iterates in parallel over different batch samplers for each dataset. Yields batches [(x_1, y_1), …, (x_s, y_s)] for s datasets.

class kale.loaddata.sampler.BalancedBatchSampler(dataset, batch_size)

Bases: BatchSampler

BatchSampler - from a MNIST-like dataset, samples n_samples for each of the n_classes. Returns batches of size n_classes * (batch_size // n_classes) adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py

class kale.loaddata.sampler.ReweightedBatchSampler(dataset, batch_size, class_weights)

Bases: BatchSampler

BatchSampler - from a MNIST-like dataset, samples batch_size according to given input distribution assuming multi-class labels adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py

kale.loaddata.sampler.get_labels(dataset): Get class labels for dataset

class kale.loaddata.sampler.InfiniteSliceIterator(array, class_)

Bases: object

reset()

get(n)

class kale.loaddata.sampler.DomainBalancedBatchSampler(dataset, batch_size)

Bases: BalancedBatchSampler

BatchSampler - samples n_samples for each of the n_domains.: Returns batches of size n_domains * (batch_size / n_domains)

Parameters

dataset (.multi_domain.MultiDomainImageFolder or torch.utils.data.Subset) – Multi-domain data access.
batch_size (int) – Batch size

kale.loaddata.usps module

Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)

class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)

Bases: Dataset

USPS Dataset.

Parameters

root (string) – Root directory of dataset where dataset file exist.
train (bool, optional) – If True, resample from dataset randomly.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g, transforms.RandomCrop

url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'

download(): Download dataset.

load_samples(): Load sample images from dataset.

kale.loaddata.tdc_datasets module

class kale.loaddata.tdc_datasets.BindingDBDataset(name: str, split='train', path='./data', mode='cnn_cnn', y_log=True, drug_transform=None, protein_transform=None)

Bases: Dataset

A custom dataset for loading and processing original TDC data, which is used as input data in DeepDTA model.

Parameters

name (str) – TDC dataset name.
split (str) – Data split type (train, valid or test).
path (str) – dataset download/local load path (default: “./data”)
mode (str) – encoding mode (default: cnn_cnn)
drug_transform – Transform operation (default: None)
protein_transform – Transform operation (default: None)
y_log (bool) – Whether convert y values to log space. (default: True)

kale.loaddata.video_access module

Action video dataset loading for EPIC-Kitchen, ADL, GTEA, KITCHEN. The code is based on https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/digits_dataset_access.py

kale.loaddata.video_access.get_image_modality(image_modality): Change image_modality (string) to rgb (bool) and flow (bool) for efficiency

kale.loaddata.video_access.get_videodata_config(cfg): Get the configure parameters for video data from the cfg files

kale.loaddata.video_access.generate_list(data_name, data_params_local, domain)

Parameters

data_name (string) – name of dataset
data_params_local (dict) – hyper parameters from configure file
domain (string) – domain type (source or target)

Returns

image directory of dataset train_listpath (string): training list file directory of dataset test_listpath (string): test list file directory of dataset

Return type

data_path (string)

class kale.loaddata.video_access.VideoDataset(value)

Bases: Enum

An enumeration.

EPIC = 'EPIC'

ADL = 'ADL'

GTEA = 'GTEA'

KITCHEN = 'KITCHEN'

static get_source_target(source: VideoDataset, target: VideoDataset, seed, params)

Gets data loaders for source and target datasets Sets channel_number as 3 for RGB, 2 for flow. Sets class_number as 8 for EPIC, 7 for ADL, 6 for both GTEA and KITCHEN.

Parameters

source – (VideoDataset): source dataset name
target – (VideoDataset): target dataset name
seed – (int): seed value set manually.
params – (CfgNode): hyper parameters from configure file

Examples::

>>> source, target, num_classes = get_source_target(source, target, seed, params)

class kale.loaddata.video_access.VideoDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: DatasetAccess

Common API for video dataset access

Parameters

data_path (string) – image directory of dataset
train_list (string) – training list file directory of dataset
test_list (string) – test list file directory of dataset
image_modality (string) – image type (RGB or Optical Flow)
frames_per_segment (int) – length of each action sample (the unit is number of frame)
n_classes (int) – number of class
transform_kind (string) – types of video transforms
seed – (int): seed value set manually.

get_train_valid(valid_ratio): Get the train and validation dataset with the fixed random split. This is used for joint input like RGB and optical flow, which will call get_train_valid twice. Fixing the random seed here can keep the seeds for twice the same.

class kale.loaddata.video_access.EPICDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

EPIC data loader

get_train()

get_test()

class kale.loaddata.video_access.GTEADatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

GTEA data loader

get_train()

get_test()

class kale.loaddata.video_access.ADLDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

ADL data loader

get_train()

get_test()

class kale.loaddata.video_access.KITCHENDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)

Bases: VideoDatasetAccess

KITCHEN data loader

get_train()

get_test()

kale.loaddata.video_datasets module

class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for GTEA, ADL and KITCHEN.

Parameters

root_path (string) – The root path in which video folders lie.
annotationfile_path (string) – The annotation file containing one row per video sample.
dataset_split (string) – Split type (train or test)
image_modality (string) – Image modality (RGB or Optical Flow)
num_segments (int) – The number of segments the video should be divided into to sample frames from.
frames_per_segment (int) – The number of frames that should be loaded per segment.
imagefile_template (string) – The image filename template.
transform (Compose) – Video transform.
random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.
test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
n_classes (int) – The number of classes.

make_dataset()

Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.

Returns: list of (video_name, start_frame, end_frame, label)
Return type: data (list)

class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)

Bases: VideoFrameDataset

Dataset for EPIC-Kitchen.

make_dataset(): Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.

kale.loaddata.video_multi_domain module

Construct a dataset for videos with (multiple) source and target domains

class kale.loaddata.video_multi_domain.VideoMultiDomainDatasets(source_access_dict, target_access_dict, image_modality, seed, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)

Bases: MultiDomainDatasets

prepare_data_loaders()

get_domain_loaders(split='train', batch_size=32)

Load Data

Submodules

kale.loaddata.dataset_access module

kale.loaddata.videos module

kale.loaddata.image_access

kale.loaddata.mnistm module

kale.loaddata.multi_domain module

kale.loaddata.sampler module

kale.loaddata.usps module

kale.loaddata.tdc_datasets module

kale.loaddata.video_access module

kale.loaddata.video_datasets module

kale.loaddata.video_multi_domain module

Module contents