Load Data
Submodules
kale.loaddata.dataset_access module
Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py
- class kale.loaddata.dataset_access.DatasetAccess(n_classes)
Bases:
object
This class ensures a unique API is used to access training, validation and test splits of any dataset.
- Parameters
n_classes (int) – the number of classes.
- n_classes()
- get_train()
- Returns: a torch.utils.data.Dataset
Dataset: a torch.utils.data.Dataset
- get_train_valid(valid_ratio)
Randomly split a dataset into non-overlapping training and validation datasets.
- Parameters
valid_ratio (float) – the ratio for validation set
- Returns
a torch.utils.data.Dataset
- Return type
Dataset
- get_test()
- kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)
- Parameters
dataset – a torch.utils.data.Dataset
class_ids (list, optional) – List of chosen subset of class ids.
- Returns: a torch.utils.data.Dataset
Dataset: a torch.utils.data.Dataset with only classes in class_ids
- kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)
Randomly split a dataset into non-overlapping new datasets of given ratios.
- Parameters
dataset (torch.utils.data.Dataset) – Dataset to be split.
split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.
- Returns
A list of subsets.
- Return type
[List]
Examples
>>> import torch >>> from kale.loaddata.dataset_access import split_by_ratios >>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2 = split_by_ratios(range(10), [0.3]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3]) >>> len(subset1) 3 >>> len(subset2) 3 >>> len(subset3) 4
kale.loaddata.videos module
- class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)
Bases:
Dataset
A highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or
FRAMES x CHANNELS x HEIGHT x WIDTH
tensors where FRAMES=x if thekale.prepdata.video_transform.ImglistToTensor()
transform is used.More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.
Note
A demonstration of using this class can be seen in
PyKale/examples/video_loading
https://github.com/pykale/pykale/tree/master/examples/video_loadingNote
This dataset broadly corresponds to the frame sampling technique introduced in
Temporal Segment Networks
at ECCV2016 https://arxiv.org/abs/1608.00859.Note
This class relies on receiving video data in a structure where inside a
ROOT_DATA
folder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values:VIDEO_FOLDER_PATH START_FRAME END_FRAME LABEL_INDEX
.VIDEO_FOLDER_PATH
is expected to be the path of a video folder excluding theROOT_DATA
prefix. For example,ROOT_DATA
might behome\data\datasetxyz\videos\
, inside of which aVIDEO_FOLDER_PATH
might bejumping\0052\
orsample1\
or00053\
.- Parameters
root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.
annotationfile_path – The .txt annotation file containing one row per video sample as described above.
image_modality – Image modality (RGB or Optical Flow).
num_segments – The number of segments the video should be divided into to sample frames from.
frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.
imagefile_template – The image filename template that video frame files have inside of their video folders as described above.
transform – Transform pipeline that receives a list of PIL images/frames.
random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.
test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
kale.loaddata.image_access
kale.loaddata.mnistm module
Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel
- class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)
Bases:
Dataset
MNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.
- Parameters
root (str) – path to directory where the MNISTM folder will be created (or exists.)
train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.
transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g.,
transforms.RandomCrop
This preprocessing function applied to all images (whether source or target)target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform
download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.
- url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'
- raw_folder = 'raw'
- processed_folder = 'processed'
- training_file = 'mnist_m_train.pt'
- test_file = 'mnist_m_test.pt'
- download()
Download the MNISTM data.
kale.loaddata.multi_domain module
Construct a dataset with (multiple) source and target domains, adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/multisource.py
- class kale.loaddata.multi_domain.WeightingType(value)
Bases:
Enum
An enumeration.
- NATURAL = 'natural'
- BALANCED = 'balanced'
- PRESET0 = 'preset0'
- class kale.loaddata.multi_domain.DatasetSizeType(value)
Bases:
Enum
An enumeration.
- Max = 'max'
- Source = 'source'
- static get_size(size_type, source_dataset, *other_datasets)
- class kale.loaddata.multi_domain.DomainsDatasetBase
Bases:
object
- prepare_data_loaders()
handles train/validation/test split to have 3 datasets each with data from all domains
- get_domain_loaders(split='train', batch_size=32)
handles the sampling of a dataset containing multiple domains
- Parameters
split (string, optional) – [“train”|”valid”|”test”]. Which dataset to iterate on. Defaults to “train”.
batch_size (int, optional) – Defaults to 32.
- Returns
A dataloader with API similar to the torch.dataloader, but returning batches from several domains at each iteration.
- Return type
- class kale.loaddata.multi_domain.MultiDomainDatasets(source_access: DatasetAccess, target_access: DatasetAccess, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)
Bases:
DomainsDatasetBase
- is_semi_supervised()
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)
- class kale.loaddata.multi_domain.MultiDomainImageFolder(root: str, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, extensions: ~typing.Optional[~typing.Tuple[str, ...]] = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp'), transform: ~typing.Optional[~typing.Callable] = None, target_transform: ~typing.Optional[~typing.Callable] = None, sub_domain_set=None, sub_class_set=None, is_valid_file: ~typing.Optional[~typing.Callable[[str], bool]] = None, return_domain_label: ~typing.Optional[bool] = False, split_train_test: ~typing.Optional[bool] = False, split_ratio: float = 0.8)
Bases:
VisionDataset
A generic data loader where the samples are arranged in this way:
root/domain_a/class_1/xxx.ext root/domain_a/class_1/xxy.ext root/domain_a/class_2/xxz.ext root/domain_b/class_1/efg.ext root/domain_b/class_2/pqr.ext root/domain_b/class_2/lmn.ext root/domain_k/class_2/123.ext root/domain_k/class_1/abc3.ext root/domain_k/class_1/asd932_.ext
- Parameters
root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. Either extensions or is_valid_file should be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g,
transforms.RandomCrop
for images.target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
sub_domain_set (list) – A list of domain names, which should be a subset of domains (folders) under the root directory. If None, all available domains will be used. Defaults to None.
sub_class_set (list) – A list of class names, which should be a subset of classes (folders) under each domain’s directory. If None, all available classes will be used. Defaults to None.
is_valid_file – A function that takes path of a file and check if the file is a valid file (to check corrupt files). Either extensions or is_valid_file should be passed.
- get_train()
- get_test()
- kale.loaddata.multi_domain.make_multi_domain_set(directory: str, class_to_idx: Dict[str, int], domain_to_idx: Dict[str, int], extensions: Optional[Tuple[str, ...]] = None, is_valid_file: Optional[Callable[[str], bool]] = None) List[Tuple[str, int, int]]
Generates a list of samples of a form (path_to_sample, class, domain). :param directory: root dataset directory :type directory: str :param class_to_idx: dictionary mapping class name to class index :type class_to_idx: Dict[str, int] :param domain_to_idx: dictionary mapping d name to class index :type domain_to_idx: Dict[str, int] :param extensions: A list of allowed extensions. Either extensions or is_valid_file should be passed.
Defaults to None.
- Parameters
is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (to check corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.
- Raises
ValueError – In case
extensions
andis_valid_file
are None or both are not None.- Returns
samples of a form (path_to_sample, class, domain)
- Return type
List[Tuple[str, int, int]]
- class kale.loaddata.multi_domain.ConcatMultiDomainAccess(data_access: dict, domain_to_idx: dict, return_domain_label: Optional[bool] = False)
Bases:
Dataset
Concatenate multiple datasets as a single dataset with domain labels
- Parameters
data_access (dict) – Dictionary of domain datasets, e.g. {“Domain1_name”: domain1_set, “Domain2_name”: domain2_set}
domain_to_idx (dict) – Dictionary of domain name to domain labels, e.g. {“Domain1_name”: 0, “Domain2_name”: 1}
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.
- class kale.loaddata.multi_domain.MultiDomainAccess(data_access: dict, n_classes: int, return_domain_label: Optional[bool] = False)
Bases:
DatasetAccess
Convert multiple digits-like data accesses to a single data access. :param data_access: Dictionary of data accesses, e.g. {“Domain1_name”: domain1_access,
“Domain2_name”: domain2_access}
- Parameters
n_classes (int) – number of classes.
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.
- get_train()
- get_test()
- class kale.loaddata.multi_domain.MultiDomainAdapDataset(data_access, valid_split_ratio=0.1, test_split_ratio=0.2, random_state: int = 1, test_on_all=False)
Bases:
DomainsDatasetBase
The class controlling how the multiple domains are iterated over.
- Parameters
data_access (MultiDomainImageFolder, or MultiDomainAccess) – Multi-domain data access.
valid_split_ratio (float, optional) – Split ratio for validation set. Defaults to 0.1.
test_split_ratio (float, optional) – Split ratio for test set. Defaults to 0.2.
random_state (int, optional) – Random state for generator. Defaults to 1.
test_on_all (bool, optional) – Whether test model on all target. Defaults to False.
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)
kale.loaddata.sampler module
Various sampling strategies for datasets to construct dataloader, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/sampler.py
- class kale.loaddata.sampler.SamplingConfig(balance=False, class_weights=None, balance_domain=False)
Bases:
object
- create_loader(dataset, batch_size)
Create the data loader
Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler
- Parameters
dataset (Dataset) – dataset from which to load the data.
batch_size (int) – how many samples per batch to load
- class kale.loaddata.sampler.FixedSeedSamplingConfig(seed=1, balance=False, class_weights=None, balance_domain=False)
Bases:
SamplingConfig
- create_loader(dataset, batch_size)
Create the data loader with fixed seed.
- class kale.loaddata.sampler.MultiDataLoader(dataloaders, n_batches)
Bases:
object
Batch Sampler for a MultiDataset. Iterates in parallel over different batch samplers for each dataset. Yields batches [(x_1, y_1), …, (x_s, y_s)] for s datasets.
- class kale.loaddata.sampler.BalancedBatchSampler(dataset, batch_size)
Bases:
BatchSampler
BatchSampler - from a MNIST-like dataset, samples n_samples for each of the n_classes. Returns batches of size n_classes * (batch_size // n_classes) adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py
- class kale.loaddata.sampler.ReweightedBatchSampler(dataset, batch_size, class_weights)
Bases:
BatchSampler
BatchSampler - from a MNIST-like dataset, samples batch_size according to given input distribution assuming multi-class labels adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py
- kale.loaddata.sampler.get_labels(dataset)
Get class labels for dataset
- class kale.loaddata.sampler.DomainBalancedBatchSampler(dataset, batch_size)
Bases:
BalancedBatchSampler
- BatchSampler - samples n_samples for each of the n_domains.
Returns batches of size n_domains * (batch_size / n_domains)
- Parameters
dataset (.multi_domain.MultiDomainImageFolder or torch.utils.data.Subset) – Multi-domain data access.
batch_size (int) – Batch size
kale.loaddata.usps module
Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)
- class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)
Bases:
Dataset
USPS Dataset.
- Parameters
root (string) – Root directory of dataset where dataset file exist.
train (bool, optional) – If True, resample from dataset randomly.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
- url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'
- download()
Download dataset.
- load_samples()
Load sample images from dataset.
kale.loaddata.tdc_datasets module
- class kale.loaddata.tdc_datasets.BindingDBDataset(name: str, split='train', path='./data', mode='cnn_cnn', y_log=True, drug_transform=None, protein_transform=None)
Bases:
Dataset
A custom dataset for loading and processing original TDC data, which is used as input data in DeepDTA model.
- Parameters
name (str) – TDC dataset name.
split (str) – Data split type (train, valid or test).
path (str) – dataset download/local load path (default: “./data”)
mode (str) – encoding mode (default: cnn_cnn)
drug_transform – Transform operation (default: None)
protein_transform – Transform operation (default: None)
y_log (bool) – Whether convert y values to log space. (default: True)
kale.loaddata.video_access module
Action video dataset loading for EPIC-Kitchen, ADL, GTEA, KITCHEN. The code is based on https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/digits_dataset_access.py
- kale.loaddata.video_access.get_image_modality(image_modality)
Change image_modality (string) to rgb (bool) and flow (bool) for efficiency
- kale.loaddata.video_access.get_videodata_config(cfg)
Get the configure parameters for video data from the cfg files
- kale.loaddata.video_access.generate_list(data_name, data_params_local, domain)
- Parameters
data_name (string) – name of dataset
data_params_local (dict) – hyper parameters from configure file
domain (string) – domain type (source or target)
- Returns
image directory of dataset train_listpath (string): training list file directory of dataset test_listpath (string): test list file directory of dataset
- Return type
data_path (string)
- class kale.loaddata.video_access.VideoDataset(value)
Bases:
Enum
An enumeration.
- EPIC = 'EPIC'
- ADL = 'ADL'
- GTEA = 'GTEA'
- KITCHEN = 'KITCHEN'
- static get_source_target(source: VideoDataset, target: VideoDataset, seed, params)
Gets data loaders for source and target datasets Sets channel_number as 3 for RGB, 2 for flow. Sets class_number as 8 for EPIC, 7 for ADL, 6 for both GTEA and KITCHEN.
- Parameters
source – (VideoDataset): source dataset name
target – (VideoDataset): target dataset name
seed – (int): seed value set manually.
params – (CfgNode): hyper parameters from configure file
- Examples::
>>> source, target, num_classes = get_source_target(source, target, seed, params)
- class kale.loaddata.video_access.VideoDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
DatasetAccess
Common API for video dataset access
- Parameters
data_path (string) – image directory of dataset
train_list (string) – training list file directory of dataset
test_list (string) – test list file directory of dataset
image_modality (string) – image type (RGB or Optical Flow)
frames_per_segment (int) – length of each action sample (the unit is number of frame)
n_classes (int) – number of class
transform_kind (string) – types of video transforms
seed – (int): seed value set manually.
- get_train_valid(valid_ratio)
Get the train and validation dataset with the fixed random split. This is used for joint input like RGB and optical flow, which will call get_train_valid twice. Fixing the random seed here can keep the seeds for twice the same.
- class kale.loaddata.video_access.EPICDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
EPIC data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.GTEADatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
GTEA data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.ADLDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
ADL data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.KITCHENDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
KITCHEN data loader
- get_train()
- get_test()
kale.loaddata.video_datasets module
- class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDataset
Dataset for GTEA, ADL and KITCHEN.
- Parameters
root_path (string) – The root path in which video folders lie.
annotationfile_path (string) – The annotation file containing one row per video sample.
dataset_split (string) – Split type (train or test)
image_modality (string) – Image modality (RGB or Optical Flow)
num_segments (int) – The number of segments the video should be divided into to sample frames from.
frames_per_segment (int) – The number of frames that should be loaded per segment.
imagefile_template (string) – The image filename template.
transform (Compose) – Video transform.
random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.
test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
n_classes (int) – The number of classes.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.
- Returns
list of (video_name, start_frame, end_frame, label)
- Return type
data (list)
- class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDataset
Dataset for EPIC-Kitchen.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.
kale.loaddata.video_multi_domain module
Construct a dataset for videos with (multiple) source and target domains
- class kale.loaddata.video_multi_domain.VideoMultiDomainDatasets(source_access_dict, target_access_dict, image_modality, seed, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)
Bases:
MultiDomainDatasets
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)