Load Data
Submodules
kale.loaddata.avmnist_datasets module
Dataset setting and data loader for AVMNIST dataset by refactoring https://github.com/pliang279/MultiBench/blob/main/datasets/avmnist/get_data.py
- class kale.loaddata.avmnist_datasets.AVMNISTDataset(data_dir, batch_size=40, flatten_audio=False, flatten_image=False, unsqueeze_channel=True, normalize_image=True, normalize_audio=True)
Bases:
object
This class loads the AVMNIST data stored in a specified directory, and prepares it for training, validation, and testing. This class also takes care of the pre-processing steps such as reshaping and normalizing the data based on provided arguments. This includes options to flatten the audio and image data, normalize the image and audio data, and add a dimension to the data, often used to represent the channel in image or audio data. Furthermore, The class handles the splitting of data into training and validation sets. It provides separate data loaders for the training, validation, and testing sets, which can be used to iterate over the data during model training and evaluation. This data loader class simplifies the data preparation process for multimodal learning tasks, allowing the user to focus on model architecture and hyperparameter tuning.
- Parameters:
data_dir (str) – Directory of data.
batch_size (int, optional) – Batch size. Defaults to 40.
flatten_audio (bool, optional) – Whether to flatten audio data or not. Defaults to False.
flatten_image (bool, optional) – Whether to flatten image data or not. Defaults to False.
unsqueeze_channel (bool, optional) – Whether to unsqueeze any channels or not. Defaults to True.
normalize_image (bool, optional) – Whether to normalize the images before returning. Defaults to True.
normalize_audio (bool, optional) – Whether to normalize the audio before returning. Defaults to True.
- load_data()
- get_train_loader(shuffle=True)
- get_valid_loader(shuffle=False)
- get_test_loader(shuffle=False)
kale.loaddata.dataset_access module
Dataset Access API adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_access.py
- class kale.loaddata.dataset_access.DatasetAccess(n_classes)
Bases:
object
This class ensures a unique API is used to access training, validation and test splits of any dataset.
- Parameters:
n_classes (int) – the number of classes.
- n_classes()
- get_train()
- Returns: a torch.utils.data.Dataset
Dataset: a torch.utils.data.Dataset
- get_train_valid(valid_ratio)
Randomly split a dataset into non-overlapping training and validation datasets.
- Parameters:
valid_ratio (float) – the ratio for validation set
- Returns:
a torch.utils.data.Dataset
- Return type:
Dataset
- get_test()
- kale.loaddata.dataset_access.get_class_subset(dataset, class_ids)
- Parameters:
dataset – a torch.utils.data.Dataset
class_ids (list, optional) – List of chosen subset of class ids.
- Returns: a torch.utils.data.Dataset
Dataset: a torch.utils.data.Dataset with only classes in class_ids
- kale.loaddata.dataset_access.split_by_ratios(dataset, split_ratios)
Randomly split a dataset into non-overlapping new datasets of given ratios.
- Parameters:
dataset (torch.utils.data.Dataset) – Dataset to be split.
split_ratios (list) – Ratios of splits to be produced, where 0 < sum(split_ratios) <= 1.
- Returns:
A list of subsets.
- Return type:
[List]
Examples
>>> import torch >>> from kale.loaddata.dataset_access import split_by_ratios >>> subset1, subset2 = split_by_ratios(range(10), [0.3, 0.7]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2 = split_by_ratios(range(10), [0.3]) >>> len(subset1) 3 >>> len(subset2) 7 >>> subset1, subset2, subset3 = split_by_ratios(range(10), [0.3, 0.3]) >>> len(subset1) 3 >>> len(subset2) 3 >>> len(subset3) 4
kale.loaddata.image_access module
- class kale.loaddata.image_access.DigitDataset(value)
Bases:
Enum
An enumeration.
- MNIST = 'MNIST'
- MNIST_RGB = 'MNIST_RGB'
- MNISTM = 'MNISTM'
- USPS = 'USPS'
- USPS_RGB = 'USPS_RGB'
- SVHN = 'SVHN'
- static get_channel_numbers(dataset: DigitDataset)
- static get_digit_transform(dataset: DigitDataset, n_channels)
- static get_access(dataset: DigitDataset, data_path, num_channels=None)
Gets data loaders for digit datasets
- Parameters:
dataset (DigitDataset) – dataset name
data_path (string) – root directory of dataset
num_channels (int) – number of channels, defaults to None
- Examples::
>>> data_access, num_channel = DigitDataset.get_access(dataset, data_path)
- static get_source_target(source: DigitDataset, target: DigitDataset, data_path)
Gets data loaders for source and target datasets
- Parameters:
source (DigitDataset) – source dataset name
target (DigitDataset) – target dataset name
data_path (string) – root directory of dataset
- Examples::
>>> source_access, target_access, num_channel = DigitDataset.get_source_target(source, target, data_path)
- class kale.loaddata.image_access.DigitDatasetAccess(data_path, transform_kind)
Bases:
DatasetAccess
Common API for digit dataset access
- Parameters:
data_path (string) – root directory of dataset
transform_kind (string) – types of image transforms
- class kale.loaddata.image_access.MNISTDatasetAccess(data_path, transform_kind)
Bases:
DigitDatasetAccess
MNIST data loader
- get_train()
- get_test()
- class kale.loaddata.image_access.MNISTMDatasetAccess(data_path, transform_kind)
Bases:
DigitDatasetAccess
Modified MNIST (MNISTM) data loader
- get_train()
- get_test()
- class kale.loaddata.image_access.USPSDatasetAccess(data_path, transform_kind)
Bases:
DigitDatasetAccess
USPS data loader
- get_train()
- get_test()
- class kale.loaddata.image_access.SVHNDatasetAccess(data_path, transform_kind)
Bases:
DigitDatasetAccess
SVHN data loader
- get_train()
- get_test()
- class kale.loaddata.image_access.OfficeAccess(root, transform=Compose( Compose( Resize(size=256, interpolation=bilinear, max_size=None, antialias=warn) CenterCrop(size=(256, 256)) ) ToTensor() Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ), download=False, **kwargs)
Bases:
MultiDomainImageFolder
,DatasetAccess
Common API for office dataset access
- Parameters:
root (string) – root directory of dataset
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. Defaults to office_transform.
download (bool, optional) – Whether to allow downloading the data if not found on disk. Defaults to False.
References
[1] Saenko, K., Kulis, B., Fritz, M. and Darrell, T., 2010, September. Adapting visual category models to new domains. In European Conference on Computer Vision (pp. 213-226). Springer, Berlin, Heidelberg. [2] Griffin, Gregory and Holub, Alex and Perona, Pietro, 2007. Caltech-256 Object Category Dataset. California Institute of Technology. (Unpublished). https://resolver.caltech.edu/CaltechAUTHORS:CNS-TR-2007-001. [3] Gong, B., Shi, Y., Sha, F. and Grauman, K., 2012, June. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2066-2073).
- static download(path)
Download dataset. Office-31 source: https://www.cc.gatech.edu/~judy/domainadapt/#datasets_code Caltech-256 source: http://www.vision.caltech.edu/Image_Datasets/Caltech256/ Data with this library is adapted from: http://www.stat.ucla.edu/~jxie/iFRAME/code/imageClassification.rar
- class kale.loaddata.image_access.Office31(root, **kwargs)
Bases:
OfficeAccess
- class kale.loaddata.image_access.OfficeCaltech(root, **kwargs)
Bases:
OfficeAccess
- class kale.loaddata.image_access.ImageAccess
Bases:
object
- static get_multi_domain_images(image_set_name: str, data_path: str, sub_domain_set=None, **kwargs)
Get multi-domain images as a dataset from the given data path.
- Parameters:
image_set_name (str) – name of image dataset
data_path (str) – path to the image dataset
sub_domain_set (list, optional) – A list of domain names, which should be a subset of domains under the directory of data path. If None, all available domains will be used. Defaults to None.
- Returns:
Multi-domain image dataset
- Return type:
- kale.loaddata.image_access.get_cifar(cfg)
Gets training and validation data loaders for the CIFAR datasets
- Parameters:
cfg (CfgNode) – hyperparameters from configure file
Examples
>>> train_loader, valid_loader = get_cifar(cfg)
- kale.loaddata.image_access.read_dicom_phases(dicom_path, sort_instance=True)
Read dicom images of multiple instances/phases for one patient.
- Parameters:
dicom_path (str) – Path to DICOM images.
sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number). Defaults to True.
- Returns:
List of dicom dataset objects
- Return type:
[list]
- kale.loaddata.image_access.check_dicom_series_uid(dcm_phases, sort_instance=True)
Check if all dicom images have the same series UID.
- Parameters:
dcm_phases (list) – List of dicom dataset objects (phases)
sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number). Defaults to True.
- Returns:
List of list(s) dicom phases.
- Return type:
list
- kale.loaddata.image_access.read_dicom_dir(dicom_path, sort_instance=True, sort_patient=False, check_series_uid=False)
- Read dicom files for multiple patients and multiple instances / phases from a given directory arranged in the
following structure:
root/patient_a/…/phase_1.dcm root/patient_a/…/phase_2.dcm root/patient_a/…/phase_3.dcm
root/patient_b/…/phase_1.dcm root/patient_b/…/phase_2.dcm root/patient_b/…/phase_3.dcm
root/patient_m/…/phase_1.dcm root/patient_m/…/phase_2.dcm root/patient_m/…/phase_3.dcm
- Parameters:
dicom_path (str) – Directory of DICOM files.
sort_instance (bool, optional) – Whether sort images by InstanceNumber (i.e. phase number) for each subject. Defaults to True.
sort_patient (bool, optional) – Whether sort subjects’ images by PatientID. Defaults to False.
check_series_uid (bool, optional) – Whether check if all series UIDs are the same. Defaults to False.
- Returns:
[a list of dicom dataset lists]
- Return type:
[list[list]]
- kale.loaddata.image_access.dicom2arraylist(dicom_patient_list, return_patient_id=False)
Convert dicom datasets to arrays
- Parameters:
dicom_patient_list (list) – List of dicom patient lists.
return_patient_id (bool, optional) – Whether return PatientID. Defaults to False.
- Returns:
list of array-like tensors. list (optional): list of PatientIDs.
- Return type:
list
kale.loaddata.mnistm module
Dataset setting and data loader for MNIST-M, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_mnistm.py (based on https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py) CREDIT: https://github.com/corenel
- class kale.loaddata.mnistm.MNISTM(root, train=True, transform=None, target_transform=None, download=False)
Bases:
Dataset
MNIST-M Dataset. Auto-downloads the dataset and provide the torch Dataset API.
- Parameters:
root (str) – path to directory where the MNISTM folder will be created (or exists.)
train (bool, optional) – defaults to True. If True, loads the training data. Otherwise, loads the test data.
transform (callable, optional) – defaults to None. A function/transform that takes in an PIL image and returns a transformed version. E.g.,
transforms.RandomCrop
This preprocessing function applied to all images (whether source or target)target_transform (callable, optional) – default to None, similar to transform. This preprocessing function applied to all target images, after transform
download (bool optional) – defaults to False. Whether to allow downloading the data if not found on disk.
- url = 'https://github.com/VanushVaswani/keras_mnistm/releases/download/1.0/keras_mnistm.pkl.gz'
- raw_folder = 'raw'
- processed_folder = 'processed'
- training_file = 'mnist_m_train.pt'
- test_file = 'mnist_m_test.pt'
- download()
Download the MNISTM data.
kale.loaddata.multi_domain module
Construct a dataset with (multiple) source and target domains, adapted from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/multisource.py
- class kale.loaddata.multi_domain.WeightingType(value)
Bases:
Enum
An enumeration.
- NATURAL = 'natural'
- BALANCED = 'balanced'
- PRESET0 = 'preset0'
- class kale.loaddata.multi_domain.DatasetSizeType(value)
Bases:
Enum
An enumeration.
- Max = 'max'
- Source = 'source'
- static get_size(size_type, source_dataset, *other_datasets)
- class kale.loaddata.multi_domain.DomainsDatasetBase
Bases:
object
- prepare_data_loaders()
handles train/validation/test split to have 3 datasets each with data from all domains
- get_domain_loaders(split='train', batch_size=32)
handles the sampling of a dataset containing multiple domains
- Parameters:
split (string, optional) – [“train”|”valid”|”test”]. Which dataset to iterate on. Defaults to “train”.
batch_size (int, optional) – Defaults to 32.
- Returns:
A dataloader with API similar to the torch.dataloader, but returning batches from several domains at each iteration.
- Return type:
- class kale.loaddata.multi_domain.MultiDomainDatasets(source_access: DatasetAccess, target_access: DatasetAccess, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)
Bases:
DomainsDatasetBase
- is_semi_supervised()
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)
- class kale.loaddata.multi_domain.MultiDomainImageFolder(root: str, loader: ~typing.Callable[[str], ~typing.Any] = <function default_loader>, extensions: ~typing.Tuple[str, ...] | None = ('.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp'), transform: ~typing.Callable | None = None, target_transform: ~typing.Callable | None = None, sub_domain_set=None, sub_class_set=None, is_valid_file: ~typing.Callable[[str], bool] | None = None, return_domain_label: bool | None = False, split_train_test: bool | None = False, split_ratio: float = 0.8)
Bases:
VisionDataset
A generic data loader where the samples are arranged in this way:
root/domain_a/class_1/xxx.ext root/domain_a/class_1/xxy.ext root/domain_a/class_2/xxz.ext root/domain_b/class_1/efg.ext root/domain_b/class_2/pqr.ext root/domain_b/class_2/lmn.ext root/domain_k/class_2/123.ext root/domain_k/class_1/abc3.ext root/domain_k/class_1/asd932_.ext
- Parameters:
root (string) – Root directory path.
loader (callable) – A function to load a sample given its path.
extensions (tuple[string]) – A list of allowed extensions. Either extensions or is_valid_file should be passed.
transform (callable, optional) – A function/transform that takes in a sample and returns a transformed version. E.g,
transforms.RandomCrop
for images.target_transform (callable, optional) – A function/transform that takes in the target and transforms it.
sub_domain_set (list) – A list of domain names, which should be a subset of domains (folders) under the root directory. If None, all available domains will be used. Defaults to None.
sub_class_set (list) – A list of class names, which should be a subset of classes (folders) under each domain’s directory. If None, all available classes will be used. Defaults to None.
is_valid_file – A function that takes path of a file and check if the file is a valid file (to check corrupt files). Either extensions or is_valid_file should be passed.
- get_train()
- get_test()
- kale.loaddata.multi_domain.make_multi_domain_set(directory: str, class_to_idx: Dict[str, int], domain_to_idx: Dict[str, int], extensions: Tuple[str, ...] | None = None, is_valid_file: Callable[[str], bool] | None = None) List[Tuple[str, int, int]]
Generates a list of samples of a form (path_to_sample, class, domain). :param directory: root dataset directory :type directory: str :param class_to_idx: dictionary mapping class name to class index :type class_to_idx: Dict[str, int] :param domain_to_idx: dictionary mapping d name to class index :type domain_to_idx: Dict[str, int] :param extensions: A list of allowed extensions. Either extensions or is_valid_file should be passed.
Defaults to None.
- Parameters:
is_valid_file (optional) – A function that takes path of a file and checks if the file is a valid file (to check corrupt files) both extensions and is_valid_file should not be passed. Defaults to None.
- Raises:
ValueError – In case
extensions
andis_valid_file
are None or both are not None.- Returns:
samples of a form (path_to_sample, class, domain)
- Return type:
List[Tuple[str, int, int]]
- class kale.loaddata.multi_domain.ConcatMultiDomainAccess(data_access: dict, domain_to_idx: dict, return_domain_label: bool | None = False)
Bases:
Dataset
Concatenate multiple datasets as a single dataset with domain labels
- Parameters:
data_access (dict) – Dictionary of domain datasets, e.g. {“Domain1_name”: domain1_set, “Domain2_name”: domain2_set}
domain_to_idx (dict) – Dictionary of domain name to domain labels, e.g. {“Domain1_name”: 0, “Domain2_name”: 1}
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.
- class kale.loaddata.multi_domain.MultiDomainAccess(data_access: dict, n_classes: int, return_domain_label: bool | None = False)
Bases:
DatasetAccess
Convert multiple digits-like data accesses to a single data access. :param data_access: Dictionary of data accesses, e.g. {“Domain1_name”: domain1_access,
“Domain2_name”: domain2_access}
- Parameters:
n_classes (int) – number of classes.
return_domain_label (Optional[bool], optional) – Whether return domain labels in each batch. Defaults to False.
- get_train()
- get_test()
- class kale.loaddata.multi_domain.MultiDomainAdapDataset(data_access, valid_split_ratio=0.1, test_split_ratio=0.2, random_state: int = 1, test_on_all=False)
Bases:
DomainsDatasetBase
The class controlling how the multiple domains are iterated over.
- Parameters:
data_access (MultiDomainImageFolder, or MultiDomainAccess) – Multi-domain data access.
valid_split_ratio (float, optional) – Split ratio for validation set. Defaults to 0.1.
test_split_ratio (float, optional) – Split ratio for test set. Defaults to 0.2.
random_state (int, optional) – Random state for generator. Defaults to 1.
test_on_all (bool, optional) – Whether test model on all target. Defaults to False.
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)
kale.loaddata.multiomics_datasets module
Construct a dataset with multiple omics modalities based on PyTorch Geometric.
This code is written by refactoring the MOGONET dataset code (https://github.com/txWang/MOGONET/blob/main/train_test.py) within the ‘Dataset’ class provided in the PyTorch Geometric.
Reference: Wang, T., Shao, W., Huang, Z., Tang, H., Zhang, J., Ding, Z., Huang, K. (2021). MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature communications. https://www.nature.com/articles/s41467-021-23774-w
- class kale.loaddata.multiomics_datasets.MultiomicsDataset(root: str, num_modalities: int, num_classes: int, url: str | None = None, raw_file_names: List[str] | None = None, random_split: bool = False, train_size: float = 0.7, transform: Callable | None = None, pre_transform: Callable | None = None, target_pre_transform: Callable | None = None)
Bases:
Dataset
The multiomics data for creating graph dataset. See here in PyTorch Geometric for the accompanying tutorial.
- Parameters:
root (string) – Root directory where the dataset should be saved.
num_modalities (int) – The total number of modalities in the dataset.
num_classes (int) – The total number of classes in the dataset.
url (string, optional) – The url to download the dataset from.
raw_file_names (list[callable], optional) – The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.random_split (bool, optional) – Whether to split the dataset into random train and test subsets. (default:
False
)train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. This parameter is used when
random_split
isTrue
.transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before every access. (default:
None
)pre_transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before being saved to disk. (default:
None
)target_pre_transform (callable, optional) – A function/transform that takes in an array_like of labels and returns a transformed version. The label object will be transformed before being saved to disk. (default:
None
)
- property raw_file_names: List[str] | None
The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.
- property processed_file_names: str | List[str] | Tuple
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- download() None
Downloads the dataset to the
self.raw_dir
folder.
- process() None
Processes the dataset to the
self.processed_dir
folder. This function reads input files, creates a ‘’Data’’ object, and saves it into the ‘’processed_dir’’.
- static get_random_split(labels, num_classes: int, train_size: float = 0.7) Tuple
Split arrays into random train and test indices.
- Parameters:
labels (array-like) – Array-like object that represents the labels of the dataset.
num_classes (int) – The total number of classes in the dataset.
train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. (default: 0.7)
- Returns:
A tuple of two arrays containing the indices for the train and test sets.
- static get_adjacency_info(data: Tensor) Tuple
Calculate a sparse adjacency matrix of the input dataset defined by edge indices and edge attributes.
- Parameters:
data (torch.Tensor) – The input data.
- Returns:
A tuple of edge indices and edge attributes.
- extend_data(data: Data) Data
Extend data object by adding additional attributes.
- Parameters:
data (Data) – An input data object.
- Returns:
Extended data object with additional attributes.
- len() int
Returns the number of graphs stored in the dataset.
- get(modality_idx) Data
Gets the data object at index
idx
.
- property num_modalities: int
Returns the number of modalities in the dataset.
- property num_classes: int
Returns the number of classes in the dataset.
- class kale.loaddata.multiomics_datasets.SparseMultiomicsDataset(root: str, raw_file_names: List[str], num_modalities: int, num_classes: int, edge_per_node: int, url: str | None = None, random_split: bool = False, train_size: float = 0.7, equal_weight: bool = False, transform: Callable | None = None, pre_transform: Callable | None = None, target_pre_transform: Callable | None = None)
Bases:
MultiomicsDataset
The multiomics data for creating sparse graph dataset based on the settings in the MOGONET paper.
- Parameters:
root (string) – Root directory where the dataset should be saved.
raw_file_names (list[callable], optional) – The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.num_modalities (int) – The total number of modalities in the dataset.
num_classes (int) – The total number of classes in the dataset.
edge_per_node (int) – Predefined number of edges per nodes in computing adjacency matrix.
url (string, optional) – The url to download the dataset from.
random_split (bool, optional) – Whether to split the dataset into random train and test subsets. (default:
False
)train_size (float, optional) – The proportion of the dataset to include in the train split that should be between 0.0 and 1.0. This parameter is used when
random_split
isTrue
.equal_weight (bool, optional) – Whether to use equal weights for all samples. (default:
False
)transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before every access. (default:
None
)pre_transform (callable, optional) – A function/transform that takes in an array_like data and returns a transformed version. The data object will be transformed before being saved to disk. (default:
None
)target_pre_transform (callable, optional) – A function/transform that takes in an array_like of labels and returns a transformed version. The label object will be transformed before being saved to disk. (default:
None
)
- extend_data(data: Data) Data
Extend data object by adding additional attributes.
- Parameters:
data (Data) – An input data object.
- Returns:
Extended data object with additional attributes.
kale.loaddata.polypharmacy_datasets module
- class kale.loaddata.polypharmacy_datasets.PolypharmacyDataset(url: str, root: str, name: str, mode: str = 'train')
Bases:
Dataset
Polypharmacy side effect prediction dataset. Only for full-batch training.
- Parameters:
url (string) – The url to download the dataset from.
root (string) – The root directory containing the dataset file.
name (string) – Name of the dataset.
mode (string) – “train”, “valid” or “test”. Defaults to “train”.
- load_data() Data
Setup dataset: download if need and load it.
kale.loaddata.sampler module
Various sampling strategies for datasets to construct dataloader, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/sampler.py
- class kale.loaddata.sampler.SamplingConfig(balance=False, class_weights=None, balance_domain=False)
Bases:
object
- create_loader(dataset, batch_size)
Create the data loader
Reference: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler
- Parameters:
dataset (Dataset) – dataset from which to load the data.
batch_size (int) – how many samples per batch to load
- class kale.loaddata.sampler.FixedSeedSamplingConfig(seed=1, balance=False, class_weights=None, balance_domain=False)
Bases:
SamplingConfig
- create_loader(dataset, batch_size)
Create the data loader with fixed seed.
- class kale.loaddata.sampler.MultiDataLoader(dataloaders, n_batches)
Bases:
object
Batch Sampler for a MultiDataset. Iterates in parallel over different batch samplers for each dataset. Yields batches [(x_1, y_1), …, (x_s, y_s)] for s datasets.
- class kale.loaddata.sampler.BalancedBatchSampler(dataset, batch_size)
Bases:
BatchSampler
BatchSampler - from a MNIST-like dataset, samples n_samples for each of the n_classes. Returns batches of size n_classes * (batch_size // n_classes) adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py
- class kale.loaddata.sampler.ReweightedBatchSampler(dataset, batch_size, class_weights)
Bases:
BatchSampler
BatchSampler - from a MNIST-like dataset, samples batch_size according to given input distribution assuming multi-class labels adapted from https://github.com/adambielski/siamese-triplet/blob/master/datasets.py
- kale.loaddata.sampler.get_labels(dataset)
Get class labels for dataset
- class kale.loaddata.sampler.DomainBalancedBatchSampler(dataset, batch_size)
Bases:
BalancedBatchSampler
- BatchSampler - samples n_samples for each of the n_domains.
Returns batches of size n_domains * (batch_size / n_domains)
- Parameters:
dataset (.multi_domain.MultiDomainImageFolder or torch.utils.data.Subset) – Multi-domain data access.
batch_size (int) – Batch size
kale.loaddata.tabular_access module
Authors: Lawrence Schobs, lawrenceschobs@gmail.com
Functions for accessing tabular data.
- kale.loaddata.tabular_access.load_csv_columns(datapath: str, split: str, fold: int | List[int], cols_to_return: str | List[str] = 'All') DataFrame
Reads a CSV file of data and returns samples where the value of the specified split column is contained in the fold variable. The columns specified in cols_to_return are returned.
- Parameters:
datapath – The path to the CSV file of data.
split – The column name for the split (e.g. “Validation”, “Testing”).
fold – The fold/s contained in the split column to return. Can be a single integer or a list of integers.
cols_to_return – Which columns to return. If set to “All”, returns all columns.
- Returns:
the first is the full DataFrame selected, and the second is the DataFrame with only the columns specified in cols_to_return.
- Return type:
A tuple of two pandas DataFrames
kale.loaddata.tdc_datasets module
- class kale.loaddata.tdc_datasets.BindingDBDataset(name: str, split='train', path='./data', mode='cnn_cnn', y_log=True, drug_transform=None, protein_transform=None)
Bases:
Dataset
A custom dataset for loading and processing original TDC data, which is used as input data in DeepDTA model.
- Parameters:
name (str) – TDC dataset name.
split (str) – Data split type (train, valid or test).
path (str) – dataset download/local load path (default: “./data”)
mode (str) – encoding mode (default: cnn_cnn)
drug_transform – Transform operation (default: None)
protein_transform – Transform operation (default: None)
y_log (bool) – Whether convert y values to log space. (default: True)
kale.loaddata.usps module
Dataset setting and data loader for USPS, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/dataset_usps.py (based on https://github.com/mingyuliutw/CoGAN/blob/master/cogan_pytorch/src/dataset_usps.py)
- class kale.loaddata.usps.USPS(root, train=True, transform=None, download=False)
Bases:
Dataset
USPS Dataset.
- Parameters:
root (string) – Root directory of dataset where dataset file exist.
train (bool, optional) – If True, resample from dataset randomly.
download (bool, optional) – If true, downloads the dataset from the internet and puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional) – A function/transform that takes in an PIL image and returns a transformed version. E.g,
transforms.RandomCrop
- url = 'https://raw.githubusercontent.com/mingyuliutw/CoGAN/master/cogan_pytorch/data/uspssample/usps_28x28.pkl'
- download()
Download dataset.
- load_samples()
Load sample images from dataset.
kale.loaddata.video_access module
Action video dataset loading for EPIC-Kitchen, ADL, GTEA, KITCHEN. The code is based on https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/digits_dataset_access.py
- kale.loaddata.video_access.get_image_modality(image_modality)
Change image_modality (string) to rgb (bool) and flow (bool) for efficiency
- kale.loaddata.video_access.get_videodata_config(cfg)
Get the configure parameters for video data from the cfg files
- kale.loaddata.video_access.generate_list(data_name, data_params_local, domain)
- Parameters:
data_name (string) – name of dataset
data_params_local (dict) – hyperparameters from configure file
domain (string) – domain type (source or target)
- Returns:
image directory of dataset train_listpath (string): training list file directory of dataset test_listpath (string): test list file directory of dataset
- Return type:
data_path (string)
- class kale.loaddata.video_access.VideoDataset(value)
Bases:
Enum
An enumeration.
- EPIC = 'EPIC'
- ADL = 'ADL'
- GTEA = 'GTEA'
- KITCHEN = 'KITCHEN'
- static get_source_target(source: VideoDataset, target: VideoDataset, seed, params)
Gets data loaders for source and target datasets Sets channel_number as 3 for RGB, 2 for flow. Sets class_number as 8 for EPIC, 7 for ADL, 6 for both GTEA and KITCHEN.
- Parameters:
source (VideoDataset) – source dataset name
target (VideoDataset) – target dataset name
seed (int) – seed value set manually
params (CfgNode) – hyperparameters from configure file
Examples
>>> source, target, num_classes = get_source_target(source, target, seed, params)
- class kale.loaddata.video_access.VideoDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
DatasetAccess
Common API for video dataset access
- Parameters:
data_path (string) – image directory of dataset
train_list (string) – training list file directory of dataset
test_list (string) – test list file directory of dataset
image_modality (string) – image type (RGB or Optical Flow)
frames_per_segment (int) – length of each action sample (the unit is number of frame)
n_classes (int) – number of class
transform_kind (string) – types of video transforms
seed (int) – seed value set manually
- get_train_valid(valid_ratio)
Get the train and validation dataset with the fixed random split. This is used for joint input like RGB and optical flow, which will call get_train_valid twice. Fixing the random seed here can keep the seeds for twice the same.
- class kale.loaddata.video_access.EPICDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
EPIC data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.GTEADatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
GTEA data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.ADLDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
ADL data loader
- get_train()
- get_test()
- class kale.loaddata.video_access.KITCHENDatasetAccess(data_path, train_list, test_list, image_modality, frames_per_segment, n_classes, transform_kind, seed)
Bases:
VideoDatasetAccess
KITCHEN data loader
- get_train()
- get_test()
kale.loaddata.video_datasets module
- class kale.loaddata.video_datasets.BasicVideoDataset(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDataset
Dataset for GTEA, ADL and KITCHEN.
- Parameters:
root_path (string) – The root path in which video folders lie.
annotationfile_path (string) – The annotation file containing one row per video sample.
dataset_split (string) – Split type (train or test)
image_modality (string) – Image modality (RGB or Optical Flow)
num_segments (int) – The number of segments the video should be divided into to sample frames from.
frames_per_segment (int) – The number of frames that should be loaded per segment.
imagefile_template (string) – The image filename template.
transform (Compose) – Video transform.
random_shift (bool) – Whether the frames from each segment should be taken consecutively starting from the center(False) of the segment, or consecutively starting from a random(True) location inside the segment range.
test_mode (bool) – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.
n_classes (int) – The number of classes.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Different datasets correspond to a different number of classes.
- Returns:
list of (video_name, start_frame, end_frame, label)
- Return type:
data (list)
- class kale.loaddata.video_datasets.EPIC(root_path: str, annotationfile_path: str, dataset_split: str, image_modality: str, num_segments: int = 1, frames_per_segment: int = 16, imagefile_template: str = 'img_{:010d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False, n_classes: int = 8)
Bases:
VideoFrameDataset
Dataset for EPIC-Kitchen.
- make_dataset()
Load data from the EPIC-Kitchen list file and make them into the united format. Because the original list files are not the same, inherit from class BasicVideoDataset and be modified.
kale.loaddata.video_multi_domain module
Construct a dataset for videos with (multiple) source and target domains
- class kale.loaddata.video_multi_domain.VideoMultiDomainDatasets(source_access_dict, target_access_dict, image_modality, seed, config_weight_type='natural', config_size_type=DatasetSizeType.Max, valid_split_ratio=0.1, source_sampling_config=None, target_sampling_config=None, n_fewshot=None, random_state=None, class_ids=None)
Bases:
MultiDomainDatasets
- prepare_data_loaders()
- get_domain_loaders(split='train', batch_size=32)
kale.loaddata.videos module
- class kale.loaddata.videos.VideoFrameDataset(root_path: str, annotationfile_path: str, image_modality: str = 'rgb', num_segments: int = 3, frames_per_segment: int = 1, imagefile_template: str = 'img_{:05d}.jpg', transform=None, random_shift: bool = True, test_mode: bool = False)
Bases:
Dataset
A highly efficient and adaptable dataset class for videos. Instead of loading every frame of a video, loads x RGB frames of a video (sparse temporal sampling) and evenly chooses those frames from start to end of the video, returning a list of x PIL images or
FRAMES x CHANNELS x HEIGHT x WIDTH
tensors where FRAMES=x if thekale.prepdata.video_transform.ImglistToTensor()
transform is used.More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.
Note
A demonstration of using this class can be seen in
PyKale/examples/video_loading
https://github.com/pykale/pykale/tree/master/examples/video_loadingNote
This dataset broadly corresponds to the frame sampling technique introduced in
Temporal Segment Networks
at ECCV2016 https://arxiv.org/abs/1608.00859.Note
This class relies on receiving video data in a structure where inside a
ROOT_DATA
folder, each video lies in its own folder, where each video folder contains the frames of the video as individual files with a naming convention such as img_001.jpg … img_059.jpg. For enumeration and annotations, this class expects to receive the path to a .txt file where each video sample has a row with four (or more in the case of multi-label, see example README on Github) space separated values:VIDEO_FOLDER_PATH START_FRAME END_FRAME LABEL_INDEX
.VIDEO_FOLDER_PATH
is expected to be the path of a video folder excluding theROOT_DATA
prefix. For example,ROOT_DATA
might behome\data\datasetxyz\videos\
, inside of which aVIDEO_FOLDER_PATH
might bejumping\0052\
orsample1\
or00053\
.- Parameters:
root_path – The root path in which video folders lie. this is ROOT_DATA from the description above.
annotationfile_path – The .txt annotation file containing one row per video sample as described above.
image_modality – Image modality (RGB or Optical Flow).
num_segments – The number of segments the video should be divided into to sample frames from.
frames_per_segment – The number of frames that should be loaded per segment. For each segment’s frame-range, a random start index or the center is chosen, from which frames_per_segment consecutive frames are loaded.
imagefile_template – The image filename template that video frame files have inside of their video folders as described above.
transform – Transform pipeline that receives a list of PIL images/frames.
random_shift – Whether the frames from each segment should be taken consecutively starting from the center of the segment, or consecutively starting from a random location inside the segment range.
test_mode – Whether this is a test dataset. If so, chooses frames from segments with random_shift=False.