Preprocess Data

Submodules

kale.prepdata.chem_transform module

Functions for labeling and encoding chemical characters like Compound SMILES and atom string, refer to https://github.com/hkmztrk/DeepDTA and https://github.com/thinng/GraphDTA.

kale.prepdata.chem_transform.integer_label_smiles(smiles, max_length=85, isomeric=False)

Integer encoding for SMILES string sequence.

Parameters:
  • smiles (str) – Simplified molecular-input line-entry system, which is a specification in the form of a line

  • strings. (notation for describing the structure of chemical species using short ASCII) –

  • max_length (int) – Maximum encoding length of input SMILES string. (default: 85)

  • isomeric (bool) – Whether the input SMILES string includes isomeric information (default: False).

kale.prepdata.chem_transform.integer_label_protein(sequence, max_length=1200)

Integer encoding for protein string sequence.

Parameters:
  • sequence (str) – Protein string sequence.

  • max_length – Maximum encoding length of input protein string. (default: 1200)

kale.prepdata.graph_negative_sampling module

kale.prepdata.graph_negative_sampling.negative_sampling(pos_edge_index: Tensor, num_nodes: int) Tensor

Negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.

Parameters:
  • pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].

  • num_nodes (int) – the number of nodes in the graph.

Returns:

edge indices in COO format with shape [2, num_edges].

Return type:

torch.Tensor

kale.prepdata.graph_negative_sampling.typed_negative_sampling(pos_edge_index: Tensor, num_nodes: int, range_list: Tensor) Tensor

Typed negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.

Parameters:
  • pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].

  • num_nodes (int) – the number of nodes in the graph.

  • range_list (torch.Tensor) – the range of edge types. [[start_index, end_index], …]

Returns:

edge indices in COO format with shape [2, num_edges].

Return type:

torch.Tensor

kale.prepdata.image_transform module

Preprocessing of image datasets, i.e., transforms, from https://github.com/criteo-research/pytorch-ada/blob/master/adalib/ada/datasets/preprocessing.py

References for processing stacked images:

Swift, A. J., Lu, H., Uthoff, J., Garg, P., Cogliano, M., Taylor, J., … & Kiely, D. G. (2020). A machine learning cardiac magnetic resonance approach to extract disease features and automate pulmonary arterial hypertension diagnosis. European Heart Journal-Cardiovascular Imaging.

kale.prepdata.image_transform.get_transform(kind, augment=False)

Define transforms (for commonly used datasets)

Parameters:
  • kind ([type]) – the dataset (transformation) name

  • augment (bool, optional) – whether to do data augmentation (random crop and flipping). Defaults to False. (Not implemented for digits yet.)

kale.prepdata.image_transform.reg_img_stack(images, coords, target_coords)

Registration for stacked images

Parameters:
  • images (list) – Input data, where each sample in shape (n_phases, dim1, dim2).

  • coords (array-like) – Coordinates for registration, shape (n_samples, n_landmarks * 2).

  • target_coords (array-like) – Target coordinates for registration.

Returns:

Registered images, each sample in the list in shape (n_phases, dim1, dim2). array-like: Maximum distance of transformed source coordinates to destination coordinates, shape (n_samples,)

Return type:

list

kale.prepdata.image_transform.rescale_img_stack(images, scale=0.5)

Rescale stacked images by a given factor

Parameters:
  • images (list) – Input data list, where each sample in shape (n_phases, dim1, dim2).

  • scale (float, optional) – Scale factor. Defaults to 0.5.

Returns:

Rescaled images, each sample in the list in shape (n_phases, dim1 * scale, dim2 * scale).

Return type:

list

kale.prepdata.image_transform.mask_img_stack(images, mask)

Masking stacked images by a given mask

Parameters:
  • images (list) – Input image data, where each sample in shape (n_phases, dim1, dim2).

  • mask (array-like) – mask, shape (dim1, dim2).

Returns:

masked images, each sample in the list in shape (n_phases, dim1, dim2).

Return type:

list

kale.prepdata.image_transform.normalize_img_stack(images)

Normalize pixel values to (0, 1) for stacked images.

Parameters:

images (list) – Input data, where each sample in shape (n_phases, dim1, dim2).

Returns:

Normalized images, each sample in the list in shape (n_phases, dim1, dim2).

Return type:

list

kale.prepdata.string_transform module

Author: Lawrence Schobs, lawrenceschobs@gmail.com This file contains functions for string manipulation.

kale.prepdata.string_transform.strip_for_bound(string_: str) list

Convert a string containing comma-separated floats into a list of floats. :param string_: A string containing floats, separated by commas. :type string_: str

Returns:

A list of floats.

Return type:

list

Example

>>> strip_for_bound("[1.0, 2.0], [3.0, 4.0]")
[[1.0, 2.0], [3.0, 4.0]]

kale.prepdata.supergraph_construct module

The supergraph structure from the Pattern Recognition 2022 paper “GripNet: Graph Information Propagation on Supergraph for Heterogeneous Graphs” <https://doi.org/10.1016/j.patcog.2022.108973>.

class kale.prepdata.supergraph_construct.SuperVertex(name: str, node_feat: Tensor, edge_index: Tensor, edge_type: Tensor | None = None, edge_weight: Tensor | None = None)

Bases: object

The supervertex structure in GripNet. Each supervertex is a subgraph containing nodes of the same category

that are semantically-coherent. Supervertices can be homogeneous or heterogeneous.

Parameters:
  • name (str) – the name of the supervertex.

  • node_feat (torch.Tensor) – node features of the supervertex with shape [#nodes, #features]. We recommend using torch.sparse.FloatTensor() if the node feature matrix is sparse.

  • edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges].

  • edge_type (torch.Tensor, optional) – one-dimensional relation type for each edge, indexed from 0. Defaults to None.

  • edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.

Examples

>>> import torch
>>> node_feat = torch.randn(4, 20)
>>> edge_index = torch.tensor([[0, 1, 2, 3], [1, 2, 3, 0]])
>>> edge_type = torch.tensor([0, 0, 1, 1])
>>> edge_weight = torch.randn(4)
>>> # create a supervertex with homogeneous edges
>>> supervertex_homo = SuperVertex(node_feat, edge_index)
>>> # create a supervertex with heterogeneous edges
>>> supervertex_hete = SuperVertex(node_feat, edge_index, edge_type)
>>> # create a supervertex with weighted edges
>>> supervertex_weight1 = SuperVertex(node_feat, edge_index, edge_weight=edge_weight)
>>> supervertex_weight2 = SuperVertex(node_feat, edge_index, edge_type, edge_weight)
add_in_supervertex(vertex_name: str)
add_out_supervertex(vertex_name: str)
class kale.prepdata.supergraph_construct.SuperEdge(source_supervertex: str, target_supervertex: str, edge_index: Tensor, edge_weight: Tensor | None = None)

Bases: object

The superedge structure in GripNet. Each superedge is a bipartite subgraph containing nodes from two categories forming two node sets, connected by edges between them. A superedge can be regarded as a heterogeneous graph connecting two supervertices.

Parameters:
  • source_supervertex (str) – the name of the source supervertex.

  • target_supervertex (str) – the name of the target supervertex.

  • edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges]. The first row is the index of source nodes, and the second row is the index of target nodes.

  • edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.

class kale.prepdata.supergraph_construct.SuperVertexParaSetting(supervertex_name: str, inter_feat_channels: int, inter_agg_channels_list: List[int], exter_agg_channels_dict: Dict[str, int] | None = None, mode: str | None = None, num_bases: int = 32, concat_output: bool = True)

Bases: object

Parameter settings for each supervertex.

Parameters:
  • supervertex_name (str) – the name of the supervertex.

  • inter_feat_channels (int) – the dimension of the output of the internal feature layer.

  • inter_agg_channels_list (List[int]) – the output dimensions of a sequence of internal aggregation layers.

  • exter_agg_channels_dict (Dict[str, int], optional) – the dimension of received message vector from parent supervertices. Defaults to None.

  • mode (str, optional) – the allowed gripnet mode–‘cat’ or ‘add’. Defaults to None.

  • num_bases (int, optional) – the number of bases used for basis-decomposition if the supervertex is multi-relational. Defaults to 32.

  • concat_output (bool, optional) – whether to concatenate the output of each layers. Defaults to True.

class kale.prepdata.supergraph_construct.SuperGraph(supervertex_list: List[SuperVertex], superedge_list: List[SuperEdge], supervertex_setting_dict: Dict[str, SuperVertexParaSetting] | None = None)

Bases: object

The supergraph structure in GripNet. Each supergraph is a directed acyclic graph (DAG) containing supervertices and superedges.

Parameters:
  • supervertex_list (list[SuperVertex]) – a list of supervertices.

  • superedge_list (list[SuperEdge]) – a list of superedges.

  • supervertex_para_setting (dict[str, SuperVertexParaSetting], Optional) – the parameter settings for each supervertex.

set_supergraph_para_setting(supervertex_setting_list: List[SuperVertexParaSetting])

Set the parameters of the supergraph.

Parameters:

supervertex_setting_list (list[SuperVertexParaSetting]) – a list of parameter settings for each supervertex.

kale.prepdata.tabular_transform module

Functions for manipulating/transforming tabular data

class kale.prepdata.tabular_transform.ToTensor(dtype: dtype | None = None, device: device | None = None)

Bases: object

Convert an array_like data to a tensor of the same shape. This class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of torch.tensor and allows users to use it as a callable instance.

Parameters:
  • dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if None, infers data type from data.

  • device (torch.device, optional) – The device of the constructed tensor. If None and data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.

class kale.prepdata.tabular_transform.ToOneHotEncoding(num_classes: int | None = -1, dtype: dtype | None = None, device: device | None = None)

Bases: object

Convert an array_like of class values of shape (*,) to a tensor of shape (*, num_classes) that have zeros everywhere except where the index of last dimension matches the corresponding value of the input tensor, in which case it will be 1.

Note that this class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of the one_hot method in the PyTorch and allows users to use it

as a callable instance.

Parameters:
  • num_classes (int, optional) – Total number of classes. If set to -1, the number of classes will be inferred as one greater than the largest class value in the input data.

  • dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if None, infers data type from data.

  • device (torch.device, optional) – The device of the constructed tensor. If None and data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.

kale.prepdata.tabular_transform.apply_confidence_inversion(data: DataFrame, uncertainty_measure: str) Tuple[Any, Any]

Invert a list of numbers, add a small number to avoid division by zero.

Parameters:
  • data (Dict) – Dictionary of data to invert.

  • uncertainty_measure (str) – Key of dict to invert.

Returns:

Dictionary with inverted data.

Return type:

Dict

kale.prepdata.tabular_transform.generate_struct_for_qbin(models_to_compare: List[str], targets: List[int], saved_bins_path_pre: str, dataset: str) Tuple[Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame]]
Returns dictionaries of pandas dataframes for:
  1. all error and prediction info (all prediction data across targets for each model),

  2. target indices for separated error and prediction info (prediction data for each model and each target),

  3. all estimated error bounds (estimated error bounds across targets for each model),

  4. target separated estimated error bounds (estimated error bounds for each model and each target).

Parameters:
  • models_to_compare – List of set models to add to data struct.

  • targets – List of targets to add to data struct.

  • saved_bins_path_pre – Preamble to path of where the predicted quantile bins are saved.

  • dataset – String of what dataset you’re measuring.

Returns:

Dictionary where keys are model names and values are pandas dataframes containing

all prediction data across targets for that model.

data_struct_sep: Dictionary where keys are a combination of model names and target indices (e.g., “model1 T1”),

and values are pandas dataframes containing prediction data for the corresponding model and target.

data_struct_bounds: Dictionary where keys are a combination of model names and the string “ Error Bounds”

(e.g., “model1 Error Bounds”), and values are pandas dataframes containing all estimated error bounds across targets for that model.

data_struct_bounds_sep: Dictionary where keys are a combination of model names, target indices and the string

”Error Bounds” (e.g., “model1 Error Bounds L1”), and values are pandas dataframes containing estimated error bounds for the corresponding model and target.

Return type:

data_structs

kale.prepdata.tensor_reshape module

kale.prepdata.tensor_reshape.spatial_to_seq(image_tensor: Tensor)

Takes a torch tensor of shape (batch_size, channels, height, width) as used and outputted by CNNs and creates a sequence view of shape (sequence_length, batch_size, channels) as required by torch’s transformer module. In other words, unrolls the spatial grid into the sequence length and rearranges the dimension ordering.

Parameters:

image_tensor – tensor of shape (batch_size, channels, height, width) (required).

kale.prepdata.tensor_reshape.seq_to_spatial(sequence_tensor: Tensor, desired_height: int, desired_width: int)

Takes a torch tensor of shape (sequence_length, batch_size, num_features) as used and outputted by Transformers and creates a view of shape (batch_size, num_features, height, width) as used and outputted by CNNs. In other words, rearranges the dimension ordering and rolls sequence_length into (height,width). height*width must equal the sequence length of the input sequence.

Parameters:
  • sequence_tensor – sequence tensor of shape (sequence_length, batch_size, num_features) (required).

  • desired_height – the height into which the sequence length should be rolled into (required).

  • desired_width – the width into which the sequence length should be rolled into (required).

kale.prepdata.video_transform module

kale.prepdata.video_transform.get_transform(kind, image_modality)

Define transforms (for commonly used datasets)

Parameters:
  • kind ([type]) – the dataset (transformation) name

  • image_modality (string) – image type (RGB or Optical Flow)

class kale.prepdata.video_transform.ImglistToTensor(*args, **kwargs)

Bases: Module

Converts a list of PIL images in the range [0,255] to a torch.FloatTensor of shape (NUM_IMAGES x CHANNELS x HEIGHT x WIDTH) in the range [0,1]. Can be used as first transform for kale.loaddata.videos.VideoFrameDataset.

forward(img_list)

For RGB input, converts each PIL image in a list to a torch Tensor and stacks them into a single tensor. For flow input, converts every two PIL images (x(u)_img, y(v)_img) in a list to a torch Tensor and stacks them. For example, if input list size is 16, the dimension is [16, 1, 224, 224] and the frame order is [frame 1_x, frame 1_y, frame 2_x, frame 2_y, frame 3_x, …, frame 8_x, frame 8_y]. The output will be [[frame 1_x, frame 1_y], [frame 2_x, frame 2_y], [frame 3_x, …, [frame 8_x, frame 8_y]] and the dimension is [8, 2, 224, 224].

Parameters:

img_list – list of PIL images.

Returns:

tensor of size `` NUM_IMAGES x CHANNELS x HEIGHT x WIDTH``

class kale.prepdata.video_transform.TensorPermute(*args, **kwargs)

Bases: Module

Convert a torch.FloatTensor of shape (NUM_IMAGES x CHANNELS x HEIGHT x WIDTH) to a torch.FloatTensor of shape (CHANNELS x NUM_IMAGES x HEIGHT x WIDTH).

Module contents