Preprocess Data

Submodules

kale.prepdata.chem_transform module

Functions for labeling and encoding chemical characters like Compound SMILES and atom string, refer to https://github.com/hkmztrk/DeepDTA and https://github.com/thinng/GraphDTA.

kale.prepdata.chem_transform.integer_label_smiles(smiles, max_length=85, isomeric=False)

Integer encoding for SMILES string sequence.

Parameters:
  • smiles (str) – Simplified molecular-input line-entry system, which is a specification in the form of a line

  • strings. (notation for describing the structure of chemical species using short ASCII)

  • max_length (int) – Maximum encoding length of input SMILES string. (default: 85)

  • isomeric (bool) – Whether the input SMILES string includes isomeric information (default: False).

kale.prepdata.chem_transform.integer_label_protein(sequence, max_length=1200)

Integer encoding for protein string sequence.

Parameters:
  • sequence (str) – Protein string sequence.

  • max_length – Maximum encoding length of input protein string. (default: 1200)

kale.prepdata.graph_negative_sampling module

kale.prepdata.graph_negative_sampling.negative_sampling(pos_edge_index: Tensor, num_nodes: int) Tensor

Negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.

Parameters:
  • pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].

  • num_nodes (int) – the number of nodes in the graph.

Returns:

edge indices in COO format with shape [2, num_edges].

Return type:

torch.Tensor

kale.prepdata.graph_negative_sampling.typed_negative_sampling(pos_edge_index: Tensor, num_nodes: int, range_list: Tensor) Tensor

Typed negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.

Parameters:
  • pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].

  • num_nodes (int) – the number of nodes in the graph.

  • range_list (torch.Tensor) – the range of edge types. [[start_index, end_index], …]

Returns:

edge indices in COO format with shape [2, num_edges].

Return type:

torch.Tensor

kale.prepdata.image_transform module

kale.prepdata.string_transform module

Author: Lawrence Schobs, lawrenceschobs@gmail.com This file contains functions for string manipulation.

kale.prepdata.string_transform.strip_for_bound(string_: str) list

Convert a string containing comma-separated floats into a list of floats. :param string_: A string containing floats, separated by commas. :type string_: str

Returns:

A list of floats.

Return type:

list

Example

>>> strip_for_bound("[1.0, 2.0], [3.0, 4.0]")
[[1.0, 2.0], [3.0, 4.0]]
kale.prepdata.string_transform.convert_to_float(value: str) float

Convert a string to a float, handling NumPy float constructors like ‘np.float32(…)’, ‘np.float64(…)’, etc.

Parameters:

value (str) – The string to convert.

Returns:

The converted float value.

Return type:

float

kale.prepdata.signal_transform module

kale.prepdata.signal_transform.normalize_signal(signal)

Normalizes a multi-channel ECG signal by removing mean and scaling to unit variance per channel.

Parameters:

signal (ndarray) – Array of shape (samples, channels)

Returns:

Normalized signal, same shape as input.

Return type:

ndarray

kale.prepdata.signal_transform.interpolate_signal(signal)

Linearly interpolates missing or NaN values in the ECG signal.

Parameters:

signal (ndarray) – Array of shape (samples, channels)

Returns:

Interpolated signal, same shape as input.

Return type:

ndarray

kale.prepdata.signal_transform.prepare_ecg_tensor(signal)

Converts a preprocessed ECG signal (NumPy array or PyTorch tensor) to a torch tensor of shape (1, -1).

Parameters:

signal (ndarray or Tensor) – Preprocessed and normalized ECG array (samples, channels).

Returns:

Flattened ECG tensor, shape (1, total_samples).

Return type:

Tensor

kale.prepdata.supergraph_construct module

The supergraph structure from the Pattern Recognition 2022 paper “GripNet: Graph Information Propagation on Supergraph for Heterogeneous Graphs” <https://doi.org/10.1016/j.patcog.2022.108973>.

class kale.prepdata.supergraph_construct.SuperVertex(name: str, node_feat: Tensor, edge_index: Tensor, edge_type: Tensor = None, edge_weight: Tensor = None)

Bases: object

The supervertex structure in GripNet. Each supervertex is a subgraph containing nodes of the same category

that are semantically-coherent. Supervertices can be homogeneous or heterogeneous.

Parameters:
  • name (str) – the name of the supervertex.

  • node_feat (torch.Tensor) – node features of the supervertex with shape [#nodes, #features]. We recommend using torch.sparse.FloatTensor() if the node feature matrix is sparse.

  • edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges].

  • edge_type (torch.Tensor, optional) – one-dimensional relation type for each edge, indexed from 0. Defaults to None.

  • edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.

Examples

>>> import torch
>>> node_feat = torch.randn(4, 20)
>>> edge_index = torch.tensor([[0, 1, 2, 3], [1, 2, 3, 0]])
>>> edge_type = torch.tensor([0, 0, 1, 1])
>>> edge_weight = torch.randn(4)
>>> # create a supervertex with homogeneous edges
>>> supervertex_homo = SuperVertex(node_feat, edge_index)
>>> # create a supervertex with heterogeneous edges
>>> supervertex_hete = SuperVertex(node_feat, edge_index, edge_type)
>>> # create a supervertex with weighted edges
>>> supervertex_weight1 = SuperVertex(node_feat, edge_index, edge_weight=edge_weight)
>>> supervertex_weight2 = SuperVertex(node_feat, edge_index, edge_type, edge_weight)
add_in_supervertex(vertex_name: str)
add_out_supervertex(vertex_name: str)
class kale.prepdata.supergraph_construct.SuperEdge(source_supervertex: str, target_supervertex: str, edge_index: Tensor, edge_weight: Tensor = None)

Bases: object

The superedge structure in GripNet. Each superedge is a bipartite subgraph containing nodes from two categories forming two node sets, connected by edges between them. A superedge can be regarded as a heterogeneous graph connecting two supervertices.

Parameters:
  • source_supervertex (str) – the name of the source supervertex.

  • target_supervertex (str) – the name of the target supervertex.

  • edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges]. The first row is the index of source nodes, and the second row is the index of target nodes.

  • edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.

class kale.prepdata.supergraph_construct.SuperVertexParaSetting(supervertex_name: str, inter_feat_channels: int, inter_agg_channels_list: List[int], exter_agg_channels_dict: Dict[str, int] | None = None, mode: str | None = None, num_bases: int = 32, concat_output: bool = True)

Bases: object

Parameter settings for each supervertex.

Parameters:
  • supervertex_name (str) – the name of the supervertex.

  • inter_feat_channels (int) – the dimension of the output of the internal feature layer.

  • inter_agg_channels_list (List[int]) – the output dimensions of a sequence of internal aggregation layers.

  • exter_agg_channels_dict (Dict[str, int], optional) – the dimension of received message vector from parent supervertices. Defaults to None.

  • mode (str, optional) – the allowed gripnet mode–‘cat’ or ‘add’. Defaults to None.

  • num_bases (int, optional) – the number of bases used for basis-decomposition if the supervertex is multi-relational. Defaults to 32.

  • concat_output (bool, optional) – whether to concatenate the output of each layers. Defaults to True.

class kale.prepdata.supergraph_construct.SuperGraph(supervertex_list: List[SuperVertex], superedge_list: List[SuperEdge], supervertex_setting_dict: Dict[str, SuperVertexParaSetting] | None = None)

Bases: object

The supergraph structure in GripNet. Each supergraph is a directed acyclic graph (DAG) containing supervertices and superedges.

Parameters:
  • supervertex_list (list[SuperVertex]) – a list of supervertices.

  • superedge_list (list[SuperEdge]) – a list of superedges.

  • supervertex_para_setting (dict[str, SuperVertexParaSetting], Optional) – the parameter settings for each supervertex.

set_supergraph_para_setting(supervertex_setting_list: List[SuperVertexParaSetting])

Set the parameters of the supergraph.

Parameters:

supervertex_setting_list (list[SuperVertexParaSetting]) – a list of parameter settings for each supervertex.

kale.prepdata.tabular_transform module

Functions for manipulating/transforming tabular data

class kale.prepdata.tabular_transform.ToTensor(dtype: dtype | None = None, device: device | None = None)

Bases: object

Convert an array_like data to a tensor of the same shape. This class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of torch.tensor and allows users to use it as a callable instance.

Parameters:
  • dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if None, infers data type from data.

  • device (torch.device, optional) – The device of the constructed tensor. If None and data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.

class kale.prepdata.tabular_transform.ToOneHotEncoding(num_classes: int | None = -1, dtype: dtype | None = None, device: device | None = None)

Bases: object

Convert an array_like of class values of shape (*,) to a tensor of shape (*, num_classes) that have zeros everywhere except where the index of last dimension matches the corresponding value of the input tensor, in which case it will be 1.

Note that this class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of the one_hot method in the PyTorch and allows users to use it

as a callable instance.

Parameters:
  • num_classes (int, optional) – Total number of classes. If set to -1, the number of classes will be inferred as one greater than the largest class value in the input data.

  • dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if None, infers data type from data.

  • device (torch.device, optional) – The device of the constructed tensor. If None and data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.

kale.prepdata.tabular_transform.apply_confidence_inversion(data: DataFrame, uncertainty_measure: str) Tuple[Any, Any]

Invert a list of numbers, add a small number to avoid division by zero.

Parameters:
  • data (Dict) – Dictionary of data to invert.

  • uncertainty_measure (str) – Key of dict to invert.

Returns:

Dictionary with inverted data.

Return type:

Dict

kale.prepdata.tabular_transform.generate_struct_for_qbin(models_to_compare: List[str], targets: List[int], saved_bins_path_pre: str, dataset: str) Tuple[Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame]]
Returns dictionaries of pandas dataframes for:
  1. all error and prediction info (all prediction data across targets for each model),

  2. target indices for separated error and prediction info (prediction data for each model and each target),

  3. all estimated error bounds (estimated error bounds across targets for each model),

  4. target separated estimated error bounds (estimated error bounds for each model and each target).

Parameters:
  • models_to_compare – List of set models to add to data struct.

  • targets – List of targets to add to data struct.

  • saved_bins_path_pre – Preamble to path of where the predicted quantile bins are saved.

  • dataset – String of what dataset you’re measuring.

Returns:

Dictionary where keys are model names and values are pandas dataframes containing

all prediction data across targets for that model.

data_struct_sep: Dictionary where keys are a combination of model names and target indices (e.g., “model1 T1”),

and values are pandas dataframes containing prediction data for the corresponding model and target.

data_struct_bounds: Dictionary where keys are a combination of model names and the string “ Error Bounds”

(e.g., “model1 Error Bounds”), and values are pandas dataframes containing all estimated error bounds across targets for that model.

data_struct_bounds_sep: Dictionary where keys are a combination of model names, target indices and the string

”Error Bounds” (e.g., “model1 Error Bounds L1”), and values are pandas dataframes containing estimated error bounds for the corresponding model and target.

Return type:

data_structs

kale.prepdata.tensor_reshape module

kale.prepdata.tensor_reshape.spatial_to_seq(image_tensor: Tensor)

Takes a torch tensor of shape (batch_size, channels, height, width) as used and outputted by CNNs and creates a sequence view of shape (sequence_length, batch_size, channels) as required by torch’s transformer module. In other words, unrolls the spatial grid into the sequence length and rearranges the dimension ordering.

Parameters:

image_tensor – tensor of shape (batch_size, channels, height, width) (required).

kale.prepdata.tensor_reshape.seq_to_spatial(sequence_tensor: Tensor, desired_height: int, desired_width: int)

Takes a torch tensor of shape (sequence_length, batch_size, num_features) as used and outputted by Transformers and creates a view of shape (batch_size, num_features, height, width) as used and outputted by CNNs. In other words, rearranges the dimension ordering and rolls sequence_length into (height,width). height*width must equal the sequence length of the input sequence.

Parameters:
  • sequence_tensor – sequence tensor of shape (sequence_length, batch_size, num_features) (required).

  • desired_height – the height into which the sequence length should be rolled into (required).

  • desired_width – the width into which the sequence length should be rolled into (required).

kale.prepdata.tensor_reshape.normalize_tensor(tensor, eps=1e-08)

Normalize a PyTorch tensor to [0, 1] ranges.

kale.prepdata.video_transform module

Module contents