Preprocess Data
Submodules
kale.prepdata.chem_transform module
Functions for labeling and encoding chemical characters like Compound SMILES and atom string, refer to https://github.com/hkmztrk/DeepDTA and https://github.com/thinng/GraphDTA.
- kale.prepdata.chem_transform.integer_label_smiles(smiles, max_length=85, isomeric=False)
Integer encoding for SMILES string sequence.
- Parameters:
smiles (str) – Simplified molecular-input line-entry system, which is a specification in the form of a line
strings. (notation for describing the structure of chemical species using short ASCII)
max_length (int) – Maximum encoding length of input SMILES string. (default: 85)
isomeric (bool) – Whether the input SMILES string includes isomeric information (default: False).
- kale.prepdata.chem_transform.integer_label_protein(sequence, max_length=1200)
Integer encoding for protein string sequence.
- Parameters:
sequence (str) – Protein string sequence.
max_length – Maximum encoding length of input protein string. (default: 1200)
kale.prepdata.graph_negative_sampling module
- kale.prepdata.graph_negative_sampling.negative_sampling(pos_edge_index: Tensor, num_nodes: int) Tensor
Negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.
- Parameters:
pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].
num_nodes (int) – the number of nodes in the graph.
- Returns:
edge indices in COO format with shape [2, num_edges].
- Return type:
torch.Tensor
- kale.prepdata.graph_negative_sampling.typed_negative_sampling(pos_edge_index: Tensor, num_nodes: int, range_list: Tensor) Tensor
Typed negative sampling for link prediction. Copy-paste from https://github.com/NYXFLOWER/GripNet.
- Parameters:
pos_edge_index (torch.Tensor) – edge indices in COO format with shape [2, num_edges].
num_nodes (int) – the number of nodes in the graph.
range_list (torch.Tensor) – the range of edge types. [[start_index, end_index], …]
- Returns:
edge indices in COO format with shape [2, num_edges].
- Return type:
torch.Tensor
kale.prepdata.image_transform module
kale.prepdata.string_transform module
Author: Lawrence Schobs, lawrenceschobs@gmail.com This file contains functions for string manipulation.
- kale.prepdata.string_transform.strip_for_bound(string_: str) list
Convert a string containing comma-separated floats into a list of floats. :param string_: A string containing floats, separated by commas. :type string_: str
- Returns:
A list of floats.
- Return type:
list
Example
>>> strip_for_bound("[1.0, 2.0], [3.0, 4.0]") [[1.0, 2.0], [3.0, 4.0]]
- kale.prepdata.string_transform.convert_to_float(value: str) float
Convert a string to a float, handling NumPy float constructors like ‘np.float32(…)’, ‘np.float64(…)’, etc.
- Parameters:
value (str) – The string to convert.
- Returns:
The converted float value.
- Return type:
float
kale.prepdata.signal_transform module
- kale.prepdata.signal_transform.normalize_signal(signal)
Normalizes a multi-channel ECG signal by removing mean and scaling to unit variance per channel.
- Parameters:
signal (ndarray) – Array of shape (samples, channels)
- Returns:
Normalized signal, same shape as input.
- Return type:
ndarray
- kale.prepdata.signal_transform.interpolate_signal(signal)
Linearly interpolates missing or NaN values in the ECG signal.
- Parameters:
signal (ndarray) – Array of shape (samples, channels)
- Returns:
Interpolated signal, same shape as input.
- Return type:
ndarray
- kale.prepdata.signal_transform.prepare_ecg_tensor(signal)
Converts a preprocessed ECG signal (NumPy array or PyTorch tensor) to a torch tensor of shape (1, -1).
- Parameters:
signal (ndarray or Tensor) – Preprocessed and normalized ECG array (samples, channels).
- Returns:
Flattened ECG tensor, shape (1, total_samples).
- Return type:
Tensor
kale.prepdata.supergraph_construct module
The supergraph structure from the Pattern Recognition 2022 paper “GripNet: Graph Information Propagation on Supergraph for Heterogeneous Graphs” <https://doi.org/10.1016/j.patcog.2022.108973>.
- class kale.prepdata.supergraph_construct.SuperVertex(name: str, node_feat: Tensor, edge_index: Tensor, edge_type: Tensor = None, edge_weight: Tensor = None)
Bases:
object- The supervertex structure in GripNet. Each supervertex is a subgraph containing nodes of the same category
that are semantically-coherent. Supervertices can be homogeneous or heterogeneous.
- Parameters:
name (str) – the name of the supervertex.
node_feat (torch.Tensor) – node features of the supervertex with shape [#nodes, #features]. We recommend using torch.sparse.FloatTensor() if the node feature matrix is sparse.
edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges].
edge_type (torch.Tensor, optional) – one-dimensional relation type for each edge, indexed from 0. Defaults to None.
edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.
Examples
>>> import torch >>> node_feat = torch.randn(4, 20) >>> edge_index = torch.tensor([[0, 1, 2, 3], [1, 2, 3, 0]]) >>> edge_type = torch.tensor([0, 0, 1, 1]) >>> edge_weight = torch.randn(4) >>> # create a supervertex with homogeneous edges >>> supervertex_homo = SuperVertex(node_feat, edge_index) >>> # create a supervertex with heterogeneous edges >>> supervertex_hete = SuperVertex(node_feat, edge_index, edge_type) >>> # create a supervertex with weighted edges >>> supervertex_weight1 = SuperVertex(node_feat, edge_index, edge_weight=edge_weight) >>> supervertex_weight2 = SuperVertex(node_feat, edge_index, edge_type, edge_weight)
- add_in_supervertex(vertex_name: str)
- add_out_supervertex(vertex_name: str)
- class kale.prepdata.supergraph_construct.SuperEdge(source_supervertex: str, target_supervertex: str, edge_index: Tensor, edge_weight: Tensor = None)
Bases:
objectThe superedge structure in GripNet. Each superedge is a bipartite subgraph containing nodes from two categories forming two node sets, connected by edges between them. A superedge can be regarded as a heterogeneous graph connecting two supervertices.
- Parameters:
source_supervertex (str) – the name of the source supervertex.
target_supervertex (str) – the name of the target supervertex.
edge_index (torch.Tensor) – edge indices in COO format with shape [2, #edges]. The first row is the index of source nodes, and the second row is the index of target nodes.
edge_weight (torch.Tensor, optional) – one-dimensional weight for each edge. Defaults to None.
- class kale.prepdata.supergraph_construct.SuperVertexParaSetting(supervertex_name: str, inter_feat_channels: int, inter_agg_channels_list: List[int], exter_agg_channels_dict: Dict[str, int] | None = None, mode: str | None = None, num_bases: int = 32, concat_output: bool = True)
Bases:
objectParameter settings for each supervertex.
- Parameters:
supervertex_name (str) – the name of the supervertex.
inter_feat_channels (int) – the dimension of the output of the internal feature layer.
inter_agg_channels_list (List[int]) – the output dimensions of a sequence of internal aggregation layers.
exter_agg_channels_dict (Dict[str, int], optional) – the dimension of received message vector from parent supervertices. Defaults to None.
mode (str, optional) – the allowed gripnet mode–‘cat’ or ‘add’. Defaults to None.
num_bases (int, optional) – the number of bases used for basis-decomposition if the supervertex is multi-relational. Defaults to 32.
concat_output (bool, optional) – whether to concatenate the output of each layers. Defaults to True.
- class kale.prepdata.supergraph_construct.SuperGraph(supervertex_list: List[SuperVertex], superedge_list: List[SuperEdge], supervertex_setting_dict: Dict[str, SuperVertexParaSetting] | None = None)
Bases:
objectThe supergraph structure in GripNet. Each supergraph is a directed acyclic graph (DAG) containing supervertices and superedges.
- Parameters:
supervertex_list (list[SuperVertex]) – a list of supervertices.
superedge_list (list[SuperEdge]) – a list of superedges.
supervertex_para_setting (dict[str, SuperVertexParaSetting], Optional) – the parameter settings for each supervertex.
- set_supergraph_para_setting(supervertex_setting_list: List[SuperVertexParaSetting])
Set the parameters of the supergraph.
- Parameters:
supervertex_setting_list (list[SuperVertexParaSetting]) – a list of parameter settings for each supervertex.
kale.prepdata.tabular_transform module
Functions for manipulating/transforming tabular data
- class kale.prepdata.tabular_transform.ToTensor(dtype: dtype | None = None, device: device | None = None)
Bases:
objectConvert an array_like data to a tensor of the same shape. This class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of torch.tensor and allows users to use it as a callable instance.
- Parameters:
dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if
None, infers data type from data.device (torch.device, optional) – The device of the constructed tensor. If
Noneand data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.
- class kale.prepdata.tabular_transform.ToOneHotEncoding(num_classes: int | None = -1, dtype: dtype | None = None, device: device | None = None)
Bases:
objectConvert an array_like of class values of shape
(*,)to a tensor of shape(*, num_classes)that have zeros everywhere except where the index of last dimension matches the corresponding value of the input tensor, in which case it will be 1.Note that this class provides a callable object that allows instances of the class to be called as a function. In other words, this class wraps the functionality of the one_hot method in the PyTorch and allows users to use it
as a callable instance.
- Parameters:
num_classes (int, optional) – Total number of classes. If set to -1, the number of classes will be inferred as one greater than the largest class value in the input data.
dtype (torch.dtype, optional) – The desired data type of returned tensor. Default: if
None, infers data type from data.device (torch.device, optional) – The device of the constructed tensor. If
Noneand data is a tensor then the device of data is used. If None and data is not a tensor then the result tensor is constructed on the CPU.
- kale.prepdata.tabular_transform.apply_confidence_inversion(data: DataFrame, uncertainty_measure: str) Tuple[Any, Any]
Invert a list of numbers, add a small number to avoid division by zero.
- Parameters:
data (Dict) – Dictionary of data to invert.
uncertainty_measure (str) – Key of dict to invert.
- Returns:
Dictionary with inverted data.
- Return type:
Dict
- kale.prepdata.tabular_transform.generate_struct_for_qbin(models_to_compare: List[str], targets: List[int], saved_bins_path_pre: str, dataset: str) Tuple[Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame], Dict[str, DataFrame]]
- Returns dictionaries of pandas dataframes for:
all error and prediction info (all prediction data across targets for each model),
target indices for separated error and prediction info (prediction data for each model and each target),
all estimated error bounds (estimated error bounds across targets for each model),
target separated estimated error bounds (estimated error bounds for each model and each target).
- Parameters:
models_to_compare – List of set models to add to data struct.
targets – List of targets to add to data struct.
saved_bins_path_pre – Preamble to path of where the predicted quantile bins are saved.
dataset – String of what dataset you’re measuring.
- Returns:
- Dictionary where keys are model names and values are pandas dataframes containing
all prediction data across targets for that model.
- data_struct_sep: Dictionary where keys are a combination of model names and target indices (e.g., “model1 T1”),
and values are pandas dataframes containing prediction data for the corresponding model and target.
- data_struct_bounds: Dictionary where keys are a combination of model names and the string “ Error Bounds”
(e.g., “model1 Error Bounds”), and values are pandas dataframes containing all estimated error bounds across targets for that model.
- data_struct_bounds_sep: Dictionary where keys are a combination of model names, target indices and the string
”Error Bounds” (e.g., “model1 Error Bounds L1”), and values are pandas dataframes containing estimated error bounds for the corresponding model and target.
- Return type:
data_structs
kale.prepdata.tensor_reshape module
- kale.prepdata.tensor_reshape.spatial_to_seq(image_tensor: Tensor)
Takes a torch tensor of shape (batch_size, channels, height, width) as used and outputted by CNNs and creates a sequence view of shape (sequence_length, batch_size, channels) as required by torch’s transformer module. In other words, unrolls the spatial grid into the sequence length and rearranges the dimension ordering.
- Parameters:
image_tensor – tensor of shape (batch_size, channels, height, width) (required).
- kale.prepdata.tensor_reshape.seq_to_spatial(sequence_tensor: Tensor, desired_height: int, desired_width: int)
Takes a torch tensor of shape (sequence_length, batch_size, num_features) as used and outputted by Transformers and creates a view of shape (batch_size, num_features, height, width) as used and outputted by CNNs. In other words, rearranges the dimension ordering and rolls sequence_length into (height,width). height*width must equal the sequence length of the input sequence.
- Parameters:
sequence_tensor – sequence tensor of shape (sequence_length, batch_size, num_features) (required).
desired_height – the height into which the sequence length should be rolled into (required).
desired_width – the width into which the sequence length should be rolled into (required).
- kale.prepdata.tensor_reshape.normalize_tensor(tensor, eps=1e-08)
Normalize a PyTorch tensor to [0, 1] ranges.