Preprocess Data

Submodules

kale.prepdata.chem_transform module

Functions for labeling and encoding chemical characters like Compound SMILES and atom string, refer to https://github.com/hkmztrk/DeepDTA and https://github.com/thinng/GraphDTA.

kale.prepdata.chem_transform.integer_label_smiles(smiles, max_length=85, isomeric=False)

Integer encoding for SMILES string sequence.

Parameters
  • smiles (str) – Simplified molecular-input line-entry system, which is a specification in the form of a line

  • strings. (notation for describing the structure of chemical species using short ASCII) –

  • max_length (int) – Maximum encoding length of input SMILES string. (default: 85)

  • isomeric (bool) – Whether the input SMILES string includes isomeric information (default: False).

kale.prepdata.chem_transform.integer_label_protein(sequence, max_length=1200)

Integer encoding for protein string sequence.

Parameters
  • sequence (str) – Protein string sequence.

  • max_length – Maximum encoding length of input protein string. (default: 1200)

kale.prepdata.image_transform module

kale.prepdata.supergraph_construct module

kale.prepdata.tensor_reshape module

kale.prepdata.tensor_reshape.spatial_to_seq(image_tensor: Tensor)

Takes a torch tensor of shape (batch_size, channels, height, width) as used and outputted by CNNs and creates a sequence view of shape (sequence_length, batch_size, channels) as required by torch’s transformer module. In other words, unrolls the spatial grid into the sequence length and rearranges the dimension ordering.

Parameters

image_tensor – tensor of shape (batch_size, channels, height, width) (required).

kale.prepdata.tensor_reshape.seq_to_spatial(sequence_tensor: Tensor, desired_height: int, desired_width: int)

Takes a torch tensor of shape (sequence_length, batch_size, num_features) as used and outputted by Transformers and creates a view of shape (batch_size, num_features, height, width) as used and outputted by CNNs. In other words, rearranges the dimension ordering and rolls sequence_length into (height,width). height*width must equal the sequence length of the input sequence.

Parameters
  • sequence_tensor – sequence tensor of shape (sequence_length, batch_size, num_features) (required).

  • desired_height – the height into which the sequence length should be rolled into (required).

  • desired_width – the width into which the sequence length should be rolled into (required).

kale.prepdata.video_transform module

kale.prepdata.video_transform.get_transform(kind, image_modality)

Define transforms (for commonly used datasets)

Parameters
  • kind ([type]) – the dataset (transformation) name

  • image_modality (string) – image type (RGB or Optical Flow)

class kale.prepdata.video_transform.ImglistToTensor

Bases: Module

Converts a list of PIL images in the range [0,255] to a torch.FloatTensor of shape (NUM_IMAGES x CHANNELS x HEIGHT x WIDTH) in the range [0,1]. Can be used as first transform for kale.loaddata.videos.VideoFrameDataset.

forward(img_list)

For RGB input, converts each PIL image in a list to a torch Tensor and stacks them into a single tensor. For flow input, converts every two PIL images (x(u)_img, y(v)_img) in a list to a torch Tensor and stacks them. For example, if input list size is 16, the dimension is [16, 1, 224, 224] and the frame order is [frame 1_x, frame 1_y, frame 2_x, frame 2_y, frame 3_x, …, frame 8_x, frame 8_y]. The output will be [[frame 1_x, frame 1_y], [frame 2_x, frame 2_y], [frame 3_x, …, [frame 8_x, frame 8_y]] and the dimension is [8, 2, 224, 224].

Parameters

img_list – list of PIL images.

Returns

tensor of size `` NUM_IMAGES x CHANNELS x HEIGHT x WIDTH``

class kale.prepdata.video_transform.TensorPermute

Bases: Module

Convert a torch.FloatTensor of shape (NUM_IMAGES x CHANNELS x HEIGHT x WIDTH) to a torch.FloatTensor of shape (CHANNELS x NUM_IMAGES x HEIGHT x WIDTH).

Module contents