selector.methods.similarity

`selector.methods.similarity`#

Module for Similarity-Based Selection Methods.

This module contains the classes and functions for the similarity-based selection methods. To select a diverse subset of molecules the similarity-based selection methods select the molecules such that the similarity between the molecules in the subset is minimized. The similarity of a set of molecules is calculated using an n-array similarity index. These indexes compare n molecules at a time and return a value between 0 and 1, where 0 means that all the molecules in the set are completely different and 1 means that the molecules are identical.

The ideas behind the similarity-based selection methods are described in the following papers:: (esim) https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00505-3 https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00504-4 (isim) TODO: Add paper

class selector.methods.similarity.NSimilarity(method: str = 'isim', inv_order: int = 1, similarity_index: str = 'RR', w_factor: str = 'fraction', c_threshold: Union[None, str, int] = None, preprocess_data: bool = True)#

Select samples of vectors using n-ary similarity indexes between vectors.

The algorithms in this class select a diverse subset of vectors such that the similarity between the vectors in the subset is minimized. The similarity of a set of vectors is calculated using an n-ary similarity index. These indexes compare n vectors (e.g. molecular fingerprints) at a time and return a value between 0 and 1, where 0 means that all the vectors in the set are completely different and 1 means that the vectors are identical.

The algorithm starts by selecting a starting reference data point. Then, the next data point is selected such as the similarity value of the group of selected data points is minimized. The process is repeated until the desired number of data points is selected.

Notes#

The ideas behind the similarity-based selection methods are described in the following papers:

_abc_impl = <_abc._abc_data object>#

_get_new_index(X: ndarray, selected_condensed: ndarray, num_selected: int, select_from: ndarray) → int#

Select a new diverse sample from the data.

The function selects a new sample such that the similarity of the new set of selected samples is minimized.

Parameters#

X: np.ndarray: Array of features (columns) for each sample (rows).
selected_condensed: np.ndarray: Columnwise sum of all the samples selected so far.
num_selected: int: Number of samples selected so far.
select_from: np.ndarray: Array of integers representing the indices of the samples that have not been selected yet.

Returns#

selected: int: Index of the new selected sample.

_scale_data(X: ndarray)#

Scales the data between so it can be used with the similarity indexes.

First each data point is normalized to be between 0 and 1.

\[x_{ij} = \frac{x_{ij} - min(x_j)}{max(x_j) - min(x_j)}\]

Then, the average of each column is calculated. Finally, each element of the final working array will be defined as

\[w_{ij} = 1 - | x_{ij} - a_j |\]

where \(x_{ij}\) is the element of the normalized array, and \(a_j\) is the average of the j-th column of the normalized array.

Parameters#

X: np.ndarray: Array of features (columns) for each sample (rows).

calculate_medoid(X: ndarray, c_total=None) → int#

Calculate the medoid of a set of real-valued vectors or binary objects.

Parameters#

X: np.array: np.array of all the real-valued vectors or binary objects.
c_total:: np.array with the columnwise sums of the data, not necessary to provide.

calculate_outlier(X: ndarray = None, c_total=None) → int#

Calculate the outlier of a set of real-valued vectors or binary objects.

Calculates the outlier of a set of real-valued vectors or binary objects. Using the similarity index provided in the class initialization.

Parameters#

X: np.array: np.array of all the real-valued vectors or binary objects.
c_total: np.array, optional: np.array with the column-wise sums of the data.

select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) → Union[List, Iterable]#

Return indices representing subset of sample points.

Parameters#

x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples): Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
size: int: Number of sample points to select (i.e. size of the subset).
labels: np.ndarray, optional: Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
proportional_selection: bool, optional: If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.

Returns#

selected: list: Indices of the selected sample points.

select_from_cluster(X: ndarray, size: int, labels: Optional[ndarray] = None, start: Union[str, List[int]] = 'medoid') → List[int]#

Algorithm of nary similarity selection for selecting points from cluster.

Parameters#

X: np.ndarray: Array of features (columns) for each sample (rows).
size: int: Number of sample points to select (i.e. size of the subset).
labels: np.ndarray, optional: Array of integers or strings representing the points ids of the data that belong to the current cluster. If None, all the samples in the data are treated as one cluster.
start: str or list: srt: key on what is used to start the selection {‘medoid’, ‘random’, ‘outlier’}. list: indices of points that are included in the selection since the beginning.

Returns#

selected: list: Indices of the selected sample points.

class selector.methods.similarity.SimilarityIndex(method: str = 'isim', inv_order: int = 1, sim_index: str = 'RR', c_threshold: Union[None, str, int] = None, w_factor: str = 'fraction')#

Calculate the n-ary similarity index of a set of vectors.

This class provides methods for calculating the similarity index of a set of vectors represented as a matrix. Each vector is a row in the matrix, and each column represents a feature of the vector. The features in the vectors must be binary or real numbers between 0 and 1.

_calculate_counters(X: ndarray, nsamples: Optional[int] = None) → dict#

Calculate 1-similarity, 0-similarity, and dissimilarity counters.

Arguments#

Xnp.ndarray: Array of arrays, each sub-array contains the binary or real valued vector. The values must be between 0 and 1. If the number of rows ==1, the data is treated as the columnwise sum of the objects. If the number of rows > 1, the data is treated as the objects.
nsamples: int: Number of objects, only necessary if the columnwise sum of the objects is provided instead of the data (num rows== 1). If the data is provided, the number of objects is calculated as the length of the data.

Returns#

countersdict: Dictionary with the weighted and non-weighted counters.

_f_d(d, n) → float#

Calculate the dissimilarity weight factor for a given number of similar objects in a set.

Parameters#

dint: Number of similar objects.
nint: Total number of objects.

Returns#

w_sfloat: Weight factor for the dissimilarity depending on the number of objects that are similar (d) in a set of (n) objects.

_f_s(d, n) → float#

Calculate the similarity weight factor for a given number of similar objects in a set.

Parameters#

dint: Number of similar objects.
nint: Total number of objects.

Returns#

w_sfloat: Weight factor for the similarity depending on the number of objects that are similar (d) in a set of (n) objects.

selector.methods.similarity

Contents

selector.methods.similarity#

Notes#

Parameters#

Returns#

Parameters#

Parameters#

Parameters#

Parameters#

Returns#

Parameters#

Returns#

Arguments#

Returns#

Parameters#

Returns#

Parameters#

Returns#

`selector.methods.similarity`#