selector.methods.similarity
#
Module for Similarity-Based Selection Methods.
This module contains the classes and functions for the similarity-based selection methods. To select a diverse subset of molecules the similarity-based selection methods select the molecules such that the similarity between the molecules in the subset is minimized. The similarity of a set of molecules is calculated using an n-array similarity index. These indexes compare n molecules at a time and return a value between 0 and 1, where 0 means that all the molecules in the set are completely different and 1 means that the molecules are identical.
- The ideas behind the similarity-based selection methods are described in the following papers:
(esim) https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00505-3 https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00504-4 (isim) TODO: Add paper
- class selector.methods.similarity.NSimilarity(method: str = 'isim', inv_order: int = 1, similarity_index: str = 'RR', w_factor: str = 'fraction', c_threshold: Union[None, str, int] = None, preprocess_data: bool = True)#
Select samples of vectors using n-ary similarity indexes between vectors.
The algorithms in this class select a diverse subset of vectors such that the similarity between the vectors in the subset is minimized. The similarity of a set of vectors is calculated using an n-ary similarity index. These indexes compare n vectors (e.g. molecular fingerprints) at a time and return a value between 0 and 1, where 0 means that all the vectors in the set are completely different and 1 means that the vectors are identical.
The algorithm starts by selecting a starting reference data point. Then, the next data point is selected such as the similarity value of the group of selected data points is minimized. The process is repeated until the desired number of data points is selected.
Notes#
The ideas behind the similarity-based selection methods are described in the following papers:
- _abc_impl = <_abc._abc_data object>#
- _get_new_index(X: ndarray, selected_condensed: ndarray, num_selected: int, select_from: ndarray) int #
Select a new diverse sample from the data.
The function selects a new sample such that the similarity of the new set of selected samples is minimized.
Parameters#
- X: np.ndarray
Array of features (columns) for each sample (rows).
- selected_condensed: np.ndarray
Columnwise sum of all the samples selected so far.
- num_selected: int
Number of samples selected so far.
- select_from: np.ndarray
Array of integers representing the indices of the samples that have not been selected yet.
Returns#
- selected: int
Index of the new selected sample.
- _scale_data(X: ndarray)#
Scales the data between so it can be used with the similarity indexes.
First each data point is normalized to be between 0 and 1.
\[x_{ij} = \frac{x_{ij} - min(x_j)}{max(x_j) - min(x_j)}\]Then, the average of each column is calculated. Finally, each element of the final working array will be defined as
\[w_{ij} = 1 - | x_{ij} - a_j |\]where \(x_{ij}\) is the element of the normalized array, and \(a_j\) is the average of the j-th column of the normalized array.
Parameters#
- X: np.ndarray
Array of features (columns) for each sample (rows).
- calculate_medoid(X: ndarray, c_total=None) int #
Calculate the medoid of a set of real-valued vectors or binary objects.
Parameters#
- X: np.array
np.array of all the real-valued vectors or binary objects.
- c_total:
np.array with the columnwise sums of the data, not necessary to provide.
- calculate_outlier(X: ndarray = None, c_total=None) int #
Calculate the outlier of a set of real-valued vectors or binary objects.
Calculates the outlier of a set of real-valued vectors or binary objects. Using the similarity index provided in the class initialization.
Parameters#
- X: np.array
np.array of all the real-valued vectors or binary objects.
- c_total: np.array, optional
np.array with the column-wise sums of the data.
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(X: ndarray, size: int, labels: Optional[ndarray] = None, start: Union[str, List[int]] = 'medoid') List[int] #
Algorithm of nary similarity selection for selecting points from cluster.
Parameters#
- X: np.ndarray
Array of features (columns) for each sample (rows).
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the points ids of the data that belong to the current cluster. If None, all the samples in the data are treated as one cluster.
- start: str or list
srt: key on what is used to start the selection {‘medoid’, ‘random’, ‘outlier’}. list: indices of points that are included in the selection since the beginning.
Returns#
- selected: list
Indices of the selected sample points.
- class selector.methods.similarity.SimilarityIndex(method: str = 'isim', inv_order: int = 1, sim_index: str = 'RR', c_threshold: Union[None, str, int] = None, w_factor: str = 'fraction')#
Calculate the n-ary similarity index of a set of vectors.
This class provides methods for calculating the similarity index of a set of vectors represented as a matrix. Each vector is a row in the matrix, and each column represents a feature of the vector. The features in the vectors must be binary or real numbers between 0 and 1.
- _calculate_counters(X: ndarray, nsamples: Optional[int] = None) dict #
Calculate 1-similarity, 0-similarity, and dissimilarity counters.
Arguments#
- Xnp.ndarray
Array of arrays, each sub-array contains the binary or real valued vector. The values must be between 0 and 1. If the number of rows ==1, the data is treated as the columnwise sum of the objects. If the number of rows > 1, the data is treated as the objects.
- nsamples: int
Number of objects, only necessary if the columnwise sum of the objects is provided instead of the data (num rows== 1). If the data is provided, the number of objects is calculated as the length of the data.
Returns#
- countersdict
Dictionary with the weighted and non-weighted counters.
- _f_d(d, n) float #
Calculate the dissimilarity weight factor for a given number of similar objects in a set.
Parameters#
- dint
Number of similar objects.
- nint
Total number of objects.
Returns#
- w_sfloat
Weight factor for the dissimilarity depending on the number of objects that are similar (d) in a set of (n) objects.
- _f_s(d, n) float #
Calculate the similarity weight factor for a given number of similar objects in a set.
Parameters#
- dint
Number of similar objects.
- nint
Total number of objects.
Returns#
- w_sfloat
Weight factor for the similarity depending on the number of objects that are similar (d) in a set of (n) objects.