selector.methods.distance
#
Module for Distance-Based Selection Methods.
- class selector.methods.distance.DISE(r0=None, ref_index=None, tol=0.05, n_iter=10, p=2.0, eps=0.0, fun_dist=None)#
Select samples using Directed Sphere Exclusion (DISE) algorithm.
In a nutshell, this algorithm iteratively excludes any sample within a given radius from any already selected sample. The radius of the exclusion sphere is an adjustable parameter. Compared to Sphere Exclusion algorithm, the Directed Sphere Exclusion algorithm achieves a more evenly distributed subset selection by abandoning the random selection approach and instead imposing a directed selection.
Reference sample is chosen based on the ref_index, which is excluded from the selected subset. All samples are sorted (ascending order) based on their Minkowski p-norm distance from the reference sample. Looping through sorted samples, the sample is selected if it is not already excluded. If selected, all its neighboring samples within a sphere of radius r (i.e., exclusion sphere) are excluded from being selected. When the selected number of points is greater than specified subset size, the selection process terminates. The r0 is used as the initial radius of exclusion sphere, however, it is optimized to select the desired number of samples.
References#
Gobbi, A., and Lee, M.-L. (2002). DISE: directed sphere exclusion. Journal of Chemical Information and Computer Sciences, 43(1), 317–323. https://doi.org/10.1021/ci025554v
- _abc_impl = <_abc._abc_data object>#
- algorithm(x, max_size) Union[List, Iterable] #
Return selected samples based on directed sphere exclusion algorithm.
Parameters#
- x: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- max_size: int
Maximum number of samples to select.
Returns#
- selected: Union[List, Iterable]
List of indices of selected samples.
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(x, size, labels=None) Union[List, Iterable] #
Return selected samples from a cluster based on directed sphere exclusion algorithm
Parameters#
- x: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- size: int
Number of samples to be selected.
- labels: np.ndarray, optional
Indices of samples that form a cluster.
Returns#
- selected: Union[List, Iterable]
List of indices of selected samples.
- class selector.methods.distance.MaxMin(fun_dist=None, ref_index=None)#
Select samples using MaxMin algorithm.
MaxMin is possibly the most widely used method for dissimilarity-based compound selection. When presented with a dataset of samples, the initial point is chosen as the dataset’s medoid center. Next, the second point is chosen to be that which is furthest from this initial point. Subsequently, all following points are selected via the following logic:
Find the minimum distance from every point to the already-selected ones.
Select the point which has the maximum distance among those calculated in the previous step.
In the current implementation, this method requires or computes the full pairwise-distance matrix, so it is not recommended for large datasets.
References#
[1] Ashton, Mark, et al., Identification of diverse database subsets using property‐based and fragment‐based molecular descriptions, Quantitative Structure‐Activity Relationships 21.6 (2002): 598-604.
- _abc_impl = <_abc._abc_data object>#
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(x, size, labels=None) Union[List, Iterable] #
Return selected samples from a cluster based on MaxMin algorithm.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space, or the pairwise distance matrix between n_samples samples. If fun_dist is None, the x is assumed to be a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray
Indices of samples that form a cluster.
Returns#
- selectedUnion[List, Iterable]
List of indices of selected samples.
- class selector.methods.distance.MaxSum(fun_dist=None, ref_index=None)#
Select samples using MaxSum algorithm.
Whereas the goal of the MaxMin algorithm is to maximize the minimum distance between any pair of distinct elements in the selected subset of a dataset, the MaxSum algorithm aims to maximize the sum of distances between all pairs of elements in the selected subset. When presented with a dataset of samples, the initial point is chosen as the dataset’s medoid center. Next, the second point is chosen to be that which is furthest from this initial point. Subsequently, all following points are selected via the following logic:
Determine the sum of distances from every point to the already-selected ones.
Select the point which has the maximum sum of distances among those calculated in the previous step.
References#
[1] Borodin, Allan, Hyun Chul Lee, and Yuli Ye, Max-sum diversification, monotone submodular functions and dynamic updates, Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. 2012.
- _abc_impl = <_abc._abc_data object>#
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(x, size, labels=None) Union[List, Iterable] #
Return selected samples from a cluster based on MaxSum algorithm.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space, or the pairwise distance matrix between n_samples samples. If fun_dist is None, the x is assumed to be a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray
Indices of samples that form a cluster.
Returns#
- selectedUnion[List, Iterable]
List of indices of selected samples.
- class selector.methods.distance.OptiSim(r0=None, k=10, tol=0.01, n_iter=10, eps=0, p=2, random_seed=42, ref_index=None, fun_dist=None)#
Selecting samples using OptiSim algorithm.
The OptiSim algorithm selects samples from a dataset by first choosing the medoid center as the initial point. Next, points are randomly chosen and added to a subsample if they exist outside of radius r from all previously selected points (otherwise, they are discarded). Once k number of points have been added to the subsample, the point with the greatest minimum distance to the previously selected points is chosen. Then, the subsample is cleared and the process is repeated.
Notes#
When the ref_index is a list for multiple classes, it will be shared among all clusters. If we want to use different reference indices for each class, we can perform the subset selection for each class separately where different ref_index parameters can be used. For example, if we have two classes, we can pass ref_index=[0, 1] to select samples from class 0 and ref_index=[3, 6] class 1 respectively.
References#
[1] J. Chem. Inf. Comput. Sci. 1997, 37, 6, 1181–1188. https://doi.org/10.1021/ci970282v
- _abc_impl = <_abc._abc_data object>#
- algorithm(x, max_size) Union[List, Iterable] #
Return selected sample indices based on OptiSim algorithm.
Parameters#
- x: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional feature space.
- max_sizeint
Maximum number of samples to select.
Returns#
- selectedUnion[List, Iterable]
List of indices of selected sample indices.
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(x, size, labels=None) Union[List, Iterable] #
Return selected samples from a cluster based on OptiSim algorithm.
Parameters#
- x: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional feature space.
- sizeint
Number of samples to be selected.
- labels: np.ndarray
Indices of samples that form a cluster.
Returns#
- selectedUnion[List, Iterable]
List of indices of selected samples.