selector.methods.partition
#
Module for Partition-Based Selection Methods.
- class selector.methods.partition.GridPartition(nbins_axis: int, bin_method: str = 'equisized_independent', random_seed: int = 42)#
Select subset of sample points using the grid partitioning algorithms.
Given the number of bins along each axis, samples are partitioned using various methods 1:
The equisized_independent partitions the feature space into bins of equal size along each dimension.
The equisized_dependent partitions the space where the bins can have different length in each dimension. I.e., the `l-`th dimension bins depend on the previous dimensions. So, the order of features affects the outcome.
The equifrequent_independent divides the space into bins with approximately equal number of sample points in each bin.
The equifrequent_dependent is similar to equisized_dependent where the partition in each dimension will depend on the previous dimensions.
References#
- 1
Bayley, Martin J., and Peter Willett. “Binning schemes for partition-based compound selection.” Journal of Molecular Graphics and Modelling 17.1 (1999): 10-18.
- _abc_impl = <_abc._abc_data object>#
- get_bins_from_method(X)#
Assign sample points to bins based on the partitioning method.
Parameters#
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
Returns#
- bins: dict[Tuple(int), List[int]]
Dictionary of bins where keys are the unique bin indices (that contain at least one sample point) and the values are the list of sample indices in that bin.
- static partition_points_to_bins_equifrequent(X, nbins_axis)#
Find all bins ids that contains points using the equifrequent method.
The equifrequent method partitions each bin to have equal number of points. This is done by doing a linear interpolation from integer indices and points, where it is then evaluated on a uniform grid with number of bins as the spacing in each axis.
Parameters#
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- nbins_axis: int
Number of bins along each axis or feature dimension.
Returns#
- unique_bin_indices: ndarray(int,)
Unique (without duplication) bin indices that have at least one sample point. These are integer tuples \((i_1, \cdot, i_\text{n_features})\) with elements corresponding to the bin index along each axis/dimension of feature space. inverse_ids contains indices of unique_bins_ids for each of the \(N\) points that it is assigned to.
- inverse_indices: ndarray(int,)
Indices of the unique bins (along specified axis) that can be used to reconstruct bin index of each sample (unique_bin_indices[inverse_indices] gives bin index array).
- static partition_points_to_bins_equisized(X, nbins_axis)#
Find all bins ids that has points in them and assign each point to each of those bins.
For each n_features dimensions, get the minimum and maximum of feature to and use nbins_axis to compute length of the bin. Then assign sample points to bins along each axis/dimension of feature space.
Parameters#
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- nbins_axis: int
Number of bins along each axis/dimension of feature space.
Returns#
- unique_bin_indices: ndarray(int,)
Unique (without duplication) bin indices that have at least one sample point. These are integer tuples \((i_1, \cdot, i_\text{n_features})\) with elements corresponding to the bin index along each axis/dimension of feature space. inverse_ids contains indices of unique_bins_ids for each of the \(N\) points that it is assigned to.
- inverse_indices: ndarray(int,)
Indices of the unique bins (along specified axis) that can be used to reconstruct bin index of each sample (unique_bin_indices[inverse_indices] gives bin index array).
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_bins(X, bins, num_selected, diversity_type='hypersphere_overlap', cs=None)#
From the bins, select a certain number of points of the bins.
- Points are selected in an iterative manner. If the number of points needed to be selected
is greater than number of bins left then randomly select points from each of the bins. If any of the bins are empty then remove the bins. If it is less than the number of bins left, then calculate the diversity of each bin and choose points of bins with the highest diversity.
Parameters#
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- bins: dict(tuple(int), list[int])
The bins that map to the id the bin (as a tuple of integers) and returns the indices of the points that are contained in that bin.
- num_selected: int
Number of points to select from the bins.
- diversity_type: str, optional
Type of diversity to use. Default=”hypersphere_overlap”.
- csint, optional
Number of common substructures in molecular compound dataset. Used only if calculating explicit_diversity_index. Default is “None”.
Returns#
- List[int]:
Indices of the points that were selected.
- select_from_cluster(X: ndarray, num_selected: int, cluster_ids: ndarray = None)#
Grid partitioning algorithm for selecting points from cluster.
Parameters#
- X: ndarray of shape (n_samples, n_features)
Feature matrix of n_samples samples in n_features dimensional space.
- num_selected: int
Number of molecules that need to be selected.
- cluster_ids: ndarray
Indices of molecules that form a cluster
Returns#
- selected: list[int]
List of ids of selected molecules with size num_selected.
- class selector.methods.partition.Medoid(func_distance=<function Medoid.<lambda>>, ref_index=0, scaling=10)#
Selecting points using an algorithm adapted from KDTree.
Points are initially used to construct a KDTree. Euclidean distances are used for this algorithm. The first point selected is based on the ref_index provided and becomes the first query point. An approximation of the furthest point to the query point is found using find_furthest_neighbor and is selected. find_nearest_neighbor is then done to eliminate close neighbors to the new selected point. Medoid is then calculated from previously selected points and is used as the new query point for find_furthest_neighbor, repeating the process. Terminates upon selecting requested number of points or if all available points exhausted.
Adapted from: https://en.wikipedia.org/wiki/K-d_tree#Construction
- _abc_impl = <_abc._abc_data object>#
- _eliminate(tree, point, threshold, num_eliminate, bv)#
Eliminates points from being selected in future rounds.
Parameters#
- tree: scipy.spatial.KDTree
KDTree organizing coordinates.
- point: list
Point where close neighbors should be eliminated.
- threshold: float
An average of all the furthest distances found using find_furthest_neighbor
- num_eliminate: int
Maximum number of points permitted to be eliminated.
- bv: bitarray
Bitvector marking picked/eliminated points.
Returns#
- num_eliminate: int
Maximum number of points permitted to be eliminated.
- _find_furthest_neighbor(kdtree, point, selected_bitvector)#
Find approximately the furthest neighbor in a k-d tree for a given point.
Parameters#
- kdtree: collections.namedtuple
KDTree organizing coordinates.
- point: list
Query point for search.
- selected_bitvector: bitarray
Bitvector to keep track of previously selected points from array.
Returns#
- best: collections.namedtuple
The furthest point found in search.
- _kdtree(arr)#
Construct a k-d tree from an iterable of points.
Parameters#
- arr: list or np.ndarray
Coordinate array of points.
Returns#
- kdtree: collections.namedtuple
KDTree organizing coordinates.
- select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) Union[List, Iterable] #
Return indices representing subset of sample points.
Parameters#
- x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
- size: int
Number of sample points to select (i.e. size of the subset).
- labels: np.ndarray, optional
Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
- proportional_selection: bool, optional
If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.
Returns#
- selected: list
Indices of the selected sample points.
- select_from_cluster(arr, num_selected, cluster_ids=None)#
Main function for selecting points using the KDTree algorithm.
Parameters#
- arr: np.ndarray
Coordinate array of points
- num_selected: int
Number of molecules that need to be selected.
- cluster_ids: np.ndarray
Indices of molecules that form a cluster
Returns#
- selected: list
List of ids of selected molecules