selector.methods.partition

`selector.methods.partition`#

Module for Partition-Based Selection Methods.

class selector.methods.partition.GridPartition(nbins_axis: int, bin_method: str = 'equisized_independent', random_seed: int = 42)#

Select subset of sample points using the grid partitioning algorithms.

Given the number of bins along each axis, samples are partitioned using various methods 1:

The equisized_independent partitions the feature space into bins of equal size along each dimension.
The equisized_dependent partitions the space where the bins can have different length in each dimension. I.e., the `l-`th dimension bins depend on the previous dimensions. So, the order of features affects the outcome.
The equifrequent_independent divides the space into bins with approximately equal number of sample points in each bin.
The equifrequent_dependent is similar to equisized_dependent where the partition in each dimension will depend on the previous dimensions.

References#

1: Bayley, Martin J., and Peter Willett. “Binning schemes for partition-based compound selection.” Journal of Molecular Graphics and Modelling 17.1 (1999): 10-18.

_abc_impl = <_abc._abc_data object>#

get_bins_from_method(X)#

Assign sample points to bins based on the partitioning method.

Parameters#

X: ndarray of shape (n_samples, n_features): Feature matrix of n_samples samples in n_features dimensional space.

Returns#

bins: dict[Tuple(int), List[int]]: Dictionary of bins where keys are the unique bin indices (that contain at least one sample point) and the values are the list of sample indices in that bin.

static partition_points_to_bins_equifrequent(X, nbins_axis)#

Find all bins ids that contains points using the equifrequent method.

The equifrequent method partitions each bin to have equal number of points. This is done by doing a linear interpolation from integer indices and points, where it is then evaluated on a uniform grid with number of bins as the spacing in each axis.

Parameters#

X: ndarray of shape (n_samples, n_features): Feature matrix of n_samples samples in n_features dimensional space.
nbins_axis: int: Number of bins along each axis or feature dimension.

Returns#

unique_bin_indices: ndarray(int,): Unique (without duplication) bin indices that have at least one sample point. These are integer tuples \((i_1, \cdot, i_\text{n_features})\) with elements corresponding to the bin index along each axis/dimension of feature space. inverse_ids contains indices of unique_bins_ids for each of the \(N\) points that it is assigned to.
inverse_indices: ndarray(int,): Indices of the unique bins (along specified axis) that can be used to reconstruct bin index of each sample (unique_bin_indices[inverse_indices] gives bin index array).

static partition_points_to_bins_equisized(X, nbins_axis)#

Find all bins ids that has points in them and assign each point to each of those bins.

For each n_features dimensions, get the minimum and maximum of feature to and use nbins_axis to compute length of the bin. Then assign sample points to bins along each axis/dimension of feature space.

Parameters#

X: ndarray of shape (n_samples, n_features): Feature matrix of n_samples samples in n_features dimensional space.
nbins_axis: int: Number of bins along each axis/dimension of feature space.

Returns#

unique_bin_indices: ndarray(int,): Unique (without duplication) bin indices that have at least one sample point. These are integer tuples \((i_1, \cdot, i_\text{n_features})\) with elements corresponding to the bin index along each axis/dimension of feature space. inverse_ids contains indices of unique_bins_ids for each of the \(N\) points that it is assigned to.
inverse_indices: ndarray(int,): Indices of the unique bins (along specified axis) that can be used to reconstruct bin index of each sample (unique_bin_indices[inverse_indices] gives bin index array).

select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) → Union[List, Iterable]#

Return indices representing subset of sample points.

Parameters#

x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples): Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
size: int: Number of sample points to select (i.e. size of the subset).
labels: np.ndarray, optional: Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
proportional_selection: bool, optional: If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.

Returns#

selected: list: Indices of the selected sample points.

select_from_bins(X, bins, num_selected, diversity_type='hypersphere_overlap', cs=None)#

From the bins, select a certain number of points of the bins.

Points are selected in an iterative manner. If the number of points needed to be selected: is greater than number of bins left then randomly select points from each of the bins. If any of the bins are empty then remove the bins. If it is less than the number of bins left, then calculate the diversity of each bin and choose points of bins with the highest diversity.

Parameters#

X: ndarray of shape (n_samples, n_features): Feature matrix of n_samples samples in n_features dimensional space.
bins: dict(tuple(int), list[int]): The bins that map to the id the bin (as a tuple of integers) and returns the indices of the points that are contained in that bin.
num_selected: int: Number of points to select from the bins.
diversity_type: str, optional: Type of diversity to use. Default=”hypersphere_overlap”.
csint, optional: Number of common substructures in molecular compound dataset. Used only if calculating explicit_diversity_index. Default is “None”.

Returns#

List[int]:: Indices of the points that were selected.

select_from_cluster(X: ndarray, num_selected: int, cluster_ids: ndarray = None)#

Grid partitioning algorithm for selecting points from cluster.

Parameters#

X: ndarray of shape (n_samples, n_features): Feature matrix of n_samples samples in n_features dimensional space.
num_selected: int: Number of molecules that need to be selected.
cluster_ids: ndarray: Indices of molecules that form a cluster

Returns#

selected: list[int]: List of ids of selected molecules with size num_selected.

class selector.methods.partition.Medoid(func_distance=<function Medoid.<lambda>>, ref_index=0, scaling=10)#

Selecting points using an algorithm adapted from KDTree.

Points are initially used to construct a KDTree. Euclidean distances are used for this algorithm. The first point selected is based on the ref_index provided and becomes the first query point. An approximation of the furthest point to the query point is found using find_furthest_neighbor and is selected. find_nearest_neighbor is then done to eliminate close neighbors to the new selected point. Medoid is then calculated from previously selected points and is used as the new query point for find_furthest_neighbor, repeating the process. Terminates upon selecting requested number of points or if all available points exhausted.

Adapted from: https://en.wikipedia.org/wiki/K-d_tree#Construction

_abc_impl = <_abc._abc_data object>#

_eliminate(tree, point, threshold, num_eliminate, bv)#

Eliminates points from being selected in future rounds.

Parameters#

tree: scipy.spatial.KDTree: KDTree organizing coordinates.
point: list: Point where close neighbors should be eliminated.
threshold: float: An average of all the furthest distances found using find_furthest_neighbor
num_eliminate: int: Maximum number of points permitted to be eliminated.
bv: bitarray: Bitvector marking picked/eliminated points.

Returns#

num_eliminate: int: Maximum number of points permitted to be eliminated.

_find_furthest_neighbor(kdtree, point, selected_bitvector)#

Find approximately the furthest neighbor in a k-d tree for a given point.

Parameters#

kdtree: collections.namedtuple: KDTree organizing coordinates.
point: list: Query point for search.
selected_bitvector: bitarray: Bitvector to keep track of previously selected points from array.

Returns#

best: collections.namedtuple: The furthest point found in search.

_kdtree(arr)#

Construct a k-d tree from an iterable of points.

Parameters#

arr: list or np.ndarray: Coordinate array of points.

Returns#

kdtree: collections.namedtuple: KDTree organizing coordinates.

select(x: ndarray, size: int, labels: ndarray = None, proportional_selection: bool = True) → Union[List, Iterable]#

Return indices representing subset of sample points.

Parameters#

x: ndarray of shape (n_samples, n_features) or (n_samples, n_samples): Feature matrix of n_samples samples in n_features dimensional feature space. If fun_distance is None, this x is treated as a square pairwise distance matrix.
size: int: Number of sample points to select (i.e. size of the subset).
labels: np.ndarray, optional: Array of integers or strings representing the labels of the clusters that each sample belongs to. If None, the samples are treated as one cluster. If labels are provided, selection is made from each cluster.
proportional_selection: bool, optional: If True, the number of samples to be selected from each cluster is proportional. Otherwise, the number of samples to be selected from each cluster is equal. Default is True.

Returns#

selected: list: Indices of the selected sample points.

select_from_cluster(arr, num_selected, cluster_ids=None)#

Main function for selecting points using the KDTree algorithm.

Parameters#

arr: np.ndarray: Coordinate array of points
num_selected: int: Number of molecules that need to be selected.
cluster_ids: np.ndarray: Indices of molecules that form a cluster

Returns#

selected: list: List of ids of selected molecules

selector.methods.partition

Contents

selector.methods.partition#

References#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

`selector.methods.partition`#