selector.measures.diversity#

Molecule dataset diversity calculation module.

selector.measures.diversity.compute_diversity(feature_subset: array, div_type: str = 'shannon_entropy', normalize: bool = False, truncation: bool = False, features: array = None, cs: int = None) float#

Compute diversity metrics.

Parameters#

feature_subsetnp.ndarray

Feature matrix.

div_typestr, optional

Method of calculation diversity for a given molecule set, which includes “entropy”, “logdet”, “shannon entropy”, “wdud”, “gini coefficient” “hypersphere_overlap”, and “explicit diversity index”. The default is “entropy”.

normalizebool, optional

Normalize the entropy to [0, 1]. Default is “False”.

truncationbool, optional

Use the truncated Shannon entropy. Default is “False”.

featuresnp.ndarray, optional

Feature matrix of entire molecule library, used only if calculating hypersphere_overlap_of_subset. Default is “None”.

csint, optional

Number of common substructures in molecular compound dataset. Used only if calculating explicit_diversity_index. Default is “None”.

Returns#

float, computed diversity.

selector.measures.diversity.explicit_diversity_index(x: ndarray, cs: int) float#

Compute the explicit diversity index.

Parameters#

x: ndarray of shape (n_samples, n_features)

Feature matrix of n_samples samples in n_features dimensional feature space.

csint

Number of common substructures in the compound set.

Returns#

edi_scaledfloat

Explicit diversity index.

Notes#

This method hasn’t been tested.

This method is used only for datasets of molecular compounds.

Papp, Á., Gulyás-Forró, A., Gulyás, Z., Dormán, G., Ürge, L., and Darvas, F.. (2006) Explicit Diversity Index (EDI): A Novel Measure for Assessing the Diversity of Compound Databases. Journal of Chemical Information and Modeling 46, 1898-1904.

selector.measures.diversity.gini_coefficient(x: ndarray)#

Gini coefficient of bit-wise fingerprints of a database of molecules.

Measures the chemical diversity of a database of molecules defined by the following formula:

\[G = \frac{2 \sum_{i=1}^L i ||y_i||_1 }{N \sum_{i=1}^L ||y_i||_1} - \frac{L+1}{L},\]

where \(y_i \in \{0, 1\}^N\) is a vector of zero and ones of length the number of molecules \(N\) of the i`th feature, and :math:`L is the feature length.

Lower values mean more diversity.

Parameters#

xndarray(N, L)

Molecule features in L bits with N molecules.

Returns#

float :

Gini coefficient in the range [0,1].

References#

Weidlich, Iwona E., and Igor V. Filippov. “Using the gini coefficient to measure the chemical diversity of small‐molecule libraries.” (2016): 2091-2097.

selector.measures.diversity.hypersphere_overlap_of_subset(x: ndarray, x_subset: array) float#

Compute the overlap of subset with hyper-spheres around each point

The edge penalty is also included, which disregards areas outside of the boundary of the full feature space/library. This is calculated as:

\[g(S) = \sum_{i < j}^k O(i, j) + \sum^k_m E(m)\]

where \(i, j\) is over the subset of molecules, \(O(i, j)\) is the approximate overlap between hyperspheres, \(k\) is the number of features and \(E\) is the edge penalty of a molecule.

Lower values mean more diversity.

Parameters#

xndarray

Feature matrix of all molecules.

x_subsetndarray

Feature matrix of selected subset of molecules.

Returns#

float :

The approximate overlapping volume of hyperspheres drawn around the selected points/molecules.

Notes#

The hypersphere overlap volume is calculated using an approximation formula from Agrafiotis (1997).

Agrafiotis, D. K.. (1997) Stochastic Algorithms for Maximizing Molecular Diversity. Journal of Chemical Information and Computer Sciences 37, 841-851.

selector.measures.diversity.logdet(x: ndarray) float#

Compute the log determinant function.

Given a \(n_s \times n_f\) feature matrix \(x\), where \(n_s\) is the number of samples and \(n_f\) is the number of features, the log determinant function is defined as:

where the \(I\) is the \(n_s \times n_s\) identity matrix. Higher values of \(F_\text{logdet}\) mean more diversity.

Parameters#

x: ndarray of shape (n_samples, n_features)

Feature matrix of n_samples samples in n_features dimensional feature space,

Returns#

f_logdet: float

The volume of parallelotope spanned by the matrix.

Notes#

The log-determinant function is based on the formula in Nakamura, T., Sci Rep 2022. Please note that we used the natural logrithim to avoid the numerical stability issues, theochem/Selector#229.

References#

Nakamura, T., Sakaue, S., Fujii, K., Harabuchi, Y., Maeda, S., and Iwata, S.., Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Scientific Reports 12, 2022.

selector.measures.diversity.nearest_average_tanimoto(x: ndarray) float#

Computes the average tanimoto for nearest molecules.

Parameters#

xndarray

Feature matrix.

Returns#

natfloat

Average tanimoto of closest pairs.

Notes#

This computes the tanimoto coefficient of pairs with the shortest distances, then returns the average of them. This calculation is explictly for the explicit diversity index.

Papp, Á., Gulyás-Forró, A., Gulyás, Z., Dormán, G., Ürge, L., and Darvas, F.. (2006) Explicit Diversity Index (EDI): A Novel Measure for Assessing the Diversity of Compound Databases. Journal of Chemical Information and Modeling 46, 1898-1904.

selector.measures.diversity.shannon_entropy(x: ndarray, normalize=True, truncation=False) float#

Compute the shannon entropy of a binary matrix.

Higher values mean more diversity.

Parameters#

xndarray

Bit-string matrix.

normalizebool, optional

Normalize the entropy to [0, 1]. Default=True.

truncationbool, optional

Use the truncated Shannon entropy by only counting the contributions of one-bits. Default=False.

Returns#

float :

The shannon entropy of the matrix.

Notes#

Suppose we have \(m\) compounds and each compound has \(n\) bits binary fingerprints. The binary matrix (feature matrix) is \(\mathbf{x} \in m \times n\), where each row is a compound and each column contains the \(n\)-bit binary fingerprint. The equation for Shannon entropy is given by 1 and 3,

\[H = \sum_i^m \left[ - p_i \log_2{p_i } - (1 - p_i)\log_2(1 - p_i) \right]\]

where \(p_i\) represents the relative frequency of 1 bits at the fingerprint position \(i\). When \(p_i = 0\) or \(p_i = 1\), the \(SE_i\) is zero. When completeness is True, the entropy is calculated as in 2 instead

\[H = \sum_i^m \left[ - p_i \log_2{p_i } \right]\]

When normalize is True, the entropy is normalized by a scaling factor so that the entropy is in the range of [0, 1], 2

\[H = \frac{ \sum_i^m \left[ - p_i \log_2{p_i } - (1 - p_i)\log_2(1 - p_i) \right]} {n \log_2{2} / 2}\]

But please note, when completeness is False and normalize is True, the formula has not been used in any literature. It is just a simple normalization of the entropy and the user can use it at their own risk.

References#

1

Wang, Y., Geppert, H., & Bajorath, J. (2009). Shannon entropy-based fingerprint similarity search strategy. Journal of Chemical Information and Modeling, 49(7), 1687-1691.

2(1,2)

Leguy, J., Glavatskikh, M., Cauchy, T., & Da Mota, B. (2021). Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. Journal of Cheminformatics, 13(1), 1-17.

3

Weidlich, I. E., & Filippov, I. V. (2016). Using the Gini coefficient to measure the chemical diversity of small molecule libraries. Journal of Computational Chemistry, 37(22), 2091-2097.

selector.measures.diversity.wdud(x: ndarray) float#

Compute the Wasserstein Distance to Uniform Distribution(WDUD).

The equation for the Wasserstein Distance for a single feature to uniform distribution is

\[WDUD(x) = \int_{0}^{1} |U(x) - V(x)|dx\]

where the feature is normalized to [0, 1], \(U(x)=x\) is the cumulative distribution of the uniform distribution on [0, 1], and \(V(x) = \sum_{y <= x}1 / N\) is the discrete distribution of the values of the feature in \(x\), where \(y\) is the ith feature. This integral is calculated iteratively between \(y_i\) and \(y_{i+1}\), using trapezoidal method.

Lower values of the WDUD mean more diversity because the features of the selected set are more evenly distributed over the range of feature values.

Parameters#

x: ndarray of shape (n_samples, n_features)

Feature matrix of n_samples samples in n_features dimensional feature space.

Returns#

float :

The mean of the WDUD of each feature over all molecules.

Notes#

Nakamura, T., Sakaue, S., Fujii, K., Harabuchi, Y., Maeda, S., and Iwata, S.. (2022) Selecting molecules with diverse structures and properties by maximizing submodular functions of descriptors learned with graph neural networks. Scientific Reports 12.