selector.measures.similarity#

Similarity Module.

selector.measures.similarity.modified_tanimoto(a: array, b: array) float#

Compute the modified tanimoto coefficient from bitstring vectors of data points A and B.

Adjusts calculation of the Tanimoto coefficient to counter its natural bias towards shorter vectors using a Bernoulli probability model.

\[{mt} = \frac{2-p}{3} T_1 + \frac{1+p}{3} T_0\]

where \(p\) is success probability of independent trials, \(T_1\) is the number of common ‘1’ bits between data points (\(T_1 = | A \cap B |\)), and \(T_0\) is the number of common ‘0’ bits between data points (\(T_0 = |(1-A) \cap (1-B)|\)).

Parameters#

andarray of shape (n_features,)

The 1D bitstring feature array of sample \(A\) in an n_features dimensional space.

bndarray of shape (n_features,)

The 1D bitstring feature array of sample \(B\) in an n_features dimensional space.

Returns#

mtfloat

Modified tanimoto coefficient between bitstring feature arrays \(A\) and \(B\).

Notes#

The equation above has been derived from

\[{mt}_{\alpha} = {\alpha}T_1 + (1-\alpha)T_0\]

where \(\alpha = \frac{2-p}{3}\). This is done so that the expected value of the modified tanimoto, \(E(mt)\), remains constant even as the number of trials \(p\) grows larger.

Fligner, M. A., Verducci, J. S., and Blower, P. E.. (2002) A Modification of the Jaccard-Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics 44, 110-119.

selector.measures.similarity.pairwise_similarity_bit(X: array, metric: str) ndarray#

Compute pairwise similarity coefficient matrix.

Parameters#

Xndarray of shape (n_samples, n_features)

Feature matrix of n_samples samples in n_features dimensional space.

metricstr

The metric used when calculating similarity coefficients between samples in a feature array. Method for calculating similarity coefficient. Options: “tanimoto”, “modified_tanimoto”.

Returns#

sndarray of shape (n_samples, n_samples)

A symmetric similarity matrix between each pair of samples in the feature matrix. The diagonal elements are directly computed instead of assuming that they are 1.

selector.measures.similarity.scaled_similarity_matrix(X: array) ndarray#

Compute the scaled similarity matrix.

\[X(i,j) = \frac{X(i,j)}{\sqrt{X(i,i)X(j,j)}}\]

Parameters#

Xndarray of shape (n_samples, n_samples)

Similarity matrix of n_samples.

Returns#

sndarray of shape (n_samples, n_samples)

A scaled symmetric similarity matrix.

selector.measures.similarity.tanimoto(a: array, b: array) float#

Compute Tanimoto coefficient or index (a.k.a. Jaccard similarity coefficient).

For two binary or non-binary arrays \(A\) and \(B\), Tanimoto coefficient is defined as the size of their intersection divided by the size of their union:

\[T(A, B) = \frac{| A \cap B|}{| A \cup B |} = \frac{| A \cap B|}{|A| + |B| - | A \cap B|} = \frac{A \cdot B}{\|A\|^2 + \|B\|^2 - A \cdot B}\]

where \(A \cdot B = \sum_i{A_i B_i}\) and \(\|A\|^2 = \sum_i{A_i^2}\).

Parameters#

andarray of shape (n_features,)

The 1D feature array of sample \(A\) in an n_features dimensional space.

bndarray of shape (n_features,)

The 1D feature array of sample \(B\) in an n_features dimensional space.

Returns#

coefffloat

Tanimoto coefficient between feature arrays \(A\) and \(B\).

Bajusz, D., Rácz, A., and Héberger, K.. (2015) Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. Journal of Cheminformatics 7.