DistanceAlgorithm#

class datafold.pcfold.distance.DistanceAlgorithm(metric, is_symmetric, cut_off=None, k=None)[source]#

Bases: object

Abstract base class for distance matrix algorithms (dense or sparse).

Important aspects and conventions for the distance algorithms:

  • The terms “pair-wise” (pdist) and “component-wise” (cdist) are adapted from the scipy’s distance matrix computations scipy.sparse.spatial.pdist and scipy.sparse.spatial.cdist

  • A sparse distance matrix with a distance (either k-neighbors or with a radius cut-off) does not store all distance pairs. Importantly, this means that the sparse matrix must store “real distance zeros” (introduced by duplicate points or self-distances in case of pdist). This sometimes requires a workaround in matrix operations, so that stored distance-zeros are not removed.

Parameters:
  • metric (str) – distance metric to compute

  • is_symmetric (bool) – indicate whether the distance matrix is symmetric (typically the standard k-nearest-neighbor is not symmetric)

  • k (Optional[float]) –

    • for k-nearest-neighbors the number of neighbors

    • for radius-based distance algorithms, there is a follow-up routine to make sure that each sample has at least kmin neighbors (including distance pairs that are larger than radius)

Attributes Summary

dist_type

is_sparse

Methods Summary

__call__(X[, Y])

Compute distance matrix.

Attributes Documentation

dist_type#
is_sparse#

Methods Documentation

abstract __call__(X, Y=None)[source]#

Compute distance matrix.

If only the reference dataset (X) is given, then the distances are pair-wise. From this the following distance matrix properties follow: :rtype: Union[ndarray, csr_matrix]

  • square

  • diagonal contains distance to itself and are therefore zero

  • symmetric

If an additional query dataset is given, then the distance matrix properties follow:

  • rectangular matrix of shape (n_samples_Y, n_samples_X)

  • outlier points can lead to columns / rows of zero

  • duplicated points between X and Y have zero entries on the diagonal

Variables:
  • X – Reference dataset of shape (n_samples_X, n_features).

  • Y – Query dataset of shape (n_samples_Y, n_features). If set then the computation is component-wise and if None, the reference dataset is taken as the query points (i.e. Y=X).