DistanceAlgorithm#
- class datafold.pcfold.distance.DistanceAlgorithm(metric, is_symmetric, cut_off=None, k=None)[source]#
Bases:
object
Abstract base class for distance matrix algorithms (dense or sparse).
Important aspects and conventions for the distance algorithms:
The terms “pair-wise” (pdist) and “component-wise” (cdist) are adapted from the scipy’s distance matrix computations
scipy.sparse.spatial.pdist
andscipy.sparse.spatial.cdist
A sparse distance matrix with a distance (either k-neighbors or with a radius cut-off) does not store all distance pairs. Importantly, this means that the sparse matrix must store “real distance zeros” (introduced by duplicate points or self-distances in case of pdist). This sometimes requires a workaround in matrix operations, so that stored distance-zeros are not removed.
- Parameters:
metric (
str
) – distance metric to computeis_symmetric (
bool
) – indicate whether the distance matrix is symmetric (typically the standard k-nearest-neighbor is not symmetric)for k-nearest-neighbors the number of neighbors
for radius-based distance algorithms, there is a follow-up routine to make sure that each sample has at least
kmin
neighbors (including distance pairs that are larger than radius)
Attributes Summary
Methods Summary
__call__
(X[, Y])Compute distance matrix.
Attributes Documentation
- dist_type#
- is_sparse#
Methods Documentation
- abstract __call__(X, Y=None)[source]#
Compute distance matrix.
If only the reference dataset (X) is given, then the distances are pair-wise. From this the following distance matrix properties follow: :rtype:
Union
[ndarray
,csr_matrix
]square
diagonal contains distance to itself and are therefore zero
symmetric
If an additional query dataset is given, then the distance matrix properties follow:
rectangular matrix of shape (n_samples_Y, n_samples_X)
outlier points can lead to columns / rows of zero
duplicated points between X and Y have zero entries on the diagonal
- Variables:
X – Reference dataset of shape (n_samples_X, n_features).
Y – Query dataset of shape (n_samples_Y, n_features). If set then the computation is component-wise and if
None
, the reference dataset is taken as the query points (i.e. Y=X).