Internal Interface

The internal interface for UMAP.jl. Some of these may be moved to the public interface as they stabilize.

Contents

Interface

Nearest Neighbors Interface

UMAP.jl allows for an extensible way to find the nearest neighbors of your data. This is controlled by the subtypes of NeighborParams.

UMAP.DescentNeighborsType
DescentNeighbors{M, K}(n_neighbors, metric::M, kwargs::K)

Parameters for finding approximate nearest neighbors using NearestNeighborDescent.

source
UMAP.NeighborParamsType

Structs for parameterizing the knn search step of UMAP.

Subtyping NeighborParams allows for different methods of finding nearest neighbors. The knn_search function will dispatch on the type of knn_params to find the nearest neighbors in the data.

source
UMAP.PrecomputedNeighborsType
PrecomputedNeighbors{M}(n_neighbors, dists_or_graphs::M)

Parameters for finding nearest neighbors from precomputed distances. dists_or_graphs can either be a matrix (pairwise distances) or a NearestNeighborDescent.ApproximateKNNGraph.

source

The neighbor parameters control dispatch for KNN search functionality, both when fitting and when transforming new data.

UMAP._knn_from_distsMethod
_knn_from_dists(dist_mat::AbstractMatrix{S}, k::Integer; ignore_diagonal=true) where {S <: Real}

Construct k-nearest neighbors and distances from a distance matrix. If ignore_diagonal is true, the diagonal elements (which are 0) will be ignored when determining the k-nearest neighbors. This is useful when the distance matrix represents pairwise distances of the same set, where the diagonal elements are the distances to themselves and will always be 0.

Returns a tuple of two matrices: the first contains the indices of the k-nearest neighbors, and the second contains the corresponding distances to those neighbors.

source
UMAP.knn_searchFunction
knn_search(data, knn_params)
knn_search(data, queries, knn_params, results)

Find the (potentially approximate) k-nearest neighbors for each point in the dataset, according to knnparams. The data may consist of one or more 'views', passed in either directly (for a single view only) or as a NamedTuple, with corresponding knnparams (either an instance of <: NeighborParams or a NamedTuple of such parameters).

For each (dataview, knnparams) pair, two KxN matrices are returned (knns, dists), holding the indices of the k-nearest neighbors and distances to those neighbors for each point in the dataset. When knnparams isa DescentNeighbors, these matrices are computed using NearestNeighborDescent. When knnparams isa PrecomputedNeighbors, the knn matrices are created from user-provided distances.

If data and knnparams are NamedTuples, the returned knnmatrices will also be in a NamedTuple with the same keys as the inputs.

source
UMAP.knn_searchMethod
knn_search(data, queries, knn_params::DescentNeighbors, result_knns_dists) -> (knns, dists)

Search for approximate nearest neighbors of queries in data using nndescent. The (knns, dists) are used to reconstruct the original KNN graph.

source
UMAP.knn_searchMethod
knn_search(_, __, knn_params::PrecomputedNeighbors, ___)

(TRANSFORM) Get neighbors from a precomputed distance matrix.

source
UMAP.knn_searchMethod
knn_search(data, knn_params::DescentNeighbors) -> (knns, dists)

Find approximate nearest neighbors using nndescent.

source
UMAP.knn_searchMethod
knn_search(_, knn_params::PrecomputedNeighbors)

(FIT) Get neighbors from a precomputed distance matrix.

source
UMAP.knn_searchMethod
knn_search(_, knn_params::PrecomputedNeighbors{M}) where {M <: ApproximateKNNGraph}

Get neighbors from a precomputed KNNGraph. This method is used when the KNN graph has been precomputed using NearestNeighborDescent

source
UMAP.knn_searchMethod
knn_search(data::NamedTuple{T}, queries::NamedTuple{T}, knn_params::NamedTuple{T}, result_knns_dists::NamedTuple{T}) -> NamedTuple{T}

(TRANSFORM) Map knn_search over each view of the data, queries, knnparams, and resultknns_dists.

source
UMAP.knn_searchMethod
knn_search(data::NamedTuple{T}, knn_params::NamedTuple{T}) -> NamedTuple{T}

(FIT) Map knn_search over each view of the data and corresponding knn_params.

source

Fuzzy Simplicial Sets Interface

The second step of the UMAP algorithm involves calculating a topological representation of the data (potentially from multiple "views"). This behavior is controlled by the following types and functions.

UMAP.SourceGlobalParamsType
SourceGlobalParams{T}(mix_ratio)

Parameters for merging the fuzzy simplicial sets for each dataset view into one fuzzy simplicial set, otherwise known as the UMAP graph.

source
UMAP.SourceViewParamsType
SourceViewParams{T}(set_operation_ratio, local_connectivity, bandwidth)

Struct for parameterizing the representation of the data in the source (original) manifold; i.e. constructing fuzzy simplicial sets of each view of the dataset.

source

These config structs parameterize the fuzzy simplicial sets created for each view, and the global fuzzy simplicial set after combining all views (sometimes called the UMAP graph).

UMAP optimizes the difference between the source fuzzy set representation and a target representation. Therefore, if there are multiple source views, they need to be combined. This is done by a function coalesce_views that combines the view simplicial sets via set union and intersections, controlled by the SourceGlobalParams.

UMAP._mix_valuesMethod
_mix_values(x, y, ratio)

For ratio in [0, 1], return a modified multiplication of x and y. For ratio = 0, return x For ratio = 1, return y For ratio = 0.5, return x*y

source
UMAP._norm_sparseMethod
_norm_sparse(fsset)

For each column (i.e. each point), normalize the membership values (divide by the maximum). This creates a copy of the matrix.

source
UMAP._reset_fuzzy_set_cardinalityFunction

Like smoothknndists, but now operating on an existing fuzzy set edge probabilities.

NOTE: k is fixed to 15 here, which is what the Python UMAP implementation does. How we would set this otherwise isn't clear (the global UMAP graph k value isn't determined.)

source
UMAP.coalesce_viewsFunction
coalesce_views(view_fuzzy_sets, params)

Merge the fuzzy simplicial sets for each view of the data. This returns a single fuzzy simplicial set - the weighted, undirected UMAP graph - that captures the spatial relationships of the data points approximated by the manifold.

source
UMAP.compute_membership_strengthsMethod
compute_membership_strengths(knns, dists, σs, ρs) -> rows, cols, vals

Compute the membership strengths for the 1-skeleton of each fuzzy simplicial set.

source
UMAP.fuzzy_simplicial_setFunction

Construct the UMAP graph, i.e. the global fuzzy simplicial set. This is represented as a symmetric, sparse matrix where each value is the probability that an edge exists between points (row, col).

For multiple views, there will be one UMAP graph per view - later combined via coalesce_views.

For transforming new data, this graph is notably not symmetric.

source
UMAP.fuzzy_simplicial_setMethod
fuzzy_simplicial_set(data, knns_dists, knn_params, src_params)

(Transform) Construct a global fuzzy simplicial set for a single data view, the data argument gives us the number of points in the original dataset.

source
UMAP.fuzzy_simplicial_setMethod
fuzzy_simplicial_set((knns, dists), knn_params, src_params::SourceViewParams)

(Fit) Construct a global fuzzy simplicial set for a single data view.

source
UMAP.fuzzy_simplicial_setMethod
fuzzy_simplicial_set(knns_dists, n_points, knn_params, src_params, combine=true) -> membership_graph::SparseMatrixCSC

Construct the local fuzzy simplicial set of each point represented by its distances to its nearest neighbors, stored in knns and dists, normalizing the distances, and converting the metric space to a simplicial set (a weighted graph).

n_points indicates the total number of points of the original data, while knns contains indices of some subset of those points (ie some subset of 1:n_points). If knns represents neighbors of the elements of some set with itself, then knns should have n_points number of columns. Otherwise, these two values may be different.

If combine is true, use intersections and unions to combine local fuzzy sets of neighbors. The returned graph has size (n_points, size(knns, 2)).

source
UMAP.fuzzy_simplicial_setMethod
fuzzy_simplicial_set(data::NamedTuple{T}, knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}

(Transform) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view.

source
UMAP.fuzzy_simplicial_setMethod
fuzzy_simplicial_set(knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}

(Fit) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view. The keys of the named tuple are the keys of the input NamedTuples.

source
UMAP.general_simplicial_set_intersectionMethod
general_simplicial_set_intersection(left_view, right_view, global_params)

Take the weighted intersection of two fuzzy simplicial sets (represented as (sparse) matrices). Since we don't want to completely lose edges that are only present in one of the two sets, we multiply by at least 1e-8. Furthermore, if the same edge in both sets has a strength below 1e-8, these are added together instead of multiplying.

source
UMAP.reset_local_connectivityFunction
reset_local_connectivity(simplicial_set, reset_local_metrics)

Reset the local connectivity requirement – each data sample should have complete confidence in at least one 1-simplex in the simplicial set. We can enforce this by locally rescaling confidences, and then remerging the different local simplicial sets together.

A FS set may lose this property when combining multiple views of the source data; this function restores it.

source
UMAP.reset_local_metrics!Method

Reset the cardinality of the fuzzy sets (usually a column in a simplicial set) to be approximately log2(k).

This step is necessary after we've combined the simplicial sets for multiple views of the same data.

source
UMAP.smooth_knn_distsMethod
smooth_knn_dists(dists, k, src_params) -> knn_dists, nn_dists

Compute the distances to the nearest neighbors for a continuous value k. Returns the approximated distances to the kth nearest neighbor (knn_dists) and the nearest neighbor (nn_dists) from each point.

source

Remaining

UMAP.create_view_configMethod

Create NeighborParams, SourceViewParams for a view of the data based on the params previously used to fit this same view. Parameters can be overridden by keyword arguments.

source
UMAP.TargetParamsType
TargetParams{M, D, I, P}(manifold::M, metric::D, init::I, memb_params::P)

Parameters for controlling the target embedding, e.g. the manifold, distance metric, initialization method.

source
UMAP._EuclideanManifoldType

A simple, singleton type representing Euclidean space with dimension N. Points in this manifold are N-dimensional vectors.

source
UMAP.initialize_embeddingFunction
initialize_embedding(umap_graph, tgt_params) -> embedding

Initialize the embedding according to tgt_params. Return a list of points (e.g. Vectors) representing the initial embedded dataset. This could be made generic to different target manifolds.

source
UMAP.initialize_embeddingMethod
initialize_embedding(ref_embedding, umap_graph, tgt_params) -> embedding

Initialize an embedding of points corresponding to the columns of the umap_graph, by taking weighted mean of the columns of ref_embedding, where weights are values from the columns of umap_graph.

source
UMAP.spectral_layoutMethod
spectral_layout(graph, embed_dim) -> embedding

Initialize the graph layout with spectral embedding.

source
UMAP.MembershipFnParamsType
MembershipFnParams{T}(min_dist, spread)

A smooth approximation for the membership strength of a 1-simplex between two points x, y, can be given by the following, with dissimilarity function dist, and constants a, b: ϕ(x, y, dist, a, b) = (1 + a*(dist(x, y))^b)^(-1)

The approximation parameters a, b are chosen by non-linear least squares fitting of the following function ψ:

ψ(x, y, dist, mindist, spread) = dist(x, y) ≤ mindist ? 1 : exp(-(dist(x, y) - min_dist)/spread)

source
UMAP.fit_abMethod
fit_ab(min_dist, spread) -> a, b

Find a smooth approximation to the membership function of points embedded in ℜᵈ. This fits a smooth curve that approximates an exponential decay offset by min_dist, returning the parameters (a, b).

source
UMAP.OptimizationParamsType
OptimizationParams(n_epochs, learning_rate, repulsion_strength, neg_sample_rate)

Parameters for controlling the optimization process.

source
UMAP._optimize_embedding!Method
_optimize_embedding!(embedding, ref_embedding, umap_graph, tgt_params, opt_params; move_ref)

Optimize the embedding for one epoch, calculating the distances between neighbors and updating the embedding via gradient descent. If embedding and ref_embedding are the same object, then this is fitting a new embedding. Otherwise, ref_embedding is the result of a previous call to fit, and we are transforming new data.

In both cases, the dimensions of umap_graph have to match embedding and ref_embedding: umapgraph in R^{n, m}, embedding length m, refembedding length n.

This optimizes the default case, where the embeddings are in vector of vectors format, the target manifold is a Euclidean manifold of some dimension N, and the target metric is squared euclidean.

source
UMAP._ϕMethod

A smooth approximation for the membership strength of the 1-simplex between two points x, y.

source
UMAP.cross_entropyMethod
cross_entropy(umap_graph, embedding, tgt_params)

Calculate the fuzzy cross entropy loss of the embedding for this umapgraph. (NOTE: μ(x, y) = umapgraph[x, y]; ν(x, y) = )

source
UMAP.target_metricFunction
target_metric(tgt_params, x, y) -> dist, grad_dist_x, grad_dist_y

Calculate the distance between x and y on the manifold tgt_params.manifold according to tgt_params.metric as well as the gradient of that distance with respect to x and y.

source
UMAP.update_embedding_neg!Method

Calculate the gradients of the negative 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.

source
UMAP.update_embedding_pos!Method

Calculate the gradients of the positive 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.

source
UMAP.trustworthinessMethod
trustworthiness(X, X_embed, n_neighbors, metric) -> [0, 1]

Compute the trustworthiness of an embedding X_embed compared to X.

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html

source

Index