Internal Interface
The internal interface for UMAP.jl. Some of these may be moved to the public interface as they stabilize.
Contents
Interface
Nearest Neighbors Interface
UMAP.jl allows for an extensible way to find the nearest neighbors of your data. This is controlled by the subtypes of NeighborParams.
UMAP.DescentNeighbors — Type
DescentNeighbors{M, K}(n_neighbors, metric::M, kwargs::K)Parameters for finding approximate nearest neighbors using NearestNeighborDescent.
UMAP.NeighborParams — Type
Structs for parameterizing the knn search step of UMAP.
Subtyping NeighborParams allows for different methods of finding nearest neighbors. The knn_search function will dispatch on the type of knn_params to find the nearest neighbors in the data.
UMAP.PrecomputedNeighbors — Type
PrecomputedNeighbors{M}(n_neighbors, dists_or_graphs::M)Parameters for finding nearest neighbors from precomputed distances. dists_or_graphs can either be a matrix (pairwise distances) or a NearestNeighborDescent.ApproximateKNNGraph.
The neighbor parameters control dispatch for KNN search functionality, both when fitting and when transforming new data.
UMAP._knn_from_dists — Method
_knn_from_dists(dist_mat::AbstractMatrix{S}, k::Integer; ignore_diagonal=true) where {S <: Real}Construct k-nearest neighbors and distances from a distance matrix. If ignore_diagonal is true, the diagonal elements (which are 0) will be ignored when determining the k-nearest neighbors. This is useful when the distance matrix represents pairwise distances of the same set, where the diagonal elements are the distances to themselves and will always be 0.
Returns a tuple of two matrices: the first contains the indices of the k-nearest neighbors, and the second contains the corresponding distances to those neighbors.
UMAP.knn_search — Function
knn_search(data, knn_params)
knn_search(data, queries, knn_params, results)Find the (potentially approximate) k-nearest neighbors for each point in the dataset, according to knnparams. The data may consist of one or more 'views', passed in either directly (for a single view only) or as a NamedTuple, with corresponding knnparams (either an instance of <: NeighborParams or a NamedTuple of such parameters).
For each (dataview, knnparams) pair, two KxN matrices are returned (knns, dists), holding the indices of the k-nearest neighbors and distances to those neighbors for each point in the dataset. When knnparams isa DescentNeighbors, these matrices are computed using NearestNeighborDescent. When knnparams isa PrecomputedNeighbors, the knn matrices are created from user-provided distances.
If data and knnparams are NamedTuples, the returned knnmatrices will also be in a NamedTuple with the same keys as the inputs.
UMAP.knn_search — Method
knn_search(data, queries, knn_params::DescentNeighbors, result_knns_dists) -> (knns, dists)Search for approximate nearest neighbors of queries in data using nndescent. The (knns, dists) are used to reconstruct the original KNN graph.
UMAP.knn_search — Method
knn_search(_, __, knn_params::PrecomputedNeighbors, ___)(TRANSFORM) Get neighbors from a precomputed distance matrix.
UMAP.knn_search — Method
knn_search(data, knn_params::DescentNeighbors) -> (knns, dists)Find approximate nearest neighbors using nndescent.
UMAP.knn_search — Method
knn_search(_, knn_params::PrecomputedNeighbors)(FIT) Get neighbors from a precomputed distance matrix.
UMAP.knn_search — Method
knn_search(_, knn_params::PrecomputedNeighbors{M}) where {M <: ApproximateKNNGraph}Get neighbors from a precomputed KNNGraph. This method is used when the KNN graph has been precomputed using NearestNeighborDescent
UMAP.knn_search — Method
knn_search(data::NamedTuple{T}, queries::NamedTuple{T}, knn_params::NamedTuple{T}, result_knns_dists::NamedTuple{T}) -> NamedTuple{T}(TRANSFORM) Map knn_search over each view of the data, queries, knnparams, and resultknns_dists.
UMAP.knn_search — Method
knn_search(data::NamedTuple{T}, knn_params::NamedTuple{T}) -> NamedTuple{T}(FIT) Map knn_search over each view of the data and corresponding knn_params.
Fuzzy Simplicial Sets Interface
The second step of the UMAP algorithm involves calculating a topological representation of the data (potentially from multiple "views"). This behavior is controlled by the following types and functions.
UMAP.SMOOTH_K_TOLERANCE — Constant
SMOOTH_K_TOLERANCETolerance for the smooth k-distance calculation.
UMAP.SourceGlobalParams — Type
SourceGlobalParams{T}(mix_ratio)Parameters for merging the fuzzy simplicial sets for each dataset view into one fuzzy simplicial set, otherwise known as the UMAP graph.
UMAP.SourceViewParams — Type
SourceViewParams{T}(set_operation_ratio, local_connectivity, bandwidth)Struct for parameterizing the representation of the data in the source (original) manifold; i.e. constructing fuzzy simplicial sets of each view of the dataset.
These config structs parameterize the fuzzy simplicial sets created for each view, and the global fuzzy simplicial set after combining all views (sometimes called the UMAP graph).
UMAP optimizes the difference between the source fuzzy set representation and a target representation. Therefore, if there are multiple source views, they need to be combined. This is done by a function coalesce_views that combines the view simplicial sets via set union and intersections, controlled by the SourceGlobalParams.
UMAP._mix_values — Method
_mix_values(x, y, ratio)For ratio in [0, 1], return a modified multiplication of x and y. For ratio = 0, return x For ratio = 1, return y For ratio = 0.5, return x*y
UMAP._norm_sparse — Method
_norm_sparse(fsset)For each column (i.e. each point), normalize the membership values (divide by the maximum). This creates a copy of the matrix.
UMAP._reset_fuzzy_set_cardinality — Function
Like smoothknndists, but now operating on an existing fuzzy set edge probabilities.
NOTE: k is fixed to 15 here, which is what the Python UMAP implementation does. How we would set this otherwise isn't clear (the global UMAP graph k value isn't determined.)
UMAP.coalesce_views — Function
coalesce_views(view_fuzzy_sets, params)Merge the fuzzy simplicial sets for each view of the data. This returns a single fuzzy simplicial set - the weighted, undirected UMAP graph - that captures the spatial relationships of the data points approximated by the manifold.
UMAP.compute_membership_strengths — Method
compute_membership_strengths(knns, dists, σs, ρs) -> rows, cols, valsCompute the membership strengths for the 1-skeleton of each fuzzy simplicial set.
UMAP.fuzzy_simplicial_set — Function
Construct the UMAP graph, i.e. the global fuzzy simplicial set. This is represented as a symmetric, sparse matrix where each value is the probability that an edge exists between points (row, col).
For multiple views, there will be one UMAP graph per view - later combined via coalesce_views.
For transforming new data, this graph is notably not symmetric.
UMAP.fuzzy_simplicial_set — Method
fuzzy_simplicial_set(data, knns_dists, knn_params, src_params)(Transform) Construct a global fuzzy simplicial set for a single data view, the data argument gives us the number of points in the original dataset.
UMAP.fuzzy_simplicial_set — Method
fuzzy_simplicial_set((knns, dists), knn_params, src_params::SourceViewParams)(Fit) Construct a global fuzzy simplicial set for a single data view.
UMAP.fuzzy_simplicial_set — Method
fuzzy_simplicial_set(knns_dists, n_points, knn_params, src_params, combine=true) -> membership_graph::SparseMatrixCSCConstruct the local fuzzy simplicial set of each point represented by its distances to its nearest neighbors, stored in knns and dists, normalizing the distances, and converting the metric space to a simplicial set (a weighted graph).
n_points indicates the total number of points of the original data, while knns contains indices of some subset of those points (ie some subset of 1:n_points). If knns represents neighbors of the elements of some set with itself, then knns should have n_points number of columns. Otherwise, these two values may be different.
If combine is true, use intersections and unions to combine local fuzzy sets of neighbors. The returned graph has size (n_points, size(knns, 2)).
UMAP.fuzzy_simplicial_set — Method
fuzzy_simplicial_set(data::NamedTuple{T}, knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}(Transform) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view.
UMAP.fuzzy_simplicial_set — Method
fuzzy_simplicial_set(knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}(Fit) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view. The keys of the named tuple are the keys of the input NamedTuples.
UMAP.general_simplicial_set_intersection — Method
general_simplicial_set_intersection(left_view, right_view, global_params)Take the weighted intersection of two fuzzy simplicial sets (represented as (sparse) matrices). Since we don't want to completely lose edges that are only present in one of the two sets, we multiply by at least 1e-8. Furthermore, if the same edge in both sets has a strength below 1e-8, these are added together instead of multiplying.
UMAP.general_simplicial_set_union — Method
general_simplicial_set_union(left_view, right_view)Take the union of two global fuzzy simplicial sets.
UMAP.reset_local_connectivity — Function
reset_local_connectivity(simplicial_set, reset_local_metrics)Reset the local connectivity requirement – each data sample should have complete confidence in at least one 1-simplex in the simplicial set. We can enforce this by locally rescaling confidences, and then remerging the different local simplicial sets together.
A FS set may lose this property when combining multiple views of the source data; this function restores it.
UMAP.reset_local_metrics! — Method
Reset the cardinality of the fuzzy sets (usually a column in a simplicial set) to be approximately log2(k).
This step is necessary after we've combined the simplicial sets for multiple views of the same data.
UMAP.smooth_knn_dists — Method
smooth_knn_dists(dists, k, src_params) -> knn_dists, nn_distsCompute the distances to the nearest neighbors for a continuous value k. Returns the approximated distances to the kth nearest neighbor (knn_dists) and the nearest neighbor (nn_dists) from each point.
Remaining
UMAP.create_config — Function
Utility for creating the configuration structs for UMAP.
UMAP.create_view_config — Function
Utility for creating the view-specific structs for UMAP, i.e. NeighborParams and SourceViewParams.
UMAP.create_view_config — Method
Create NeighborParams, SourceViewParams for a view of the data based on the params previously used to fit this same view. Parameters can be overridden by keyword arguments.
UMAP.create_view_config — Method
Create new NeighborParams, SourceViewParams for a view of the data given the arguments.
UMAP.TargetParams — Type
TargetParams{M, D, I, P}(manifold::M, metric::D, init::I, memb_params::P)Parameters for controlling the target embedding, e.g. the manifold, distance metric, initialization method.
UMAP._EuclideanManifold — Type
A simple, singleton type representing Euclidean space with dimension N. Points in this manifold are N-dimensional vectors.
UMAP.initialize_embedding — Function
initialize_embedding(umap_graph, tgt_params) -> embeddingInitialize the embedding according to tgt_params. Return a list of points (e.g. Vectors) representing the initial embedded dataset. This could be made generic to different target manifolds.
UMAP.initialize_embedding — Method
initialize_embedding(ref_embedding, umap_graph, tgt_params) -> embeddingInitialize an embedding of points corresponding to the columns of the umap_graph, by taking weighted mean of the columns of ref_embedding, where weights are values from the columns of umap_graph.
UMAP.spectral_layout — Method
spectral_layout(graph, embed_dim) -> embeddingInitialize the graph layout with spectral embedding.
UMAP.MembershipFnParams — Type
MembershipFnParams{T}(min_dist, spread)A smooth approximation for the membership strength of a 1-simplex between two points x, y, can be given by the following, with dissimilarity function dist, and constants a, b: ϕ(x, y, dist, a, b) = (1 + a*(dist(x, y))^b)^(-1)
The approximation parameters a, b are chosen by non-linear least squares fitting of the following function ψ:
ψ(x, y, dist, mindist, spread) = dist(x, y) ≤ mindist ? 1 : exp(-(dist(x, y) - min_dist)/spread)
UMAP.fit_ab — Method
fit_ab(min_dist, spread) -> a, bFind a smooth approximation to the membership function of points embedded in ℜᵈ. This fits a smooth curve that approximates an exponential decay offset by min_dist, returning the parameters (a, b).
UMAP.OptimizationParams — Type
OptimizationParams(n_epochs, learning_rate, repulsion_strength, neg_sample_rate)Parameters for controlling the optimization process.
UMAP._optimize_embedding! — Method
_optimize_embedding!(embedding, ref_embedding, umap_graph, tgt_params, opt_params; move_ref)Optimize the embedding for one epoch, calculating the distances between neighbors and updating the embedding via gradient descent. If embedding and ref_embedding are the same object, then this is fitting a new embedding. Otherwise, ref_embedding is the result of a previous call to fit, and we are transforming new data.
In both cases, the dimensions of umap_graph have to match embedding and ref_embedding: umapgraph in R^{n, m}, embedding length m, refembedding length n.
This optimizes the default case, where the embeddings are in vector of vectors format, the target manifold is a Euclidean manifold of some dimension N, and the target metric is squared euclidean.
UMAP.cross_entropy — Method
cross_entropy(umap_graph, embedding, tgt_params)Calculate the fuzzy cross entropy loss of the embedding for this umapgraph. (NOTE: μ(x, y) = umapgraph[x, y]; ν(x, y) = )
UMAP.target_metric — Function
target_metric(tgt_params, x, y) -> dist, grad_dist_x, grad_dist_yCalculate the distance between x and y on the manifold tgt_params.manifold according to tgt_params.metric as well as the gradient of that distance with respect to x and y.
UMAP.update_embedding_neg! — Method
Calculate the gradients of the negative 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.
UMAP.update_embedding_pos! — Method
Calculate the gradients of the positive 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.
UMAP.trustworthiness — Method
trustworthiness(X, X_embed, n_neighbors, metric) -> [0, 1]Compute the trustworthiness of an embedding X_embed compared to X.
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html
Index
UMAP.SMOOTH_K_TOLERANCEUMAP.DescentNeighborsUMAP.MembershipFnParamsUMAP.NeighborParamsUMAP.OptimizationParamsUMAP.PrecomputedNeighborsUMAP.SourceGlobalParamsUMAP.SourceViewParamsUMAP.TargetParamsUMAP._EuclideanManifoldUMAP._knn_from_distsUMAP._mix_valuesUMAP._norm_sparseUMAP._optimize_embedding!UMAP._reset_fuzzy_set_cardinalityUMAP._ϕUMAP.coalesce_viewsUMAP.compute_membership_strengthsUMAP.create_configUMAP.create_view_configUMAP.create_view_configUMAP.create_view_configUMAP.cross_entropyUMAP.fit_abUMAP.fuzzy_simplicial_setUMAP.fuzzy_simplicial_setUMAP.fuzzy_simplicial_setUMAP.fuzzy_simplicial_setUMAP.fuzzy_simplicial_setUMAP.fuzzy_simplicial_setUMAP.general_simplicial_set_intersectionUMAP.general_simplicial_set_unionUMAP.initialize_embeddingUMAP.initialize_embeddingUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.knn_searchUMAP.reset_local_connectivityUMAP.reset_local_metrics!UMAP.smooth_knn_distsUMAP.spectral_layoutUMAP.target_metricUMAP.trustworthinessUMAP.update_embedding_neg!UMAP.update_embedding_pos!