Internal Interface

The internal interface for UMAP.jl. Some of these may be moved to the public interface as they stabilize.

Internal Interface

Interface

Nearest Neighbors Interface

UMAP.jl allows for an extensible way to find the nearest neighbors of your data. This is controlled by the subtypes of NeighborParams.

UMAP.DescentNeighbors — Type

DescentNeighbors{M, K}(n_neighbors, metric::M, kwargs::K)

Parameters for finding approximate nearest neighbors using NearestNeighborDescent.

source

UMAP.NeighborParams — Type

Structs for parameterizing the knn search step of UMAP.

Subtyping NeighborParams allows for different methods of finding nearest neighbors. The knn_search function will dispatch on the type of knn_params to find the nearest neighbors in the data.

source

UMAP.PrecomputedNeighbors — Type

PrecomputedNeighbors{M}(n_neighbors, dists_or_graphs::M)

Parameters for finding nearest neighbors from precomputed distances. dists_or_graphs can either be a matrix (pairwise distances) or a NearestNeighborDescent.ApproximateKNNGraph.

source

The neighbor parameters control dispatch for KNN search functionality, both when fitting and when transforming new data.

UMAP._knn_from_dists — Method

_knn_from_dists(dist_mat::AbstractMatrix{S}, k::Integer; ignore_diagonal=true) where {S <: Real}

Construct k-nearest neighbors and distances from a distance matrix. If ignore_diagonal is true, the diagonal elements (which are 0) will be ignored when determining the k-nearest neighbors. This is useful when the distance matrix represents pairwise distances of the same set, where the diagonal elements are the distances to themselves and will always be 0.

Returns a tuple of two matrices: the first contains the indices of the k-nearest neighbors, and the second contains the corresponding distances to those neighbors.

source

UMAP.knn_search — Function

knn_search(data, knn_params)
knn_search(data, queries, knn_params, results)

Find the (potentially approximate) k-nearest neighbors for each point in the dataset, according to knnparams. The data may consist of one or more 'views', passed in either directly (for a single view only) or as a NamedTuple, with corresponding knnparams (either an instance of <: NeighborParams or a NamedTuple of such parameters).

For each (dataview, knnparams) pair, two KxN matrices are returned (knns, dists), holding the indices of the k-nearest neighbors and distances to those neighbors for each point in the dataset. When knnparams isa DescentNeighbors, these matrices are computed using NearestNeighborDescent. When knnparams isa PrecomputedNeighbors, the knn matrices are created from user-provided distances.

If data and knnparams are NamedTuples, the returned knnmatrices will also be in a NamedTuple with the same keys as the inputs.

source

UMAP.knn_search — Method

knn_search(data, queries, knn_params::DescentNeighbors, result_knns_dists) -> (knns, dists)

Search for approximate nearest neighbors of queries in data using nndescent. The (knns, dists) are used to reconstruct the original KNN graph.

source

UMAP.knn_search — Method

knn_search(_, __, knn_params::PrecomputedNeighbors, ___)

(TRANSFORM) Get neighbors from a precomputed distance matrix.

source

UMAP.knn_search — Method

knn_search(data, knn_params::DescentNeighbors) -> (knns, dists)

Find approximate nearest neighbors using nndescent.

source

UMAP.knn_search — Method

knn_search(_, knn_params::PrecomputedNeighbors)

(FIT) Get neighbors from a precomputed distance matrix.

source

UMAP.knn_search — Method

knn_search(_, knn_params::PrecomputedNeighbors{M}) where {M <: ApproximateKNNGraph}

Get neighbors from a precomputed KNNGraph. This method is used when the KNN graph has been precomputed using NearestNeighborDescent

source

UMAP.knn_search — Method

knn_search(data::NamedTuple{T}, queries::NamedTuple{T}, knn_params::NamedTuple{T}, result_knns_dists::NamedTuple{T}) -> NamedTuple{T}

(TRANSFORM) Map knn_search over each view of the data, queries, knnparams, and resultknns_dists.

source

UMAP.knn_search — Method

knn_search(data::NamedTuple{T}, knn_params::NamedTuple{T}) -> NamedTuple{T}

(FIT) Map knn_search over each view of the data and corresponding knn_params.

source

Fuzzy Simplicial Sets Interface

The second step of the UMAP algorithm involves calculating a topological representation of the data (potentially from multiple "views"). This behavior is controlled by the following types and functions.

UMAP.SMOOTH_K_TOLERANCE — Constant

SMOOTH_K_TOLERANCE

Tolerance for the smooth k-distance calculation.

source

UMAP.SourceGlobalParams — Type

SourceGlobalParams{T}(mix_ratio)

Parameters for merging the fuzzy simplicial sets for each dataset view into one fuzzy simplicial set, otherwise known as the UMAP graph.

source

UMAP.SourceViewParams — Type

SourceViewParams{T}(set_operation_ratio, local_connectivity, bandwidth)

Struct for parameterizing the representation of the data in the source (original) manifold; i.e. constructing fuzzy simplicial sets of each view of the dataset.

source

These config structs parameterize the fuzzy simplicial sets created for each view, and the global fuzzy simplicial set after combining all views (sometimes called the UMAP graph).

UMAP optimizes the difference between the source fuzzy set representation and a target representation. Therefore, if there are multiple source views, they need to be combined. This is done by a function coalesce_views that combines the view simplicial sets via set union and intersections, controlled by the SourceGlobalParams.

UMAP._mix_values — Method

_mix_values(x, y, ratio)

For ratio in [0, 1], return a modified multiplication of x and y. For ratio = 0, return x For ratio = 1, return y For ratio = 0.5, return x*y

source

UMAP._norm_sparse — Method

_norm_sparse(fsset)

For each column (i.e. each point), normalize the membership values (divide by the maximum). This creates a copy of the matrix.

source

UMAP._reset_fuzzy_set_cardinality — Function

Like smoothknndists, but now operating on an existing fuzzy set edge probabilities.

NOTE: k is fixed to 15 here, which is what the Python UMAP implementation does. How we would set this otherwise isn't clear (the global UMAP graph k value isn't determined.)

source

UMAP.coalesce_views — Function

coalesce_views(view_fuzzy_sets, params)

Merge the fuzzy simplicial sets for each view of the data. This returns a single fuzzy simplicial set - the weighted, undirected UMAP graph - that captures the spatial relationships of the data points approximated by the manifold.

source

UMAP.compute_membership_strengths — Method

compute_membership_strengths(knns, dists, σs, ρs) -> rows, cols, vals

Compute the membership strengths for the 1-skeleton of each fuzzy simplicial set.

source

UMAP.fuzzy_simplicial_set — Function

Construct the UMAP graph, i.e. the global fuzzy simplicial set. This is represented as a symmetric, sparse matrix where each value is the probability that an edge exists between points (row, col).

For multiple views, there will be one UMAP graph per view - later combined via coalesce_views.

For transforming new data, this graph is notably not symmetric.

source

UMAP.fuzzy_simplicial_set — Method

fuzzy_simplicial_set(data, knns_dists, knn_params, src_params)

(Transform) Construct a global fuzzy simplicial set for a single data view, the data argument gives us the number of points in the original dataset.

source

UMAP.fuzzy_simplicial_set — Method

fuzzy_simplicial_set((knns, dists), knn_params, src_params::SourceViewParams)

(Fit) Construct a global fuzzy simplicial set for a single data view.

source

UMAP.fuzzy_simplicial_set — Method

fuzzy_simplicial_set(knns_dists, n_points, knn_params, src_params, combine=true) -> membership_graph::SparseMatrixCSC

Construct the local fuzzy simplicial set of each point represented by its distances to its nearest neighbors, stored in knns and dists, normalizing the distances, and converting the metric space to a simplicial set (a weighted graph).

n_points indicates the total number of points of the original data, while knns contains indices of some subset of those points (ie some subset of 1:n_points). If knns represents neighbors of the elements of some set with itself, then knns should have n_points number of columns. Otherwise, these two values may be different.

If combine is true, use intersections and unions to combine local fuzzy sets of neighbors. The returned graph has size (n_points, size(knns, 2)).

source

UMAP.fuzzy_simplicial_set — Method

fuzzy_simplicial_set(data::NamedTuple{T}, knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}

(Transform) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view.

source

UMAP.fuzzy_simplicial_set — Method

fuzzy_simplicial_set(knns_dists::NamedTuple{T}, knn_params::NamedTuple{T}, src_params::NamedTuple{T}) -> NamedTuple{T}

(Fit) Construct a fuzzy simplicial set for each view of the data, returning a named tuple of fuzzy simplicial sets, one for each view. The keys of the named tuple are the keys of the input NamedTuples.

source

UMAP.general_simplicial_set_intersection — Method

general_simplicial_set_intersection(left_view, right_view, global_params)

Take the weighted intersection of two fuzzy simplicial sets (represented as (sparse) matrices). Since we don't want to completely lose edges that are only present in one of the two sets, we multiply by at least 1e-8. Furthermore, if the same edge in both sets has a strength below 1e-8, these are added together instead of multiplying.

source

UMAP.general_simplicial_set_union — Method

general_simplicial_set_union(left_view, right_view)

Take the union of two global fuzzy simplicial sets.

source

UMAP.reset_local_connectivity — Function

reset_local_connectivity(simplicial_set, reset_local_metrics)

Reset the local connectivity requirement – each data sample should have complete confidence in at least one 1-simplex in the simplicial set. We can enforce this by locally rescaling confidences, and then remerging the different local simplicial sets together.

A FS set may lose this property when combining multiple views of the source data; this function restores it.

source

UMAP.reset_local_metrics! — Method

Reset the cardinality of the fuzzy sets (usually a column in a simplicial set) to be approximately log2(k).

This step is necessary after we've combined the simplicial sets for multiple views of the same data.

source

UMAP.smooth_knn_dists — Method

smooth_knn_dists(dists, k, src_params) -> knn_dists, nn_dists

Compute the distances to the nearest neighbors for a continuous value k. Returns the approximated distances to the kth nearest neighbor (knn_dists) and the nearest neighbor (nn_dists) from each point.

source

Remaining

UMAP.create_config — Function

Utility for creating the configuration structs for UMAP.

source

UMAP.create_view_config — Function

Utility for creating the view-specific structs for UMAP, i.e. NeighborParams and SourceViewParams.

source

UMAP.create_view_config — Method

Create NeighborParams, SourceViewParams for a view of the data based on the params previously used to fit this same view. Parameters can be overridden by keyword arguments.

source

UMAP.create_view_config — Method

Create new NeighborParams, SourceViewParams for a view of the data given the arguments.

source

UMAP.TargetParams — Type

TargetParams{M, D, I, P}(manifold::M, metric::D, init::I, memb_params::P)

Parameters for controlling the target embedding, e.g. the manifold, distance metric, initialization method.

source

UMAP._EuclideanManifold — Type

A simple, singleton type representing Euclidean space with dimension N. Points in this manifold are N-dimensional vectors.

source

UMAP.initialize_embedding — Function

initialize_embedding(umap_graph, tgt_params) -> embedding

Initialize the embedding according to tgt_params. Return a list of points (e.g. Vectors) representing the initial embedded dataset. This could be made generic to different target manifolds.

source

UMAP.initialize_embedding — Method

initialize_embedding(ref_embedding, umap_graph, tgt_params) -> embedding

Initialize an embedding of points corresponding to the columns of the umap_graph, by taking weighted mean of the columns of ref_embedding, where weights are values from the columns of umap_graph.

source

UMAP.spectral_layout — Method

spectral_layout(graph, embed_dim) -> embedding

Initialize the graph layout with spectral embedding.

source

UMAP.MembershipFnParams — Type

MembershipFnParams{T}(min_dist, spread)

A smooth approximation for the membership strength of a 1-simplex between two points x, y, can be given by the following, with dissimilarity function dist, and constants a, b: ϕ(x, y, dist, a, b) = (1 + a*(dist(x, y))^b)^(-1)

The approximation parameters a, b are chosen by non-linear least squares fitting of the following function ψ:

ψ(x, y, dist, mindist, spread) = dist(x, y) ≤ mindist ? 1 : exp(-(dist(x, y) - min_dist)/spread)

source

UMAP.fit_ab — Method

fit_ab(min_dist, spread) -> a, b

Find a smooth approximation to the membership function of points embedded in ℜᵈ. This fits a smooth curve that approximates an exponential decay offset by min_dist, returning the parameters (a, b).

source

UMAP.OptimizationParams — Type

OptimizationParams(n_epochs, learning_rate, repulsion_strength, neg_sample_rate)

Parameters for controlling the optimization process.

source

UMAP._optimize_embedding! — Method

_optimize_embedding!(embedding, ref_embedding, umap_graph, tgt_params, opt_params; move_ref)

Optimize the embedding for one epoch, calculating the distances between neighbors and updating the embedding via gradient descent. If embedding and ref_embedding are the same object, then this is fitting a new embedding. Otherwise, ref_embedding is the result of a previous call to fit, and we are transforming new data.

In both cases, the dimensions of umap_graph have to match embedding and ref_embedding: umapgraph in R^{n, m}, embedding length m, refembedding length n.

This optimizes the default case, where the embeddings are in vector of vectors format, the target manifold is a Euclidean manifold of some dimension N, and the target metric is squared euclidean.

source

UMAP._ϕ — Method

A smooth approximation for the membership strength of the 1-simplex between two points x, y.

source

UMAP.cross_entropy — Method

cross_entropy(umap_graph, embedding, tgt_params)

Calculate the fuzzy cross entropy loss of the embedding for this umapgraph. (NOTE: μ(x, y) = umapgraph[x, y]; ν(x, y) = )

source

UMAP.target_metric — Function

target_metric(tgt_params, x, y) -> dist, grad_dist_x, grad_dist_y

Calculate the distance between x and y on the manifold tgt_params.manifold according to tgt_params.metric as well as the gradient of that distance with respect to x and y.

source

UMAP.update_embedding_neg! — Method

Calculate the gradients of the negative 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.

source

UMAP.update_embedding_pos! — Method

Calculate the gradients of the positive 1-simplices in the simplicial set, and update the embeddings. This assumes embedded in R^d with the squared euclidean metric.

source

UMAP.trustworthiness — Method

trustworthiness(X, X_embed, n_neighbors, metric) -> [0, 1]

Compute the trustworthiness of an embedding X_embed compared to X.

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html

source

Index

UMAP.SMOOTH_K_TOLERANCE
UMAP.DescentNeighbors
UMAP.MembershipFnParams
UMAP.NeighborParams
UMAP.OptimizationParams
UMAP.PrecomputedNeighbors
UMAP.SourceGlobalParams
UMAP.SourceViewParams
UMAP.TargetParams
UMAP._EuclideanManifold
UMAP._knn_from_dists
UMAP._mix_values
UMAP._norm_sparse
UMAP._optimize_embedding!
UMAP._reset_fuzzy_set_cardinality
UMAP._ϕ
UMAP.coalesce_views
UMAP.compute_membership_strengths
UMAP.create_config
UMAP.create_view_config
UMAP.create_view_config
UMAP.create_view_config
UMAP.cross_entropy
UMAP.fit_ab
UMAP.fuzzy_simplicial_set
UMAP.fuzzy_simplicial_set
UMAP.fuzzy_simplicial_set
UMAP.fuzzy_simplicial_set
UMAP.fuzzy_simplicial_set
UMAP.fuzzy_simplicial_set
UMAP.general_simplicial_set_intersection
UMAP.general_simplicial_set_union
UMAP.initialize_embedding
UMAP.initialize_embedding
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.knn_search
UMAP.reset_local_connectivity
UMAP.reset_local_metrics!
UMAP.smooth_knn_dists
UMAP.spectral_layout
UMAP.target_metric
UMAP.trustworthiness
UMAP.update_embedding_neg!
UMAP.update_embedding_pos!

Internal Interface

Contents

Interface

Nearest Neighbors Interface

Fuzzy Simplicial Sets Interface

Remaining

Index