SWINN¶

Sliding WIndow-based Nearest Neighbor (SWINN) search using Graphs.

Extends the NNDescent algorithm¹ to handle vertex addition and removal in a FIFO data ingestion policy. SWINN builds and keeps a directed graph where edges connect the nearest neighbors. Any distance metric can be used to build the graph. By using a directed graph, the user must set the desired number of neighbors. More neighbors imply more accurate search queries at the cost of increased running time and memory usage. Note that although the number of directed neighbors is limited by the user, there is no direct control on the number of reverse neighbors, i.e., the number of vertices that have an edge to a given vertex.

The basic idea of SWINN and NNDescent is that "the neighbor of my neighbors might as well be my neighbor". Hence, the connections are constantly revisited to improve the graph structure. The algorithm for creating and maintaining the search graph can be described in general lines as follows:

Start with a random neighborhood graph;
For each node in the search graph: refine the current neighborhood by checking if there are better neighborhood options among the neighbors of the current neighbors;
If the total number of neighborhood changes is smaller than a given stopping criterion, then stop.

SWINN adds strategies to remove vertices from the search graph and pruning redundant edges. SWINN is more efficient when the selected maxlen is greater than 500. For small sized data windows, using the lazy/exhaustive search, i.e., neighbors.LazySearch might be a better idea.

Parameters¶

graph_k

Type → int

Default → 20

The maximum number of direct nearest neighbors each node has.
dist_func

Type → DistanceFunc | FunctionWrapper | None

Default → None

The distance function used to compare two items. If not set, use the Minkowski distance with p=2.
maxlen

Type → int

Default → 1000

The maximum size of the data buffer.
warm_up

Type → int

Default → 500

How many data instances to observe before starting the search graph.
max_candidates

Type → int | None

Default → None

The maximum number of vertices to consider when performing local neighborhood joins. If not set SWINN will use min(50, max(50, self.graph_k)).
delta

Type → float

Default → 0.0001

Early stop parameter for the neighborhood refinement procedure. NNDescent will stop running if the maximum number of iterations is reached or the number of edge changes after an iteration is smaller than or equal to delta * graph_k * n_nodes. In the last expression, n_nodes refers to the number of graph nodes involved in the (local) neighborhood refinement.
prune_prob

Type → float

Default → 0.0

The probability of removing redundant edges. Must be between 0 and 1. If set to zero, no edge will be pruned. When set to one, every potentially redundant edge will be dropped.
n_iters

Type → int

Default → 10

The maximum number of NNDescent iterations to perform to refine the search index.
seed

Type → int | None

Default → None

Random seed for reproducibility.

Methods¶

append

Add a new item to the search index.

Data is stored using the FIFO strategy. Both the data buffer and the search graph are updated. The addition of a new item will trigger the removal of the oldest item, if the maximum size was reached. All edges of the removed node are also dropped and safety procedures are applied to ensure its neighbors keep accessible. The addition of a new item also trigger local neighborhood refinement procedures, to ensure the search index is effective and the node degree constraints are met.

Parameters

item — 'typing.Any'
kwargs

connectivity

Get a list with the size of each connected component in the search graph.

This metric provides an overview of reachability in the search index by using Kruskal's algorithm to build a forest of connected components. We want our search index to have a single connected component, i.e., the case where we get a list containing a single number which is equal to maxlen. If that is not the case, not every node in the search graph can be reached from any given starting point. You may want to try increasing graph_k to improve connectivity. However, keep in mind the following aspects: 1) computing this metric is a costly operation (\(O(E\log V)\)), where \(E\) and \(V\) are, respectively, the number of edges and vertices in the search graph; 2) often, connectivity comes at the price of increased computational costs. Tweaking the sample_rate might help in such situations. The best possible scenario is to decrease the value of graph_k while keeping a single connected component.

Returns

list[int]: A list of the number of elements in each connected component of the graph.

search

Search the underlying nearest neighbor graph given a query item.

In case not enough samples were observed, i.e., the number of stored samples is smaller than warm_up, then the search switches to a brute force strategy.

Parameters

item — 'typing.Any'
n_neighbors — 'int'
epsilon — 'float' — defaults to 0.1
kwargs

Returns

tuple[list, list]: neighbors, dists

Notes¶

There is an accuracy/speed trade-off between graph_k and sample_rate. To ensure a single connected component, and thus an effective search index, one can increase graph_k. The connectivity method is a helper to determine whether the search index has a single connected component. However, search accuracy might come at the cost of increased memory usage and slow processing. To alleviate that, one can rely on decreasing the sample_rate to avoid exploring all the undirected edges of a node during search queries and local graph refinements. Moreover, the edge pruning procedures also help decreasing the computational costs. Note that, anything that limits the number of explored neighbors or prunes edges might have a negative impact on search accuracy.

Dong, W., Moses, C., & Li, K. (2011, March). Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web (pp. 577-586). ↩