Jason Tian

Suck out the marrow of data

K Dimension Tree and K Nearest Neighbors

K Dimension Tree

In some cases we may be not interested in the question “is X in this container,” but rather “what value in the container is X most similar to?” At a high level, a kd-tree is a generalization of a binary search tree that stores points in k-dimensional space.

This is quoted from this nice introduction, and you can find more detailed interpretation in it. Here I just excerpt some contents from it.

When we just consider the balanced K-D tree, we always use median value to be split boundaries.

Nearest-Neighbor

  • The first target should be in the smallest cell and we randomly choose one point in that cell, then we can construct a candidate hy- persphere centered at the test point and running through the guess point. The nearest neighbor to the test point must lie inside this hypersphere.
  • If this hypersphere is fully to one side of a splitting hyperplane, then all points on the other side of the splitting hyperplane cannot be contained in the sphere and thus cannot be the nearest neighbor.
  • To determine whether the candidate hypersphere crosses a splitting hyperplane that com- pares coordinate i, we check whether \(|b_i – a_i|<r\).
  • If this hypersphere crosses some boundaries, we need to go back to the parent node (may go up to multiple upleavls). Then get a point to compute the distance and compare it with the previous distance.

This algorithm can be shown to run in O(log n) time on a balanced kd-tree with n data points provided that those points are randomly distributed. In the worst case, though, the entire tree might have to be searched. However, in low-dimensional spaces, such as the Cartesian plane or three-di- mensional space, this is rarely the case.

k-Nearest Neighbor Searches

We will apply K-D tree with a bounded priority queue (or BPQ for short) to tackle this problem. A bounded priority queue is similar to a regular priority queue, except that there is a fixed upper bound on the number of elements that can be stored in the BPQ. When- ever a new element is added to the queue, if the queue is at capacity, the element with the highest priority value is ejected from the queue.