Carnegie Mellon University
Browse

Efficient and Accurate Non-Metric k-NN Search with Applications to Text Matching

Download (4.38 MB)
thesis
posted on 2022-12-02, 20:44 authored by Leonid BoytsovLeonid Boytsov

In this thesis we advance state-of-the-art of the non-metric k-NN search by carrying out an extensive empirical evaluation (both and intrinsic) of generic methods for k-NN search. This work contributes to establishing a collection of strong benchmarks for data sets with generic distances.

We start with intrinsic evaluations and demonstrate that non metric k-NN search is a practical and reasonably accurate tool for a wide variety of complex distances. However, somewhat surprisingly, achieving good performance does not require distance mapping/proxying via metric learning or distance symmetrization. Existing search methods can often work directly with non-metric and non-symmetric distances. They outperform the filter-and-refine approach relying on the distance symmetrization in the filtering step. 

Intrinsic evaluations are complemented with extrinsic evaluations in a realistic text retrieval task. In doing so, we make a step towards replacing/complementing classic term-based retrieval with a generic k-NN search algorithm. To this end we use a similarity function that takes into account subtle term associations, which are learned from a parallel monolingual corpus. An exact brute-force k-NN search using this similarity function is quite slow. We show that an approximate search algorithm can be 100-300 times faster at the expense of only a small loss in accuracy (10%). On one data set, a retrieval pipeline using an approximate k-NN search is twice as efficient as the C++ baseline while being as accurate as the Lucene-based fusion pipeline. We note, however, that it is necessary to compare our methods against more recent ranking algorithms. 

The thesis is concluded with a summary of learned lessons and open research questions (relevant to this work). We also discuss potential challenges facing a retrieval system designer. 

History

Date

2018-05-17

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Dr. Eric Nyberg

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC