Episode Summary: Leo Boytsov, a PhD researcher from the Language Technologies Institute of Carnegie Mellon University, talks about fast approximate search in modern information retrieval. We discuss the curse of dimensionality, hard-to-beat baselines and NMSLIB, Leo's super fast library for nearest-neighbour search. How does NMSLIB compare to Facebook's FAISS and Spotify's Annoy? Warning: very technical.
Links & resources:
NMSLIB: Leonid's Non-Metric Space Library (NMSLIB) for efficient similarity search.
Erik Bernhardsson's excellent repo for benchmarking kNN search libraries.
My original 2014 benchmark series of Python libraries for nearest neighbour search in high dimensional spaces.
Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search (YouTube): Leo's talk at the Allen Institute for Artificial Intelligence (AI2).
PyFastPFor: Fast integer compression—for anyone who needs fast integer compression :-)
A few extra links Leo wanted you to see:
Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. D Baranchuk, A Babenko, Y Malkov. arxiv.org, 2018.
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions. P Clark, O Etzioni, T Khot, A Sabharwal, O Tafjord… aaai.org, 2017.
Revisiting unreasonable effectiveness of data in deep learning era. C Sun, A Shrivastava, S Singh. arxiv.org, 2017.