optimization | RARE Technologies

Word2vec Tutorial

Radim Řehůřek 2014-02-02 gensim, programming 158 Comments

I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

Performance Shootout of Nearest Neighbours: Querying

Radim Řehůřek 2014-01-12 gensim, programming 38 Comments

Previous posts explained the whys & whats of nearest-neighbour search, the available OSS libraries and Python wrappers. We converted the English Wikipedia to vector space, to be used as our testing dataset for retrieving “similar articles”. In this post, I finally get to some hard performance numbers, plus a live demo near the end.

Performance Shootout of Nearest Neighbours: Contestants

Radim Řehůřek 2013-12-08 gensim, programming 12 Comments

Continuing the benchmark of libraries for nearest-neighbour similarity search, part 2. What is the best software out there for similarity search in high dimensional vector spaces? Document Similarity @ English Wikipedia I’m not very fond of benchmarks on artificial datasets, and similarity search in particular is sensitive to actual data densities and distance profiles. Using fake “random gaussian datasets” seemed …

Parallelizing word2vec in Python

Radim Řehůřek 2013-10-04 gensim, programming 21 Comments

The final instalment on optimizing word2vec in Python: how to make use of multicore machines. You may want to read Part One and Part Two first.

Word2vec in Python, Part Two: Optimizing

Radim Řehůřek 2013-09-21 gensim, programming 46 Comments

Last weekend, I ported Google’s word2vec into Python. The result was a clean, concise and readable code that plays well with other Python NLP packages. One problem remained: the performance was 20x slower than the original C code, even after all the obvious NumPy optimizations.