One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the past year, including GloVe and matrix factorization via SVD.
The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick. Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous …
Latent Dirichlet Allocation (LDA), one of the most used modules in gensim, has received a major performance revamp recently. Using all your machine cores at once now, chances are the new LdaMulticore class is limited by the speed you can feed it input data. Make sure your CPU fans are in working order!
There are tools and concepts in computing that are very powerful but potentially confusing even to advanced users. One such concept is data streaming (aka lazy evaluation), which can be realized neatly and natively in Python. Do you know when and how to use generators, iterators and iterables?
MALLET, “MAchine Learning for LanguagE Toolkit” is a brilliant software tool. Unlike gensim, “topic modelling for humans”, which uses Python, MALLET is written in Java and spells “topic modeling” with a single “l”. Dandy.
The end of the year is proving crazy busy as usual, but gensim acquired a cool new feature that I just had to blog about. Ben Trahan sent a patch that allows automatic tuning of Latent Dirichlet Allocation (LDA) hyperparameters in gensim. This means that an optimal, asymmetric alpha can now be trained directly from your data.
Gensim, the machine learning library for unsupervised learning I started in late 2008, will be celebrating its fifth anniversary this November. Time to reminisce and mull over its successes and failures 🙂