New Gensim feature: Author-topic modeling. LDA with metadata.

Ólavur Mortensen gensim

The author-topic model is an extension of Latent Dirichlet Allocation that allows data scientists to build topic representations of attached author labels. These author labels can represent any kind of discrete metadata attached to documents, for example, tags on posts on the web. In December of 2016, I wrote a blog post explaining that a Gensim implementation was on its …

Author Topic Model

Author-topic models: why I am working on a new implementation

Ólavur Mortensen gensim, Machine Learning, Open Source, programming, Student Incubator

Author-topic models promise to give data scientists a tool to simultaneously gain insight about authorship and content in terms of latent topics. The model is closely related to Latent Dirichlet Allocation (LDA). Basically, each author can be associated with multiple documents, and each document can be attributed to multiple authors. The model learns topic representations for each author, so that …


FastText and Gensim word embeddings

Jayant Jain gensim

Facebook Research open sourced a great project recently – fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec. The main goal of the Fast Text …

Pycon Entrance 2

Pycon 2016 and Gensim Sprint Recap

Lev Konstantinovskiy gensim, Machine Learning, PyCon 2 Comments

Our team was on site representing RaRe Technologies and Gensim at this year’s PyCon 2016 hosted in Portland, Oregon, from May 28th to June 5th. It was a packed, outright massive event of over 3000 attendees which included two days of focused tutorials, sponsor workshops and talks from some of the industry’s renowned experts. RaRe was a sponsor of the …


Doc2vec tutorial

Radim Rehurek gensim, programming 87 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick. Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous …