Chinmaya’s Google Summer of Code 2017 Live-Blog : a Chronicle of Integrating Gensim with scikit-learn and Keras

Chinmaya Pancholi gensim, Student Incubator

1st August, 2017 In the last two weeks, I worked mainly on updating and adding sklearn API for models in Gensim and updating tests in shorttext. I was not able to add a blog in the previous week since my college semester has now commenced and I was travelling at the time. In PR #1473, I removed the BaseTransformer class and refactored the …

Parul’s Google Summer of Code 2017 Live-Blog : a chronicle of adding training and topic visualizations in gensim

Parul Sethi gensim, Student Incubator

2nd August 2017 PR-1484 is almost near it’s completion. It adds the dendrogram visualization which I talked about in last post. I added the additional parameter to define text annotations on the upper hierarchy levels also which could enable user to see the common/different terms on the cluster heads also which are made up of topics in leaves. It also …


Google Summer of Code 2017 – Performance improvement in Gensim and fastText

Prakhar Pratyush gensim, Student Incubator

July 20, 2017 This week, I’ve mostly worked on implementing native unsupervised fastText (PR #1482) in gensim. It’s quite challenging as I had to look into the fasttext C codes, and read the research paper to properly understand how this is working, and then had to figure out the similarity with word2vec code. After lots of discussion with mentors, we …

Google Summer of Code 2017 – Week 1 of Integrating Gensim with scikit-learn and Keras

Chinmaya Pancholi gensim, Student Incubator

This is my first post as part of Google Summer of Code 2017 working with Gensim. I would be working on the project ‘Gensim integration with scikit-learn and Keras‘ this summer. I stumbled upon Gensim while working on a project which utilized the Word2Vec model. I was looking for a functionality to suggest words semantically similar to the given input word and Gensim’s …


Text Summarization in Python: Extractive vs. Abstractive techniques revisited

Pranay, Aman and Aayush gensim, Student Incubator, summarization

This blog is a gentle introduction to text summarization and can serve as a practical summary of the current landscape. It describes how we, a team of three students in the RaRe Incubator programme, have experimented with existing algorithms and Python tools in this domain. We compare modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module …


Gensim switches to semantic versioning

Lev Konstantinovskiy gensim, Open Source

Starting with release 1.0.0, Gensim adopts semantic versioning. The time went in a flash, but Gensim has reached maturity. It's been cited in nearly 500 academic papers, used commercially in dozens of companies, organized many coding sprints and meetups and generally withstood the test of time. Between the continued Gensim support by our parent company,, and our open Student ...

WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”

Parul Sethi gensim, Student Incubator

Comparisons to Word2Vec and FastText with TensorBoard visualizations. With various embedding models coming up recently, it could be a difficult task to choose one. Should you simply go with the ones widely used in NLP community such as Word2Vec, or is it possible that some other model could be more accurate for your use case? There are some evaluation metrics …


New Gensim feature: Author-topic modeling. LDA with metadata.

Ólavur Mortensen gensim

The author-topic model is an extension of Latent Dirichlet Allocation that allows data scientists to build topic representations of attached author labels. These author labels can represent any kind of discrete metadata attached to documents, for example, tags on posts on the web. In December of 2016, I wrote a blog post explaining that a Gensim implementation was on its …


Topic Modelling with Latent Dirichlet Allocation: How to pre-process data and tune your model. New tutorial.

Ólavur Mortensen gensim, Machine Learning, Open Source, programming, Student Incubator

If you’ve learned how to train topic models in Gensim, but aren’t able to get satisfying results, then we have a new tutorial that will help you get on the right track on GitHub. Primarily, you will learn some things about pre-processing text data for the LDA model. You will also get some tips about how to set the parameters …

Author Topic Model

Author-topic models: why I am working on a new implementation

Ólavur Mortensen gensim, Machine Learning, Open Source, programming, Student Incubator

Author-topic models promise to give data scientists a tool to simultaneously gain insight about authorship and content in terms of latent topics. The model is closely related to Latent Dirichlet Allocation (LDA). Basically, each author can be associated with multiple documents, and each document can be attributed to multiple authors. The model learns topic representations for each author, so that …