2016 Student Incubator – Week 1 Implementing Topic Coherence Metrics in Gensim

Here’s my first post as part of the RaRe Technologies Incubator Programme! Over the course of this summer I will be working on (and hopefully improving) the functionality of gensim, an open source library for topic modelling.

My interest in machine learning and natural language processing started when I took an online course on machine learning by BerkeleyX. I was fascinated by how maths can be used to “learn” a language! Towards the end of last year I also started collaborating with Bhargav Srinivasa (who is currently doing his GSoC with gensim) on building a Whatsapp Chat Analyser which is hosted on GitHub and is still a work in progress. This year, while searching for tools that can be used out of the box to help me with this project, I came across gensim. I found their work to be stellar! If you’re still here and have been hearing a lot about “Deep Learning” (psst Google vs Lee Sedol), you should surely check this out to see what gensim can do! Gensim is a very easy-to-use, robust library which works at lightning fast speeds. So if you’re into natural language processing and haven’t used it yet, I strongly recommend you try it. Seriously, topic modelling can’t be easier than this.

I started work on gensim by making normalization an explicit transformation (issue #69). The problem was that every normalization done in gensim was an L2 normalization and was done implicitly without really giving the user a choice. My first pull request basically adds the L1 normalization option and gives the user a choice of which normalization to choose. It makes normalization an “explicit transformation”. Apart from that it also has an option of passing a corpus and storing it as a normalized corpus or performing normalization in-place on documents.
My second pull request looks to raise warnings if unexpected input is encountered while using Word2Vec and Doc2Vec. This can help the user become more aware of what is going on inside the “box” and can also help the user rectify his/her mistakes in the initialization of these models.
After I finish work on these two pull requests, I will be proceeding with my project of adding a 4 stage topic coherence pipeline (more on that later!) to gensim.

A big shout out to Radim, Lev and Gordon for helping me out with the PRs and giving me an opportunity to work on this project! It’s been a brilliant learning experience so far and hopefully by the end of this project, topic modelling will become even better for humans!

Leave a Reply Cancel reply