Hey everyone! Here’s a small reflection of what I had set out to do and how it panned out over the last month.
My agenda for last month was to complete my normalization PR, finish my doc2vec/word2vec warning PR, code two modules required by the topic coherence API and resolve any other bugs which I encounter in the process.
The normalization PR has been successfully completed and merged into gensim. Now the users will have explicit control of which normalization to choose and can normalize documents in-place by creating a “normalization model”.
The word2vec/doc2vec warning PR has also been successfully completed and merged. Users were previously unaware of what errors they were making while training doc2vec and word2vec models. Warnings will now be raised on committing some common errors.
Coming to my topic coherence project, topic coherence is basically a measure by which a set of topics’ human interpretability is quantified. Hence it can be used to measure the quality of a topic modelling algorithm by checking the topic coherence of the topics it comes up with. The AKSW group of Michael Röder did a nice comparison of various methods for evaluating topic coherence and came up with this wonderful paper. My project is basically to add the topic coherence pipeline mentioned in this paper to gensim.
The topic coherence API is also going well. You can check out the open PR here. You can also check out a sample usage of this API here. This is still a work in progress and is still far from being merged but it’s starting to take shape now. My next step would be to perform benchmark testing against Palmetto using some popular datasets such as the English Wikipedia. My next blog post will be dedicated to this very interesting part of my project so stay tuned for more!
Lastly, thanks a lot Lev and Radim for the assistance offered for this project! It’s been a great experience with Gensim so far!