This is my first post as part of Google Summer of Code 2017 working with Gensim. I would be working on the project ‘Gensim integration with scikit-learn and Keras‘ this summer. I stumbled upon Gensim while working on a project which utilized the Word2Vec model. I was looking for a functionality to suggest words semantically similar to the given input word and Gensim’s …
Text Summarization in Python: Extractive vs. Abstractive techniques revisited
This blog is a gentle introduction to text summarization and can serve as a practical summary of the current landscape. It describes how we, a team of three students in the RaRe Incubator programme, have experimented with existing algorithms and Python tools in this domain. We compare modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module …
WordRank embedding: “crowned” is most similar to “king”, not word2vec’s “Canute”
Comparisons to Word2Vec and FastText with TensorBoard visualizations. With various embedding models coming up recently, it could be a difficult task to choose one. Should you simply go with the ones widely used in NLP community such as Word2Vec, or is it possible that some other model could be more accurate for your use case? There are some evaluation metrics …
Topic Modelling with Latent Dirichlet Allocation: How to pre-process data and tune your model. New tutorial.
If you’ve learned how to train topic models in Gensim, but aren’t able to get satisfying results, then we have a new tutorial that will help you get on the right track on GitHub. Primarily, you will learn some things about pre-processing text data for the LDA model. You will also get some tips about how to set the parameters …
Author-topic models: why I am working on a new implementation
Author-topic models promise to give data scientists a tool to simultaneously gain insight about authorship and content in terms of latent topics. The model is closely related to Latent Dirichlet Allocation (LDA). Basically, each author can be associated with multiple documents, and each document can be attributed to multiple authors. The model learns topic representations for each author, so that …
Three Sprints in India (To Say Nothing of PyCon)
I was very happy to visit India this October to run three Gensim coding sprints, give workshops and visit PyCon India conference. Many thanks to our Incubator programme student Devashish Deshpande for being my host. PyCon India Pycon India was a very friendly event of 500 attendees with workshops on Friday and conference talks over Saturday and Sunday. My favorite PyCon moment was the keynote …
More topic coherence use-cases
Recently while doing some topic modelling, I encountered a few problems such as: How to use the topic coherence (TC) pipeline with other topic models (eg. HDP). How to find the optimal number of topics for LDA. LSI is brilliant since it ranks its topics. Can LDA do that too? If you face such problems, this blog might be able …
Dynamic NMF and Dynamic Topics
While hunting for a data set to try my DTM python port, I came across this paper, and this repository. The paper itself was quite an interesting read and analysed trends of topics in the European Parliament, but what caught my attention was the algorithm they used to perform this analysis – what they called the Dynamic Non-Negative Matrix Factorisation (NMF). The …
Validating gensim’s topic coherence pipeline
Sorry for not posting in such a long while. It had been a turbulent few weeks with some sharp twists and turns involving mails flying back and forth and a few pivots here and there. To validate the topic coherence pipeline in gensim, my plan was to work with the RTL-Wiki corpus and reproduce the results stated in the paper. …
The craziness that is Dynamic Topic Models
Every week, I’d end up having ‘fit DTM‘ as my weekly goal. And I would try, converting line by line of C++ gsl code, only to have it fail miserably and fall back on me. (you can see my gripe about it in my live blog here.) The task in itself was quite straightforward – rewrite the Dynamic Topic Model code originally written by …