Word2Vec became so popular mainly thanks to huge improvements in training speed producing high-quality words vectors of much higher dimensionality compared to then widely used neural network language models. Word2Vec is an unsupervised method that can process potentially huge amounts of data without the need for manual labeling. There is really no limit to size of a dataset that can …
FastText and Gensim word embeddings
Facebook Research open sourced a great project recently – fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are an extension of word2vec. The main goal of the Fast Text …
More topic coherence use-cases
Recently while doing some topic modelling, I encountered a few problems such as: How to use the topic coherence (TC) pipeline with other topic models (eg. HDP). How to find the optimal number of topics for LDA. LSI is brilliant since it ranks its topics. Can LDA do that too? If you face such problems, this blog might be able …
Validating gensim’s topic coherence pipeline
Sorry for not posting in such a long while. It had been a turbulent few weeks with some sharp twists and turns involving mails flying back and forth and a few pivots here and there. To validate the topic coherence pipeline in gensim, my plan was to work with the RTL-Wiki corpus and reproduce the results stated in the paper. …
The craziness that is Dynamic Topic Models
Every week, I’d end up having ‘fit DTM‘ as my weekly goal. And I would try, converting line by line of C++ gsl code, only to have it fail miserably and fall back on me. (you can see my gripe about it in my live blog here.) The task in itself was quite straightforward – rewrite the Dynamic Topic Model code originally written by …
What is Topic Coherence?
What exactly is this topic coherence pipeline thing? Why is it even important? Moreover, what is the advantage of having this pipeline at all? In this post I will look to answer those questions in an as non-technical language as possible. This is meant for the general reader as much as a technical one so I will try to engage …
Radim, Gensim and RaRe Technologies
Racing through 2016 with so much on the front burner and yet it is timely to pause for a quick update on the launch of my new machine learning company, RaRe Technologies. The Start of Something Exciting I’ve heard from a few people who were confused when they received a recent newsletter from “RaRe Technologies”, when they signed up for …
Devashish’s Student Incubator Live-Blog: a Chronicle of Implementing Topic Coherence Metrics in Gensim
10th August : PyCon Delhi Planning to give some open space and lightening talks on gensim at pycon India in September. Hopefully we’ll also be able to organize a sprint there. 1st August : Plugging in your own model You can use the topic coherence pipeline to plug in your own topic model too. If you can extract the topics …
Pycon 2016 and Gensim Sprint Recap
Our team was on site representing RaRe Technologies and Gensim at this year’s PyCon 2016 hosted in Portland, Oregon, from May 28th to June 5th. It was a packed, outright massive event of over 3000 attendees which included two days of focused tutorials, sponsor workshops and talks from some of the industry’s renowned experts. RaRe was a sponsor of the …
Topic Coherence API Project – Week 2
Hey everyone! Here’s a small reflection of what I had set out to do and how it panned out over the last month. My agenda for last month was to complete my normalization PR, finish my doc2vec/word2vec warning PR, code two modules required by the topic coherence API and resolve any other bugs which I encounter in the process. The …