Recently while doing some topic modelling, I encountered a few problems such as:
- How to use the topic coherence (TC) pipeline with other topic models (eg. HDP).
- How to find the optimal number of topics for LDA.
- LSI is brilliant since it ranks its topics. Can LDA do that too?
If you face such problems, this blog might be able to help you out. I will also demonstrate how you can build your own TC pipeline using the different methods of the TC modules in gensim.
Using the pipeline with other topic models
The TC pipeline has built-in support for LDA and VW and mallet wrappers however can this pipeline be used with other models such as LSI or HDP? Most certainly. The only parameter you should provide is the `topics` parameter while initializing the pipeline. i.e. If you can get the topic keywords out of your model, you can use the TC pipeline. I’ve added an example of how to use the pipeline with gensim HdpModel in this pull request. This pipeline can also be used with DTM (as demonstrated by Bhargav) or any other topic model. Only the topic keywords are needed.
Finding the optimal number of topics in LDA
When I encountered this problem and googled it, I came to know that this is a problem faced by quite a lot of people. The answers for this range from manual inspection of topics by changing ‘N’ to finding out the best ‘N’ by doing a series of mathematical operations. I feel topic coherence emulates the former really well. The optimal number of topics can be found out by simply iterating through a range of integers, plotting the coherence values and then selecting the best value from the graph. Algorithms exist to find the elbow from the graph automatically.
Making LDA behave like LSI
LSI is a very useful topic model if you want to display the topics in a ranked manner. However suppose you trained your LDA model over 100 topics but only want to display the top 5 topics. The show_topics parameter can only display 5 random topics. If you want to make your LDA behave like LSI, you can calculate coherence for each individual topic and display the topics in decreasing order of coherence value thus displaying only the most meaningful 5 topics.
Making your own pipeline
In my opinion this is one of the biggest advantages of the gensim TC pipeline. A user can customize the pipeline to cater to his/her needs and thus is not bound by the measures provided by gensim. To make a pipeline, all you need to do is define segmentation, probability estimation, confirmation measure and aggregation manually and then evaluate the coherence value. There exist some functions in each of the modules in gensim however a lot more can be added (contributions welcome!).
I have added a small gist giving all the code for the above operations here: https://gist.github.com/dsquareindia/ac9d3bf57579d02302f9655db8dfdd55
Enjoy your topic modelling!