Author-topic models promise to give data scientists a tool to simultaneously gain insight about authorship and content in terms of latent topics.
The model is closely related to Latent Dirichlet Allocation (LDA). Basically, each author can be associated with multiple documents, and each document can be attributed to multiple authors. The model learns topic representations for each author, so that we may ask questions like “what topic does author X typically write about?” or “how similar are authors X and Y?”. More generally, the model enables us to learn topic representations of any labels on documents (more on this later).
While there exist implementations of the author-topic model, they are slow, not scalable and have poor documentation and interfaces. So for the past few months, I have been working on getting us an implementation with all the nice properties we have become used to in Gensim’s LDA, that is,
- fast training
- streaming corpora
among other things. In this post I will share my progress in this endeavor, and hopefully get some of you interested in trying the implementation once it’s released.
Short review of available software
There are, to the best of my knowledge, three implementations of the author-topic model:
- The python-topic-model module.
- The MatLab Topic Modeling Toolbox.
- An LDA package in C++ by Yoshua Bengio.
The python-topic-model version is quite slow. I tried training author-topic model on a small dataset. After about a minute my own algorithm started giving reasonable results, whereas the python-topic-model’s version yielded poor results even after 15 minutes of training.
The C++ implementation has quite poor documentation, and as the GitHub page describes the package as “Practice of LDA and other Topic Model based Collapsed Gibbs Sampling.” it doesn’t suggest that it is geared towards users.
The MatLab implementation seems to produce decent results quite quickly. A port of this algorithm over to Python would not be a bad idea, so that data scientists without MathWorks licenses (MatLab is a proprietary piece of software) could use it. However, as the algorithm is based on Gibbs sampling, the scalability of the implementation is questionable (more on this in the next section). As this implementation does not compute any kind of likelihood, it will be difficult to objectively compare the scalability to my own implementation.
Training models like the author-topic model and LDA typically apply either Gibbs sampling or variational inference. All the implementations discussed in the previous section apply some form of Gibbs sampling (perhaps blocking and/or collapsed Gibbs sampling).
Variational Bayes (VB) is known to enable fast and scalable inference. For this exact reason, Gensim’s LDA is trained using VB. It is also easy to formulate an online algorithm (streamable corpus) within the VB framework. For these reasons, it was decided to implement an online VB algorithm to train the model. This required some theoretical work, as a VB algorithm has never been developed for the author-topic model. Technically, the algorithm that was developed is a blocking variational Bayes algorithm.
VB lends itself quite well to parallel computation. While I have not implemented yet, it shouldn’t be a great challenge.
Collapsed variational Bayes
Another candidate algorithm that was considered was collapsed variational Bayes (CVB). The theoretical groundwork has been laid out for this model, by Ngo and co-authors in 2016, although no code has been made publicly available. Online CVB is also possible.
CVB converges faster than standard VB, and would therefore be preferred. However, the natural first step would be to migrate the LDA implementation over to CVB, before doing the same in the author-topic model. Therefore standard VB was applied rather than CVB. It would be interesting to see a scalable, fast, online CVB implementation of LDA in Gensim, as it should be faster to train than the current implementation.
Algorithmic development has been the focal point of this project for the majority of its duration. Recently, everything has been coming together nicely, and a refactor of the code is in progress so that it may be released. The refactored code will have similar interface and structure to Gensim’s LDA, and thus jumping into the model will be relatively easy for users and developers alike. At this point, it is not certain when the code will be released, but it shouldn’t be long.
Not only authors
While the name of the model suggests that it has something to do with authors, it is really a much more general abstraction of the standard topic model. The “authors” can represent any kind of label on documents, in particular when there are several labels per document and several documents per label; for example, tags on internet posts could be used. Indeed, the model can be applied to any kind of data that have a similar structure, such as genomics data (LDA is used for this as well) or neuroimaging data (see the CVB paper mentioned earlier).
Gensim has for a long time been at the forefront of bridging the gap between research and practice in natural language processing. While the author-topic model has not drawn a lot of attention yet, I hope that this implementation will provide people with the opportunity to experiment with it, and that we will see some interesting use cases as a result.