Bhargav’s Google Summer of Code 2016 Live-Blog: a Chronicle of Dynamic Topic Models

September 2nd, 2016

It’s celebration time – I’ve officially cleared Google Summer of Code 2016! 😀 😀
It’s been an absolutely awesome experience, I’ve had great mentors in Lev and Radim and I’ve learned so, so much.

You can see the result of my work here in this notebook tutorial – link.
And you can follow the extra features which it needs, and issues out here – link.

There is still some work to do, but I will be sticking around with gensim for along while. 😉
Once again, I’m very thankful to Gensim, RaRe Technologies and the GSoC team for giving me this oppurtunity.

August 22nd, 2016

It’s officially the last day for GSoC today, and I am glad to say things are finally ship-shape. 🙂
The last couple of days has mainly been focussed at me cleaning up code, documentation, and writing the tutorial for using DTM.

There are still some ways to go ahead – improving performance in terms of speed, and in making the code more pythonic. I think I’ve been staring at my code way too much though – need some fresh perspective (in terms of more code reviews, bugs found, or user reviews) to make more progress. But hopefully it’s ready to be merged!

August 9th, 2016

It’s an exciting time now – getting my code to be gensim ready. This means inspecting the ldamodel.py code and better understanding the structure and design. One of the things I’ve implemented now is dealing with returning values of methods with side effects, as well as better marking out the E-step and M-step.

The biggest exercise is docstrings, so I’ve gone back and read the original DTM and LDA papers so I can better explain what exactly is going on each method. There’s still a long way to go in this, but the mathematical parts are better explained now.

It’s now time to restructure my code so it becomes easier to maintain and understand. Running DTM on other data sets is giving me satisfactory results – so at least the functionality is more or less fine!

August 5th, 2016

It’s again been quite a while since my last update here; the last week was a lot of work. Now that I had my coherence values checked for the news corpus and it was fine, I knew that my code could more or less replicate the wrapper code. But I was still using the sufficient statistics created by the DTM code for the modelling – I had to remedy this and it took a while to do so. My first attempt was to recreate the LDA with gensim, but I was getting strange values for some reason. So my next attacks at it was to recreate the Blei LDA C code in python – which I painstakingly and slowly did. But despite doing an exact python port, I was still getting weird values! And then checked, and double – checked – I was using a dictionary with different indexing so the right words weren’t coming up. :face palm:

Soon I fixed this and was happy to see that both my blei LDA port and the gensim code was giving good results.
Now that this was done, my next task was to make sure my corpus is iterable. I spent a while attempting some solutions before modifying my classes and code in parts to just be able to integrate any gensim corpus – and viola, I have streaming DTM code.

After this was done I added docstrings for classes and basic (very basic) explanations for each of the methods. Need to spend some more time and add more details to each of the methods! I also had to refactor and clean up code to engineer a clean API. Now all you need is a gensim corpus, and an array with information of new time-slices and DTM can be run!

There’s still more work to be done – I’ve got my hands on two more corpuses to test DTM on. Will be running them over the weekend and see my results. And of course, Lev will be reviewing my code – I expect a lot of changes to come in there as well. Still, it feels nice for the code to finally take some readable shape – and it works.

July 29th, 2016

Now that the code is ready, I started testing my DTM port on a corpus. I compared the results of my DTM port with the already present genesis DTM wrapper, using pyLDAvis to visualise my results and the new coherence measures to see how they match up. Surprisingly, the DTM port seems to out perform the wrapper, at least for this corpus in terms of coherence. This notebook outlies the methods I used to do this, and how to plug in DTM values for the same.
There is still a lot left to be desired, as it is about 10 times slower, as well as not being integrated into gensim’s code base. I also need to make sure that gensim’s LDA class is used to generate the sufficient statistics, which I am grappling with right now.

In the meantime, I have also written a blog post discussing Dynamic NMF and how it’s dynamic topic modelling algorithm compares to DTM.

July 26th, 2016

It’s been 2 weeks since my last update – this delay was because of a lot of refactoring and work I needed to catch up on. Testing and comparing my code to the C++ DTM code made me realize that there were lots of small bugs along the way – some of them would be a silly mistakes in variable names or something gigantic like accidentally swapping a ‘+’ for a ‘-‘.

This further means that I have to write iterative tests, something I am doing right away.
In the meantime I have cleaned a corpus and kept it ready for testing – particularly the sample data used in this dynamic nmf implementation. The DTM wrapper results on this wrapper look fine – must test it on my code to make sure the results look good.

July 12th, 2016

I have written methods to print topics (luckily the already present DTM wrapper code helped pushed me in the right direction!), and now need to find a corpus to see if my ‘evolution’ of topics makes more sense.

I have a few corpuses in mind and will run the code on them to see how they look.

July 11th, 2016

I have finished writing a blog post about the experiences the last couple of weeks – https://topicmodel2016.wordpress.com/2016/07/11/the-craziness-that-is-dynamic-topic-models/.

This is on my personal blog page and not on the RaRe blog page – for some reason, with my internet at home I am not able to edit or publish posts on the website (it takes too long to load). It works fine on other connections, so I’ll upload it through a different connection (or talk to Chris about this.)

I am now writing methods to print topics so I can have a better way of figuring out if my DTM is working fine.

July 8th, 2016

I apologize for the lack of updates last week – it’s been a wild one!
I’m finally wiping sweat off my brow, my DTM code compiles and fits (somewhat) data I fed it.
Even as it runs I notice things aren’t perfect; my bounds are off by a bit, and there is some more fine tuning and tweaking I have to do until it perfectly replicates the C code.
There is still a lot to do, but it’s quite a relief to see the code run. Will put out a more detailed update of last week and how it has been so far in a blog post over the weekend.

July 4th, 2016

It is Monday now, and I am a day behind my goal of fitting DTM by Sunday. While introspecting last night about my slow progress in this regard, I think I found out what exactly is slowing me down

1) Testing methods take a lot longer than I thought.

– mathematical functions like update_phi() and update_gamma() update variational parameters, and to test it with regard to DTM I have to set up a lot of things before I do the actual value matching. While doing this, I end up realizing a lot of things about the code, such as where the actual inference is going on, where is the time based evolution, and more importantly, how to structure my code based on a time-based approach, especially my corpus. Re-thinking things, as well as wondering how to actually test it in the first place… it isn’t straightforward.

2) Annoying roadblocks along the way.

sometimes the scipy methods don’t mirror the GSL ones and I have to spend a lot of time wondering how to make things work (looking at you, optimize_fdf). I am also not super comfortable with C++ and CLion, so testing slows down here as well.

While I am happy that I have essentially finished the work of converting and understanding the C++ DTM code, there is still a long way to go. I have asked Lev to wait till Wednesday to expect a fully running DTM – I hope I can meet this.

July 2nd, 2016

Found replacement methods for lngamma and psi and pushed those changes. Still slowly trying to figure out what exactly the optimize method does with respect to DTM, this is slowing me down. To push more code I’m starting to test all individual modules which can be tested right away. Getting closer to fitting it.
Opened two new PRs which address the comments on old ones and fixes some other stuff – but need help in this regarding backward compatibility and such.

I’m also following the mailing list and GitHub issues daily. It’s fun to fix small issues while doing DTM, and I hope to keep doing this for a long time.

July 1st, 2016

The mathematical methods to help with DTM are all done. I have been testing scipy methods for lngamma() and psi() and seeing if they are the same as the gsl methods. Once that is done and the optimism_fdf method is figured out, DTM can be fit.

I have also opened two PRs to fix small bugs to do with DTM wrapper and LdaMallet. They are done from my side, and need reviews before merging.

June 29th, 2016

Added four tests for mathematical helper methods. The DTM code structure is done, I have a few mathematical functions (update_gamma, update_phi) left to implement and I should be done. I don’t think I have caught up with where I wanted to be right now but I hope to reach my goal of fitting dtm on something small by this Sunday.

Learning to better estimate my schedule now – it isn’t always straightforward and I stumble along the way often, especially this week. Working to speed this up.

June 27th, 2016

Couldn’t code much the last 3 days because I was recovering from a fever (crazy weather changes in both Bangalore and Pune!). Today I managed to put in some extra hours though, and I have only 3 functions left and will have finished converting all the code I need to actually run a python port of DTM. Of course, it isn’t as straightforward – need to figure out how to properly incorporate the ldamodel and corpus classes of gensim into my code. I intend to be done with all the DTM code (at least structured as modules) by tomorrow, finish testing all individual modules by Wednesday and have decent results by Thursday. I had intended to get all this done by Sunday, so I am set back by 4 days, and need to buck up!

Outside of DTM I made further changes to my Distance Metrics notebook and pushed it. I also have been ignoring issue #698 which deals with bytes-like objects and strings being messed around – need to have a look into this asap, propose a solution and put in a new PR!

Looks like a busy but exciting week is ahead. 🙂

June 23rd, 2016

I am traveling today so couldn’t really do any heavy work – but to pass the time I am trying to think of ways of fixing the failing get_term_topics() and get_document_topics() tests. My hunches are that it is because of dictionary indexes being different in Windows and Mac OS, or that the LDA model is giving slightly different values. Must think of a smarter way to test them. My Distance Metric notebook also needs some work, so I’ll put in some time on that. Back to DTM tomorrow!

June 22nd, 2016

Today I finished up my DTM blog post. This is the link to it – https://topicmodel2016.wordpress.com/2016/06/22/understanding-and-coding-dynamic-topic-models/. In the meantime I continued working on converting fit_lda_seq. In particular, I made a class for LDA post and made some headway with what they call the M-Step in the Blei code. I feel I may be writing code which may be already present in the ldamodel.py class, especially with the bound and inference methods, but I haven’t gone around to checking it up yet. Either way, just because it will be easier to test and reproduce results, I am continuing with translating the code – though I am unsure if this is the smartest way of doing things.

June 21st, 2016

Yesterday and today I spent my time coding fit_lda_seq in python. There were a few questions and doubts along the way, like how and when I should log the gamma and likelihood matrices to files (and whether it is necessary at all), and how I should go about implementing methods which use the Blei LDA class – should I make a dummy class to use for now, or figure out how to integrate the already existing ldamodel into this? The LDA posterior class also is the source of similar doubts for me.

I have finished the skeletons of most of these functions and hope to finish it by Sunday.

In the mean time I am also adding some information on metrics and on when to use KL and Hellinger. Can make a blog post about it too – I even used the similarity metrics a bit in my hackathon bot to find similar documents.

June 19th, 2016

Yesterday was a slower code day, the only commits I pushed were for the Distance functions PR, where I made changes to the Notebook to illustrate how Kullback-Leibler can be used in finding the distance between two topic-word distributions.I finally got around to re-watching the Blei google talk on State Space Models. There are a few small hiccups in testing the entirety of init_lda_seq_ss which should hopefully be wrapped up today, so that fit_lda_seq can proceed smoothly.

June 17th, 2016

As of last night – I was working on an all night hackathon at Intuit Inc. where we built a Chat Bot which heavily uses LSI, LDA, and Doc2Vec to pick out similar queries, and call a domain ‘expert’ based on which topic the query falls under. It’s an all python bot web-app which also uses Django, ChatterBot and Theano’s RNN to generate sentences. It was great to be able to use parts of gensim I previously haven’t, like LSI, Similarity Index, and Doc2Vec. This is the github link of our project: https://github.com/thegyro/qandabot.

About the GSOC work – I will be working on the DTM fit_lda_seq method the rest of the day, while also finishing pushing all the tests written for init_lda_seq_ss to the PR. Will update again tonight about how that goes!

Leave a Reply Cancel reply