The craziness that is Dynamic Topic Models

Every week, I’d end up having ‘fit DTM‘ as my weekly goal.
And I would try, converting line by line of C++ gsl code, only to have it fail miserably and fall back on me. (you can see my gripe about it in my live blog here.)

The task in itself was quite straightforward – rewrite the Dynamic Topic Model code originally written by Blei into python – but what I wasn’t prepared for was 1000s of lines of mathematical C++ code which I had to make sure translated perfectly. This means writing tests after each module, making sure they pass each time, and seeing if the modules of code which are made up of these math functions hold up fine when they are stitched together.

Most of the problems I faced were in the smaller details – a result being a few decimal points off can thoroughly mess up what the final results are supposed to look like. Seeing if the math library’s log-gamma and scipy’s popular digamma and optimize methods match up to the C++ gsl code, as well as making sure that I haven’t interpreted my pointers incorrectly was the crux of it most times.

Often the values wouldn’t add up, and I would have to painfully go line by line to only notice I put a dammed minus instead of a plus – doing this also taught me the valuable lesson of breakpoints/de-bugging and checking variables through handy IDEs instead of printing it to console like a neanderthal.

But finally, last Friday night, I had my own little eureka moment – not only did my crude little DTM compile, the values weren’t embarrassingly off!

The happiness is in the ‘finished in 9.2s’ more than anything, here. Finished!

Of course, this is a short-lived happiness (though it didn’t stop me from treating myself to a weekend off!) – there is a lot of fixing and fine-tuning to do.
The obvious first step is to make sure my DTM results matches the original C++ code results as well as possible. And ‘values’ might not always be the best judge of DTM – my task for the night is to make a small method which will print out my topics and the evolution of topics, so I can see if they at least make sense. And when it comes to making sense of things – in the pain of just converting the code, I’m not sure why I’m writing some of the code I’m writing (an embarrassing confession – I still don’t know what my update_zeta method exactly does!).

So while there is a lot to do (I haven’t even started thinking about how this will smoothly integrate with the gensim code base!), at a rough half-way mark it feels quite heartening to know I’m edging towards the right direction.

So on this Monday evening – a good amount of caffeine by my side, my benchmarking and testing libraries ready, armed with the sensibilities of my mentors Lev and Radim – it’s time for me to take on the remaining bit of my Google Summer of Code project.

Leave a Reply Cancel reply