Google Summer of Code 2017 – Performance improvement in Gensim and fastText

Prakhar Pratyush gensim, Student Incubator

July 20, 2017

This week, I’ve mostly worked on implementing native unsupervised fastText (PR #1482) in gensim. It’s quite challenging as I had to look into the fasttext C codes, and read the research paper to properly understand how this is working, and then had to figure out the similarity with word2vec code. After lots of discussion with mentors, we are finally on the right track.

So far, we have finalized the API, which is almost similar to that of word2vec. After writing the codes for complete fasttext structure, for both cbow, and skipgram, we have now implemented subwords feature, and now we are under discussion for implementing model update codes for negative sampling, softmax, etc. I think by the end of the upcoming second evaluation, we will have a working code for unsupervised fasttext (pure python) in gensim.

July 14, 2017

For the last two weeks, I have been working on PR #1226, PR #1413, and a new PR #1482 which is implementing native fasttext (unsupervised) in gensim (see this issue #1471) . After a lots of discussion, we decided that it’s better to first implement native unsupervised version of fasttext in gensim, and then later incorporate supervised feature (labeledw2v) into it.

As for earlier PR #1226 and #1413, we are in the final stage of clean-up discussion, and we encountered a few issues which have almost been resolved. For fasttext, so far we were undergoing discussion regarding structure and API, and after finalizing the architecture (which is to be very similar as word2vec), we are starting to write training and subwords codes.

June 29, 2017

This week has been little hectic so far, as we need to finish up Phrases optimization before moving on to next phase.

Takeaway from the first month – software development is less about number of lines of codes and just getting things done, rather it is more about impact and strategy. After lots of discussion regarding any2utf8 optimization (#1413) that we have been discussing in previous blog posts, finally it is accepted as a genuine major bottleneck, and utf8 conversion will now be optional as users with greater RAM might prefer higher speed. In the updated PR, now we have an optional parameter recode_to_utf8, and after lots of debugging and testing, finally the PR is ready to be merged. So much work, just for a little change (though 1.8-2x speed up is not that bad) !!

Unfortunately, most of the optimization that I attempted for Phrases module (cython static typing and jobling threading) couldn’t provide much performance improvement as expected, therefore I’ve closed these PRs. Though, these were quite intense and a great learning experience.

The updated benchmark for text8 corpus –

             Optimization               Python 2.7               Python 3.6
                     original             ~ 36-38 sec                 ~32-34 sec
         recode_to_utf8=False               ~19-21 sec                ~20-22 sec

As for fastText PR #1341, finally it got merged. While removing unused codes and files (clean up part), we encountered a strange flake8 error which was fixed by @jayantj in the same PR.

June 21, 2017

In the last blog, I mentioned about a memory trade-off for speed by applying unicode to utf8 conversions (any2utf8) only before saving and not on every incoming word. But apparently, memory is more critical here, therefore to handle this speed bottleneck, we now apply this conversion on entire sentence in one go (by using a delimiter), and the result is a significant improvement in time as we had expected. PR #1413 has been updated and is ready to be reviewed.

Apart from any2utf8 optimization, I continued working on cythonizing (static typing) learn_vocab and export_phrases function in Phrases module. With cythonized code and any2utf8 optimization, we get almost 2 – 2.5x speedup.

Also, as I mentioned in last blog post, PR #1341 is finally ready to be merged and it will be the default way of loading fastText models, i.e., by using only .bin file.

Best of 3 result for Phrases using text8 corpus –

          Optimization               Time
              Original       ~ 36-38 sec
        cython (static typing)       ~30-32 sec
       any2utf8 (without cython)       ~20-22 sec
       any2utf8 (with cython)        ~15-18 sec

June 14, 2017

Almost two weeks into coding period, and it’s all fab so far. I just opened a new PR #1413 for optimizing Phrases module. Earlier we used to convert every incoming words into utf8 encoding, but this causes a significant speed bottleneck. Now, we only convert those words into utf8 which finally makes into the vocab (after pruning, and discarding), before saving into disk. Basically, trading memory for speed !

As I discussed earlier, first part of gsoc project is to optimize the Phrases module. The PR #1385 for cythonizing is almost finished apart from a little cleanup, and discussion for multi-core implementation is underway with my mentors.

Apart from that, I was also working on a fastText PR #1341 (loading fastText model using bin file only). This was quite challenging, and consumed most of the second week. Finally it’s finished and ready to be merged, and also discussion to make it the default way of loading fastText model is due for today’s hangouts meeting.

May 31, 2017

It’s been around a month since I got selected for Google summer of code with my mentor organization NumFOCUS and Gensim, and it has already been a great learning experience so far. This is first of a series of blog posts that I’ll commit throughout the next three months as a part of my project Performance improvement in Gensim and fastText .

I’m an aspiring entrepreneur, and a final year student in the department of Electronics & Communication Engineering at IIT Roorkee. I’ve always been excited to learn about new technologies. Recently, in the last six months, I’ve been more inclined towards Machine Learning and AI. I had worked earlier on a few ML libraries like sklearn, tensorFlow, and Gensim (I got introduced to Gensim through this tutorial at Kaggle), so when I found Gensim in the gsoc organization list, I approached the community, and after working through my first PR (#1186), and the intro-talk with Lev, I was more than certain that I wanted to stick with this organization. I would like to extend my gratitude to entire Gensim community for constructive feedbacks, and specially to Lev and Giacomo Berardi for guiding me through the proposal drafting phase and beyond.

One thing that I’ve realized so far working with Gensim, is that I really enjoy working on bugs and fixing issues. I have solved quite a few issues, and learned a lot in the process – the most significant being about Test Driven Development (TDD), and the importance of writing neat codes. I’ve been assigned three cool mentors – Ivan, Jayant, and Lev, and it’s really great to learn from such experienced Ninjas. Following is the glimpse of my contribution so far –

A few more PRs are in the pipeline which I’ll be finishing very soon.

Facebook Research recently released a new word embedding and classification library fastText, which performs better than word2vec on syntactic task, and trains much faster for supervised text classification. Gensim has a wrapper for getting word representation from fastText ( supervised text classification will be implemented in Python as a part of this gsoc project). A few weeks ago, fastText code was updated with some additional variables and functions which resulted in error while loading fastText trained models through Gensim wrapper (and other available wrappers too – see this issue in fastText repo). This error was resolved in #1319. Therefore, Gensim is capable of loading both old and new fastText format now.

There is another issue in fastText regarding mismatch in bin and vec files in French pretrained vector (I opened an issue in facebook’s repo). I have thoroughly discussed the problem with my mentors, and Gensim will soon have a provision for loading fastText with only bin file (look out to PR #1341).

As I mentioned in my proposal, the project consists of three Milestones – multi-core implementation of phrases module in Cython, refactoring of Labeled w2v (supervised fastText from this PR #1153 by Giacomo) and implementation of remaining parts in Gensim, and further optimization/multi-core/GPU implementation. So far, I’ve profiled and benchmarked the code, and located the potential bottlenecks. I had given a lot of thoughts while drafting my proposal, therefore I’m planning to stick to the proposed timeline. In the first week, the plan is to work on learn_vocab function in phrases.py module, and use NumPy library wherever applicable and cythonize the for loop. After that, BLAS optimization and multi-core implementation should further enhance the performance. Also, I’ve been looking into the data structure, and as suggested by Lev, I’m looking more into cuckoo hash. I’ll be writing more about it as I go along.

Community bonding period is over now, and it has perfectly set the tone for a fruitful summer ahead. I’ve thoroughly enjoyed working so far, and I hope it continues to be so. 🙂