Last weekend, I ported Google’s word2vec into Python. The result was a clean, concise and readable code that plays well with other Python NLP packages. One problem remained: the performance was 20x slower than the original C code, even after all the obvious NumPy optimizations.
There are two major optimization directions: re-obfuscate (parts of) the Python code by converting it back into C, and parallelizing the computation (the original C tool uses threads).
Selecting which part to optimize was an easy task — even without profiling, it’s clear the bulk of the work is done in the nested loop that goes through each sentence, and for each sentence position (word) tries to predict all the other words within its window. It’s a tiny part of the overall code, but accounts for most of the time spent — mere three lines (namely 480, 487 and 489 in r26) eat up 90% of the entire runtime!
The rest of the code is mostly there to prepare the input words and sentences. In the original C word2vec, the words are assumed to reside in a single file on disk, one sentence per line, with words delimited by whitespace. In Python, the sentences can come from anywhere, and be pre-processed in any way you like. The methods accept an iterable of sentences, so we can plug in the Brown Corpus from NLTK like this:
>>> class BrownCorpus(object): ... """Iterate over sentences from the Brown corpus (part of NLTK data).""" ... def __init__(self, dirname): ... self.dirname = dirname ... ... def __iter__(self): ... for fname in os.listdir(self.dirname): ... fname = os.path.join(self.dirname, fname) ... if not os.path.isfile(fname): ... continue ... for line in open(fname): ... # each file line is a single sentence in the Brown corpus ... # each token is WORD/POS_TAG ... token_tags = [t.split('/') for t in line.split() if len(t.split('/')) == 2] ... # ignore words with non-alphabetic tags like ",", "!" etc (punctuation, weird stuff) ... words = ["%s/%s" % (token.lower(), tag[:2]) for token, tag in token_tags if tag[:2].isalpha()] ... if not words: # don't bother sending out empty sentences ... continue ... yield words >>> for sentence in BrownCorpus('/Users/kofola/nltk_data/corpora/brown'): ... print sentence ['the/at', 'fulton/np', 'county/nn', 'grand/jj', 'jury/nn', 'said/vb', 'friday/nr', 'an/at', 'investigation/nn', 'of/in', "atlanta's/np", 'recent/jj', 'primary/nn', 'election/nn', 'produced/vb', 'no/at', 'evidence/nn', 'that/cs', 'any/dt', 'irregularities/nn', 'took/vb', 'place/nn'] ['the/at', 'jury/nn', 'further/rb', 'said/vb', 'in/in', 'term-end/nn', 'presentments/nn', 'that/cs', 'the/at', 'city/nn', 'executive/jj', 'committee/nn', 'which/wd', 'had/hv', 'over-all/jj', 'charge/nn', 'of/in', 'the/at', 'election/nn', 'deserves/vb', 'the/at', 'praise/nn', 'and/cc', 'thanks/nn', 'of/in', 'the/at', 'city/nn', 'of/in', 'atlanta/np', 'for/in', 'the/at', 'manner/nn', 'in/in', 'which/wd', 'the/at', 'election/nn', 'was/be', 'conducted/vb'] ['the/at', 'september-october/np', 'term/nn', 'jury/nn', 'had/hv', 'been/be', 'charged/vb', 'by/in', 'fulton/np', 'superior/jj', 'court/nn', 'judge/nn', 'durwood/np', 'pye/np', 'to/to', 'investigate/vb', 'reports/nn', 'of/in', 'possible/jj', 'irregularities/nn', 'in/in', 'the/at', 'hard-fought/jj', 'primary/nn', 'which/wd', 'was/be', 'won/vb', 'by/in', 'mayor-nominate/nn', 'ivan/np', 'allen/np', 'jr./np'] ...
The sentences are constructed on the fly, from various files in the Brown Corpus directory, without any extra preprocessing steps or loading everything into RAM. Plugging in different training data is pretty straightforward: adjust the iterator to fit the data format. Gensim’s word2vec module already contains this BrownCorpus iterator, plus a few others, as blueprint examples.
For testing the training speed, I’ll be using the Brown corpus above. With ~1 million words in 57k sentences, it’s too small for any quality training, but already big enough so we can evaluate improvements/regressions in performance. The speed is tested with a hidden layer size of 200 and with ignoring vocabulary which appears less than 5 times (size=200, min_count=5 constructor parameters). The final vocabulary contains 15,079 words, so the projection weights form a 15,079×200 matrix.
All results below are a best-of-ten on my MacbookPro 2.3GHz laptop, meaning each test was run 10 times and only the best result is reported, to mitigate random noise/OS multitasking influences.
The baseline performance of the NumPy code under these conditions is 1.4k words per second. Rewriting the training loop in Cython improves this to 33.3k words/sec, which is a 24x speedup.
EDIT: I originally used memoryviews to pass arrays around in Cython, and converting to a memoryview is a lot slower than plain casting arrays to pointers. I’ve since updated the code, re-ran all tests and updated the timings in this post.
While rewriting the loops in Cython, I noticed the logic contained therein could be expressed in terms of BLAS. BLAS (Basic Linear Algebra Subprograms) are routines for stuff like vector_y += scalar_alpha * vector_x, with funky names like saxpy or dgemm. The reason why people use BLAS is that these basic operations can be heavily optimized, so that calling saxpy is way faster than what a generic compiler would produce from an equivalent, naive C loop.
NumPy itself links against BLAS internally, but unfortunately offers no way to plug directly into the C routines (or I couldn’t find a way). Calling BLAS via NumPy would be crazy slow, because it would have to go through Python calls, acquiring GIL and all.
Fortunately, SciPy contains a little known gem, hidden inside scipy.linalg.blas, which allows us to call the C BLAS routines directly. After some poking in SciPy’s bowels, scratching my head and googling around, I came up with:
from cpython cimport PyCObject_AsVoidPtr from scipy.linalg.blas import cblas ctypedef void (*saxpy_ptr) (const int *N, const float *alpha, const float *X, const int *incX, float *Y, const int *incY) nogil cdef saxpy_ptr saxpy=<saxpy_ptr>PyCObject_AsVoidPtr(cblas.saxpy._cpointer)
Which gives us saxpy, aka vectorized y = alpha*x in single precision. After plugging in saxpy, sdot & co. in place of lines 480, 487 and 489, I got a ~3x speedup in the crunch code, which translated into 89.8k words/s in the overall training speed. This is a 64x improvement over NumPy, and a 2.7x improvement over plain Cython.
Important note on BLAS: This improvement hinges on the quality of the BLAS library installed. On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS. Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices. Installing a generic, pre-packaged BLAS which is not optimized for your machine (no SSE, unknown CPU cache sizes…) is not a good choice.
The original C code contains an extra optimization, where instead of computing y = 1 / (1 + e-x), it looks up the value of y directly, in a precomputed table in RAM. This is a classic optimization from the times when float operations were expensive: a trade-off between using more memory & less precision vs. freeing up CPU. With dedicated FPUs and complex cache architectures in modern computers, it’s not such a clear hit anymore.
I actually expected no performance gain at all, but profiling proved me wrong. Using the precomputed EXP_TABLE with Cython, I got 34.7k words/s, a 4% improvement. Using both the BLAS optimization above and the precomputed sigmoid table, it’s 101.8k words/s. This is the best result so far, almost 73x faster than the NumPy code I started with. It’s also 3.5 times faster than the original C code, which runs at 29k words/s on this corpus (with a single thread).
All the timings above in one table:
|optimization||words per second||speed-up|
|original C word2vec||29.0k||20.7x|
|Cython + BLAS||89.8k||64.1x|
|Cython + sigmoid table||34.7k||24.8x|
|Cython + BLAS + sigmoid table||101.8k||72.7x|
With the hotspot optimized, gensim’s word2vec is now both fast and easy to use. The cost is an extra dependency on Cython. I realize this may be an issue for Windows users, so I added fallback code where if the fast Cython fails to compile (because there’s no compiler or no Cython…), it will use the slower, NumPy code. This way, noob users can still use word2vec without going into compilation setup and technicalities.
The next step is making the code run in parallel, to make use of multicore machines. This will require some thought, because very fine parallelization may not work well because of thread collisions (there is no locking in the C word2vec, so it’s best if the threads work on very different words and sentences and avoid stepping on each other’s toes). On the other hand, having coarse-grained threading blocks like in the original C word2vec will not work in Python, because GIL can only be released in the low-level methods. I’ll have to give this some thought :-)
Part III: Parallelizing word2vec in Python