Last weekend, I ported Google’s word2vec into Python. The result was a clean, concise and readable code that plays well with other Python NLP packages. One problem remained: the performance was 20x slower than the original C code, even after all the obvious NumPy optimizations.
Selecting the hotspots
There are two major optimization directions: re-obfuscate (parts of) the Python code by converting it back into C, and parallelizing the computation (the original C tool uses threads).
Selecting which part to optimize was an easy task — even without profiling, it’s clear the bulk of the work is done in the nested loop that goes through each sentence, and for each sentence position (word) tries to predict all the other words within its window. It’s a tiny part of the overall code, but accounts for most of the time spent — mere three lines (namely 480, 487 and 489 in r26) eat up 90% of the entire runtime!
The rest of the code is mostly there to prepare the input words and sentences. In the original C word2vec, the words are assumed to reside in a single file on disk, one sentence per line, with words delimited by whitespace. In Python, the sentences can come from anywhere, and be pre-processed in any way you like. The methods accept an iterable of sentences, so we can plug in the Brown Corpus from NLTK like this:
>>> class BrownCorpus(object): ... """Iterate over sentences from the Brown corpus (part of NLTK data).""" ... def __init__(self, dirname): ... self.dirname = dirname ... ... def __iter__(self): ... for fname in os.listdir(self.dirname): ... fname = os.path.join(self.dirname, fname) ... if not os.path.isfile(fname): ... continue ... for line in open(fname): ... # each file line is a single sentence in the Brown corpus ... # each token is WORD/POS_TAG ... token_tags = [t.split('/') for t in line.split() if len(t.split('/')) == 2] ... # ignore words with non-alphabetic tags like ",", "!" etc (punctuation, weird stuff) ... words = ["%s/%s" % (token.lower(), tag[:2]) for token, tag in token_tags if tag[:2].isalpha()] ... if not words: # don't bother sending out empty sentences ... continue ... yield words >>> for sentence in BrownCorpus('/Users/kofola/nltk_data/corpora/brown'): ... print sentence ['the/at', 'fulton/np', 'county/nn', 'grand/jj', 'jury/nn', 'said/vb', 'friday/nr', 'an/at', 'investigation/nn', 'of/in', "atlanta's/np", 'recent/jj', 'primary/nn', 'election/nn', 'produced/vb', 'no/at', 'evidence/nn', 'that/cs', 'any/dt', 'irregularities/nn', 'took/vb', 'place/nn'] ['the/at', 'jury/nn', 'further/rb', 'said/vb', 'in/in', 'term-end/nn', 'presentments/nn', 'that/cs', 'the/at', 'city/nn', 'executive/jj', 'committee/nn', 'which/wd', 'had/hv', 'over-all/jj', 'charge/nn', 'of/in', 'the/at', 'election/nn', 'deserves/vb', 'the/at', 'praise/nn', 'and/cc', 'thanks/nn', 'of/in', 'the/at', 'city/nn', 'of/in', 'atlanta/np', 'for/in', 'the/at', 'manner/nn', 'in/in', 'which/wd', 'the/at', 'election/nn', 'was/be', 'conducted/vb'] ['the/at', 'september-october/np', 'term/nn', 'jury/nn', 'had/hv', 'been/be', 'charged/vb', 'by/in', 'fulton/np', 'superior/jj', 'court/nn', 'judge/nn', 'durwood/np', 'pye/np', 'to/to', 'investigate/vb', 'reports/nn', 'of/in', 'possible/jj', 'irregularities/nn', 'in/in', 'the/at', 'hard-fought/jj', 'primary/nn', 'which/wd', 'was/be', 'won/vb', 'by/in', 'mayor-nominate/nn', 'ivan/np', 'allen/np', 'jr./np'] ...
The sentences are constructed on the fly, from various files in the Brown Corpus directory, without any extra preprocessing steps or loading everything into RAM. Plugging in different training data is pretty straightforward: adjust the iterator to fit the data format. Gensim’s word2vec module already contains this BrownCorpus iterator, plus a few others, as blueprint examples.
Is it faster?
For testing the training speed, I’ll be using the Brown corpus above. With ~1 million words in 57k sentences, it’s too small for any quality training, but already big enough so we can evaluate improvements/regressions in performance. The speed is tested with a hidden layer size of 200 and with ignoring vocabulary which appears less than 5 times (size=200, min_count=5 constructor parameters). The final vocabulary contains 15,079 words, so the projection weights form a 15,079×200 matrix.
All results below are a best-of-ten on my MacbookPro 2.3GHz laptop, meaning each test was run 10 times and only the best result is reported, to mitigate random noise/OS multitasking influences.
Cython
The baseline performance of the NumPy code under these conditions is 1.4k words per second. Rewriting the training loop in Cython improves this to 33.3k words/sec, which is a 24x speedup.
EDIT: I originally used memoryviews to pass arrays around in Cython, and converting to a memoryview is a lot slower than plain casting arrays to pointers. I’ve since updated the code, re-ran all tests and updated the timings in this post.
BLAS
While rewriting the loops in Cython, I noticed the logic contained therein could be expressed in terms of BLAS. BLAS (Basic Linear Algebra Subprograms) are routines for stuff like vector_y += scalar_alpha * vector_x, with funky names like saxpy or dgemm. The reason why people use BLAS is that these basic operations can be heavily optimized, so that calling saxpy is way faster than what a generic compiler would produce from an equivalent, naive C loop.
NumPy itself links against BLAS internally, but unfortunately offers no way to plug directly into the C routines (or I couldn’t find a way). Calling BLAS via NumPy would be crazy slow, because it would have to go through Python calls, acquiring GIL and all.
Fortunately, SciPy contains a little known gem, hidden inside scipy.linalg.blas, which allows us to call the C BLAS routines directly. After some poking in SciPy’s bowels, scratching my head and googling around, I came up with:
from cpython cimport PyCObject_AsVoidPtr from scipy.linalg.blas import cblas ctypedef void (*saxpy_ptr) (const int *N, const float *alpha, const float *X, const int *incX, float *Y, const int *incY) nogil cdef saxpy_ptr saxpy=<saxpy_ptr>PyCObject_AsVoidPtr(cblas.saxpy._cpointer)
Which gives us saxpy, aka vectorized y = alpha*x in single precision. After plugging in saxpy, sdot & co. in place of lines 480, 487 and 489, I got a ~3x speedup in the crunch code, which translated into 89.8k words/s in the overall training speed. This is a 64x improvement over NumPy, and a 2.7x improvement over plain Cython.
Important note on BLAS: This improvement hinges on the quality of the BLAS library installed. On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS. Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices. Installing a generic, pre-packaged BLAS which is not optimized for your machine (no SSE, unknown CPU cache sizes…) is not a good choice.
Precomputed sigmoid table
The original C code contains an extra optimization, where instead of computing y = 1 / (1 + e-x), it looks up the value of y directly, in a precomputed table in RAM. This is a classic optimization from the times when float operations were expensive: a trade-off between using more memory & less precision vs. freeing up CPU. With dedicated FPUs and complex cache architectures in modern computers, it’s not such a clear hit anymore.
I actually expected no performance gain at all, but profiling proved me wrong. Using the precomputed EXP_TABLE with Cython, I got 34.7k words/s, a 4% improvement. Using both the BLAS optimization above and the precomputed sigmoid table, it’s 101.8k words/s. This is the best result so far, almost 73x faster than the NumPy code I started with. It’s also 3.5 times faster than the original C code, which runs at 29k words/s on this corpus (with a single thread).
Summary
All the timings above in one table:
optimization | words per second | speed-up |
---|---|---|
NumPy baseline | 1.4k | 1.0x |
original C word2vec | 29.0k | 20.7x |
Cython | 33.3k | 23.8x |
Cython + BLAS | 89.8k | 64.1x |
Cython + sigmoid table | 34.7k | 24.8x |
Cython + BLAS + sigmoid table | 101.8k | 72.7x |
Where next?
With the hotspot optimized, gensim’s word2vec is now both fast and easy to use. The cost is an extra dependency on Cython. I realize this may be an issue for Windows users, so I added fallback code where if the fast Cython fails to compile (because there’s no compiler or no Cython…), it will use the slower, NumPy code. This way, noob users can still use word2vec without going into compilation setup and technicalities.
The next step is making the code run in parallel, to make use of multicore machines. This will require some thought, because very fine parallelization may not work well because of thread collisions (there is no locking in the C word2vec, so it’s best if the threads work on very different words and sentences and avoid stepping on each other’s toes). On the other hand, having coarse-grained threading blocks like in the original C word2vec will not work in Python, because GIL can only be released in the low-level methods. I’ll have to give this some thought 🙂
Part III: Parallelizing word2vec in Python
Comments 44
Congratulations – seems like a miracle that your Python single threaded code is actually faster than paralel C code from Google 🙂
Author
Hah, only than the single-threaded C code. Parallelization is coming next weekend — this weekend I’m going home to Břeclav for Hody 🙂
Hi Radim,
thanks for publishing this! Interesting read and full of useful information! Id like to check your code, did you release it already somewhere?
greets
Stefan
Author
Hello Stefan, sure — search for “word2vec module” on this page.
Hi,
Nice work :).
I do not think you need to implement any locking mechanism. Locking does not buy you anything when it comes to stochastic optimization. Also if you use adaptaive learning rates (adagrad), most of your concents will be mitigated.
Check hogwild for the benfits of lock free implementation. (http://pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf)
Google’s pre-trained vectors (eg “freebase-vectors-skipgram1000-en.bin.gz”) gzip-compress so well (2.4GB vs. 5.4 GB)… I wonder if perhaps a compressed in-memory format might be another opportunity for optimization, lowering RAM requirements and maybe even speeding lookups/calculations (via better cache-efficiency and/or core utilization). Just a vague idea at this point, though.
Just test your code on my machine.
Numpy baseline: 1.5k
original C code: 159k (1 thread)
Cython + BLAS + sigmoid table: 67.8k
Did I get anything wrong?
Author
Hard to say. Can you describe your setup/data on the gensim mailing list? Let’s look into this, I’m curious.
When you specify -threads 8, the per-thread speed is Words/thread/sec: 76.96k. However the total running time becomes real 0m31.470s from real 1m48.107s.
Hi,
When you say that ditching memoryviews ended up with a performance gain – how much are we talking? The pyarray casting is really ugly and makes the code a lot less readable. However, if it results in a massive speed increase then I could learn to live with it.
Cheers
Author
17k words/sec with memoryviews, where it now says 33.3k.
But note that this was in a tight loop (hundreds of thousands of calls per second), where even a little overhead makes a big difference. If you do more work inside the called fnc, I expect the difference will be much less pronounced. I decided uglifying ~10 lines of crunch C code was worth the gain.
Thanks for the quick response. Interesting to note. I don’t usually deal with data quite that massive, but I do frequently iterate in tight loops. I’ll have to see how much of a win I get!
Thanks again!
Hi. Radim
Since I have a problem on the speed, I downloaded the dev version of gensim and did the following to install.
sudo python setup.py install
A problem happened to me is word2vec_inner.pyx did not go into the models fold. So I have to copy as follows: sudo cp gensim/models/word2vec_inner.pyx /usr/local/lib/python2.7/dist-packages/gensim-0.8.7-py2.7.egg/gensim/models/
After that, I can run the program, but I have even serious speed problem. The single thread speed drops to 21k (previously version 67.8k). When I set 8 workers, it goes up to 140K. The corpus for all the test are text8.
Another trivial thing to speed up I would like to mention is, comiple word2vec.py to c-so lib, I got around 50% speed-up. Say 8 workers, from 140K to 210K.
Author
Hi jeffery, the easiest way is to just add the gensim develop dir to your
PYTHONPATH
, or install it withpip install -e .
. The code will change a lot until the next release (unit tests, packaging), so there’s no point doing a “hard” install.Why your speed varies I don’t know. Like I said, for support come to the gensim mailing list.
Hi. Radim
Thanks for your quick response. I filed that on gensim google group.
Pingback: 3 Google Engineers observe consistant hyperspace between languages when dimensionality is reduced by a deep neural net | jMHz | Jason Malcolm Herzmark
Heads up for anyone trying the code – matutils.zeros_aligned is only on the develop branch, so if you update gensim, it still won’t run. I installed gensim from the zip file of the develop branch, and it ran. 103k words/sec, whohoo!
This is really cool. I’m wondering if something like numexpr (http://code.google.com/p/numexpr/) might be useful here as well?
Hi Radim, nice works. Thanks for all these stuffs.
Recently, I still have a problem about the speed. I first installed the cython (“easy_install cython”) to use optimized word2vec training, but when i run the word2vec in gensim, the value of “FAST_VERSION” is -1, which indicates that the optimized one is not used. And I can’t figure out how to solve this.
Any suggestions?? Thanks..
Author
Hi Leon, hard to say, not enough info. Come on the gensim mailing list or open a github issue. Those are better mediums for debugging.
Hi, Radim, thanks for your response. Sorry to disturb you again, i am a fresher in python and gensim.
I run the program on the win7 (64bit) operating systems and the cython is installed before the code running. And the speed is similar with the NumPy baseline case (1.4k/s).
And When I debug the code “from word2vec_inner import train_sentence, FAST_VERSION”, an exception is throwed: “ImportError: No module named word2vec_inner”. (The word2vec_inner.pyx file is also in the models directory).
Thanks..
Author
Hi again Leon, for the mailing list, google “gensim mailing list”.
Hello Leon, I come across the same problem as you, did you finally solve it?
Hi Radim, nice work, porting this over to SciPy!
I can [not – pls read on…] reproduce your claim of your port being faster on a 4-processor (Nehalem Xeon X5550; cuad-cores @ 2.67GHz) machine with “unlimited” RAM (64 GB).
First, whether I run your PYX code (using anaconda with their provided atlas package statically linked) or the original C code, the training time with 16 threads (“workers” in your code) is more or less the same for both word2vec and word2vec.py on text8 with otherwise all (word2vec) default parameters (~1 min). To be precise, the python version is actually a tiny bit slower (approx. 12 sec, for a time of ~1.2 min).
Next, if I train single-threaded/with one worker only, the outcome is that the C version requires ~5.5 min, but your PYX code only takes 4.2 min to train on text8, so indeed your version it is a bit faster in this scenario.
Third, I checked training on a single CPU with four threads/workers. In this scenario, both implementations take virtually the same time (~2.25 min).
Last but not least, I tried to use your blas.patch (on the word2vec mailing list) with the latest OpenBLAS from GitHub, using NEHALEM as architecture target an compiling with the “-msse4.2” instruction set. I added the openblas_set_num_threads call to the main routine, and ran word2vec using 4 threads. Still I could not reproduce your speed improvements – as a matter of fact, word2vec performance is quite a bit *slower* when using the BLAS routines on 4 cores (~3 vs. 2.25 min).
Finally, I checked the runtime with a single threaded run using OpenBLAS via your patch, and voilá, I finally could reproduce your 2x speed improvement: ~2.83 vs 5.5 min!
Care to elaborate? It seems, with OpenBLAS you do get a speed improvement, but it does not improve speed when using multiple threads (actually, it got worse!), and neither does the anaconda/atlas-based version of Python “beat” the native version in multi-threaded mode…
Author
Hi Florian, thanks for the performance report. The port doesn’t really use SciPy (except for the BLAS bindings).
The exact performance depends on a million factors: your HW, OS, compiler… It’s also quite possible you’re hitting your RAM throughput limits. All threads update the same matrix concurrently (both the C and py version), so there’s a lot of cache contention.
In any case, ±similar performance is what I was going for. Actually, even 2x slower than C wouldn’t bother me — check out the motivation section in blog post Part 1. If it’s a little bit faster, so much the better 🙂
True; I secretly was hoping there is something I did wrong, but probably not… By now I’ve even tried to recompile numpy/scipy with the latest OpenBLAS on because I did not trust anaconda, but it did not make your code any faster, either.
However, as you state, the real point is that it is always impressive when it turns out that Python can keep up with a pure C library. Great work!
BTW, I’ve “ported” your word2vec.py and _inner.pyx code to make it work “standalone” (without all of gensim) in Py3k; I’ll ping you on GitHub once I put it online. Maybe it’s worth it distributing “word2vec for Py3” via PyPi on its own or are you close to having all of gensim ported to Py3?
* I should add to my reply above that using OpenBLAS (compiled with SSE4.2 and NEHALEM optimizations) + numpy with your code in single threaded mode did finally generate the speedup you report (about twice as fast as word2vec), but this advantage is lost as soon as I use multiple threads.
Author
I’m not thrilled about a py3k fork of a (part of) gensim, because it just means more support and headaches =) But of course, as long as you respect the license, it’s your call.
Author
For the record, gensim is now (as of 0.10.0) fully python 3 compatible. Including word2vec.
Great work!
when I trained the data, I got 101k words/s speed on Macbook Pro 2.4GHZ Dual Core, But when I ran the model.most_similar(), I sometimes got the python crashed with the following report:
Thread 0 Crashed:: Dispatch queue: com.apple.root.default-priority
0 libBLAS.dylib
Do u know why? I didn’t use openblas.
Author
No idea; maybe faulty RAM? Problems with your SciPy installation? Post your specs/log at the gensim mailing list.
Hi Radim, have you tried NumbaPro (support GPU computing) to parallelize the code?
Author
No. But I’d be curious to hear the results — if you try, please let me know.
Sure, I’m modifying a SGD algorithm using NumbaPro (GPU). After this, I will try to modify word2vec and send you the code.
Hi Jie, did you get gensim working with NumbaPro?
Daniel
Hello ,
I am a huge follower of you work.
I am trying to develop UI for this work.But I am not an expert in python so I have a littl e issue.
I tried saving the model to a file but encountered the following error :
Traceback (most recent call last):
File “word2Vec_impl.py”, line 39, in
model.save(fo)
File “C:Python27libsite-packagesgensim-0.9.0-py2.7.egggensimmodelsword2
vec.py”, line 667, in save
super(Word2Vec, self).save(*args, **kwargs)
File “C:Python27libsite-packagesgensim-0.9.0-py2.7.egggensimutils.py”, l
ine 269, in save
pickle(self, fname)
File “C:Python27libsite-packagesgensim-0.9.0-py2.7.egggensimutils.py”, l
ine 608, in pickle
with smart_open(fname, ‘wb’) as fout: # ‘b’ for binary, needed on Windows
File “C:Python27libsite-packagesgensim-0.9.0-py2.7.egggensimutils.py”, l
ine 596, in smart_open
_, ext = path.splitext(fname)
File “C:Python27libntpath.py”, line 190, in splitext
return genericpath._splitext(p, sep, altsep, extsep)
File “C:Python27libgenericpath.py”, line 91, in _splitext
sepIndex = p.rfind(sep)
AttributeError: ‘file’ object has no attribute ‘rfind’
Any idea what could be the solution?
Author
Hi, you need to pass a filename (=string) to `save()`, not a file object.
Exponents are still very expensive unless you use GPU or the Intel compiler. May take dozens of CPU cycles.
I checked: Visual Studio with the fast floating point mode is fast as well. But on Linux with the GNU math library, this optimization still makes some sense.
Author
Leonid, with your optimization expertise, it would be interesting if you could find more places in word2vec to optimize.
(the core routine is here, in C/Cython: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx#L580 )
Hi Radim,
How would I use word2vec to calculating document similarity? I have a corpus of ~300 documents and need to compare them to another set of ~100,000 and pick the most similar ones.
thanks!
priya
Hi Radim,
I have a problem. I have used “gensim._six import iteritems, itervalues” then I got an error “ImportError: No module named _six” . But I already have “six” in my machine. How can I resolve that one.
Best Regards,
Partha
Author
Hello Partha, it’s more convenient to report problems on the gensim mailing list / github issues: https://groups.google.com/d/forum/gensim
In any case, “gensim._six import iteritems, itervalues” doesn’t sound like valid Python. Where does that line come from?
Pingback: 科学计算之Python与MKL共舞 | iJiaer