Making sense of word2vec

Radim Řehůřek gensim, programming 27 Comments

One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the past year, including GloVe and matrix factorization via SVD.

In case you missed the buzz, word2vec was widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as deep learning (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”) resembles the vector for “Toronto Maple Leafs”.

Apparently crows are good at that stuff, too: Crows Can Understand Analogies.

Check out my online word2vec demo and the blog series on optimizing word2vec in Python for more background.

So, what’s changed?

For one, Tomáš Mikolov no longer works for Google :-)

More relevantly, there was a lovely piece of research done by the good people at Stanford: Jeffrey Pennington, Richard Socher and Christopher Manning. They explicitly identified the objective that word2vec optimizes through its async stochastic gradient backpropagation algorithm, and neatly connected it to the well-established field of matrix factorizations.

And in case you’ve never heard of that — in short, word2vec ultimately learns word vectors and word context vectors. These can be viewed as two 2D matrices (of floats), of size #words x #dim each. Their method GloVe (Global Vectors) identified a matrix which, when factorized using the particular SGD algorithm of word2vec, yields out exactly these two matrices. So where word2vec was a bit hazy about what’s going on underneath, GloVe explicitly names the “objective” matrix, identifies the factorization, and provides some intuitive justification as to why this should give us working similarities.

Very nice and clear paper, go read it if you haven’t!

For example, if we have the following nine preprocessed sentences, and set window=5, the co-occurrence matrix looks like this:

# nine input sentences
texts = [['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

# word-word co-occurrence matrix, with context window size of 5
[[0 1 1 1 1 1 1 1 0 0 0 0]
 [1 0 1 0 0 2 0 0 1 0 0 0]
 [1 1 0 0 0 1 0 1 1 0 0 0]
 [1 0 0 0 1 1 2 2 0 0 0 0]
 [1 0 0 1 0 1 1 1 0 0 1 1]
 [1 2 1 1 1 2 1 2 3 0 0 0]
 [1 0 0 2 1 1 0 2 0 0 0 0]
 [1 0 1 2 1 2 2 0 1 0 0 0]
 [0 1 1 0 0 3 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 2 1]
 [0 0 0 0 1 0 0 0 0 2 0 2]
 [0 0 0 0 1 0 0 0 0 1 2 0]]
# (rows/columns represent words:
# "computer human interface response survey system time user eps trees graph minors",
# in that order)

Note how the matrix is very sparse and symmetrical; the implementation we’ll use below takes advantage of both these properties to train GloVe more efficiently.

The GloVe algorithm then transforms such raw integer counts into a matrix where the co-occurrences are weighted based on their distance within the window (word pairs farther apart get less co-occurrence weight):

# same row/column names as above
[[ 0.    0.5   1.    0.5   0.5   1.    0.33  1.    0.    0.    0.    0.  ]
 [ 0.    0.    1.    0.    0.    2.    0.    0.    0.5   0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    1.    0.    1.    0.5   0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.25  1.    2.    1.33  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.33  0.2   1.    0.    0.    0.5   1.  ]
 [ 0.    0.    0.    0.    0.    0.    0.5   1.    1.67  0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.75  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    1.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    1.5   1.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    2.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]

then takes a log, and factorizes this matrix to produce the final word vectors.

This was really exciting news — it means the plain softmax word2vec essentially reduces to counting how many times words occur together, with some scaling thrown in. Technically, this is just a glorified cooccurrence_counts[word, other_word]++ in a loop, followed by any of the standard matrix factorization algorithms, both of which are well understood processes with efficient implementations.

GloVe vs word2vec

Oddly, the evaluation section of this GloVe paper didn’t match the quality of the rest. It had serious flaws in how the experiments compared GloVe to other methods. Several people called the authors out on the weirdness, most lucidly the Levy & Goldberg research duo from Bar Ilan university — check out their “apples to apples” blog post for a bit of academic drama. To summarize, when evaluated properly, paying attention to parameter settings, GloVe doesn’t really seem to outperform the original word2vec, let alone by 11% as the GloVe paper claimed.

Luckily, Maciej Kula implemented GloVe in Python, using Cython for performance. Using his neat implementation, we can try to make sense of the performance and accuracy ourselves.

Code to train GloVe in Python:

from gensim import utils, corpora, matutils, models
import glove

# Restrict dictionary to the 30k most common words.
wiki = models.word2vec.LineSentence('/data/shootout/title_tokens.txt.gz')
id2word = corpora.Dictionary(wiki)
id2word.filter_extremes(keep_n=30000)
word2id = dict((word, id) for id, word in id2word.iteritems())

# Filter all wiki documents to contain only those 30k words.
filter_text = lambda text: [word for word in text if word in word2id]
filtered_wiki = lambda: (filter_text(text) for text in wiki)  # generator

# Get the word co-occurrence matrix -- needs lots of RAM!!
cooccur = glove.Corpus()
cooccur.fit(filtered_wiki(), window=10)

# and train GloVe model itself, using 10 epochs
model_glove = glove.Glove(no_components=600, learning_rate=0.05)
model_glove.fit(cooccur.matrix, epochs=10)

And similarly for training word2vec:

model_word2vec = models.Word2Vec(size=600, window=10)
model_word2vec.build_vocab(filtered_wiki())
model_word2vec.train(filtered_wiki())

The reason why we restricted the vocabulary to only 30,000 words is that Maciej’s implementation of GloVe requires memory quadratic in the number of words: it keeps that sparse matrix of all word x word co-occurrences in RAM. In contrast, the gensim word2vec implementation is happy with linear memory, so millions of words are not a problem there. This is not an intrinsic limitation of GloVe though; with a different implementation, the co-occurrence matrix could be assembled out-of-core (Map/Reduce seems ideal for the job), and the factorization could just stream over it with constant memory too, in a more gensim-like fashion.

 

Results for 600 dims, context window of 10, 1.9B words of EN Wikipedia.
algorithm accuracy on the word analogy task wallclock time peak RAM [MB]
I/O only = iterating over wiki with
sum(len(text) for text in filtered_wiki())
N/A 3m 25
GloVe, 10 epochs, learning rate 0.05 67.1% 4h12m 9,414
GloVe, 100 epochs, learning rate 0.05 67.3% 18h39m 9,452
word2vec, hierarchical skipgram, 1 epoch 57.4% 3h10m 266
word2vec, negative sampling with 10 samples, 1 epoch 68.3% 8h38m 628
word2vec, pre-trained GoogleNews model released by Tomáš Mikolov, 300 dims, 3,000,000 vocabulary 55.3% ? ?

Basically, where GloVe precomputes the large word x word co-occurrence matrix in memory and then quickly factorizes it, word2vec sweeps through the sentences in an online fashion, handling each co-occurrence separately. So, there is a tradeoff between taking more memory (GloVe) vs. taking longer to train (word2vec). Also, once computed, GloVe can re-use the co-occurrence matrix to quickly factorize with any dimensionality, whereas word2vec has to be trained from scratch after changing its embedding dimensionality.

Note that both implementations are fairly optimized, running on 8 threads (on an 8 core machine), using the exact same input corpus, text preprocessing, vocabulary and evaluation code, so that the numbers are directly comparable. Code here.

SPPMI and SVD

In a manner analogous to GloVe, Levy and Goldberg (the same researchers mentioned above) analyzed the objective function of word2vec with negative sampling. That’s the one that performed best in the table above, so I decided to check it out too.

Again, they manage to derive a beautifully simple connection to matrix factorization. This time, the word x context objective “source” matrix is computed differently to GloVe. Each matrix cell, corresponding to word w and context word c, is computed as max(0.0, PMI(w, c) - log(k)), where k is the number of negative samples in word2vec (for example, k=10). PMI is the standard pointwise mutual information — if we use the notation that word w and context c occurred together #wc times in the training corpus, then PMI(w, c) = log frac{#wc * sum{w,c}#wc}{sum_c#wc * sum_w#wc} (no smoothing).

The funky “SPPMI” name simply reflects that we’re subtracting log(k) from PMI (“shifting”) and that we’re taking the max(0.0, SPMI) (“positive”; should be non-negative, really). So, Shifted Positive Pointwise Mutual Information.

For example, for the same nine texts we used above and k=1, the SPPMI matrix looks like this:

[[ 0.    0.83  0.83  0.49  0.49  0.    0.49  0.13  0.    0.    0.    0.  ]
 [ 0.83  0.    1.16  0.    0.    0.83  0.    0.    0.98  0.    0.    0.  ]
 [ 0.83  1.16  0.    0.    0.    0.13  0.    0.47  0.98  0.    0.    0.  ]
 [ 0.49  0.    0.    0.    0.49  0.    1.18  0.83  0.    0.    0.    0.  ]
 [ 0.49  0.    0.    0.49  0.    0.    0.49  0.13  0.    0.    0.83  1.05]
 [ 0.    0.83  0.13  0.    0.    0.    0.    0.13  1.05  0.    0.    0.  ]
 [ 0.49  0.    0.    1.18  0.49  0.    0.    0.83  0.    0.    0.    0.  ]
 [ 0.13  0.    0.47  0.83  0.13  0.13  0.83  0.    0.29  0.    0.    0.  ]
 [ 0.    0.98  0.98  0.    0.    1.05  0.    0.29  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    2.37  1.9 ]
 [ 0.    0.    0.    0.    0.83  0.    0.    0.    0.    2.37  0.    2.08]
 [ 0.    0.    0.    0.    1.05  0.    0.    0.    0.    1.9   2.08  0.  ]]

No neural network training, no parameter tuning, we can directly take rows of this SPPMI matrix to be the word vectors. Very fast and simple. How does raw SPPMI compare to word2vec’s and GloVe’s factorizations though?

Comparison on 600 dims, context window 10, 1.9B words of EN Wikipedia.
algorithm accuracy on the analogy task wallclock time peak RAM [MB]
word2vec, negative sampling k=10, 1 epoch 68.3% 8h38m 628
GloVe, learning rate 0.05, 10 epochs 67.1% 4h12m 9,414
SPPMI, k=1 48.7% 50m 3,433
SPPMI, k=10 30.3% 50m 3,429
SPPMI-SVD, k=1 39.4% 1h23m 3,426
SPPMI-SVD, k=10 3.8% 1h23m 3,444

The SPPMI-SVD method simply factorizes the sparse SPPMI matrix using Singular Value Decomposition (SVD), rather than the gradient descent methods of word2vec/GloVe, and uses the (dense) left singular vectors as the final word embeddings. SVD is a fast, scalable method with straightforward geometric interpretation, and it performed very well in the NIPS experiments of Levy & Goldberg, who suggested SSPMI-SVD.

In the table above, the quality of both SPPMI and SPPMI-SVD models is atrocious, especially for higher values of k (more “shift”). I’m not sure why this is; I’ll try to get the original implementation of Levy & Goldberg to compare.

Also, I originally tried to get this table on 1,000 dims, rather than 600. But the GloVe implementation started failing, producing word vectors with NaNs in them, and weird <1% accuracies when I tried to combat that by decreasing its learning rate. Maciej is still working on that one, so if you’re thinking of using GloVe in production, beware. EDIT: Successfully resolved, Maciej’s GloVe handles that fine now.

To make experiments easier, I wrote and published a script that takes the parsed English Wikipedia and computes accuracy on the analogy task, using each of these different algorithms in turn. You can get it from GitHub and experiment with the various methods yourself, trying them on your own data / application.

What does that all mean?

Playing with Wikipedia is fun, but usually clients require more concrete insights from us.

How do we tweak word2vec to better model what we want?

How to tune word2vec model quality on a specific task (which is, in all likelihood, not “word analogies”)?

I’ll postpone that until the next post. Suffice to say that the performance depends on tuning the methods’ internal parameters, in non-obvious ways. The Bar Ilan powerhouse of Levy, Goldberg & Dagan wrote a full paper on that, exploring the various parameter combinations (dynamic vs. fixed context window, subsampling, negative distribution smoothing, taking context vectors into account as well….). Their paper is under review now — I’ll post a link here as soon as it becomes public. EDIT: ACL paper here.

In the meanwhile, there has been some tentative research into the area of word2vec error analysis and tuning. Check out this web demo (by Levy & Goldberg again, from this paper), for investigating which contexts get activated for different words. This can lead to visual insights into which co-occurrences are responsible for a particular class of errors.

TL;DR: the word2vec implementation is still fine and state-of-the-art, you can continue using it :-)

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

If you like such machine learning shenanigans, sign up for my newsletter above, get in touch for commercial projects or check out my older posts: efficient nearest neighbour similarity search, optimizing word2vec, its doc2vec extension.

Comments 27

  1. Mr. Powerhouse

    Nice work, Radim! Few comments:

    – in your first table, you report 55.3% accuracy for the pretrained vectors from word2vec page – I’m getting >70%

    – the training time & accuracy you obtained with your Gensim implementation of word2vec seems to be worse than the C version (when I used script demo-train-big-model-v1.sh from word2vec, the vectors themselves are trained in about three hours using 8 billion words, and the accuracy is 10% higher than yours – big difference!) – maybe you use wrong hyper-parameters? or the C version is faster and more accurate?

    – finally, Omer and Yoav report much better results with SPPMI – any idea why your results are worse?

    Thanks!

    1. Post
      Author
      Radim

      Hi Mr. Powerhouse 🙂

      1. All experiments were done using the same 30k vocabulary, so that the numbers can be compared directly. The accuracy there is ~55% using the GoogleNews model, not >70%. The model performed very well on the syntactic tasks, and poorly on the semantic ones.

      2. No, the two are pretty much identical. Check out the word2vec porting series. Your results are most likely due to different training corpus / hardware / level of parallelization. “Word2vec” are actually many algorithms, depending on the parameter settings. You’d have to control for all of those, to get meaningful apples-to-apples comparison.

      3. I’m in touch with them, looking for an answer. Will update when we’ve found out the reason. It’s probably not the SPPMI code itself, which is very straightforward (see the github repo).

      1. sebastien-j

        The accuracy function in gensim is unfair to the GoogleNews model, which was trained on non-lowercased data. For a better evaluation, the first letter of countries, cities, etc. should not be lowercased. Arguably, words that differ from “a”, “b” and “c” by letter case only should also be rejected, while answers that are identical to the expected word, except for case, should be accepted.

        Using the full vocabulary, the accuracy with these modifications is 73.8% (14,222/19,544), while the current gensim implementation gives 55.3% (7581/13705), with many questions rejected.

        1. Post
          Author
          Radim

          No, the evaluation procedure is the same for both the C and gensim version. Can you post your code? I’m not sure what you’re talking about.

          Both evaluations treat words as case insensitive, meaning if the model suggests “Montreal”, and the expected answer is “montreal”, it’s recorded as a success = correct answer.

          The only difference is the Python version can handle unicode, whereas the C version is ASCII-only. (So if a word contained “Č” or “Á”, the C version wouldn’t be able to handle the lower case correctly. No words in the eval set do, though, so it makes no difference here.)

          1. sebastien-j

            https://gist.github.com/anonymous/06e060cdf53257fde1ef

            For the GoogleNews model, `lowercase=False` should be used in the function above.

            If a word is represented by many vectors, which one we choose when asking the analogy questions matters. In general, using the vector corresponding to the most common surface form will lead to higher accuracy. The vector for “Montreal” (common) would be `better` than the one for “montreal” (rare).

            For example, the GoogleNews model gives the wrong answer to “athens:greece::beijing:?”. However, if we ask “Athens:Greece::Beijing:?”, then the prediction is correct.

          2. Post
            Author
            Radim

            I think I see what you mean. You’re proposing a different evaluation technique, one where letter case matters.

            That makes sense, and you’re right in that such evaluation may give different results for models trained with case sensitivity. However, neither the C tool nor gensim implement such evaluation. So if they’re “unfair” in their evaluation, they are both equally unfair.

            Looking at the C source, it will ASCII-uppercase all eval words and all model vocabulary; gensim will unicode-lowercase all eval words and leave vocabulary intact.

          3. Post
            Author
            Radim

            You left me wondering how the C tools handles conflicts when uppercasing its model vocabulary: “Montreal” and “montreal” will both map to “MONTREAL”, so which word’s embedding is actually chosen during the accuracy evaluation?

            It seems that in this case, the C word match during eval is set up (maybe accidentally, maybe by design, there are no comments) so that the most frequent surface form is used. Which is probably what we want. Or maybe being explicit, raising an error, is preferable when there’s a casing conflict? Or your approach, where there’s a bool flag that says “either ignore case, or use my case exactly”? I’m not sure.

            I think I’ll change gensim to use one of these options & make the docs clear about what’s going on. This only affects models with case-sensitive vocabulary, of course.

            Thanks for bringing this up Sebastien.

  2. Leonid Boytsov

    A very interesting post, thank you.

    I haven’t seen the original papers. However, off the top of my head, it seems very strange to me that you can use SPPMI matrix directly without any dimensionality reduction. The number of columns (each one corresponds to a context word) should be huge. Did I miss something?

    1. Post
      Author
      Radim

      Hi Leonid — you missed nothing. The SPPMI matrix is huge.

      It’s basically #words x #words, like I described in the post. In theory, it’s also very sparse.

      In this particular case though, after using only the most frequent 30k words, the sparsity is pretty low = the SPPMI matrix is almost dense.

  3. Pingback: Condensed News | Data Analytics & R

  4. Pingback: 语义分析的一些方法(二) | 火光摇曳

  5. Marcel

    Hi Radim,
    thank you for your great post!
    Do you have an update on this for us?
    Eg. you write
    “How do we tweak word2vec to better model what we want? How to tune word2vec model quality on a specific task (which is, in all likelihood, not “word analogies”)? I’ll postpone that until the next post. ”
    and
    “Levy & Goldberg wrote a full paper on that, exploring the various parameter combinations (dynamic vs. fixed context window, subsampling, negative distribution smoothing, taking context vectors into account as well….). Their paper is under review now — I’ll post a link here as soon as it becomes public.”

    Thank you for your great work!
    Marcel

    1. Post
      Author
      Radim

      Thanks Marcel!

      Too much “real” work, I know I’ve been neglecting the blog a bit… I’ll try to post again soon.

      Levy & Goldberg: I don’t know, I’ll check for you.

      1. Post
        Author
  6. Giusepe Attardi

    Have you considered the approach of Hellinger PCA, which applies SVD to the co-occurrence matrix and is also cited in the GloVe paper?

    Lebret, Rémi, and Ronan Collobert. “Word Embeddings through Hellinger PCA.” EACL 2014 (2014): 482.

    1. Post
      Author
      Radim

      No we haven’t, I haven’t seen that paper yet. Do you want to add the method & report back?

      Our evaluation here is open source, and gensim supports large scale SVD, so I’m thinking the implementation may be straightforward.

    1. Post
      Author
  7. Pingback: GloVe: Global Vectors for Word Representations « Building Babylon

  8. Pingback: CodieNerd (2) – Toying with Word2Vec | Everything about Data Analytics

  9. Pingback: » Compressed Representation: A Really Good Idea Jo's Blog

  10. Dmitriy Selivanov

    Hi, Radim. Do you have exactly the same data, on which you perform benchmarks? I implemented GloVe in text2vec R package algorithm (from scratch) and on latest 2.1B wikipedia dump it produce much better results, then you reported here – 75% on word analogy task (top 30k words, dim = 600, 10 sgd epochs, 12182 questions).

    1. Post
      Author
      Radim

      Hello Dmitriy!

      The script used to preprocess the data (xml.bz2 Wikipedia dump) is linked to from this post. It comes from the “nearest neighbour shootout” blog series and is public.

      The raw data itself was taken as the latest Wikipedia dump at the time of writing that “shootout” series. Also public.

      The GloVe implementation used was from Maciej Kule, also linked to in this post. This implementation is from the time of writing this blog post obviously (it may have changed since).

      If you get much better results on the same inputs + parameters, it may be worth checking back with Maciej. I’m sure he’d be interested in hearing comparisons + improving / fixing his implementation.

      EDIT: here’s some file stats I found on disk:

      $ ls -l title_tokens.txt
      -rw-r--r-- 1 radim develop 11566440066 Apr 15 2015 title_tokens.txt

        1. Post
          Author
          Radim

          I don’t see how that would affect accuracy. Or in your package, in your evaluation, do you combine the word and context vectors together?

          Anyway, I think your analysis and findings are very interesting! Please keep me posted if you write it up in a blog post (and I’ll ping Maciej as well). There’s such a lack of practical tools in this space (as opposed to media hype).

Leave a Reply

Your email address will not be published. Required fields are marked *