Word2vec Tutorial

Radim Řehůřek gensim, programming 90 Comments

I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.

Preparing the Input

Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):

# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.

Training

Word2vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

model = Word2Vec(sentences, min_count=10)  # default value is 5

A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

model = Word2Vec(sentences, size=200)  # default value is 100

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

The last of the major parameters (full list here) is for training parallelization, to speed up training:

model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization

The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).

Note from Radim: Get my latest machine learning tips & articles delivered straight to your inbox (it's free).

 Unsubscribe anytime, no spamming. Max 2 posts per month, if lucky.

Memory

At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

Evaluating

Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.

Gensim support the same evaluation set, in exactly the same format:

model.accuracy('/tmp/questions-words.txt')
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.

Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.

Storing and loading models

You can store/load models using the standard gensim methods:

model.save('/tmp/mymodel')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:

model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip:
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

Online training / Resuming training

Advanced users can load a model and continue training it with more sentences:

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

Using the model

Word2vec supports several word similarity tasks out of the box:

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch";.split())
'cereal'
model.similarity('woman', 'man')
0.73723527

If you need the raw output vectors in your application, you can access these either on a word-by-word basis

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

…or en-masse as a 2D NumPy matrix from model.syn0.

Bonus app

As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:


If you don’t get “queen” back, something went wrong and baby SkyNet cries.
Try more examples too: “he” is to “his” as “she” is to ?, “Berlin” is to “Germany” as “Paris” is to ? (click to fill in).

is to as is to


Try: U.S.A.; Monty_Python; PHP; Madiba (click to fill in).


Also try: “monkey ape baboon human chimp gorilla”; “blue red green crimson transparent” (click to fill in).



The model contains 3,000,000 unique phrases built with layer size of 300.

Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.

On a related note, I noticed about half the queries people entered into the LSA@Wiki demo contained typos/spelling errors, so they found nothing. Ouch.

To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.

The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.

See the complete HTTP server code for this “bonus app” on github (using CherryPy).

Outro

Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.

And here’s me talking about the optimizations behind word2vec at PyData Berlin 2014

Comments 90

  1. pritpal

    Hi radim,
    Impressive tutorial. I have a query that the output Word2Vec model is returning in an array. How can we use that as an input to recursive neural network??

    Thanks

  2. Suzana

    model = gensim.models.Word2Vec(sentences) will not work as shown in the tutorial, because you will receive the error message: “RuntimeError: you must first build vocabulary before training the model”. You also have to set down the min_count manually, like model = gensim.models.Word2Vec(sentences, min_count=1).

    1. Post
      Author
      Radim

      Default `min_count=5` if you don’t set it explicitly. Vocabulary is built automatically from the sentences.

      What version of gensim are you using? It should really work simply with `Word2Vec(sentences)`, there are even unit tests for that.

      1. Claire

        if you don’t set ‘min_count=1’, it will remove all the words in sentences in the example given –
        logging:
        ‘INFO : total 0 word types after removing those with count<5'

        1. Post
          Author
  3. Pavel

    Hi Radim,

    Is there any way to obtain the similarity of phrases out of the word2vec? I’m trying to get 2-word phrases to compare, but don’t know how to do it.

    Thanks!
    Pavel

    1. Post
      Author
      Radim

      Hello Pavel, yes, there is a way.

      First, you must detect phrases in the text (such as 2-word phrases).

      Then you build the word2vec model like you normally would, except some “tokens” will be strings of multiple words instead of one (example sentence: [“New York”, “was”, “founded”, “16th century”]).

      Then, to get similarity of phrases, you do `model.similarity(“New York”, “16th century”)`.

      It may be a good idea to replace spaces with underscores in the phrase-tokens, to avoid potential parsing problems (“New_York”, “16th_century”).

      As for detecting the phrases, it’s a task unrelated to word2vec. You can use existing NLP tools like the NLTK/Freebase, or help finish a gensim pull request that does exactly this: https://github.com/piskvorky/gensim/pull/135 .

  4. luopuya

    Hi Radim,

    The Word2Vec function split my words as:
    u’u4e00u822c’ ==>> u’u4e00′ and u’u822c’

    How could I fix it?

    Thanks, luopuya

    1. luopuya

      Sorry, I did not read the blog carefully.
      Every time reading a line of file, I should split it like what “MySentences” do

  5. Max

    Hi Radim,
    you are awesome, thank you so much for gensim and this tutorial!!

    I have a question. I read in the docs that by default you utilize Skip-Gram, which can be switched to CBOW. From what I gathered in the NIPS slides, CBOW is faster, more effective and gets better results. So why use Skip-Gram in the first place? I’m sure I’m missing something obvious here 🙂

    Thanks,
    Max

    1. Max

      Whoops, I just realized the parameter “sg” is not supported anymore in the Word2Vec constructor. Is that true? So what is used by default?

    2. Post
      Author
      Radim

      Hello Max,

      thanks 🙂

      Skip-gram is used because it gives better (more accurate) results.

      CBOW is faster though. If you have lots data, you can be advantageous to run the simpler but faster model.

      There’s a pull request under way, to enable all the word2vec options and parameters. You can try out the various models and their performance for yourself 🙂

      1. Max

        Thanks for you answer Radim! I only saw that one experiment in the slides that said that CBOW was faster AND more accurate, but that might have been an outlier or my misinterpretation. I’m excited for that pull request! 🙂

        Anyway, I have another question (or bug report?). I changed a bunch of training parameters and added input data and suddenly got segfaults on Python when asking for similarities for certain words… So I tried which of the changes caused this, and it turned out that the cause was that I set the output size to 200! Setting it to (apparently) any other number doesn’t cause any trouble, but 200 does… If you hear this from anyone else or are able to reproduce it yourself, consider it a bug 🙂

        1. Post
          Author
  6. Bogdan

    Hi Radim,

    Indeed, a great tutorial! Thank you for that!

    Playied a bit with Word2Vec and it’s quite impressive. Couldn’t figure out how the first part of the demo app works. Can you provide some insights please ?

    Thanks! 😀

    1. Post
      Author
  7. Indian

    Hello,
    i’d like to ask You, if this all can be done with other languages too, like Korean, Russian, Arabic and so, or whether is this toolkit fixed to the English only.

    Thank You in advance for the answer

    1. Post
      Author
      Radim

      Hi, it can be done directly for languages where you can split text into sentences and tokens.

      The only concern would be that the `window` parameter has to be large enough to capture syntactic/semantic relationships. For English, the default `window=5`, and I wouldn’t expect it to be dramatically different for other languages.

  8. Sebastian

    Hey, I wanted to know if the version you have in gensim is the same that you got after “optimizing word2vec in Python”.. I am using the pre-trained model of the Google News vector(found in the page of word2vec) and then I run model.accuracy(‘file_questions’) but it runs really slow… Just wanted to know if this is normal or i have to do some things to speed uṕ the version of gensim.. Thanks in advance and great work!

    1. Post
      Author
      Radim

      It is — gensim always contains the latest, most optimized version (=newer than this blog post).

      However, the accuracy computations (unlike training) are NOT optimized 🙂 I never got to optimizing that part. If you want to help, let me know, I don’t think I’ll ever get to it myself. (Massive optimizations can be done directly in Python, no need to go C/Cython).

  9. Zigi

    Hi,
    could you please explain how do CBOW and skip-gram models actually do the learning. I’ve read ‘Efficient estimation…’ but it doesn’t really explain how does the actual training happen.
    I’ve taken a look at the original source code and your implementation, and while I can understand the code I cannot understand the logic behind it.

    I don’t understand these lines in your implementation (word2vec.py, gensim 0.9) CBOW:
    ————————————
    l2a = model.syn1[word.point] # 2d matrix, codelen x layer1_size
    fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
    ga = (1 – word.code – fa) * alpha # vector of error gradients multiplied by the learning rate
    model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
    —————————————-
    I see that it has something to do with the Huffman-tree word representation, but despite the comments, I don’t understand what is actually happening, what does syn1 represent, why do we multiply l1 with l2a… why are we multiplying ga and l1 etc…

    Could you please explain in a sentence or two what is actually happening in there.
    I would be very grateful.

  10. Vimos

    Hi, this is great, but I have a question about unknown words.
    After loading a model, when train more sentences, new words will be ignored, not added to the vocab automatically.
    I am not quite sure about this. Is that true?
    Thank you very much!

    1. Post
      Author
  11. Katja

    Hi,
    Unfortunately I’m not sufficiently versed in programming (yet) to solve this by myself. I would like to be able to be able to add a new vector to the model after it has been trained.
    I realize I could export the model to a text file, add the vector there and load the modified file. Is there a way to add the vector within Python, though? As in: create a new entry for the model such that model[‘new_entry’] is assigned the new_vector.
    Thanks in advance!

    1. Post
      Author
      Radim

      The weights are a NumPy matrix — have a look at `model.syn0` and `model.syn1` matrices. You can append new vectors easily.

      Off the top of my head I’m not sure whether there are any other variables in `model` that need to be modified when you do this. There probably are. I’d suggest checking the code to make sure everything’s consistent.

  12. Pingback: Motorblog » [Review] PyData Berlin 2014 – Satellitenevent zur EuroPython

  13. Xu

    I am new to word2vec. Can I ask you two questions?
    1. When I apply the pre-trained model to my own dataset, do you have any suggestion about how to deal with the unknown words?
    2. Do you have any suggestion about aggregating the word embeddings of words in a sentence into one vector to represent that sentence?
    Thanks a lot!

    1. Post
      Author
  14. Xiao Zhibo

    Hi,

    Could you tell me how to find the most similar word as in web app 3? Calculating the cosine similarity between each word seems like a no-brainer way to do it? Is there any API in gensim to do that?

    Another question, I want to represent sentence using word vector, right now I only add up all the words in the sentence to get a new vector. I know this method does’t make sense, since each word has a coordinate in the semantic space, adding up coordinates is not an ideal to represent a sentence. I have read some papers talking about this problem? Could you tell me what will be an ideal way to represent sentence to do sentence clustering?

    Thank you very much!

    1. Post
      Author
  15. neo

    Does the workers = x work to multithread the iteration over the sentences or just the training of the model ?

    1. Post
      Author
      Radim

      It parallelizes training.

      How you iterate over sentences is your business — word2vec only expects an iterator on input. What the iterator does internally to iterate over sentences is up to you and not part of word2vec.

      1. neo

        I have a collection of 1500000 text files (with 10 lines each on average) and a machine with 12 cores/16G of ram(not sure if it is relevant for reading files).

        How would you suggest me to build the iterator to utilize all the computing resources I have?

        1. Post
          Author
          Radim

          No, not relevant.

          I’d suggest you loop over your files inside __iter__() and yield out your sentences (lines?), one after another.

  16. satarupa

    If I am using the model pre-trained with Google News data set, is there any way to control the size of the output vector corresponding to a word?

  17. suvir

    for “model.build_vocab(sentences)” command to work, we need to add “import os”. without that, i was getting error for ‘os’ not defined.

    1. Post
      Author
  18. Pingback: How to grow a list of related words based on initial keywords? | CL-UAT

  19. T Zheng

    I am using the train function as described in the api doc. I notice that the training might have terminated “prematuredly”, according to the logging output below. Not sure if I understand the output properly. When it said “PROGRESS: at 4.10% words”, does it mean 4.1% of the corpus or 4.1% of the vocabs? I suspect the former, so it would suggest it only processed 4.1% of the words. Please enlighten me. Thanks!

    2015-02-11 19:34:40,894 : INFO : Got records: 20143
    2015-02-11 19:34:40,894 : INFO : training model with 4 workers on 67186 vocabulary and 200 features, using ‘skipgram’=1 ‘hierarchical softmax’=0 ‘subsample’=0 and ‘negative sampling’=15
    2015-02-11 19:34:41,903 : INFO : PROGRESS: at 0.45% words, alpha 0.02491, 93073 words/s
    2015-02-11 19:34:42,925 : INFO : PROGRESS: at 0.96% words, alpha 0.02477, 97772 words/s
    2015-02-11 19:34:43,930 : INFO : PROGRESS: at 1.48% words, alpha 0.02465, 100986 words/s
    2015-02-11 19:34:44,941 : INFO : PROGRESS: at 2.00% words, alpha 0.02452, 102187 words/s
    2015-02-11 19:34:45,960 : INFO : PROGRESS: at 2.51% words, alpha 0.02439, 102371 words/s
    2015-02-11 19:34:46,966 : INFO : PROGRESS: at 3.05% words, alpha 0.02426, 104070 words/s
    2015-02-11 19:34:48,006 : INFO : PROGRESS: at 3.55% words, alpha 0.02413, 103439 words/s
    2015-02-11 19:34:48,625 : INFO : reached the end of input; waiting to finish 8 outstanding jobs
    2015-02-11 19:34:49,026 : INFO : PROGRESS: at 4.10% words, alpha 0.02400, 104259 words/s

  20. Sasha Kacanski

    Hi Radim,
    Is there a whole example that I can use to understand the whole concept and to walk through the code.
    Thanks much,

    1. Post
      Author
  21. wade

    Hi Radim,

    I‘m wondering about the difference between model from trained in C(original way) and trained in gensim.
    when I trying to use the model.most_similar function,loading the model I’ve trained in C, I got a totally different result when i trying to do the same stuff with word-analogy.sh? So I just want to know if the model.most_similar function use the same way when trying to calculate ‘man’-‘king’ + ‘women’ ≈ ‘queue’ like mikolov achieved in his C codes (word-analogy) ,thanks!!!

    1. Post
      Author
      Radim

      Yes, exactly the same (cosine similarity).

      The training is almost the same too, up to different randomized initialization of weights IIRC.

      Maybe you’re using different data (preprocessing)?

      1. wade

        Sorry to bother you again,here are two kinds of way when I try to do:
        The way when i use gensim:
        model=Word2Vec.load_word2vec_format(‘vectors_200.bin’,binary=True)
        #Chinese
        word1=u’石家庄’
        word2=u’河北’
        word3=u’河南’

        le=model.most_similar(positive=[word2,word3],negative=[word1])

        The way use C code:
        ./word-analogy vectors_200.bin
        the input :’石家庄’ 河北’ 河南’

        totally different results…

        the same model loaded, how could that happened?

        1. Post
          Author
          Radim

          Oh, non-ASCII characters.

          IIRC, the C code doesn’t handle unicode in any way, all text is treated as binary. Python code (gensim) uses Unicode for strings.

          So, perhaps some encoding mismatch?

          How was your model trained — with C code? Is so, what was the encoding?

  22. Anton

    Hello Radim,

    Is there a way to extract the output feature vector (or, sort of, predicted probabilities) from the model, just like while it’s training?

    Thanks

  23. Anupama

    Hey Radim

    Thanks for the wonderful tutorial.
    I am new to word2vec and I am trying generate n-grams of words for an Indian Script. I have 2 quesries:
    Q1. Should the input be in plain text:
    ସୁଯୋଗ ଅସଟାର or unicodes 2860 2825 2858 2853 2821
    Q2. Is there any code available to do clustering of the generated vectors to form word classes?

    Please help

  24. Cong

    Hi Radim,

    For this example: “woman king man”:
    I run with bonus web app, and got the results:
    521.9ms [[“kings”,0.6490576267242432],[“clown_prince”,0.5009066462516785],[“prince”,0.4854174852371216],[“crown_prince”,0.48162946105003357],[“King”,0.47213971614837646]]

    The above result is the same with word2vec by Tomas Mikolov.

    However, when I run example above in gensim, the output is:
    [(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]

    So why is this the case?
    Your web app’s result is different to gensim ???

    Thanks!

    1. Post
      Author
      Radim

      Hi Cong

      no, both are the same.

      In fact, the web app just calls gensim under the hood. There’s no extra magic happening regarding word2vec queries, it’s just gensim wrapped in cherrypy web server.

      1. Cong

        Thank you for your reply.

        I loaded the pre-trained model: GoogleNews-vectors-negative300.bin by Tomas Mikolov.

        Then, I used word2vec in gensim to find the output.
        This is my code when using gensim:

        from gensim.models import word2vec
        model_path = “…/GoogleNews-vectors-negative300.bin”
        model = word2vec.Word2Vec.load_word2vec_format(model_path, binary=True)
        stringA = ‘woman’
        stringB = ‘king’
        stringC = ‘man’
        print model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10)

        –> Output is:
        [(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]

        The see that the output above is different to the web app?
        So can you check it for me?

        Thanks so much.

  25. Boquan Tang

    Hi Radim,

    Thank you for the great tool and tutorial.
    I have one question regarding learning rate of the online training. You mentioned to adjust total_words in train(), but could you give a more detailed explanation about how this parameter will affect the learning rate?
    Thank you in advance.

  26. Barry Dillon

    Fantastic tool and tutorial. Thanks for sharing.

    I’m wondering about compounding use of LSI. Take large corpus and perform LSI to map words into some space. Now having a document when you hit a word look up the point in the space and use that rather than just the word. Words of similar meaning then start out closer together and more sensibly influence the docuement classification. Would model just reverse out those initial weights ? thanks for any ideas.

  27. Nima

    Hi Radim,

    First of all, thanks for you great job on developing this tool. I am new in word2vec and unfortunately literature do not explain the details clearly. I would be grateful if you could answer my simple questions.

    1- for CBOW (sg=0), does the method uses negative sampling as well? or this is something just related to skip-gram model.

    2-what about the window size? is the window size also applicable when one uses CBOW? or all the words in 1 sentences is considered as bag-of-words?

    3- what happens if the window size is larger than the size of a sentence? Is the sentence ignored or simply a smaller window size is chosen which fits the size of the sentence?

    4- what happens if the word sits at the end of the sentence? there is no word after that for the skip-gram model !

  28. Jesse Berwald

    Hi Radim,

    Thanks for such a nice package! It may be bold to suggest, but I ran across what I think might be a bug. It’s likely a features :), but I thought I’d point it out since I needed to fix it in an unintuitive way.

    If I train a word2vec model using a list of sentences:

    sentences = MySentences(fname) # generator that yields sentences
    mysentences = list(sentences)
    model = gensim.models.Word2Vec(sentences=mysentences **kwargs)

    then the model finishes training. Eg., the end of the logging shows

    …snip…
    2015-05-13 22:12:07,329 : INFO : PROGRESS: at 97.17% words, alpha 0.00075, 47620 words/s
    2015-05-13 22:12:08,359 : INFO : PROGRESS: at 98.25% words, alpha 0.00049, 47605 words/s
    2015-05-13 22:12:09,362 : INFO : PROGRESS: at 99.32% words, alpha 0.00019, 47603 words/s
    2015-05-13 22:12:09,519 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
    2015-05-13 22:12:09,901 : INFO : training on 4427131 words took 92.9s, 47648 words/s

    I’m training on many GB of data, so I need to pass in a generator that yields sentences line by line (like your MySentences class above). But when I try it as suggested with, say, iter=5:

    sentences = MySentences(fname) # generator that yields sentences
    model = gensim.models.Word2Vec(sentences=None, **kwargs) # iter=10 defined in kwargs
    
model.build_vocab(sentences_vocab)

    model.train(sentences_train)

    the model stops training 1/20 of the way through. If iter=10, it stops 1/10 of the way, etc. Eg., the end of the logging looks like,

    …snip…
    2015-05-13 22:31:37,265 : INFO : PROGRESS: at 18.21% words, alpha 0.02049, 49695 words/s
    2015-05-13 22:31:38,266 : INFO : PROGRESS: at 19.29% words, alpha 0.02022, 49585 words/s
    2015-05-13 22:31:38,452 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
    2015-05-13 22:31:38,857 : INFO : training on 885538 words took 17.8s, 49703 words/s

    Looking in word2vec.py, around line 316 I noticed

    sentences = gensim.utils.RepeatCorpusNTimes(sentences, iter)

    so I added

    sentences_train = gensim.utils.RepeatCorpusNTimes(Sentences(fname), model.iter)

    before calling model.train() in the above code snippet. Does this seem like the correct course of action, or am I missing something fundamental about the way one should stream sentences to build the vocab and train the model?

    Thanks for your help,
    Jesse

    1. Post
      Author
  29. Abdullah Kiwan

    Hello,
    It is a great tutorial, thank you very much….

    but i have a problem,
    I used the function ( accuracy ) to print the evaluation of the model , but nothing is printed to me

    how to sove this problem ??

    thanks a lot

    1. Post
      Author
      Radim

      Try turning on logging — the accuracy may be printed to log.

      See the beginning of this tutorial for how to do that.

  30. Shuai Wang

    Great tutorial, Radim! Is it possible to download your trained model of 100 billion Google words?

    1. Post
      Author
  31. Swami Iyer

    Hi Radim,

    I was wondering if it is possible to train a Word2Vec model, not with sentences, but with input and output vectors built from the sentences in an application-specific manner?

    Thanks.

    Swami

  32. Burness Duan

    Hi, I’ve got a problem‘OverflowError: Python int too large to convert C long’ when i run ‘model = gensim.models.Word2Vec(sentences, min_count=1)’. Could you help me with it ?!

    1. Post
      Author
  33. Pingback: Word2vec Tutorial » RaRe Technologies | D...

  34. Mike

    I ran:
    model = gensim.models.Word2Vec(sentences, min_count=1)

    and got the following error:

    model = gensim.models.Word2Vec(sentences, min_count=1)
    Traceback (most recent call last):

    File “”, line 1, in
    model = gensim.models.Word2Vec(sentences, min_count=1)

    File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 312, in __init__
    self.build_vocab(sentences)

    File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 414, in build_vocab
    self.reset_weights()

    File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 521, in reset_weights
    random.seed(uint32(self.hashfxn(self.index2word[i] + str(self.seed))))

    OverflowError: Python int too large to convert to C long

    I am using Python 3.4.3 in the Anaconda 2.3.0-64bit distribution.

    I’d really like to be able to use this module, but it seems like there’s some fundamental issue for my computer.

    Thanks!!

    1. Post
      Author
      Radim

      Hello Mike, the fix was a part of gensim 0.12.1, released some time ago.

      What version of gensim are you using?

      1. Mike

        Found the error…I was using “conda update gensim” but it looks like their Anaconda repository has not been updated. I’ll let them know, since many people use Anaconda distrib.

        I ran “pip install –upgrade gensim” and it got 0.12.1. I had 10.1!!!

      2. Mike

        Ok, I updated and ran with the following input list of sentences:

        sentences
        Out[17]:
        [[‘human’,
        ‘machine’,
        ‘interface’,
        ‘for’,
        ‘lab’,
        ‘abc’,
        ‘computer’,
        ‘applications’],
        [‘a’,
        ‘survey’,
        ‘of’,
        ‘user’,
        ‘opinion’,
        ‘of’,
        ‘computer’,
        ‘system’,
        ‘response’,
        ‘time’],
        [‘the’, ‘eps’, ‘user’, ‘interface’, ‘management’, ‘system’],
        [‘system’, ‘and’, ‘human’, ‘system’, ‘engineering’, ‘testing’, ‘of’, ‘eps’],
        [‘relation’,
        ‘of’,
        ‘user’,
        ‘perceived’,
        ‘response’,
        ‘time’,
        ‘to’,
        ‘error’,
        ‘measurement’],
        [‘the’, ‘generation’, ‘of’, ‘random’, ‘binary’, ‘unordered’, ‘trees’],
        [‘the’, ‘intersection’, ‘graph’, ‘of’, ‘paths’, ‘in’, ‘trees’],
        [‘graph’,
        ‘minors’,
        ‘iv’,
        ‘widths’,
        ‘of’,
        ‘trees’,
        ‘and’,
        ‘well’,
        ‘quasi’,
        ‘ordering’],
        [‘graph’, ‘minors’, ‘a’, ‘survey’]]

        Still got the same error:

        in [16]: model = gensim.models.word2vec.Word2Vec(sentences)

        Traceback (most recent call last):

        File “”, line 1, in
        model = gensim.models.word2vec.Word2Vec(sentences)

        File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 417, in __init__
        self.build_vocab(sentences)

        File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 483, in build_vocab
        self.finalize_vocab() # build tables & arrays

        File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 611, in finalize_vocab
        self.reset_weights()

        File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 888, in reset_weights
        self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))

        File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 900, in seeded_vector
        once = random.RandomState(uint32(self.hashfxn(seed_string)))

        OverflowError: Python int too large to convert to C long

        1. Post
          Author
          Radim

          Is Python picking up the right gensim?

          AFAIK anaconda has its own packaging system, I’m not sure how it plays with your `pip install`.

          What does `import gensim; print gensim.__version__` say?

  35. Mike

    Ok…I was able to make a couple changes to word2vec.py to get it to run on my computer:

    The current version uses numpy.uint32 on lines 83, 327, 373, and 522. This was causing an overflow error when converting to C long.

    I changed these to reference numpy.uint64 and it *almost* worked….the use of uint64 on line 522 for setting the seed of the random number generator resulted in a seed value being out of bounds. I addressed this by truncating the seed to the max allowable seed:

    “random.seed(min(uint64(self.hashfxn(self.index2word[i] + str(self.seed))),4294967295))”

    now everything runs fine (except that my version is not compiled under C so I may see some performance issues for large coropra)…

  36. Mike

    actually, there is a solution on Kaggle for 64-bit machines that worked really well (do not use my solution…it results in all word vectors being collinear).

    def hash32(value):
    return hash(value) & 0xffffffff

    Then pass the following arugument to Word2Vec: hashfxn=hash32

    This will overwrite the base hashfxn and resolve the issues. Also, all my cosine similarities were not 1 now!!

  37. Alexis C

    Beware of how you go through your training data :

    When, in your class “MySentences” you use :
    “for line in open(os.path.join(self.dirname, fname)): ”

    As far as I know, it won’t close your file. You’re letting the garbage collector of Python deal with the leak in memory.

    Use :

    “with gzip.open(os.path.join(self.dirname, fname)) as f:”

    instead (Ref : http://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important )

    For training on large dataset, it can be a major bottleneck (it was for me 😉 ).

    Thank you very much for your fast implementation of Word2vec and Doc2vec !

    1. Post
      Author
      Radim

      No, CPython closes the file immediately after the object goes out of scope. There is no leak (though that’s a common misconception and a favourite nitpick).

      With gzip it makes more sense, but then you should be using smart_open anyway (also to work around missing context managers in python 2.6).

  38. Rodolpho Rosa

    Hi, Radim.

    Great tutorial.

    I have a doubt. Is there any way I can represent a phrase as a vector so that I can calculate similarity between phrases just as what we do with words?

    1. Post
      Author
  39. jin

    Hi Radim,
    first of thank you very much for your amazing work and even more amazing tutorial.

    I am currently trying to compare two set of phrases.
    I am using GoogleNews model as my model

    Splited all the words into individual words by using .split()

    ie.
    [‘golf’,’field’] and [‘country’,’club’]
    [‘gas’,’station’] and [‘fire’,’station’]

    as per feature in your app “phrase suggestions ”
    I can see that GoogleNews model have
    County_Club
    Fire_station
    gas_station
    golf_field

    But it’s difficult to scan for those words because of Capitalization in GN model.

    I tried.
    model.vocab.keys()

    which would convert all available names into a list.
    but couldn’t get any close to your example above.

    I also looked at gensim.models.phrase.Phrases
    hoping that it can help me detect above example with bigram
    or trigram

    for those who are using GN model,
    how could we detect bigram or trigram?

    Thank you in advance.

    1. Radim

      As far as I know, Google didn’t release their vocabulary/phrase model, nor their text preprocessing method.

      The only thing you have to go by are the phrases inside the model itself (3 million of them), sorted by frequency.

      You can lowercase the model vocabulary and match against that, but note that you’ll lose some vectors (no way to tell County_Club from county_club from County_club).

      You can also try asking at the gensim mailing list, or Tomas Mikolov at his mailing list — better chance someone may have an answer or know something.

      1. jin

        Hi Radim,
        Thanks for your reply,

        I went ahead and created a small function which would create bigram and replace the original words if bigram exist in GoogleNews model.

        and yes, I will join google mailing group.

        ####################################################
        # bigram creator
        # try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately

        # creating bigram
        def create_bigram(list):
        for i in range(0,len(list)-1):
        #ie. country_club
        word1 = list[i]+’_’+list[i+1]
        #ie. Country_club
        word2 = list[i].capitalize()+’_’+list[i+1]
        #ie. Country_Clue
        word3 = list[i].capitalize()+’_’+list[i+1].capitalize()
        #ie. COUNTRY_CLUE
        word4 = (list[i]+’_’+list[i+1]).upper()

        word_list = [word1,word2,word3,word4]

        for item in word_list:
        print item
        if item in model.vocab:
        list.pop(i)
        list.pop(i)
        list.append(item)
        break

        1. jin

          here’s fixed code,
          added a line that will not append new word if len(list) gets shorter

          ####################################################
          # bigram creator
          # try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately

          # creating bigram
          def create_bigram(list):
          for i in range(0,len(list)-1):
          if i < len(list)-1:
          #ie. country_club
          word1 = list[i]+'_'+list[i+1]
          #ie. Country_club
          word2 = list[i].capitalize()+'_'+list[i+1]
          #ie. Country_Clue
          word3 = list[i].capitalize()+'_'+list[i+1].capitalize()
          #ie. COUNTRY_CLUE
          word4 = (list[i]+'_'+list[i+1]).upper()

          word_list = [word1,word2,word3,word4]

          for item in word_list:
          print i
          if item in model.vocab:
          list.pop(i)
          list.pop(i)
          list.append(item)
          break

  40. Lis

    I trained 2 Doc2vec models with the same data, and parameters:
    model = Doc2Vec(sentences, dm=1, size=300, window=5, negative=10, hs=1, sample=1e-4, workers=20, min_count=3)

    But I got 2 different models in each time. Is this true?

    Can you explain more details for me?
    Is that the case for Word2vec model?

    Thanks Radim!

    1. Post
      Author

Leave a Reply

Your email address will not be published. Required fields are marked *