I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.
UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.
Preparing the Input
Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):
# import modules & set up logging import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = [['first', 'sentence'], ['second', 'sentence']] # train word2vec on the two sentences model = gensim.models.Word2Vec(sentences, min_count=1)
Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.
Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…
For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:
class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() sentences = MySentences('/some/directory') # a memory-friendly iterator model = gensim.models.Word2Vec(sentences)
Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.
Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator model.train(other_sentences) # can be a non-repeatable, 1-pass generator
In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.
Training
Word2vec accepts several parameters that affect both training speed and quality.
One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:
model = Word2Vec(sentences, min_count=10) # default value is 5
A reasonable value for min_count is between 0-100, depending on the size of your dataset.
Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:
model = Word2Vec(sentences, size=200) # default value is 100
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
The last of the major parameters (full list here) is for training parallelization, to speed up training:
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Memory
At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).
Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.
There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.
Evaluating
Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.
Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.
Gensim support the same evaluation set, in exactly the same format:
model.accuracy('/tmp/questions-words.txt') 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.
Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.
Storing and loading models
You can store/load models using the standard gensim methods:
model.save('/tmp/mymodel') new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.
In addition, you can load models created by the original C tool, both using its text and binary formats:
model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
Online training / Resuming training
Advanced users can load a model and continue training it with more sentences:
model = gensim.models.Word2Vec.load('/tmp/mymodel') model.train(more_sentences)
You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
Using the model
Word2vec supports several word similarity tasks out of the box:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) [('queen', 0.50882536)] model.doesnt_match("breakfast cereal dinner lunch";.split()) 'cereal' model.similarity('woman', 'man') 0.73723527
If you need the raw output vectors in your application, you can access these either on a word-by-word basis
model['computer'] # raw NumPy vector of a word array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
…or en-masse as a 2D NumPy matrix from model.syn0.
Bonus app
As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:
If you don’t get “queen” back, something went wrong and baby SkyNet cries.
Try more examples too: “he” is to “his” as “she” is to ?, “Berlin” is to “Germany” as “Paris” is to ? (click to fill in).
Try: U.S.A.; Monty_Python; PHP; Madiba (click to fill in).
Also try: “monkey ape baboon human chimp gorilla”; “blue red green crimson transparent” (click to fill in).
The model contains 3,000,000 unique phrases built with layer size of 300.
Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.
On a related note, I noticed about half the queries people entered into the LSA@Wiki demo contained typos/spelling errors, so they found nothing. Ouch.
To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.
The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.
See the complete HTTP server code for this “bonus app” on github (using CherryPy).
Outro
Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.
And here’s me talking about the optimizations behind word2vec at PyData Berlin 2014
Comments 90
Hi radim,
Impressive tutorial. I have a query that the output Word2Vec model is returning in an array. How can we use that as an input to recursive neural network??
Thanks
model = gensim.models.Word2Vec(sentences) will not work as shown in the tutorial, because you will receive the error message: “RuntimeError: you must first build vocabulary before training the model”. You also have to set down the min_count manually, like model = gensim.models.Word2Vec(sentences, min_count=1).
Author
Default `min_count=5` if you don’t set it explicitly. Vocabulary is built automatically from the sentences.
What version of gensim are you using? It should really work simply with `Word2Vec(sentences)`, there are even unit tests for that.
if you don’t set ‘min_count=1’, it will remove all the words in sentences in the example given –
logging:
‘INFO : total 0 word types after removing those with count<5'
Author
Ah ok, thanks Claire. I’ve add the `min_count=1` parameter.
Hi Radim,
Is there any way to obtain the similarity of phrases out of the word2vec? I’m trying to get 2-word phrases to compare, but don’t know how to do it.
Thanks!
Pavel
Author
Hello Pavel, yes, there is a way.
First, you must detect phrases in the text (such as 2-word phrases).
Then you build the word2vec model like you normally would, except some “tokens” will be strings of multiple words instead of one (example sentence: [“New York”, “was”, “founded”, “16th century”]).
Then, to get similarity of phrases, you do `model.similarity(“New York”, “16th century”)`.
It may be a good idea to replace spaces with underscores in the phrase-tokens, to avoid potential parsing problems (“New_York”, “16th_century”).
As for detecting the phrases, it’s a task unrelated to word2vec. You can use existing NLP tools like the NLTK/Freebase, or help finish a gensim pull request that does exactly this: https://github.com/piskvorky/gensim/pull/135 .
Hi Radim,
The Word2Vec function split my words as:
u’u4e00u822c’ ==>> u’u4e00′ and u’u822c’
How could I fix it?
Thanks, luopuya
Sorry, I did not read the blog carefully.
Every time reading a line of file, I should split it like what “MySentences” do
Hi Radim,
you are awesome, thank you so much for gensim and this tutorial!!
I have a question. I read in the docs that by default you utilize Skip-Gram, which can be switched to CBOW. From what I gathered in the NIPS slides, CBOW is faster, more effective and gets better results. So why use Skip-Gram in the first place? I’m sure I’m missing something obvious here 🙂
Thanks,
Max
Whoops, I just realized the parameter “sg” is not supported anymore in the Word2Vec constructor. Is that true? So what is used by default?
Author
Hello Max,
thanks 🙂
Skip-gram is used because it gives better (more accurate) results.
CBOW is faster though. If you have lots data, you can be advantageous to run the simpler but faster model.
There’s a pull request under way, to enable all the word2vec options and parameters. You can try out the various models and their performance for yourself 🙂
Thanks for you answer Radim! I only saw that one experiment in the slides that said that CBOW was faster AND more accurate, but that might have been an outlier or my misinterpretation. I’m excited for that pull request! 🙂
Anyway, I have another question (or bug report?). I changed a bunch of training parameters and added input data and suddenly got segfaults on Python when asking for similarities for certain words… So I tried which of the changes caused this, and it turned out that the cause was that I set the output size to 200! Setting it to (apparently) any other number doesn’t cause any trouble, but 200 does… If you hear this from anyone else or are able to reproduce it yourself, consider it a bug 🙂
Author
Hey Max — are you on OS X? If so, it may be https://github.com/numpy/numpy/issues/4007
If not, please file a bug report at https://github.com/piskvorky/gensim/issues with system/sw info. Cheers!
Hi Radim,
Indeed, a great tutorial! Thank you for that!
Playied a bit with Word2Vec and it’s quite impressive. Couldn’t figure out how the first part of the demo app works. Can you provide some insights please ?
Thanks! 😀
Author
Hello Bogdan, you’re welcome 🙂
This demo app just calls the “model.most_similar(positive, negative)” method in the background. Check out the API docs: http://radimrehurek.com/gensim/models/word2vec.html
Hello,
i’d like to ask You, if this all can be done with other languages too, like Korean, Russian, Arabic and so, or whether is this toolkit fixed to the English only.
Thank You in advance for the answer
Author
Hi, it can be done directly for languages where you can split text into sentences and tokens.
The only concern would be that the `window` parameter has to be large enough to capture syntactic/semantic relationships. For English, the default `window=5`, and I wouldn’t expect it to be dramatically different for other languages.
Hey, I wanted to know if the version you have in gensim is the same that you got after “optimizing word2vec in Python”.. I am using the pre-trained model of the Google News vector(found in the page of word2vec) and then I run model.accuracy(‘file_questions’) but it runs really slow… Just wanted to know if this is normal or i have to do some things to speed uṕ the version of gensim.. Thanks in advance and great work!
Author
It is — gensim always contains the latest, most optimized version (=newer than this blog post).
However, the accuracy computations (unlike training) are NOT optimized 🙂 I never got to optimizing that part. If you want to help, let me know, I don’t think I’ll ever get to it myself. (Massive optimizations can be done directly in Python, no need to go C/Cython).
Hi,
could you please explain how do CBOW and skip-gram models actually do the learning. I’ve read ‘Efficient estimation…’ but it doesn’t really explain how does the actual training happen.
I’ve taken a look at the original source code and your implementation, and while I can understand the code I cannot understand the logic behind it.
I don’t understand these lines in your implementation (word2vec.py, gensim 0.9) CBOW:
————————————
l2a = model.syn1[word.point] # 2d matrix, codelen x layer1_size
fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
ga = (1 – word.code – fa) * alpha # vector of error gradients multiplied by the learning rate
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
—————————————-
I see that it has something to do with the Huffman-tree word representation, but despite the comments, I don’t understand what is actually happening, what does syn1 represent, why do we multiply l1 with l2a… why are we multiplying ga and l1 etc…
Could you please explain in a sentence or two what is actually happening in there.
I would be very grateful.
Hi, this is great, but I have a question about unknown words.
After loading a model, when train more sentences, new words will be ignored, not added to the vocab automatically.
I am not quite sure about this. Is that true?
Thank you very much!
Author
Yes, it is true. The word2vec algorithm doesn’t support adding new words dynamically.
Hi,
Unfortunately I’m not sufficiently versed in programming (yet) to solve this by myself. I would like to be able to be able to add a new vector to the model after it has been trained.
I realize I could export the model to a text file, add the vector there and load the modified file. Is there a way to add the vector within Python, though? As in: create a new entry for the model such that model[‘new_entry’] is assigned the new_vector.
Thanks in advance!
Author
The weights are a NumPy matrix — have a look at `model.syn0` and `model.syn1` matrices. You can append new vectors easily.
Off the top of my head I’m not sure whether there are any other variables in `model` that need to be modified when you do this. There probably are. I’d suggest checking the code to make sure everything’s consistent.
Pingback: Motorblog » [Review] PyData Berlin 2014 – Satellitenevent zur EuroPython
I am new to word2vec. Can I ask you two questions?
1. When I apply the pre-trained model to my own dataset, do you have any suggestion about how to deal with the unknown words?
2. Do you have any suggestion about aggregating the word embeddings of words in a sentence into one vector to represent that sentence?
Thanks a lot!
Author
Good questions, but I don’t have any insights beyond the standard advice:
1. unknown words are ignored; or you can build a model with one special “word” to represent all OOV words.
2. for short sentences/phrases, you can average the individual vectors; for longer texts, look into something like paragraph2vec: https://github.com/piskvorky/gensim/issues/204#issuecomment-52093328
Thanks for your advice 😀
Hi,
Could you tell me how to find the most similar word as in web app 3? Calculating the cosine similarity between each word seems like a no-brainer way to do it? Is there any API in gensim to do that?
Another question, I want to represent sentence using word vector, right now I only add up all the words in the sentence to get a new vector. I know this method does’t make sense, since each word has a coordinate in the semantic space, adding up coordinates is not an ideal to represent a sentence. I have read some papers talking about this problem? Could you tell me what will be an ideal way to represent sentence to do sentence clustering?
Thank you very much!
Author
“Find most similar word”: look at the API docs.
“Ideal way to represent a sentence”: I don’t know about ideal, but another way to represent sentences is using “paragraph2vec”: https://github.com/piskvorky/gensim/issues/204
Radim, this is great stuff. I have posted a link to your software on our AI community blog in the USA at http://www.smesh.net/pages/980191274#54262e6cdb3c2facb5a41579 , so that other people can benefit from it too.
Does the workers = x work to multithread the iteration over the sentences or just the training of the model ?
Author
It parallelizes training.
How you iterate over sentences is your business — word2vec only expects an iterator on input. What the iterator does internally to iterate over sentences is up to you and not part of word2vec.
I have a collection of 1500000 text files (with 10 lines each on average) and a machine with 12 cores/16G of ram(not sure if it is relevant for reading files).
How would you suggest me to build the iterator to utilize all the computing resources I have?
Author
No, not relevant.
I’d suggest you loop over your files inside __iter__() and yield out your sentences (lines?), one after another.
ok thanks!
If I am using the model pre-trained with Google News data set, is there any way to control the size of the output vector corresponding to a word?
for “model.build_vocab(sentences)” command to work, we need to add “import os”. without that, i was getting error for ‘os’ not defined.
Author
Not sure what you are talking about suvir, you don’t need any “import os”. If you run into problems, send us the full traceback (preferably to the gensim mailing list, not this blog). See http://radimrehurek.com/gensim/support.html. Cheers.
hello,
Where can I find the code (in python) of the Bonus App?
Pingback: How to grow a list of related words based on initial keywords? | CL-UAT
I am using the train function as described in the api doc. I notice that the training might have terminated “prematuredly”, according to the logging output below. Not sure if I understand the output properly. When it said “PROGRESS: at 4.10% words”, does it mean 4.1% of the corpus or 4.1% of the vocabs? I suspect the former, so it would suggest it only processed 4.1% of the words. Please enlighten me. Thanks!
2015-02-11 19:34:40,894 : INFO : Got records: 20143
2015-02-11 19:34:40,894 : INFO : training model with 4 workers on 67186 vocabulary and 200 features, using ‘skipgram’=1 ‘hierarchical softmax’=0 ‘subsample’=0 and ‘negative sampling’=15
2015-02-11 19:34:41,903 : INFO : PROGRESS: at 0.45% words, alpha 0.02491, 93073 words/s
2015-02-11 19:34:42,925 : INFO : PROGRESS: at 0.96% words, alpha 0.02477, 97772 words/s
2015-02-11 19:34:43,930 : INFO : PROGRESS: at 1.48% words, alpha 0.02465, 100986 words/s
2015-02-11 19:34:44,941 : INFO : PROGRESS: at 2.00% words, alpha 0.02452, 102187 words/s
2015-02-11 19:34:45,960 : INFO : PROGRESS: at 2.51% words, alpha 0.02439, 102371 words/s
2015-02-11 19:34:46,966 : INFO : PROGRESS: at 3.05% words, alpha 0.02426, 104070 words/s
2015-02-11 19:34:48,006 : INFO : PROGRESS: at 3.55% words, alpha 0.02413, 103439 words/s
2015-02-11 19:34:48,625 : INFO : reached the end of input; waiting to finish 8 outstanding jobs
2015-02-11 19:34:49,026 : INFO : PROGRESS: at 4.10% words, alpha 0.02400, 104259 words/s
Hi Radim,
Is there a whole example that I can use to understand the whole concept and to walk through the code.
Thanks much,
Author
Hello Sasha,
not sure what concept / code you need, but there is one example right there in the word2vec.py source file:
https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L997
(you can download the text8 corpus used there from http://mattmahoney.net/dc/text8.zip )
Hi Radim,
I‘m wondering about the difference between model from trained in C(original way) and trained in gensim.
when I trying to use the model.most_similar function,loading the model I’ve trained in C, I got a totally different result when i trying to do the same stuff with word-analogy.sh? So I just want to know if the model.most_similar function use the same way when trying to calculate ‘man’-‘king’ + ‘women’ ≈ ‘queue’ like mikolov achieved in his C codes (word-analogy) ,thanks!!!
Author
Yes, exactly the same (cosine similarity).
The training is almost the same too, up to different randomized initialization of weights IIRC.
Maybe you’re using different data (preprocessing)?
Sorry to bother you again,here are two kinds of way when I try to do:
The way when i use gensim:
model=Word2Vec.load_word2vec_format(‘vectors_200.bin’,binary=True)
#Chinese
word1=u’石家庄’
word2=u’河北’
word3=u’河南’
le=model.most_similar(positive=[word2,word3],negative=[word1])
The way use C code:
./word-analogy vectors_200.bin
the input :’石家庄’ 河北’ 河南’
totally different results…
the same model loaded, how could that happened?
Author
Oh, non-ASCII characters.
IIRC, the C code doesn’t handle unicode in any way, all text is treated as binary. Python code (gensim) uses Unicode for strings.
So, perhaps some encoding mismatch?
How was your model trained — with C code? Is so, what was the encoding?
The training corpus encoding in utf-8, that’s the reason?
Hello Radim,
Is there a way to extract the output feature vector (or, sort of, predicted probabilities) from the model, just like while it’s training?
Thanks
Hey Radim
Thanks for the wonderful tutorial.
I am new to word2vec and I am trying generate n-grams of words for an Indian Script. I have 2 quesries:
Q1. Should the input be in plain text:
ସୁଯୋଗ ଅସଟାର or unicodes 2860 2825 2858 2853 2821
Q2. Is there any code available to do clustering of the generated vectors to form word classes?
Please help
Hi Radim,
For this example: “woman king man”:
I run with bonus web app, and got the results:
521.9ms [[“kings”,0.6490576267242432],[“clown_prince”,0.5009066462516785],[“prince”,0.4854174852371216],[“crown_prince”,0.48162946105003357],[“King”,0.47213971614837646]]
The above result is the same with word2vec by Tomas Mikolov.
However, when I run example above in gensim, the output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
So why is this the case?
Your web app’s result is different to gensim ???
Thanks!
Author
Hi Cong
no, both are the same.
In fact, the web app just calls gensim under the hood. There’s no extra magic happening regarding word2vec queries, it’s just gensim wrapped in cherrypy web server.
Thank you for your reply.
I loaded the pre-trained model: GoogleNews-vectors-negative300.bin by Tomas Mikolov.
Then, I used word2vec in gensim to find the output.
This is my code when using gensim:
from gensim.models import word2vec
model_path = “…/GoogleNews-vectors-negative300.bin”
model = word2vec.Word2Vec.load_word2vec_format(model_path, binary=True)
stringA = ‘woman’
stringB = ‘king’
stringC = ‘man’
print model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10)
–> Output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
The see that the output above is different to the web app?
So can you check it for me?
Thanks so much.
I found that in gensim, the order should be:
…positive=[stringB, stringC], negative=[stringA]..
Hi Radim,
Thank you for the great tool and tutorial.
I have one question regarding learning rate of the online training. You mentioned to adjust total_words in train(), but could you give a more detailed explanation about how this parameter will affect the learning rate?
Thank you in advance.
Fantastic tool and tutorial. Thanks for sharing.
I’m wondering about compounding use of LSI. Take large corpus and perform LSI to map words into some space. Now having a document when you hit a word look up the point in the space and use that rather than just the word. Words of similar meaning then start out closer together and more sensibly influence the docuement classification. Would model just reverse out those initial weights ? thanks for any ideas.
Hi Radim,
First of all, thanks for you great job on developing this tool. I am new in word2vec and unfortunately literature do not explain the details clearly. I would be grateful if you could answer my simple questions.
1- for CBOW (sg=0), does the method uses negative sampling as well? or this is something just related to skip-gram model.
2-what about the window size? is the window size also applicable when one uses CBOW? or all the words in 1 sentences is considered as bag-of-words?
3- what happens if the window size is larger than the size of a sentence? Is the sentence ignored or simply a smaller window size is chosen which fits the size of the sentence?
4- what happens if the word sits at the end of the sentence? there is no word after that for the skip-gram model !
Hi Radim,
Thanks for such a nice package! It may be bold to suggest, but I ran across what I think might be a bug. It’s likely a features :), but I thought I’d point it out since I needed to fix it in an unintuitive way.
If I train a word2vec model using a list of sentences:
sentences = MySentences(fname) # generator that yields sentences
mysentences = list(sentences)
model = gensim.models.Word2Vec(sentences=mysentences **kwargs)
then the model finishes training. Eg., the end of the logging shows
…snip…
2015-05-13 22:12:07,329 : INFO : PROGRESS: at 97.17% words, alpha 0.00075, 47620 words/s
2015-05-13 22:12:08,359 : INFO : PROGRESS: at 98.25% words, alpha 0.00049, 47605 words/s
2015-05-13 22:12:09,362 : INFO : PROGRESS: at 99.32% words, alpha 0.00019, 47603 words/s
2015-05-13 22:12:09,519 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:12:09,901 : INFO : training on 4427131 words took 92.9s, 47648 words/s
I’m training on many GB of data, so I need to pass in a generator that yields sentences line by line (like your MySentences class above). But when I try it as suggested with, say, iter=5:
sentences = MySentences(fname) # generator that yields sentences
model = gensim.models.Word2Vec(sentences=None, **kwargs) # iter=10 defined in kwargs
model.build_vocab(sentences_vocab)
model.train(sentences_train)
the model stops training 1/20 of the way through. If iter=10, it stops 1/10 of the way, etc. Eg., the end of the logging looks like,
…snip…
2015-05-13 22:31:37,265 : INFO : PROGRESS: at 18.21% words, alpha 0.02049, 49695 words/s
2015-05-13 22:31:38,266 : INFO : PROGRESS: at 19.29% words, alpha 0.02022, 49585 words/s
2015-05-13 22:31:38,452 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:31:38,857 : INFO : training on 885538 words took 17.8s, 49703 words/s
Looking in word2vec.py, around line 316 I noticed
sentences = gensim.utils.RepeatCorpusNTimes(sentences, iter)
so I added
sentences_train = gensim.utils.RepeatCorpusNTimes(Sentences(fname), model.iter)
before calling model.train() in the above code snippet. Does this seem like the correct course of action, or am I missing something fundamental about the way one should stream sentences to build the vocab and train the model?
Thanks for your help,
Jesse
Author
Hello Jesse,
for your sentences, are you using a generator (=can be iterated over only once), or an iterable (can be iterated over many times)?
It is true that for multiple passes, generator is not enough. Anyway better ask at the gensim mailing list / github, that’s a better medium for this:
http://radimrehurek.com/gensim/support.html
Hello,
It is a great tutorial, thank you very much….
but i have a problem,
I used the function ( accuracy ) to print the evaluation of the model , but nothing is printed to me
how to sove this problem ??
thanks a lot
Author
Try turning on logging — the accuracy may be printed to log.
See the beginning of this tutorial for how to do that.
Great tutorial, Radim! Is it possible to download your trained model of 100 billion Google words?
Author
Thanks Shuai.
Yes, it is possible to download it:
https://code.google.com/p/word2vec/#Pre-trained_word_and_phrase_vectors
(The model is not mine, it was trained by Tomas while at Google).
Awesome! I just found it too. Cheers.
Hi Radim,
I was wondering if it is possible to train a Word2Vec model, not with sentences, but with input and output vectors built from the sentences in an application-specific manner?
Thanks.
Swami
Hi, I’ve got a problem‘OverflowError: Python int too large to convert C long’ when i run ‘model = gensim.models.Word2Vec(sentences, min_count=1)’. Could you help me with it ?!
Author
Hello!
It’s best to report such things on the mailing list, or on GitHub, not on the blog:
http://radimrehurek.com/gensim/support.html
For this particular error, check out this GitHub issue.
thank you very much.
Thanks for writing this, it was quite helpful and told a lot
Pingback: Word2vec Tutorial » RaRe Technologies | D...
I ran:
model = gensim.models.Word2Vec(sentences, min_count=1)
and got the following error:
model = gensim.models.Word2Vec(sentences, min_count=1)
Traceback (most recent call last):
File “”, line 1, in
model = gensim.models.Word2Vec(sentences, min_count=1)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 312, in __init__
self.build_vocab(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 414, in build_vocab
self.reset_weights()
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 521, in reset_weights
random.seed(uint32(self.hashfxn(self.index2word[i] + str(self.seed))))
OverflowError: Python int too large to convert to C long
I am using Python 3.4.3 in the Anaconda 2.3.0-64bit distribution.
I’d really like to be able to use this module, but it seems like there’s some fundamental issue for my computer.
Thanks!!
Author
Hello Mike, the fix was a part of gensim 0.12.1, released some time ago.
What version of gensim are you using?
Found the error…I was using “conda update gensim” but it looks like their Anaconda repository has not been updated. I’ll let them know, since many people use Anaconda distrib.
I ran “pip install –upgrade gensim” and it got 0.12.1. I had 10.1!!!
Ok, I updated and ran with the following input list of sentences:
sentences
Out[17]:
[[‘human’,
‘machine’,
‘interface’,
‘for’,
‘lab’,
‘abc’,
‘computer’,
‘applications’],
[‘a’,
‘survey’,
‘of’,
‘user’,
‘opinion’,
‘of’,
‘computer’,
‘system’,
‘response’,
‘time’],
[‘the’, ‘eps’, ‘user’, ‘interface’, ‘management’, ‘system’],
[‘system’, ‘and’, ‘human’, ‘system’, ‘engineering’, ‘testing’, ‘of’, ‘eps’],
[‘relation’,
‘of’,
‘user’,
‘perceived’,
‘response’,
‘time’,
‘to’,
‘error’,
‘measurement’],
[‘the’, ‘generation’, ‘of’, ‘random’, ‘binary’, ‘unordered’, ‘trees’],
[‘the’, ‘intersection’, ‘graph’, ‘of’, ‘paths’, ‘in’, ‘trees’],
[‘graph’,
‘minors’,
‘iv’,
‘widths’,
‘of’,
‘trees’,
‘and’,
‘well’,
‘quasi’,
‘ordering’],
[‘graph’, ‘minors’, ‘a’, ‘survey’]]
Still got the same error:
in [16]: model = gensim.models.word2vec.Word2Vec(sentences)
Traceback (most recent call last):
File “”, line 1, in
model = gensim.models.word2vec.Word2Vec(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 417, in __init__
self.build_vocab(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 483, in build_vocab
self.finalize_vocab() # build tables & arrays
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 611, in finalize_vocab
self.reset_weights()
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 888, in reset_weights
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 900, in seeded_vector
once = random.RandomState(uint32(self.hashfxn(seed_string)))
OverflowError: Python int too large to convert to C long
Author
Is Python picking up the right gensim?
AFAIK anaconda has its own packaging system, I’m not sure how it plays with your `pip install`.
What does `import gensim; print gensim.__version__` say?
Ok…I was able to make a couple changes to word2vec.py to get it to run on my computer:
The current version uses numpy.uint32 on lines 83, 327, 373, and 522. This was causing an overflow error when converting to C long.
I changed these to reference numpy.uint64 and it *almost* worked….the use of uint64 on line 522 for setting the seed of the random number generator resulted in a seed value being out of bounds. I addressed this by truncating the seed to the max allowable seed:
“random.seed(min(uint64(self.hashfxn(self.index2word[i] + str(self.seed))),4294967295))”
now everything runs fine (except that my version is not compiled under C so I may see some performance issues for large coropra)…
actually, there is a solution on Kaggle for 64-bit machines that worked really well (do not use my solution…it results in all word vectors being collinear).
def hash32(value):
return hash(value) & 0xffffffff
Then pass the following arugument to Word2Vec: hashfxn=hash32
This will overwrite the base hashfxn and resolve the issues. Also, all my cosine similarities were not 1 now!!
Beware of how you go through your training data :
When, in your class “MySentences” you use :
“for line in open(os.path.join(self.dirname, fname)): ”
As far as I know, it won’t close your file. You’re letting the garbage collector of Python deal with the leak in memory.
Use :
“with gzip.open(os.path.join(self.dirname, fname)) as f:”
instead (Ref : http://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important )
For training on large dataset, it can be a major bottleneck (it was for me 😉 ).
Thank you very much for your fast implementation of Word2vec and Doc2vec !
Author
No, CPython closes the file immediately after the object goes out of scope. There is no leak (though that’s a common misconception and a favourite nitpick).
With gzip it makes more sense, but then you should be using smart_open anyway (also to work around missing context managers in python 2.6).
Hi, Radim.
Great tutorial.
I have a doubt. Is there any way I can represent a phrase as a vector so that I can calculate similarity between phrases just as what we do with words?
Author
Thanks Rodolpho!
Yes there is; check out the doc2vec tutorial.
Hi Radim,
first of thank you very much for your amazing work and even more amazing tutorial.
I am currently trying to compare two set of phrases.
I am using GoogleNews model as my model
Splited all the words into individual words by using .split()
ie.
[‘golf’,’field’] and [‘country’,’club’]
[‘gas’,’station’] and [‘fire’,’station’]
as per feature in your app “phrase suggestions ”
I can see that GoogleNews model have
County_Club
Fire_station
gas_station
golf_field
But it’s difficult to scan for those words because of Capitalization in GN model.
I tried.
model.vocab.keys()
which would convert all available names into a list.
but couldn’t get any close to your example above.
I also looked at gensim.models.phrase.Phrases
hoping that it can help me detect above example with bigram
or trigram
for those who are using GN model,
how could we detect bigram or trigram?
Thank you in advance.
As far as I know, Google didn’t release their vocabulary/phrase model, nor their text preprocessing method.
The only thing you have to go by are the phrases inside the model itself (3 million of them), sorted by frequency.
You can lowercase the model vocabulary and match against that, but note that you’ll lose some vectors (no way to tell County_Club from county_club from County_club).
You can also try asking at the gensim mailing list, or Tomas Mikolov at his mailing list — better chance someone may have an answer or know something.
Hi Radim,
Thanks for your reply,
I went ahead and created a small function which would create bigram and replace the original words if bigram exist in GoogleNews model.
and yes, I will join google mailing group.
####################################################
# bigram creator
# try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately
# creating bigram
def create_bigram(list):
for i in range(0,len(list)-1):
#ie. country_club
word1 = list[i]+’_’+list[i+1]
#ie. Country_club
word2 = list[i].capitalize()+’_’+list[i+1]
#ie. Country_Clue
word3 = list[i].capitalize()+’_’+list[i+1].capitalize()
#ie. COUNTRY_CLUE
word4 = (list[i]+’_’+list[i+1]).upper()
word_list = [word1,word2,word3,word4]
for item in word_list:
print item
if item in model.vocab:
list.pop(i)
list.pop(i)
list.append(item)
break
here’s fixed code,
added a line that will not append new word if len(list) gets shorter
####################################################
# bigram creator
# try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately
# creating bigram
def create_bigram(list):
for i in range(0,len(list)-1):
if i < len(list)-1:
#ie. country_club
word1 = list[i]+'_'+list[i+1]
#ie. Country_club
word2 = list[i].capitalize()+'_'+list[i+1]
#ie. Country_Clue
word3 = list[i].capitalize()+'_'+list[i+1].capitalize()
#ie. COUNTRY_CLUE
word4 = (list[i]+'_'+list[i+1]).upper()
word_list = [word1,word2,word3,word4]
for item in word_list:
print i
if item in model.vocab:
list.pop(i)
list.pop(i)
list.append(item)
break
I am following a tutorial of doc2vec from http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
But while calling
model_dm.build_vocab(np.concatenate((x_train, x_test, unsup_reviews)))
I am getting an error:
‘numpy.ndarray’ object has no attribute ‘words’
It seems like this error occurs at document.words in doc2vec.py.
What am I missing here?
I trained 2 Doc2vec models with the same data, and parameters:
model = Doc2Vec(sentences, dm=1, size=300, window=5, negative=10, hs=1, sample=1e-4, workers=20, min_count=3)
But I got 2 different models in each time. Is this true?
Can you explain more details for me?
Is that the case for Word2vec model?
Thanks Radim!
Author
Hello Lis,
it’s best to use the mailing list for gensim support:
http://radimrehurek.com/gensim/support.html
You’ll get the quickest and most qualified answers there 🙂