I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.
UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.
Preparing the Input
Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):
# import modules & set up logging import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = [['first', 'sentence'], ['second', 'sentence']] # train word2vec on the two sentences model = gensim.models.Word2Vec(sentences, min_count=1)
Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.
Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…
For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:
class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for fname in os.listdir(self.dirname): for line in open(os.path.join(self.dirname, fname)): yield line.split() sentences = MySentences('/some/directory') # a memory-friendly iterator model = gensim.models.Word2Vec(sentences)
Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.
Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator model.train(other_sentences) # can be a non-repeatable, 1-pass generator
In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.
Training
Word2vec accepts several parameters that affect both training speed and quality.
One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:
model = Word2Vec(sentences, min_count=10) # default value is 5
A reasonable value for min_count is between 0-100, depending on the size of your dataset.
Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:
model = Word2Vec(sentences, size=200) # default value is 100
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
The last of the major parameters (full list here) is for training parallelization, to speed up training:
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Memory
At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).
Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.
There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.
Evaluating
Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.
Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.
Gensim support the same evaluation set, in exactly the same format:
model.accuracy('/tmp/questions-words.txt') 2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342) 2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812) 2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380) 2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332) 2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702) 2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870) 2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482) 2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992) 2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702) 2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)
This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.
Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.
Storing and loading models
You can store/load models using the standard gensim methods:
model.save('/tmp/mymodel') new_model = gensim.models.Word2Vec.load('/tmp/mymodel')
which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.
In addition, you can load models created by the original C tool, both using its text and binary formats:
model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False) # using gzipped/bz2 input works too, no need to unzip: model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)
Online training / Resuming training
Advanced users can load a model and continue training it with more sentences:
model = gensim.models.Word2Vec.load('/tmp/mymodel') model.train(more_sentences)
You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.
Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.
Using the model
Word2vec supports several word similarity tasks out of the box:
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) [('queen', 0.50882536)] model.doesnt_match("breakfast cereal dinner lunch";.split()) 'cereal' model.similarity('woman', 'man') 0.73723527
If you need the raw output vectors in your application, you can access these either on a word-by-word basis
model['computer'] # raw NumPy vector of a word array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
…or en-masse as a 2D NumPy matrix from model.syn0.
Bonus app
As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:
If you don’t get “queen” back, something went wrong and baby SkyNet cries.
Try more examples too: “he” is to “his” as “she” is to ?, “Berlin” is to “Germany” as “Paris” is to ? (click to fill in).
Try: U.S.A.; Monty_Python; PHP; Madiba (click to fill in).
Also try: “monkey ape baboon human chimp gorilla”; “blue red green crimson transparent” (click to fill in).
The model contains 3,000,000 unique phrases built with layer size of 300.
Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.
On a related note, I noticed about half the queries people entered into the [email protected] demo contained typos/spelling errors, so they found nothing. Ouch.
To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.
The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.
See the complete HTTP server code for this “bonus app” on github (using CherryPy).
Outro
Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.
And here’s me talking about the optimizations behind word2vec at PyData Berlin 2014
Comments 158
Hi radim,
Impressive tutorial. I have a query that the output Word2Vec model is returning in an array. How can we use that as an input to recursive neural network??
Thanks
model = gensim.models.Word2Vec(sentences) will not work as shown in the tutorial, because you will receive the error message: “RuntimeError: you must first build vocabulary before training the model”. You also have to set down the min_count manually, like model = gensim.models.Word2Vec(sentences, min_count=1).
Author
Default `min_count=5` if you don’t set it explicitly. Vocabulary is built automatically from the sentences.
What version of gensim are you using? It should really work simply with `Word2Vec(sentences)`, there are even unit tests for that.
if you don’t set ‘min_count=1’, it will remove all the words in sentences in the example given –
logging:
‘INFO : total 0 word types after removing those with count<5'
Author
Ah ok, thanks Claire. I’ve add the `min_count=1` parameter.
It showing “RuntimeError: you must first build vocabulary before training the model” Even though i changed min_count =1 .
please help to correct.
How can i train vocabulary ?
I suggest that the content of sentences may be error format,you may need to delete punctuation,characters separated by spaces.
Hi Radim,
Is there any way to obtain the similarity of phrases out of the word2vec? I’m trying to get 2-word phrases to compare, but don’t know how to do it.
Thanks!
Pavel
Author
Hello Pavel, yes, there is a way.
First, you must detect phrases in the text (such as 2-word phrases).
Then you build the word2vec model like you normally would, except some “tokens” will be strings of multiple words instead of one (example sentence: [“New York”, “was”, “founded”, “16th century”]).
Then, to get similarity of phrases, you do `model.similarity(“New York”, “16th century”)`.
It may be a good idea to replace spaces with underscores in the phrase-tokens, to avoid potential parsing problems (“New_York”, “16th_century”).
As for detecting the phrases, it’s a task unrelated to word2vec. You can use existing NLP tools like the NLTK/Freebase, or help finish a gensim pull request that does exactly this: https://github.com/piskvorky/gensim/pull/135 .
Hi Radim,
The Word2Vec function split my words as:
u’u4e00u822c’ ==>> u’u4e00′ and u’u822c’
How could I fix it?
Thanks, luopuya
Sorry, I did not read the blog carefully.
Every time reading a line of file, I should split it like what “MySentences” do
Hi Radim,
you are awesome, thank you so much for gensim and this tutorial!!
I have a question. I read in the docs that by default you utilize Skip-Gram, which can be switched to CBOW. From what I gathered in the NIPS slides, CBOW is faster, more effective and gets better results. So why use Skip-Gram in the first place? I’m sure I’m missing something obvious here 🙂
Thanks,
Max
Whoops, I just realized the parameter “sg” is not supported anymore in the Word2Vec constructor. Is that true? So what is used by default?
Author
Hello Max,
thanks 🙂
Skip-gram is used because it gives better (more accurate) results.
CBOW is faster though. If you have lots data, you can be advantageous to run the simpler but faster model.
There’s a pull request under way, to enable all the word2vec options and parameters. You can try out the various models and their performance for yourself 🙂
Thanks for you answer Radim! I only saw that one experiment in the slides that said that CBOW was faster AND more accurate, but that might have been an outlier or my misinterpretation. I’m excited for that pull request! 🙂
Anyway, I have another question (or bug report?). I changed a bunch of training parameters and added input data and suddenly got segfaults on Python when asking for similarities for certain words… So I tried which of the changes caused this, and it turned out that the cause was that I set the output size to 200! Setting it to (apparently) any other number doesn’t cause any trouble, but 200 does… If you hear this from anyone else or are able to reproduce it yourself, consider it a bug 🙂
Author
Hey Max — are you on OS X? If so, it may be https://github.com/numpy/numpy/issues/4007
If not, please file a bug report at https://github.com/piskvorky/gensim/issues with system/sw info. Cheers!
Hi Radim,
Indeed, a great tutorial! Thank you for that!
Playied a bit with Word2Vec and it’s quite impressive. Couldn’t figure out how the first part of the demo app works. Can you provide some insights please ?
Thanks! 😀
Author
Hello Bogdan, you’re welcome 🙂
This demo app just calls the “model.most_similar(positive, negative)” method in the background. Check out the API docs: http://radimrehurek.com/gensim/models/word2vec.html
Hello,
i’d like to ask You, if this all can be done with other languages too, like Korean, Russian, Arabic and so, or whether is this toolkit fixed to the English only.
Thank You in advance for the answer
Author
Hi, it can be done directly for languages where you can split text into sentences and tokens.
The only concern would be that the `window` parameter has to be large enough to capture syntactic/semantic relationships. For English, the default `window=5`, and I wouldn’t expect it to be dramatically different for other languages.
Hi I just wanted to ask if anyone tried it on different languages yet? I was planning to test it out so I thought I should ask if someone already did it!
Author
Hello Disha,
the gensim mailing list is a much better place to ask. I doubt anyone regularly scans the blog comments.
Best,
Radim
Hey, I wanted to know if the version you have in gensim is the same that you got after “optimizing word2vec in Python”.. I am using the pre-trained model of the Google News vector(found in the page of word2vec) and then I run model.accuracy(‘file_questions’) but it runs really slow… Just wanted to know if this is normal or i have to do some things to speed uṕ the version of gensim.. Thanks in advance and great work!
Author
It is — gensim always contains the latest, most optimized version (=newer than this blog post).
However, the accuracy computations (unlike training) are NOT optimized 🙂 I never got to optimizing that part. If you want to help, let me know, I don’t think I’ll ever get to it myself. (Massive optimizations can be done directly in Python, no need to go C/Cython).
Hi,
could you please explain how do CBOW and skip-gram models actually do the learning. I’ve read ‘Efficient estimation…’ but it doesn’t really explain how does the actual training happen.
I’ve taken a look at the original source code and your implementation, and while I can understand the code I cannot understand the logic behind it.
I don’t understand these lines in your implementation (word2vec.py, gensim 0.9) CBOW:
————————————
l2a = model.syn1[word.point] # 2d matrix, codelen x layer1_size
fa = 1.0 / (1.0 + exp(-dot(l1, l2a.T))) # propagate hidden -> output
ga = (1 – word.code – fa) * alpha # vector of error gradients multiplied by the learning rate
model.syn1[word.point] += outer(ga, l1) # learn hidden -> output
—————————————-
I see that it has something to do with the Huffman-tree word representation, but despite the comments, I don’t understand what is actually happening, what does syn1 represent, why do we multiply l1 with l2a… why are we multiplying ga and l1 etc…
Could you please explain in a sentence or two what is actually happening in there.
I would be very grateful.
Hi, this is great, but I have a question about unknown words.
After loading a model, when train more sentences, new words will be ignored, not added to the vocab automatically.
I am not quite sure about this. Is that true?
Thank you very much!
Author
Yes, it is true. The word2vec algorithm doesn’t support adding new words dynamically.
Hi,
Unfortunately I’m not sufficiently versed in programming (yet) to solve this by myself. I would like to be able to be able to add a new vector to the model after it has been trained.
I realize I could export the model to a text file, add the vector there and load the modified file. Is there a way to add the vector within Python, though? As in: create a new entry for the model such that model[‘new_entry’] is assigned the new_vector.
Thanks in advance!
Author
The weights are a NumPy matrix — have a look at `model.syn0` and `model.syn1` matrices. You can append new vectors easily.
Off the top of my head I’m not sure whether there are any other variables in `model` that need to be modified when you do this. There probably are. I’d suggest checking the code to make sure everything’s consistent.
Pingback: Motorblog » [Review] PyData Berlin 2014 – Satellitenevent zur EuroPython
I am new to word2vec. Can I ask you two questions?
1. When I apply the pre-trained model to my own dataset, do you have any suggestion about how to deal with the unknown words?
2. Do you have any suggestion about aggregating the word embeddings of words in a sentence into one vector to represent that sentence?
Thanks a lot!
Author
Good questions, but I don’t have any insights beyond the standard advice:
1. unknown words are ignored; or you can build a model with one special “word” to represent all OOV words.
2. for short sentences/phrases, you can average the individual vectors; for longer texts, look into something like paragraph2vec: https://github.com/piskvorky/gensim/issues/204#issuecomment-52093328
Thanks for your advice 😀
Hi,
Could you tell me how to find the most similar word as in web app 3? Calculating the cosine similarity between each word seems like a no-brainer way to do it? Is there any API in gensim to do that?
Another question, I want to represent sentence using word vector, right now I only add up all the words in the sentence to get a new vector. I know this method does’t make sense, since each word has a coordinate in the semantic space, adding up coordinates is not an ideal to represent a sentence. I have read some papers talking about this problem? Could you tell me what will be an ideal way to represent sentence to do sentence clustering?
Thank you very much!
Author
“Find most similar word”: look at the API docs.
“Ideal way to represent a sentence”: I don’t know about ideal, but another way to represent sentences is using “paragraph2vec”: https://github.com/piskvorky/gensim/issues/204
Radim, this is great stuff. I have posted a link to your software on our AI community blog in the USA at http://www.smesh.net/pages/980191274#54262e6cdb3c2facb5a41579 , so that other people can benefit from it too.
Does the workers = x work to multithread the iteration over the sentences or just the training of the model ?
Author
It parallelizes training.
How you iterate over sentences is your business — word2vec only expects an iterator on input. What the iterator does internally to iterate over sentences is up to you and not part of word2vec.
I have a collection of 1500000 text files (with 10 lines each on average) and a machine with 12 cores/16G of ram(not sure if it is relevant for reading files).
How would you suggest me to build the iterator to utilize all the computing resources I have?
Author
No, not relevant.
I’d suggest you loop over your files inside __iter__() and yield out your sentences (lines?), one after another.
ok thanks!
If I am using the model pre-trained with Google News data set, is there any way to control the size of the output vector corresponding to a word?
for “model.build_vocab(sentences)” command to work, we need to add “import os”. without that, i was getting error for ‘os’ not defined.
Author
Not sure what you are talking about suvir, you don’t need any “import os”. If you run into problems, send us the full traceback (preferably to the gensim mailing list, not this blog). See http://radimrehurek.com/gensim/support.html. Cheers.
hello,
Where can I find the code (in python) of the Bonus App?
Pingback: How to grow a list of related words based on initial keywords? | CL-UAT
I am using the train function as described in the api doc. I notice that the training might have terminated “prematuredly”, according to the logging output below. Not sure if I understand the output properly. When it said “PROGRESS: at 4.10% words”, does it mean 4.1% of the corpus or 4.1% of the vocabs? I suspect the former, so it would suggest it only processed 4.1% of the words. Please enlighten me. Thanks!
2015-02-11 19:34:40,894 : INFO : Got records: 20143
2015-02-11 19:34:40,894 : INFO : training model with 4 workers on 67186 vocabulary and 200 features, using ‘skipgram’=1 ‘hierarchical softmax’=0 ‘subsample’=0 and ‘negative sampling’=15
2015-02-11 19:34:41,903 : INFO : PROGRESS: at 0.45% words, alpha 0.02491, 93073 words/s
2015-02-11 19:34:42,925 : INFO : PROGRESS: at 0.96% words, alpha 0.02477, 97772 words/s
2015-02-11 19:34:43,930 : INFO : PROGRESS: at 1.48% words, alpha 0.02465, 100986 words/s
2015-02-11 19:34:44,941 : INFO : PROGRESS: at 2.00% words, alpha 0.02452, 102187 words/s
2015-02-11 19:34:45,960 : INFO : PROGRESS: at 2.51% words, alpha 0.02439, 102371 words/s
2015-02-11 19:34:46,966 : INFO : PROGRESS: at 3.05% words, alpha 0.02426, 104070 words/s
2015-02-11 19:34:48,006 : INFO : PROGRESS: at 3.55% words, alpha 0.02413, 103439 words/s
2015-02-11 19:34:48,625 : INFO : reached the end of input; waiting to finish 8 outstanding jobs
2015-02-11 19:34:49,026 : INFO : PROGRESS: at 4.10% words, alpha 0.02400, 104259 words/s
Hi Radim,
Is there a whole example that I can use to understand the whole concept and to walk through the code.
Thanks much,
Author
Hello Sasha,
not sure what concept / code you need, but there is one example right there in the word2vec.py source file:
https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec.py#L997
(you can download the text8 corpus used there from http://mattmahoney.net/dc/text8.zip )
Hi Radim,
I‘m wondering about the difference between model from trained in C(original way) and trained in gensim.
when I trying to use the model.most_similar function,loading the model I’ve trained in C, I got a totally different result when i trying to do the same stuff with word-analogy.sh? So I just want to know if the model.most_similar function use the same way when trying to calculate ‘man’-‘king’ + ‘women’ ≈ ‘queue’ like mikolov achieved in his C codes (word-analogy) ,thanks!!!
Author
Yes, exactly the same (cosine similarity).
The training is almost the same too, up to different randomized initialization of weights IIRC.
Maybe you’re using different data (preprocessing)?
Sorry to bother you again,here are two kinds of way when I try to do:
The way when i use gensim:
model=Word2Vec.load_word2vec_format(‘vectors_200.bin’,binary=True)
#Chinese
word1=u’石家庄’
word2=u’河北’
word3=u’河南’
le=model.most_similar(positive=[word2,word3],negative=[word1])
The way use C code:
./word-analogy vectors_200.bin
the input :’石家庄’ 河北’ 河南’
totally different results…
the same model loaded, how could that happened?
Author
Oh, non-ASCII characters.
IIRC, the C code doesn’t handle unicode in any way, all text is treated as binary. Python code (gensim) uses Unicode for strings.
So, perhaps some encoding mismatch?
How was your model trained — with C code? Is so, what was the encoding?
The training corpus encoding in utf-8, that’s the reason?
Hello Radim,
Is there a way to extract the output feature vector (or, sort of, predicted probabilities) from the model, just like while it’s training?
Thanks
Hey Radim
Thanks for the wonderful tutorial.
I am new to word2vec and I am trying generate n-grams of words for an Indian Script. I have 2 quesries:
Q1. Should the input be in plain text:
ସୁଯୋଗ ଅସଟାର or unicodes 2860 2825 2858 2853 2821
Q2. Is there any code available to do clustering of the generated vectors to form word classes?
Please help
Hi Radim,
For this example: “woman king man”:
I run with bonus web app, and got the results:
521.9ms [[“kings”,0.6490576267242432],[“clown_prince”,0.5009066462516785],[“prince”,0.4854174852371216],[“crown_prince”,0.48162946105003357],[“King”,0.47213971614837646]]
The above result is the same with word2vec by Tomas Mikolov.
However, when I run example above in gensim, the output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
So why is this the case?
Your web app’s result is different to gensim ???
Thanks!
Author
Hi Cong
no, both are the same.
In fact, the web app just calls gensim under the hood. There’s no extra magic happening regarding word2vec queries, it’s just gensim wrapped in cherrypy web server.
Thank you for your reply.
I loaded the pre-trained model: GoogleNews-vectors-negative300.bin by Tomas Mikolov.
Then, I used word2vec in gensim to find the output.
This is my code when using gensim:
from gensim.models import word2vec
model_path = “…/GoogleNews-vectors-negative300.bin”
model = word2vec.Word2Vec.load_word2vec_format(model_path, binary=True)
stringA = ‘woman’
stringB = ‘king’
stringC = ‘man’
print model.most_similar(positive=[stringA, stringB], negative=[stringC], topn=10)
–> Output is:
[(u’queen’, 0.7118195295333862), (u’monarch’, 0.6189675331115723), (u’princess’, 0.5902432203292847), (u’crown_prince’, 0.5499461889266968), (u’prince’, 0.5377322435379028), (u’kings’, 0.523684561252594), (u’Queen_Consort’, 0.5235946178436279), (u’queens’, 0.5181134939193726), (u’sultan’, 0.5098595023155212), (u’monarchy’, 0.5087413191795349)]
The see that the output above is different to the web app?
So can you check it for me?
Thanks so much.
I found that in gensim, the order should be:
…positive=[stringB, stringC], negative=[stringA]..
Hi Radim,
Thank you for the great tool and tutorial.
I have one question regarding learning rate of the online training. You mentioned to adjust total_words in train(), but could you give a more detailed explanation about how this parameter will affect the learning rate?
Thank you in advance.
Fantastic tool and tutorial. Thanks for sharing.
I’m wondering about compounding use of LSI. Take large corpus and perform LSI to map words into some space. Now having a document when you hit a word look up the point in the space and use that rather than just the word. Words of similar meaning then start out closer together and more sensibly influence the docuement classification. Would model just reverse out those initial weights ? thanks for any ideas.
Hi Radim,
First of all, thanks for you great job on developing this tool. I am new in word2vec and unfortunately literature do not explain the details clearly. I would be grateful if you could answer my simple questions.
1- for CBOW (sg=0), does the method uses negative sampling as well? or this is something just related to skip-gram model.
2-what about the window size? is the window size also applicable when one uses CBOW? or all the words in 1 sentences is considered as bag-of-words?
3- what happens if the window size is larger than the size of a sentence? Is the sentence ignored or simply a smaller window size is chosen which fits the size of the sentence?
4- what happens if the word sits at the end of the sentence? there is no word after that for the skip-gram model !
Hi Radim,
Thanks for such a nice package! It may be bold to suggest, but I ran across what I think might be a bug. It’s likely a features :), but I thought I’d point it out since I needed to fix it in an unintuitive way.
If I train a word2vec model using a list of sentences:
sentences = MySentences(fname) # generator that yields sentences
mysentences = list(sentences)
model = gensim.models.Word2Vec(sentences=mysentences **kwargs)
then the model finishes training. Eg., the end of the logging shows
…snip…
2015-05-13 22:12:07,329 : INFO : PROGRESS: at 97.17% words, alpha 0.00075, 47620 words/s
2015-05-13 22:12:08,359 : INFO : PROGRESS: at 98.25% words, alpha 0.00049, 47605 words/s
2015-05-13 22:12:09,362 : INFO : PROGRESS: at 99.32% words, alpha 0.00019, 47603 words/s
2015-05-13 22:12:09,519 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:12:09,901 : INFO : training on 4427131 words took 92.9s, 47648 words/s
I’m training on many GB of data, so I need to pass in a generator that yields sentences line by line (like your MySentences class above). But when I try it as suggested with, say, iter=5:
sentences = MySentences(fname) # generator that yields sentences
model = gensim.models.Word2Vec(sentences=None, **kwargs) # iter=10 defined in kwargs
model.build_vocab(sentences_vocab)
model.train(sentences_train)
the model stops training 1/20 of the way through. If iter=10, it stops 1/10 of the way, etc. Eg., the end of the logging looks like,
…snip…
2015-05-13 22:31:37,265 : INFO : PROGRESS: at 18.21% words, alpha 0.02049, 49695 words/s
2015-05-13 22:31:38,266 : INFO : PROGRESS: at 19.29% words, alpha 0.02022, 49585 words/s
2015-05-13 22:31:38,452 : INFO : reached the end of input; waiting to finish 16 outstanding jobs
2015-05-13 22:31:38,857 : INFO : training on 885538 words took 17.8s, 49703 words/s
Looking in word2vec.py, around line 316 I noticed
sentences = gensim.utils.RepeatCorpusNTimes(sentences, iter)
so I added
sentences_train = gensim.utils.RepeatCorpusNTimes(Sentences(fname), model.iter)
before calling model.train() in the above code snippet. Does this seem like the correct course of action, or am I missing something fundamental about the way one should stream sentences to build the vocab and train the model?
Thanks for your help,
Jesse
Author
Hello Jesse,
for your sentences, are you using a generator (=can be iterated over only once), or an iterable (can be iterated over many times)?
It is true that for multiple passes, generator is not enough. Anyway better ask at the gensim mailing list / github, that’s a better medium for this:
http://radimrehurek.com/gensim/support.html
Hello,
It is a great tutorial, thank you very much….
but i have a problem,
I used the function ( accuracy ) to print the evaluation of the model , but nothing is printed to me
how to sove this problem ??
thanks a lot
Author
Try turning on logging — the accuracy may be printed to log.
See the beginning of this tutorial for how to do that.
Great tutorial, Radim! Is it possible to download your trained model of 100 billion Google words?
Author
Thanks Shuai.
Yes, it is possible to download it:
https://code.google.com/p/word2vec/#Pre-trained_word_and_phrase_vectors
(The model is not mine, it was trained by Tomas while at Google).
Awesome! I just found it too. Cheers.
Hi Radim,
I was wondering if it is possible to train a Word2Vec model, not with sentences, but with input and output vectors built from the sentences in an application-specific manner?
Thanks.
Swami
Hi, I’ve got a problem‘OverflowError: Python int too large to convert C long’ when i run ‘model = gensim.models.Word2Vec(sentences, min_count=1)’. Could you help me with it ?!
Author
Hello!
It’s best to report such things on the mailing list, or on GitHub, not on the blog:
http://radimrehurek.com/gensim/support.html
For this particular error, check out this GitHub issue.
thank you very much.
Thanks for writing this, it was quite helpful and told a lot
Pingback: Word2vec Tutorial » RaRe Technologies | D...
I ran:
model = gensim.models.Word2Vec(sentences, min_count=1)
and got the following error:
model = gensim.models.Word2Vec(sentences, min_count=1)
Traceback (most recent call last):
File “”, line 1, in
model = gensim.models.Word2Vec(sentences, min_count=1)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 312, in __init__
self.build_vocab(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 414, in build_vocab
self.reset_weights()
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 521, in reset_weights
random.seed(uint32(self.hashfxn(self.index2word[i] + str(self.seed))))
OverflowError: Python int too large to convert to C long
I am using Python 3.4.3 in the Anaconda 2.3.0-64bit distribution.
I’d really like to be able to use this module, but it seems like there’s some fundamental issue for my computer.
Thanks!!
Author
Hello Mike, the fix was a part of gensim 0.12.1, released some time ago.
What version of gensim are you using?
Found the error…I was using “conda update gensim” but it looks like their Anaconda repository has not been updated. I’ll let them know, since many people use Anaconda distrib.
I ran “pip install –upgrade gensim” and it got 0.12.1. I had 10.1!!!
Ok, I updated and ran with the following input list of sentences:
sentences
Out[17]:
[[‘human’,
‘machine’,
‘interface’,
‘for’,
‘lab’,
‘abc’,
‘computer’,
‘applications’],
[‘a’,
‘survey’,
‘of’,
‘user’,
‘opinion’,
‘of’,
‘computer’,
‘system’,
‘response’,
‘time’],
[‘the’, ‘eps’, ‘user’, ‘interface’, ‘management’, ‘system’],
[‘system’, ‘and’, ‘human’, ‘system’, ‘engineering’, ‘testing’, ‘of’, ‘eps’],
[‘relation’,
‘of’,
‘user’,
‘perceived’,
‘response’,
‘time’,
‘to’,
‘error’,
‘measurement’],
[‘the’, ‘generation’, ‘of’, ‘random’, ‘binary’, ‘unordered’, ‘trees’],
[‘the’, ‘intersection’, ‘graph’, ‘of’, ‘paths’, ‘in’, ‘trees’],
[‘graph’,
‘minors’,
‘iv’,
‘widths’,
‘of’,
‘trees’,
‘and’,
‘well’,
‘quasi’,
‘ordering’],
[‘graph’, ‘minors’, ‘a’, ‘survey’]]
Still got the same error:
in [16]: model = gensim.models.word2vec.Word2Vec(sentences)
Traceback (most recent call last):
File “”, line 1, in
model = gensim.models.word2vec.Word2Vec(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 417, in __init__
self.build_vocab(sentences)
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 483, in build_vocab
self.finalize_vocab() # build tables & arrays
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 611, in finalize_vocab
self.reset_weights()
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 888, in reset_weights
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
File “C:Anaconda3libsite-packagesgensimmodelsword2vec.py”, line 900, in seeded_vector
once = random.RandomState(uint32(self.hashfxn(seed_string)))
OverflowError: Python int too large to convert to C long
Author
Is Python picking up the right gensim?
AFAIK anaconda has its own packaging system, I’m not sure how it plays with your `pip install`.
What does `import gensim; print gensim.__version__` say?
Ok…I was able to make a couple changes to word2vec.py to get it to run on my computer:
The current version uses numpy.uint32 on lines 83, 327, 373, and 522. This was causing an overflow error when converting to C long.
I changed these to reference numpy.uint64 and it *almost* worked….the use of uint64 on line 522 for setting the seed of the random number generator resulted in a seed value being out of bounds. I addressed this by truncating the seed to the max allowable seed:
“random.seed(min(uint64(self.hashfxn(self.index2word[i] + str(self.seed))),4294967295))”
now everything runs fine (except that my version is not compiled under C so I may see some performance issues for large coropra)…
actually, there is a solution on Kaggle for 64-bit machines that worked really well (do not use my solution…it results in all word vectors being collinear).
def hash32(value):
return hash(value) & 0xffffffff
Then pass the following arugument to Word2Vec: hashfxn=hash32
This will overwrite the base hashfxn and resolve the issues. Also, all my cosine similarities were not 1 now!!
Beware of how you go through your training data :
When, in your class “MySentences” you use :
“for line in open(os.path.join(self.dirname, fname)): ”
As far as I know, it won’t close your file. You’re letting the garbage collector of Python deal with the leak in memory.
Use :
“with gzip.open(os.path.join(self.dirname, fname)) as f:”
instead (Ref : http://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important )
For training on large dataset, it can be a major bottleneck (it was for me 😉 ).
Thank you very much for your fast implementation of Word2vec and Doc2vec !
Author
No, CPython closes the file immediately after the object goes out of scope. There is no leak (though that’s a common misconception and a favourite nitpick).
With gzip it makes more sense, but then you should be using smart_open anyway (also to work around missing context managers in python 2.6).
Hi, Radim.
Great tutorial.
I have a doubt. Is there any way I can represent a phrase as a vector so that I can calculate similarity between phrases just as what we do with words?
Author
Thanks Rodolpho!
Yes there is; check out the doc2vec tutorial.
Hi Radim,
first of thank you very much for your amazing work and even more amazing tutorial.
I am currently trying to compare two set of phrases.
I am using GoogleNews model as my model
Splited all the words into individual words by using .split()
ie.
[‘golf’,’field’] and [‘country’,’club’]
[‘gas’,’station’] and [‘fire’,’station’]
as per feature in your app “phrase suggestions ”
I can see that GoogleNews model have
County_Club
Fire_station
gas_station
golf_field
But it’s difficult to scan for those words because of Capitalization in GN model.
I tried.
model.vocab.keys()
which would convert all available names into a list.
but couldn’t get any close to your example above.
I also looked at gensim.models.phrase.Phrases
hoping that it can help me detect above example with bigram
or trigram
for those who are using GN model,
how could we detect bigram or trigram?
Thank you in advance.
As far as I know, Google didn’t release their vocabulary/phrase model, nor their text preprocessing method.
The only thing you have to go by are the phrases inside the model itself (3 million of them), sorted by frequency.
You can lowercase the model vocabulary and match against that, but note that you’ll lose some vectors (no way to tell County_Club from county_club from County_club).
You can also try asking at the gensim mailing list, or Tomas Mikolov at his mailing list — better chance someone may have an answer or know something.
Hi Radim,
Thanks for your reply,
I went ahead and created a small function which would create bigram and replace the original words if bigram exist in GoogleNews model.
and yes, I will join google mailing group.
####################################################
# bigram creator
# try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately
# creating bigram
def create_bigram(list):
for i in range(0,len(list)-1):
#ie. country_club
word1 = list[i]+’_’+list[i+1]
#ie. Country_club
word2 = list[i].capitalize()+’_’+list[i+1]
#ie. Country_Clue
word3 = list[i].capitalize()+’_’+list[i+1].capitalize()
#ie. COUNTRY_CLUE
word4 = (list[i]+’_’+list[i+1]).upper()
word_list = [word1,word2,word3,word4]
for item in word_list:
print item
if item in model.vocab:
list.pop(i)
list.pop(i)
list.append(item)
break
here’s fixed code,
added a line that will not append new word if len(list) gets shorter
####################################################
# bigram creator
# try to capture fire_station or Fire_Station rather than using ‘fire’ ‘station’ seperately
# creating bigram
def create_bigram(list):
for i in range(0,len(list)-1):
if i < len(list)-1:
#ie. country_club
word1 = list[i]+'_'+list[i+1]
#ie. Country_club
word2 = list[i].capitalize()+'_'+list[i+1]
#ie. Country_Clue
word3 = list[i].capitalize()+'_'+list[i+1].capitalize()
#ie. COUNTRY_CLUE
word4 = (list[i]+'_'+list[i+1]).upper()
word_list = [word1,word2,word3,word4]
for item in word_list:
print i
if item in model.vocab:
list.pop(i)
list.pop(i)
list.append(item)
break
I am following a tutorial of doc2vec from http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis
But while calling
model_dm.build_vocab(np.concatenate((x_train, x_test, unsup_reviews)))
I am getting an error:
‘numpy.ndarray’ object has no attribute ‘words’
It seems like this error occurs at document.words in doc2vec.py.
What am I missing here?
I trained 2 Doc2vec models with the same data, and parameters:
model = Doc2Vec(sentences, dm=1, size=300, window=5, negative=10, hs=1, sample=1e-4, workers=20, min_count=3)
But I got 2 different models in each time. Is this true?
Can you explain more details for me?
Is that the case for Word2vec model?
Thanks Radim!
Author
Hello Lis,
it’s best to use the mailing list for gensim support:
http://radimrehurek.com/gensim/support.html
You’ll get the quickest and most qualified answers there 🙂
Hi Radim,
Thanks for this amazing python version of Word2Vec!
I have come to a strange behaviour after training; and I wanted to mention it here to you.
So when I trained word2vec model, with default parameters (namely the skip-gram model), the results where coherent with what is reported (in this blog and in papers..).
When I used the pre-trained “vectors.bin” model from C version of Word2Vec from Tomas, loaded in gensim, everything seems fine as well (notice that the default model of C version is CBOW).
Then I tried to train the Gensim Word2Vec with default parameters used in C version (which are: size=200, workers=8, window=8, hs=0, sampling=1e-4, sg=0 (using CBOW), negative=25 and iter=15) and I got a strange “squeezed” or shrank vector representation where most of computed “most_similar” words shared a value of roughly 0.97!! (And from the classical “king”, “man”, “woman” the most similar will be “and” with 0.98, and in the top 10 I don’t even have the “queen”…). Everything was train on the SAME text8 dataset.
So I wondered if you saw such “wrong” training before, with those atypical characteristics (all words in roughly one direction in vector space) and if you know where might be the issue.
I am trying different parameters setting to hopefully figure out what is wrong (workers>1? iter?).
Thanks for any comment,
Cheers
Author
Thanks Hug!
I didn’t see such behaviour. Would you mind posting this info (plus any other version/data info you might have) on the gensim mailing list?
I’ll do thanks for the reply!
Cant import word2vec
Runtimerror – Kindly any on help
Traceback (most recent call last):
File “/Users/apple/Documents/w2c.py”, line 15, in
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 432, in __init__
self.train(sentences)
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 690, in train
raise RuntimeError(“you must first build vocabulary before training the model”)
RuntimeError: you must first build vocabulary before training the model
Cant import word2vec
Runtimerror – Kindly any on help
Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 690, in train
raise RuntimeError(“you must first build vocabulary before training the model”)
RuntimeError: you must first build vocabulary before training the model
Hi Radim,
Great tutorial.
I have trained 17 million sentences with my i5core and 4GB RAM. During the process it hanged a bit, but somehow I managed to save the model so that I can load it in future. The saved model is pretty huge with three files summing up to 1.2GB. So, whenever I load the model the system hangs and gets super slow. Is there any workaround for this problem or its just about upgrading RAM ?.
Is there any command available to determine the vocabulary frequencies from the saved model without the help of importing the training dataset furthermore ?
Thanks.
Author
Hello Siva,
yes, you have several options there:
* model.init_sims(replace=True), to remove unneeded files and save memory, if you don’t want to continue training.
* estimate_memory, to estimate required memory
* Look at the log from training, which contains a lot of useful, detailed information and numbers. Always a good idea to store and inspect the log.
Hope that helps,
Radim
I will work on it. Thanks for the reply…
HI Radim
Cant import Word2vec in python showing
Traceback (most recent call last):
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 690, in train
raise RuntimeError(“you must first build vocabulary before training the model”)
RuntimeError: you must first build vocabulary before training the model
How to train vocabulary.
Thank you
Hey Rahul,
you need to build vocab before training.
You need to execute command written in 5th line of “Preparing input” section
Example if you have input sentences: [I like ice cream],[I enjoy sleeping]
you need to split every word of each sentence
sentence = [[‘I’,’like’,’ice’,’cream’],[‘I’,’enjoy’,’sleeping’]]
You can split sentences into words by nltk library.
I just saw your doubt, thought this will be helpful 🙂
baby skynet is in distress, it makes me laugh no more, please fix it
what exactly does the output represent when I do model[“someword”]?
Hi Radim,
Thanks for your wonderful tutorial and the package!
I was trying to execute the sample code (given in the tutorial) as shown below:
=====================CODE====================================
In[2]: import gensim, logging
In[3]: logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
In[4]: sentences = [[‘first’, ‘sentence’], [‘second’, ‘sentence’]]
In[5]: model = gensim.models.Word2Vec(sentences, min_count=1)
==============================================================
But, it didn’t work properly and the terminated with the following error:
=======================OUTPUT==============================
2015-12-19 03:36:41,976 : INFO : collecting all words and their counts
2015-12-19 03:36:41,976 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2015-12-19 03:36:41,977 : INFO : collected 3 word types from a corpus of 4 raw words and 2 sentences
2015-12-19 03:36:41,977 : INFO : min_count=1 retains 3 unique words (drops 0)
2015-12-19 03:36:41,977 : INFO : min_count leaves 4 word corpus (100% of original 4)
2015-12-19 03:36:41,977 : INFO : deleting the raw counts dictionary of 3 items
2015-12-19 03:36:41,977 : INFO : sample=0 downsamples 0 most-common words
2015-12-19 03:36:41,977 : INFO : downsampling leaves estimated 4 word corpus (100.0% of prior 4)
2015-12-19 03:36:41,977 : INFO : estimated required memory for 3 words and 100 dimensions: 4500 bytes
2015-12-19 03:36:41,978 : INFO : constructing a huffman tree from 3 words
2015-12-19 03:36:41,978 : INFO : built huffman tree with maximum node depth 2
2015-12-19 03:36:41,978 : INFO : resetting layer weights
Traceback (most recent call last):
File “/usr/lib/python2.7/dist-packages/IPython/core/interactiveshell.py”, line 2538, in run_code
exec code_obj in self.user_global_ns, self.user_ns
File “”, line 1, in
model = gensim.models.Word2Vec(sentences, min_count=1)
File “/home/sahisnu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 431, in __init__
self.build_vocab(sentences, trim_rule=trim_rule)
File “/home/sahisnu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 497, in build_vocab
self.finalize_vocab() # build tables & arrays
File “/home/sahisnu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 627, in finalize_vocab
self.reset_weights()
File “/home/sahisnu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 958, in reset_weights
self.syn0[i] = self.seeded_vector(self.index2word[i] + str(self.seed))
File “/home/sahisnu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py”, line 970, in seeded_vector
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
File “mtrand.pyx”, line 561, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:4716)
File “mtrand.pyx”, line 597, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:4941)
ValueError: object of too small depth for desired array
=================================================================
I searched a lot in the web but couldn’t figure the problem out. I also tried to execute it with my own corpus and getting the same error. Am I missing something required for the successful execution of the code? Can you please tell me how can I eliminate the error?
Please help!
With Thanks,
Sahisnu
Solved the problem!
Actually, I was using older version of gensim, numpy and scipy….
Upgraded all and the problem got solved.
Hi,SAHINU
recently, I have the same problem with you and I’m using the latest of gensim. I have searched a lot in the web but couldn’t figure it out.Can you please help me? and please contact me with my e-mail:[email protected]
Thank you very much!
lilingao
Hi Radim,
Thanks for your great work. I downloaded your newest package and read through the word2vec code. I saw that in lines 255-260, you simultaneously update model.syn1neg[word_indices] or l2b (using l1) and l1 (using l2b). I think it is necessary to add deepcopy in line 255!!! Is it correct???
Thank you,
Best regards,
Tin
Could you help me with this problem ?
I use a simple python function load_bin_vec shown as follows to load the Google pretrained .bin model. But I find that the outputs are different from the results using the load_word2vec_format function in gensim.models.Word2Vec.
For example:
for the word ‘woman’,
the vectors loaded by load_bin_vec function return:
[ 2.43164062e-01 -7.71484375e-02 -1.03027344e-01 -1.07421875e-01 …]
while the vectors loaded by load_word2vec_format function return:
[ 9.15656984e-02 -2.90509649e-02 -3.87959108e-02 …]
def load_bin_vec(fname):
“””
Loads 300×1 word vecs from Google (Mikolov) word2vec
“””
word_vecs = {}
with open(fname, “rb”) as f:
header = f.readline()
vocab_size, layer1_size = map(int, header.split())
binary_len = np.dtype(‘float32’).itemsize * layer1_size
for line in xrange(vocab_size):
word = []
while True:
ch = f.read(1)
if ch == ‘ ‘:
word = ”.join(word)
break
if ch != ‘n’:
word.append(ch)
word_vecs[word] = np.fromstring(f.read(binary_len), dtype=’float32′)
return word_vecs
Author
Hello Xiaoshan Yang,
the best medium for gensim support is its mailing list:
http://radimrehurek.com/gensim/support.html
thanks for your replay
Pingback: Relatively Prime season 2 | God plays dice
Thanks for the tutorial. I’m very new to word2vec and so greatly appreciate help here. Do I have to remove stopwords from my input text? Because I could see words like ‘of’, ‘when’.. when I do ‘model.most_similar(‘someword’)..?
But I haven’t seen that stopword removal is done with word2vec when I was reading things?
Author
Hello sam,
you can remove stopwords (any other other words) either as part of your sentences (MySentences in the code above).
Or, keep sentences as-is and add a post-filtering step over them:
>>> model = gensim.models.Word2Vec()
>>> stopword_set = set(…) # your set of words you don’t want
>>> discard_stopwords = lambda: ((word for word in sentence if word not in stopword_set) for sentence in sentences)
>>> model.build_vocab(discard_stopwords()))
>>> model.train(discard_stopwords())
In general, questions like this are best posted to the gensim mailing list, so others can benefit from the discussion.
Best,
Radim
Hello Radim,
Thanks for the quick response. I actually have made a post there (https://groups.google.com/forum/#!topic/gensim/nJX3PmLZAws) but didn’t get any replies.
Just one more thing, is it technically wrong to apply model.most_similar(‘Apple’,topn=20) on a small text file, say of 2000 tweets?
i want to know internal working of word2vec with example like what we have to pass as an input an one hot representation of word or frequency of word and as an output what we can get and how?
Author
Hello Rachana,
I didn’t really understand your question, but the best place to ask is the gensim mailing list:
https://groups.google.com/d/forum/gensim
Best,
Radim
Thanks for reply , my question is very simple How does word2vec work internally in Neural network means as an input what we have to pass and as an output what we can get for deep learning further processing like this https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors and What is the good example of Word2vec?
Hi, I am getting the following error when I try to load a saved model. Could you please let me know how to fix it?
IOError: [Errno 2] No such file or directory: ‘wiki_trained.model.syn1neg.npy’
I am getting the same error on my Notebook (Ubuntu 14.04). Have you found a solution?
Hello,
Is there a way to have two different values for min_count in the same model?
For ex: I would like min_count = 3 but I also need some representation for words in sentences which occur only once. Is this possible with in a single model?
Thanks,
Shreya
Author
Hello Shreya,
yes, there is. Have a look at the `trim_rule` parameter of Word2Vec.
For further questions, please use the gensim mailing list.
Hello,
I have few questions. From where do you find such huge data? Is there any pre-processing to be done on textual data before training word2vec?
Author
The 100-billion-word GoogleNews corpus was actually prepared by Google themselves. I don’t think it’s public.
Preprocessing: depends on the app. We usually do careful preprocessing; Google didn’t do almost any for the GoogleNews model 🙂 You’ll see it in the word suggestions in the “Bonus App” above. There’s lots of words like “###” and rubbish characters, typos, uppercase variants etc.
Thank you for replying sir. So can you suggest from where can I get huge textual data that is publicly available?
Author
Wikipedia is one example. You can find a script for automatically preprocessing Wikipedia into plain text here: https://github.com/piskvorky/sim-shootout
Then there’s various web crawls etc. Depends what “huge” means for you… if 20newsgroups does it for you, that’s great.
Can you recommend me small data which yields good word2Vec results?
To add above comment, I found data here http://qwone.com/~jason/20Newsgroups/ . Each file contains some information and data. Information like from , subject , organization etc. Am I supposed to remove such enitites and just keep the raw text data?
Hi Radim,
I would like to understand the right way to resume a Word2Vec model and continue the training process. There was no issue that I could save and load the model, and continue training. But, I couldn’t keep the old vocabs built. I used the simple script to test and got confused. Could you help?
When I did
===
some_sentences = [[‘first’, ‘sentence’], [‘second’, ‘sentence’]]
model = Word2Vec(min_count=1)
model.build_vocab(some_sentences)
model.train(some_sentences)
print model.similarity(‘first’,’second’) # no problem
other_sentences = [[‘third’, ‘sentence’], [‘fourth’, ‘sentence’]]
model.build_vocab(other_sentences)
model.train(other_sentences)
print model.similarity(‘third’, ‘fourth’) # no problem
print model.similarity(‘first’,’second’) # The vocabs in some_sentences are no longer available???
===
It complained
-0.0450417522552
0.00356975799328
Traceback (most recent call last):
File “test.py”, line 16, in
print model.similarity(‘first’,’second’)
File “/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py”, line 1233, in similarity
return dot(matutils.unitvec(self[w1]), matutils.unitvec(self[w2]))
File “/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py”, line 1213, in __getitem__
return self.syn0[self.vocab[words].index]
KeyError: ‘first’
===
Furthermore,
model = Word2Vec() # an empty model, no training
model.build_vocab(some_sentences)
model.train(other_sentences)
What is this for? After the execution, I can find the similarity between vocabs in some_sentences, but no similarity between vocabs in other_sentences. Vocabs in some_sentences were built but not trained. Vocabs in other_sentences were trained but not built. Do you have a use case to explain me the relationship? I general, we should have all vocabs built and trained first so that we are able to get the relataion between any of them, right? After loading, the old vocabs tree should be there, and allow adding new vocabs to the existing tree and training the new vocabs. Is my understanding correct?
Best,
Henry
Author
Hello Henry,
the best place for such questions is the mailing list: http://radimrehurek.com/gensim/support.html
In this case, the problem comes from the fact the vocabulary scan is only done once. You can continue training on new sentences, but cannot add any new vocabulary. There is an ongoing work (pull request) to allow dynamic training including new vocabulary in gensim, but it’s not finished yet.
If you have any follow up questions, please use the mailing list.
Many thanks for the clarification, Radim.
I also read the post at
http://rutumulkar.com/blog/2015/word2vec/
last Friday. I thought what I needed could be done by
model.build_vocab(some_sentences, keep_raw_vocab=True)
after model loading.
Best,
Henry
I have made my model and also saved it. I can see the model in my folder but the content is gebbrish. Also when I am trying to run a query like model.similarity(‘iphone’, ‘battery’) , I am getting a” keyerror: iphone ” error.
how am i supposed to query it???
give me some way to input my entire txt file instead of individual sentences. My file is ‘../script/join.txt”
Can anyone tell me how to know what are the words int the vocab?
How the words in vocab are stored? Is it based on the description which we are giving as input from where it get the vocab?
Kindly reply
Author
Hello Atul,
such questions are best answered by the gensim commmunity on our mailing list:
http://radimrehurek.com/gensim/support.html
Best,
Radim
Pingback: 基于 Gensim 的 Word2Vec 实践 - 莹莹之色
Pingback: Dupe Snoop: Identify duplicates on Quora | Jana Grcevich
Pingback: Getting started with Word2Vec | TextProcessing | A Text Processing Portal for Humans
Pingback: Exploiting Wikipedia Word Similarity by Word2Vec – Text Mining Online
Pingback: Google News Word2vec | cm
Pingback: Google News Word2vec | ch
Pingback: Understanding Word Vectors and Word2Vec – Stokastik
Pingback: Sentence Based similarity – Research & Expreimental Blog
Pingback: Word2Vec – iDataMining.net
Pingback: Error Code 2538
Pingback: Semantic analysis for New Product Development: do you care about Language? - Innoradiant
Pingback: How to Develop Word Embeddings in Python with Gensim | A bunch of data
Pingback: K Means Clustering Example with Word2Vec in Data Mining or Machine Learning - Text Analytics Techniques
Pingback: AI is not just learning our biases; it is amplifying them. | Copy Paste Programmers
Pingback: Curso de procesamiento de textos – Gensim & Python - APRENDER PYTHON DESDE CERO - Aprender Python online gratis
Pingback: Word Embeddings in Python with Spacy and Gensim | Shane Lynn
Pingback: 基于 Gensim 的 Word2Vec 实践 - 奇奇问答
Online training / Resuming training
it can not add new word in model, you can use
new_model.build_vocab(new_tl, update=True),set ‘update=True’ to update the word in your model
ref:
http://www.muzhen.tk/2017/06/21/machine%20learning/NLP/gensim%20w2v/
Pingback: How to Develop Word Embeddings in Python with Gensim – Book of AI
Pingback: Word2vec: how to train and update it | machine learning and statistics
Pingback: AI Frameworks : Pros & Cons – DVLUP Inc