Doc2vec tutorial

Radim Řehůřek 2014-12-15 gensim, programming 47 Comments

The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.

IMPORTANT NOTE: the doc2vec functionality received a major facelift in gensim 0.12.0. The API is now cleaner, training faster, there are more tuning parameters exposed etc. While the basic ideas explained below still apply, see this IPython notebook for a more up-to-date tutorial on using doc2vec. For a commercial document similarity engine, see our scaletext.com.

Continuing in Tim’s own words:

Input

Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.

The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the dm=0 flag in constructor.

The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:
sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.

Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:
class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(filename)):
            yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
A more robust version of this LabeledLineSentence class above is also included in the doc2vec module, so you can use that. Read the doc2vec API docs for all constructor parameters.

Training

Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag train_lbls=False in your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag train_words=False.

One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either

randomizing the order of input sentences, or

manually controlling the learning rate over the course of several iterations.

For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:
model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
model.build_vocab(sentences)
for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002  # decrease the learning rate
    model.min_alpha = model.alpha  # fix the learning rate, no decay
The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.

Note from Radim: I wanted to include the obligatory run-on-English-Wikipedia-example at this point, with some timings and code. But I couldn’t get reasonable results out of Doc2Vec, and didn’t want to delay publishing Tim’s write up any longer while I experiment. Despair not; the caravan goes on, and we’re working on a more scalable version of doc2vec, one which doesn’t require a vector in RAM for each document, and with a simpler API for inference on new documents. Ping me if you want to help.

Memory Usage

With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.

I/O

The usage for Doc2Vec is the same as for gensim’s Word2Vec. One can save and load gensim Doc2Vec instances in the usual ways: directly with Python’s pickle, or using the optimized Doc2Vec.save() and Doc2Vec.load() methods:
model = Doc2Vec(sentences)
...
# store the model to mmap-able files
model.save('/tmp/my_model.doc2vec')
# load the model back
model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')
Helper functions like model.most_similar(), model.doesnt_match() and model.similarity() also exist. The raw words and label vectors are also accessible either individually via model['word'], or all at once via model.syn0.
See the docs.

The main point is, labels act in the same way as words in Doc2Vec. So, to get the most similar words/sentences to the first sentence (label SENT_0, for example), you’d do:
print model.most_similar(&quot;SENT_0&quot;)
[('SENT_48859', 0.2516525387763977),
 (u'paradox', 0.24025458097457886),
 (u'methodically', 0.2379375547170639),
 (u'tongued', 0.22196565568447113),
 (u'cosmetics', 0.21332012116909027),
 (u'Loos', 0.2114654779434204),
 (u'backstory', 0.2113303393125534),
 ('SENT_60862', 0.21070502698421478),
 (u'gobble', 0.20925869047641754),
 ('SENT_73365', 0.20847654342651367)]
or to get the raw embedding for that sentence as a NumPy vector:
print model[&quot;SENT_0&quot;]
etc. More functionality coming soon!

If you liked this article, you may also enjoy the Optimizing word2vec series and the Word2vec tutorial.

Comments 47

Simon Smith
2014-12-16 at 4:55 pm

Tim, Radim,

Thank you for this – it looks great. I’m a big fan of Gensim and it is now even better.

Simon

Reply
1. Post
  Author
  
  Radim
  2015-03-06 at 10:28 am
  
  Thanks Simon, I appreciate it.
  
  Reply
Wen
2014-12-30 at 8:25 am

Hi, thanks for your awesome tutorial. What I should do if I want to input new testing sentences to find similar sentences after training?

Thank you

Reply
1. Samuel Rönnqvist
  2015-01-28 at 3:44 pm
  
  You should be able to add new sentences to an existing model using train(), and then run most_similar([‘SENT_XX’]), where SENT_XX is the label of one of your new sentences.
  
  Reply
  1. Yikang
    2015-03-06 at 9:18 am
    
    Thanks for your awesome work. But I find that train() doesn’t add new sentences to an existing model. I get an ‘KeyError’ error while I was trying. Maybe train() just update the weight?
    
    Reply
    1. Ólavur Mortensen
      2015-04-02 at 2:24 pm
      
      I would very much like to know this too. I can’t figure out how to train on more data in word2vec either.
      
      Reply
dhruv shah
2015-01-15 at 9:44 am

Hi Radim,

I just had a quick question.
The doc2vec is an unsupervised algorithm. So why do we have to provide labels when we are training the model?

Reply
1. Bach
  2015-01-19 at 8:11 am
  
  Hi, I think that the labels here are not the “y”, they are like tags attached to each sentence, so you can access the vector that represents the sentence.
  
  Reply
  1. dhruv shah
    2015-01-20 at 10:22 am
    
    Does this mean that every single document label is unique?
    
    Reply
Bach
2015-01-19 at 8:09 am

Hi,

How can I get the vector representing a paragraph? My understanding is that I can access the vectors representing sentences via their labels; So how do I label a paragraph contains a few sentences?

Thanks.

Reply
1. Denis
  2015-01-22 at 11:18 pm
  
  I guess you should use the same label for all sentences in your paragraph
  
  Reply
  1. Bach
    2015-01-27 at 11:18 am
    
    Hi,
    So will each sentence in a paragraph have 1 label for its own and 1 label for the paragraph? E.g., for paragraph 1:
    sentence.labels = [u’SENT_i’, u’PARA_1′]
    for sentence i.
    
    Reply
    1. Samuel Rönnqvist
      2015-01-28 at 3:54 pm
      
      You could also feed entire paragraphs (or arbitrarily long sequences of tokens) together with a PARA_X label, to reduce memory usage (i.e., number of unique labels), unless you’re interested in separating the semantics of paragraphs and individual sentences.
      
      Reply
Jeff
2015-01-26 at 9:57 pm

Hi Radim,

First, thank you for this tutorial. I did what you said in this tutorial, but it didn’t work. The issue is after my training, I got a key error with
print model[‘SENT_0’].

I have checked all of the keys, and I can’t find any labels, then I checked the source code, I find:
self.train_words = train_words
self.train_lbls = train_lbls
if sentences is not None:
self.build_vocab(sentences)
self.train(sentences)
In Doc2vec class, it just call the build_vocab() method inherited from Word2Vec class, I wonder how this can generate a key list include ‘SENT_0′,’SENT_1’,……

Hoping for your reply.
Thanks

Reply
1. Samuel Rönnqvist
  2015-01-28 at 4:12 pm
  
  If training was successful, you should be able to access the sentence vector by label like that. Here’s a minimal example:
  
  sentence = LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
  model = Doc2Vec([sentence], min_count=0)
  model[‘SENT_1’]
  
  Reply
  1. Mark
    2015-09-02 at 12:16 pm
    
    That example code doesn’t work for me:
    
    >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], labels=[u’SENT_1′])
    Traceback (most recent call last):
    File “”, line 1, in
    TypeError: __new__() got an unexpected keyword argument ‘labels’
    
    Step one doesn’t work. But it appears that LabeledSentence now wants ‘tags’ instead of labels, so I’ll update it.
    
    >>> sentence = gensim.models.doc2vec.LabeledSentence(words=[u’some’, u’words’, u’here’], tags=[u’SENT_1′])
    >>> model = gensim.models.Doc2Vec([sentence], min_count=0)
    >>> model[‘SENT_1’]
    Traceback (most recent call last):
    File “”, line 1, in
    File “C:Usersu772700AppDataLocalContinuumAnaconda (x86)libsite-packagesgensimmodelsword2vec.py”, line 1204, in __getitem__
    return self.syn0[self.vocab[words].index]
    KeyError: ‘SENT_1’
    
    I can build the model, BUT the sentence tag doesn’t exist in the vocabulary.
    
    Reply
    1. Post
      Author
      
      Radim
      2015-09-02 at 12:26 pm
      
      Mark, did you read the IMPORTANT NOTE above? Doc2vec API has changed considerably since this blog post.
      
      Reply
    2. wolfgang
      2015-09-27 at 7:59 pm
      
      I believe you need to use the model.docvecs[‘SENT_1’], the API has changed as Radim mentioned
      
      Reply
Zach
2015-01-29 at 8:24 pm

I have a model trained with Doc2Vec. (It’s very cool!)

Is there an easy way to extract JUST the vectors for all of the sentences?

Basically, I want to save a matrix of all the original documents and their corresponding vectors.

Reply
1. Claudio
  2015-02-19 at 6:12 am
  
  I think you only have to read the labels which start with “SENT_ ” because that are the labels which contains the vector representation. However, I’m experimenting with Doc2Vec and after the a successful training process my model have some missing sentences labels. Did you solve it?
  
  Reply
  1. Nate
    2015-02-23 at 6:28 pm
    
    Let me start off by saying I really like Doc2Vec and I appreciate the efforts that have gone into making it!
    
    After successfully training a Doc2Vec model, many of my sentence labels are missing in the trained model and many new labels have appeared representing individual words (presumably extracted from some of the sentences, themselves). Maybe I’m not correctly understanding the purpose of the labels, but I thought that perhaps they represented the “hooks” for extracting the vector representations for each sentence I trained on?
    
    Reply
    1. Claudio
      2015-02-26 at 11:57 pm
      
      Nate, how are you training your model?. I fixed the missing sentences labels by setting the parameter min_count=1. According with the documentation “min_count = ignore all words with total frequency lower than this.” However, the problem with this solution is we are keeping the noise in the dataset. (very specific words).
      
      Reply
Paul F
2015-02-01 at 11:47 pm

Searching for golden needles in unlabeled email haystacks..I was astounded to see in Table two of the Le and Mikolov article that LDA demonstrates a 32.58% error rate compared to the authors’ Paragraph Vector model with only a 7.42%.error rate. I am searching an un-labled corpus of 10+ million emails to determine by topic what exists. We ran our initial test of LDA (single Core) a few days ago on the Enron email test file but have not had time to thoroughly analyze the results. This difference in the error rate means I likely must add Doc2Vec to the process. Under LDA our goal was a very homogeneous group of clusters with the goal of quickly eliminating the NOT RELEVANT clusters and the documents they include from further processing based upon a manual review of the top 5-10 top ranked topics in the top 10-20 top ranked documents in each cluster (recognizing that there could be MANY thousands of homogeneous clusters, we are thinking of ways that the machine can determine 80%+ of the Not Relevant clusters based on rules yet to be determined.) The NOT RELEVANT Clusters and their documents become a training set as do specific paragraphs and sentences of the POSSIBLY RELEVANT Clusters when we move to SuPervised learning using LSA. LDA gets me top ranked topics of top ranked documents in each cluster but apparently has limitation using distributed computing. LSA seems apparently more effective in Distributed Computing and is a comfortable known process for training with manually determined LSA training sets.
Doc2Vec apparently brings sentence, paragraph and even possibly document labeling still with the benefit of sentiment analysis at an apparent vastly increased processing time. Since my 12 core machine won’t cut it on Doc2Vec and I need to acquire a considerable number of cores and build a LAN. what suggestion can anyone supply concerning the least cost per core(I am at about US$45 each) and how to determine the amount of Ram required per worker core for a good trade off between processing speed and core cost? Will each worker need to have identical processors and RAM? Since they are plentiful, I am contemplating Dell Precision T7500 with dual 6 core and Ram based upon your comments. Please confirm that under all three models only actual cores and not hyperthreaded fake cores are usable?

Thanks folks, Your efforts are appreciated.
Paul F.

Reply
Alex
2015-02-04 at 6:35 pm

I am interested in such an application:
can I use existing word2vec C format binary file (e.g. GoogleNews-vectors-negative300.bin) to help training paragraph vector. I assume I should follow the steps
step1
model = Doc2Vector.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary=True)
step2
model.train(sentences)

But when I did in this way, errors pop up like
File “”, line 1, in
File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in train
total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
File “/homes/xx302/lib/python2.7/site-packages/gensim-0.10.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py”, line 466, in
total_words = total_words or int(sum(v.count * v.sample_probability for v in itervalues(self.vocab)) * self.iter)
AttributeError: ‘Vocab’ object has no attribute ‘sample_probability’

So is it caused by using binary word2vector in which some information is lost?

Reply
1. Manuel Reis
  2015-04-25 at 4:46 pm
  
  Hello Alex,
  
  I am having the same problem. How did you managed to solve it?
  
  Reply
yuka
2015-02-28 at 10:19 am

Hello Radim, thank you for your tutorials, it is really interesting and enlightening. I am a rookie in python and therefore have some issues while manipulating doc2vec. Here it is, I have a corpora (folder) with five classes (folders) inside, and each document is simply a txt file. In this case, how should I start training all these documents? Thanks.

Reply
Huo
2015-03-09 at 1:37 pm

HI,
In the class LabledLineSentence(), every sentence line got a unique label, which is ‘SENT_%s’ % item_no. But what if two lines have duplicated contents ? In your code, they will have two different labels. Is it reasonable? Thanks.

Reply
JohnDannl
2015-03-18 at 3:01 am

Hi Radim,
I really appreciate your good work！I have successfully completed the first step: training a model,but I find the same problem as YiKang posted that “the train() method doesn’t add new sentences to an existing model” when I want to predict some new sentences.According to Mikolov‘s paper：in “the inference stage” ，we can get paragraph vectors D for new paragraphs.Is there something I missed? Looking forward to your reply.

Reply
Huanliang Wang
2015-04-08 at 5:43 am

Hi, Randim:
I want to get the vector of input sentence after training was successful. But I can’t find any method to get it. Could you tell me how to operate gensim?

Reply
Huanliang Wang
2015-04-08 at 7:55 am

Hi, Randim:
After training was successful, I find some SENT_?? are not exist. such as:
model([‘SENT_22’])
An error is thrown: KeyError: ‘SENT_22’.
But running both model([‘SENT_21’]) and model([‘SENT_23’]) are right.

Is it an error? How do I avoid such error?

Reply
Andrew Beam
2015-04-08 at 10:41 pm

Thanks for implementing this, I’ve had fun with the new module. Is there any way to assess convergence? Can we access the log-likehood value somehow? I have no idea how many epochs I should be running this for or if my alpha size is reasonable.

Reply
Debasis Ganguly
2015-04-09 at 10:13 pm

I’m using doc2vec on one of the SEMEVAL ’14 datasets. I’ve got 755 number of lines (sentences) in a text file.
However, after executing the following code:

sentences=gensim.models.doc2vec.LabeledLineSentence(‘test.txt’)
model = gensim.models.doc2vec.Doc2Vec(sentences, size=10, window=5, min_count=5, workers=4)
model.save_word2vec_format(‘svectors.txt’)

when I wanted to check if all sentences have been stored in the vector file, to my surprise I found that some sentences are missing! More precisely,
grep -c SENT_ svectors.txt
gave me an o/p of 735.

Wondering what might be the cause of 20 sentences missing?

Reply
1. Debasis Ganguly
  2015-04-09 at 11:14 pm
  
  Setting min_count to 2 fixes this problem. It was ignoring sentences where one of the constituent words was having freq. less than the min_count.
  
  Reply
Erick
2015-04-15 at 6:04 am

I’m not sure I understood the point in having two labels in a sentence. Does it mean that sentence can be used to train the vectors of two paragraphs?

Reply
hj
2015-04-16 at 5:40 am

hello.

Thank you so much for this helpful tutorial.
I have a question about the speed.
I have a 366MB text file and wanted to create the doc2vec model. However, it seems it stopped since the log stopped at 30.23% for the last ten hours. So I was wondering, what is the max size doc2vec code can handle? Also, any suggestions on how I can increase speed and size limit?

Thanks again, have nice day:)

Reply
Silvia
2015-04-17 at 10:23 am

I am using the Doc2vec class with a corpus containing very short sentences that can have even 1 word. I observed that for many sentences, especially the short ones, Doc2vec do not provide any representations. Could you explain why? And could I solve this?

Thanks in advance!

Reply
R
2015-05-05 at 12:09 am

Hi Radim

Great port. Have you an example on how to use Doc2vec for Information Retrieval or any advice on the subject.

Reply
PA
2015-05-12 at 3:10 am

Hi Radim,

Great software. I have a question. Suppose I have created a model and wish to find similarities with new sentences. I create new labels for each new sentence. I can do the model.train, but it does not update the vocabulary from the loaded model to include the words in the new sentences. If I do

model.load(“orig.model”) # this has 100000 labels

sentences = []
currentlines = 100000
for line in lines:
currentlines += 1
label = ‘SENT_’ + str(currentlines)
sentence = models.doc2vec.LabeledSentence(unicodewords, labels=[unicode(label)])
sentences.append(sentence)

model.train(sentences)

print model.most_similar(“SENT_109900”)

Since this label is not in the original model, it complains with

KeyError: “word ‘SENT_109900’ not in vocabulary”

The mechanism updates the model with train but does not allow me to update the vocabulary. How is this possible?

Reply
1. Radim
  2015-06-20 at 10:52 pm
  
  Exactly right — the model doesn’t allow adding new vocabulary (only updating existing one).
  
  But I have good news for you! One of gensim users is creating a new pull request on github, which will allow adding new words too. This way, the model should become fully online.
  
  Reply
sky
2015-07-16 at 6:34 pm

Thanks! Although this tutorial seems to be a bit outdated as a significant update has been merged (https://github.com/piskvorky/gensim/pull/356) and LabeldInstances class is fully replaced by TaggedDocument. It leads to major changes in the way to create input sentences.

Reply
Rob
2015-07-20 at 11:27 am

Does pull 356 mean that you no longer need each document vector in RAM?

Also, which pull request was going to make doc2vec online? Is it the same one?

Thanks!! These implementations are amazing.

Reply
Pingback: How to train p(category|title) model with word2vec - codeengine
Sagar Arora
2015-08-12 at 8:02 am

Any updates on scalable version of doc2vec, the one which does not require all label vectors to be stored separately in RAM?

Looking forward to it

Reply
arandomuser
2015-09-15 at 1:47 am

Sorry but why you have published a “””tutorial””” (a)that is out of date, (b) hard to follow, (c) full of errors AND (d) you do not help people AT ALL. Someone post a link of github with a “new” tutorial. Looks that your company is funny and cannot be taken serious. I am looking for your response – if you have something to say.

Reply
1. Post
  Author
  
  Radim
  2015-09-15 at 3:59 am
  
  Yeah, we should probably update this tutorial with the new one, instead of just putting a disclaimer on top.
  
  What errors did you find here?
  
  Reply
Pingback: Python:How to calculate the sentence similarity using word2vec model of gensim with python – IT Sprite
Ola Gustafsson
2015-10-31 at 2:37 pm

I’ve done some experimenting with doc2vec for recommendation purposes, looking for documents that share similar meaning, and the results seem very promising.

However, when training on a corpus and extracting vectors, these are not the same vectors as I get from the infer method, used on the same texts. I was expecting them to be identical, but was I right to do so? There are some alpha parameters both in training a model and at the infer stage. Are these or something else the reason behind different vectors?

Reply

Input

Training

Memory Usage

I/O

Comments 47

Leave a Reply Cancel reply