The latest gensim release of 0.10.3 has a new class named Doc2Vec. All credit for this class, which is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”, as well as for this tutorial, goes to the illustrious Tim Emerick.
Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.
IMPORTANT NOTE: the doc2vec functionality received a major facelift in gensim 0.12.0. The API is now cleaner, training faster, there are more tuning parameters exposed etc. While the basic ideas explained below still apply, see this IPython notebook for a more up-to-date tutorial on using doc2vec. For a commercial document similarity engine, see our scaletext.com.
Continuing in Tim’s own words:
Since the Doc2Vec class extends gensim’s original Word2Vec class, many of the usage patterns are similar. You can easily adjust the dimension of the representation, the size of the sliding window, the number of workers, or almost any other parameter that you can change with the Word2Vec model.
The one exception to this rule are the parameters relating to the training method used by the model. In the word2vec architecture, the two algorithm names are “continuous bag of words” (cbow) and “skip-gram” (sg); in the doc2vec architecture, the corresponding algorithms are “distributed memory” (dm) and “distributed bag of words” (dbow). Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the dbow model if you wish, by using the
dm=0flag in constructor.
The input to Doc2Vec is an iterator of LabeledSentence objects. Each such object represents a single sentence, and consists of two simple lists: a list of words and a list of labels:sentence = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1'])
The algorithm then runs through the sentences iterator twice: once to build the vocab, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset.
Although this architecture permits more than one label per sentence (and I myself have used it this way), I suspect the most popular use case would be to have a single label per sentence which is the unique identifier for the sentence. One could implement this kind of use case for a file with one sentence per line by using the following class as training data:class LabeledLineSentence(object): def __init__(self, filename): self.filename = filename def __iter__(self): for uid, line in enumerate(open(filename)): yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
Doc2Vec learns representations for words and labels simultaneously. If you wish to only learn representations for words, you can use the flag
train_lbls=Falsein your Doc2Vec class. Similarly, if you only wish to learn representations for labels and leave the word representations fixed, the model also has the flag
One caveat of the way this algorithm runs is that, since the learning rate decrease over the course of iterating over the data, labels which are only seen in a single LabeledSentence during training will only be trained with a fixed learning rate. This frequently produces less than optimal results. I have obtained better results by iterating over the data several times and either
- randomizing the order of input sentences, or
- manually controlling the learning rate over the course of several iterations.
For example, if one wanted to manually control the learning rate over the course of 10 epochs, one could use the following:model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate model.build_vocab(sentences) for epoch in range(10): model.train(sentences) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model.alpha # fix the learning rate, no decay
The code runs on optimized C (via Cython), just like the original word2vec, so it’s fairly fast.
Note from Radim: I wanted to include the obligatory run-on-English-Wikipedia-example at this point, with some timings and code. But I couldn’t get reasonable results out of Doc2Vec, and didn’t want to delay publishing Tim’s write up any longer while I experiment. Despair not; the caravan goes on, and we’re working on a more scalable version of doc2vec, one which doesn’t require a vector in RAM for each document, and with a simpler API for inference on new documents. Ping me if you want to help.
With the current implementation, all label vectors are stored separately in RAM. In the case above with a unique label per sentence, this causes memory usage to grow linearly with the size of the corpus, which may or may not be a problem depending on the size of your corpus and the amount of RAM available on your box. For example, I’ve successfully run this over a collection of over 2 million sentences with no problems whatsoever; however, when I tried to run it on 20x that much data my box ran out of RAM since it needed to create a new vector for each sentence.
The usage for Doc2Vec is the same as for gensim’s Word2Vec. One can save and load gensim Doc2Vec instances in the usual ways: directly with Python’s pickle, or using the optimized
Doc2Vec.load()methods:model = Doc2Vec(sentences) ... # store the model to mmap-able files model.save('/tmp/my_model.doc2vec') # load the model back model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')
Helper functions like
model.similarity()also exist. The raw words and label vectors are also accessible either individually via
model['word'], or all at once via
See the docs.
The main point is, labels act in the same way as words in Doc2Vec. So, to get the most similar words/sentences to the first sentence (label
SENT_0, for example), you’d do:print model.most_similar("SENT_0") [('SENT_48859', 0.2516525387763977), (u'paradox', 0.24025458097457886), (u'methodically', 0.2379375547170639), (u'tongued', 0.22196565568447113), (u'cosmetics', 0.21332012116909027), (u'Loos', 0.2114654779434204), (u'backstory', 0.2113303393125534), ('SENT_60862', 0.21070502698421478), (u'gobble', 0.20925869047641754), ('SENT_73365', 0.20847654342651367)]
or to get the raw embedding for that sentence as a NumPy vector:print model["SENT_0"]
etc. More functionality coming soon!