Sent2Vec: An unsupervised approach towards learning sentence embeddings

Prerna Kashyap gensim, Student Incubator

A comparison of sentence embedding techniques by Prerna Kashyap, our RARE Incubator student. As her graduation project, Prerna implemented sent2vec, a new document embedding model in Gensim, and compared it to existing models like doc2vec and fasttext.

What are sentence embeddings?

Many machine learning algorithms require the input to be represented as a fixed-length feature vector. Word embeddings are representation of words in an N-dimensional vector space so that semantically similar (e.g. “king” — “monarch”) or semantically related (e.g. “bird” — “fly”) words come closer depending on the training method (using words as context or using documents as context). When it comes to texts, one of the most common fixed-length features is the bag-of-words. But this method neglects a lot of information like ordering and semantics of the words. For example, two sentences can have identical final representation but entirely different meanings; like ‘I’m going to study Math instead of English’ and ‘I’m going to study English instead of Math’.

What is Sent2Vec?

In the paper Unsupervised Learning of Sentence Embeddings using Compositional N-Gram Features, a new model for sentence embeddings called Sent2Vec is introduced. Sent2Vec presents a simple but efficient unsupervised objective to train distributed representations of sentences. It can be thought of as an extension of FastText and word2vec (CBOW) to sentences. The sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams of words present in each sentence, and averaging the n-gram embeddings along with the words.

Sent2Vec seemed like an interesting model that could be incorporated in Gensim and so my task for the student incubator was to come up with a correct and efficient native implementation of Sent2Vec in Gensim.

How is Sent2Vec different from FastText?

Sent2Vec predicts from source word sequences to target words, as opposed to character sequences to target words.

Native implementation of Sent2Vec in Gensim

I started off by reading the paper and going through the original C++ code open-sourced by the authors that builds upon Facebook’s Fasttext. The first version of the code I came up with was a pure Python/Numpy implementation and was consequently pretty slow. So, I looked for hotspots in the code that took up the most time and implemented those in Cython. Selecting which part to optimize was an easy task—especially with profiling. It was clear that the bulk of the work was done in the nested loop that went through each sentence, and for each word in the sentence tried to predict all the other words within its window. There was a significant speedup by cythonising the various hotspots.

Parallelization of cythonised code

The next step was to make the code run in parallel, to make use of multicore machines. Due to the simplicity of the model, parallel training is straight-forward using parallelized or distributed SGD. In standard Python, the GIL aka Global Interpreter Lock enforces single-threaded execution for CPU-bound tasks. But with Sent2vec’s cythonised version I was already working with low-level C++ calls so the GIL was no issue. Sent2Vec’s multithreaded version is inspired by Gensim’s Word2Vec, which accepts a generic sequence of sentences and a fixed number of sentences is taken and put in a job queue, from which worker threads repeatedly lift jobs for training. The core computation is done inside _do_train_job_util, which is the function I optimized with Cython and BLAS. We must release the GIL for multithreading to be practical, and this is also done inside _do_train_job_util, using Cython’s nogil syntax.

Benchmarking and Evaluation

For benchmarking, I used the Toronto Corpus, which is a private corpus. The Toronto Book Corpus has all sentences in 11,038 books. It has about 70 million sentences and 0.9 billion words. Only 7,087 out of 11,038 books in Book Corpus are unique. Among them 2089 books have one duplicate, 733 books have two and 95 books have more than two duplicates. Maximum sentence length of 300 is used, i.e. sentences having length greater than 300 are ignored. Comparison is done with Gensim’s Doc2Vec (DBOW and DM) and Gensim's FastText model.

Examples of usage:

from gensim.models import Sent2Vec
from gensim.test.utils import common_texts
model = Sent2Vec(common_texts, size=100, min_count=1)
# For online training:
from gensim.models import Sent2Vec
from gensim.test.utils import common_texts
new_sentences = [['computer', 'artificial', 'intelligence'],
    ['artificial', 'trees'],
    ['human', 'intelligence'],
    ['artificial', 'graph'],
    ['intelligence'],
    ['artificial', 'intelligence', 'system']]
model = Sent2Vec(common_texts, size=100, min_count=1)
model.build_vocab(new_sentences, update=True)
model.train(new_sentences, total_examples=model.corpus_count, epochs=model.epochs)

Here common_texts is a list/stream of sentences, size is the size of the sentence embeddings and min_count refers to the minimum frequency of words. Several other parameters for the model can also be specified like epochs (number of iterations over the training corpus), dropout_k (number of ngrams dropped while training the model), workers (number of worker threads used for training to ensure faster training on multicore machines), etc. For a full list of parameters that Sent2Vec supports take a look at the documentation.

Sentence embeddings are evaluated for various tasks. I evaluated classification of product reviews (CR) (Hu and Liu, 2004), opinion polarity (MPQA) (Wiebe et al, 2005), movie review sentiment (MR) (Pang & Lee, 2005), subjectivity classification (SUBJ) (Pang & Lee, 2004) and question type classification (TREC) (Voorhees, 2002). To classify, I used the SentEval evaluation toolkit for sentence embeddings.

The sentence embeddings are inferred from input sentences and directly fed to a logistic regression classifier. Accuracy scores are obtained using 10-fold cross-validation for the MR, CR, SUBJ, MPQA datasets. For these datasets nested cross-validation is used to tune the L2 penalty. For the TREC dataset, the accuracy is computed on the test set.

Evaluation Results

S.No.

Model

MR

CR

SUBJ

MPQA

TREC

1.

Gensim Sent2Vec

63.72

73.38

79.45

75.66

58.2

2.

Gensim Doc2Vec DM

50.23

62.41

51.09

66.74

22.2

3.

Gensim Doc2vec DBOW

51.62

57.4

53.68

67.15

26.2

4.

Gensim FastText

68.39

74.12

84.95

81.18

61.4

Examples where Sent2Vec outperforms Doc2Vec

Eg: In the question type classification task (TREC) Doc2Vec performs pretty poorly. The TREC dataset contains 6 types of question classes, namely ENTY(entity), ABBR(abbreviation), LOC(location), HUM(human), NUM(numeric) and DESC(description). Some questions which Sent2Vec is able to classify correctly and Doc2Vec isn’t are:

  • How far is it from Denver to Aspen? (actual label: NUM , predicted label: DESC )
  • Who was Galileo? (actual label: HUM , predicted label: DESC)
  • What is an atom? (actual label: DESC , predicted label: HUM)
  • How tall is the Sears building? (actual label: NUM, predicted label: DESC)

The number of correctly predicted labels out of total number of labels for Doc2Vec and Sent2Vec are as follows: (NOTE: The following numbers have been computed without 10 fold cross validation in contrast to the above mentioned evaluation table. In this case TREC test accuracy for SentVec and Doc2Vec is 55.4% and 30.6% respectively)

S.No.

Model

LOC

HUM

NUM

ABBR

DESC

ENTY

1.

Doc2Vec DBOW

16/81

35/65

20/113

2/9

38/138

27/94

2.

Sent2Vec

38/81

49/65

69/113

7/9

69/138

45/94

 

Evaluation Summary

A noticeable improvement is seen in accuracy as we use larger datasets. Sent2Vec can be clearly seen having better performance than Gensim’s Doc2Vec. However, Gensim's FastText slightly outperforms Gensim's Sent2Vec in all evaluation tasks and is clearly a better model for learning word embeddings.

Why use Sent2Vec over other sentence embedding models?

  • The computational complexity of Sent2Vec embeddings is only O(1) vector operations per word processed, both during training and inference of the sentence embeddings. This is in contrast to models like SkipThought, whose training time is large owing to the fact that it involves training an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence. Other neural network based models like Recursive NNs or Recurrent NNs also have time complexity of O(n) or O(n^2), where n is the length of the text.
  • The low computational complexity allows the Sent2Vec model to learn from extremely large datasets, which is a crucial advantage in the unsupervised setting.
  • Due to the simplicity of the model, parallel training is straight-forward using parallelized or distributed SGD. Great generalizability, i.e. sent2vec performs well on both supervised and unsupervised tasks as opposed to models which perform extremely well on one of the tasks and end up faring very poorly on other tasks.
  • Sent2Vec models are also faster to train when compared to methods like SkipThought and Doc2Vec, owing to the SGD step allowing a high degree of parallelizability.
  • Sent2Vec model gives better accuracy for various downstream supervised evaluation tasks than Doc2Vec. It also uses significantly less memory and computational resources like RAM as compared to Doc2Vec (which incidentally had to use memory mapping to store the bulky model while training on the Toronto Corpus).

When to use Sent2Vec?

  • When training is to be done using extremely large datasets.
  • For a wide range of supervised tasks like subjectivity classification, classification of movie reviews etc, because Sent2Vec has great generalizability.