Text Summarization in Python: Extractive vs. Abstractive techniques revisited

This blog is a gentle introduction to text summarization and can serve as a practical summary of the current landscape. It describes how we, a team of three students in the RaRe Incubator programme, have experimented with existing algorithms and Python tools in this domain.

We compare modern extractive methods like LexRank, LSA, Luhn and Gensim’s existing TextRank summarization module on the Opinosis dataset of 51 article-summary pairs. We also had a try with an abstractive technique using Tensorflow’s Text Summarization algorithm, but didn’t obtain good results due to its extremely high hardware demands (7000 GPU hours, ~$30k cloud credits).

Why text summarization?

With push notifications and article digests gaining more and more traction, the task of generating intelligent and accurate summaries for long pieces of text has become a popular research as well as industry problem.

There are two fundamental approaches to text summarization: extractive and abstractive. The former extracts words and word phrases from the original text to create a summary. The latter learns an internal language representation to generate more human-like summaries, paraphrasing the intent of the original text.

There are two fundamental approaches to text summarization: extractive and abstractive.

Extractive Text Summarization

First, a quick description of some popular algorithms & implementations for text summarization that exist today:

Text Summarization in Gensim

gensim.summarization module implements TextRank, an unsupervised algorithm based on weighted-graphs from a paper by Mihalcea et al. It was added by another incubator student Olavur Mortensen – see his previous post on this blog. It is built on top of the popular PageRank algorithm that Google used for ranking webpages. TextRank works as follows:

Pre-process the text: remove stop words and stem the remaining words.
Create a graph where vertices are sentences.
Connect every sentence to every other sentence by an edge. The weight of the edge is how similar the two sentences are.
Run the PageRank algorithm on the graph.
Pick the vertices(sentences) with the highest PageRank score

In original TextRank the weights of an edge between two sentences is the percentage of words appearing in both of them. Gensim’s TextRank uses Okapi BM25 function to see how similar the sentences are. It is an improvement from a paper by Barrios et al..

PyTeaser

PyTeaser is a Python implementation of the Scala project TextTeaser, which is a heuristic approach for extractive text summarization.

TextTeaser associates a score with every sentence. This score is a linear combination of features extracted from that sentence. Features that TextTeaser looks at are:

titleFeature: The count of words which are common to title of the document and sentence.
sentenceLength: Authors of TextTeaser defined a constant “ideal” (with value 20), which represents the ideal length of the summary, in terms of number of words. sentenceLength is calculated as a normalized distance from this value.
sentencePosition: Normalized sentence number (position in the list of sentences).
keywordFrequency: Term frequency in the bag-of-words model (after removing stop words).

More on the sentence features for summarization see Sentence Extraction Based Single Document Summarization by Jagadeesh et al.

PyTextRank

PyTextRank is a python implementation of the original TextRank algorithm with a few enhancements like using lemmatization instead of stemming, incorporating Part-Of-Speech tagging and Named Entity Resolution, extracting key phrases from the article and extracting summary sentences based on them. Along with a summary of the article, PyTextRank also extracts meaningful key phrases from the article. PyTextRank works in four stages, each feeding its output to the next:

In the first stage, Part-of-Speech Tagging and lemmatization is performed for every sentence in the document.
In the second stage, key phrases are extracted along with their counts, and are normalized.
Calculates a score for each sentence by approximating jaccard distance between the sentence and key phrases.
Summarizes the document based on most significant sentences and key phrases.

Luhn’s Algorithm

Published in 1958, this algorithm [PDF] ranks sentences for summarization extracts by considering “significant” words, which are frequently occurring words in a document, and the linear distance between these words due to non-significant words.

LexRank

LexRank is an unsupervised graph based approach similar to TextRank. LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as weight of the graph edge between two sentences. LexRank also incorporates an intelligent post-processing step which makes sure that top sentences chosen for the summary are not too similar to each other.

More on LexRank Vs. TextRank can be found here.

Latent Semantic Analysis (LSA) in Text Summarization

LSA works by projecting the data into a lower dimensional space without any significant loss of information. One way to interpret this spatial decomposition operation is that singular vectors can capture and represent word combination patterns which are recurring in the corpus. The magnitude of the singular value indicates the importance of the pattern in a document.

If terms like singular vectors and singular values seem unfamiliar, we recommend this tutorial, which covers the theory of LSA, including an instructive, if naive, Python implementation (for a robust and fast implementation, use LSA in gensim, of course).

How to evaluate text summarization quality?

The ROUGE-N metric

For LexRank, Luhn and LSA methods we made use of the Sumy summarization library which implements these algorithms. We used the ROUGE-1 metric to compare the discussed techniques.

Rouge-N is a word N-gram measure between the model and the gold summary.

Specifically, it is the ratio of the count of N-gram phrases which occur in both the model and gold summary, to the count of all N-gram phrases that are present in the gold summary.

Another way to interpret it is as the recall value which measures how many N-grams from the gold summaries appeared in the model summaries.

Generally for summarization evaluation, only ROUGE-1 and ROUGE-2 (sometimes ROUGE-3, if we have really long gold and model summaries) metrics are used, rationale being that as we increase N, we increase the length of the N-gram word phrase that needs to be matched completely in both the gold and model summary.

As an example, consider two semantically similar phrases “apples bananas” and “bananas apples”. If we use ROUGE-1 we only consider uni-grams, which are the same for both phrases. But if we use ROUGE-2, we use 2-word phrases, so “apples bananas” become a single entity which is different from “bananas apples”, leading to a “miss” and lower evaluation score.

Example:

Gold Summary: A good diet must have apples and bananas.
Model Apples and bananas are must for a good diet.

If we use the ROUGE-1, the score is 7/8 = 0.875.

For ROUGE-2, it is 4/7 = ~0.57.

The above ratios can be interpreted as the amount of relevant information that our algorithm managed to extract from the set of all the relevant information, which is exactly the definition of recall, and hence Rouge is recall based.

More examples of how to calculate the scores are in this gist.

The BLEU metric

BLEU metric is a modified form of precision, extensively used in machine translation evaluation.

Precision is the ratio of the number of words that co-occur in both gold and model translation/summary to the number of words in the model summary. Unlike ROUGE, BLEU directly accounts for variable length phrases – unigrams, bigrams, trigrams etc., by taking a weighted average.

The actual metric is just precision which is modified to avoid the problem when a model’s translation/summary contains repeated relevant information.

Example:

Gold Summary: A good diet must have apples and bananas.
Model Summary: Apples and bananas are must for a good diet.

If we use the BLEU score considering only unigrams, i.e., weight of unigram is 1 and 0 for all other N-grams, our ratio for BLEU is calculated as 7/9 = 0.778.

For weights [0.6, 0.4] for unigram and bigram respectively, the ratio becomes 0.6 * (7/9) + 0.4 * (4/8) = 0.667.

BLEU with modified N-gram precision

The key intuition behind modified N-gram precision is that a reference phrase/word should be considered exhausted once it has been identified in the model summary. This idea addresses the problem of repeated/over-generated words in the model summary.

Modified N-gram precision is computed by first finding the maximum number of times a word/phrase occurs in any single reference. This count becomes the maximum reference count for that word/phrase. We then clip the total count of each model word/phrase by its maximum reference count, add the clipped counts for each word in the model translation/summary and divide the sum by the total number of words/phrases in the model translation/summary.

The link to the paper on BLEU (see above) has good examples on its modified N-gram precision.

TL;DR: The greater the ROUGE and BLEU score, the better the summary.

Dataset

Comparison was performed using the Opinosis dataset of 51 articles. Each article is about a product’s feature, like iPod’s Battery Life, etc. and is a collection of reviews by customers who purchased that product. Each article in the dataset has 5 manually written “gold” summaries. Usually the 5 gold summaries are different but they can also be the same text repeated 5 times.

Model parameters

For Gensim TextRank, the count of words in the output summary, word_count was set to 75.
For Sumy-LSA and Sumy-Lex_rank the count of sentences in the output summary(sentence_count) was set to 2.

Results

The mean and standard deviation of ROUGE-1 and BLEU scores obtained are shown in the table below

Model	Maximum ROUGE-1 Score	Std deviation of ROUGE-1 Score	BLEU Score	Std deviation of BLEU score
Sumy-Lex_rank	0.26	0.06	0.391	0.095
Sumy-LSA	0.211	0.059	0.374	0.086
Sumy-Luhn	0.126	0.038	0.157	0.044
PyTeaser	0.221	0.053	0.343	0.072
Gensim TextRank	0.230	0.058	0.388	0.095
PyTextRank	0.197	0.045	0.269	0.059

ROUGE scores for every summary is the maximum ROUGE score amongst the five (individual gold summary) scores.

For BLEU score we used NLTK’s bleu_score module with weights for unigrams, bigrams and trigrams as 0.4, 0.3, 0.2 respectively.

For a concrete example, have a look at this review of the Garmin 255W navigation device. See both the human and model-generated summaries.

Qualitative Assessment

LexRank is the winner here as it yields a better ROUGE and BLEU score. Unfortunately we found the summaries generated by it to be less informative than summaries by Gensim’s TextRank and Luhn’s model. Furthermore, LexRank doesn’t always beat TextRank in the ROUGE score – for example, TextRank performs marginally better than LexRank on the DUC 2002 dataset. So the choice between LexRank and TextRank depends on your dataset, it’s worth trying both.

Another conclusion from the data is that Gensim’s Textrank outperforms the plain PyTextRank because it uses the BM25 function instead of Cosine IDF in plain TextRank.

Another point from the table is that Luhn’s algorithm has a lower BLEU score. This is because it extracts a longer summary and hence covers more reviews of the product. Unfortunately, we couldn’t make it shorter because the wrapper for Luhn’s algorithm in Sumy doesn’t provide the parameters to change the word limit.

Abstractive Text Summarization

A Neural Network Approach

Google’s Textsum is a state of the art open-source abstractive text summarization architecture. It can create headlines for news articles based on their first two sentences.

It has shown good results after training on 4 million pairs from the Gigaword dataset of the form (first two sentences, headline). During training it optimises the likelihood of the summary given the article’s first two sentences. Both the encoding layer and language model are trained at the same time. In order to generate a summary it searches the space of all possible summaries to find the most likely sequence of words for the given article.

Here is an example of the data used for training the TextSum model together with the model-generated summary.

Article	novell inc. chief executive officer eric schmidt has been named chairman of the internet search-engine company google .
Human summary	novell ceo named google chairman
Textsum	novell chief executive named to head internet company

Notice that the word “head” doesn’t appear in the original text. The model has generated it. This would never happen in an extractive algorithm above.

We ran the Tensorflow network provided by Google and tweaked some of its hyperparameters. Unfortunately we could only train the model for 10% of the time of what was needed and got summaries of very low quality. We couldn’t even use the ROUGE and BLEU scores above due to the summaries not making any sense.

To compare different tweaks to the neural network architecture we had to resort to using a mathematical measure of the model fit on the training set “running average loss”. The average running loss graphs for the models can be found in this gist.

How much is “sufficiently trained”?
It is advised by the authors of Tensorflow’s implementation to train for over million time-steps to successfully reproduce their results. This would mean weeks of training time on GPU enabled clusters. Google themselves used 10 machines with 4 GPUs each, training for a
week. That is equivalent to 7000 GPU hours or $30k AWS cloud credits. We didn’t have such hardware resources at our disposal.

Also Google TextSum authors use the Annotated English Gigaword dataset which requires a $3000 license. So instead, we use a relatively small but free news article data set: CNN and DailyMail. These 320k articles are converted into a Textsum compatible format and vocabulary. You can generate your own TextSum compatible pre-processed CNN and DailyMail Data by using our code from github.

Initially, the training with default parameters was done on an NVIDIA GTX 950M laptop but the algorithm did not seem to converge even after training for more than 48 hours. To speed up the process and generate meaningful summaries we switched to a g2.2xlarge Amazon EC2 instance, equipped with NVIDIA’s K520 GPU.

Observations

Some examples of the very bad summaries generated by our insufficently trained TextSum model. This is similar to the attempt to train TextSum in Pavel Surmenok’s blog.

Human Written Summary	TextSum Summary (trained for less than 50k steps)	TextSum Generated Summary (trained for 100k+ steps)
criticised over ‘ incompetent ‘ handling of bank ‘s tax evasion scandal	<UNK> <UNK> the to to to in a	manchester united face manchester city in the premier league on sunday night
alleged hoax on manti te’o is compared with the documentary ” catfish “	<UNK> <UNK> miss , to	the supreme anniversary of the number of people are less than 200 days from the u . s . airstrikes
queensland opposition leader annastacia palaszczukr has opened up about the heartbreak of losing her baby at 11 weeks	said is to to to to to in to to	ex – marine scott olsen now awake after being hit on tuesday

The generated summaries for a major portion of the test set are redundant and do not resemble the actual summaries in the test set.

Certain words recur in many summaries irrespective of whether or not these words are present in the the actual articles and their summaries in the test set, e.g. the phrase “manchester united” and “manchester city” repeat a lot of time in the generated summaries.

Another observation is that initially (global_steps< 50000) the model was not generating grammatically correct sentences, as we trained the model for a greater duration the generated summaries began to make some sense and the grammar was somewhat more correct. However, the generated summaries were still completely irrelevant to the original articles as well as the corresponding human-written summaries.

There was a visible improvement in loss (and in the semantic quality of summaries) only after 50,000 time-steps. After having trained for 100,000 time-steps for close to one day, we observed the quality – here we use our subjective understanding to judge said quality – of summaries improve ever so slightly. Even so the summaries are clearly not up to the mark. This is to be expected, given the training time. The authors of the model claim that it is possible to get much better results provided the user is willing to trade-off in terms of required time and compute.

To “Summarize”

For extractive techniques, our metrics tell us that LexRank outperforms Gensim’s TextRank by a narrow margin but we’ve also observed cases where TextRank gives higher quality of summaries. We believe that the dataset which is used affects the quality of obtained summaries. A good practise would be to run both the algorithms and use the one which gives more satisfactory summaries. A future direction is to compare Gensim’s TextRank implementation with Paco Nathan’s PyTextRank.

Due to lack of GPU resources and a lot of parameters to tune we end our research on abstractive summarization at a point where we cannot conclude with absolute certainty that the model can be used as an alternative to current extractive implementations. But of course, one can always try training the model for a few million (more) timesteps and tweak some parameters to see whether the results get better on CNN-Dailymail Dataset or on another dataset.