A Brief Introduction to Text Summarization

3 minute read

Text Summarization can be categorized under two types: Extraction and Abstraction. With the power of AI, summarization is becoming more popular and accessible.

Text summarization is the process of taking pieces from a longer text to put together a (shorter) summary, in which the key elements and meaning of the text are preserved. Doing this manually is quite a time-consuming and strenuous task. However, powered by the data and AI revolution, the automation of this task is gaining more popularity.

We can distinguish two types of text summarization: extraction and abstraction.

Extractive Summarization

Extractive summarization is the easiest approach to automatic text summarization as it requires little linguistic analysis. In extractive summarization, sentences are picked directly from the document, based on their scoring, and are then put together to form a coherent summary. With this method, important sections of the text are identified, then cropped out and stitched together to produce a condensed version of the full document or text.

Extractive summarization consists of three steps:

  1. The first step is to construct an intermediate representation of the input text. There are two ways to do this: a topic representation or an indicator representation.
    In a topic representation, the text is transformed into constituent topics. The techniques used for this differ in terms of their complexity and representation model, and are divided into frequency-driven approaches, topic word approaches, latent semantic analysis, and Bayesian topic models.
    In an indicator representation, each sentence is represented as a list of indicators of importance (sentence length, location in the document, presence of certain phrases etc). Examples of indicator representations are graph-based models and machine learning models.

  2. In the second step, each sentence in the representation is assigned a score or value that indicates their importance.
    For topic representations, the score is usually related to how well the sentence expresses some of the most important topics in the document or to what extent it combines information about the different topics.
    For indicator representations, the score of each sentence is determined by combining the outcome of the different indicators.

  3. In the final step, the summarizer selects the best combination of important sentences to form an average length summary. Usually, the most important (highest valued) sentences that form a summary of the desired length are put together. Ideally, the system tries to maximize overall importance, minimize redundant sentences and maximize coherency.

Abstractive Summarization

Abstractive summarization requires more advanced NLP techniques, as it aims to produce a summary through the interpretation of the text. In abstractive summarization, important information is incorporated by AI models to generate new and rephrased sentences, parts of which may not appear in the original text. These generated summaries are more linguistically fluent and comparable to human-made summaries.

Abstractive summarization can be regarded as a “sequence mapping task”, where the source text should be mapped to the target summary, and take advantage of the advancements in deep learning techniques and “sequence to sequence models”. Just like with machine translation models, these sequence-to-sequence models consist of an encoder and a decoder, where a neural network reads the text, encodes it, and then generates the target text.

Because it involves complex language modeling, building automatic human-like abstractive summaries remains a challenging task. 

There are some free online tools available for automated extractive and abstractive summarization, such as SummarizeBot, Resoomer, SMMRY, TextSummarization, Text Compactor.



Anne-Maj van der Meer is a marketing professional with over 10 years of experience in event organization and management. She has a BA in English Language and Culture from the University of Amsterdam and a specialization in Creative Writing from Harvard University. Before her position at TAUS, she was a teacher at primary schools in regular as well as special needs education. Anne-Maj started her career at TAUS in 2009 as the first TAUS employee where she became a jack of all trades, taking care of bookkeeping and accounting as well as creating and managing the website and customer services. For the past 5 years, she works in the capacity of Events Director, chief content editor and designer of publications. Anne-Maj has helped in the organization of more than 35 LocWorld conferences, where she takes care of the program for the TAUS track and hosts and moderates these sessions.

Related Articles
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.