Grouping sentences as better language unit for extractive text summarization

Abstract

Most existing methods for extractive text summarization aim to extract important sentences with statistical or linguistic techniques and concatenate these sentences as a summary. However, the extracted sentences are usually incoherent. The problem becomes worse when the source text and the summary are long and based on logical reasoning. The motivation of this paper is to answer the following two related questions: What is the best language unit for constructing a summary that is coherent and understandable? How is the extractive summarization process based on the language unit? Extracting larger language units such as a group of sentences or a paragraph is a natural way to improve the readability of summary as it is rational to assume that the original sentences within a larger language unit are coherent. This paper proposes a framework for group-based text summarization that clusters semantically related sentences into groups based on Semantic Link Network (SLN) and then ranks the groups and concatenates the top-ranked ones into a summary. A two-layer SLN model is used to generate and rank groups with semantic links including the is-part-of link, sequential link, similar-to link, and cause–effect link. The experimental results show that summaries composed by group or paragraph tend to contain more key words or phrases than summaries composed by sentences and summaries composed by groups contain more key words or phrases than those composed by paragraphs especially when the average length of source texts is from 7000 words to 17,000 words which is the usual length of scientific papers. Further, we compare seven clustering algorithms for generating groups and propose five strategies for generating groups with the four types of semantic links.

Publication DOI: https://doi.org/10.1016/j.future.2020.03.046
Divisions: College of Engineering & Physical Sciences > Systems analytics research institute (SARI)
?? 50811700Jl ??
College of Engineering & Physical Sciences
Funding Information: This work was supported by the National Natural Science Foundation of China (No. 61640212 and No.61876048).
Additional Information: © 2020, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
Uncontrolled Keywords: Clustering,Natural language processing,Semantic Link Network,Text summarization,Software,Hardware and Architecture,Computer Networks and Communications
Publication ISSN: 1872-7115
Last Modified: 11 Mar 2024 08:42
Date Deposited: 23 Apr 2020 08:56
Full Text Link:
Related URLs: http://www.scop ... tnerID=8YFLogxK (Scopus URL)
https://www.sci ... 8989?via%3Dihub (Publisher URL)
PURE Output Type: Article
Published Date: 2020-08-01
Published Online Date: 2020-04-01
Accepted Date: 2020-03-23
Authors: Cao, Mengyun
Zhuge, Hai (ORCID Profile 0000-0001-8250-6408)

Export / Share Citation


Statistics

Additional statistics for this record