The aim of automatic text summarization is to process text with the purpose of identifying and presenting the most important information appearing in the text. In this research, we aim to investigate automatic multiple document summarization using a hybrid approach of extractive and “shallow" abstractive methods. We aim to utilize the graph-based representation approach proposed in  and  as part of our method to multiple document summarization aiming to provide concise, informative and coherent summaries. We start by scoring sentences based on significance to extract top scoring ones from each document of the set of documents being summarized. In this step, we look into different criteria of scoring sentences, which include: the presence of highly frequent words of the document, the presence of highly frequent words of the set of documents and the presence of words found in the first and last sentence of the document and the different combination of such features. Upon running our experiments we found that the best combination of features to use is utilizing the presence of highly frequent words of the document and presence of words found in the first and last sentences of the document. The average f-score of those features had an average of 7.9% increase to other features' f-scores. Secondly, we address the issue of redundancy of information through clustering sentences of same or similar information into one cluster that will be compressed into one sentence, thus avoiding redundancy of information as much as possible. We investigated clustering the extracted sentences based on two criteria for similarity, the first of which uses word frequency vector for similarity measure and the second of which uses word semantic similarity. Through our experiment, we found that the use of the word vector features yields much better clusters in terms of sentence similarity. The word feature vector had a 20% more number of clusters labeled to contain similar sentences as opposed to those of the word semantic feature. We then adopted a graph-based representation of text proposed in  and  to represent each sentence in a cluster, and using the k-shortest paths we found the shortest path to represent the final compressed sentence and use it as a final sentence in the summary. Human evaluator scored sentences based on grammatical correctness and almost 74% of 51 sentences evaluated got a perfect score of 2 which is a perfect or near perfect sentence. We finally propose a method for scoring the compressed sentences according to the order in which they should appear in the final summary. We used the Document Understanding Conference dataset for year 2014 as the evaluating dataset for our final system. We used the ROUGE system for evaluation which stands for Recall-Oriented Understudy for Gisting Evaluation. This system compare the automatic summaries to “ideal" human references. We also compared our summaries ROUGE scores to those of summaries generated using the MEAD summarization tool. Our system provided better precision and f-score as well as comparable recall scores. On average our system has a percentage increase of 2% for precision and 1.6% increase in f-score than those of MEAD while MEAD has an increase of 0.8% in recall. In addition, our system provided more compressed version of the summary as opposed to that generated by MEAD. We finally ran an experiment to evaluate the order of sentences in the final summary and its comprehensibility where we show that our ordering method produced a comprehensible summary. On average, summaries that scored a perfect score in term of comprehensibility constitute 72% of the evaluated summaries. Evaluators were also asked to count the number of ungrammatical and incomprehensible sentences in the evaluated summaries and on average they were only 10.9% of the summaries sentences. We believe our system provide a 'shallow abstractive summary to multiple documents that does not require intensive Natural Language Processing.’
Computer Science & Engineering Department
MS in Computer Science
Committee Member 1
Committee Member 2
Library of Congress Subject Heading 1
Library of Congress Subject Heading 2
The author retains all rights with regard to copyright. The author certifies that written permission from the owner(s) of third-party copyrighted matter included in the thesis, dissertation, paper, or record of study has been obtained. The author further certifies that IRB approval has been obtained for this thesis, or that IRB approval is not necessary for this thesis. Insofar as this thesis, dissertation, paper, or record of study is an educational record as defined in the Family Educational Rights and Privacy Act (FERPA) (20 USC 1232g), the author has granted consent to disclosure of it to anyone who requests a copy.
Institutional Review Board (IRB) Approval
Approval has been obtained for this item
(2014).Utilizing graph-based representation of text in a hybrid approach to multiple documents summarization [Master's Thesis, the American University in Cairo]. AUC Knowledge Fountain.
Mahmoud, May Sayed. Utilizing graph-based representation of text in a hybrid approach to multiple documents summarization. 2014. American University in Cairo, Master's Thesis. AUC Knowledge Fountain.