Abstract

This thesis investigates Information Retrieval (IR) field of research with focus on Text Categorization (TC) area. We study the available techniques and models used for classification of text documents to a predefined set of categories. We use a subset of Reuters - 21578 test collection for our research. We use the first 150 documents for training and the following 100 for testing. We use pre-processing steps such as parsmg, stop-word removal, and stemming using porter stemmer. We identify all possible phrases in a document during the pre-processing stage. We use only adjacent phrases of size two. We learn frequency information during the pre-processing stage for documents of the training set. We construct the index lists after applying a feature reduction function on extracted features (terms and phrases). We assign weights to relate index features to categories. The weights and index lists are then used to classify test set documents. Categorization results are compared with relevance judgements to evaluate the performance of our categorization methodology. We propose a feature reduction technique to reduce the number of initially extracted features by selecting features of high categorization quality. The result of the feature reduction process is a set of index-features. This function uses feature frequency and feature document frequency combined with a new feature category frequency technique. Coefficients of the proposed feature reduction formula control the output of the formula allowing more or less features to be selected. We try several coefficient combinations and achieve over 98% reduction for terms and over 99% reduction for phrases and at the same time achieving high precision and recall values. Our primary goal is to provide a method that achieves high precision categorization based on phrases. We want to prove that phrases can be used alone as an independent highly precise classifier. Our secondary goal is to propose other term based techniques that perform at least as good as term-classifiers with less complexity. We propose a classifier based on the existence of category names in a document. This classifier considers a document relevant to a category if category name exists in the document. We will also propose title classifier that gives higher weight to index terms found in a document title. We also used the same phrase classifier concept on terms to obtain categorization results based on terms only to be

able to compare our proposed techniques with term categorization using the same testing environment. We evaluate precision and recall for each technique individually to be able to compare them together and also study the behavior of combining these techniques and the resulting effect on the total system performance. Our phrase classifier achieves average of 89% precision and 35% recall. Other researches based on phrase categorization achieve much less precision. For example, a research on statistical and syntactic phrases achieves an average of 54% precision at recall of 30% and a maximum of 85% at 0% recall level [ 19]. Using category-term classifier alone achieves 67% precision and 37% recall. Using title classifier independently achieves average of 26% precision and 46% recall. Categorization based on term-classifier only achieves averages of 12% precision and 89% recall.

Department

Computer Science & Engineering Department

Degree Name

MS in Computer Science

Date of Award

2-1-2004

Online Submission Date

1-1-2003

First Advisor

Amr Goneid

Committee Member 1

Amr El Kadi

Committee Member 2

A. Khalil

Committee Member 3

Mokhtar Boshra

Document Type

Thesis

Extent

120 leaves

Library of Congress Subject Heading 1

Text processing (Computer science)

Rights

The American University in Cairo grants authors of theses and dissertations a maximum embargo period of two years from the date of submission, upon request. After the embargo elapses, these documents are made available publicly. If you are the author of this thesis or dissertation, and would like to request an exceptional extension of the embargo period, please write to thesisadmin@aucegypt.edu

Call Number

Thesis 2003/61

Location

mgfth;mrs2

Share

COinS