Influenza-A's ability to mutate constantly has resulted in recurring seasonal epidemics and pandemics. Recently, the virus's spread has been enhanced by its ability to infect multiple hosts simultaneously. Fast identification of the subtype and hosts of Influenza-A virus, is thus crucial, to quickly measure its drug resistance and virulence. Research in data mining techniques for influenza virus A host and subtype classification, has already been underway. The older studies' main goal was improving the accuracy, speed and safety of the virus analyses. With newer infectious strains of Influenza-A, appearing yearly, these techniques are still open for improvement. The current research plans to improve existing machine learning techniques for classifying Influenza-A by using the following methodologies: (a) Exploring the effectiveness of using RNA/cDNA data over protein data for virus classification. (b) Measuring the impact of preprocessing the virus, by selecting the most informative positions in the sequence, on classifier performance and speed; both neural networks (NNs) and decision trees (DTs) were analyzed. (c) Testing the previous method on more than one classification problem; host identification experiments were conducted on both subtype H1, and H5, while antiviral resistance identification was conducted on the H1N1 strain. Accuracy, sensitivity, specificity, precision and time were used as performance measures. The final results showed that: (a) DNA data is more sensitive than Protein data when using both subtypes. (b) Using the most 100 and 10 informative positions with DTs yielded an overall speed improvement of 92-100% when identifying hosts for segments of subtype H1. The performance decrease was insignificant. Using 100 and 60 informative positions with NNs yielded a speed improvement of 88% when identifying hosts of both subtypes H1, and H5. There was no significant drop in overall performance. Of the two classifiers: NNs had better performance, while DTs had better efficiency. (c) Testing the method on antiviral resistance identification of Influenza-A, showed promising results: Using the most 100 informative positions with DTs yielded an overall performance of not less than 95%, in not more than 3 seconds for all 8 segments. The method has the potential to improve the efficiency of other Influenza-A classification problems, as well as other viral classification problems in the Bioinformatics field. The thesis provided the following contributions: (a) A way to extract informative positions from DNA positions directly without converting the DNA data to protein data. This can aid in detecting silent mutations in Influenza-A virus. (b) Antiviral identification of Adamantane using all eight segments of the virus. Previously there was one known viral segment mainly responsible for antiviral resistance. (c) Measuring the efficiency of using informative positions, as a preprocessing step, in terms of speed. (d) A clear comparison between two classifier performances when using the information gain algorithm.


Computer Science & Engineering Department

Degree Name

MS in Computer Science

Graduation Date


Submission Date

August 2014

First Advisor

Drs., Rafea, Ahmed, El-Hefnawi, Mahmoud, Moustafa, Ahmed

Committee Member 1

Mahmoud, Wael

Committee Member 2

ElKafrawi, Passent


115 p.

Document Type

Master's Thesis

Library of Congress Subject Heading 1


Library of Congress Subject Heading 2

Machine learning.


The author retains all rights with regard to copyright. The author certifies that written permission from the owner(s) of third-party copyrighted matter included in the thesis, dissertation, paper, or record of study has been obtained. The author further certifies that IRB approval has been obtained for this thesis, or that IRB approval is not necessary for this thesis. Insofar as this thesis, dissertation, paper, or record of study is an educational record as defined in the Family Educational Rights and Privacy Act (FERPA) (20 USC 1232g), the author has granted consent to disclosure of it to anyone who requests a copy.

Institutional Review Board (IRB) Approval

Approval has been obtained for this item


I would like to thank my supervisors Dr. Ahmed Rafea, Dr. Ahmed Mousatafa, and Dr. Mahmoud El- Hefwi for their continuous support throughout my four years of study. I would also like to thank Mrs. Magda Aboul Ela for her advisory role during my academic years. I would love to thank both the Biology department and the Computer Science and Engineering department for collaborating in order to allow me to produce the research presented in this document. I would like to lend my gratitude to my readers, Dr. Weal Mohamed and Dr. Yousra Alkabani, especially with their supportive role and comments that helped in shaping this thesis. I would like to lend my complements to Dr. Passent El Kafrawi for her thorough alysis of my thesis and her flexibility and support on such short notice. I would also like to complement Dr. Mikhail Mikhail, Dr. Sherif Aly, and Dr. Mohamed Moustafa for their supportive roles as graduate Masters program directors as well as Dr. Tarek Shawki for his supportive role as the dean of Sciences and Engineering program at AUC. Filly yet importantly I would like to acknowledge my family and friends who were with me every step of the way; Ashraf Shaltout, Lob Shaltout, Stephanie Dirlam, and Rhiannon Dirlam.