May Shalaby


Social Influence can be described as the ability to have an effect on the thoughts or actions of others. Influential members in online communities are becoming the new media to market products and sway opinions. Also, their guidance and recommendations can save some people the search time and assist their selective decision making. The objective of this research is to detect the influential users in a specific topic on Twitter. In more detail, from a collection of tweets matching a specified query, we want to detect the influential users, in an online fashion. In order to address this objective, we first want to focus our search on the individuals who write in their personal accounts, so we investigate how we can differentiate between the personal and non-personal accounts. Secondly, we investigate which set of features can best lead us to the topic-specific influential users, and how these features can be expressed in a model to produce a ranked list of influential users. Finally, we look into the use of the language and if it can be used as a supporting feature for detecting the author's influence. In order to decide on how to differentiate between the personal and non-personal accounts, we compared between the effectiveness of using SVM and using a manually assembled list of the non-personal accounts. In order to decide on the features that can best lead us to the influential users, we ran a few experiments on a set of features inspired from the literature. Two ranking methods were then developed, using feature combinations, to identify the candidate users for being influential. For evaluation we manually examined the users, looking at their tweets and profile page in order to decide on their influence. To address our final objective, we ran a few experiments to investigate if the SLM could be used to identify the influential users' tweets. For user account classification into personal and non-personal accounts, the SVM was found to be domain independent, reliable and consistent with a precision of over 0.9. The results showed that over time the list performance deteriorates and when the domain of the test data was changed, the SVM performed better than the list with higher precision and specificity values. We extracted eight independent features from a set of 12, and ran experiments on these eight and found that the best features at identifying influential users to be the Followers count, the Average Retweets count, The Average Retweets Frequency and the Age_Activity combination. Two ranking methods were developed and tested on a set of tweets retrieved using a specific query. In the first method, these best four features were combined in different ways. The best combination was the one that took the average of the Followers count and the Average Retweets count, producing a precision at 10 value of 0.9. In the second method, the users were ranked according to the eight independent features and the top 50 users of each were included in separate lists. The users were then ranked according to their appearance frequency in these lists. The best result was obtained when we considered the users who appeared in six or more of the lists, which resulted in a precision of 1.0. Both ranking methods were then conducted on 20 different collections of retrieved tweets to verify their effectiveness in detecting influential users, and to compare their performance. The best result was obtained by the second method, for the set of users who appeared in six or more of the lists, with the highest precision mean of 0.692. Finally, for the SLM, we found a correlation between the users' average Retweets counts and their tweets' perplexity values, which consolidates the hypothesis that SLM can be trained to detect the highly retweeted tweets. However, the use of the perplexity for identifying influential users resulted in very low precision values. The contributions of this thesis can be summarized into the following. A method to classify the personal accounts was proposed. The features that help detecting influential users were identified to be the Followers count, the Average Retweets count, the Average Retweet Frequency and the Age_Activity combination. Two methods for identifying the influential users were proposed. Finally, the simplistic approach using SLM did not produce good results, and there is still a lot of work to be done for the SLM to be used for identifying influential users.


Computer Science & Engineering Department

Degree Name

MS in Computer Science

Graduation Date


Submission Date

May 2014

First Advisor

Rafea, Ahmed

Committee Member 1

Rafea, Ahmed

Committee Member 2

Aly, Sherif


131 p.

Document Type

Master's Thesis

Library of Congress Subject Heading 1


Library of Congress Subject Heading 2

Network alysis (Planning) -- Computer programs.


The author retains all rights with regard to copyright. The author certifies that written permission from the owner(s) of third-party copyrighted matter included in the thesis, dissertation, paper, or record of study has been obtained. The author further certifies that IRB approval has been obtained for this thesis, or that IRB approval is not necessary for this thesis. Insofar as this thesis, dissertation, paper, or record of study is an educational record as defined in the Family Educational Rights and Privacy Act (FERPA) (20 USC 1232g), the author has granted consent to disclosure of it to anyone who requests a copy.

Institutional Review Board (IRB) Approval

Not necessary for this item