Download the original file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Land-use forecasting wikipedia , lookup

Twitter wikipedia , lookup

Junction Grammar wikipedia , lookup

Tribe (Internet) wikipedia , lookup

Transcript
UW: 0728840
BI TECH Data Mining
Lec. Erin Shellman
5/15/2015
Introduction
Twitter is a social media platform allowing users to share thoughts in a social space. Twitter earns revenue by selling ad
space to retailers, but because there are many Twitter bots, retailers who bought ad space do not want to falsely market
to bots. The goal of this project is to use data mining methodology and practices to build a classifier to identify Twitter
users that are Twitter bots, so that they can be eliminated.
The data used shows 5000+ rows of Twitter data showing the following columns:
-
Id: user ID
Bot: bot identifier
Followers_count: how many followers
Friends_count: count of friends
Default_profile: profile default
Default_profile_image: profile default image
Favourites_count: how many things where favorited
Geo_enabled: enabled geo-location services
Account_age: what is age of account
Days_since_last_tweet: what was the days since last tweet
Listed_count: how many people were listed
Profile_background_tile: what is the background image of the profile
Status_favorite_count: how many people favorited the last tweet
Verified: is the user verified
Exploratory Data Analysis
The following variables above were chosen as factors for the data. After which the following variables below were
plotted to identify the trend in data for exploratory data analysis. It looks as though from the ‘Summary(data)’ syntax,
the trend in data across the different columns such as followers_count, friends_count, favourites_count,
days_since_last_tweet, listed_count and status_favorite_count have the most varied trend in data. Which shows that
the data is very scattered and the trend is very large. Which means that it is important to identify what are reasons for
this large scatter of data and determine if Twitter bots could contribute to this large varied trend in data.
The ggpairs syntax produced the following chart below, which identifies the overall trend in data in comparison to other
factors. It looks as though the account_age factor when paired with different factors such as follower count, friend
count and others had the most variance in data trend by account age. This shows that Twitter pattern usage is different
at different Twitter account ages.
Additional inspections were made to identify how many bots are in each variable filter (blue or ‘1’ means it is a bot).
Looking at the following syntax, it looks as though Twitter users with less than 100 followers are bots, and those that
don’t have less than 100 followers are not bots.
The following syntax shows the difference between friend counts for bots versus not bots Twitter users. It looks as
though Twitter users that are bots have more friends and Twitter users that are not bots, (human) have less friends.
Looking at the account age factor by bot or not bot users, it looks like Twitter users that are bots have a younger account
age and Twitter users that are not bots, have an older account age.
Methods
After the performing exploratory analysis on the data, the next step is to create train/test syntaxes in order to separate
the data by test and control. This step is important, so we can test the data and create predictive models and compare
the tested data with the true source of truth for the data. The ID column in the bot_vs_not_bot Twitter data was
removed as the ID column represents unique users which is irrelevant to identifying what factors that could classify
Twitter bots.
Models
After setting up the train and test for the data, the next step is to create classification and logistic regressions models to
help identify what factors would be accurate factors that identify Twitter bots. Several classification models were
implemented, such as the tree model, bagging model, boosting model, step model and the logistic model. The following
model syntaxes will be implemented and will be compared against each other to identify which models are most
accurate in determining factors that could identify Twitter bots.
The following are the different modeling syntaxes to model the ‘bot’ factor, to identify which model can predict and
identify factors that categorize Twitter bot users versus non-bot Twitter users.
After implementing the different models, the next step is to run a syntax to compare which model was most accurate in
determining factors that are indicative of Twitter bots. After running the ‘resamples’ syntax for the tree model, bagged
model, boost model, logistic model and step model, it looks like the model that is most accurate is the ‘boost model’.
This model has the highest accuracy and kappa values, which means that the boost model is most accurate in indicating
what factors and variables define and identify Twitter bots. It looks as though the boost model has the least variance in
data when plotted against other models with the dotplot syntax. The model that I would use to identify what factors
determine Twitter bots is the boost model.
Results
Boost Model
The Boosting model is chosen to determine which factors are important variables for identifying Twitter bots. The
following syntax was implemented to create the boost model with the use of bot values. The boosting model overweighs
values that were misclassified and then decides where to split values in order to better classify what data is a bot and
what data is not.
The data and the graph in the boost model, as shown below, indicate that the level of accuracy is higher when there are
more trees or more boosting iterations made, the larger the accuracy value and the larger the kappa value. Also, if the
depth of the tree is larger, the more accurate the model is and the larger the kappa. This means that the model with the
most boosting iteration and most interaction depth has the highest accuracy.
Looking into more detail, the predictive value of the boost model was identified to further test if the model can be
predictively accurate.
The predictive syntax implemented on the boost model shows that the model does have high accuracy. The accuracy
value is at .97 which is really high.
The boost model shows that this classification model is able to identify with high accuracy and predictive accuracy what
factors would be indicative to determining Twitter bots. To identify what factors in this boost model are indicative to
Twitter bots or not, the following syntaxes were implemented to review in more detail the data and see the visual trend
of the factors.
After looking at the data that the boosting model has identified, the factors that have the highest level of importance
that identify Twitter bots are follower count, days since last tweet and friend count. The factors with the highest relative
influence to identify Twitter bots are follower count, days since last tweet and friend count as well. It looks like from
both syntaxes that ran on the boost model that those factors are most indicative of determining Twitter bots against
Twitter users.
The boost model can accurately predict and identify which factors are influential to identifying which users are Twitter
bots. The next step is to classify Twitter users, or identify types of Twitter users by using the following factors: follower
count, days since last tweet and friend count. Based on these factors, the data was explored to identify the type of trend
on what characteristics identify a Twitter bot. The following plot syntaxes were generated and analyzed (blue or ‘1’
means bot).
The graph implemented took a snapshot of 200 followers and the trend segmented by bots vs not. It looks as though
from the follower count trend, that the lower the follower count, the more likely the Twitter user is a bot. However,
from the follower count trend again, Twitter users marked as red, or not bot, or humans, across the board have varying
counts of followers at a very consistent level. That means that Twitter users that are not bots, have a large variation of
follower count, except Twitter users that are bots have the lowest follower count.
In the days since last tweet comparison between bots and not, the count of bots and human Twitter users share similar
trend in tweeting periods. It seems that with this factor, it is difficult to identify Twitter bot usage patterns here.
But finally in the friend count trend, there is a stark difference in Twitter bot users versus human Twitter users. From
this graph, Twitter bots have larger friend count volume compared to human Twitter users that have a lower friend
count volume. This helps identify which users are Twitter bots and which are not with this characteristic.
To conclude, the best model to use to identify which factors reflect Twitter bot activity and therefore help identify
Twitter bots is the Boost model. Based on the follower count, days since last tweet and friend count, Twitter bots
generally have a low volume in Twitter followers, but generally more Twitter users that have varying ranges in follower
count are human Twitter users. Twitter bots and human Twitter users have very similar tweeting periods which make it
hard to identify which Twitter user is a bot and which is not based on the days since last tweet. But the most compelling
identifier is that Twitter bots generally have a larger friend count volume. The next steps for Twitter would be to identify
Twitter users that follow the following behavior and investigate further the Twitter activity of those users to identify if
they are indeed Twitter bots. However before moving forward and deleting Twitter users, more data is needed to test
and evaluate what are other additional factors that identify Twitter bots.
Discussion
The following process was effective in identifying which factors would be important to determining Twitter bots.
However, there were some limitations in the data. There was not enough data or unique Twitter users to implement in
the logistic regression and classification tree model. If the sample size were bigger, we may be able to more accurately
identify which model would be accurate at identifying factors that contribute to Twitter bot activity. The time range of
when this data was retrieved was not available, so there is no clear indication of if the data is 1 year old to 3 years or
more. This would be important to know because we would be able to identify if there is enough sample data to
effectively identify factors that would reflect Twitter bot activity. Additional processes that could have affected the data,
was the filter for how much data to review in the trend. Because only a small amount of data was reviewed in the ggplot
trends in the ‘Results’ section, additional data was taken off, which could impact how the data was analyzed. This I
believe is another limitation to the overall process. Overall, there was data to begin the preliminary data mining process
and implement the methodologies in R to review Twitter data and help find factors that could predict and identify and
reflect Twitter bot activity, but there needs to be more research and more data in order to effectively determine which
Twitter user accounts should be permanently closed in order to reassure Twitter’s advertising stakeholders that human
Twitter users are looking at their ads.