Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UW: 0728840 BI TECH Data Mining Lec. Erin Shellman 5/15/2015 Introduction Twitter is a social media platform allowing users to share thoughts in a social space. Twitter earns revenue by selling ad space to retailers, but because there are many Twitter bots, retailers who bought ad space do not want to falsely market to bots. The goal of this project is to use data mining methodology and practices to build a classifier to identify Twitter users that are Twitter bots, so that they can be eliminated. The data used shows 5000+ rows of Twitter data showing the following columns: - Id: user ID Bot: bot identifier Followers_count: how many followers Friends_count: count of friends Default_profile: profile default Default_profile_image: profile default image Favourites_count: how many things where favorited Geo_enabled: enabled geo-location services Account_age: what is age of account Days_since_last_tweet: what was the days since last tweet Listed_count: how many people were listed Profile_background_tile: what is the background image of the profile Status_favorite_count: how many people favorited the last tweet Verified: is the user verified Exploratory Data Analysis The following variables above were chosen as factors for the data. After which the following variables below were plotted to identify the trend in data for exploratory data analysis. It looks as though from the ‘Summary(data)’ syntax, the trend in data across the different columns such as followers_count, friends_count, favourites_count, days_since_last_tweet, listed_count and status_favorite_count have the most varied trend in data. Which shows that the data is very scattered and the trend is very large. Which means that it is important to identify what are reasons for this large scatter of data and determine if Twitter bots could contribute to this large varied trend in data. The ggpairs syntax produced the following chart below, which identifies the overall trend in data in comparison to other factors. It looks as though the account_age factor when paired with different factors such as follower count, friend count and others had the most variance in data trend by account age. This shows that Twitter pattern usage is different at different Twitter account ages. Additional inspections were made to identify how many bots are in each variable filter (blue or ‘1’ means it is a bot). Looking at the following syntax, it looks as though Twitter users with less than 100 followers are bots, and those that don’t have less than 100 followers are not bots. The following syntax shows the difference between friend counts for bots versus not bots Twitter users. It looks as though Twitter users that are bots have more friends and Twitter users that are not bots, (human) have less friends. Looking at the account age factor by bot or not bot users, it looks like Twitter users that are bots have a younger account age and Twitter users that are not bots, have an older account age. Methods After the performing exploratory analysis on the data, the next step is to create train/test syntaxes in order to separate the data by test and control. This step is important, so we can test the data and create predictive models and compare the tested data with the true source of truth for the data. The ID column in the bot_vs_not_bot Twitter data was removed as the ID column represents unique users which is irrelevant to identifying what factors that could classify Twitter bots. Models After setting up the train and test for the data, the next step is to create classification and logistic regressions models to help identify what factors would be accurate factors that identify Twitter bots. Several classification models were implemented, such as the tree model, bagging model, boosting model, step model and the logistic model. The following model syntaxes will be implemented and will be compared against each other to identify which models are most accurate in determining factors that could identify Twitter bots. The following are the different modeling syntaxes to model the ‘bot’ factor, to identify which model can predict and identify factors that categorize Twitter bot users versus non-bot Twitter users. After implementing the different models, the next step is to run a syntax to compare which model was most accurate in determining factors that are indicative of Twitter bots. After running the ‘resamples’ syntax for the tree model, bagged model, boost model, logistic model and step model, it looks like the model that is most accurate is the ‘boost model’. This model has the highest accuracy and kappa values, which means that the boost model is most accurate in indicating what factors and variables define and identify Twitter bots. It looks as though the boost model has the least variance in data when plotted against other models with the dotplot syntax. The model that I would use to identify what factors determine Twitter bots is the boost model. Results Boost Model The Boosting model is chosen to determine which factors are important variables for identifying Twitter bots. The following syntax was implemented to create the boost model with the use of bot values. The boosting model overweighs values that were misclassified and then decides where to split values in order to better classify what data is a bot and what data is not. The data and the graph in the boost model, as shown below, indicate that the level of accuracy is higher when there are more trees or more boosting iterations made, the larger the accuracy value and the larger the kappa value. Also, if the depth of the tree is larger, the more accurate the model is and the larger the kappa. This means that the model with the most boosting iteration and most interaction depth has the highest accuracy. Looking into more detail, the predictive value of the boost model was identified to further test if the model can be predictively accurate. The predictive syntax implemented on the boost model shows that the model does have high accuracy. The accuracy value is at .97 which is really high. The boost model shows that this classification model is able to identify with high accuracy and predictive accuracy what factors would be indicative to determining Twitter bots. To identify what factors in this boost model are indicative to Twitter bots or not, the following syntaxes were implemented to review in more detail the data and see the visual trend of the factors. After looking at the data that the boosting model has identified, the factors that have the highest level of importance that identify Twitter bots are follower count, days since last tweet and friend count. The factors with the highest relative influence to identify Twitter bots are follower count, days since last tweet and friend count as well. It looks like from both syntaxes that ran on the boost model that those factors are most indicative of determining Twitter bots against Twitter users. The boost model can accurately predict and identify which factors are influential to identifying which users are Twitter bots. The next step is to classify Twitter users, or identify types of Twitter users by using the following factors: follower count, days since last tweet and friend count. Based on these factors, the data was explored to identify the type of trend on what characteristics identify a Twitter bot. The following plot syntaxes were generated and analyzed (blue or ‘1’ means bot). The graph implemented took a snapshot of 200 followers and the trend segmented by bots vs not. It looks as though from the follower count trend, that the lower the follower count, the more likely the Twitter user is a bot. However, from the follower count trend again, Twitter users marked as red, or not bot, or humans, across the board have varying counts of followers at a very consistent level. That means that Twitter users that are not bots, have a large variation of follower count, except Twitter users that are bots have the lowest follower count. In the days since last tweet comparison between bots and not, the count of bots and human Twitter users share similar trend in tweeting periods. It seems that with this factor, it is difficult to identify Twitter bot usage patterns here. But finally in the friend count trend, there is a stark difference in Twitter bot users versus human Twitter users. From this graph, Twitter bots have larger friend count volume compared to human Twitter users that have a lower friend count volume. This helps identify which users are Twitter bots and which are not with this characteristic. To conclude, the best model to use to identify which factors reflect Twitter bot activity and therefore help identify Twitter bots is the Boost model. Based on the follower count, days since last tweet and friend count, Twitter bots generally have a low volume in Twitter followers, but generally more Twitter users that have varying ranges in follower count are human Twitter users. Twitter bots and human Twitter users have very similar tweeting periods which make it hard to identify which Twitter user is a bot and which is not based on the days since last tweet. But the most compelling identifier is that Twitter bots generally have a larger friend count volume. The next steps for Twitter would be to identify Twitter users that follow the following behavior and investigate further the Twitter activity of those users to identify if they are indeed Twitter bots. However before moving forward and deleting Twitter users, more data is needed to test and evaluate what are other additional factors that identify Twitter bots. Discussion The following process was effective in identifying which factors would be important to determining Twitter bots. However, there were some limitations in the data. There was not enough data or unique Twitter users to implement in the logistic regression and classification tree model. If the sample size were bigger, we may be able to more accurately identify which model would be accurate at identifying factors that contribute to Twitter bot activity. The time range of when this data was retrieved was not available, so there is no clear indication of if the data is 1 year old to 3 years or more. This would be important to know because we would be able to identify if there is enough sample data to effectively identify factors that would reflect Twitter bot activity. Additional processes that could have affected the data, was the filter for how much data to review in the trend. Because only a small amount of data was reviewed in the ggplot trends in the ‘Results’ section, additional data was taken off, which could impact how the data was analyzed. This I believe is another limitation to the overall process. Overall, there was data to begin the preliminary data mining process and implement the methodologies in R to review Twitter data and help find factors that could predict and identify and reflect Twitter bot activity, but there needs to be more research and more data in order to effectively determine which Twitter user accounts should be permanently closed in order to reassure Twitter’s advertising stakeholders that human Twitter users are looking at their ads.