Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Detecting Illicit Drug Usage on Twitter Darrel Pollard, Christopher Homan January 2016 Abstract Social media accounts for a large amount of the data produced on the Internet, informing and connecting many users across the world. Some social media, like Twitter, reveal valuable information about their users and user interests. In this project, we use social media to study opioid abuse, an emerging epidemic resulting in over 20,000 deaths in 2014 alone. We build tools to extract information about opioid use in the Rochester area and answer questions such as: in which regions of the metropolitan area do opioid abusers live? And, what leads them to opioid abuse? As a first step we use supervised and unsupervised learning to build classifiers for detecting mentions of opioid use and abuse in Twitter posts. 1 Introduction Because of the increased accessibility to the Internet across the globe, many people are connecting to social sites at a higher frequency. Social media users are connecting themselves to large communities sharing valuable information about themselves like their interest, behaviors, sentiment toward certain topics, future plans, relationship with others, and their locations. This type of information can be very useful for understanding certain trends in society or what the current mindset of a community is. Using social media analysis, we attempt to identify illicit and semi-legal drug use. By building a classifier, we can identify posts about illicit drugs and track the spread of these posts in specific communities. With the increase in deaths from drugs in younger culture events like festivals and parties , it is important we take steps to identify these users and understand why they use said drugs. Is there preventive help we can supply? We would also like to compare the sentiments people hold between illicit drugs and drugs like marijuana that are on a road to legalization or viewed as acceptable by society. This is what we propose (Since we are still within the infancy of this research, it is not an easy task to correctly identify the users. We aim to investigate the following questions as the guideline of this project: RQ1: With all the syntax, slang, and emojis embedded in social media sites, can we build an accurate classifier to identify illicit drug users? 1 RQ2: How does the sentiment differ from drugs viewed as highly illegal versus ones with a more open minded perception? 2 Related Work Buntain and Golbeck focus on how illicit drugs spread from one place to another, while tracking which illicit drugs are popular on Twitter over time [BG15]. They use Twitter4j for sampling and filtering the Twitter stream. They used a list of keywords that contained 11 drug types, twenty-one drugs, and 300 slang term to sample tweets from April to July 2014, yielding approximately 247 GB, and from October 30th to November 26, 2014,the results yield approximately 81 GB of data. Since their data size was relatively large they used the Twitter’s libraries to help with processing them, which is called ElephantBird. In order to address the ambiguity of slang words like OXY, they used Mallet for topic modeling. They showed that the most popular or trendy drug to tweet about was cocaine for the time period of April to July. findings. In the paper they did do some location-based analysis. They first took the top four drugs: cocaine, heroin, methamphetamine, and Oxycontin and mapped them out based on their geographical location in the United states. According to their results, heroin is much more popular than Oxycontin in our primary region of interest, Rochester New York. A common question that may be asked is, “Can we be sure if a user is tweeting their behavior or interest?” Michael and Mark [Here and elsewhere, you are citing authors by their first names; this is never done; always use last names; so, for instance, this should be cited as Paul and Dredze, also, you should not mention their affiliation, i.e., Johns Hopkins] at Johns Hopkins University adopt an approach used by health care facilities to collect data (known as sentinel surveillance) to determine whether or not data [what data, tweets?] may have contained information about health status of a twitter user [PD11]. They then used a modified ailment topic aspect model (ATAM+) to create structured disease information from tweets. This was used to model how users express their feelings toward their ailments and illnesses. This was done by using a list of keywords to both identify specific ailments and symptoms and treatments over tweets. They also used latent Dirichlet allocation to produce a mean reciprocal rank (MMR) for scoring determining a tweet’s correct topic identification. Two independent evaluators tested the ATAM+ model, and both agreed that ATAM+ produced better result than the regular ATAM. [This all begs the question of what ATAM is and how ATAM+ differs from it.] Being able to correctly identify a drug user is important, but monitoring the influence or spread of drugs is as equally fascinating. Adam et al. of the University of Rochester research [SKS12] focused on identifying the spread of diseases from social interaction, using data sources like Twitter and Google. By collecting tweets and classifying them using an SVM to detect illness related tweets, they monitored these users, their friends (i.e., those users who follow 2 each other) and geo-location. All information is obtained from the Twitter search API. The researchers were able to model the spread of diseases based on the proximity of co-located tweets, i.e., those within 100-meter cell and a time window of 2 days. Unlike Google Flu Trends, which can identify influenza specifically, they were able to detect influenza-like illness. They also observed an exponential relationship between the probability of disease spread through physical encounters based on the twitter data location. The ability to detect sentiment can reveal much about a community’s intentions and views on a certain matter. Preslav et al focused on sentiment analysis in twitter [NKR+ 13]. The task they were trying to tackle was to build a system that could correctly detect sentiment and label on a large amount of data. They used a task-based system, Task A was a contextual polarity disambiguation, particular phrase. which goal was to discover a positive, negative or neutral sentiment. For messages that contained more than one sentiment; users were asked to determine which sentiment was stronger. Task B was message Polarity Classification. Using both Twitter tweets and SMS messages, participants were tasked to label messages as positive, negative, neutral, or no subjective word/phrases. They also had assistance from SentiWordNet to filter messages that had sentiment from the tweets collected by the Streaming API, removing messages that did not have at least one positive or negative word. They applied scores to both task results testing the precision of they annotated messages and... Due to the complexity, Task B seemed to yield less precision than Task A, which implies searching a message for sentiment words is easier than finding the full sentiment of a message. Efthymios et al. also had the goal of sentiment analysis in their research [KWM11]. By developing a system that classified three data sets- hashtag data from the Edinburgh Twitter Corpus, emoticon data from Stanford University, and hand annotated tweets from the iSieve corporation, they preprocessed the data in three tokenization steps that identified emoticon and abbreviations were a part of this step, normalization and parts of speech. For classification, they used four features: n-grams (mainly unigrams and bigrams), lexicons, parts of speech, and microblog-specific features such as positive negative or neutral emoticons and abbreviations. Their results showed that parts of speech have very little significance on classifying the sentiment of a tweet comparing to the combination of the to other three. One of our main goals is to classify tweets as drug-related or non drugrelated. Marco and Ana-Maria of Yahoo labs used a machine learning approach to tackle the task of classification of tweets. [PP11] In order to classify a user they focus on certain features;usage of regular expressions for self-identification and location; frequency of tweets or what URL; Linguistic content- prototypical words, prototypical hash-tags, sentiment words; the social connections between users. They evaluated their system by detecting political affiliation, ethnicity and identifying Starbucks fans. 3 3 Data Our main source of data comes from using streaming API with the help of Tweepy. Through Tweepy we set up two main streams. One stream was used to collect tweets in the geo-location of America. The other stream was used to pull tweets based on a list of known drug keywords and hashtags. There was also tweets collected from the Detroit and Rochester area. Using all the data we obtained we partitioned it into two sets different data sets. The first data set was constructed using a portion of drug keyword tweets and the tweets gathered across North America. The total amount of tweets in this data set was about 1500 tweets. The second data set was constructed using another portion of drug keyword tweets, Rochester data from January 2015 and Detroit tweets from November 2015. The amount of tweets total used in data set 2 was 1,605 tweets. It was not enough to just label tweets based on if a drug word was present. In order to correctly label a tweet a few things had to be taken into consideration.What is the context in which the drug keyword was used?Is the tweet about drugs or drug usage? For instance, tweets such as “Drugs are bad”, or “legalization of weed will not be a good thing” refer to sentiment toward drugs. Meanwhile tweets such as ”I am having kush withdraw” relate closely to drug usage. Once all the data is selected as drug related through a Boolean value, it is then closely examined so that sentiment within the tweets can be extracted. 4 Methods Although only a small set of tweets were used for machine learning, we did analyze a fair amount more of the data. Two things have drawn our attention. Using a keyword search to discover the topic of a tweet was not sufficient enough to identify the tweets related to drugs. Often certain drug words are used to describe something rather than describe the use of the drug. We also discovered that there were not many instances of drug related tweets. For about every 2000 tweets in the drug keyword data there were about 100-200 tweets about drug use drug, and this number does not exclude duplicate tweets. In general tweet data there were about 2 tweets for every 2000 that were drug related. In order to correctly gather drug related tweets at a large scale, we decided to create a classifier to help identify drug related tweets. 4.1 Pre-process In order to make the data compatible with the machine learning API we decided to use, we had to prepare the data. The first step to data preparation was to annotate all the data and update the current JSON with a new field describing if a tweet was drug related or not. This field was called ”Drug relation” and held a value of true or false. Although we tried to keep the tweet labeling at binary as possible, tweets did not fall easily int the two categories. For the tweets that 4 deemed difficult to label they were stored in a unsure file for further viewing. It is important to notice that in the process of deciding whether a tweet is drug related or not, these two aspects are of great importance: the sentiment of the tweet, and whether the drug words were only used for purpose of description.It is important to notice that in the process of deciding whether a tweet is drug related or not, these two aspects are of great importance: the sentiment of the tweet, and whether the drug words were only used for purpose of description. The next step was to parse all tweets removing stop words and punctuation. The list of stop words was produced from NLTK stop-words library. 4.2 SVM In terms of classification techniques we decided to go with a Support Vector Machine with a linear Kernel using F1-Score. This was a simple task using the Sci-kit Learn libraries. First we decided to use the tool Count-Vectorizer to extract the features from each tweet which would be used as the different x values for the sparse matrix. The features we extracted from each tweet were tri-grams. Once the feature matrix was created, it was time to extract the drug relation fields from the tweet to be used as Y values. The Y values were stored as a separate list of true or false values, which was later converted to the numerical values of 0 or 1. In the same instance to make sure the Y values Lined up with the sparse matrix values, the X values for the sparse matrix were filled in and stored as a numpy array. The vectorizer, y value list and x value numpy array were stored for the next stage in building the classifier. 4.3 Classifier A The first classifier was created using Sci-kit learns Grid search cross validation library. In order to train and test the classifier; data set 1 which contained tweets from United States and drug keyword data, was split into two parts. 80 percent of the data was randomly selected for training and the rest was used to test. As mentioned,there is an imbalance of drug related tweets and non drug related tweets. In order to solve this problem a set of class weights was constructed to find the best weights for training. Together with the list of class weights, param grid settings, they classifier was trained. The remaining 20 percent of the data was used for testing the classifiers performance. 4.4 Classifier B After analyzing the the performance of the first classifier A, we decided we needed to find a way to improve its performance. Due to the rarity of drug related tweets, we decided to use the second data set and pass it through the first classifier. A more refined set of data was created from data set 2, containing all the true value classifier A predicted and half of the false values. Using this new refined data and all of the tweets from data set 1, we trained a new classifier B. 80 percent of the data was used to train the new classifier while 20 percent 5 of the data was used to test the new classifier B. This method is known as bootstrapping-u- se one iteration of classification to train a new and improved classifier. See Figure 1 for Classifier B’s results. (a) Classifier A : Data Set 1 (b) Classifier B: Bootstrap Results Figure 1: Classification Prediction Performance 5 Results After creating the first classifier and analyzing the results we noticed a few things. The performance of classifier A was was affected by the the random seed used to split the data into test and training. It is believed that when certain tweets are selected for training it they are so unique it affects scores across the board by 5 percent. This led to the development of the second classifier. With the lack of data positive hand labeled drug related tweets, we implemented the bootstrapping of Classifier A to create B as a means to accommodate the lack of data. Even if we obtained more data we did not want to lose the progress made. By using bootstrapping we were able increase the performance from A by 10 percent for recall and F1-score. Ideally bootstrapping should be done several times each time using the previous classifiers predictions and all previous data to train a new classifier. 6 Conclusion By using the boot strapping method we were able to improve on performance of classification. Bootstrapping has proven to improve performance of a new classifier to a certain extent but more data to balance out the class imbalance would also improve performance. In the future we would also like to add sentiment analysis as a way to improve and detect drug related topics. References [BG15] Cody Buntain and Jennifer Golbeck. This is your twitter on drugs: Any questions? In Proceedings of the 24th International Confer6 ence on World Wide Web Companion, pages 777–782. International World Wide Web Conferences Steering Committee, 2015. [Bil13] Kerry Mason Billboard. Electric zoo festival canceled after two deaths, 2013. [BNG11] Hila Becker, Mor Naaman, and Luis Gravano. Beyond trending topics: Real-world event identification on twitter. 2011. [Fel13] Ronen Feldman. Techniques and applications for sentiment analysis. Commun. ACM, 56(4):82–89, April 2013. [GLG+ 13] Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. Entity extraction, linking, classification, and tagging for social media: A wikipediabased approach. Proc. VLDB Endow., 6(11):1126–1137, August 2013. [HCBGC13] Carl Lee Hanson, Ben Cannon, Scott Burton, and Christophe Giraud-Carrier. An exploration of social circles and prescription drug abuse through twitter. Journal of medical Internet research, 15(9):e189, 2013. [KPB11] Sheila Kinsella, Alexandre Passant, and John G Breslin. Topic classification in social media using metadata from hyperlinked objects. In Advances in Information Retrieval, pages 201–206. Springer, 2011. [KWM11] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter sentiment analysis: The good the bad and the omg! Icwsm, 11:538–541, 2011. [NKR+ 13] Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, and Theresa Wilson. Semeval-2013 task 2: Sentiment analysis in twitter. 2013. [PD11] Michael J Paul and Mark Dredze. You are what you tweet: Analyzing twitter for public health. ICWSM, 20:265–272, 2011. [PD12] Michael J Paul and Mark Dredze. A model for mining public health topics from twitter. Health, 11:16–6, 2012. [PP10] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326, 2010. [PP11] Marco Pennacchiotti and Ana-Maria Popescu. A machine learning approach to twitter user classification. ICWSM, 11(1):281–288, 2011. 7 [QAH13] Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. Online community detection in social sensing. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, pages 617–626, New York, NY, USA, 2013. ACM. [RC12] Timo Reuter and Philipp Cimiano. Event-based classification of social media streams. In Proceedings of the 2Nd ACM International Conference on Multimedia Retrieval, ICMR ’12, pages 22:1–22:8, New York, NY, USA, 2012. ACM. [SKS12] Adam Sadilek, Henry A Kautz, and Vincent Silenzio. Modeling spread of disease from social interactions. In ICWSM, 2012. [SNP+ 15] Lukas Shutler, Lewis S Nelson, Ian Portelli, Courtney Blachford, and Jeanmarie Perrone. Drug use in the twittersphere: a qualitative contextual analysis of tweets about prescription drugs. Journal of Addictive Diseases, 34(4):303–310, 2015. [SOG+ 16] Abeed Sarker, Karen O’Connor, Rachel Ginn, Matthew Scotch, Karen Smith, Dan Malone, and Graciela Gonzalez. Social media mining for toxicovigilance: Automatic monitoring of prescription medication abuse from twitter. Drug safety, pages 1–10, 2016. 8