Download Detecting Illicit Drug Usage on Twitter

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Detecting Illicit Drug Usage on Twitter
Darrel Pollard, Christopher Homan
January 2016
Abstract
Social media accounts for a large amount of the data produced on the
Internet, informing and connecting many users across the world. Some
social media, like Twitter, reveal valuable information about their users
and user interests. In this project, we use social media to study opioid
abuse, an emerging epidemic resulting in over 20,000 deaths in 2014 alone.
We build tools to extract information about opioid use in the Rochester
area and answer questions such as: in which regions of the metropolitan
area do opioid abusers live? And, what leads them to opioid abuse? As a
first step we use supervised and unsupervised learning to build classifiers
for detecting mentions of opioid use and abuse in Twitter posts.
1
Introduction
Because of the increased accessibility to the Internet across the globe, many
people are connecting to social sites at a higher frequency. Social media users are
connecting themselves to large communities sharing valuable information about
themselves like their interest, behaviors, sentiment toward certain topics, future
plans, relationship with others, and their locations. This type of information can
be very useful for understanding certain trends in society or what the current
mindset of a community is.
Using social media analysis, we attempt to identify illicit and semi-legal drug
use. By building a classifier, we can identify posts about illicit drugs and track
the spread of these posts in specific communities. With the increase in deaths
from drugs in younger culture events like festivals and parties , it is important
we take steps to identify these users and understand why they use said drugs.
Is there preventive help we can supply? We would also like to compare the
sentiments people hold between illicit drugs and drugs like marijuana that are
on a road to legalization or viewed as acceptable by society.
This is what we propose (Since we are still within the infancy of this research,
it is not an easy task to correctly identify the users. We aim to investigate the
following questions as the guideline of this project:
RQ1: With all the syntax, slang, and emojis embedded in social media sites,
can we build an accurate classifier to identify illicit drug users?
1
RQ2: How does the sentiment differ from drugs viewed as highly illegal versus
ones with a more open minded perception?
2
Related Work
Buntain and Golbeck focus on how illicit drugs spread from one place to another,
while tracking which illicit drugs are popular on Twitter over time [BG15]. They
use Twitter4j for sampling and filtering the Twitter stream.
They used a list of keywords that contained 11 drug types, twenty-one drugs,
and 300 slang term to sample tweets from April to July 2014, yielding approximately 247 GB, and from October 30th to November 26, 2014,the results yield
approximately 81 GB of data. Since their data size was relatively large they
used the Twitter’s libraries to help with processing them, which is called ElephantBird.
In order to address the ambiguity of slang words like OXY, they used Mallet
for topic modeling. They showed that the most popular or trendy drug to tweet
about was cocaine for the time period of April to July. findings. In the paper
they did do some location-based analysis. They first took the top four drugs:
cocaine, heroin, methamphetamine, and Oxycontin and mapped them out based
on their geographical location in the United states.
According to their results, heroin is much more popular than Oxycontin in
our primary region of interest, Rochester New York.
A common question that may be asked is, “Can we be sure if a user is tweeting their behavior or interest?” Michael and Mark [Here and elsewhere, you are
citing authors by their first names; this is never done; always use last names; so,
for instance, this should be cited as Paul and Dredze, also, you should not mention their affiliation, i.e., Johns Hopkins] at Johns Hopkins University adopt
an approach used by health care facilities to collect data (known as sentinel
surveillance) to determine whether or not data [what data, tweets?] may have
contained information about health status of a twitter user [PD11]. They then
used a modified ailment topic aspect model (ATAM+) to create structured disease information from tweets. This was used to model how users express their
feelings toward their ailments and illnesses. This was done by using a list of
keywords to both identify specific ailments and symptoms and treatments over
tweets. They also used latent Dirichlet allocation to produce a mean reciprocal rank (MMR) for scoring determining a tweet’s correct topic identification.
Two independent evaluators tested the ATAM+ model, and both agreed that
ATAM+ produced better result than the regular ATAM. [This all begs the
question of what ATAM is and how ATAM+ differs from it.]
Being able to correctly identify a drug user is important, but monitoring
the influence or spread of drugs is as equally fascinating. Adam et al. of the
University of Rochester research [SKS12] focused on identifying the spread of
diseases from social interaction, using data sources like Twitter and Google. By
collecting tweets and classifying them using an SVM to detect illness related
tweets, they monitored these users, their friends (i.e., those users who follow
2
each other) and geo-location. All information is obtained from the Twitter
search API. The researchers were able to model the spread of diseases based
on the proximity of co-located tweets, i.e., those within 100-meter cell and a
time window of 2 days. Unlike Google Flu Trends, which can identify influenza
specifically, they were able to detect influenza-like illness. They also observed
an exponential relationship between the probability of disease spread through
physical encounters based on the twitter data location.
The ability to detect sentiment can reveal much about a community’s intentions and views on a certain matter. Preslav et al focused on sentiment
analysis in twitter [NKR+ 13]. The task they were trying to tackle was to build
a system that could correctly detect sentiment and label on a large amount of
data. They used a task-based system, Task A was a contextual polarity disambiguation, particular phrase. which goal was to discover a positive, negative
or neutral sentiment. For messages that contained more than one sentiment;
users were asked to determine which sentiment was stronger. Task B was message Polarity Classification. Using both Twitter tweets and SMS messages,
participants were tasked to label messages as positive, negative, neutral, or no
subjective word/phrases. They also had assistance from SentiWordNet to filter
messages that had sentiment from the tweets collected by the Streaming API,
removing messages that did not have at least one positive or negative word.
They applied scores to both task results testing the precision of they annotated
messages and... Due to the complexity, Task B seemed to yield less precision
than Task A, which implies searching a message for sentiment words is easier
than finding the full sentiment of a message.
Efthymios et al. also had the goal of sentiment analysis in their research
[KWM11]. By developing a system that classified three data sets- hashtag data
from the Edinburgh Twitter Corpus, emoticon data from Stanford University,
and hand annotated tweets from the iSieve corporation, they preprocessed the
data in three tokenization steps that identified emoticon and abbreviations were
a part of this step, normalization and parts of speech. For classification, they
used four features: n-grams (mainly unigrams and bigrams), lexicons, parts
of speech, and microblog-specific features such as positive negative or neutral
emoticons and abbreviations. Their results showed that parts of speech have
very little significance on classifying the sentiment of a tweet comparing to the
combination of the to other three.
One of our main goals is to classify tweets as drug-related or non drugrelated. Marco and Ana-Maria of Yahoo labs used a machine learning approach
to tackle the task of classification of tweets. [PP11] In order to classify a user
they focus on certain features;usage of regular expressions for self-identification
and location; frequency of tweets or what URL; Linguistic content- prototypical
words, prototypical hash-tags, sentiment words; the social connections between
users. They evaluated their system by detecting political affiliation, ethnicity
and identifying Starbucks fans.
3
3
Data
Our main source of data comes from using streaming API with the help of
Tweepy. Through Tweepy we set up two main streams. One stream was used
to collect tweets in the geo-location of America. The other stream was used to
pull tweets based on a list of known drug keywords and hashtags. There was
also tweets collected from the Detroit and Rochester area. Using all the data
we obtained we partitioned it into two sets different data sets.
The first data set was constructed using a portion of drug keyword tweets
and the tweets gathered across North America. The total amount of tweets in
this data set was about 1500 tweets. The second data set was constructed using
another portion of drug keyword tweets, Rochester data from January 2015 and
Detroit tweets from November 2015. The amount of tweets total used in data
set 2 was 1,605 tweets. It was not enough to just label tweets based on if a drug
word was present. In order to correctly label a tweet a few things had to be taken
into consideration.What is the context in which the drug keyword was used?Is
the tweet about drugs or drug usage? For instance, tweets such as “Drugs
are bad”, or “legalization of weed will not be a good thing” refer to sentiment
toward drugs. Meanwhile tweets such as ”I am having kush withdraw” relate
closely to drug usage. Once all the data is selected as drug related through a
Boolean value, it is then closely examined so that sentiment within the tweets
can be extracted.
4
Methods
Although only a small set of tweets were used for machine learning, we did
analyze a fair amount more of the data. Two things have drawn our attention.
Using a keyword search to discover the topic of a tweet was not sufficient enough
to identify the tweets related to drugs. Often certain drug words are used to
describe something rather than describe the use of the drug. We also discovered
that there were not many instances of drug related tweets. For about every 2000
tweets in the drug keyword data there were about 100-200 tweets about drug
use drug, and this number does not exclude duplicate tweets. In general tweet
data there were about 2 tweets for every 2000 that were drug related. In order
to correctly gather drug related tweets at a large scale, we decided to create a
classifier to help identify drug related tweets.
4.1
Pre-process
In order to make the data compatible with the machine learning API we decided
to use, we had to prepare the data. The first step to data preparation was to
annotate all the data and update the current JSON with a new field describing
if a tweet was drug related or not. This field was called ”Drug relation” and held
a value of true or false. Although we tried to keep the tweet labeling at binary
as possible, tweets did not fall easily int the two categories. For the tweets that
4
deemed difficult to label they were stored in a unsure file for further viewing.
It is important to notice that in the process of deciding whether a tweet is drug
related or not, these two aspects are of great importance: the sentiment of the
tweet, and whether the drug words were only used for purpose of description.It
is important to notice that in the process of deciding whether a tweet is drug
related or not, these two aspects are of great importance: the sentiment of the
tweet, and whether the drug words were only used for purpose of description.
The next step was to parse all tweets removing stop words and punctuation.
The list of stop words was produced from NLTK stop-words library.
4.2
SVM
In terms of classification techniques we decided to go with a Support Vector
Machine with a linear Kernel using F1-Score. This was a simple task using
the Sci-kit Learn libraries. First we decided to use the tool Count-Vectorizer
to extract the features from each tweet which would be used as the different x
values for the sparse matrix. The features we extracted from each tweet were
tri-grams. Once the feature matrix was created, it was time to extract the drug
relation fields from the tweet to be used as Y values. The Y values were stored as
a separate list of true or false values, which was later converted to the numerical
values of 0 or 1. In the same instance to make sure the Y values Lined up with
the sparse matrix values, the X values for the sparse matrix were filled in and
stored as a numpy array. The vectorizer, y value list and x value numpy array
were stored for the next stage in building the classifier.
4.3
Classifier A
The first classifier was created using Sci-kit learns Grid search cross validation
library. In order to train and test the classifier; data set 1 which contained tweets
from United States and drug keyword data, was split into two parts. 80 percent
of the data was randomly selected for training and the rest was used to test.
As mentioned,there is an imbalance of drug related tweets and non drug related
tweets. In order to solve this problem a set of class weights was constructed to
find the best weights for training. Together with the list of class weights, param
grid settings, they classifier was trained. The remaining 20 percent of the data
was used for testing the classifiers performance.
4.4
Classifier B
After analyzing the the performance of the first classifier A, we decided we
needed to find a way to improve its performance. Due to the rarity of drug
related tweets, we decided to use the second data set and pass it through the
first classifier. A more refined set of data was created from data set 2, containing
all the true value classifier A predicted and half of the false values. Using this
new refined data and all of the tweets from data set 1, we trained a new classifier
B. 80 percent of the data was used to train the new classifier while 20 percent
5
of the data was used to test the new classifier B. This method is known as
bootstrapping-u- se one iteration of classification to train a new and improved
classifier. See Figure 1 for Classifier B’s results.
(a) Classifier A : Data Set 1
(b) Classifier B: Bootstrap Results
Figure 1: Classification Prediction Performance
5
Results
After creating the first classifier and analyzing the results we noticed a few
things. The performance of classifier A was was affected by the the random seed
used to split the data into test and training. It is believed that when certain
tweets are selected for training it they are so unique it affects scores across the
board by 5 percent. This led to the development of the second classifier. With
the lack of data positive hand labeled drug related tweets, we implemented the
bootstrapping of Classifier A to create B as a means to accommodate the lack of
data. Even if we obtained more data we did not want to lose the progress made.
By using bootstrapping we were able increase the performance from A by 10
percent for recall and F1-score. Ideally bootstrapping should be done several
times each time using the previous classifiers predictions and all previous data
to train a new classifier.
6
Conclusion
By using the boot strapping method we were able to improve on performance of
classification. Bootstrapping has proven to improve performance of a new classifier to a certain extent but more data to balance out the class imbalance would
also improve performance. In the future we would also like to add sentiment
analysis as a way to improve and detect drug related topics.
References
[BG15]
Cody Buntain and Jennifer Golbeck. This is your twitter on drugs:
Any questions? In Proceedings of the 24th International Confer6
ence on World Wide Web Companion, pages 777–782. International World Wide Web Conferences Steering Committee, 2015.
[Bil13]
Kerry Mason Billboard. Electric zoo festival canceled after two
deaths, 2013.
[BNG11]
Hila Becker, Mor Naaman, and Luis Gravano. Beyond trending
topics: Real-world event identification on twitter. 2011.
[Fel13]
Ronen Feldman. Techniques and applications for sentiment analysis. Commun. ACM, 56(4):82–89, April 2013.
[GLG+ 13]
Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. Entity extraction,
linking, classification, and tagging for social media: A wikipediabased approach. Proc. VLDB Endow., 6(11):1126–1137, August
2013.
[HCBGC13] Carl Lee Hanson, Ben Cannon, Scott Burton, and Christophe
Giraud-Carrier. An exploration of social circles and prescription
drug abuse through twitter. Journal of medical Internet research,
15(9):e189, 2013.
[KPB11]
Sheila Kinsella, Alexandre Passant, and John G Breslin. Topic
classification in social media using metadata from hyperlinked
objects. In Advances in Information Retrieval, pages 201–206.
Springer, 2011.
[KWM11]
Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore.
Twitter sentiment analysis: The good the bad and the omg!
Icwsm, 11:538–541, 2011.
[NKR+ 13]
Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal,
Veselin Stoyanov, and Theresa Wilson. Semeval-2013 task 2: Sentiment analysis in twitter. 2013.
[PD11]
Michael J Paul and Mark Dredze. You are what you tweet: Analyzing twitter for public health. ICWSM, 20:265–272, 2011.
[PD12]
Michael J Paul and Mark Dredze. A model for mining public health
topics from twitter. Health, 11:16–6, 2012.
[PP10]
Alexander Pak and Patrick Paroubek. Twitter as a corpus for
sentiment analysis and opinion mining. In LREc, volume 10, pages
1320–1326, 2010.
[PP11]
Marco Pennacchiotti and Ana-Maria Popescu. A machine learning
approach to twitter user classification. ICWSM, 11(1):281–288,
2011.
7
[QAH13]
Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. Online
community detection in social sensing. In Proceedings of the Sixth
ACM International Conference on Web Search and Data Mining,
WSDM ’13, pages 617–626, New York, NY, USA, 2013. ACM.
[RC12]
Timo Reuter and Philipp Cimiano. Event-based classification of
social media streams. In Proceedings of the 2Nd ACM International
Conference on Multimedia Retrieval, ICMR ’12, pages 22:1–22:8,
New York, NY, USA, 2012. ACM.
[SKS12]
Adam Sadilek, Henry A Kautz, and Vincent Silenzio. Modeling
spread of disease from social interactions. In ICWSM, 2012.
[SNP+ 15]
Lukas Shutler, Lewis S Nelson, Ian Portelli, Courtney Blachford,
and Jeanmarie Perrone. Drug use in the twittersphere: a qualitative contextual analysis of tweets about prescription drugs. Journal
of Addictive Diseases, 34(4):303–310, 2015.
[SOG+ 16]
Abeed Sarker, Karen O’Connor, Rachel Ginn, Matthew Scotch,
Karen Smith, Dan Malone, and Graciela Gonzalez. Social media
mining for toxicovigilance: Automatic monitoring of prescription
medication abuse from twitter. Drug safety, pages 1–10, 2016.
8