Download Tutorial7 - hkust cse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Mining
CSE, HKUST
March 27
Recap
What have we done for the project?
Phase I: Data Collection
1. Collect data from the Web (at least 4 data sources)
2. Parse the data using different tool
3. Design integrated schema for collected data
4. Store collected data into the unified schema
html
Website
1. Get the
website
(Tutorial 2)
Useful
Info
2. Parsing
by BeautifulSoup or
regular expression
(Tutorial 2)
db
3. Store in MySQL
(Tutorial 3),
Schema matching
(Tutorial 4)
BeautifulSoup tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Regular expression (regux) tutorial: https://www.summet.com/dmsi/html/readingTheWeb.html
What have we done for the project?
Phase II: Entity Resolution
1. Identify records in different data sets
2. Test different combinations of similarity measures
Phase III: Data Fusion
1. Merge data and resolve data conflicts
2. Experiment with different conflict resolution strategies
Possible basic methods:
1. String operations (Tutorial 1, Tutorial 5)
2. Text similarity / String based similarity (Tutorial 6)
What have we done for the project?
Phase IV: Mining on your Portal
1. Sentiment Analysis
2. Recommendation
3. Frequent pattern, top-K usage, etc.
Sentiment Analysis
The simplest task
input
output
Positive or
Negative?
The simplest task
input
output
“This is very easy to set up for computers”
“Good quality”
“The price is too high”
Positive or
Negative?
Sentiment Analysis
1. Simplest task: Is the attitude of this text positive or negative?
2. More complex: Rank the attitude of this text from 1 to 5
3. Advanced: Detect the target, source, or complex attitude types
We will only introduce the simplest part today!
How to get this “positive” or “negative” answer?
Step 1: Tokenization
1. Filter unnecessary information/stop words/numbers
2. Chopping up sentences into smaller pieces (words or tokens)
How?
• The choice for the delimiter will for most cases be a whitespace (“We’re going
to Barcelona” -> [“We’re”, “going”, “to”, “Barcelona.”])
• What should you do with punctuation marks? ! puts extra emphasis on the
negative/positive sentiment of the sentence, while ? can mean uncertainty (no
sentiment)
• “, ‘ , [], () can mean that the words belong together and should be treated as a
separate sentence. Same goes for words which are bold, italic, underlined,
or inside a link.
How to get this “positive” or “negative” answer?
Step 2: Word Normalization
Reduce each word to its base/stem form!
How?
• Capital letters should be normalized to lowercase, unless it occurs in the
middle of a sentence; this could indicate the name of a writer, place, brand etc.
• What should be done with the apostrophe (‘); “George’s phone” should
obviously be tokenized as “George” and “phone”, but I’m, we’re, they’re
should be translated as I am, we are and they are.
• Ambiguous words like High-tech, The Hague, P.h.D., USA, U.S.A., US and us.
How to get this “positive” or “negative” answer?
Step 3: Bags of words
A text (such as a sentence or a document) is represented as the bag (multiset) of
its words, disregarding grammar and even word order but keeping multiplicity.
Example implementation:
Input (two simple text documents):
(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
Output:
One Summary string: [ "John", "likes", "to", "watch", "movies", "also", "football",
"games", "Mary", "too" ]
One Count string to record the term frequencies of all the distinct words:
(1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
(2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
How to get this “positive” or “negative” answer?
Step 4: Sentiment Lexicons
In this bag-of-words representation you only take individual words into account and give
each word a specific subjectivity score. This subjectivity score can be looked up in a
sentiment lexicon.
What is sentiment lexicon?
In a simple way, a sentiment lexicon gives scores to different words..
Example:
“Good”  Score: 3.97
“Bad”  Score: -6.6
Sentiment Lexicons
Try these interesting demos!
1. Give scores to words
http://sentiment.christopherpotts.net/lexicon/
2. Give scores to text
http://sentiment.christopherpotts.net/textscores/
Other interesting demos
3. Tokenize
http://sentiment.christopherpotts.net/tokenizing/
Sentiment Lexicons: Popular Tools
Tool 1: GI(The General Inquirer) http://www.wjh.harvard.edu/~inquirer/
Contains about ~11.780 words and has a more complex way of ‘scoring’ words; each word can be scored
in 15+ categories; words can be Positiv-Negative, Strong-Weak, Active-Passive, Pleasure-Pain.
Paper: A computer approach to content analysis: studies using the General Inquirer system
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966.
http://dl.acm.org/citation.cfm?id=1461583
Sentiment Lexicons: Popular Tools
Tool 2: LIWC (Linguistic Inquiry and Word Count) Not free!
http://liwc.wpengine.com/ (There is an interesting demo here)
Paper: Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count:
LIWC 2007. Austin, TX
Sentiment Lexicons: Popular Tools
Tool 3: SentiWordNet http://sentiwordnet.isti.cnr.it/
SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, objectivity.
It gives the words a positive or negative score between 0 and 1. It contains about 117.660 words, however
only ~29.000 of these words have been scored (either positive or negative).
Paper: Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced
Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010
http://www.lrec-conf.org/proceedings/lrec2010/summaries/769.html
How to get this “positive” or “negative” answer?
Step 4: Sentiment Lexicons
In this bag-of-words representation you only take individual words into account and give
each word a specific subjectivity score. This subjectivity score can be looked up in a
sentiment lexicon.
A simplest method:
If the total score is negative, the text will be classified as negative
If it is positive, the text will be classified as positive.
Then we finish!
The more complex tasks…
Step 5: Text Classification
• This Classifier first has to be trained with a training dataset,
• And then it can be used to actually classify documents.
• Training means that we have to determine its model parameters.
• If the set of training examples is chosen correctly, the Classifier should predict the
class probabilities of the actual documents with a similar accuracy (as the training
examples).
• In classification tasks we are trying to produce a classification function which can give
the correlation between a certain ‘feature’ D and a class C.
The more complex tasks…
Step 5: Text Classification
• Example:
o Document containing the words like “Good” “Popular” “Cheap” should be
categorized as positive
o Documents containing the words “Poor quality” and “Expensive service”
should be categorized as negative
The more complex tasks…
Step 5: Text Classification - Popular classifiers
• Naive Bayes
• Sebastian Raschka’s blog-post:
http://sebastianraschka.com/Articles/2014_naive_bayes_1.html
• Maximum Entropy
• A brief Maxent tutorial:
http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
• SVM
• Article: http://biblio.telecom-paristech.fr/cgi-bin/download.cgi?id=6694
• Youtube: https://www.youtube.com/watch?v=YsiWisFFruY
Some Useful Materials
Introduction of sentimental analysis:
• http://ataspinar.com/2015/11/16/text-classification-and-sentimentanalysis/#SL_literature
• https://web.njit.edu/~da225/NetHelp/default.htm?turl=Documents%
2Fcurrentexamplesofsentimentanalysis.htm
Sentiment Symposium Tutorial: Lexicons (A lot of interesting demos!)
• http://sentiment.christopherpotts.net/lexicons.html#overview
Review….
How to do sentimental analysis?
• Step 1: Tokenization
• Chopping up sentences into smaller pieces (words or tokens)
• Step 2: Word Normalization
• Reduce each word to its base/stem form
• Step 3: Bags of words
• Represent the text as the bag (multiset) of its words
• Step 4: Sentiment Lexicons
• Give each word a specific subjectivity score