Download Sample 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multinomial logistic regression wikipedia , lookup

Transcript
Analyzing Website’ s Reviews
(Text Mining)
Objective
Analyze reviews from
different websites:
- Amazon
- IMDb
Determine the writer’s
attitude toward the cell
phone products, and
movies they reviewed
Create a model to classify
reviews into either positive
or negative reviews
Description of the Problem
What is text mining?
-
A type of data mining
-
Focus on pulling useful information from text or
sentences
Sentiment Analysis
-
A type of text mining
-
Determine people’s opinions and attitudes based off
the words used in a sentence
Dataset
The dataset is from: ‘From Group to Individual Labels
using Deep Features’ by Kotzias, Denil, Freitas, and Smyth
Cell Phones and Accessories Reivews from Amazon
Movie Reviews from IMDb
Original Data Set includes:
◦ Score: 0 or 1 where 0 is a negative review and 1 is positive
◦ Sentence: The review from the website
◦ 1000 observations from each website 500 positive and 500
negative reviews each
Created Predictors
• Length: The length of the review in characters
• Period, Exclamation, Question, etc.: The
number of periods, exclamation points, question
marks, etc in each review.
• Words: The number of times the 50 most common is
used in each review
Note: We exclude common words that do not appear to
have any connotation: articles, prepositions, pronouns,
and auxiliary verbs
Understanding the Data
Score can contain both positive and negative words
with a negative connotation (ex. not good)
Words containing contractions involving the word not
(isn’t, wasn’t, doesn’t, etc.) will be added to the not
word count
Misspelled words cannot be added to the correct word
(ie. baad is not counted in bad)
Creating the New Predictors
Microsoft Excel function used to create predictors:
=(LEN(<Sentence>)-LEN(SUBSTITUTE(
<Sentence>,<word>,"")))/LEN(<word>)
Note: The function is case sensitive
Before searching the Sentence for a word you must convert
the Sentence variable to lowercase letters using:
=LOWER(<Sentence>)
Data Analysis – Amazon
Interaction
Fit logistic regression model for Amazon where
interaction was present
Not was crossed with common positive and negative
words:
Full model included all 56 created predictors and these 5
interactions
Data Analysis – Amazon
Interaction
One-by-one the least significant predictors (highest pvalue) were removed leaving 9 significant predictors
Two interactions that were significant:
Misclassification Error Rate using interactions: 0.504
Misclassification Error Rate with no interactions: 0.498
Amazon Reduced Logistic
Regression with Interaction
Coefficients Estimate p-value
Intercept
0.068
0.4515
not*great
-0.299
0.0081
not*problem
0.398
0.0508
not
-0.6083 approx. 0
great
1.6488 approx. 0
good
0.69
0.0004
very
0.3531
0.014
but
-0.2824 0.0461
problem
-0.5265
0.1184
Coefficients with positive
estimates are more likely to
have positive reviews:
◦ not*problem great, good,
very
Coefficients with negative
estimates are more likely to
have negative reviews:
◦ Not*good, not, but,
problem
Data Analysis – IMDb
Interaction
The same process was run for the IMDb dataset using:
not*bad was the only interaction that was significant
Misclassification Error Rate using interactions: 0.363
Misclassification Error Rate with no interaction: 0.351
Interaction is significant, but does not appear to help
much in classifying a Sentence into a Score
IMDb Reduced Logistic Regression
with Interaction
Coefficients
Intercept
Length
question
not*good
not
bad
great
would
love
good
plot
Estimate p-value Coefficients Estimate p-value
-0.2926 0.0313 best
1.3858 0.0058
0.0060 0.0002 film
0.5106 0.0068
-1.7617 0.0357 make
-1.1487 0.0078
-3.9140 0.0185 cast
1.7905 0.0081
-0.7525 approx. 0 just
-0.7504 0.0139
-2.7193 approx. 0 well
1.1971 0.0146
1.9625 0.0001 play
1.2409 0.0209
-2.5397 0.0001 even
-0.9931 0.0265
3.3622 0.0001 how
-0.5838 0.0316
1.5352 0.0004 script
-1.1221 0.0384
-1.8441 0.0017 work
-0.8065 0.0491
IMDB Reduced Logistic Regression
with Interaction
Positive Estimates
◦
◦
◦
◦
◦
◦
◦
◦
◦
Length
great
love
good
best
film
cast
well
play
Negative Estimates
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
◦
Question
not*good
not
bad
would
plot
make
just
even
how
script
work
Comparing Reduced Logistic
Regression Models
Positive Estimates in both models
◦ good
◦ great
Negative Estimate in both models
◦ not
IMDb has Length and question as
significant predictors
◦ Length is the length of the review in characters
◦ Question is the number of question marks used
in the review
Amazon had no significant punctuation
predictors
Amazon
Misclassification Error Rates
Method
Error Rate
Logistic Regression with Interaction
0.504
Logistic Regression
0.498
LDA
0.499
QDA
0.486
KNN with k = 2
0.500
Unpruned Decision Tree
0.379
Decision Tree pruned to 5
0.371
Random Forest with mtry = p/3
0.509
Bagging
0.489
SVM Linear with cost = 0.1
0.377
SVM Radial with cost = 10, gamma = 0.5
0.451
SVM Polynomial with cost = 10
0.370
IMDb
Misclassification Error Rates
Method
Error Rate
Logistic Regression with Interaction
0.363
Logistic Regression
0.351
LDA
0.356
QDA
0.322
KNN with k = 2
0.452
Unpruned Decision Tree
0.385
Decision Tree pruned to 3
0.384
Random Forest with mtry = p/3
0.426
Bagging
0.413
SVM Linear with cost = 1
0.354
SVM Radial with cost = 10, gamma = 2
0.473
SVM Polynomial cost = 1
0.375
Best Models
Amazon
◦ SVM Polynomial with cost = 10
◦ Error Rate = 0.370
IMDb
◦ QDA
◦ Error Rate = 0.322
Conclusion
Amazon was much harder to predict the score than IMDb
◦ Average Error Rate:
Amazon: 0.4495
IMDb: 0.3855
◦ Number of Predictors in Reduced Logistic Regression
Model using Interaction (full model has 61 predictors):
Amazon: 9
IMDb: 22
Conclusion
Both Amazon and IMDb had good and great
as positive predictors and not as a negative
predictor
The best positive and negative predictors are
commonly intuitive
◦ Positive: good, great, best, love
◦ Negative: not, bad, problem, not*good,
not*great
Any Questions?
Thank you for your attention