Download Sample 1

Analyzing Website’ s Reviews (Text Mining) Objective Analyze reviews from different websites: - Amazon - IMDb Determine the writer’s attitude toward the cell phone products, and movies they reviewed Create a model to classify reviews into either positive or negative reviews Description of the Problem What is text mining? - A type of data mining - Focus on pulling useful information from text or sentences Sentiment Analysis - A type of text mining - Determine people’s opinions and attitudes based off the words used in a sentence Dataset The dataset is from: ‘From Group to Individual Labels using Deep Features’ by Kotzias, Denil, Freitas, and Smyth Cell Phones and Accessories Reivews from Amazon Movie Reviews from IMDb Original Data Set includes: ◦ Score: 0 or 1 where 0 is a negative review and 1 is positive ◦ Sentence: The review from the website ◦ 1000 observations from each website 500 positive and 500 negative reviews each Created Predictors • Length: The length of the review in characters • Period, Exclamation, Question, etc.: The number of periods, exclamation points, question marks, etc in each review. • Words: The number of times the 50 most common is used in each review Note: We exclude common words that do not appear to have any connotation: articles, prepositions, pronouns, and auxiliary verbs Understanding the Data Score can contain both positive and negative words with a negative connotation (ex. not good) Words containing contractions involving the word not (isn’t, wasn’t, doesn’t, etc.) will be added to the not word count Misspelled words cannot be added to the correct word (ie. baad is not counted in bad) Creating the New Predictors Microsoft Excel function used to create predictors: =(LEN(<Sentence>)-LEN(SUBSTITUTE( <Sentence>,<word>,"")))/LEN(<word>) Note: The function is case sensitive Before searching the Sentence for a word you must convert the Sentence variable to lowercase letters using: =LOWER(<Sentence>) Data Analysis – Amazon Interaction Fit logistic regression model for Amazon where interaction was present Not was crossed with common positive and negative words: Full model included all 56 created predictors and these 5 interactions Data Analysis – Amazon Interaction One-by-one the least significant predictors (highest pvalue) were removed leaving 9 significant predictors Two interactions that were significant: Misclassification Error Rate using interactions: 0.504 Misclassification Error Rate with no interactions: 0.498 Amazon Reduced Logistic Regression with Interaction Coefficients Estimate p-value Intercept 0.068 0.4515 not*great -0.299 0.0081 not*problem 0.398 0.0508 not -0.6083 approx. 0 great 1.6488 approx. 0 good 0.69 0.0004 very 0.3531 0.014 but -0.2824 0.0461 problem -0.5265 0.1184 Coefficients with positive estimates are more likely to have positive reviews: ◦ not*problem great, good, very Coefficients with negative estimates are more likely to have negative reviews: ◦ Not*good, not, but, problem Data Analysis – IMDb Interaction The same process was run for the IMDb dataset using: not*bad was the only interaction that was significant Misclassification Error Rate using interactions: 0.363 Misclassification Error Rate with no interaction: 0.351 Interaction is significant, but does not appear to help much in classifying a Sentence into a Score IMDb Reduced Logistic Regression with Interaction Coefficients Intercept Length question not*good not bad great would love good plot Estimate p-value Coefficients Estimate p-value -0.2926 0.0313 best 1.3858 0.0058 0.0060 0.0002 film 0.5106 0.0068 -1.7617 0.0357 make -1.1487 0.0078 -3.9140 0.0185 cast 1.7905 0.0081 -0.7525 approx. 0 just -0.7504 0.0139 -2.7193 approx. 0 well 1.1971 0.0146 1.9625 0.0001 play 1.2409 0.0209 -2.5397 0.0001 even -0.9931 0.0265 3.3622 0.0001 how -0.5838 0.0316 1.5352 0.0004 script -1.1221 0.0384 -1.8441 0.0017 work -0.8065 0.0491 IMDB Reduced Logistic Regression with Interaction Positive Estimates ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Length great love good best film cast well play Negative Estimates ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Question not*good not bad would plot make just even how script work Comparing Reduced Logistic Regression Models Positive Estimates in both models ◦ good ◦ great Negative Estimate in both models ◦ not IMDb has Length and question as significant predictors ◦ Length is the length of the review in characters ◦ Question is the number of question marks used in the review Amazon had no significant punctuation predictors Amazon Misclassification Error Rates Method Error Rate Logistic Regression with Interaction 0.504 Logistic Regression 0.498 LDA 0.499 QDA 0.486 KNN with k = 2 0.500 Unpruned Decision Tree 0.379 Decision Tree pruned to 5 0.371 Random Forest with mtry = p/3 0.509 Bagging 0.489 SVM Linear with cost = 0.1 0.377 SVM Radial with cost = 10, gamma = 0.5 0.451 SVM Polynomial with cost = 10 0.370 IMDb Misclassification Error Rates Method Error Rate Logistic Regression with Interaction 0.363 Logistic Regression 0.351 LDA 0.356 QDA 0.322 KNN with k = 2 0.452 Unpruned Decision Tree 0.385 Decision Tree pruned to 3 0.384 Random Forest with mtry = p/3 0.426 Bagging 0.413 SVM Linear with cost = 1 0.354 SVM Radial with cost = 10, gamma = 2 0.473 SVM Polynomial cost = 1 0.375 Best Models Amazon ◦ SVM Polynomial with cost = 10 ◦ Error Rate = 0.370 IMDb ◦ QDA ◦ Error Rate = 0.322 Conclusion Amazon was much harder to predict the score than IMDb ◦ Average Error Rate: Amazon: 0.4495 IMDb: 0.3855 ◦ Number of Predictors in Reduced Logistic Regression Model using Interaction (full model has 61 predictors): Amazon: 9 IMDb: 22 Conclusion Both Amazon and IMDb had good and great as positive predictors and not as a negative predictor The best positive and negative predictors are commonly intuitive ◦ Positive: good, great, best, love ◦ Negative: not, bad, problem, not*good, not*great Any Questions? Thank you for your attention

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sample 1