* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Power Point
Survey
Document related concepts
Transcript
Yelp Dataset Project
TEAM #1:
NIENTZU KUAN
KATHY APPLEBAUM
Why Are Reviews Interesting?
We all use reviews to make decisions
Does the sentiment of the text match the rating?
Do reviews change over time?
Yelp Academic Dataset
A small subset of data made available for
research
10 cities in 4 countries
Over 2 million reviews and ½ million tips from ½
million users
https://www.yelp.com/dataset_challenge
The Data
Nested JSON format:
{ "yelping_since":"2004-10",
"votes":{
"funny":167,
"useful":280,
"cool":245
},
"review_count":108,
"name":"Russel",
"user_id":"18kPq7GPye-YQ3LyKyAZPw",
…
"friends":[ "rpOyqD_893cqmDAtJLbdog",
Data Preprocessing
Examine data structure
Design traditional SQL database
Write Java program to convert JSON into
multiple SQL tables
Data Warehouse
What questions can we answer?
Design schema to help us do that
Schema
• Snowflake schema
• Allows both data mart
and data mining
SQL Ninja
Athena is very slow
Adding an index made queries slower on
Athena
Question: Why is this?
Data Cube
Partially materialized data cube
Worst query went from 2.4
seconds to 0.01 seconds
Added little complexity to front
end
Data Mart
Data Mining
Does the subjective response of Yelp review text
match the star rating?
Conduct sentiment analysis and classification on
Yelp Academic Dataset text to help us do that.
Method
Used the data mining tool RapidMiner.
Utilized RapidMiner extension, Text Processing,
for statistical text analysis.
Applied Support Vector Machine to predict
whether the text is a positive or negative review.
Data Collection for Mining
Collected data for Mining from the MySQL
database created in our Data Warehouse part.
Executed several MySQL queries and stored
results in CSV format.
SELECT text FROM review WHERE stars=4 OR stars=5 Order by
Rand() LIMIT 20000 > pos_reviews.csv
SELECT text FROM review WHERE stars=1 OR stars=2 OR stars=3
Order by Rand() LIMIT 20000 > neg_reviews.csv
Input Data
Dataset was consisted of 40,000 texts
20,000 positive reviews (Yelp star rating 4 and 5)
20,000 negative reviews (Yelp star rating 1, 2 and 3)
Data Preprocessing for Mining
Used RapidMiner to perform data preprocessing
for classification.
RapidMiner With Opinion Mining
Training: SVM
Testing: 10-fold Cross Validation
The Result
Accuracy achieved is 84.46%
What is recall?
What is precision?
Conclusion
The result is satisfactory with an accuracy of
84.46% in predicting the Yelp star rating.
This means that 84 out of 100 Yelp text reviews are correctly
predicted as either rating 4/5 (positive review) or rating 1/2/3
(negative review).
Since the sentiment of Yelp text review
correlates to its star rating, users can trust the
Yelp rating system.
Data Mining Learning Experience
Difficulties encountered in learning the tool
which contained sentiment analysis support.
Issues with a large amount of data for training
and validation.
Tried a more aggressive goal to make it a
multiclass classification problem with no success.
Any Questions?