Download Power Point

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Pattern recognition wikipedia, lookup

Data assimilation wikipedia, lookup

Neuroinformatics wikipedia, lookup

Data analysis wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

Corecursion wikipedia, lookup

Theoretical computer science wikipedia, lookup

Geographic information system wikipedia, lookup

Transcript
Yelp Dataset Project
TEAM #1:
NIENTZU KUAN
KATHY APPLEBAUM
Why Are Reviews Interesting?

We all use reviews to make decisions

Does the sentiment of the text match the rating?

Do reviews change over time?
Yelp Academic Dataset

A small subset of data made available for
research

10 cities in 4 countries

Over 2 million reviews and ½ million tips from ½
million users

https://www.yelp.com/dataset_challenge
The Data
Nested JSON format:
{ "yelping_since":"2004-10",
"votes":{
"funny":167,
"useful":280,
"cool":245
},
"review_count":108,
"name":"Russel",
"user_id":"18kPq7GPye-YQ3LyKyAZPw",
…
"friends":[ "rpOyqD_893cqmDAtJLbdog",
Data Preprocessing

Examine data structure

Design traditional SQL database

Write Java program to convert JSON into
multiple SQL tables
Data Warehouse

What questions can we answer?

Design schema to help us do that
Schema
• Snowflake schema
• Allows both data mart
and data mining
SQL Ninja

Athena is very slow

Adding an index made queries slower on
Athena

Question: Why is this?
Data Cube

Partially materialized data cube

Worst query went from 2.4
seconds to 0.01 seconds

Added little complexity to front
end
Data Mart
Data Mining

Does the subjective response of Yelp review text
match the star rating?

Conduct sentiment analysis and classification on
Yelp Academic Dataset text to help us do that.
Method

Used the data mining tool RapidMiner.

Utilized RapidMiner extension, Text Processing,
for statistical text analysis.

Applied Support Vector Machine to predict
whether the text is a positive or negative review.
Data Collection for Mining

Collected data for Mining from the MySQL
database created in our Data Warehouse part.

Executed several MySQL queries and stored
results in CSV format.

SELECT text FROM review WHERE stars=4 OR stars=5 Order by
Rand() LIMIT 20000 > pos_reviews.csv

SELECT text FROM review WHERE stars=1 OR stars=2 OR stars=3
Order by Rand() LIMIT 20000 > neg_reviews.csv
Input Data

Dataset was consisted of 40,000 texts

20,000 positive reviews (Yelp star rating 4 and 5)

20,000 negative reviews (Yelp star rating 1, 2 and 3)
Data Preprocessing for Mining

Used RapidMiner to perform data preprocessing
for classification.
RapidMiner With Opinion Mining
Training: SVM
Testing: 10-fold Cross Validation
The Result

Accuracy achieved is 84.46%

What is recall?

What is precision?
Conclusion

The result is satisfactory with an accuracy of
84.46% in predicting the Yelp star rating.


This means that 84 out of 100 Yelp text reviews are correctly
predicted as either rating 4/5 (positive review) or rating 1/2/3
(negative review).
Since the sentiment of Yelp text review
correlates to its star rating, users can trust the
Yelp rating system.
Data Mining Learning Experience

Difficulties encountered in learning the tool
which contained sentiment analysis support.

Issues with a large amount of data for training
and validation.

Tried a more aggressive goal to make it a
multiclass classification problem with no success.
Any Questions?