Download Power Point

Yelp Dataset Project TEAM #1: NIENTZU KUAN KATHY APPLEBAUM Why Are Reviews Interesting?  We all use reviews to make decisions  Does the sentiment of the text match the rating?  Do reviews change over time? Yelp Academic Dataset  A small subset of data made available for research  10 cities in 4 countries  Over 2 million reviews and ½ million tips from ½ million users  https://www.yelp.com/dataset_challenge The Data Nested JSON format: { "yelping_since":"2004-10", "votes":{ "funny":167, "useful":280, "cool":245 }, "review_count":108, "name":"Russel", "user_id":"18kPq7GPye-YQ3LyKyAZPw", … "friends":[ "rpOyqD_893cqmDAtJLbdog", Data Preprocessing  Examine data structure  Design traditional SQL database  Write Java program to convert JSON into multiple SQL tables Data Warehouse  What questions can we answer?  Design schema to help us do that Schema • Snowflake schema • Allows both data mart and data mining SQL Ninja  Athena is very slow  Adding an index made queries slower on Athena  Question: Why is this? Data Cube  Partially materialized data cube  Worst query went from 2.4 seconds to 0.01 seconds  Added little complexity to front end Data Mart Data Mining  Does the subjective response of Yelp review text match the star rating?  Conduct sentiment analysis and classification on Yelp Academic Dataset text to help us do that. Method  Used the data mining tool RapidMiner.  Utilized RapidMiner extension, Text Processing, for statistical text analysis.  Applied Support Vector Machine to predict whether the text is a positive or negative review. Data Collection for Mining  Collected data for Mining from the MySQL database created in our Data Warehouse part.  Executed several MySQL queries and stored results in CSV format.  SELECT text FROM review WHERE stars=4 OR stars=5 Order by Rand() LIMIT 20000 > pos_reviews.csv  SELECT text FROM review WHERE stars=1 OR stars=2 OR stars=3 Order by Rand() LIMIT 20000 > neg_reviews.csv Input Data  Dataset was consisted of 40,000 texts  20,000 positive reviews (Yelp star rating 4 and 5)  20,000 negative reviews (Yelp star rating 1, 2 and 3) Data Preprocessing for Mining  Used RapidMiner to perform data preprocessing for classification. RapidMiner With Opinion Mining Training: SVM Testing: 10-fold Cross Validation The Result  Accuracy achieved is 84.46%  What is recall?  What is precision? Conclusion  The result is satisfactory with an accuracy of 84.46% in predicting the Yelp star rating.   This means that 84 out of 100 Yelp text reviews are correctly predicted as either rating 4/5 (positive review) or rating 1/2/3 (negative review). Since the sentiment of Yelp text review correlates to its star rating, users can trust the Yelp rating system. Data Mining Learning Experience  Difficulties encountered in learning the tool which contained sentiment analysis support.  Issues with a large amount of data for training and validation.  Tried a more aggressive goal to make it a multiclass classification problem with no success. Any Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Power Point