Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ST810 Course Project This is a group project. You can work with up to two other people (i.e., groups of no more than three). While you are encouraged to base your work on your own research or research interests, the work must be new. Your group is encouraged to do a course project related to your own research or research interests. The only requirement is that the work is new and that it involves a strong computational component. Alternatively, your group can work on the recently released Yelp Dataset Challenge. The idea is to apply the material leant from this class to the analysis of a real-world “big data”. To obtain a passing grade on the project, your group must finish the following tasks. 1. Download and preprocess the data. This data set contains information about 11,537 business, 43,873 users, and 229,907 reviews in the greater Phoenix, AZ metropolitan area. Preprocessing can be time consuming but is critical for analyzing big data. A scripting language such as Python, Javasript, or Perl helps. Both R and Matlab also have packages for importing data from json files. 2. Identify a research question. Be liberal here. Many meaningful questions are worth “mining” in this rich data set. Just a few sample questions: • A good data analysis always starts from descriptive statistics. Are there any good visualization tools to help explore this data set and reveal interesting patterns? • Each user has rated and/or written reviews for a number of business. How to predict his/her potential ratings of other business? This facilitates personalized recommendations. • Users of Yelp are worried about how reliable the reviews are. How to identify quirky reviewers? • Is there any relation between the star rating of a business and its location? • Here are a few questions hinted on the Yelp Dataset Challenge webpage. How well can you guess a review’s rating from its text alone? Can you take all of the reviews of a business and predict when it will be the most busy? What makes a review useful, funny, or cool? Can you figure out which business a user is likely to review next? How much of a business’s success is really just location, location, location? What businesses deserve their own subcategory (i.e., Szechuan or Hunan versus just ”Chinese restaurants”), and can you learn this from the review text? There are certainly numerous others to be explored. 3. Develop and implement a method to address your research question. Stick to all the principles you learnt from this class. • Organize all the project files using a version control system. Your svn log will be checked. 1 • Ensure the reproducibility of your results. • Stick to a good coding style. E.g., check your R code against Google R style guide. Make sure your code is readable with sufficient comments. • If you problem can be formulated as a optimization problem, consider using a professional optimizer such as Cplex if applicable or propose and implement your own optimization algorithm. • High performance computing is encouraged. This includes but is not limited to coding the bottleneck routine in a low level language, calling optimized math libraries, and parallel computing. 4. A 15-minute presentation in the last week of class (April 22 and 24). Your group should present your research question, the method, and preliminary results. 5. Send your group project to instructors ([email protected] and [email protected]) by May 9, 2013. It can be a written project report, video, slides, website, blog, or any other medium. 6. (Optional) Submit your project to Yelp by Monday, May 20, 2013. 2