Download Course Project - NCSU Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ST810 Course Project
This is a group project. You can work with up to two other people (i.e., groups of no more
than three). While you are encouraged to base your work on your own research or research
interests, the work must be new.
Your group is encouraged to do a course project related to your own research or research interests.
The only requirement is that the work is new and that it involves a strong computational component.
Alternatively, your group can work on the recently released Yelp Dataset Challenge. The idea is to
apply the material leant from this class to the analysis of a real-world “big data”. To obtain a
passing grade on the project, your group must finish the following tasks.
1. Download and preprocess the data. This data set contains information about 11,537 business,
43,873 users, and 229,907 reviews in the greater Phoenix, AZ metropolitan area. Preprocessing
can be time consuming but is critical for analyzing big data. A scripting language such as
Python, Javasript, or Perl helps. Both R and Matlab also have packages for importing data
from json files.
2. Identify a research question. Be liberal here. Many meaningful questions are worth “mining”
in this rich data set. Just a few sample questions:
• A good data analysis always starts from descriptive statistics. Are there any good visualization tools to help explore this data set and reveal interesting patterns?
• Each user has rated and/or written reviews for a number of business. How to predict
his/her potential ratings of other business? This facilitates personalized recommendations.
• Users of Yelp are worried about how reliable the reviews are. How to identify quirky
reviewers?
• Is there any relation between the star rating of a business and its location?
• Here are a few questions hinted on the Yelp Dataset Challenge webpage. How well can
you guess a review’s rating from its text alone? Can you take all of the reviews of a
business and predict when it will be the most busy? What makes a review useful, funny,
or cool? Can you figure out which business a user is likely to review next? How much
of a business’s success is really just location, location, location? What businesses deserve
their own subcategory (i.e., Szechuan or Hunan versus just ”Chinese restaurants”), and
can you learn this from the review text?
There are certainly numerous others to be explored.
3. Develop and implement a method to address your research question. Stick to all the principles
you learnt from this class.
• Organize all the project files using a version control system. Your svn log will be checked.
1
• Ensure the reproducibility of your results.
• Stick to a good coding style. E.g., check your R code against Google R style guide. Make
sure your code is readable with sufficient comments.
• If you problem can be formulated as a optimization problem, consider using a professional
optimizer such as Cplex if applicable or propose and implement your own optimization
algorithm.
• High performance computing is encouraged. This includes but is not limited to coding the
bottleneck routine in a low level language, calling optimized math libraries, and parallel
computing.
4. A 15-minute presentation in the last week of class (April 22 and 24). Your group should present
your research question, the method, and preliminary results.
5. Send your group project to instructors ([email protected] and [email protected]) by May
9, 2013. It can be a written project report, video, slides, website, blog, or any other medium.
6. (Optional) Submit your project to Yelp by Monday, May 20, 2013.
2