Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LECTURE 15: BEYOND LINEARITY PT. 1 November 8, 2016 SDS 293 Machine Learning Announcements • Assignments - Feedback for A5/A6 should be out shortly - Solution posted - A7 release delayed by 1 week, due 11-17 by 11:59pm • T-minus 6 weeks until the end of the semester! Final project • Goal: apply the ML techniques we’ve learned to solve a real-world problem you care about • Teams of 2-3 people (recommended) or on your own Deliverable: a poster (or interactive visualization) that will be demonstrated during our end-of-semester reception and a 2-page write up of the methods you used • Example problems dataset challenge www.yelp.com/dataset_challenge dataset challenge: example ?s • Cultural Trends: What makes a particular city different? For example, in which countries are Yelpers sticklers for service quality? • Location Mining: How much of a business' success is really just location, location, location? Do reviewers' behaviors change when they travel? • Seasonal Trends: What about seasonal effects: are there more reviews for sports bars on major game days and if so, could you predict that? • Infer Categories: Do you see any non-intuitive correlations between business categories e.g., how many karaoke bars also offer Korean food, and vice versa? • Natural Language Processing (NLP): How well can you guess a review's rating from its text alone? • Change Points and Events: Can you detect when things change suddenly (i.e. a business coming under new management)? • Social Graph Mining: Can you figure out who the trend-setters are? For example, who found the best waffle joint before waffles were cool? Final project • Goal: apply the ML techniques we’ve learned to solve a real-world problem you care about • Teams of 2-3 people (recommended) or on your own Deliverable: a poster (or interactive visualization) that will be demonstrated during our end-of-semester reception and a 2-page write up of the methods you used • Example problems • Activity: topic generation Activity: real world problems Step 1: Write a quick description of a data set you think would be interesting to explore at the top of the page, and write your 99 number at the bottom Activity: real world problems Step 2: Pass your description clockwise to the next person Activity: real world problems Step 3: Read the description of the dataset, and underneath the description, write a question you think someone might want to answer using it Activity: real world problems Step 4: Fold over the top of the paper (leaving just your question visible), and pass it clockwise. Now repeat! For next class: pick a topic • Before class next Tuesday, write a quick Piazza post about a potential final project topic • Please include: - A description of the domain - The problem(s) you're trying to solve / question(s) you're trying to answer - The audience - The data you’ll be using (if you know) • Not 100% sure? Try a couple and get some feedback (you’re free to change your mind later) • See a topic you like? Reply to the post and form a team! Outline Final project overview / activity • Moving beyond linearity - Polynomial regression - Step functions - Splines - Local regression - Generalized additive models (GAMs) • Lab So far: linear models • The good: - Easy to describe & implement - Straightforward interpretation & inference • The bad: - Linearity assumption is (almost) always an approximation - Sometimes it’s a pretty poor one • RR, the lasso, PCA, etc. all try to improve on least squares by controlling the variance of a linear model • … but linear models can only stretch so far Flashback: Auto dataset Polynomial regression • One simple fix is to use polynomial transformations: • This example is a quadratic regression • Big idea: extend the linear model by adding extra predictors that are powers of the original predictors Note: this is still a linear model! (and so we can find its coefficients using regular ol’ least squares) Polynomial regression in practice • For large enough degree d, a polynomial regression allows us to produce an extremely non-linear curve • As d increases, this can produce some really weird shapes • Question: what’s happening in terms of bias vs. variance? • Answer: increased flexibility less bias, more variance; in practice, we generally only go to degree 3 or 4 unless we have additional knowledge that more will help Example: Wage dataset Example: Wage dataset 95% confidence interval (i.e. 2x std. error) Degree 4 polynomial Example: Wage dataset 79 “high earners” Example: Wage dataset What’s going on here? Example: Wage dataset Relatively sparse = less confident Global structure in polynomial regression • Polynomial regression gives us added flexibility, but imposes global structure on the non-linear function of X • Question: what’s the problem with this? • Answer: when our data has different behavior in different areas, we wind up with a messy, complicated function trying to describe both parts at once Step functions • Big idea: if our data exhibits different behavior in different parts, we can fit a separate “mini-model” on each piece and then glue them together to describe the whole • Process: 1. Create k cutpoints c1, c2, . . . , cK in the range of X 2. Construct (k+1) dummy variables: 3. Fit least squares model using C1(X), … ,CK(X) as predictors (we can exclude C0(X) because it is redundant with the intercept) Example: Wage dataset Example: Wage dataset Granularity in step functions • Step functions gives us added flexibility by letting us model different parts of X independently • Question: what’s the problem with this? • Answer: if our data doesn’t have natural breaks, choosing the wrong step size might mean that we “miss the action” Lab: Polynomials and Step Functions • To do today’s lab in R: <nothing new> • To do today’s lab in python: <nothing new> • Instructions and code: http://www.science.smith.edu/~jcrouser/SDS293/labs/lab12/ • Full version can be found beginning on p. 287 of ISLR