Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
important stuff Lecture 1.1 • Read the syllabus (on CCLE) • Office Hours: Wednesday 1-1:50 • Midterm in class: May 16 • Final: due Saturday of finals week with in-class presentation on June 11, 8am-11am • Send all academic questions to piazza, not email. Feel free to answer/participate in discussions. more important stuff • • Please turn homework in on time. Late homework will not be accepted. And “late” means any moment after the deadline. We will do lots of working in teams/pairs. Get used to it. goals • What are the purposes and uses of statistical modeling? • What is Mean-squared error, and how do we estimate it? • But first: some notation and terminology! starting slow… For instance: • • • x represents an explanatory variable, and sometimes we’ll write X to represent a whole collection of such variables. y represents the response variable: the thing we’re trying to predict/explain What factors explain the variation in Type II diabetes? • y is a 1 if a person has Type II diabetes, a 0 if not. • the predictors might include: income, zipcode, age, weight, height, gender, race, etc. parametric f (X) is the function that relates the predictors, X, to the response, y. • Situations like linear regression, in which we KNOW the functional form of f(x) (for example, we know that f(x) is linear) are called parametric problems. • Once we know that f (X) = 0 + 1 x1 + 2 x2 it becomes a “simple” matter of estimating the parameters beta_0, beta_1, and beta_2 (and sigma) In 101A, the function was (almost) always f (X) = 0 + 1 x1 + 2 x2 or, using matrix notation f (X) = X non-parametric • In this course, we’ll consider situations in which we do not know the shape of f(x) • Our goal is to estimate it from the data. Cross Validation • One technique we’ll make frequent use of is cross validation. • In cross validation, you set aside some data and use it to test your model. testing vs. training testing data vs. training data • “Training” data are data used to fit a model • “Testing” data are data that were NOT used in the fitting process, but are used to test how well your model performs on unseen data. • For example, for your final exam, we’ve set aside some testing data that you’ll never see. • You’ll use the data we give you to fit a model, and make predictions for a given set of covariates. Only we will know what the correct response values are, and will use this to test your model. Truth vs. data f(X) represents the TRUTH. The true relationship between x’s and y. many ways to go wrong From regression analysis, you know that if the truth is linear, then our estimated model is fˆ(X) = ˆ0 + ˆ1 x fˆ(X) represents a model of that truth, based on data. And you know that beta_0 “hat” and beta_1 “hat” are not exactly equal to the true values. Instead, they are the true values plus-or-minus some error. • • • Purpose of Analysis You also know that perhaps fˆ(X) = ˆ0 + ˆ1 x isn’t exactly right. Maybe the truth has additional predictors as well. • Take a moment to consider a small subset of the American Time-Use Survey (“atus copy.txt”) And maybe it isn’t even linear. You might need to transform one or all variables. • With a partner, come up with 4 things you’d like to learn from these data. Or maybe you need a completely different functional form that is extremely non-linear. • See “data not in textbook” under Site Info on CCLE • Now, can you classify these in terms of “inference” and “prediction” Inference • Examples Sometimes, we want to know about the parameters. • All things being equal, how does the mean of y differ when x differs? (i.e. what’s the slope?) • Which variables are associated with y and which are not? (i.e. Which parameters are 0 and which are not?) • Do people sleep longer on the weekends than weekdays? (If so, the slope of predicted sleep = beta_0 + beta_1 * weekend will be positive.) • What factors are associated with an increased time spent watching TV? Prediction • Sometimes, we wish to classify objects or predict future y values. Examples of prediction • Given how much time someone spends on certain activities, can we predict whether or not they are unemployed? • Given that someone is unemployed, can we predict how much time they will spend providing child care? pair programming • https://youtu.be/vgkahOzFH2Q For Wednesday • Read Chapter 2