Download STATS 101C

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
important stuff
Lecture 1.1
•
Read the syllabus (on CCLE)
•
Office Hours: Wednesday 1-1:50
•
Midterm in class: May 16
•
Final: due Saturday of finals week with in-class
presentation on June 11, 8am-11am
•
Send all academic questions to piazza, not email.
Feel free to answer/participate in discussions.
more important stuff
•
•
Please turn homework in on time. Late homework
will not be accepted. And “late” means any
moment after the deadline.
We will do lots of working in teams/pairs. Get used
to it.
goals
•
What are the purposes and uses of statistical
modeling?
•
What is Mean-squared error, and how do we
estimate it?
•
But first: some notation and terminology!
starting slow…
For instance:
•
•
•
x represents an explanatory variable, and
sometimes we’ll write X to represent a whole
collection of such variables.
y represents the response variable: the thing we’re
trying to predict/explain
What factors explain the variation in Type II
diabetes?
•
y is a 1 if a person has Type II diabetes, a 0 if
not.
•
the predictors might include: income, zipcode,
age, weight, height, gender, race, etc.
parametric
f (X)
is the function that relates the predictors, X, to the
response, y.
•
Situations like linear regression, in which we KNOW
the functional form of f(x) (for example, we know
that f(x) is linear) are called parametric problems.
•
Once we know that f (X) = 0 + 1 x1 + 2 x2
it becomes a “simple” matter of estimating the
parameters beta_0, beta_1, and beta_2 (and
sigma)
In 101A, the function was (almost) always
f (X) =
0
+
1 x1
+
2 x2
or, using matrix notation
f (X) = X
non-parametric
•
In this course, we’ll consider situations in which we
do not know the shape of f(x)
•
Our goal is to estimate it from the data.
Cross Validation
•
One technique we’ll make frequent use of is cross
validation.
•
In cross validation, you set aside some data and
use it to test your model.
testing vs. training
testing data vs. training data
•
“Training” data are data used to fit a model
•
“Testing” data are data that were NOT used in the
fitting process, but are used to test how well your
model performs on unseen data.
•
For example, for your final exam, we’ve set aside
some testing data that you’ll never see.
•
You’ll use the data we give you to fit a model, and
make predictions for a given set of covariates. Only
we will know what the correct response values are,
and will use this to test your model.
Truth vs. data
f(X) represents the TRUTH. The true relationship
between x’s and y.
many ways to go wrong
From regression analysis, you know that if the truth is linear,
then our estimated model is
fˆ(X) = ˆ0 + ˆ1 x
fˆ(X) represents a model of that truth, based on data.
And you know that beta_0 “hat” and beta_1 “hat” are not
exactly equal to the true values. Instead, they are the true
values plus-or-minus some error.
•
•
•
Purpose of Analysis
You also know that perhaps fˆ(X) = ˆ0 + ˆ1 x
isn’t exactly right. Maybe the truth has additional
predictors as well.
•
Take a moment to consider a small subset of the
American Time-Use Survey (“atus copy.txt”)
And maybe it isn’t even linear. You might need to
transform one or all variables.
•
With a partner, come up with 4 things you’d like to
learn from these data.
Or maybe you need a completely different
functional form that is extremely non-linear.
•
See “data not in textbook” under Site Info on CCLE
•
Now, can you classify these in terms of “inference”
and “prediction”
Inference
•
Examples
Sometimes, we want to know about the parameters.
•
All things being equal, how does the mean of y
differ when x differs? (i.e. what’s the slope?)
•
Which variables are associated with y and which
are not? (i.e. Which parameters are 0 and which
are not?)
•
Do people sleep longer on the weekends than
weekdays? (If so, the slope of predicted sleep =
beta_0 + beta_1 * weekend will be positive.)
•
What factors are associated with an increased time
spent watching TV?
Prediction
•
Sometimes, we wish to classify objects or predict
future y values.
Examples of prediction
•
Given how much time someone spends on certain
activities, can we predict whether or not they are
unemployed?
•
Given that someone is unemployed, can we predict
how much time they will spend providing child
care?
pair programming
•
https://youtu.be/vgkahOzFH2Q
For Wednesday
•
Read Chapter 2