Download Lecture 15 – Tues., Oct. 28

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Lecture 15 – Tues., Oct. 28
• Review example of one-way layout
• Simple Linear Regression:
– Simple Linear Regression Model, 7.2
– Least Squares Regression Estimation, 7.3.17.3.2, 7.3.4
– Causation, 7.5.3
• Next time: Inference for simple linear
regression, 7.3.3, 7.3.5, 7.4.
Review of One-way layout
• Assumptions of ideal model
– All populations have same standard deviation.
– Each population is normal
– Observations are independent
• Planned comparisons: Usual t-test but use all groups to
estimate  . If many planned comparisons, use Bonferroni
to adjust for multiple comparisons
• Test of H 0 : 1  2    I vs. alternative that at least
two means differ: one-way ANOVA F-test
• Unplanned comparisons: Use Tukey-Kramer procedure to
adjust for multiple comparisons.
Review Example
• A developmental psychologist is interested
in the extent to which children’s memory
for facts improves as children get older.
• Ten children of ages 4, 6, 8 and 10 are
randomly selected to participate in the
study.
• Each child is given a 30 item memory test;
the scores are recorded in memorytest.JMP.
Regression for memorytest
• Let Y = score, X = age.
• Each age is a subpopulation.
• The regression of Y on X is the mean of Y as a
function of the subpopulation X, denoted by  (Y | X )
• Simple linear regression model:
{Y | X }  0  1 X
1
= slope = change in mean number of items
remembered for each additional year of age
= intercept = mean number of items
0
remembered at age 0
Least squares estimates: ˆ0  4.74, ˆ1  1.96
Regression – General Setup
• General setup: We have data (yi, xi), i=1,…,n. [Later we
will look at setting where we have multiple x’s].
• Y is called the response variable, X is called the
explanatory variable.
• Regression: the mean of Y given X=x,
• Regression model: an ideal formula to approximate the
regression {Y | X }
• Simple linear regression model:
 (Y | X )  0  1 X
Uses of Regression Analysis
• Description: Describe the association between Y and X,
e.g., case study 7.1.1: What is the relationship between the
distance from Earth (Y) and the recession velocity of extragalactic nebulae (X)? The relationship can be used to
estimate the age of the universe using the theory of the big
bang.
• Passive prediction. Predict y based on x where you do not
plan to manipulate x, e.g., predict today’s stock price based
on yesterday’s stock price.
• Control. Predict what y will be if you change x, e.g.,
predict what your earnings will be if you obtain different
levels of education.
Example (Problem 30)
• Studies over the past two decades have shown that
activity can affect the reorganization of the human
central nervous system.
• Psychologists used magnetic source imaging
(MSI) to measure neuronal activity in the brains of
nine string players and six controls when thumb
and fifth finger of left hand were exposed to mild
stimulation.
• Research hypothesis: String players, who use
fingers of left hand extensively, should show
different brain behavior (in particular more
neuronal activity).
Example Continued
• Two-sided t-test: p-value = 0.0003, CI =
(7.51,18.92), strong evidence that string players
have higher neuron activity than controls
• More interesting question: How much does neuron
activity index increase per extra year of playing
the instrument?
• Y= neuron activity index, X = years playing.
Simple linear regression model:  (Y | X )  0  1 X
• What is the interpretation of  0 and 1 here?
Ideal Model
• Assumptions of ideal simple linear regression
model
– There is a normally distributed subpopulation of
responses for each value of the explanatory variable
– The means of the subpopulations fall on a straight-line
function of the explanatory variable.
– The subpopulation standard deviations are all equal (to
 )
– The selection of an observation from any of the
subpopulations is independent of the selection of any
other observation.
Estimating the coefficients
• We want to make the predictions of Y based on X as good
as possible. The best prediction of Y based on X is {Y | X }
• Least Squares Method: Choose coefficients to minimize
the sum of squared prediction errors.
• Fitted value for observation i is its estimated mean:
fiti  ˆ{Y | X i }  ˆ0  ˆ1 X i
• Residual for observation is the prediction error of using X
to predict Y: resi  yi  fiti
• Least squares method: Find estimates that minimize the
sum of squared residuals, solution on page 182.
Regression Analysis in JMP
• Use Analyze, Fit Y by X. Put response
variable in Y and explanatory variable in X
(make sure X is continuous).
• Click on fit line under red triangle next to
Bivariate Fit of Y by X.
JMP output for example
Neuron activity index
Bivariate Fit of Neuron activity index By Years playing
30
25
20
15
10
5
0
0
5
10
15
Years playing
20
Linear Fit
Linear Fit
Neuron activity index = 7.9715909 + 1.0268308 Years playing
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.866986
0.855902
3.025101
15.89286
14
Parameter Estimates
Term
Intercept
Years playing
Estimate
Std Error
t Ratio
Prob>|t|
7.9715909 1.206598 6.61 <.0001
1.0268308 0.116105 8.84 <.0001
The standard deviation 
•  is the standard deviation in each
subpopulation.
•  measures the accuracy of predictions from the
regression.
• If the simple linear regression model holds, then
approximately
– 68% of the observations will fall within  of the
regression line
– 95% of the observations will fall within 2 of the
regression line
Estimating

• Residuals provide basis for an estimate of
ˆ 

sum of all squared residuals
degrees of freedom
• Degrees of freedom for simple linear regression =
n-2
• If the simple linear regression models holds, then
approximately
– 68% of the observations will fall within ̂ of the least
squares line
– 95% of the observations will fall within 2̂ of the least
squares line
JMP commands
• ̂ is found under Summary of Fit and is labeled
“Root Mean Square Error”
• To look at a plot of residuals versus X, click Plot
Residuals under the red triangle next to Linear Fit
after fitting the line.
• To save the residuals or fitted values (predicted
values), click Save Residuals or Save Predicteds
under the red triangle next to Linear Fit after
fitting the line.
Interpolation and Extrapolation
• The simple linear regression model makes it
possible to draw inference about any mean
response, ˆ
{Y | X }  ˆ0  ˆ1 X
• Interpolation: Drawing inference about mean
response for X within range of observed X; strong
advantage of regression model is ability to
interpolate.
• Extrapolation: Drawing inference about mean
response for X outside of range of observed X;
dangerous. Straight-line model may hold
approximately over region of observed X but not
for all X.
Extrapolation in Memory Test
• Y=Score on test of 30 items, X = Age.
• Least squares estimates:
ˆ{Y | X }  4.74  1.96 X
• Predicted Mean of Y at age 0: 4.7
Predicted Mean of Y at age 20: 43.9
Predicted Mean of Y at age 90: 181.1
Difficulties of extrapolation
• Mark Twain: “In the space of one hundred and seventy-six years, the
Lower Mississippi has shortened itself two hundred and forty-two
miles. That is an average of a trifle over one mile and a third per year.
Therefore, any calm person, who is not blind or idiotic, can see that in
the old Oolitic Silurian period, just a million years ago next November,
the Lower Mississippi River was upward of one million three hundred
thousand miles long, and stuck out over the Gulf of Mexico like a
fishing-rod. And by the same token any person can see that seven
hundred and forty-two years from now the Lower Mississippi will be
only a mile and three-quarters long, and Cairo and New Orleans will
have joined their streets together and be plodding comfortably along
under a single mayor and a mutual board of aldermen. There is
something fascinating about science. One gets such wholesale return
of conjecture out of such a trifling investment of fact.”
Cause and Effect?
• The regression summarizes the association between the
mean response of Y and the value of the explanatory
variable X.
• No cause and effect relationship can be inferred unless X is
randomly assigned to units in a random experiment.
• A researcher measures the number of television sets per
person X and the average life expectancy Y for the world’s
nations. The regression line has a positive slope – nations
with many TV sets have higher life expectancies. Could
we lengthen the lives of people in Rwanda by shipping
them TV sets?
Brain activity in string players
• Y=neuron activity, X = years playing string
instrument
• Least squares estimates: ˆ{Y | X }  7.97  1.03 X
• Is this a randomized experiment?
• What is an alternative explanation for the
association between Y and X other than that
playing string instruments causes an
increase in the neuron activity index?