Download Thu Mar 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 16 – Thurs., March 4
• Chi squared test for M&M experiment
• Simple linear regression (Chapter 7.2)
• Next class after spring break: Inference for
simple linear regression (Chapter 7.3-7.4)
Chi-squared test for M &M
experiment
• Data in MandM.JMP.
• According to the M&M’s web site, the color
distribution in peanut butter M&M’s is 20%
brown, 20% yellow, 20% red, 20% blue,
10% green and 10% orange. Test
H 0 : pbrown  p yellow  pred  pblue  0.2, pgreen  porange  0.1
H a : at least one of above probabilti es is not true.
Distributions
Color
Yellow
Red
Orange
Green
Brow n
Blue
Frequencies
Level
Blue
Brown
Green
Orange
Red
Yellow
Total
Count
92
125
117
62
79
134
609
Prob
0.15107
0.20525
0.19212
0.10181
0.12972
0.22003
1.00000
N Missing
0
6 Levels
Test Probabilities
Level
Blue
Brown
Green
Orange
Red
Yellow
Test
Likelihood Ratio
Pearson
Estim Prob
0.15107
0.20525
0.19212
0.10181
0.12972
0.22003
Hypoth Prob
0.20000
0.20000
0.10000
0.10000
0.20000
0.20000
ChiSquare
67.0421
75.3350
DF
5
5
Prob>Chisq
<.0001
<.0001
Method:
Fix hypothesized values, rescale omitted
We reject the null hypothesis that the distribution of colors of peanut
butter M&M’s is the claimed
H 0 : p brown  p yellow  p red  p blue  0.2, p green  p orange  0.1
Regression Analysis: Motivating
Example
• Heart catheterization is performed on children
with congenital heart defects. A Teflon tube
(catheter) is passed into a major vein or artery and
pushed into heart to obtain information about
heart’s physiology and functional ability.
• It would be desirable to accurately predict the
needed length of catheter (Y) based on the height
(X) of the child [Weindling, 1977].
• In a small study of 12 children, the exact catheter
length required was determined by using a
fluoroscope to check that tip of catheter had
reached pulmonary artery.
Catheter length required
Bivariate Fit of Catheter length required By Height
50
45
40
35
30
25
20
15
20
30
40 50
Height
60
70
Predicting Y based on X
• For each X=height, there is a population of
children with height X.
• What is a good prediction of Y=catheter length
required if we know that a child’s height is X (e.g.,
48)?
• A good prediction is the mean catheter length
required for the population of children with height
X, {Y | X } (e.g., mean catheter length required for
population of children with height 48, {Y | X  48}).
Regression Analysis
• The goal of regression analysis is to estimate the
mean of Y for the population with characteristic X,
{Y | X } called the mean of Y given X or
[sometimes
conditional mean of Y given X]
• Simple regression: There is only one characteristic
X.
• Multiple regression: There are several
characteristics X 1 ,..., X p [e.g., child’s height and
weight]
• The Y variable that we want to predict is called the
response variable. The X variables that we use to
make the prediction are called the explanatory
variables or predictor variables.
Simple Linear Regression Model
• Simple linear regression model: The mean of Y
given X is a straight line –
{Y | X }  0  1 X
(this is called the regression line)
•  0 = Intercept. The mean of Y given X=0.
• 1 = Slope. The amount by which the mean of Y
given X increases for each one unit increase in X.
• Example: Suppose{Y | X }  12  0.6 X for catheter
data. For each additional inch of height, the mean
catheter length required increases by 0.6 cm.
Estimating the coefficients
• We want to make the predictions of Y based on X as good
as possible. The best prediction of Y based on X is {Y | X }
• Least Squares Method: Choose coefficients to minimize
the sum of squared prediction errors.
• Fitted value for observation i is its estimated mean given
X:
fiti  ˆ{Y | X i }  ˆ0  ˆ1 X i
• Residual for observation i is the prediction error of using
ˆ{Y | X  X i } to predict Yi : resi  yi  fiti
• Least squares method: Find estimates that minimize the
sum of squared residuals, solution on page 182.
Regression Analysis in JMP
• Use Analyze, Fit Y by X. Put response
variable in Y and explanatory variable in X
(make sure X is continuous).
• Click on fit line under red triangle next to
Bivariate Fit of Y by X.
Catheter length required
Bivariate Fit of Catheter length required By Height
50
45
40
35
30
25
20
15
20
30
40
50
Height
60
70
Linear Fit
Linear Fit
Catheter length required = 12.124045 + 0.5967612 Height
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.776459
0.754105
4.008309
36.20833
12
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
1
10
11
558.06374
160.66543
718.72917
558.064
16.067
34.7345
Prob > F
0.0002
Parameter Estimates
Term
Intercept
Height
Estimate
Std Error
t Ratio
Prob>|t|
12.124045
0.5967612
4.247174
0.101256
2.85
5.89
0.0171
0.0002
Predicting Y based on X
• Least squares regression line:
ˆ{Y | X }  12.124  0.597 X
• The estimated mean cathether length required for
children who are 4 feet (48 inches tall) is
ˆ{Y | X  48}  12.124  0.597 * 48  40.78
• A good prediction of the catheter length required
for a child who is 4 feet tall is
cm.
ˆ{Y | X  48}  40.78
Ideal Simple Linear Regression
Model
• Assumptions of ideal simple linear regression
model
– There is a normally distributed subpopulation of
responses for each value of the explanatory variable
– The means of the subpopulations fall on a straight-line
function of the explanatory variable.
– The subpopulation standard deviations are all equal (to
 )
– The selection of an observation from any of the
subpopulations is independent of the selection of any
other observation.
The standard deviation 
•  is the standard deviation in each
subpopulation.
•  measures how accurate the predictions of y
based on x from the regression will be.
• If the simple linear regression model holds, then
approximately
– 68% of the observations will fall within  of the
regression line
– 95% of the observations will fall within 2 of the
regression line
Estimating

resi  yi  ˆ0  ˆ1xi
• Residuals,
, are an estimate
of deviation of yi from its estimated mean
given xi
• Residuals provide basis for an estimate of
sum of all squared residuals
ˆ 
n-2
• Degrees of freedom for ̂ for simple linear
regression = n-2
JMP commands
• ̂ is found under Summary of Fit and is labeled
“Root Mean Square Error”
• To look at a plot of residuals versus X, click Plot
Residuals under the red triangle next to Linear Fit
after fitting the line.
• To save the residuals or fitted values (predicted
values), click Save Residuals or Save Predicteds
under the red triangle next to Linear Fit after
fitting the line.
Accuracy of predictions
• If the simple linear regression models holds, then
approximately
– 68% of the observations will fall within  of the
regression line
– 95% of the observations will fall within 2  of the
regression line
• For catheter data, ˆ  4.01 . Approximately 68%
of the time the predicted catheter length given
height will be at most 4.01 cm wrong;
approximately 95% of the time the predicted
catheter length given height will be at most
2*4.01=8.02 cm wrong.
Interpolation and Extrapolation
• The simple linear regression model makes it possible to
draw inference about any mean response, ˆ{Y | X }  ˆ  ˆ X
0
1
• Interpolation: Drawing inference about mean response for
X within range of observed X; strong advantage of
regression model is ability to interpolate (e.g., predict
mean catheter length required for child who is 42.0 inches,
height not observed in sample).
• Extrapolation: Drawing inference about mean response for
X outside of range of observed X; dangerous. Straight-line
model may hold approximately over region of observed X
but not for all X.
• Extrapolation in catheter data:
ˆ{Y | X  6}  15.706cm  6.18inches
Difficulties of extrapolation
• Mark Twain: “In the space of one hundred and seventy-six years, the
Lower Mississippi has shortened itself two hundred and forty-two
miles. That is an average of a trifle over one mile and a third per year.
Therefore, any calm person, who is not blind or idiotic, can see that in
the old Oolitic Silurian period, just a million years ago next November,
the Lower Mississippi River was upward of one million three hundred
thousand miles long, and stuck out over the Gulf of Mexico like a
fishing-rod. And by the same token any person can see that seven
hundred and forty-two years from now the Lower Mississippi will be
only a mile and three-quarters long, and Cairo and New Orleans will
have joined their streets together and be plodding comfortably along
under a single mayor and a mutual board of aldermen. There is
something fascinating about science. One gets such wholesale return
of conjecture out of such a trifling investment of fact.”