Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 16 – Thurs., March 4 • Chi squared test for M&M experiment • Simple linear regression (Chapter 7.2) • Next class after spring break: Inference for simple linear regression (Chapter 7.3-7.4) Chi-squared test for M &M experiment • Data in MandM.JMP. • According to the M&M’s web site, the color distribution in peanut butter M&M’s is 20% brown, 20% yellow, 20% red, 20% blue, 10% green and 10% orange. Test H 0 : pbrown p yellow pred pblue 0.2, pgreen porange 0.1 H a : at least one of above probabilti es is not true. Distributions Color Yellow Red Orange Green Brow n Blue Frequencies Level Blue Brown Green Orange Red Yellow Total Count 92 125 117 62 79 134 609 Prob 0.15107 0.20525 0.19212 0.10181 0.12972 0.22003 1.00000 N Missing 0 6 Levels Test Probabilities Level Blue Brown Green Orange Red Yellow Test Likelihood Ratio Pearson Estim Prob 0.15107 0.20525 0.19212 0.10181 0.12972 0.22003 Hypoth Prob 0.20000 0.20000 0.10000 0.10000 0.20000 0.20000 ChiSquare 67.0421 75.3350 DF 5 5 Prob>Chisq <.0001 <.0001 Method: Fix hypothesized values, rescale omitted We reject the null hypothesis that the distribution of colors of peanut butter M&M’s is the claimed H 0 : p brown p yellow p red p blue 0.2, p green p orange 0.1 Regression Analysis: Motivating Example • Heart catheterization is performed on children with congenital heart defects. A Teflon tube (catheter) is passed into a major vein or artery and pushed into heart to obtain information about heart’s physiology and functional ability. • It would be desirable to accurately predict the needed length of catheter (Y) based on the height (X) of the child [Weindling, 1977]. • In a small study of 12 children, the exact catheter length required was determined by using a fluoroscope to check that tip of catheter had reached pulmonary artery. Catheter length required Bivariate Fit of Catheter length required By Height 50 45 40 35 30 25 20 15 20 30 40 50 Height 60 70 Predicting Y based on X • For each X=height, there is a population of children with height X. • What is a good prediction of Y=catheter length required if we know that a child’s height is X (e.g., 48)? • A good prediction is the mean catheter length required for the population of children with height X, {Y | X } (e.g., mean catheter length required for population of children with height 48, {Y | X 48}). Regression Analysis • The goal of regression analysis is to estimate the mean of Y for the population with characteristic X, {Y | X } called the mean of Y given X or [sometimes conditional mean of Y given X] • Simple regression: There is only one characteristic X. • Multiple regression: There are several characteristics X 1 ,..., X p [e.g., child’s height and weight] • The Y variable that we want to predict is called the response variable. The X variables that we use to make the prediction are called the explanatory variables or predictor variables. Simple Linear Regression Model • Simple linear regression model: The mean of Y given X is a straight line – {Y | X } 0 1 X (this is called the regression line) • 0 = Intercept. The mean of Y given X=0. • 1 = Slope. The amount by which the mean of Y given X increases for each one unit increase in X. • Example: Suppose{Y | X } 12 0.6 X for catheter data. For each additional inch of height, the mean catheter length required increases by 0.6 cm. Estimating the coefficients • We want to make the predictions of Y based on X as good as possible. The best prediction of Y based on X is {Y | X } • Least Squares Method: Choose coefficients to minimize the sum of squared prediction errors. • Fitted value for observation i is its estimated mean given X: fiti ˆ{Y | X i } ˆ0 ˆ1 X i • Residual for observation i is the prediction error of using ˆ{Y | X X i } to predict Yi : resi yi fiti • Least squares method: Find estimates that minimize the sum of squared residuals, solution on page 182. Regression Analysis in JMP • Use Analyze, Fit Y by X. Put response variable in Y and explanatory variable in X (make sure X is continuous). • Click on fit line under red triangle next to Bivariate Fit of Y by X. Catheter length required Bivariate Fit of Catheter length required By Height 50 45 40 35 30 25 20 15 20 30 40 50 Height 60 70 Linear Fit Linear Fit Catheter length required = 12.124045 + 0.5967612 Height Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.776459 0.754105 4.008309 36.20833 12 Analysis of Variance Source Model Error C. Total DF Sum of Squares Mean Square F Ratio 1 10 11 558.06374 160.66543 718.72917 558.064 16.067 34.7345 Prob > F 0.0002 Parameter Estimates Term Intercept Height Estimate Std Error t Ratio Prob>|t| 12.124045 0.5967612 4.247174 0.101256 2.85 5.89 0.0171 0.0002 Predicting Y based on X • Least squares regression line: ˆ{Y | X } 12.124 0.597 X • The estimated mean cathether length required for children who are 4 feet (48 inches tall) is ˆ{Y | X 48} 12.124 0.597 * 48 40.78 • A good prediction of the catheter length required for a child who is 4 feet tall is cm. ˆ{Y | X 48} 40.78 Ideal Simple Linear Regression Model • Assumptions of ideal simple linear regression model – There is a normally distributed subpopulation of responses for each value of the explanatory variable – The means of the subpopulations fall on a straight-line function of the explanatory variable. – The subpopulation standard deviations are all equal (to ) – The selection of an observation from any of the subpopulations is independent of the selection of any other observation. The standard deviation • is the standard deviation in each subpopulation. • measures how accurate the predictions of y based on x from the regression will be. • If the simple linear regression model holds, then approximately – 68% of the observations will fall within of the regression line – 95% of the observations will fall within 2 of the regression line Estimating resi yi ˆ0 ˆ1xi • Residuals, , are an estimate of deviation of yi from its estimated mean given xi • Residuals provide basis for an estimate of sum of all squared residuals ˆ n-2 • Degrees of freedom for ̂ for simple linear regression = n-2 JMP commands • ̂ is found under Summary of Fit and is labeled “Root Mean Square Error” • To look at a plot of residuals versus X, click Plot Residuals under the red triangle next to Linear Fit after fitting the line. • To save the residuals or fitted values (predicted values), click Save Residuals or Save Predicteds under the red triangle next to Linear Fit after fitting the line. Accuracy of predictions • If the simple linear regression models holds, then approximately – 68% of the observations will fall within of the regression line – 95% of the observations will fall within 2 of the regression line • For catheter data, ˆ 4.01 . Approximately 68% of the time the predicted catheter length given height will be at most 4.01 cm wrong; approximately 95% of the time the predicted catheter length given height will be at most 2*4.01=8.02 cm wrong. Interpolation and Extrapolation • The simple linear regression model makes it possible to draw inference about any mean response, ˆ{Y | X } ˆ ˆ X 0 1 • Interpolation: Drawing inference about mean response for X within range of observed X; strong advantage of regression model is ability to interpolate (e.g., predict mean catheter length required for child who is 42.0 inches, height not observed in sample). • Extrapolation: Drawing inference about mean response for X outside of range of observed X; dangerous. Straight-line model may hold approximately over region of observed X but not for all X. • Extrapolation in catheter data: ˆ{Y | X 6} 15.706cm 6.18inches Difficulties of extrapolation • Mark Twain: “In the space of one hundred and seventy-six years, the Lower Mississippi has shortened itself two hundred and forty-two miles. That is an average of a trifle over one mile and a third per year. Therefore, any calm person, who is not blind or idiotic, can see that in the old Oolitic Silurian period, just a million years ago next November, the Lower Mississippi River was upward of one million three hundred thousand miles long, and stuck out over the Gulf of Mexico like a fishing-rod. And by the same token any person can see that seven hundred and forty-two years from now the Lower Mississippi will be only a mile and three-quarters long, and Cairo and New Orleans will have joined their streets together and be plodding comfortably along under a single mayor and a mutual board of aldermen. There is something fascinating about science. One gets such wholesale return of conjecture out of such a trifling investment of fact.”