Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ST 380 Probability and Statistics for the Physical Sciences Regression Models The value of a response Y may be influenced by the levels of one or more factors. Some factors are qualitative: type of solvent, or brand of gasoline. Others are quantitative: temperature, or pressure. Regression modeling deals largely with quantitative factors, also called explanatory variables, or predictors. 1 / 18 Simple Linear Regression Introduction ST 380 Probability and Statistics for the Physical Sciences When only two levels of a quantitative factor are used, we can detect whether it influences the response, but not how. When several levels of a factor are used, we can begin to describe the way in which it influences the response. In observational data, we do not control the levels, and many different levels may be observed. The simplest type of influence is linear dependence. 2 / 18 Simple Linear Regression Introduction ST 380 Probability and Statistics for the Physical Sciences Example 12.2 In a particular process for removing arsenic from drinking water, the percentage removed (Y ) is affected by the pH (x) of the water. In R arsenic <- read.table("Data/Example-12-02.txt", header = TRUE) plot(arsenic) Clearly the percentage of arsenic removed is influenced by the pH. The percentage removed declines roughly linearly as the pH increases. 3 / 18 Simple Linear Regression Introduction ST 380 Probability and Statistics for the Physical Sciences The Regression Model The general idea of a regression model is that the distribution of the response Y depends on the level of the quantitative factor x. In the simple linear regression model, we assume that: The expected value of Y is a linear function of x: E (Y ) = β0 + β1 x The variance of Y is constant: V (Y ) = σ 2 . 4 / 18 Simple Linear Regression The Model ST 380 Probability and Statistics for the Physical Sciences Equivalently, if we write = Y − E (Y ) for the random error in Y , then Y = β0 + β1 x + , where E () = 0 and V () = σ 2 . We also assume that the Y ’s (or ’s) are independent. 5 / 18 Simple Linear Regression The Model ST 380 Probability and Statistics for the Physical Sciences The assumptions appear to be valid for the arsenic example. In other situations, any of the three assumptions might be violated: The expected value E (Y ) is often a nonlinear function of x. The variance V (Y ) may not be constant; it often increases with E (Y ). The responses may be correlated; measurements of the same variable collected over time are usually correlated with each other. Also, the response is typically influenced by more than one factor. For now, we ignore these complications. 6 / 18 Simple Linear Regression The Model ST 380 Probability and Statistics for the Physical Sciences Statistical Inference The parameters of the simple linear regression model are: β0 , the intercept of the line; β1 , the slope of the line; σ 2 , the error variance. As always, we want point estimators and interval estimators of them, and we want to test hypotheses about them. 7 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Least Squares The most commonly used estimators of β0 and β1 are the least squares estimators. They are also maximum likelihood estimators, if Y has a normal distribution. 8 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Suppose that we observe n pairs (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). For any trial values b0 and b1 , we can use the line y = b0 + b1 x to predict what the value of Y should have been at each xi : ŷi (b0 , b1 ) = b0 + b1 xi . Since we observed the actual value yi , we can also calculate the prediction error, or residual: ri (b0 , b1 ) = yi − ŷi (b0 , b1 ) = yi − (b0 + b1 xi ) 9 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Good values of b0 and b1 should make good predictions, with small residuals. The best values, in the least squares sense, give the smallest value of the sum of squared residuals, S(b0 , b1 ) = n X ri (b0 , b1 )2 i=1 = n X [yi − (b0 + b1 xi )]2 i=1 10 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences These are the least squares estimators Pn (x − x̄)(yi − ȳ ) Pn i β̂1 = i=1 2 i=1 (xi − x̄) and βˆ0 = ȳ − β̂1 x̄. 11 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Estimating σ 2 The error sum of squares, or residual sum of squares (for the least squares estimates) is SSE = S β̂0 , β̂1 . Under the regression model, E (SSE) = (n − 2)σ 2 , so s2 = SSE n−2 is an unbiased estimator of σ 2 . The maximum likelihood estimator has a divisor of n instead of n − 2. 12 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Example 12.2 continued In R, the lm() function produces least squares estimates of the parameters, and much more: summary(lm(Percent ~ pH, arsenic)) Output Call: lm(formula = Percent ~ pH, data = arsenic) Residuals: Min 1Q Median -9.0421 -4.5110 -0.7635 13 / 18 3Q Max 3.8326 11.1382 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Output continued Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 190.268 12.587 15.12 6.81e-11 *** pH -18.034 1.474 -12.23 1.55e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 6.126 on 16 degrees of freedom Multiple R-squared: 0.9034, Adjusted R-squared: 0.8974 F-statistic: 149.7 on 1 and 16 DF, p-value: 1.552e-09 14 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences As before, we need to translate the output to the regression model; the “Coefficients” are the estimated parameters: (Intercept) β̂0 pH β̂1 Each line has the parameter estimate, its standard error, the t-statistic for testing the null hypothesis H0 : βi = 0, the P-value for the statistic. 15 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Note that β0 is E (Y ) when x = 0; in this case, no drinking water has a pH of zero, so β0 has no physical meaning, and testing hypotheses about it are a waste of time (and Type I errors!). β1 is the slope of the line, and measures how strongly pH affects the removal of arsenic; in particular, if β1 = 0, pH has no effect, so H0 : β1 = 0 is of substantive interest. The output shows that the association of pH with the removal of arsenic is highly significant. 16 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences Coefficient of Determination The coefficient of determination, denoted R 2 , measures how much of the variation in Y is explained by the regression model. It is defined as R2 = 1 − SSE , SST where SST = X (yi − ȳ )2 is the sum of squares of the residuals around a horizontal line (β1 = 0) at height ȳ (β0 = ȳ ). In the R output, the coefficient of determination is labeled Multiple R-squared. 17 / 18 Simple Linear Regression Inference ST 380 Probability and Statistics for the Physical Sciences In the arsenic removal example, R 2 = 0.9034, which would often be stated as 90.34% of the variance of Y is explained by the linear regression on x or more loosely as 90.34% of the variation in the effectiveness of removal of arsenic is explained by the effect of pH. 18 / 18 Simple Linear Regression Inference