Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Simple Linear Regression (Session 02) SADC Course in Statistics Learning Objectives At the end of this session, you will be able to • understand the meaning of a simple linear regression model, its aims and terminology • determine the best fitting line describing the relationship between a quantitative response (y) and a quantitative explanatory variable (x) • Interpret the unknown parameters of the regression line To put your footer here go to View > Header and Footer 2 An illustrative example Data on the next slide shows the average number of cigarettes smoked per adult in 1930 and the death rate per million in 1952 for sixteen countries. The question of interest is whether there is a relationship between the death rate (y) and level of smoking (x). Here both y and x are quantitative measurements. To put your footer here go to View > Header and Footer 3 The Data Country England and Wales Cig. Smoked (x) Death rate (y) 1378 1662 461 433 Finland Austria Nethelands Belgium Switzerland New Zealand U.S.A. Denmark Australia Canada France Italy Sweden Norway Japan To put your footer here go to View > 960 632 1066 706 478 1296 465 504 760 585 455 388 359 723 Header and Footer 380 276 254 236 216 202 179 177 176 140 110 89 77 40 4 0 100 200 300 400 500 Start by plotting - shows pattern 0 500 1000 Cigarettes smoked (x) 1500 2000 -a straight line relationship seems plausible here. To put your footer here go to View > Header and Footer 5 Recall reasons for modelling • To determine which of (often) several factors explain variability in the key response of interest; • To summarise the relationship(s); • For predictive purposes, e.g. predicting y for given x’s, or identifying x’s that optimise y in some way; Note: Presence of an association between variables does not necessarily imply causation. To put your footer here go to View > Header and Footer 6 Describing the Regression Model Describe variation in response (here death rate) in terms of its relationship with the explanatory variable (here cig. numbers). Model : data = pattern + residual – can describe pattern as: a + bx , if straight line relationship seems reasonable – residual is unexplained variation assumed to be random. To put your footer here go to View > Header and Footer 7 Simple Linear Regression Model If there is only one explanatory variable, we have a Simple Linear Regression Model. Here data = pattern + residual becomes: y = + x + where + x =pattern and = residual. • is called the intercept • is called the slope • the ’s represent the departure of the true line from the observed values. To put your footer here go to View > Header and Footer 8 A Diagrammatic Representation y × × i yi } × × × × × × y x × } xi To put your footer here go to View > Header and Footer x 9 Parameters of Model & Assumptions • and are the unknown parameters in the model. They are estimated from the data • The random error, , is assumed to have a – normal distribution – with constant variance (whatever the value of x) We shall return to these assumptions later. To put your footer here go to View > Header and Footer 10 Results of model fitting -----------------------------------------------------deathrate|Coef. Std.Err. t P>|t| [95% Conf.Int.] ---------+-------------------------------------------Cigars | .2410 .0544 4.43 0.001 .1245 .3577 Const. | 28.31 46.92 0.60 0.556 -72.34 128.95 ------------------------------------------------------ These are estimates of coefficients of the regression equation since this is a sample of data - precision quantified by standard errors Estimated equation is: y = 28.31 + 0.241 * x Note: The t and P>|t| columns will be discussed in the next session. To put your footer here go to View > Header and Footer 11 0 100 200 300 400 500 The fitted line 0 500 1000 Cigarettes smoked (x) Death rate (y) 1500 2000 Fitted values To put your footer here go to View > Header and Footer 12 Interpreting model parameters • Slope (regression coefficient): If cigarettes smoked increases by 1 unit per year, death rate will increase by 0.24 units. In other words, if cigarettes smoked increases by 100 units, death rate will increase by 24 units. • Intercept of 28.31 only has meaning if the range of x values (cigarettes smoked) under study includes the value of zero. Here zero cigarettes smoked still gives an estimated death rate of 28.3 To put your footer here go to View > Header and Footer 13 Predictions from the line The model equation can also be used to predict y at a given value of x Thus from y = 28.31 + 0.241 x, predicted death rate ( ŷ ) in a country where number of cigarettes smoked is x=1000, is given by ˆ ˆx ŷ = 28.31 + 0.241 (1000) = 269.3 Note: Predictions will be discussed in greater detail in Session 9. To put your footer here go to View > Header and Footer 14 Computation of model estimates (for reference only) ˆ x i yi ( x i )( yi ) / n Sxy 2 2 Sxx x i ( x i ) / n ˆ ˆ x y i i n ˆx y Note: Can also write (x i x)(yi y) Sxy 2 Sxx (xi x) To put your footer here go to View > Header and Footer 15 Practical work follows to ensure learning objectives are achieved… To put your footer here go to View > Header and Footer 16