* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 3. Linear Modelling and Residual Analysis
Survey
Document related concepts
Transcript
3 Linear modelling and residual analysis cyan magenta yellow 95 Slopes and equations of lines Correlation Measuring correlation Line of best fit Residual analysis 100 50 A B C D E 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Contents: black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\117SA12MET2_03.CDR Wednesday, 15 September 2010 10:31:27 AM PETER SA_12MET-2 118 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) In the next two chapters we will study mathematical modelling. This is the process of finding an equation to describe the relationship between two variables using data obtained by observation or experiment. In this chapter we will be concerned with linear modelling. We will use linear equations to describe the relationship between two variables. We will also consider how to determine whether it is appropriate to use a linear equation to model the data. We will begin by reviewing some important properties of lines. A SLOPES AND EQUATIONS OF LINES In previous courses we establish that: y The slope of a straight line passing through the points y2 ¡ y1 y-step = . (x1 , y1 ) and (x2 , y2 ) is m = x-step x2 ¡ x1 If the graph of y against x is linear, then x and y are connected by the rule y = mx + c, where m and c are constants. c slope = m y = mx + c is the equation of the line with slope m and y-intercept c. x Example 1 y Find the slope and y-intercept of the illustrated line: 2 (4, 3) x (0, 2) and (4, 3) lie on the line ) m= y-step 3¡2 = = x-step 4¡0 So, the slope is 1 4 1 4 and the y-intercept is 2. y To find the equation of the line passing through two points (x1 , y1 ) and (x2 , y2 ), we first find the slope m of the line. y ¡ y1 = m. The equation of the line is x ¡ x1 (x2, y2) P(x, y) (x1, y1) cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 x black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\118SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:00 AM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 119 (Chapter 3) Example 2 Find the equation of the line passing through (1, 4) and (3, ¡2). ¡2 ¡ 4 ¡6 y-step = = = ¡3. x-step 3¡1 2 The line has slope m = y ¡ y1 =m x ¡ x1 y¡4 = ¡3 ) x¡1 ) y ¡ 4 = ¡3(x ¡ 1) ) y = ¡3x + 7 So, the equation of the line is Suppose we are given or have determined the equation of a line. Given the value of one variable, we can use substitution to find the value of the other. Example 3 For the line with equation y = 2x + 5, find: a y given that x = 6 b x given that y = 11. a Substituting x = 6 into y = 2x + 5 gives y = 2(6) + 5 ) y = 17 b Substituting y = 11 y = 2x + 5 gives ) ) into 11 = 2x + 5 6 = 2x x=3 EXERCISE 3A 1 Determine the slope and y-intercept of the line with equation: a y = 2x + 5 b y = 0:8x c y=8 d y = 0:52x + 10:3 2 Give the slope and y-intercept of the following lines: a b y c y 4 (4, 4) y (5, 5) 3 (0, 2) (6, 0) x x x 3 Write down the equation of the line: a with slope 0:75 and y-intercept 2:13 b with y-intercept 0:75 and slope 2:13 . 4 Find the equations of the illustrated lines: a b y c y (20, 3) 3.1 3 (5.2, 2.9) (8, 2) magenta yellow 95 x 100 50 75 25 0 5 95 100 50 4 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 x cyan y black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\119SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:06 AM PETER x SA_12MET-2 120 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 5 For the line with equation y = ¡5x + 17, find: a y when x = 3 b x when y = 9:5 . 6 A TV repair business charges $40 for a call-out plus an hourly rate of $35. a Copy and complete the table alongside: Hours (x) 0 b Graph y against x. Total charge ($y) c Find and interpret the slope of the line. d Find and interpret the y-intercept. f What is the charge for a 2 14 1 2 3 e Find the equation of the line. hour repair job? 7 Next month’s projected sales of toasters, y, can be modelled by the equation y = 9250 + 8:2x, where x is the advertising expenditure in dollars. a What increase in sales should result from each $1 of advertising? b What increase in sales should result by increasing advertising by $2000? 8 When the price of an electric kettle is $x, the demand is modelled by y = 18 000 ¡ 350x kettles. a How will the demand change if the price per kettle is increased by $1? b How will the demand be affected if: i the price is increased by $8 ii the price is decreased by $4? 9 Which of the following statements are true concerning the straight line equation y = mx+c? A The slope is m and the y-intercept is c. B If an increase in x results in an increase in y, then m > 0. C If m < 0, constant increases in x result in constant decreases in y. D If c < 0, the graph of y = mx + c cuts the horizontal axis to the left of the origin. B CORRELATION Often, we wish to know how two variables are associated or related. To find such a relationship we construct and observe a scatter plot. A scatter plot consists of points plotted on a set of axes. The independent variable is placed on the horizontal axis. The dependent variable is placed on the vertical axis. Examples of typical plots are: ² weight (kg) height (cm) weight (kg) yellow 95 for a study investigating whether a person’s weight has any effect on their IQ. 100 50 75 25 0 95 100 50 75 for a sports goods store where profit is dependent on the amount of advertising done. 25 0 5 95 50 75 25 0 5 95 100 50 75 25 0 5 100 magenta IQ advertising ($) for a soccer team where weight is dependent on height. cyan ² profit ($) 5 ² black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\120SA12MET2_03.CDR Wednesday, 15 September 2010 12:08:35 PM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 121 (Chapter 3) Consider the following experiment: We wish to examine the relationship between the length of a helical spring and the mass that is hung from the spring. The force of gravity on the mass causes the spring to stretch. As the length of the spring depends on the force applied, the dependent variable is the length. L cm The following experimental results are obtained when objects of varying mass are hung from the spring: Mass (w grams) 0 50 100 150 200 250 Length (L cm) 17:7 20:4 22:0 25:0 26:0 27:8 For each addition of 50 grams in mass, the consecutive increases in length are roughly constant. 30 w grams length (cm) 25 The points are approximately linear. 20 15 mass (g) 0 50 100 150 200 250 CORRELATION Correlation refers to the relationship or association between two variables. When looking at the correlation between two variables, we should follow these steps. Step 1: Look at the scatter plot for any pattern. For a generally upward shape we say that the correlation is positive. As the independent variable increases, the dependent variable generally increases. For a generally downward shape we say that the correlation is negative. As the independent variable increases, the dependent variable generally decreases. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 For randomly scattered points, with no upward or downward trend, there is usually no correlation. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\121SA12MET2_03.CDR Wednesday, 15 September 2010 12:09:29 PM PETER SA_12MET-2 122 LINEAR MODELLING AND RESIDUAL ANALYSIS Step 2: (Chapter 3) Look at the spread of points to make a judgement about the strength of the correlation. This is a measure of how closely the data follows a pattern or trend. For positive relationships we would classify the following scatter plots as: strong moderate weak Similarly there are strength classifications for negative relationships: strong moderate weak Step 3: Look at the pattern of points to see whether it is linear. These points are roughly linear. These points do not appear to be linear. Step 4: Look for and investigate any outliers. These appear as isolated points which do not fit in with the general trend of the data. Outliers should be investigated as they are sometimes mistakes made in recording or plotting the data. Genuine extraordinary data should be included. outlier not an outlier Looking at the scatter plot for the spring data, we can say that there appears to be a strong positive correlation between the mass of the object hung from the spring, and the length of the spring. The relationship appears to be linear, with no obvious outliers. EXERCISE 3B 1 Describe what is meant by: 2 a a scatter plot b correlation d negative correlation e an outlier. c positive correlation a What is meant by independent and dependent variables? cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 b When drawing a scatter plot, which variable is placed on the horizontal axis? black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\122SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:16 AM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 123 3 For the following scatter plots, comment on: i the existence of any pattern (positive, negative or no association) ii the relationship strength (zero, weak, moderate or strong) iii whether the relationship is linear iv whether there are any outliers. a b y c y x d y x e y x f y x y x x 4 Ten students participated in a typing contest, where the students were given one minute to type as many words as possible. The table below shows how many words each student typed, and how many errors they made: Student Number of words (x) Number of errors (y) A 40 11 B 53 15 C 20 2 D 65 20 E 35 4 F 60 22 G 85 30 H 49 16 I 35 27 J 76 25 a Draw a scatter plot for this data. b Name the student who is best described as: i slow but accurate ii fast but inaccurate iii an outlier. You can use technology to construct scatter plots. Consult the graphics calculator instructions at the front of the book. c Describe the direction and strength of correlation between these variables. d Is the data linear? C MEASURING CORRELATION In order to measure more precisely the degree to which two variables are linearly related, we can calculate Pearson’s correlation coefficient. We denote this coefficient r. PEARSON’S CORRELATION COEFFICIENT r For a set of n data given as ordered pairs (x1 , y1 ), (x2 , y2 ), (x3 , y3 ), ...., (xn , yn ), P xy ¡ n x y Pearson’s correlation coefficient is r = p P P ( x2 ¡ nx2 )( y 2 ¡ ny 2 ) P where x and y are the means of the x and y data respectively, and means the sum over all the data values. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 You are not required to learn this formula. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\123SA12MET2_03.CDR Wednesday, 15 September 2010 12:10:21 PM PETER SA_12MET-2 124 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) The values of r range from ¡1 to +1. y If r = +1, the data are perfectly positively correlated. The data lie exactly on a straight line with positive slope. x If r = 0, the data show no correlation. y x If r = ¡1, the data are perfectly negatively correlated. y The data lie exactly on a straight line with negative slope. x POSITIVE CORRELATION A positive value for r indicates the variables are positively correlated. The closer r is to +1, the stronger the correlation. Here are some examples of scatter plots for positive correlation: y y y x r = +1 y x r = +0.8 r = +0.5 x r = +0.2 x NEGATIVE CORRELATION A negative value for r indicates the variables are negatively correlated. The closer r is to ¡1, the stronger the correlation. Here are some examples of scatter plots for negative correlation: y yellow 95 x 100 50 75 25 0 r = -0.5 5 95 50 75 25 0 5 95 magenta y x r = -0.8 100 50 25 0 5 95 100 50 75 25 0 5 75 x r = -1 cyan y 100 y black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\124SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:25 AM PETER r = -0.2 x SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 125 (Chapter 3) EXERCISE 3C.1 1 Estimate Pearson’s correlation coefficient r for the data in the scatter plots below: a b y c y y x x d x e f y y y x x 2 The table alongside shows the ages of six children, and the number of times they visited the doctor in the last year: x Age 2 5 7 5 8 3 No. of doctor visits 10 6 5 4 3 8 a Draw a scatter plot of the data. b Estimate the correlation coefficient. c Describe the correlation between age and number of doctor visits. COEFFICIENT OF DETERMINATION r 2 To help determine the strength of correlation between two variables, we calculate the coefficient of determination r2 . This is simply the square of Pearson’s correlation coefficient r. Value 2 r =0 2 0 < r < 0:25 Squaring r eliminates the direction of the correlation, and gives us a value from 0 to 1 which measures the strength of correlation. Strength of association no correlation very weak correlation 2 0:25 6 r < 0:50 weak correlation 0:50 6 r2 < 0:75 moderate correlation 2 strong correlation 0:75 6 r < 0:90 2 The table alongside is a guide for assessing the strength of linear correlation between two variables. 0:90 6 r < 1 very strong correlation r2 = 1 perfect correlation USING TECHNOLOGY TO FIND r AND r2 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 We can use technology to find r and r2 . For help, consult the graphics calculator instructions at the front of the book. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\125SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:29 AM PETER SA_12MET-2 126 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) Example 4 A group of adults was weighed, and their maximum speed when sprinting was measured: Weight (x kg) Max. speed (y km h¡1 ) 85 60 78 100 83 67 79 62 88 68 26 29 24 17 22 30 25 24 19 27 a Use technology to find r and r2 for the data. b Describe the correlation betwen weight and maximum speed. a Casio fx-9860G Plus TI-nspire TI-84 Plus Using technology, r ¼ ¡0:813 and r2 ¼ 0:662. b There is a moderate negative correlation between weight and maximum speed. EXERCISE 3C.2 1 Jill hangs her clothes out to dry every Saturday, and notices that the clothes dry faster some days than others. She investigates the relationship between temperature and the time her clothes take to dry: Temperature (x o C) 25 32 27 39 35 24 30 36 29 35 Drying time (y min) 100 70 95 25 38 105 70 35 75 40 b Calculate r and r2 . a Draw a scatter plot for this data. c Describe the correlation between temperature and drying time. 2 The table below shows the ticket and beverage sales for each day of a 12 day music festival: Ticket sales ($x £ 1000) 25 22 15 19 12 17 24 20 18 23 29 26 Beverage sales ($y £ 1000) 9 7 4 8 3 4 8 10 7 7 9 8 b Calculate r and r2 . a Draw a scatter plot for this data. c Describe the correlation between ticket sales and beverage sales. 3 A local council collected data from a number of parks in the area, recording the size of the parks and the number of gum trees each contained: Size (hectares) 2:8 6:9 7:4 4:3 8:5 2:3 9:4 5:2 8:0 4:9 6:2 3:3 4:5 No. of gum trees 18 31 33 24 13 17 40 32 37 30 32 25 28 a Draw a scatter plot for this data. c Calculate r and r2 . b Would you expect r to be positive or negative? magenta yellow 95 100 50 75 25 0 5 95 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 cyan 100 e Remove the outlier, and re-calculate r and r2 . d Are there any outliers? black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\126SA12MET2_03.CDR Wednesday, 15 September 2010 12:11:09 PM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS D (Chapter 3) 127 LINE OF BEST FIT Consider again the scatter plot of the spring data. Since the data is approximately linear, it is reasonable to draw a line of best fit through the data. This line can be used to predict the 30 value of one variable given the value 25 of the other. There are several ways to fit a straight 20 line to a data set. We will examine two of them: length (cm) 15 ² the line of best fit ‘by eye’ ² the ‘least squares’ regression line. mass (g) 0 50 100 150 200 250 LINE OF BEST FIT ‘BY EYE’ Given a scatter plot for a data set, we can draw a line of best fit ‘by eye’, which should have about the same number of points above as below it. Its direction should follow the general trend of the data. To find the equation of this line, we first select two points which lie on the line. We then find the equation of the line passing through these points using the techniques revised in Section A. Example 5 For the spring data on page 121: a draw the scatter plot, and draw a line of best fit through the data b find the equation of the line you have drawn. a 30 length (cm) 25 20 15 mass (g) 0 50 100 150 200 250 b The line of best fit above passes through (100, 22) and (200, 26). y ¡ 22 26 ¡ 22 = 0:04 and equation = 0:04 So, the line has slope m = 200 ¡ 100 x ¡ 100 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 ) y ¡ 22 = 0:04x ¡ 4 ) y = 0:04x + 18 or in this case L = 0:04w + 18 black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\127SA12MET2_03.CDR Wednesday, 15 September 2010 12:16:01 PM PETER SA_12MET-2 128 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) EXERCISE 3D.1 1 For the following data sets: i draw the scatter plot ii draw a line of best fit through the data iii find the equation of the line you have drawn. a x y 11 16 7 12 16 32 4 5 8 7 b x y 13 10 18 6 7 17 1 18 10 19 12 13 17 30 6 14 5 14 15 6 12 19 4 15 2 6 17 5 8 17 3 14 13 24 10 10 9 15 18 34 5 6 12 26 5 13 2 Over 10 days the maximum temperature and number of car break-ins was recorded for a city: Max. temperature (x o C) 22 17 14 18 24 29 33 32 26 22 No. of car break-ins (y) 30 18 9 20 31 38 47 40 29 25 a Draw a scatter plot for the data. b c d e Describe the correlation between temperature and number of break-ins. Draw a line of best fit through the data. Find the equation of the line of best fit. Use your equation to estimate the number of car break-ins you would expect to occur on a 25o C day. THE LEAST SQUARES REGRESSION LINE The problem with finding the line of best fit by eye is that the line drawn will vary from one person to the next. Instead, mathematicians use a method known as linear regression to find the equation of the line which best fits the data. Consider the set of points alongside. y For any line we draw to model the points, we can find the vertical distances d1 , d2 , d3 , .... between each point and the line. d4 d2 We can then square the distances and find their sum d12 + d22 + d32 + :::: : If all the points are close to the line, this value will be small. d3 d1 The least squares regression line is the line which minimises this value. x cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 This demonstration allows you to experiment with various data sets. Use trial and error to find the least squares line of best fit for each set. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\128SA12MET2_03.CDR Wednesday, 15 September 2010 12:11:41 PM PETER DEMO SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 129 (Chapter 3) In practice, rather than finding this line by experimentation, we use the following formula: P xy ¡ n x y The least squares line has equation y = mx + c where m = P 2 ( x ) ¡ nx2 and c = y ¡ mx USING TECHNOLOGY TO FIND THE LINE OF BEST FIT Instead of using the above formula, we can use technology to find the least squares line of best fit. For help, consult the graphics calculator instructions at the start of the book. Example 6 Use technology to find the least squares line of best fit for the spring data. Casio fx-9860G Plus TI-nspire TI-84 Plus So, the least squares line of best fit is y ¼ 0:0402x + 18:1, or L ¼ 0:0402w + 18:1. Compare this equation with the one we obtained when we found the line of best fit by eye. INTERPOLATION AND EXTRAPOLATION Suppose we have gathered data to investigate the association between two variables. We obtain the scatter diagram shown below. The data values with the lowest and highest values of x are called the poles. We use least squares regression to obtain a line of best fit. We can use the line of best fit to estimate values of one variable given a value for the other. yellow x 95 100 50 75 extrapolation 25 0 5 95 100 50 75 25 0 5 95 50 75 25 0 5 95 100 50 75 25 0 5 100 magenta line of best fit lower pole If we use values of x outside the poles, we say we are extrapolating outside the poles. cyan upper pole y If we use values of x in between the poles, we say we are interpolating between the poles. STATISTICS PACKAGE black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\129SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:47 AM PETER interpolation extrapolation SA_12MET-2 130 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) The accuracy of an interpolation depends on how linear the original data was. This can be gauged by determining the correlation coefficient and ensuring that the data is randomly scattered around the line of best fit. The accuracy of an extrapolation depends not only on how linear the original data was, but also on the assumption that the linear trend will continue past the poles. The validity of this assumption depends greatly on the situation under investigation. As a general rule, it is reasonable to interpolate between the poles, but unreliable to extrapolate outside them. Example 7 The table below shows how far a group of students live from school, and how long it takes them to travel there each day. Distance from school (x km) 7:2 4:5 13 1:3 9:9 12:2 19:6 6:1 23:1 Time to travel to school (y min) 17 13 29 2 25 27 41 15 53 a Draw a scatter plot of the data. ii the equation of the line of best fit. b Use technology to find: i r2 c Pam lives 15 km from school. i Estimate how long it takes Pam to travel to school. ii Comment on the reliability of your estimate. a b y i r2 ¼ 0:987 ii The line of best fit is y ¼ 2:16x + 1:42. x c i When x = 15, y ¼ 2:16(15) + 1:42 ¼ 33:8 So, it will take Pam approximately 33:8 minutes to travel to school. ii The estimate is an interpolation, and the r2 value indicates a very strong correlation. This suggests that the estimate is reliable. EXERCISE 3D.2 1 Use technology to find the equation of the least squares regression line for this data set: 3 9 x y 7 5 8 2 4 7 6 5 3 10 8 4 1 15 2 Consider the temperature vs drying time problem on page 126. a Use technology to find the equation of the line of best fit. b Estimate the time it will take for Jill’s clothes to dry on a 28o C day. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 c How reliable is your estimate in b? black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\130SA12MET2_03.CDR Wednesday, 15 September 2010 12:12:06 PM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 131 (Chapter 3) 3 Consider the ticket sales vs beverage sales problem on page 126. a Find the equation of the line of best fit. b The music festival is extended by one day, and $35 000 worth of tickets are sold. i Predict the beverage sales for this day. ii Comment on the reliability of your prediction. 4 The table below shows the amount of time a collection of families spend preparing homemade meals each week, and the amount of money they spend each week on fast food. Time on homemade meals (x hours) 3:3 6:0 4:0 8:5 7:2 2:5 9:1 6:9 3:8 7:7 85 Money on fast food ($y) 0 60 0 27 100 15 40 59 29 a Draw a scatter plot of the data. b Use technology to find the line of best fit. c State the values of r and r2 . d Interpret the slope and y-intercept of the line of best fit. e Another family spends 5 hours per week preparing homemade meals. Estimate how much money they spend on fast food each week. Comment on the reliability of your estimate. 5 The ages and heights of children at a playground are given below: Age (x years) 3 9 7 4 4 12 8 6 5 10 13 Height (y cm) 94 132 123 102 109 150 127 110 115 145 157 a Draw a scatter plot of the data. b Use technology to find the line of best fit. c At what age would you expect children to reach a height of 140 cm? d Interpret the slope of the line of best fit. e Use the line to predict the height of a 20 year old. Do you think this prediction is reliable? 6 Once a balloon has been blown up, it slowly starts to deflate. A balloon’s diameter was recorded at various times after it was blown up: Time (t hours) 0 10 25 40 55 70 90 100 110 Diameter (D cm) 40:2 37:8 34:5 30:2 26:1 23:9 19:8 17:2 14:0 a Draw a scatter plot of the data. b Describe the correlation between D and t. c Find the equation of the least squares regression line. d Use this equation to predict: i the diameter of the balloon after 80 hours ii the time it took for the balloon to completely deflate. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 e Which of your predictions in d is more likely to be reliable? black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\131SA12MET2_03.CDR Wednesday, 15 September 2010 12:12:23 PM PETER SA_12MET-2 132 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 7 Each year in AFL football, the Brownlow medal is awarded to the ‘Fairest and Best’ player in the competition. Scott has a theory that he can predict the winner from the average number of disposals (kicks and handballs) per game. He wants to test his theory on the results for the top 20 vote-getters for the 2009 season. Player A B C D E F G H I J Disposals per game, x 34 27 28 25 16 28 21 27 27 25 Brownlow votes, y 30 22 20 19 19 17 17 16 15 15 Player K L M N O P Q R S T Disposals per game, x 17 27 29 26 25 27 30 28 28 25 Brownlow votes, y 15 14 14 13 13 13 13 13 13 12 a Construct a scatter plot for the data in the table above, using disposals per game as the independent variable. b Describe the correlation between disposals per game and Brownlow votes. c Find Pearson’s correlation coefficient for the data. d Find the equation of the least squares line of best fit. e Use the line in d to predict the Brownlow votes for a player who averaged 25 disposals per game. f There are four players in the top 20 who averaged 25 disposals per game. Identify these players and their actual Brownlow votes. g How reliable is the variable disposals per game as a predictor of Brownlow votes? E RESIDUAL ANALYSIS Given a set of data, we have seen how we can draw a scatter plot, then find the line of best fit to model it. y x However, it is not always appropriate to model data using a straight line. For example, the data alongside exhibits strong positive correlation. However, it is clearly not linear. y cyan magenta yellow 95 100 50 x 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 The values of r and r2 can be used to determine how well the linear model fits the data. However, to further assess the appropriateness of the linear model, we need to analyse the residuals. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\132SA12MET2_03.CDR Wednesday, 15 September 2010 10:32:59 AM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 133 RESIDUALS y For each data point, the residual is given by yobs ¡ ypred positive residual yobs where yobs is the observed y-value of the data point, and ypred is the y-value predicted by the line of best fit for the x-value of the data point. ypred negative residual x A positive residual indicates that the data point is above the line of best fit. residual A negative residual indicates that the data point is below the line of best fit. We can then plot the residuals against the x-values to form a residual plot. The residual plot shows how the points vary about the line of best fit. x Here is the residual plot for the data points above: Example 8 x y Consider the data set: 3 7 4 4 6 10 9 11 11 20 a Find the equation of the line of best fit. b Calculate the residuals. c Draw the residual plot. a Using technology, the line of best fit is y ¼ 1:61x ¡ 0:230 . b We find ypred for each data point by evaluating y = 1:61x ¡ 0:230 for each of the x-values. x yobs ypred residual = yobs ¡ ypred 3 4 6 9 11 7 4 10 11 20 4:60 6:21 9:43 14:27 17:49 2:40 ¡2:21 0:57 ¡3:27 2:51 y x c 4 residual 2 x -2 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 -4 black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\133SA12MET2_03.cdr Wednesday, 15 September 2010 12:40:01 PM PETER SA_12MET-2 134 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) EXERCISE 3E.1 1 Match the following scatter plots with the correct residual plot: a b y c y x A y x B residual x C residual x residual x 2 A least squares regression line is shown on the scatter plot alongside. x 20 y 15 10 5 x Which one of the following would be the residual plot for the regression line? A B residual 1 0.5 0 -0.5 -1 C 1 10 15 20 D residual 3 2 1 0 -1 -2 15 10 5 x 1 2 3 4 3 4 5 residual 4 2 0 -2 -4 x 5 2 x 1 2 3 4 5 2 3 4 5 residual x 1 5 3 For the following data sets: i draw the scatter plot iii calculate the residuals cyan magenta 3 8 x y 7 10 8 12 11 13 15 17 yellow 95 100 50 75 25 0 5 95 12 6 100 50 4 10 75 15 2 25 6 12 0 95 9 14 b 10 16 5 7 13 100 1 18 50 5 9 75 x y 25 2 3 0 95 100 50 75 25 0 5 c x y 5 a ii find the line of best fit iv draw the residual plot. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\134SA12MET2_03.CDR Wednesday, 15 September 2010 12:13:01 PM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 135 (Chapter 3) 4 Check your residual plots from 3 using technology. For help, consult the graphics calculator instructions at the front of the book. 5 The equation of the least squares regression line applied to the data graphed alongside is Diastolic blood pressure = 68:5 + 0:2 £ weight. 100 diastolic blood pressure (mm Hg) 95 90 85 a Draw the least squares regression line on the graph. 80 75 PRINTABLE GRAPH 70 weight (kg) 60 70 80 90 100 110 b Estimate the residual for the following points from the graph, then check the value using the equation. i (82, 87:5) ii (103:3, 81:4) c Sketch the residual plot for this regression line. ANALYSING RESIDUAL PLOTS We can use residual plots to determine whether it is appropriate to fit a linear model to a data set. Consider the set of data points alongside, which appear linear. The line of best fit is also shown. y x Here is the residual plot for these data points. Notice that the points are randomly scattered about the x-axis, with no obvious pattern. This indicates that the data varies randomly about the line of best fit, and so the linear model is appropriate for the data. residual Now consider this second set of points, which do not appear to be linear. Again, the line of best fit is given. y x cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 x black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\135SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:09 AM PETER SA_12MET-2 136 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) residual On the residual plot for these data points, we see the points are not random, but show a clear pattern. This indicates that the linear model is not appropriate for the data. x In general, a residual plot with points randomly scattered about the x-axis indicates the linear model is appropriate for the data. A residual plot which exhibits a clear pattern indicates the linear model is not appropriate for the data. Example 9 1 0:75 x y Consider the data set: 2 1:95 3 3:1 4 4:2 5 5:1 6 5:95 7 6:7 a Find the line of best fit, and state the value of r2 . b Construct a residual plot for the data. c Is the linear model appropriate for the data? a Using technology, the line of best fit is y ¼ 0:995x ¡ 0:0143, and r2 ¼ 0:993. y 8 6 4 2 2 4 6 8 x b Using technology, the residual plot is: c The residual plot shows a clear pattern, and does not appear random. This indicates that the linear model is not appropriate for the data. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 This example shows how a high value of r2 does not necessarily mean that the line of best fit is appropriate. The original scatter plot, the coefficient of determination r2 , and the residual plot should all be considered before concluding that a linear model is appropriate. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\136SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:12 AM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 137 EXERCISE 3E.2 1 Which one of the following residual plots shows a regression line that is not a good fit for the data? Explain your answer. A residual B residual 5 5 x x 1 2 3 4 10 20 30 40 5 -5 -5 C D residual residual 2 1 10 5 x 1 -5 2 3 4 x 10 20 30 40 -1 -2 -3 5 -10 2 For each of the following data sets: i ii iii iv draw the scatter plot use technology to find the line of best fit, and state the value of r2 use technology to construct the residual plot determine whether the line of best fit is appropriate to model the data. a x y 1 33 2 29 b x y 2:2 3:6 3:7 7:1 c x y 5 13 9 1 3 25 4 24 9:5 22:5 1 6 12 14 5 20 6 18 7 13 6:2 13:3 1:4 1:9 3:9 7:6 6 2 9 10 7 8 5 9 8 9 9 8 7:5 16:8 2 4 8 18:2 5:5 11:5 10 11 3 In a 60 minute Art lesson, students had to make as many paper cranes as possible. The table shows how long it took each student to make a paper crane, and how many cranes they made during the lesson: Time taken (t min) 6 8:5 4 5 8 7:5 10 7 Cranes made (C) 10 7 15 12 7 8 6 8 a Draw a scatter plot of the data. b Use technology to find the line of best fit, and state the value of r2 . c Draw the line of best fit on your scatter diagram. cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 d Use technology to construct the residual plot. e Is the line of best fit appropriate to model the data? black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\137SA12MET2_03.CDR Wednesday, 15 September 2010 12:14:02 PM PETER SA_12MET-2 138 LINEAR MODELLING AND RESIDUAL ANALYSIS (Chapter 3) 4 Ten people were asked how many text messages they had sent and received in the last week: Text messages sent (x) 18 3 7 22 15 5 20 30 7 25 Text messages received (y) 22 2 9 21 16 9 23 33 7 24 a Draw a scatter plot for the data. b Find the line of best fit, and state the value of r2 . c Describe the correlation between text messages sent and text messages received. d Construct the residual plot. e Is the line of best fit appropriate to model the data? f Ted received 10 text messages in the last week. i Estimate the number of text messages he sent. ii How reliable is this estimate? REVIEW SET 3 1 A storage bin for chicken food contains 1000 kg of food pellets. Exactly 13 kg of food is removed each day for the chickens. a Copy and complete Days elapsed (t) 0 1 the following table: Pellets in bin (F kg) 1000 987 2 3 4 b What is the dependent variable? What is the independent variable? c If a graph was to be drawn, what axis should F be plotted on? d Graph the relationship between F and t. e What is the slope of the line through these points? f Write down the function for F in terms of t. g Interpret the slope and the vertical intercept of the function. h Using the function from f, determine the amount of food left after a fortnight. i Determine when the food supply will run out. 2 Temperatures can be expressed in a variety of units. Consider the following table, which shows the relationship between the temperature Tc in degrees Celsius, and the temperature TF in degrees Fahrenheit: Temperature (TC o C) 10 20 30 40 Temperature (TF o F) 50 68 86 104 a Show that the relationship between TF and TC is linear. b Find the change in TF per unit increase in TC . c Write down the function for TF in terms of TC . d Convert 100 o C to o F. e Convert 32 o F to o C. 3 Thomas rode for an hour each day for eleven days and recorded the number of kilometres travelled against the temperature that day. Temp. (T o C) 32:9 33:9 35:2 37:1 38:9 30:3 32:5 31:7 35:7 36:3 34:7 cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 Distance (d km) 26:5 26:7 24:4 19:8 18:5 32:6 28:7 29:4 23:8 21:2 29:7 black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\138SA12MET2_03.CDR Wednesday, 15 September 2010 12:14:50 PM PETER SA_12MET-2 LINEAR MODELLING AND RESIDUAL ANALYSIS 139 (Chapter 3) a Draw a scatter plot for this data. b Describe the association between the distance travelled and temperature. c Determine the equation of the line of best fit. d Interpret the slope and vertical intercept of this line. e Determine the values of r and r2 . f Use your equation to predict how hot it must get before Thomas does not ride at all. Comment on the reasonableness of this prediction. 4 Eight identical garden beds were watered a varying number of times each week, and the number of flowers each bed produced is recorded in the table below: Number of waterings (n) 0 1 2 3 4 5 6 7 Flowers produced (f ) 18 52 86 123 158 191 228 250 a Draw a scatter plot for this data. b Describe the association between the number of waterings and the flowers produced. c Find the equation of the least squares regression line, and state the values of r and r2 . d Interpret the slope and vertical intercept of this line. e Violet has two flower beds. She waters one five times a fortnight, and the other ten times a week. i How many flowers can she expect from each bed? ii Which is the more reliable estimate? 5 After an outbreak of the flu at a school, medical authorities begin recording the number of people diagnosed with the flu. Days after outbreak (n) 2 3 4 5 6 7 8 9 10 11 People diagnosed (d) 8 14 33 47 80 97 118 123 105 83 a Draw a scatter plot for this data. b Determine: i the least squares regression line ii the values of r and r2 . c Construct a residual plot for the linear relationship between d and n. d Is the line of best fit an appropriate model for the data? Explain your answer. 6 Two supervillains, Silent Predator and the Furry Reaper, terrorise Metropolis by abducting fair maidens. Superman believes that they are collaborating, alternately abducting fair maidens so as not to compete with each other for ransom money. He records their abduction rate below, in dozens of maidens. Silent Predator (s) 4 6 5 9 3 5 8 11 3 7 7 4 Furry Reaper (f) 13 10 11 8 11 9 6 6 12 7 10 8 a Draw a scatter plot for this data. Plot s on the horizontal axis. b Determine: i the least squares regression line c Construct a residual plot for the data. ii the values of r and r2 . cyan magenta yellow 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 d Is the line of best fit an appropriate model for the data? Explain your answer. black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\139SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:24 AM PETER SA_12MET-2 cyan magenta yellow 95 100 50 (Chapter 3) 75 25 0 5 95 100 50 75 25 0 5 95 100 50 75 25 0 5 95 LINEAR MODELLING AND RESIDUAL ANALYSIS 100 50 75 25 0 5 140 black Y:\HAESE\SA_12MET-2nd\SA_12MET2_03\140SA12MET2_03.CDR Wednesday, 15 September 2010 10:33:27 AM PETER SA_12MET-2