Download Chapter 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Forecasting wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
COLLIN COLLEGE
Chapter 4
Describing the relation between two variables
Math 1342.S06
Prof. Ntchobo
Section 4.1 – Scatter Diagrams and Correlation
Correlation – There is a correlation between two variables when one of them is related to the other in
some way.
The response variable is the variable whose value can be explained by the value of the predictor or
explanatory variable.
Example: Create a scatterplot of the following data:
x
3
4
3
1
5
2
5
y
3
2
4
5
2
5
1
Various Types of Relations in a Scatter Diagram
2
The linear correlation coefficient (r) measures the strength and direction of the linear relationship
between the two variables. It is also called the Pearson product moment correlation coefficient.
r
n  xy  (  x )(  y )
n(  x 2 )  (  x ) 2
n(  y 2 )  (  y ) 2
(ALWAYS ROUND TO 3 DECIMAL PLACES)
Properties of r :
1. –1 ≤ r ≤ 1
2. The closer r is to +1, the stronger is the evidence of positive association.
3. The closer r is to –1, the stronger is the evidence of negative association.
4. If r is close to 0, then little or no evidence exists of a linear relationship
between the two variables.
5. The value of r is unitless. The units of x and y play no role in the interpretation of r.
6. r is not resistant, ie, outliers or points that do not follow the pattern will affect the value of r.
To “turn on” r on your calculator go to 2nd, 0 (catalog) and scroll down to “DiagnosticOn” then
hit enter twice. It should say “done” on the screen.
To find r, enter the data in L1 and L2, then go to Stat, Calc, LinReg (#4)
Example:
x
y
Find r for the following set of data. Describe the association, if any.
11
13
8
11
4
10
1
4
7
11
3
How can we use the value of r to determine if the correlation between the two variables is strong
enough to conclude that there is a linear relationship between them? (ie, significant linear
correlation)
Use Table II in appendix A.
If the absolute value of r exceeds the value in Table II, then a linear relationship exists between the
two variables.
Example: Assume that 20 pairs of data result in a value of r = 0.565.
Is there a linear relationship between x and y ?
Example: The following measurements represent the chest size of a bear and its weight.
x Chest (in)
y Weight (lbs)
26
90
45
344
54
416
49
348
41
262
49
360
44
332
19
34
a) Draw a scatter diagram of the data.



































4
b) Calculate r.
c) Determine if there is a linear relationship between a bear’s weight and its chest size.
d) Do the results change if you convert inches to feet?
Difference Between correlation and causation
According to data obtained from the Statistical Abstract of the United States, the correlation between
the percentage of the female population with a bachelor’s degree and the percentage of births to
unmarried mothers since 1990 is 0.940.
Does this mean that a higher percentage of females with bachelor’s degrees causes a higher
percentage of births to unmarried mothers?
Certainly not! The correlation exists only because both percentages have been increasing since 1990.
It is this relation that causes the high correlation. In general, time series data (data collected over time)
will have high correlations because each variable is moving in a specific direction over time (both going
up or down over time; one increasing, while the other is decreasing over time).
When data are observational, we cannot claim a causal relation exists between two variables. We
can only claim causality when the data are collected through a designed experiment.
Another way that two variables can be related even though there is not a causal relation is through a
lurking variable.
A lurking variable is related to both the explanatory and response variable.
For example, ice cream sales and crime rates have a very high correlation. Does this mean that local
governments should shut down all ice cream shops? No! The lurking variable is temperature. As air
temperatures rise, both ice cream sales and crime rates rise.
5
Section 4.2 – Least-Squares Regression
Once the linear correlation coefficient has indicated that a linear relationship exists between two
variables, our next step is to find a linear equation that describes the relationship between the two
variables.
Or, you can use your calculator: Stat, Calc, LinReg (#4 or #8)
Using Regression Lines to Make Predictions: When predicting a y-value based on some value of x,
*If there is not linear correlation between the variables, the best prediction for y is y .
*If there is linear correlation between the variables, the best predicted y-value is found by
substituting the x value into the regression equation.
*When using the regression equation to predict, stay within the scope of the sample data.
6
Example:
a) Use your calculator to find the least-squares regression line for the following set of data:
Car Weight (lbs)
Fuel Consumption
(mi per gal)
3175 3450 3225 3985 2440 2500 2290
27
29
27
24
37
34
37
b) Predict the fuel consumption of a car that weighs 3000 lbs.
Example:
a) Use your calculator to find the least-squares regression line for the following set of data:
x
y
2
5
3
9
3
10
5
12
8
25
b) Calculate r
c) Is there a linear relationship between the variables?
d) What is the best predicted value of y when x = 4?
7
Example: Suppose yˆ  0.25  3.2 x , n = 8,
r = 0.613, and y = 10.
a) Is there a linear relationship between the variables?
b) What is the best predicted value of y when x = 6?
A Residual is the difference between an actual observed y-value (y) and the predicted y-value ( ŷ )
found using the regression equation. ( y – ŷ )
EX: for a predicted x value of 3 and a corresponding predicted y value of 5.2 and an observed value of
4.5 the residual is shown below.
Positive residuals indicate that a data point is ABOVE average.
Negative residuals indicate that a data point is BELOW average.
8
Example:
a) Use your calculator to find the least-squares regression line for the following set of data:
Club-head speed (mph)
Distance (yd)
100
257
102
264
103
274
101
266
105
277
100
263
99
258
105
275
b) Interpret the slope and y-intercept.
c) Predict the distance that a golf ball will travel if the club-head speed is 103 mph.
d) Suppose that a golf ball is hit with a club-head speed of 103 mph and travels 274 yards.
Is this distance above or below average among all golf balls hit with a club-head speed of 103 mph?
CAUTION: If the least-squares regression line is used to make predictions based on values of the
explanatory variable that are much larger or much smaller than the observed values, we say the
researcher is working outside the scope of the model. Never use a least-squares regression line to
make predictions outside the scope of the model because we can’t be sure the linear relation continues
to exist.
9