Download Chapter 4 Describing the Relation between Two Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Chapter 4 Describing the Relation between Two Variables
4.1 Scatter Diagrams and Correlation
The response variable is the variable whose value can be explained by the value of the explanatory or
predictor variable.
A scatter diagram is a graph that shows the relationship between two quantitative variables measured
on the same individual. Each individual in the data set is represented by a point in the scatter diagram.
I. Scatter plot
(x)
# hour
of sleep
6
8
10
2
(y)
performance
3
5
4
1
1
II. Linear Correlation coefficient (r)
The linear correlation coefficient or Pearson product moment correlation coefficient is a measure of the
strength and direction of the linear relation between two quantitative variables. The Greek letter ρ (rho)
represents the population correlation coefficient, and r represents the sample correlation coefficient.
We present only the formula for the sample correlation coefficient.
Sample Linear Correlation Coefficient
 xi  x   yi  y 
  s   s 
 x  y 
r
n 1
where x is the sample mean of the explanatory variable
sx is the sample standard deviation of the explanatory variable
y is the sample mean of the response variable
sy is the sample standard deviation of the response variable
n is the number of individuals in the sample
Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient is always between –1 and 1, inclusive. That is, –1 ≤ r ≤ 1.
2. If r = + 1, then a perfect positive linear relation exists between the two variables.
3. If r = –1, then a perfect negative linear relation exists between the two variables.
4. The closer r is to +1, the stronger is the evidence of positive association between the two
variables.
5. The closer r is to –1, the stronger is the evidence of negative association between the two
variables.
6. If r is close to 0, then little or no evidence exists of a linear relation between the two variables.
So r close to 0 does not imply a relation, just no linear relation.
7. The linear correlation coefficient is a unitless measure of association. So the unit of measure for
x and y plays no role in the interpretation of r.
8. The correlation coefficient is not resistant. Therefore, an observation that does not follow the
overall pattern of the data could affect the value of the linear correlation coefficient.
2
EXAMPLE Determining the Linear Correlation Coefficient
Use StatCrunch to determine the linear correlation coefficient
of the drilling data.
Open StatCrunch
Stat
Input Data
Summary stats
Correlation
select column(s)
Click the 1st variable, and click the 2nd variable while
holding the ctrl key
compute.
Testing for a Linear Relation
• Step 1 Determine the absolute value of the correlation coefficient
• Step 2 Find the critical value in Table II from Appendix A for the given sample size
• Step 3 If the absolute value of the correlation coefficient is greater than the critical
value, we say a linear relation exists between the two variables. Otherwise, no linear
relation exists.
3
EXAMPLE Does a Linear Relation Exist?
The correlation between drilling
depth and time to drill is 0.773.
The critical value for n = 12
observations is 0.576. Since
0.773 > 0.576, there is a positive
linear relation between time to
drill five feet and depth at which
drilling begins.
4.2 Least-Squares Regression
EXAMPLE Finding an Equation that Describes Linearly Related Data
Using the following sample data:
(a) Find a linear equation that relates x (the explanatory variable) and y (the response
variable) by selecting two points and finding the equation of the line containing the
points.
Using (2, 5.7) and (6, 1.9)
𝑚=
5.7 − 1.9
2−6
= -0.95
4
(b) Graph the equation on the scatter diagram.
(c) Use the equation to predict y if x = 3.
The difference between the observed value of y and the predicted value of y is the error, or residual.
Using the line from the last example, and the predicted value at x = 3:
residual = observed y – predicted y
Least-Squares Regression Criterion
If there is positive / negative correlation between x and y; find the best fitted line for the data.
The least-squares regression line is the line that minimizes the sum of the squared errors (or residuals).
This line minimizes the sum of the squared vertical distance between the observed values of y and those
predicted by the line ŷ , (“y-hat”). We represent this as “minimize Σresiduals2 ” (minimizes the sum of
the squared errors).
The Least-Squares Regression Line
The equation of the least-squares regression line is given by
𝑆
where 𝑏1 = r * 𝑆𝑦 is the slope of the least-squares regression line
𝑥
and 𝑏0 = y - 𝑏1 x is the y-intercept of the least-squares regression line.
5
The Least-Squares Regression Line
Note: x is the sample mean and sx is the sample standard deviation of the explanatory variable x; y is
the sample mean and sy is the sample standard deviation of the response variable y.
EXAMPLE Finding the Least-squares Regression Line
Using the drilling data
(a) Find the least-squares regression line using StatCrunch
Open StatCrunch
Input Data
Stat
Regression
Simple linear
Select 1st variable for x variable, and select 2nd variable for y variable
Compute
(b) Predict the drilling time if drilling starts at 130 feet.
(c) Is the observed drilling time at 130 feet above, or below, average.
The observed drilling time is 6.93 seconds. The predicted drilling time is 7.035 seconds.
The drilling time of 6.93 seconds is below average.
(d) Draw the least-squares regression line on the scatter diagram of the data.
6
Interpretation of Slope:
The slope of the regression line is 0.0116. For each additional foot of depth we start drilling,
the time to drill five feet increases by 0.0116 minutes, on average.
Interpretation of the y-Intercept:
The y-intercept of the regression line is 5.5273. To interpret the y-intercept, we must first ask two
questions:
1. Is 0 a reasonable value for the explanatory variable?
2. Do any observations near x = 0 exist in the data set?
A value of 0 is reasonable for the drilling data (this indicates that drilling begins at the surface of Earth.
The smallest observation in the data set is x = 35 feet, which is reasonably close to 0. So, interpretation
of the y-intercept is reasonable.
The time to drill five feet when we begin drilling at the surface of Earth is 5.5273 minutes.
Caution: If the least-squares regression line is used to make predictions based on values of the
explanatory variable that are much larger or much smaller than the observed values, we say the
researcher is working outside the scope of the model. Never use a least-squares regression line to make
predictions outside the scope of the model because we can’t be sure the linear relation continues to
exist.
Predictions When There is No Linear Relation:
When the correlation coefficient indicates no linear relation between the explanatory and response
variables, and the scatter diagram indicates no relation at all between the variables, then we use the
mean value of the response variables, then we use the mean value of the response variable as the
predicted value so that ŷ  y
Summary
1. Use StatCrunch to plot a scatter plot
2. Use StatCrunch to calculate r
3. Determine whether there is a positive/negative linear correlation between X and Y.
4. If there is a linear correlation between X and Y, use StatCrunch to find the least squares regression
line. Otherwise, do not find the least squares regression line.
7
5. When a value is assigned to X  if there is a correlation between X and Y, use the least squares
regression line to find the best predicted Y.
6. When a value is assigned to X  if there is no correlation between X and Y, use StatCrunch to find
y and the best predicted Y is y for any X.
8