Download Chapter 10 Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
CORRELATION &
REGRESSION
Chapter 10
Introduction
• Another area of inferential statistics involves determining
whether a relationship exists between two or more
quantitative variables
• For example:
• Business person deciding whether volume of sales for given month
is related to amount of advertising the firm does that month
• Educators interested in how number of hours a student studies is
related to student’s score on an exam
• Medical researchers interested in determining if caffeine is related
to heart damage
Introduction cont.
• Correlation
• Statistical method used to determine whether a relationship
between variables exists
• Regression
• Statistical method used to describe nature of relationship between
variables, that is, positive or negative, linear or nonlinear
• Questions to be answered
1. Are two or more variables related?
2. If so, what is strength of relationship?
3. What type of relationship exists?
4. What kind of predictions can be made from relationship?
Types of Relationships
• Two types of relationships: simple and multiple
• Simple relationship
• One independent (explanatory) variable, and one dependent
(response) variable
• Simple relationship analysis is called simple regression
• Positive relationship – exists when both variables increase or
decrease at the same time
• Negative relationship – exists when one variable increases as the
other decreases, and vice versa
• Multiple relationship
• Two or more independent variables are used to predict one
dependent variable
10.1 – Scatter Plots & Regression
• In simple correlation and regression studies, researcher
collects data on two quantitative variables to see whether
a relationship exists between them
• Independent variable can be controlled or manipulated
(designated as x-axis variable)
• Dependent variable cannot be controlled or manipulated
(designated as y-axis variable)
Scatter Plots
• Scatter plot
• Graph of ordered pairs (x, y) of numbers consisting of independent
variable x and the dependent variable y
• Visual way to describe nature of relationship between independent and
dependent variables
• After plot is drawn, it should be analyzed to determine which type of
relationship, if any, exists
• Example 10 – 1
• P. 536
• Example 10 – 2
• P. 537
• Example 10 – 3
• P. 538
Correlation
• Statisticians use correlation coefficient to determine
strength of linear relationship between two variables
• Pearson product moment correlation coefficient
(PPMC)
• Named after statistician Karl Pearson, who pioneered research in
this area
• Correlation coefficient
• Computed from sample data measures strength and direction of
linear relationship between two variables
• Symbol for sample correlation coefficient is r
• Symbol for population correlation coefficient is ρ (Greek letter rho)
Formula for Correlation Coefficient
• Range of the correlation coefficient is from -1 to +1
• Value of r close to +1 suggests strong positive linear relationship
• Value of r close to -1 suggests strong negative linear relationship
• Value of r close to 0 suggest weak or no relationship
• Formula for Correlation Coefficient r
𝒓=
𝒏
𝒏
𝒙𝒚 − ( 𝒙)( 𝒚)
𝒙𝟐 − ( 𝒙)𝟐 𝒏( 𝒚𝟐 ) − ( 𝒚)𝟐
Where n is the number of data pairs
Example 10 – 4
• Compute the correlation coefficient for data in example
10-1
Significance of Correlation Coefficient
• Question arises, when is value of r due to change, and
when does it suggest a significant linear relationship
between the variables?
• Since value of r is computed from samples, two
possibilities exist when r is not equal to zero
• Either value of r is high enough to conclude there is significant
linear relationship OR
• Value of r is due to change
• To make a decision, use a hypothesis-testing procedure
similar to the traditional method
Population Correlation Coefficient
• Sample correlation coefficient can be used as an
estimator of p (rho) if following assumptions are valid
1.
2.
3.
Variables x and y are linearly related
Variables are random variables
Two variables have a bivariate normal distribution
• Population correlation coefficient
• Correlation computed by using all possible pairs of data values
(x,y) taken from a population
Hypothesis Testing
• In hypothesis testing, one of these is true
• 𝐻0 : 𝜌 = 0
OR
𝐻1 : 𝜌 ≠ 0
• When null hypothesis is rejected at a specific level, it means
there is a significant difference between the value of r and 0.
• When null hypothesis is not rejected, it means value of r is not
significantly different from 0 and is probably due to chance
• Do not have to identify claim, since question will always be
whether there is significant linear relationship between variable
Formula for t Test
• Formula for t Test for Correlation Coefficient
𝑛−2
𝑡=𝑟
1 − 𝑟2
with degrees of freedom equal to n – 2
• Example 10 – 7
• Test the significance of the correlation coefficient found in example
10 – 4. Use α = 0.05 and r = 0.982
Correlation and Causation
• When a hypothesis test indicates that a significant linear
relationship exists between variables, researchers must
consider possibilities outlined next.
• Possible Relationships Between Variables
• When null hypothesis has been rejected for a specific α value, any of
the following five possibilities can exist:
1. There is a direct cause-and-effect relationship between variables
2. There is a reverse cause-and-effect relationship between variables
3. Relationship between variables may be caused by a third variable
4. There may be a complexity of interrelationships among many
variables
5. Relationship may be coincidental
• Remember, correlation does not necessarily imply causation
10.2 – Regression
• If value of correlation coefficient is significant, next step is
to determine equation of regression line
• Regression line
• Data’s line of best fit
• Allows researcher to see rend and make predictions on basis of the
data
Line of Best Fit
• Given a scatter plot, you must be able to draw the line of
best fit
• Line of best fit
• Line drawn so that sum of squares of vertical distances from each
point in scatter plot to line is at a minimum
Determination of Regression Line
Equation
• Linear equation in algebra is written as 𝑦 = 𝑚𝑥 + 𝑏
• In statistics, regression line is written as
𝑦 ′ = 𝑎 + 𝑏𝑥
Where 𝑎 𝑖𝑠 𝑦 ′ 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑎𝑛𝑑 𝑏 𝑖𝑠 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑙𝑖𝑛𝑒
• Formula for Regression Line y’= a + bx
• 𝑎=
𝑥 2 −( 𝑥)( 𝑥𝑦)
𝑦
𝑛
𝑥 2 −( 𝑥)2
and 𝑏 =
𝑛
𝑥𝑦 −( 𝑥)( 𝑦)
𝑛 𝑥 2 −( 𝑥)2
• Rounding rule: round values of a and b to three decimal places
Examples
• 10 – 9
• Find the equation of the regression line for data in example 10 – 4
and graph the line on the scatter plot of the data
• 10 – 11
• Use the equation of the regression line to predict the income of a
car rental agency that has 200,000 automobiles
Assumptions
• Marginal change
• Magnitude of change in one variable when the other variable
changes exactly 1 unit
• When r is not significantly different from 0, best predictor
of y is mean of data values of y
• For valid predictions, value of correlation coefficient must
be significant, also two other assumptions must be met:
1.
2.
For any specific value of the independent variable x, the value of
the dependent variable y must be normally distributed about the
regression line
The standard deviation of each of the dependent variables must
be the same for each value of the independent variable
Checking for Outliers
• All scatter plots should be checked for outliers
• Influential points/ influential observations
• Points that can affect equation of regression line
• When point on scatter plot seems to be an outlier it should be checked
to see if it is an influential point because influential points seem to
“pull” regression line towards it
• Researchers should use their judgment whether to include
influential observations in final analysis of data
• If researcher feels observation is not necessary, then it should be
excluded so it does not influence results of study
• If researcher feels that it is necessary, he or she may want to obtain
additional data values whose x values are near x value of influential
point
10.3 – Coefficient of Determination &
Standard Error of the Estimate
• If correlation coefficient can is significant then equation of
regression line can be determined
• Other measures are associated with correlation and
regression techniques:
• Coefficient of determination
• Standard error of the estimate
• Prediction interval
Regression Model
• Consider this hypothetical regression model
• X values: {1, 2, 3, 4, 5}
• Y values: {10, 8, 12, 16, 20}
• Regression line equation is: 𝑦 ′ = 4.8 + 2.8𝑥 and r = 0.919
• For each value of x there is an observed value and a
predicted y’ value
• When x = 1, y = 10, and y’ = 7.6
• Recall that closer the y’ values are to actual y values then
the better the fit and closer r is to +1 or -1
Total Variation
• Total variation
• Sum of squares of vertical distances each point is from mean
• (𝑦 − 𝑦)2
• Explained variation
• Variation obtained from the relationship (y’ predicted values)
• (𝑦 ′ −𝑦)2
• Unexplained variation
• Variation due to chance
• (𝑦 − 𝑦 ′ )2
• *Total variation = Explained variation + unexplained variation*
• (𝑦 − 𝑦)2 = (𝑦 ′ −𝑦)2 + (𝑦 − 𝑦 ′ )2
Residuals & Least-Squares
• Residual
• Difference between actual value of y and predicted y’ value for a
given x value
• Least-squares line
• Another name for a regression line because it is computed using
sum of squares of residuals is the smallest possible value
Coefficient of Determination
• Coefficient of determination
• Measure of the variation of the dependent variables that is
explained by the regression line and the independent variable
• Ratio of explained variation and total variation
• 𝑟2 =
𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
𝑡𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛
• Can also be found by squaring the r value
• Coefficient of nondetermination
• Found by subtracting coefficient of determination from 1
• 1 − 𝑟2
Standard Error of the Estimate
• When a y’ value is predicted for a specific x value,
prediction is a point estimate
• Standard error of the estimate
• Denoted by sest, is the standard deviation of the observed y values
about the predicted y’ values
• Prediction interval uses this statistic
• Formula for standard error of estimate is
𝑠𝑒𝑠𝑡 =
(𝑦−𝑦 ′ )2
𝑛−2
Examples
• 10 – 12
• A researcher collects the following data (page 569) and determines
that there is a significant relationship between age of a copy
machine and its monthly maintenance cost. The regression line is
• 𝑦 ′ = 55.57 + 8.13𝑥
Find the standard error of the estimate
Prediction Interval
• Prediction interval
• Similar to a confidence interval where the standard error of the
estimate is used to create an interval about a y’ value
• By selecting an α value, you can achieve a 1 − 𝛼 ∗ 100%
confidence that the interval contains the actual mean of the y
values that correspond to the given x value
• Formula for the Prediction Interval about a Value y’
𝑦 = 𝑦′ ± 𝑡𝛼/2 𝑠𝑒𝑠𝑡
• With d.f. = n – 2
1
𝑛(𝑥 − 𝑋)2
1+ +
𝑛 𝑛 𝑥 2 − ( 𝑥)2
Example 10 – 14
• For the data in Example 10 – 12, find the 95% prediction
interval for the monthly maintenance cost of a machine
that is 3 years old