Download Chapter 9 Correlation and Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
STATISTICS
Chapter 5
Correlation/Regression
MVS 250: V. Katch
1
Overview
Paired Data
 is there a relationship
 if so, what is the equation
 use the equation for prediction
2
Correlation
3
Definition
Correlation
exists between two variables
when one of them is related to
the other in some way
4
Assumptions
1. The sample of paired data (x,y) is a
random sample.
2. The pairs of (x,y) data have a
bivariate normal distribution.
5
Definition
Scatterplot (or scatter diagram)
is a graph in which the paired
(x,y) sample data are plotted with
a horizontal x axis and a vertical
y axis. Each individual (x,y) pair
is plotted as a single point.
6
Scatter Diagram of Paired Data
7
Scatter Diagram of Paired Data
8
Positive Linear Correlation
y
y
y
(a) Positive
x
x
x
(b) Strong
positive
(c) Perfect
positive
Scatter Plots
9
Negative Linear Correlation
y
y
y
(d) Negative
x
x
x
(e) Strong
negative
(f) Perfect
negative
Scatter Plots
10
No Linear Correlation
y
y
x
(g) No Correlation
x
(h) Nonlinear Correlation
Scatter Plots
11
Definition
Linear Correlation Coefficient r
measures strength of the linear relationship
between paired x and y values in a sample
xy/n - (x/n)(y/n)
r=
(SDx) (SDy)
Where xy/n is the mean of the cross products;
(x/n) is the mean of the x variable; (y/n) is the
mean of the y variable; SDx is the standard
deviation of the x variable and SDy is the
standard deviation of the x variable
12
Notation for the
Linear Correlation Coefficient
n
number of pairs of data presented

denotes the addition of the items indicated.
x/n denotes the mean of all x values.
y/n denotes the mean of all y values.
xy/n denotes the mean of the cross products [x
times y, summed; divided by n]
r
linear correlation coefficient for a sample

linear correlation coefficient for a
population
13
Rounding the
Linear Correlation Coefficient r
 Round to three decimal places
Use calculator or computer if possible
14
Properties of the
Linear Correlation Coefficient r
1. -1  r  1
2. Value of r does not change if all values of
either variable are converted to a different
scale.
3. The r is not affected by the choice of x and y.
Interchange x and y and the value of r will
not
change.
4. r measures strength of a linear relationship.
15
Interpreting the Linear
Correlation Coefficient
If the absolute value of r exceeds the
value in Sig. Table, conclude that there is
a significant linear correlation.
Otherwise, there is not sufficient
evidence to support the conclusion of
significant linear correlation.
Remember to use n-2
16
Common Errors Involving Correlation
1. Causation: It is wrong to conclude that
correlation implies causality.
2. Averages: Averages suppress individual
variation and may inflate the correlation
coefficient.
3. Linearity: There may be some relationship
between x and y even when there is no
significant linear correlation.
17
Common Errors Involving Correlation
250
Distance
(feet)
200
150
100
50
0
0
1
2
3
4
5
6
7
8
Time (seconds)
18
Correlation is Not Causation
A
B
C
19
Correlation Calculations
Rank Order Correlation - Rho
Pearson’s - r
20
Rank Order Correlation
Hits
1
Rank
10
HR
3
Rank
8
D
2
D2
4
2
3
4
5
9
8
7
6
4
5
1
7
7
6
10
4
2
2
-3
2
4
4
9
4
6
7
8
5
4
3
6
2
10
5
9
1
0
-5
2
0
25
4
9
10
2
1
9
8
2
3
0
2
0
4
21
Rank Order Correlation, cont
Rho = 1- [6
2
(∑D )
Hits
Rank
HR
Rank
D
D2
1
10
3
8
2
4
2
9
4
7
2
4
3
8
5
6
2
4
4
7
1
10
-3
9
5
6
7
4
2
4
6
5
6
5
0
0
7
4
2
9
-5
25
8
3
10
1
2
4
9
2
9
2
0
0
10
1
8
3
2
4
N=10
/N
2
(N -1)]
Rho = 1- [6(58)/10(102-1)]
Rho = 1- [348 / 10 (100 -1)]
Rho = 1- [348 / 990]
Rho = 1- 0.352
Rho = 0.648
(∑D2 = 58)
22
Pearson’s r
Hits
HR
xy
1
3
3
2
4
8
3
5
15
4
1
4
5
7
35
6
6
36
7
2
14
8
10
80
9
9
81
10
8
80
x/n x/n xy/n
=5.5 = 5.5 =32.86
xy/n - (x/n)(y/n)
r=
(SDx) (SDy)
r = 32.86 - (5.5) (5.5)/(3.03) (3.03)
r = 35.86 - 30.25 / 9.09
r = 5.61 / 9.09
r = 0.6172
23
Pearson’s r
Excel Demonstration
24
Is there a significant linear correlation?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
25
Is there a significant linear correlation?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
Plastic
0.27
1.41
2.19
2.83
2.19
1.81
0.85
3.05
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Household
2
3
3
6
4
2
1
5
26
Is there a significant linear correlation?
Data from the Garbage Project
x Plastic (lb)
y Household
0.27 1.41
2
3
2.19
2.83
2.19
1.81
0.85
3.05
3
6
4
2
1
5
Household size
Plastic Garbage v Household size
7
6
5
4
3
2
1
0
r = 0.842
R2 2= 0.7096
R = 0.71
0
0.5
1
1.5
2
2.5
3
3.5
Plastic (lbs)
27
Is there a significant linear correlation?
n=8
 = 0.05
=0
:  0
H 0:
H1
Test statistic is r = 0.842
Critical values are r = - 0.707 and 0.707
(Table R with n = 8 and  = 0.05)
TABLE R Critical Values of the Pearson Correlation Coefficient r
n
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
25
30
35
40
45
50
60
70
80
90
100
 = .05
.950
.878
.811
.754
.707
.666
.632
.602
.576
.553
.532
.514
.497
.482
.468
.456
.444
.396
.361
.335
.312
.294
.279
.254
.236
.220
.207
.196
 = .01
.999
.959
.917
.875
.834
.798
.765
.735
.708
.684
.661
.641
.623
.606
.590
.575
.561
.505
.463
.430
.402
.378
.361
.330
.305
.286
.269
.256
28
Is there a significant linear correlation?
0.842 > 0.707, That is the test statistic does fall within the
critical region.
Therefore, we REJECT H0:  = 0 (no correlation) and conclude
there is a significant linear correlation between the weights of
discarded plastic and household size.
Reject
 =0
-1
r = - 0.707
Fail to reject
=0
0
Reject
 =0
r = 0.707
1
Sample data:
r = 0.842
29
Method 1: Test Statistic is t
(follows format of earlier chapters)
30
Formal Hypothesis Test
 To determine whether there is a
significant linear correlation
between two variables
 Two methods
 Both methods let H0:  =
(no significant linear correlation)
H1:  
(significant linear correlation)
31
Method 2: Test Statistic is r
(uses fewer calculations)
Test statistic: r
Critical values: Refer to Table R
(no degrees of freedom)
32
Method 2: Test Statistic is r
(uses fewer calculations)
Test statistic: r
Critical values: Refer to Table A-6
(no degrees of freedom)
Reject
 =0
-1
r = - 0.811
Fail to reject
=0
0
Reject
 =0
r = 0.811
1
Sample data:
r = 0.828
33
Method 1: Test Statistic is t
(follows format of earlier chapters)
Test statistic:
t=
r
1-r2
n-2
Critical values:
use Table T with
degrees of freedom = n - 2
34
Start
Testing for a
Linear Correlation
Let H0:  = 0
H1:   0
Select a
significance
level 
Calculate r using
Formula 9-1
METHOD 1
METHOD 2
The test statistic is
t=
The test statistic is
r
r
Critical values of t are from
Table A-6
1-r2
n -2
Critical values of t are from Table A-3
with n -2 degrees of freedom
If the absolute value of the
test statistic exceeds the
critical values, reject H0:  = 0
Otherwise fail to reject H0
If H0 is rejected conclude that there
is a significant linear correlation.
If you fail to reject H0, then there is
not sufficient evidence to conclude
that there is linear correlation.
35
Why does the critical value of r
increase as sample size decreases?
A correlation by chance is more likely.
36
Coefficient of Determination
(Effect Size)
r2
The part of variance of one variable that can be
explained by the variance of a related variable.
37
Justification for r Formula
r=
 (x -x) (y -y)
(n -1) Sx Sy
(x, y)
centroid of sample points
x=3
y
x - x = 7- 3 = 4
(7, 23)
•
24
20
y - y = 23 - 11 = 12
Quadrant 1
Quadrant 2
16
•
12
8
•
Quadrant 3
••
4
y = 11
(x, y)
Quadrant 4
x
0
0
1
2
3
4
5
6
7
38