Download Relations Between Two Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Coefficient of determination wikipedia, lookup

Instrumental variables estimation wikipedia, lookup

Forecasting wikipedia, lookup

Transcript
Relations Between Two Variables
Regression and Correlation
In both cases, y is a random variable beyond the control of the experimenter.
In the case of correlation, x is also a random variable.
In the case of regression, x is treated as a fixed variable. (As if there is no
sampling error in x.)
Regression: you are wishing to predict the value of y on the basis of the value of x.
Correlation: you are wishing to express the degree the relation between a and y.
Scatter Diagram or Scatter Plot
X axis (abscissa) = predictor variable
Y axis (ordinate) = criterion variable
Positive
Negative
Perfect None
Covariance
COVxy
is a number reflecting the degree to which two variable vary or change
in value together.
( x  x )( y  y )


n 1
n = the number of xy pairs.
Using an example of collecting RT and error scores.
If a subject is slow (high x) and accurate (low y), then the d score for the x will be
positive and the d score for the y will be negative; their product will be negative.
If a subject is slow (high x) and inaccurate (high y), then the d score for the x will be
positive and the d score for the y will be positive; their product will be positive.
If a subject is fast (low x) and accurate (low y), then the d score for the x will be
negative and the d score for the y will be negative; their product will be positive.
If a subject is fast (low x) and inaccurate (high y), then the d score for the x will be
negative and the d score for the y will be positive; their product will be negative.
Illustrative Trends
(x  x)
Sub.
x
1
2
3
4
5
100
200
300
400
500
1
2
3
4
5
1
2
3
4
5
100
200
300
400
500
100
200
300
400
500
-200
-100
0
100
200
-200
-100
0
100
200
-200
-100
0
100
200
( y  y)
y
20
15
10
5
0
0
5
10
15
20
10
5
20
5
10
10
5
0
-5
-10
( x  x )( y  y )
-2000
-500
0
-500
-2000
-10
-5
0
5
10
2000
500
0
500
2000
0
-5
10
-5
0
0
500
0
-500
0
Those subjects who are fast
make more errors.
Total = -5000
Those subjects who are fast
make fewer errors.
Total = 5000
There is no trend.
Total = 0
Scatter plots of data from previous page.
We can see a trend
after all.
100 200 300 400 500
Scale Issues
(Sec.) (Min.)
x
(x  x)
y
1
3
-4
-2
5
13
-8
0
32
0
5
7
9
0
2
4
9
17
21
-4
4
8
0
8
32
1
3
5
7
-4
-2
0
2
300
780
540
1020
-430
0
-240
240
1920
0
0
480
9
4
1260
480
1920
( y  y)
( x  x )( y  y )
Total = 72
Total = 4320
Sub
1
2
3
4
5
X
2
3
2
4
4
Y
10
12
12
15
12
COVxy
( x  x )( y  y )


n 1
What is the covariance?
The absolute value of the covariance is a function of the variance of x and the
variance of y. Thus, a covariance could reflect a strong relation when the two
variances are small, but maybe express a weak relation when the variances are large.
Linear Relation is one in which the relation can be most accurately represented by
a straight line.
xnew  c1 ( xold )  c2
Remember: a linear transformation
The general equation for a straight line:
y  bx  a
(a is the y intercept and b is the slope of the line.)
b
y y2  y1

x x2  x1
3 2 1

 .5
31 2
If x = 8 then, y = .5(8) + 1.5 = 5.5
A = 1.5
When the relation is imperfect:
(not all points fall on a straight line.)
Why are the points not on the line?
We draw the “best fit” using what is called the “least-squares” criterion.
Why squares?
See optional link on simultaneous equations for a
closer look at the idea of least-squares.
Regression Line: Example
Subject
Stat. Score (x)
GPA (y)
1
110
1.0
2
112
1.6
3
118
1.2
4
119
2.1
5
122
2.6
6
125
1.8
7
127
2.6
8
130
2.0
9
132
3.2
10
134
2.6
11
136
3.0
12
138
3.6
GPA
4
3
2
1
110 120 130 140
Statistics Score
We wish to minimize
y 
 ( y  y )
2
The predicted value of y for a given value of x
y  by x  a y
by
= the slope minimizing the errors predicting y
ay
= y-axis minimizing the errors predicting y
 ( x  x )( y  y )
by 
COVxy
sx2

(n  1)
 (x  x)2
(n  1)
For our example:
by  0.074
What does this mean?
ay
y  b x

a
n
 y  bx
Our working example:
A = 2.275 – 0.074(125.25)
= -7.006
The regression line for our data:
y  0.074 x  7.006
Using the regression formula to predict: e.g., x = 124
y  0.074(124)  7.006
y  2.17
Note: If the x value you are inserting is beyond
the range of the values used to construct the
Formula, caution must be used.
Remember: To minimize the sum of the squared deviations about a point, the mean is best.
GPA
( y  y)2
1.0
1.69
1.6
.49
1.2
1.21
2.1
.04
2.6
.09
1.8
.25
2.6
.09
2.0
.09
3.2
.81
2.6
.09
3.0
.49
3.6
.169

y  27.3
y  2.3
 ( y  y)
Note: Using our GPA and Statistic Scores data
7.03
sy 
11
= .79
We could call this a type of
Standard Error” of y.
2
 7.03
Using only the mean of y to predict y, all y values would be the mean.
Using X,
sy .x  ?
Which MODEL is superior? Why?
Is there a reliable difference?
Standard Error of the Estimate: similar to a standard deviation
Where the relation is imperfect, there will be prediction error, whether one
use the mean or the regression line.
sy .x
2

(
y

y
)


( n  2)
Transformed….
 n  1
sy .x  sy (1  r r )

 n  2
What is r?
Residual Variance =
What might create residual variance?