Download Prediction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Time series wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Prediction
Prediction
• Prediction is a scientific guess of unknown by
using the known data.
• Statistical prediction is based on correlation. If
the correlation among variables is high, the
success in prediction will be high.
– So, if there is a perfect correlation among variables,
then the prediction will be perfect.
• Today, we will study on the bivariate linear
correlation and regression. That is, linear relation
between two variables.
Prediction
• Today, we will study on the bivariate linear correlation
and regression. That is, linear relation between two
variables.
– In a graduate level of statistic class, you will learn
curvilinear correlation/regression for U-shaped relations
and Cannonical Correlation for the relations among three
or more variables
• By using statistical predictions (regression), we can
answer such questions:
– What is the height of a person whose weight is 80 kg?
– If a students’ OSS score is 168, what will be his/her GPA in
university?
Prediction
• As you remember, correlation coefficient
provides us a measure of how strongly two
variables are related.
– Pearson r is one of the correlation coefficients and
it is calculated on the least squared deviations.
• By using that method, we can draw a
hypothetic line between the scores on a
scatter diagram.
– This line represents the best fit of the predicted
scores
Ali
Kezban
Can
Sevgi
Kenan
Arda
Zehra
Selin
Nalan
Serap
Weight
54
55
65
45
65
66
54
56
67
67
r=1
Height
154
155
165
145
165
166
154
156
167
167
170
165
160
155
150
145
140
0
20
40
60
• In this example, the correlation between weight and height is perfect
(r= +1). So, all the dots in the diagram are on the hypothetic line (the line
of best fit).
• The formula of the line is Y=aX+c. For these distributions, it is Y=1x + 100.
80
Weight
Height
Ali
54
154
Kezban
55
155
Can
65
165
Sevgi
45
145
Kenan
65
165
Arda
66
166
Zehra
54
154
Selin
56
156
Nalan
67
167
Serap
67
167
170
165
160
155
150
145
140
r=1
0
20
40
60
80
Terminology:
• The line in this diagram is regression line.
• The equation of the line is regression equation (Y=aX+c).
– For correlation, it is not important which axis represents which variable, since there is no
direction in correlation.
– For regression, Y axis is always shows the predicted (dependent) variable, and X axis represents
the predictor (independent variable)
•
Be careful about the language of regression. The equation is always read as
regression of Y on X.
Ali
Kezban
Can
Sevgi
Kenan
Arda
Zehra
Selin
Nalan
Serap
Weight Height
53
154
54
155
65
165
43
145
68
165
67
166
53
154
54
156
65
167
66
167
r=.986
170
165
160
155
150
145
140
0
10
20
30
40
50
60
70
• Now let’s focus on the other examples in which the
relation between two variables is less perfect.
• As you can see, when the correlation become less
perfect, dots in the diagram starts to diverge from the
regression line
80
Weight
Ali
Kezban
Can
Sevgi
Kenan
Arda
Zehra
Selin
Nalan
Serap
54
61
45
46
72
68
57
71
47
49
r=.086
Height
154
155
165
145
165
166
154
156
167
167
170
165
160
155
150
145
140
0
10
20
30
40
50
60
70
• In this example, the correlation is far from being
perfect.
– So, the dots are far away from the regression line.
80
The Criterion of Best Fit
• By using the regression equation,
we can calculate predicted values
of Y.
– Let’s say these predicted values
are Yl
• As you can see in the scatter
diagram, these predicted values
of Yl are not the same of Y.
– The discrepancy with Y and Yl is
the error of prediction.
• These discrepancies are
presented as vertical lines in the
figure.
– As you can see, the higher the r,
the lower the error is.
The Regression Equation
• The regression of Y on X (Standart-Score Formula)
– zly = rzx
• zly = the predicted standart-score value of Y
• r = correlation coefficient
• rzx = the standart-score value of X from which zly is predicted
• As you can see,
– 1. zly = rzx when r=1
– 2. zly is 0 when zx is 0. That is mean of X overlaps with
mean of Yl
• To calculate the predicted values of Y from raw
scores of X, we can use a simpler formula
The Regression Equation
• In this formula
– Yl = the predicted raw
score in Y
– Sx and Sy= the two
standart deviations
– r= correlation
coefficient
• Now, lets try to
calculate Yl for the
distribution presented
in Table 1.
Note that this form of the formula is similar to
Y=aX+c
Error of Prediction
The Standard Error of Estimate
• The regression line is
a kind of a mean of
the score in the
scatter plot
• The discrepancy from
regression line is then
a kind of a deviation
form mean (line)
• So, we can use a
similar formula to SD
in order to calculate
standard discrepancy
Error of Prediction
The Standard Error of Estimate
• A preferred formula for Syx is
Syx = Sy
Limits of Error in Estimating Y from X
• Assume that actual scores of Y are normally
distributed about predicted scores
– So, we can use normal curve to calculate the limits of
error in prediction
• In a Normal Distribuion
– 68% of the Y scores fall within the limits Y(mean) +/1 SD
– 95 % of the Y scores fall within the limits Y(mean) +/1.96 SD
– 99 % of the Y scores fall within the limits Y(mean) +/2.58 SD
Limits of Error in Estimating Y from X
• Given that regression line is a kind of a mean
and standard error is a kind of a SD
• We can see that
– 68% of actual Y values fall within the limits
Y(mean) +/- 1 Syx
– 95 % of actual Y values fall within the limits
Y(mean) +/- 1.96 Syx
– 99 % of actual Y values fall within the limits
Y(mean) +/- 2.58 Syx