Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prediction Prediction • Prediction is a scientific guess of unknown by using the known data. • Statistical prediction is based on correlation. If the correlation among variables is high, the success in prediction will be high. – So, if there is a perfect correlation among variables, then the prediction will be perfect. • Today, we will study on the bivariate linear correlation and regression. That is, linear relation between two variables. Prediction • Today, we will study on the bivariate linear correlation and regression. That is, linear relation between two variables. – In a graduate level of statistic class, you will learn curvilinear correlation/regression for U-shaped relations and Cannonical Correlation for the relations among three or more variables • By using statistical predictions (regression), we can answer such questions: – What is the height of a person whose weight is 80 kg? – If a students’ OSS score is 168, what will be his/her GPA in university? Prediction • As you remember, correlation coefficient provides us a measure of how strongly two variables are related. – Pearson r is one of the correlation coefficients and it is calculated on the least squared deviations. • By using that method, we can draw a hypothetic line between the scores on a scatter diagram. – This line represents the best fit of the predicted scores Ali Kezban Can Sevgi Kenan Arda Zehra Selin Nalan Serap Weight 54 55 65 45 65 66 54 56 67 67 r=1 Height 154 155 165 145 165 166 154 156 167 167 170 165 160 155 150 145 140 0 20 40 60 • In this example, the correlation between weight and height is perfect (r= +1). So, all the dots in the diagram are on the hypothetic line (the line of best fit). • The formula of the line is Y=aX+c. For these distributions, it is Y=1x + 100. 80 Weight Height Ali 54 154 Kezban 55 155 Can 65 165 Sevgi 45 145 Kenan 65 165 Arda 66 166 Zehra 54 154 Selin 56 156 Nalan 67 167 Serap 67 167 170 165 160 155 150 145 140 r=1 0 20 40 60 80 Terminology: • The line in this diagram is regression line. • The equation of the line is regression equation (Y=aX+c). – For correlation, it is not important which axis represents which variable, since there is no direction in correlation. – For regression, Y axis is always shows the predicted (dependent) variable, and X axis represents the predictor (independent variable) • Be careful about the language of regression. The equation is always read as regression of Y on X. Ali Kezban Can Sevgi Kenan Arda Zehra Selin Nalan Serap Weight Height 53 154 54 155 65 165 43 145 68 165 67 166 53 154 54 156 65 167 66 167 r=.986 170 165 160 155 150 145 140 0 10 20 30 40 50 60 70 • Now let’s focus on the other examples in which the relation between two variables is less perfect. • As you can see, when the correlation become less perfect, dots in the diagram starts to diverge from the regression line 80 Weight Ali Kezban Can Sevgi Kenan Arda Zehra Selin Nalan Serap 54 61 45 46 72 68 57 71 47 49 r=.086 Height 154 155 165 145 165 166 154 156 167 167 170 165 160 155 150 145 140 0 10 20 30 40 50 60 70 • In this example, the correlation is far from being perfect. – So, the dots are far away from the regression line. 80 The Criterion of Best Fit • By using the regression equation, we can calculate predicted values of Y. – Let’s say these predicted values are Yl • As you can see in the scatter diagram, these predicted values of Yl are not the same of Y. – The discrepancy with Y and Yl is the error of prediction. • These discrepancies are presented as vertical lines in the figure. – As you can see, the higher the r, the lower the error is. The Regression Equation • The regression of Y on X (Standart-Score Formula) – zly = rzx • zly = the predicted standart-score value of Y • r = correlation coefficient • rzx = the standart-score value of X from which zly is predicted • As you can see, – 1. zly = rzx when r=1 – 2. zly is 0 when zx is 0. That is mean of X overlaps with mean of Yl • To calculate the predicted values of Y from raw scores of X, we can use a simpler formula The Regression Equation • In this formula – Yl = the predicted raw score in Y – Sx and Sy= the two standart deviations – r= correlation coefficient • Now, lets try to calculate Yl for the distribution presented in Table 1. Note that this form of the formula is similar to Y=aX+c Error of Prediction The Standard Error of Estimate • The regression line is a kind of a mean of the score in the scatter plot • The discrepancy from regression line is then a kind of a deviation form mean (line) • So, we can use a similar formula to SD in order to calculate standard discrepancy Error of Prediction The Standard Error of Estimate • A preferred formula for Syx is Syx = Sy Limits of Error in Estimating Y from X • Assume that actual scores of Y are normally distributed about predicted scores – So, we can use normal curve to calculate the limits of error in prediction • In a Normal Distribuion – 68% of the Y scores fall within the limits Y(mean) +/1 SD – 95 % of the Y scores fall within the limits Y(mean) +/1.96 SD – 99 % of the Y scores fall within the limits Y(mean) +/2.58 SD Limits of Error in Estimating Y from X • Given that regression line is a kind of a mean and standard error is a kind of a SD • We can see that – 68% of actual Y values fall within the limits Y(mean) +/- 1 Syx – 95 % of actual Y values fall within the limits Y(mean) +/- 1.96 Syx – 99 % of actual Y values fall within the limits Y(mean) +/- 2.58 Syx