Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY CORRELATION Definition: It is the existence of some definite relationship between two or more variables. Correlation analysis is a statistical tool used to describe the degree to which one variable is linearly related to another variable. Types of Correlation Correlation may be classified in the following ways:(a) Positive and negative correlation. Correlation is said to be positive if two series move in the same direction, otherwise it is negative (opposite Direction). (b) Linear and Non-Linear correlation Correlation is linear if the amount of change in one variable tends to bear a constant ratio to the amount of change in the other variable otherwise it is non-linear. (c) Simple, partial and multiple correlation Simple correlation is where two variables are studied while partial or multiple involves three or more variables. Methods of calculating simple correlation 1. Scatter diagram 2. Karl Pearson’s coefficient of correlation 3. Spearman’s rank correlation coefficient 4. Method of least squares Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation) The coefficient of correlation (r) is a measure of strength of the linear relationship between two variables. XY n X Y r 2 2 X 2 n X Y 2 nY Interpretation of the coefficient of correlation 1. When r = +1, there is a perfect positive correlation between the variables 2. When r = -1, there is a perfect negative correlation between the variables 3. When r = 0, there is no correlation between the variables 4. The closer r is to +1 or to –1, the stronger the relationship between the variables and the closer r is to 0, the weaker the relationship. 5. The following table lists the interpretations for various correlation coefficients: Value Comment 0.8 to 1.0 Very strong 0.6 to 0.8 Strong 0.4 to 0.6 Moderate 0.2 to 0.4 Weak 0.0 to 0.2 Very weak Method of least squares SS xy r SS xx * SS yy Coefficient of determination (r2) It is the square of the correlation coefficient. It shows the proportion of the total variation in the dependent variable Y that is explained or accounted for by the variation in the independent 1 INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY variable X. e.g. If the value of r = 0.9, r2 = 0.81, this means 81% of the variation in the dependent variable has been explained by the independent variable. Spearman’s Rank Correlation It is the correlation between the ranks assigned to individuals by two different people. It is a non-parametric technique for measuring strength of relationship between paired observations of two variables when the data are in ranked form. It is denoted by R or p R 1 6 d i2 1 6 d 2 N ( N 2 1) N3 N In rank correlation, there are two types of problems:i. Where actual ranks are given ii. Where actual ranks are not given Where actual ranks are given Steps: Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d. Square these differences and obtain the total d 2 Use the formula R 1 6 d 2 N3 N Example The ranks given by two judges to 10 individuals are given below. Individual 1 2 3 4 5 6 7 8 Judge 1(X) 1 2 7 9 8 6 4 3 Judge 2 (Y) 7 5 8 10 9 4 1 6 Calculate (a) The spearman’s rank correlation. (b) The Coefficient of correlation 9 10 3 10 5 2 Where ranks are not given Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same method should be followed in case of all the variables. Example Calculate the Rank correlation coefficient for the following data of marks given to 1st year B Com students: CMS 100 45 47 60 38 50 CAC 100 60 61 58 48 46 EQUAL RANKS OR TIE IN RANKS Where equal ranks are assigned to some entries, an adjustment in the formula for calculating the Rank coefficient of correlation is made. The adjustment consists of adding 1 m 3 m to the value of d 2 where m stands for 12 the number of items whose ranks are common. 2 INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY Example An examination of eight applicants for a clerical post was taken by a firm. From the marks obtained by the applicants in the accounting and statistics papers, compute the Rank coefficient of correlation. Applicant A B C D E F G H Marks in accounting 15 20 28 12 40 60 20 80 Marks in statistics 40 30 50 30 20 10 30 60 EXAMPLE A real estate agent would like to predict the selling price of single family homes. After careful consideration, she concludes that the variable likely to be mostly closely related to the selling price is the size of the house. As an experiment, she take a random sample of 15 recently sold houses and records the selling price in Sh.000’s and size in 100 ft2 of each. The data is shown in the table below:House size (100 ft2) Selling price (Sh’000) 20.0 14.8 20.5 12.5 18.0 14.3 27.5 16.5 24.3 20.2 89.5 79.9 83.1 56.9 66.6 82.5 126.3 79.3 119.9 87.6 22.0 19.0 12.3 14.0 16.7 112.6 120.8 78.5 74.3 74.8 Required:(a) Find the sample regression line for the data (b) Estimate the variance of the error variable and the standard error of estimate. (c) Can we conclude at the 1% significance level that the size of a house is linearly related to its selling price? (d) Estimate the 99% confidence interval estimate of 1 (e) Compute the coefficient of correlation and interpret its value (f) Can we conclude at the 1% significance level that the two variables are correlated? (g) Compute the coefficient of determination and interpret its value (h) Predict with 95% confidence the selling price of a house that occupies 2,000ft2. (i) In a certain part of the city, a developer built several thousand houses whose floor plans and exteriors differ but whose sizes are all 2,000 ft2. To date, they have been rented but the builder now wants to sell them and wants to know approximately how much money in total he can expect from the sale of the houses. Help him by estimating a 95% confidence interval estimate of the mean selling price of the houses. Interpretation of the standard error of estimate The smallest value that the standard error of estimate can assume is zero, which occurs when SSE = 0 i.e. when all the points fall on the regression line. If S is close to zero, the fit is excellent and the linear model is likely to be a useful and effective analytical and forecasting tool If S is large, the model is a poor one and the statistician should either improve it or discard it. 3 INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY In general, the standard error of estimate cannot be used as an absolute measure of the model’s utility. Nonetheless, it is useful in comparing models. ANALYSIS OF VARIANCE One-way analysis of variance ANOVA is a method which measures variability within and between samples. These measures of variability are used as the basis for comparing the means between a number of samples. ANOVA is a procedure used to test the null hypothesis that the means of the three or more populations are equal. Assumptions of one-way ANOVA The populations from which the samples are drawn are (approximately) normally distributed. The populations from which the samples are drawn have the same variance or standard deviation. The samples drawn from different populations are random and independent. When carrying out a one way ANOVA, it is important to identify: The dependent variable: This is the random variable under study. The treatment variable: It is the random variable which is assumed to influence the outcome of the dependent variable. The level of treatment variable: Refers to each category of the treatment variable. Experimental units In one factor ANOVA model, the null hypothesis states that the treatment means 1 , 2 ......... n are all equal to some value . If we assume that the independent variable or treatment has no effect on the response, then the only reason that xi j differs from is because of random effects. Therefore the null hypothesis would be X ij i j where i j is a random variable having zero mean which measures the unexplainable effects that influence X ij . On the other hand, if the treatments or independent variable affects the response, then X ij will differ from because of the random effects i j and also because of the treatment effects t j . If the treatment effect t j is non zero, then the model becomes X ij t j i j In the null hypothesis, the mean value of response in population j is j t j . In the above model, measures the common effect, t j measures the treatment effect and i j measures the random effect. In one factor model, the following hypothesis is tested; H 0 : 1 2 3 ......... n H A : 1, 2 , 3 areallequa l Test statistic The test statistic for one way ANOVA is F MSTR F MSE 4 INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY Where MSTR – Mean Square between treatments MSE – Mean Square for Error Rejection region The one way ANOVA is always right tailed with the rejection region in the right-tail of the F distribution curve. F F ,r 1, N r where r is the number of treatment levels and N total sample size. Characteristics of the F Distribution It is continuous and skewed to the right It has two degrees of freedom i.e. for the numerator and denominator The units of a F distribution are non-negative Computation of the value of the test statistic General format of a one-way ANOVA table Source of Sum of squares d. f variation Between SSTR r 1 treatment levels Within treatment SSE N r levels Total Mean Square F-ratio SSTR r 1 SSE MSE N r F MSTR MSTR MSE SST There are three measures of variability The total sum of squares (SST): This is a measure of the total variability. It reflects the variability of the individual X ij values about an overall grand mean X r nj SST ( X i j X ) 2 j 1 i 1 (X )2 n Treatment sum of squares (SSTR) – Explained variation This describes the variability which is observed between all the individual sample means. Deviations observed between individual sample means can be attributed to the effect of the different sample treatments. SST X 2 Or r SSTR n j ( X j X ) 2 where X j is the j th treatment mean j 1 Alternatively: ( X 1 ) 2 ( X 2 ) 2 ( X j ) 2 ( X ) 2 SSTR .......... n2 nj n n1 Error Sum of Squares (SSE): residual sum of squares or unexplained variation. It measures how the individual X ij observations within each sample vary about their respective sample means X j 5 INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY r nj SSE ( X i j X j ) 2 j 1 i 1 ( X 1 ) 2 ( X 2 ) 2 ( X j ) 2 SSE X 2 .......... n2 nj n1 SST SSTR SSE Calculation of the mean squares When SSTR and SSE, are divided by appropriate degrees of freedom, they become variance measures SSTR SSE MSTR MSE r 1 N r Where N is the total sample size and r is the number of treatment levels Two-way analysis of variance In two factor ANOVA models, two hypothesis are tested. The primary hypothesis and the secondary hypothesis. The primary hypothesis tests for a difference in the means across the treatment levels The secondary hypothesis tests for any difference in the means across the different blocking levels In two factor ANOVA, the total variability is partitioned into three parts: SSTR – Sum of squares attributed to the treatment variable SSB – Sum of squares attributed to the blocking variable SSE – Sum of squares due to error ( Residual) OR SST SSTR SSB SSE 6