Download One-way analysis of variance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
CORRELATION
Definition: It is the existence of some definite relationship between two or more variables.
Correlation analysis is a statistical tool used to describe the degree to which one variable is
linearly related to another variable.
Types of Correlation
Correlation may be classified in the following ways:(a) Positive and negative correlation.
Correlation is said to be positive if two series move in the same direction, otherwise it is negative
(opposite Direction).
(b) Linear and Non-Linear correlation
Correlation is linear if the amount of change in one variable tends to bear a constant ratio to the
amount of change in the other variable otherwise it is non-linear.
(c) Simple, partial and multiple correlation
Simple correlation is where two variables are studied while partial or multiple involves three or
more variables.
Methods of calculating simple correlation
1. Scatter diagram
2. Karl Pearson’s coefficient of correlation
3. Spearman’s rank correlation coefficient
4. Method of least squares
Karl Pearson’s coefficient of correlation (Product moment coefficient of correlation)
The coefficient of correlation (r) is a measure of strength of the linear relationship between two
variables.
 XY  n X Y
r
2
2
 X 2  n X  Y 2  nY
Interpretation of the coefficient of correlation
1. When r = +1, there is a perfect positive correlation between the variables
2. When r = -1, there is a perfect negative correlation between the variables
3. When r = 0, there is no correlation between the variables
4. The closer r is to +1 or to –1, the stronger the relationship between the variables and the
closer r is to 0, the weaker the relationship.
5. The following table lists the interpretations for various correlation coefficients:
Value
Comment
0.8 to 1.0
Very strong
0.6 to 0.8
Strong
0.4 to 0.6
Moderate
0.2 to 0.4
Weak
0.0 to 0.2
Very weak
Method of least squares
SS xy
r
SS xx * SS yy
Coefficient of determination (r2)
It is the square of the correlation coefficient. It shows the proportion of the total variation in the
dependent variable Y that is explained or accounted for by the variation in the independent
1
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
variable X. e.g. If the value of r = 0.9, r2 = 0.81, this means 81% of the variation in the dependent
variable has been explained by the independent variable.
Spearman’s Rank Correlation
 It is the correlation between the ranks assigned to individuals by two different people.
 It is a non-parametric technique for measuring strength of relationship between paired
observations of two variables when the data are in ranked form.
It is denoted by R or p
R  1
6 d i2
 1
6 d 2
N ( N 2  1)
N3  N
In rank correlation, there are two types of problems:i.
Where actual ranks are given
ii.
Where actual ranks are not given
Where actual ranks are given
Steps:
 Take the differences of the two ranks i.e. (R1-R2) and denote these differences by d.
 Square these differences and obtain the total  d 2
 Use the formula R  1 
6 d 2
N3  N
Example
The ranks given by two judges to 10 individuals are given below.
Individual
1
2 3
4
5
6
7
8
Judge 1(X) 1
2 7
9
8
6
4
3
Judge 2 (Y) 7
5 8
10 9
4
1
6
Calculate
(a) The spearman’s rank correlation.
(b) The Coefficient of correlation
9
10
3
10
5
2
Where ranks are not given
Ranks can be assigned by taking either the highest value as 1 or the lowest value as 1. the same
method should be followed in case of all the variables.
Example
Calculate the Rank correlation coefficient for the following data of marks given to 1st year B
Com students:
CMS 100
45
47
60
38
50
CAC 100
60
61
58
48
46
EQUAL RANKS OR TIE IN RANKS
 Where equal ranks are assigned to some entries, an adjustment in the formula for
calculating the Rank coefficient of correlation is made.
 The adjustment consists of adding 1 m 3  m to the value of  d 2 where m stands for
12
the number of items whose ranks are common.


2
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
Example
An examination of eight applicants for a clerical post was taken by a firm. From the marks
obtained by the applicants in the accounting and statistics papers, compute the Rank coefficient
of correlation.
Applicant
A
B
C
D
E
F
G
H
Marks in accounting
15 20
28 12 40 60 20 80
Marks in statistics
40 30
50 30 20 10 30 60
EXAMPLE
A real estate agent would like to predict the selling price of single family homes. After careful
consideration, she concludes that the variable likely to be mostly closely related to the selling
price is the size of the house. As an experiment, she take a random sample of 15 recently sold
houses and records the selling price in Sh.000’s and size in 100 ft2 of each. The data is shown in
the table below:House size
(100 ft2)
Selling price
(Sh’000)
20.0 14.8 20.5 12.5 18.0 14.3 27.5
16.5 24.3
20.2
89.5 79.9 83.1 56.9 66.6 82.5 126.3 79.3 119.9 87.6
22.0
19.0
12.3 14.0 16.7
112.6 120.8 78.5 74.3 74.8
Required:(a) Find the sample regression line for the data
(b) Estimate the variance of the error variable and the standard error of estimate.
(c) Can we conclude at the 1% significance level that the size of a house is linearly related to
its selling price?
(d) Estimate the 99% confidence interval estimate of  1
(e) Compute the coefficient of correlation and interpret its value
(f) Can we conclude at the 1% significance level that the two variables are correlated?
(g) Compute the coefficient of determination and interpret its value
(h) Predict with 95% confidence the selling price of a house that occupies 2,000ft2.
(i) In a certain part of the city, a developer built several thousand houses whose floor plans
and exteriors differ but whose sizes are all 2,000 ft2. To date, they have been rented but the
builder now wants to sell them and wants to know approximately how much money in total
he can expect from the sale of the houses. Help him by estimating a 95% confidence
interval estimate of the mean selling price of the houses.
Interpretation of the standard error of estimate

The smallest value that the standard error of estimate can assume is zero, which occurs
when SSE = 0 i.e. when all the points fall on the regression line.

If S  is close to zero, the fit is excellent and the linear model is likely to be a useful and
effective analytical and forecasting tool

If S  is large, the model is a poor one and the statistician should either improve it or
discard it.
3
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY

In general, the standard error of estimate cannot be used as an absolute measure of the
model’s utility. Nonetheless, it is useful in comparing models.
ANALYSIS OF VARIANCE
One-way analysis of variance
ANOVA is a method which measures variability within and between samples. These measures of
variability are used as the basis for comparing the means between a number of samples. ANOVA
is a procedure used to test the null hypothesis that the means of the three or more populations are
equal.
Assumptions of one-way ANOVA
 The populations from which the samples are drawn are (approximately) normally
distributed.
 The populations from which the samples are drawn have the same variance or standard
deviation.
 The samples drawn from different populations are random and independent.
When carrying out a one way ANOVA, it is important to identify: The dependent variable: This is the random variable under study.
 The treatment variable: It is the random variable which is assumed to influence the outcome
of the dependent variable.
 The level of treatment variable: Refers to each category of the treatment variable.
 Experimental units
In one factor ANOVA model, the null hypothesis states that the treatment means 1 ,  2 ......... n
are all equal to some value  . If we assume that the independent variable or treatment has no
effect on the response, then the only reason that xi j differs from  is because of random effects.
Therefore the null hypothesis would be X ij     i j where  i j is a random variable having
zero mean which measures the unexplainable effects that influence X ij .
On the other hand, if the treatments or independent variable affects the response, then X ij will
differ from  because of the random effects  i j and also because of the treatment effects t j . If
the treatment effect t j is non zero, then the model becomes X ij    t j   i j
In the null hypothesis, the mean value of response in population j is  j    t j . In the above
model,  measures the common effect, t j measures the treatment effect and  i j measures the
random effect.
In one factor model, the following hypothesis is tested;
H 0 : 1   2   3  .........   n
H A : 1,  2 ,  3 areallequa l
Test statistic
The test statistic for one way ANOVA is F
MSTR
F
MSE
4
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
Where MSTR – Mean Square between treatments
MSE – Mean Square for Error
Rejection region
The one way ANOVA is always right tailed with the rejection region in the right-tail of the F
distribution curve.
F  F ,r 1, N  r where r is the number of treatment levels and N total sample size.
Characteristics of the F Distribution
 It is continuous and skewed to the right
 It has two degrees of freedom i.e. for the numerator and denominator
 The units of a F distribution are non-negative
Computation of the value of the test statistic
General format of a one-way ANOVA table
Source
of Sum of squares d. f
variation
Between
SSTR
r 1
treatment levels
Within treatment SSE
N r
levels
Total
Mean Square
F-ratio
SSTR
r 1
SSE
MSE 
N r
F
MSTR 
MSTR
MSE
SST
There are three measures of variability
The total sum of squares (SST): This is a measure of the total variability. It reflects the
variability of the individual X ij values about an overall grand mean X
r
nj
 SST   ( X i j  X ) 2
j 1 i 1
(X )2
n
Treatment sum of squares (SSTR) – Explained variation
This describes the variability which is observed between all the individual sample means.
Deviations observed between individual sample means can be attributed to the effect of the
different sample treatments.
SST   X 2 
Or
r
SSTR   n j ( X j  X ) 2 where X j is the j th treatment mean
j 1
Alternatively:
 ( X 1 ) 2 ( X 2 ) 2
( X j ) 2  ( X ) 2
SSTR  

 .......... 

n2
nj
n
 n1

Error Sum of Squares (SSE): residual sum of squares or unexplained variation. It measures how
the individual X ij observations within each sample vary about their respective sample means
X
j
5
INTRODUCTION TO STATISTICS ST.PAULS UNIVERSITY
r
nj
 SSE   ( X i j  X j ) 2
j 1 i 1
 ( X 1 ) 2 ( X 2 ) 2
( X j ) 2 
SSE   X 2  

 .......... 

n2
nj
 n1

 SST  SSTR  SSE
Calculation of the mean squares
When SSTR and SSE, are divided by appropriate degrees of freedom, they become variance
measures
SSTR
SSE
MSTR 
MSE 
r 1
N r
Where N is the total sample size and r is the number of treatment levels
Two-way analysis of variance
In two factor ANOVA models, two hypothesis are tested. The primary hypothesis and the
secondary hypothesis.
The primary hypothesis tests for a difference in the means across the treatment levels
The secondary hypothesis tests for any difference in the means across the different blocking
levels
In two factor ANOVA, the total variability is partitioned into three parts: SSTR – Sum of squares attributed to the treatment variable
 SSB – Sum of squares attributed to the blocking variable
 SSE – Sum of squares due to error ( Residual)
OR
 SST  SSTR  SSB  SSE
6