Download Chi-square GOF

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
2

Tests for Goodness of Fit:
• General Notion: We often wish to know whether a
particular distribution fits a general definition
• Example: To use t tests, we must suppose that the
population is normally distributed
• If a sample is drawn from, say, a normal distribution, the
sample values should be reflect the population
distribution
• Allows us to state the number in the sample that should
be in a particular range
• Example: 68% of a normal distribution is within +/- 1
standard deviation of the mean. About 68% of the
values in a sample from a normal distribution should be
within +/- 1 standard deviation of the mean
• Comparison of actual and expected numbers is the
province of the 2 distribution
• Let Oj be the number observed in the sample in
range j
• Let Ej be the number that would be expected if
the population had a given distribution, as
uniform, Poisson, normal, etc.
• Then
2
k
 
2
j 1
(O j  E j )
Ej
• where k is the number of categories
• degrees of freedom = k – 1 – m where m is the
number of parameter estimates used in the
calculation
•
•
Example: Are the answers to Dr. Dinwiddie’s multiple-choice tests
random? If so, the answers should conform to a uniform distribution
and P(A) = P(B) = P(C) = P(D) = ¼. (For the uniform distribution
P(E) = 1/n, where n is the number of possible values.)
On a recent exam there were sixty questions with correct answers:
A-20, B-5, C-17, and D-18.
H0: the distribution of answers is uniform
H1: the distribution is not uniform
Correct Answer
Observed
Expected
A
20
15
B
5
15
C
17
15
D
18
15
k
(O j  E j ) 2
j 1
Ej
2  
Squared
Difference
Then 2 = 9.207, and no parameters were estimated, so degrees of
freedom = 4 – 1 = 3
• Excel and the chi-square distribution
– CHIDIST(x value, df) returns the area in the
right-hand tail of the chi-square distribution
• goodness of fit tests are all upper one-tail tests, so
chidist gives the p-value of the test
– CHIINV(probability, df) gives the chi-square
value for the upper tail of the probability
entered
• use to find the critical value for a chi-square test
• For the Dinwiddie problem:
CHIDIST(9.207, 3) gives the p-value of the
test
•
EXAMPLE: Hamish suspects that the dice at Black Bart’s are not
fair, so he spirits one out of the casino one night. After rolling the
stolen die 120 times, he has the following result:
No. of Dots
No. of Times
1
27
2
24
3
18
4
11
5
27
6
13
k
(O j  E j ) 2
j 1
Ej
2  
What are the null and
alternative hypotheses?
Is Hamish right to be
suspicious of Black Bart?
•
Testing for normality
– suppose that nationally auto insurance has a mean price of $700
with standard deviation $135. We have a sample of 80 NC
drivers, and we’d like to know whether their insurance bills are
normally distributed with the national parameters.
– how many would we expect in the range 700 to 835?
– HINT: how many standard deviations? What proportion are
within that range of standard deviations?
•
answer: on a normal distribution, 0.34 are between the mean and
+1 st dev, so we’d expect to find 0.34 * 80 = 27.2 in that range
• Setting up a spreadsheet: use normsdist
• normsdist(-2) gives the proportion more than two standard
deviations below the mean
• normsdist(-1) – normsdist(-2) would give proportion between 1 and 2
st devs below mean
•
Continuing in that fashion, we’d have the following
St Devs
Range
Prop.
Expected
freq
< -2
< 430
0.02275
1.82
-2 to -1
430-565
0.1359
10.87
-1 to 0
565-700
0.3413
27.31
0 to 1
700-835
0.3413
27.31
1 to 2
835-970
0.1359
10.87
>2
> 970
0.02275
1.82
• To find the observed values in the sample,
use the HISTOGRAM tool
• An elaborated solution appears under
“Study Aids” on my web site. Click on the
link to normaltest.xls
• Issue: how many degrees of freedom does
the 2 statistic have?
– df = k – 1 – m = 6 – 1 – 0 = 5
• Alternate technique: determine whether
the sample was drawn from a normal
population
• First, calculate sample mean and standard
deviation and use those numbers in the
calculation
• Issue: how many degrees of freedom does
the 2 statistic have?
– df = k – 1 – m = 6 – 1 – 2 = 3
•
•
•
•
•
•
•
•
•
•
A problem and an alternate solution
Each cell should have expected frequency at least 5, otherwise chisquare value is not correct
One solution: choose ranges with equal expected frequencies
Divide data into, say, 10 ranges – each expected to contain 8
observations
So we define ranges that each contain 1/10 of total
Remember NORMINV(probability, mean, standard deviation)
displays the upper boundary of the given probability for the specified
mean and standard deviation
Example: NORMINV(.1, 300, 20) = 274.37. 10% of this distribution
is ≤ 274.37
NORMINV(1/10, X, s) will find the boundary of the lowest 10% of
the distribution
NORMINV(4/10, X, s) finds the boundary of the lowest 40% and so
on
Look carefully at sheet 2 of the workbook normaltest.xls as posted
•
•
The boundaries thus found are the bin range
Each will have expected number equal to n/c where n is the amount
of data and c the number of categories
• Testing for conformity to an observed
distribution:
– The national distribution of pets is as follows:
Number of Pets
Percentage of Households
0
55
1
25
2
10
3
5
4
3
5 or more
2
A marketing company wants to know whether Boone
conforms to the national pattern. In a sample of 300
Boone households, they found the following:
No. of Pets
No. of
Households
0
128
1
75
2
50
3
20
4
18
5 or more
9
k
(O j  E j )
j 1
Ej
 
2
Expected No.
2
Squares