Download Chapter9 Categorical Data: One

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
STP231 Brief Class Notes, Instructor: Ela Jackiewicz
Chapter9 Categorical Data: One-Sample Distributions
Estimating population proportion:
In the first part of this chapter we will consider a dichotomous categorical variable (2 classes: A, not A)
in a large population. We will discuss a sampling distribution of an estimate of a population proportion
p=P(A) in our population. Suppose we take a random sample of size n, and denote y=# of subjects
with characteristics A in our sample, then we can estimate p by using two different sample statistics:
̂ y , or
ordinary sample proportion (p-hat): p=
n
y2

Wilson-Adjusted Proportion (p-tilde): p=
which gives CI-s more reliable than those based on
n4
p-hat.
We will only use p-tilde in our computations.
Sampling Distribution of p̃
Ex1 Suppose certain population has 39% of mutants, so p=population proportion of mutants=0.39.
If we take a random sample of 6 individuals from that population, obtain Sampling Distribution of p̃
Let Y=# of mutants in our sample, Y has binomial distribution with p=0.39
All possible values of of Y are 0-6, probabilities of taking each value can be computes using binomial
model. Values are displayed in the table below:
Y
probability
p̃
0
1
2
3
4
5
6
(0+2)/6+4)=0.2
(1+2)/(6+4)=0.3
(2+2)/(6+4)=0.4
(3+2)/(6+4)=0.5
(4+2)/(6+4)=0.6
(5+2)/(6+4)=0.7
(6+2)/(6+4)=0.8
Binompdf(6, 0.39, 0)=0.0515
binompdf(6, 0.39, 1)=0.1976
binompdf(6, 0.39, 2)=0.3159
binompdf(6, 0.39, 3)=0.2693
binompdf(6, 0.39, 4)=0.1291
binompdf(6, 0.39, 5)=0.0330
binompdf(6, 0.39, 6)=0.0035
We can use the above probability distribution to assess following probabilities:
a) Probability that our p̃ will estimate p within 4%,
P(.39−.04≤ p̃ ≤.39+.04)=P(.35≤ ̃p≤.43)=P ( p̃ =.4)=.3159
b) Probability that p̃ will overestimate p by more than 5%=
P( ̃p ≥.39+.05)=P( p̃ ≥.44)=.2693+.1291+.0330+.0035=.4393
c) What is the % of samples for which p̃ will overestimate p by more than 5%?
The answer is 43.93%, the same as in part b)
As n increases, the sampling distribution of p̃ becomes more compresses around the value of
p=0.39, so the probability that p̃ is within ±4 percentage points of p will be greater and
overestimating p by 5 or more percentage points , using p̃ , will become less likely.
STP231 Brief Class Notes, Instructor: Ela Jackiewicz
For large n sampling distribution of p̃ is approximately normal, with mean p and standard deviation
p( 1− p)
we will use that fact in constructing CI for p. The approximation gets better with
n+4
increasing n.
√
95% Confidence Interval for p=unknown population proportion.
Standard Error of
p
̃ :
SEp =

p
 1−p 

SEp ,
, 95% CI for p: p±1.96
n4
We can use this CI if sample size is at least 5.
__________________________________________________________________________________
Optional:
y.5  z 2 /2 
 1−p 
p

for other confidence levels: p=
, SEp =
2
nz 2/ 2
nz  /2

and 1−∗100 % CI for p: p±z
 / 2 SE p

__________________________________________________________________________________

Sample size considerations for desired Standard Error:
Selecting sample size: n=
Guessed p 1−Guessed p 
−4 rounded to the next integer
 Desired SE2
If no suitable guess available for
̃p , use 50% (0.5).
Ex2 . Gene mutations have been found in patients with MD. In one study of patients with MD, 23 out of
180 patients had a certain defect in the gene coding.
a. Construct and interpret 95% CI for true proportion of all patients with MD with that defect.
.1359(1−.1359)
p =25/184= .1359
.1359±1.96 (.0253) gives
Answer:
SE ̃p=
=.0253
184
.1359±.0496 CI: (.0863, .1855)
√
We have 95% confidence that true proportion of MD patients with that gene mutation is in the above
interval.
b. Compute sample size is needed for a standard error to be cut in half, assume a reasonable guess for p
is 0.14.
Answer: SE=.0253 .5(SE)=.01265 n=
.14 (1−.14 )
−4=748.39 , n=749
.01265 2
STP231 Brief Class Notes, Instructor: Ela Jackiewicz
Inference for proportions: Goodness-of-Fit Test:
In this part of the chapter we will consider one categorical variable with k categories, not necessarily
dichotomous. Distribution of that variable in a random sample is compared to specified fixed
distribution, and null hypothesis is:
H 0 : The variable has specified distribution
(probability pi in each i category is specified)
H a : The variable does not have specified distribution
(probability pi in some or all categories is not as specified)
O=observed counts are counts of sampled observations in each category
Ei=np i
E= Expected counts are:
We assume that all E are 1 or greater, and all E are > 5
(O−E)2
has Chi-square distribution with k-1 degrees of freedom (under
E
null hypothesis), where k=number of categories
Test statistics : χ 2s =∑
P-Value: To obtain a P-value (P) of a hypothesis test, we compute, assuming the null hypothesis is
true, the probability of observing a value of the test statistic as extreme or more extreme than
that observed. By extreme we mean far from what we would expect to observe if the null
hypothesis were true.
P-value =area right of the observed test statistics under Chi-square curve with df=k-1
Note: Our alternative is nondirectional, but in case of dichotomous variable we
can also have a directional hypothesis. We can test hypothesis, specifying that
probability in one category is smaller/larger than in the other. We have to check if
the directionality is correct first. In that case p-value as computed above is
divided by 2. Check example #2
Ex1. The offspring produced by a cross between two given types of plants can be any one of three
genotypes A, B or C. A simple inheritance model suggests that the offspring of types A,B and C should
be in a ratio 1:2:1 respectively. An experiment was conducted in which 100 plants were bred by
crossing the two parent types. The genetic classification of offspring are recorded below. Do these data
support the hypothesis that the offspring follow the predicted ratio? Test using =.05
Genotype:
O=observed frequency:
A
18
B
55
C
27
This is GOF test, variable=genotype of offspring, 3 classes.
Notice that if predicted ratio is a:b:c, then
p1=
a
b
c
, p 2=
and p3=
a+b+c
a+b+c
a+b+c
STP231 Brief Class Notes, Instructor: Ela Jackiewicz
H 0 : p1=1/4, p2=1/2, p3=1/4 ( data follows predicted ratio)
H a : not all probabilities are as stated in the null hypothesis (data does not follow predicted ratio)
(18−25)2 ( 55−50)2 (27−25)2
E= 25, 50 and 25 , χ 2s =
+
+
=2.62 df=2
25
50
25
p-value = χ 2 cdf (2.62,10 6 ,2) =.27>.05
Do not reject null, data support hypothesis that offspring follow predicted ratio.
Ex2 People who harvest wild mushrooms sometimes accidentally eat the toxic ones. In reviewing 205
European cases of mushroom poisoning from 1971 through 1980 researchers found that 45 of the
victims had died. Does this present the evidence that mortality has decreased since 1970, when it was
recorded to be 30%. Use Chi- square test with appropriate directional hypothesis and =.05 .
GOF test again, we have 1 variable: Status after eating toxic mushrooms: Dead or Alive , and we
compare distribution of it past 1970 to the fixed distribution P(dead)=0.3, P(alive)=.07 as recorded in
1970
Let p= % of dead since 1970, our hypothesis is then: H 0 : p=0.3 vs H a : p<0.3 .
We can have directional hypothesis here, since there are only 2 classes and 45/205=.22, so we have a
correct directionality. We can, but we do not have to specify both probabilities. We have:
O:
Dead
45
Alive
160
(45−61.5)2 (160−143.5)2
χ=
+
=6.33
61.5
143.5
p=(1/2)* χ 2 cdf (6.33, 106 ,1) =.5(.012)=.006<.05
E:
.3(205)=61.5
.7(205)=143.5
2
s
Reject H0 , evidence that mortality decreased in since 1970
Ex3. In a study of spatial orientation of certain fish 50 individuals were caught in various locations and
later tested in artificial pool to see which direction they would choose when released. Use the following
data and Chi-square test to test the null hypothesis that directional choice of these fish is random. Use
=.05 .
Directional choice:
#of fish=O
Toward shore
Away from shore
Along shore (right)
Along shore (left)
GOF test,
18
12
13
7
H 0 : p 1= p 2= p3= p 4=.25 ie. directions are randomly selected (all equally likely)
H a : not all pi are as stated in null hypothesis (selections not random, some
directions are preferred over others))
E= np=.25(50) = 12.5 for each category
χ 2s =4.88 , p= χ 2 cdf (4.88,106 , 3) =0.180, do not
reject H0 , no evidence that choices are not random.
STP231 Brief Class Notes, Instructor: Ela Jackiewicz
Ex4.
Day
%
In 2000, workplace accidents were distributed on workdays as follows:
Monday
25
Tuesday
15
Wednesday Thursday
15
15
Friday
30
In 2005, a random sample of 120 workplace accidents yielded the following data:
Day
Monday
Tuesday
Wednesday
Thursday
Number of accidents=O
33
20
12
17
E=Expected number of
.25(120)=30 .15(120)=18 .15(120)=18
.15(120)=18
accidents under H0
Friday
38
.3(120)=36
Do the data present sufficient evidence to indicate that the distribution of workplace accidents in 2005
differs from the 2000 distribution? Test the appropriate hypotheses by means of a Chi-square test and
=.05
H 0 : p 1=.25, p 2=p 3= p4 =.15, p 5=.30 i.e. distributions are the same both years
H a : not all pi are as stated in null hypothesis , distributions are different
This is again GOF test.
are different
2
χ s =2.69 , df=4, p=.611, so do not reject H0 , no evidence that distributions
Using Calculator (TI 83, 84)
1 Proportion Z interval
use STAT menu then TESTS
option A is 1-PropZInterval
It will use p-hat method, just input x and n
If we want 95% CI using p-tilde, we can input x=x+2 and n=n+4,
for other confidence levels it will not work
Chi-square GOF Test: only newer calculators;
1. Place observed and expected frequencies on 2 different lists , (STAT EDIT
option)
2. Use  2 GOF−Test , make sure to set appropriate degrees of freedom. P-value computed
by the test is for nondirectional alternative.
Alternatively, if you have older TI:
STAT EDIT, input O on L1, E on L2, then compute test statistics as follows:
(L1-L2)^2 /L2 , STO L3, 1-Var Stats L3, Test statistics = ∑ x