Download 95% confidence interval for a difference in two percentages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
‫‪Categorical data 1‬‬
‫‪Single proportion and comparison of 2 proportions‬‬
‫دکتر سید ابراهیم جباری فر( ‪(Dr. jabarifar‬‬
‫تاریخ ‪2010 / 1388 :‬‬
‫دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر‬
The objectives of the session
Sampling distribution of simple proportion
Calculation of 95% confidence interval for a proportion
The comparison of two proportions (or percentages)
Statistical test of significance for comparison of two
proportions
Calculation of 95% Confidence interval for the difference in
two proportions.
Categorical data
•What is categorical data?
•Examples?
Examples of categorical data
Education primary , secondary , university
Marital status: married , single,divorced, widowed
Cigarette smoking history: never smoker , ex-smoker,
current smoker
More examples of categorical data
Endpoint in a study
Person is dead or alive
Person with MI or without MI
Person can rate their own health as very good, good,
average, bad or very bad
More examples of categorical data
Quantitative measurements or assessments can be used
as categorical data:
Hypertension: Yes (for example systolic BP≥ 160 or
diastolic BP ≥ 90 mm Hg) or no
Alcohol consumption : none, light(<200 ml of ethanol/
week, heavy ≥ 200 ml of ethanol/week)
Proportions and percentages
In this session, we will concentrate on the use of
binomial data( = data with just two categories)
Example: in a survey interviews were conducted with
5335 middle- aged women. Of these , 1476 were
current smokers while 3859 were not.
Proportion of smokers=
1476
5335
=0.277
Percentage of smokers= 0.277×100=27.7%
Sampling variability of a proportion
•It is important to take into account the number of
subjects included
•The greater the number of subjects the more reliable
our estimates are
•Example: if we want to estimate proportion of men in
a population who smoke cigarettes study of 1000 men
will be more trustworthy than study of 10 men
Important assumption
We need to know that the sample of
individuals studied has been randomly
selected from some population of interest
Sampling distribution of single proportion
Let’s continue with the example of middle aged women.
Among 5335 women, there were 1476 smokers
If we want to say something about the population which this
study sample represents, we need the concept of a sampling
distribution.
•Let’s assume that we repeatedly took a sample of 5335
women and clculated the proportion of smokers
•For each sample , we calculate the proportion of smokers
and then construct a histogram of these values
•This histogram represents the sampling distribution of the
proportion and will take the following shape.
The curve is centred over value of the proportion of smokers in
the population , often referred to as the true proportion and
represented by µ Some of the sample proportions will be larger
than µ, others will be smaller. Many will be close in value to µ
a few will be a lot larger or a lot smaller
In practice we only conduct one survey, from which we have a
sample proportion represented by P.
Is P close to µ , or is it very different from µ ?
Only of we are very lucky will P actually be equal to µ.
In any random sample , there will be some sampling
variation in P.
The larger the sample , the smaller the extent of such
sampling variation.
Consider (P- µ)2 as a measure of variation in p from the
true proportion µ.
Then it can be shown mathematically that if you took
lots of random samples each of n subjects then the
average value of (P- µ)2 is equal to  (1   )
n
Variance and standard error of proportion
 (1   )
is the vaiance of a proportion
n
  (1   )
n
is the standard error of a proportion
It is a measure of the average extent of error in P= how far we
can expect the observed proportion to differ from π on average
Example:
π= 0.4:
N=100, then SE= 0.049
N=1000 ,then SE= 0.0155 (SE smaller)
SE does not depend much on π
N= 1000 π = 0.5: SE=0.0158
Back to the example:5335 women, 1476 current smokers
It means that 27.7% of women are smokers
The estimated standard error of the proportion of
smokers is
0.277  (1  0.277)
 0.0061
5335
We can also use percentages:
27.7  (100  27.7
0.61%
5335
95% confidence limits for a proportion
We want to get an interval of possible values within which the
true population proportion might lie
This can be done using the theoretical properties of the
Normal distribution
It can be shown that P will be within 1.96 standard errors of 
with probaility 0.95
That is , there is just a 2.5% risk that the observed proportion
will exceed the true population proprtion  by more than 1.96
standard errors , and another 2.5% risk that p will
understimate  by more than 1.96 standard errors.
95% Confidence limits for a proportion
We use this fact to define a 95% confidence
P-1.96×
P  (1  P)
n
to P + 1.96 ×
P  (1  P)
n
Usually written as P±1.96 ×standard error of P
Back to example
The true population percentage of smokers has
following 95% confidence interval
27.7%  1.96 
27.7  72.3
5335
This means that 95% confidence interval is from 26.5%
to 28.9%
These two values are the lower and upper confidence
limits , respectively.
95% confidence interval
95% confidence intervals= the most common statistical
technique for displaying the degree of uncertainty that
should be attached to any proportion.
There is a 5% risk that the true population proportion lies
outside thd interval
That is , you can anticipate that one in every 20 confidence
intervals you calculate will not include 
Two proportions
Example
Smoking
Total
Men
Women
Total
Yes
566
(45.1%)
313
(23.8%)
879
(34.2%)
No
690
1001
1691
1256
1314
2570
Question
From the table, we want to evaluate how strong is
the evidence that men smoke more than women
The null hypothesis
We need to define null hypothesis
In our case , the null hypothesis is that smoking is as freqent
among women as is among men (same proportion of smokers
among men and women)
If the null hypothesis were true, then the whole population
would have identical percent (%) of smokers.
Alternatively , one can say that if the null hypothesis were true
for any randomly selected person (man or women ), the
probability of being a smokers is the same independent of sex
of the person selected.
Significance testing for comparing 2 proportions
After defining the null hypothesis , the main question is
If the null hypothesis is true , what are the between the two
percentages as that observed?
For example , in the Czech study, what is the probability of
getting a sex difference in smoking as large as (or larger than)
45% versus 24%?
Observed difference in percentages
= P1-P2
= 45.1%-23.8%=21.3%
The overall percentage response= 879.2570=34.2%
If the null hypothesis is true , then the only reason that
P1-P2 differs from 0 is due to the sampling variation
Under the null hypothesis we are assuming that the
two samples of size n1=1256 and n2 =1314 are random
samples of people with equal true probabilities of
response .
We need to calculated the standard error of the
difference in two percentages
P  (100  P  (
1 1
 )
n1 n2
1
1
34.2  65.8  (

)
1256 1314
=1.9%
Now , we compare the observed difference with the
standard error of the difference, simply by dividing
one by the other.
Thus, we compute
Observed difference in percentages
Z=

21.3
1 .9
=11.2
Standard Error of difference
•How large does Z have to be in orther for us to assert that
we have strong evidence that the null hypothesis is
untrue?
•We need to make use of the fact that the difference
between two observed proportions has approximately a
Normal distribution, since this enables us to convert any
value of Z into a probability P (as we have already learnt
in previous sessions)
Z
exceeds
0.674
With probability
0.5
1.282
0.2
1.645
0.1
1.960
0.05
2.576
0.01
3.291
0.001
In our example , Z= 11.2 and so the probability P is
(substantially )less than 0.001 . That means , if the
proportion of smokers is same among men and women,
the chances of getting such a big percentage difference
in our study is less than 0.001
We therefore have storing evidence that the proportin of
smokers in men and women in the defined population is
different (and is lower in women) . We may also say the
difference between the percentages is staitstically
significant at the 0.1% level.
Exercise: we want to know wheter smoking
depends on marital status
Married
Unmarried
Total
Smoking Yes
732 (34.1%)
147(34.9%)
879 (34.2)
Smoking no
1417
274
1691
Total
2149
421
2570
The observed difference in percentages is
The standard error of the difference (using the
formula given above) is
Z=
P=
d
18

 0.32
SE 2.53
95% confidence interval for a difference in two percentages
While giving the actual P-value is useful, we also need to give
attention to estimating the magnitude of the difference and
express the uncertainty in such an estimate by using a
confidence intervals.
The 95% confidence interval for the difference between two
percentages is
Observed difference ±1.96×Standard Error of difference
In the calculation of the confidence interval, the formula
for the standard error of the difference does not assume
the null hypothesis of the two proportions being equal . A
slightly different formula is used for the standard error.
SE (difference in proportions)=
P1  (1  P1 ) P2 (1  P2 )

n1
n2
In our study , for smoking difference between men
and women 95% confidence interval is
(45.1%  23.8%)  1.96
=17.7% to 24.9%
45.1 54.9 23.8  76.2

1256
1314
Exercise
Calculate 95% confidence interval for difference in
percentage of smokers among married and unmarried
individuals
SE=2.54
CI=0.8±1.96×2.54=-42%, 5.8%
Note that if such a 95% confidence interval for a difference
includes the value 0.0 (i.e one limit is positive and the other
is negative), then P is greater than 0.05
Conversely , if the 95% confidence interval does not include
0.0 then P is less than 0.05
This illustrates that there is a close link between
significance testing and confidence intervals.