Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Categorical data 1 Single proportion and comparison of 2 proportions دکتر سید ابراهیم جباری فر( (Dr. jabarifar تاریخ 2010 / 1388 : دانشیار دانشگاه علوم پزشکی اصفهان بخش دندانپزشکی جامعه نگر The objectives of the session Sampling distribution of simple proportion Calculation of 95% confidence interval for a proportion The comparison of two proportions (or percentages) Statistical test of significance for comparison of two proportions Calculation of 95% Confidence interval for the difference in two proportions. Categorical data •What is categorical data? •Examples? Examples of categorical data Education primary , secondary , university Marital status: married , single,divorced, widowed Cigarette smoking history: never smoker , ex-smoker, current smoker More examples of categorical data Endpoint in a study Person is dead or alive Person with MI or without MI Person can rate their own health as very good, good, average, bad or very bad More examples of categorical data Quantitative measurements or assessments can be used as categorical data: Hypertension: Yes (for example systolic BP≥ 160 or diastolic BP ≥ 90 mm Hg) or no Alcohol consumption : none, light(<200 ml of ethanol/ week, heavy ≥ 200 ml of ethanol/week) Proportions and percentages In this session, we will concentrate on the use of binomial data( = data with just two categories) Example: in a survey interviews were conducted with 5335 middle- aged women. Of these , 1476 were current smokers while 3859 were not. Proportion of smokers= 1476 5335 =0.277 Percentage of smokers= 0.277×100=27.7% Sampling variability of a proportion •It is important to take into account the number of subjects included •The greater the number of subjects the more reliable our estimates are •Example: if we want to estimate proportion of men in a population who smoke cigarettes study of 1000 men will be more trustworthy than study of 10 men Important assumption We need to know that the sample of individuals studied has been randomly selected from some population of interest Sampling distribution of single proportion Let’s continue with the example of middle aged women. Among 5335 women, there were 1476 smokers If we want to say something about the population which this study sample represents, we need the concept of a sampling distribution. •Let’s assume that we repeatedly took a sample of 5335 women and clculated the proportion of smokers •For each sample , we calculate the proportion of smokers and then construct a histogram of these values •This histogram represents the sampling distribution of the proportion and will take the following shape. The curve is centred over value of the proportion of smokers in the population , often referred to as the true proportion and represented by µ Some of the sample proportions will be larger than µ, others will be smaller. Many will be close in value to µ a few will be a lot larger or a lot smaller In practice we only conduct one survey, from which we have a sample proportion represented by P. Is P close to µ , or is it very different from µ ? Only of we are very lucky will P actually be equal to µ. In any random sample , there will be some sampling variation in P. The larger the sample , the smaller the extent of such sampling variation. Consider (P- µ)2 as a measure of variation in p from the true proportion µ. Then it can be shown mathematically that if you took lots of random samples each of n subjects then the average value of (P- µ)2 is equal to (1 ) n Variance and standard error of proportion (1 ) is the vaiance of a proportion n (1 ) n is the standard error of a proportion It is a measure of the average extent of error in P= how far we can expect the observed proportion to differ from π on average Example: π= 0.4: N=100, then SE= 0.049 N=1000 ,then SE= 0.0155 (SE smaller) SE does not depend much on π N= 1000 π = 0.5: SE=0.0158 Back to the example:5335 women, 1476 current smokers It means that 27.7% of women are smokers The estimated standard error of the proportion of smokers is 0.277 (1 0.277) 0.0061 5335 We can also use percentages: 27.7 (100 27.7 0.61% 5335 95% confidence limits for a proportion We want to get an interval of possible values within which the true population proportion might lie This can be done using the theoretical properties of the Normal distribution It can be shown that P will be within 1.96 standard errors of with probaility 0.95 That is , there is just a 2.5% risk that the observed proportion will exceed the true population proprtion by more than 1.96 standard errors , and another 2.5% risk that p will understimate by more than 1.96 standard errors. 95% Confidence limits for a proportion We use this fact to define a 95% confidence P-1.96× P (1 P) n to P + 1.96 × P (1 P) n Usually written as P±1.96 ×standard error of P Back to example The true population percentage of smokers has following 95% confidence interval 27.7% 1.96 27.7 72.3 5335 This means that 95% confidence interval is from 26.5% to 28.9% These two values are the lower and upper confidence limits , respectively. 95% confidence interval 95% confidence intervals= the most common statistical technique for displaying the degree of uncertainty that should be attached to any proportion. There is a 5% risk that the true population proportion lies outside thd interval That is , you can anticipate that one in every 20 confidence intervals you calculate will not include Two proportions Example Smoking Total Men Women Total Yes 566 (45.1%) 313 (23.8%) 879 (34.2%) No 690 1001 1691 1256 1314 2570 Question From the table, we want to evaluate how strong is the evidence that men smoke more than women The null hypothesis We need to define null hypothesis In our case , the null hypothesis is that smoking is as freqent among women as is among men (same proportion of smokers among men and women) If the null hypothesis were true, then the whole population would have identical percent (%) of smokers. Alternatively , one can say that if the null hypothesis were true for any randomly selected person (man or women ), the probability of being a smokers is the same independent of sex of the person selected. Significance testing for comparing 2 proportions After defining the null hypothesis , the main question is If the null hypothesis is true , what are the between the two percentages as that observed? For example , in the Czech study, what is the probability of getting a sex difference in smoking as large as (or larger than) 45% versus 24%? Observed difference in percentages = P1-P2 = 45.1%-23.8%=21.3% The overall percentage response= 879.2570=34.2% If the null hypothesis is true , then the only reason that P1-P2 differs from 0 is due to the sampling variation Under the null hypothesis we are assuming that the two samples of size n1=1256 and n2 =1314 are random samples of people with equal true probabilities of response . We need to calculated the standard error of the difference in two percentages P (100 P ( 1 1 ) n1 n2 1 1 34.2 65.8 ( ) 1256 1314 =1.9% Now , we compare the observed difference with the standard error of the difference, simply by dividing one by the other. Thus, we compute Observed difference in percentages Z= 21.3 1 .9 =11.2 Standard Error of difference •How large does Z have to be in orther for us to assert that we have strong evidence that the null hypothesis is untrue? •We need to make use of the fact that the difference between two observed proportions has approximately a Normal distribution, since this enables us to convert any value of Z into a probability P (as we have already learnt in previous sessions) Z exceeds 0.674 With probability 0.5 1.282 0.2 1.645 0.1 1.960 0.05 2.576 0.01 3.291 0.001 In our example , Z= 11.2 and so the probability P is (substantially )less than 0.001 . That means , if the proportion of smokers is same among men and women, the chances of getting such a big percentage difference in our study is less than 0.001 We therefore have storing evidence that the proportin of smokers in men and women in the defined population is different (and is lower in women) . We may also say the difference between the percentages is staitstically significant at the 0.1% level. Exercise: we want to know wheter smoking depends on marital status Married Unmarried Total Smoking Yes 732 (34.1%) 147(34.9%) 879 (34.2) Smoking no 1417 274 1691 Total 2149 421 2570 The observed difference in percentages is The standard error of the difference (using the formula given above) is Z= P= d 18 0.32 SE 2.53 95% confidence interval for a difference in two percentages While giving the actual P-value is useful, we also need to give attention to estimating the magnitude of the difference and express the uncertainty in such an estimate by using a confidence intervals. The 95% confidence interval for the difference between two percentages is Observed difference ±1.96×Standard Error of difference In the calculation of the confidence interval, the formula for the standard error of the difference does not assume the null hypothesis of the two proportions being equal . A slightly different formula is used for the standard error. SE (difference in proportions)= P1 (1 P1 ) P2 (1 P2 ) n1 n2 In our study , for smoking difference between men and women 95% confidence interval is (45.1% 23.8%) 1.96 =17.7% to 24.9% 45.1 54.9 23.8 76.2 1256 1314 Exercise Calculate 95% confidence interval for difference in percentage of smokers among married and unmarried individuals SE=2.54 CI=0.8±1.96×2.54=-42%, 5.8% Note that if such a 95% confidence interval for a difference includes the value 0.0 (i.e one limit is positive and the other is negative), then P is greater than 0.05 Conversely , if the 95% confidence interval does not include 0.0 then P is less than 0.05 This illustrates that there is a close link between significance testing and confidence intervals.