Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Confidence interval wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Transcript

More Statistics measures of uncertainty and small number issues Fiona Reid Acknowledgements • This presentation has been adapted from the original presentation provided by the following contributors • Mark Dancox • Shelley Bradley • Jacq Clarkson Things to be covered • Summarising data – Common measures – Correlation • Common Distributions • Measuring uncertainty – Standard error – Confidence Intervals • Significance and p-values • Small samples Summary Statistics Summarising data • Impractical to look at every single piece of data so need to use summary measures • Need to reduce a lot of information into compact measures • Look at location and spread How do you describe an elephant? How do you describe an elephant? How big is it…..? LOCATION How do you describe an elephant? How varied…? SPREAD Measures of Location • Mean – Commonly used measure of location – Is the sum of values divided by the number of values • The sample mean is given by: 1 n 1 X xi [ x1 x2 n i 1 n xn ] • Can be drastically affected by unusual observations (called ‘outliers’) so it is not very robust. • Excel function: average() Measures of location • Median: is a value for which 50% of the data lie above (or below) “middle value” – For an odd number of observations, the median is the observation exactly in the middle of the ordered list – For an even number of observations, the median is the mean of the two middle observation is the ordered list • Less sensitive to outliers and gives a ‘real’ value (unlike the mean) but does ‘throw away’ a lot of information in the sample • Excel function: Median() • Mode: The mode is the most frequently occurring value • Sometimes too simplistic and not always unique Measures of Spread • Variance: The average of the squared deviations of each sample value from the sample mean divided by N-1 n 1 2 s ( xi X ) n 1 i 1 2 • Excel function: Vara () • Standard deviation: is the square root of the sample variance • Excel function: Stdeva() Measures of Spread • We can describe the spread of a distribution by using percentiles. • The pth percentile of a distribution is the value such that approximately p percent of the values are equal or less than that number. • Excel function: percentile () • Quartiles divide data into four equal parts. – First quartile (Q1) 25th Percentile • 25% of observations are below Q1 and 75% above Q1 – Second quartile (Q2) 50th Percentile • 50% of observations are below Q2 and 50% above Q2 – Third quartile (Q3) 75th Percentile • 75% of observations are below Q3 and 25% above Q3 Measures of Spread • Range: the difference between the largest and smallest values • Can be misleading if the data set contains outliers. • The interquartile range is the difference between the third and the first quartiles in a dataset (50% of the data lie in this range). • Interquartile range more robust to outliers. IQR 50% 95% x 2s x 2 Range 100% x 2s Q1 x s 3 Q3 x x s If the data is approximately normal, we can use the mean and the standard deviation s to find intervals within which a given percentage of the data lie. Skewness • Values in a distribution may not be spread evenly. This will affect symmetry. • Skewness a measure of symmetry. – If skewness =0 the distribution is symmetrical – If skewness >0 there are more larger values – If skewness < 0 there are more smaller values Some skewed distributions Count of persons killed and seriously injured in road traffic accidents, 2003-2005 Histogram of killed of seriously injured road traffic accidents, 2003-2005, all LAs 45 40 35 Number of LAs 30 25 20 15 10 5 0 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 Rate of killed or seriuosly iinjured road traffic accidents, per 100,000 Some skewed distributions Prevalence ofHistogram diabetes in ofall Local Authorities 2005/6 of prevalence diabetes, 2005/06, all LAs 90 80 70 Number of LAs 60 50 40 30 20 10 0 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 Prevalence of diabetes, percentage 4.75 5 5.25 5.5 5.75 Some skewed distributions Histogram of under 18 conception rate all local authorities 2003-5 Histogram of under 18 conception rate, 2003-2005, all LAs 60 50 Number of LAs 40 30 20 10 0 10 15 20 25 30 35 40 45 50 Under 18 conception rate 55 60 65 70 80 90 Relative positions of the mean and median for (a) right-skewed, (b) symmetric, and (c) left-skewed distributions Note: The mean assumes that the data is normally distributed. If this is not the case it is better to report the median as the measure of location. Skewness • The degree of skewness affects measures of location. • If no skew – Mean = Median • If skew > 0 (right or +ve skew) – Mean > Median • If skew < 0 (left or -ve skew) – Mean < Median Exercise 1 • Calculate some summary statistics for the class size data in sheet one of the exercises. Correlation • Correlation is a measure of association between two continuous variables • Correlation is best visualised graphically, plotting one variable against the other: Y X Positive Correlation Y One variable increases with the other X Negative Correlation Y One variable increases as the other decreases X No Correlation Y Y neither increases nor decreases with X X Correlation coefficient • Correlation coefficients measure strength of (linear) association between continuous variables • Pearson’s correlation coefficient r measures linear association i.e. Do the points lie on a straight line? • If the points form a perfect straight line, then we have perfect correlation. The closer r is to 0, the weaker the correlation r = 1 Perfect positive correlation r = -1 Perfect negative correlation r = 0 No correlation xy nxy r (n 1) sx s y where Sx and Sy denote standard deviations. Excel function PEARSON( ) (a) (b) r = +1 r = -1 (c) (d) r = 0.3 r=0 (e) r = -0.5 (f) r = 0.7 Example Spearman’s rank correlation coefficient measures association whenever one or both variables are on an ordinal scale. This does not need to be linear, Does one variable increase/decrease with the other? 1 Positive correlation Negative correlation 0 r 1 -1 r 0 6 d 2 n3 n d is the difference between the rank orderings of the data. Not an inbuilt function in excel WARNING Spurious correlations can arise from: Change of direction of association Subgroups Outliers Exercise • Produce scatterplots of life expectancy for both deprivation measures. • Calculate the correlation coefficient for LE and the deprivation measures • Which measure of deprivation shows the strongest association with life expectancy, is this the same for both men and women? Ecological Fallacy • When correlations based on grouped data are incorrectly assumed to hold for individuals. • E.g. investigating the relationship between food consumption and cancer risk. • One way to begin such an investigation would be to look at data on the country level and construct a plot of overall cancer risk against per capita daily caloric intake. • But it is people, not countries, who get cancer. • It could be that within countries those who eat more are less likely to develop cancer. On the country level, per capita food intake may just be an indicator of overall wealth and industrialization. The ecological fallacy was in studying countries when one should have been studying people. Distributions for Public Health Analysts Types of distributions • Normal distribution – Used for continuous measures such as height, weight, blood pressure • Poisson distribution – Used for discrete counts of things: violent crimes, number of serious accidents, number of horse kicks • Binomial distribution – Used to analyse data where the response is discrete count of a category: success/ failure, response/ nonresponse Normal Distribution • • • • • • Distribution of natural phenomena Continuous Family of distributions with same shape Area under curve is the same (=1) Symmetrical Defined by mean (μ) and standard deviation (σ) • Widely assumed for statistical inference Normal Distribution, changes in mean (μ) Normal Distribution 0.45 0.4 σ=1 0.3 0.25 0.2 0.15 0.1 0.05 μ=0 μ=0.7 μ=2 Keeping the standard deviation constant, changing the mean of a distribution moves the distribution to the left or right… 4.4 4.12 3.85 3.57 3.3 3.02 2.75 2.47 2.2 1.92 1.65 1.37 1.1 0.82 0.55 0.27 0 -0.3 -0.5 -0.8 -1.1 -1.4 -1.6 -1.9 -2.2 -2.5 -2.7 0 -3 Probability density 0.35 Normal Distribution, changes in standard deviation (σ) Normal Distribution 1.4 1.2 0.8 0.6 0.4 σ=0.6 0.2 σ=1 μ=0 Keeping the mean constant but changing the standard deviation affects the ‘narrowness’ of the curve… 2.82 2.6 2.37 2.15 1.92 1.7 1.47 1.25 1.02 0.8 0.57 0.35 0.12 -0.1 -0.3 -0.5 -0.8 -1 -1.2 -1.4 -1.7 -1.9 -2.1 -2.3 -2.6 -2.8 0 -3 Probability density σ=0.3 1 Distribution of values Distribution of values 0.45 0.4 Probability density 0.35 0.3 0.25 0.2 0.68 0.15 0.1 0.16 0.05 0.16 0 -σ μ0 +σ Distribution of values 0.45 0.4 Probability density 0.35 0.3 0.25 0.2 0.95 0.15 0.1 0.025 0.025 0.05 0 -2σ μ0 +2σ Poisson • A discrete distribution taking on the values X= 0,1,2,3,… • Often used to model the number of events occurring in a defined period of time. • Determined by a single parameter, lambda (λ), which is the mean of the process. • Shape of distribution changes with the value of λ Example of Poisson distribution Poisson Distribution (mean = 5) 0.2 0.18 Probability 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 1 2 3 4 5 6 7 8 9 Number of Events 10 11 12 13 14 15 Binomial Distribution • Used to analyse discrete counts • Consider when interested in a count expressed as a proportion of a total sample size. – “the proportion of brown-eyed persons in a building” • Defined by the probability of an outcome (p) and the sample size (n) Binomial Distribution Binomial Distribution, n = 10, p =0.5 0.3 Probability 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 Number of successes 7 8 9 10 Which distribution would best describe? • Number of abortions by gestational age • Percentage of patients with diabetes mellitus treated with ACE inhibitor therapy for Acute sickness • Number of Adults on prescribed medication • Proportion of Adults who are overweight • Average weekly alcohol units consumed Choice of distribution • No hard and fast rules about which distribution should be used. • If a sample size is big enough choice of distribution may be less important – “everything tends to normality” – Normal distribution will be a good approximation to Poisson and Binomial distributions given big enough sample. Standard Error • Summary statistics – such as the meanare based on samples • Different samples from the same population give rise to different values • This variation is quantified by the standard error An example…. 68 102 51 46 69 35 114 171 130 Population mean = 87.33 An example….continued…. 68 102 51 46 69 35 114 171 130 68 102 51 46 69 35 114 171 130 Sample 2 mean = 101.25 Sample 1 mean = 71.25 68 102 51 46 69 35 114 171 130 Sample 3 mean = 100 Standard Errors for some common distributions • Normal distribution s n s standard deviation, n sample size se • Poisson distribution se mean • Binomial Distribution se( p) p(1 p) n Confidence Intervals • How to calculate • How to interpret Confidence Intervals • Summary statistics are point estimates based on samples • Confidence Intervals quantify the degree of uncertainty in these estimates • Quoted as a lower limit and an upper limit which provide a range of values within which the population value is likely to lie Calculating Confidence Intervals • General form of any 95% C.I.: Point Estimate ± 1.96*(Estimated SE) • For 99% CI’s we use 2.57 • For 90% CI’s we use 1.64 Interpretation (X X) X) (X • A 95% Confidence Interval is a random interval , such that in related sampling 95 out of every 100 intervals succeed in covering the parameter • Loose interpretation – “95% chance true value inside interval” X) (X X) (X (X X) (X X) (X 5% of cases X) X) (X X) (X X) (X X) (X (X X) (X X) X) (X (X X) (X X) (X X) (X X) (X (X X) True Value Interpretation of confidence intervals • Non overlapping intervals indicative of real differences • Overlapping intervals need to be considered with caution • Need to be careful about using confidence intervals as a means of testing. • The smaller the sample size, the wider the confidence interval Example If the mean weight (kg) for a given sample of 43 men aged 55 is 81.4kg and the standard deviation is 12.7 kg…. Then, A 95% confidence interval is given by 81.4-1.96*(stdev/n), 81.4+1.96 *(stdev/n) which evaluates to (77.6, 85.2) kg Exercise 2 Using the CI calculator provided 1.Calculate the 95%CI for the mean class size from exercise 1. 2.If 20% in a sample of 400 are smokers, calculate a 95% confidence interval around this proportion 86 70 Bristol Plymouth Gloucester Swindon Penwith Torbay Forest of Dean Restormel Bournemouth Weymouth and Exeter North Devon Kerrier Torridge Mendip North Caradon North Cornwall Kennet Sedgemoor Poole Taunton Deane Stroud Salisbury Cheltenham South North Wiltshire South Tewkesbury West Dorset West Wiltshire Bath and North Carrick Teignbridge Mid Devon West Devon South Hams East Devon West Somerset Cotswold North Dorset Christchurch Purbeck East Dorset Life expectancy at birth (years) Exercise 3, which areas are significantly higher than England? Life Expectancy (2002-04) Source: ONS 2004 84 82 80 78 76 74 72 Male England - Male Female England - Female Measuring uncertainty Types of hypotheses • Null Hypothesis (H0) – The hypothesis under consideration – “there is no difference between groups” – The accused is innocent • Alternative Hypothesis (Ha) – The hypothesis we accept if we reject the null hypothesis – “there is a difference between groups” – Or the accused is guilty Hypothesis Testing • Inferences about a population are often based upon a sample. • Want to be able to use sample statistics to draw conclusions about the underlying population values • Hypothesis testing provides some criteria for reaching these conclusions General principles of hypothesis testing • Formulate null (H0) and alternative (Ha) hypotheses (simple or composite) • Choose test statistic • Decide on rule for choosing between the null and alternative hypotheses • Calculate test statistic and compare against the decision rule. Illustration of acceptance regions Principles of Testing 0.45 0.4 0.3 0.25 0.2 0.15 reject null hypothesis reject null hypothesis 0.1 accept null hypothesis 0.05 μ0 3 2.77 2.55 2.32 2.1 1.9 1.67 1.45 1.22 1 0.77 0.55 0.32 0.1 -0.1 -0.3 -0.6 -0.8 -1 -1.2 -1.5 -1.7 -1.9 -2.1 -2.3 -2.6 -2.8 0 -3 Probability density 0.35 Significance Levels • Used as the criteria to accept or reject H0 • P-value < 0.05 (or 0.01) indicates that the truth of HO is unlikely • Usually 5% or 1% • Chosen a priori P-values • Criteria to judge statistical significance of results. Quoted as values between 0 and 1 • Probability of result, assuming Ho true • Values less than 0.05 (or 0.01) indicates an observation unlikely under the assumption that HO is true Illustration of P-value under Ho P values 0.45 0.4 Distribution assumed under H0 0.3 0.25 Probability value as extreme as observed 0.2 0.15 Observed value 0.1 0.05 μ0 3 2.77 2.55 2.32 2.1 1.9 1.67 1.45 1.22 1 0.77 0.55 0.32 0.1 -0.1 -0.3 -0.6 -0.8 -1 -1.2 -1.5 -1.7 -1.9 -2.1 -2.3 -2.6 -2.8 0 -3 Probability density 0.35 Sample size • Results may indicate no difference between groups • This may be because there is truly no difference between groups or because there was an insufficiently large sample size for this to be detected Determining Sample Size • Choice of sample size depends on: – Anticipated size of effect/ required precision – Variability in measurement – Power – Significance levels Sample size formula • It is possible to combine information on variability, significance and power with the size of the effect we are trying to detect: Example from Clinical Trials: N = 2σ2(zα/2+zβ)2/(μ0- μa)2 Small samples • The smaller the sample, the higher degree of uncertainty in results. • Increased variability in small samples • Confidence Intervals for estimates are wider • Low numbers may affect the calculation of directly standardised rates (for instance) • Distribution assumptions may be affected. Dealing with small numbers • Confidentiality can be an issue • Can combine several years of data – Mortality pooled over several years for rare conditions • Suicide, infant mortality, cancers in the young • Combine counts across categories of data – Low cell counts in cross-classifications of the data • Exact methods may be needed. Problems with small numbers Trends in deaths from accidents 1993-2005 14 12 DSR/100,000 10 8 6 4 2 0 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 PERSONS England DSR 4.80 4.71 4.40 4.04 3.81 3.62 3.78 3.26 2.97 3.09 region DSR PCT DSR 5.98 5.60 4.42 5.17 3.65 5.09 4.61 2.70 3.95 6.75 12.25 7.16 4.82 1.61 5.65 0.00 3.18 4.86 England OBS 442 437 407 374 354 337 349 300 region OBS 30 29 22 26 18 26 22 4 7 4 3 1 3 0 PCT OBS 2.99 2.66 2.50 4.36 1.46 4.94 1.40 1.71 10.41 1.79 2.13 272 280 268 238 227 13 18 19 7 21 7 2 3 1 6 1 1 Finding out more: APHO http://www.apho.org.uk/resource/item.aspx?RID=48457 Finding out more • Lots of useful information can be found at the HealthKnowledge website… Finding out more • The NCHOD website also contains useful information on methodology… http://www.nchod.nhs.uk/ Finding out more • Some further references of interest: – Bland, M. Introduction to Medical Statistics. Third Edition. Oxford University Press, 2000. – Hennekens CH, Buring JE. Epidemiology in Medicine, Lippincott Williams & Wilkins, 1987. – Larson, H.J. Introduction to Probability Theory and Statistical Inference. Third Edition. Wiley, 1982