Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introductory Statistics John Matthews, Professor of Medical Statistics, School of Mathematics and Statistics Janine Gray, Senior Lecturer and Deputy Director, Newcastle Clinical Trials Unit University of Newcastle-upon-Tyne Course Outline Data Description Mean, Median, Standard Deviation Graphs The Normal Distribution Populations and Samples Confidence intervals and p-values Estimation and Hypothesis testing Continuous data Categorical data Regression and Correlation Course Objectives To have an understanding of the Normal distribution and its relationship to common statistical analyses To have an understanding of basic statistical concepts such as confidence intervals and pvalues To know which analysis is appropriate for different types of data Recommended Textbooks Swinscow TDV and Campbell MJ. Statistics at Square One (10th edn). BMJ Books Altman DG. Practical Statistics for Medical Research. Chapman and Hall Bland M. An Introduction to Medical Statistics. Oxford Medical Publications Campbell MJ & Machin D. Medical Statistics A Commonsense Approach. Wiley Other reading Chinn S. Statistics for the European Respiratory Journal. Eur Respir J 2001; 18:393-401 www.mas.ncl.ac.uk/~njnsm/medfac/MDPhD/notes.htm BMJ statistics notes Types of Data Numerical Data – discrete number of lesions number of visits to GP – continuous height lesion area Types of Data Categorical – unordered Pregnant/Not pregnant married/single/divorced/separated/widowed – ordered (ordinal) minimal/moderate/severe/unbearable Stage of breast cancer: I II III IV Exercise What a) b) c) d) e) f) type are the following variables? sex diastolic blood pressure diagnosis height family size cancer stage Types of Data Outcome/Dependent variable – outcome of interest – e.g. survival, recovery Explanatory/Independent – treatment group – age – sex variable Histogram of Birthweight (grams) at 40 weeks GA Summary Statistics Location – Mean (average value) – Median (middle value) – Mode (most frequently occurring value) Variability – Variance/SD – Range – Centiles Birthweights (g) at 40 weeks Gestation mean = 3441g median = 3428g sd = 434g min = 2050g max = 4975g range = 2925g Boxplot T4 c ells/ mm 3 blood sample 2000 3 1500 1000 T4 CELLS 23 500 0 N= GROU P 20 20 Hodgkin 's Non-Ho dgkin's Symmetric Data mean = median (approx) standard deviation Skew Data median = "typical" value mean affected by extreme values - larger than median SD fairly meaningless centiles (less affected by extreme values/outliers) Half of all doctors are below average…. Even if all surgeons are equally good, about half will have below average results, one will have the worst results, and the worst results will be a long way below average Ref. BMJ 1998; 316:1734-1736 Discrete Data Principal diagnosis of patients in Tooting Bec Hospital Diagnosis Number of patients Schizophrenia 474 (32%) Affective Disorders 277 (19%) Organic Brain Syndrome 405 (28%) Subnormality 58 (4%) Alcoholism 57 (4%) Other/Not Known 196 (13%) Total 1467 Bar Chart Princ ipal Diagn osis of Patients in Tooting Bec Hospit al 500 400 300 200 Count 100 0 Schizop hrenia Organic Brain Syndro Affective Disorders Diagn osis Alcoho lism Subnor mality Other/N ot Known Summarising data - Summary Choosing the appropriate summary statistics and graph depends upon the type of variable you have Categorical (unordered/ordered) Continuous (symmetric/skew) The Normal Distribution N(2 unknown population mean estimate using sample mean unknown population SD estimate using sample SD Birthweight is N(3441, 4342) N(0,1) - Standard Normal Distribution 68% within ± 1 SD Units 95% within ± 1.96 x z 99% within ± 2.58 z - SD units Birthweight (g) at 40 weeks 95% within 1.96 SDs 2590 - 4292 grams 99% within 2.58 SDs 2321 - 4561 grams Further Reading http://www.mas.ncl.ac.uk/~njnsm/medfac/docs/intro.pdf Altman DG, Bland JM (1996) Presentation of numerical data. BMJ 312, 572 Altman DG, Bland JM. (1995) The normal distribution. BMJ 310, 298. Samples and Populations Use samples to estimate population quantities (parameters) such as disease prevalence, mean cholesterol level etc Samples are not interesting in their own right - only to infer information about the population from which they are drawn Sampling Variation Populations are unique - samples are not. Sample and Populations How much might these estimates vary from sample to sample? Determine precision of estimates (how close/far away from the population?) (Artifical) example Have 5000 measurements of diastolic blood pressure from airline pilots. This accounts for ALL airline pilots and is the population of airline pilots. (Artificial example - if we had the whole population we wouldn’t need to sample!!) Since we have the population, we know the true population characteristics. It is these we are trying to estimate from a sample. Population distribution of diastolic BP from Airline Pilots (in mmHg) True mean = 78.2 True SD = 9.4 Example Write each measurement on a piece of paper and put into a hat. Draw 5 pieces of paper and calculate the mean of the BP. replace and repeat 49 more times End up with 50 (different) estimates of mean BP Sampling Distribution Each estimate of the mean will be different. Treat this as a random sample of means Plot a histogram of the means. This is an estimate of the sampling distribution of the mean. Can get the sampling distribution of any parameter in a similar way. Distribution of the mean = 78.2, = 9.4 Population 50 samples N=5 50 samples N=10 50 samples N=100 Distribution of the Mean BUT! Don’t need to take multiple samples Standard SE error of the mean = Sample SD 2 N of the mean is the SD of the distribution of the sample mean Distribution of Sample Mean Distribution of sample mean is Normal regardless of distribution of sample (unless small or very skew sample) SO Can apply Normal theory to sample mean also Distribution of Sample Mean i.e. 95% of sample means lie within 1.96 SEs of (unknown) true mean This is the basis for a 95% confidence interval (CI) 95% CI is an interval which on 95% of occasions includes the population mean Example 57 measurements of FEV1 in male medical students Example X 4.06litres, SD 0.67 litres 95% of population lie within i.e. within 4.06 ±1.960.67, from 2.75 to 5.38 litres X 196 . SDs Example SE Thus 0.67 2 0.09 57 for FEV1 data, 95% chance that the interval 4.06 1.96 0.09 contains the true population mean i.e. between 3.89 and 4.23 litres This is the 95% confidence interval for the mean Confidence Intervals The confidence interval (CI) measures uncertainty. The 95% confidence interval is the range of values within which we can be 95% sure that the true value lies for the whole of the population of patients from whom the study patients were selected. The CI narrows as the number of patients on which it is based increases. Standard Deviations & Standard Errors The SE is the SD of the sampling distribution (of the mean, say) SE = SD/√N Use SE to describe the precision of estimates (for example Confidence intervals) Use SD to describe the variability of samples, populations or distributions (for example reference ranges) The t-distribution When N is small, estimate of SD is particularly unreliable and the distribution of sample mean is not Normal Distribution is more variable - longer tails Shape of distribution depends upon sample size This distribution is called the t-distribution N=2 t(1) 95% within ± 12.7 N(0,1) t(1) N=10 t(9) 95% within ± 2.26 N(0,1) t(9) N=30 t(29) 95% within ± 2.04 t-distribution As N becomes larger, t-distribution becomes more similar to Normal distribution Degrees of Freedom (DF)sample size - 1 DF measure of amount of information contained in data set Implications Confidence interval for the mean » Sample size < 30 Use t-distribution » Sample size > 30 Use either Normal or t distribution Note: Stats packages (generally) will automatically use the correct distribution for confidence intervals Example Numbers of hours of relief obtained by 7 arthritic patients after receiving a new drug: 2.2, 2.4, 4.9, 3.3, 2.5, 3.7, 4.3 Mean = 3.33, SD = 1.03, DF = 6, t(5%) = 2.45 95% CI = 3.33 ± 2.451.03/ 7 2.38 to 4.28 hours Normal 95% CI = 3.33 ± 1.961.03/ 7 2.57 to 4.09 hours TOO NARROW!! Hypothesis Testing Enables us to measure the strength of evidence supplied by the data concerning a proposition of interest In a trial comparing two treatments there will ALWAYS be a difference between the estimates for each treatment - a real difference or random variation? Null Hypothesis Study hypothesis - hypothesis in the mind of the investigator (patients with diabetes have raised blood pressure) Null hypothesis is the converse of the study hypothesis - aim to disprove it (patients with diabetes do not have raised blood pressure) Hypothesis of no effect/difference Two-Sample t-test Two independent samples Can the two samples be considered to be the same with respect to the variable you are measuring or are they different? Sample means will ALWAYS be different real difference or random variation? ASSUMPTION: Data are normally distributed and SD in each group similar Two-Sample t-test 24 hour total energy expenditure (MJ/day) in groups of lean and obese women Do the women differ in their energy expenditure? Null hypothesis: energy expenditure in lean and obese women is the same Boxplot of energy expenditure MJ/day 14 12 13 12 10 8 1 6 4 N= GROUP 13 9 lean obese Two-sample t-test Summary statistics lean obese Mean 8.1 10.3 SD 1.2 1.4 N 13 9 Difference in means = 10.3 - 8.1 = 2.2 SE difference = 0.57 (weighted average) Two Sample t-test Test statistic is 2.2/0.57 = 3.9 N1 + N2 - 2 DF (= 20) Calculate the probability of observing a value at least as extreme as 3.9 if the null hypothesis is true If the null hypothesis is true, the test statistic should have a t-distribution with 20 df (df = N1+N2-2) Two Sample t-test 95% of values from t-distribution with 20 DF lie between -2.09 and +2.09 Probability of observing a value as extreme or more extreme than 3.9 in a t-distribution with 20 df is 0.001 Only a very small probability that the value of 3.9 fits reasonably with a t-distribution with 20 df Conclude that energy expenditure is significantly different between lean and obese women The P-value The P-value is the probability of observing a test statistic at least as extreme as that observed if the null hypothesis is true t distribution with 20 df .4 Probability .3 .2 .1 0 -4 -3 -2 -1 0 x 1 2 3 4 Confidence Interval for the difference in two means 95% CI = 2.2 - 2.090.57 to 2.2 +2.090.57 or from 1.05 to 3.41 MJ/day Thus we are 95% confident that obese women use between 1.05 and 3.41 MJ/day energy more than lean women Confidence Interval or P-value? Confidence interval!!! P-value will tell you whether or not there is a statistically significant difference confidence interval will give information about the size of the difference and the strength of the evidence Paired t-test Obvious pairing between observations – two measurements on each subject (before-after study) – case-control pairs Assumption - paired data are normally distributed Example - Systolic blood pressure (SBP) measured in 16 middle aged men before and after a standard exercise. Post-exercise SBP - Pre-exercise SBP calculated for each man Boxplot of differences 20 10 0 -10 N= 16 Paired t-test Mean difference = 6.6 SE(Mean) = 1.5 t = 6.6/1.5 = 4.4 Compare with t(15) P < 0.001 Conclusion- mean systolic blood pressure is higher after exercise than before Paired t-test 95% confidence interval for the mean difference 6.6 2.13×1.5 = 3.4 to 9.8 Categorical Variables To investigate the relationship between two categorical variables form contingency table Hypothesis tests – Chi-squared test (2 test) – Fisher’s exact test (small samples) – McNemar’s test (paired data) Chi-squared test Used to test for associations between categorical variables (2 or more distinct outcomes) Example - a comparison between psychotherapy and usual care for major depression in primary care Patient Reported Recovery at 8 months Recovered Not Recovered Total 47 (51%) 46 (49%) 93 Usual Care 18 (20%) 73 (80%) 91 119 (65%) 184 Psychotherapy Total 65 (35%) P<0.001, Chi-square test Patient Reported Recovery at 8 months Difference between means 30.8% 95% confidence interval for difference 17.7% to 43.8% Larger tables Similar methods can be applied to larger tables to test the association between two categorical variables Example - Is there an association between housing tenure and time of delivery of baby (preterm/term). Null hypothesis: There is no relationship between housing tenure and time of delivery Relationship between housing tenure and time of delivery Housing Tenure Preterm Term Total Owner-occupier 50 (61.7) 849 (837.3) 899 Council Tenant 29 (17.7) 229 (240.3) 258 Private Tenant 11 (12.0) 164 (163.0) 175 Lives with Parents 6 (4.9) 66 (67.1) 72 Other 3 (2.7) 36 (36.3) 39 Total 99 1344 1443 Relationship between housing tenure and time of delivery Test Statistic ....... 50 61.7 2 3 2 .7 2 2 .7 61.7 849 837 .3 2 36 36.3 2 36.3 837 .3 ....... 10.5 DF = (5-1)(2-1) = 4 P = 0.03 Thus we strong evidence of a relationship between housing tenure and time of delivery Notes Chi-squared test not valid if expected values are small (<5) – Combine rows or columns to obtain a smaller table with larger expected values – Use Fisher’s exact test for small tables McNemar’s test Appropriate for use with paired or matched (case-control) data with a dichotomous outcome Example - McNemar’s test Skaane compared the use of mammography and ultrasound in the assessment of 327 (228 palpable and 99 non-palpable) consecutive malignant tumours confirmed at histology. Acta radiologica vol 40;486-490 (1999) McNemar’s test - example Mammogram US Yes No Tot. Yes 267 11 278 No 41 8 49 Tot. 308 19 327 McNemar’s test - example 308/327 (94%) were picked up by mammograpy compared with 278/327 (85%) picked up by ultrasound P<0.001 Conclusion: Mammography is significantly more sensitive in diagnosing tumours than ultrasound in a population of mixed malignant tumours Hypothesis testing - summary Type of data Paired Design Unpaired Design Continuous Quantitative data Paired (one-sample) ttest Wilcoxon Signed rank test Wilcoxon signed rank test Unpaired (independent samples) t-test Mann-Whitney U test Ordered Categorical data Unordered Categorical McNemar's test (2 data categories only) Mann-Whitney U test Chi-squared test Fisher's exact test Adapted from Chinn S. Statistics for the European Respiratory Journal. Correlation and Regression Relationship – regression – correlation between two continuous variables Relationship between two continuous variables 3 main purposes for doing this – to assess whether the two variables are associated (correlation) – to enable the value of one variable to be predicted from any known value of the other variable (regression) – to assess the amount of agreement between two variables (method comparison study) Example Women from a pre-defined geographical area were invited to have their haemoglobin (Hb) level and packed cell volume measured. They were also asked their age. Haemoglobin and packed cell volume 18 16 14 12 10 8 20 30 Packed Cell Volume (%) 40 50 60 Example - relationships between variables Association between Hb and PCV? Hb affects PCV or PCV affects Hb? Use correlation to measure the strength of an association Association between Hb and age? age must affect Hb and not vice versa Use regression to predict Hb from age Correlation Not interested in causation i.e. does a high PCV cause a high Hb level Interested in association i.e. is a high PCV associated with a high Hb level? sample correlation coefficient – summarises strength of relationship – can be used to test the hypothesis that the population correlation coefficient is 0 Correlation Coefficient dimensionless, from -1 to 1 measures the strength of a linear relationship +ve - high value of one variable associated with high value of the other -ve - high value of one variable associated with low value of the other +1 = exact linear relationship strictly called Pearson correlation coefficient Example Data r = -0.4 r=1 10 20 18 16 0 14 12 10 -10 8 Y Y 6 4 1 2 3 4 5 6 7 8 -20 1 9 2 3 4 5 6 7 8 9 X X r=0 r = 0.7 30 8 6 20 4 2 10 0 0 Y Y -2 1 X 2 3 4 5 6 7 8 9 -4 1 X 2 3 4 5 6 7 8 9 When not to use the correlation coefficient If the relationship is non-linear with caution in the presence of outliers when the variables are measured over more than one distinct group (i.e. disease groups) when one of the variables is fixed in advance Assessing agreement Correlation - example data 11 9 10 8 9 7 y1 y2 8 6 7 5 6 4 5 3 4 4 9 4 14 9 13 13 12 12 11 11 10 10 y4 y3 14 x2 x1 9 9 8 8 7 7 6 6 5 5 4 9 x3 14 10 15 x4 20 Is there an alternative? If the data are non-linear or there is an outlier – use spearman rank correlation coefficient Haemoglobin and Packed Cell Volume Without outlier Pearson=0.67 Spearman=0.63 18 16 14 12 With outlier Pearson=0.34 Spearman=0.48 10 8 6 4 2 20 30 Packed Cell Volume (%) 40 50 60 Regression Assume a change in x will cause a change in y predict y for a given value of x usually not logical to believe y causes x y is the dependent variable (vertical axis) x is the independent variable (horizontal axis) Example - Haemoglobin vs Age 18 16 14 12 10 8 10 20 Age (Years) 30 40 50 60 70 Regression Logical to assume that increasing age leads to increasing Hb Not logical to assume Hb affects age! Assume underlying true linear relationship Make an estimate of what that true linear relationship is Estimating a regression line How do I identify the ‘best’ straight line? least squares estimate straight line determined by slope and intercept y = a + bx a and b are estimates of the true intercept and slope and are subject to sampling variation Regression line of haemoglobin on age 18 16 14 12 10 8 10 20 Age (years) 30 40 50 60 70 Regression of haemoglobin on age Variable(s) Entered on Step Number 1.. AGE Age (Years) Multiple R .87959 R Square .77367 Adjusted R Square .76110 Standard Error 1.17398 Analysis of Variance Regression Residual F = 61.53133 DF 1 18 Sum of Squares 84.80397 24.80803 Signif F = .0000 Mean Square 84.80397 1.37822 Regression of haemoglobin on age ---------------------- Variables in the Equation ------------Variable B SE B 95% Confdnce Intrvl B AGE .134251 .017115 .098295 .170208 (Constant) 8.239786 .794261 6.571104 9.908467 ----------- in -----------Variable T Sig T AGE 7.844 .0000 (Constant) 10.374 .0000 What does this tell us? Hb = 8.2 + 0.13 AGE 95% CI for the slope goes from 0.098 to 0.170 P < 0.0001 Significant relationship between Hb and age 77% of the variability in Hb can be accounted for by age Mean How can it be used? Predict Eg. mean Hb for a given age What is the mean Hb of a 50 year old? Mean Hb = 8.2 + 0.1350 = 14.7 g/dl 95% CI for the estimate from 14.4 to 15.5 g/dl How can it be used? To calculate reference ranges for the population E.g. What range would you expect 95% of 50 year olds to lie within? (reference range) Between 12.4 to 17.5 g/dl 95% Confidence Interval for the Mean & 95% prediction interval for individuals 20 18 16 14 12 10 8 10 20 Age (years) 30 40 50 60 70 Definitions Predicted value – the value predicted by the regression line – an estimate of the mean value Residual – Observed value - predicted value What assumptions have I made? The relationship is approximately linear The residuals have a normal distribution Multiple Regression One outcome variable with multiple predictor variables Residuals assumed to be normally distributed Predictor variables can be continuous or categorical No assumptions made about distribution of continuous predictor variables Multiple Regression Example. Does the value of packed cell volume improve the prediction of hb? Model fitted Mean Hb = 5.2 + 0.1age(years) + 0.1packed cell volume(%) R2 = 83% Knowledge of packed cell volume improves the prediction of haemoglobin Summary Regression can be used to estimate the numerical relationship between an outcome variable and one or more predictor variables Correlation coefficient alone is of limited use