Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
university of copenhagen Faculty of Health Sciences Basic statistical concepts Susanne Rosthøj Section of Biostatistics Institute of Public Health University of Copenhagen d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Structure of teaching sessions One topic Videos Automated feedback Teaching session Teaching session Training exercises Introduction to training activities 2 / 30 Discussions Reading Collective feedback university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Why all these exercises? Learning R (and statistics) requires training! Gains of training: • focus on statistical rather than technical issues • increase learning Several hours of studying and training is necessary before each session. You need to complete 80% of the tests for the training exercises (you don’t need 80% correct answers!). 3 / 30 university of copenhagen Outline day 1 • Discussion of the training activities • Descriptive statistics • Inferential statistics • Means and confidence intervals • Test of hypotheses 4 / 30 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Applied statistics The basis for statistics is data / observations observed with random variation. We want to quantify the variation in the observations due to • systematic variation • random variation due to factors we cannot control We need to • summarize many observations as simple as possible • quantify that conclusions based on many observations are more precise than conclusions based on few observations 5 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Statistical approaches Descriptive statistics : • Summarizing observations • Typically represented • graphically • in tables • as summary statistics (single values) Inferential statistics : • Procedures allowing us to infer / generalize / conclude on observations • Typically based on • models, confidence intervals, hypotheses, tests • need mathematical results and assumptions 6 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Descriptive statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics tell us about the distribution of data points in a data set. Quantitative data are summarized by : Median, range, quartiles, inter quartile range (IQR), mean and standard deviation. Graphics: Histograms and scatter plots. Categorical data are summarized by : Tables, proportions 7 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small exercise (1/2) Data : X = (X1 , . . . , Xn ) X = (50, 52, 56, 61, 64, 71, 72, 73, 75, 79) (n = 10) The k’th percentile is the point below which k% of the values of the distribution lie : k’th percentile = k × (n + 1)th value of ordered observations. 100 By hand: Find median (k=50) and inter quartiles (k=25,75). • • • • 8 / 30 median (middle or 50×(n+1) ’th observation) : 100 min , max , range inter quartiles : IQR (Inter Quartile Range) : university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small exercise (2/2) Now find median and inter quartiles using R: First enter the data in a vector named x in R. x <- c(50,52,56,61,64,71,72,73,75,79) Experiment with the commands and explain output from median(x) quantile(x) quantile(x, type=6) How can you ask R to find the 20% quantile? Further determine 1 P10 • mean = X = 10 i=1 Xi = q 10 1 2 • standard deviation = sd = 10−1 i=1 (Xi − X) = Use google to find out how to determine the standard deviation in R. 9 / 30 P university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Inferential statistics What can we say about the rest of the world using the observations we have seen? Popula'on Sample Es'mates Assumption: Unknown parameters describe the population. Sample estimates are guesses of population parameters. 10 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The normal distribution The normal distribution is the most important distribution for describing continuous variables. Examples: • Body temperature • Hemoglobin level • Weight It is widely used in statistical inference because • it has many mathematically convenient properties • the Central Limit Theorem : The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed. 11 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The 95% reference interval Hemoglobin levels for adult women are normal with mean 14g/100ml and a SD of 1g/100ml. Reference range / normal range: µ ± 1.96 · SD Density Suppose we measure the Hgb in a group of women. What happens to the SD of the sample distribution as n is increased? 12 / 30 11 12 13 14 Hemoglobin 15 16 17 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Mean and standard deviation of an average The mean of an average equals the mean of the variable we are averaging: mean(X) = mean(X) (=µ). The standard deviation of the average is the standard deviation of the variable divided by squareroot of the number of observations: SD(X) = SD(X) σ √ (= √ ) n n This is termed the standard error of the mean (SE or SEM) and measures the amount of variation in averages of size n. 13 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Density The distribution of the sample mean (X) 95% 2.5% µ + 1.96 14 / 30 2.5% σ n µ µ − 1.96 σ n university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The 95% confidence interval The 95% confidence interval (CI) for µ: Density X ± 1.96 · SE 95% X ● X ● 2.5% µ + 1.96 15 / 30 2.5% σ n µ µ − 1.96 σ n university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Understanding confidence intervals The population mean µ is a fixed unknown number. Understanding confidence The confidence intervals vary between samples: intervals Mean and 95% confidence interval The population mean (µ) is a fixed unknown number: it is the confidence interval that will vary between samples. 27 20 samples of size 100, from a population with mean 24.2 and s.d. 5.9. 26 25 The sample means vary around the population mean µ 24 23 22 One of the twenty 95% C.I.s does not contain m 21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample 16 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Interpretation of CI A 95% CI ranges from 3 to 4. How many of the following statements are true? A. The probability that the population mean is greater than 0 is at least 95%. B. The probability that the population mean equals 0 is smaller than 5%. C. There is a 95% probability that the population mean lies between 3 and 4. D. We may be 95% confident that the interval (3;4) contains the population mean. E. If we were to repeat the experiment over and over, then 95% of the time the population mean falls between 3 and 4. 17 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Why do we need confidence intervals? We want to estimate a parameter, e.g. • the mean value for dice rolling • the mean effect of a sleeping pill. • the mean IQ Based on a sample we suggest a qualified guess (estimate) • we are uncertain about the guess and suggest an interval of plausible values • the interval has to be narrow • we want a large probability (95%) of guessing right. 18 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small sample confidence intervals For small samples (n ≤ 60) the CIs are better approximated by the t-distribution with df=n − 1. The 95%-CI for µ is X ± t(0.025,df=n−1) · SE with t0.025,df=n−1 being the lower 2.5%-quantile. See a selection of quantiles in KS table A3 or calculate quantiles in R qt(0.025,df=n-1). 19 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s How to make conclusions based on data? The purpose of most experiments is to prove or disprove a hypothesis. This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right. 20 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example of a hypothesis 636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume (L/1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude? 21 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Examples of hypotheses We always formulate hypotheses as no difference or no association. Investigation of a single population (one group): • H0 : The mean is equal to a specific number (e.g. mean FEV for boys is µ1 = 1.5) HA : The mean is not equal to a specific number. Comparison of two populations (two groups): • H0 : The means are equal (i.e. µ1 − µ2 = 0) HA : The means are not equal. If sufficient evidence against the hypothesis, we reject H0 . 22 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Test statistics We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD(estimate) We expect the test statistic to be • small if the hypothesis is true • large if the hypothesis is false. 23 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Parameters: µ1 : mean FEV for girls µ2 : mean FEV for boys Do boy and girls have different lung capacity? Hypothesis: H 0 : µ 2 = µ1 . µ2 − µ1 is the parameter we investigate. 0 is the hypothetical value. 24 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Two sample t-test Can be used when data arise from two groups, the variances in the two groups are equal and all observations are independent. Hypothesis: The population mean in the two groups are equal H0 : µ1 = µ2 against HA : µ1 6= µ2 Summary data from the two groups : Group 1 : n1 , X 1 , sd1 25 / 30 Group 2 : n2 , X 2 , sd2 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Two sample t-test First calculate a ’pooled’ sd : s SD = (n1 − 1)SD21 + (n2 − 1)SD22 (n1 − 1) + (n2 − 1) Standard error of difference in means (X 1 − X 2 ) : s SE(X 1 − X 2 ) = 1 1 + · SD n1 n2 Test statistic : t = ?? If H0 is true, t ∼ t(df = n1 + n2 − 2). 26 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Girls Boys n 335 301 mean 1.538 1.657 SD 0.291 0.308 X 2 − X 1 = 0.119. An estimate of the difference : Pooled SD of the difference assuming equal variances : s SD(X 2 − X 1 ) = (335 − 1) × 0.2912 + (301 − 1) × 0.3082 = 0.299 (335 − 1) + (301 − 1) The test statistic T = 0.119 0.299 × Small or large??? 27 / 30 q 1 335 + 1 301 = −5.01. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s P values We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P(|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true. 28 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Under the null, T ∼ t(df = 335 + 301 − 2). P-value: P(|T| > 5.01) = P(T < −5.01) + P(T > 5.01) = 2 · 3.54 × 10−7 = 7.09 × 10−7 Conclusion : The observed test statistics of 5.01 is unlikely, if there was no difference between boys and girls wrt lung capacity. NB : If we instead use the normal distribution to calculate the p value, the test is termed a Wald test. 29 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Interpretation of P values The smaller the P-value, the lower the chance of getting a difference as big as the one observed if the null hypothesis is true. Large P-value: • The difference between observed and hypothetical value is small compared to the statistical uncertainty. • The observed difference is due to chance. Small P values: • The difference between observed and hypothetical value is large compared to the statistical uncertainty. • It is unlikely that the observed difference is due to chance. 30 / 30