Download Basic statistical concepts

university of copenhagen Faculty of Health Sciences Basic statistical concepts Susanne Rosthøj Section of Biostatistics Institute of Public Health University of Copenhagen d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Structure of teaching sessions One topic Videos Automated feedback Teaching session Teaching session Training exercises Introduction to training activities 2 / 30 Discussions Reading Collective feedback university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Why all these exercises? Learning R (and statistics) requires training! Gains of training: • focus on statistical rather than technical issues • increase learning Several hours of studying and training is necessary before each session. You need to complete 80% of the tests for the training exercises (you don’t need 80% correct answers!). 3 / 30 university of copenhagen Outline day 1 • Discussion of the training activities • Descriptive statistics • Inferential statistics • Means and confidence intervals • Test of hypotheses 4 / 30 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Applied statistics The basis for statistics is data / observations observed with random variation. We want to quantify the variation in the observations due to • systematic variation • random variation due to factors we cannot control We need to • summarize many observations as simple as possible • quantify that conclusions based on many observations are more precise than conclusions based on few observations 5 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Statistical approaches Descriptive statistics : • Summarizing observations • Typically represented • graphically • in tables • as summary statistics (single values) Inferential statistics : • Procedures allowing us to infer / generalize / conclude on observations • Typically based on • models, confidence intervals, hypotheses, tests • need mathematical results and assumptions 6 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Descriptive statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics tell us about the distribution of data points in a data set. Quantitative data are summarized by : Median, range, quartiles, inter quartile range (IQR), mean and standard deviation. Graphics: Histograms and scatter plots. Categorical data are summarized by : Tables, proportions 7 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small exercise (1/2) Data : X = (X1 , . . . , Xn ) X = (50, 52, 56, 61, 64, 71, 72, 73, 75, 79) (n = 10) The k’th percentile is the point below which k% of the values of the distribution lie : k’th percentile = k × (n + 1)th value of ordered observations. 100 By hand: Find median (k=50) and inter quartiles (k=25,75). • • • • 8 / 30 median (middle or 50×(n+1) ’th observation) : 100 min , max , range inter quartiles : IQR (Inter Quartile Range) : university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small exercise (2/2) Now find median and inter quartiles using R: First enter the data in a vector named x in R. x <- c(50,52,56,61,64,71,72,73,75,79) Experiment with the commands and explain output from median(x) quantile(x) quantile(x, type=6) How can you ask R to find the 20% quantile? Further determine 1 P10 • mean = X = 10 i=1 Xi = q 10 1 2 • standard deviation = sd = 10−1 i=1 (Xi − X) = Use google to find out how to determine the standard deviation in R. 9 / 30 P university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Inferential statistics What can we say about the rest of the world using the observations we have seen? Popula'on Sample Es'mates Assumption: Unknown parameters describe the population. Sample estimates are guesses of population parameters. 10 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The normal distribution The normal distribution is the most important distribution for describing continuous variables. Examples: • Body temperature • Hemoglobin level • Weight It is widely used in statistical inference because • it has many mathematically convenient properties • the Central Limit Theorem : The average of a sufficiently number of independent variables with same distribution will be approximately normally distributed. 11 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The 95% reference interval Hemoglobin levels for adult women are normal with mean 14g/100ml and a SD of 1g/100ml. Reference range / normal range: µ ± 1.96 · SD Density Suppose we measure the Hgb in a group of women. What happens to the SD of the sample distribution as n is increased? 12 / 30 11 12 13 14 Hemoglobin 15 16 17 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Mean and standard deviation of an average The mean of an average equals the mean of the variable we are averaging: mean(X) = mean(X) (=µ). The standard deviation of the average is the standard deviation of the variable divided by squareroot of the number of observations: SD(X) = SD(X) σ √ (= √ ) n n This is termed the standard error of the mean (SE or SEM) and measures the amount of variation in averages of size n. 13 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Density The distribution of the sample mean (X) 95% 2.5% µ + 1.96 14 / 30 2.5% σ n µ µ − 1.96 σ n university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s The 95% confidence interval The 95% confidence interval (CI) for µ: Density X ± 1.96 · SE 95% X ● X ● 2.5% µ + 1.96 15 / 30 2.5% σ n µ µ − 1.96 σ n university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Understanding confidence intervals The population mean µ is a fixed unknown number. Understanding confidence The confidence intervals vary between samples: intervals Mean and 95% confidence interval The population mean (µ) is a fixed unknown number: it is the confidence interval that will vary between samples. 27 20 samples of size 100, from a population with mean 24.2 and s.d. 5.9. 26 25 The sample means vary around the population mean µ 24 23 22 One of the twenty 95% C.I.s does not contain m 21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Sample 16 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Interpretation of CI A 95% CI ranges from 3 to 4. How many of the following statements are true? A. The probability that the population mean is greater than 0 is at least 95%. B. The probability that the population mean equals 0 is smaller than 5%. C. There is a 95% probability that the population mean lies between 3 and 4. D. We may be 95% confident that the interval (3;4) contains the population mean. E. If we were to repeat the experiment over and over, then 95% of the time the population mean falls between 3 and 4. 17 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Why do we need confidence intervals? We want to estimate a parameter, e.g. • the mean value for dice rolling • the mean effect of a sleeping pill. • the mean IQ Based on a sample we suggest a qualified guess (estimate) • we are uncertain about the guess and suggest an interval of plausible values • the interval has to be narrow • we want a large probability (95%) of guessing right. 18 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Small sample confidence intervals For small samples (n ≤ 60) the CIs are better approximated by the t-distribution with df=n − 1. The 95%-CI for µ is X ± t(0.025,df=n−1) · SE with t0.025,df=n−1 being the lower 2.5%-quantile. See a selection of quantiles in KS table A3 or calculate quantiles in R qt(0.025,df=n-1). 19 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s How to make conclusions based on data? The purpose of most experiments is to prove or disprove a hypothesis. This is done by collecting data, analyzing it and drawing a conclusion. The original hypothesis is tested against the data to find out whether or not it is right. 20 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example of a hypothesis 636 children from Peru had their lung capacity examined. Response: FEV (Forced Expiratory Volume (L/1s). Scientific question: Do boys and girls have different lung capacity? Hypothesis: H0 : There is no difference in lung capacity for boys and girls. We observe: Girls : mean(FEV) = 1.54 Boys : mean(FEV) = 1.66. Observed difference = 0.12. What can we conclude? 21 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Examples of hypotheses We always formulate hypotheses as no difference or no association. Investigation of a single population (one group): • H0 : The mean is equal to a specific number (e.g. mean FEV for boys is µ1 = 1.5) HA : The mean is not equal to a specific number. Comparison of two populations (two groups): • H0 : The means are equal (i.e. µ1 − µ2 = 0) HA : The means are not equal. If sufficient evidence against the hypothesis, we reject H0 . 22 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Test statistics We use test statistics to find evidence against the hypothesis. Often test statistics are given by estimate − hypothetical value SD(estimate) We expect the test statistic to be • small if the hypothesis is true • large if the hypothesis is false. 23 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Parameters: µ1 : mean FEV for girls µ2 : mean FEV for boys Do boy and girls have different lung capacity? Hypothesis: H 0 : µ 2 = µ1 . µ2 − µ1 is the parameter we investigate. 0 is the hypothetical value. 24 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Two sample t-test Can be used when data arise from two groups, the variances in the two groups are equal and all observations are independent. Hypothesis: The population mean in the two groups are equal H0 : µ1 = µ2 against HA : µ1 6= µ2 Summary data from the two groups : Group 1 : n1 , X 1 , sd1 25 / 30 Group 2 : n2 , X 2 , sd2 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Two sample t-test First calculate a ’pooled’ sd : s SD = (n1 − 1)SD21 + (n2 − 1)SD22 (n1 − 1) + (n2 − 1) Standard error of difference in means (X 1 − X 2 ) : s SE(X 1 − X 2 ) = 1 1 + · SD n1 n2 Test statistic : t = ?? If H0 is true, t ∼ t(df = n1 + n2 − 2). 26 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Girls Boys n 335 301 mean 1.538 1.657 SD 0.291 0.308 X 2 − X 1 = 0.119. An estimate of the difference : Pooled SD of the difference assuming equal variances : s SD(X 2 − X 1 ) = (335 − 1) × 0.2912 + (301 − 1) × 0.3082 = 0.299 (335 − 1) + (301 − 1) The test statistic T = 0.119 0.299 × Small or large??? 27 / 30 q 1 335 + 1 301 = −5.01. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s P values We use p values to assess the size of test statistics. If the hypothesis is true and we replicate the sampling many times: How often will we obtain a test statistic numerically larger than the observed test statistic? The p-value P(|test statistic| > |observed test statistic|) is calculated assuming the hypothesis being true. A small p-value corresponds to the observed test statistic being unlikely if the hypothesis is true. 28 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Example : Lung capacity Under the null, T ∼ t(df = 335 + 301 − 2). P-value: P(|T| > 5.01) = P(T < −5.01) + P(T > 5.01) = 2 · 3.54 × 10−7 = 7.09 × 10−7 Conclusion : The observed test statistics of 5.01 is unlikely, if there was no difference between boys and girls wrt lung capacity. NB : If we instead use the normal distribution to calculate the p value, the test is termed a Wald test. 29 / 30 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Interpretation of P values The smaller the P-value, the lower the chance of getting a difference as big as the one observed if the null hypothesis is true. Large P-value: • The difference between observed and hypothetical value is small compared to the statistical uncertainty. • The observed difference is due to chance. Small P values: • The difference between observed and hypothetical value is large compared to the statistical uncertainty. • It is unlikely that the observed difference is due to chance. 30 / 30

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Basic statistical concepts