* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Dr. Ramsey Foty`s Statistics Workshop
Psychometrics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Regression toward the mean wikipedia , lookup
Analysis of variance wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Statistics Workshop 2011 Ramsey A. Foty, Ph.D. Department of Surgery UMDNJ-RWJMS “An unsophisticated forecaster uses statistics as a drunkard uses lamp-postsfor support rather than for illumination” Andrew Lang (1844-1912)…Scottish poet and novelist. “Then there is the man who drowned crossing a stream with an average depth of six-inches W.I.E. Gates…German Author Statistics: The only science that enables different experts using the same figures to draw different conclusions.” Evan Esar…American Humorist Topics • • • Why do we need statistics? Sample vs population. Gaussian/normal distribution. • Descriptive Statistics. – Measures of location. • Mean, Median, Mode. – Measures of dispersion. • Range, Variance, Standard Deviation. – Precision of the mean. • Standard Error, Confidence Interval. – Outliers. • Grubb’s test. • • • • The null hypothesis. Significance testing. Variability. Comparing two means. – – • Comparing 3 or more groups. – – • • T-test Group exercise ANOVA Group Excercise Linear Regression. Power Analysis. Why do we need statistics? • Variability can obscure important findings. • We naturally assume that observed differences are real and not due to natural variability. • Variability is the norm. • Statistics allow us to draw from the sample, conclusions about the general population. Sample vs Population • • Taking samples of information can be an efficient way to draw conclusions when the cost of gathering all the data is impractical. If you measure the concentration of factor X in the blood of 10 people, does that accurately reflect the concentration of Factor X of the human race in general? How about from 100, 1000, or 10,000 people? How about if you sampled everyone on the planet? Statistical methods were developed based on a simple model: • Assume that an infinitely large population of values exists and that your sample was randomly selected from a large subset of that population. Now, use the rules of probability to make inferences about the general population. The Gaussian Distribution • If samples are large enough, the sample distribution will be bellshaped. • The Gaussian function describing this shape is defined as follows: ; where m represents the population mean and s the standard deviation. An example of a Gaussian distribution Descriptive Statistics Measures of Location Measures of Dispersion A typical or central value that best describes the data. • Mean • Median • Mode Describe spread (variation) of the data around that central value. • • • • • Range Variance Standard Deviation Standard Error Confidence Interval No single parameter can fully describe distribution of data in the sample. Most statistics software will provide a comprehensive table describing the distribution. Measures of Location: Mean Mean • More commonly referred to as “the average”. • It is the sum of the data points divided by the number of data points. 49 27 132 24 78 80 62 39 200 M 9 M=76.78 microns = 77 microns Migration Assay Cell # Distance travelled (Microns) 1 49 2 27 3 132 4 24 5 78 6 80 7 62 8 39 9 200 Measures of Location: Median Median for odd sample size • The value which has half the data smaller than that point and half the data larger. • For odd numbers, you first rank order then pick the middle number. • Therefore the 5’th number in the sequence is the median = 62 microns. Migration assay Cell # Distance traveled (microns) 1 24 2 27 3 39 4 49 5 62 6 78 7 80 8 132 9 200 Measures of Location: Median Median for even sample size • Find the middle two numbers then find the value that lies between them. • Add two middle ones together and divide by 2. • Median is (7+13)/2=10. • The median is less sensitive for extreme scores than the mean and is useful for skewed data. Unranked Ranked 3 3 13 5 7 7 5 13 21 21 23 23 Measures of Location: Mode Mode • Value of the sample which occurs most frequently. • It’s a good measure of central tendency. • The Mode for this data set is 72 since this is the number with the highest frequency in the data set. • Not all data sets have a single mode. It’s only useful in very limited situations. • Data sets can be bi-modal. Marble Color Frequency Black 6 Brown 2 Blue 34 Purple 72 Pink 71 Green 58 Rainbow 34 Boxplots Largest observed value that is not an outlier 75’th percentile Median 25’th percentile Smallest observed value that is not an outlier 12, 13, 5, 8, 9, 20, 16, 14, 14, 6, 9, 12, 12 5, 6, 8, 9, 9, 12, 12 ,12, 13, 14, 14, 16, 20 Boxplots are used to display summary statistics Measures of Location… do not provide information on spread or variability of the data Measures of Dispersion • Describe the spread or variability within the data. • Two distinct samples can have the same mean but completely different levels of variability. • Which mean has a higher level of variability? 110 ± 5 or 110 ± 25 • Typical measures of dispersion include Range, Variance, and Standard Deviation. Measures of Dispersion: Range Range • The difference between the largest and smallest sample values. • It depends only on extreme values and provides no information about how the remaining data is distributed. For the cell migration data: Largest distance = 200 microns Smallest distance = 24 microns Range = 200-24 = 176 microns. NOT a reliable measure of dispersion of the whole data set. Measures of Dispersion: Variance Variance • Defined as the average of the square distance of each value from the mean. To calculate variance, it is first necessary to calculate the mean score then measure the amount that each score deviates from the mean. The formula for calculating variance is: (X M ) 2 S N 1 2 Why Square? • Squaring makes • Makes the bigger them all positive differences stand numbers (to out, 1002 (10,000) is eliminate negatives, a lot bigger than which will reduce the 502(2500). variance. N vs N-1 N N-1 Size of the population Size of the sample For the cell migration data, the sample variance is: S2 (28)2 (50)2 (55)2 (53)2 (1)2 (3)2 (15)2 (38)2 (123)2 8 NOT a very user-friendly statistic. Measures of Dispersion: Standard Deviation Standard Deviation • • • The most common and useful measure of dispersion. Tells you how tightly each sample is clustered around the mean. When the samples are tightly bunched together, the Gaussian curve is narrow and the standard deviation is small. When the samples are spread apart, the Gaussian curve is flat and the standard deviation is large. • The formula to calculate standard deviation is: SD = square root of the variance. • For this data set, the mean and standard deviation are: 77 ± 57 microns Conclusion: There’s lots of scatter in this data set. But then again…. • This is a fairly small population (n=9). • What if we were to count the migration of 90, or 900, or 9000 cells. • Would this give us a better sense of what the average migration distance is? • In other words, how can we determine whether our mean is precise? Precision of the Mean Standard Error • A measure of how far the sample mean is away from the population mean. For our data set: SEM SEM gets smaller as sample size increases since the mean of a larger sample is likely to be closer to the population mean. SD 57 57 19 N 9 3 Increasing sample size does not change scatter in the data. SD may increase or decrease. Increasing sample size will, however, predictably reduce the standard error. Should we show standard deviation or standard error? Use Standard Deviation Use standard error • If the scatter is caused by biological variability and you want to show that variability. • For example: You aliquot 10 plates each with a different cell line and measure integrin expression of each. • If the variability is caused by experimental imprecision and you want to show the precision of the calculated mean. • For example: You aliquot 10 plates of the same cell line and measure integrin expression of each. Precision of the Mean Confidence Intervals • Combines the scatter in any given population with the size of that population. • Generates an interval in which the probability that the sample mean reflects the population mean is high. The formula for calculating CI: CI = X ± (SEM x Z) • X is the sample mean and Z is the critical value for the normal distribution. • For the 95% CI, Z=1.96. • For our data set: 95% CI=77 ± (19x1.96)=77 ± 32 CI 95%=45-109 • This means that there’s a 95% chance that the CI you calculated contains the population mean. CI: A Practical Example Data set A Data set B Data set A Data set B 80 90 Mean 86.1 64.1 85 52 SD 4.1 19.3 90 30 SEM 1.3 6.1 88 44 Low 95% CI 83.2 50.3 79 68 High 95% CI 89.0 77.9 92 77 88 55 85 62 88 75 86 88 Between these two data sets, which mean do you think best reflects the population mean and why? SD/SEM/95% CI error bars SD SEM 95% CI Outliers • An observation that is numerically distant from the rest of the data. • Can be caused by systematic error, flaw in the theory that generated the data point, or by natural variability. How to deal with outliers? • In general, we first quantify the difference between the mean and the outlier, then we divide by the scatter (usually SD). Grubb’s test mean value Z SD For the cell migration data set: The mean is 77 microns. The Sample furthest from the mean Is the 200 micron point and the SD is 57. So: 77 200 Z 2.16 57 What does a Z value of -2.16 mean? • In order to answer this question, we must compare this number to a probability value (P) to answer the following question: • “If all the values were really sampled from a normal population, what is the chance of randomly obtaining an outlier so far from the other values?” • To do this, we compare the Z value obtained with a table listing the critical value of Z at the 95% probability level. • If the computed Z is larger than the critical value of Z in the table, then the P value is less than 5% and you can delete the outlier. For our data set: • Z calc (2.16) is less than Z Tab (2.21), so P is greater than 5% and the outlier must be retained. The Null Hypothesis • Appears in the form Ho: m1 = m2 Ho = null hypothesis m1 = mean of population 1 m2 = mean of population 2 • An alternate form is Ho: m1-m2=0 • The null hypothesis is presumed true until statistical evidence in the form of a hypothesis test proves otherwise. Where; Statistical Significance • When a statistic is significant, it simply means that the statistic is reliable. • It does not mean that it is biologically important or interesting. • When testing the relationship between two parameters we might be sure that the relationship exists, but is it weak or strong? Strong vs weak relationships r2=0.2381 r2=1.000 Significance Testing Type I and Type II errors Statistical Decision • Type I error: a true null hypothesis can be incorrectly rejected. True state of the null hypothesis – False positive Ho True Ho false Reject Ho Type I error Correct Do not Reject Ho Correct Type II error • Type II error: a false null hypothesis can fail to be rejected. – False negative A Practical Example Type I error • A pregnancy test has produced a "positive" result (indicating that the woman taking the test is pregnant); if the woman is actually not pregnant, then we say the test produced a "false positive". Type II error • A Type II error, or a "false negative", is the error of failing to reject a null hypothesis when the alternative hypothesis is the true state of nature….i.e if a pregnancy test reports "negative" when the woman is, in fact, pregnant. In significance testing we must be able to reduce the chance of rejecting a true null-hypothesis to as low a value as desired. The test must be so devised that it will reject the hypothesis tested when it is likely to be false. Sources of Variability Random Error Systematic Error • Caused by inherently unpredictable fluctuations in the readings of a measurement apparatus or in the experimenter's interpretation of the instrumental reading. • Can occur in either direction. • Is predictable, and typically constant or proportional to the true value. Systematic errors are caused by imperfect calibration of measurement instruments or imperfect methods of observation. • Typically occurs only in 1 direction. Some Examples Type of Error Example How to Minimize Random Error You measure the mass of a ring three times using the same balance and get slightly different values: 17.46 g, 17.42 g, 17.44 g Take more data. Random errors can be evaluated through statistical analysis and can be reduced by averaging over a large number of observations. Systematic Error The electronic scale you use reads 0.05 g too high for all your mass measurements (because it is improperly tared throughout your experiment). Systematic errors are difficult to detect and cannot be analyzed statistically, because all of the data is off in the same direction (either too high or too low). Repeatability/Reproducibility Repeatability • The variation in measurements taken by a single person or instrument on the same item and under the same conditions. • An experiment, if performed by the same person, using the same equipment, reagents, and supplies, must yield the same result. Reproducibility • The ability of a test or experiment to be accurately reproduced or replicated by someone else working independently. • Cold fusion is an example of an unreproducible experiment. Hypothesis Testing Observe Phenomenon Propose Hypothesis Designv Study v v Collect and Analyze Data Interpret Results Draw Conclusions Statistics are an important Part of the study design Comparing Two Means • Are these two means significantly different? • Variability can strongly influence whether the means are different. Consider these 3 scenarios: Which of these will likely yield significant differences? Comparing Two Means Student t-test • Introduced in 1908 by William Sealy Gosset. • Gosset was a chemist working for the Guiness Brewery in Dublin. • He devised the t-test as a way to cheaply monitor the quality of Stout. • He was forced to use a penname by his employer-he chose to use the name Student. • N < 30 • Independent data points, except when using a paired t-test. • Normal distribution for equal and unequal variance • Random sampling • Equal sample size. • Degrees of freedom important. • Most useful when comparing 2 sample means. The Student t-test • Given two data sets, each characterized by it’s mean, standard deviation, and number of samples, we can determine whether the means are significant by using a t-test. Note below that the difference between the means is the same but The variability is very different. An Example Drop # Sample 1 Sample 2 1 345 134 2 376 116 3 292 154 4 415 142 5 359 177 6 364 111 7 298 189 8 295 187 9 352 166 10 316 184 • The null hypothesis states that there is no difference in the means between samples: • 1) Calculate means. • 2) Calculate SDs. • 3) Calculate SEs. • 4) Calculate t-value. • 5) Compare tcalc to ttab. • 6) Accept/reject Ho. Plot Data Box Plot Bar Graph 1) Calculate Mean M1 M2 X1 N1 X2 N2 345 376 292 415 359 364 298 295 352 316 341 10 134 116 154 142 177 111189 187 166 184 156 10 2) Calculate SD SD1 (xi M 1 )2 N 1 (345 371)2 (376 341)2 (292 341)2 (415 341)2 (359 341)2 (364 341)2 (298 341)2 (295 341)2 (352 341)2 (316 341)2 9 16 1225 2401 5476 324 529 1849 2116 121 625 1631 40 9 SD2 (xi M 2 )2 N 1 (134 156)2 (116 156)2 (154 156)2 (142 156)2 (177 156)2 (111 156)2 (189 156)2 (187 156)2 (166 156)2 (184 156)2 9 484 1600 4 196 441 2025 1089 961 100 784 854 29 9 3) Calculate SE SD1 SE1 N SD SE2 2 N 40 40 12.5 13 10 3.2 29 29 9.1 9 10 3.2 4) Caculate the t-statistic Sample 1 Sample 2 Mean 341 156 SD 40 29 SE 13 9 N 10 10 t= 341-156 185 185 11.6 (13) + (9) 250 16 2 2 Now we have to compare our t-value to a table of critical t-values to determine whether the sample means differ. But…. We first have to determine the degrees of freedom…. • Describe the number of values in the final calculation of a statistic that are free to vary. • For our data set, the degrees of freedom is 2N-2 or 2(10)-2 or 20-2=18. Why 18 degrees of freedom…. • To calculate SD we must first calculate the mean and then compute the sum of the several squared deviations from that mean. • While there are n deviations, only n-1 are actually free to assume any value whatsoever. • This is because we used an n value to calculate the mean. • Since we have 2 data sets, then df=2n-2=18 Did you hear the one about the statistician who was thrown in jail? • He now has zero degrees of freedom. 5) Compare tcalc to ttab for 18 df • For the 95% confidence level and a df of 18, ttab=2.101. Our t-value was 11.6. • Since tcalc> ttab, then we must reject the Ho and conclude that the sample means are significantly different. Your Turn… One-tailed vs two-tailed t-test One-tailed t-test A one-tailed test will test either if the mean is significantly greater than x or if the mean is significantly less than x, but not both. The one-tailed test provides more power to detect an effect in one direction by not testing the effect in the other direction. Two-tailed t-test A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05. Paired vs Unpaired t-test Paired • The observed data are from the same subject or from a matched subject and are drawn from a population with a normal distribution. • Example: Measuring glucose concentration in diabetic patients before and after insulin injection. Unpaired • The observed data are from two independent, random samples from a population with a normal distribution. • Example: Measuring glucose concentration of diabetic patients versus nondiabetics. P values “If the P value is low, the null hypothesis must go” Comparing Three or More Means Why not just do multiple t-tests? • If you set the confidence level at 5% and do repeated t-tests, you will eventually reject the null hypothesis when you shouldn’t…i.e you increase your chance of making a Type I error. Number of Groups Number of Comparisons a=0.05 2 1 0.05 3 3 0.14 4 6 0.26 5 10 0.40 6 15 0.54 7 21 0.66 8 28 0.76 9 36 0.84 10 45 0.90 11 55 0.94 12 66 0.97 Frog Germ Layer Experiment Multiple t-test results for significance Germ Later Surface Tensions • • • • • • Endo vs Meso: p=0.0293 Yes Endo vs Ecto: p= 0.0045 Yes Endo vs Ecto under: p=0.0028 Yes Meso vs Ecto: p=0.0512 No Meso vs Ecto under: p=0.0018 Yes Ecto vs Ecto under: p=0.0007 Yes 4 groups, 6 possible comparisons, 26% chance of detecting significant difference When non exists. To compare three or more means we must use Analysis of Variance (ANOVA) • In ANOVA we don’t actually measured variance. We measure a term called “sum of squares.” • There are 3 sum of squares we need to measure. 1) Total sum of squares. • Total scatter around the grand mean. 2) Between-group sum of squares. • Total scatter of the group means with respect to the grant mean. 3) Within-group sum of squares. • The scatter of the scores. Frog Germ Layer Experiment Germ Layer Surface Tensions Anova/MCT Frog Germ Layer Experiment Germ Layer Surface Tensions Comparison T-test Anova/MCT Endo vs Meso Yes No Endo vs. Ecto Yes No Endo vs Ecto under Yes Yes Meso vs Ecto No No Meso vs Ecto under Yes Yes Ecto vs Ecto under Yes Yes ANOVA The fundamental equation for ANOVA is: SSTot SSBG SSWG From this we can calculate the mean sum of squares by dividing the sum of squares by the degrees of freedom. SS SS MS ; MS df df BG WG BG WG BG WG We can then calculate the F statistic: MS F MS BG WG To calculate sums of squares we first need to calculate two types of means. • 1) group means ( X ) • 2) the grand mean (X) SStotal X X SSBG X X # groups 2 SSWG x x 2 2 Sum of squares of each sample (X) minus the grand mean. Sum of squares of each group mean minus the grand mean, multiplied by the number of groups. Sum of squares of each sample (X) minus the group mean. df for ANOVA • To calculate the MS BG and MS WG, we need to know the Df. MSBG MSWG SSBG dfBG SSWG dfWG • To determine the df for these two parameters we need to partition: • df of SSBG= n-1 of how many groups there are. Therefore for 3 groups, df=2. • df of SSWG = n-1 of all groups. Therefore for 30 samples (10 in each of the 3 groups), df=27. • We can then compared the Fcalc to the Ftab to determine whether significant differences exist in the entire data set. Your Turn…. One-way versus two-way ANOVA One-Way ANOVA • 1 measurement variable and 1 nominal variable. • For example, you might measure glycogen content for multiple samples of heart, liver, kidney, lung etc… Two-Way ANOVA • 1 measurement variable and 2 nominal variables. • For example, you might measure a response to three different drugs in both men and women. Drug treatment is one factor and gender is the other. ANOVA only tells us that the smallest and largest means likely differ from each other. But what about other means? In order to test other means, we have to run post hoc multiple comparisons tests. Post hoc tests • Are only used if the null hypothesis is rejected. • There are many, including Tukey’s, Bonferroni’s, Schefe’s, Dunn’s, Newman-Keul’s. • All test whether any of the group means differ significantly. • These tests don’t suffer from the same issues as performing multiple t-tests. They all apply different “corrections” to account for the multiple comparisons. • Accordingly, some post hoc tests are more “stringent” than others. Linear Regression • The goal of linear regression is to adjust the values of slope and intercept to find the line that best predicts Y from X. • More precisely, the goal is to minimize the sum of the squares of the vertical distances of the points from the line. Note that linear regression does not test whether your data are linear. It assumes that your data are linear, and finds the slope and intercept that make a straight line that best fits your data. r2, a measure of goodness-of-fit of linear regression • The value r2 is a fraction between 0.0 and 1.0, and has no units • An r2 value of 0.0 means that knowing X does not help you predict Y. • When r2 equals 1.0, all points lie exactly on a straight line with no scatter. Knowing X lets you predict Y perfectly. How is r2 calculated? • The left panel shows the best-fit linear regression line. In this example, the sum of squares of those distances (SSreg) equals 0.86. • The right half of the figure shows the null hypothesis -- a horizontal line through the mean of all the Y values. Goodnessof-fit of this model (SStot) is 4.907. An Example Power Analysis: How many samples are enough? • If sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. • If sample size is too large, time and resources will be wasted, often for minimal gain. Calculation of power requires 3 pieces of information: 1) A research hypothesis. • This will determine how many control and treatment groups are required. 2) The variability of the outcomes measure. • Standard Deviation is the best option. 3) An estimate of the clinically (or biologically) relevant difference. • A difference between groups that is large enough to be considered important. By convention, this is set at 0.8 SD. An Example • We would like to design a study to measure two skin barriers for burn patients. • We are interested in “pain” as the clinical outcome using the “Oucher” scale (1-5). • We know from previous studies that the Oucher scale has a SD of 1.5. • What is the sample size to detect 1 unit (D) on the Oucher scale. Here’s the equation: ( 1 2 ) (z1a / 2 z1 )2 2 n 2 D2 Here, a is the critical value of z at 0.975 (1.96) and is power at 80% (0.84). n (1.5 2 1.5 2 ) (1.96 0.84)2 12 35.3 36 What would happen to n if our clinically relevant difference was set at 2 Oucher units. Here: n (1.52 1.52 ) (1.96 0.84) 2 2 2 8.8 9 What would happen to n if our clinically relevant difference was set at 0.5 Oucher units. Here: n (1.52 1.52 ) (1.96 0.84) 2 2 0.5 141.12 142 Another Example • You want to measure whether aggregates of invasive cell lines are less cohesive than those generated from noninvasive counterparts. • You know that SD for the control group is 3dynes/cm and for the invasive group is 2dynes/cm. • You set the a at 0.05 (=1.96) and at 80% (=0.84) and D at 2 dynes/cm. • How many aggregates from each group would you need? n (3 2 2 2 ) (1.96 0.84)2 22 (9 4) (2.80)2 n 4 13 7.84 101.9 n 25.5 26 4 4 Therefore, we need 26 aggregates in each group to be able to reliably detect a difference of 2 dynes/cm cohesivity between invasive and non-invasive cells. In general, how do variability, detection difference, and power influence n? More variability in the data Higher n required Less variability in the data Fewer n required Detect small differences between groups Higher n required Detect large differences between groups Fewer n required Smaller a (0.01) Higher n required Less power (smaller ) Fewer n required