Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Faculty of Health Sciences Descriptive statistics Basic statistics for experimental researchers, Fall 2015 Julie Lyng Forman Department of Biostatistics, University of Copenhagen university of copenhagen Outline Introduction Describing categorical and discrete data Describing continuous data Tiny datasets 2 / 24 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Descriptive statistics Metods for summarizing raw data I Numerical descriptive statistics: Numbers. I Graphical descriptive statistics: Figures. 200 250 300 350 Folate with confidence limits 3 / 24 N20+O2,24h N2O+O2,op O2,24h university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Datatypes In order to select an appropriate statistical method for analyzing your data, you need to consider what type of data you have. Quantitative variables: I Continuous which (in principle) can take any value on a continuous scale, e.g. concentrations. I Discrete: which are most often counts (0,1,2,. . . ). Categorical variables: 4 / 24 I Nominal, i.e. unordered grouping such as gender, genotype. I Ordinal, i.e. ordered grouping such as tumor stage. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Overview of descriptive methods For categorical / discrete variables: I Numerical: Tabulated values, percentages. I Graphical: Barplots. For continuous / discrete variables: I Numerical: mean (average), standard deviation, quantiles. I Graphical: Histogram, boxplot, stripchart. Relation between two variables: I Numerical: Cross tabulation, correlation. I Graphical: Scatterplot, side by side plots. Repeated measurements: 5 / 24 I Graphical: Spaghettiplots. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Descriptive statistics with SAS Compute summary statistics with I PROC MEANS or PROC UNIVARIATE I PROC FREQ for tablutation or cross-tabulation A large range of plots can be made with PROC SGPLOT I Scatterplots, histograms, boxplots, barplots, spaghettiplots I PROC SGPANEL makes side by side plots over multiple groups. More about this at the SAS introduction later today . . . 6 / 24 university of copenhagen Outline Introduction Describing categorical and discrete data Describing continuous data Tiny datasets 7 / 24 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Summarizing categorical data Example: Type 1 diabetes in NOD mice. I Does treatment delay onset of diabetes? Crosstabulated data: treatment diabetes (at 250 days follow-up) Frequency| Row Pct |no |yes | Total -----------+----------+--------+ control | 3 | 8 | 11 | 27.27 | 72.73 | -----------+----------+--------+ vehicle | 7 | 7 | 14 | 50.00 | 50.00 | -----------+--------+--------+ vorinostat | 15 | 2 | 17 | 88.24 | 11.76 | ---------+----------+--------+ Total 25 17 42 Reference Christensen et al, Lysine deacetylase inhibition prevents diabetes by chromatin-independent immunoregulation and β-cell protection, PNAS, www.pnas.org/cgi/doi/10.1073/pnas.1320850111 8 / 24 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Barplot for crosstabulated data Used both for exploratory data analysis and for presentation. 9 / 24 I Side by side or stacked. I Note: Total numbers, not percentages. university of copenhagen Outline Introduction Describing categorical and discrete data Describing continuous data Tiny datasets 10 / 24 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Continuous data Example: Hemoglobin in blood samples from 70 women (g/100ml) 10.2 12.9 13.7 12.7 11.7 11.6 13.3 13.3 10.5 11.4 11.2 10.2 10.8 9.7 10.6 12.9 14.6 8.8 14.7 13.1 11.0 12.1 13.5 11.1 11.3 11.6 12.3 12.2 9.3 12.9 10.9 13.0 13.4 11.8 12.0 12.1 12.5 12.9 13.1 11.0 13.4 11.4 10.7 10.9 12.0 11.7 11.9 15.1 13.5 11.5 10.8 13.6 11.2 11.1 14.9 13.2 10.3 11.9 14.6 10.4 9.4 14.1 11.4 10.4 13.7 12.1 11.8 10.6 10.9 12.5 We are interested in the distribution not the individual measurements. Reference: Kirkwood & Sterne, essential medical statistics, Blackwell publishing, 2003. 11 / 24 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Histogram Used when exploring data and for model checking. I 12 / 24 The volume of the box matches the proportion of data within the interval. Do we see a normal distribution? university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Cumulative distribution Main usage: Illustration in a basic statistics book. Each step represents an observation at this value. I 13 / 24 Stepsize for hemoglobin is 1/70 × 100% = 1.43%. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Cumulative distribution and quantiles I 14 / 24 Median: 11.85. Quartiles: 10.9 and 13.1. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s QQ-plots Used for model checking Empirical quantiles vs quantiles of a theoretical distribution. I 15 / 24 Hemoglobin data agree well with the normal distribution. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Box and whiskers plot 16 / 24 I The box shows the median and the quartiles. I The whiskers are drawn at minimum and maximum, but not exceeding 1.5 times the length of the box. I Data points beyond are highlighted as potential outliers. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Overview of numerical methods The most common summary statistics. Central tendency: What is the most "typical" measurement? I Median (50% above, 50% below) I Average (aka the sample mean). Variability: What is the typical distance to the center? 17 / 24 I Range (minimum – maximum). I Quartiles (interval containing the central 50% of the data). I Standard deviation (SD). university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Mean and standard deviation The sample mean (or average) is: P x̄ = n x = x1 + x2 + x3 + · · · + xn n The standard deviation is: sP s= I (x − x̄)2 (n − 1) Note: division by degrees of freedom n − 1 rather than n. Small values indicate that data is gathered closely around the sample mean, while larger values indicate a larger variation between subjects. 18 / 24 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Interpretation of the standard deviation The normal range x̄ ± 2s contains ≈ 95% of the data in a normal distribution. WARNING: not useful if data is skewed (next slide). Example: For hemoglobin 11.98 ± 2 × 1.42 ' (9.14; 14.82). 19 / 24 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s About normal ranges The formula x̄ ± 2s is only valid if data have been sampled from a normal distribtuion. If the distribution is not normal, the computed normal range may be misleading. Especially if the distribution is markedly skewed. Instead estimate the normal range from the quantiles of the data (2.5% -quantile ; 97.5% -quantile) I 20 / 24 This demands a fairly large sample size. university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Which summary statistics should I report? Two descriptions of hemoglobin data: Mean (SD): 11.98 (1.42). Median (quartiles): 11.85 (10.9;13.1). Mean (SD) is most often reported, but I sensitive to extreme observations, outliers. I if the distribution is skewed the inferred normal range is wrong. In these cases median and quartiles should be reported instead. 21 / 24 university of copenhagen Outline Introduction Describing categorical and discrete data Describing continuous data Tiny datasets 22 / 24 d e pa rt m e n t o f b i o s tat i s t i c s university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Describing tiny datasets Summary statistics are not sensible when sample size is very small! You get a reasonably short and very accurate description of the data by showing the full data: 6.11, 6.54, 8.75 while using summary statistics gives a false impression of what the distribution looks like. Mean (SD): 7.13 (1.41) - but is the data normal?? Median (quartiles): 6.54 (6.32;7.64) - but what is 25% of 3??. 23 / 24 university of copenhagen d e pa rt m e n t o f b i o s tat i s t i c s Visualisation of tiny datasets Stripchart: 200 250 300 350 Folate with confidence limits I 24 / 24 N20+O2,24h N2O+O2,op O2,24h Note: Overlayed confidence intervals for group means.