Download Descriptive statistics - Basic statistics for experimental researchers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Faculty of Health Sciences
Descriptive statistics
Basic statistics for experimental researchers, Fall 2015
Julie Lyng Forman
Department of Biostatistics, University of Copenhagen
university of copenhagen
Outline
Introduction
Describing categorical and discrete data
Describing continuous data
Tiny datasets
2 / 24
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Descriptive statistics
Metods for summarizing raw data
I
Numerical descriptive statistics: Numbers.
I
Graphical descriptive statistics: Figures.
200
250
300
350
Folate with confidence limits
3 / 24
N20+O2,24h
N2O+O2,op
O2,24h
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Datatypes
In order to select an appropriate statistical method for analyzing
your data, you need to consider what type of data you have.
Quantitative variables:
I
Continuous which (in principle) can take any value on a
continuous scale, e.g. concentrations.
I
Discrete: which are most often counts (0,1,2,. . . ).
Categorical variables:
4 / 24
I
Nominal, i.e. unordered grouping such as gender, genotype.
I
Ordinal, i.e. ordered grouping such as tumor stage.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Overview of descriptive methods
For categorical / discrete variables:
I
Numerical: Tabulated values, percentages.
I
Graphical: Barplots.
For continuous / discrete variables:
I
Numerical: mean (average), standard deviation, quantiles.
I
Graphical: Histogram, boxplot, stripchart.
Relation between two variables:
I
Numerical: Cross tabulation, correlation.
I
Graphical: Scatterplot, side by side plots.
Repeated measurements:
5 / 24
I
Graphical: Spaghettiplots.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Descriptive statistics with SAS
Compute summary statistics with
I
PROC MEANS or PROC UNIVARIATE
I
PROC FREQ for tablutation or cross-tabulation
A large range of plots can be made with PROC SGPLOT
I
Scatterplots, histograms, boxplots, barplots, spaghettiplots
I
PROC SGPANEL makes side by side plots over multiple
groups.
More about this at the SAS introduction later today . . .
6 / 24
university of copenhagen
Outline
Introduction
Describing categorical and discrete data
Describing continuous data
Tiny datasets
7 / 24
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Summarizing categorical data
Example: Type 1 diabetes in NOD mice.
I
Does treatment delay onset of diabetes?
Crosstabulated data:
treatment
diabetes (at 250 days follow-up)
Frequency|
Row Pct
|no
|yes
| Total
-----------+----------+--------+
control
|
3 |
8 |
11
| 27.27 | 72.73 |
-----------+----------+--------+
vehicle
|
7 |
7 |
14
| 50.00 | 50.00 |
-----------+--------+--------+
vorinostat |
15 |
2 |
17
| 88.24 | 11.76 |
---------+----------+--------+
Total
25
17
42
Reference Christensen et al, Lysine deacetylase inhibition prevents diabetes by
chromatin-independent immunoregulation and β-cell protection, PNAS,
www.pnas.org/cgi/doi/10.1073/pnas.1320850111
8 / 24
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Barplot for crosstabulated data
Used both for exploratory data analysis and for presentation.
9 / 24
I
Side by side or stacked.
I
Note: Total numbers, not percentages.
university of copenhagen
Outline
Introduction
Describing categorical and discrete data
Describing continuous data
Tiny datasets
10 / 24
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Continuous data
Example: Hemoglobin in blood samples from 70 women (g/100ml)
10.2
12.9
13.7
12.7
11.7
11.6
13.3
13.3
10.5
11.4
11.2
10.2
10.8
9.7
10.6
12.9
14.6
8.8
14.7
13.1
11.0
12.1
13.5
11.1
11.3
11.6
12.3
12.2
9.3
12.9
10.9
13.0
13.4
11.8
12.0
12.1
12.5
12.9
13.1
11.0
13.4
11.4
10.7
10.9
12.0
11.7
11.9
15.1
13.5
11.5
10.8
13.6
11.2
11.1
14.9
13.2
10.3
11.9
14.6
10.4
9.4
14.1
11.4
10.4
13.7
12.1
11.8
10.6
10.9
12.5
We are interested in the distribution not the individual
measurements.
Reference: Kirkwood & Sterne, essential medical statistics, Blackwell
publishing, 2003.
11 / 24
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Histogram
Used when exploring data and for model checking.
I
12 / 24
The volume of the box matches the proportion of data within
the interval. Do we see a normal distribution?
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Cumulative distribution
Main usage: Illustration in a basic statistics book.
Each step represents an observation at this value.
I
13 / 24
Stepsize for hemoglobin is 1/70 × 100% = 1.43%.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Cumulative distribution and quantiles
I
14 / 24
Median: 11.85. Quartiles: 10.9 and 13.1.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
QQ-plots
Used for model checking
Empirical quantiles vs quantiles of a theoretical distribution.
I
15 / 24
Hemoglobin data agree well with the normal distribution.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Box and whiskers plot
16 / 24
I
The box shows the median and the quartiles.
I
The whiskers are drawn at minimum and maximum,
but not exceeding 1.5 times the length of the box.
I
Data points beyond are highlighted as potential outliers.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Overview of numerical methods
The most common summary statistics.
Central tendency: What is the most "typical" measurement?
I
Median (50% above, 50% below)
I
Average (aka the sample mean).
Variability: What is the typical distance to the center?
17 / 24
I
Range (minimum – maximum).
I
Quartiles (interval containing the central 50% of the data).
I
Standard deviation (SD).
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Mean and standard deviation
The sample mean (or average) is:
P
x̄ =
n
x
=
x1 + x2 + x3 + · · · + xn
n
The standard deviation is:
sP
s=
I
(x − x̄)2
(n − 1)
Note: division by degrees of freedom n − 1 rather than n.
Small values indicate that data is gathered closely around the
sample mean, while larger values indicate a larger variation
between subjects.
18 / 24
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Interpretation of the standard deviation
The normal range x̄ ± 2s contains ≈ 95% of the data in a normal
distribution. WARNING: not useful if data is skewed (next slide).
Example: For hemoglobin 11.98 ± 2 × 1.42 ' (9.14; 14.82).
19 / 24
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
About normal ranges
The formula x̄ ± 2s is only valid if data have been sampled from a
normal distribtuion.
If the distribution is not normal, the computed normal range may
be misleading. Especially if the distribution is markedly skewed.
Instead estimate the normal range from the quantiles of the data
(2.5% -quantile ; 97.5% -quantile)
I
20 / 24
This demands a fairly large sample size.
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Which summary statistics should I report?
Two descriptions of hemoglobin data:
Mean (SD): 11.98 (1.42).
Median (quartiles): 11.85 (10.9;13.1).
Mean (SD) is most often reported, but
I
sensitive to extreme observations, outliers.
I
if the distribution is skewed the inferred normal range is
wrong.
In these cases median and quartiles should be reported instead.
21 / 24
university of copenhagen
Outline
Introduction
Describing categorical and discrete data
Describing continuous data
Tiny datasets
22 / 24
d e pa rt m e n t o f b i o s tat i s t i c s
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Describing tiny datasets
Summary statistics are not sensible when sample size is very small!
You get a reasonably short and very accurate description of the
data by showing the full data:
6.11, 6.54, 8.75
while using summary statistics gives a false impression of what the
distribution looks like.
Mean (SD): 7.13 (1.41) - but is the data normal??
Median (quartiles): 6.54 (6.32;7.64) - but what is 25% of 3??.
23 / 24
university of copenhagen
d e pa rt m e n t o f b i o s tat i s t i c s
Visualisation of tiny datasets
Stripchart:
200
250
300
350
Folate with confidence limits
I
24 / 24
N20+O2,24h
N2O+O2,op
O2,24h
Note: Overlayed confidence intervals for group means.
Related documents