Download Descriptive Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Law of large numbers wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Descriptive Statistics
Descriptive Statistics
B.H. Robbins Scholars Series
May 27, 2010
1 / 34
Descriptive Statistics
Outline
Measurement
Random Variables and Distributions
Commonly used distributions
Descriptive Statistics
Graphics
Categorical Variables
Continuous Variables
2 / 34
Descriptive Statistics
Measurement
Types of measurements
I
Binary: two category variable
I
I
I
I
Race, eye color, etc.
Ordinal: a categorical variable whose values are in a special
order
I
I
yes/no, present/absent, 0/1
Categorical (nominal, polytomous, discrete): at least 2 values
that are not necessarily ordered
Disease severity, likert score, cancer stage, ASA status
Count: A discrete value that can (in theory) has no upper
limit
I
Number of ER visits in a day, number of CABG surgeries in a
year.
3 / 34
Descriptive Statistics
Measurement
Types of measurements (cont.)
I
Continuous: numerical variable that has many possible values
that represent some underlying distribution.
I
I
I
Tend to have the most information (assuming not a lot of
preprocessing)
Turning continuous data into categories is risky: loss of
information (i.e., lower power to detect effects, more people
needed to have the same power).
Errors not reduced by categorization unless that’s the only way
to get someone to answer the question (e.g., with income)
4 / 34
Descriptive Statistics
Random Variables and Distributions
Random Variables: X
I
A potential measurement
I
A quantity that may vary from subject to subject
I
A variable whose values are random but whose probability
distribution is known.
I
A function that assigns a unique value with every outcome
from an experiment.
I
Once the random variable X is observed, it is a sample value
(x).
5 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I
Distribution of the RV, X, is a profile of its tendencies
I
Depending upon the type of RV, the distribution may be
completely characterized by one or a few parameters.
Binary variable:
I
I
I
I
The probability of a ’yes’ or ’present’ or 1
In the sample this is a proportion
K-category categorical variable (multinomial)
I
I
The probability that a randomly chosen subject will be from
category 1, 2, ... K
In a sample this a proportion falling into each category
6 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I
Continuous variable:
1. probability density: the value of x is on the x-axis, and the
relative probability of observing values close to it is on the
y-axis: In a sample this is a histogram
2. cumulative probability distribution: y-axis is the probability of
X ≤ x. This function only rises or stays flat. For a sample it is
a cumulative histogram
3. All percentiles of X
4. All moments of X (e.g. mean(X), mean(X 2 ), mean(X 3 ),...)
5. If we know one of these, we can derive the others.
7 / 34
Descriptive Statistics
Random Variables and Distributions
Distributions of Random Variables (RV)
I
If the distribution is known, we may be able to make
reasonable guesses about future observations
I
It is much harder to guess a single value x than it is to
estimate a mean from a distribution.
I
At the very least, the distribution tells us the proportions of
people we expect to see within each interval of interest
8 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Normal distribution
I
Characterized by two parameters: mean (µ) and variance (σ 2 )
I
Mean is a measure of central tendency
I
Variance is the mean of the squared deviations from the mean.
I
µ and σ 2 are completely independent of each other.
I
Symmetric distribution
9 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Normal(−2,4)
0.10
Density
0.05
0.2
0.0
0.00
0.1
Density
0.3
0.15
0.4
0.20
Normal(−2,1)
−10
−5
0
5
10
−10
−5
5
10
5
10
Density
0.05
0.10
0.15
0.4
0.3
0.2
0.00
0.1
0.0
Density
0
x
Normal(2,4)
0.20
x
Normal(2,1)
−10
−5
0
5
10
−10
−5
0
10 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Bernoulli distribution
I
The random variable is binary
I
Characterized by one parameter: probability (p)
I
The mean is p and the variance is p(1 − p)
11 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Proportion
0.2 0.4 0.6 0.8
0
0
Proportion
0.2 0.4 0.6 0.8
1
Bernoulli(0.5)
1
Bernoulli(0.1)
No
Yes
No
Proportion
0.2 0.4 0.6 0.8
0
0
Proportion
0.2 0.4 0.6 0.8
1
Bernoulli(0.9)
1
Bernoulli(0.25)
Yes
No
Yes
No
Yes
12 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Poisson distribution
I
Characterized by a single parameter parameter: mean (λ)
I
Poisson distribution is commonly used in modeling a count
per interval of time
I
In a Poisson distribution, the variance is equal to the mean.
I
Generally is right skewed but converges to a symmetric
distribution
13 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Poisson(5)
0.20
Density
0.00
0.0
5
10
15
20
0
5
10
x
x
Poisson(2)
Poisson(10)
15
20
15
20
Density
0.0
0.00
0.4
0.10
0.8
0.20
1.2
0
Density
0.10
1.0
0.5
Density
1.5
0.30
Poisson(1)
0
5
10
x
15
20
0
5
10
x
14 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Exponential distribution
I
Characterized by a single parameter: rate (λ)
I
It is used to describe the time until an event occurs (train
arrives, light bulb burns out, a patient has an MI)
I
The mean of an expoential is 1/λ and the variance is 1/λ2
I
A relaxed version of this distribution is very commonly used in
medical research for survival analysis.
I
Right or positively skewed
15 / 34
Descriptive Statistics
Random Variables and Distributions
Commonly used distributions
Exponential(0.2)
0.15
0.10
Density
0.00
0.0
20
40
60
80
100
0
20
40
60
x
x
Exponential(0.5)
Exponential(0.05)
80
100
80
100
0.0
0.02
0.00
0.1
0.2
Density
0.3
0.4
0.04
0
Density
0.05
0.6
0.4
0.2
Density
0.8
Exponential(1)
0
20
40
60
x
80
100
0
20
40
60
x
16 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables
I
Let x1 , x2 , . . . , xn denote sample values
I
Measures of location or the ’center’ of a sample
P
Arithmetic average: x = n1 ni=1 xi
I
I
I
I
Population mean is the value that x converges to as n → ∞
Highly influenced by a single value (which may be a bad thing
if that values is not real)
Median: middle among all sorted values (e.g., half the sample
is greater and half is smaller)
I
I
Usually very descriptive
Not affected by an outlying value
17 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables: Measures of location
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05
0.03
0.62
1.18
1.32]
0.03
0.62 11.8 13.2]
I Sample average = -0.28
I Median = -0.59
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05
I Sample average = 1.76
I Median = -0.59
18 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)
I
Quantiles or percentiles: Can generally be used to describe
central tendency, spread, symmetry, heavy tailedness, etc.
I
I
I
I
I
The p th sample quantile, xp is the value such that the fraction
p of observations fall below the value.
The p th population quantile, is the value x such that
pr (X ≤ x) = p
Sample median is the 50th percentile or 0.5 quantile
Quartiles: (Q1 , Q2 , Q3 ) are the (25th , 50th , 75th )
Quintiles: (Q1 , Q2 , Q3 , Q4 ) are the (20th , 40th , 60th , 80th )
19 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)
I
I
Measures of spread or variability
Inter-Quartile range, (Q1 , Q3 ): An interval that contains half
of all subjects
I
I
Variance: The average of the squared deviations between each
observed valuePand the overall average value:
I
I
I
Most people think these are meaningful for continuous
distributions
n
1
2
s 2 = n−1
i=1 (xi − x)
n − 1 is used because we must acknowledge that we have
estimated and do not know that true value of µ the population
mean.
Standard deviation: the square root of the variance
I
I
A normal distribution can be defined in terms of the proportion
of people falling within a SD of the mean
68% of the populations falls within 1 SD of the mean, and
approximately 95% fall within 2 SDs of the mean.
20 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables: Measures of scale or spread
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05
0.03
0.62
1.18
1.32]
0.03
0.62 11.8 13.2]
I Sample Standard Deviation = 0.98
I Interquartile range: (-0.965, 0.325)
[-1.52 -1.36 -1.00 -0.93 -0.83 -0.59 -0.05
I Sample Standard Deviation = 5.36
I Interquartile range: (-0.965, 0.325)
21 / 34
Descriptive Statistics
Descriptive Statistics
Continuous Variables (cont.)
I
With highly skewed or assymetric distributions, variance and
SD may not be very useful summaries of spread
I
If median length of stay at Vanderbilt Hospital is 3 days then
having a SD of 10 due to some very sick people is difficult to
interpret.
I
Range: may not be useful because it is always increasing with
n and it is dominated by a single value.
I
Coefficient of variation (standard deviation divided by the
mean): few situations where this could be useful because it
depends highly on what the value of the mean is (esp if close
to 0).
22 / 34
Descriptive Statistics
Descriptive Statistics
Discrete Variables
I
Categorical Variables
I
I
The proportion of observations falling within each category
Count Variables
I
Best to report the average.
23 / 34
Descriptive Statistics
Graphics
Categorical Variables
Chart Junk
I
Chart junk (Tufte): non-data-ink or redundant data-ink
I
I
I
I
Not part of the minimum set of visuals to communicate the
message understandably.
Elements of a display that are not necessary to understand the
information contained in it
Can be distracting and can skew the depiction, making it
difficult to understand (3-D pie and bar chart are examples)
Ink : information ratio : Ratio to of the amount of ink used
and the amount of information portrayed
24 / 34
Descriptive Statistics
Graphics
Categorical Variables
Categorical Variables
I
Pie chart
I
I
I
I
High ink:information ratio
Can create optical illusions (perception depends on orientation
vs the horizon)
With a lot of categories, difficult to label
Bar chart
I
I
I
I
High ink:information ratio
Hard to depict uncertainty
Hard to interpret subcategories
Labels hard to read if bars are vertical
25 / 34
Descriptive Statistics
Graphics
Categorical Variables
Categorical Variables
I
Dot chart
I
I
I
I
I
I
Leads to the most accurate perception
Easy to show labels
Allows for multiple levels of categorization
Multiple categories within a single line of dots
Easy to show 2-sided error bars
Dot chart and figure 3
26 / 34
Descriptive Statistics
Graphics
Categorical Variables
n
North
770
9565
7439
2378
8093
10341
7610
1484
14320
10518
1461
9231
3241
Category a
Category b
Category c
Category d
Category e
Category f
Category g
Category h
Category i
Category j
Category k
Category l
Category m
South
494
13849
10810
6529
375
13410
11764
14450
3392
13591
13841
1384
14911
Category a
Category b
Category c
Category d
Category e
Category f
Category g
Category h
Category i
Category j
Category k
Category l
Category m
0.0
0.2
0.4
0.6
0.8
Y
27 / 34
Descriptive Statistics
Graphics
Continuous Variables
Continuous Variables
I
Histograms show relative frequencies (requires binning of
data)
I
I
Cumulative distribution functions
I
I
Not optimal for comparing multiple distributions
Can read all quantiles directly off the plot
Box plot: good way to compare many groups
28 / 34
Descriptive Statistics
Graphics
Continuous Variables
Pr(X < x)
Normal(0, 1)
Normal(−1, 2)
−4
−2
0
2
4
0.0 0.2 0.4 0.6 0.8 1.0
Pr(X < x)
Exponential CDF
0.0 0.2 0.4 0.6 0.8 1.0
Normal CDF
Exponential(.5)
Exponential(.2)
0
x
5
10
15
20
x
Pr(X < x)
0.0 0.2 0.4 0.6 0.8 1.0
Poisson CDF
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
Poisson(2)
Poisson(5)
●
●
●
●
●
●
5
10
15
20
x
29 / 34
Descriptive Statistics
Graphics
Continuous Variables
Box and Whisker Plot
F
●
x
●
x
E
x
D
●
x
C
x
B
●
x
A
−2
0
2
Value
30 / 34
Descriptive Statistics
Graphics
Continuous Variables
Box Percentile Plot
●
F
●
E
D
●
C
●
●
B
●
A
−2
0
2
Value
31 / 34
Descriptive Statistics
Graphics
Continuous Variables
Decile Plot
x
F
x
E
D
x
C
x
x
B
x
A
−2
−1
0
1
2
Value
32 / 34
Descriptive Statistics
Graphics
Continuous Variables
Decile Plot: Exponential Distribution
x
F
x
E
x
D
x
C
x
B
x
A
0.5
1.0
1.5
2.0
2.5
Value
33 / 34
Descriptive Statistics
Graphics
Continuous Variables
Summary
I
Measurements
I
Random variables and distributions
I
Numerical descriptive statistics
Graphical summaries
I
I
I
Categorical variables
Continuous variables
34 / 34