Download By M. Pagano

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Mean field particle methods wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
MD 5108
Summarizing data:
statistical indices
(Modified
from the notes of Prof. A. Kuk)
Topic 4: Summary index (summary statistic)
Using a single value to summarize some
characteristic of a dataset.
For example, the arithmetic mean (or
average) is a summary statistic because it
gives the average value of a dataset such
as average blood pressure readings
4.1 Indices of Central Tendency (or location)
(Arithmetic) Mean: average of a set of values
Blood Pressure Readings
n
Xi
Xi
95 X1
Arithmetic Mean X = i 1
98 X2
n
101 X3
87 X4
= 486 / 5
105 X5
= 97.2 mm Hg
---------------486 Sum

Approx 4 million singleton births, 1991 :
Variable
Mean
Mother’s age
26.4 years
Gestational age
39.15 weeks
Birth weight
3358.6 grams
Weight gain*
30.4 lbs
For those who die (31,417 of them)
in the first year :
Survival
49.4 days
26.4 years
years
4.2 Robust Measure of Location
Mean is very sensitive (not robust) to extreme values
Blood Pressure Readings
87
95
98
101
105.0
Mean = 97.2
87
95
98
101
1050
Xi
X1
X2
X3
X4
X5
Decimal overlooked,
Mean = 286.2
Robust measure of location
The median (the middle value of an ordered data set) is
less sensitive (robust) to extreme values in the data
Blood Pressure Readings
87
95
98
101
105
median value = 98
is unchanged
87
95
98
101
1050
Xi
X1
X2
X3
X4
X5
Trimmed mean (e.g. 10% trimmed mean is the
average after deleting 10% of the data at both ends) is
also less affected by extreme values
Intervals between failures of an air
conditioner (in operating hours)
413, 14, 58, 37, 100, 65, 9, 169, 447, 184, 36, 201,
118, 34, 31, 18, 19, 67, 57, 62, 7, 22, 34, 90, 10
Mean = ?
8% trimmed mean = ?
Median = ?
Ordered values
7, 9, 10, 14, 18, 19, 22, 31, 34, 34, 36, 37, 57, 58,
62, 65, 67, 90, 100, 118, 169, 184, 201, 413, 447
Measures of location
Sample size = 25 mean = 2302/25 = 92.1 hrs
8% of 25 = 2, leave out 2 obs at both ends
8% trimmed mean = 1426/21 = 67.9 hrs
median = 13th ordered value = 57 < 67.9 <92.1 hrs
Desirable properties of the median
• Not sensitive to extreme values in data
• More suitable for describing skewed distributions
•
•
(e.g., median income vs average income)
The relative positions of the data points are
unchanged when log-transformed. As a result, the
median of the log-transformed data is just the log
of the median of the original data
Not so for the mean, the mean of logX is not
obtainable from the mean of X
87 < 95 < 98 < 101 < 105
Ln87  Ln95  Ln98  Ln101  Ln105
Med = 98
Med  Ln98  4.585
Relative positions of median and
mean for skewed distributions
Positively-skewed
or skewed to the right
(where the longer tail is)
Mean > Median
Negatively-skewed
or skewed to the left
(where the longer tail is)
Mean < Median
Singleton births, 1991 :
Variable
Mean
Median
Mom’s age (yrs)
26.4
25
Gest. Age (wks)
39.2
39
Birth weight
(gms)
3359
3374
Weight gain (lbs)
30.4
30
Survival (days)
49.4
7
Mean = 39.2 Median = 39
weeks
Mean = 3359 Median = 3374
Extremely fluctuatory due to the use of narrow class interval
Mean = 30.4 Median = 30
Mortality in the first year of baby's life
(for those who die in their first year)
Proportion
0.40
0.30
0.20
0.10
0.00
0
60
121
182
244
(survival days)
305
Mortality in the first year of baby's life
(for those who die in their first year)
Mean = 49.4 Median=7
Proportion
1.0000
0.1000
0
60
121
182
244
305
0.0100
0.0010
0.0001
(survival days)
By M. Pagano
When to use mean or median:
Use both by all means.
Mean performs best when we have a normal
or symmetric distribution with thin tails.
If skewed or when we want to limit
the influence of outliers, use the median.
4.3 Indices of Dispersion / Spread
Besides indices of central tendency (location), it is
also useful to have indices that summarise the
spread (or dispersion) of values in a dataset.
These indices give a measure of variation.
Indices of Dispersion or Spread
Range: difference between the largest and the
smallest value
Problem: does not consider how values in between
are scattered.
In the following, for both sets of data, the numbers
of observations, means, medians and ranges are all
equal. Which one has more scatter?
datasets
with same
range but
different
scatter of
values
10, 12, 13, 14, 15, 16, 17, 18, 20
10, 15, 15, 15, 15, 15, 15, 15, 20
range
Spread:
Variable
Singleton births, 1991 :
Min
Max
Range
Mom’s age
10
49
39
Gest. Age
17
47
30
227
8164
7937
Weight gain
0
98
98
Survival
0
363
363
Birth
weight
Indices of Dispersion
A good index of dispersion should be one that
summarises the dispersion of individual
values from some central value like the mean
mean
X
X
X
X
X
X
Indices of Dispersion
Problem with averaging deviations of
individual values from the mean is that it is
always 0
_
(Xi  X )
87 - 97.2 = -10.2
95 - 97.2 = -2.2
98 - 97.2 = 0.8
101-97.2 = 3.8
105-97.2 = 7.8
--0
where 97.2 is the mean of values
87, 95, 98, 101, 105
average of deviations of
individual values from the
mean
Indices of Dispersion
Usual approach: consider square deviations
from the mean and take their average
_
(Xi  X )
87 - 97.2 = -10.2
95 - 97.2 = -2.2
98 - 97.2 = 0.8
101-97.2 = 3.8
105-97.2 = 7.8
--0
_
( X i  X )2
104.04
4.84
0.64
14.44
60.84
---------184.80
sum of
squares of deviations
from the mean
Variance calculation from a sample: customary
to divide by n-1 (default option in most software)
rather than by n
_
(Xi  X )
_
2
104.04
4.84
0.64
14.44
60.84
---------184.80
2
(
X

X
)
 i
n 1
= 184.8 / 4
= 46.2
effective sample size
- also called
degrees of freedom
Variance of a sample
Can be shown mathematically:
_
2
(
X

X
)
 i
n 1

(X

X

)
2
2
n 1
n
Why subtract 1 ?
• Results in a better estimator of the population
•
•
variance
Acknowledge the fact that the population
mean is unknown and has to be estimated by
the sample mean (effective sample size
decreased by 1 for every parameter
estimated)
No need to subtract 1 if we calculate variance
using deviations from the population mean
Variance of a sample
• Problem with variance is its awkward unit of
measurement as values have been squared
• Problem overcome by taking square root of
variance - revert back to original unit of
measurement
Square root of the variance gives
the standard deviation
Sample Standard Deviation
The Sample Standard Deviation
(S or SD)
(X

X

)
2
2
n 1
n
Singleton births, 1991 :
Variable
Mean Std dev
Mom’s age (yrs)
26.4
5.84
Gest. Age (wks)
39.2
2.61
Birth weight (gms)
3359
227
Weight gain (lbs)
30.4
12.13
Survival (days)
49.4
76.1
Empirical Rule:
If dealing with a unimodal and
symmetric distribution, then
Mean ± 1 sd covers approx 67% obs.
Mean ± 2 sd covers approx 95% obs
Mean ± 3 sd covers approx all obs
Exact for normal distribution
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x
± k s.d.s
k left limit right limit Emp.
1
By M. Pagano
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x
± k s.d.s
k left limit right limit Emp.
1
20.56
By M. Pagano
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x
± k s.d.s
k left limit right limit Emp.
1
20.56
32.24
By M. Pagano
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x
± k s.d.s
k left limit right limit Emp.
1
20.56
32.24
67%
By M. Pagano
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x
± k s.d.s
k left limit right limit Emp.
1
20.56
32.24
67%
2
14.72
38.08
95%
3
8.88
43.92
all
By M. Pagano
Area = 0.6475
20.56
years
32.4
Area = 0.963
14.72
years
38.08
Mother’s age: mean = 26.4 yrs
s.d. = 5.84 yrs
Table of x ± k s.d.s
k
left
limit
right
limit
Emp.
Actual
1 20.56 32.24 67% 64.75%
2 14.72 38.08 95%
3
8.88
43.92
all
96.3%
99.89%
Chebychev’s Inequality
Table of x ± k s.d.s
Proportion is at least 1-1/k2
(true for any distribution.)
Chebychev’s Inequality
Table of x ± k s.d.s
k
1/k2
1
1
2
0.25
3
0.11
Proportion is at least 1-1/k2
(true for any distribution.)
By M. Pagano
Chebychev’s Inequality
Table of x ± k s.d.s
k
1/k2
1-1/k2
1
1
0
2
0.25
0.75
3
0.11
0.89
Proportion is at least 1-1/k2
(true for any distribution.)
By M. Pagano
Chebychev’s Inequality
Table of x ± k s.d.s
k
1/k2
1-1/k2 Emp. Actual
1
1
0
2
0.25
0.75
95%
96.3%
3
0.11
0.89
all
99.89%
67% 64.75%
Proportion is at least 1-1/k2
(true for any distribution.)
When not to use the standard deviation?
•Heavy tailed distribution
•Presence of outliers
•Skewed distribution
•Comparing variables with vastly different
magnitude or different units of measurements
Weights of newborn
elephants (kg)
929
878
895
937
801
853
939
972
841
826
Weights of newborn
mice (kg)
0.72
0.63
0.59
0.79
1.06
0.42
0.31
0.38
0.96
0.89
n = 10
n = 10
_
_
X = 887.1
SD = 56.50
X = 0.68
SD = 0.255
Difference in magnitude of measurement has
contributed to large difference in the SD
Solution is to use Coefficient of Variation (CV)
which is given by:
SD
_
X
The CV expresses the standard deviation (s) relative to its mean.
Also known as relative dispersion.
As a ratio, it is unit-free.
Indices of Dispersion
Weights of newborn
elephants (kg)
929
878
895
937
801
853
939
972
841
826
n = 10
_
X = 887.1
SD = 56.50
CV = 0.0637
Weights of newborn
mice (kg)
0.72
0.63
0.59
0.79
1.06
0.42
0.31
0.38
0.96
0.89
n = 10
X = 0.68
SD = 0.255
CV = 0.375
_
newborn mice shows
greater birthweight variation!
When to use Coefficient of Variation (cv):
• when means of comparison groups have
large differences
(CV suitable as it expresses the std dev relative
to its corresponding mean)
• when different units of measurements are
involved i.e. group 1 units are mm and
group 2 units are gm
(CV suitable for comparison as they are unitfree, being a ratio)
4.4 Robust Measure of Dispersion
• Variance is defined as the mean of the squared
deviations and as such is even more nonrobust
to extreme values than the mean (an extreme
deviation becomes even more extreme after
squaring)
• A robust measure of dispersion is IQR/1.35
where
IQR = 3rd quartile – 1st quartile
= Inter-quartile range
The reason for dividing IRQ by 1.35 is to make it
compatible with the standard deviation when the
underlying distribution is normal
Intervals between failures of an air
conditioner (in operating hours)
413, 14, 58, 37, 100, 65, 9, 169, 447, 184, 36, 201,
118, 34, 31, 18, 19, 67, 57, 62, 7, 22, 34, 90, 10
Mean = ?
8% trimmed mean = ?
SD=?
IQR/1.35 = ?
Median = ?
Ordered values
7, 9, 10, 14, 18, 19, 22, 31, 34, 34, 36, 37, 57, 58,
62, 65, 67, 90, 100, 118, 169, 184, 201, 413, 447
Measures of location
Sample size = 25 mean = 2302/25 = 92.1 hrs
8% of 25 = 2, leave out 2 obs at both ends
8% trimmed mean = 1426/21 = 67.9 hrs
median = 13th ordered value = 57 < 67.9 <92.1 hrs
Measures of dispersion
SD = 115.5 hrs
1st quartile = 7th ordered value = 22 hrs
3rd quartile = 19th ordered value = 100 hrs
IQR/1.35 = 78/1.35 = 57.8 hrs
5-Number Summary of a data set
• Min,
• 1st quartile
• Median
• 3rd quartile,
• Max
Represent graphically by a box plot