Download Measures of center

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Measures of center
The mean
The mean of a distribution is the arithmetic average of the observations:
x̄ =
n
x1 + · · · + xn
1P
=
xi
n i=1
n
The median
The median is the midpoint of a distribution: the number M such that
◦ half the observations are smaller and
◦ half are larger.
How to find the median
Suppose the observations are x1 , x2 , . . . , xn .
1. Arrange the data in increasing order and let x(i) denote the ith smallest
observation.
2. If the number of observations n is odd, the median is the center observation in the ordered list:
M = x((n+1)/2)
3. If the number of observation n is even, the median is the average of
the two center observations in the ordered list:
x(n/2) + x(n/2+1)
M=
2
Numerical Description of Data, Jan 7, 2004
-1-
Measures of center
Examples:
Data set 1:
x1
2
x2
4
x3
3
x4
4
x5
6
x6
5
x7
4
x8
-6
x9
5
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6
2
3
4
4
4
5
5
6
There is an odd number of observations, so the median is
M = x((n+1)/2) = x(5) = 4.
The mean is given by
x̄ =
2 + 4 + 3 + 4 + 6 + 5 + 4 + (−6) + 5 27
=
= 3.
9
9
Data set 2:
x1
2.3
x2
8.8
x3
3.9
x4
4.1
x5
6.4
x6
5.9
x7
4.2
x8
2.9
x9
1.3
x10
5.1
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10)
1.3 2.3 2.9 3.9 4.1 4.2 5.1 5.9 6.4 8.8
There is an even number of observations, so the median is
M=
x(n/2) + x(n/2+1)
x(5) + x(6)
4.1 + 4.2
=
=
= 4.15.
2
2
2
The mean is given by
x̄ =
2.3 + 8.8 + 3.9 + 4.1 + 6.4 + 5.9 + 4.2 + 2.9 + 1.3 + 5.1 44.9
=
= 4.49.
10
10
Numerical Description of Data, Jan 7, 2004
-2-
Mean versus median
◦ The mean is easy to work with algebraically, while the median is not.
◦ The mean is sensitive to extreme observations, while the median is more
robust.
Example:
0
1
2
3
4
5
6
7
8
9
10
The original mean and median are
x̄ =
0+1+2
= 1 and M = x((n+1)/2) = 1
3
The modified mean and median are
0 + 1 + 10
2
= 3 and M = x((n+1)/2) = 1
3
3
◦ If the distribution is exactly symmetric, then mean=median.
x̄ =
◦ In a skewed distribution, the mean is further out in the longer tail than
the median.
◦ The median is preferable for strongly skewed distributions, or when
outliers are present.
Numerical Description of Data, Jan 7, 2004
-3-
Measures of spread
Example: Monthly returns on two stocks
Stock A
30
20
10
0
−10
−5
0
5
Stock B
40
Frequency
Frequency
40
10
15
20
30
20
10
0
−10
Daily returns (in %)
−5
0
5
10
15
20
Daily returns (in %)
Stock A Stock B
Mean
4.95
4.82
Median
4.99
4.68
The distributions of the two stocks have approximately the same mean and
median, but stock B is more volatile and thus more risky.
◦ Measures of center alone are an insufficient description of a distribution
and can be misleading
◦ The simplest useful numerical description of a distribution consists of
both a measure of center and a measure of spread.
Common measures of spread are
◦ the quartiles and the interquartile range
◦ the standard deviation
Numerical Description of Data, Jan 7, 2004
-4-
Quartiles
Quartiles divide data into 4 even parts
◦ Lower (or first) quartile QL :
median of all observations less than the median M
◦ Middle (or second) quartile M = QM :
median of all observations
◦ Upper (or third) quartile QU :
median of all observations lgreater than the median M
◦ Interquartile range: IQR = QU − QL
distance between upper and lower quartile
How to find the quartiles
1. Arrange the data in increasing order and find the median M
2. Find the median of the observations to the left of M, that is the lower
quartiles, QL
3. Find the median of the observations to the right of M, that is the
upper quartiles, QU
Examples:
Data set:
x1
2
x2
4
x3
3
x4
4
x5
6
x6
5
x7
4
x8
-6
x9
5
Arrange in increasing order:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
-6
2
3
4
4
4
5
5
6
◦ QL is the median of {−6, 2, 3, 4}: QL = 2.5
◦ QU is the median of {4, 5, 5, 6}: QU = 5
◦ IQR = 5 − 2.5 = 2.5
Numerical Description of Data, Jan 7, 2004
-5-
Percentiles
More generally we might be interested in the value which is exceeded only
by a certain percentage of observations:
The pth percentile of a set of observations is the value such that
◦ p% of the observation are less than or equal to it and
◦ (100 − p)% of the observation are greater than or equal to it.
How to find the percentiles
1. Arrange the data into increasing order.
2. If np/100 is not an integer, then x(k+1) is the pth percentile, where k
is the largest integer less than np/100.
3. If np/100 is an integer, the pth percentile is the average of the x(np/100)
and x(np/100+1) .
Five-number summary
A numerical summary of a distribution {x1 , . . . , xn } is given by
x(1) QL M QU x(n)
A simple boxplot is a graph of the five-number summary.
Numerical Description of Data, Jan 7, 2004
-6-
Boxplots
A common “rule” for discovering outliers is the 1.5 × IQR rule:
20
An observations is a suspected outlier if it lies more than falls more than
1.5 × IQR below QL or above QU .
How to draw a boxplot Box-and-whisker plot)
10
1. A box (the box) is drawn from the lower to the
upper quartile (QL and QU ).
4. Measurements falling outside 1.5 IQR from the ends
of the box are potential outliers and marked by ◦
or ∗.
−10
3. Lines (the whiskers) are drawn from the ends of
the box to the most extreme observations within a
distance of 1.5 IQR (Interquartile range).
0
2. The median of the data is shown by a line in the
box.
Stock A
Stock B
Plotting a boxplot with STATA:
.
.
.
.
infile A B using stocks.txt, clear
label var A "Stock A"
label var B "Stock B"
graph box A B, xsize(2) ysize(5)
Numerical Description of Data, Jan 7, 2004
-7-
Boxplots
Interpretation of Box Plots
◦ The IQR is a measure for the sample’s variability.
◦ If the whiskers differ in length the distribution of the data is probably
skewed in the direction of the longer whisker.
◦ Very extreme observations (more than 3 IQR away from the lower
resp. upper quartile) are outliers, with one of the following explanations:
a) The measurement is incorrect (error in measurement process or data
processing).
b) The measurement belongs to a different population.
c) The measurement is correct, but represents a rare (chance) event.
We accept the last explanation only after carefully ruling out all others.
Numerical Description of Data, Jan 7, 2004
-8-
Variance and standard deviation
Suppose there are n observations x1 , x2 , . . . , xn ,
The variance of the n observations is:
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2
s =
n−1
n
1 P
=
(xi − x̄)2
2
n − 1 i=1
This is (approximately) the average of the squared distances of the observations from the mean.
The standard deviation is:
s
n
√
1 P
(xi − x̄)2
s = s2 =
n − 1 i=1
Why n − 1?
Division by n − 1 instead of n in the variance calculation is a common
cause of confusion. Why n − 1? Note that
n
X
(xi − x̄) = 0
i=1
Thus, if you know any n − 1 of the differences, the last difference can be
determined from the others. The number of “freely varying” observations,
n − 1 in this case, is called the “degrees of freedom”.
Numerical Description of Data, Jan 7, 2004
-9-
Properties of s
◦ Measures spread around the mean =⇒ use only if the mean is used as
a measure of center.
◦ s = 0 ⇔ all observations are the same
◦ s is in the same units as the measurements, while s2 is in the square of
these units.
◦ s, like x̄ is not resistant to outliers.
Five-number summary versus standard deviation
◦ The 5-number summary is better for describing skewed distributions,
since each side has a different spread.
◦ x̄ and s are preferred for symmetric distributions with no outliers.
Numerical Description of Data, Jan 7, 2004
- 10 -