Download Numerical Summaries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Numerical Summaries
COMP 245 STATISTICS
Dr N A Heard
1
Introduction
Once a sample of data has been drawn from the population of interest, the first task of the
statistical analyst might be to calculate various numerical summaries of these data.
This procedure serves two purposes:
• The first is exploratory. Calculating statistics which characterise general properties of the
sample, such as location, dispersion, or symmetry, helps us to understand the data we
have gathered. This aim can be greatly aided by the use of graphical displays representing the data.
• The second, as we shall see later, is that these summaries will commonly provide the
means for relating the sample we have learnt about to the wider population in which we
are truly interested. Later, we will assess the properties of these numerical summaries as
estimators of population parameters.
2
2.1
Summary Statistics
Measures of Location
Sample Mean
The arithmetic mean (or just mean for short) of a sample of real values ( x1 , . . . , xn ) is the
sum of the values divided by their number. That is,
x̄ =
1 n
xi
n i∑
=1
This is often colloquially referred to as the average.
Ex. The mean of (7, 2, 4, 12, 5) is
7 + 2 + 4 + 12 + 5
30
=
= 6.
5
5
Order Statistics
For a sample of real values ( x1 , . . . , xn ), define the ith order statistic x(i) to be the ith smallest
value of the sample.
So
1
• x(1) ≡ min( x1 , . . . , xn ) is the smallest value;
• x(2) is the next smallest, and so on, up to
..
.
• x(n) ≡ max( x1 , . . . , xn ) being the largest value.
Furthermore, in an abuse of notation it will be useful to define x(i+α) for integer 1 ≤ i < n
and non-integer α ∈ (0, 1) as the linear interpolant
x ( i + α ) = ( 1 − α ) x ( i ) + α x ( i +1) ,
where the order statistics x(i) are defined as before.
r
x ( i +1)
x (i + α )
x (i )
r
-
i+1
i+α
i
Ex.
x(4.2) = 0.8 × x(4) + 0.2 × x(5) .
Sample Median
The median of a sample of real values ( x1 , . . . , xn ) is the middle value of the order statistics.
That is, using our extended notation,
median = x({n+1}/2)
(
x({n+1}/2)
=
x(n/2) + x(n/2+1)
2
if n is odd
if n is even
Ex.
• The median of (7, 2, 4, 12, 5) is 5.
• The median of (7, 2, 4, 12, 5, 15) is 6.
Mean vs. Median
The mean is sensitive to outlying points, whilst the median is not.
Ex.
(1, 2, 3, 4, 5) has median = mean = 3.
(1, 2, 3, 4, 40) again has median = 3, but now mean = 10.
The arithmetic mean is the most commonly used location statistic, followed by the median.
2
Sample Mode
The mode of a sample of real values ( x1 , . . . , xn ) is the value of the xi which occurs most
frequently in the sample.
Ex.
The mode of (3, 5, 7, 2, 10, 14, 12, 2, 5, 2) is 2.
Some data sets are multimodal.
Geometric Mean
Two other useful measures of location (other averages).
For positive data, the geometric mean
s
xG =
n
n
∏ xi .
i =1
(
Note: It is easy to show that xG = exp
)
1 n
log xi , the exponential of the arithmetic
n i∑
=1
mean of the logs of the data.
=⇒ It is thus less severely affected by exceptionally large values.
Harmonic Mean
The harmonic mean
(
xH =
1 n 1
n i∑
x
=1 i
) −1
=
n
∑in=1 x1i
.
which is most useful when averaging rates.
Note: For positive data ( x1 , . . . , xn ),
Arithmetic mean ≥ geometric mean ≥ harmonic mean.
2.2
Measures of Dispersion
Range
The range of a sample of real values ( x1 , . . . , xn ) is the difference between the largest and
the smallest values. That is
range = x(n) − x(1)
Ex.
The range of (7, 2, 4, 12, 5) is 12 − 2 = 10.
3
1
2
3
Mean
4
5
Arithmetic
Geometric
Harmonic
x1
2
4
6
8
10
x2
Arithmetic, geometric and harmonic means for two data points points ( x1 , x2 ), where x1 = 1.
Quartiles
Consider again the order statistics of a sample, x(1) , . . . , x(n) .
1
of the way through the ordered sam2
ple. (Not necessarily exactly or uniquely since there may be tied values or n even.)
1
3
Similarly, we can define the first and third quartiles respectively as being values and
4
4
of the way through the ordered sample.
We defined the median so that it lay approximately
Interquartile Range
first quartile = x({n+1}/4)
third quartile = x(3{n+1}/4)
and thus we define the interquartile range as the range of the data lying between the first
and third quartiles,
interquartile range = third quartile − first quartile
= x(3{n+1}/4) − x({n+1}/4)
Five Figure Summary
The five figure summary of a set of data lists, in order:
• The minimum value in the sample
• The lower quartile
• The sample median
4
• The upper quartile
• The maximum value
Sample Variance
The most widely used measure of dispersion is based on the squared differences between
the data points and their mean, ( xi − x̄ )2 .
The average (the mean) of these squared differences is the mean square or sample variance
s2 =
1 n
( xi − x̄ )2 .
n i∑
=1
Equivalently, it is often more convenient to rewrite this formula as
s2 =
1 n 2
xi
n i∑
=1
!
−
1 n
xi
n i∑
=1
!2
That is, the mean of the squares minus the square of the mean.
Sample Standard Deviation
The square root of the variance is the root mean square or sample standard deviation
s
s=
1 n
( xi − x̄ )2
n i∑
=1
Unlike the variance, the standard deviation is in the same units as the xi .
Summary
We can see analogies between the numerical summaries for location and dispersion, and
their robustness properties are comparable.
Location
Dispersion
Least Robust
x (1) + x ( n )
2
x ( n ) − x (1)
More Robust
Most Robust
x̄
x({n+1}/2)
s2
x(3{n+1}/4) − x({n+1}/4)
x (1) + x ( n )
would be the midpoint of our data halfway between the minimum and
2
maximum values in the sample, which provides another alternative descriptor of location.)
(where
5
2.3
Skewness
Sample Skewness
Skewness is a measure of asymmetry. The skewness of a sample of real values ( x1 , . . . , xn )
is given by
1 n
skewness = ∑
n i =1
xi − x̄
s
3
.
A sample is positively (negatively) or right (left) skewed if the upper tail of the histogram
of the sample is longer (shorter) than the lower tail.
Skewness
Positive skew
Negative skew
Since the mean is more sensitive to outlying points than is the median, one might choose
the median as a more suitable measure of ’average value’ if the sample is skewed.
Can expect skewness for example when the data can only take positive (or only negative
values) and if the values are not far from zero.
Can remove skewness by transforming the data. In the case above, we need a transformation which has larger effect on the larger values: e.g. square root, log (though beware 0
values).
Note: in a positively skewed sample the mean is often greater than the median.
2.4
Covariance and Correlation
Sample Covariance
Suppose we have a sample made up of ordered pairs of real values (( x1 , y1 ), . . . , ( xn , yn )).
The value xi might correspond to the measurement of one quantity x of individual i, and yi to
another quantity y of the same individual.
The covariance between the samples of x and y is given by
s xy =
1 n
( xi − x̄ )(yi − ȳ).
n i∑
=1
6
It gives a measurement of relatedness between the two quantities x and y.
As before, the covariance can be rewritten equivalently as
∑in=1 xi yi
− x̄ ȳ.
n
s xy =
Sample Correlation
Note that the magnitude of s xy varies according to the scale on which the data have been
measured. The correlation between the samples of x and y is defined to be
r xy =
=
s xy
s x sy
∑in=1 xi yi − n x̄ ȳ
.
ns x sy
where s x and sy are the sample standard deviations of ( x1 , . . . , xn ) and (y1 , . . . , yn ) respectively.
Unlike covariance, correlation gives a measurement of relatedness between the two quantities x and y which is scale-invariant.
3
3.1
Related Graphical Displays
Box-and-Whisker Plots
Box-and-Whisker/Box Plots
Based on the five figure summary.
• Median - middle line in the box
• 3rd & 1st Quartiles - top and bottom of the box
• ’Whiskers’ - extend out to any points which are within ( 23 × interquartile range) from the
box
• Any extreme points out to the maximum and minimum which are beyond the whiskers
are plotted individually.
3.2
Empirical pmf & cdf
Empirical pmf
The empirical probability mass function (epmf) of a sample of real values ( x1 , . . . , xn ) is
given by
pn ( x ) =
1 n
I ( x i = x ).
n i∑
=1
For any real number x, pn ( x ) returns the proportion of the data which take the value x. The
non-zero values of pn ( x ) are the normalised frequency counts of the data.
Note that a mode of the sample ( x1 , . . . , xn ) yields a maximal value of pn ( x ).
7
25
20
15
Count
10
●
0
5
●
A
B
C
D
E
F
Insecticide
Ex. Box plots of the counts of insects found in agricultural experimental units treated with six
different insecticides (A-F).
Empirical cdf
The empirical cumulative distribution function (ecdf) of a sample of real values ( x1 , . . . , xn )
is given by
Fn ( x ) =
1 n
I ( x i ≤ x ).
n i∑
=1
For any real number x, Fn ( x ) returns the proportion of the data having values which do
not exceed x.
1.0
Note this is a step function, with change points at the sampled values.
●
●
●
●
●
●
0.8
●
●
●
●
●
●
0.6
●
●
●
Fn
●
●
●
0.4
●
●
0.2
●
●
0.0
●
●
0
5
10
15
20
25
Count
Ex. Collecting the insecticide data from before together across the different treatments.
8
Related documents