Download Summarizing Quantitative Data: Statistics • statistic Any quantity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3
Summarizing Quantitative Data: Statistics
• statistic
Any quantity computed from the data values in the
sample; used to quantify some distributional feature.
• parameter
A quantity used to describe some distributional feature
of the larger population.
There are two chief systems of statistics in common use:
ordered statistics, based on an ordering of the data
values from lowest to highest; and weighted statistics, which locate features relative to how they balance
against the rest of data set on the measuring scale.
Feature
Ordered
statistic
Weighted
statistic
Center
median
Spread
interquartile range standard deviation
(IQR)
(s)
Relative
percentile
standing
1
mean (x̄)
z-score (z)
Chapter 3
• mode
the data value (or values) with the highest frequency
(which applies also to qualitative data); there may be
one mode, no mode or multiple modes in a set of data
• median
the middle observation in a sorted list of the data values
(for an even number of values, average the two middle
observations) [Excel:=MEDIAN]
• percentile
a pth percentile is a number greater than or equal to
exactly p % of the data and less than (100 − p) % of
the data (the median is the 50th percentile)
2
Chapter 3
There are two common methods for computing percentiles, and they typically produce slightly different
values:
– the exclusive method (used by the authors of our
textbook) locates the position of the pth percentile
using the formula
p
Lp = (n + 1)
100
So, for example, if L30 = 13.42, the 30th percentile
will be the number 42% of the way between the 13th
and 14th number in the list that sorts the data from
lowest to highest. However, the method suffers from
the disadvantage that it fails to return any value for
very small or very large values of p.
– the inclusive method locates the position of the pth
percentile using the formula
p
Lp = (n − 1)
+1
100
This method always produces a definite value for the
pth percentile.
[Excel:=PERCENTILE.EXC or =PERCENTILE.INC]
3
Chapter 3
• lower/upper quartiles (Q1 and Q3)
the observations which are one quarter (Q1) and three
quarters (Q3) of the way up the list; also equal to
the median values of each half of the data located below/above the median; or the 25th and 75th percentiles
(The 0th quartile is the minimum value: Q0 = min; the
second quartile is the median: Q2 = median; and the
fourth quartile is the maximum value: Q4 = max.)
[Excel:=QUARTILE.EXC or =QUARTILE.INC;
=MAX; =MIN]
• interquartile range (IQR)
difference IQR = Q3 − Q1 between the two quartiles
• five-number summary of a data set is the list of its
five quartiles:
– minimum (min)
– lower quartile (Q1)
– median
– upper quartile (Q3)
– maximum (max)
4
Chapter 3
• boxplot
display of the five-number summary formed by
– drawing a box over a number line so that the sides
of the box are located at the two quartiles (the width
of the box equals the IQR),
– drawing the wall (a line across the box) at the location of the median, and
– drawing whiskers (lines parallel to the number line)
extended from the sides of the box to the min and
max.
[Excel: Compute Q1, Q2 − Q1, Q3 − Q2, then. . .
Insert > Charts > Bar > Stacked Bar;
Select Data > Switch Row/Column;
ChartTools > Layout > Analysis > Error
Bars > Options; Display > Plus/Minus;
Error Amount > Custom]
5
Chapter 3
• modified boxplot
same as above, except:
– whiskers extend from the sides of the box to the
fences, points positioned 1.5 IQR from each end
of the box; and
– outliers (any values lying outside the fences) are individually marked with symbols; far outliers, which
lie more than 3 IQR from the ends of the box, are
often marked with different symbols for emphasis.
• resistance to outliers
moving the extreme values of a data set either further
away or closer to the center of the distribution does
not change the median value; hence, the median (and
other ordered statistics) is often preferred when describing skewed data sets (income data, housing prices, etc.).
6
Chapter 3
• (arithmetic) mean (x̄, µ)
a data set that includes repeated measures of some variable quantity x (sample size = n; population size = N )
has mean value equal to its arithmetical average, the
sum of the values divided by the number of values:
P
xi
x̄ =
sample mean
Pn
xi
population mean
µ=
N
It represents the point on the number scale where the
distribution “balances” (as if the histogram were made
of some massive substance) [Excel:=AVERAGE]
• sensitivity to outliers
moving the extreme values of a data set either further
away or closer to the center of the distribution can substantially alter the mean value; hence, the mean (and
other weighted statistics) is used to describe only symmetric data sets or those without much skew. In skewed
distributions, the mean is pulled in the direction of the
skewness (the longer tail)
• geometric mean
mean value appropriate for multiplicative data, those
that are combined by multiplication (e.g., rates of growth)
7
Chapter 3
If repeated percentage growth rates gi over various time
intervals are collected, their corresponding growth factors have the form (1 + gi), whence the geometric mean
of these factors satisfies
(1 + Gg ) = [(1 + g1)(1 + g2) · · · (1 + gn)]1/n .
Therefore, the average growth rate equals
p
Gg = n (1 + g1)(1 + g2) · · · (1 + gn) − 1.
If various financial rates of return Ri are given, their
geometric mean return is
p
GR = n (1 + R1)(1 + R2) · · · (1 + Rn) − 1.
If instead of growth rates gi, the periodic amounts xi
are given, then as the rates of return satisfy
xi+1 = (1 + gi)xi,
the average growth rate simplifies to
r
r
x2 x3
xn
xn
Gg = n
· ···
−1= n
− 1.
x1 x2
xn−1
x1
The geometric mean is generally somewhat less sensitive to outliers than the arithmetic mean for the same
data. [Excel:=GEOMEAN]
8
Chapter 3
• range
difference between the max and min values of the data;
coarsest measure of spread, and highly sensitive to outliers
• deviation from the mean (x − x̄)
the difference between a single data value x and the
mean x̄ of the data set; values greater than the mean
have positive deviations, while those below the mean
have negative deviations. Each number in the data set
has its own deviation from the mean.
• mean absolute deviation (MAD)
average of the absolute values of the deviations from
the mean:
P
|xi − x̄|
¯
sample MAD
d=
P n
|xi − µ|
population MAD
δ=
N
Statistical theory that includes the MAD as the measure
of spread is surprisingly difficult, so it is not commonly
used in practice.
9
Chapter 3
• variance (s2, σ 2)
an estimate of the average squared deviation from the
mean:
P
(x − x̄)2
2
sample variance
s =
Pn − 1 2
(x − x̄)
2
population variance
σ =
N
[Excel:=VAR.S or =VAR.P]
Variance is in (meaningless) squared units, so the more
important measure of spread is. . .
• standard deviation (s, σ)
square root of the variance, a measure of spread that
estimates the size of a typical deviation from the mean:
rP
(x − x̄)2
sample standard deviation
s=
r Pn − 1
(x − x̄)2
pop. standard deviation
σ=
N
[Excel:=STDEV.S or =STDEV.P]
The greater the standard deviation, the further away
from the mean will most values be found. Also, it
weights large deviations from the mean more than does
the MAD
10
Chapter 3
• coefficient of variation (CV)
a measure of relative spread that determines how
large the standard deviation is relative to the mean
value; used exclusively for values of x measured on a
ratio scale:
s
sample coefficient of variation
CV =
x̄
σ
pop. coefficient of variation
CV =
µ
Note that the notation CV is used for both sample and
population values; only context will distinguish which
measure is being used.
• z-score, or standardized value
a measure of relative standing that determines how far
each data value is from the mean measured in units of
standard deviations:
x − x̄
z=
s
Positive z-scores correspond to values larger than the
mean; negative z-scores to values below the mean.
[Excel:=STANDARDIZE]
11
Chapter 3
• mean-variance analysis
In finance, an investment I that fluctuates in value over
time will produce variability in its rates xi of return.
This variability is summarized by the mean rate of return x̄I and standard deviation of the rates of return
sI for the investment. The mean measures the investment’s reward, and the standard deviation its risk.
Further, the degree to which the return on the investment compensates for the risk that the investor takes
can be measured by comparing the difference between
its reward x̄I and that Rf of a risk-free investment (like
a treasury bill), but in units of risk sI ; this measure is
called the Sharpe ratio:
x̄I − Rf
Sharpe ratio =
sI
The higher the Sharpe ratio, the better the investment
will compensate its investors for the risk they are taking
by investing.
12
Chapter 3
General properties of distributions
• Chebyshev’s Theorem
states that for any data set, the proportion of values
that lie within k standard deviations from the mean
will be equal or greater than 1 − k12 whenever k > 1.
• Empirical Rule
states that for data sets that are nearly normally distributed, i.e., symmetric and bell-shaped (with a prominent peak describing the central cluster of data and tails
of more extreme that dissipate as one moves away from
the center),
– approximately 68% of the data will lie within one
standard deviation of the mean, within the interval
x̄ ± s;
– approximately 95% of the data will lie within two
standard deviations of the mean, within the interval
x̄ ± 2s; and
– nearly all of the data will lie within three standard
deviation of the mean, within the interval x̄ ± 3s.
13
Chapter 3
Working with grouped data
When raw data is processed by being aggregated into
intervals along its scale of measure, the mean and standard deviation of the set can be approximated by assuming that all values in an interval are concentrated
at the midpoint m of the interval.
If the n data points of the sample – or the N data points
of the population – are partitioned into k intervals with
midpoints m1, m2, . . . , mk , and the intervals contain
frequencies of f1, f2, . . . , fk , respectively, then
P
P
mifi
mifi
mean x̄ =
,
µ =
n
n
P
P
2
(mi − x̄) fi 2
(mi − x̄)2fi
2
variance s =
, σ =
n−1
n−1
√
√
2
stand. dev. s = s ,
σ = σ2
14
Chapter 3
Weighted mean
We can reinterpret the mean value formula above by
considering the midpoints mi as a data set to which
we assign weights given by the corresponding relative
frequencies fi/n; a greater weight gives the midpoint
value a larger contribution to the value of the mean.
More generally, if data values
Pxi are assigned corresponding weights wi, where
wi = 1, then we find
the weighted mean by the similar formulas
X
X
x̄ =
wi xi
µ=
w i xi
15
Chapter 3
Statistics for paired data
• covariance (sxy , σxy )
a measure of the direction and strength of the linear
association between paired data variables, where x is
the explanatory and y the response variable:
P
(x − x̄)(y − ȳ)
sample covariance
sxy =
n−1
P
(x − x̄)(y − ȳ)
population covariance
σxy =
N
Note that covariance is measured in square units.
[Excel:=COVARIANCE.S or =COVARIANCE.P]
• correlation (rxy , ρxy )
a standardized version of covariance:
sample correlation
rxy
population correlation
ρxy
[Excel:=CORREL (for samples only)]
16
sxy
=
sx sy
σxy
=
σx σy
Chapter 3
Analyzing Data: Paired Quantitative Data
• positive values of sxy , rxy indicate a positive association
between x and y; negative values of sxy , rxy indicate a
negative association between x and y
• rxy always lies between 1 and 1, with values close to 0
indicating weak association, values close to 1 a strong
positive association, and values close to 1 a strong negative association
• while it is possible to compute the covariance and correlation for any pair of quantitative variables, only linear
associations are evaluated by these statistics
• the statistics sxy , rxy are highly sensitive to outliers, so
the presence of an outlier can dramatically alter their
values
• there may be a strong association between variables
without there being any cause/effect relation between
them: association does not signify causation. Sometimes, there is a third lurking variable (one that is
not measured by the investigator) standing behind the
other two variables as a common and hidden determining factor.
17