Download Lecture 3 - Sampling and statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Student's t-test wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Treatment and analysis of data – Applied statistics
Lecture 3: Sampling and descriptive statistics
Topics covered:
ƒ Parameters and statistics
ƒ Sample mean and sample standard deviation
ƒ Order statistics and quantiles
ƒ Confidence intervals and confidence levels
ƒ Error bars and box plots
ƒ Histograms
ƒ Cumulative and percentile plots
ƒ Probability plots
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 1
Population, parameters, sample and statistics
Sample space (in probability theory)
≡ population (in statistics)
A (random) sample is
drawn from the population
data
sampling
(data collection)
inference
population described by certain
parameters such as μ and σ
Sept-Oct 2006
data
analysis
statistics such as
m and s
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 2
Parameters and statistics
A parameter is a quantity that describes a population (e.g. the population mean μ
and population standard deviation σ).
Data are obtained by sampling the population (e.g., x1, x2, ..., xn).
Any function of the data is called a statistic. Examples of statistics:
n
- the number of data points
min(x1, x2, ..., xn)
- the smallest data value
x1 + n1/3
- not a very useful statistic
m = (x1 + x2 + ... + xn)/n
- the sample mean
s = [ Σi (xi–m)2 / (n–1) ]1/2
- the sample standard deviation
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 3
Descriptive statistics
Simple calculations on the data allow to condense them in a form useful e.g. in
order to
•
summarize results in a way that is quickly grasped
•
assess the quality of the data
•
compare different sets of data
•
explore what kind of information the data may contain
•
support a statement (make a conclusion more convincing)
When the data represent a more or less unknown distribution, the most important
statistics may be
ƒ
some measure of location, such as the sample mean or median
ƒ
some measure of scale (or scatter, or precision), such as the sample standard
deviation or interquartile range
This is often supported by graphics which give much more complete information
on distributions. (A graph is also a statistic.)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 4
Sample mean and sample standard deviation
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 5
Important comment
Be careful to distinguish between:
•
the sample standard deviation
1 n
2
(
)
s=
x
−
m
∑ i
n − 1 i =1
which measures the dispersion among the values x1, x2, ..., xn around the
sample mean value m, and
•
the standard deviation of the sample mean, which is usually estimated as
n
1
s
2
(
)
=
x
−
m
D[m] =
∑ i
n(n − 1) i =1
n
and which may be quoted as the standard error (1σ uncertainty) of m.
E.g.: "the mean value and dispersion of the data are 12.3 ± 2.5" is ambiguous!
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 6
Alternative measures of location and scale
The sample mean and sample standard deviation are very sensitive to outliers or
stongly deviating points. In manual data analysis one can often cope interactively
with these cases, but for automatic analysis it is better to use a more robust
method.
In such cases, or when the distribution is known or suspected to be non-gaussian,
there are many other useful measures of location and scale.
Instead of the sample mean m we may use the sample median xmed (see below).
Instead of the sample standard deviation (= RMS deviation from the sample
mean), we may use the mean absolute deviation from the mean:
1 n
MAD = ∑ xi − m
n i =1
Often the sample median is used instead of the sample mean when calculating the
MAD. In fact, for any fixed sample the median minimizes the MAD, so it is
logical to use the median and MAD together.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 7
Order statistics
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 8
Sample quantiles
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 9
Quantiles for the normal (Gaussian) distribution
32% of the area is outside ±1σ
68% of the area is within ±1σ
4.6% of the area is outside ±2σ
0.3% of the area is outside ±3σ
frequency
−3σ −2σ −1σ
Sept-Oct 2006
0
+1σ +2σ +3σ
Statistics for astronomers (L. Lindegren, Lund Observatory)
value
Lecture 3, p. 10
Confidence intervals and levels – normal case (1)
frequency
0.5
5
25 50 75
95 99.5 percentile
value
-2.57 -1.65 -0.67 0 0.67 1.65 2.57 standard deviations
Alternatively, the precision can be specified as a confidence interval,
with an associated confidence level (CL):
x = 3.7 ± 2.5 (90% CL) or 1.2 < x < 6.2 (90% CL)
x > 1.2 (95% CL) [one-sided confidence interval]
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 11
Confidence intervals and levels – normal case (2)
Confidence Level
two-sided confidence interval (for normal distr.)
50%
[ −0.67σ, +0.67σ ]
68%
[ −1.00σ, +1.00σ ]
90%
[ −1.65σ, +1.65σ ]
95%
[ −1.96σ, +1.96σ ]
99%
[ −2.58σ, +2.58σ ]
99.9%
[ −3.29σ, +3.29σ ]
Caution: older astronomical literature (< 1960) often uses “probable error” (p.e.),
which corresponds to 50% CL or ±0.67σ.
Thus:
(standard error) = 1.5 × (probable error)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 12
Deviations from the normal distribution
Actual errors rarely follow the normal distribution:
ƒ usually points beyond ±3σ are much more frequent than expected for a normal
distribution (0.3%)
ƒ the distribution is often skew, especially in the tails
ƒ sometimes the distribution is completely different, e.g. exponential
Although the standard deviation is applicable to many non-normal cases, it could
be misleading without further specification of the distribution.
For instance, given only the information
x = 3.7 ± 1.5 (s.e.)
one might conclude that x > 8.2 is very unlikely (0.15%). However, if x has a lognormal distribution, the probability is in fact 2 – 3%.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 13
Quantiles (fractiles), percentiles, quartiles, etc
Other names for quantiles at certain q-values:
Q(0.5)
= median (or 50th percentile)
Q(0.25)
= lower quartile (or 25th percentile)
Q(0.75)
= upper quartile (or 75th percentile)
Q(0.1)
= first decile, Q(0.2) = second decile, etc [not so often used]
The interquartile range IQR = Q(0.75) – Q(0.25) is sometimes used as a measure
of precision (equal to 1.35σ for a normal distribution).
Half the "intersextile range" (not a standard term), [Q(5/6) – Q(1/6)]/2 = 0.97σ for
a normal distribution, and is useful as a robust assessment of the dispersion.
NOTE: The terms quantile, fractile, and percentile are used almost synonymously
in the literature, while median, quartile, decile etc have very specific meanings.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 14
Error bars and box plots
Error bars usually indicate ±1σ (i.e. the confidence interval at 68% CL).
If not, the exact meaning must definitely be stated in the figure caption.
Box plots (or box-whisker plots):
“outliers” (>1.5×IQR from median)
highest “non-outlier”
upper quartile
median
lower quartile
lowest “non-outlier”
“outlier”
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 15
Histograms
One-dimensional sample distributions are often shown as histograms. A histogram
displays the number of data points per bin, versus the position of the bin (or the
density of data points, if unequal bin sizes are used).
E.g., define the sequence x0, x1, ..., xn which are the boundaries of n bins.
Equal bins of size Δx are obtained as xi = x0 + i Δx, i = 1, 2, ..., n.
Let hi be the number of data points with xi–1 ≤ x < xi . (Note position of <)
In the histogram, hi (or sometimes hi /Δxi ) is plotted as a bar from xi–1 to xi .
Things to consider when constructing a histogram:
ƒ Which bin size to use? - compromise between resolution and noise. In any case, be
careful to specify the bin size if it is not clear from the graph!
ƒ Where to start (x0)? - often arbitrary!
ƒ What to do with points outside x0, xn (if any)?
A difficulty with histograms is that they look radically different depending on the
choices you make!
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 16
Different histograms of the same data... (1)
bin size = 2
bin size = 2
bin size = 2
bin size = 2
These histograms (of the same 200 points) differ only in the choice of starting value x0
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 17
Different histograms of the same data... (2)
bin size = 2
bin size = 1
bin size = 1
bin size = 0.5
These histograms (of the same 200 points) differ in bin size as well. It is better to make the
bins too narrow than too wide: the eye can smooth out the noise but cannot recover lost
resolution! Note that the uncertainty of any histogram value hi. is of order ±√hi.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 18
Cumulative plots
An alternative to histogram is to plot the cumulative fraction, analoguous to the
cumulative distribution function (cdf):
empirical data
theoretical distributions
cumulative distribution function
⇔
cumulative fraction
probability density function
⇔
histogram
The cumulative fraction is a step function that increments by 1/n for each data
point, starting from 0 and ending at 1.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 19
Cumulative plot, example
n = 200
Cumulative fraction plot for the same 200 data points as in the histograms. The
two modes can be seen as the steeper parts of the curve around 10 and 15.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 20
Cumulative plots, some more examples (1)
You can transform the scale of data valuesto emphasize important intervals.
For example, for strictly positive data it often makes sense to use a logarithmic
scale (this and following examples from bardeen.physics.csbsju.edu/stats/).
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 21
Cumulative plots, some more examples (2)
B1
B2
Cumulative plots are excellent to compare two samples: do they have the same
distribution? (Cf. K-S test.) Works also for samples of unequal size.
The two samples B1 and B2 are clearly drawn from different populations. This is
also evident from the box-plot, but not from the the mean/dispersion plot (right).
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 22
Percentile plots
The ragged appearance of the cumulative plot can be disturbing to the eye,
especially for small n.
It may then be better to use a percentile plot (red line), which simply connects the
n points with x(i) as abscissa and p = i/(n+1) as ordinate. This is actually a better
estimate of the cumulative distribution function than the cumulative fraction plot.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 23
Percentile plot, example
n = 200
Percentile plot for the same 200 data points as in the histograms and as in the
cumulative fraction plot (slide 20).
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 24
Transformed percentiles...
n = 200
n = 200
Sometimes it's useful to transform the percentile scale to bring out more clearly the
important parts of the distribution.
In this example (a sample drawn from from χ32) we are concerned about the tail of
large values, which is difficult to see in the standard percentile plot (left). By
plotting 1 – p instead of p and using a logarithmic scale, the tail is emphasized.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 25
Probability plots
As n → ∞ the percentile plot converges to the cdf F(x).
To see if the data follow a given distribution F(x), we could make a percentile plot
with F–1(i/(n+1)) on the y-axis instead of i/(n+1). If the data follow F(x) we should
then get (approximately) a straight line. This is a probability plot.
The nice thing about probability plots is that any linear transformation axi+b of the
data will just shift and change the slope of the curve, but a straight line (for
example) remains straight.
The most common type of this plot is the normal probability plot, using the
standard normal cdf
x
Φ ( x) =
∫
−∞
⎛ t2 ⎞
1
exp⎜⎜ − ⎟⎟ dt
2π
⎝ 2⎠
The abscissae are x(i) and the ordinates are Φ–1(i/(n+1)) for i = 1, 2, ..., n.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 26
The inverse standard normal cdf
To make normal probability plots you need to be able to compute the inverse
standard normal cdf Φ–1(p) for any 0 < p < 1. Routines for this are are available in
most numerical/statistical packages (can be found e.g. in Numerical Recipes).
If not readily available, use the following approximation which is always good
enough for probability plots (maximum error is 0.003; Abramowitz & Stegun,
Handbook of Mathematical Functions):
where
⎧ 2.30753 + 0.27061 t
⎪ 1 + 0.99229 t + 0.04481 t 2 − t if 0 < p ≤ 0.5
⎪
Φ −1 ( p ) = ⎨
⎪− Φ −1 (1 − p )
if 0.5 ≤ p < 1
⎪
⎩
t = − 2 ln p
The values Φ–1(p) are sometimes called the normal scores.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 27
Percentile vs. probability plot (1)
5/6
n = 50
0.5
5th sextile ≈ 7
1/6
1st sextile ≈ -3
median ≈ 1
Percentile plot for 50 random numbers from a normal distribution with mean = 2
and s.d. = 5.
Note that you can use the percentile plot to estimate quantiles, e.g. the median and
the first/last sextiles.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 28
Percentile vs. probability plot (2)
n = 50
+1σ ≈ 7
+1
0
-1
-1σ ≈ -3
median ≈ 1
Normal probability plot for the same 50 random numbers.
The approximately straight relationship suggest that the data are indeed gaussian.
The median and the quantiles corresponding to ±1σ for the normal distribution are
easily found.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 29
Normal probability plots, expected variation (n = 20)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 30
Normal probability plots, expected variation (n = 200)
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 31
Normal probability plot for a non-normal sample
n = 200
bin size = 0.5
Normal probability plot for the bimodal sample earlier plotted in the histograms
(slides 17-18).
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 32
Normal probability plot for a non-normal sample
n = 200
Typical normal probability plot for a sample that is nearly gaussian, but with some
outliers
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 33
Normal probability plot for a non-normal sample
n = 200
Normal probability plot for a sample drawn from the Cauchy distribution with
location α = 2 and scale β = 5.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 34
Cauchy probability plot for the Cauchy sample
n = 200
Probability plots may not
be very useful for extreme
distributions like Caucy!
Cauchy probability plot for the same sample as in the previous slide. The inverse
cdf for the standard Cauchy distribution is F–1( p) = tan [( p – 0.5) π].
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 35
Related to probability plots...
20th Century’s 100 largest disasters worldwide
2
10
Technological ($10B)
Natural ($100B)
1
10
US Power outages
(10M of customers,
1985-1997)
Slope = -1
(α=1)
0
10
-2
10
Sept-Oct 2006
-1
10
0
10
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 36
A histogram plot from Hipparcos data analysis
ESA SP-1200
Vol. 3, Fig. 16.28
Normalised differences between the FAST and NDAC parallax estimates for
successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 37
The same data in a normal probability plot
Real data are
sometimes
surprisingly
Gaussian!
ESA SP-1200
Vol. 3, Fig. 16.29
Normalised differences between the FAST and NDAC parallax estimates for
successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000.
Sept-Oct 2006
Statistics for astronomers (L. Lindegren, Lund Observatory)
Lecture 3, p. 38