* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 3 - Sampling and statistics
Survey
Document related concepts
Transcript
Treatment and analysis of data – Applied statistics Lecture 3: Sampling and descriptive statistics Topics covered: Parameters and statistics Sample mean and sample standard deviation Order statistics and quantiles Confidence intervals and confidence levels Error bars and box plots Histograms Cumulative and percentile plots Probability plots Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 1 Population, parameters, sample and statistics Sample space (in probability theory) ≡ population (in statistics) A (random) sample is drawn from the population data sampling (data collection) inference population described by certain parameters such as μ and σ Sept-Oct 2006 data analysis statistics such as m and s Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 2 Parameters and statistics A parameter is a quantity that describes a population (e.g. the population mean μ and population standard deviation σ). Data are obtained by sampling the population (e.g., x1, x2, ..., xn). Any function of the data is called a statistic. Examples of statistics: n - the number of data points min(x1, x2, ..., xn) - the smallest data value x1 + n1/3 - not a very useful statistic m = (x1 + x2 + ... + xn)/n - the sample mean s = [ Σi (xi–m)2 / (n–1) ]1/2 - the sample standard deviation Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 3 Descriptive statistics Simple calculations on the data allow to condense them in a form useful e.g. in order to • summarize results in a way that is quickly grasped • assess the quality of the data • compare different sets of data • explore what kind of information the data may contain • support a statement (make a conclusion more convincing) When the data represent a more or less unknown distribution, the most important statistics may be some measure of location, such as the sample mean or median some measure of scale (or scatter, or precision), such as the sample standard deviation or interquartile range This is often supported by graphics which give much more complete information on distributions. (A graph is also a statistic.) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 4 Sample mean and sample standard deviation Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 5 Important comment Be careful to distinguish between: • the sample standard deviation 1 n 2 ( ) s= x − m ∑ i n − 1 i =1 which measures the dispersion among the values x1, x2, ..., xn around the sample mean value m, and • the standard deviation of the sample mean, which is usually estimated as n 1 s 2 ( ) = x − m D[m] = ∑ i n(n − 1) i =1 n and which may be quoted as the standard error (1σ uncertainty) of m. E.g.: "the mean value and dispersion of the data are 12.3 ± 2.5" is ambiguous! Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 6 Alternative measures of location and scale The sample mean and sample standard deviation are very sensitive to outliers or stongly deviating points. In manual data analysis one can often cope interactively with these cases, but for automatic analysis it is better to use a more robust method. In such cases, or when the distribution is known or suspected to be non-gaussian, there are many other useful measures of location and scale. Instead of the sample mean m we may use the sample median xmed (see below). Instead of the sample standard deviation (= RMS deviation from the sample mean), we may use the mean absolute deviation from the mean: 1 n MAD = ∑ xi − m n i =1 Often the sample median is used instead of the sample mean when calculating the MAD. In fact, for any fixed sample the median minimizes the MAD, so it is logical to use the median and MAD together. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 7 Order statistics Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 8 Sample quantiles Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 9 Quantiles for the normal (Gaussian) distribution 32% of the area is outside ±1σ 68% of the area is within ±1σ 4.6% of the area is outside ±2σ 0.3% of the area is outside ±3σ frequency −3σ −2σ −1σ Sept-Oct 2006 0 +1σ +2σ +3σ Statistics for astronomers (L. Lindegren, Lund Observatory) value Lecture 3, p. 10 Confidence intervals and levels – normal case (1) frequency 0.5 5 25 50 75 95 99.5 percentile value -2.57 -1.65 -0.67 0 0.67 1.65 2.57 standard deviations Alternatively, the precision can be specified as a confidence interval, with an associated confidence level (CL): x = 3.7 ± 2.5 (90% CL) or 1.2 < x < 6.2 (90% CL) x > 1.2 (95% CL) [one-sided confidence interval] Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 11 Confidence intervals and levels – normal case (2) Confidence Level two-sided confidence interval (for normal distr.) 50% [ −0.67σ, +0.67σ ] 68% [ −1.00σ, +1.00σ ] 90% [ −1.65σ, +1.65σ ] 95% [ −1.96σ, +1.96σ ] 99% [ −2.58σ, +2.58σ ] 99.9% [ −3.29σ, +3.29σ ] Caution: older astronomical literature (< 1960) often uses “probable error” (p.e.), which corresponds to 50% CL or ±0.67σ. Thus: (standard error) = 1.5 × (probable error) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 12 Deviations from the normal distribution Actual errors rarely follow the normal distribution: usually points beyond ±3σ are much more frequent than expected for a normal distribution (0.3%) the distribution is often skew, especially in the tails sometimes the distribution is completely different, e.g. exponential Although the standard deviation is applicable to many non-normal cases, it could be misleading without further specification of the distribution. For instance, given only the information x = 3.7 ± 1.5 (s.e.) one might conclude that x > 8.2 is very unlikely (0.15%). However, if x has a lognormal distribution, the probability is in fact 2 – 3%. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 13 Quantiles (fractiles), percentiles, quartiles, etc Other names for quantiles at certain q-values: Q(0.5) = median (or 50th percentile) Q(0.25) = lower quartile (or 25th percentile) Q(0.75) = upper quartile (or 75th percentile) Q(0.1) = first decile, Q(0.2) = second decile, etc [not so often used] The interquartile range IQR = Q(0.75) – Q(0.25) is sometimes used as a measure of precision (equal to 1.35σ for a normal distribution). Half the "intersextile range" (not a standard term), [Q(5/6) – Q(1/6)]/2 = 0.97σ for a normal distribution, and is useful as a robust assessment of the dispersion. NOTE: The terms quantile, fractile, and percentile are used almost synonymously in the literature, while median, quartile, decile etc have very specific meanings. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 14 Error bars and box plots Error bars usually indicate ±1σ (i.e. the confidence interval at 68% CL). If not, the exact meaning must definitely be stated in the figure caption. Box plots (or box-whisker plots): “outliers” (>1.5×IQR from median) highest “non-outlier” upper quartile median lower quartile lowest “non-outlier” “outlier” Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 15 Histograms One-dimensional sample distributions are often shown as histograms. A histogram displays the number of data points per bin, versus the position of the bin (or the density of data points, if unequal bin sizes are used). E.g., define the sequence x0, x1, ..., xn which are the boundaries of n bins. Equal bins of size Δx are obtained as xi = x0 + i Δx, i = 1, 2, ..., n. Let hi be the number of data points with xi–1 ≤ x < xi . (Note position of <) In the histogram, hi (or sometimes hi /Δxi ) is plotted as a bar from xi–1 to xi . Things to consider when constructing a histogram: Which bin size to use? - compromise between resolution and noise. In any case, be careful to specify the bin size if it is not clear from the graph! Where to start (x0)? - often arbitrary! What to do with points outside x0, xn (if any)? A difficulty with histograms is that they look radically different depending on the choices you make! Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 16 Different histograms of the same data... (1) bin size = 2 bin size = 2 bin size = 2 bin size = 2 These histograms (of the same 200 points) differ only in the choice of starting value x0 Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 17 Different histograms of the same data... (2) bin size = 2 bin size = 1 bin size = 1 bin size = 0.5 These histograms (of the same 200 points) differ in bin size as well. It is better to make the bins too narrow than too wide: the eye can smooth out the noise but cannot recover lost resolution! Note that the uncertainty of any histogram value hi. is of order ±√hi. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 18 Cumulative plots An alternative to histogram is to plot the cumulative fraction, analoguous to the cumulative distribution function (cdf): empirical data theoretical distributions cumulative distribution function ⇔ cumulative fraction probability density function ⇔ histogram The cumulative fraction is a step function that increments by 1/n for each data point, starting from 0 and ending at 1. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 19 Cumulative plot, example n = 200 Cumulative fraction plot for the same 200 data points as in the histograms. The two modes can be seen as the steeper parts of the curve around 10 and 15. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 20 Cumulative plots, some more examples (1) You can transform the scale of data valuesto emphasize important intervals. For example, for strictly positive data it often makes sense to use a logarithmic scale (this and following examples from bardeen.physics.csbsju.edu/stats/). Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 21 Cumulative plots, some more examples (2) B1 B2 Cumulative plots are excellent to compare two samples: do they have the same distribution? (Cf. K-S test.) Works also for samples of unequal size. The two samples B1 and B2 are clearly drawn from different populations. This is also evident from the box-plot, but not from the the mean/dispersion plot (right). Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 22 Percentile plots The ragged appearance of the cumulative plot can be disturbing to the eye, especially for small n. It may then be better to use a percentile plot (red line), which simply connects the n points with x(i) as abscissa and p = i/(n+1) as ordinate. This is actually a better estimate of the cumulative distribution function than the cumulative fraction plot. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 23 Percentile plot, example n = 200 Percentile plot for the same 200 data points as in the histograms and as in the cumulative fraction plot (slide 20). Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 24 Transformed percentiles... n = 200 n = 200 Sometimes it's useful to transform the percentile scale to bring out more clearly the important parts of the distribution. In this example (a sample drawn from from χ32) we are concerned about the tail of large values, which is difficult to see in the standard percentile plot (left). By plotting 1 – p instead of p and using a logarithmic scale, the tail is emphasized. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 25 Probability plots As n → ∞ the percentile plot converges to the cdf F(x). To see if the data follow a given distribution F(x), we could make a percentile plot with F–1(i/(n+1)) on the y-axis instead of i/(n+1). If the data follow F(x) we should then get (approximately) a straight line. This is a probability plot. The nice thing about probability plots is that any linear transformation axi+b of the data will just shift and change the slope of the curve, but a straight line (for example) remains straight. The most common type of this plot is the normal probability plot, using the standard normal cdf x Φ ( x) = ∫ −∞ ⎛ t2 ⎞ 1 exp⎜⎜ − ⎟⎟ dt 2π ⎝ 2⎠ The abscissae are x(i) and the ordinates are Φ–1(i/(n+1)) for i = 1, 2, ..., n. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 26 The inverse standard normal cdf To make normal probability plots you need to be able to compute the inverse standard normal cdf Φ–1(p) for any 0 < p < 1. Routines for this are are available in most numerical/statistical packages (can be found e.g. in Numerical Recipes). If not readily available, use the following approximation which is always good enough for probability plots (maximum error is 0.003; Abramowitz & Stegun, Handbook of Mathematical Functions): where ⎧ 2.30753 + 0.27061 t ⎪ 1 + 0.99229 t + 0.04481 t 2 − t if 0 < p ≤ 0.5 ⎪ Φ −1 ( p ) = ⎨ ⎪− Φ −1 (1 − p ) if 0.5 ≤ p < 1 ⎪ ⎩ t = − 2 ln p The values Φ–1(p) are sometimes called the normal scores. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 27 Percentile vs. probability plot (1) 5/6 n = 50 0.5 5th sextile ≈ 7 1/6 1st sextile ≈ -3 median ≈ 1 Percentile plot for 50 random numbers from a normal distribution with mean = 2 and s.d. = 5. Note that you can use the percentile plot to estimate quantiles, e.g. the median and the first/last sextiles. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 28 Percentile vs. probability plot (2) n = 50 +1σ ≈ 7 +1 0 -1 -1σ ≈ -3 median ≈ 1 Normal probability plot for the same 50 random numbers. The approximately straight relationship suggest that the data are indeed gaussian. The median and the quantiles corresponding to ±1σ for the normal distribution are easily found. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 29 Normal probability plots, expected variation (n = 20) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 30 Normal probability plots, expected variation (n = 200) Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 31 Normal probability plot for a non-normal sample n = 200 bin size = 0.5 Normal probability plot for the bimodal sample earlier plotted in the histograms (slides 17-18). Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 32 Normal probability plot for a non-normal sample n = 200 Typical normal probability plot for a sample that is nearly gaussian, but with some outliers Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 33 Normal probability plot for a non-normal sample n = 200 Normal probability plot for a sample drawn from the Cauchy distribution with location α = 2 and scale β = 5. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 34 Cauchy probability plot for the Cauchy sample n = 200 Probability plots may not be very useful for extreme distributions like Caucy! Cauchy probability plot for the same sample as in the previous slide. The inverse cdf for the standard Cauchy distribution is F–1( p) = tan [( p – 0.5) π]. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 35 Related to probability plots... 20th Century’s 100 largest disasters worldwide 2 10 Technological ($10B) Natural ($100B) 1 10 US Power outages (10M of customers, 1985-1997) Slope = -1 (α=1) 0 10 -2 10 Sept-Oct 2006 -1 10 0 10 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 36 A histogram plot from Hipparcos data analysis ESA SP-1200 Vol. 3, Fig. 16.28 Normalised differences between the FAST and NDAC parallax estimates for successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 37 The same data in a normal probability plot Real data are sometimes surprisingly Gaussian! ESA SP-1200 Vol. 3, Fig. 16.29 Normalised differences between the FAST and NDAC parallax estimates for successive solutions (12, 18, 30, 37 months of data). n = 40,000 - 100,000. Sept-Oct 2006 Statistics for astronomers (L. Lindegren, Lund Observatory) Lecture 3, p. 38