* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Summarizing Quantitative Data: Statistics • statistic Any quantity
Survey
Document related concepts
Transcript
Chapter 3 Summarizing Quantitative Data: Statistics • statistic Any quantity computed from the data values in the sample; used to quantify some distributional feature. • parameter A quantity used to describe some distributional feature of the larger population. There are two chief systems of statistics in common use: ordered statistics, based on an ordering of the data values from lowest to highest; and weighted statistics, which locate features relative to how they balance against the rest of data set on the measuring scale. Feature Ordered statistic Weighted statistic Center median Spread interquartile range standard deviation (IQR) (s) Relative percentile standing 1 mean (x̄) z-score (z) Chapter 3 • mode the data value (or values) with the highest frequency (which applies also to qualitative data); there may be one mode, no mode or multiple modes in a set of data • median the middle observation in a sorted list of the data values (for an even number of values, average the two middle observations) • percentile a pth percentile is any data value greater than or equal to exactly p % of the data (the median is the 50th percentile) • lower/upper quartiles (Q1 and Q3) the observations which are one quarter (Q1) and three quarters (Q3) of the way up the list; also equal to the median values of each half of the data located below/above the median; or the 25th and 75th percentiles (The 0th quartile is the minimum value: Q0 = min; the second quartile is the median: Q2 = median; and the fourth quartile is the maximum value: Q4 = max.) 2 Chapter 3 • interquartile range (IQR) difference IQR = Q3 − Q1 between the two quartiles • five-number summary of a data set is the list of its five quartiles: – minimum (min) – lower quartile (Q1) – median – upper quartile (Q3) – maximum (max) • boxplot display of the five-number summary formed by – drawing a box over a number line so that the sides of the box are located at the two quartiles (the width of the box equals the IQR), – drawing the wall (a line across the box) at the location of the median, and – drawing whiskers (lines parallel to the number line) extended from the sides of the box to the min and max. 3 Chapter 3 • modified boxplot same as above, except: – whiskers extend from the sides of the box to the fences, points positioned 1.5 IQR from each end of the box; and – outliers (any values lying outside the fences) are individually marked with symbols; far outliers, which lie more than 3 IQR from the ends of the box, are often marked with different symbols for emphasis. • resistance to outliers moving the extreme values of a data set either further away or closer to the center of the distribution does not change the median value; hence, the median (and other ordered statistics) is often preferred when describing skewed data sets (income data, housing prices, etc.). 4 Chapter 3 • (arithmetic) mean (x̄, µ) a data set that includes repeated measures of some variable quantity x (sample size = n; population size = N ) has mean value equal to its arithmetical average, the sum of the values divided by the number of values: P xi x̄ = sample mean Pn xi population mean µ= N It represents the point on the number scale where the distribution “balances” (as if the histogram were made of some massive substance) • sensitivity to outliers moving the extreme values of a data set either further away or closer to the center of the distribution can substantially alter the mean value; hence, the mean (and other weighted statistics) is used to describe only symmetric data sets or those without much skew. In skewed distributions, the mean is pulled in the direction of the skewness (the longer tail) • geometric mean mean value appropriate for multiplicative data, those that are combined by multiplication (e.g., rates of growth) 5 Chapter 3 If repeated percentage growth rates gi over various time intervals are collected, their corresponding growth factors have the form (1 + gi), whence the geometric mean of these factors satisfies (1 + Gg ) = [(1 + g1)(1 + g2) · · · (1 + gn)]1/n . Therefore, the average growth rate equals p Gg = n (1 + g1)(1 + g2) · · · (1 + gn) − 1. If various financial rates of return Ri are given, their geometric mean return is p GR = n (1 + R1)(1 + R2) · · · (1 + Rn) − 1. If instead of growth rates gi, the periodic amounts xi are given, then as the rates of return satisfy xi+1 = (1 + gi)xi, the average growth rate simplifies to r r x2 x3 xn xn · ··· −1= n − 1. Gg = n x1 x2 xn−1 x1 The geometric mean is generally somewhat less sensitive to outliers than the arithmetic mean for the same data. 6 Chapter 3 • range difference between the max and min values of the data; coarsest measure of spread, and highly sensitive to outliers • deviation from the mean (x − x̄) the difference between a single data value x and the mean x̄ of the data set; values greater than the mean have positive deviations, while those below the mean have negative deviations. Each number in the data set has its own deviation from the mean. • mean absolute deviation (MAD) average of the absolute values of the deviations from the mean: P |xi − x̄| ¯ sample MAD d= P n |xi − µ| population MAD δ= N Statistical theory that includes the MAD as the measure of spread is surprisingly difficult, so it is not commonly used in practice. 7 Chapter 3 • variance (s2, σ 2) an estimate of the average squared deviation from the mean: P (x − x̄)2 2 sample variance s = Pn − 1 2 (x − x̄) 2 population variance σ = N (We divide by n − 1 instead of n for technical reasons that will be explained later.) Variance is in (often meaningless) squared units, so the more important measure of spread is. . . • standard deviation (s, σ) the square root of the variance, a measure of spread that estimates the size of a typical deviation from the mean: rP (x − x̄)2 sample standard deviation s= r Pn − 1 (x − x̄)2 pop. standard deviation σ= N The larger the standard deviation, the further away from the mean will most values be found. In addition, the standard deviation weighs large deviations from the mean more than does the MAD (since deviations are squared at first) 8 Chapter 3 • coefficient of variation (CV) a measure of relative spread that determines how large the standard deviation is relative to the mean value; used exclusively for values of x measured on a ratio scale: s sample coefficient of variation CV = x̄ σ pop. coefficient of variation CV = µ Note that the notation CV is used for both sample and population values; only context will distinguish which measure is being used. • z-score, or standardized value a measure of relative standing that determines how far each data value is from the mean measured in units of standard deviations: x − x̄ z= s Positive z-scores correspond to values larger than the mean; negative z-scores to values below the mean. 9 Chapter 3 • mean-variance analysis In finance, an investment I that fluctuates in value over time will produce variability in its rates xi of return. This variability is summarized by the mean rate of return x̄I and standard deviation of the rates of return sI for the investment. The mean measures the investment’s reward, and the standard deviation its risk. Further, the degree to which the return on the investment compensates for the risk that the investor takes can be measured by comparing the difference between its reward x̄I and that Rf of a risk-free investment (like a treasury bill), but in units of risk sI ; this measure is called the Sharpe ratio: x̄I − Rf Sharpe ratio = sI The higher the Sharpe ratio, the better the investment will compensate its investors for the risk they are taking by investing. 10 Chapter 3 General properties of distributions • Chebyshev’s Theorem states that for any data set, the proportion of values that lie within k standard deviations from the mean will be equal or greater than 1 − k12 whenever k > 1. • Empirical Rule states that for data sets that are nearly normally distributed, i.e., symmetric and bell-shaped (with a prominent peak describing the central cluster of data and tails of more extreme that dissipate as one moves away from the center), – approximately 68% of the data will lie within one standard deviation of the mean, within the interval x̄ ± s; – approximately 95% of the data will lie within two standard deviations of the mean, within the interval x̄ ± 2s; and – nearly all of the data will lie within three standard deviation of the mean, within the interval x̄ ± 3s. 11 Chapter 3 Working with grouped data When raw data is processed by being aggregated into intervals along its scale of measure, the mean and standard deviation of the set can be approximated by assuming that all values in an interval are concentrated at the midpoint m of the interval. If the n data points of the sample – or the N data points of the population – are partitioned into k intervals with midpoints m1, m2, . . . , mk , and the intervals contain frequencies of f1, f2, . . . , fk , respectively, then P P mifi mifi mean x̄ = , µ = n n P P 2 (mi − x̄) fi 2 (mi − x̄)2fi 2 variance s = , σ = n−1 n−1 √ √ 2 stand. dev. s = s , σ = σ2 12 Chapter 3 Weighted mean We can reinterpret the mean value formula above by considering the midpoints mi as a data set to which we assign weights given by the corresponding relative frequencies fi/n; a greater weight gives the midpoint value a larger contribution to the value of the mean. More generally, if data values Pxi are assigned corresponding weights wi, where wi = 1, then we find the weighted mean by the similar formulas X X x̄ = wixi µ= wixi 13 Chapter 3 Statistic for paired data • covariance (sxy , σxy ) a measure of the direction and strength of the linear association between paired data variables, where x is the explanatory and y the response variable: P (x − x̄)(y − ȳ) sample covariance sxy = n−1 P (x − x̄)(y − ȳ) population covariance σxy = N Note that covariance is measured in square units. • correlation (rxy , ρxy ) a standardized version of covariance: sample correlation rxy population correlation ρxy 14 sxy = sx sy σxy = σxσy Chapter 3 Analyzing Data: Paired Quantitative Data • positive values of sxy , rxy indicate a positive association between x and y; negative values of sxy , rxy indicate a negative association between x and y • rxy always lies between 1 and 1, with values close to 0 indicating weak association, values close to 1 a strong positive association, and values close to 1 a strong negative association • while it is possible to compute the covariance and correlation for any pair of quantitative variables, only linear associations are evaluated by these statistics • the statistics sxy , rxy are highly sensitive to outliers, so the presence of an outlier can dramatically alter their values • there may be a strong association between variables without there being any cause/effect relation between them: association does not signify causation. Sometimes, there is a third lurking variable (one that is not measured by the investigator) standing behind the other two variables as a common and hidden determining factor. 15