Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapters 4-5 Displaying Quantitative Data Numerical data can be visualized with a histogram. Data are separated into equal intervals along a numerical scale, then the frequency of data in each of the intervals is tallied. Rectangles are built over each interval with heights, measured along a vertical scale, are given by the frequency (or relative frequency) of data within each interval. [TI83: STAT Edit, STATPLOT, ZoomStat, and Window settings.] A quicker way to display numerical data by hand is with a stem-and-leaf display. All but the rightmost digit (or digits) of the measurement become stems; stems head rows in which the remaining digit(s), the leaves, are listed, carefully lined up vertically in columns. (List all intermediate stems, even if they contain no leaves!) 1 Chapters 4-5 Describing Quantitative Data: Features of Interest • The shape of a histogram or stem-and-leaf describes the distribution of the data – where data is concentrated and how it spreads out across the entire range of values. • Where is the center of the distribution located? • How much spread is there in the distribution? How tightly is the data clustered about the center? • Is there more than one cluster, or mode? Is the data unimodal, bimodal, multimodal? Note: The location of modes can change with the scaling unit of a display (width of a bar). • Is the distribution uniform (has a flat contour), indicating that every value is (roughly) equally represented? Is it roughly symmetric, with equally frequent values on either side of the center (the distribution to the right of the center is the mirror image of what appears to the left)? Or is it skewed (heaver on one side of the center than the other) to the left or right, in the direction of the tail (region of most extreme values)? • Are there any outliers (values located very far from the center)? Can we explain why they appear? 2 Chapters 4-5 Summarizing Quantitative Data: Statistics • statistic Any quantity computed from the data values; typically used to quantify some feature of the distribution of the data. There are two chief systems of statistics in common use: ordered statistics, which locate features of the distribution in terms of an ordering of the data values from lowest to highest; and weighted statistics, which locate features relative to how they balance against the rest of data set on the measuring scale. Feature Ordered statistic Weighted statistic Center median Spread interquartile range standard deviation (IQR) (s) Relative percentile standing 3 mean (x̄) z-score (z) Chapters 4-5 • median the middle observation in a sorted list of the data values (for an even number of values, average the two middle observations); a better estimate of center since it is resistant to the effects of outliers, hence a more commonly used measure of center • percentile a pth percentile is any data value greater than or equal to exactly p % of the data (the median is the 50th percentile) • lower/upper quartiles (Q1 and Q3) the observations which are one quarter (Q1) and three quarters (Q3) of the way up the list; also equal to the median values of each half of the data located below/above the median; or the 25th and 75th percentiles (The 0th quartile is the minimum value: Q0 = min; the second quartile is the median: Q2 = median; and the fourth quartile is the maximum value: Q4 = max.) • interquartile range (IQR) difference IQR = Q3 − Q1 between the two quartiles 4 Chapters 4-5 • five-number summary of a data set is the list of its five quartiles: ◦ minimum (min) ◦ lower quartile (Q1) ◦ median ◦ upper quartile (Q3) ◦ maximum (max) • boxplot display of the five-number summary formed by ◦ drawing a box over a number line so that the sides of the box are located at the two quartiles, ◦ drawing the wall (a line across the box) at the location of the median, and ◦ drawing whiskers (lines parallel to the number line) extended from the sides of the box to the min and max. [TI83: STATPLOT, ZoomStat] 5 Chapters 4-5 • modified boxplot same as above, except: ◦ whiskers extend from the sides of the box to the fences, points positioned 1.5 IQR from each end of the box; and ◦ outliers (any values lying outside the fences) are individually marked with symbols; far outliers, which lie more than 3 IQR from the ends of the box, are often marked with different symbols for emphasis. • resistance to outliers moving the extreme values of a data set either further away or closer to the center of the distribution does not change the median value; hence, the median (and other ordered statistics) is often preferred when describing skewed data sets (income data, housing prices, etc.). 6 Chapters 4-5 • mean (x̄) a data set that includes n repeated measures of some variable quantity x has mean value equal to its arithmetical average, the sum of the values divided by the number of values: P x x̄ = n It represents the point on the number scale where the distribution “balances” (as if the histogram were made of some massive substance) • sensitivity to outliers moving the extreme values of a data set either further away or closer to the center of the distribution can substantially alter the mean value; hence, the mean (and other weighted statistics) is used to describe only symmetric data sets or those without much skew. In skewed distributions, the mean is pulled in the direction of the skewness (the longer tail) 7 Chapters 4-5 • deviation from the mean (x − x̄) the difference between a single data value x and the mean x̄ of the data set; values greater than the mean have positive deviations, while those below the mean have negative deviations. Each number in the data set has its own deviation from the mean. • variance (s2) an estimate of the average squared deviation from the mean: P (x − x̄)2 2 s = n−1 (we divide by n − 1 instead of n for technical reasons) • standard deviation (s) the square root of the variance, a measure of spread that estimates the size of a typical deviation from the mean: rP (x − x̄)2 s= n−1 The larger the standard deviation, the further away from the mean will most values be found. 8 Chapters 4-5 • z-score, or standardized value a measure of realtive standing that determines how far each data value is from the mean measured in units of standard deviations: x − x̄ z= s Positive z-scores correspond to values larger than the mean; negative z-scores to values below the mean, e.g. a value with z = −2.15 corresponds to a value below the mean by an amount exactly equal to 2.15 times the size of the standard deviation. 9