Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Distributions Numerically Center**Whenever you find the center of data, always ask yourself how well it actually summarizes the data!! Median: middle value that divides the histogram into two equal areas; same units as data (be sure to include the units when describing the median) Finding the Median: (with data in numerical order) When n is odd: the median is in the n 1 position 2 When n is even: there are two middle values, the median is the average of the values in the n n position and the 1 position 2 2 Mean: y Total y n n Can be affected by outliers; the mean is “pulled” in the direction of the longer tail Only appropriate when the shape is symmetric and there are no outliers To check for symmetry and outliers, plot the data! For fairly symmetric distributions, the mean and median are very close. In this situation, the mean will be most useful (as we’ll see in later chapters) Spread*The more the data vary, the less the median alone can tell us Range: Range = max-min Range is always a single number (not an interval of values) Range is not often used in statistic, it is too susceptible to outliers Interquartile Range: Lower quartile (25th percentile): the median of the lower half of the data; 25% of data lies below it Upper quartile (75th percentile): the median of the upper half of the data; 25% of data lies above it Interquartile Range (IQR): the middle half of the data IQR = upper quartile – lower quartile Standard Deviation: Although the IQR is a reasonable summary of spread, it ignores much of the information about how the individual data values vary Standard Deviation is a better measure, but is ONLY appropriate for symmetric data Variance: gives the “average” of the squared values (to keep values from cancelling out) of the deviations, but the units are also squared To get back to the original (and more useful) units, we take the square root to get the standard deviation Variance: s 2 ( y y) 2 n 1 Standard Deviation: s ( y y) n 1 2 Shape, Center, and Spread: If the shape is skewed, report the median and the IQR. You may want to include the mean and standard deviation, but you should point out why the mean and median differ. The fact that the mean and median do not agree is a sign that the distribution may be skewed. A histogram will help you make that point. If the shape is symmetric, report the mean and standard deviation and possibly the median and IQR as well. For symmetric data the IQR is usually a bit larger than the standard deviation. If that is not true for your data set, look again to make sure the distribution isn’t skewed and there are no outliers. If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be revealing. (Of course, the median and IQR are not likely to be affected by the outliers). Always pair the median with the IQR and the mean with the standard deviation Generally, report summary statistics to one or two more decimal places than the original data Categorical Data: **Step-by-Step: pg. 85 **Read “What Can Go Wrong?” – pg. 86-87 Re-expressing to Equalize the Spread of Groups Useful when comparing groups that have very different spreads For measurements that can’t be negative and whose distributions are skewed to the high end, a good first guess at a re-expression is the logarithm. This can also improve the symmetry of the distribution and pull in most of the apparent outliers. Changing Units: Adding/Subtracting – affect measures of center only Multiplying/Dividing – affect all measures of center and spread Notations: n = number of values y (y-bar) = find the mean; this is generally true for any variable with a bar over it (sigma) = sum of the observations SQRT(y) or y^0.5 = common computer representations for square root s = standard deviation s 2 = variance Boxplots5-Number Summary: a description of a distribution that reports the median, quartiles, and extremes (max and min). Boxplot: a visual display of the 5-Number Summary; useful for comparison To make a boxplot: Draw a single axis (vertical or horizontal) spanning the extent of the data. Draw short lines (horizontal or vertical, respectively) at the lower and upper quartiles and at the median. Connect the lines to form a box. Outliers: Construct “fences.” Fences are just for construction, not part of the display. Do not include them in your boxplot! o Upper fence = Q3+1.5(IQR) o Lower fence = Q1-1.5(IQR) Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If data falls outside one of the fences, do not connect it. Any data values beyond the fences are represented by special symbols (often dots or x’s). These are the outliers. “Far outliers” – more than 3 IQR’s from the quartiles – often use a different symbol. Summarizing a Boxplot If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric If it is not centered, the distribution is skewed. The “whiskers” show skewness as well if they are not roughly the same length **Step-by-Step: pg. 78-79 **TI Tips: pg. 80 **Just Checking: pg. 84 Suggested Practice (Old Book): #3, 5-8, 12-14, 17, 19, 21, 24, 27, 28, 30, 34, 47 Suggested Practice (New Book): pg. 160 #4.1-4.16 Pg. 169 #4.17*, 4.19*, 4.21, 4.22, 4.23, 4.25, 4.29, 4.30 pg. 176 #4.32-4.37