Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing distributions with numbers • A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central tendency of the set of observations – it is the tendency of the data to cluster, or center, about certain numerical values. The variability of the set of observation – it is the spread of the data. week2 1 Measuring Center • The Mode is the observation that occurs most frequently. • The mode for categorical variable will be the label of the category with the highest number of counts. • Measuring center Two common measures of center are the mean and the median. These two measures behave differently. The mean is the “average value” and the median is the “middle value”. week2 2 Measuring center: the median • The median M is the midpoint of the distribution, the number such that half the observations are smaller then it and the other half are larger. • To find the median of a distribution: 1. Arrange the observations in order of size, from smallest to largest. 2. If the number of observations n is odd, the median is the center observation in the ordered list. 3. If the number of observations n is even, the median is the average of the two center observations in the ordered list. week2 3 Example The annual salaries (in thousands of $) of a random sample of five employees of a company are: 40, 30, 25, 200, 28 Arranging the values in increasing order: 25 28 30 40 200 median = 30 Excluding 200 median = (28+30)/2=29. week2 4 • MINITAB commands Stat > Basic Statistics > Display Descriptive Statistics • MINITAB output for the data in the example above is given bellow: Variable N Median salary 5 30.0 week2 5 Measuring center: mean • To find the mean x of a set of observations, add their values and divide by the number of observations. If the n observations are x1,x2,…xn, their mean is given by x1 x2 xn mean x n x i n • Example Find the mean of the following observations: 4, 5, 9, 3, 5. Solution: week2 6 Example • The annual salaries (in thousands of $) of a random sample of five employees of a company are: 40, 30, 25, 200, 28. If we exclude 200 as an outlier, • Mean is sensitive to the influence of a few extreme observations. Because the mean cannot resist the influence of extreme values, we say that it is NOT a resistant measure of center. week2 7 Mean versus median • The median and mean are the most common measures of the center of a distribution. • If the distribution is exactly symmetric, the mean and median are exactly the same. • Median is less influenced by extreme values. • If the distribution is skewed to the right, then mode < median < mean • If the distribution is skewed to the left, then mean < median < mode. week2 8 Questions 1. 2. 3. 4. You are asked to recommend a measure of center to characterize the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4. What is your recommendation and why? The mean is ____ sensitive to extreme values than the median. (a) more (b) less (c) equally (d) can’t say without data Changing the value of a single score in a data set will necessarily cause the mean to change. (T/F) Changing the value of a single score in a data set will necessarily cause the median to change. (T/F) week2 9 Quartiles • The 25th percentile is called the first quartile (Q1). • The first quartile (Q1) is the median of the observations whose position in the ordered list is to the left of the location of the overall median. • The 75th percentile is called the third quartile (Q3). • The third quartile (Q3) is the median of the observations whose position in the ordered list is to the right of the location of the overall median. NOTE: The median is the second quartile Q2 . week2 10 Example The highway mileages of 20 cars, arranged in increasing order are: 13 15 16 16 17 19 20 22 23 23 | 23 24 25 25 26 28 28 28 29 32. The median is … The first quartile Q1 is … The third quartile Q3 is… week2 11 Measuring Spread • The range (max-min) is a measure of spread but it is very sensitive to the influence of extreme values. • The distance between the first and third quartiles is called the Interquartile range (IQR) i.e. IQR =Q3 – Q1 . • The IQR is another measure of spread that is less sensitive to the influence of extreme values. It measures the spread of middle 50% of data. week2 12 The five-number summary • The five-number summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile and the largest observation. • These five numbers give a reasonably complete description of both the center and the spread of the distribution. • MINITAB commands: Stat > Basic Statistics > Display Descriptive Statistics week2 13 Example • The highway mileages of 20 cars, arranged in increasing order are: 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32. Give the five number summary. • Answer From example 1.14 p45 we have, min = 13, first quartile = 18, median = 23, third quartile = 27 , max. = 32. The MINITAB output using the above commands is as follows: Variable mileage N 20 Minimum 13.00 Q1 17.50 week2 Median 23.00 Q3 27.50 Maximum 32.00 14 Box-plot • A box-plot is a graph of the five-number summary. • Example: Make a box-plot for the data in the above example. Boxplot of Mileages Mileages 30 25 20 15 • MINITAB commands: Graph > Boxplot week2 15 Outliers An outlier is an observation that is usually large or small relative to the other values in a data set. Outliers are typically attributable to one of the following causes: 1. The observation is observed, recorded, or entered incorrectly. 2. The observation comes from a different population. 3. The observation is correct but represents a rare event. week2 16 The 1.5×IQR Criterion for outliers • Call an observation a suspected outlier if it falls more than 1.5×IQR above the 3rd quartile or below the 1st quartile. • Example Consider the data given in example 1.13 on page 43 in IPS (mileage data with an extra observation of 66). Variable Mileages N 21 Mean 24.67 Min 13 Q1 18 Median 23 Q3 28 Max 66 The IQR = 28-18 = 10 and the largest observation, 66, falls more than 1.5×IQR above Q3 and therefore is an outlier. week2 17 Exercise The stemplot for a set of 50 observations is given below: Draw a box-plot for the data. Stem-and-leaf of Fees N = 50 Leaf Unit = 1.0 2 0 89 7 1 00234 15 1 55558899 (28) 2 0000000000000111112222222223 7 2 59 5 3 0 4 3 5 3 4 00 1 4 1 5 0 week2 18 Exercise • The box-plot, histogram and stem-and-leaf plot for a data set are given below. Describe the distribution. Stem-and-leaf of C2 N = 50 Leaf Unit = 1.0 (29) 0 00011111111122222222233444444 21 0 55555666788 10 1 0234 6 1 66 4 2 1 3 2 88 1 3 1 3 8 Frequency 20 10 0 0 0 10 20 C2 30 40 4 8 12 16 20 24 28 32 36 40 C2 week2 19 Exercise • Consider the following Minitab generated box-plots of coagulation times in seconds for samples of blood drawn from animals receiving three different diets denoted 1, 2, and 3 : coagtimes 70 65 60 1 2 3 Diet • State whether the following statements are true or false a) The animal that had the longest coagulation time was given diet 3. b) The greatest variability occurs with diet 2. c) Diet 1 shows evidence of right (positive) skewness but diet 2 shows evidence of left (negative) skewness. d) Approximately 25% of animals on diet 2 had coagulation times less then 63. e) The smallest upper (third) quartile is for diet 3. f) We can see that the mean for diet 1 is less than 62 seconds. week2 20 Measuring spread: Standard deviation • The variance (s2) of a set of n observations x1 , x2 ,..., xn is 2 ( x x ) 2 ( x x ) 2 2 ( x x ) ( x x ) n i 2 s2 1 n1 n1 • The standard deviation (s) is the square root of the variance (s2). • Standard deviation (SD) based on how far away from mean each value is. If many values far from mean (or some very far), SD will be large. If all values close to the mean, SD will be small. week2 21 • The deviations xi x display the spread of the values xi about their mean. Some of these deviations will be positive and some negative because the observations fall on each side of the mean. • The sum of the deviations of the observations from their mean will always be zero. • Squaring the deviations makes them all positive, so that observations far from the mean in either direction have large positive squared deviations. • The variance is the average of the squared deviations. • The variance, s2, and the standard deviation, s, will be large if the observations are widely spread about their mean, and small if the observations are all close to the mean. week2 22 Properties of standard deviation (s) • s measures the spread about the mean and should be used only when the mean is chosen as the measure of center. • s = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise, s > 0. • s, like the mean , is not resistant to extreme values. A few outliers can make s very large. week2 23 Choosing a summary • The five-number summary is usually better than the mean and the standard deviation for describing skewed distributions or distributions with strong outliers. • Use mean and standard deviation for reasonably symmetric distributions that are free of outliers. week2 24 Questions 1. How do the mean, median, and mode compare, usually, when a distribution is positively skewed? negatively skewed? Draw a picture and try to estimate the locations of these measures. 2. Which type of display is the most useful type for clear direct comparisons of the key characteristics of several data sets (e.g. blood cholesterol changes for several different treatments) ? 3. In a frequency table of 300 scores, the mean is reported as 80 and the median as 65. One would expect this distribution to be a. b. c. d. positively skewed. negatively skewed. symmetrical rectangular. week2 25