Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 1: Looking at Data—Distributions (Part 2) 1.2 Describing Distributions with Numbers Dr. Nahid Sultana 1.2 Describing Distributions with Numbers Objectives Measures of center: mean, median Mean versus median Measures of spread: quartiles, standard deviation Five-number summary and boxplot Choosing among summary statistics Changing the unit of measurement Measures of center: The Mean The most common measure of center is the arithmetic average, or mean, or sample mean. To calculate the average, or mean, add all values, then divide by the number of individuals. It is the “center of mass.” If the n observations are x1, x2, x3, …, xn, their mean is: sum of observations x1 + x2 + ... + xn = x= n n or in more compact notation 1 x = ∑ xi n Measures of center: The Mean (cont…) Find the mean: Here are the scores on the first exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 55 Find the mean first-exam score for these students. Solution: 80 90 Measuring Center: The Median Another common measure of center is the median. The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations from smallest to largest. 2. If the number of observations n is odd, the median M is the center observation in the ordered list. 3. If the number of observations n is even, the median M is the average of the two center observations in the ordered list. Measuring Center: The Median (cont...) Find the median: Here are the scores on the first exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 55 80 Find the median first-exam score for these students. Solution: 90 Comparing Mean and Median Comparing Mean and Median (Cont...) The mean and the median are the same only if the distribution is symmetrical. In a skewed distribution, the mean is usually farther out in the long tail than is the median. The median is a measure of center that is resistant to skew and outliers. The mean is not. Measuring Spread: The Quartiles A measure of center alone can be misleading. A useful numerical description of a distribution requires both a measure of center and a measure of spread. We describe the spread or variability of a distribution by giving several percentiles. The median divides the data in two ; half of the observations are above the median and half are below the median. We could call the median the 50th percentile. The lower quartile is the median of the lower half of the data; the upper quartile is the median of the upper half of the data. With the median, the quartiles divide the data into four equal parts; 25% of the data are in each part Measuring Spread:The Quartiles (Cont.) Calculate the quartiles and inter-quartile: 1. Arrange the observations in increasing order and locate the median M. 2. The first quartile Q1 is the median of the lower half of the data, excluding M. 3. The third quartile Q3 is it is the median of the upper half of the data, excluding M. Measuring Spread: The Quartiles (Cont.) Example: Here are the scores on the first-exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 Find the quartiles for these first-exam scores. Solution: In order, the scores are: 55 73 75 80 80 85 90 55 80 90 92 93 98 The median is, Q1 = 75, the median of the first five numbers: 55, 73, 75, 80, 80. Q3 = 92, the median of the last five numbers: 85, 90, 92, 93, 98. The Five-Number Summary The five-number summary of a distribution consists of The smallest observation (Min) The first quartile (Q1) The median (M) The third quartile (Q3) The largest observation (Max) written in order from smallest to largest. Minimum Q1 M Q3 Maximum Boxplots A boxplot is a graph of the five-number summary. Draw a central box from Q1 to Q3. Draw a line inside the box to mark the median M. Extend lines from the box out to the minimum and maximum values that are not outliers. Boxplots (Cont…) Example: Here are the scores on the first-exam in an introductory statistics course for 10 students: 80 73 92 85 75 98 93 Make a boxplot for these first-exam scores. Solution: In order, the scores are: 55, 73, 75, 80, 80, 85, 90, 92, 93, 98 Min = 55 Q1 = 75 M = 82.5 Q3 = 92 Max = 98 55 80 90 Boxplots for skewed data Comparing Boxplots to Histograms Suspected Outliers: 1.5 × IQR Rule Outliers are troublesome data points, and it is important to be able to identify them. The interquartile range IQR is the distance between the first and third quartiles, IQR = Q3 − Q1 IQR is used as part of a rule of thumb for identifying outliers. The 1.5 × IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 × IQR above the third quartile or below the first quartile. Suspected Outliers: 1.5 × IQR Rule (Cont..) Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5 * IQR =3.225 years. Thus, individual #25 is a suspected outlier. Suspected Outliers: 1.5 × IQR Rule (Cont..) Modified boxplots plot suspected outliers individually. The 8 largest call lengths are 438, 465, 479, 700, 700, 951, 1148, 2631 They are plotted as individual points, though 2 of them are identical and so do not appear separately. Measuring Spread: The Standard Deviation The most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation. The standard deviation s measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance. Calculating The Standard Deviation 1. Calculate mean 2. Calculate each deviation, deviation = observation – mean 3. Square each deviation 4. Calculate the sum of the squared deviations 5. Divided by degrees freedom, (df) = (n − 1), this is called the variance. 6. Calculate the square root of the variance…this is the standard deviation. The variance = 52/(9 – 1) = 6.5 Standard deviation = xi (xi-mean) (xi-mean)2 1 1 - 5 = -4 (-4)2 = 16 3 3 - 5 = -2 (-2)2 = 4 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 4 4 - 5 = -1 (-1)2 = 1 5 5-5=0 (0)2 = 0 7 7-5=2 (2)2 = 4 8 8-5=3 (3)2 = 9 9 9-5=4 (4)2 = 16 Mean=5 Sum=0 Sum=52 Properties of The Standard Deviation s measures spread about the mean and should be used only when the mean is the measure of center. s = 0 only when all observations have the same value and there is no spread. Otherwise, s > 0. s is not resistant to outliers. s has the same units of measurement as the original bservations. Choosing Measures of Center and Spread We now have a choice between two descriptions for center and spread Mean and Standard Deviation Median and Interquartile Range Choosing Measures of Center and Spread The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! Changing the Unit of Measurement Variables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: xnew = a + bx. Example 1: If a distance x is measured in kilometers, the same distance in miles is xnew = 0.62 x This transformation changes the units without changing the origin —a distance of 0 kilometers is the same as a distance of 0 miles. Example 2: A temperature x measured in degrees Fahrenheit can be expressed in degrees Celsius by the transformation This transformation changes both the unit size and the origin of the measurements —The origin in the Celsius scale (0◦C, the temperature at which water freezes) is 32◦ in the Fahrenheit scale. Changing the Unit of Measurement (Cont…) Linear transformations do not change the basic shape of a distribution (skew, symmetry, multimodal). But they do change the measures of center and spread: Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (IQR, s) by b. Adding the same number a (positive or negative) to each observation adds a to measures of center and to quartiles but it does not change measures of spread (IQR, s).