Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5: Describing Distributions Numerically Important Values • The Minimum and Maximum • (Extremes) • The Midrange measures the average of the maximum and minimum value – Do not use the midrange to describe the distribution. • The range of the data is defined as the difference between the maximum and minimum values • The Median is the middle value that divides the histogram into two equal areas • The quartiles are the points that divide the data into quarters. • Interquartile Range (IQR) of a data set is a measure of variation that gives the range of the middle portion (about half) of the data. More on Quartiles • One quarter of the data lies below the lower quartile (also known as the 25th percentile) • One quarter of the data lies above the upper quartile (also known as the 75th percentile) • Half the data lies between the lower quartile and the upper quartile • The difference between the quartiles is called the interquartile range (IQR) Finding the important numerical values of a Data Set 1.) Order the data set numerically 2.) Find the Extremes, Midrange and range. 3.) Find the median 𝑄2 . “cue-two” 4.) Find 𝑄1 - It is the median of the data entries to the left of 𝑄2 . 5.) Find 𝑄3 - It is the median of the data entries to the right of 𝑄2 . 6.) Find the IQR, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 and any outliers. Identify Outliers 𝑑𝑎𝑡𝑎 𝑒𝑛𝑡𝑟𝑦 < 𝑄1 − 1.5(𝐼𝑄𝑅) 𝑄3 + 1.5 𝐼𝑄𝑅 < 𝑑𝑎𝑡𝑎 𝑒𝑛𝑡𝑟𝑦 Practice The number of nuclear power plants in the top 15 nuclear power-producing countries in the world are listed. Find the 5 number summary. 7 20 16 6 58 9 20 50 23 33 8 10 15 16 104 Reorder 7 20 16 6 58 9 20 50 23 33 8 10 15 16 104 6 7 8 9 10 15 16 16 20 20 23 33 50 58 104 Minimum: 6 Maximum: 104 Midrange: 55 Range: 98 6 7 8 9 10 15 16 16 20 20 23 33 50 58 104 𝑄2 = 16 𝑄1 = 9 𝑄3 = 33 𝐼𝑄𝑅 = 24 Outliers No data entries are less then -27. 104 > 69 The country with 104 nuclear power plants is an outlier. Use a box plot to display your data 6 7 8 9 10 15 16 16 20 20 23 33 50 58 104 By hand first Then by calculator And by alcula Interpret • The box represents about half of the data, which means about 50% of the data entries are between 9 and 33. • The left whisker represents about 25% of the data entries are less than 9. • The right whisker represents about onequarter of the data, so about 25% of the data entries are greater than 33. Also, the data entries that are above the 75th percentile. More Interpretations • The length or height of the box is the IQR. • If the median is roughly in the middle of the box then the distribution is symmetric. If not then the distribution is skewed. Summary The number of power plants in the top 15 nuclear power producing countries in the world. As of May 2016, 30 countries worldwide are operating 444 nuclear reactors for electricity generation and 63 new nuclear plants are under construction in 15 countries. In 2015, 13 countries relied on nuclear energy to supply at least one-quarter of their total electricity. Choose the top 15 to compare. Country Number of Operating Nuclear Power Plants USA 104 France 59 Japan 45 Russian Federation 43 China 55 Republic of Korea 28 India 27 Canada 19 Ukraine 17 United Kingdom 15 Sweden 10 Germany 8 Belgium 7 Spain 7 Czech Rep. 6 Homework Answers: Dollars for Students Homework Answers: Dollars for students Comparing Groups with Boxplots • When we plot two (or more) boxplots side-byside on the same axis, we can “see” a lot – Which group has the greater median? – Which group has the higher IQR? – Which group has the bigger range? – Do the groups have similar spreads? • Symmetry? • Spread? • Outliers? 2006 Min: 6 Q1: 9 Q2: 16 Q3: 33 Max: 104 Outlier: 104 2016 Min: 6 Q1: 8 Q2: 19 Q3: 45 Max: 104 Outlier: 104 Comparing the number of power plants in the top 15 nuclear power producing countries in the world from the year 2006 and 2016. The distributions are skewed to the right because USA has 104 nuclear power plants. The IQR from 2016 ranges from having 8 to 45 nuclear power plants which is significantly higher then from 2006. This is because China had 33 and now 55 nuclear power plants. Removing Outliers 2006 Min: 6 Q1: 9 Q2: 16 Q3: 23 Max: 58 Outliers: 50, 58 2016 Min: 6 Q1: 8 Q2: 18 Q3: 43 Max: 59 When removing the outlier, USA, in 2006 you can see that France and Japan were nearly above the 75th percentile. Now, in 2016, if we remove USA from the data, France and Japan are still above the 75th percentile but not considered to be an outlier. In conclusion, countries have built more nuclear power plants with in the past 10 years. Comparing the number of power plants in the top 15 nuclear power producing countries in the world from the year 2006 and 2016, there is evidence that countries have and are pursuing to build more. The distributions of both data sets are skewed to the right because the USA has 104 nuclear power plants. The IQR from 2016 ranges from having 8 to 45 nuclear power plants which is significantly higher then from 2006 having 9 - 33. This is because China had 33 and now has 55 nuclear power plants. When removing the outlier, USA, in 2006 you can see that France and Japan were nearly above the 75th percentile. Now, in 2016, if we remove USA from the data, France and Japan are still above the 75th percentile but not considered to be an outlier. In conclusion, these top 15 countries have built more nuclear power plants with in the past 10 years. We may expect an even higher accumulation in the next 10, considering those countries that have not begun building nuclear power plants. The Formula for Averaging • While we know how to find the mean, the notation here is key: Sigma means to sum the observations y total y n n pronounced “y-bar” in general, a bar over any symbol/variable denotes finding its mean The mean is located at the balancing point of the histogram. Since the distribution is skewed to the left, the mean is lower than the median. # of Countries Mean or Median? 60 50 40 30 20 10 0 HALE (yr) When data is skewed, it’s better to report the median than the mean as a measure of center Standard Deviation • The IQR is a reasonable summary of spread, but because it uses only the two quartiles of data, it ignores much of the information about how individual values vary. • The standard deviation takes into account how far each value is from the mean. • Just like the mean, the standard deviation is only appropriate for symmetric data. Variance If we summed the deviations from the mean, however, we would get 0 (which won’t help much). However, when we add the squared deviations from the center and find their average (almost – we divide by n – 1 instead of n), we call the result the variance. Some Formulas y y • Variance: s n 1 We use n – 1 instead of n 2 because there is 1 degree of freedom. Degrees of freedom comes up in depth in a later chapter Subtract the mean from each data value and square the result. Then, sum the squared differences 2 y y • Standard Deviation: s n 1 Standard deviation is the square root of the variance. 2 Guidelines Finding the Sample Variance and Standard Deviation 𝑥 𝑥= 𝑛 𝑥−𝑥 1.) Find the mean of the sample data set. 2.) Find the deviation of each entry. (𝑥 − 𝑥)2 3.) Square each deviation. 4.) Add to get the sum of squares. 5.) Divide by n-1 to get the sample variance. 6.) Find the square root of the variance to get the sample standard deviation 𝑆𝑆𝑥 = (𝑥 − 𝑥)2 (𝑥 − 𝑥)2 𝑠 = 𝑛−1 2 𝑠= (𝑥 − 𝑥)2 𝑛−1 Thinking about Variance • Always report the spread along with any summary of the center • If data values are close to the center, the measures of spread (variance and standard deviation) will be small • If data values are far from the center, the measures of spread will be large Just Checking 1. The U.S. Census Bureau reports the median family income in its summary of census data. Why do you suppose they use the median instead of the mean? What might be the disadvantages of reporting the mean? 2. You’ve just bought a new car that claims to get a highway fuel efficiency of 31 mpg. Of course, your mileage will “vary.” If you had to guess, would you expect the IQR of gas mileage attained by all cars like yours to be 30 mpg, 3 mpg, or 0.3 mpg? Why? Just Checking 3. A company selling a new MP3 player advertises that the player has a mean lifetime of 5 years. If you were in charge of quality control at the factory, would you prefer that the standard deviation of lifespans of the players you produce be 2 years or 2 months? Why? More S.O.C.S. How do we know which “center” to report? • If the shape is skewed, report the median and IQR. You may want to include the mean and standard deviation, but you should point out why the mean and median differ. In fact, when the mean and median differ, it’s a sign that the distribution may be skewed. • If the shape is symmetric, report the mean, standard deviation, and possibly the median and IQR. For symmetric data, the IQR is usually a little larger than the standard deviation. More S.O.C.S. • If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be revealing. The median and IQR are less sensitive to the outliers. • We always pair the median with the IQR and the mean with the standard deviation. It’s not useful to report one without the other. • Reporting a center without a spread (and vice versa) is dangerous. What Can Go Wrong? • Don’t forget to do a reality check. – Verify that your results make sense; it’s easy to make a quick calculator error! • Don’t forget to sort the values when finding the median or percentiles by hand. • Don’t compute numerical summaries of a categorical data (even if it has numbers in it!). • Watch out for multiple modes – Consider separating the data into different groups What (Else) Could Go Wrong? • Be aware of slightly different methods – While it won’t make a difference in our course, different statistics packages/books do things differently • Beware of outliers • ALWAYS make a picture!! • Be careful when comparing groups that have very different spreads – We can re-express data to address major differences in spread as well as shape