Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 (Yates et al. 2002) Introduction Individuals – objects described by data. Individuals may be people, animals or objects. Variable – is any characteristic of an individual that can take different values for different individuals. A data set contains information (organized as variables) about some group of individuals. When you view a new data set, ask the following questions: Who? ○ Individuals? ○ How many individuals? What? ○ How many variables? ○ Definitions for each variable? ○ Units for numerical values? ○ Reason to mistrust information? Why? ○ For what reason was the data gathered? Categorical variable Quantitative variable, or numerical variable Litmus test: Does it make sense to calculate an average for the variable? If so it is a quantitative variable. Example 1.1 Education in the United Sates. State Region Pop. (1000s) SAT Verbal SAT Math Percent taking Percent no HS Teachers’ pay ($1000) : CA PAC 33,871 497 514 49 23.8 43.7 CO MTN 4,301 536 540 32 15.6 37.1 CT NE 3,406 510 509 80 20.8 50.7 : Each row of data is called a case. Can you answer the three “W” questions about this data set? Who? Individuals? How many individuals? What? How many variables? Definitions for each variable? Units for numerical values? Reason to mistrust information? Why? For what reason was the data gathered? Distribution – of a variable, tells us what values the variable takes and how often it takes these variables. Think of it as the pattern of variation. Examining data in order to describe its main features if called exploratory data analysis. Examine variables individually, then look for relationships among variables. ○ These examinations start with graph(s), then numerical summaries of specific aspects of the data. Do exercises 1.1 to 1.3 1.1 Displaying Distributions with Graphs Displaying categorical variables: bar graphs and pie charts We already practiced displaying frequency distributions and relative frequency distributions with tables and bar graphs. According to Yates et al. (2002) the distribution of categorical variables can be accomplished as either counts or the percent of individuals that fall in each category. Example 1.2. The most popular soft drink. The following table displays the sales figures and market share (percent of total sales) achieved by several major soft drink companies in 1999. Company Cases sold (millions) Market share (percent) Coca-Cola Co. 4377.5 44.1 Pepsi-Cola Co. 3119.5 31.4 Dr. Pepper/7-UP (Cadbury) 1455.1 14.7 Cott Corp. 310.0 3.1 National Beverage 205.0 2.1 Royal Crown 115.4 1.2 Other 347.5 3.4 1999 Soft Drink Sales 5000 4000 3000 2000 1000 0 Coca-Cola Co. Pepsi-Cola Co. Dr. Pepper/7-UP (Cadbury) Cott Corp. National Beverage 1999 Soft Drink Sales—Market Shares Coca-Cola Co. Pepsi-Cola Co. Dr. Pepper/7-UP (Cadbury) Cott Corp. National Beverage Royal Crown Other Royal Crown Other How to construct a bar graph and pie chart. See p. 8. Bar graphs and pie charts help an audience grasp the distribution quickly. Example 1.3, p. 10 gives an example of categorical data that would not be appropriate for a pie chart. Do exercises 1.5 to 1.6 Displaying quantitative variables: dotplots and stemplots Example 1.4., p. 11, shows how to construct a dotplot. Figure 1.3, p. 11, is and example of a dotplot. Making a graph is not an end in itself. The graph must be interpreted. Look for overall pattern and also for striking deviations from that pattern. ○ Give the center and the spread. ○ Described the shape of the distribution. Symmetrical? Bimodal? Skewed right or skewed left? Is there an outlier–and individual observation that falls outside the pattern? When the values are too spread out a stemplot may be useful. Example 1.5, p. 13, shows how to construct a stemplot. Figure 1.4, p. 14. Sometimes stems can be split to better resolve the distribution. Too few stems will result in a skyscraper-shaped plot. Too many and you get a “pancake” graph. Round if the data have too many digits. See Technology Toolbox for interpreting data from computer output. Do exercises 1.8 to 1.11 Displaying quantitative variables: histogram Quantitative variables often take on many values. A graph of the distribution is clearer if we group nearby values as in a histogram. Example 1.6, p. 19, shows how to make a histogram. Let’s interpret Figure 1.7, p. 20. Do Technology Toolbox, p. 21. Do exercises 1.12 and 1.15 More about shape Look for: Major peaks Clear outliers Rough symmetry or clear skewness Symmetric – referring to a distribution of a histogram in which the right and left sides are approximately mirror images of each other. Skewed right Skewed left Example 1.7 Lightning Flashes and Shakespeare. Figure 1.8, p. 26 Figure 1.9, p. 26 ○ Note that the vertical scale here is a percent not a count. Convenient when counts are large or when we want to compare several distributions. Overall shape of a distribution provides important information about a variable. Some types of data regularly distributions that are symmetrical or that are skewed. Some display neither. Do exercises 1.16 and 1.17 Relative frequency, cumulative frequency, and ogives. So you received a test score report that said you were in the 85th percentile. So what? The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. Relative cumulative frequency graph, or ogive – shows the relative standing of an individual observation. How to construct a relative cumulative frequency graph, pp. 28-30. •Decide on intervals and make a frequency table •Add three columns: relative frequency, cumulative frequency, and relative cumulative frequency. •To get the values of the relative frequency, divide the count in each class by the total number of individuals. •To fill the cumulative Frequency column, add the counts in the frequency column that fall in or below the current class interval. •For relative cumulative frequency, divide the entries in the cumulative frequency column by the total number of individuals. Relative Frequency Cumulative Frequency Relative Cumulative Freq Class Frequency 40-44 2 2/43 = 0.047, or 4.7% 2 2/43 = 0.047, or 4.7% 45-49 6 6/43 = 0.140, or 14.0% 8 8/43 = 0.186, or 18.6% 50-54 13 13/43 = 0.302, or 30.2% 21 21/43 = 0.488, or 48.8% 55-59 12 12/43 = 0.279, or 27.9% 33 33/43 = 0.767, or 76.7% 60-64 7 7/43 = 0.163, or 16.3% 40 40/43 = 0.930, or 93.0% 65-69 3 3/43 = 0.070, or 7.0% 43 43/43 = 1.000, or 100.0% Total 43 Next, graph the relative cumulative frequency against the variable of interest. The vertical axis is scaled from 0% to 100%. The horizontal axis is scaled to fit the highest value and each major tick is labeled with the left end-point of the next class interval. How do you locate an individual within the distribution? How do you locate a value corresponding to a percentile? How can you find the center of the distribution? 100% 90% Rel. Cum. Freq. 80% 70% 60% 50% 40% 30% 20% 10% 0% 30 35 40 45 50 55 60 Age at Inauguration Figure 1.12 Ogive of presidents’ ages at inauguration. 65 70 75 Do exercise 1.19. Time plots Many variables are measure at intervals over time. Example: Height of a growing child. A time plot of a variable plots each observation against the time at which it was measured. Time is always placed on the horizontal axis. Variable of interest on the vertical axis. If there are not too many points, connect the points to show patter of change over time. When examining a time plot look for Overall pattern Strong deviations from the pattern A common overall pattern is called a trend. A pattern that repeats itself at regular time intervals is called seasonal variation. Do exercise 1.21. 1.2 Describing Distributions with Numbers Shape, center, and spread provide a good description of the overall pattern of any distribution of a quantitative variable. Measuring center: the mean Mean is the most common measure of center; is ordinary arithmetic average. 1 x xi n Example 1.10 Barry Bonds vs. Hank Aaron illustrates that the mean is sensitive to extreme values (outliers); also to skewed distributions. The mean is not a resistant measure of center. Measuring center: the median The median, M, is the formal version of the midpoint, with a specific rule for calculation. Number such that half the observations are smaller and the other half larger. See p. 39. Example 1.11 Finding Medians shows tells a different story in the comparison of Barry Bonds and Hank Aaron’s homeruns. Comparing the mean and the median The mean and median of a symmetric distribution are close together. In a skewed distribution, the mean is farther out in the long tail than the median. Reports of house prices, incomes, and other strongly skewed distributions usually give the median rather than the mean. However, if you are a tax assessor interested in figuring out the total value of all the homes in your area, use the mean. Do exercises 1.31 to 1.35 Measuring spread: the quartiles Center does not include how variable the values of a variable are. The simplest useful numerical description of a distribution contains both the measure of center and a measure of spread. Range is the difference between the largest and smallest observation. But, this can depend on outliers. Measure of spread can be improved by looking at the spread of the middle half of the data. Quartiles mark the middle half ○ First quartile, Q1, lies one-quarter of the way up an ordered list of values for a variable. Larger than 25 % of the observations. ○ Third quartile, Q3, lies one-quarter of the way up an ordered list of values for a variable. Larger than 75 % of the observations. ○ Second quartile, is the median, M, previously defined Larger than 50 % of the observations. Calculating quartiles, how to, p. 42. Calculators and computer software may use slightly different rules, but the slight differences are no problem. Example 1.12, p. 43. Distance between the first and third quartiles is a simple measure of spread called interquartile range (IQR). IQR Q3 Q1 If and observation fall between Q1 and Q2, it is not unusual. IQR is the basis for determining outliers. Criterion for determining outliers If a value if is1.5 IQR above the third quartile or below the first quartile. ○ Upper cutoff Q3 1.5 IQR ○ Lower cutoff Q1 1.5 IQR The five-number summary and boxplots Smallest and largest values give us information about the tails of the distribution. This gives a quick summary of center and spread. Minimum Q1 M Q3 Maximum Graph called a boxplot can be used to view five-number summaries for a variable and to compare groups of data for the same variable. Figure 1.17, p. 45. Comparing Barry Bonds and Hank Aaron When looking at a boxplot Locate median And identify the spread, the quartiles and the extreme values. The boxplot also gives an indication of symmetry. A distribution that is right skewed, the third quartile will be farther from the median than the first quartile below it. What bout left skewed. Modified boxplot plots the outliers as isolated points. See Figure 1.18, p. 46. Do exercises Technology Toolbox and1.36 to 1.39 Measuring spread: the standard deviation The most common numerical description of distribution is the mean and the standard deviation—measures spread by looking at how far the observations are from their mean. Variance (s2) – is the average of the squares of the deviations from their mean. 2 x x i 2 s n 1 Standard deviation (s) square root of the variance. 2 xi x 2 s n 1 Example 1.14 Metabolic Rate, p. 49 Figure 1.20, p. 50, illustrates the idea of a deviation using the metabolic rate data. Variance (and standard deviation) is large if the observations are widely spread about their mean; opposite is true if the observations are all close to the mean. Degrees of freedom Basic properties of standard deviation as a measure of spread. Only use when the mean is a measure of spread. Why? s = 0 only when there is no spread, i.e. all observations have the same value. Otherwise, s > 0. As observations become more spread out about their mean, s gets larger. s like the mean, is not resistant. Choosing measures of center and spread Use the five-number summary when describing a skewed distribution with strong outliers Use mean and standard deviation when describing a reasonably symmetric distribution that is free of outliers. Always plot your data! Do Exercises 1.40 to 1.43 Changing the unit of measurement Done by a linear transformation. xnew a bx Adding a constant a shifts x up or down by the same amount. Multiplying by a positive constant b changes the size of the unit of measurement. Example: 9 TF TC 32 5 Effects of linear transformation on spread (p. 55) Multiplying each observation by a positive number b multiplies both measures of center (mean and median) and measures of spread (s and IQR) by b. Adding the same number a (either positive or negative) to each observation adds a to the measures of center and to quartiles but does NOT change measures of spread. Overall, does not change the shape of the distribution. Example 1.15 Los Angeles Lakers’ Salaries (pp. 53-55) Stem-and-Leaf Display: Base Salary Stem-and-Leaf Display: +.1 Bonus Stem-and-Leaf Display: 10% Bonus Stem-and-leaf of Base Salary N = 14 Leaf Unit = 0.10 Stem-and-leaf of +.1 Bonus N = 14 Leaf Unit = 0.10 Stem-and-leaf of 10% Bonus N = 14 Leaf Unit = 0.10 3 5 7 7 6 3 2 2 2 2 2 2 1 1 1 1 1 1 3 5 7 7 6 3 2 2 2 2 2 2 1 1 1 1 1 1 3 5 7 7 6 3 2 2 2 2 2 2 2 1 1 1 1 1 1 0 378 1 00 2 01 3 1 4 235 5 0 6 7 8 9 10 11 8 12 13 14 15 16 17 1 0 489 1 11 2 12 3 2 4 346 5 1 6 7 8 9 10 11 9 12 13 14 15 16 17 2 0 378 1 11 2 23 3 4 4 679 5 5 6 7 8 9 10 11 12 9 13 14 15 16 17 18 8 Figure 1.21 Stemplots of the salaries of LA Laker players before linear transformation, and after + 0.1 and a factor transformation of 0.1 Figure 1.21A Boxplots of the salaries of LA Laker players before linear transformation, and after + 0.1 linear transformation and a factor linear transformation of 0.1. Do Exercises 1.44 to 1.46 Comparing distributions Side-by-side bar graphs for categorical data. Back-to-back stemplots for small quantitative data sets. Side-by-side boxplots. Percent Purchased 25 20 15 Full-sized or intermediatesized car Light Truck or van 10 5 0 Med or dark gree White Light brown Silver Black Color Figure 1.22 Favorite car and truck colors for 1998. Male Female 0 57 1 0489 87550 2 59 76431 3 13 4 4 90 5 6 7 65 8 Figure 1.23 Back-to-back stemplot of the number of cesarean sections performed by male and female Swiss doctors. Descriptive statistics for the number of Cesarean section performed by male and female Swiss doctors. Compare the distributions. Mean SD Min Q1 M Q3 Max IQR Male doctors 41.3 20.61 20 27 34 50 86 23 Female doctors 19.1 10.13 5 10 18.5 29 33 19 Do Exercises 1.47 to 1.49