Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 1 Exploring Data 1.0 Data Analysis 1.1 Analyzing Categorical Data 1.2 Displaying Quantitative Data with Graphs 1.3 Describing Quantitative Data with Numbers 1.0 Data Analysis Objectives SWBAT: 1) Identify the individuals and variables in a set of data. 2) Classify variables as categorical or quantitative. What’s the difference between categorical and quantitative variables? • A variable is any characteristic of an individual. • A quantitative (numerical) variable takes numerical values for which it makes sense to find an average. • Examples would include height, weight, speed, age, number of oranges in a bowl, number of stolen bases, etc... • A categorical variable places an individual into one of several categories or groups. (think qualitative) • Examples would include gender, blood type, ethnicity, outcome of a plate appearance in baseball, etc… • Again, be cautious. Just because a variable is a number doesn’t mean it is numerical. Take zip code for example. It describes a location in which you live. This is really a categorical variable. It wouldn’t make sense to find the average zip code. Do we ever use numbers to describe the values of a categorical variable? Do we ever divide the distribution of a quantitative variable into categories? • A word of caution: not every variable that takes number values is quantitative. • Take zip code for example. It describes a location in which you live. This is really a categorical variable. It wouldn’t make sense to find the average zip code. • Another example would be social security number. You could just as easily use letters instead of numbers to represent someone’s identity. • Often, variables like age and weight are divided into categories and treated as a categorical variable. • An example would be age categories to classify people, such as 0-9, 10-19, etc… What is a distribution? • A variable generally takes on many different values. In data analysis, we are interested in how often a variable takes on each value. • A distribution of a variable tells us what values the variable takes and how often it takes these values. • Note: the values can be words or numbers. Example 2009 Fuel Economy Guide MODEL 2009 Fuel Economy Guide 2009 Fuel Economy Guide MPG MPG MODEL <new>MODEL MPG 1 Acura RL 922 Dodge Avenger 1630 Mercedes-Benz E350 24 2 Audi A6 Quattro 1023 Hyundai Elantra 1733 Mercury Milan 29 3 Bentley Arnage 1114 Jaguar XF 1825 Mitsubishi Galant 27 4 BMW 5281 1228 Kia Optima 1932 Nissan Maxima 26 5 Buick Lacrosse 1328 Lexus GS 350 2026 Rolls Royce Phantom 18 6 Cadillac CTS 1425 Lincolon MKZ 2128 Saturn Aura 33 7 Chevrolet Malibu 1533 Mazda 6 2229 Toyota Camry 31 8 Chrysler Sebring 1630 Mercedes-Benz E350 2324 Volkswagen Passat 29 9 Dodge Avenger 1730 Mercury Milan 2429 Volvo S80 Variable of Interest: MPG 25 <new> Dotplot of MPG Distribution Example: US Census Data, 10 randomly selected US residents from 2000 census. a) Who are the individuals in this data set? 10 randomly selected US residents who participated in the 2000 US census b) What variables are measured? 1) state; categorical 2) number of family members; quantitative; units-ppl 3) age; quantitative; units- years 4) gender; categorical 5) marital status; categorical 6) total income; quantitative; units: dollars 7) travel time to work; quantitative; units: minutes c) Describe the individual in the first row. The individual lives in Kentucky, has 2 members in her family, is 61 yeas old, is female, is married, makes $21,000 a year, and travels 20 minutes to work. 1.1 Analyzing Categorical Data Objectives SWBAT: 1) Display categorical data with a bar graph. Decide if it would be appropriate to make a pie chart. 2) Identify what makes some graphs of categorical data deceptive. 3) Calculate and display the marginal distribution of a categorical variable from a two-way table. 4) Calculate and display the conditional distribution of a categorical variable for a particular value of the other categorical variable in a two-way table. 5) Describe the association between two categorical variables by comparing appropriate conditional distributions. What is the difference between a frequency table and a relative frequency table? When is it better to use relative frequency? • A frequency table is a table that displays the count (frequency) of observations in each category or class. • A relative frequency table is a table that shows the percents (relative frequencies) of observations in each category or class. Frequency Table Format Variable Values Relative Frequency Table Count of Stations Format Percent of Stations Adult Contemporary 1556 Adult Contemporary Adult Standards 1196 Adult Standards 8.6 Contemporary Hit 4.1 Contemporary Hit 569 11.2 Country 2066 Country 14.9 News/Talk 2179 News/Talk 15.7 Oldies 1060 Oldies Religious 2014 Religious Rock 869 Spanish Language 750 Other Formats Total 1579 13838 7.7 14.6 Rock 6.3 Count Spanish Language 5.4 Other Formats 11.4 Total 99.9 Percent • When the number of observations is not the same (or close to the same) between distributions, we should make a relative frequency histogram. Example: Here are two frequency histograms comparing the number of points scored for players on the LA Lakers and players not on the Lakers in the 20082009 regular season. • Because there are many more players not on the Lakers, it is hard to compare these distributions. • The comparison is now much easier to make. What is the most important thing to remember when making pie charts and bar graphs? Why do statisticians prefer bar graphs? • The most important thing to remember is to make sure everything is properly labeled! • Statisticians prefer bar graphs because 1) they’re easier to make and read and 2) they allow for a comparison of quantities that are measured in the same units. What are some common ways to make a misleading graph? • When making any graph, avoid adding embellishments that are potentially misleading. • One way to make a graph misleading is to violate the area principle, meaning that the area representing each category in a graph should be proportional to the number of observations in that category (all bars should be equally wide). • Another way is if you don’t start the frequency axis at 0. • This graph makes it look as if LeBron missed almost all of his shots. • A third way to make graphs misleading is by making them 3D. The 3D design makes the slices closer to the reader appear larger than those in the back. The red and purple slices are both 42%, but the purple looks much larger. What is wrong with the following graph? First, the heights of the bars are not accurate. According to the graph, the difference between 81 and 95 is much greater than the difference between 56 and 81. Also, the extra width for the DIRECTV bar is deceptive since our eyes respond to the area, not just the height. What is a two-way table? What is a marginal distribution? • Two-way Table – describes two categorical variables, organizing counts according to a row variable and a column variable. Example, p. 12 Young adults by gender and chance of getting rich Female Male Total Almost no chance 96 98 194 Some chance, but probably not 426 286 712 A 50-50 chance 696 720 1416 A good chance 663 758 1421 Almost certain 486 597 1083 Total 2367 2459 4826 The variables described by this table are gender and opinion about getting rich. • The Marginal Distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. • Note: Percents are often more informative than counts, especially when comparing groups of different sizes. • To examine a marginal distribution, 1)Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals. 2)Make a graph to display the marginal distribution. Examine the marginal distribution of chance of getting rich. Response Percent Almost no chance 194/4826 = 4.0% Some chance 712/4826 = 14.8% A 50-50 chance 1416/4826 = 29.3% A good chance 1421/4826 = 29.4% Almost certain 1083/4826 = 22.4% Young adults by gender and chance of getting rich Female Male Total Almost no chance 96 98 194 Some chance, but probably not 426 286 712 A 50-50 chance 696 720 1416 A good chance 663 758 1421 Almost certain 486 597 1083 Total 2367 2459 4826 What is a segmented bar graph? Why are they good to use? • A segmented bar graph displays the possible outcomes of a categorical variable as slices of a rectangle, with the area of each slice proportional to how often each corresponding outcome occurred (each bar must total 100%). • It is also known as a “stacked” bar chart. • Segmented bar graphs are good to use because they force us to use percents. • Note that they aren’t the best for comparison purposes. A better graph would be a side-by-side bar graph like the one on page 17. What does it mean for two variables to have an association? How can you tell by looking at a graph? • Two variables have an association if knowing the value of one variable helps predict the value of the other. • For example, if knowing that a person is male makes one of the responses more likely, there is an association between gender and response. • In the graph to the right, there is an association between gender and opinion. Knowing that a young adult is male helps us predict his opinion: he is more likely than a female to say “good chance” or “almost certain”. Continuing with the same example, if there was no association between gender and opinion, then knowing a young adult is male would NOT help us predict his opinion. He would be no more or less likely than a female to say “good chance” or “almost certain” or any other response. Males and females would have the same opinions. In other words, the bars would be almost equal in height for the genders. Example: The Pew Research Center asked a random sample of 2024 adult cell phone owners from the US which type of cell phone they own: iPhone, Android, or other (including non-smart phones). Here are the results, broken down by age category. a) Explain what it would mean if there was no association between age and cell phone type. No association would mean that knowing someone’s age would not help us predict what type of phone they would buy. b) Based on this data, can we conclude there is an association between age and cell phone type? Justify. It’s clear that there is an association between age and cell phone type. We can predict that 18-34 year olds would get an Android, 35-54 year olds would get some other type of phone, and 55+ would get some other phone. 1.2 Displaying Quantitative Data with Graphs Objectives SWBAT: 1) Make and interpret dotplots and stemplots of quantitative data. 2) Describe the overall pattern (shape, center, and spread) of a distribution and identify any major departures from the pattern (outliers). 3) Identify the shape of a distribution from a graph as roughly symmetric or skewed. 4) Make and interpret histograms of quantitative data. 5) Compare distributions of quantitative data using dotplots, stemplots, and histograms. The dotplots show the daily high temperatures for 7 cities in June, July and August. 1) What is the most important difference between cities A, B, and C? Their centers 2) What is the most important difference between cities C and D? Their spreads 3) What are two important differences between cities D and E? Their spreads (but not range) and unusual values (outliers) 4) What is the most important difference between cities C, F, and G? Their shapes When describing the distribution of a quantitative variable, what characteristics should be addressed? • You want to address patterns and departures from patterns. The acronym to remember is SOCS: Shape, Outliers, Center, and Spread. Shape • When you describe a distribution’s shape, concentrate on the main features. Look for rough symmetry or clear skewness. Definitions: A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other. A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side. It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side. To help remember skewed right and skewed left, think about your feet. A distribution is skewed to the right when the right side of the graph is more spread out than the left side. Think about your right foot. The toes are tall on the left side and get progressively smaller as you move to the right. A distribution is skewed to the left when the left side of the graph is more spread out than the right side. Think about your left foot. The toes are tall on the right side and get progressively smaller as you move to the left. Other terms to describe shape: Unimodal A distribution is unimodal when it shows one distinct peak. Bimodal A distribution is bimodal if it has two distinct peaks. Note: we don’t worry about little bumps. They have to be distinct. Uniform A distribution is uniform when the heights of the bars are all about the same. How would you describe the shapes of these distributions? Skewed right, unimodal Symmetric, unimodal • An outlier is an individual value that falls outside the overall pattern of a distribution. • For now we’ll use an eye test to determine outliers. Looking at this distribution, there’s two unusually high values that appear to be outliers, at approximately 57 and 91. • The center is the middle value in the distribution (either the mean or median). • The spread is the variability of a sampling distribution (how spread out the data is). • Common measures of spread are range and IQR. • Here is an example of Tom Brady’s passer ratings in the 2001 NFL season. Describe the spread. The range is 148.3-57.1=91.2 Frozen Pizza Example Here are the number of calories per serving for 16 brands of frozen cheese pizza, along with a dotplot of the data. 340 340 310 320 310 360 350 330 260 380 340 320 360 290 320 330 Shape: roughly symmetric and unimodal Center: median at 330 calories Spread: the values vary from 260 calories to 380 calories (a range of 120) Outliers: there appears to be one unusually small value (260 calories) What is the most important thing to remember when you are asked to compare two distributions? • You need to actually compare the distributions using explicit comparison words! • • • • Examples: The center for distribution A is larger than the center for distribution B. Carucci’s cat meows less than Mr. Fal’s cat. Prestige Worldwide makes the same amount of money as the South Pole Elf Corporation. • Needless to say, this is only applicable to certain parts of SOCS (center and spread). One shape cannot necessarily be better than another shape. How do the annual energy costs (in dollars) compare for refrigerators with top freezers and refrigerators with bottom freezers? The data below is from the May 2010 issue of Consumer Reports. • Shape: The distribution for bottom freezers looks skewed right and possibly bimodal (modes near $58 and $70 per year). The distribution for top freezers looks roughly symmetric, with its main peak centered near $55. • Outliers: There appear to be two bottom freezers with unusually high energy costs (over $140). There are no outliers for the top freezers. • Center: The typical energy cost for bottom freezers is greater than the typical cost for the top freezers (midpoint of $69 vs midpoint of $56). • Spread: There is much more variability in the energy costs for bottom freezers. What is the most important thing to remember when making a stemplot? • Stemplots (aka stem-andleaf plots) are simple graphical displays for fairly small data sets. • Stemplots give us a quick picture of the distribution while including the actual numerical values. • Just like with all displays, it is important to remember the LABELS (and a key)!!!! How to Make a Stemplot 1)Separate each observation into a stem (all but the final digit) and a leaf (the final digit). 2)Write all possible stems from the smallest to the largest in a vertical column and draw a vertical line to the right of the column. 3)Write each leaf in the row to the right of its stem. 4)Arrange the leaves in increasing order out from the stem. 5)Provide a key that explains in context what the stems and leaves represent. • Stemplots (Stem-and-Leaf Plots) • These data represent the responses of 20 female AP Statistics students to the question, “How many pairs of shoes do you have?” Construct a stemplot. 50 26 26 31 57 19 24 22 23 38 13 50 13 34 23 30 49 13 15 51 1 1 93335 1 33359 2 2 664233 2 233466 3 3 1840 3 0148 4 4 9 4 9 5 5 0701 5 0017 Stems Add leaves Order leaves Key: 4|9 represents a female student who reported having 49 pairs of shoes. Add a key Sometimes it may be beneficial to split stems, which is a method for spreading out a stemplot that has too few stems (the data tends to be bunched up). [every number 0-4 goes in the first stem, 5-9 in the second] Example: Which gender is taller, males or females? A sample of 14-year-olds from the UK was randomly selected using the CensusAtSchool website. Here are the heights of the students (in cm). Make a back-to-back stemplot and compare the distributions. Male: 154, 157, 187, 163, 167, 159, 169, 162, 176, 177, 151, 175, 174, 165, 165, 183, 180 Female: 160, 169, 152, 167, 164, 163, 160, 163, 169, 157, 158, 153, 161, 165, 165, 159, 168, 153, 166, 158, 158, 166 If we opted to not split stems: By splitting stems: Male: 154, 157, 187, 163, 167, 159, 169, 162, 176, 177, 151, 175, 174, 165, 165, 183, 180 Female: 160, 169, 152, 167, 164, 163, 160, 163, 169, 157, 158, 153, 161, 165, 165, 159, 168, 153, 166, 158, 158, 166 Shape: The female distribution is skewed left unimodal. The male distribution is symmetric unimodal. Outliers: Neither distribution appears to contain outliers. Centers: The males have a larger center than the females (median of 167 centimeters vs median of 162 centimeters [avg the middle two]. Spread: The male distribution has greater variability than the female distribution. • Histograms • Quantitative variables often take many values. A graph of the distribution may be clearer if nearby values are grouped together. • The most common graph of the distribution of one quantitative variable is a histogram. How to Make a Histogram 1)Divide the range of data into classes of equal width. 2)Find the count (frequency) or percent (relative frequency) of individuals in each class. 3)Label and scale your axes and draw the histogram. The height of the bar equals its frequency. Adjacent bars should touch, unless a class contains no individuals. • The smallest observation is 93.2 and the largest is 106.1 We could choose classes of width 2 starting at 93. Why would we prefer a relative frequency histogram to a frequency histogram? When comparing distributions of different sample size! When the number of observations are not equal, a fair comparison cannot be made using just the frequency. • Follow pages 36-37 to make a histogram on the TI-84! • Note: • To change the boundaries, press WINDOW. • Xmin defines where the first class begins and Xscl defines the class width. • Xmax, Ymin, and Ymax define how big the window will be. 1.3 Describing Quantitative Data with Numbers Objectives SWBAT: 1) Calculate measures of center (mean, median). 2) Calculate and interpret measures of spread (range, IQR, standard deviation). 3) Choose the most appropriate measure of center and spread in a given setting. 4) Identify outliers using the 1.5 X IQR rule. 5) Make and interpret boxplots of quantitative data. 6) Use appropriate graphs and numerical summaries to compare distributions of quantitative variables. What is a resistant measure? Is the mean a resistant measure of center? Is the median a resistant measure of center? • A resistant measure is a measure that can resist the influence of extreme observations. • Think about if we were going to calculate the mean salary for students in this classroom. • Let’s say Adam Sandler finds out he is one class short of graduating high school, and that class happens to be AP Statistics. He moves to Lyndhurst and transfers into this class. What effect would his salary have on the mean? • What type of effect would it have on the median? • Because the mean cannot resist the influence of extreme observations, we say that it is not a resistant measure of center. However, median is a resistant measure of center. How does the shape of a distribution affect the relationship between mean and median? There is a connection between the shape of a distribution and the relationship between the mean and median of the distribution. • When a distribution is symmetric, the mean and median will be approximately the same. • When a distribution is skewed right, the mean will be greater than the median. • When a distribution is skewed left, the mean will be smaller than the median. • This distribution of stolen bases is skewed right, with a median of 5, as noted on the histogram. • It does not seem plausible that the balancing point (mean) is also 5. Because the distribution is stretched out to the right, the mean must be greater than 5. Think of all the extremely values that will pull the mean up. • Two common measures of spread (variability) are range and IQR. What is range? Is it a resistant measure of spread? • The range of a distribution is the distance between the minimum value and the maximum value. • Do you think it is a resistant measure of spread? Let’s go back to our Adam Sandler example. • Range can be a bit deceptive if there is an unusually high or unusually low value in a distribution. It is not a resistant measure of spread. What are quartiles? How do you find them? • Quartiles are the values that divide a distribution into four groups of roughly the same size. • Find the quartiles: 4 6 8 12 20 22 27 How To Calculate The Quartiles And The IQR: To calculate the quartiles: 1.Arrange the observations in increasing order and locate the median. 2.The first quartile Q1 is the median of the observations located to the left of the median in the ordered list. 3.The third quartile Q3 is the median of the observations located to the right of the median in the ordered list. What is the interquartile range (IQR)? Is the IQR a resistant measure of spread? • The interquartile range (IQR) is a single number that measures the range of the middle half of the distribution, ignoring the values in the lowest quarter of the distribution and the values in the highest quarter of the distribution The interquartile range (IQR) is defined as: IQR = Q3 – Q1 • Since IQR essentially discards the lowest and highest 25% of the distribution, any outliers would have a minimal affect on IQR. As a result, we can state that IQR is a resistant measure of spread. Here are data on the amount of fat (in grams) in 9 different McDonald’s fish and chicken sandwiches. Calculate the median and the IQR. Median=19 What is an outlier? How do you identify them? Are there any outliers in the chicken/fish distribution? In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers. The 1.5 x IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Since no values fall below the boundary of 0.75 or above the boundary of 38.75, the distribution contains no outliers. Are there any outliers in the beef distribution? No values fall below 9, so there are no small outliers. 43 falls above 41, so the Double Quarter Pounder with Cheese is an outlier. The Five-Number Summary The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution. To get a quick summary of both center and spread, combine all five numbers. The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q1 Median Q3 Maximum Five-number summaries are displayed with box-and-whisker plots. Boxplots (Box-and-Whisker Plots) The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot. How To Make A Boxplot: • A central box is drawn from the first quartile (Q1) to the third quartile (Q3). • A line in the box marks the median. • Lines (called whiskers) extend from the box out to the smallest and largest observations that are not outliers. • Outliers are marked with a special symbol such as an asterisk (*). Construct a Boxplot Consider our New York travel time data: 10 30 5 25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45 5 10 10 15 15 15 15 20 20 20 25 30 30 40 40 45 60 60 65 85 Min=5 Q1 = 15 Median = 22.5 Q3= 42.5 Max=85 Recall, this is an outlier by the 1.5 x IQR rule • In the distribution above, how far are the values from the mean, on average? • The concept of mean absolute deviation is similar to standard deviation. What does the standard deviation measure? • Standard deviation measures the average distance of observations from their mean (typical deviation from the mean). • Note: The formula for standard deviation and variance are on the formula sheet. However, the calculations can be done right on the 84. What are some similarities and differences between range, IQR, and standard deviation? • Similarity: All three measure variability. • Differences: • only IQR is resistant to outliers • only standard deviation uses all the data How is the standard deviation calculated? What is the variance? • For now, just know that variance is standard deviation squared. • To understand how to calculate standard deviation, we need to understand what a deviation is first. • Deviation from the mean: A deviation from the mean, 𝑥 − 𝑥, is the difference between the value of x and the mean, 𝑥 (how far an observation is away from the mean). • For example, let’s say our data set consist of the values 4, 5, 8, and 11. • The mean would be 28/4 = 7. • Our deviations would be -3, -2, 1, and 4. • A positive deviation indicates a value is larger than the mean. • A negative deviation indicates a value is smaller than the mean. • A deviation of 0 indicates the value is equal to the mean. • The sum of the deviations, (𝑥 − 𝑥) is always zero because the deviations of x values smaller than the mean (which are negative) cancel out those x values larger than the mean (which are positive). • We can remove this neutralizing effect if we do something to make all the deviations positive. This can be accomplished by squaring each of the deviations; squared deviations will all be nonnegative (positive or zero) values. The squared deviations are used to find the variance. • Variance is the mean of the squared deviations • Variance of a population is denoted by 𝜎 2 . 2 (𝑥 − 𝜇) 𝑖 2 𝜎 = 𝑛 • Variance of a sample is denoted by 𝑠 2 . 2 𝑥 − 𝑥 𝑖 2 𝑠 = 𝑛−1 Note: variance is calculated slightly differently for populations vs samples. Unless specified, we assume we are working with a sample. Steps to find the variance. 1) Find the sum of your data set. 2) Find the mean. 3) Calculate the deviations. 4) Square the deviations. 5) Sum the squared deviations. 6) Divide the sum of the squared deviations by n if working with the population or n-1 if working with the sample. Variance: 𝑠2 = 𝑥𝑖 − 𝑥 𝑛−1 2 14 14 = = =7 3−1 2 Example: Find the variance for the sample data: 25, 26, 30. • The standard is the positive square root of the variance. • Standard deviation measures a typical deviation from the mean. • In the previous example, the variance was 7, so the standard deviation is: s = 𝑠 2 = 7 = 2.65 • The standard deviation of a population is: 𝜎= (𝑥𝑖 − 𝜇)2 𝑛 • The standard deviation of a sample is: 𝑠= 𝑥𝑖 − 𝑥 𝑛−1 2 What are some properties of standard deviation? • SD measures the spread about the mean and should be used only when the mean is chosen as the center (if median is chosen, use IQR). • SD is always greater than or equal to 0. [A SD of 0 would mean all observations are the same value.] • SD has the same units of measurement as the original observations. • SD is not resistant to outliers. A few outliers can make SD very large. A random sample of 5 students was asked how many minutes they spent doing HW the previous night. Here are their responses (in minutes): 0, 25, 30, 60, 90. Calculate and interpret the standard deviation. The number of minutes spent doing HW typically varies by about 34.71 minutes from the mean of 41 minutes. What factors should you consider when choosing summary statistics? We now have a choice between two descriptions for center and spread • Mean and Standard Deviation • Median and Interquartile Range Choosing Measures of Center and Spread •The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. •Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. •NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!