Download Graphing Categorical Variables

AP Statistics Take home package Complete these notes by reading Chapter P and Chapter 1 of the text. The notes below do not necessarily follow the reading sequentially but are arranged to group information together. Then complete the homework problems listed at the end. You can expect a quiz/test over this material within the first few days of the school year. Come prepared with questions. AP Statistics: Chapter P Statistics is _____________________________________________________________________ Data is ____________________________________________ Data consists of information about some group of individuals (may be people, animals or even inanimate objects), and the characteristics we measure on each individual are called variables. Example 1: Give an example of each of the following “types” of individuals along with a corresponding variable for the individual. A person: An animal: An inanimate object: Variables fall into two main categories: 1. A categorical, or qualitative, variable _____________________________________________ _____________________________________________________________________________ 2. A quantitative variable _________________________________________________________ _____________________________________________________________________________ Example 2: Consider the three “types” of individuals listed in example 1 and give a possible categorical and a possible quantitative variable for each. A person: An animal: An inanimate object: Ideally, any set of data is accompanied by background information that helps us understand it. When you meet a new set of data, ask yourself the following key questions (write notes for each). WHO WHAT WHY WHEN, WHERE, HOW and BY WHOM The distribution of a variable tells us ___________________________________________________ __________________________________________________________________________________ Example 3: Take a standard deck of 52 playing cards, randomly select a card and record if it is an ace, two, three, etc. Return the card to the deck and randomly select a second card and record if it is an ace, two etc. Repeat 20 times. Example 4: Take a regular 6-sided dice and roll it 20 times. Each time record what number appears. Statistical inference involves drawing conclusions about a large group, called the ________________ by gathering information from a smaller subgroup, called the ________________. You may wonder why not just gather information about everyone in the population, this is called a _________________, rather than bother with a sample? The reason is simple, too much time and too much money!! The main statistical designs for producing data are _______________, __________________, and _________________________________________. Example P3 on p.9 of the text illustrates one concern when using a survey to gather data. What is this concern? In an observational study, __________________________________________________________ ________________________________________________________________________________ In an experiment, we ______________________________________________________________ ____________________________________________________________ What is the key difference between an observational study and an experiment? Data analysis is _____________________________________________________________________ What is a side-by-side bar graph best used for? What type of data is a dotplot used for? HW: p11 / P.1 – P.5 p19 / P.7 – P.12 p25 / P.13 – P.15, P.18 p30 / P.19, P.21 – P.24, P.28 AP Statistics: Chapter 1 What two types of graphs are typically used for categorical variables? If a particular section of a pie chart is to represent 17% of the data, what should be the central angle measure that determines that particular wedge? What two types of graphs are typically used for quantitative variables? Steps for constructing a stem-and-leaf plot: 1. ________________________________________________________________________________ __________________________________________________________________________________ 2. ________________________________________________________________________________ __________________________________________________________________________________ 3. ________________________________________________________________________________ __________________________________________________________________________________ On the bottom of page 45 there is a Minitab (a statistical computer package) version of a stem–and– leaf plot. The far left column simply shows a cumulative total of the number of leafs for that stem and all the stems before it. It also shows a leaf unit, so that you know what digit in the data value the leaf represents. For example, the stem and leaf 0 9 represents 9,000 while the stem and leaf 2 5 represents 25,000. Compare and contrast stem-and-leaf plots and histograms as to their advantages and when each should be used. Two techniques that are helpful when using a stem-and-leaf plot for a moderately large set of data are _______________________ and _______________________. Describe each technique. A __________________________________ is very useful when you wish to compare two related distributions. A histogram _______________________________________________________________________ __________________________________________________________________________________ You can choose any convenient number of classes but ______________________________________ ___________________________________________. If you choose too few classes, you get a ____________________ graph while too many classes will yield a _________________ graph. Count the number of data values that fall into each class. These counts are called _________________ and a table that lists the class and the frequency for each class is called a _______________________. A relative frequency histogram gives percents instead of frequencies and is very useful when comparing two sets of data where one set has many more values than the other. In a cumulative frequency histogram each class’s frequency is the sum of the frequencies for that class and all the classes before it as well. Constructing a graph to represent our data is only the first step. The next step is to interpret what we see. When you describe the distribution pay special attention to the … shape ___________________________________________________________________________ The length of the “tails” will tell us whether a graph (i.e. distribution) is left –skewed (left tail is the longest) or right-skewed (the right tail is the longest). modes __________________________________________________________________________ __________________ - one major peak, __________________ - two major peaks center ___________________________________________________________________________ The two most common measures of center are the mean and the median. These will be discussed in greater detail later in these notes. spread ___________________________________________________________________________ The IQR and standard deviation are probably the two most common measures of spread. Both will be discussed in greater detail later in these notes. outliers __________________________________________________________________________ Outliers will be discussed in greater detail later in these notes. When you have to describe the shape of a distribution, don’t get mad, C U S S For the center, refer to mean, median or, perhaps, mode. Unusual refers to outliers or gaps in the data. Spread refers to IQR, standard deviation or range Shape refers to symmetrical or skewed as well as any peaks E N T E R N U S U A L P R E A D H A P E To look at the relative standing of an individual observation, we use a relative cumulative frequency graph, which is called an __________________ (pronounced o-jive). Here’s how to make an ogive. 1. Decide on class intervals and make a frequency table. Add 3 columns to your frequency table labeled: relative frequency, cumulative frequency and relative cumulative frequency (divide the cumulative frequency by the total 2. Complete the frequency table below which shows the ages of U.S. Presidents at their inauguration. Class 40 – 44 45 – 49 50 – 54 55 – 59 60 – 64 65 – 69 Total: Frequency 2 6 13 12 7 3 43 Relative Frequency Cumulative Frequency Relative Cumulative Frequency 3. Label and scale your axes and title your graph. Label the horizontal axis “Age at Inauguration” and the vertical axis “Relative Cumulative Frequency”. Scale the horizontal axis according to your choice of class intervals and the vertical axis from 0% to 100%. 4. Plot a point corresponding to the relative cumulative frequency in each class interval at the left endpoint of the next class interval. Connect consecutive points with a line segment to form the ogive. The last point you plotted should be at a height of ____________. Ages of U.S. Presidents at the Time of Their Inauguration Age at Inauguration Ogives can be used to locate an individual within the distribution. Example 1: Determine Bill Clinton’s relative standing when he took office at the age of 46. Ogives can also be used to locate a value corresponding to a percentile. Example 2: What is the center of the distribution? __________ A time plot of a variable plots each observation against time. Always put ____________ on the horizontal scale and the variable you are measuring on the vertical scale. Connecting the data points by line segments helps emphasize any change over time. A good use of a time plot would be to graph stock market prices over time. The table below (from page 70 in your text) gives the EPA city and highway mileage for cars in the “two-seater” and “minicompact” categories. Fuel economy (mph) for 2004 model motor vehicles Two-seater Cars Minicompact Cars Model City Highway Model City Highway Acura NSX 17 24 Aston Martin Vanquish 12 19 Audi TT Roadster 20 28 Audi TT Coupe 21 29 BMW Z4 Roadster 20 28 BMW 325CI 19 27 Cadillac XLR 17 25 BMW 330CI 19 28 Chevrolet Corvette 18 25 BMW M3 16 23 Dodge Viper 12 20 Jaguar XK8 18 26 Ferrari 360 Modena 11 16 Jaguar XKR 16 23 Ferrari Maranello 10 16 Lexus SC 430 18 23 Ford Thunderbird 17 23 Mini Cooper 25 32 Honda Insight 60 66 Mitsibishi Eclipse 23 31 Larborghini Gallardo 9 15 Mitsibishi Spyder 20 29 Lamborghini Murcielago 9 13 Porsche Cabriolet 18 26 Lotus Esprit 15 22 Porsche Turbo 911 14 22 Maserati Spyder 12 17 Mazda Miata 22 28 Mercedes-Benz SL 500 16 23 Mercedes-Benz SL600 13 19 Nissan 350Z 20 26 Porsche Boxster 20 29 Parsche Carrera 911 15 23 Toyota MR2 26 32 Measuring Center: The Mean & Median To calculate the mean, add the values of the observations and divide by the number of observations.  The mean of a sample is denoted x , pronounced x-bar.  The mean of a population is denoted  , the Greek letter Mu. Example 3: Determine the mean highway mileage for two-seaters. What outlier do you see in the data? ____________________ Example 4: Determine the mean highway mileage for two-seaters without the outlier. Examples 3 & 4 illustrate an important weakness of the mean as a measure of center: the mean is sensitive to the influence of a few extreme observations. These may be outliers, but a skewed distribution that has no outliers will also pull the mean toward its long tail. The median (denoted by _____) is the __________________ of a distribution: To calculate the median…. 1. Order the observations from smallest to largest. 2. If the number of observations is odd, the median is simply the middle value in the list. You can find the location by counting __________ observations from the bottom (or top). 3. If the number of observations is even, you should average the two middle numbers. The location of the median is again __________ from the bottom or top of the list. Example 5: Find the median highway mileage for 2004 model two-seater cars. Example 6: Drop the Honda Insight (the outlier) and find the median. Is the median sensitive to the influence of an extreme observation? _____ We say that the median is an _____________________________________________ of center. Mean versus Median The mean and median of a roughly symmetrical distribution will be ___________________________. If the distribution is exactly symmetric, the mean and median are _______________. In a skewed distribution, the mean is __________________________ in the long tail than the median.  In a skewed distribution, the ____________ is the more accurate measure of center. In descriptions of data, the “average” value of a variable is usually referred to as the __________ whereas the “typical” value is usually referred to as the __________________. Measuring Spread: The Quartiles A measure of center alone can be misleading. Example 7: Find the mean and median of: 6000 x = __________ M = __________ M = __________ 8000 9000 15000 range (see below) = __________ Example 8: Find the mean and median of: 1000 x = __________ 7000 1000 8000 8000 27000 range = __________ One way to measure spread, or variability, is to calculate the range, which is ____________________ _________________________________________________________________ Another way to describe the spread of a distribution is by considering different percentiles. The pth percentile of a distribution is the value that has ____________________________________________ ________________________________. The median is the ________ percentile. The 25th percentile is called the ______________________________ while the 75th percentile is called the ______________ ___________________. Example 9: Find the median and quartiles of the 21 gasoline-powered two-seater cars below. 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 Example 10: Find the median and quartiles of the 13 minicompact cars below. 19 22 23 23 23 26 26 27 28 29 29 31 32 66 The Five-Number Summary and Boxplots The five-number summary of a set of observations consists of the _____________________, the ____________________________, the _____________, the __________________________ and the _____________________. These five numbers give a fairly complete description of center and spread. Example 11: Find the five-number summary for the highway gas mileage for two-seaters and minicompacts in examples 9 & 10. two-seaters minicompacts _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ Remember that the median describes _________________________________________________, the quartiles show ___________________________________________________________________ and the minimum and maximum values show_________________________________________________ The five-number summary can be presented visually by a boxplot. These are the steps for constructing a boxplot. 1. ________________________________________________________________________________ 2. ________________________________________________________________________________ 3. ________________________________________________________________________________  You should also place the exact values of the five-number summary above the appropriate line. Example 12: On the graph below, construct a boxplot for the highway gas mileage for two-seaters from example 9 0 10 20 30 40 50 60 70 The 1.5IQR Rule for Outliers The distance between the 1st and 3rd quartiles is called the ___________________________________, which is abbreviated IQR for obvious reasons. The quartiles and IQR are resistant to changes in either tail of a distribution. Note, however, that no single numerical measure of spread, such as IQR, is very useful for describing skewed distributions.  We will call a data value a “suspected” outlier if _____________________________________ ____________________________________________________________________________ * The IQR rule for outliers is the only one given in this text. A commonly used rule that uses the mean instead of the median is “a data value is an outlier if it lies more than 2 standard deviations above or below the mean”. In a modified boxplot, _______________________________________________________________ ______________________________________ and asterisks are used to denote any outliers. Example 13: Consider the highway gas mileage for two-seaters from example 9. (a) Show that the Honda Insight is a suspected outlier. (b) Find the lower bound in order for an observation to be an outlier. (c) Draw a modified boxplot. HW Assignments: P46 / 1.1ab, 1.3, 1.4, 1.5 (dot plot and stem-and-leaf) P55 / 1.7 - 1.9, 1.11, 1.12 P64 / 1.13 – 1.15, 1.18 P74 / 1.27 – 1.32 P82 / 1.33, 1.34a-d, 1.35, 1.36a, 1.37 Measuring Spread: The Standard Deviation While the five-number summary certainly gives a great deal of information about the distribution of a set of numbers, the most common numerical description of a distribution is the combination of the mean to measure ________________ and the standard deviation to measure ________________ . The standard deviation measures spread by ______________________________________________ _______________________________________________   The standard deviation of a sample is denoted by s. The standard deviation of a population is denoted  , the Greek letter Sigma. The following formula is used to compute the standard deviation of a sample. s= What does all this mean???? The deviations xi  x measure _______________________________________________________. Some of these deviations will be positive and some negative. Why? The sum of the deviations (the Greek letter sigma,  , means find the sum) of the observations from their mean will always be ______. Squaring the deviations makes them all positive. After adding the now positive deviations, we find their average by dividing by n – 1. Why n – 1? This number n – 1 is called the _______________________________ (see example 14 below) Finally, taking the square root undoes the squaring of the deviations that we did initially. Example 14: I have 6 numbers whose sum is 0. Five of the numbers are 2, 5, –3, 6 and –4. What is the sixth number? ______ Notice that if you only have 5 of the numbers, you can determine the sixth. The variance of a set of observations, s 2 or  2 , is simply the square of the standard deviation. Example 15: Find the standard deviation of the following metabolic rates (in calories per 24 hours) of 7 men. 1795 1666 1362 1614 1460 1867 1439 step 1: Find x . step 2: Determine xi  x . step 3: Square each number in step 2, add the squares together and divide by n – 1 to find the variance. step 4: Take the square root of the answer to step 3 to find the standard deviation. x= xi  x xi  xi  x  2 1795 1666 1362 1614 1460 1867 1439    n  1    s 2 (variance)  s (standard deviation) There is a shortcut formula for computing the standard deviation of a sample. Use it to find the standard deviation of the numbers in example 15 s nx 2  (x) 2 n(n  1) Properties of the Standard Deviation 1. s measures spread about the _______________ and should be used only when the mean is used as the measure of center 2. s = 0 only when there is ___________________________________ (i.e. _____________________ ___________________________________. Otherwise,__________. As the observations become more spread out about their mean, s gets ________________. 3. s, like the mean x , is not resistant to outliers. A few outliers can make s very large. Distributions with outliers and strongly skewed distributions have very large standard deviations. As such, the number s does not give much helpful information about such distributions. Choosing Measures of Center and Spread The five number summary, in particular the median and the IQR, is usually better than the mean and standard deviation for describing _______________________________________________________ _______________________________________ Use x and s only for reasonably _________________ distributions that are free of outliers. In the United States we commonly use feet and inches to measure height while much of the rest of the world will use the metric system. How does converting data values from one unit of measure to another affect the various measures of center and spread that we have discussed? Lets consider an example: The following numbers are tests scores out of 50 for 8 statistics students. 40, 42, 47, 32, 39, 29, 41, 45 x = _________ M = __________ s = ___________ What if the teacher added 3 points to each test grade as a curve? x = _________ M = __________ s = ___________ What if the teacher decided to make the test worth 100 points instead? The new scores would be 80, 84, 94, 64, 78, 58, 82, 90 x = _________ M = __________ s = ___________ A linear transformation changes the original variable x into a new variable xnew by an equation of the form ________________________ where, the constant a ____________________________________ __________________________________________ while the constant b _______________________ ______________________________________________. Note: Adding the same number, a, to each observation ______________________________________ _____________________________________________________________________________ Multiplying each observation by the same number, b, __________________________________ _____________________________________________________________________________ HW Assignments: P89 / 1.39, 1.40, 1.42, 1.43 P97 / 1.45, 1.46, 1.50, 1.54 P100 / 1.51, 1.54, 1.55, 1.57, 1.58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Graphing Categorical Variables