Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
For categorical variables it is easy to draw the distribution because each category is a natural “pile”. We can use a bar chart or pie chart. For quantitative variables, data is often displayed using either a histogram, stem and leaf plot or dot plot. Dot Plot – Places a dot along an axis for each case in the data Good for small data sets Shows basic facts about the distribution Can be horizontal or vertical Here is a dot plot of the winning times of the Kentucky Derby in each race from 1875 to the 2011 Derby. Basic facts about the distribution: Easily see the fastest and slowest times There are two clusters of points In 1896 the distance of the Derby race was changed from 1.5 miles to the current 1.25 miles How good was the 2012 U.S. women’s soccer team? With players like Abby Wambach, Megan Rapinoe and Hope Solo, the team put on an impressive showing en route to winning the gold medal at the 2012 Olympics in London. Here are the data on the number of goals scored by the team in the 12 months prior to the 2012 Olympics. 1 3 1 14 13 4 3 4 2 5 2 0 4 1 3 4 3 4 2 4 3 1 2 4 2 Create a dot plot to represent the data. Draw a horizontal or vertical axis (a number line) and label it with the variable name Scale the axis. Start by looking at the minimum and the maximum values of the variable. Mark your axis with tick marks Mark a dot above the location on the axis corresponding to each data value. Histogram is useful when working with large sets of data, stemplot is better for smaller data sets Easier to make by hand Give a quick picture of the shape of a distribution while including the actual numerical values in the graph 8 8 means a pulse of 88 bpm Take the tens place of the number and make that the stem, the ones place of the number is the leaf 8 000044 means four pulse rates of 80 and two of 84 Conclusions: All of the numbers are even and all are multiples of 4. Something you could have never seen in a histogram Gives insight into how the data was collected Do you think the nurse counted the pulse for a full minute, or counted for 15 seconds and multiplied by 4? How many pairs of shoes does a typical teenager have? To find out, a group of AP Statistics students conducted a survey. They selected a random sample of 20 female students from their school. Then they recorded the number of pairs of shoes that each respondent reported having. Here are the data: 50 38 15 26 13 51 26 50 31 13 57 34 19 23 24 30 22 49 23 13 Steps: Separate each observation into a stem, and a leaf. Write the stems in a vertical column with the smallest number at the top, and draw a vertical line at the right of this column. Do not skip any stems even if there is no data value for a particular stem. Write each leaf in the row to the right of its stem. Arrange the leaves in increasing order out from the stem Provide a key that explains in context what the stems and leaves represent. The AP Statistics students also collected data from a random sample of 20 male students at their school. Here are the numbers of pairs of shoes reported by each male in the sample: 14 7 6 5 12 38 8 7 10 10 10 11 4 5 22 7 5 10 35 7 What would happen if we tried the same approach as before with creating a stem plot? There are other ways to create your graph so you can see a clearer picture of the shape and distribution. To get more stems and a better picture we can split the stems, so 0 to 4 are placed on one stem and 5 to 9 are placed on another stem. We can also use a back to back stem plot to compare the number of shoes that males and females have. When you describe distribution you should always tell about four key things: Shape Center Spread Outliers Concentrate on the main features. Look for major peaks, not for minor ups and downs. Look for clusters, obvious gaps and potential outliers. Look for symmetry and skewness: A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other A distribution is skewed to the right if the right side of the graph (the larger values) is much longer than the left side. It is skewed to the left if the left side of the graph is much longer than the right side. **The direction of skewness is the direction of the long tail, not where most observations are clustered! Skewed left Skewed right Unimodal Bimodal Uniform Multimodal Do any unusual features stick out? Always mention any stragglers, called outliers, that stand off away from the body of the distribution Example: If your collecting data on nose lengths and Pinocchio is in the group you would definitely want to mention that. An outlier can me the most informative part of your data, or it might just be an error. Here is a stemplot of the percents of residents aged 65 and older in the 50 states and the District of Columbia. The stems are whole percents and the leaves are tenths of a percent. The low outlier is Alaska. What percent of Alaska residents are 65 or older? Ignoring the outlier, the shaper of the distribution is…? The center of the distribution is close to…? Slice up the possible values into equal-width intervals called bins Histogram displays the bin counts as the height of the bars (like a bar chart) Unlike a bar chart, the bars in a histogram touch one another. An empty space between bars represents a gap in data values If a value falls on the border between two bars, it is placed in the bin on the right. Magnitudes (on the Richter scale) of the 1,318 earthquakes in the NGDC data: Conclusions: What is the interval of each bar? Count the number of bars in between each number and you will see the interval is 0.2 The tallest bar says there were about 200 earthquakes between 7.0 and 7.2 Earthquakes typically have magnitudes around 7, most are between 5.5 and 8.5 some are as small as 4 or as large as 9 Ones in Japan and Sumatra were some of the biggest ever recorded. Replace the counts on the vertical axis with percentage of total number of cases. Don’t confuse histograms with bar graphs. Histograms display quantitative variables. The horizontal axis is marked in the units of measurement of the variable A bar graph is used to display categorical variables. The horizontal axis identifies the categories being compared Outlier – there are three cities in the leftmost bar Are there any gaps in the distribution? Gaps help us see multiple modes and encourage us to notice when the data may come from different sources or contain more than one group. A credit card company wants to see how much customers in a particular segment of their market use their credit card. They have provided you with data on the amount spent by 500 selected customers during a 3month period and have asked you to summarize the expenditures . Of course, you begin by making a histogram. Question: Describe the shape of this distribution. Answer: Shape: The distribution is unimodal Center: It is skewed to the right, the high end of monthly spending Spread: There is an extraordinarily large value at about $7,000, and some values are negative. Let’s return to the tsunami earthquakes. Let’s look at just 25 years of data. 207 earthquakes Occurred from 1987 to 2011 The data is symmetric so it is easy for us to see the center Median: the middle value that divides the histogram into two equal areas The median has the same units as the data, be sure to include the units whenever you talk about median. For this example, there are 207 earthquakes, so the median is found at 207+1 2 = 104𝑡ℎ place in the sorted data. The median earthquake magnitude is 7.2 Finding the median of a batch of n numbers: Order the values If n is odd, the median is the middle value Mathematically we find this position using If n is even, there are two middle values The median is the average of the two values The positions are found by 𝑛 2 𝑛 𝑎𝑛𝑑 2 +1 𝑛+1 2 Suppose the data has these values: 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8 First, order the values: -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8 Since n is 7 (odd), the median is (7+1)/2 = 4th value, which counting from either side is 13.9 Notice there are 3 values lower and 3 values higher. Suppose the data has the same values, but add 35.7: 14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7 First, order the values: -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 35.7, 45.8 8 = 4th 2 13.9+14.1 2 Now n=8 (even) so the median is the average of place and 14.0 8 2 + 1 = 5th place. So the median is Four data values are lower and four are higher = The median is one way to find the center of the data, but there are many other ways. Knowing the median, we can say that a typical tsunamicausing earthquake was about 7.2 on the Richter scale. How well does the median describe the data? Whenever we find the center of the data, the next step is always to ask how well it actually summarizes the data. Medians are a good measure of the center when the data is skewed or there are outliers Because the median considers only the order of the values it is resistant to values that are extraordinarily large or small as it ignores their distance from the center When we have symmetric data an alternative (better) measure is to use the mean When the data is symmetric the mean and median will be close Why not just always use the median? The median can sometimes be too resistant and can be unaffected by changes in many data values, but the mean considers every data point and gives them each equal weight. When the data is unimodal and roughly symmetric you should use the mean If you are unsure which is the better option, state them both and explain why they differ. ⅀ means “sum” and is pronounced sigma 𝑇𝑜𝑡𝑎𝑙 𝑦 𝑦= = 𝑛 𝑛 Number of values Sum of values You want to summarize the expenditures of 500 credit card company customers. Question: You have found the mean expenditure to be $478.19 and the median to be $216.28. Which is the more appropriate measure of center and why? Answer: Because the distribution is skewed, the median is the more appropriate measure of center. Unlike the mean, it’s not affected by the large outlying value or by the skewness. Even without making a histogram, we can expect some variables to be skewed. When values of a quantitative variable are bounded on one side but not the other, the distribution may be skewed. Examples: incomes and waiting times can’t be less than zero, so they are often skewed to the right. Amounts of things (dollars, employees) are also often skewed to the right for the same reason. Combinations of things are also often skewed For example, a histogram showing the number of cancelled flights in a month. More flights are likely to be cancelled in January (due to snowstorms) and August (thunderstorms). Combining values across months leads to a skewed distribution. The more the data vary, the less the median alone can tell us. We need to measure how the data values vary around the center, how spread out they are. The range of the data is the difference between the maximum and minimum values: Range = max – min Range is a single number, not an interval of values The range has the disadvantage that a single extreme value can make it very large. What is the range of the earthquake data? The maximum magnitude of these earthquakes is 9.1 and the minimum is 3.7 so the range is 9.1 – 3.7 = 5.4 A better way to describe the spread of a variable might be to ignore the extremes and concentrate on the middle of the data. We can split the data into 4 quartiles. The lower and upper quartiles are also known as the 25th and 75th percentiles of the data and the median is the 50th percentile. To find the quartiles: Divide the data in half at the median Divide both halves in half again, which will cut the data into four quarters One quarter of the data lies below the lower quartile One quarter of the data lies above the upper quartile Interquartile Range: The difference between the quartiles that tells us how much territory the middle half of the data covers: IQR = Upper – Lower Quartile When n is odd, you can either include the median in both halves, or omit it from either half Example data from the median: n is odd -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 45.8 Example data from the median: n is even -17.5, 2.8, 3.2, 13.9, 14.1, 25.3, 37.5, 45.8 The IQR is almost always a reasonable summary of the spread of a distribution. Even if the distribution itself is skewed or has some outliers, the IQR should provide useful information. The one exception to this is when the data is strongly bimodal, such as the Kentucky Derby data. The five number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum) The 5 – number summary for the earthquake data would look like this Max 9.1 Q3 7.6 Median 7.2 Q1 6.7 Min 3.7 It is good practice to report the number of data values and the identity of the cases. Here there are 207 earthquakes. Conclusions from the 5-number summary: Provides a good overview of the Max 9.1 distribution of magnitudes of these tsunami-causing earthquakes. The median magnitude is 7.2. The IQR is 7.6 – 6.7 = 0.9 Q3 7.6 Median 7.2 Q1 6.7 Min 3.7 Because this is small we see that many quakes are close to the median magnitude 25% of the earthquakes had a magnitude above 7.6, although one tsunami was cause by a quake measuring only 3.6 on the Richter scale. Once we have a 5-number summary of a quantitative variable, we can display that information in a boxplot Step 1: Draw a single axis spanning the extent of the data Step 2: Draw short lines to mark the lower, median, and upper quartiles. Connect the lines to form a box Step 3: Construct “fences” around the main part of the data. The upper fence is 1.5 IQR above the upper quartile and the lower fence is 1.5 IQR below the lower quartile Upper fence = Q3 + 1.5 IQR Lower fence = Q1 + 1.5 IQR The fences are just for construction they are not part of the display. We use the fences to grow the “whiskers” Step 4: Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If data value falls outside one of the fences, we do not connect it with a whisker. Step 5: Place a dot to display any outliers beyond the fences A box plot highlights several features of the distribution. The central box shows the middle 50% of the data, between the quartiles. The height of the box is equal to the IQR. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric If the median is not centered then the distribution is skewed. The whiskers show skewness as well if they are not roughly the same length. For the recent tsunami earthquake data, the central box contains all earthquakes whose magnitudes are between 6.7 and 7.6 on the Richter scale. From the shape of the box, it looks like the central part of the distribution of earthquakes is roughly symmetric At the low end the longer whisker and outliers indicate that the distribution stretches out slightly to the left We also see the two large quakes that we have discussed State the 5-number summary and create a box plot for the following data set: 1, 1, 2, 3, 4, 5, 6, 6,15 In Algebra you used letters to represent values in a problem and it did not matter what letter you chose. You could call the width of a rectangle x, or you could use w or any other letter you wanted. In Statistics, the notation is part of the vocabulary Example, n is always the number of data values. ALWAYS Here is a new one: Whenever there is a bar over the symbol, it means “find the mean (average)” IQR is always a reasonable summary of spread, but because it only uses two quartiles it ignores much of the information about how individual values vary. A more powerful approach uses the standard deviation, which takes into account how far each value is from the mean. Like the mean, standard deviation is most appropriate for symmetric data. One way to think about spread is to determine how far each data value is from the mean. The difference between a value and the mean is called deviation We could average these deviations, but the positives would cancel with the negatives and we would always get zero To keep them from cancelling out we square each deviation, resulting in all positive values. When we average the squared deviations the result is called the variance Sum Variance Difference of value and mean (𝑦 − 𝑦) 𝑠2 = 𝑛−1 n – 1 instead of just n The problem with using the variance as a measure of spread is the units. We want the units to match the data, but the units of the variance are squared. For example, squared dollars or mpg2 To get back to the original units we take the square root of the variance. This is called the standard deviation Standard deviation is a very important concept to understand and will be used throughout the course. Sum Difference of value and mean Standard Deviation 𝑠= (𝑦 − 𝑦) 𝑛−1 n – 1 instead of just n To find the standard deviation: Step 1 – Find the mean Step 2 – Subtract each data value from the mean Step 3 – Square all the differences Step 4 – Add all squares from step 3 Step 5 – Divide the sum by n – 1 This gives the variance Step 6 – Take the square root Suppose the batch of values is 14, 13, 20, 22, 18, 19 and 13. Find the standard deviation. Consider the following histogram representing resting pulse rates of adults. The distribution is roughly symmetric so we will use the mean and standard deviation The mean pulse rate is 72.7 bpm. We can see that some heart rates are higher and some are lower, but how much? The standard deviation of 6.5 bpm indicates that on average we can expect people’s heart rate to differ from the mean by about 6.5 bpm Measures of spread tell how well other summaries describe the data. ALWAYS report a spread along with any summary of the center. If the data is skewed or has outliers: Median and IQR are your best choices for summaries If the data is roughly symmetric: Mean and standard deviation are your best options