Download Continuous Data

Continuous Data The Median The median is the 50th percentile. The idea is that the median splits the data set in half. The median is a commonly used single-value measure of the center of a distribution. To find the median, first sort the data. The minimum value is in position 1, the maximum value in position n. The median is the value in position ½(1 + n) (in other words: the value in the middle position). If this position is not a whole number, the median is obtained by averaging the values in the two positions on either side of position ½(1 + n). Example 1 At right you see the times until failure of 28 industrial machines (in hours). The data x must be sorted low to high (this is the part that takes time when doing problems by hand). Positions are shown to the right of the sorted list. 12 Mean = 235.76 hr # of Machines 10 SD = 34.19 hr 8 6 4 2 0 200 240 280 Failure Time (hours) 320 Figure 1 The mean and standard deviation are generally reported with one more decimal place of accuracy than are data values. Under rounding them (too much precision) is fine. Over rounding them is not fine. The mode is around 210 hours. There are any number of ways to sort the data. (You generally do not want to do it by hand!) You can use Excel’s autofill function to quickly obtain a position for each value in the sorted list. Since n = 28, the median is found in position ½(1 + 28) = 14.5. The values in positions 14 and 15 are 215.7 and 219.3 respectively. So the median is (228.5 + 236.1)/2 = 232.3 hours. The median, like the mean and mode, has the same measurement units as the data. It is appropriate to round the median to at least the same accuracy as is the data. Percentiles What if we want to split the data some other way? For instance, we want to offer scholarship money to the top 30% of students based on GPA. What GPA cuts the top 30% from the bottom 70%? This GPA is the 70th percentile. 1 “3.257 is the 70th percentile.”  “The percentile rank of 3.257 is 70.” These two statements are equivalent. Keeping in mind that the units of observation are students and the variable is GPA: 70% of students have GPA below 3.257; the remaining 30% of students have GPA above 3.257. Working interpretation of percentile. Identify the units and variable, then use the appropriate descriptions to replace [units] and [variable] in this statement: “(Approximately) k% of [units] have [variable] below x; the remaining (100 – k)% of [units] have [variable] above x.”1 You do not need to use the word “approximately” in your statements. The following are equivalent: “the kth percentile of the data is x” “the percentile rank of x is k” Consider our illustration with GPAs. Suppose we learn (by looking at the data) that the GPA that cuts off the bottom 70% from the top 30% is 3.257. Then 3.257 is the 70th percentile. The percentile rank of 3.257 is 70. It is not correct to say “Out of 100 graduating seniors, 70 have GPA below 3.274; the other 30 have GPA above 3.274.” First of all, there aren’t exactly 100 graduating seniors. And secondly, if you chose 100, you would be unlikely to get exactly a 70/30 split. (When you look at all students you get a 70%/30% split.) A statement that references 100 units is only true on average – assuming you averaged over all possible samples2 of 100 students. Expressing this is more difficult and confusing, so just say it the correct way: “70% of graduating seniors have GPA below 3.274; the other 30% have GPA above 3.274.” We pretty much ignore students with GPA exactly 3.257. This is no big deal, because very few have precisely this GPA. Fewer than you think, actually, because the 70th percentile is really 3.25782132… (it’s rounded reasonably for ease of display and reading) and virtually no one has this GPA. Percentiles are uniquely suited to continuous data – when there are at most a few units for which the values of the variable are exactly the same. (Other ways of saying this: If you randomly choose two units, chances are small that they tie; there are few duplicates/multiples in the data.) A good deal of what remains of this document concerns itself with how to obtain percentiles from a data set. How to compute them is the minor issue: How to interpret them is what’s important. 1 There is no flexibility here. This – or something immediately and obviously equivalent – is the proper statement. It is not acceptable to use the terms “variable” and “unit” – replace them with precise descriptions of the variable and unit for the situation at hand. 2 In statistics a sample is a collection of units drawn from the collection of all units. The point here is technical (and is addressed in a later part of your course): Different samples yield different results. If you look at results for all conceivable samples, and average them, you get the result for the entire set of data. 2 Computing Percentiles and Percentile Ranks We will use spreadsheets to work with percentiles. Finding Percentile Ranks Each data value can be assigned a percentile rank. Use =PERCENTRANK(array, x, 9) array is the location of the sorted data; it can be selected with the mouse. x is a data value which must be between the minimum and maximum 9 is connected to rounding precision, and is sufficiently large to always work. Don’t use a number < 9 in this slot; don’t leave out the 9 Example 1 Consider the value of 216.6. This is one of the data values. What is the percentile rank of 216.6 hours? Notice that the sorted data occupy cells A2 through A29. In spreadsheets this is written A2:A29 and is called an array. (You do not need to capitalize the As, nor the function name.) Doubleclick into any empty cell. Then start typing =PERCENTRANK(A2:A29, 216.6, 9) As you start to type the function, the program will help you complete it, and will also suggest how you should organize information about the data array and data value you are inputting. (In Excel you will replace [significance] with 9; Google spreadsheet will not cue you about the “significance,” but it still accepts a value.) You do not have to type A2:A29 – you can use the mouse to select the array. When you do this: In Excel A2:A29 and the outline of the selected array will be shown in blue; In google spreadsheets the array will be shaded. At right see what it looks like in an Excel sheet. The formula is entered into cell C4. You can see the formula in any cell as you enter it (the formula also appears in the “formula bar” following the fx.) When you’ve typed the entire formula, evaluate it with the [Enter] / [Return] key. Cell C4 now shows a result: 0.3704 0.3704 is the same as 37.04%. A failure time of 216.6 hours has a percentile rank of 37.04. The 37.04 percentile of this distribution is 216.6 hours. An interpretation reads like this (units are underlined; the variable is in italics): 37.03% of machines have a failure time less than 216.6 hours; the remaining 62.97% do not fail until after 216.6 hours. 3 Look at the data: Of the 27 failure times other than the 216.6, 10 are less than 216.6: 10/27 = 37.04%. And that is how it works! Aside: Rounding In this class, our convention will be to round percents to the nearest 0.01%. This is overkill in many cases – but it is far better to round too little than to round too much. Here’s a guide: Percent Maximum rounding Between 10% and 90% nearest 1% Between 1% and 10% or between 90% and 99% nearest 0.1% Between 0.1% and 1.0% or between 99.0% and 99.9% nearest 0.01% Etc. Etc. It’s always better to under-round, rather than to over-round. Example 1 What are the percentile ranks for 211.4 hours and 211.6 hours? =PERCENTRANK(A2:A29,211.4,9)  0.2222 The percentile rank of 211.4 hours is 22.22; 211.4 is the 22.22 percentile. =PERCENTRANK(A2:A29,211.6,9)  0.2593 The percentile rank of 211.6 hours is 25.93; 211.6 is the 25.93 percentile. You can also find a percentile rank for a value not in the list – as long as it doesn’t fall below the minimum or above the maximum. Let’s find the percentile rank for 211.5: =PERCENTRANK(A2:A29,211.5,9)  0.2407 The percentile rank of 211.5 hours is 24.07; 211.5 is the 24.07 percentile. This makes sense: 211.5 is halfway between 211.4 and 211.6, and the associated percent (24.07) is halfway between 22.22 and 25.93: (22.222 + 25.926)/2 = 0.24074. Percentile ranks for values not in the list are linearly interpolated from those in the list. Find the percentile ranks for 308.0, 309.0 and 329.0. =PERCENTRANK(A2:A29,308.0,9)  0.9630 =PERCENTRANK(A2:A29,309.0,9)  0.9647 =PERCENTRANK(A2:A29,329.0,9)  1.0000 308.0 hours is the 96.30 percentile; 329.0 hours is the 100.00 percentile – the maximum. These two values (308.0 and 329.0) are both in the data set. 309.0 hours is not in the data set. But 309.0 is between 308.0 and 329.0, and so its percentile rank is between 96.301 and 100.00%. Since 309.0 is much closer to 308.0, it’s percentile rank is very close to 96.30. 318.5 hours is halfway between 308.0 and 329.0. Its percentile rank is 98.15, which is halfway between 96.30 and 100.00. One more: What is the percentile rank of the mean? 4 Examining Figure 1, see that the mean is 235.8. We need the percentile rank of 235.8. =PERCENTRANK(A2:A29,235.757.0,9)  0.5168 The mean (235.8) is the 51.68th percentile. A more gnarly way to write this, without even having to see the mean, is like this: =PERCENTRANK(A2:A29,AVERAGE(A2:A29),9)  0.5168 Skew and percentiles It’s often the case that the discrepancy between median and mean hints at the shape of the distribution. (A histogram supplies information too.)  For a fairly symmetric distribution, the mean and median will be quite close. (If the distribution is exactly symmetric – which is quite rare for real data sets – the two will be identical.) The percentile of the mean will be quite close to 50.  For a right skewed distribution, the mean will fall to the right of (above) the median; the percentile rank of the mean will be above 50.  For a left skewed distribution, the mean will fall to the left of (below) the median; the percentile rank of the mean will be below 50. In Example 1 the mean is above the median – although not by that much. (An effective visual comparison is to mark both under the horizontal axis of the histogram. They are rather close.) This suggests a distribution with a little bit of right skew. The histogram bears this out. Finding Percentiles We now discuss what to do when the percentile rank (a percent from 0% – 100%) is given. To do this job use PERCENTILE. Again you specify the array that is the data. The second input is a value on the relative frequency scale. Spreadsheets input and output relative frequency, not percent. So be careful when feeding percentages into PERCENTILE. You must either first divide by 100 (see the examples below), or type the % sign after the percent. To find the kth percentile, where 0 < k < 100: Doubleclick into any empty cell. Then either =PERCENTILE(array, k/100) OR =PERCENTILE(array, k%) (You can have Excel do the division by 100; you may also do it yourself by simply moving the decimal point two places left. The value that is input in the second position must be between 0 and 1.) Example 1 Find the 37.04 percentile. k = 37.04% = 0.3704 and you have two very similar ways to input this. OR Then after [Enter] or [Return] you’ll see 216.60168. The output of this function should be rounded to the precision with 5 which data are recorded. Here that is to the nearest 0.1 hour. The 37.04 percentile is 216.6. You can round this yourself. Or use the spreadsheet’s rounding options to display the proper precision: In excel this button , or Format Cells (Category) Number and set decimal places. (When you have selected a cell or block of cell, Format Cells is available off the right click. You can also type Alt o e in windows and Command 1 in osx.) In google spreadsheets the 123 button, or Format Number will allow you to format the selected cells. So 216.6 is the 37.04 percentile. Of course we already knew this. (See above where we computed the percentile rank for 216.6 hours.) Aside: Inverses We’ve discussed two functions in spreadsheets that deal with percentiles and percentile ranks. PERCENTRANK Given a data value x this gives the appropriate percentile rank k% PERCENTILE Given a percentile rank k% this gives the appropriate data value x These two functions are inverses of one another.3 =PERCENTRANK(A2:A29, 216.6,9)  0.3704 =PERCENTILE(A2:A29, 0.3704)  216.6 Example 1 Find the median. The median is (by definition) the 50th percentile. =PERCENTILE(A2:A29, 0.5)  232.3 The median is 232.3 hours. Find the 75th percentile. (The 75th percentile is also called the 3rd quartile or upper quartile.) =PERCENTILE(A2:A29, 0.75) 3  254.2 Sometimes it appears these are not exactly inverses. This is because of rounding. Notice that our input into PERCENTILE is 0.3704. However, 216.6 gives a percentile rank of 0.37037037 when displayed with more precision. As you might guess, this decimal is repeating: 0.37037037037037037037…which is 10/27 (see the discussion above, where 10/27 is explicitly discussed relative to 216.6). If one takes advantage of this exact expression then PERCENTILE(A2:A29,10/27)  216.6 This demonstrates that technically these are exactly inverses. 6 Here’s a good place to stop and check that things make sense. Here’s our list of data, arranged horizontally for easier viewing. We have 254.2 as the 75th percentile. 189.1 193.2 200.4 202.6 202.7 206.7 211.4 211.6 212.1 215.7 216.6 218.7 219.3 228.5 236.1 236.5 237.0 239.7 244.3 251.5 253.3 257.0 259.9 265.2 267.4 287.7 308.0 329.0 Notice that of the 28 failure times, exactly 21 are below 254.2: 21 / 28 = 0.75 = 75%. This makes good sense. Again, here’s what we say: “Approximately 75% of the machines have failure time below 254.2 hr; the other 25% of machines have failure time above 267.4 hr.” variable units To determine the 90th percentile: =PERCENTILE(A2:A29, 0.90)  273.5. The 90th percentile of failure times is 273.5 hours. “Approximately 90% of the machines have failure time below 273.5; the other 10% have failure time above 273.5 hr.” Actually, the percent of data values below 273.5 is more precisely 89.3%. This discrepancy occurs because there no way to get exactly 90% of the data below any value in a list of 28 values: 90% of 28 is 25.2. These functions generally yield such small discrepancies. It’s unavoidable. (To have precise matching for all percents from 0.1% to 99.9% would require a data set with size an exact multiple of 1001). It is not worth discussing these issues: They are minor. Simply learn how to compute percentiles and percentile ranks, and (especially) learn how to interpret them. Our Standard for Percentiles There are other ways of defining percentiles and percentile ranks. All reduce to the same thing for the median. Elsewhere the differences are minor, and an interpretation is the same no matter which method is used. (Your instructor has read a 50-page paper examining nothing more than the many different ways of determining only the first and third quartiles – the 25th and 75th percentiles. While the paper had its interesting technical points, for the most part it was quite dull.) There are other standards for obtaining percentiles and percentile ranks – this is the one that you and your class are adopting. All the methods produce virtually identical values with large data sets. For small data sets there are some differences, but they are not important – especially relative to the uncertainty resulting from the lack of information inherent to a small data set.4 Learn to use this standard. Learn to interpret percentiles. 4 For the failure time data, Minitab gives 211.5 as the 25 th percentile (we have 211.6) and 256.1 as the 75 th percentile (we have 254.2). These are very minor differences relative to a) the small size of the data set and b) the large amount of variability in this data. 7 Frequently Used Percentiles Some sets of percentiles are commonly reported: Quartiles (used to obtain a boxplot): 25th percentile: “lower quartile” or “first quartile:” Q1 50th percentile: “median” or “second quartile:” M (occasionally Q2) 75th percentile: “upper quartile” or “third quartile:” Q3 Quintiles: 20th, 40th, 60th, 80th Deciles: 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th Boxplots The 5 number summary The five-number summary consists of the minimum, quartiles, and maximum value. To display this summary, place the values in {curly brackets}, listed from low (minimum) to high (maximum), with commas separating values. Example 1 The minimum is 189.1; the first quartile (25th percentile) is 211.6; the median is 232.3; the third quartile is 254.2; the maximum is 329.0. The five number summary is written as follows: {189.1, 211.6, 232.3, 254.2, 329.0} It is appropriate to display each value with the same accuracy – here to the nearest 0.1, which is the accuracy of the data. Notice how the 329.0 is displayed. This is: a) for consistency, and b) because the data are measured to the nearest 0.1 – the value itself gives an idea of how precise the measurement scale is. Interquartile Range The interquartile range (IQR) is the distance between the third and first quartiles: IQR = Q3 – Q1 Example 1 IQR = 254.2 – 211.6 = 42.7. The IQR is a measure of the variability in a data set: It tells you the length of the interval that includes the middle 50% of all the data. Generally speaking, data sets with larger IQR have more variability. Rule of Thumb The standard deviation is often around ¾ of the IQR. For Example 1, the ratio of standard deviation to IQR is 34.19 / 42.7 = 0.801, which is reasonably close to 0.75. There are occasions when this rule of thumb comes nowhere close to truth. This usually happens when the corresponding distribution has a huge amount of skew, or outliers, or some other unusual feature. 8 The simple boxplot The boxplot is a graphical display of the five number summary. A scale extending from below the minimum to above the maximum is drawn. A box is placed, with edges located at the first and third quartiles. A line is drawn through the box at the location of the median. Then lines – called “whiskers” are drawn from the first quartile to the minimum, and from the third quartile to the maximum. Figure 2 When you ask statistical software to construct a boxplot it will produce a modified boxplot.5 The modified boxplot is constructed using a mathematical rule for identifying outliers – values that are extreme relative to the bulk of the distribution. Outliers in boxplots Here is the rule: Any value more than 1.5IQR below Q1 is an outlier. Any value more than 1.5IQR above Q3 is an outlier. For Example 1, IQR = 42.7, so 1.5IQR = 64.0. Any value more than 64.0 below 211.6 is an outlier. 211.6 – 64.0 = 147.5. There are no values below 147.5. Any value more than 64.0 above 254.2 is an outlier. 254.2 + 64.0 = 318.2. There is one value above 318.2 – it is 329.0. Modified Boxplot Plot the box as in a simple boxplot. But, extend the whiskers only as far as the most extreme observations that are not considered outliers. Then use special symbols to plot the outliers. Example 1 Since 329.0 is considered an outlier, the whisker on the high side is drawn to 308.0 – which is not considered an outlier. 329.0 is plotted on its own. 5 If you know the software reasonably well, you can get it construct a “regular” (unmodified) boxplot such as in Figure 2. 9 . Figure 3 From the boxplot you get a fairly accurate gauge of the five number summary. From this you can quickly deduce some other values: The range and the interquartile range (which quantify how spread the data are), as well as the median (which quantifies the center). It also helps identify extreme values – values that are somewhat unusual, and perhaps require some looking in to. You can also “guess” the standard deviation in two ways: Range / k where k is something between 4 and 6 (our textbook uses 4) 0.75 IQR 10 Exercises The data for the exercises (as well as the examples above) can be accessed from the instructor’s shared folder (Continuous Data Sets). The different data sets are placed within different tabs in the spreadsheet. (Tabs are accessed at the bottom of the window. See the instructor if this is confusing you. It’s easy, but not until you “get it.”) 1. In discussing investment opportunities, a financial advisor speaks about a company’s “price to earnings” ratio (PE) – the price of a share of stock divided by the amount of profit the company makes annually (ie.: How much it costs to purchase $1 of annual profit). A stock market analyst says “For the ECC Company, its PE of 7.3 is at the 15th percentile among companies in the industrial sector. a) What is the percentile rank for a PE of 7.3? b) Write a sentence explaining what this means, without using the word “percentile.” Your statement must identify the units and variable. You may use the word “percent,” and you must use the numbers 7.3 and 15. 2. For girls 1 year of age, the 5th percentile of weights is 17.5 pounds. a) Write a sentence interpreting this. (Speak in terms of units / variables.) b) For girls 6 months of age, how does the 5th percentile compare to 17.5 pounds? Is it larger, smaller, or the same as 17.5 pounds? Why? 3. The 90th percentile of CEO salaries is $5.052 million. a) What percent of CEOs make more than this amount? Less? What percent (to the nearest 1%) make exactly this amount? b) How does the 80th percentile of CEO salaries compare to $5.052 million? 4. For SUNY Oswego students, the 65th percentile on the combined SAT score is 1250. Write a sentence interpreting this. (Speak in terms of units and variable.) 5. (GPA tab) Here are the GPAs of 8 math majors (sorted). 1.98 2.10 2.58 2.69 2.94 3.05 3.65 3.83 a) Determine the percentile rank for GPAs of 3.05, 3.35 and 3.65. b) Determine values for Q1, Q2 and Q3, the 1st, 2nd and 3rd quartiles, respectively. c) Obtain values for the Range and interquartile range (IQR). d) Determine values for the mean and standard deviation. Compare the standard deviation to both Range/4 and 0.75 IQR (both these expressions will “guess” the standard deviation). e) State the 5 number summary for this data set. 6. (Pateint Wating Time tab) The data are the amounts of time patients waited in an emergency room at a local hospital prior to seeing a doctor (in minutes). a) What are the units? What is the variable? b) Obtain a histogram. Identify the shape of this histogram. Are there any outliers? c) Determine values for Q1, Q2 and Q3, the 1st, 2nd and 3rd quartiles, respectively. 11 d) What is the percentile rank for a waiting time of 2 hours? How about 75 minutes? e) Make sure you can interpret all your results to b and c. f) Here are two questions you just answered: In part b) “Find Q3 – the 75th percentile.” In part c) “What is the percentile rank for a waiting time of 75 minutes?” Explain why the answers are different, even though the number (75) is the same. g) Determine values for the mean and standard deviation. Compare the standard deviation to both Range/4 and 0.75 IQR (both these expressions will “guess” the standard deviation). 7. You have a data set where the variable is the waist measurements for a random sample of 236 men. The data are located in Excel cells A2 through A237 a) What do you type in an empty cell in order to obtain the 35th percentile? b) What do you type in an empty cell in order to obtain the percentile rank of a waist measurement of 35 inches? 8. A couple is researching the cost of completing an international adoption. This cost varies from adoption to adoption.. From a U.S. government source they learn that the 20th percentile of costs is $19,312. Identify the variable and statistical units. Then: Which of the following properly explains this to someone unfamiliar with the term “percentile?”  20% of the cost of an adoption is less than $19,312.  An adoption costs 20% of $19,312.  20% of those who adopt pay exactly $19,312 to do so.  20% of those who adopt pay more than $19,312 to do so.  20% of those who adopt pay less than $19,312 to do so. 9. (Jan Temps tab) Average January temperatures in Oswego over the last 150 years. a) Identify the units and the variable. b) Obtain a histogram. Identify the shape. Are there any outliers? If so: In what year(s) were the outlying temperatures obtained? c) State the five number summary. Compute values for the Range and IQR. Compare Range/4 and 0.75IQR to the standard deviation. d) Determine the 80th percentile. e) Without a spreadsheet: Look at your result to part c. What is the percentile rank for a temperature of 27.8 degrees? (Now check your answer by computing it.) f) Determine the percentile rank of the mean. Compare the mean to the median. Are they close? 10. (Children tab) In order to properly apply percentiles, replicates of the same value should not be common, as this exercise illustrates. Consider this data set, the number of children in 20 local families: 1 1 1 1 1 2 2 2 2 2 3 12 3 3 4 4 4 4 4 5 6 Determine the 1st percentile and the 21st percentile. Because of the large amount of replicates, 1 is technically both the 1st and 21st percentiles. In fact, you can fill in the blank in the following sentence with any number from 0 to 21: 1 is the ____ percentile. There are so many replicates that percentiles are not useful. With highly discrete data (when there are many of ties) do not bother with percentiles. Instead, simply tabulate values and their relative frequencies. # of children in family 1 2 3 4 5 6 % of families 25 25 15 25 5 5 11. Percentiles are suited to continuous quantitative data. (For an example of data that is too discrete for percentiles, see #10 above.) In each of the following situations: 1) identify the variable that is of interest; 2) state whether the variable is quantitative or categorical; 3) decide whether the 25th, 50th and 75th percentiles would be meaningful measures for summarizing data. a) The number of fire department calls to fires in Oswego on a day. b) The daily total mass of the garbage an industrial company produces. c) The colors of people’s cars. d) The size of men’s feet. e) The zip code of students’ hometowns. f) The unused hard drive space on a group of computers that have been used a year. g) Student response to this questionnaire item on a statistics instructor’s teaching: How effective was the instructor at helping you learn the course material? 1 2 3 not at all 4 5 very 13 Solutions 1. a) 15 b) 15% of companies in the industrial sector have PE below 7.3; the remaining 85% of companies have PE above 7.3. Units: companies in the industrial sector; Variable: PE. 2. a) 5% of 1-year-old girls weigh less than 17.5 pounds; the remaining 95% weigh more than 17.5 pounds. Units: 1-year-old girls; Variable: weight. b) At 6 months girls will tend to be smaller. (In fact, almost all ½-year-olds are smaller than almost all 1-year-olds. You cannot come even close to saying that sort of thing if comparing, say, 6-year-olds to 7-year-olds. Some – not a lot, but some – 6-year-olds are heavier than a good portion of 7-year-olds.) So the 5th percentile will be less than 17.5 pounds. 3. a) 90% make less; 10% more; about 0% make exactly this amount. b) The 80th percentile is less than $5.052 million. 4. 65% of SUNY Oswego students have combined SAT below 1150; the other 35% have SAT above 1150. Units: SUNY Oswego students; Variable: combined SAT. 5. a) 3.05 has percentile rank 71.43 (the 71.43 percentile is 3.05) 3.35 has percentile rank 78.57 (the 78.57 percentile is 3.35) 3.65 has percentile rank 85.71 (the 85.71 percentile is 3.65) Notice that the percentile rank of 3.35 is exactly halfway between those for 3.05 and 3.65. That’s because 3.35 is exactly halfway between 3.05 and 3.65. b) Q1 = 2.46; Q2 = 2.82; Q3 = 3.20. c) Range = 1.85; IQR = 3.20 – 2.46 = 0.74. d) The mean is 2.853; the standard deviation is 0.662. Range/4 = 0.463; 0.75IQR = 0.96. Neither of these that precisely anticipate the standard deviation. On the other hand, these rules of thumb are not that often really precise – in particular with really small data sets like this. But look at the average of these two guesses: (0.463 + 0.96) / 2 = 0.711, which is not at all far off. e) {1.98, 2.46, 2.82. 3.20, 3.83} 6. a) Patient arrivals are the units. Each arrival is timed: Waiting time is the variable. b) The histogram is a little bit right skewed with a fairly prominent outlier (the waiting time of 201 minutes). c) The 25th percentile is Q1 = 71. The 50th is Q2 = M = 86. The 75th is Q3 = 105. 14 d) 2 hours = 120 minutes has percentile rank of 84.38. The 84.38 percentile is 120 minutes. 75 minutes has a percentile rank of 31.25. e) For example: 84.4 percent of patients (that’s the units) have waiting times (that’s the variable) less than two hours; the other 15.6 percent wait longer than 2 hours. f) "Find the 75th percentile." This means: Determine the waiting time x such that 75% of the waiting times are less than x. "What percentile is a waiting time of 75 minutes?" This means essentially: What % of waiting times are below 75 minutes? g) The mean is 92.82 minutes; the standard deviation is 32.82. The range is 163; dividing by 4 gives 40.75. The IQR is 34; 3/4th IQR = 25.5. One might observe that while neither of these are an excellent guess, their average is: (40.75 + 25.5) / 2 = 33.13. 7. a) To obtain the 35th percentile: =PERCENTILE(A2:A237, 0.35) b) To obtain the percentile rank of a waist measurement of 35 inches: =PERCENTRANK(A2:A237, 35, 9) 8. The variable is “cost of adoption.” The units are the “couples.” Correct: “20% of couples who adopt pay less than $19,312 to do so.” Different people pay different amounts to adopt. The official is communicating that 20% of people pay less than $19,312 – because the 20th percentile is always the amount separating the lowest 20% of data from the highest 80%. The first two statements are irrelevant to varying costs of adoption. (Percentile doesn’t refer to “an” (a single) adoption. It refers to the variation in costs among all adoptions.) 9. a) The units are the Januaries of each year. (“Years” or “Januaries” is fine.) The variable here is the average temperature for the entire month. “Average temperature for the entire month varies among the Januaries of each year.”) 15 b) The distribution is rather symmetric with no outliers. c) {13.6, 20.6, 23.9, 27.1, 35.8} Range = 22.2; IQR = 6.5. Range/4 = 5.55; 0.75IQR = 4.875. The second of these is quite close to the standard deviation of 4.678. d) The 80th percentile is 27.8 degrees. e) 27.8 degrees has percentile rank of 80. f) The mean is 23.980 degrees and has percentile rank of 50.54. The mean is very close in value to the median (50th percentile) of 23.9 degrees. This suggests symmetry. 10. The 1st percentile is 1; the 21st percentile is also 1. 11. a) The variable “number of fires in a day” is quantitative, and is highly discrete. There would be many ties (a tie would occur when two days had the same number of fires – and this would happen often in a small city like Oswego.) Percentiles are not a good choice here. b) The variable is the amount of garbage in a day. This is quantitative data, and is continuous. Percentiles would be a good choice here. c) The variable is car color. This is categorical data. Percentiles would make no sense at all. d) The variable is foot size. It’s quantitative. First, if we use shoe size (and only length – not width) we do have a quantitative variable. Based on standard shoe sizes (done on a nonstandard scale – the “shoe size scale”), the variable would be fairly discrete, and percentiles would not be so useful. But if one takes the time to measure foot length accurately with a ruler, then foot size is continuous, and percentiles are a good choice. e) The variable is hometown zip code. Zip codes are formed with digits, but they are categorical. (Working with them as numbers makes no sense. It makes no sense to operate (add, etc.) on them.) Percentiles would make no sense at all. f) The variable “unused hard drive space” is quantitative and continuous. (Technically it is discrete – there’s a fixed number of bits of space. A small hard drive these days holds 200 gigabytes, which is 1717986918400 bits: it’s virtually impossible for two drives to have the same number of bits of storage used.) Percentiles would be a good choice here. g) The variable “rating” is again essentially categorical. The choices are presented as numbers, but they aren’t “numbers” in the usual sense. If you answered that it’s discrete quantitative data, that’s OK. (But: any set of categories can be labeled with numbers. Just because it’s numbers doesn’t mean it’s quantitative.) Percentiles are not a good choice here. 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Continuous Data