Download Continuous Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Continuous Data
The Median
The median is the 50th percentile. The idea is that the median splits the data set in half. The
median is a commonly used single-value measure of the center of a distribution. To find the
median, first sort the data. The minimum value is in position 1, the maximum value in position n.
The median is the value in position ½(1 + n) (in other words: the value in the middle position). If
this position is not a whole number, the median is obtained by averaging the values in the two
positions on either side of position ½(1 + n).
Example 1
At right you see the times until failure of 28 industrial machines (in
hours). The data x must be sorted low to high (this is the part that
takes time when doing problems by hand). Positions are shown to
the right of the sorted list.
12
Mean = 235.76 hr
# of Machines
10
SD = 34.19 hr
8
6
4
2
0
200
240
280
Failure Time (hours)
320
Figure 1
The mean and standard deviation are generally reported with one
more decimal place of accuracy than are data values. Under
rounding them (too much precision) is fine. Over rounding them is
not fine.
The mode is around 210 hours.
There are any number of ways to sort the data. (You generally do not want to do it by hand!)
You can use Excel’s autofill function to quickly obtain a position for each value in the sorted list.
Since n = 28, the median is found in position ½(1 + 28) = 14.5. The values in positions 14 and 15
are 215.7 and 219.3 respectively. So the median is (228.5 + 236.1)/2 = 232.3 hours. The median,
like the mean and mode, has the same measurement units as the data. It is appropriate to round
the median to at least the same accuracy as is the data.
Percentiles
What if we want to split the data some other way? For instance, we want to offer scholarship
money to the top 30% of students based on GPA. What GPA cuts the top 30% from the bottom
70%? This GPA is the 70th percentile.
1
“3.257 is the 70th percentile.”

“The percentile rank of 3.257 is 70.”
These two statements are equivalent.
Keeping in mind that the units of observation are students and the variable is GPA:
70% of students have GPA below 3.257; the remaining 30% of students have GPA above
3.257.
Working interpretation of percentile.
Identify the units and variable, then use the appropriate descriptions to replace [units] and
[variable] in this statement:
“(Approximately) k% of [units] have [variable] below x; the remaining (100 – k)% of
[units] have [variable] above x.”1
You do not need to use the word “approximately” in your statements.
The following are equivalent:
“the kth percentile of the data is x”
“the percentile rank of x is k”
Consider our illustration with GPAs. Suppose we learn (by looking at the data) that the GPA that
cuts off the bottom 70% from the top 30% is 3.257. Then
3.257 is the 70th percentile.
The percentile rank of 3.257 is 70.
It is not correct to say “Out of 100 graduating seniors, 70 have GPA below 3.274; the other 30
have GPA above 3.274.” First of all, there aren’t exactly 100 graduating seniors. And secondly,
if you chose 100, you would be unlikely to get exactly a 70/30 split. (When you look at all
students you get a 70%/30% split.) A statement that references 100 units is only true on average
– assuming you averaged over all possible samples2 of 100 students. Expressing this is more
difficult and confusing, so just say it the correct way: “70% of graduating seniors have GPA
below 3.274; the other 30% have GPA above 3.274.”
We pretty much ignore students with GPA exactly 3.257. This is no big deal, because very few
have precisely this GPA. Fewer than you think, actually, because the 70th percentile is really
3.25782132… (it’s rounded reasonably for ease of display and reading) and virtually no one has
this GPA.
Percentiles are uniquely suited to continuous data – when there are at most a few units for which
the values of the variable are exactly the same. (Other ways of saying this: If you randomly
choose two units, chances are small that they tie; there are few duplicates/multiples in the data.)
A good deal of what remains of this document concerns itself with how to obtain percentiles
from a data set. How to compute them is the minor issue: How to interpret them is what’s
important.
1
There is no flexibility here. This – or something immediately and obviously equivalent – is the proper statement. It
is not acceptable to use the terms “variable” and “unit” – replace them with precise descriptions of the variable and
unit for the situation at hand.
2
In statistics a sample is a collection of units drawn from the collection of all units. The point here is technical (and
is addressed in a later part of your course): Different samples yield different results. If you look at results for all
conceivable samples, and average them, you get the result for the entire set of data.
2
Computing Percentiles and Percentile Ranks
We will use spreadsheets to work with percentiles.
Finding Percentile Ranks
Each data value can be assigned a percentile rank. Use
=PERCENTRANK(array, x, 9)
array is the location of the sorted data; it can be selected with the mouse.
x is a data value which must be between the minimum and maximum
9 is connected to rounding precision, and is sufficiently large to always work.
Don’t use a number < 9 in this slot; don’t leave out the 9
Example 1
Consider the value of 216.6. This is one of the data values. What is the percentile rank of 216.6
hours?
Notice that the sorted data occupy cells A2 through A29. In
spreadsheets this is written A2:A29 and is called an array.
(You do not need to capitalize the As, nor the function
name.) Doubleclick into any empty cell. Then start typing
=PERCENTRANK(A2:A29, 216.6, 9)
As you start to type the function, the program will help you
complete it, and will also suggest how you should organize
information about the data array and data value you are
inputting. (In Excel you will replace [significance] with 9;
Google spreadsheet will not cue you about the
“significance,” but it still accepts a value.) You do not have
to type A2:A29 – you can use the mouse to select the array.
When you do this: In Excel A2:A29 and the outline of the
selected array will be shown in blue; In google spreadsheets
the array will be shaded. At right see what it looks like in an
Excel sheet.
The formula is entered into cell C4. You can see the formula
in any cell as you enter it (the formula also appears in the
“formula bar” following the fx.)
When you’ve typed the entire formula, evaluate it with the [Enter] /
[Return] key. Cell C4 now shows a result: 0.3704
0.3704 is the same as 37.04%. A failure time of 216.6 hours has a
percentile rank of 37.04. The 37.04 percentile of this distribution is 216.6 hours. An
interpretation reads like this (units are underlined; the variable is in italics):
37.03% of machines have a failure time less than 216.6 hours; the remaining 62.97% do
not fail until after 216.6 hours.
3
Look at the data: Of the 27 failure times other than the 216.6, 10 are less than 216.6: 10/27 =
37.04%. And that is how it works!
Aside: Rounding
In this class, our convention will be to round percents to the nearest 0.01%. This is overkill in
many cases – but it is far better to round too little than to round too much. Here’s a guide:
Percent
Maximum rounding
Between 10% and 90%
nearest 1%
Between 1% and 10% or between 90% and 99%
nearest 0.1%
Between 0.1% and 1.0% or between 99.0% and 99.9%
nearest 0.01%
Etc.
Etc.
It’s always better to under-round, rather than to over-round.
Example 1
What are the percentile ranks for 211.4 hours and 211.6 hours?
=PERCENTRANK(A2:A29,211.4,9)  0.2222
The percentile rank of 211.4 hours is 22.22; 211.4 is the 22.22 percentile.
=PERCENTRANK(A2:A29,211.6,9)  0.2593
The percentile rank of 211.6 hours is 25.93; 211.6 is the 25.93 percentile.
You can also find a percentile rank for a value not in the list – as long as it doesn’t fall below the
minimum or above the maximum. Let’s find the percentile rank for 211.5:
=PERCENTRANK(A2:A29,211.5,9)  0.2407
The percentile rank of 211.5 hours is 24.07; 211.5 is the 24.07 percentile.
This makes sense: 211.5 is halfway between 211.4 and 211.6, and the associated percent (24.07)
is halfway between 22.22 and 25.93: (22.222 + 25.926)/2 = 0.24074.
Percentile ranks for values not in the list are linearly interpolated from those in the list.
Find the percentile ranks for 308.0, 309.0 and 329.0.
=PERCENTRANK(A2:A29,308.0,9)  0.9630
=PERCENTRANK(A2:A29,309.0,9)  0.9647
=PERCENTRANK(A2:A29,329.0,9)  1.0000
308.0 hours is the 96.30 percentile; 329.0 hours is the 100.00 percentile – the maximum. These
two values (308.0 and 329.0) are both in the data set. 309.0 hours is not in the data set. But 309.0
is between 308.0 and 329.0, and so its percentile rank is between 96.301 and 100.00%. Since
309.0 is much closer to 308.0, it’s percentile rank is very close to 96.30.
318.5 hours is halfway between 308.0 and 329.0. Its percentile rank is 98.15, which is halfway
between 96.30 and 100.00.
One more: What is the percentile rank of the mean?
4
Examining Figure 1, see that the mean is 235.8. We need the percentile rank of 235.8.
=PERCENTRANK(A2:A29,235.757.0,9)  0.5168
The mean (235.8) is the 51.68th percentile. A more gnarly way to write this, without even having
to see the mean, is like this:
=PERCENTRANK(A2:A29,AVERAGE(A2:A29),9)  0.5168
Skew and percentiles
It’s often the case that the discrepancy between median and mean hints at the shape of the
distribution. (A histogram supplies information too.)

For a fairly symmetric distribution, the mean and median will be quite close. (If the
distribution is exactly symmetric – which is quite rare for real data sets – the two will be
identical.) The percentile of the mean will be quite close to 50.

For a right skewed distribution, the mean will fall to the right of (above) the median; the
percentile rank of the mean will be above 50.

For a left skewed distribution, the mean will fall to the left of (below) the median; the
percentile rank of the mean will be below 50.
In Example 1 the mean is above the median – although not by that much. (An effective visual
comparison is to mark both under the horizontal axis of the histogram. They are rather close.)
This suggests a distribution with a little bit of right skew. The histogram bears this out.
Finding Percentiles
We now discuss what to do when the percentile rank (a percent from 0% – 100%) is given. To do
this job use PERCENTILE. Again you specify the array that is the data. The second input is a
value on the relative frequency scale. Spreadsheets input and output relative frequency, not
percent. So be careful when feeding percentages into PERCENTILE. You must either first divide
by 100 (see the examples below), or type the % sign after the percent.
To find the kth percentile, where 0 < k < 100: Doubleclick into any empty cell. Then either
=PERCENTILE(array, k/100)
OR
=PERCENTILE(array, k%)
(You can have Excel do the division by 100; you may also do it yourself by simply moving the
decimal point two places left. The value that is input in the second position must be between 0
and 1.)
Example 1
Find the 37.04 percentile. k = 37.04% = 0.3704 and you have two very similar ways to input this.
OR
Then after [Enter] or [Return] you’ll see 216.60168.
The output of this function should be rounded to the precision with
5
which data are recorded. Here that is to the nearest 0.1 hour.
The 37.04 percentile is 216.6. You can round this yourself. Or use the spreadsheet’s rounding
options to display the proper precision:
In excel this button
, or Format Cells (Category) Number and set decimal
places. (When you have selected a cell or block of cell, Format Cells is available off the
right click. You can also type Alt o e in windows and Command 1 in osx.)
In google spreadsheets the 123 button, or Format Number will allow you to format the
selected cells.
So 216.6 is the 37.04 percentile. Of course we already knew this. (See above where we computed
the percentile rank for 216.6 hours.)
Aside: Inverses
We’ve discussed two functions in spreadsheets that deal with percentiles and percentile ranks.
PERCENTRANK
Given a data value x this gives the appropriate percentile rank k%
PERCENTILE
Given a percentile rank k% this gives the appropriate data value x
These two functions are inverses of one another.3
=PERCENTRANK(A2:A29, 216.6,9)

0.3704
=PERCENTILE(A2:A29, 0.3704)

216.6
Example 1
Find the median.
The median is (by definition) the 50th percentile.
=PERCENTILE(A2:A29, 0.5)

232.3
The median is 232.3 hours.
Find the 75th percentile. (The 75th percentile is also called the 3rd quartile or upper quartile.)
=PERCENTILE(A2:A29, 0.75)
3

254.2
Sometimes it appears these are not exactly inverses. This is because of rounding. Notice that our input into
PERCENTILE is 0.3704. However, 216.6 gives a percentile rank of 0.37037037 when displayed with more
precision. As you might guess, this decimal is repeating: 0.37037037037037037037…which is 10/27 (see the
discussion above, where 10/27 is explicitly discussed relative to 216.6). If one takes advantage of this exact
expression then
PERCENTILE(A2:A29,10/27)  216.6
This demonstrates that technically these are exactly inverses.
6
Here’s a good place to stop and check that things make sense. Here’s our list of data, arranged
horizontally for easier viewing. We have 254.2 as the 75th percentile.
189.1 193.2 200.4 202.6 202.7 206.7 211.4 211.6 212.1 215.7 216.6 218.7 219.3
228.5 236.1 236.5 237.0 239.7 244.3 251.5 253.3 257.0 259.9 265.2 267.4 287.7
308.0 329.0
Notice that of the 28 failure times, exactly 21 are below 254.2: 21 / 28 = 0.75 = 75%. This makes
good sense.
Again, here’s what we say:
“Approximately 75% of the machines have failure time below 254.2 hr; the other 25% of
machines have failure time above 267.4 hr.”
variable
units
To determine the 90th percentile:
=PERCENTILE(A2:A29, 0.90)  273.5.
The 90th percentile of failure times is 273.5 hours. “Approximately 90% of the machines have
failure time below 273.5; the other 10% have failure time above 273.5 hr.”
Actually, the percent of data values below 273.5 is more precisely 89.3%. This discrepancy
occurs because there no way to get exactly 90% of the data below any value in a list of 28
values: 90% of 28 is 25.2.
These functions generally yield such small discrepancies. It’s unavoidable. (To have precise
matching for all percents from 0.1% to 99.9% would require a data set with size an exact
multiple of 1001). It is not worth discussing these issues: They are minor. Simply learn how to
compute percentiles and percentile ranks, and (especially) learn how to interpret them.
Our Standard for Percentiles
There are other ways of defining percentiles and percentile ranks. All reduce to the same thing
for the median. Elsewhere the differences are minor, and an interpretation is the same no matter
which method is used. (Your instructor has read a 50-page paper examining nothing more than
the many different ways of determining only the first and third quartiles – the 25th and 75th
percentiles. While the paper had its interesting technical points, for the most part it was quite
dull.)
There are other standards for obtaining percentiles and percentile ranks – this is the one that you
and your class are adopting. All the methods produce virtually identical values with large data
sets. For small data sets there are some differences, but they are not important – especially
relative to the uncertainty resulting from the lack of information inherent to a small data set.4
Learn to use this standard. Learn to interpret percentiles.
4
For the failure time data, Minitab gives 211.5 as the 25 th percentile (we have 211.6) and 256.1 as the 75 th percentile
(we have 254.2). These are very minor differences relative to a) the small size of the data set and b) the large amount
of variability in this data.
7
Frequently Used Percentiles
Some sets of percentiles are commonly reported:
Quartiles (used to obtain a boxplot):
25th percentile: “lower quartile” or “first quartile:” Q1
50th percentile: “median” or “second quartile:” M (occasionally Q2)
75th percentile: “upper quartile” or “third quartile:” Q3
Quintiles: 20th, 40th, 60th, 80th
Deciles: 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, 90th
Boxplots
The 5 number summary
The five-number summary consists of the minimum, quartiles, and maximum value. To display
this summary, place the values in {curly brackets}, listed from low (minimum) to high
(maximum), with commas separating values.
Example 1
The minimum is 189.1; the first quartile (25th percentile) is 211.6; the median is 232.3; the third
quartile is 254.2; the maximum is 329.0. The five number summary is written as follows:
{189.1, 211.6, 232.3, 254.2, 329.0}
It is appropriate to display each value with the same accuracy – here to the nearest 0.1, which is
the accuracy of the data. Notice how the 329.0 is displayed. This is: a) for consistency, and b)
because the data are measured to the nearest 0.1 – the value itself gives an idea of how precise
the measurement scale is.
Interquartile Range
The interquartile range (IQR) is the distance between the third and first quartiles:
IQR = Q3 – Q1
Example 1
IQR = 254.2 – 211.6 = 42.7.
The IQR is a measure of the variability in a data set: It tells you the length of the interval that
includes the middle 50% of all the data. Generally speaking, data sets with larger IQR have more
variability.
Rule of Thumb
The standard deviation is often around ¾ of the IQR. For Example 1, the ratio of standard
deviation to IQR is 34.19 / 42.7 = 0.801, which is reasonably close to 0.75.
There are occasions when this rule of thumb comes nowhere close to truth. This usually happens
when the corresponding distribution has a huge amount of skew, or outliers, or some other
unusual feature.
8
The simple boxplot
The boxplot is a graphical display of the five number summary. A scale extending from below
the minimum to above the maximum is drawn. A box is placed, with edges located at the first
and third quartiles. A line is drawn through the box at the location of the median. Then lines –
called “whiskers” are drawn from the first quartile to the minimum, and from the third quartile to
the maximum.
Figure 2
When you ask statistical software to construct a boxplot it will produce a modified boxplot.5 The
modified boxplot is constructed using a mathematical rule for identifying outliers – values that
are extreme relative to the bulk of the distribution.
Outliers in boxplots
Here is the rule:
Any value more than 1.5IQR below Q1 is an outlier.
Any value more than 1.5IQR above Q3 is an outlier.
For Example 1, IQR = 42.7, so 1.5IQR = 64.0.
Any value more than 64.0 below 211.6 is an outlier. 211.6 – 64.0 = 147.5. There are no
values below 147.5.
Any value more than 64.0 above 254.2 is an outlier. 254.2 + 64.0 = 318.2. There is one
value above 318.2 – it is 329.0.
Modified Boxplot
Plot the box as in a simple boxplot. But, extend the whiskers only as far as the most extreme
observations that are not considered outliers. Then use special symbols to plot the outliers.
Example 1
Since 329.0 is considered an outlier, the whisker on the high side is drawn to 308.0 – which is
not considered an outlier. 329.0 is plotted on its own.
5
If you know the software reasonably well, you can get it construct a “regular” (unmodified) boxplot such as in
Figure 2.
9
.
Figure 3
From the boxplot you get a fairly accurate gauge of the five number summary. From this you can
quickly deduce some other values: The range and the interquartile range (which quantify how
spread the data are), as well as the median (which quantifies the center). It also helps identify
extreme values – values that are somewhat unusual, and perhaps require some looking in to. You
can also “guess” the standard deviation in two ways:
Range / k
where k is something between 4 and 6 (our textbook uses 4)
0.75 IQR
10
Exercises
The data for the exercises (as well as the examples above) can be accessed from the instructor’s
shared folder (Continuous Data Sets). The different data sets are placed within different tabs in
the spreadsheet. (Tabs are accessed at the bottom of the window. See the instructor if this is
confusing you. It’s easy, but not until you “get it.”)
1. In discussing investment opportunities, a financial advisor speaks about a company’s “price
to earnings” ratio (PE) – the price of a share of stock divided by the amount of profit the
company makes annually (ie.: How much it costs to purchase $1 of annual profit). A stock
market analyst says “For the ECC Company, its PE of 7.3 is at the 15th percentile among
companies in the industrial sector.
a) What is the percentile rank for a PE of 7.3?
b) Write a sentence explaining what this means, without using the word “percentile.” Your
statement must identify the units and variable. You may use the word “percent,” and you
must use the numbers 7.3 and 15.
2. For girls 1 year of age, the 5th percentile of weights is 17.5 pounds.
a) Write a sentence interpreting this. (Speak in terms of units / variables.)
b) For girls 6 months of age, how does the 5th percentile compare to 17.5 pounds? Is it
larger, smaller, or the same as 17.5 pounds? Why?
3. The 90th percentile of CEO salaries is $5.052 million.
a) What percent of CEOs make more than this amount? Less? What percent (to the nearest
1%) make exactly this amount?
b) How does the 80th percentile of CEO salaries compare to $5.052 million?
4. For SUNY Oswego students, the 65th percentile on the combined SAT score is 1250. Write a
sentence interpreting this. (Speak in terms of units and variable.)
5. (GPA tab) Here are the GPAs of 8 math majors (sorted).
1.98 2.10 2.58 2.69 2.94 3.05 3.65 3.83
a) Determine the percentile rank for GPAs of 3.05, 3.35 and 3.65.
b) Determine values for Q1, Q2 and Q3, the 1st, 2nd and 3rd quartiles, respectively.
c) Obtain values for the Range and interquartile range (IQR).
d) Determine values for the mean and standard deviation. Compare the standard deviation to
both Range/4 and 0.75 IQR (both these expressions will “guess” the standard deviation).
e) State the 5 number summary for this data set.
6. (Pateint Wating Time tab) The data are the amounts of time patients waited in an emergency
room at a local hospital prior to seeing a doctor (in minutes).
a) What are the units? What is the variable?
b) Obtain a histogram. Identify the shape of this histogram. Are there any outliers?
c) Determine values for Q1, Q2 and Q3, the 1st, 2nd and 3rd quartiles, respectively.
11
d) What is the percentile rank for a waiting time of 2 hours? How about 75 minutes?
e) Make sure you can interpret all your results to b and c.
f) Here are two questions you just answered: In part b) “Find Q3 – the 75th percentile.” In
part c) “What is the percentile rank for a waiting time of 75 minutes?” Explain why the
answers are different, even though the number (75) is the same.
g) Determine values for the mean and standard deviation. Compare the standard deviation to
both Range/4 and 0.75 IQR (both these expressions will “guess” the standard deviation).
7. You have a data set where the variable is the waist measurements for a random sample of
236 men. The data are located in Excel cells A2 through A237
a) What do you type in an empty cell in order to obtain the 35th percentile?
b) What do you type in an empty cell in order to obtain the percentile rank of a waist
measurement of 35 inches?
8. A couple is researching the cost of completing an international adoption. This cost varies
from adoption to adoption.. From a U.S. government source they learn that the 20th percentile of
costs is $19,312. Identify the variable and statistical units. Then: Which of the following
properly explains this to someone unfamiliar with the term “percentile?”

20% of the cost of an adoption is less than $19,312.

An adoption costs 20% of $19,312.

20% of those who adopt pay exactly $19,312 to do so.

20% of those who adopt pay more than $19,312 to do so.

20% of those who adopt pay less than $19,312 to do so.
9. (Jan Temps tab) Average January temperatures in Oswego over the last 150 years.
a) Identify the units and the variable.
b) Obtain a histogram. Identify the shape. Are there any outliers? If so: In what year(s) were
the outlying temperatures obtained?
c) State the five number summary. Compute values for the Range and IQR. Compare
Range/4 and 0.75IQR to the standard deviation.
d) Determine the 80th percentile.
e) Without a spreadsheet: Look at your result to part c. What is the percentile rank for a
temperature of 27.8 degrees? (Now check your answer by computing it.)
f) Determine the percentile rank of the mean. Compare the mean to the median. Are they
close?
10. (Children tab) In order to properly apply percentiles, replicates of the same value should not
be common, as this exercise illustrates.
Consider this data set, the number of children in 20 local families:
1
1
1
1
1
2
2
2
2
2
3
12
3
3
4
4
4
4
4
5
6
Determine the 1st percentile and the 21st percentile.
Because of the large amount of replicates, 1 is technically both the 1st and 21st percentiles. In
fact, you can fill in the blank in the following sentence with any number from 0 to 21:
1 is the ____ percentile.
There are so many replicates that percentiles are not useful. With highly discrete data (when
there are many of ties) do not bother with percentiles. Instead, simply tabulate values and their
relative frequencies.
# of children in family
1
2
3
4
5
6
% of families
25 25 15 25
5
5
11. Percentiles are suited to continuous quantitative data. (For an example of data that is too
discrete for percentiles, see #10 above.) In each of the following situations: 1) identify the
variable that is of interest; 2) state whether the variable is quantitative or categorical; 3) decide
whether the 25th, 50th and 75th percentiles would be meaningful measures for summarizing data.
a) The number of fire department calls to fires in Oswego on a day.
b) The daily total mass of the garbage an industrial company produces.
c) The colors of people’s cars.
d) The size of men’s feet.
e) The zip code of students’ hometowns.
f) The unused hard drive space on a group of computers that have been used a year.
g) Student response to this questionnaire item on a statistics instructor’s teaching:
How effective was the instructor at helping you learn the course material?
1
2
3
not at all
4
5
very
13
Solutions
1.
a) 15
b) 15% of companies in the industrial sector have PE below 7.3; the remaining 85% of
companies have PE above 7.3. Units: companies in the industrial sector; Variable: PE.
2.
a) 5% of 1-year-old girls weigh less than 17.5 pounds; the remaining 95% weigh more than
17.5 pounds. Units: 1-year-old girls; Variable: weight.
b) At 6 months girls will tend to be smaller. (In fact, almost all ½-year-olds are smaller than
almost all 1-year-olds. You cannot come even close to saying that sort of thing if comparing,
say, 6-year-olds to 7-year-olds. Some – not a lot, but some – 6-year-olds are heavier than a
good portion of 7-year-olds.) So the 5th percentile will be less than 17.5 pounds.
3.
a) 90% make less; 10% more; about 0% make exactly this amount.
b) The 80th percentile is less than $5.052 million.
4. 65% of SUNY Oswego students have combined SAT below 1150; the other 35% have SAT
above 1150. Units: SUNY Oswego students; Variable: combined SAT.
5.
a) 3.05 has percentile rank 71.43 (the 71.43 percentile is 3.05)
3.35 has percentile rank 78.57 (the 78.57 percentile is 3.35)
3.65 has percentile rank 85.71 (the 85.71 percentile is 3.65)
Notice that the percentile rank of 3.35 is exactly halfway between those for 3.05 and
3.65. That’s because 3.35 is exactly halfway between 3.05 and 3.65.
b) Q1 = 2.46; Q2 = 2.82; Q3 = 3.20.
c) Range = 1.85; IQR = 3.20 – 2.46 = 0.74.
d) The mean is 2.853; the standard deviation is 0.662. Range/4 = 0.463; 0.75IQR = 0.96.
Neither of these that precisely anticipate the standard deviation. On the other hand, these
rules of thumb are not that often really precise – in particular with really small data sets like
this. But look at the average of these two guesses: (0.463 + 0.96) / 2 = 0.711, which is not at
all far off.
e) {1.98, 2.46, 2.82. 3.20, 3.83}
6.
a) Patient arrivals are the units. Each arrival is timed: Waiting time is the variable.
b) The histogram is a little bit right skewed with a fairly prominent outlier (the waiting time
of 201 minutes).
c) The 25th percentile is Q1 = 71. The 50th is Q2 = M = 86. The 75th is Q3 = 105.
14
d) 2 hours = 120 minutes has
percentile rank of 84.38. The 84.38
percentile is 120 minutes. 75 minutes
has a percentile rank of 31.25.
e) For example: 84.4 percent of
patients (that’s the units) have
waiting times (that’s the variable)
less than two hours; the other 15.6
percent wait longer than 2 hours.
f) "Find the 75th percentile." This
means: Determine the waiting time x
such that 75% of the waiting times are less than x. "What percentile is a waiting time of 75
minutes?" This means essentially: What % of waiting times are below 75 minutes?
g) The mean is 92.82 minutes; the standard deviation is 32.82. The range is 163; dividing by
4 gives 40.75. The IQR is 34; 3/4th IQR = 25.5. One might observe that while neither of these
are an excellent guess, their average is: (40.75 + 25.5) / 2 = 33.13.
7.
a) To obtain the 35th percentile:
=PERCENTILE(A2:A237, 0.35)
b) To obtain the percentile rank of a waist measurement of 35 inches:
=PERCENTRANK(A2:A237, 35, 9)
8. The variable is “cost of adoption.” The units are the “couples.” Correct: “20% of couples
who adopt pay less than $19,312 to do so.” Different people pay different amounts to adopt. The
official is communicating that 20% of people pay less than $19,312 – because the 20th percentile
is always the amount separating the lowest 20% of data from the highest 80%. The first two
statements are irrelevant to varying costs of adoption. (Percentile doesn’t refer to “an” (a single)
adoption. It refers to the variation in costs among all adoptions.)
9.
a) The units are the
Januaries of each year.
(“Years” or “Januaries”
is fine.) The variable
here is the average
temperature for the
entire month. “Average
temperature for the
entire month varies
among the Januaries of
each year.”)
15
b) The distribution is rather symmetric with no outliers.
c) {13.6, 20.6, 23.9, 27.1, 35.8} Range = 22.2; IQR = 6.5. Range/4 = 5.55;
0.75IQR = 4.875. The second of these is quite close to the standard deviation of 4.678.
d) The 80th percentile is 27.8 degrees.
e) 27.8 degrees has percentile rank of 80.
f) The mean is 23.980 degrees and has percentile rank of 50.54. The mean is very close in
value to the median (50th percentile) of 23.9 degrees. This suggests symmetry.
10. The 1st percentile is 1; the 21st percentile is also 1.
11.
a) The variable “number of fires in a day” is quantitative, and is highly discrete. There
would be many ties (a tie would occur when two days had the same number of fires – and
this would happen often in a small city like Oswego.) Percentiles are not a good choice here.
b) The variable is the amount of garbage in a day. This is quantitative data, and is
continuous. Percentiles would be a good choice here.
c) The variable is car color. This is categorical data. Percentiles would make no sense at all.
d) The variable is foot size. It’s quantitative. First, if we use shoe size (and only length – not
width) we do have a quantitative variable. Based on standard shoe sizes (done on a
nonstandard scale – the “shoe size scale”), the variable would be fairly discrete, and
percentiles would not be so useful. But if one takes the time to measure foot length
accurately with a ruler, then foot size is continuous, and percentiles are a good choice.
e) The variable is hometown zip code. Zip codes are formed with digits, but they are
categorical. (Working with them as numbers makes no sense. It makes no sense to operate
(add, etc.) on them.) Percentiles would make no sense at all.
f) The variable “unused hard drive space” is quantitative and continuous. (Technically it is
discrete – there’s a fixed number of bits of space. A small hard drive these days holds 200
gigabytes, which is 1717986918400 bits: it’s virtually impossible for two drives to have the
same number of bits of storage used.) Percentiles would be a good choice here.
g) The variable “rating” is again essentially categorical. The choices are presented as
numbers, but they aren’t “numbers” in the usual sense. If you answered that it’s discrete
quantitative data, that’s OK. (But: any set of categories can be labeled with numbers. Just
because it’s numbers doesn’t mean it’s quantitative.) Percentiles are not a good choice here.
16