Download Box Plot

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Elementary mathematics wikipedia , lookup

Transcript
Basic Statistics Review: Quartiles and Box Plots
Quartiles are dividers to sort data into four groups of approximately equal size, Q1 (first 25%), Q2 (middle 50%
[also the median]), Q3 (first 75% or upper 25%).
Q1
│
Lower 25%
Q2
Second 25%% │ Third 25%
Q3
│ Upper 25%
You can think of Q1 as the “median” between the values below Q2, and similarly Q3 is the “median” between
Q2 and the values above Q2. One note on quartiles, the method shown in this textbook is the most common for
introductory courses. However, upon close inspection you can see this method does not guarantee exactly the
same number of observations in all four quadrants. There are many other methods that do get closer this this goal
but add complexity to the calculations. Therefore, you may (will) get different values for Q1 and Q3 depending
what software (formula) used.
How to find the quartiles:
Step 1:
Step 2:
Sort the observations from smallest to largest
To find the location of each quartile use:
𝑷
𝑳𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝒏 (𝟏𝟎𝟎) where P is the 25th, 50th, and 75th percentiles and n = sample size

If Location is NOT an integer - round Location up to the next integer (4.2, 4.5, 4.7 would all round up to
5), the quartile is data value at XLocation

If Location is an integer - average the two data values XLocation and XLocation +1. The quartile is computed
𝑿
+𝑿
using this formula: 𝑳𝒐𝒄𝒂𝒕𝒊𝒐𝒏 𝟐 𝑳𝒐𝒄𝒂𝒕𝒊𝒐𝒏+𝟏
Example #1: Suppose you have this data set:
14 15 10 11 19 12 17
The first step is to sort from smallest to largest:
10 11 12 14 15 17 19
To find the quartiles use:
𝟐𝟓
For Q1, 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝟕 (𝟏𝟎𝟎) = 𝟏. 𝟕𝟓
𝟓𝟎
For Q2, location = 𝟕 (𝟏𝟎𝟎) = 𝟑. 𝟓
𝟕𝟓
For Q3, 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝟕 (𝟏𝟎𝟎) = 𝟓. 𝟐𝟓
round up to 2, so Q1 = second observation = 11
round up to 4, so Q2 = fourth observation = 14
round up to 6, so Q3 = sixth observation = 17
Example #2: Suppose you have this data set:
14 15 10 11 19 12 17 24
The first step is to sort from smallest to largest:
10 11 12 14 15 17 19 24
To find the quartiles:
𝟐𝟓
For Q1, 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝟖 (𝟏𝟎𝟎) = 𝟐 Since 2 is an integer add the 2nd and 3rd observations and divide by 2 to get
𝑸𝟏 =
𝟏𝟏+𝟏𝟐
𝟐
= 𝟏𝟏. 𝟓
𝟓𝟎
For Q2, 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝟖 (𝟏𝟎𝟎) = 𝟒 Since 4 is an integer add the 4th and 5th observations and divide by 2 to get
𝑸𝟐 =
𝟏𝟒+𝟏𝟓
𝟐
= 𝟏𝟒. 𝟓
𝟕𝟓
For Q3, 𝒍𝒐𝒄𝒂𝒕𝒊𝒐𝒏 = 𝟖 (𝟏𝟎𝟎) = 𝟔 Since 6 is an integer add the 6th and 7th observations and divide by 2 to get
𝑸𝟑 =
𝟏𝟕+𝟏𝟗
𝟐
= 𝟏𝟖
Five Point Summary of a Sample Data Set.
In a five point summary, the following five characteristic numbers are used to summarize the data.
1.
2.
3.
4.
5.
The minimum data value
First Quartile (Q1)
Median (Q2)
Third Quartile (Q3)
The maximum data value
Box and Whiskers Display (Box Plot)
This is an approach to graphically summarizing data that allows you to study it by quartile groupings.
How to construct a box and whiskers display:
1. Compute the five point summary (smallest, Q1, Q2, Q3, largest)
2. Compute the interquartile range.
The interquartile range contains the middle fifty percent of the data.
IQR = Q3 - Q1
3. Compute the upper and lower fences. The fences will be a distance of 1.5 times the IQR below and above
Q1 and Q3 respectively. Note: Fences help determine if there are outliers or extreme values in the data.
These fences are typically not shown in the box plot.
The lower fence is Q1 - 1.5(IQR)
The upper fence is Q3 + 1.5(IQR)
4. Construct the horizontal axis (number line) and place the five number summary on it (the axis should
extend past the smallest and largest data values). Next draw the box with ends at Q1 and Q3. Next mark
the median (Q2) with a vertical line within the box.
5. NOTE: The ends of the whiskers can represent several possible alternative values depending on what
textbook (or software!) you use, among them:
a) the lowest observation still within the fences
b) the minimum and maximum of all of the data
c) one standard deviation above and below the mean of the data
d) the 9th percentile and the 91st percentile
e) the 2nd percentile and the 98th percentile.
6. To draw the whiskers we will use method (a) above. The lower line (whisker) starts at the lower end of
the box and continues until the smallest value in the data set that is MORE THAN the lower fence value.
Likewise, the upper dashed (whisker) line starts at the upper end of the box and continues until the largest
value in the data set that is LESS THAN the upper fence value. Optional: add a small vertical line at the
end of each whisker.
7. Add asterisks (*) to represent any data points less than the lower fence or larger than the upper fence.
Consider all *’s outside the fences as possible outliers requiring attention.
Example: Study the following example carefully as it explains how to calculate the percentiles.
Consider the following sample data set. Note that it is arranged in an ascending order.
5 5.4 6.3 8.2 8.7 9.3 9.5 9.8 10.5 11.6 11.8 12.1 12.6 20
Find the five number summary:
Smallest = 5
Q1: location = (percentile/100)n = (25th/100)14 = 3.5. Since 3.5 is not an integer we round up to 4 and look for
the number in the 4th position in the ordered data which is 8.2. Twenty five percent of the data is below 8.2.
Q2: location = (percentile/100)n = (50th/100)14 =7 (notice that this is also the median) = Since 7 is an integer we
take the 7th and 8th numbers in the ordered data and add them then divide by 2 to find the answer which is (9.5 +
9.8)/2 = 9.65 Fifty percent of the data is below 9.65.
Q3: location = (percentile/100)n = (75th/100)14 =10.5. Since 10.5 is not an integer we handle it like we did the
first quartile. The number in the 11th position in the ordered data is 11.8. Therefore seventy five percent of the
data is below 11.8.
Largest = 20
To draw the box plot first draw the horizontal axis to scale and place the five number summary on it. Next draw
the box with ends at quartile 1 and 3 and mark the median or Q2.
5
↑
Q1
↑
Q2
↑
Q3
8.2
9.65
11.8
20
Now we need to find the interquartile range and the fences.
Interquartile range – (IQR) The interquartile range contains the middle fifty percent of the data.
IQR = 11.8 – 8.2 =3.6
IQR = Q3 - Q1
Fences are markers we can use to help determine if there are outliers or extreme values in the data. If there are it
may be appropriate to throw out some or all of the outliers before continuing to study the data. Sometimes fences
are called hinges or limits.
Fences = 1.5(IQR) above Q3 and below Q1
5.4 above Q3 = 17.2
5.4 below Q1= 2.8
1.5(3.6) = 5.4
Note: there is one potential outlier above 17.2
Note: there are no potential outliers below 2.3
Here’s how our box plot looks (typically you do not actually show the fences on the box plot). The dashed lines
represent the “whiskers” and show the range for the data within the fences. There is one potential outlier
designated with an asterisk (*) outside the upper fence.
*
5
↑
Q1
↑
Q2
↑
Q3
8.2
9.65
11.8
12.6
20
NOTE: FYI, if a fifteenth observation were added you would have the same value for Q1 but different values
for Q2 and Q3!:
5 5.4 6.3 8.2 8.7 9.3 9.5 9.8 10.5 11.6 11.8 12.1 12.6 20 23
For Q1 location = (25th/100)15 = 3.75, Q1 is still the fourth number = 8.2
For Q2 location = 7.5, Q2 is the eighth number = 9.8
For Q3 location = 11.25, Q3 is the twelfth number = 12.1