Download Tabulate Qualitative Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Objectives
Student should be able to
•
Chapter 2
Organize data –
Tabulate data into frequency/relative frequency tables
•
Descriptive Statistics:
Display data graphically –
Qualitative data – pie charts, bar charts, Pareto Charts.
Quantitative data –
Histograms, Stemplots, Dot plots and Boxplots.
Describe the shape of the plot.
•
Summarize data numerically –
Quantitative data only –
Measure of center – mean, median, midrange, and mode.
Measure of position – quartiles and percentiles.
Measure of spread/variation – range, variance, standard
deviation, and inter-quartile range.
Organizing, Displaying and
Summarizing Data
•
Use TI graphing calculator to obtain statistics.
Tabulate Qualitative Data
Organize Data
Tabulate data into frequency and
relative frequency Tables
Frequency Table
• A simple data set is
blue, blue, green, red, red, blue, red, blue
• A frequency table for this qualitative data
is
Color
Blue
Green
Red
Frequency
4
1
3
• Qualitative data values can be organized
by a frequency distribution
• A frequency distribution lists
– Each of the categories
– The frequency/counts for each category
What Is A Relative Frequency?
• The relative frequencies are the proportions (or
percents) of the observations out of the total
• A relative frequency distribution lists
– Each of the categories
– The relative frequency for each category
Relative frequency =
Frequency
Total
• The most commonly occurring color is
blue
1
Relative Frequency Table
• A relative frequency table for this
qualitative data is
Color
Blue
Relative Frequency
.500 (= 4/8)
Green
Red
.125 (= 1/8)
.375 (= 3/8)
• A relative frequency table can also be
constructed with percents (50%, 12.5%,
and 37.5% for the above table)
Tabulate Quantitative Data
• Suppose we recorded number of customers
served each day for total of 40 days as below:
• We would like to compute the frequencies and
the relative frequencies
Frequency/Relative Frequency Table
The resulting frequencies and the relative
frequencies:
Display Data graphically
Qualitative data – Bar, Pareto, Pie Charts
Quantitative data – Histograms, Stemplots,
Dot plots
Bar and Pie Charts for Qualitative
Data
Relative Frequency Bar Chart
Frequency Bar Chart
0.6
4.5
4
0.5
3.5
0.4
Frequency
Bar Charts, Pareto Charts, Pie
Charts
Note: Always label the axes, provide category and numeric scales,
and title when you present graphs.
Relative Frequency
Graphic Display for Qualitative
Data
• Bar charts for our simple data (generated with Chart
command in Excel)
– Frequency bar chart
– Relative frequency bar chart
0.3
0.2
3
2.5
2
1.5
1
0.1
0.5
0
0
Blue
Green
Color
Red
Blue
Green
Red
Color
2
Pareto Charts
Pareto Charts
• A Pareto chart is a particular type of bar graph
• A Pareto differs from a bar chart only in that the
categories are arranged in order
– The category with the highest frequency is placed first
(on the extreme left)
– The second highest category is placed second
– Etc.
• Pareto charts are often used when there are many
categories but only the top few are of interest
Here shows a Pareto chart for the simple
data set:
Pareto Chart
Color
Relative Frequency
Blue
0.5
Red
0.375
Green
0.125
Relative Frequency
60%
50%
40%
30%
20%
10%
0%
Blue
Red
Green
Color
Side-by-Side Bar Charts
• Use it to compare multiple bar charts.
• An example side-by-side bar chart comparing
educational attainment in 1990 versus 2003
Pie Charts
Pie Charts are used to display qualitative data. It shows
the amount of data that belong to each category as a
proportional part of a circle.
Pie Chart
Green, 13%
Blue, 50%
Red, 38%
Notice that Bar charts show the amount of data that belong to each
category as a proportionally sized rectangular area.
Pie Charts
• Another example of a pie chart
Summary
• Qualitative data can be organized in
several ways
– Tables are useful for listing the data, its
frequencies, and its relative frequencies
– Charts such as bar charts, Pareto charts, and
pie charts are useful visual methods for
organizing data
– Side-by-side bar charts are useful for
comparing multiple sets of qualitative data
3
Histogram
Graphic Display Quantitative
Data
Histograms, Stemplots, Dot Plots
Histogram is a bar graph which represents a frequency distribution of
a quantitative variable. It is a term used only for a bar graph of
quantitative data. A histogram is made up of the following
components:
1.
A title, which identifies the population of interest
2.
A vertical scale, which identifies the frequencies or relative
frequency in the various classes
3.
A horizontal scale, which identifies the variable x. Values or
ranges of values may be labeled along the x-axis. Use whichever
method of labeling the axis best presents the variable.
When you make a graph, make sure you label (give descriptions to)
both axes clearly, and give a title for the graph too.
Histogram for discrete Quantitative
data
•
Example of histograms for discrete data
– Frequencies
– Relative frequencies
Note: The term “histogram” is used only for a bar graph to
summarize quantitative data. The bar chart for qualitative data
can not be called a histogram. Also, there are no gaps between
bars in a histogram.
Categorize/Group Continuous
Quantitative Data
• Continuous type of quantitative data cannot be
put directly into frequency tables since they do
not have any obvious categories
• Categories are created using classes, or
intervals/ranges of numbers
• The continuous data is then put into the classes
Categorize/Group Continuous
Quantitative Data
•
•
•
•
•
For ages of adults, a possible set of classes is
20 – 29
30 – 39
40 – 49
50 – 59
60 and older
For the class 30 – 39
– 30 is the lower class limit
– 39 is the upper class limit
The class width is the difference between the upper class limit and the lower
class limit
For the class 30 – 39, the class width is
40 – 30 = 10
(The difference between two adjacent lower class limits)
The class midpoint = Average of the lower limits for the two adjacent
classes
Categorize/Group Continuous
Quantitative Data
• All the classes should have the same
widths, except for the last class
• The class “60 and above” is an openended class because it has no upper limit
• Classes with no lower limits are also called
open-ended classes
4
Categorize/Group Continuous
Quantitative Data
• The classes and the number of values in
each can be put into a frequency table
Age
Number
(frequency)
20 – 29
533
30 – 39
1147
40 – 49
1090
50 – 59
493
60 and older
110
• In this table, there are 1147 subjects
between 30 and 39 years old
Histogram for continuous
Quantitative data
•
•
•
Just as for discrete data, a histogram can be created from the
frequency table
Instead of individual data values, the categories are the classes –
the intervals of data
You can label/scale the bars with the lower class limits or class
midpoints.
Categorize/Group Continuous
Quantitative Data
• Good practices for constructing tables for
continuous variables
– The classes should not overlap
– The classes should not have any gaps between them
– The classes should have the same width (except for
possible open-ended classes at the extreme low or
extreme high ends)
– The class boundaries should be “reasonable”
numbers
– The class width should be a “reasonable” number
Stemplots
• A stem-and-leaf plot ( or simply Stemplot) is a
different way to represent data that is similar to a
histogram
• To draw a stem-and-leaf plot, each data value
must be broken up into two components
– The stem consists of all the digits except for the right
most one
– The leaf consists of the right most digit
– For the number 173, for example, the stem would be
“17” and the leaf would be “3”
Example of a Stemplot
• In the stem-and-leaf plot below
Stemplots Construction
• To draw a stem-and-leaf plot
– Write all the values in ascending order
– Find the stems and write them vertically in ascending
order
– For each data value, write its leaf in the row next to its
stem
– The resulting leaves will also be in ascending order
–
–
–
The smallest value is 56
The largest value is 180
The second largest value is 178
• The list of stems with their corresponding leaves
is the stem-and-leaf plot
5
Modification to Stemplots
• Modifications to stem-and-leaf plots
– Sometimes there are too many values with
the same stem … we would need to split the
stems (such as having 10-14 in one stem and
15-19 in another)
– If we wanted to compare two sets of data, we
could draw two stem-and-leaf plots using the
same stem, with leaves going left (for one set
of data) and right (for the other set) – a sideby-side stem plot
Dot Plots
• A dot plot is a graph where a dot is placed
over the observation each time it is
observed
• The following is an example of a dot plot
Shapes of Plots for Quantiative
Data
•
•
•
The pattern of variability displayed by the data of a variable is called
distribution. The distribution displays how frequent each value of the
variable occurs.
A useful way to describe a quantitative variable is by the shape of its
distribution
Some common distribution shapes are
– Uniform
– Bell-shaped (or normal)
– Skewed right
– Skewed left
– Bimodal
Uniform Distribution
• A variable has a uniform distribution when
– Each of the values tends to occur with the
same frequency
– The histogram looks flat
Note: We are not concerned about the shapes of the plots for qualitative
data, because there is no particular order arrangement for the
categories of the nominal data. Once we change the order, the shape
of the graph will be changed.
Normal Distribution
• A variable has a bell-shaped (normal)
distribution when
– Most of the values fall in the middle
– The frequencies tail off to the left and to the
right
– It is symmetric
Right-skewed Distribution
•
A variable has a skewed right distribution when
– The distribution is not symmetric
– The tail to the right is longer than the tail to the left
– The arrow from the middle to the long tail points right
In Other words: The direction of skewness is determined by the side of distribution
with a longer tail. That is, if a distribution has a longer tail on its right side, it is
called a right-skewed distribution.
Right
6
Left-skewed Distribution
• A variable has a skewed left distribution when
– The distribution is not symmetric
– The tail to the left is longer than the tail to the right
– The arrow from the middle to the long tail points left
Bimodal Distribution
•
There are two peaks/humps or highest points in the distribution.
•
Often implies two populations are sampled.
The graph below shows a bimodal distribution for body mass. It
implies that data come from two populations, each with its own
separate average. Here, one group has an average body mass
of 147 grams and the other has a average body mass of
178 grams.
Left
Summary
• Quantitative data can be organized in
several ways
– Histogram is the most used graphical tool.
– Histograms based on data values are good
for discrete data
– Histograms based on classes (intervals) are
good for continuous data
– The shape of a distribution describes a
variable … histograms are useful for
identifying the shapes
Summarize data numerically
Measure of Center, Spread, and
Position
Measures of Center
Measure of Center
Mean, Median, Mode, Midrange
• Numerical values used to locate the
middle of a set of data, or where the data
is most clustered
• The term mean/average is often
associated with the measure of center of a
distribution.
7
Mean
• An arithmetic mean
• For a population … the population mean
Formula for Means
•
The sample mean is the sum of all the values divided by the size of
the sample, n:
x=
– Is computed using all the observations in a population
– Is denoted by a Greek letter µ ( called mu)
– Is a parameter
• For a sample … the sample mean
– Is computed using only the observations in a sample
– Is denoted x (called x bar)
– Is a statistic
Note: We usually cannot measure µ (due to the size of the
population) but would like to estimate its value with a sample
mean x
•
1
1
∑ xi = n ( x1 + x2 + ... + xn )
n
The population mean is the sum of all the values divided by the size
of the population, N:
µ=
Note:
∑
1
N
∑x
i
=
1
( x1 + x2 + ... + x N )
N
is called “summation”, means summing all values.
It is a short-cut notation for adding a set of numbers.
Example
Median
Example:The following sample data represents the number
of accidents
in each of the last 6 years at a dangerous
intersection. Find the mean number of accidents: 8, 9, 3, 5,
2, 6, 4, 5:
• The median denoted by M of a variable is the
“center”. The median splits the data into halves
Solution:
x=
1
(8 + 9 + 3 + 5 + 2 + 6 + 4 + 5) = 5.25
8
In the data above, change 6 to 26:
Solution:
x=
1
(8 + 9 + 3 + 5 + 2 + 26 + 4 + 5) = 7.75
8
Note: The mean can be greatly influenced by outliers (extremely large or
small values)
How to Obtain a Median?
• To calculate the median of a data set
– Arrange the data in order
– Count the number of observations, n
• If n is odd
– There is a value that’s exactly in the middle
– That value is the median
• If n is even
– There are two values on either side of the exact
middle
– Take their mean to be the median
• When the data is sorted in order, the median is
the middle value
• The calculation of the median of a variable is
slightly different depending on
– If there are an odd number of points, or
– If there are an even number of points
Example
• An example with an odd number of observations
(5 observations)
• Compute the median of
6, 1, 11, 2, 11
• Sort them in order
1, 2, 6, 11, 11
• The middle number is 6, so the median is 6
8
Example
• An example with an even number of observations (4
observations)
• Compute the median of
6, 1, 11, 2
• Sort them in order
1, 2, 6, 11
• Take the mean of the two middle values
(2 + 6) / 2 = 4
• The median is 4
Example 1
Suppose we want to find the median of the data set
4, 8, 3, 8, 2, 9, 2, 11, 3,
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
2. Find the position of the median using the formula:
n +1
2
For the data given, n is 9 (because the size of the
sample is 9, that is, there are 9 data values given),
9 +1
so the median position is
=5
2
The median is the 5th smallest or 5th largest value, which
is 4.
Mode
The mode of a variable is the most frequently occurring
value.
For instance, Find the mode of the data
6, 1, 2, 6, 11, 7, 3
Since the data contain 6 distinct values:
1, 2, 3, 6, 7, 11
and, the value 6 occurs twice, all the other values occur
only once, so the mode is 6
Quick Way to Locate Median
1. Rank the data (Suppose, the sample size is n .)
2. Find the position of the median (counting from
either end) using the formula:
i=
n +1
2
Then, the median is the ith smallest value.
Example 2
Consider this data set 4, 8, 3, 8, 2, 9, 2, 11, 3, 15
1.
Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2.
Find the position of the median using the formula:
n +1
2
For the data given, n is 10 (because the size of the
sample is 10, that is, there are 10 data values given),
so the median position is
10 + 1
2
= 5.5
The median is the 5.5th smallest or largest value. In other words,
it is in the middle of the 5th and 6th smallest or largest values.
Since the 5th value is 4 and the 6th value is 8. We average out 4
and 8, so the median is 6.
Midrange
Another useful measure of the center of the distribution
is Midrange, which is the number exactly midway
between a lowest value data L and a highest value
data H. It is found by averaging the low and the
high values:
midrange =
L+ H
2
Note: If two or more values in a sample are tied for the
highest frequency (number of occurrences), there is no
mode
9
Comparing mean and Median
• The mean and the median are often
different
• This difference gives us clues about the
shape of the distribution
– Is it symmetric?
– Is it skewed left?
– Is it skewed right?
– Are there any extreme values?
Symmetric Distribution
• If a distribution is symmetric, the data values above and
below the mean will balance
– The mean will be in the “middle”
– The median will be in the “middle”
• Thus the mean will be close to the median, in general,
for a distribution that is symmetric
Mean and Median
• Symmetric – the mean will usually be close to the
median
• Skewed left – the mean will usually be smaller than the
median
• Skewed right – the mean will usually be larger than the
median
Left-skewed Distribution
• If a distribution is skewed left, there will be some data
values that are larger than the others
– The mean will decrease
– The median will not decrease as much
• Thus the mean will be smaller than the median, in
general, for a distribution that is skewed left
Right-skewed Distribution
Mean and Median
• If a distribution is skewed right, there will be some data
values that are larger than the others
– The mean will increase
– The median will not increase as much
• Thus the mean will be larger than the median, in general,
for a distribution that is skewed right
If one value in a data set is extremely different
from the others?
For instance, if we made a mistake and
6, 1, 2
was recorded as
6000, 1, 2
• The mean is now ( 6000 + 1 + 2 ) / 3 = 2001
• The median is still 2
• The median is “resistant to extreme values” than
the mean.
10
Round-off Rule
When rounding off an answer, a common
rule-of-thumb is to keep one more decimal
place in the answer than was present in
the original data
Measure of Spread
To avoid round-off buildup, round off only
the final answer, not intermediate steps
Range, Variance, Standard Deviation
Measures of Spread/Dispersion
Range
• Measures of dispersion are used to describe
the spread, or variability, of a distribution
• The range of a variable is the largest data value minus
the smallest data value
• Compute the range of
6, 1, 2, 6, 11, 7, 3, 3
• The largest value is 11
• The smallest value is 1
• Subtracting the two … 11 – 1 = 10 … the range is 10
• Common measures of dispersion: range,
variance, and standard deviation
Note: Please do not confused the range with the midrange
which is a measure for the center of data distribution
• Measures of central tendency alone cannot
completely characterize a set of data. Two
very different data sets may have similar
measures of central tendency.
Range
• The range only uses two values in the data set –
the largest value and the smallest value
• The range is affected easily by extreme values
in the data. (i.e., not resistant to outliers)
• If we made a mistake and
6, 1, 2
was recorded as
6000, 1, 2
• The range is now ( 6000 – 1 ) = 5999
Deviations From The Mean
•
The variance is based on the deviation from the mean
– ( xi – µ ) for populations
– ( xi – x ) for samples
•
Deviation may be positive or negative depending on if value is
above the mean or below the mean. So, the sum of all deviations
will be zero. To avoid the cancellation of the positive deviations
and the negative deviations when we add them up, we square the
deviations first:
– ( xi – µ )2 for populations
– ( xi – x )2 for samples
11
Population Variance
•
The population variance of a variable is the average of these
squared deviations, i.e. is the sum of these squared deviations
divided by the number in the population
∑(x
i
− µ)2
N
•
=
( x1 − µ ) 2 + ( x 2 − µ ) 2 + ... + ( x N − µ ) 2
N
The population variance is represented by σ2 (namely sigma square)
Note: For accuracy, use as many decimal places as allowed by your
calculator during the calculation of the squared deviations, if the
average is not a whole number.
Sample Variance
i
− x)2
n −1
=
( x1 − x ) 2 + ( x2 − x ) 2 + ... + ( xN − x ) 2
n −1
• The sample variance is represented by s2
Note: we use n – 1 as the devisor.
• Compute the population variance of
6, 1, 2, 11
• Compute the population mean first
µ = (6 + 1 + 2 + 11) / 4 = 5
• Now compute the squared deviations
(1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36
• Average the squared deviations
(16 + 9 + 1 + 36) / 4 = 15.5
• The population variance σ2 is 15.5
Example
• The sample variance of a variable is the average
deviations for the sample data, i.e., is the sum of these
squared deviations divided by one less than the number
in the sample
∑ (x
Example
• Compute the sample variance of
6, 1, 2, 11
• Compute the sample mean first
= (6 + 1 + 2 + 11) / 4 = 5
• Now compute the squared deviations
(1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36
• Average the squared deviations
(16 + 9 + 1 + 36) / 3 = 20.7
• The sample variance s2 is 20.7
Computational Formulas for the
Sample Variance
Compare Population and Sample
Variances
A shortcut (a quick way to compute) formula for the sample
variance: ( because you do not need to compute all the deviations
from the mean.)
• Why are the population variance (15.5) and the
sample variance (20.7) different for the same set
of numbers?
• In the first case, { 6, 1, 2, 11 } was the entire
population (divide by N)
• In the second case, { 6, 1, 2, 11 } was just a
sample from the population (divide by n – 1)
• These are two different situations
s2 =
( x )2
∑ x 2 − ∑n
n −1
∑ x 2 is the sum of the squars of each data value.
(∑ x)
2
is the square of the sum of all data values.
For the above example, ∑ x 2 = 6 2 + 12 + 2 2 + 112 = 162 ,(∑ x )
2
= (6 + 1 + 2 + 11) 2 = 400
400
162 −
4 = 20.7
S2 =
4 −1
12
Why Population and Sample
Variances are different?
• Why do we use different formulas?
• The reason is that using the sample mean is not
quite as accurate as using the population mean
• If we used “n” in the denominator for the sample
variance calculation, we would get a “biased”
result
• Bias here means that we would tend to
underestimate the true variance
Standard Deviation
•
The standard deviation is the square root of the variance
•
The population standard deviation
– Is the square root of the population variance (σ2)
– Is represented by σ
•
The sample standard deviation
– Is the square root of the sample variance (s2)
– Is represented by s
Note: Standard deviation can be interpreted as the average deviation
of the data. It has the same measuring unit as the original data (
e.g. inches). The variance has a squared unit (e.g. inches 2).
Compute mean and Variance for A
Frequency Distribution
Example
• If the population is { 6, 1, 2, 11 }
– The population variance σ2 = 15.5
– The population standard deviation σ =
To calculate the mean, variance for a set of
sample data:
15.5 = 3.9
•
In a grouped frequency distribution, we use the
frequency of occurrence associated with each
class midpoint
In an ungrouped frequency distribution, use the
frequency of occurrence, f, of each observation
• If the sample is { 6, 1, 2, 11 }
– The sample variance s2 = 20.7
– The sample standard deviation s =
20.7 = 4.5
•
• The population standard deviation and the
sample standard deviation apply in
different situations
x=
Grouped Data
•
•
To compute the mean, variance, and standard deviation for grouped
data
– Assume that, within each class, the mean of the data is equal to
the class midpoint (which is an average of two adjacent lower
lass limits.)
– Use the class midpoint as an approximated value for all data in
the same class, since their actual values are not provided.
– The number of times the class midpoint value is used is equal to
the frequency of the class
For instance, if 6 values are in the interval [ 8, 10 ] , then we assume
that all 6 values are equal to 9 (the midpoint of [ 8, 10 ]
∑ x2 f −
∑ xf
∑f
s2 =
∑
(∑ xf )
2
∑f
f −1
Example of Grouped Data
• As an example, for the following frequency
table, Class
0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
we calculate the mean as if
–
–
–
–
The value 1 occurred 3 times
The value 3 occurred 7 times
The value 5 occurred 6 times
The value 7 occurred 1 time
13
Example of Grouped Data
Class
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Example of Grouped Data
Since
the sample size =
∑f
= 3 + 7 + 6 + 1 = 17
∑x
the Sum of squared values =
The calculation for the mean would be
1+ 1+ 1+ 3 + 3 + 3 + 3 + 3 + 3 + 3 + 5 + 5 + 5 + 5 + 5 + 5 + 7
17
Or
(1× 3) + (3 × 7) + (5 × 6) + (7 × 1)
17
Which follows the formula
= 3.6
∑ xf
X=
∑f
Summary
• The mean for grouped data
– Use the class midpoints
– Obtain an approximation for the mean
• The variance and standard deviation for
grouped data
– Use the class midpoints
– Obtain an approximation for the variance and
standard deviation
the square of the sum =
(∑ x f )
2
2
f = 12 × 3 + 32 × 7 + 52 × 6 + 7 2 × 1 = 265
= (1× 3 + 3 × 7 + 5 × 6 + 7 ×1) 2 = 612 = 3721
Follow the short-cut formula for the sample variance, we obtain
3721
17 = 265 − 218.88235 = 2.882
17 − 1
16
265 −
S 2=
the sample variance
the sample standard deviation
S = 2.882 = 1.7
Example of Ungrouped Data
Example: A survey of students in the first grade at a local school
asked for the number of brothers and/or sisters for each child. The
results are summarized in the table below. Here, we see 15 students
responded o sibling, 17 students responded 1 sibling, etc. Total
f.
number of students in this survey is 62, which is n =
Find 1) the mean, 2) the variance, and 3) the standard deviation:
∑
Solutions:
First:
Sum:
x
f
xf
x2 f
0
1
2
4
5
15
17
23
5
2
62
0
17
46
20
10
93
0
17
92
80
50
239
239 − (93)
62
.
62 −1 =163
2
1) x = 93/ 62 = 15
.
2) s2 =
. = 128
.
3) s= 163
Measures of Position
Measure of Position
Percentiles, Quartiles
• Measures of position are used to describe the
relative location of an observation within a data
set.
• Quartiles and percentiles are two of the most
popular measures of position
• Quartiles are part of the 5-number summary
14
Percentile
• The median divides the lower 50% of the data
from the upper 50%
• The median is the 50th percentile
• If a number divides the lower 34% of the data
from the upper 66%, that number is the 34th
percentile
Quartiles
•
Quartiles divide the data set into four equal parts
•
The quartiles are the 25th, 50th, and 75th percentiles
– Q1 = 25th percentile
– Q2 = 50th percentile = median
– Q3 = 75th percentile
Quartiles are the most commonly used percentiles
The 50th percentile and the second quartile Q2 are both other ways
of defining the median
•
•
How to Find Quartiles?
1. Order the data from smallest to largest.
Example
The following data represents the pH levels of a random sample of
swimming pools in a California town. Find the three quartiles.
5.6
6.0
6.7
7.0
2. Find the median Q2.
3. The first quartile (Q1) is then the median of the lower half of the data;
that is, it is the median of the data falling below the median (Q2)
position (and not including Q2).
4. The third quartile (Q3) is the median of the upper half of the data; that
is, it is the median of the data falling above the Q2 position (not
including Q2).
Note: Excel has a set of different rules to compute these quartiles than the
TI graphing calculator which will follow the rules stated above. So,
different software may give different quartiles, particularly if the
sample size is an odd-numbed. However, for a large data set, the
values are often not much different. In our class, we will only follow
the rules stated here.
Outliers
• Extreme observations in the data are
referred to as outliers
• Outliers should be investigated
• Outliers could be
– Chance occurrences
– Measurement errors
– Data entry errors
– Sampling errors
• Outliers are not necessarily invalid data
5.6
6.1
6.8
7.3
5.8
6.2
6.8
7.4
5.9
6.3
6.8
7.4
6.0
6.4
6.9
7.5
Solutions:
1) Median= Q2 = the average of the 10th and 11th smallest values = (6.4+6.7)/2 =6.55
2) The first quartile = Q1 = the median of the 10 values below the median
= the average of the 5th and 6th smallest values = (6.0+6.0)/2 = 6.0
3) The third quartile =Q3 = the median of the 10 values above the median
= the average of the 15th and 16th smallest values = (6.9+7.0)/2 = 6.95
How To Detect Outliers?
• One way to check for outliers uses the quartiles
• Outliers can be detected as values that are
significantly too high or too low, based on the
known spread
• The fences used to identify outliers are
– Lower fence = LF = Q1 – 1.5 × IQR
– Upper fence = UF = Q3 + 1.5 × IQR
• Values less than the lower fence or more than
the upper fence could be considered outliers
15
Example
• Is the value 54 an outlier?
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
• Calculations
– Q1 = (4 + 7) / 2 = 5.5
– Q3 = (27 + 31) / 2 = 29
– IQR = 29 – 5.5 = 23.5
– UF = Q3 + 1.5 × IQR = 29 + 1.5 × 23.5 = 64
Another Measure of the
Spread
Inter-quartile range (IQR)
• Using the fence rule, the value 54 is not an
outlier
Inter-quartile Range (IQR)
• The inter-quartile range (IQR) is the difference
between the third and first quartiles
IQR = Q3 – Q1
• The IQR is a resistant measurement of spread.
Its value will not be affected easily by extremely
large or small values in a data set, since IQR
covers only the middle 50% of values.)
Five-number Summary
• The five-number summary is the collection
of
– The smallest value
– The first quartile (Q1 or P25)
– The median (M or Q2 or P50)
– The third quartile (Q3 or P75)
– The largest value
• These five numbers give a concise
description of the distribution of a variable
Another Graphical Tool to
Summarize Data
Five-number Summary
&
Boxplot
Why These Five Numbers?
• The median
– Information about the center of the data
– Resistant measure of a center
• The first quartile and the third quartile
– Information about the spread of the data
– Resistant measure of a spread
• The smallest value and the largest value
– Information about the tails of the data
16
Example
• Compute the five-number summary for the
ordered data:
1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
• Calculations
–
–
–
–
–
The minimum = 1
Q1 = P25, Q1 = 7
M = Q2 = P50 = (16 + 19) / 2 = 17.5
Q3 = P75 = 27
The maximum = 54
• The five-number summary is
1, 7, 17.5, 27, 54
How to draw A Boxplot?
To draw a (basic) boxplot:
1. Calculate the five-number summary
2. Draw & scale a horizontal number line which will cover all the
data from the minimum to the maximum
3. Mark the 5 numbers on the number line according to the scale.
4. Superimpose these five marked points on some distance
above the lines.
5. Draw a box with the left edge at Q1 and the right edge at Q3
6. Draw a line inside the box at M = Q2
7. Draw a horizontal line from the Q1 edge of the box to the
minimum and one from the Q3 edge of the box to the maximum
A Modified Boxplot
• An example of a more sophisticated boxplot is
• The middle box shows Q1, Q2, and Q3
• The horizontal lines (sometimes called
“whiskers”) show the minimum and maximum
• The asterisk on the right shows an outlier
(determined by using the upper fence)
Boxplot
• The five-number summary can be
illustrated using a graph called the boxplot
• An example of a (basic) boxplot is
• The middle box shows Q1, Q2, and Q3
• The horizontal lines (sometimes called
“whiskers”) show the minimum and
maximum
Example
• To draw a (basic) boxplot
Draw the middle box
Draw in the median
Draw the minimum and maximum
Voila!
How To Draw A Modified Boxplot?
To draw a modified boxplot
1. Draw the center box and mark the median, as before
2. Compute the upper fence and the lower fence
3. Temporarily remove the outliers as identified by the
upper fence and the lower fence (but we will add
them back later with asterisks)
4. Draw the horizontal lines to the new minimum and
new maximum (These are the minimum and
maximum within the fence)
5. Mark each of the outliers with an asterisk
Note: Sometimes, data contain no outliers. You will obtain
a basic boxplot.
17
Example
Interpret a Boxplot
• The distribution shape and boxplot are related
• To draw this boxplot
– Symmetry (or lack of symmetry)
– Quartiles
– Maximum and minimum
Draw the middle box and the median
• Relate the distribution shape to the boxplot for
Draw in the fences, remove the outliers (temporarily)
Draw the minimum and maximum
– Symmetric distributions
– Skewed left distributions
– Skewed right distributions
Draw the outliers as asterisks
Symmetric Distribution
Left-skewed Distribution
Distribution
Boxplot
Q1 is equally far from the median as
Q3 is
The median line is in the
center of the box
The min is equally far from
the median as the max is
The left whisker is equal
to the right whisker
Q1 M Q3
Min
Q1 M Q3
Max
Right-skewed Distribution
Distribution
Boxplot
Q1 is closer to the median
than Q3 is
The median line is to the
left of center in the box
The min is closer to the median
than the max is
The left whisker is shorter
than the right whisker
Distribution
Boxplot
Q1 is further from the median than Q3
is
The median line is to the
right of center in the box
The min is further from the median
than the max is
The left whisker is longer
than the right whisker
Min
Q1 MQ3 Max
Min
Q1 MQ3 Max
Side-by-side Boxplot
• We can compare two distributions by
examining their boxplots
• We draw the boxplots on the same
horizontal scale
– We can visually compare the centers
– We can visually compare the spreads
– We can visually compare the extremes
Min Q1M
Q3
Max
Min Q1M
Q3
Max
18
Example
Comparing the “flight” with the “control” samples
Center
Spread
Summary
• 5-number summary
– Minimum, first quartile, median, third quartile
maximum
– Resistant measures of center (median) and spread
(interquartile range)
• Boxplots
– Visual representation of the 5-number summary
– Related to the shape of the distribution
– Can be used to compare multiple distributions
Entering Data into TI Calculator
Using Technology for
Statistics
Instruction for TI Graphing
Calculator
Enter data in lists: Press STAT then choose EDIT menu. (We’ll
denote the sequence of the key strokes by STAT EDIT). Entering
data one by one (press Return after each entry) under a blank
column which represents a variable (a list).
Note:
1. Clear a list: on EDIT screen, use the up arrow to place the cursor on the list name, press
CLEAR, then ENTER (that is, CLEARENTER). You need to always clear a list before entering
a new set of data into the list.
Warning! Pressing the DEL key instead of CLEAR will delete the list from the calculator. You can
get it back with the INS key. See Insert a new list below.
2. List name: there are six built-in lists, L1 through L6, and you can add more with your own
names. You can get the L1 symbol by pressing the 2ND key, then 1 key [ 2nd 1 ] .(The instruction
in the brackets shows the sequence of keys you need to press, here, you press 2ND key, then 1
key to have a L1 symbol.)
3. Insert a new list (optional): STAT EDIT, use the up arrow to place the cursor on a list name,
then press INS [ 2nd DEL ] . Type the name of a list using the alpha character keys. The ALPHA
key is locked down for you. Press ENTER. The new list is placed just before the point where the
cursor was. To obtain a quick statistics, just use one of the build-in list L1 through L6 to enter the
data, you do not need to create a new list with a name.
Obtain Numeric Measures from TI
Calculator
Obtain Statistics from a Frequency
Distribution
1. After entering data, return to home screen by
pressing QUIT[2nd MODE].
2. Press STAT Key, select CALC menu, then
choose the number 1 operation : 1-Var Stats,
then ENTER . Enter the name of the list, say L1.
That is,
• Enter the values in one list, say L1, and their
corresponding frequencies in another list, say L2.
Then,
STAT CALC 1 ENTER L1
Note: L1 is the default list. You do not need
to enter it, if the data is on L1
STAT CALC 1 ENTER L1, L2
Note: Need to enter comma L2 after L1. The calculator
will use the second list as the frequency for the values
entered on its list before to calculate the appropriate
statistics.
19
Example 1
Example 2
Consider the grouped data we considered previously:
Example: A random sample of students in a sixth grade class was selected.
Their weights are given in the table below. Find the mean and variance,
standard deviation, 5-number summary for this data using the TI calculator:
63
94
64
97
76
99
76
99
81 83 85 86 88 89
99 101 108 109 112
90
91
92
93
93
93
The output shows:
x = 90.44
∑ x = 2261
∑x
2
= 208083
1.
S x = 12.244...
n = 25
min X = 63
0 – 1.9
2 – 3.9
4 – 5.9
6 – 7.9
Midpoint
1
3
5
7
Frequency
3
7
6
1
Use TI calculator to obtain the statistics:
The output shows:
Note:
Since this a sample data, we take Sx as
the standard deviation.
σ x = 11.996...
Q1 = 84
Med = 92
Class
2.
You may need to press the arrow key
on the calculator several times to view
these many statistics.
x = 3.588..
∑ x = 61
∑ x = 265
2
S x = 1.697..
σ x = 1.647..
n = 17
min X = 1
Q1 = 3
Q3 = 99
Med = 3
max X = 112
Q3 = 5
Note: Here, the notations used in the
calculator correspond to the notations
used in the formula for computing mean,
variance and standard deviation of a
frequency distribution:
n=∑ f
∑x = ∑xf
∑x = ∑x
2
2
f
max X = 7
20