Download (qq) Plots - Online Stats Book

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
8. Advanced Graphs
A. Q-Q Plots
B. Contour Plots
C. 3D Plots
273
Quantile-Quantile (q-q) Plots
by David Scott
Prerequisites
• Chapter 1: Distributions
• Chapter 1: Percentiles
• Chapter 2: Histograms
• Chapter 4: Introduction to Bivariate Data
• Chapter 7: Introduction to Normal Distributions
Learning Objectives
1. State what q-q plots are used for.
2. Describe the shape of a q-q plot when the distributional assumption is met.
3. Be able to create a normal q-q plot.
Introduction
The quantile-quantile or q-q plot is an exploratory graphical device used to check
the validity of a distributional assumption for a data set. In general, the basic idea
is to compute the theoretically expected value for each data point based on the
distribution in question. If the data indeed follow the assumed distribution, then the
points on the q-q plot will fall approximately on a straight line.
Before delving into the details of q-q plots, we first describe two related
graphical methods for assessing distributional assumptions: the histogram and the
cumulative distribution function (CDF). As will be seen, q-q plots are more general
than these alternatives.
Assessing Distributional Assumptions
As an example, consider data measured from a physical device such as the spinner
depicted in Figure 1. The red arrow is spun around the center, and when the arrow
stops spinning, the number between 0 and 1 is recorded. Can we determine if the
spinner is fair?
274
lots
11/7/10 1:09 PM
u=0.876
0
0.9
0.1
0.8
0.2
●
0.7
0.3
Figure 1. A physical device that gives samples from a uniform distribution.
If the spinner is fair, then these numbers should follow a uniform distribution. To
0.6
0.4
investigate whether the spinner is fair, spin the arrow n times, and record the
0.5
measurements by {μ 1, μ2, ..., μn}. In this example we collect n = 100 samples. The
If the spinner is fair, then these numbers should follow a uniform distribution. To
investigate whether the spinner is fair, spin the arrow n times, and record the
measurements by {μ , μ , ..., μ }. In this example, we collect n = 100 samples. The
histogram provides a useful visualization of these uniform
data. In distribution.
Figure 2, we display
three different
histograms
on
a probability
scale.
Theahistogram
should be flat for a
the spinner
is fair,
then
these
numbers should
follow
uniform distribuFigure 2. Three If
histograms
of a sample
of 100
uniform
points.
uniform
butn the
visual
perception
varies depending
tion. Spinsample,
the arrow
times,
and record
the measurements
by {u1 ,on
u2 , whether
. . . , un }. the
Collect n =has
10010,
samples.
provides a useful
of
histogram
5, or 3 The
bins.histogram
The last histogram
looks visualization
flat, but the other
two
these
data.
In
Figure
2,
we
display
three
di↵erent
histograms
on
a
probhistograms are not obviously flat. It is not clear which histogram we should base
ability scale. The histogram should be flat for a uniform sample, but the
our
conclusion on.
visual perception varies whether the histogram has 10, 5, or 3 bins. The last
histogram provides a useful visualization of these data. In Figure 2, we display three
different histograms on a probability scale. The histogram should be flat for a uniform
sample, but the visual perception varies depending on whether the histogram has 10, 5,
1
2
n
or 3 bins. The last histogram looks flat, but the other two histograms are not obviously
Figure 1: A physical device that gives samples from a
flat. It is not clear which histogram we should base our conclusion on.
histogram looks flat, but the other two histograms are not obviously flat.
1.4
1.2
Probability
Alternatively, we might use the cumulative distribution function (CDF), which is
denoted by F(μ).1.0
The CDF gives the probability that the spinner gives a value less than
.8
/onlinestatbook.com/2/advanced_graphs/q-q_plots.html
Page 2 of 13
.6
.4
.2
0
0
.2
.4
.6
.8
1
0
.2
.4
u
.6
u
.8
1
0
.2
.4
.6
.8
1
u
Figure 2: Three histograms of a sample of 100 uniform points.
Figure 2. Three histograms of a sample of 100 uniform points.
2
275
Alternatively, we might use the cumulative distribution function (CDF), which is
denoted by F(μ). The CDF gives the probability that the spinner gives a value less
than or equal to μ, that is, the probability that the red arrow lands in the interval [0,
atively, we might use the cumulative distribution function (CDF),
μ]. By simple arithmetic, F(μ) = μ, which is the diagonal straight line y = x. The
denoted by CDF
F (u).
The
CDF
gives data
the isprobability
that the
based
upon
the sample
called the empirical
CDFspinner
(ECDF), is denoted
alue less than
or
equal
to
u,
that
is,
the
probability
that
the
red
by
ds in the interval
[0, u]. By simple arithmetic, F (u) = u, which is
Alternatively, we might use the cumulative distribution function (CDF),
nal straightwhich
line isydenoted
= x. by
The
CDF
based
upon
the sample
data
is
F (u).
The CDF
gives
the probability
that the
spinner
gives a(ECDF),
value less than
or equal toby
u, F̂
that
is, the
probability
that to
the be
red
empirical CDF
is denoted
and
is defined
n (u),
arrow lands in the interval [0, u]. By simple arithmetic, F (u) = u, which is
and isthan
defined
be theto
fraction
of the
f the data less
or toequal
u; that
is, data less than or equal to μ; that is,
the diagonal straight line y = x. The CDF based upon the sample data is
called the empirical CDF (ECDF), is denoted by F̂n (u), and is defined to be
ui 
u to u; that is,
fraction of the data less#than
or equal
F̂n (u) =
.
n
# ui  u
.
n
staircase appearance.
F̂n (u) =
, the ECDF takes on a ragged
In general,
general,
the
onon
a ragged
appearance.
In
theECDF
ECDF
takes
a2,ragged
staircase
appearance.
e spinner sample
analyzed
intakes
Figure
we staircase
computed
the ECDF and
For the spinner sample analyzed in Figure 2, we computed the ECDF and
For
spinner
analyzed
in Figure 2,ECDF
we computed
the ECDF and
ch are displayed
in the
Figure
3.sample
Inin the
left
frame,
appears
CDF, which
are displayed
Figure
3. In
the left the
frame, the ECDF
appears
which
are
in in
Figure
3. In In
the
left frame,
the
ECDF
appears
close to
close
to
the line
= x, middle
shown
the middle
frame.
In
the right
frame,we
we
he line y =CDF,
x, shown
inydisplayed
the
frame.
the
right
frame,
the
line
y = verify
x,two
shown
the
middle
frame.
rightquite
frame,
these two
overlay
these
curvesinand
verify
that
they In
arethe
indeed
closewe
tooverlay
each
ese two curves
and
that
they
are
indeed
quite
close
to
each
other. Observe
thatthat
we they
do not
to specify
the number
bins as
with that we do
curves
and verify
areneed
indeed
quite close
to eachofother.
Observe
bserve that the
wehistogram.
do not need to specify the number of bins as with
not need to specify the number of bins as with the histogram.
ram.
1
Empirical
CDF
.8
Probability
Empirical
CDF
Theoretical
CDF
Overlaid
CDFs
Theoretical
CDF
.6
Overlaid
CDFs
.4
.2
0
0
.2
.4
.6
.8
1 0
.2
.4
u
.2
.4
.6
.6
.8
1 0
.2
.4
u
.6
.8
1
u
Figure 3: The empirical and theoretical cumulative distribution functions of
Figure
3. The empirical and theoretical cumulative distribution functions of
a sample of 100 uniform points.
a sample of 100 uniform points.
.8
u
1 0
.2
.4
.6
.8
1 0
u
3
.2
.4
.6
.8
1
u
q-q plot for uniform data
The empirical
and theoretical cumulative distribution functions of
The q-q plot for uniform data is very similar to the empirical cdf graphic,
of 100 uniform
exceptpoints.
with the axes reversed. The q-q plot provides a visual comparison of
276
3
q-q plot for uniform data
The q-q plot for uniform data is very similar to the empirical CDF graphic, except
with the axes reversed. The q-q plot provides a visual comparison of the sample
quantiles to the corresponding theoretical quantiles. In general, if the points in a qq plot depart from a straight line, then the assumed distribution is called into
question.
Here we define the qth quantile of a batch of n numbers as a number ξq such
that a fraction q x n of the sample is less than ξq, while a fraction (1 - q) x n of the
sample is greater than ξq. The best known quantile is the median, ξ0.5, which is
located in the middle of the sample.
Consider a small sample of 5 numbers from the spinner
µ1 = 0.41, µ2 = 0.24, µ3 = 0.59, µ4 = 0.03, µ5 =0.67.
Q-Q plots
Based upon our description of the spinner, we expect a uniform distribution to
model these data. If the sample data were “perfect,” then on average there would
be an observation in the middle of each of the 5 intervals: 0 to .2, .2 to .4, .4 to .6,
and so on. Table 1 shows the 5 data points (sorted in ascending order) and the
theoretically expected value of each based on the assumption that the distribution
is uniform (the middle of the interval).
11/7/10 1:09 P
on. Table
1 shows the
data points
(sorted
in ascending order) and the theoretically
Table
1. Computing
the5Expected
Quantile
Values.
expected value of each based on the assumption that the distribution is uniform (the
middle of the interval).
Data (µ)
Rank (i)
Middle of the
ith Interval
Table 1. Computing the Expected Quantile Values.
0.03
Data (μ)
0.24
.03
.24
0.41
.41
0.59
.59
.67
0.67
1 Middle of the ith
0.1
Rank ( i)
Interval
1
2
3
4
5
.1
.3
.5
.7
.9
2
3
4
5
0.3
0.5
0.7
0.9
The theoretical and empirical CDF's are shown in Figure 4 and the q-q plot is shown in
the left frame of Figure 5.
277
Figure 4: The theoretical and empirical CDFs of a small sample of 5
uniform points, together with the expected values of the 5 points
The theoretical and empirical CDFs are shown in Figure 4 and the q-q plot is
shown in1the left frame of Figure 5. 1
Theoretical
CDF
1
.8
Probability
.8
Empirical
CDF
●
1
Theoretical
CDF
.8
.6
●
.8
.6
.6
.6
.4
.4
1
.8
Empirical
●
CDF
●
.6
●
●
.4
.4
.4
●
●
.2
.2
.2
.2
.2
●
0
.2
.4
.6
.8
0
10
u
●
.2
0
●
●
●
●
.4.2 .6 .4 .8 .6 1
u
u
0
●
.8 0
●
1.2
0
●
●
.40
●
u
●
●
●
●
.6 .2 .8 .4 1
u
0
.6 0
● ●
●●
.8.2
●
.4
1
●
.6
●
.8
1
u
Figure
of of
a small
sample
of 5 together
uniform
Figure4.4:The
Thetheoretical
empirical and
CDFempirical
of a smallCDFs
sample
5 uniform
points,
The empirical
CDF
of together
a small
sample
of5 5points
uniform
points,
together
points,
with
values
of the
points
dots in
with
the expected
values
ofthe
theexpected
(red dots
in 5
the
right(red
frame).
thethe
right
expected values of
5 frame).
points (red dots in the right frame).
In general, we consider the full set of sample quantiles to be the sorted
Indata
general,
valueswe consider the full set of sample quantiles to be the sorted data values
neral, we consider the full set of sample quantiles to be the sorted
u(1) < u(2) < u(3) < · · · < u(n 1) < u(n) ,
es
µ(1) < µ(2) < µ(3) < ··· < µ(n-1) < µ(n) ,
the<parentheses
indicate the data have been ordered.
u(1)where
< u(2)
u(3) < · · ·in<the
u(nsubscript
1) < u(n) ,
Roughly speaking, we expect the first ordered value to be in the middle of
where
the parentheses
in
subscript
indicate
data have
been
ordered.
e parentheses
in interval
the
subscript
thetodata
been
ordered.
the
(0, 1/n),indicate
thethesecond
be
inhave
thethe
interval
(1/n,
2/n),
and the
Roughly
speaking,
we
expect
the
first
ordered
value
to
be
in
the
middle
the
to be the
in the
middle
of thevalue
interval
Thus we
speaking, welast
expect
first
ordered
to((n
be in1)/n,
the1).middle
of take asofthe
(0, 1/n),
second
to be in the middle of the interval (1/n, 2/n), and the
theoretical
quantile
theinvalue
val (0, 1/n),interval
the second
tothebe
the
interval (1/n, 2/n), and the
last to be in the middle of the interval ((ni - 1)/n,
1). Thus, we take as the theoretical
0.5 we
e in the middle of the interval ((n 1)/n,
1).
Thus
take as the
⇠q = q ⇡
,
quantile the value
n
al quantile the value
where q corresponds to the ith ordered sample value. We subtract the quani 0.5
tity 0.5 so⇠qthat
= qwe⇡are exactly
, in the middle of the interval ((i 1)/n, i/n).
These ideas are depictedn in the right frame of Figure 4 for our small sample
of size n = 5.
orresponds to the ith ordered sample value. We subtract the quanWeq corresponds
are now prepared
define
thesample
q-q plot
precisely.
First, we
where
to thetoithof
ordered
value.
We
subtract
the compute
quantity 0.5
o that we are
exactly
in
the
middle
the
interval
((i
1)/n,
i/n).
the n expected values of the data, which we pair with the n data points
so thatinwe
areright
exactly
in the middle
of the
interval
((i - 1)/n,
i/n). These ideas are
as are depicted
the
frame
4 for
our small
sample
sorted in
ascending
order. of
ForFigure
the uniform
density,
the q-q
plot is composed
depicted
the right
frame of Figure 4 for our small sample of size n = 5.
of the ninordered
pairs
= 5.
We are now prepared
to define
plot precisely. First, we compute the n
◆ the q-qFirst,
✓ q-q plot
e now prepared to
define the
precisely.
we compute
i
0.5
expected values of the data, ,which
we
pair
with
the
n. .data
u
,
for
i
=
1,
2,
. , n .points sorted in
(i)
pected values of the data, which
n we pair with the n data points
ascending order. For the uniform density, the q-q plot is composed of the n ordered
ascending order. For the uniform density, the q-q plot is composed
This definition is slightly di↵erent from the ECDF, which includes the points
pairs
ordered pairs(u(i) , i/n). In the left frame of Figure 5, we display the q-q plot of the 5
◆
✓
i 0.5
, u(i) ,
for i = 1, 2, . . .5, n .
n
278
We are now prepared to define the q-q plot precisely. First, we compute
he n expected values of the data, which we pair with the n data points
orted in ascending order. For the uniform density, the q-q plot is composed
of the n ordered pairs
◆
✓
i 0.5
, u(i) ,
for i = 1, 2, . . . , n .
n
This definition is slightly di↵erent from the ECDF, which includes the points
definition
slightlyofdifferent
which
the points
(u(i)5,
u(i) , i/n). This
In the
left is
frame
Figurefrom
5, the
we ECDF,
display
theincludes
q-q plot
of the
i/n). In the left frame of Figure 5, we display the q-q plot of the 5 points in Table 1.
Inpoints
the right
two frames of Figure 5,frames
we display
the q-q plot of the same batch of
in Table 1. In the right two
of Figure 5, we display the q-q plot
5
numbers
usedbatch
in Figure
2. In theused
finalin
frame,
we2.add
linewe
y =add
x as a
of the same
of numbers
Figure
In the
thediagonal
final frame
point
of reference.
the diagonal
line y = x as a point of reference.
1
●●●
●●
●●
●●●
.8
Sample quantile
●●●
●●
●●
●●●
●●
●●●
●
●
●
●●●
●●
●●
●●
●●
●●
●
●●
●●●●
●
●●
●●●●
●●●
●
●
.6
●●
●●●
●
●
●
●●●
●●
●●
●●
●
●●
●●
●
●●
●●●●
●
●●
●●●●
●●●
●
●
●
●
.4
●
.2
0
●●
●●●●
●●●
●●●
●●
●●
●●
●●●
●
●●●
●●●
●●●●●
●
●
●●
●
●●
●
●
●
●
●●
●●●
●
0
●
●●
●●●●
●●●
●●●
●●
●●
●●
●●●
●
●●●
●●●
●●●●●
●
●
.2
.4
.6
.8
Theoretical quantile
1
0
●
●●
●
●●
●
●
●
●
●●
●●●
.2
.4
.6
.8
Theoretical quantile
1 0
.2
.4
.6
.8
Theoretical quantile
1
Figure
Figure5.5:(Left)
(Left)q-q
q-qplot
plotofofthe
the55uniform
uniform points.
points. (Right)
(Right)q-q
q-q plot
plot of
of aa sample
sample
of
100
uniform
points.
of 100 uniform points.
The sample
should
be taken
account
when
judging
close
The sample
size size
should
be taken
intointo
account
when
judging
howhow
close
thethe
q-q plot
is to the
straight
line.two
Weother
showuniform
two other
uniform
samples
of size
isq-q
to plot
the straight
line.
We show
samples
of size
n = 10
and n =
n
=
10
and
n
=
1000
in
Figure
6.
Observe
that
the
q-q
plot
when
n
=
1000
1000 in Figure 6. Observe that the q-q plot when n = 1000 is almost identical to the
is almost identical to the line y = x, while such is not the case when the
line y = x, while such is not the case when the sample size is only n = 10.
sample size is only n = 10.
1
●
●
Sample quantile
.8
●
●
.6
●
.4
●
●
●
●
.2
0
.2
.4
.6
.8
Theoretical quantile
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
.2
.4
.6
.8
Theoretical quantile
1
Figure 6: q-q plot of a sample of 10 and 1000 uniform points.
What if the data are in fact not uniform? In Figure 7 we show the
279
The sample size should be taken into account when judging how close the
q-q plot is to the straight line. We show two other uniform samples of size
n = 10 and n = 1000 in Figure 6. Observe that the q-q plot when n = 1000
is almost identical to the line y = x, while such is not the case when the
sample size is only n = 10.
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Sample quantile
.8
●
●
●
.6
●
.4
●
●
●
●
.2
0
.2
.4
.6
.8
Theoretical quantile
0
.2
.4
.6
.8
Theoretical quantile
1
Figure 6. q-q plots of a sample of 10 and 1000 uniform points.
Figure 6: q-q plot of a sample of 10 and 1000 uniform points.
In Figure 7, we show the q-q plots of two random samples that are not uniform. In
ifof the
are samples
in fact not
In Figure
we show
the
q-qWhat
plotsexamples,
two data
random
thatuniform?
are
notthe
uniform.
In 7both
examples,
both
the sample
quantiles
match
theoretical
quantiles
only at the
the sample quantiles match the theoretical quantiles only at the median and
median and at the extremes. Both samples seem to be symmetric around the
6 be symmetric around the median. But
at the extremes. Both samples seem to
median.
But
the
data
in
the
left
frame
are closer to the median than would be
the data in the left frame are closer to the median than would be expected
expected
if theuniform.
data were
uniform.
in frame
the right
arefrom
further
if the
data were
The
data inThe
thedata
right
areframe
further
thefrom the
median
than
would
expected
if the
datawere
were
uniform.
median
than
would
be be
expected
if the
data
uniform.
1
●
●
●
●
●
●
●
●
●
●
●
●
●
Sample quantile
.8
.6
.4
.2
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
.2
.4
.6
.8
Theoretical quantile
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
.2
.4
.6
.8
Theoretical quantile
1
Figure 7. q-q plots of two samples of size 1000 that are not uniform.
Figure 7: q-q plots of two samples of size 1000 that are not uniform.
In fact, the data were generated in the R language from beta distributions with
In fact, the adata
generated
in the
beta In
distributions
parameters
= b were
= 3 on
the left and
a =Rblanguage
= 0.4 on from
the right.
Figure 8 we display
with parameters a = b = 3 on the left and a = b = 0.4 on the right. In Figure
histograms of these two data sets, which serve to clarify the true shapes of the
8 we display histograms of these two data sets, which serve to clarify the true
densities.
are clearly
shapes
of theThese
densities.
These non-uniform.
are clearly non-uniform.
200
200
100
Frequency
Frequency
150
150
100
280
In fact, the data were generated in the R language from beta distributions
with parameters a = b = 3 on the left and a = b = 0.4 on the right. In Figure
8 we display histograms of these two data sets, which serve to clarify the true
shapes of the densities. These are clearly non-uniform.
200
200
Frequency
Frequency
150
100
50
150
100
50
0
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Figure 8. Histograms of the two non-uniform data sets.
Figure 8: Histograms of the two non-uniform data sets.
q-q plot for normal data
7
The definition of the q-q plot may
be extended to any continuous density. The q-q
plot will be close to a straight line if the assumed density is correct. Because the
cumulative distribution function of the uniform density was a straight line, the q-q
plot was very easy to construct. For data that are not uniform, the theoretical
quantiles must be computed in a different manner.
Let {z1, z2, ..., zn} denote a random sample from a normal distribution
with mean μ = 0 and standard deviation σ = 1. Let the ordered values be
denoted by
z(1) < z(2) < z(3) < ... < z(n-1) < z(n).
These n ordered values will play the role of the sample quantiles.
Let us consider a sample of 5 values from a distribution to see how they
compare with what would be expected for a normal distribution. The 5 values in
ascending order are shown in the first column of Table 2.
281
Table 2. Computing the Expected Quantile Values for Normal Data.
of the Two Non-Uniform Data Sets.
Data (z)
Rank (i)
Middle of the ith
Interval
-1.96
1
0.1
-1.28
-0.78
2
0.3
-0.52
0.31
3
0.5
0.00
1.15
4
0.7
0.52
1.62
5
0.9
1.28
z
Just as in the case of the uniform distribution, we have 5 intervals. However, with a
normal distribution the theoretical quantile is not the middle of the interval but
rather the inverse of the normal distribution for the middle of the interval. Taking
the first interval as an example, we want to know the z value such that 0.1 of the
area in the normal distribution is below z. This can be computed using the Inverse
Normal Calculator as shown in Figure 9. Simply set the “Shaded Area” field to the
middle of the interval (0.1) and click on the “Below” button. The result is -1.28.
Therefore, 10% of the distribution is below a z value of -1.28.
282
igure 9. Example of the inverse normal calculator for finding a value
of the expected quantile from a normal distribution.
Figure 9. Example of the Inverse Normal Calculator for finding a value of
e q-q plot for the data
Table 2 isquantile
shown infrom
the left
frame distribution.
of Figure 11.
theinexpected
a normal
In general, what should we take as the corresponding theoretical quanitiles? Let the
mulative distribution
function
thedata
normal
density
beshown
denoted
the of Figure
The q-q
plot forofthe
in Table
2 is
in by
theΦ(z).
left In
frame
11.
In general, what should we take as the corresponding theoretical quantiles?
is the qth quantile of a normal distribution, then
Let the cumulative distribution function of the normal density be denoted by Φ(z).
In the previous example, Φ(-1.28) = 0.10 and Φ(0.00) = 0.50. Using the quantile
"(#q)= q.
notation, if ξq is the qth quantile of a normal distribution, then
evious example, Φ(-1.28) = 0.10 and Φ(0.50) = 0.50. Using the quantile notation, if
at is, the probability a normal sample is less than ξq is in fact just q.
Φ(ξvalue
q) =z q.
Consider the first ordered
(1) . What might we expect the value of Φ(z (1) ) to
? Intuitively, we expect this probability to take on a value in the interval (0, 1/n).
That is, the probability a normal sample is less than ξ is in fact just
ewise, we expect Φ(z (2) ) to take on a value in the interval (1/n, 2/n). qContinuing, we
q.
Consider the first ordered value, z(1). What might we expect the value of
pect Φ(z (n) ) to fall in the interval ((n - 1)/n, 1/n). Thus the theoretical quantile we
Φ(z(1)) to be? Intuitively, we expect this probability to take on a value in the
sire is defined by the inverse (not reciprocal) of the normal CDF. In particular, the
interval (0, 1/n). Likewise, we expect Φ(z(2)) to take on a value in the interval (1/n,
om/2/advanced_graphs/q-q_plots.html
Page 8 ofThus,
13
2/n). Continuing, we expect Φ(z(n)) to fall in the interval ((n - 1)/n, 1/n).
the
theoretical quantile we desire is defined by the inverse (not reciprocal) of the
normal CDF. In particular, the theoretical quantile corresponding to the empirical
quantile z(i) should be
283
he interval (1/n, 2/n). Continuing, we expect (z(n) ) to fall in the interval
n 1)/n, 1/n). Thus the theoretical quantile we desire is defined by the
verse (not reciprocal) of the normal CDF. In particular, the theoretical
uantile corresponding to the empirical quantile z(i) should be
✓
◆
i
0.5
1
for i = 1, 2, . . . , n .
n
he empirical CDF and theoretical quantile construction for the small sample
ve in TableThe
2 are
displayed
in Figure
10.quantile
For the
larger sample
of size
100,
empirical
CDF and
theoretical
construction
for the small
sample
he first few given
expected
quantiles
are -2.576,
-2.170,
-1.960.
in Table
2 are displayed
in Figure
10. Forand
the larger
sample of size 100, the
first few
expected quantiles 1are -2.576, -2.170, and1 -1.960.
1
Theoretical
CDF
1.8
Probability
Probability
Theoretical
CDF
.8.6
1.8
9 Empirical
CDF
1.8
.8.6
Empirical
CDF
.8.6
●
● ●
● ●
.6.4
.6.4
.6.4
.4.2
.4.2
.4.2
.2 0
.2 0
.2 0
● ●
● ●
−2
−1
0
z
0
−2
−1
1
2
●
0
0
z
1
2
●
−2
●
−1
●
●
−2
●
0
z●
−1
●
1
2
●
0
z
●
●
−2
0
●
1
●
●
2
●
−1
−2
●
●
−1
0
z
●
●
1
●
0
z
2
●
1
2
Figure 10: The empirical CDF of a small sample of 5 normal points, together
Figure
10. The empirical
of a small(red
sample
of 5the
normal
points,
together
with
the
valuesCDF
ofCDF
the
dots
right
frame).
Figure
10:expected
The empirical
of 5a points
small sample
of 5innormal
points,
together
the expected
the 5(red
points
dotsright
in theframe).
right frame).
with the with
expected
values ofvalues
the 5 of
points
dots(red
in the
In the left frame of Figure 11, we display the q-q plot of small normal
In the left frame of Figure 11, we display the q-q plot of the small normal sample
sample
in Table
2. The11,
remaining
frames
in Figure
display
the
In thegiven
left frame
of Figure
we display
the q-q
plot of 11
small
normal
given
in
Table
2.
The
remaining
frames
in
Figure
11
display
the
q-q
plots
q-q plots
of normal
random
samples
of sizeframes
n = 100
and n =
Asthe
theof normal
sample
given
in Table
2. The
remaining
in Figure
111000.
display
random
samples
ofrandom
size
=samples
100 in
and
n =q-q1000.
As
sample
sizeline
increases,
sample
increases,
thenpoints
the
liethe
closer
the
= x the
q-q
plotssize
of normal
of size
nplots
= 100
and
n to
= 1000.
Asy the
in the
q-q
closerintothe
theq-q
lineplots
y = x.
aspoints
the sample
size plots
n increases.
sample
size
increases,
theliepoints
lie closer to the line y = x
as the sample size n increases.
●
3
●
●
●●
●●●
●
n=5
32
n = 100
Sample quantile
Sample quantile
●
21
●
n=5
●
●●
0−1
● ●
−1−2
●
●
−2−3
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
n = 100
●
●
10
n = 1000
●
●
●●●
●●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
● ●
●● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
n = 1000
●
●●
●●●
●
●
−3
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
Theoretical
quantile
Figure 11. q-q plots of normal
data.
●
● ●
3
−3
−2
−1
0
1
2
3
●
Theoretical quantile
−3
●
●●
●●
●
3
−3
−2
−1
0
1
2
3
Figure 11: q-q plots of normal data.
As before, a normal
q-q plot
canplots
indicate
departures
Figure
11: q-q
of normal
data.from normality. The two most
common
examples
areq-q
skewed
dataindicate
and data
with heavy
tailsnormality.
(large kurtosis).
As before,
a normal
plot can
departures
from
The In
twoAsmost
common
examples
arecan
skewed
datadepartures
and data with
tails (large
before,
a normal
q-q plot
indicate
fromheavy
normality.
The
kurtosis).
In Figure
12 we show
normaldata
q-q and
plotsdata
for with
a chi-squared
(skewed)
two
most common
examples
are skewed
heavy tails
(large
data set and
a Student’s-t
(kurtotic)
set, both
size n = 1000.
The
kurtosis).
In Figure
12 we show
normaldata
q-q plots
for aofchi-squared
(skewed)
dataset
were
first
The red data
line isset,
again
y =ofx.size
Notice
particular
data
and
a standardized.
Student’s-t (kurtotic)
both
n = in
1000.
The
thatwere
the data
from the t distribution
follow
the ynormal
curve in
fairly
closely
data
first standardized.
The red line
is again
= x. Notice
particular
284
4
Figure 12 we show normal q-q plots for a chi-squared (skewed) data set and a
Student’s-t (kurtotic) data set, both of size n = 1000. The data were first
standardized. The red line is again y = x. Notice, in particular, that the data from
the t distribution follow the normal curve fairly closely until the last dozen or so
points on each extreme.
●
●●●
3
4
−2
Sample quantile
Sample quantile
−1
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
χ (8)
●
0
0
−2
−1
−4
−3
●
●
−4
●
−3
−3
−2
●
−2
−3
●
t5
2
0
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
2
1
●
t5
4
4
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●●
2
1
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2
3
2
0
χ2(8)
●
−2
−2
−1
0
−1
0
1
1
Figure 12. q-q plots for
2
3
4
Theoretical quantile
−4
−2
0
2
4
3
4
−4
−2
Theoretical
quantile
standardized non-normal data (n =
0
●
2
2
4
1000).
Figure 12: q-q plots for standardized non-normal data (n = 1000).
q-q plots for normal data with general mean and scale
Figure
12:plots
q-q plots
standardized
non-normal
data
(n = 1000).
5 q-q
for for
normal
data with
general
mean
5
Our previous discussion of q-q plots for normal data all assumed that our data were
and scale
standardized.
One approach to constructing q-q plots is to first standardize the data
proceed as described
previously.
AnThe
alternative
constructofthe plot
Whyand
didthen
we standardize
the data in
Figure 12?
q-q plot isis to
comprised
directly
the n
pointsfrom raw data.
✓ section
✓ we present
◆
In this
a◆general approach for data that are not
i 0.5
1
standardized. Why did we standardize
the
in2,Figure
, z(i)
for data
i = 1,
. . . , n 12?
. The q-q plot is
n
didcomprised
we standardize
the
data
in
Figure
12?
The
q-q plot is comprised
of the n points
q-q plots for normal data with general mean
and scale
Why
the original data {zi } are normal, but have an arbitrary mean µ and stanthe n Ifpoints
dard deviation
✓ , then
✓ the line y◆= x will◆not match the expected theoretical
quantile. Clearly the
transformation
i 0.5
1 linear
, z(i)
for i = 1, 2, . . . , n .
n
µ + · ⇠q
of
th
provide
the {z
q }
theoretical quantile on the transformed scale. In pracIf the would
original
data
i are normal, but have an arbitrary mean µ and stantice, with a new data set
dard deviation
, then
line
y = xbutwill
not
match the
theoretical
If the original
data the
{zi} are
have
an arbitrary
meanexpected
μ and standard
{xnormal,
1 , x2 , . . . , x n } ,
deviation
then linear
the line transformation
y = x will not match the expected theoretical quantile.
quantile.
Clearlyσ,the
the normal q-q plot would consist of the n points
Clearly, the✓linear✓transformation
◆
◆
1
i
0.5
n
· ⇠iq= 1, 2, . . . , n .
, x(i)µ + for
285
wouldInstead
provide
the q ththetheoretical
scale. In pracof plotting
line y = x as quantile
a referenceon
line,the
the transformed
line
tice, with a new data set
y = x̄ + s · x
µ + σ ξq
would provide the qth theoretical quantile on the transformed scale. In practice,
with a new data set
{x1,x2,...,xn} ,
the normal q-q plot would consist of the n points
Instead of plotting the line y = x as a reference line, the line
y = M + s · x
should be composed, where M and s are the sample moments (mean and standard
deviation) corresponding to the theoretical moments μ and σ. Alternatively, if the
data are standardized, then the line y = x would be appropriate, since now the
sample mean would be 0 and the sample standard deviation would be 1.
Example: SAT Case Study
The SAT case study followed the academic achievements of 105 college students
majoring in computer science. The first variable is their verbal SAT score and the
second is their grade point average (GPA) at the university level. Before we
compute inferential statistics using these variables, we should check if their
distributions are normal. In Figure 13, we display the q-q plots of the verbal SAT
and university GPA variables.
286
academic achievements of 105 college students. The first variable is their
verbal SAT score and the second is their grade point average (GPA) at the
university level. Before we compute a correlation coefficient, we should check
if the distributions of the two variables are both normal. In Figure 13, we
display the q-q plots of the verbal SAT and university GPA variables.
●
●
Verbal SAT
700
●
●
●
3.75
●
●●
●
●●●
●●●●●
●●
GPA
3.50
●
Sample quantile
650
3.25
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
600
●
●
●
2.75
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
2.50
●●
●●
●●
●●●
●●●
● ●
500
●
●
●
●
●
●
3.00
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
550
●● ●
●●●
●
●●●
●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
2.25
●
● ●●
●●●●
●●●
●●●
●
●
●
●
●
−2
−1
0
1
2
Theoretical quantile
−2
−1
0
1
2
Figure 13. q-q plots for the student data (n = 105).
Figure 13: q-q plots for the student data (n = 105).
The verbal
reasonably
well,
except in
The
verbal SAT
SAT seems
seems to
tofollow
followaanormal
normaldistribution
distribution
reasonably
well,
the in
extreme
tails. However,
the university
GPA variable
is highlyisnon-normal.
except
the extreme
tails. However,
the university
GPA variable
highly
Compare
the
GPA
q-q
plot
to
the
simulation
in
the
right
frame
of
Figure
non-normal. Compare the GPA q-q plot to the simulation in the right frame7. These
figures 7.are These
very similar,
for the
regionexcept
wherefor
x ≈the
-1. To
follow
these ideas,
of Figure
figuresexcept
are very
similar,
region
where
x ⇡we1.computed
To followhistograms
these ideas,
of thediagram
variables
ofwe
thecomputed
variables histograms
and their scatter
inand
Figure 14.
theirThese
scatter
diagram
in Figure
14. These
tell quite GPA
a di↵erent
story.with about
figures
tell quite
a different
story.figures
The university
is bimodal,
The20%
university
GPA is bimodal,
with
about 20%
of the
students
into scatter
of the students
falling into
a separate
cluster
with
a gradefalling
of C. The
a separate cluster with a grade of C. The scatter diagram is quite unusual.
diagram is quite unusual. While the students in this cluster all have below average
While the students in this cluster all have below average verbal SAT scores,
verbal
SAT
scores,
there are
as low
many
students
lowGPA’s
SAT scores
whose GPAs
there
are as
many
students
with
SAT
scoreswith
whose
were quite
were quite
We might
as todi↵erent
the cause(s):
different
respectable.
Werespectable.
might speculate
as to speculate
the cause(s):
majors,
di↵erent
distractions,
different
study
but it wouldBut
onlyobserve
be speculation.
study
habits, but
it would
onlyhabits,
be speculation.
that the But
raw observe
correlation
verbal SAT
and verbal
GPA isSAT
a rather
high is
0.65,
but when
we but
that the between
raw correlation
between
and GPA
a rather
high 0.65,
when we exclude the cluster, the correlation for the remaining 86 students falls a
12
little to 0.59.
287
exclude the cluster, the correlation for the remaining 86 students falls a little
to 0.59.
30
25
25
3.5
●
20
15
15
10
10
5
5
0
0
gpa
Frequency
Frequency
20
3.0
2.5
●
●
●
●
●
●●
● ●
●
●
● ●
●●
●
● ● ●●
●●
●
●●
●
● ● ●●
●
●
● ●
●●
●
450
500
550
600
650
700
verbal
750
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
● ● ● ●●
● ● ●●
●
●
● ●
●
●●●
●●
●
●
●
●●
● ●
●
●
● ●
●
●
●
●●
●
●●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
gpa
500
550
600
650
700
verbal
Figure 14. Histograms and scatter diagram of the verbal SAT and GPA
Figure 14: Histograms and scatter diagram of the verbal SAT and GPA
for the
105 students.
variablesvariables
for the 105
students.
Discussion
6Parametric
Discussion
modeling usually involves making assumptions about the shape of data,
or the shape
of residuals
from
a regression
fit.assumptions
Verifying such
assumptions
Parametric
modeling
usually
involves
making
about
the shape can take
but anofexploration
of the
shape using
and assumpq-q plots is very
ofmany
data, forms,
or the shape
residuals from
a regression
fit.histograms
Verifying such
tions
can take
forms,
but
exploration
of the
shape using
effective.
Themany
q-q plot
does
notanhave
any design
parameters
suchhistograms
as the number of
and
are very e↵ective. The q-q plot does not have any design pabinsq-q
forplots
a histogram.
rametersInsuch
as the number
of bins
histogram.
an advanced
treatment,
thefor
q-qa plot
can be used to formally test the null
In an advanced course, the q-q plot can be used to formally test the
hypothesis that the data are normal. This is done by computing the correlation
null hypothesis that the data are normal. This is done by computing the
coefficientcoefficient
of the n points
q-qin
plot.
upon n, the
nulln,hypothesis
is
correlation
of theinn the
points
theDepending
q-q plot. Depending
upon
the
rejected
if the iscorrelation
is less
than a threshold.
The
threshold is
null
hypothesis
rejected if coefficient
the correlation
coefficient
is less than
a threshold.
already
quite is
close
to 0.95
forclose
modest
sample
The
threshold
already
quite
to 0.95
for sizes.
modest sample sizes.
We have
seenseen
thatthat
the the
q-q q-q
plotplot
for for
uniform
data
is very
close
related
to to the
We have
uniform
data
is very
closely
related
the
empirical
cumulativedistribution
distribution
function.For
For
generaldensity
densityfunctions,
functions,the soempirical
cumulative
function.
general
the
so-called
probability
integral
transform
a random
variable
andit to the
called
probability
integral
transform
takes atakes
random
variable
X and X
maps
maps it to the interval (0, 1) through the CDF of X itself, that is,
interval (0, 1) through the CDF of X itself, that is,
Y = FX(X)
Y = FX (X)
which has been shown to be a uniform density. This explains why the q-q plot
onwhich
standardized
is to
always
close to density.
the lineThis
y = explains
x when why
the model
has been data
shown
be a uniform
the q-qisplot on
correct.
standardized data is always close to the line y = x when the model is correct.
Finally, scientists have used special graph paper for years to make relationships
13
linear (straight lines). The most common example used to be semi-log paper, on
which points following the formula y = aebx appear linear. This follows of course
since log(y) = log(a) + bx, which is the equation for a straight line. The q-q plots
288
may be thought of as being “probability graph paper” that makes a plot of the
ordered data values into a straight line. Every density has its own special
probability graph paper.
289
Contour Plots
by David Lane
Prerequisites
• none
Learning Objectives
1. Describe a contour plot.
2. Interpret a contour plot
Contour plots portray data for three variables in two dimensions. The plot contains
a number of contour lines. Each contour line is shown in an X-Y plot and has a
constant value on a third variable. Consider the Figure 1 that contains data on the
fat, non-sugar carbohydrates, and calories present in a variety of breakfast cereals.
Each line shows the carbohydrate and fat levels for cereals with the same number
of calories. Note that the number of calories is not determined exactly by the fat
and non-sugar carbohydrates since cereals also differ in sugar and protein.
290
10
9
8
200
7
175
Fat
6
5
4
125
3
100
2
75
1
150
150
0
10
20
30
40
50
Carbohydrates
Figure 1. A contour plot showing calories as a function of fat and
carbohydrates.
An alternative way to draw the plot is shown in Figure 2. The areas with the same
number of calories are shaded.
291
cereal contour: Contour Plot of Calories
Page 1 of 2
Contour Plot for Calories
10
9
8
200
7
175
Fat
6
5
4
125
3
100
2
75
1
150
150
0
10
20
30
40
50
Carbohydrates
Figure 2. A contour plot showing calories as a function of fat and
carbohydrates with areas shaded. An area represents values less than
or equal to the label to the right of the area.
292
3D Plots
by David Lane
Prerequisites
• Chapter 4: Introduction to Bivariate Data
Learning Objectives
1. Describe a 3D Plot.
2. Give an example of the value of a 3D plot.
Just as two-dimensional scatter plots show the data in two dimensions, 3D plots
show data in three dimensions. Figure 1 shows a 3D scatter plot of the fat, nonsugar carbohydrates, and calories from a variety of cereal types.
Figure 1. A 3D scatter plot showing fat, non-sugar carbohydrates, and
calories from a variety of cereal types.
293
Many statistical packages allow you to rotate the axes interactively to view the data
from a different vantage point. Figure 2 is an example.
Figure 2. An alternative 3D scatter plot showing fat, non-sugar
carbohydrates, and calories.
A fourth dimension can be represented as long as it is represented as a nominal
variable. Figure 3 represents the different manufacturers by using different colors.
294
Figure 3. The different manufacturers are color coded.
Interactively rotating 3D plots can sometimes reveal aspects of the data not
otherwise apparent. Figure 4 shows data from a pseudo random number generator.
Figure 4 does not show anything systematic and the random number generator
appears to generate data with properties similar to those of true random numbers.
295
Figure 4. A 3D scatter plot showing 400 values of X, Y, and Z from a pseudo
random number generator.
Figure 5 shows a different perspective on these data. Clearly they were not
generated by a random process.
296
Figure 5. A different perspective on the 3D scatter plot showing 400 values
of X, Y, and Z from a pseudo random number generator.
Figures 4 and 5 are reproduced with permission from R snippets by Bogumil
Kaminski.
297
Statistical Literacy
by David M. Lane
Prerequisites
• Chapter 8: Contour Plots
This web page portrays altitudes in the United States.
What do you think?
What part of the state of Texas (North, South, East, or West) contains the highest
elevation?
West Texas
298
Exercises
1. What are Q-Q plots useful for?
2. For the following data, plot the theoretically expected z score as a function of
the actual z score (a Q-Q plot).
0
0.5
0.8
1.3
2.1
0
0.6
0.9
1.4
2.1
0
0.6
1
1.4
2.1
0
0.6
1
1.5
2.1
0
0.6
1.1
1.6
2.1
0
0.6
1.1
1.7
2.1
0.1
0.6
1.2
1.7
2.3
0.1
0.6
1.2
1.7
2.5
0.1
0.6
1.2
1.8
2.7
0.1
0.6
1.2
1.8
3
0.1
0.7
1.2
1.9
4.2
0.2
0.7
1.2
1.9
5
0.2
0.8
1.3
2
5.7
0.3
0.8
1.3
2
12.4
0.3
0.8
1.3
2
15.2
0.4
0.8
1.3
2.1
3. For the data in problem 2, describe how the data differ from a normal
distribution.
4. For the “SAT and College GPA” case study data, create a contour plot looking at
College GPA as a function of Math SAT and High School GPA. Naturally, you
should use a computer to do this.
5. For the “SAT and College GPA” case study data, create a 3D plot using the
variables College GPA, Math SAT, and High School GPA. Naturally, you should
use a computer to do this.
299