Download Part One Exploratory Data Analysis Probability Distributions

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Part One
Exploratory Data Analysis
Probability
Distributions
Charles A. Rohde
Fall 2001
Contents
1 Numeracy and Exploratory Data Analysis
1.1
1
Numeracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.1.1
Numeracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Stem and leaf displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Letter Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.5
Five Point Summaries and Box Plots . . . . . . . . . . . . . . . . . . . . . .
12
1.6
EDA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
1.7
Other Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
1.7.1
Classical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
1.8
Transformations for Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.9
Bar Plots and Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.9.1
Bar Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.9.2
Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.9.3
Frequency Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
1.10 Sample Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . .
32
1.11 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
i
ii
CONTENTS
1.11.1 Smoothing Example . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
1.12 Shapes of Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
1.13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
2 Probability
2.1
47
Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1.1
Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.1.2
Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
2.2
Relating Probability to Responses and Populations . . . . . . . . . . . . . .
54
2.3
Probability and Odds - Basic Definitions . . . . . . . . . . . . . . . . . . . .
56
2.3.1
Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
2.3.2
Properties of Probability . . . . . . . . . . . . . . . . . . . . . . . . .
57
2.3.3
Methods for Obtaining Probability Models . . . . . . . . . . . . . . .
58
2.3.4
Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
Interpretations of Probability . . . . . . . . . . . . . . . . . . . . . . . . . .
64
2.4.1
Equally Likely Interpretation . . . . . . . . . . . . . . . . . . . . . .
64
2.4.2
Relative Frequency Interpretation . . . . . . . . . . . . . . . . . . . .
65
2.4.3
Subjective Probability Interpretation . . . . . . . . . . . . . . . . . .
65
2.4.4
Does it Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
2.5.1
Multiplication Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
2.5.2
Law of Total Probability . . . . . . . . . . . . . . . . . . . . . . . . .
71
2.6
Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
2.7
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
2.8
Bernoulli trial models; the binomial distribution . . . . . . . . . . . . . . . .
81
2.4
2.5
CONTENTS
2.9
iii
Parameters and Random Sampling . . . . . . . . . . . . . . . . . . . . . . .
83
2.10 Probability Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
2.10.1 Randomized Response . . . . . . . . . . . . . . . . . . . . . . . . . .
94
2.10.2 Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
3 Probability Distributions
3.1
3.2
Random Variables and Distributions . . . . . . . . . . . . . . . . . . . . . .
99
3.1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
3.1.2
Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 101
3.1.3
Continuous or Numeric Random Variables . . . . . . . . . . . . . . . 107
3.1.4
Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.1.5
Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . 117
3.1.6
Other Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Parameters of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.2.1
Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.2.2
Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.2.3
Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.2.4
Other Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.2.5
Inequalities involving Expectations . . . . . . . . . . . . . . . . . . . 125
4 Joint Probability Distributions
4.1
99
127
General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.1.1
Marginal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.1.2
Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . 128
4.1.3
Properties of Marginal and Conditional Distributions . . . . . . . . . 129
4.1.4
Independence and Random Sampling . . . . . . . . . . . . . . . . . . 129
iv
CONTENTS
4.2
The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.3
The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . 134
4.4
Parameters of Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . 136
4.5
4.4.1
Means, Variances, Covariances and Correlation
. . . . . . . . . . . . 136
4.4.2
Joint Moment Generating Functions . . . . . . . . . . . . . . . . . . 138
Functions of Jointly Distributed Random Variables . . . . . . . . . . . . . . 139
4.5.1
Linear Combinations of Random Variables . . . . . . . . . . . . . . . 141
4.6
Approximate Means and Variances . . . . . . . . . . . . . . . . . . . . . . . 143
4.7
Sampling Distributions of Statistics . . . . . . . . . . . . . . . . . . . . . . . 145
4.8
Methods of Obtaining Sampling Distibutions or Approximations . . . . . . . 151
4.8.1
Exact Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . 151
4.8.2
Asymptotic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.8.3
Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.8.4
Central Limit Theorem Example . . . . . . . . . . . . . . . . . . . . 153
4.8.5
Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.8.6
The Delta Method - Univariate . . . . . . . . . . . . . . . . . . . . . 160
4.8.7
The Delta Method - Multivariate . . . . . . . . . . . . . . . . . . . . 162
4.8.8
Computer Intensive Methods
4.8.9
. . . . . . . . . . . . . . . . . . . . . . 166
Bootstrap Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Chapter 1
Numeracy and Exploratory Data
Analysis
1.1
Numeracy
1.1.1
Numeracy
Since most of statistics involves the use of numerical data to draw conclusions we first discuss
the presentation of numerical data.
Numeracy may be broadly defined as the ability to effectively think about and present
numbers.
• One of the most common forms of presentation of numerical information is in tables.
• There are some simple guidelines which allow us to improve tabular presentation of
numbers.
• In certain situations, the guidelines presented here will need to be modified if the
audience e.g. readers of a professional journal expect the results to be presented in a
specified format.
1
2
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Guidelines
• Round to two significant figures.
◦ In order to understand a table of numbers it is almost always easier to do so if
the numbers do not contain too many significant figures.
• Add averages or totals.
◦ Adding row and/or column averages, proportions or totals when appropriate to a
table often provide a useful focus for establishing trends or patterns.
• Numbers are easier to compare in columns.
• Order by size.
◦ A more effective presentation is often achieved by rearranging so that the largest
(and presumably most important numbers) appear first.
• Spacing and layout.
◦ It is useful to present tables in single space format and not have a lot of “empty
space” to detract the reader from concentrating on the numbers in the table.
1.2. DISCRETE DATA
1.2
3
Discrete Data
For discrete data present tables of the numbers of responses at the various values, possibly
grouped by factors. Also one can produce bar graphs and histograms for graphical presentation. Thus in the first example in the introduction we might present the results as
follows:
Proportion Cases
Studied
Placebo
.008
200,745
Vaccine
.004
201,229
A sensible description might be 4 cases per thousand for the vaccinated group and 8 cases
per thousand for the placebo group.
4
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
For the alcohol use data in the Overview Section eg.
Group
Use Alcohol
Clergy
32
Educators
51
Executives
67
Merchants
83
Surveyed Proportion
300
.11
250
.20
300
.22
350
.24
we might present the data as
Figure 1.1:
1.2. DISCRETE DATA
5
For the self classification data in the Overview Section e.g.
Class
Number
Lower Working Middle
72
714
655
we might present the data as
Figure 1.2:
Upper
41
6
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.3
Stem and leaf displays
Suppose we have a batch or collection of numbers. Stem and leaf displays provide a simple,
yet informative way to
• Develop summaries or descriptions of the batch either to learn about it in isolation or
to compare it with other batches. The fundamental summaries are
◦ location of the batch (a center concept)
◦ scale or spread of the batch (a variability concept).
• Explore (note) characteristics of the batch including
◦ symmetry and general shape
◦ exceptional values
◦ gaps
◦ concentrations
1.3. STEM AND LEAF DISPLAYS
7
Consider the following batch of 62 numbers which give the ages in years of graduate
students, post-docs, staff and faculty of a large academic department of statistics:
33
33
34
37
20
22
23
28
41
42
43
44
52
55
59
64
35
36
37
37
25
26
27
29
43
43
43
44
61
61
61
64
37
37
39
40
29
30
31
31
44
46
46
49
64
65
67
74
40
40
41
51
32 50 76
32 50 79
32 51 81
52
Not much can be learned by looking at the numbers in this form.
A simple display which begins to describe this collection of numbers is as follows:
( 1) 1
( 4) 3
(12) 8
(20) 8
(42) 16
(26) 17
( 9) 9
9
8
7
6
5
4
3
2
1
|
|
|
|
|
|
|
|
|
|
1
4
1
9
2
0
9
6
4
1
1
7
7
9
5
5
4
6
3
7
2
3
3
2
4
1
3
7
9
1
2
3
2
0
1
0
0
7
5
4
0
3 6 0 1 6 4 0 9 4
2 2 2 1 5 9 4 7 1 7
6 8
Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79,
etc.
8
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
A more refined version of this display is:
( 1) 1
( 4) 3
(12) 8
(20) 8
(42) 16
(26) 17
( 9) 9
9
8
7
6
5
4
3
2
1
|
|
|
|
|
|
|
|
|
1
4
1
0
0
0
0
6
1
0
0
1
2
9
1
1
0
1
3
4
1
1
2
5
4
2
1
2
6
4
2
2
2
7
5
5
3
3
8
7
9
3 3 3 4 4 4 6 6 9
3 4 5 6 7 7 7 7 7 9
9 9
Interpretation: 1 at 8 means 81, 4 at 7 means 74, 6 at 7 means 76, 9 at 7 means 79, etc.
To construct a stem and leaf display we perform the following steps:
• To the left of the solid line we put the stem of the number
• To the right of the solid line we put the leaf of the number.
The remaining entries in the display are discussed in the next section. Note that a stem and
leaf display provides a quick and easy way to display a batch of numbers. Every statistical
package now has a program to draw stem and leaf displays.
Some additional comments on stem and leaf displays:
• Number of stems. Understanding Robust and Exploratory Data Analysis suggests
for n less than 100 and 10 log10 (n) for n larger than 100.
√
n
(Usually more than 50 are done using a computer and each statistical package has its
own default method).
• Stems can be double (or more) digits and there can be stems such as 5? and 5· which
divide the numbers with stem 5 into two groups (0,1,2,3,4) and (5,6,7,8,9). Large
displays could use 5 or 10 divisions per stem. The important idea is to display the
numbers effectively.
• For small batches, when working by hand, the use of stem and leaf displays is a simple
way to obtain the ordered values of the batch.
1.4. LETTER VALUES
1.4
9
Letter Values
The stem and leaf display can be used to determine a collection of derived numbers, called
statistics, which can be used to summarize some additional features of the batch. To do this
we need determine the total size of the batch and where the individual numbers are located
in the display.
• To the left of the stem we count the number of leaves on each stem.
• The numbers in parentheses are the cumulative numbers counting up and counting
down.
• Using the stem and leaf display we can easily “count in” from either end of the batch.
◦ The associated count is called the depth of the number.
◦ Thus at depth 4 we have the number 74 if we count down (largest to smallest)
and the number 25 if we count up (smallest to largest).
• It is easier to understand the concept of depth if the numbers are written in a column
from largest to smallest.
• A measure of location is provided by the median, defined as that number in the display
with depth equal to
1
(1 + batch size)
2
◦ If the size of the batch is even (n = 2m) the depth of the median will not be an
integer.
◦ In such a case the median is defined to be halfway between the numbers with
depth m and depth m + 1.
◦ In the example
1
63
median depth = (1 + 62) =
= 31.5
2
2
thus the median is given by:
41 + 42
(# with depth 31) + (# with depth 32)
=
= 41.5
2
2
10
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
◦ The median has the property that 1/2 of the numbers in the batch are above it
and 1/2 of the numbers in the batch are below it, i.e., it is halfway from either
end of the batch.
• The median is just one example of a letter value. Other letter values enable us to
describe variability, shape and other characteristics of the batch.
◦ The simplest sequence of letter values divides the lower half in two and the upper
half in two, each of these halves in two, and so on.
◦ To obtain these letter values we first find their depths by the formula
next letter value depth =
1
(1 + [previous letter value depth])
2
where [ ] means we discard any fraction in the calculation. (Called the “floor
function”).
◦ Thus the upper and lower quartiles have depths equal to
1
(1 + [depth of median])
2
The quartiles are sometimes called fourths.
◦ The eighths have depths equal to
1
(1 + [depth of hinge])
2
◦ We proceed down to the extremes which have depth 1.
◦ The median, quartiles and extremes often describe a batch of numbers quite well.
◦ The remaining letter values are used to describe more subtle features of the data
(illustrated later).
In the example we thus have
1
32
(1 + 31) =
= 16
2
2
1
17
E depth =
(1 + 16) =
= 8.5
2
2
1
2
Extreme depth =
(1 + 1) = = 1
2
2
The corresponding letter values are
F depth =
1.4. LETTER VALUES
11
M 41.5
F
33
Ex 20
52
81
depth 31.5
depth 16
depth 1
We can display the letter values as follows:
Value Depth
M
31.5
F
16
E
8.5
Ex
1
Lower Upper
41.5
41.5
33
52
29
64
20
81
Spread
0
19
35
61
where the spread of a letter value is defined as:
upper letter value − lower letter value
12
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.5
Five Point Summaries and Box Plots
• A useful summary of a batch of numbers is the five point summary in which we list
the upper and lower extremes, the upper and lower hinges and the median. Thus for
the example we have the five point summary given by
20, 33, 41.5, 52, 81
• A five point summary can be displayed graphically as a box plot in which we picture
only the median, the lower fourth, the upper fourth and the extremes as on the following
page:
1.5. FIVE POINT SUMMARIES AND BOX PLOTS
13
For this batch of numbers there is evidence of asymmetry or skewness as can be
observed from the stem-leaf display or the box plot.
Figure 1.3:
To measure spread we can use the interquartile range which is simply the diference
between the upper quartile and the lower quartile.
14
1.6
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
EDA Example
The following are the heights in centimeters of 351 elderly female patients. The data set is
elderly.raw (from Hand et. al. pages 120-121)
156
150
156
155
164
160
156
153
159
157
167
166
151
157
155
160
158
164
157
169
154
157
169
155
158
150
163
158
161
160
163
164
156
161
163
162
151
162
159
163
162
159
171
158
161
161
164
145
158
163
158
167
154
168
167
165
166
155
155
152
169
159
153
158
164
155
165
163
158
166
153
157
162
153
156
167
163
153
168
164
162
142
155
152
164
165
162
168
158
153
161
157
178
163
157
160
169
162
160
165
156
152
158
155
153
162
155
169
161
150
164
166
167
165
170
147
163
160
161
154
166
161
158
152
151
157
164
165
155
163
159
152
161
156
158
155
160
165
154
158
163
164
158
164
162
160
153
163
156
163
164
162
154
163
152
155
152
151
157
166
157
160
158
163
158
159
167
165
165
163
170
162
166
165
162
163
157
163
153
158
163
173
160
164
155
157
157
147
160
162
160
164
147
165
159
158
158
158
151
174
173
170
158
153
161
156
164
161
158
152
154
165
166
161
149
156
163
157
168
170
160
153
176
163
158
161
156
163
155
154
160
145
168
145
152
156
170
162
173
162
166
160
162
169
160
161
153
155
163
157
155
158
148
161
156
162
153
157
167
148
150
163
161
156
166
159
160
159
163
178
165
156
154
170
161
159
155
153
158
159
155
171
160
171
160
157
170
158
168
164
160
166
165
177
170
150
154
163
153
163
169
146
158
153
156
155
159
157
156
150
158
163
163
164
159
159
159
151
161
165
154
159
158
157
162
155
165
160
158
159
156
165
155
152
161
169
156
161
154
158
163
170
165
152
170
152
153
157
156
147
170
1.6. EDA EXAMPLE
STATA log for EDA of Heights of Elderly Women
. infile height using c:\courses\b651201\datasets\elderly.raw
(351 observations read)
. stem height
Stem-and-leaf plot for height
14t
14f
14s
14.
15*
15t
15f
15s
15.
16*
16t
16f
16s
16.
17*
17t
17f
17s
17.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2
555
67777
889
000000111111
22222222222233333333333333333
44444444444555555555555555555555
6666666666666666666677777777777777777777
888888888888888888888888888888899999999999999999
00000000000000000000011111111111111111111
222222222222222222333333333333333333333333333333
44444444444444444555555555555555555
666666666667777777
88888899999999
00000000000111
333
4
67
88
15
16
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
. summarize height, detail
height
------------------------------------------------------------Percentiles
Smallest
1%
145
142
5%
150
145
10%
152
145
Obs
351
25%
156
145
Sum of Wgt.
351
50%
75%
90%
95%
99%
160
164
168
170
176
Largest
176
177
178
178
Mean
Std. Dev.
159.7749
6.02974
Variance
Skewness
Kurtosis
36.35777
.1289375
3.160595
. display 3.49*6.02974*(351^(-1/3))
2.9832408
. display 3.49*sqrt(r(Var))*(351^(-1/3))
2.983241
. display (178-142)/2.98
12.080537
. display min(sqrt(351),10*log(10))
18.734994
1.6. EDA EXAMPLE
17
. graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins)
. graph height, normal xlabel ylabel ti(Heights of Elderly Women 5 Bins) saving
> (g1,replace)
. graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 5 Bins
> ) saving(g2,replace)
. graph height, bin(12) normal xlabel ylabel ti(Heights of Elderly Women 12 Bin
> s) saving(g2,replace)
. graph height, bin(18) normal xlabel ylabel ti(Heights of Elderly Women 18 Bin
> s) saving(g3,replace)
. graph height, bin(25) normal xlabel ylabel ti(Heights of Elderly Women 25 Bin
> s) saving(g4,replace)
. graph using g1 g2 g3 g4
18
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Histograms of Data on Elderly Women
Figure 1.4: Histograms
1.6. EDA EXAMPLE
19
. lv height
#
351
M
F
E
D
C
B
A
Z
Y
176
88.5
44.5
22.5
11.5
6
3.5
2
1.5
1
inner fence
outer fence
height
--------------------------------|
160
|
|
156
160
164 |
|
153
159.5
166 |
|
151
160.25
169.5 |
|
148.5
159.5
170.5 |
|
147
160
173 |
|
145
160.75
176.5 |
|
145
161.5
178 |
|
143.5
160.75
178 |
|
142
160
178 |
|
|
|
144
176 |
|
132
188 |
spread
8
13
18.5
22
26
31.5
33
34.5
36
# below
1
0
pseudosigma
5.95675
5.667454
6.048453
5.929273
6.071367
6.659417
6.360923
6.355203
6.246375
# above
4
0
spread
8.00
13.00
18.50
22.00
26.00
31.50
33.00
34.50
36.00
# below
1
0
pseudosigma
5.96
5.67
6.05
5.93
6.07
6.66
6.36
6.36
6.25
# above
4
0
. format height %9.2f
. lv height
#
351
M
F
E
D
C
B
A
Z
Y
176
88.5
44.5
22.5
11.5
6
3.5
2
1.5
1
inner fence
outer fence
height
--------------------------------|
160.00
|
|
156.00
160.00
164.00 |
|
153.00
159.50
166.00 |
|
151.00
160.25
169.50 |
|
148.50
159.50
170.50 |
|
147.00
160.00
173.00 |
|
145.00
160.75
176.50 |
|
145.00
161.50
178.00 |
|
143.50
160.75
178.00 |
|
142.00
160.00
178.00 |
|
|
|
144.00
176.00 |
|
132.00
188.00 |
20
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
. graph height, box
. graph height, box ylabel
. graph height, box ylabel l1(Height in Centimeters) ti(Box Plot of Heights of
> Elderly Women)
. cumul height, gen(cum)
. graph cum height,s(i) c(l) ylabel xlabel ti(Empirical Distribution Function O
> f Heights of Elderly Women) rlabel yline(.25,.5,.75)
. kdensity height
. kdensity height,normal ti(Kdensity Estimate of Heights)
. log close
1.7. OTHER SUMMARIES
1.7
21
Other Summaries
Other measures of location are
• mid = 12 (UQ + LQ)
+ UQ where UQ is the upper quartile, M
• tri-mean = 12 (mid + median) = LQ + 2M
4
is the median and LQ is the lower quartile.
It is often useful to identify exceptional values that need special attention. We do this
using fences.
• The upper and lower fences are defined by
upper fence = UF
lower fence = LF
= upper hinge + 32 (H-spread)
= lower hinge − 32 (H-spread)
• Values above the upper fence or below the lower fence can be considered as exceptional
values and need to be examined closely for validity.
22
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.7.1
Classical Summaries
The summary quantities developed in the previous sections are examples of statistics, formally defined as functions of a sample data set. There are other summary measures of a
sample data set.
• For location, the traditional summary measure is the sample mean defined by
x̄ =
n
1X
xi
n i=1
where n is the number of observations in the data set and (x1 , x2 , . . . , xn ) is the sample
data set.
• For spread or variablity the sample variance, s2 , and the sample standard deviation,
s, are defined by
n
√
1 X
s2 =
(xi − x̄)2 and s = s2
n − 1 i=1
• Note that
where x̄i−1
µ
¶
1
1
x̄(i−1) + xi
x̄ = 1 −
n
n
is the sample mean of the data set with the ith observation removed.
◦ It follows that a single observation can greatly influence the magnitude of the
sample mean which explains why other summaries such as the median or tri-mean
for location are often used.
◦ Similarly the sample variance and sample standard deviation are greatly influenced by single observations.
• For distributions which are “bell-shaped” the interquartile range is approximately equal
to 1.34 s to where s is the sample standard deviation.
1.8. TRANSFORMATIONS FOR SYMMETRY
1.8
23
Transformations for Symmetry
Data can be easier to understand if it is nearly symmetric and hence we sometimes transform
a batch to make it approximately symmetric. The reasons for transformations are:
• For symmetric batches we have an unambiguous measure of center (the mean or the
median).
• Transformed data may have a scientific meaning.
• Many statistical methods are more reliable for symmetric data.
As examples of transformed data with scientific meaning we have
• For income and population changes the natural logarithm is often useful since both
money and poulations grow exponentially i.e.
Nt = N0 exp(rt)
where r is the interest rate or growth rate.
• In measuring consumption e.g. miles per gallon or BTU per gallon the reciprocal is a
measure of power.
The fundamental use of transformations is to change shape which can be loosely described
as everything about the batch other than location and scale. Desirable features of a transformation is to preserve order and be a simple and smooth function of the data. We first
note that a linear transformation does not change shape, it only changes the location and
center of the batch since
t(yi ) = a + byi , t(yj ) = a + byj =⇒ t(yi ) − t(yj ) = b(yi − yj )
shows that a linear transformation does not change the relative distances between observations. Thus a linear transformation does not change the shape of the batch.
To choose a transformation for symmetry we first need to determine whether the data
are skewed right or skewed left. A simple way to do this is to examine the “mid-list” defined
as
lower letter value + upper letter value
mid letter value =
2
24
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
If the values in the mid-list increase as the letter values increase then the batch is skewed
right. Conversely if the values in the mid-list decrease as the letter values increase the batch
is skewed left.
A convenient collection of transformations is the power family of transformations defined by
(
y k k 6= 0
tk (y) =
ln(y) k = 0
For this family of transformations we have the following ladder of re-expression or transformation:
k
tk (y)
2
y2
1
y
√
1
y
2
0
ln(y)
√
1
− 2 −1/ y
−1 −1/y
−2 −1/y 2
The rule for using this ladder is to start at the transfomation where k = 1. If the data are
skewed to high values, go down the ladder to find a transformation. If skewed towards low
values of y go up the ladder. For the data set on ages the complete set of letter vales as
produced by STATA is
#
M
F
E
D
C
B
62
31.5
16
8.5
4.5
2.5
1.5
1
inner fence
outer fence
y
--------------------------------|
41.5
|
|
33
42.5
52 |
|
29
46.5
64 |
|
25.5
48
70.5 |
|
22.5
50
77.5 |
|
21
50.5
80 |
|
20
50.5
81 |
|
|
|
|
|
4.5
80.5 |
|
-24
109 |
spread
19
35
45
55
59
61
# below
0
0
# above
1
0
1.8. TRANSFORMATIONS FOR SYMMETRY
25
Thus the mid-list is
mid letter value
41.5
median
42.5
fourth
46.5
eighth
48
D
50
B
50.5
A
50.5
Extreme
Since the values increase we need to go down the ladder. Hence we try square roots or
natural logarithms first.
Note: There are some rather sophisticated symmetry plots now available. e.g. STATA
has a command symplot which determines the value of k. Often, however this results in
k = .48 or k = .52. Try to choose a k which is simple e.g. k = 1/2 and hope for a scientific
justification.
26
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Here are the stem and leaf plots of the natural logarithm and square root of the age data
30* |
31* |
32* |
33* |
34* |
35* |
36* |
37* |
38* |
39* |
40* |
41* |
42* |
43* |
lnage
09
4
26
0377
033777
00368
111116999
1146666888
339
113355
18
1116667
0
0379
4** | 47
4** | 69
4** | 80
5** | 00,10
5** | 20,29,39,39
5** | 48,57,57
5** | 66,66,66,74,74
5** | 83,92
6** | 00,08,08,08,08,08
6** | 24,32,32,32
6** | 40,40,48,56,56,56,56
6** | 63,63,63,78,78
6** |
7** | 00,07,07,14,14
7** | 21,21
7** | 42
7** | 68
7** | 81,81,81
8** | 00,00,00,06,19
8** |
8** |
8** | 60,72
8** | 89
9** | 00
square root of age
1.9. BAR PLOTS AND HISTOGRAMS
1.9
27
Bar Plots and Histograms
Two other useful graphical displays for describing the shape of a batch of data are provided
by bar plots and histograms.
1.9.1
Bar Plots
• Barplots are very useful for describing relative proportions and frequencies defined for
different groups or intervals.
• The key concept in constructing bar plots is to remember that the plot must be such
that the area of the bar is proportional to the quantity being plotted.
• This causes no problems if the intervals are of equal length but presents real problems
if the intervals are not of equal length.
• Such incorrect graphs are examples of “lying graphics” and must be avoided.
1.9.2
Histograms
• Histograms are similar to bar plots and are used to graph the proportion of data set
values in specified intervals.
• These graphs give insight into the distributional patterns of the data set.
• Unlike stem-leaf plots, histograms sacrifice the individual data values.
• In constructing histograms the same basic principle used in constructing bar plots
applies: the area over an interval must be proportional to the number or proportion of
data values in the interval. The total area is often scaled to be one.
• Smoothed histograms are available in most software packages. (more later when we
discuss distributions).
The following pages show the histogram of the first data set of 62 values with equal intervals
and the kdensity graph.
28
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Histogram
Figure 1.5:
1.9. BAR PLOTS AND HISTOGRAMS
Smoothed histogram
Figure 1.6:
29
30
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.9.3
Frequency Polygons
• Closely related to histograms are frequency polygons in which the proportion or
frequency of an interval is plotted at the mid point of the interval and the resulting
points connected.
• Frequency polygons are also useful in visualizing the general shape of the distribution
of a data set.
Here is a small data set giving the number of reported suicide attempts in a major US city
in 1971:
Age
6-15 16-25
Frequency
4
28
26-35
16
36-45
8
46-55
4
56-65
1
1.9. BAR PLOTS AND HISTOGRAMS
The frequency polygon for this data set is as follows:
Figure 1.7:
31
32
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.10
Sample Distribution Functions
• Another useful graphical display is the sample distribution function or empirical
distribution function which is a plot of the proportion of values less than or equal
to y versus y where y represents the ordered values of the data set.
• These plots can be conveniently made using current software but usually involve too
much computation to be done by hand.
• They represent a very valuable technique for comparing observed data sets to theoretical models as we will see later.
1.10. SAMPLE DISTRIBUTION FUNCTIONS
Here is the sample distribution function for the first data set on ages.
Figure 1.8:
33
34
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
1.11
Smoothing
Time series data of the form yt : t = 0, 1, 2, . . . , n which we abbreviate to {yt } can usefully
be separate d into two additive parts: {zt } and {rt } where
• {zt } is the smooth or signal and represents that part of the data which is slowly varying
and structured.
• {rt } is the rough or noise and represents that part of the data which is rapidly varying
and unstructured.
{zt }, the smooth, tells us about long-run patterns while {rt }, the roughh, tells us about
exceptional points. The operator which converts the data {yt } into the smooth is called a
data smoother. The smoothed data may then be written as Sm{yt }. The corresponding
rough is then given by
Ro{yt } = {yt } − Sm{yt }
There are many smoothers, defined by their properties. For our purposes two general
types are important:
• Linear smoothers defined by the property
Sm{axt + byt } = aSm{xt } + bSm{yt }
• Semi-linear smoothers defined by the property
Sm{ayt + b} = aSm{yt } + b
Examples of linear smoothers include moving averages e.g.
Sm{yt } =
yt−1 + yt + yt+1
3
and weighted moving averages such as Hanning defined by
1
1
1
Sm{yt } = yt−1 + yt + yt+1
4
2
4
(Special adjustments are made at the ends of the series.
1.11.
SMOOTHING
35
Examples of semi-linear smoothers include running medians of length 3 or 5 when smoothing without a computer or even lengths if using a statistical package with the right programs.
e.g.
Sm{yt } = med{yt−1 , yt , yt+1 }
is a smoother of running medians of length 3 with the ends replicated (copied). These kinds
of smoothers are applied several times until they “settle down”. Then end adjustments are
made.
The two basic types of smoothers are usually combined to form compound smoothers.
The nomenclature for these smoothers is rather bewildering at first but informative: e.g.
3RSSH,twice
refers to the smoother which
• takes running medians of length 3 until the series stabilizes (R)
• the S refers to splitting the repeated values, using the endpoint operator on them and
then replaces the original smooth with these values
• H applies the Hanning smoother to the series which remains
• twice refers to using the smoother on the rough and then adding the rough back to the
smooth to form the final smoothed version
A little trial and error is needed in using these smoothers. Velleman has recommended
the smoother
4253H,twice
for general use.
36
1.11.1
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Smoothing Example
To illustrate the smoothing techniques we use data on unemployment percent for the years
1960 to 1990.
. infile year unempl using c:\courses\b651201\datasets\unemploy.raw
(31 observations read)
. smooth 3 unempl, gen(sm1)
. smooth 3 sm1, gen(sm2)
. smooth 3R unempl, gen(sm3)
. smooth 3RE unempl, gen(sm4)
. smooth 4253H,twice unempl, gen(sm5)
. gen sm5r=round(sm5,.1)
1.11.
SMOOTHING
37
. list year unempl sm1 sm2 sm3 sm4
year
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
unempl
4.9
6
4.9
5
4.6
4.1
3.3
3.4
3.2
3.1
4.4
5.4
5
4.3
5
7.8
7
6.2
5.2
5.1
6.3
6.7
8.6
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
sm1
4.9
4.9
5
4.9
4.6
4.1
3.4
3.3
3.2
3.2
4.4
5
5
5
5
7
7
6.2
5.2
5.2
6.3
6.7
8.4
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
sm2
4.9
4.9
4.9
4.9
4.6
4.1
3.4
3.3
3.2
3.2
4.4
5
5
5
5
7
7
6.2
5.2
5.2
6.3
6.7
8.4
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
sm3
4.9
4.9
4.9
4.9
4.6
4.1
3.4
3.3
3.2
3.2
4.4
5
5
5
5
7
7
6.2
5.2
5.2
6.3
6.7
8.4
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
sm4
4.9
4.9
4.9
4.9
4.6
4.1
3.4
3.3
3.2
3.2
4.4
5
5
5
5
7
7
6.2
5.2
5.2
6.3
6.7
8.4
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
38
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
. list year unempl sm5r
year
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
unempl
4.9
6
4.9
5
4.6
4.1
3.3
3.4
3.2
3.1
4.4
5.4
5
4.3
5
7.8
7
6.2
5.2
5.1
6.3
6.7
8.6
8.4
6.5
6.2
6
5.3
4.7
4.5
4.1
sm5r
4.9
5
5
4.9
4.6
4
3.6
3.4
3.4
3.6
4.1
4.6
4.8
5.1
5.5
6
6.2
6.1
5.9
5.8
6.2
7
7.4
7.3
7
6.4
5.8
5.3
4.8
4.4
4.1
1.11.
SMOOTHING
39
. graph unempl sm4 year,s(oi) c(ll) ti(Unemployment and 3RE Smooth) xlab
. graph unempl sm5r year,s(oi) c(ll) ti(Unemployment and 4253H,twice
> lab
Smooth) x
. log close
The graphs on the following two pages show the smoothed versions and the original data.
40
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Graph of Unemployment Data and 3RE smooth
Figure 1.9:
1.11.
SMOOTHING
41
Graph of Unemployment Data and 4253H,twice Smooth.
Figure 1.10:
42
1.12
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
Shapes of Batches
Figure 1.11:
1.13. REFERENCES
1.13
43
References
1. Bound, J. A. and A. S. C. Ehrenberg (1989). Significant Sameness. J. R. Statis. Soc.
A 152(Part 2): pp. 241-247.
2. Chakrapani, C. Numeracy. Encyclopedia of Statistics.
3. Chambers, J. M., W. S. Cleveland, et al. (1983). Graphical Methods for Data Analysis,
Wadsworth International Group.
4. Chatfield, C. (1985). The Initial Examination of Data. J.R.Statist. Soc. A 148(3):
214-253.
5. Cleveland, W. S. and R. McGill (1984). The Many Faces of a Scatterplot. JASA
79(388): 807-822.
6. Doksum, K. A. (1977). Some Graphical Methods in Statistics. Statistica Neelandica
Vol. 31(No. 2): pp. 53-68.
7. Draper, D., J. S. Hodges, et al. (1993). Exchangability and Data Analysis. J. R.
Statist. Soc. A 156(Part 1): pp. 9-37.
8. Ehrenberg, A. S. C. (1977). Graphs or Tables ? The Statistician Vol. 27(No.2): pp.
87-96.
9. Ehrenberg, A. S. C. (1986). Reading a Table: An Example. Applied Statistics 35(3):
237-244.
10. Ehrenberg, A. S. C. (1977). Rudiments of Numeracy. J. R. Statis. Soc. A 140(3):
277-297.
11. Ehrenberg, A. S. C. Reduction of Data. Johnson and Kotz.
12. Ehrenberg, A. S. C. (1981). The Problem of Numeracy. American Statistician 35(3):
67-71.
13. Finlayson, H. C. The Place of ln x Among the Powers of x. American Mathematical
Monthly: 450.
14. Gan, F. F., K. J. Koehler, et al. (1991). Probability Plots and Distribution Curves for
Assessing the Fit of Probability Models. American Statistician 45(1): 14-21.
44
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
15. Goldberg, K. and B. Iglewicz (1992). Bivariate Extensions of the Boxplot. Technometrics 34(3): 307-320.
16. Hand, D. J. (1996). Statistics and the Theory of Measurement. J. R. Statist. Soc. A
159(Part 3): pp. 445-492.
17. Hand, D. J. (1998). Data Mining: Statistics and More? American Statistics 52(2):
112-118.
18. Hoaglin, D. C., F. Mosteller, et al. (1991). Fundamentals of Exploratory Analysis of
Variance, John Wiley & Sons, Inc.
19. Hoaglin, D. C., F. Mosteller, et al., Eds. (1983). Understanding Robust and Exploratory Data Analysis, John Wiley & Sons, Inc.
20. Hunter, J. S. (1988). The Digidot Plot. American Statistician 42(1): 54.
21. Hunter, J. S. (1980). The National System of Scientific Measurement. Science 210:
869-874.
22. Kafadar, K. Notched Box-and-Whisker Plots. Encyclopedia of Statistics. Johnson and
Kotz.
23. Kruskal, W. (1978). Taking Data Seriously. Toward a Metric of Science, John Wiley
& Sons: 139-169.
24. Mallows, C. L. and D. Pregibon (1988). Some Principles of Data Analysis, Statistical
Research Reports No. 54 AT&T Bell Labs.
25. McGill, R., J. W. Tukey, et al. (1978). Variations of Box Plots. American Statistician
32(1): 12-16.
26. Mosteller, F. (1977). Assessing Unknown Numbers: Order of Magnitude Estimation.
Statistical Methods for Policy Analysis. W. B. Fairley and F. Mosteller, AddisonWesley.
27. Paulos, J. A. (1988). Innumeracy: Mathematical Illiteracy and Its Consequences, Hill
and Wang.
28. Paulos, J. A. (1991). Beyond Numeracy: Ruminations of a Numbers Man, Alfred A.
Knopf.
1.13. REFERENCES
45
29. Preece, D. A. (1987). The language of size, quantity and comparison. The Statistician
36: 45-54.
30. Rosenbaum, P. R. (1989). Exploratory Plots for Paired Data. American Statistician
43(2): 108-109.
31. Scott, D. W. (1979). On optimal and data-based histograms. Biometrika 66(3): pp.
605-610.
32. Scott, D. W. (1985). Frequency Polygons: Theory and Applications. JASA 80(390):
348-354.
33. Sievers, G. L. Probability Plotting. Encyclopedia of Statistics. Johnson and Kotz:
232-237.
34. Snee, R. D. and C. G. Pfeifer.. Graphical Representation of Data. Encyclopedia of
Statistics. Johnson and Kotz: 488-511.
35. Stevens, S. S. (1968). Measurement, Statistics and the Schemapric View. Science
161(3844): 849-856.
36. Stirling, W. D. (1982). Enhancements to Aid Interpretation of Probablity Plots. The
Statistician 31(3): 211.
37. Sturges, H. A. (1926). The Choice of Class Interval. JASA 21: 65-66.
38. Terrell, G. R. and D. W. Scott (1985). Oversmoothed Nonparametric Density Estimates. JASA 80(389): 209-213.
39. Tukey, J. W. (1980). We Need Both Exploratory and Confirmatory. American Statistician 34(1): 23-25.
40. Tukey, J. W. (1986). Sunset Salvo. American Statistician 40(1): 72-76.
41. Tukey, J. W. (1977). Exploratory Data Analysis, Addison Wesley.
42. Tukey, J. W. and C. L. Mallows An Overview of Techniques of Data Analysis, Emphasizing Its Exploratory Aspects: 111-172.
43. Velleman, P. F. Applied Nonlinear Smoothing. Sociological Methodology 1982 San
Francisco: Jossey-Bass
46
CHAPTER 1. NUMERACY AND EXPLORATORY DATA ANALYSIS
44. Velleman, P. F. and L. Wilkinson (1993). Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. American Statistician 47(1): 65-72.
45. Wainer, H. (1997). Improving Tabular Displays, With NAEP Tables as Examples and
Inspirations. Journal of Educational and Behavioral Statistics 22(1): 1-30.
46. Wand, M. P. (1997). Data-Based Choice of Histogram Bin Width. American Statistician Vol. 51(No. 1): pp. 59-64.
47. Wilk, M. B. and R. Gnanadesikian (1968). Probability plotting methods for the analysis of data. Biometrika 55(1): 1-17.
Chapter 2
Probability
2.1
2.1.1
Mathematical Preliminaries
Sets
To study statistics effectively we need to learn some probability. There are certain elementary
mathematical concepts which we use to increase the precision of our discussions. The use
of set notation provides a convenient and useful way to be precise about populations and
samples.
Definition: A set is a collection of objects called points or elements.
Examples of sets include:
• set of all individuals in this class
• set of all individuals in Baltimore
• set of integers including 0 i.e. {0, 1, . . .}
• set of all non-negative numbers i.e. [0, +∞)
• set of all real numbers i.e. (−∞, +∞)
47
48
CHAPTER 2. PROBABILITY
To describe the contents of a set we will follow one of two conventions:
• Convention 1: Write down all of the elements in the set and enclose them in curly
brackets. Thus the set consisting of the four numbers 1, 2, 3 and 4 is written as
{1, 2, 3, 4}
• Convention 2: Write down a rule which determines or defines which elements are in
the set and enclose the result in curly brackets. Thus the set consisting of the four
numbers 1, 2, 3 and 4 is written as
{x : x = 1, 2, 3, 4}
and is read as “the set of all x such that x = 1, 2, 3, or 4”. The general convention is
thus
{x : C(x)}
and is read as “the set of all x such that the condition C(x) is satisfied”.
Obviously convention 2 is more useful for complicated and large sets.
2.1. MATHEMATICAL PRELIMINARIES
49
Notation and Definitions:
• x ∈ A means that the point x is a point in the set A
• x 6∈ A means that the point x is not a point in the set A Thus
1 ∈ {1, 2, 3, 4} but 5 6∈ {1, 2, 3, 4}
• A ⊂ B means that each a ∈ A implies that a ∈ B. A. Such an A is said to be a
subset of B. Thus
{1, 2} ⊂ {1, 2, 3, 4}
• A = B means that every point in A is also in B and conversely. More precisely A = B
means that A ⊂ B and B ⊂ A.
• The union of two sets A and B is denoted by A ∪ B and is the set of all points x
which are in at least one of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then
A ∪ B = {1, 2, 3, 4}
• The intersection of two sets A and B is denoted by A∩B and is the set of all points x
which are in both of the sets. Thus if A = {1, 2} and B = {2, 3, 4} then A ∩ B = {2}.
• If there are no points x which are in both A and B we say that A and B are disjoint
or mutually exclusive and we write
A∩B =∅
where ∅ is called the empty set (the set containing no points).
50
CHAPTER 2. PROBABILITY
• Each set under discussion is usually considered to be a subset of a larger set Ω called
the sample space.
• The complement of a set A, Ac is the set of all points not in A i.e.
Ac = {x : x 6∈ A}
Thus if Ω = {1, 2, 3, 4, 5} and A = {1, 2, 4} then Ac = {3, 5}.
• If B ⊂ A then A − B = A ∩ B c = {x : x ∈ A ∩ B c }
• If a and b are elements or points we call (a, b) an ordered pair. a is called the first
coordinate and b is called the second coordinate. Two ordered pairs are equal defined
to be equal if and only if both their first and second coordinates are equal. Thus
(a, b) = (c, d) if and only if a = c and b = d
Thus if we record for an individual their blood pressure and their age the result may
be written as (age, blood pressure).
• The Cartesian product of two sets A and B is written as A × B and is the set of
all ordered pairs having as first coordinate an element of A and second coordinate an
element of B. More precisely
A × B = {(a, b) : a ∈ A; b ∈ B}
Thus if A = {1, 2, 3} and B = {3, 4} then
A × B = {(1, 3), (1, 4), (2, 3), (2, 4), (3, 3), (3, 4)}
• Extension of Cartesian products to three or more sets is useful. Thus
A1 × A2 × A3 = {(a1 , a2 , a3 ) : a1 ∈ A1 , a2 ∈ A2 , a3 ∈ A3 }
defines a set of triples. Two triples are equal if and only if they are equal coordinatewise.
Most computer based storage systems (data base programs) implicitly use Cartesian
products to label and store data values.
• An n tuple is an ordered collection of n elements of the form a1 , a2 , . . . , an .
2.1. MATHEMATICAL PRELIMINARIES
51
example: Consider the set (population) of all individuals in the United States. If
• A is all those who carry the AIDS virus
• B is all homosexuals
• C is all IV drug users
Then
• The set of all individuals who carry the AIDS virus and satisfy only one of the other
two conditions is
(A ∩ B ∩ C c ) ∪ (A ∩ B c ∩ C)
• The set of all individuals satisfying at least two of the conditions is
(A ∩ B) ∪ (A ∩ C) ∪ (B ∩ C)
• The set of individuals satisfying exactly two of the conditions is
(A ∩ B ∩ C c ) ∪ (A ∩ B c ∩ C) ∪ (Ac ∩ B ∩ C)
• The set of all individuals satisfying all three conditions is
A∩B∩C
• The set of all individuals satisfying at least one of the conditions is
A∪B∪C
52
CHAPTER 2. PROBABILITY
2.1.2
Counting
Many probability problems involve “counting the number of ways” something can occur.
Basic Principle of Counting: Given two sets A and B with n1 and n2 elements respectively of the form
A = {a1 , a2 , . . . , an1 }
B = {b1 , b2 , . . . , bn2 }
then the set A × B consisting of all ordered pairs of the form (ai , bj ) contains n1 n2 elements.
• To see this consider the table
a1
a2
..
.
b1
(a1 , b1 )
(a2 , b1 )
..
.
b2
a1 , b2 )
a2 , b2 )
..
.
···
···
···
...
bn2
(a1 , bn2 )
(a2 , bn2 )
..
.
an1
(an1 , b1 ) an1 , b2 ) · · ·
(an1 , bn2 )
The conclusion is thus obvious.
• Equivalently: If there are n1 ways to perform operation 1 and n2 ways to perform
operation 2 then there are n1 n2 ways to perform first operation 1 and then operation
2.
• In general if there are r operations in which the ith operation can be performed in ni
ways then there are n1 n2 · · · nr ways to perform the r operations in sequence.
• Permutations: If a set S contains n elements, there are
n! = n × (n − 1) × · · · × 3 × 2 × 1
different n tuples which can be formed from the n elements of S.
– By convention 0! = 1.
– If r ≤ n there are
(n)r = (n − r + 1)(n − r + 2) · · · (n − 1)n
r tuples composed of elements of S.
2.1. MATHEMATICAL PRELIMINARIES
53
• Combinations: If a set S contains n elements and r ≤ n, there are
à !
Crn =
n
n!
=
r
r!(n − r)!
subsets of size r containing elements of S.
To see this we note that if we have a subset of size r from S there are r! permutations
of its elements, each of which is an r tuple of elements from S. Therefore we have the
equation
r! Crn = (n)r
and the conclusion follows.
examples:
(1) For an ordinary deck of 52 cards there are 52 × 51 × 50 ways to choose a “hand” of
three cards.
(2) If we toss two dies (each six-sided with sides numbered 1-6) there are 36 possible
outcomes.
(3) The use of the convention that 0! = 1 can be considered a special case of the Gamma
function defined by
Z ∞
Γ(α) =
xα−1 e−x dx
0
defined for any positive α. We note by integration by parts that
Γ(α) = (α − 1)x
¯∞
Z ∞
¯
¯ + (α − 1)
xα−2 e−x dx = (α − 1)Γ(α − 1)
¯
0
α−1 ¯
0
It follows that if α = n where n is an integer then
Γ(n) = (n − 1)!
and hence with n = 1
0! = Γ(1) =
Z ∞
0
e−x dx = 1
54
CHAPTER 2. PROBABILITY
2.2
Relating Probability to Responses and Populations
Probability is a measure of the uncertainty associated with the occurrence of events.
• In applications to statistics probability is used to model the uncertainty associated
with the response of a study.
• Using probability models and observed responses (data) we make statements (statistical
inferences) about the study:
◦ The probability model allows us to relate the uncertainty associated with sample
results to statements about population characteristics.
◦ Without such models we can say little about the population and virtually nothing
about the reliability or generalizability of our results.
• The term experiment or statistical experiment or random experiment denotes
the performance of an observational study, a census or sample survey or a designed
experiment.
◦ The collection, Ω, of all possible results of an experiment will be called the sample
space.
◦ A particular result of an experiment will be called an elementary event and
denoted by ω.
◦ An event is a collection of elementary events.
◦ Events are thus sets of elementary events.
2.2. RELATING PROBABILITY TO RESPONSES AND POPULATIONS
• Notation and interpretations:
◦ ω ∈ E means that E occurs when ω occurs
◦ ω 6∈ E means that E does not occur when ω occurs
◦ E ⊂ F means that the occurrence of E implies the occurrence of F
◦ E ∩ F means the event that both E and F occur
◦ E ∪ F means the event that at least one of E or F occur
◦ φ denotes the impossible event
◦ E ∩ F = φ means that E and F are mutually exclusive
◦ E c is the event that E does not occur
◦ Ω is the sample space
55
56
CHAPTER 2. PROBABILITY
2.3
Probability and Odds - Basic Definitions
2.3.1
Probability
Definition: Probability is an assignment to each event of a number called its probability
such that the following three conditions are satisfied:
(1) P (Ω) = 1 i.e. the probability assigned to the certain event or sample space is 1
(2) 0 ≤ P (E) ≤ 1 for any event E i.e. the probability assigned to any event must be
between 0 and 1
(3) If E1 and E2 are mutually exclusive then
P (E1 ∪ E2 ) = P (E1 ) + P (E2 )
i.e. the probability assigned to the union of mutually exclusive events equals the sum
of the probabilities assigned to the individual events.
P (E) is called the probability of the event E
Note: In considering probabilities for continuous responses we need a stronger form of (3):
P (∪i Ei ) =
X
P (Ei )
i
for any countable collection of events which are mutually exclusive.
2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS
2.3.2
57
Properties of Probability
Important properties of probabilities are:
• P (E c ) = 1 − P (E)
• P (∅) = 0
• E1 ⊂ E2 implies P (E1 ) ≤ P (E2 )
• P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )
Rather than develop the theory of probability we will:
• Develop the most important probability models used in statistics.
• Learn to use these models to make calculations according to the definitions and properties listed above
• Learn how to interpret probabilities.
examples:
• Suppose that P (A) = .4, P (B) = .3 and P (A ∩ B) = .2 then
P (A ∪ B) = .4 + .3 − .2 = .5
• For any three events A, B and C we have
P (A∪B ∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (A∩C)−P (B ∩C)+P (A∩B ∩C)
and hence
P (A ∪ B ∪ C) ≤ P (A) + P (B) + P (C)
58
CHAPTER 2. PROBABILITY
2.3.3
Methods for Obtaining Probability Models
The four most important sample spaces for statistical applications are
◦ {0, 1, 2, . . . , n} (discrete-finite)
◦ {0, 1, 2, . . .} (discrete-countable)
◦ [0, ∞) (continuous)
◦ {(−∞, ∞)} (continuous)
For these sample spaces probabilities are defined by probability mass functions (discrete
case) and probability density functions (continuous case). We shall call both of these probability density functions (pdfs).
◦ For the discrete cases a pdf assigns a number f (x) to each x in the sample space such
that
X
f (x) ≥ 0 and
f (x) = 1
x
Then P (E) is defined by
P (E) =
X
f (x)
x∈E
◦ For the continuous cases a pdf assigns a number f (x) to each x in the sample space
such that
Z
f (x) ≥ 0 and
f (x)dx = 1
x
Then P (E) is defined by
Z
P (E) =
x∈E
f (x)dx
Since sums and integrals over disjoint sets are additive probabilities can be assigned using
pdfs (i.e. the probabilities so assigned obey the three axioms of probabilities).
examples:
◦ If
à !
f (x) =
n x
p (1 − p)n−x x = 0, 1, 2, . . . , n
x
2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS
59
where 0 ≤ p ≤ 1 we have a binomial probabilty model with parameter p. The fact that
X
f (x) =
x
n
X
à !
n x
p (1 − p)n−x = 1
x
x=0
follows from the fact (Newton’s binomial expansion) that
n
X
(a + b) =
à !
n x n−x
a b
x
x=0
for any a and b.
◦ If
λx e−λ
x = 0, 1, 2, . . .
x!
where λ ≥ 0 we have a Poisson probability model with parameter λ. The fact that
f (x) =
X
f (x) =
x
∞
X
λx e−λ
x=0
follows from the fact that
∞
X
λx
x=0
x!
x!
=1
= eλ
◦ If
f (x) = λe−λx 0 ≤ x < ∞)
where λ ≥ 0 we have an exponential probability model with parameter λ. The fact
that
Z
Z ∞
f (x)dx =
λe−λx dx = 1
x
follows from the fact that
0
Z ∞
0
e−λx dx =
1
λ
60
CHAPTER 2. PROBABILITY
◦ If
f (x) = (2πσ)−1/2 exp{−(x − µ)2 /2σ 2 }
− ∞ < x < +∞
where −∞ < µ < +∞ and σ > 0 we have a normal or Gaussian probability model
with parameters µ and σ 2 . The fact that
Z
x
f (x)dx =
Z +∞
−∞
(2πσ)−1/2 exp{−(x − µ)2 /2σ 2 }dx = 1
is shown in the supplemental notes.
Each of the above examples of probability models play major roles in the statistical
analysis of data from experimental studies. The binomial is used to model prospective
(cohort), retrospective (case-control) studies in epeidemiology, the Poisson is used to model
accident data, the exponential is used to model failure time data and the normal distribution
is used for measurement data which has a bell-shaped distribution as well as to approximate
the binomial and Poisson. The normal distribution also figures in the calculation of many
common statistics used for inference via the Central Limit Theorem. All of these models are
special cases of the exponential family of distributions defined as having pdfs of the form:

p
X
f (x; θ1 , θ2 , . . . , θp ) = C(θ1 , θ2 , . . . , θp )h(x) exp 
j=1


tj (x)qj (θ)
2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS
2.3.4
61
Odds
Closely related to probabilities are odds.
• If the odds of an event E occurring are given as a to b this means, by definition, that
P (E)
P (E)
a
=
=
c
P (E )
1 − P (E)
b
We can solve for P (E) to obtain
P (E) =
a
a+b
◦ Thus we can go from odds to probabilities and vice-versa.
◦ Thinking about probabilities in terms of odds sometimes provides useful interpretation of probability statements.
• Odds can also be given as the odds against E are c to d. This means that
P (E c )
1 − P (E)
c
=
=
P (E)
P (E)
d
so that in this case
P (E) =
d
c+d
• example: The odds against disease 1 are 9 to 1. Thus
P (disease 1) =
1
= .1
1+9
• example: The odds of thundershowers this afternoon are 2 to 3. Thus
P (thundershowers) =
2
= .4
2+3
62
CHAPTER 2. PROBABILITY
• Ratios of odds are called odds ratios and play an important role in modern epidemiology where they are used to quantify the risk associated with exposure.
◦ example: Let OR be the odds ratio for the occurrence of a disease in an exposed
population relative to an unexposed or control population. Thus
odds of disease in exposed population
OR =
=
odds of disease in control population
p2
1−p2
p1
1−p1
where p2 is the probability of the disease in the exposed population and p1 is the
probability of the disease in the control population.
◦ Note that if OR = 1 then
p2
p1
=
1 − p2
1 − p1
which implies that p2 = p1 i.e. that the probability of disease is the same in the
exposed and control population.
◦ If OR > 1 then
p1
p2
>
1 − p2
1 − p1
which can be shown to imply that p2 > p1 i.e. that the probability of disease
in the exposed population exceeds the probability of the disease in the control
population.
◦ If OR < 1 the reverse conclusion holds i.e. the probability of disease in the control
population exceeds the probability of disease in the exposed population.
2.3. PROBABILITY AND ODDS - BASIC DEFINITIONS
63
• The odds ratio, while useful in comparing the relative magnitude of risk of disease
does not convey the absolute magnitude of the risk (unless the risk is small).
◦ Note that
p2
1−p2
p1
1−p1
implies that
= OR
"
p1
p2 = OR
1 + (OR − 1)p1
#
◦ Consider a situation in which the odds ratio is 100 for exposed vs control. Thus
if OR = 100 and p1 = 10−6 (one in a million) then p2 is approximately 10−4 (one
in ten thousand). If p1 = 10−2 (one in a hundred) then

p2 = 100 
1
100³
1 + 99

1
100
´ =
100
= .50
199
64
2.4
CHAPTER 2. PROBABILITY
Interpretations of Probability
Philosophers have discussed for several centuries at various levels what constitues “probability”. For our purposes probability has three useful operational interpretations.
2.4.1
Equally Likely Interpretation
Consider an experiment where the sample space consists of a finite number of elementary
events
e1 , e2 , . . . , eN
If, before the experiment is performed, we consider each of the elementary events to be
“equally likely” or exchangeable then an assignment of probability is given by
p({ei }) =
1
N
This allows an interpretation of statements such as “we selected an individual at random
from a population” since in ordinary language at random means that each invidual has the
same chance of being selected. Although defining probability via this recipe is circular it is
a useful interpretation in any situation where the sample space is finite and the elementary
events are deemed equally likely. It forms the basis of much of sample survey theory where
we select individuals at random from a population in order to investigate properties of the
population.
Summary: The equally likely interpretation assumes that each element in the sample
space has the same chance of occuring.
2.4. INTERPRETATIONS OF PROBABILITY
2.4.2
65
Relative Frequency Interpretation
Another interpretation of probability is the so called relative frequency interpretation.
• Imagine a long series of trials in which the event of interest either occurs or does not
occur.
• The relative frequency (number of trials in which the event occurs divided by the total
number of trials) of the event in this long series of trials is taken to be the probability
of the event.
• This interpretation of probability is the most widely used interpretation in scientific
studies. Note, however, that it is also circular.
• It is often called the “long run frequency interpretation”.
2.4.3
Subjective Probability Interpretation
This interpretation of probability requires the personal evaluation of probabilities using indifference between two wagers (bets).
Suppose that you are interested in determining the probability of an event E. Consider
two wagers defined as follows:
Wager 1 : You receive $100 if the event E occurs and nothing if it does not occur.
Wager 2 : There is a jar containing x white balls and N − x red balls. You receive $100 if
a white ball is drawn and nothing otherwise.
You are required to make one of the two wagers. Your probability of E is taken to be
the ratio x/N at which you are indifferent between the two wagers.
66
CHAPTER 2. PROBABILITY
2.4.4
Does it Matter?
• For most applications of probability in modern statistics the specific interpretation of
probability does not matter all that much.
• What matters is that probabilities have the properties given in the definition and those
properties derived from them.
• In this course we will take probability as a primitive concept leaving it to philosophers
to argue the merits of particular interpretations.
• Each of the interpretations discussed above satisfies the three basic axioms of the
definition of probability.
2.5. CONDITIONAL PROBABILITY
2.5
67
Conditional Probability
• Conditional probabilities possess all the properties of probabilities.
• Conditional probabilities provide a method to revise probabilities in the light of additional information (the process itself is called conditioning).
• Conditional probabilities are important because almost all probabilities are conditional
probabilities.
example:
Suppose a coin is flipped twice and you are told that at least one coin is a head. What is
the chance or probability that they are both heads? Assuming a fair coin and a good toss
each of the four possibilities
{(H, H), (H, T ), (T, H), (T, T )}
which constitutes the sample space for this experiment has the same probability i.e. 1/4.
Since the information given rules out (T, T ); a logical answer for the conditional probability
of two heads given at least one head is 1/3.
example:
A family has three children. What is the probability that two of the children are boys?
Assuming that gender distributions are equally likely the eight equally likely possibilities
are:
{(B, B, B), (B, B, G), (B, G, B), (G, B, B),
(G, G, B), (G, B, G), (B, G, G), (G, G, G)}
Thus the probability of two boys is
1 1 1
3
+ + =
8 8 8
8
Depending on the conditioning information the probability of two boys is modified e.g.
• What is the probability of two boys if you are told that at least one child in the family
is a boy?
Answer: 37
68
CHAPTER 2. PROBABILITY
• What is the probability of two boys if you are told that at least one child in the family
is a girl?
Answer: 37
• What is the probability of two boys if you are told that the oldest child is a boy?
Answer: 12
• What is the probability of two boys if you are told that the oldest child is a girl?
Answer: 14
We generalize to other situations using the following definition:
Definition: The conditional probability of event B given event A is
P (B|A) =
P (B ∩ A)
P (A)
provided that P (A) > 0
example: The probability of two boys given that the oldest child is a boy is the probability
of the event “two boys in the family and the oldest child in the family is a boy” divided
by the probability of the event “the oldest child in the family is a boy”. Thus the required
conditional probability is given by
P ({(B, G, B), (G, B, B)})
=
P ({(B, B, B), (B, G, B), (G, B, B), (G, G, B)})
2
8
4
8
=
1
2
2.5. CONDITIONAL PROBABILITY
2.5.1
69
Multiplication Rule
The multiplication rule for probabilities is as follows:
P (A ∩ B) = P (A)P (B|A)
which can immediately be extended to
P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B)
and in general to:
P (E1 ∩ E2 ∩ · · · ∩ En ) = P (E1 )P (E2 |E1 ) · · · P (En |E1 ∩ E2 ∩ · · · ∩ En−1 )
example: There are n people in a room. What is the probability that at least two of the
people have a common birthday?
Solution: We first note that
P (common birthday) = 1 − P (no common birthday)
If there are just two people in the room then
µ
365
365
¶µ
364
365
¶
365
P (no common birthday) =
365
¶µ
364
365
¶µ
363
365
P (no common birthday) =
while for three people we have
µ
¶
It follows that the probability of no common birthday with n people in the room is given
by
µ
365
365
¶µ
¶
Ã
364
365 − (n − 1)
···
365
365
!
70
CHAPTER 2. PROBABILITY
Simple calculations show that if n = 23 then the probability of no common birthday is
slightly less than 12 . Thus if the number of people in a room is 23 or larger the probability of
a common birthday exceeds 12 . The following is a short table of the results for other values
of n
n
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Prob
.003
.008
.016
.027
.041
.056
.074
.095
.117
.141
.167
.194
.223
.253
.284
n
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Prob
.315
.347
.379
.411
.444
.476
.507
.538
.569
.598
.627
.654
.681
.706
.730
2.5. CONDITIONAL PROBABILITY
2.5.2
71
Law of Total Probability
Law of Total Probability:
For any event E we have
P (E) =
X
P (E|Ei )P (Ei )
i
where Ei is a partition of the sample space i.e. the Ei are mutually exclusive and their
union is the sample space.
example: An examination consists of multiple choice questions. Each question is a multiple
choice question in which there are 5 alternative answers only one of which is correct. If a
student has diligently done his or her homework he or she is certain to select the correct
answer. If not he or she has only a one in five chance of selecting the correct answer (i.e.
they choose an answer at random). Let
• p be the probability that the student does their homework
• A the event that they do their homework
• B the event that they select the correct answer
72
CHAPTER 2. PROBABILITY
(i) What is the probability that the student selects the correct answer to a question?
Solution: We are given
P (A) = p ; P (B|A) = 1 and P (B|Ac ) =
1
5
By the Law of Total Probability
P (B) = P (A)P (B|A) + P (Ac )P (B|Ac )
µ ¶
1
= p × 1 + (1 − p) ×
5
5p + 1 − p
=
5
4p + 1
=
5
(ii) What is the probability that the student did his or her homework given that they
selected the correct anwer to the question?
Solution: In this case we want P (A|B) so that
P (A ∩ B)
P (B)
P (A)P (B|A)
=
P (B)
1×p
= 4p+1
P (A|B) =
5
5p
=
4p + 1
2.5. CONDITIONAL PROBABILITY
73
example: Cross-Sectional Study
Suppose a population of individuals is classified into four categories defined by
• their disease status (D is diseased and Dc is not diseased)
• their exposure status (E is exposed and E c is not exposed).
If we observe a sample of n individuals so classified we have the following population
probabilities and observed data.
Population
Probabilities
Dc
D
c
c
c
E
P (E , D ) P (E c , D)
E
P (E, Dc ) P (E, D)
Total
P (Dc )
P (D)
Total
P (E c )
P (E)
1
Sample
Numbers
Dc
D
c
c
n(E , D ) n(E c , D)
n(E, Dc ) n(E, D)
n(Dc )
n(D)
The law of total probability then states that
P (D) = P (E, D) + P (E c , D)
= P (D|E)P (E) + P (D|E c )P (E c )
Total
n(E c )
n(E)
n
74
CHAPTER 2. PROBABILITY
Define the following quantities:
Population Parameters
prob of exposure
P (E) = P (E, D) + P (E, Dc )
prob of disease given exposed
P (D|E) = PP(E,D)
(E)
odds of disease if exposed
(D,E)
O(D|E) = PP(D
c ,E)
odds of disease if not exposed
(D,E c )
O(D|E c ) = PP(D
c ,E c )
odds ratio (relative odds)
O(D|E)
OR = O(D|E
c)
relative risk
(D|E)
RR = PP(D|E
c)
Sample Estimates
prob of exposure
c)
p(E) = n(E,D)+n(E,D
n
prob of disease given exposed
p(D|E) = n(D,E)
n(E)
odds of disease if exposed
n(D,E)
o(D|E) = n(D
c ,E)
odds of disease if not exposed
n(D,E c )
o(D|E c ) = n(D
c ,E c )
odds ratio (relative odds)
o(D|E)
or = o(D|E
c)
relative risk
p(D|E)
rr = p(D|E
c)
It can be shown that if the disease is rare in both the exposed group and the non exposed
group then
OR ≈ RR
The above population parameters are fundamental to the epidemiological approach to
the study of disease as it relates to exposure.
example: In demography the crude death rate is defined as
CDR =
Total Deaths
D
=
Population Size
N
If the population is divided into k age groups or other strata defined by gender, ethnicity,
etc. then D = D1 + D2 + · · · + Dk and N = N1 + N2 + · · · + Nk and hence
D
CR =
=
N
Pk
i=1
N
Di
Pk
=
k
Ni Mi X
=
pi Mi
N
i=1
i=1
where Mi = Di /Ni is the age specfic death rate for the ith age group and pi = Ni /N is the
proportion of the population in the ith age group. This is directly analogous to the law of
total probability,
2.6. BAYES THEOREM
2.6
75
Bayes Theorem
Bayes theorem combines the definition of conditional probability, the multiplication rule and
the law of total probability and asserts that
P (Ei )P (E|Ei )
P (Ei |E) = P
j P (Ej )P (E|Ej )
• where E is any event
• the Ej constitute a partition of the sample space
• Ei is any event in the partition.
Since
P (Ei ∩ E)
P (E)
P (Ei ∩ E) = P (Ei )P (E|Ei )
X
P (E) =
P (Ej )P (E|Ej )
P (Ei |E) =
j
Bayes theorm is obviously true.
Note: A partition of the sample space is a collection of mutually exclusive events such that
their union is the sample space.
76
CHAPTER 2. PROBABILITY
example: The probability of disease given exposure is .5 while the probability of disease
given non-exposure is .1. Suppose that 10% of the population is exposed. If a diseased
individual is detected what is the probability that the individual was exposed?
Solution: By Bayes theorem
P (Ex)P (Dis|Ex)
P (Ex)P (Dis|Ex) + P (N o Ex)P (Dis|N o Ex)
(.1)(.5)
=
(.1)(.5) + (.9)(.1)
5
=
5+9
5
=
14
P (Ex|Dis) =
The intuitive explanation for this result is as follows:
• Given 1,000 individuals 100 will be exposed and 900 not exposed
• Of the 100 individuals exposed 50 will have the disease.
• of the 900 non exposed individuals 90 will have the disease
Thus of the 140 individuals with the disease, 50 will have been exposed which yields a
5
proportion of 14
.
2.6. BAYES THEOREM
77
example: Diagnostic Tests
In this type of study we are interested in the performance of a diagnostic test designed
to determine whether a person has a disease. The test has two possible results:
• + positive test (the test indicates presence of disease).
• − negative test (the test does not indicate presence of disease).
We thus have the following setup:
Population
Probabilities
Dc
D
Total
c
−
P (−, D ) P (−, D) P (−)
+
P (+, Dc ) P (+, D) P (+)
Total
P (Dc )
P (D)
1
Sample
Numbers
Dc
D
Total
c
n(−, D ) n(−, D) n(−)
n(+, Dc ) n(+, D) n(+)
n(Dc )
n(D)
n
78
CHAPTER 2. PROBABILITY
We define the following quantities:
Population Parameters
sensitivity
P (+,D)
P (+|D) = P (+,D)+P
(−,D)
specificity
c)
P (−|Dc ) = P (−,DP c(−,D
)+P (+,Dc )
positive test probability
P (+) = P (+, D) + P (+, Dc )
negative test probability
P (−) = P (−, D) + P (−, Dc )
positive predictive value
P (D|+) = PP(+,D)
(+)
negative predictive value
c)
P (Dc |−) = P (−,D
P (−)
Sample Estimates
sensitivity
n(+,D)
p(+|D) = n(+,D)+n(−,D)
specificity
c)
c
p(−|D ) = n(−,Dn(−,D
c )+n(+,D c )
proportion positive test
p(+) = n(+)
n
proportion negative test
p(−) = n(−)
n
positive predictive value
p(D|+) = p(+,D)
p(+)
negative predictive value
c)
p(Dc |−) = p(−,D
p(−)
As an example consider the performance of a blood sugar diagnostic test to determine
whether a person has diabetes. The test has two possible results:
• + positive test (the test indicates presence of diabetes).
• − negative test (the test does not indicate presence of diabetes).
2.6. BAYES THEOREM
79
The following numerical example is from Epidemiology (1996) Gordis, L. W. B. Saunders.
We have the following setup:
Population
Probabilities
Dc
D
Total
c
−
P (−, D ) P (−, D) P (−)
+
P (+, Dc ) P (+, D) P (+)
Total
P (Dc )
P (D)
1
Sample
Numbers
Dc
D
Total
7600 150 7750
1900 350 2250
9500 500 10, 000
We calculate the following quantities:
Population Parameters
sensitivity
P (+,D)
P (+|D) = P (+,D)+P
(−,D)
specificity
c)
P (−|Dc ) = P (−,DP c(−,D
)+P (+,Dc )
positive test probability
P (+) = P (+, D) + P (+, Dc )
negative test probability
P (−) = P (−, D) + P (−, Dc )
positive predictive value
P (D|+) = PP(+,D)
(+)
negative predictive value
c)
P (Dc |−) = P (−,D
P (−)
Sample Estimates
sensitivity
= .70
p(+|D) = 350
500
specificity
7600
p(−|Dc ) = 9500
= .80
proportion positive test
2250
p(+) = 10,000
= .225
proportion negative test
7750
= .775
p(−) = 10,000
positive predictive value
350
p(D|+) = 2250
= 0.156
negative predictive value
7600
p(Dc |−) = 7750
= 0.98
80
CHAPTER 2. PROBABILITY
2.7
Independence
Closely related to the concept of conditional probability is the concept of independence of
events.
Definition Events A and B are said to be independent if
P (B|A) = P (B)
Thus knowledge of the occurrence of A does not influence the assignment of probabilities to
B.
Since
P (B|A) =
P (A ∩ B)
P (A)
it follows that if A and B are independent then
P (A ∩ B) = P (A)P (B)
This last formulation of independence is the definition used in building probability models.
2.8. BERNOULLI TRIAL MODELS; THE BINOMIAL DISTRIBUTION
2.8
81
Bernoulli trial models; the binomial distribution
• One of the most important probability models is the binomial. It is widely used in
epidemiology and throughout statistics.
• The binomial model is based on the assumption of Bernoulli trials.
The assumptions for a Bernoulli trial model are
(1) The result of the experiment or study can be thought of as the result of n smaller
experiments called trials each of which has only two possible outcomes e.g. (dead,
alive), (diseased, non-diseased), (success, failure)
(2) The outcomes of the trials are independent
(3) The probabilities of the outcomes of the trials remain the same from trial to trial
(homogeneous probabilities).
example 1: A group of n individuals are tested to see if they have elevated levels of
cholestrol. Assuming the results are recorded as elevated or not elevated and we can justify
(2) and (3) we may apply the Bernoulli trial model.
example 2: A population of n individuals is found to have d deaths during a given period
of time. Assuming we can justify (2) and (3) we may use the Bernoulli model to describe
the results of the study.
In Bernoulli trial models the quantity of interest is the number of successes x which
occur in the n trials. It can be be shown that the following formula gives the probability of
obtaining x successes in n Bernoulli trials
à !
P (x) =
n x
p (1 − p)n−x
x
where
• x can be 0, 1, 2, . . . , n
• p is the probability of success on a given trial
82
CHAPTER 2. PROBABILITY
•
³ ´
n
x
, read as ”n choose x”, is defined by
à !
n!
n
=
x
x! (n − x)!
In this last formula r! = r(r − 1)(r − 2) · · · 3 · 2 · 1 for any integer r and 0! = 1.
Note: The term distribution is used because the formula describes how to distribute probability over the possible values of x.
example: The chance or probability of having an elevated cholesterol level is 1/100. If 10
individuals are examined, what is the probability that one or more of them will have been
exposed?
Solution: The binomial model applies so that
Ã
!
10
P (0) =
(.01)0 (1 − .01)10−0
0
= (.99)10
Thus
P (1 or more elevated) = 1 − P (0 elevated)
= 1 − (.99)10
= .059
2.9. PARAMETERS AND RANDOM SAMPLING
2.9
83
Parameters and Random Sampling
• The numbers n and p which appear in the formula for the binomial distribution are
examples of what statisticians call parameters.
• Different values of n and p give different assignments of probabilities each of the binomial type.
• Thus a parameter can be considered as a label which identifies the particular assignment of probabilities.
• In applications of the binomial distribution the parameter n is known and can be fixed
by the investigator - it is thus a study design parameter.
• The parameter p, on the other hand, is unknown and obtaining information about it
is the reason for performing the experiment.
We use the observed data and the model to tell us something about p. This same set-up
applies in most applications of statistics.
To summarize:
• Probability distributions relate observed data to parameters.
• Statistical methods use data and probability models to make statements
about the parameters of interest.
In the case of the binomial the parameter of interest is p, the probability of success on a
given trial.
84
CHAPTER 2. PROBABILITY
example: Random sampling and the binomial distribution. In many circumstances we are
given the results of a survey or study in which the investigators state that they examined a
“random sample” from the population of interest.
Suppose we have a population containing N individuals or objects. We are presented
with a “random sample” consisting of n individuals from the population. What does this
mean? We begin by defining what we mean by a sample.
Definition: A sample of size n from a target population T containing N objects is an
ordered collection of n objects each of which is an object in the target population.
In set notation a sample is just an n-tuple with each coordinate being an element of the
target population. In symbols then a sample s is
s = (a1 , a2 , . . . , an )
where a1 ∈ T, a2 ∈ T, . . . , an ∈ T .
Specific example:
If T = {a, b, c, d} then a possible sample of size 2 is (a, b) while some others are (b, a) and
(c, d). What about (a, a)? Clearly, this is a sample according to the definition.
To distinguish between these two types of samples:
• A sample is taken with replacement if an element in the population can appear
more than once in the sample
• A sample is taken without replacement if an element in the population can appear
at most once in the sample.
2.9. PARAMETERS AND RANDOM SAMPLING
85
Thus in our example the possible samples of size 2 with replacement are
(a, a)
(b, a)
(c, a)
(d, a)
(a, b)
(b, b)
(c, b)
(d, b)
(a, c)
(b, c)
(c, c)
(d, c)
(a, d)
(b, d)
(c, d)
(d, d)
while without replacement the possible samples are
(a, b)
(b, a)
(b, c)
(d, b)
(a, c)
(c, a)
(c, b)
(c, d)
(a, d)
(d, a)
(b, d)
(d, c)
Definition: A random sample of size n from a population of size N is a sample which is
selected such that each sample has the same chance of being selected i.e.
P (sample selected) =
1
number of possible samples
1
Thus in the example each sample with replacement would be assigned a chance of 16
while
1
each sample without replacement would be assigned a chance of 12 for random sampling.
86
CHAPTER 2. PROBABILITY
In the general case,
• For sampling with replacement the probability assigned to each sample is
1
Nn
• For sampling without replacement the probability assigned to each sample is
1
(N )n
where (N )n is given by:
(N )n = N (N − 1)(N − 2) · · · (N − n + 1)
In our example we see that
N n = 42 = 16 and (N )n = (4)2 = 4(4 − 2 + 1) = 4 × 3 = 12
To summarize: A random sample is the result of a selection process in which
each sample has the same chance of being selected.
2.9. PARAMETERS AND RANDOM SAMPLING
87
Suppose now that each object in the population can be classified into one of two categories
e.g. (exposed, not exposed), (success, failure), (A, not A), (0, 1) etc. For definiteness let us
call the two outcomes success and failure and denote them by S and F .
In the example suppose that a and b are successes while c and d are failures. The target
population is now
T = {a(S), b(S), c(F ), d(F )}
In general D of the objects will be successes and N − D will be failures.
The question of interest is: If we select a random sample of size n from a population of
size N consisting of D successes and N − D failures, what is the probability that x successes
will be observed in the sample?
In the example we see that with replacement the samples are
(a(S), a(S))
(b(S), a(S))
(c(F ), a(S))
(d(F ), a(S))
(a(S), b(S))
(b(S), b(S))
(c(F ), b(S))
(d(F ), b(S))
(a(S), c(F ))
(b(S), c(F ))
(c(F ), c(F ))
(d(F ), c(F ))
(a(S), d(F ))
(b(S), d(F ))
(c(F ), d(F ))
(d(F ), d(F ))
Thus if sampling is at random with replacement the probabilities of 0 successes, 1 success
and 2 successes are given by
4
16
8
P (1) =
16
4
P (2) =
16
P (0) =
If sampling is at random without replacement the probabilities are given by
2
12
8
P (1) =
12
2
P (2) =
12
P (0) =
88
CHAPTER 2. PROBABILITY
These probabilities can, in the general case, be shown to be
without replacement :
à !
P (x successes) =
with replacement :
n (D)x (N − D)n−x
x
(N )n
à !µ
n
P (x successes) =
x
D
N
¶x µ
D
1−
N
¶n−x
The distribution without replacement is called the hypergeometric distribution with
parameters N, n and D. The distribution with replacement is the binomial distribution with
parameters n and p = D/N .
In many applications the sample size, n, is small relative to the population size N . In
this situation it can be shown that the formula
à !µ
n
x
D
N
¶x µ
D
1−
N
¶n−x
provides an adequate approximation to the probabilities for sampling without replacement.
Thus for most applications, random sampling from a population in which each individual
is classified as a success or a failure results in a binomial distribution for the probability of
obtaining x successes in the sample.
The interpretation of the parameter p =
D
N
is thus:
• “the proportion of successes in the target population”
• “the chance that an individual selected at random will be classified as a success”.
2.9. PARAMETERS AND RANDOM SAMPLING
89
example: Prospective (Cohort) Study
In this type of study
• we observe n(E) individuals who are exposed and n(E c ) individuals who are not exposed.
• These individuals are followed and the number in each group who develop the disease
are recorded.
We thus have the following setup:
Ec
E
Population
Probabilities
c
D
D
P (Dc |E c ) P (D|E c )
P (Dc |E) P (D|E)
Total
1
1
Sample
Numbers
c
D
D
Total
n(Dc , E c ) n(D, E c ) n(E c )
n(Dc , E) n(D, E) n(E)
We can model this situation as two independent binomial distributions as follows:
n(D, E) is binomial (n(E), P (D|E))
n(D, E c ) is binomial (n(E c ), P (D|E c ))
90
CHAPTER 2. PROBABILITY
We define the following quantities:
Population Parameters
prob of disease given exposed
P (D|E) = PP(E,D)
(E)
odds of disease if exposed
(D,E)
O(D|E) = PP(D
c ,E)
odds of disease if not exposed
(D,E c )
O(D|E c ) = PP(D
c ,E c )
odds ratio (relative odds)
O(D|E)
OR = O(D|E
c)
relative risk
(D|E)
RR = PP(D|E
c)
Sample Estimates
prob of disease given exposed
p(D|E) = n(D,E)
n(E)
odds of disease if exposed
n(D,E)
o(D|E) = n(D
c ,E)
odds of disease if not exposed
n(D,E c )
o(D|E c ) = n(D
c ,E c )
odds ratio (relative odds)
o(D|E)
or = o(D|E
c)
relative risk
p(D|E)
rr = p(D|E
c)
As an example consider the following hypothetical study in which we follow smokers
and non smokers to see which individuals develop coronary heart disease (CHD). Thus E is
smoker and E c is non smoker. This example is from Epidemiology (1996) Gordis, L. W. B.
Saunders.
2.9. PARAMETERS AND RANDOM SAMPLING
91
We have the following setup:
c
E
E
Population
Probabilities
Dc
D
c
c
P (D |E ) P (D|E c )
P (Dc |E) P (D|E)
Total
1
1
Sample
Numbers
No CHD CHD Total
4, 913
87
5, 000
2, 916
84
3, 000
We calculate the following quantities:
Population Parameters
prob of disease given exposed
P (D|E) = PP(E,D)
(E)
odds of disease if exposed
(D,E)
O(D|E) = PP(D
c ,E)
odds of disease if not exposed
(D,E c )
O(D|E c ) = PP(D
c ,E c )
odds ratio (relative odds)
O(D|E)
OR = O(D|E
c)
relative risk
(D|E)
RR = PP(D|E
c)
Sample Estimates
prob of disease given exposed
84
p(CHD|S) = 3,000
= 0.028
odds of disease if exposed
84
o(CHD|S) = 2916
= 0.0288
odds of disease if not exposed
87
o(CHD|N S) = 4913
= 0.0177
odds ratio (relative odds)
or = 84/2916
= 1.63
87/4913
relative risk
= 1.61
rr = 84/3000
87/5000
92
CHAPTER 2. PROBABILITY
example: Retrospective (Case-Control) Study
In this type of study we
• Select n(D) individuals who have the disease (cases) and n(Dc ) individuals who do not
have the disease (controls).
• Then the number of individuals in each group who were exposed is determined.
We thus have the following setup:
Population
Probabilities
Dc
D
c
c
c
E
P (E |D ) P (E c |D)
E
P (E|Dc ) P (E|D)
Total
1
1
Sample
Numbers
Dc
D
c
c
n(D , E ) n(D, E c )
n(Dc , E) n(D, E)
n(Dc )
n(D)
We can model this situation as two independent binomials as follows:
n(D, E) is binomial (n(D), P (E|D))
n(Dc , E) is binomial (n(Dc ), P (E|Dc ))
Define the following quantities:
Population Parameters
prob of exposed given diseased
P (E|D)
odds of exposed if disease
(E|D)
O(E|D) = PP(E
c |D)
odds of exposed if not disease
(E|Dc )
O(E|Dc ) = PP(E
c |D c )
odds ratio (relative odds)
O(E|D)
OR = O(E|D
c)
Sample Estimates
prob of exposed given disease
p(E|D) = n(D,E)
n(D)
odds of exposed if disease
n(D,E)
o(E|D) = n(D,E
c)
odds of exposed if not disease
n(E,Dc )
o(E|Dc ) = n(E
c ,D c )
odds ratio (relative odds)
o(E|D)
or = o(E|D
c)
2.9. PARAMETERS AND RANDOM SAMPLING
93
As an example consider the following hypothetical study in which examine individuals
with coronary heart disease (CHD) (cases) and individuals without coronary heart diease
(controls). We then determine which individuals were smokers and which were not. Thus E
is smoker and E c is non smoker. This example is from Epidemiology (1996) Gordis, L. W.
B. Saunders.
Ec
E
Total
Population
Probabilities
Controls
Cases
c
c
P (E |D ) P (E c |D)
P (E|Dc ) P (E|D)
1
1
Sample
Numbers
Controls Cases
224
88
176
112
400
200
We calculate the following quantities:
Population Parameters
prob of exposed given diseased
P (E|D)
odds of exposed if disease
(E|D)
O(E|D) = PP(E
c |D)
odds of exposed if not disease
(E|Dc )
O(E|Dc ) = PP(E
c |D c )
odds ratio (relative odds)
O(E|D)
OR = O(E|D
c)
Sample Estimates
prob of exposed given disease
112
= 0.56
p(E|D) = 200
odds of exposed if disease
o(E|D) = 112
= 1.27
88
odds of exposed if not disease
176
o(E|Dc ) = 224
= 0.79
odds ratio (relative odds)
112/88
or = 176/224
= 1.62
94
2.10
CHAPTER 2. PROBABILITY
Probability Examples
The following two examples illustrate the importance of probability in solving real problems.
Each of the topics presented has been extended and generalized since their introduction.
2.10.1
Randomized Response
Suppose that a sociologist is interested in determining the prevalence of child abuse in a
population. Obviously if individual parents are asked a question such as “have you abused
your child” the reliability of the answer is in doubt. The sociologist would ideally like the
parent to respond with an honest choice between the following two questions:
(i) Have you ever abused your children?
(ii) Have you not abused your children?
A clever method for determining prevalence in such a situation is to provide the respondent with a randomization device such as a deck of cards in which a proportion P of the
cards are marked with the number 1 and the remainder with the number 2. The respondent
selects a card at random and replaces it with the result unknown to the interviewer. Thus
confidentiality of the respondent is protected. If the card drawn is 1 the respondent answers
truthfully to question 1 whereas if the card drawn is a 2 the respondent answers truthfully
to question 2.
2.10. PROBABILITY EXAMPLES
95
It follows that the probability λ that the respondent answers yes is given by
λ = P (yes)
= P (yes|Q1)P Q1) + P (yes|Q2)P Q2)
= πP + (1 − π)(1 − P )
where π is the prevalence (the proportion in the population who abuse their children and P
is the proportion of 1’s in the deck of cards. We assume P 6= 1/2.
If we use this procedure on n respondents and observe x yes answers then the observed
proportion x/n is a natural estimate of πP + (1 − π)(1 − P ) i.e.
b=
λ
x
= πP + (1 − π)(1 − P )
n
Since we know P we can solve for π giving us the estimate
πb =
Reference: Encyclopedia of Biostatistics.
b+1−P
λ
2P − 1
96
2.10.2
CHAPTER 2. PROBABILITY
Screening
As another simple application of probability consider the following situation. We have a
fixed amount of money available to test individuals for the presence of a disease, say $1,000.
The cost of testing one sample of blood is $5. We have to test a population of size 1,000 in
which we suspect the prevalence of the disease is 3/1,000. Can we do it? If we divide the
population into 100 groups of size 10 then there should be 1 diseased individual in 3 of the
groups and the remaining 97 groups will be disease free. If we pool the samples from each
group and test each grouped sample we would need 100 + 30 = 130 tests instead of 1,000
tests to screen eveyone.
The probabilistic version is as follows: A large number N of individuals are subject to a
blood test which can be administered in one of two ways
(i) Each individual is to be tested separately so that N tests are required.
(ii) The samples of n
negative then the
positive then each
are required if the
individuals can be pooled or combined and tested. If this test is
one test suffices to clear all of these n individuals. If this test is
of the n individuals in that group must be tested. Thus n + 1 tests
pooled samples tests positive.
Assume that individuals are independent and that each has probability p of testing
positive. Clearly we have a Bernoulli trial model and hence the probability that the combined
sample will test positive is
P (combined test positive) = 1 − P (combined test negative) = 1 − (1 − p)n
Thus we have for any group of size n
P (1 test) = (1 − p)n+1 ; P (n + 1 tests) = 1 − (1 − p)n
It follows that the expected number of tests if we combine samples is
(1 − p)n + (n + 1)[1 − (1 − p)n ] = n + 1 − n(1 − p)n
Thus if there are N/n groups we expect to run
·
1
N 1 + − (1 − p)n
n
¸
tests if we combine samples instead of the N tests if we test each individual. Given a value
of p we can choose n to minimize the total number of tests.
2.10. PROBABILITY EXAMPLES
97
As an example with N = 1, 000 and p = .01 we have the following numbers
Group Size Number of Tests
2
519.9
3
363.0343
4
289.404
5
249.0099
6
225.1865
7
210.7918
8
202.2553
9
197.5939
10
195.6179
11
195.5708
12
196.9485
13
199.4021
14
202.6828
15
206.6083
16
211.0422
17
215.8803
18
221.0418
19
226.463
20
232.0931
Thus we should combine individuals into groups of size 10 or 11. In which case we expect
to run 196 tests instead of 1,000 tests. Clearly we achieve real savings.
Reference: Feller, W. (1950 An Introduction to Probability Theory and Its Applications.
John Wiley & Sons.
98
CHAPTER 2. PROBABILITY
Graph of Expected Number of Tests vs Group Size (N = 1, 000 and p = .01)
Figure 2.1:
Chapter 3
Probability Distributions
3.1
3.1.1
Random Variables and Distributions
Introduction
Most of the responses we model in statistics are numerical. It is useful to have a notation for
real valued responses. Real valued responses are called random variables. The notation
is not only convenient, it is imperative when we consider statistics, defined as functions of
sample data. The probability models for these random variables are called their sampling
distributions and form the foundation of the modern theory of statistics.
Definition:
• Before the experiment is performed the possible numerical response is denoted by X,
X is called a random variable.
• After the experiment is performed the observed value of X is denoted by x. We call
x the realized or observed value of X.
99
100
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Notation:
• The set of all possible values of a random variable X is called the sample space of
X and is denoted by X .
• The probability model of X is denoted by PX and we write
PX (B) = P (X ∈ B)
for the probability that the event X ∈ B occurs.
• The probability model for X is called the probability distribution of X.
There are two types of random variables which are of particular importance: discrete and
continuous. These correspond to the two types of numbers introduced in the overview section
and the two types of probability density functions introduced in the probability section.
• A random variable is discrete if its possible values (sample space) constitute a finite
or countable set e.g.
X = {0, 1} ; X = {0, 1, 2, . . . , n} ; X = {0, 1, 2, . . .}
◦ Discrete random variables arise when we consider response variables which are
categorical or counts.
• A random variable is continuous or numeric if its possible values (sample space) is
an interval of real numbers e.g.
X = [0, ∞) ; X = (−∞, ∞)
◦ Continuous random variables arise when we consider response variables which are
recorded on interval or ratio scales.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
3.1.2
101
Discrete Random Variables
Probabilities for discrete random variables are specified by the probability density function
p(x) :
X
PX (B) = P (X ∈ B) =
p(x)
x∈B
Probability density functions for discrete random variables have the properties
• 0 ≤ p(x) ≤ 1 for all x in the sample space X
•
P
x∈X
p(x) = 1
Binomial Distribution
A random variable is said to have a binomial distribution if its probability density function
is of the form:
à !
n x
p (1 − p)n−x for x = 0, 1, 2, . . . , n
p(x) =
x
where 0 ≤ p ≤ 1.
If we define X as the number of successes in n Bernoulli trials then X is a random
variable with a binomial distribution. The parameters are n and p where p is the probability
of success on a given trial. The term distribution is used because the formula describes how
to distribute probability over the possible values of x.
Recall that the assumptions necessary for a Bernoulli trial model to apply are:
• The result of the experiment or study consists of the result of n smaller experiments
called trials each of which has only two possible outcomes e.g. (dead, alive), (diseased,
non-diseased), (success, failure).
• The outcomes of the trials are independent.
• The probabilities of the outcomes of the trials remain the same from trial to trial
(homogeneous probabilities).
102
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Histograms of Binomial Distributions
Figure 3.1:
Note that as n ↑ the binomial distribution becomes more symmetric.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
103
Poisson Distribution
A random variable is said to have a Poisson distribution if its probability distribution is
given by
λx e−λ
p(x) =
for x = 0, 1, 2, . . .
x!
• The parameter of the Poisson distribution is λ.
• The Poisson distribution is one of the most important distributions in the applications
of statistics to public health problems. The reasons are:
◦ It is ideally suited for modelling the occurence of “rare events”.
◦ It is also particularly useful in modelling situations involving person-time.
◦ Specific examples of situations in which the Poisson distribution applies include:
◦ Number of deaths due to a rare disease
◦ Spatial distribution of bacteria
◦ Accidents
The Poisson distribution is also useful in modelling the occurence of events over time. Suppose that we are interested in modelling a process where:
(1) The occurrences of the event in an interval of time are independent.
(2) The probability of a single occurrence of the event in an interval of time is proportional
to the length of the interval.
(3) In any extremely short time interval, the probability of more than one occurrence of
the event is approximately zero.
Under these assumptions:
• The distribution of the random variable X, defined as the number of occurrences of
the event in the interval is given by the Poisson distribution.
• The parameter λ in this case is the average number of occurrences of the event in the
interval i.e.
λ = µt where µ is the rate per unit time
104
CHAPTER 3. PROBABILITY DISTRIBUTIONS
example: Suppose that the suicide rate in a large city is 2 per week. Then the probability
of two suicides in one week is
P (2 suicides in one week) =
22 e−2
= .2707 = .271
2!
The probability of two suicides in three weeks is
P (2 suicides in three weeks) =
62 e−6
= .0446 = .045
2!
example: The Poisson distribution is often used as a model for the probability of automobile
or other accidents for the following reasons:
(1) The population exposed is large.
(2) The number of people involved in accidents is small.
(3) The risk for each person is small.
(4) Accidents are “random”.
(5) The probability of being in two or more accidents in a short time period is approximately zero.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
105
Approximations using the Poisson Distribution
Poisson probabilities can be used to approximate binomial probabilities when n is large, p
is small and λ is taken to be np Thus for n = 150 and p = .02 we have the following table:
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Binomial
n = 150, p = .02
0.04830
0.14784
0.22478
0.22631
0.16974
0.10115
0.04989
0.02094
0.00764
0.00246
0.00071
0.00018
0.00004
0.00001
0.00000
Poisson
λ = 150(.02) = 3
0.04979
0.14936
0.22404
0.22404
0.16803
0.10082
0.05041
0.02160
0.00810
0.00270
0.00081
0.00022
0.00006
0.00001
0.00000
Note the closeness of the approximation. The supplementary notes contain a “proof” of the
propositition that the Poisson approximates the binomial when n is large and p is small.
106
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Histograms of Poisson Distributions
Figure 3.2:
Note that as n ↑ the Poisson distribution becomes more symmetric.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
3.1.3
107
Continuous or Numeric Random Variables
Probabilities for numeric or continuous random variables are given by the area under the
curve of its probability density function f (x).
Z
P (E) =
E
f (x)dx
• f (x) has the properties:
◦ f (x) ≥ 0
◦ The total area under the curve is one
• Probabilities for numeric random variables are tabled or can be calculated using a
statistical software package.
The Normal Distribution
By far the most important continuous probability distribution is the normal or Gaussian.
The probability density function is given by:
(
(x − µ)2
1
p(x) = √
exp −
2σ 2
2πσ
)
• The normal distribution is used as a basic model when theobserved data has a histogram which is symmetric and bell-shaped.
• In addition the normal distribution provides useful approximations to other distributions by the Central Limit Theorem.
• The Central Limit Theorem also implies that a variety of statistics have distributions
that can be approximated by normal distributions.
• Most statistical methods were originally developed for the normal distribution and
then extended to other distributions.
• The parameter µ is the natural center of the distribution (since the distribution is
symmetric about µ).
108
CHAPTER 3. PROBABILITY DISTRIBUTIONS
• The parameter σ 2 or σ provides a measure of spread or scale.
• The special case where µ = 0 and σ 2 = 1 is called the standard normal or Z
distribution
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
109
The following quote indicates the importance of the normal distribution:
The
normal
law of error
stands out in the
experience of mankind
as one of the broadest
generalizations of natural
philosophy. It serves as the
guiding instrument in researches
in the physical and social sciences and
in medicine, agriculture and engineering.
It is an indispensible tool for the analysis and the
interpretation of the basic data obtained by observation and experimentation.
W. J. Youden
110
CHAPTER 3. PROBABILITY DISTRIBUTIONS
The principal characteristics of the normal distribution are
• The curve is bell-shaped.
• The possible values for x are between −∞ and +∞
• The distribution is symmetric about µ
• median = mode (point of maximum height of the curve)
• area under the curve is 1.
• area under the curve over an interval I gives the probability of I
• 68% of the probability is between µ − σ and µ + σ
• 95% of the probability is between µ − 2σ and µ + 2σ
• 99.7% of the probability is between µ − 3σ and µ + 3σ
• For the standard normal distribution we have
◦ P (Z ≥ z) = 1 − P (Z ≤ z)
◦ P (Z ≥ z0 ) = P (Z ≤ −z0 ) for z0 ≥ 0. Thus we have
P (Z ≤ 1.645) = .95
P (Z ≥ 1.645) = .05
P (Z ≤ −1.645) = .05
• Probabilities for any normal distribution can be calculated by converting to the standard normal distribution (µ = 0 and σ = 1) as follows:
µ
P (X ≤ x) = P Z ≤
x−µ
σ
¶
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
Plot of Z Distribution
Figure 3.3:
111
112
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Plots of Normal Distributions
Figure 3.4:
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
113
Approximating Binomial Probabilities Using the Normal Distribution
If n is large we may approximate binomial probabilities using the normal distribution as
follows:


1
x
−
np
+
2 
P (X ≤ x) ≈ P Z ≤ q
np(1 − p)
• The 12 in the approximation is called a continuity correction since it improves the
approximation for modest values of n.
• A guideline is to use the normal approximation when
Ã
p
n≥9
1−p
!
Ã
1−p
and n ≥ 9
p
!
and use the continuity correction.
The Supplementary Notes give a brief discussion of the appropriateness of the continuity correction.
114
CHAPTER 3. PROBABILITY DISTRIBUTIONS
For the Binomial distribution with n = 30 and p = .3 we find the following probabilities:
x P (X = x) P (X ≤ x)
0
0.00002
0.00002
1
0.00029
0.00031
2
0.00180
0.00211
3
0.00720
0.00932
4
0.02084
0.03015
5
0.04644
0.07659
6
0.08293
0.15952
7
0.12185
0.28138
8
0.15014
0.43152
9
0.15729
0.58881
10
0.14156
0.73037
11
0.11031
0.84068
12
0.07485
0.91553
13
0.04442
0.95995
14
0.02312
0.98306
15
0.01057
0.99363
16
0.00425
0.99788
17
0.00150
0.99937
Thus P (Y ≤ 12) is exactly 0.91553. Using the normal approximation without the continuity
correction yields a value of 0.88400 Using the continuity correction yields a value of 0.91841,
close enough for most work. However, using STATA or other statistical packages makes it
easy to get exact probabilities.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
115
Approximating Poisson Probabilities Using the Normal Distribution
If λ ≥ 10 we can use the normal (Z) distribution to approximate the Poisson distribution as
follows:
Ã
!
x−λ
P (X ≤ x) ≈ P Z ≤ √
λ
The following are some Poisson probabilities for λ = 10
x
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
P (X = x)
0.00005
0.00045
0.00227
0.00757
0.01892
0.03783
0.06306
0.09008
0.11260
0.12511
0.12511
0.11374
0.09478
0.07291
0.05208
0.03472
0.02170
0.01276
.0070911
.0037322
.0018661
P (X ≤ x)
0.00005
0.00050
0.00277
0.01034
0.02925
0.06709
0.13014
0.22022
0.33282
0.45793
0.58304
0.69678
0.79156
0.86446
0.91654
0.95126
0.97296
0.98572
0.99281
0.99655
0.99841
For y = 15 we find that P (≤ 15) = 0.95126 Using the normal approximation yields a value
of 0.94308 A continuity correction can again be used to improve the approximation.
116
3.1.4
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Distribution Functions
For any random variable the probability that it assumes a value less than or equal to a
specified value, say x, is called its distribution function and denoted by F i.e.
F (x) = P (X ≤ x)
The distribution function F is between 0 and 1 and does not decrease as x increases. The
graph of F is a step function for discrete random variables (the height of the step at x is the
probability of the value x) and is a differentiable function for continuous random varaibles
(the derivative equals the density function).
Distribution functions are the model analogue to the empirical distribution function introduced in the exploratory data analysis section. They play an important role in goodness
of fit tests and in finding the distribution of functions of continuous random variables. In addition, the natural estimate of the distribution function is the empirical distribution function
which forms the basis for the substitution method of estimation.
3.1. RANDOM VARIABLES AND DISTRIBUTIONS
3.1.5
117
Functions of Random Variables
It is often necessary to find the distribution of a function of a random variable(s).
Functions of Discrete Random Variables
In this case to find the pdf of Y = g(X) we find the probability density function directly
using the formula
f (y) = P (Y = y) = P ({x : g(x) = y})
Thus if X has a binomial pdf with parameters n and p and represents the number of successes
in n trials what is the pdf of Y = n − X, the number of failures? We find that
Ã
!
à !
n
n
P (Y = y) = P ({x : x = n − y}) =
pn−y (1 − p)n−(n−y) =
(1 − p)y pn−y
n−y
y
i.e. binomial with parameters n and 1 − p.
118
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Functions of Continuous Random Variables
Here we find the distribution function of Y
P (Y ≤ y) = P ({x : g(x) ≤ y})
and then differentiate to find the density function of Y .
example: Let Z be standard normal and let Y = Z 2 . The distribution function of Y is
given by
Z √y
√
√
F (y) = P (Y ≤ y) = P ({z : − y ≤ z ≤ y}) = √ φ(z)dz
− y
where φ(z) is the standard normal density i.e.
φ(z) = (2π)−1/2 e−z
2 /2
It follows that the density function of Y is equal to
dF (y)
1
1
√
√
= √ φ( y) + √ φ(− y)
dy
2 y
2 y
or
1 √
y 1/2−1 e−y/2
√
f (y) = √ ( 2π)−1/2 e−y/2 =
y
21/2 π
which is called the chi-square distribution with one degree of freedom. That is, if Z is
standard normal then Z 2 is chi-square with one degree of freedom.
3.1.6
Other Distributions
A variety of other distributions arise in statistical problems. These include the log-normal,
the chi-square, the Gamma, the Beta, the t, the F , and the negative binomial. We will
discuss these as they arise.
3.2. PARAMETERS OF DISTRIBUTIONS
3.2
119
Parameters of Distributions
3.2.1
Expected Values
In exploratory data analysis we emphasized the importance of a measure of location (center)
and spread (variability) for a batch of numbers. There are analagous measures for probability
distributions.
Definition: The expected value, E(X), of a random variable is the weighted average of
its values, the weights being the probability assigned to the values.
• For a discrete random variable we have
E(X) =
X
xp(x)
x
where p(x) is the probability density function of X.
• For continuous random variables
Z
E(X) =
x
xf (x)dx
Some important expected values are:
(1) The expected value of the binomial distribution is np
(2) The expected value of the Poisson distribution is λ
(3) The expected value of the normal distribution is µ
120
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Using the properties of sums and integrals we have the following properties of expected
values
• E(c) = where c is a constant.
In words: The expected value of a constant is equal to the constant.
• E(cX) = cE(X) where c is a constant.
In words: The expected value of a constant times a random variable is equal
to the constant times the expected value of the random variable.
• E(X + Y ) = E(X) + E(Y )
In words: The expected value of the sum of two random variables is the sum
of their expected values.
• If X ≥ 0 then E(X) ≥ 0
In words: The expected value of a non-negative random variable is nonnegative.
Note: The result that the expected value of the sum of two random variables is the sum of
their expected values is non trivial in the sense that one must show that the distribution of
the sum has expected value equal to the sum of the individual expected values.
3.2. PARAMETERS OF DISTRIBUTIONS
3.2.2
121
Variances
Definition: The variance of a random variable is
var (X) = E(X − µ)2 where µ = E(X)
• If we write
X = µ + (X − µ) or X = µ + error
we see that the variance of a random variable is a measure of the average size of the
squared error made when using µ to predict the value of X.
• The square root of var (X) is called the standard deviation of X and is used as a
basic measure of variability for X.
(1) For the binomial var (X) = npq where q = 1 − p
(2) For the Poisson var (X) = λ
(3) For the normal var (X) = σ 2
Using the properties of sums and integrals we have the following properties of variances:
• var (c) = 0 where c is a constant.
In words: The variance (variability) of a constant is 0.
• var (c + X) = var (X) where c is a constant.
In words: The variance of a random variable is unchanged by the addition
of a constant.
• var (cX) = c2 var (X) where c is a constant.
In words: The variance of a constant times a random variable equals the
constant squared times the variance of the random variable.
• var (X) ≥ 0
In words: The variance of a random variable cannot be negative.
122
3.2.3
CHAPTER 3. PROBABILITY DISTRIBUTIONS
Quantiles
Recall that
• The median of a batch of numbers is the value which divides the batch in half.
• Similarly the upper quartile has one fourth of the numbers above it while the lower
quartile has one fourth of the numbers below it.
• There are analogs for probability distributions of random variables.
Definition: The pth quantile, Qp of X is defined by
P (X ≤ Qp ) = p
where 0 < p < 1.
• Q.5 is called the median of X
• Q.25 is called the lower quartile of X
• Q.75 is called the upper quartile of X
• Q.75 − Q.25 is called the interquartile range of X
3.2. PARAMETERS OF DISTRIBUTIONS
3.2.4
123
Other Expected Values
If Y = g(X) is a function of X then Y is also a random variable and has expected value
given by
( P
g(x)f (x) if X is discrete
E[Y ] = E[g(X)] = R x
x g(x)f (x)dx if X is continuous
Definition: The moment generating function of X, M (t), is defined as the expected
value of Y = etX where t is a real number.
The moment generating function has two important theoretical properties:
(1) The rth derivative of M (t) with respect to t, evaluated at t = 0 gives the rth moment
of X, E(X r ) for any integer r. This often provides an easy method to find the mean,
variance, etc. of a random variable.
(2) The moment generating function is unique: that is, if two distributions have the same
moment generating function then they have the same distribution.
example: For the binomial distribution we have that
tX
M (t) = E[e ] =
n
X
x=0
à !
e
tx
à !
n
X
n x
n
n−x
p (1 − p)
=
(pet )x (1 − p)n−x = (pet + q)n
x
x=0 x
where q = 1 − p. The first and second derivatives are
d2 M (t)
dt2
dM (t)
= npet (pet + q)n−1
dt
2 2t
t
n−2
= n(n − 1)p e (pe + q)
+ npet (pet + q)n−1
Thus we have
E(X) = np ; E(X 2 ) = n(n − 1)p2 + np
and hence
var (X) = n(n − 1)p2 + np − (np)2 = np(1 − p)
example: For the Poisson distribution we have that
M (t) = E(etX ) =
∞
X
x=0
etx e−λ
∞
X
λx
(λet )x
t
== e−λ
= eλ(e −1)
x!
x!
x=0
124
CHAPTER 3. PROBABILITY DISTRIBUTIONS
The first and second derivatives are
dM (t)
dt
2
M (t)
= λet M (t) d dt
= λ2 et M (t) + λet M (t)
2
Thus we have
E(X) = λ ; E(X 2 ) = λ2 + λ
and hence
var (X) = (λ2 + λ) − λ2 = λ
example: For the normal distribution we have that
M (t) = exp{tµ + t2 σ 2 /2}
The first two derivatives are
dM (t)
dt
2
M (t)
= (µ + tσ 2 )M (t) d dt
= (µ + tσ 2 )2 M (t) + (σ 2 )M (t)
2
Thus we have
E(X) = µ ; E(X 2 ) = µ2 + σ 2
and hence
var (X) = (µ2 + σ 2 ) − µ2 = σ 2
3.2. PARAMETERS OF DISTRIBUTIONS
3.2.5
125
Inequalities involving Expectations
Markov’s Inequality: If Y is any non-negative random variable then
P (Y ≥ c) ≤
E(Y )
c
where c is any positive constant. To see this define a discrete random variable by the equation
(
Z=
c if Y ≥ c
0 if Y < c
Note that Z ≤ Y so that
E(Y ) ≥ E(Z) = 0P (Z = 0) + cP (Z = c) = cP (Y ≥ c)
Tchebychev’s Inequality: If X is any random variable then
P (−δ < X − µ < δ) ≥ 1 −
σ2
δ2
where σ 2 is the variance of X and δ is any positive number. To see this define
Y = (|X − µ|)2
Then Y is non-negative with expected value equal to σ 2 and by Markov’s Inequality we have
that
σ2
P (Y ≥ δ 2 ) ≤ 2
δ
and hence
σ2
σ2
1 − P (Y < δ 2 ) ≤ 2 or P (Y < δ 2 ) ≥ 1 − 2
δ
δ
But
P (Y < δ 2 ) = P (|X − µ| < δ) = P (−δ < |X − µ| < δ)
so that
P (−δ < X − µ < δ) ≥ 1 −
σ2
δ2
126
CHAPTER 3. PROBABILITY DISTRIBUTIONS
example: Consider n Bernoulli trials and let Sn be the number of successes. Then X = Sn /n
has
µ
¶
µ
¶
Sn
np
Sn
npq
pq
E
=
= p and var
= 2 =
n
n
n
n
n
Thus Tchebychev’s Inequality says that
µ
1 ≥ P −δ <
¶
Sn
pq
−p<δ ≥1− 2
n
nδ
In other words, if the number of trials is large, the probability that the observed frequency
of successes will be close to the true probability of success is close to 1. This is used as the
justification for the relative frequency interpretation of probability. It is also a special case
of the Weak Law of large Numbers.
Chapter 4
Joint Probability Distributions
4.1
General Case
Often we want to consider several responses simultaneously. We model these using random
variables X1 , X2 , . . . and we have joint probability distributions. There are again two
major types.
(i) Joint discrete distributions have the property that the sample space for each random
variable is discrete and probabilities are assigned using the joint probability density
function defined by
0 ≤ f (x1 , x2 , . . . , xk ) ≤ 1 ;
XX
x1 x2
···
X
f (x1 , x2 , . . . , xk ) = 1
xk
(ii) Joint continuous distributions have the property that the sample space for each random variable is continuous and probabilities are assigned using the probability density
function which has the properties that
Z
f (x1 , x2 , . . . , xk ) ≥ 0
x1
Z
x2
Z
···
xk
127
f (x1 , x2 , . . . , xk )dx1 dx2 · · · dxk = 1
128
4.1.1
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Marginal Distributions
Marginal distributions are distributions of subsets of random variables which have a joint
distribution. In particular the marginal distribution of one of the components, say Xi , is said
to be the marginal distribution of Xi . Marginal distributions are obtained by “summing”
or “integrating” out the other variables in the joint density. Thus if X and Y have a joint
distribution which is discrete the marginal distribution of X is given by
fX (x) =
X
f (x, y)
y
If X and Y have a joint distribution which is continuous the marginal distribution of X is
given by
Z
fX (x) = f (x, y)dy
y
4.1.2
Conditional Distributions
Conditional distributions are distributions of subsets of random variables which have
a joint distribution given that other components of the random variables are fixed. The
conditional distribution of Y given X = x is obtained by
fY |X (y|x) =
f (y, x)
fX (x)
where f (y, x) is the joint distribution of Y and X and fX (x) is the marginal distribution of
X.
Conditional distributions are of fundamental importance in regression and prediction
problems.
4.1. GENERAL CASE
4.1.3
129
Properties of Marginal and Conditional Distributions
• The joint distribution of X1 , X2 , . . . , Xk can be obtained as
f (x1 , x2 , . . . , xk ) = f1 (x1 )f2 (xx |x1 )f3 (x3 |x1 , x2 ) · · · fk (xk |x1 , x2 , . . . , xk−1 )
which is a generalization of the multiplication rule for probabilities.
• The marginal distribution of Y can be obtained via the formula
(
fY (y) =
P
x f (y|x)fX (x) if X, Y are discrete
f
(y|x)f
X (x)dx if X, Y are continuous
x
R
which is is a generalization of the law of total probability.
• The conditional density of y given X = x can be obtained as
fY |X (y|x) =
fY (y)fX|Y (x|y)
f (y, x)
=
fX (x)
fX (x)
which is a version of Bayes Theorem.
4.1.4
Independence and Random Sampling
If X and Y have a joint distribution they are independent if
f (x, y) = fX (x)fY (y) or if fY |X (y|x) = fY (y)
In general X1 , X2 , . . . , Xn are independent if
f (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) · · · fXn (xn )
i.e. the joint distribution is the product of the marginal distributions.
Definition: We say that x1 , x2 , . . . , xn constitute a random sample from f if they are
realized values of independent random variables X1 , X2 , . . . , Xn , each of which has the same
probability distribution f .
Random sampling from a distribution is fundamental to many applications of modern
statistics.
130
4.2
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
The Multinomial Distribution
The most important joint discrete distribution is the multinomial defined as
f (x1 , x2 , · · · , xk ) = n!
k
Y
pxi i
i=1
where
xi !
P
k
xi = 0, 1, 2, . . . , n , i = 1, 2, . . . , k ,
x =n
Pk i=1 i
0 ≤ pi ≤ 1 , i = 1, 2, . . . , k ,
i=1 pi = 1
The multinomial is the basis for the analysis trials where the outcomes are not binary but of
k distinct types and in the analysis of tables of data which consist of counts of the number
of times certain response patterns occur. Note that if k = 2 the multinomial reduces to the
binomial.
example: Suppose we are interested in the daily pattern of “accidents” in a manufacturing firm. Assuming individuals in the firm have accidents independent of others then the
probability of accidents by day has the multinomal distribution
P (x1 , x2 , x3 , x4 , x5 ) =
n!
px1 px2 px3 px4 px5
x1 !x2 !x3 !x4 !x5 ! 1 2 3 4 5
where pi is the probability of an accident on day i and i indexes working days. Of interest
is whether or not the pi are equal. If they are not we might be interested in which seem too
large.
4.2. THE MULTINOMIAL DISTRIBUTION
131
example: This data set consists of the cross classification of 12,763 applications for admission to graduate programs at the University of California at Berkeley in 1973. The data were
classified by gender and admission outcome. Of interest is the possibility of gender bias in
the admissions policy of the university.
Gender
Male
Female
Admissions Outcome
Admitted Not Admitted
3738
4704
1494
2827
In general we have that n individuals are investigated and their gender and admission
outcome is recorded. The data are thus of the form:
Gender
Male
Female
Admitted
n00
n10
Not Admitted
n01
n11
To model this data we assume that individuals are independent and that the possible response
patters for an individual are given by one of the following:
(male, admitted) = (0, 0)
(female, admitted) = (1, 0)
(male, not admitted) = (0, 1)
( female, not admitted) = (1, 1)
Denoting the corresponding probabilities by p11 , p01 , p10 and p00 the multinomial model
applies and we have the probabilities of the observed responses given by
n!
pn00 pn01 pn10 pn11
n00 !n01 !n10 !n11 ! 00 01 10 11
The random variables are thus N00 , N01 , N10 and N11 .
132
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
In the model above the probabilities are thus given by
Gender
Male
Female
Marginal of Admission Status
Admitted
p00
p10
p+0
Not Admitted
p01
p11
p+1
Marginal of Gender
p0+
p1+
1
Note that p+0 gives the probability of admission and that p0+ gives the probability of being male. It is clear (why?) that the marginal distribution of admission is binomial with
parameters n and p = p+0 .
The probability that N00 = n00 and N01 = n01 given that N00 + N01 = n0+ gives the
probability of admission given male and is
P (N00 = n00 , N01 = n01 |N00 + N01 = n0+ )
This conditional probability is given by:
P (N00 = n00 , N01 = n0+ − n00 )
=
P (N00 + N01 = n0+ )
n!
(n0+ −n00 )!n00 !(n−n0+ )!
n!
n0+ !(n−n0+ )!
Ã
n
−n
pn0000 p010+ 00 (1 − p0+ )n−n0+
n
p0+0+ (1 − p0+ )n−n0+
!n
p00 00
n0+ !
=
n00 !(n0+ − n00 )! p0+
Ã
!
n0+ n00
=
p (1 − p∗ )n0+ −n00
n00 ∗
Ã
p01
p0+
!n0+ −n00
which is a binomial distribution with parameters n0+ , the number male and
p00
p∗ =
p0+
Note that the odds of admission given male are
p∗
p00
=
1 − p∗
p01
Similarly the probability of admission given female is binomial with parameters n1+ , the
number of females and P∗ where
p10
P∗ =
p1+
Note that the odds in this case are given by
p10
P∗
=
1 − P∗
p11
4.2. THE MULTINOMIAL DISTRIBUTION
133
Thus the odds ratio of admission (female to male) is given by
p10 /p11
p01 p10
=
p00 /p01
p00 p11
If the odds ratio is one gender and admission are independent. (Why?) It follows that the
odds ratio is a natural measure of association for categorical data.
In the example the odds of admission for males is estimated by
odds of admission for males =
3738/8442
3738
=
= 0.79
4704/8442
4704
while the odds for admission given female is
odds of admission given female =
1494/4321
1494
=
= 0.53
2827/4321
2827
Thus the odds of admission are lower for females. The odds ratio is estimated by
odds ratio of admission (females to males) =
1494/2827
1494 × 4704
=
= 0.67
3738/4704
2827 × 3738
Is this odds ratio different enough from 1 to claim that females are discriminated against
in the admissions policy? More later!!!
134
4.3
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
The Multivariate Normal Distribution
The most important joint continuous distribution is the multivariate normal distribution.
The density function of Y is given by
½
¾
k
1
1
f (x) = (2π)− 2 [det(V)]− 2 exp − (x − µ)T V−1 (x − µ)
2
where we assume that V is a non-singular, symmetric, positive definite matrix of rank k.
The two parameters of this distribution are µ and V.
• It can be shown that the marginal distribution of any Xi
µi and vii where



µ1
v11 v12 · · ·



 µ2 
 v12 v22 · · ·


µ=
..
..
 ..  , V =  ..
 . 
 .
.
.
µk
v1k v2k · · ·
is normal with parameters
v1k
v2k
..
.






vkk
• It can also be shown that the distribution of linear combinations of multivariate normal
random variables are also multivariate normal. More precisely let W = a + BY where
a is p × 1 and B is a p × k matrix with p ≤ k. Then the joint distribution of W is
multivariate normal with parameters
µW = a + BµY and VW = BVY BT
where BT is the transpose of B.
4.3. THE MULTIVARIATE NORMAL DISTRIBUTION
135
• It can also be shown that the conditional distribution of any subset of X given any
other subset is multivariate normal more precisely: let
"
X=
X1
X2
#
"
; µ=
µ1
µ2
#
"
, V=
V11 V12
T
V12
V22
#
where AT denotes the transpose of A. Then the conditional distribution of X2 given
X1 = x1 is also multivariate normal with
−1
T
−1
T
V12
V11
(x1 − µ1 ) ; V∗ = V22 − V12
V11
µ∗ = µ2 + V12
• It follows that if X1 and X2 have a multivariate normal distribution then they are
independent if and only if
V12 = 0
The multivariate normal distribution forms the basis for regression analysis, analysis
of variance and a variety of other statistical methods including factor analysis and latent
variable analysis.
136
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
4.4
4.4.1
Parameters of Joint Distributions
Means, Variances, Covariances and Correlation
The collection of expected values of the marginal distributions of Y is called the expected
value of Y and is written as



E(Y) = µ = 


E(Y1 )
E(Y2 )
..
.






=




E(Yk )
µ1
µ2
..
.






µk
The covariance between X and Y , where X and Y have a joint distribution is defined by
cov (X, Y ) = E(X − µX )(Y − µY )
The correlation between X and Y is defined as
cov (X, Y )
ρ(X, Y ) = q
var (X)var (Y )
and is simply a standardized covariance. Correlations have the property that
−1 ≤ ρ(X, Y ) ≤ 1
4.4. PARAMETERS OF JOINT DISTRIBUTIONS
137
Using the properties of expected values we see that covariances have the following properties
• cov (X, Y ) = cov (Y, X)
• cov (X, X) = var (X)
• cov (X + a, Y + b) = cov (X, Y )
• cov (aX, bY ) = abcov (X, Y )
• cov (aX + bY, cW + dZ) = ac cov (X, W ) + ad cov (X, Z) + bc cov (Y, W ) + bd cov (Y, Z)
We define the variance covariance matrix of Y as



VY = 


var (Y1 )
cov (Y1 , Y2 )
cov (Y1 , Y1 )
var (Y2 )
..
..
.
.
cov (Yk , Y1 ) cov (Yk , Y2 )
· · · cov (Y1 , Yk )
· · · cov (Y2 , Yk )
..
..
.
.
· · · var (Yk , Yk )






Note that for the multivariate normal distribution with parameters µ and V we have
that
E(Y) = µ and VY = V
Thus the two parameters in the multivariate normal are respectively the mean vector and
the variance covariance matrix.
138
4.4.2
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Joint Moment Generating Functions
The joint moment generating function of X1 , X2 , . . . , Xk is defined as
Pk
MX (t) = E(e
tX
i=1 i i
)
• Partial derivatives with respect to ti evaluated at t1 = t2 = · · · = tk = 0 give the
moments of Xi and mixed partial derivatives (e.g. with respect to ti and tj give the
covariances, etc.)
• Joint moment generating functions are unique (if two distributions have the same
moment generating function then the two distributions are the same).
• The joint moment generating function for the multivariate normal distribution is given
by


½
¾
k
k X
k


X
X
1
1
MX (t) = exp µT t + tT Vt = exp  ti µi +
ti tj vij 
2
2 i=1 j=1
i=1
• If random variables are independent then their joint moment generating function is
equal to the product of the individual moment generating functions.
4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES
4.5
139
Functions of Jointly Distributed Random Variables
If Y = g(X) is any function of random variables X we can find its distribution exacly as in
the one variable case i.e.
fY (y) =
P
x:g(x)=y
fY (y) =
where
f (x1 , x2 , . . . , xk ) if X is discrete
dFY (y)
dy
if X is continuous
Z
FY (y) =
{x:g(x)≤y}
f (x1 , x2 , . . . , xk )dx1 dx2 · · · dxk
Thus we can find the distribution of the sum, the difference, a linear combination, a
ratio, a product, etc. We shall not derive all of the results we use in later sections but we
shall record a few of the most important results here
• If X has a multivariate normal distribution with mean µ and variance covariance
matrix V then the distribution of
Y = a + bT X = a +
k
X
bi Xi
i=1
is normal with
E(Y ) = a + bT µ = a +
k
X
i=1
bi E(Xi ) and var (Y ) = bT Vb =
k X
k
X
i=1 j=1
bi bj cov (Xi , Xj )
140
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
• If Z1 , Z2 , . . . , Zr are independent each N (0, 1) then the distribution of
Z12 + Z22 + · · · + Zr2
is chi-square with r degrees of freedom.
• If Z is N (0, 1) and W is chi-square with r degrees of freedom and Z and W are
independent then
Z
T =q
W/r
has a Student’s t distribution with r degrees of freedom.
• If Z1 and Z2 are each N (0, 1) and independent then the distribution of the ratio
C=
is Cauchy with parameters 0 and 1
Z1
Z2
4.5. FUNCTIONS OF JOINTLY DISTRIBUTED RANDOM VARIABLES
4.5.1
141
Linear Combinations of Random Variables
If X1 , X2 , . . . , Xn have a joint distribution with parameters µ1 , µ2 , . . . , µn and variances and
covariances given by
cov (Xi , Xj ) = vij
then the expected value of
Pn
i=1
ai Xi is given by
n
X
E(
ai Xi ) =
i=1
and the variance of
Pn
var
i=1
ai E(µi ) =
i=1
n
X
ai µi
i=1
ai Xi is given by
à n
X
!
ai Xi =
i=1
If we write
n
X
n X
n
X
ai aj cov (Xi , Xj ) =
i=1 j=1



µ=


µ1
µ2
..
.
µn

n X
n
X
ai aj vij
i=1 j=1





 ; V=




v11 v12
v21 v22
..
..
.
.
vn1 vn2

· · · v1n
· · · v2n 

. . . .. 

. 
· · · vnn
we see that the above results may be written as
E(aT X) = aT µ ; var (aT X) = aT Va
142
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
As special cases we have
• var (X + Y ) = var (X) + var (Y ) + 2 cov (X, Y )
• var (X − Y ) = var (X) + var (Y ) − 2 cov (X, Y )
• Thus if X and Y are uncorrelated with the same variance σ 2 we hav
– var (X + Y ) = 2σ 2
– var (X − Y ) = 2σ 2
• More generally if X1 , X2 , . . . , Xn are uncorrelated then
var
à n
X
!
ai Xi =
i=1
– In particular if we take each ai =
n
X
a2i var (Xi )
i=1
1
n
we have
σ2
var (X̄) =
n
4.6. APPROXIMATE MEANS AND VARIANCES
4.6
143
Approximate Means and Variances
In some problems we cannot find the expected value or variance or distribution of Y = g(X)
exactly. It is useful to have approximations for the means and variances in such cases. If
the function g is reasonably linear in a neignorhood of µX , the expected value of X then we
can write
Y = g(X) ≈ g(µ) + g (1) (µX )(X − µX )
by Taylor’s Theorem. Hence we have
E(Y ) ≈ g(µX )
2
var (Y ) ≈ [g (1) (µX )]2 σX
We can get an improved approximation to the expected value of Y by writing
1
Y = g(X) ≈ g(µ) + g (1) (µX )(X − µX ) + g (2) (µX )(X − µX )2
2
Thus
1
2
E(Y ) ≈ g(µX ) + g (2) (µX )σX
2
144
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
If Z = g(X, Y ) is a function of two random variables then we can write
Z = g(X, Y ) ≈ g(µ) +
∂g(µ)
∂g(µ)
(X − µX ) +
(Y − µY )
∂x
∂y
where µ denotes the point (µX , µY ) and
¯
∂g(µ)
∂g(x, y) ¯¯
¯
=
∂x
∂x ¯x=µX ,y=µY
Thus we have that
var (Z) ≈
h
i
∂g(µ) 2
∂x
2
σX
+
h
E(Z) ≈ g(µ)
i
∂g(µ) 2
∂y
σy2 + 2
h
∂g(µ)
∂x
ih
∂g(µ)
∂y
i
cov (X, Y )
As in the single variable case we can obtain an improved approximation for the expected
value by using Taylor’s Theorem with second order terms e.g.
"
#
"
#
"
#
1 ∂ 2 g(µ) 2
1 ∂ 2 g(µ) 2
∂ 2 g(µ)
E(Z) ≈ g(µ) +
σ
+
σ
+
cov (X, Y )
X
Y
2
∂x2
2
∂y 2
∂x∂y
• Note 1: The improved approximation is needed for the expected value because in
general E[g(X)] 6= g(µ) i.e. E(X 2 ) 6= µ2
• Note 2: Some care is needed when working with discrete variables and certain functions.
Thus if X is binomial with parameters n and p the expected value of log(X) is not
defined so that no approximation can be correct.
4.7. SAMPLING DISTRIBUTIONS OF STATISTICS
4.7
145
Sampling Distributions of Statistics
Definition: A statistic is a numerical quantity calculated from a set of data. Typically a
statistic is designed to provide information about some parameter of the population.
• If x1 , x2 , . . . , xn is the data some statistics are
– x̄, the sample mean
– the median
– the upper quartile
– s2 , the sample variance
– the range
• Since the data are realized values of random variables a statistic is also realized value
of a random variable.
• The probability distribution of this random variable is called the sampling distribution of the statistic.
146
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
In most contemporary applications of statistics the sampling distribution of the statistic is
used to assess the performance of a statistic used for inference about population parameters.
The following is a schematic diagram of the concept of a sampling distribution of a statistics.
experiment.
Figure 4.1:
4.7. SAMPLING DISTRIBUTIONS OF STATISTICS
Illustration of Sampling Distributions
Sampling Distribution of Sample Mean, Sample Size 25
Figure 4.2:
147
148
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Illustration of Sampling Distributions
Sampling Distribution of (n − 1)s2 /σ 2 , Sample Size n = 10
Figure 4.3:
4.7. SAMPLING DISTRIBUTIONS OF STATISTICS
Illustration of Sampling
Distributions
√
Sampling Distribution of t = n(x̄ − µ)/s, Sample Size n = 10
Figure 4.4:
149
150
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
example: Given a sample of data suppose we calculate the sample mean x̄ and the sample
median q.5 . Which of these is a better measure of the center of the population?
• If we assume that the data represent a random sample from a probability distribution
which is N (µ, σ 2 ) then it is known that:
2
◦ the sampling distribution of X̄ is N (µ, σn )
2
◦ the sampling distribution of the sample median is approximately N (µ, ( π2 ) σn ).
• Thus the sample mean will, on average, be closer to the population mean than will the
sample median. Thus the sample mean is preferred as as estimate of the population
mean.
• If the underlying population is not N (µ, σ 2 ) then the above result does not hold and
the sample median may be the preferred estimate.
• It follows that the role of assumptions about the underlying probability model is crucial
in the development and assesment of statistical procedures.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS151
4.8
Methods of Obtaining Sampling Distibutions or
Approximations
There are three methods used to obtain information on sampling distributions:
• Exact sampling distributions. Statisticians have, over the last 100 years, developed the
sampling distributions for a variety of useful statistics for specific parametric models.
For the most part these statistics are simple functions of the sample data such as the
sample mean, the sample variance, etc.
• Asymptotic (approximate) distributions. When exact sampling distributions are not
tractable we may find the distribution of the statistic for large sample sizes. These are
called asymptotic methods and are suprisingly useful.
• Computer intensive methods. These are based on resampling from the empirical distribution of the data and have been shown to have useful properties. The most important
of these methods is called the bootstrap.
4.8.1
Exact Sampling Distributions
Here we find the exact sampling distribution of the statistic using the methods previously
discussed. The most famous example of this method is the result that if we have a random
sample for a normal distribution then the distribution of the sample mean is also normal.
Other examples include the distribution of the sample variance from a normal sample, the t
distribution and the F distribution.
152
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
4.8.2
Asymptotic Distributions
4.8.3
Central Limit Theorem
If we cannot find the exact sampling distribution of a statistic we may be able to find its
mean and variance. If the sampling distribution were approximately normal then we would
be able to make approximate statements using just the mean and variance. In the discussion
of the Binomial and Poisson distributions we noted that for large n the distributions could
be approximated by the normal distribution.
• In fact, the sampling distribution of X̄ for almost any population distribution becomes
more and more similar to the normal distribution regardless of the shape of the original
distribution as n increases.
More precisely:
Central Limit Theorem If X1 , X2 , . . . , Xn are independent each with the same distribution
having expected value µ and variance σ 2 then the sampling disribution of X̄ is approximately
2
N (µ, σn ) i.e.


X̄ − µ
P σ
≤ z  ∼ P (Z ≤ z)
√
n
where P (Z ≤ z) is the area under the normal curve up to z.
The Central Limit Theorem has been extended and refined over the last 75 years.
• Many statistics have distributions whose sampling distributions are approximately normal.
• This explains the great use of the normal distribution in statistics.
• In particular, whenever a measurement can be thought of as a sum of individual components we may expect it to be approximately normal.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS153
4.8.4
Central Limit Theorem Example
We now illustrate the Central Limit Theorem and some other results on sampling distributions. The data set consists of a population of 1826 children whose blood lead values
(milligarms per deciliter) were recorded at the Johns Hopkins Hospital. The data are courtesy of Dr. Janet Serwint. Lead in children is a serious public health problem, lead levels
exceeding 15 milligrams per deciliter are considered to have implications for learning disabilities, are implicated in violent behavior and are the concern of major governmental efforts
aimed at reducing exposure.
The distribution in real populations is often assumed to follow a log-normal distribution
i.e. the natural logarithm of blood lead values is normally distributed.
Note the asymmetry of the distribution of blood lead values. Note that the log transformation results in a decided improvement in symmetry, indicating that the log-normal
assumption is probably appropriate.
We select random samples from the population of blood lead readings and log blood
lead readings. We select 100 random samples of size 10, 25 and 100 respectively. As the
histograms indicate the distribution of the sample means of the blood lead values do indeed
appear normal even though the distribution of blood lead values is highly skewed.
154
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Histograms of Blood Lead and Log Blood Lead Values
Figure 4.5:
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS155
Histograms of Sample Means of Blood Lead Values
Figure 4.6:
156
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Histograms of Sample Means of Log Blood Lead Values
Figure 4.7:
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS157
The summary statistics for blood lead values and the samples are as follows
> summary(blpb)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0
5
8 9.773
12 128
> var(blpb)
71.79325
Sample Size
10
25
100
Mean
9.93
9.75
9.87
Variance
9.53
2.91
.72
The summary statistics for log blood lead values and the samples are as follows
summary(logblpb)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.386
1.658
2.11 2.084
2.506 4.854
> var(logblpb)
[1] 0.4268104
Sample Size
10
25
100
Mean
2.07
2.08
2.08
Variance
.037
.017
.004
158
4.8.5
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Law of Large Numbers
Under most weak conditions, the average of a sample is “close” to the population average if
the sample size is large. More precisely:
Law of Large Numbers: If we have a random sample X1 , X2 , . . . , Xn from a distribution
with expected value µ and variance σ 2 then
P (X̄ ≈ µ) ≈ 1
for n sufficiently large. The approximation becomes closer the larger the value of n.
p
We write X̄ −→ µ and say that X̄ converges in probability to µ. If g is a continuous
function then if X converges in probability to µ then g(X) converges in probability to g(µ).
Some idea of the value of n needed can be obtained from Chebychev’s inequality which
states that
σ2
P (−k ≤ X̄ − µ ≤ k) ≥ 1 −
n k2
where k is any constant.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS159
Law of Large Numbers Examples
Figure 4.8:
160
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
4.8.6
The Delta Method - Univariate
For statistics Sn which are normal or approximately normal the delta method can be used
to find the approximate distribution of g(Sn ), a function of Sn .
The technique is based on approximating g by a linear function as in obtaining approximations to expected values and variances of functions i.e.
g(Sn ) ≈ g(µ) + g (1) (µ)(Sn − µ)
where Sn converges in probability to µ and g (1) (µ) is the derivative of g evaluated at µ.
Thus we have that
g(Sn ) − g(µ) ≈ g (1) (µ)(Sn − µ)
√
If n(Sn − µ) has an exact or approximate normal distribution with mean 0 and variance
σ 2 then
√
n[g(Sn ) − g(µ)]
has an approximate normal distribution with mean 0 and variance
[g (1) (µ)]2 σ 2
It follows that we may make approximate calculations by treating g(Sn ) as if were normal
with mean g(µ) and variance [g (1) (µ)]2 σ 2 /n i.e.




g(Sn ) − g(µ)
x − g(µ)
x − g(µ)
 = P Z ≤ q

P (g(Sn ) ≤ s) = P  q
≤q
2 2
2 2
2 2
(1)
(1)
(1)
[g (µ)] σ /n
[g (µ)] σ /n
[g (µ)] σ /n
where Z is N (0, 1). in addition if g (1) (µ) is continous then we can replace µ by Sn in the
formula for the variance.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS161
example: Let X be binomial with parameters n and p and let Sn = X/n. Then we know
by the Central Limit Theorem that the approximate distribution of
√
n(Sn − p)
is N (0, pq). If we define
µ
¶
x
= ln(x) − ln(1 − x)
g(x) = ln
1−x
then
g (1) (x) =
1
1
(1 − x) + x
1
+
=
=
x (1 − x)
x(1 − x)
x(1 − x)
Thus
g (1) (µ) =
and hence
√
"
µ
1
pq
Ã
¶
Sn
p
n ln
− ln
1 − Sn
1−p
!#
is approximately normal with mean 0 and variance
pg
1 1
= +
2
(pq)
p q
Since g (1) (µ) is continuous we may treat ln(Sn ) as if it where normal with
Ã
p
mean ln
1−p
!
·
and variance
¸
1 1
1
1
1
+
=
+
n Sn 1 − Sn
X n−X
Thus the distribution of the sample log odds in a binomial may be approximated by a normal
distribution with mean equal to the population log odds and variance equal to the sum of
the reciprocals of the number of successes and the number of failures.
162
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
4.8.7
The Delta Method - Multivariate
More generally, if we have a collection of statistics S1 , S2 , . . . , Sk then we say that they
are approximately multivariate normally distributed with mean µ and variance covariance
matrix V if
√ T
n a (Sn − µ)
has an approximate normal distribution with mean 0 and variance aT Va for any a.
In this case the distribution of g(Sn ) is also approximately normal i.e.
√
n [g(Sn ) − g(µ)]
is approximately normal with mean 0 and variance σg2 = ∇(µ)T V∇(µ) where




∇(µ) = 



∂g(µ)
∂µ1
∂g(µ)
∂µ2
..
.
∂g(µ)








∂µk 2
Thus we may make approximate calculations by treating g(Sn ) as if were normal with with
mean g(µ) and variance σg2 i.e.




s − g(µ) 
x − g(µ) 
g(Sn ) − g(µ)
≤ q
= P Z ≤ q
P (g(Sn ) ≤ s) = P  q
2
2
σg /n
σg /n
σg2 /n
where N (0, 1). In addition if each partial derivative is continuous we may replace µ by Sn
in the formula for the variance.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS163
example: Let X1 be binomial n and p1 and let X2 be binomial n and p2 and be independent.
Then then the joint distribution of
"
S1n
S2n
Sn =
is such that
√
#
"
=
X1 /n
X2 /n
#
n(Sn − p)
is approximately multivariate normal with mean 0 and variance covariance matrix V where
"
V=
#
p 1 q1 0
0 p2 q2
Thus if
Ã
p2
g(p) = ln
1 − p2
!
Ã
p1
− ln
1 − p1
we have that
!
= ln(p2 ) − ln(1 − p2 ) − ln(p1 ) + ln(1 − p1 )
∂g(p)
= − p11
∂p1
∂g(p)
= p12
∂p2
−
+
1
1−p1
1
1−p2
= − p11q1
= p21q2
It follows that
σg2
=
h
− p11q1
1
p2 q2
i
"
p 1 q1 0
0 p2 q2
#"
− p11q1
#
1
p2 q2
=
1
1
+
p1 q1 p2 q2
Since the partial derivatives are continuous we may treat the sample log odds ratio as if it
where normal with mean equal to the population log odds ratio
Ã
p2 /(1 − p2 )
ln
p1 /(1 − p1 )
and variance
!
1
1
1
1
+
+
+
X1 n − X1 X2 n − X2
164
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
If we write the sample data as
sample 1
sample 2
X1 = a n − X1 = b
X2 = c n − X2 = d
then the above formula reads as
1 1 1 1
+ + +
a b c d
a very widely used formula in epidemiology.
Technical Notes
(1) Only a minor modification is needed to show that the result is true when the sample
size in the two binomials is different provided that the ratio of the sample sizes does
not tend to 0.
(2) The log odds ratio is much nearly normally distributed than the odds ratio.
We generate 1000 samples of size 20 from each of two binomial populations one with parameter .3 and the other with parameter .5. It follows that the population odds ratio and the
population log odds ratio are given by
odds ratio =
.5/.5
7
= = 2.333 ; log odds ratio = .8473
.3/.7
3
The asymptotic variance for the log odds ratio is given by the formula
(1/6) + (1/14) + (1/10) + (1/10) = .4381
which leads to an asymptotic standard deviation of .6618.
The mean of the 1000 random samples is .9127 with variance .5244 and standard deviation
.7241.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS165
Graphs of the Simulated Distributions
Figure 4.9:
166
4.8.8
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Computer Intensive Methods
• The determination of the sampling distribution of statistics which are complicated
functions of the observations can be approximated using the Delta Method.
• With the advent of fast modern computing techniques other methods of obtaining
sampling distributions have been developed. One of these, called the bootstrap is of
great importance in estimation and in interval estimation.
The Bootstrap Method
Given data x1 , x2 , . . . , xn , a random sample from p(x; θ) we estimate θ by the statistic θ̂. Of
interest is the standard error of θ̂. We may not be able to obtain the standard error if θ̂
is a complicated function of the data, nor do we want an asymptotic result which may be
suspect if used for small samples.
The bootstrap method, introduced in 1979 by Bradley Efron, is a computer intensive
method for obtaining the standard error of θ̂ which has been shown to valid in most situations.
The bootstrap method for estimating the standard error of θ̂ is as follows:
(1) Draw a random sample of size n with replacement from the observed data x1 , x2 , . . . , xn
and compute θ̂.
(2) Repeat step 1 a large number, B, of times obtaining B separate estimates of θ denoted
by
θ̂1 , θ̂2 , . . . , θ̂B
(3) Calculate the mean of the estimates in step 2 i.e.
PB
θ̄ =
i=1 θ̂i
B
(4) The bootstrap estimate of the standard error of θ̂ is given by
σ̂BS (θ̂) =
v
uP
u B (θ̂ − θ̄)2
t i=1 i
B−1
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS167
The bootstrap is computationally intensive but is easy to use except in very complex
problems. Efron suggests about 250 samples be drawn (i.e. B=250) in order to obtain
reliable estimates of the standard error. To obtain percentiles of the bootstrap distribution
it is suggested that 500 to 1000 bootstrap samples be taken. The following is a schematic of
the bootstrap procedure.
Figure 4.10:
168
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
It is interesting to note that the current citation index for statistics lists about 600 papers
involving use of the bootstrap!
References:
1. A Leisurely Look at the Bootstrap, the Jackknife and Cross-Validation (1983) B. Efron
and G. Gong; The American Statistician, February 1983, Vol. 37, No. 1
2. Bootstrapping (1993) C. Mooney and R. Duval; Sage Publications. This is a very
readable introduction designed for applications in the Social Sciences.
3. The STATA Manual has an excellent section on the bootstrap and a bootstrap command is available.
The Jackknife Method
The jackknife is another procedure for obtaining estimates and standard errors in situations
where
• The exact sampling distribution of the estimate is not known.
• We want an estimate of the standard error of the estimate which is robust against
model failure and the assumption of large sample sizes.
The jackknife is computer intensive but relatively easy to implement.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS169
Assume that we have n observations x1 , x2 , . . . , xn which are assumed to be a random
sample from a distribution p. Assume the parameter of interest is θ and that the estimate
is θ̂
The jackknife procedure is as follows:
1. Let θ̂(i) denote the estimate of θ determined by eliminating the ith observation.
2. The jackknife estimate of θ is defined by
θ̂(JK) =
n
1X
θ̂(i)
n i=1
i.e. the average of the θ̂(i) .
3. The jackknife estimate of the standard error of θ̂ is given by
"
σ̂JK
n
(n − 1) X
=
(θ̂(i) − θ̂(JK) )2
n
i=1
#1/2
170
4.8.9
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
Bootstrap Example
In ancient Greece a rectangle was called a “Golden Rectangle” if the length to width ratio
was
2
√
= 0.618034
5+1
This ratio was a design feature of their architecture. The following data set gives the breadth
to length ratio of beaded rectangles used by the Shoshani Indians in the decoration of leather
goods. Were they also using the Golden Rectangle?
.693
.748
.654
.670
.662
.672
.615
.606
.690
.628
.668
.611
.606
.609
.601
.553
.570
.844
.576
.933
We now use the bootstrap method for the sample mean and the sample median.
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS171
. infile ratio using "c:\courses\b651201\datasets\shoshani.raw
(20 observations read)
. stem ratio
Stem-and-leaf plot for ratio
ratio rounded to nearest multiple of .001
plot in units of .001
5**
6**
6**
7**
7**
8**
8**
9**
|
|
|
|
|
|
|
|
53,70,76
01,06,06,09,11,15,28
54,62,68,70,72,90,93
48
44
33
. summarize ratio
Variable |
Obs
Mean
Std. Dev.
Min
Max
---------+----------------------------------------------------ratio |
20
.66045
.0924608
.553
.933
172
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
. bs "summarize ratio" "r(mean)", reps(1000) saving(mean)
command:
statistic:
(obs=20)
summarize ratio
r(mean)
Bootstrap statistics
Variable |
Reps
Observed
Bias
Std. Err.
[95% Conf. Interval]
---------+------------------------------------------------------------------bs1 |
1000
.66045
.0017173
.0197265
.6217399 .6991601 (N)
|
.626775
.70365 (P)
|
.6264
.7021 (BC)
----------------------------------------------------------------------------N = normal, P = percentile, BC = bias-corrected
. use mean, clear
(bs: summarize ratio)
. kdensity bs1
. kdensity bs1,saving(g1,replace)
4.8. METHODS OF OBTAINING SAMPLING DISTIBUTIONS OR APPROXIMATIONS173
. drop _all
. infile ratio using "c:\courses\b651201\datasets\shoshani.raw
(20 observations read)
. bs "summarize ratio,detail" "r(p50)", reps(1000) saving(median)
command:
statistic:
(obs=20)
summarize ratio,detail
r(p50)
Bootstrap statistics
Variable |
Reps
Observed
Bias
Std. Err.
[95% Conf. Interval]
---------+------------------------------------------------------------------bs1 |
1000
.641
-.001711
.0222731
.5972925 .6847075 (N)
|
.6075
.671 (P)
|
.609
.679 (BC)
----------------------------------------------------------------------------N = normal, P = percentile, BC = bias-corrected
. use median,clear
(bs: summarize ratio,detail)
. kdensity bs1,saving(g2,replace)
. graph using g1 g2
174
CHAPTER 4. JOINT PROBABILITY DISTRIBUTIONS
The bootstrap distributions of the sample mean and the sample median are given below:
Figure 4.11: