Download Introduction to Probability and Statistics: notes

Document related concepts

History of statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Introduction to Probability and Statistics: notes for a short
course
Jonathan G. Campbell
Department of Computing,
Letterkenny Institute of Technology,
Co. Donegal, Ireland.
email: jonathan dot campbell (at) gmail.com, [email protected]
URL: http://www.jgcampbell.com/stats/stats.pdf
Report No: jc/09/0004/r
Revision 0.3
18th August 2009
Contents
1
2
3
4
5
Introduction
1.1 Purpose and Scope . . . . . . . . . . . . . . . . .
1.2 Why use R? . . . . . . . . . . . . . . . . . . . . .
1.3 Relevant textbooks and web sources . . . . . . . .
1.3.1 General Books on Probability and Statistics
1.3.2 Books on R and Statistics using R . . . . .
1.3.3 Bayesian Statistics . . . . . . . . . . . . .
1.3.4 Web Links . . . . . . . . . . . . . . . . . .
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Simple Data Analysis and Visualisation and Introduction to R
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Installation of R . . . . . . . . . . . . . . . . . . . .
2.1.2 Running R . . . . . . . . . . . . . . . . . . . . . . .
2.2 Visualisation and Exploratory Data Analysis . . . . . . . . .
Averages
3.1 Introduction . . . . . . . . . . . . . . . . .
3.2 Arithmetic Mean . . . . . . . . . . . . . .
3.2.1 Arithmetic Mean using Frequencies
3.3 Median . . . . . . . . . . . . . . . . . . . .
3.4 Mode . . . . . . . . . . . . . . . . . . . .
3.5 Other Means . . . . . . . . . . . . . . . .
Measures of Data Variability
4.1 Introduction . . . . . . . . . . . . . . . .
4.2 Variance and Standard Deviation . . . . .
4.2.1 Equalising the means . . . . . . .
4.2.2 Variability and spread . . . . . . .
4.2.3 Variance and Standard Deviation .
4.3 Standard Scores and Normalising Marks .
4.3.1 Standard Scores . . . . . . . . .
Probability and Random Variables
5.1 Introduction . . . . . . . . . . . . . . . .
5.2 Basic Probability and Random Variables .
5.2.1 Introduction . . . . . . . . . . . .
5.2.2 Probability and Events . . . . . .
5.2.3 A Point on Terminology . . . . .
5.2.4 Probability of Non-disjoint Events
0–1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
3
3
3
.
.
.
.
1
1
1
1
2
.
.
.
.
.
.
1
1
1
2
3
4
6
.
.
.
.
.
.
.
1
1
1
4
5
5
6
7
.
.
.
.
.
.
1
1
1
1
2
3
3
5.2.5 Finite Sample Spaces . . . . . . . . . . . . . .
Random Variables . . . . . . . . . . . . . . . . . . . .
Computing probabilities . . . . . . . . . . . . . . . . .
Enumerating more complex events and sample spaces .
5.5.1 Multiplication of outcomes . . . . . . . . . . .
5.5.2 Addition of outcomes . . . . . . . . . . . . . .
5.5.3 Permutations . . . . . . . . . . . . . . . . . .
5.5.4 Combinations . . . . . . . . . . . . . . . . . .
5.6 Conditional Probability . . . . . . . . . . . . . . . . .
5.6.1 Venn diagrams . . . . . . . . . . . . . . . . .
5.6.2 Probability Trees . . . . . . . . . . . . . . . .
5.6.3 Joint Probability . . . . . . . . . . . . . . . .
5.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . .
5.8 Independent Events . . . . . . . . . . . . . . . . . . .
5.9 Betting and Odds . . . . . . . . . . . . . . . . . . . .
5.10 Classical versus Bayesian Interpretations of Probability
5.3
5.4
5.5
6
7
One Dimensional Random Variables
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Definition: Random Variable . . . . . . . . . .
6.1.2 Probability associated with a Random Variable
6.2 Probability Mass Function (pmf) of a Discrete r.v. . .
6.3 Some Discrete Random Variables . . . . . . . . . . . .
6.3.1 Point Mass Distribution . . . . . . . . . . . .
6.3.2 Discrete Uniform Distribution . . . . . . . . .
6.3.3 Bernoulli Distribution . . . . . . . . . . . . . .
6.3.4 Binomial Distribution . . . . . . . . . . . . . .
6.3.5 Geometric Distribution . . . . . . . . . . . . .
6.3.6 Poisson Distribution . . . . . . . . . . . . . .
6.4 Some Continuous Random Variables . . . . . . . . . .
6.4.1 Probability Density Function (PDF) . . . . . .
6.4.2 Cumulative Distribution Function (cdf) . . . .
6.4.3 Uniform Distribution . . . . . . . . . . . . . .
6.4.4 Normal (Gaussian) Distribution . . . . . . . .
6.4.5 Exponential Distribution . . . . . . . . . . . .
6.4.6 Gamma Distribution . . . . . . . . . . . . . .
6.4.7 Beta Distribution . . . . . . . . . . . . . . . .
6.4.8 Student t Distribution . . . . . . . . . . . . .
6.4.9 Cauchy Distribution . . . . . . . . . . . . . . .
6.4.10 Chi-squared Distribution . . . . . . . . . . . .
6.5 Range spaces — terminology . . . . . . . . . . . . . .
6.6 Parameters . . . . . . . . . . . . . . . . . . . . . . .
Two- and Multi-Dimensional Random Variables
7.1 Introduction . . . . . . . . . . . . . . . . . . . . .
7.2 Probability Function of a Discrete Two-dimensional
7.3 PDF of a Continuous Two-dimensional r.v. . . . .
7.4 Marginal Probability Distributions . . . . . . . . .
7.5 Conditional Probability Distributions . . . . . . . .
0–2
. .
r.v.
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
4
4
5
5
5
6
6
6
7
9
9
11
12
13
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
1
2
2
2
3
3
3
3
4
4
4
5
5
5
6
6
6
7
7
7
7
7
.
.
.
.
.
1
1
2
2
3
4
7.6
7.7
8
Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two-dimensional (Bivariate) Normal Distribution . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
1
1
1
4
5
5
5
.
.
.
.
.
.
.
.
1
1
2
4
4
5
6
6
6
10 Statistical Inference
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
11 Statistical Estimation
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Estimating the Mean . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 Estimating the Standard Deviation . . . . . . . . . . . . . . . . . . .
11.5 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . .
11.5.1 Sampling Distribution of the mean . . . . . . . . . . . . . . .
11.5.2 Sampling Distribution for Estimates of the Standard Deviation
11.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
1
1
1
2
2
3
3
4
5
12 Hypothesis Testing
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
13 Sampling
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
14 Classification and Pattern Recognition
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
15 Simple Classifier Methods
15.1 Thresholding for one-dimensional data . . . . . . .
15.2 Linear separating lines/planes for two-dimensions .
15.3 Nearest mean classifier . . . . . . . . . . . . . . .
15.4 Normal form of the separating line, projections, and
15.5 Projection and linear discriminant . . . . . . . . .
15.6 Projections and linear discriminants in p dimensions
1
1
4
4
5
6
7
9
Characterisations of Random Variables
8.1 Introduction . . . . . . . . . . . . . . . . . .
8.2 Expected Value (Mean) of a Random Variable
8.3 Variance of a Random Variable . . . . . . . .
8.4 Expectations in Two-dimensions . . . . . . .
8.4.1 Mean . . . . . . . . . . . . . . . . .
8.4.2 Covariance . . . . . . . . . . . . . .
4
5
The
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Normal Distribution
Introduction . . . . . . . . . . . . . . . . . . . . . .
Cumulative Distribution Function (cdf) . . . . . . .
Normal Cdf . . . . . . . . . . . . . . . . . . . . . .
Using the Normal Cdf . . . . . . . . . . . . . . . . .
Sum of Independent Normal Random Variables . . .
Differences of Normal Random Variables . . . . . . .
Linear Transformations of Normal Random Variables
The Central Limit Theorem . . . . . . . . . . . . . .
0–3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
linear discriminants
. . . . . . . . . . .
. . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . .
15.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 Statistical Classifier Methods
16.1 One-dimensional classification revisited . . . . . . . . . .
16.2 Bayes’ Rule for the Inversion of Conditional Probabilities
16.3 Parametric Methods . . . . . . . . . . . . . . . . . . . .
16.4 Discriminants based on Normal Density . . . . . . . . .
16.5 Bayes-Gauss Classifier – Special Cases . . . . . . . . . .
16.5.1 Equal and Diagonal Covariances . . . . . . . . .
16.5.2 Equal but General Covariances . . . . . . . . . .
16.6 Least square error trained classifier . . . . . . . . . . . .
16.7 Generalised linear discriminant function . . . . . . . . .
7
7
.
.
.
.
.
.
.
.
.
1
1
2
3
4
4
5
6
7
8
17 Linear Discriminant Analysis and Principal Components Analysis
17.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
18 Neural Network Methods
18.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . .
18.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
3
3
4
19 Unsupervised Classification (Clustering)
1
20 Regression
20.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
A Basic Mathematical Notation
A.1 Sets . . . . . . . . . . . . . . . . . . . . .
A.1.1 Set Definition and Membership . .
A.1.2 Important Number Sets . . . . . . .
A.1.3 Set Operations . . . . . . . . . . .
A.1.4 Venn Diagrams . . . . . . . . . . .
A.2 Iterated Summation and Product Notation
A.3 Iterated Union and Intersection . . . . . . .
A.4 Cartesian Product Sets . . . . . . . . . . .
.
.
.
.
.
.
.
.
1
1
1
2
2
2
4
4
4
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
4
4
5
5
5
5
6
6
B Matrices and Linear Algebra
B.1 Introduction . . . . . . . . . . . .
B.2 Linear Simultaneous Equations . .
B.3 Vectors and Matrices . . . . . . .
B.4 Basic Matrix Arithmetic . . . . . .
B.4.1 Matrix Multiplication . . .
B.4.2 Multiplication by a Scalar .
B.4.3 Addition . . . . . . . . . .
B.5 Special Matrices . . . . . . . . . .
B.5.1 Identity Matrix . . . . . .
B.5.2 Orthogonal Matrix . . . .
B.5.3 Diagonal . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0–4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B.5.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.7 Multidimensional (Multivariate) Random Variables . . . . . . . . . . . . . . . . . .
0–5
6
7
8
Chapter 1
Introduction
1.1
Purpose and Scope
This report is written as the basis for a short course on statistics to be presented for postgraduate
students at Letterkenny Institute of Technology.
The notes have a mixed objective. I started writing a set of notes based on the traditional approach
to probability and statistics, namely: basic probability, up to and including conditional probability,
independence, Bayes’ Law; then some one-dimensional discrete and continuous distributions and
some of the properties. Et cetera. And the on to sampling, parameter estimation, point estimates,
confidence intervals, and hypothesis testing.
However, after discussion with someone who knows potential consumers of the course, I was
persuaded to start with a more gentle introduction. Hence I start off with simple visualisation, the
look at averages (central tendency), then variance, and then back to the main line.
As I say, the notes have a mixed objective. One objective is as notes for a gentle introduction to
statistics; another is to include a set of reference results that one would refer to during a course;
that is a course presenter might not want to spend time of the details of, for example, the Binomial
distribution, or even full details of the Normal, but it would be useful for students to have access
to some of these details without having to access one or more textbooks.
When I give a course, I may give attendees a printout of all the notes — including an outline of
the objective of the course and the plan of coverage, mentioning the chapters that will be used.
Or, alternatively, I may do a specialised printout that includes only the chapters to be covered.
The notes you see here include everything.
1.2
Why use R?
Let me quote from the R website http://www.r-project.org/:
R is a language and environment for statistical computing and graphics. It is a GNU
project which is similar to the S language and environment which was developed at
1–1
Bell Laboratories (formerly AT &T, now Lucent Technologies) by John Chambers and
colleagues. R can be considered as a different implementation of S. There are some
important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, . . . ) and graphical techniques,
and is highly extensible. The S language is often the vehicle of choice for research in
statistical methodology, and R provides an Open Source route to participation in that
activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can
be produced, including mathematical symbols and formulae where needed. Great care
has been taken over the defaults for the minor design choices in graphics, but the user
retains full control.
When I have to choose a software package for teaching or for practical use (I mean generally, it
could be a development system for a programming language, a computer games engine, a statistics
package, . . . ) I look primarily at the following criteria:
• Is it easily available, i.e. is it already installed in our laboratory machines, or is easy (and
cheap) to acquire?
R does well on this criterion — it is free to download and install, see 2.1.1.
• Is it well supported by textbooks and online documentation?
Again, R does well. In the past ten years and this is greatly accelerating in the last five years,
a great many top class books on R and on particular statistical techniques using R; see 1.3.
I notice that books that used to have just numerical examples now, in recent editions, give
R examples.
There is a top class mailing list supported by volunteers of the highest calibre:
https://stat.ethz.ch/mailman/listinfo/r-help
Via that mailing list, I have received assistance from world-class statisticians.
• Is it widely used? Yes.
1.3
1.3.1
Relevant textbooks and web sources
General Books on Probability and Statistics
These notes are mostly based on (Meyer 1966) (which was used for a college course on statistics
that I attended), (Wasserman 2004), which is a good summary of all the statistics you might
ever need, but is not an introduction, (Griffiths 2009) and (Milton 2009) which are excellent
introductions though very wordy, (Crawley 2005), (Spiegel & Stephens 2008). The latter, (Spiegel
& Stephens 2008), has plenty of examples including some examples on the use of the Excel
spreadsheet.
1–2
(Dytham 2009) seems to be a good introduction for biologists and the more advanced (Quinn &
Keough 2002) receives a lot of recommendations.
Hacking’s book (Hacking 2001) is maybe a good introduction to probability and the philosophy
and practice of probabilistic inference.
The bibliography contains books in my collection and which I may have used in some small way
and/or which may be useful to users of these notes.
1.3.2
Books on R and Statistics using R
Crawley may be the best general book (Crawley 2005); for bio-scientists it has the advantage that
Crawley’s research area is bio-science.
Venables and Ripley’s MASS (Venables & Ripley 2002) is top class — note, do not be confused
by the title Modern Applied Statistics with S; R is an open-source version of S (and S-Plus) and
the book covers any differences, which are minimal. Maindonald (Maindonald & Braun 2007) is
good for R graphics; R code for all his diagrams is available online (free).
Matloff’s R for Programmers (Matloff 2008) has the advantage that it is available online.
See also the extensive list at
http://www.r-project.org/doc/bib/R-books.html
1.3.3
Bayesian Statistics
Not that we’ll be emphasising the Bayesian approach.
(Sivia 2006) (best introduction to Bayesian statistics), (MacKay 2002), (Lee 2004).
1.3.4
Web Links
• General: http://www.jgcampbell.com/links/stats.html;
• R: http://www.r-project.org/.
1.4
Outline
Chapter 5 gives an introduction to probability; if you want to understand basic statistics you must
have a basic understanding of probability — however we note that probability is to a great extent
common sense. Before starting you should have a quick run through Appendix A just to familiarise
yourself with basic mathematical notation; we note that the mathematical notation used is no
more than shorthand; it would be difficult to write these notes without employing that shorthand;
in addition, you will encounter similar shorthand in books and research papers.
1–3
Chapter 2 gives a very brief introduction to simple statistical techniques and visualisation and to
the statistical package R.
Chapter 3 gives a brief introduction to averages or what statisticians call central tendency.
Chapter 4 This chapter introduces methods of describing data variability, most notably variance
and standard deviation.
Chapter 6 introduces random variables and lists the common one-dimensional probability distributions.
Chapter 7 gives a brief introduction to multivariate random variables and some distributions. Note
that Appendix B gives a gentle introduction to vector and matrix mathematics which are necessary
in multivariate statistics.
Chapter 8 discusses important characteristics of randoms variables such a mean and variance.
Chapter 9 gives specialised treatment to the normal distribution — in view of its importance in
applications.
Chapter 10 introduces statistical inference, that is, how can we infer properties of a population
from statistics derived from a sample. One aspect of statistical inference is parameter estimation;
Chapter 11 introduces point estimation and confidence interval estimation. Hypothesis testing is
strongly related to estimation; Chapter 12 gives an introduction to hypothesis testing.
Chapter 13 discusses some of the intricacies of sampling.
As of 2009-08-18 this is work in progress and will remain so for the foreseeable future.
1–4
Chapter 2
Simple Data Analysis and Visualisation and
Introduction to R
2.1
Introduction
The objectives of this chapter are to give a very brief introduction to simple statistical techniques
and visualisation and to the statistical package R.
2.1.1
Installation of R
Click on http://www.r-project.org/ and find the Download link. For Windows users there is
an exe file which does everything. You may need Administrator rights on your machine; contact
Computer Services as necessary.
Linux users are probably best advised to rely on the installer of their particular Linux distribution.
2.1.2
Running R
Start R by clicking on R desktop icon. R will open up a window with something like the following
in it.
R version 2.7.1 (2008-06-23)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ’license()’ or ’licence()’ for distribution details.
Type ’demo()’ for some demos, ’help()’ for on-line help, or
’help.start()’ for an HTML browser interface to help.
Type ’q()’ to quit R.
2–1
¿
The ¿
is R asking you to enter something as on a calculator; R can operate as a simple
calculator, but of course we are interested in its use as a powerful statistical calculator.
¿ 2 + 3
[1] 5
¿ sqrt(26)
[1] 5.09902
¿ 3ˆ4
[1] 81
¿
For the remainder of this chapter we’ll look at a significant example involving visualisation and
exploratory data analysis on a data set.
2.2
Visualisation and Exploratory Data Analysis
Were going to read in some examination result data and analyse them. The file exam.txt contains
data as follows:
exam
65
60
47
... etc. 66 results in total
The name of the column is exam and we tell R to pay attention to that.
In what follows, #
is a comment symbol and R ignores anything after the #
until the next line.
Anything after ¿
is something that you typed — a request to R. If something appears without
a¿
, that is an R response.
¿ ex ¡- read.table(”exam.txt”, header= T)
¿ attach(ex)
¿ exam # print ’exam’ data on the screen
[1] 65 60 47 43 51 32 62 71 0 56 52 59 15 49 54 67 44 2 47 61 45 95 62 80 46
[26] 52 61 12 62 69 78 62 48 56 56 58 60 0 48 71 50 90 51 53 5 51 63 35 39 10
[51] 57 53 20 54 22 44 53 52 25 60 55 39 30 53 67 50
¿
That printout is quite uninformative, for example you have no idea what the maximum is, nor the
range, nor have you an even rough idea of what the average mark is, etc.
Let us look at a histogram.
2–2
¿ hist(exam)
And we get Figure 2.1.
Often, like me here, you want to save the diagram to a file so that you can include it in a report.
Here is how to do that; vis1-1.pdf is a filename that I made up.
¿ pdf(”vis1-1.pdf”, onefile=FALSE, height=8, width=6, pointsize=8,
paper=”special”)
¿ hist(exam)
¿ devoff()
Error: could not find function ”devoff” # R complaining ...
¿ dev.off() # do this to finalise and close the file
# if you don’t it’s like forgetting to save in a wordprocessor.
¿
10
5
0
Frequency
15
20
Histogram of exam
0
20
40
60
exam
Figure 2.1: Histogram of exam marks.
2–3
80
100
Let us see what the average mark is and the range of marks:
¿ mean(exam)
[1] 49.07576
¿ range(exam)
[1] 0 95
¿
We could have used:
¿ length(exam)
[1] 66 # 66 results in ’exam’
¿ sum(exam)/length(exam)
[1] 49.07576
Let us see the data in sorted order — a good deal more informative than unsorted:
¿ sort(exam)
[1] 0 0 2 5 10 12 15 20 22 25 30 32 35 39 39 43 44 44 45 46 47 47 48 48 49
[26] 50 50 51 51 51 52 52 52 53 53 53 53 54 54 55 56 56 56 57 58 59 60 60 60 61
[51] 61 62 62 62 62 63 65 67 67 69 71 71 78 80 90 95
¿
Now read in corresponding continuous assessment (CA) marks (courswork); they came from a
spreadsheet so there’s a load of digits after the decimal point and that makes the data evern more
incomprehensible, so we use round to round them to the nearest integer number. It looks like
the CA marks are more generous than the exam. marks, and mean(ca) confirms this, as does the
histogram in Figure 2.2.
¿ cw ¡- read.table(”ca.txt”, header= T)
¿ attach(cw)
¿ ca
[1] 91.34390 85.54622 72.65543 63.10473
[9] 18.58191 83.30836 78.78221 77.68898
[17] 61.70048 16.28892 69.57387 83.08058
[25] 60.17263 79.49133 89.35610 27.89478
[33] 69.70333 85.23094 86.99767 82.89807
[41] 75.20815 97.17500 65.78075 70.29256
[49] 60.66164 20.05529 78.16085 73.58862
[57] 77.53929 77.20521 52.67979 89.10232
[65] 89.12518 67.58763
¿ car = round(ca)
¿ car
[1] 91 86 73 63 73 51 86 97 19 83 79 78
[26] 79 89 28 98 92 96 89 70 85 87 83 77
73.22074
21.07860
74.19594
98.06673
77.35877
14.20315
34.07182
76.78222
50.99642
76.04457
97.12300
92.34510
15.12655
73.02363
78.03601
54.16873
85.69151
76.56793
81.58833
96.19500
72.41332
87.38178
39.31353
40.23080
97.06528
86.90106
98.12345
88.69131
90.07670
52.74194
69.57565
81.09443
21 76 77 87 62 16 70 83 74 97 82 98 60
15 72 90 75 97 66 70 14 73 87 53 61 20
2–4
[51] 78 74 34
¿
¿ sort(car)
[1] 14 15 16
[26] 73 73 73
[51] 87 87 87
¿
¿ mean(ca)
[1] 70.10692
¿
78 39 70 78 77 53 89 77 54 40 81 89 68
19 20 21 28 34 39 40 51 53 53 54 60 61 62 63 66 68 70 70 70 70 72
74 74 75 76 77 77 77 77 78 78 78 78 79 79 81 82 83 83 83 85 86 86
89 89 89 89 90 91 92 96 97 97 97 98 98
¿ hist(ca)
# and save another one to a file
¿ pdf(”vis1-ca.pdf”, onefile=FALSE, height=4, width=6, pointsize=8, paper=”special”)
¿ hist(ca)
¿ dev.off()
10
5
0
Frequency
15
Histogram of ca
20
40
60
ca
Figure 2.2: Histogram of CA marks.
2–5
80
100
Boxplots are another way of examining a data set. Figure 2.3 shows boxplots for the examination
and CA results.
The construction of the boxplot is as follows: (a) the heavy line across the interior of the box
correspond to the median value (see Chapter 3); (b) the top and bottom of the box correspond
to, respectively, the lower quartile and upper quartile, i.e. 25% of the data are below the lower
quartile and 25% are above the upper quartile (or, if you like, 75% are below it).
The so called whiskers show the smallest and largest values — excluding boxplot’s interpretation
of outliers. The outliers are then shown as single points.
Quartile is a specialisation of the general term quantile, see Chapter 4. In Chapters 9, 11 and
12, we’ll come across, for example, 5% and 95% quantiles. The median is the centre of the data,
i.e. as many of the data are above the median as are bwlow it; see Chapter 3.
100
To determine what are outliers, boxplot fits a Normal distribution to the data and labels as outliers
any data that are below the 1% or above the 99% quantiles of the fitted Normal distribution.
●
20
40
40
60
60
80
80
●
0
20
●
●
●
●
●
●
●
●
●
●
●
●
Figure 2.3: Boxplot of: left, examination marks; right, CA marks.
2–6
How to look at the two data sets together? There must be a way of superimposing one histogram
on another, but I haven’t found that yet.
So let us display a two-dimensional scatter plot of the two data sets, see Figure 2.4.
¿ library(lattice) # first we must load a library that has ’xyplot’ in it
¿ xyplot(exam ˜ ca)
●
●
80
●
●
●
● ●
●
●● ● ●●●
●
● ●
●
●
●
● ●●
●●
● ● ●●
●●● ●
●●
●
●●
●
● ●
●
●
●
●
●●
●
●
exam
60
40
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
0
20
40
60
80
100
ca
Figure 2.4: Scatter plot of Exam. marks versus CA marks.
Someone says those CA and exam. marks look quite correlated, I wonder how accurately we could
have predicted the exam. results using the CA?. This is regression territory — and given that
Figure 2.4 shows a sort of straight line relationship, we’ll try linear regression, your old friend
y = mx + c, or in this case exam = mca + c and it is more usual to use a, b exam = a + bca. a
is the intercept, where the fitted straight line meets the y-axis at x = 0 and b is the slope.
¿ fitres = lm(exam ˜ ca)
¿ summary(fitres)
Call:
lm(formula = exam ˜ ca)
Residuals:
Min
1Q
-10.9697 -3.1181
Median
-0.7405
3Q
3.1036
Max
22.8368
Coefficients:
Estimate Std. Error t value Pr(¿—t—)
2–7
(Intercept) -10.83639
2.21002 -4.903 6.77e-06 ***
ca
0.85458
0.03002 28.469 ¡ 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 5.482 on 64 degrees of freedom
Multiple R-squared: 0.9268,Adjusted R-squared: 0.9257
F-statistic: 810.5 on 1 and 64 DF, p-value: ¡ 2.2e-16
¿
R prints a lot of information that we’ll find out about in Chapter 20; for now all we need to know
are a = −10.83639 (intercept) and b = 0.85458 (coefficient multiplying ca), i.e. the fitted line is
exam = −10.83639 + 0.85458 × ca. Figure 2.5 shows the results of the straight line fitting.
●
80
●
●
●
exam
60
●
40
●
●
●●
●
● ● ● ● ●●
●
●
●
●
●
●
● ● ●
●●
●
●
●
●
●
● ●
●
● ● ● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
20
●
●
●
●
●
●
0
●
●
●
●
20
40
60
80
ca
Figure 2.5: Straight line fitting Exam. marks versus CA marks.
2–8
100
Finally, we can save all those commands:
¿ savehistory(”20090508-3.txt”)
# which we could load again at a later time with
¿ loadhistory(”20090508-3.txt”)
# but in any case, weh you use q() to quit, R will offer you the
# option of saving and thse saved commands will be loaded the
# next time you run R.
¿ q()
Save workspace image? [y/n/c]: y
That’s enough for an introduction.
2–9
Chapter 3
Averages
3.1
Introduction
This chapter gives a brief introduction to “average”s or what statisticians call central tendency.
These are often, but not always, useful in summarising a set of data, especially when we wish to
compare the data set with another.
There are some pitfalls in using the common-or-garden average and we will note some of these.
3.2
Arithmetic Mean
The most familiar average value is the arithmetic mean, i.e. sum the value and divide by the
number of data. Just to get used to some mathematical notation, see A.2, we’ll write this as you’ll
see it in textbooks (the data are xi , i = 1, . . . , n):
n
1X
x̄ =
xi ,
n i=1
(3.1)
R-Example 1 .
As before, we’ll read the data and print them. This time they are already sorted, so much easier
to read, even in list form.
¿ ex2 = read.table(”exam2.txt”, header = T)
¿ attach(ex2)
¿ exam2
[1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60
[26] 61 62 62 64 69
We can compute the mean by summing and dividing, see below, but not unexpectedly, R has a
function mean that does it for us.
3–1
¿ sum(exam2)
[1] 1647
¿ length(hw)
[1] 20
¿ sum(exam2)/length(exam2)
[1] 54.9
¿ mean(exam2)
[1] 54.9
¿
In spite of its simplicity, it is possible to compute the arithmetic wrongly.
R-Example 2 . The following are a set of homework marks, marked out of 10. We read the data
in and print them. Then we produce a summarising table, marks versus frequency, which tells us
that we have a three students with four (4) marks, three with five, six with six, etc.
¿ df.homew ¡- read.table(”hw.txt”, header = T)
¿ attach(df.homew)
¿ hw
[1] 6 8 5 7 6 5 6 4 6 5 8 4 8 8 7 6 6 7 4 7
¿ table(hw)
hw
4 5 6 7 8
# marks
3 3 6 4 4
# frequencies
If we were not using a computer, we might think that we have a quick way to compute the mean, we
have just five marks, namely, 4 5 6 7 8, so we’ll take the average of those 4 + 5 + 6 + 7 + 8 = 30,
so mean = 30/5 = 6. But R thinks differently:
¿ mean(hw)
[1] 6.15
The method we used works only if the frequencies are the same for each mark; it would be a rare
fluke if this were the case.
But we’ll pursue the matter further, because (a) computing an arithmetic mean using a frequency
table — done properly — can be a (correct) shortcut if you have a lot of numbers and just a
calculator or pencil and paper; (b) using frequencies prepares the ground for topics covered in later
chapters.
3.2.1
Arithmetic Mean using Frequencies
We’ll rewrite the table, now calling the data (marks) x, we’ll label them with i so that we have
xi , i = 1 . . . n, and n = 5.
3–2
¿ table(hw)
hw
i= 1 2 3 4 5
---------------xi 4 5 6 7 8
fi 3 3 6 4 4
¿
# marks
# frequencies
If we want to use the frequency table, we have to replace eqn. 3.1 with
Pn
fi xi
x̄ = Pi=1
.
n
i=1 fi
(3.2)
Applying eqn. 3.2 to our frequency table above gives
(3×4+3×5+6×6+4×7+4×8)/(3+3+6+4+4) = (12+15+36+28+32)/20 = 123/20 = 6.15.
If we look at the sum divided by number calculation in R, we see that the frequency calculation
ends up with not only the same result, but the same division,
¿ length(hw)
[1] 20
¿ sum(hw)
[1] 123
¿ sum(hw)/length(hw)
[1] 6.15
If you look at the sum of fi × xi you will see that it is the same as
4 + 4 + 4 + 5 + . . . + 8 + 8 + 8 + 8;
the sorted hw marks are below:
¿ sort(hw)
[1] 4 4 4 5 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8
And the sum of the frequencies is 20, i.e. the number of data. [B
3.3
Median
Sometimes neither the mean nor the mode give us what we would expect from a central value.
Look at the following speed data (speed of cars at a speed check). Here mean,37.1, is well off
the centre; and that offset is caused by an outlier, the 75. The offset would be a lot worse if the
outlier was 1000 — not likely in the case of speeds, but outliers of this magnitude are possible
in the case of some measurement systems. A common example is a mineralisation survey taken
across an area of land. For the sake of argument, assume that we are looking for zinc. A sample
that coincides with the dumping of an old bucket will produce a huge outlier. Now if we want to
produce contour plots based on smoothed values (averages over regions), then mean smoothing
will show a (false) hot-spot, while median smoothing will not.
3–3
sp = read.table(”cars.txt”, header = T)
¿ attach(sp)
¿ speed
[1] 25 31 33 31 30 35 75
¿ mean(speed)
[1] 37.14286
The media gives a the true central value. If we sort the speeds, we see that the central value (the
fourth) is 31. median give the same result.
¿ sort(speed)
[1] 25 30 31 31 33 35 75
¿ median(speed)
[1] 31
¿ speed[4]
[1] 31
In the example above there are seven values, so the central one is the fourth; if we had an even
number of values, we would take the average of the two central values.
It can be said that the median is a measure of central tendency that is robust against outliers.
3.4
Mode
Sometimes the mean does not give us what we would expect from a central value; for example,
in the homework example, the mean (6.15) gives us a value that appears nowhere in the original
data; that’s normally not a big deal, but it suggests the mode as a possible “average value”.
The mode is the most frequent value, i.e. obtained from a frequency table or from a histogram,
Figure 3.1.
¿ table(hw)
hw
xi 4 5 6 7
fi 3 3 6 4
8
4
# marks
# frequencies
3–4
3
2
1
0
Frequency
4
5
6
Histogram of hw
4
5
6
7
hw
Figure 3.1: Histogram of hw.
3–5
8
Multimodal Data Now that we’ve mentioned the mode, we’d better take the opportunity of
warning about multi-modal data.
File hw2.txt contains data which has two peaks in its histogram, Figure 3.2.
¿ df.homew2 ¡- read.table(”hw2.txt”, header = T)
¿ attach(df.homew2)
¿ sort(hw2)
[1] 3 4 4 4 4 5 7 8 8 8 8 8 9
¿ hist(hw2)
mean(hw2)
[1] 6.153846
3
2
0
1
Frequency
4
5
Histogram of hw2
3
4
5
6
7
8
9
hw2
Figure 3.2: Histogram of hw2 — multimodal.
We can work calculate the mean, but does it convey much about the centre of the data? No, and
using the mean as such may be quite misleading. For example, an average of 6.15 may indicate
that the homework was, on average, completed satisfactorily; however, in fact, we had two sets of
results, one good, one poor and the average of 6.15 adequately represents neither.
Multimodality is pretty obvious in that small and one-dimensional data set. In much larger data
sets and especially in multidimensional data, multimodality may be difficult to detect.
Much later, Chapter 19, we’ll look at methods for separating multimodal data into different classes
or clusters.
3.5
Other Means
Read up in (Crawley 2005) on: geometric mean and harmonic mean.
3–6
Chapter 4
Measures of Data Variability
4.1
Introduction
This chapter introduces methods of describing data variability, most notably variance and standard
deviation.
4.2
Variance and Standard Deviation
We are now going to work through an example based on two examination results, exam3 and
exam4, see below.
¿ df.exam3 = read.table(”exam3.txt”, header = T)
¿ attach(df.exam3)
¿ df.exam4 = read.table(”exam4.txt”, header = T)
¿ attach(df.exam4)
¿ exam3
[1] 68 70 71 72 72 73 73 73 74 75 75 75 75 75 76 76 76 76 76 77 77 78 78 80 82
¿ exam4
[1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60
[26] 61 62 62 64 69 73
¿
We are going to assume that these examinations are from two optional modules that final year
BSc Honours students can take, that is students take one or other of these modules and not both.
Final Honours classifications depend on these results; but we can see already that the students
who took exam3 are at an advantage; except for one, they all achieved first class honours in that
examination. If we assume that the exam3 students are equally capable as the exam4 students, then
can we correct the imbalance? Before you start to be incredulous, this technique was practiced at
a well-known university where I worked.
First of all let us look at the histograms, Figure 4.1 and the box-plots, Figure 4.2.
4–1
¿ hist(exam3)
¿ hist(exam4)
4
6
Frequency
6
4
0
2
2
0
Frequency
8
8
10
12
Histogram of exam4
10
Histogram of exam3
68
70
72
74
76
78
80
82
40
45
50
exam3
55
60
exam4
Figure 4.1: Histograms of exam3 and exam4.
¿ boxplot(exam3)
¿ boxplot(exam4)
4–2
65
70
75
68
70
45
70
50
72
55
74
60
76
78
65
80
82
●
●
Figure 4.2: Boxplots of exam3 and exam4.
4–3
The means confirm the difference.
¿ mean(exam3)
[1] 74.92
¿ mean(exam4)
[1] 55.48387
¿
¿ diff ¡- mean(exam3) - mean(exam4)
¿ diff
[1] 19.43613
¿
4.2.1
Equalising the means
Can we shift one of the means so that the two data sets have the same mean?
¿ diff
[1] 19.43613
¿ exam4new ¡- round(exam4 + diff)
¿ exam4new
[1] 62 62 62 63 65 67 67 69 70 72 72 72 74 75 75 76 76 77 77 78 78 78 79 79 79
[26] 80 81 81 83 88 92
¿ fpdfsmall()
¿ hist(exam4new)
Histogram of exam4new
4
6
Frequency
6
4
0
2
2
0
Frequency
8
8
10
10
Histogram of exam3
68
70
72
74
76
78
80
82
60
exam3
65
70
75
80
exam4new
Figure 4.3: Histograms of exam3 and exam4 shifted by 19.
4–4
85
90
95
4.2.2
Variability and spread
That is a bit better, but there remains a greater spread in exam4new (mean shifted). Can we
quantify spread; range gives us the range between minimum and maximum, but we would like one
number.
¿ range(exam3)
[1] 68 82
¿ range(exam4new)
[1] 62 92
¿
From our experience with the mean, maybe we can take the mean (expected value) of deviations
from the means,
¿ mean(exam3 - mean(exam3))
[1] -1.705372e-15 # effectively zero
¿ mean(exam4new - mean(exam4new))
[1] -4.586385e-16
Not much good; from the definition of the mean we should have known in advance that these
means (or sums) of deviations would be zero — the negative deviations cancel the positive.
¿ mean((exam4new - mean(exam4new))ˆ2)
[1] 53.6691
¿ mean((exam3 - mean(exam3))ˆ2)
[1] 9.0336
We can achieve the same using sum and length,
¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)
[1] 9.0336
4.2.3
Variance and Standard Deviation
The variance, which is the expected value of the squared deviations from the mean is the built-in
function to use (var in R), see eqn. 4.1,
n
1X
V ar (X) = E[(X − µ)] =
(xi − µ)2 .
n i=1
¿ var(exam3)
[1] 9.41
¿ var(exam4new)
[1] 55.45806
4–5
(4.1)
Immediately, we see that it is not an illusion that the variability of exam4new is much greater than
that of exam3. Note that the variance as calculated by var is slightly different from that calculated
using mean — we’ll return to that below.
The variance values, since they are sums of squares, give us a measure of squared variability; that
can be hard to interpret and use; what we want is the square-root of the variance, or the standard
deviation (sd in R), see eqn. 4.2,
σX = SD(X) =
p
V ar [X].
(4.2)
¿ sqrt(var(exam4new))
[1] 7.447017
¿ sqrt(var(exam3))
[1] 3.067572
¿ sd(exam4new)
[1] 7.447017
¿ sd(exam3)
[1] 3.067572
¿
Variance different from mean of squared deviations? We return to the problem of variance
being different the mean of squared deviations. The clue is given below,
¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3)
[1] 9.0336
¿ sum((exam3 - mean(exam3))ˆ2)/(length(exam3) -1)
[1] 9.41
In fact, rather than eqn. 4.1, this particular implementation of var computes what is called the
sample variance using eqn. 4.3,
n
X
1
V ar (X) =
(xi − µ)2 .
(n − 1) i=1
(4.3)
This gives an unbiassed estimate of the variance.
4.3
Standard Scores and Normalising Marks
We now return to our desire to manipulate (fairly) the two data sets, exam3, exam4, such that
students in each class have roughly the same opportunity; see section 4.2.1 where we equalised
the means, but where we noted that the difference in variability remained a problem.
4–6
4.3.1
Standard Scores
The normal way to equalise data sets like these (the proper term is either standardise or normalise)
is to use the standard score as in,
Xss =
X−µ
.
σ
(4.4)
Eqn. 4.4 gives a set of scores with mean zero and standard deviation one, µss = 0, σss = 1. Thus,
if we apply eqn. 4.4 to the two sets of marks, using the mean and standard-deviations of each, we
get two sets of marks with the same mean (0) and the same spread (standard-deviation 1).
That is fine for purely comparison purposes, but what if we need marks to publish? What we are
going to do is: (i) use eqn. 4.4 to standardise the scores; then (ii) multiply by whatever (new)
standard-deviation, call it σnew , that we require; finally, add the (new) mean that we require. The
whole operation is given in eqn. 4.5,
Xnew =
Xold − µ
× σnew + µnew .
σold
(4.5)
We’ll now apply this to exam4, i.e. we want to make exam4 as close as possible to exam3 (in terms
of mean and standard deviation).
¿ sd3 ¡- sd(exam3)
¿ sd3
[1] 3.067572
¿ m3 ¡- mean(exam3)
¿ sd4 ¡- sd(exam4)
¿ sd4
[1] 7.447017
¿ m4 ¡- mean(exam4)
¿ m4
[1] 55.48387
¿ m3
[1] 74.92
¿ exam4new = round(((exam4 - m4)/sd4)*sd3 + m3)
¿ exam4new
[1] 70 70 70 70 71 72 72 73 73 74 74 74 75 75 75 76 76 76 76 76 76 76 77 77
77
[26] 77 78 78 78 80 82
¿ mean(exam3)
[1] 74.92
¿ mean(exam4new)
[1] 74.96774 # difference due to rounding
¿ sd(exam3)
[1] 3.067572
¿ sd(exam4new)
[1] 2.99426 # difference due to rounding
¿
4–7
And let us compare the histograms in Figure 4.4
8
6
0
2
4
Frequency
6
4
2
0
Frequency
8
10
Histogram of exam4new
10
Histogram of exam3
68
70
72
74
76
78
80
82
70
exam3
72
74
76
78
exam4new
Figure 4.4: Histograms of exam3 and exam4new (exam4 equalised with exam3).
4–8
80
82
Chapter 5
Probability and Random Variables
5.1
Introduction
This chapter gathers together some basic definitions, symbols and terminology to do with, probability, random variables, and random processes; the topics are chosen according to their applicability
to basic statistics for bio-scientists, as well as pattern recognition, image processing and data compression. We will use some of the notation from Appendix A; you should have a quick look at that
first. We emphasise that such notation is merely shorthand for common sense concepts which
would otherwise be confusing and long-winded if written in English.
5.2
5.2.1
Basic Probability and Random Variables
Introduction
Let there be a set of outcomes to an experiment {ω1 , ω2 , . . . , ωn } = Ω, where, to each ωi , we
associate a probability pi . The definition of probability includes the following constraints:
0 ≤ pi ≤ 1,
n
X
pi = 1.
(5.1)
(5.2)
i=1
The above simple definition of probability over outcomes is satisfactory for simple applications, but
for many applications we need to extend it to apply to subsets of Ω.
We could call the outcomes above elementary events, i.e. indivisible events, and we could call the
subsets below composite, i.e. they are a composition of one or more outcomes.
Ω is often called the sample space, i.e. as defined above, the set of all possible outcomes of the
experiment. Elements of Ω are called outcomes, sample outcomes, or realisations. One of the
problems of learning probability and statistics is the confusion caused by the multiplicity of terms
for the same concept. In addition, different fields of study, e.g. bio-science, engineering, social
science, . . . have their own terminology.
5–1
Example 1 Six sided dice. Ω = {i | i ∈ {1, . . . 6}} = {1, 2, . . . 6}.
Example 2 Toss two six sided dice.
(1, 1), (1, 2), . . . (1, 6), (2, 1), . . . (6, 6)}.
Ω
=
{(i, j)
|
i, j
∈
{1, . . . 6}}
=
Example 3 Two sided coin. Ω = {H, T }. Outcomes need not be numbers.
5.2.2
Probability and Events
Let there be subsets of Ω called events with a general event ai ; the set of all ai s is A. We define
a probability measure P on A; P is a number and satisfies the following axioms:
P (a) ≥ 0,
(5.3)
P (Ω) = 1, (certain event, something happens).
(5.4)
If a1 , a2 , . . . are disjoint, i.e. ai ∩ aj = ∅, ∀i, j, i 6= j, then
P(
∞
[
i=1
ai ) =
∞
X
P (ai ).
(5.5)
i=1
Disjoint (subsets) is another term for mutually exclusive, i.e. they cannot possibly happen together.
∩ denotes set intersection,
i.e. in eqn. 5.5 we are requiring that there is no overlap between any of
S
the subsets and
denotes union. Put simply, eqn. 5.5 says that probabilities add for events that
do not overlap. ∅ denotes the empty set.
There is a fourth axiom, a corollary of eqns. 5.4 and 5.5,
P (∅) = 0, (impossible event).
(5.6)
Example 4 Six sided dice. Ω = {1, 2, . . . 6}. Let a be the event score greater than three; i.e.
a = {4, 5, 6}.
Example 5 Toss two six sided dice. Ω = {(i, j) | i, j ∈ {1, . . . 6}}. Let a be the event score less
than four. Then a = {(1, 1), (1, 2), (2, 1)}.
Partition When {a1 ∪ a2 ∪ . . . ∪ an } = Ω and a1 , a2 , . . . an are disjoint, we say that {a1 , a2 , . . . an }
form a partition of Ω.
5–2
5.2.3
A Point on Terminology
Above we have P (ai ) for probability that the outcome is in set ai . “The outcome is in set ai ” is
what is called a proposition. A proposition is a sentence which may be true or false — but only
one or the other and not in between.
We should note that in most textbooks and later in these notes the arguments of probability
functions, P (.) will be propositions, e.g. P (A) means the probability that A will occur, or that A
will be true.
Then, when we write P (AB) or P (A, B) (they mean the same), we mean probability of A and B
being both true; logical and.
Not or set complement We may want to talk about the probability that A will be false, i.e. the
probability that the outcome will be in the complement set to A, i.e. any of the outcomes (in Ω)
but not in As set. Not A is denoted Ā.
We now can write a further axiom.
P (Ā) = 1 − p(A).
(5.7)
Example 6 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so Ā = {5, 6}.
P (Ā) = 1 − P (A) = 1 −
5.2.4
4
6
=
2
6
= 13 .
Probability of Non-disjoint Events
We saw in eqn. 5.5 that to compute the probability of two disjoint events you can add probabilities.
For events A and B that are not necessarily disjoint (there may be overlap), we can write
P (A
[
B) = P (A) + P (B) − P (AB).
(5.8)
Example 7 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so B = {4, 5}; so A ∪ B =
{1, 2, 3, 4, 5} and A ∩ B = {4}.
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) =
P (A ∪ B) = P ({1, 2, 3, 4, 5}) = 65 .
4
6
+
2
6
−
1
6
= 65 , and we can see that, computed directly,
We note that eqn. 5.8 collapses to eqn. 5.5 when AB is false (no overlap, the two cannot be true
together), because of eqn. 5.6, i.e. P (∅) = 0, and
P (A
[
B) = P (A) + P (B) − P (∅) = P (A) + P (B) − 0 = P (A) + P (B).
5–3
5.2.5
Finite Sample Spaces
In Example 1 we could identify and list all possible outcomes and we have a finite sample space.
On the other hand, if the outcome was a weight, for example of a precipitate, then we could not
list all possible weights and we would have an infinite sample space.
5.3
Random Variables
If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable
(r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real
numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,
then X is a discrete r.v. Chapter 6 contains an extensive discussion on random variables and an
introduction probability distributions.
5.4
Computing probabilities
We have already done this in examples, but we need to formalise a bit. The number of elements
in a (finite) set, say a, is called its cardinality and written |a|.
Example 8 Six sided dice. Ω = {1, 2, . . . 6}, |Ω| = 6.
Let a = {4, 5, 6}, |a| = 3.
If the outcomes are equally likely (which {1, 2, . . . 6} are), then we can compute the probability of
an event a as the ratio:
P (a) =
|a|
.
|Ω|
Example 9 Six sided dice. Ω = {1, 2, . . . 6}, |Ω| = 6.
Let a = {4, 5, 6}, |a| = 3, so
P (a) =
5.5
|a|
3
1
= = .
|Ω|
6
2
Enumerating more complex events and sample spaces
We see above P (a) =
|a|
|Ω| .
But |a| or |Ω| may not be simple to enumerate or count.
5–4
(5.9)
5.5.1
Multiplication of outcomes
Let an event correspond to the combined outcomes of two experiments performed in sequence.
Let the first have n1 outcomes and the second n2 outcomes.
Any of the n1 outcomes of the first may be followed by any of the n2 outcomes of the second, so
the number of outcomes in the combined experiment is n1 × n2 .
Example 10 Toss two six sided dice in sequence (but the result is the same if we throw them
together). n1 = |Ω1 | = 6, n2 = |Ω2 | = 6, so, for the combined experiment, |Ω| = n1 × n2 = 36,
which we can also compute by counting the elements in Ω = {(i, j) | i, j ∈ {1, . . . 6}}.
5.5.2
Addition of outcomes
Suppose again that we have two experiments. Let the first have n1 outcomes and the second n2
outcomes. This time we perform the first experiment or the second, but not both and which of
them gets performed is chosen randomly; how many outcomes?
We have n1 outcomes of the first, or the n2 outcomes of the second, so the total number of
outcomes in the combined experiment is n1 + n2 .
Example 11 Toss one six sided dice or toss a two sided coin. n1 = |Ω1 | = 6, n2 = |Ω2 | = 2,
so, for the combined experiment, |Ω| = n1 + n2 = 8, which we can also compute by counting the
elements in Ω = {1, 2, 3, 4, 5, 6, H, T }.
5.5.3
Permutations
Suppose we have n items and we wish to place them in a sequence — just any sequence, not
ordered according to size or any other attribute. How many ways to do this?
The first position may be filled by any of the n items; the second position may be filled by any of the
remaining n − 1 items, and so on, so that the number of possible different sequences (orderings) is
n(n − 1)(n − 2) . . . 1 = n! (n-factorial).
(5.10)
Suppose now we have n items and we wish to choose any r of them place these in a sequence. How
many ways to do this? The first position may be filled by any of the n items; the second position
may be filled by any of the remaining n − 1 items, and so on until we have r in the sequence. The
number of possible different sequences (orderings) is
n(n − 1)(n − 2) . . . n − (r − 1) = n(n − 1)(n − 2) . . . n − r + 1) =
n
Pr is the name for the number of permutations of r from n.
5–5
n!
=n P r .
(n − r )!
(5.11)
5.5.4
Combinations
Suppose again we have n items and we wish to choose any r of them, but we do not need to place
the r in a sequence. How many ways n Cr to do this? We can appeal to eqns. 5.11 and 5.10.
n!
=n Cr × (number of ways of permuting)r = r !n Cr ,
(n − r )!
which leads to
n
5.6
n!
=
Cr =
r !(n − r )!
n
r
.
(5.12)
Conditional Probability
Example 12 Ω = {1, 2, 3, 4, 5, 6}. I throw the dice. What is the probability of getting greaterthan-three, P (> 3)? Let A be greater-than-three so that A = {4, 5, 6}, and the cardinality of
this set is nA = |A| = 3, and ndice = |Ω| = 6, see section 5.4; there are three possibilities
greater-than-3, so P (A) = P (> 3) = nA /ndice = 3/6 = 1/2.
Now, I have a peek and I tell you that we have an odd number, let us call this event B (odd). What
now is the probability of A(> 3)? The probability surely has changed because the only possibilities
now are A odd = {1, 3, 5}. Within this set, 5 is the only (one) possibility that satisfies greaterthan-three, so, forgetting about any ideas we had before, we say that the conditional probability
of greater-than-three given that we already know that an odd number has occurred, 1/3, i.e. the
probability has doubled based on the information that an odd has occurred.
We write this P (> 3|odd), the conditional probability of a > 3 conditional on the fact that we
already know that an odd number has occurred.
This is conditional probability ; we computed the probability of B conditional on A, P (B|A).
5.6.1
Venn diagrams
Venn diagrams, see section A.1.4, can be used to think about conditional probabilities such as
the one in Example 12. Here Ω = {1, 2, 3, 4, 5, 6} corresponds to the universal set (the set of all
possibilities).
One we have been told that the number is odd, we can reduce our sample space to set odd; then
odd ∩ (> 3) = {5}.
Example 13 If after hearing first that we have an odd number, then secondly we are told that
greater-than-three has occurred, we are then asked (a) what is the probability of a six?, (b) what
is the probability of a five?
Think about it, once we have the two pieces of information: odd, then greater-than-three, the
possibilities are very greatly reduced. To what?
5–6
1
3
2
5
6
4
odd & <= 3
1
1
3
odd & > 3
3
5
odd
5
odd
<= 3
2
2
4
4
6
6
even
even
>3
even & <= 3
even & > 3
Figure 5.1: Dice: (a) universal set; (b) sets odd, even; (c) sets (> 3) and (<= 3) superimposed
to show that, for example, odd&(> 3) = (set-odd) ∩ (set > 3) = {1, 3, 5} ∩ {4, 5, 6} = {5}
.
5.6.2
Probability Trees
Probability trees, see (Griffiths 2009, p. 158), are another way to think graphically about conditional probability. In mathematics, trees can grow sideways or even upside down.
Figure 5.2 shows a probability tree for Example 12.
When we split into branches as in Figure 5.2, any branching must represent all possibilities; in this
case we first have odd and even; if we call odd B, we have even = not-odd = B̄. In the diagram
we have no bar symbol, so we use B 0 = B̄. Next we have (> 3 and (<= 3).
Thus, at any branching the probabilities in the branches must sum to one.
The diagram shows how to compute joint probabilities using conditional probabilities and the
probability of the conditioning event, for example P (> 3 & odd) = P (> 3 | odd) × P (odd).
Figure 5.3 shows a general probability tree.
The following may help us to think about conditional probability and joint probability. Think of the
tree as having probability flowing in its branches.
We start of at the root with all the probability (one, 1); proportions of the probability flow into
the first set of branches (the proportions sum to one); follow one of those branches, at the next
branching point, we split the remaining probability into proportions that again sum to one (it is
just the proportions that sum to one, if there is, for example, 0.4 flowing into the branching point,
and the proportions are 0.4, 0.4, 0.2 — three-way branch, then we will have probability flows of
0.16, 0.16, 0.08). And so on.
5–7
>3 and odd has occurred
odd has occurred
P(>3|odd)
B
1/3
<=3 and odd has occurred
P(<=3|odd)
2/3
P(odd)
1/2
1/2
P(>3 & odd) = P(>3|odd) x P(odd)
P(<=3 & odd) = P(<=3|odd) x P(odd)
P(>3|even)
P(>3 & even) = P(>3|even) x P(even)
2/3
P(even)
= 1/2 x 2/3 = 2/6 = 1/3 [P(4 or 6)]
B’
P(<=3|even)
even has
1/3
occurred
P(<=3 & even) = P(<=3|even) x P(even)
= 1/2 x 1/3 = 1/6
[P(2)]
Figure 5.2: Probability tree for the dice example. We start off on the left with the root and
everything possible. Then we split into branches odd and even. Next we split odd into (> 3) and
(<= 3); same for the even branch.
A has occurred
i.e. A & B have occurred
We know B has
occurred
P(A | B)
A
B
not A has occurred
i.e. not A & B = A’ & B
P(B)
P(B’)
(not B)
P(A & B) = P(AB) = P(A | B) x P(B)
P(A’ | B)
A’
P(A | B’)
A
B’
P(A’B) = P(A’ | B) \x P(B)
P(AB’) = P(A | B’) x P(B’)
A’
P(A’ | B’)
B has not
occurred
i.e. not B has occurred
P(A’B’) = P(A’ | B’) x P(B’)
Figure 5.3: Probability tree.
5–8
Symbolically, and referring to Figure 5.3 . . . If we have proportion P (B) in a branch and then
that splits into proportions P (A|B) and P (Ā|B) (these (relative) proportions again sum to one,
but their total probability sums to whatever flowed into the branching point). Then the P (A|B)
branch must an absolute amount of probability equal to P (A|B) × P (B) and this is P (AB).
Formula for Conditional Probability
abilities,
We now give the formula for computing conditional prob-
P (A|B) =
P (AB)
,
P (B)
(5.13)
provided that P (B) > 0.
Alternatively, as in Figure 5.3,
P (AB) = P (A|B)P (B).
5.6.3
(5.14)
Joint Probability
P (AB) is the joint probability of A and B happening together.
Sometimes we write P (AB), sometimes P (A&B), sometimes P (A and B), and sometimes, using
set notation, P (A ∪ B).
5.7
Bayes’ Rule
If we reverse the conditionality in eqn. 5.13 and noting that P (AB) = P (BA), we have
P (B|A) =
P (AB)
,
P (A)
(5.15)
leading to
P (A)P (B|A) = P (AB),
(5.16)
P (B)P (A|B) = P (AB),
(5.17)
P (A)P (B|A) = P (B)P (A|B),
(5.18)
and eqn. 5.13 gives us
so that
5–9
leading to Bayes’ rule:
P (A|B) = P (A)P (B|A)/P (B).
(5.19)
Eqn. 5.19 allows to invert or reverse the conditionality.
Example 14 Let A be has disease-X; let B be has swollen ankles. From a sample of former
disease-X patients, we can estimate P (B|A); say it is P (B|A) = 0.3. Let us assume that we also
know the proportion of the general population that have swollen ankles, P (B) = 0.01. Also we
assume that we have the incidence of disease-X in the general population, P (A) = 0.005.
Eqn. 5.19 allows us to compute the probability that the patient has disease-X given that the swollen
ankles symptom (B) is present, P (A|B). Of course, in general, P (A|B) 6= P (B|A).
P (A|B) = P (A)P (B|A)/P (B) = 0.005 × 0.3/0.01 = 0.15.
(5.20)
Bayes’ rule may be written in a more general manner. First we need a result called the law of total
probabilities.
Let A1 , A2 , . . . , An be a partition of Ω (see section 5.2.2 for a definition of partition), then
P (B) =
n
X
P (B|Ai )P (Ai ).
(5.21)
i=1
We write the more general form of Bayes’ rule as
P (Ai |B) = P (B|Ai )P (Ai )/
n
X
P (B|Ai )P (Ai ).
(5.22)
i=1
Let us return to Example 14 and apply eqn. 5.22. When we said proportion of the general population
that have swollen ankles, P (B) = 0.01, we strictly meant probability of people with disease-X together with those without disease-X = 0.01. We can restate the problem with A1 = has disease-X
and A2 = has not disease-X, so that they form a partition of the general population.
Assume that we now have P (B|A2 ) = 0.01 (i.e. we are changing the story slightly to associate
this probability with people who do not have disease-X) and, as before, P (B|A1 ) = 0.3; we need
also P (A1 = 0.005, as before. What is P (A2 ); it is P (Ā1 ) (probability that a general person does
not have disease-X) and this is 1 − P (Ā1 ) = 0.995.
Eqn. 5.21 now gives a revised figure for P (B),
P (B) =
n
X
P (B|Ai )P (Ai ) = P (B|A1 )P (A1 )+P (B|A2 )P (A2 ) = 0.30.005+0.010.995 = 0.01145,
i=1
and we can rework eqn. 14 (or use eqn. 5.22,
P (A1 |B) = P (A1 )P (B|A1 )/P (B) = 0.005 × 0.3/0.01145 = 0.131.
5–10
5.8
Independent Events
We have already discussed disjoint events, i.e. events which cannot occur simultaneously; thus,
disjoint events A, B, A ∩ B = ∅. Consequently, we can state that P (A|B) = 0 (if B has occurred,
A cannot).
At the opposite extreme, let A ⊂ B, i.e. A is a subset of B and if A has occurred, then so must
B, with certainty, so in this case P (B|A) = 1.
Example 15 Ω = {1, 2, 3, 4, 5, 6}. Let B = {2, 4, 6} (even number) and A = {6}. If we know
that a 6 has been thrown (A has occurred), what is P (B|A)? The answer is 1 — we know that 6
is even so B is a sure thing — in punter parlance :-).
But there are cases where A and B are totally unrelated — they are independent events.
Example 16 Throw a dice (1) and toss a coin (2). Ω1 = {1, 2, 3, 4, 5, 6}, Ω2 = {H, T } and
the combined sample space Ω = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )} and |Ω| = 12. Let
A = {4, 6} and B = {H}, so that AB = {(4, H), (6, H)} (two out of 12 equally likely events), so
P (AB) = 1/6. also P (A) = 1/3, P (B) = 1/2.
From eqn. 5.13 we have
P (B|A) =
P (AB)
1 1
1
= / = .
P (A)
6 3
2
Because the result of the dice throw is unrelated to the result of the coin toss we are not surprised
to find that
P (B|A) = P (B) =
1
.
2
This leads us to a more general definition of independent events,
P (B|A) = P (B) =
P (AB)
,
P (A)
so that A and B are independent events if and only if
P (AB) = P (A)P (B).
5–11
(5.23)
5.9
Betting and Odds
In circumstances where the terms have meaning, probability of A can be computed as the ratio
of the number of equal probability events favourable to A, nA , versus the total number of equal
probability events, nT ,
P (A) = nA /nt .
(5.24)
Odds, on the other hand are computed as the ratio of the number of equal probability events
favourable to A, nA , versus the number of equal probability events unfavourable to A, nĀ ,
O(A) = nA /nĀ .
(5.25)
Thus, the probability of a 1 on the throw of a dice is 61 , whilst the odds are 15 ; bookmakers express
this as five-to-one against.
The probability for any number less than five (1–4) would be
bookmakers express this as two-to-one on.
4
6,
whilst the odds are
4
2
=
2
1;
You can calculate probability from odds using
P (A) =
O(A)
.
1 + O(A)
(5.26)
Thus, for any number less than five (1–4) on a dice throw,
P (A) =
2
O(A)
= 1
1 + O(A)
1+
2
1
=
2
.
3
You can calculate odds from probability using
O(A) =
P (A)
,
1 − P (A)
that is, the ratio of probability-for (favourable) to probability-against (unfavourable).
Thus, for one on a dice throw,
O(A) =
1
6
1−
5–12
1
6
=
1
.
5
(5.27)
Bookmakers odds and probabilities Bookmakers “probabilities” do not add to 1. Unlike proper
probabilities, which add one for all possible events, see eqn 5.2.
Let’s say we have four horses, each with an equal probability of winning (P (Ai ) =
1, 2, 3, 4. We would expect odds of
O(A) =
1
4
1−
1
4
=
1
4,
for i =
1
,
3
or three-to-one against. But the bookmaker has to make a living, and not just provide a mutual
service for his punters. In this case, if four punters bet 10 Euro on each horse (bookie gets 40
Euro), one punter gets paid 30 Euro plus his stake returned = 40 Euro, and the bookie makes
nothing for his work.
The bookie is likely to give odds of something like two-to-one against, O0 (A) = 12 , and, computing
probabilities, we find
P 0 (A) =
1
O0 (A)
2
=
1 + O0 (A)
1+
1
2
=
1
,
3
and the sum of “probabilities” is 43 .
In this amended case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter
gets paid 20 Euro plus his stake returned = 30 Euro, and the bookie makes 10 Euro.
5.10
Classical versus Bayesian Interpretations of Probability
In many books and discussions you will see a distinction made between the classical and the
Bayesian interpretation of probability; also, in this context the term frequentist may be used as a
synonym for classical. As an interpretation of probability, the term Bayesian has little to do with
Bayes’ rule, section 5.7, that is until we get to statistical inference, Chapter 10.
Broadly speaking, Bayesians interpret probability as belief ; frequentists interpret probability as
relative frequency.
Bayesian (belief) interpretation Take the case of the tossed (fair) dice. If you were asked to
rate, on a scale of [0, 1], your belief that 2 will be the outcome, you would, I hope, agree that
the probability is 16 ; for an even number of dots: 62 = 12 ; and any number 1-6 — a sure thing —
probability is 1.
Here 0 corresponds to complete disbelief and 1 to complete belief.
5–13
Relative frequency interpretation The frequentist says that the probability of 2 is the relative
frequency with which 2 occurs in a large number of hypothetical throws.
Let us then run an experiment involving a large number (n = 600) of throws. and let yi = the
count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 =
90, y4 = 97, y5 = 105, y6 = 103. We then use p̂(i ) = yni ; the hat, ˆ, indicates that p̂(i ) is an
approximation to p(i ); however, p̂(i ) → p(i ) as n → ∞.
We have p̂(i ) = yni = p̂(i ) = {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 =
0.158, 0.183, 0.15, 0.162, 0.175, 0.172}. The correct value is p(i ) = 16 = 0.1667.
The errors above are not a real indictment of the frequentist method; a thought experiment allows
us to reason that p(i ) = 61 .
On the other hand, when you want to bet on football match and would like to estimate the
probability and hence the odds, it makes no sense to think of an infinity of matches.
5–14
Chapter 6
One Dimensional Random Variables
6.1
Introduction
We have already introduced the notion of a random variable in section 5.3, i.e. where we associate
a number with the outcome of an experiment governed by probability.
In most cases, your (scientific) data will already be numerical, but it nonetheless remains worthwhile
to be cognisant of the details of probability and sample space described in Chapter 5.
In some of the examples in Chapter 5, namely those involving the dice, the outcome already is a
number, i.e. {1, . . . , 6}; in some considerations, this number is more a label than a number, but
in any case, the association of a number with the outcome is made trivial. In the coin example we
had {H, T }; in this case we could use the association {H → 1, T → 0}.
6.1.1
Definition: Random Variable
If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable
(r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real
numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set,
then X is a discrete r.v. The space of all possible values of X is called the range space of X, RX .
In discussing random variables we label the r.v. with an upper case letter, e.g. X, but particular
values of it are labelled with lower case, e.g. x, or xi .
Example 17 Toss two coins. Ω = {T T, T H, HT, HH}. Let a r.v. X be defined as the number of
heads in the outcome, i.e. {T T → 0, T H → 1, HT → 1, HH → 2}. Notice that two outcomes
map to the same number (1); this is not a problem or a mistake. RX = {0, 1, 2}.
6.1.2
Probability associated with a Random Variable
If we have an event B with respect to a range space RX . Let the event A with respect to Ω be
defined as
6–1
A = {ω ∈ Ω | X(ω) ∈ B}.
(6.1)
Then A and B are equivalent events and we can carry the definitions and equations of Chapter 5
over to random variables.
Example 18 Two coins as in Example 17. Examples of equivalent events are: A = {T T }, B = {0};
A = {T H, HT }, B = {1}; A = {HH}, B = {2}.
In the case of eqn. 6.1, we can say
P (B) = P (A).
(6.2)
Example 19 Two coins as in Example 18. A = {T T }, P (A) = 14 , B = {0}, P (B = 0) = 14 ; A =
{T H, HT }, P (A) = 12 , B = {1}, P (B = 1) = 12 ; A = {HH}, P (A) = 41 , B = {2}, P (B = 2) = 14 .
6.2
Probability Mass Function (pmf) of a Discrete r.v.
Let a r.v. X have a range space RX = {x1 , x2 , . . . , xn }. We denote the probability of a particular
value X = xi as pX (xi ) = P (X = xi ). The probabilities pX (xi ), i = 1, 2, . . . , n, in keeping with
eqns. 5.3 and 5.4, must satisfy
pX (xi ) ≥ 0, i = 1, 2, . . . , n,
n
X
pX (xi ) = 1.
(6.3)
(6.4)
i=1
pX is called the probability function or the probability mass function of the r.v. X. We’ll attempt to
standardise on probability mass function and its abbreviation pmf. We use the shorthand X ∼ pX
to state that the r.v. X has a pmf pX . Often, where there is no ambiguity, you will find the
subscript X omitted — pX (x) → p(x).
6.3
Some Discrete Random Variables
This section identifies and describes the pmfs of some commonly occurring discrete random variables.
6.3.1
Point Mass Distribution
If X can take on only one value, a, it has a point mass distribution at a; X ∼ δa .
pX (x) = 1, for x = a, and 0 elsewhere.
6–2
(6.5)
6.3.2
Discrete Uniform Distribution
X has a discrete uniform distribution on {1, . . . , k}, U(1, k), if
pX (x) =
1
, for x = 1, . . . , k; and 0elsewhere.
k
(6.6)
Example 20 . Lottery machine, k balls. First draw, X ∼ U(1, k).
6.3.3
Bernoulli Distribution
Let X be the result of a (binary outcome) experiment with probability p of one outcome, X = 1,
say, and 1 − p for the other, X = 0; for example a coin flip. There’s overuse of the symbol p
here, but we need to keep to standard notation; context should resolve any ambiguities between
the parameter p = P (X = 1) and the pmf pX (X).
pX (x) = q x (1 − q)1−x , for x ∈ {0, 1}.
6.3.4
(6.7)
Binomial Distribution
Repeat the experiment above (Bernoulli distribution — coin flip) n times and let X be the number
of 1s (e.g. heads) obtained.
pX (x) =
n
x
p x (1 − p)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise.
(6.8)
n
Where does the
come from? We have already introduced it in eqn. 5.12; it is the number
x
of ways of selecting x items from n. The probability one of the x 1s is p x and the probability one
of the n − x 0s is (1 − p)n−x ; the flips are independent so we can multiply the probabilities
to get
n
n
n!
p x (1 − p)n−x . However, there are
possible ways of getting the X = x 1s.
= x!(n−x)!
.
x
x
Take n = 3; the sample space is Ω = {T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH} and the
event corresponding to x = 2 (two heads, any two heads) is A = {T HH, HT H, HHT }, i.e. there
are three outcomes that give two heads.
3!
6
n
3
=
=
= = 3.
x
2
2!1!
2
6.3.5
Geometric Distribution
X has a geometric distribution with parameter p, X ∼ Geom(p), p ∈ (0, 1), if
P (X = k) = p(1 − p)k−1 , k = 1, 2, . . . , ∞.
Example 21 . Distribution of the number of coin flips until the first head.
6–3
(6.9)
6.3.6
Poisson Distribution
X has a Poisson distribution with parameter λ, X ∼ P oi sson(λ), if
pX (x) = e
x
−λ λ
x!
, x ≥ 0.
(6.10)
Example 22 . Distribution of rare events like traffic accidents; there can be long periods of
inactivity, but clumping of events is possible, e.g. waiting a long time for a town bus and three
arrive in quick succession!
6.4
Some Continuous Random Variables
This section identifies and describes the probability density functions of some commonly occurring
continuous random variables. First we must introduce a continuous alternative to the probability
mass function.
6.4.1
Probability Density Function (PDF)
When we discussed discrete r.v.’s we let X have a range space RX = {x1 , x2 , . . . , xn };
the number of values in the range space was countable. Let the range space be RX =
{0, 0.01, 0.02, . . . , 0.99, 1.0}; this is still a discrete r.v.
But what if RX = [0, 1], i.e. all real numbers in the range 0 − −1. A number of problems arise,
the chief of which are:
• the random variable is now continuous, i.e. the elements of the range space are not countable;
• the probability of any particular value of the r.v. is in fact zero. Example: you buy 0.5-kg
of cheese in Tesco; what is the chance of it being exactly 0.5-kg? Zero. Same goes for
the weight of a product of a chemical experiment. Hence we cannot use probability mass
functions.
We now must use a different probability function called a probability density function (pdf). A pdf,
over a range space RX , must satisfy (c.f. eqns. 6.3 and 6.4 for discrete r.v.’s)
fX (x) ≥ 0, all x ∈ Rx ,
(6.11)
Z
fX (x)dx = 1.
(6.12)
Rx
We emphasise that fX (x) is not a probability, but fX (x)dx is. If you want to speak of a probability
over a continuous r.v. you mustRstate something like the probability that X is in the range a to b,
b
inclusive, is P (a ≤ X ≤ b), i.e. a fX (x)dx.
The term probability density function is used (in contrast to probability mass function (for discrete
r.v.’s)) because, with a continuous r.v. you simply cannot pick a value (X = x), say, and state
P (X = x), which is in fact zero.
6–4
Discrete probability mass versus Continuous probability density Think of a ruler upon which
we place (stick with Blue-tack) ball bearings of various sizes along its length; the ball bearings
represent discrete masses and
P we can state that we have a mass m1 at ruling x1 ; we can also
compute the total mass as i mi .
Now think of a rod of varying diameter laid along the ruler; we cannot pick a point x and say that
the mass at precisely that point is m(x), but we can say that the mass in a little length, x, x + ∆x,
is d(x)∆x, where Rd is the mass per unit length at x, (the density). In this case we can compute
the total mass as length d(x)dx.
6.4.2
Cumulative Distribution Function (cdf)
Many textbooks base their treatment of continuous r.v.’s on the cumulative distribution function
(cdf); the cdf does give a probability.
FX (x) = P (X ≤ x),
Z
(6.13)
x
FX (x) =
fX (x)dx.
(6.14)
−∞
6.4.3
Uniform Distribution
X has a uniform distribution on [a, b], X ∼ Unif or m(a, b), if
(
fX (x) =
1
(b−a) ,
for x ∈ [a, b]
0
otherwise.
(6.15)
The cumulative distribution function (cdf) is
FX (x) =


0,
x <a
0
x > b.
(x−a)
,
 (b−a)

6.4.4
x ∈ [a, b]
(6.16)
Normal (Gaussian) Distribution
X has a Normal (Gaussian) distribution with parameters µ and σ, X ∼ N(µ, σ), if
2 !
1 x −µ
1
fX (x) = √ exp −
, ∞ < x < ∞.
2
σ
σ 2π
(6.17)
The Normal distribution is often used to model measurements taken in the presence of error or
noise. If the true value of a variable X is µ, then measurement (random) variable is distributed as
N(µ, σ) where σ (the standard deviation) is a measure of the ‘size’ of the errors.
6–5
We say X has a standard Normal distribution if µ = 0 and σ = 1; standard Normal r.v.’s are
typically denoted by Z; Z ∼ N(0, 1). The CDF for Z is denoted by Φ(z); although there is no
formula for Φ(z), it is tabulated. In the days before widespread use of computers, tables such as
those for Φ(z) were of great importance to those involved in statistics and statistical inference.
Nowadays statistic packages and even some calculators will compute Φ(z) for you or even remove
the necessity by calculating the thing that required Φ(z) as an intermediate value.
If X ∼ N(µ, σ) then Z = (x − µ)/sigma ∼ N(0, 1).
Conversely, if Z ∼ N(0, 1) then X = σZ + µ ∼ N(µ, σ).
Also, if X ∼ N(µ, σ) and Y = aX + b, then Y ∼ N(aµ + b, aσ).
6.4.5
Exponential Distribution
X has a Exponential distribution with parameter β, β > 0, X ∼ Exp(β), if
1
exp(−x/β).
β
fX (x) =
(6.18)
The Exponential distribution is used to model the waiting times between infrequent events, c.f.
the Poisson distribution, see section 6.3.6.
6.4.6
Gamma Distribution
X has a Gamma distribution with parameters α, β; α, β > 0, X ∼ Gamma(α, β), if
fX (x) =
1
x α−1 exp(−x/β), x > 0.
β α Γ (α)
(6.19)
The Gamma function, for parameter α > 0, is given by
∞
Z
y α−1 e −y dy .
Γ (α) =
(6.20)
0
The Exponential distribution is Gamma with parameter α = 1, Gamma(1, β).
6.4.7
Beta Distribution
X has a Beta distribution with parameters α, β; α, β > 0, X ∼ Beta(α, β), if
fX (x) =
Γ (α + β) α−1
x
(1 − x)β−1 ), 0 < x < 1.
Γ (α)Γ (β)
6–6
(6.21)
6.4.8
Student t Distribution
X has a Student t distribution (or just t distribution, with ν degrees of freedom X ∼ tν , if
Γ
fX (x) =
Γ
6.4.9
ν+1
2
ν
2
1+
1
(ν+1)/2 .
x2
(6.22)
ν
Cauchy Distribution
The Cauchy distribution, X ∼ Cauchy , is a special case of the t distribution with ν = 1,
fX (x) =
6.4.10
1
.
π(1 + x 2 )
(6.23)
Chi-squared Distribution
X has a χ2 distribution with n degrees of freedom X ∼ χ2n , if
fX (x) =
6.5
1
x (n/2)−1 e −x/2 , x > 0.
Γ (n/2)2n/2
(6.24)
Range spaces — terminology
In discussing discrete r.v.’s we mentioned, for example, a range space RX = {x1 , x2 , . . . , xn }. If the
range space is all the integers, we could use the common symbol RX = Z. If the range space is
all the real numbers, we could use the common symbol RX = R. If the range space is a subset of
R, we use, for example, RX = [0, 1] to state that the r.v. can be 0 − −1 inclusive. For a discrete
(integer) subset we use, for example, {1, 2, . . . , 10}.
6.6
Parameters
In discussing the Binomial distribution, eqn. 6.8, and the Normal, eqn. 6.17, see below,
pX (x) =
n
x
q x (1 − q)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise,
2 !
1 x −µ
1
fX (x) = √ exp −
, ∞ < x < ∞,
2
σ
σ 2π
6–7
we note that q for the Binomial, and µ, σ for the Normal, completely specify the distributions. We
call these parameters and we will see distributions written as, for example, fX (x; θ1 , θ2 ), where θ is
a common symbol for parameter.
A lot of practical statistics involves parameter estimation, where, for example, we may have a set
(sample) of data x1 , x2 , . . . , xn , which we know to be drawn from a population with distribution
fX (x; θ1 , θ2 ) and we want to compute an estimate θˆ1 for θ1 .
6–8
Chapter 7
Two- and Multi-Dimensional Random
Variables
7.1
Introduction
Chapter 6 has introduced one dimensional random variables and certain well known distributions.
Both discrete and continuous r.v.’s were covered.
In many cases, your (scientific) data will consist not just of single numbers, for example, the weight
of a chemical in a mixture, but two or more numbers. If the numbers correspond to independent
events, see section 5.8, it may be possible or desirable to treat them separately as individual
one-dimensional r.v.’s, but, generally, you will want to treat pairs or triples or multiple numbers
together.
In section 5.6 and eqn. 5.13 we introduced the notion of the probability of two events happening
together, P (AB), the joint probability of A and B.
Here we introduce first two-dimensional r.v.’s and then go on to generalise to multi-dimensional
r.v.’s.
Range spaces — terminology for two and more dimensions See section 6.5 where we introduced some symbols and terminology used in describing range spaces for one-dimensional r.v.’s.
If we have a two-dimensional continuous random variable — a pair (X, Y )— each member of which
can take on any real value, we say that the range space is R × R; for general multi-dimensions, say
p-dimensions, where the random variable is a random vector, we use Rp . For a subsets of R, we
use, for example, [0, 1] × [0, 1] and [0, 1]p . The term for a combination (product) of sets such as
[0, 1] × [0, 1] is Cartesian product.
Two-dimensional (Bivariate) Random Variables If, to every outcome, ω, of an experiment,
we assign two numbers, X(ω), Y (ω), X is called a two-dimensional random variable.
As with one-dimension, we have discrete and continuous two-dimensional random variable, or
random vector, especially when more than two dimensions.
7–1
Much of what we present here is just a two-dimensional analogue of what was covered in Chapter 6. Also, what is described here in terms of two-dimensions transfers immediately to multiple
dimensions.
7.2
Probability Function of a Discrete Two-dimensional r.v.
By analogy with eqns. 6.3 and 6.4, for one-dimension, we have pX,Y (xi , yj ) = P (X = xi , Y = yj )
(or just p(xi , yj )) and it must satisfy the following
p(xi , yj ) ≥ 0, i = 1, 2, . . . ; j = 1, 2, . . .
m X
n
X
p(xi , yj ) = 1.
(7.1)
(7.2)
j=1 i=1
As with one-d., pX,Y or just p is called the probability function or the joint probability function for
the r.v. (X, Y ).
Example 23 From (Meyer 1966, p. 85). There are two production lines; the first has a capacity
to produce up to five items in a day; its actual production is a random variable X; the second has
a capacity to produce up to three items in a day and its actual production is a random variable Y .
The pair of random variables is the two-dimensional random variable (X, Y ) and the joint probability
function is given in Table 7.1. Each entry represents P (X = xi , Y = yj ); so p(2, 3) = 0.04. Such
a table could be estimated by noting (X, Y ) over a large number of days.
X
Y
0
1
2
3
0
1
2
3
4
5
0.0
0.01
0.01
0.01
0.01
0.02
0.03
0.02
0.03
0.04
0.05
0.04
0.05
0.05
0.05
0.06
0.07
0.06
0.05
0.06
0.09
0.08
0.06
0.05
Table 7.1: Example of a two-dimensional probability function
We can verify that the table does represent a proper probability function in that requirement eqn.
7.1 is satisfied, and, by summing over all entries, that requirement eqn. 7.2 is satisfied — the
entries sum to 1.
7.3
PDF of a Continuous Two-dimensional r.v.
By analogy with eqns. 6.11 and 6.12, for one-dimension, we have the (joint) PDF f (x, y ) and it
must satisfy the following
f (x, y ) ≥ 0, all (x, y ) ∈ R × R,
(7.3)
7–2
Z
∞
Z
∞
f (x, y )dxdy = 1.
−∞
(7.4)
−∞
We emphasise again that f (x, y ) is not a probability, but f (x, y )dxdy is.
7.4
Marginal Probability Distributions
Example 24 Suppose in Example 23 (Table 7.1) we want to compute the probability functions for
X and Y on their own. These are called marginal probability functions. The marginal probability
function for X is given by
pX (xi ) = P (X = xi ) = P (X = xi , Y = y1 , or . . . , or X = xi , Y = yn ) =
m
X
p(xi , yj ).
(7.5)
j=1
Similarly, the marginal probability function Y is given by
n
X
pY (yj ) =
p(xi , yj ).
i=1
Table 7.2 shows the corresponding sums.
X
Y
0
1
2
3
Sum
0
1
2
3
4
5
Sum
0.0
0.01
0.01
0.01
0.03
0.01
0.02
0.03
0.02
0.08
0.03
0.04
0.05
0.04
0.16
0.05
0.05
0.05
0.06
0.21
0.07
0.06
0.05
0.06
0.24
0.09
0.08
0.06
0.05
0.28
0.25
0.26
0.25
0.24
1.00
Table 7.2: Example
We can verify that the sums corresponding to p(xi ) and p(yj ) do represent proper probability
functions in that requirement 6.3 is satisfied, and, by summing the marginals, that requirement
6.4 is satisfied — both sets of marginals sum to 1.
For continuous random variables, we can state the equivalent equation for marginal PDFs:
Z
fX (x) =
fX,Y (x, y )dy .
Y
7–3
(7.6)
7.5
Conditional Probability Distributions
In section 5.6 we introduced conditional probability, i.e. the probability of an event B when we
know that event A has occurred:
P (B|A) =
P (AB)
.
P (A)
(7.7)
We can do the same for probability functions.
Example 25 Suppose in Example 24 (Table 7.2) we want to compute the conditional probability
P (X = 2|Y = 1). Applying eqn. 7.7 we have
P (X = 2|Y = 1) =
P (X = 2, Y = 1)
0.04
=
= 0.154.
P (Y = 1)
0.26
We can give general rules, noting that q(yj ), p(xi ) are marginal probability functions given by
eqn. 7.5,
p(xi |yj ) =
p(xi , yj )
if q(yj ) > 0,
q(yj )
(7.8)
p(yj |xi ) =
p(xi , yj )
if p(xi ) > 0.
p(xi )
(7.9)
We can give similar general rules for continuous random variables, noting that h(yj ), h(x) are
marginal probability functions given by eqn. 7.6,
7.6
f (x|y ) =
f (x, y )
if h(y ) > 0,
h(y )
(7.10)
h(y |x) =
f (x, y )
if g(x) > 0.
g(xi )
(7.11)
Independent Random Variables
We can define the notion of independent random variables using the definition of independent
events given in section 5.8; we had: A and B are independent events if and only if
P (AB) = P (A)P (B).
(The occurrence of event A in no way influences the occurrence of B and vice-versa.)
7–4
(7.12)
Independent Discrete Random Variables Given the two-d. discrete random variable (X, Y ), X
and Y are said to be independent if and only if
p(xi , yj ) = q(xi )r (yj ),
(7.13)
noting that q(yj ), r (xi ) are marginal probability functions given by eqn. 7.5.
Independent Continuous Random Variables Similarly, given the two-d. continuous random
variable (X, Y ), X and Y are said to be independent if and only if
f (x, y ) = g(x)h(y ),
(7.14)
where g(x), h(y ) are marginal pdfs.
7.7
Two-dimensional (Bivariate) Normal Distribution
We can extend the one-d. Normal (Gaussian) distribution to two-d.
f (x, y ) =
2πσx σy
1
p
2
2 !
1
x − µx
(x − µx )(y − µy )
y − µy
exp −
− 2ρ
+
,
2(1 − ρ2 )
σx
σx σy
σy
1 − ρ2
(7.15)
for ∞ < x < ∞, ∞ < y < ∞.
Before you start protesting that eqn. 7.15 is incomprehensible, (i) it isn’t and I can explain it; (ii)
there is a much better way of handling multivariate random variables that is better for even two-d.
See Chapter B and section B.7.
7–5
Chapter 8
Characterisations of Random Variables
8.1
Introduction
We introduced the notion of a random variable in Chapters 6 and 7. We identified probability
functions (for discrete r.v.’s) and probability density functions for some commonly occurring r.v.’s.
Here we identify and define some parameters (numbers) that characterise some aspects of r.v.
distributions.
Generally, the expected value or expectation of some function of the r.v. is found useful and the
expected value of the r.v. itself (the mean) is first amongst these.
8.2
Expected Value (Mean) of a Random Variable
The expected value of a r.v. X, or expectation, or mean, is the average value of X.
Definition: Expected Value, Discrete R.V. Discrete r.v., range space RX = {x1 , . . . , xn };
probability mass function p(xi ) = P (X = xi ). The expected value or expectation ((E(X)), or
mean of X is given by
E(X) = µx =
N
X
xi p(xi ).
(8.1)
i=1
Continuous r.v., range space RX = R; probability density function f (x). The expected value or
expectation ((E(X)), or mean of X is given by
Z
E(X) = µx =
xf (x)dx.
R
8–1
(8.2)
Example 26 Toss two coins as in Example 18. X = number of heads. A = {T T }, P (A) = 14 , X =
{0}, P (X = 0) = 41 ; A = {T H, HT }, P (A) = 12 , X = {1}, P (X = 1) = 21 ; A = {HH}, P (A) =
1
1
4 , X = {2}, P (X = 2) = 4 .
E(X) = µx =
N
X
i=1
1
1
1
xi p(xi ) = 0 + 1 + 2 = 0 + 0.5 + 0.5 = 1.
4
2
4
Example 27 Toss a dice and take X = the number of dots obtained; p(xi ) = 16 , i = 1, . . . , 6.
E(X) = µx =
N
X
i=1
6
1X
xi p(xi ) =
xi = 21/6 = 3.5.
6 i=1
(8.3)
Note that in Example 27 µx = 3.5 is not one of the possible values of X.
It is useful, particularly in two-d. cases, to think of µx as the centre of mass, where p(xi ) is a mass
and xi is a position along a lever arm; µx is the position to place the fulcrum in order to achieve a
balance.
Aside — Sample Averages In later chapters we will encounter samples and sample averages.
By sample we mean that we run an experiment and take some example values, say n of them, of
the r.v., x1 , x2 , . . . , xn .
Here we use n for the size of the sample rather than N as in eqn. 8.1 and note that the sample
space Rx = x1 , . . . xN denotes the population, rather than a sample of it.
Then we can compute a sample mean, X̄, (pronounced x-bar ) as
n
1X
xi .
X̄ =
n i=1
(8.4)
That is, compute the average like we learned in early arithmetic.
Ordinarily, we’ll make a strong distinction between sample mean and true mean. But let us consider
the case of a large sample, say N = 600. Let yi = the count of each Xi obtained. We might
expect to obtain something like
Pyn1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103, so that
for eqn. /refeq:charrv-samp1 i=1 xi = 95 × 1, y2 = 110 × 2, y3 = 90 × 3, y4 = 97 × 4, y5 =
105 × 5, y6 = 103 × 6 = 3.6.
If we look more carefully at eqn. 8.2 for this example, we can interpret it as a sample version of
eqn. 8.1.
X̄ =
n
X
1
i=1
n
yi × xi =
n
X
yi
xi ,
n
i=1
(8.5)
and, comparing with eqn. 8.1, we have yni in place of p(xi ); we note that yni = p̄(xi ) =
{95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172},
i.e. we have sample estimates of the probability mass function, which are incorrect. The error,
X̄ 6= µx , is due to the errors in the p̄(xi ). Generally, as n → ∞, p̄(xi ) → p(xi ) and X̄ → µX .
8–2
Definition: Expected Value of a function of a r.v.
of X Y = r (X) is given by
E(Y ) = E(r (X)) =
The expected value ((E(r (X))) of a function
N
X
r (xi )p(xi ).
(8.6)
i=1
Example 28 Let us use a dice as a one number slot-machine (one-armed-bandit). We pay 5c to
play and the machine pays whatever number comes up (1 − 6); thus our payout for each play is
xi − 5. What is the expected value of the payout? (Think play for an hour, 1000 plays, inserting
5000c, what do we expect to win or lose?)
E(Y ) =
N
X
i=1
r (xi )p(xi ) =
6
X
i=1
(xi − 5)
1
= −4/6 − 3/6 − 2/6 − 1/6 + 0/6 + 1/6 = −9/6 = −1.5.
6
That is, we lose on average 1.5c for every play and would lose 1500c in 1000 plays. (Maybe better
than the average slot-machine?)
Expected values for two-dimensions and higher
dimensions.
Eqns. 8.1 and 8.2 carry over to two and more
Discrete r.v., range space RX,Y = {x1 , . . . , xN }×{y1 , . . . , yM }; probability mass function p(xi , yj ) =
P (X = xi , Y = yj ). The expected value or expectation, (E[(X, Y )], or mean of the pair (X, Y ) is
given by
E[(X, Y )] = µX,Y = (µX , µY ) =
M
N X
X
(xi , yj )p(xi , yj ).
(8.7)
i=1 j=1
And similarly for two-d. (and multidimensional) continuous, where multiple integrals replace single
integrals.
Useful facts
For Xi , . . . , Xn random variables and constants ai , . . . , an ,
E(
X
ai Xi ) =
i
X
E(Xi ).
(8.8)
i
For Xi , . . . , Xn independent random variables
n
n
Y
Y
E( Xi ) =
E(Xi ).
i=1
i=1
8–3
(8.9)
8.3
Variance of a Random Variable
Variance gives the spread of a distribution. The variance is the expected value (mean value) of
the squared deviation from the mean.
Definition: Variance Discrete r.v., range space RX = {x1 , . . . , xN }; probability mass function
p(xi ), mean µ. The variance is given by
N
X
V (X) = σ = E[(X − µX ) ] =
(xi − µX )2 p(xi ).
2
2
(8.10)
i=1
Continuous r.v.
2
Z
2
V (X) = σ = E[(X − µX ) ] =
(x − µX )2 f (x)dx.
(8.11)
R
The following formula is sometimes useful
V (X) = E(X 2 ) − (E(X))2 = E(X 2 ) − µ2X .
Aside — Sample Variance
variance is given by
(8.12)
Eqn. 8.2 gives the sample mean of a random variable; the sample
n
X
1
(xi − X̄)2 .
s =
(n − 1) i=1
2
(8.13)
You may wonder about the (n − 1) instead of n; if we divided by n, the estimate would be biassed.
Standard Deviation
Standard deviation: σX =
Useful facts about variance
p
(V (X).
For constants a, b,
V (aX + b) − a2 V (X).
(8.14)
For Xi , . . . , Xn independent random variables and constants ai , . . . , an ,
V(
n
X
i=1
Xi ) =
n
X
V (Xi ).
(8.15)
i=1
If Xi , . . . , Xn are independent and identically distributed (IID) random variables with µ =
E(X), σ 2 = V (X), then
E(X̄) = µ, V (X̄) = σ 2 /n, E(s 2 ) = σ 2 .
8–4
(8.16)
8.4
8.4.1
Expectations in Two-dimensions
Mean
Two-d. discrete r.v., range space RX = {x1 , . . . , xn } × {y1 , . . . , yM }; probability mass function
p(xi , yj ). The expected value or expectation ((E[(X, Y )]), or mean of (X, Y ) is given by
E[(X, Y )] = µX,Y
M X
N
X
=
(xi , yj )p(xi , yj ).
(8.17)
j=1 i=1
Similarly for a continuous r.v. — double integral replaces summation, pdf replaces probability mass
function.
8.4.2
Covariance
Let X, Y be r.v.’s with means µX , µY and standard deviations σX , σY . The covariance between X
and Y is defined as
Cov (X, Y ) = E[(X − µX )(Y − µY )].
(8.18)
Cov (X, Y ) = Cov (Y, X).
The correlation between between X and Y is defined as
ρX,Y = Cov (X, Y )/σX σY .
8–5
(8.19)
Chapter 9
The Normal Distribution
9.1
Introduction
Here we introduce some uses of the Normal distribution, eqn. 6.17. The Normal distribution can
be used as a model or approximate model in so many cases that a large amount of mathematics
has been built up around it. Note: we use Normal (capitalised) to distinguish from the word normal
(expected, typical) and because most other distribution names are capitalised.
The probability density function (pdf) is given by:
2 !
1
1 x −µ
fX (x) = √ exp −
, ∞ < x < ∞.
2
σ
σ 2π
(9.1)
We say X ∼ N(µ, σ); note: some writers use X ∼ N(µ, σ 2 ), i.e. they use the variance for the
second parameter of N; we will attempt to standardise on N(µ, σ). It is well worth checking
carefully when reading books and papers, there can be a great difference between σ and σ 2 !
Because the pdf is different for each µ, σ, it is convenient to create a standardised Normal in which
µ = 0, σ = 1. We standardise the r.v. X as follows; first we shift to zero mean, and then we divide
by σ to obtain unit standard deviation.
Z = (X − µ)/σ.
(9.2)
When we standardise X, we obtain Z = (X − µ)/σ ∼ N(0, 1), and eqn. 9.1 becomes eqn. 9.3,
1
fZ (z) = √ exp(−z 2 /2).
2π
(9.3)
The pdf for N(0, 1) is shown in Figure 9.1. As you can see, most of the probability is located in
−3 < Z < 3; between these limits we have probability 0.9974, i.e. P (−3 < Z < 3) = 0.9974,
that is if we have a random variable Z, we can be pretty sure it will fall between these limits; you
may have heard the term three-sigma to denote nearly all occurrences. Likewise P (−1.96 < Z <
1.96) = 0.95, so that probability outside these limits is 0.05 or 5%;
9–1
R-Example 3 The following R code computes and plots Figure 9.1.
¿ z = seq(-6, 6, length = 200)
¿ pdf = dnorm(z, 0, 1) ## dnorm for d(ensity) normal
¿ plot(z, pdf, type = ”l”, lwd=3)
¿
9.2
Cumulative Distribution Function (cdf)
As we indicated in section 6.4.2, the pdf does not represent a probability, but a probability density,
the numbers we refer to above, for example, P (−1.96 < Z < 1.96) = 0.95, are obtained by
integration,
Z
1.96
P (−1.96 < Z < 1.96) = 0.95 =
fX (x)dx.
(9.4)
Rb
fX (x)dx, which is where
−1.96
However, for the Normal distribution, there is no easy way to compute
the cdf comes in; we recall that the cdf is given by eqns. 9.5 and 9.6,
a
FZ (z) = P (Z ≤ z),
Z
z
Φ(z) = FZ (z) =
Z
z
fZ (u)du =
−∞
−∞
1
√ exp(−u 2 /2)du.
2π
(9.5)
(9.6)
Because it is so commonly used, the standardised Normal cdf gets it own symbol, Φ(z). Φ(z) is
plotted in Figure 9.2 which was created using the code in R-Example 4.
R-Example 4 The following R code computes and plots Figure 9.1.
¿ z = seq(-6, 6, length = 200)
¿ cdf = pnorm(z, 0, 1) ## pnorm for p(robability) normal
¿ plot(z, cdf, type = ”l”, lwd=3)
¿
### add these if you want a figure for a report
pdf(”normcdf.pdf”, onefile=FALSE, height=4, width=4, pointsize=8, paper=”special”)
¿ plot(z, cdf, type = ”l”, lwd=3)
¿ dev.off() ### necessary to flush diagram into the file ”normcdf.pdf”
Following the discussion above on how most of the probability is located between (−3 < Z < 3),
we are not surprised to see that Φ(z) is close to zero at z = 3; it rises to 0.5 at z = 0 (one half
of the probability is below 0, the other above 0) and then flattens out at z = 3 after which there
is almost no probability for the integral to add in.
9–2
Figure 9.1: Standardised Normal distribution, N(0, 1), probability density function (pdf).
Figure 9.2: Normal cumulative distribution function (cdf).
9–3
9.3
Normal Cdf
Traditionally, statistics books, and books of tables contained tabulations of the Normal cdf, Φ(z).
We will see below how these tables are used. However, because most statistics is now conducted
using software packages, tables may be less frequently used, and may be less commonly encountered
in textbooks.
R-Example 5 . The following R code computes Table 9.1.
¿ z = seq(-4, 4,
¿ cdf = pnorm(z,
¿ z
[1] -4 -3 -2 -1
¿ cdf
[1] 3.167124e-05
[6] 8.413447e-01
¿
z
Phi(z)
length = 9)
0, 1)
0
1
2
3
4
1.349898e-03 2.275013e-02 1.586553e-01 5.000000e-01
9.772499e-01 9.986501e-01 9.999683e-01
-4
-3
-2
-1
0
1
3.2e-05 1.35e-03 2.28e-02 0.159 0.5 0.84
2
3
4
0.977 0.999 0.99997
Table 9.1: Erf(z) for z = -4 to + 4.
What does Φ(z = −2) = 2.28 × 10−02 = 0.0228 mean? Referring to Figure 9.1 it means that the
amount of probability to the left of Z = −2 is 0.0228, i.e. as indicated by eqn. 9.5.
Owing to the symmetry of Figure 9.1, we can state that the amount of probability to the right of
of Z = +2 is also 0.0228. Hence the probability P (Z < −2 or Z > +2) = 2 × 0.0228 = 0.0456 or
4.56%. If we move a little closer to the mean, we get P (Z < −1.96 or Z > +1.96) = 2 × 0.025 =
0.05 or 5%. This 5% quartile (+/ − 1.96) is used a lot in statistics.
If P (Z < −1.96 or Z > +1.96) = 0.05 then P (−1.96 < Z < +1.96) = 0.95.
In a similar way, we can determine that P (Z < −1 or Z > +1) = 2 × 0.159 = 0.318; that is, a
standard Normal random variable Z is between plus or minus one standard deviation of the mean
3.18% of the time. The 0.159 number is used below in Example 29.
9.4
Using the Normal Cdf
Example 29 Suppose we have a manufacturing process which takes fixed quantities of raw materials A (1000-grams) and B (500-g.) which react together to produce a product C in the form of
a solid cake. The weights of the cakes, X, are monitored and those below a certain weight are set
aside as B-grade. The manufacturer of the machine gives the yield expected value as E(X) = 165
grams with a variance
√of 9 and has determined that the yield follows the Normal distribution; that
is, µX = 165, σX = 9 = 3 and X ∼ N(165, 3). We have decided that cakes below 162 grams
should be marked as B-grade.
9–4
What is the probability that a randomly selected output will be less than 162 grams?
We have no tables for N(165, 3), but we do have for N(0, 1), that is the cdf for the standardised
Normal Φ(z).
Solution.
(i) First we standardise using eqn. 9.2, Z = (X − µ)/σ = (X − 165)/3. Our standardisation
formula is
Z = (X − 165)/3,
in which case the standardised weight corresponding to 162 is Z162 = (162 − 165)/3 = −1.
(ii) The probability that Z < Z162 is just Φ(Z162 = Φ(−1) and we can read that from Table 9.1,
i.e. the probability is 0.159 and 15.9% of the output will be B-grade.
(iii) Or, we can use R.
¿ pnorm(-1, 0, 1) ## here explicitly giving mu and sigma.
[1] 0.1586553
¿ pnorm(-1) ## if none given, R assumes mu = 0, sigma = 1
[1] 0.1586553
¿
(iv) We can even let R handle the standardisation.
¿ pnorm(162, 165, 3) ## here explicitly giving mu and sigma.
[1] 0.1586553
Normal distribution appropriate? In Example 29 there can be an immediate objection to the
Normal model. X can never be less than zero, but N(165, 3) will have a value greater than
zero (but very very small) for X < 0. In defence, we can argue that the value will be negligibly
small so that use the Normal model should not introduce significant errors. If we had a weight,
E(X) = 4, V (X) = 9, σ = 3, then we would have to question the Normal model.
9.5
Sum of Independent Normal Random Variables
If X1 ∼ N(µ1 , σ1 ) and X2 ∼ N(µ2 , σ2 ) are independent random variables,
X = X1 + X2 ∼ N(µ, σ),
where µ = µ1 + µ2 and V ar (X) = σ 2 = σ12 + σ22 .
Add the means, add the variances; note not add the standard deviations.
9–5
(9.7)
Need example here.
Eqn. 9.7 generalises to give the distribution of a sum on n independent observations of the same
random variable. If Xi ∼ N(µ, σ),
X = X1 + X2 , . . . , X n =
n
X
Xi ∼ N(nµ,
√
nσ).
(9.8)
i=1
That is, add n means, and add n variances, so that σsum =
p
√
nV ar (X) = nσ.
Need example here.
9.6
Differences of Normal Random Variables
X1 ∼ N(µ1 , σ1 ), X2 ∼ N(µ2 , σ2 )
X = X1 − X2 ∼ N(µ, σ),
(9.9)
where µ = µ1 − µ2 and V ar (X) = σ 2 = σ12 + σ22 .
Take the difference of the means and add the variances (not difference of variances).
Need example here.
9.7
Linear Transformations of Normal Random Variables
If X ∼ N(µ, σ),
Y = aX + b ∼ N(aµ + b, aσ).
(9.10)
Need example here.
9.8
The Central Limit Theorem
Why is the Normal distribution (a) so common; (b) so popular amongst statisticians. First, the
Central Limit Theorem (CLT) states, roughly speaking, that if a random variable has been created
by summing a large number of (independent) random variables, then the sum will have an approximately Normal distribution. Second, it is popular not just because of its common occurrence but
because mathematics involving the distribution, eqn. 9.1 and its multivariate counterpart is in many
cases rather easy — or a good deal easier than mathematics involving some other distributions.
A compact statement of the CLT, from (Wasserman 2004), is as follows.
9–6
Let X1 , X2 , . . . , Xn be independent
and identically distributed r.v.’s with mean µ and standard
P
deviation σ. Let X̄n = n1 ni=1 Xi . Then, as n → ∞,
X̄n − µ
X̄n − µ
√ → Z,
Zn = p
=
σ/ n
V ar (X̄n )
where Z ∼ N(0, 1).
9–7
(9.11)
Chapter 10
Statistical Inference
10.1
Introduction
We use the Normal distribution, eqn. 6.17, repeated here, to introduce statistical inference.
2 !
1
1 x −µ
fX (x) = √ exp −
, ∞ < x < ∞.
2
σ
σ 2π
(10.1)
We may write fX as fX (x; µ, σ) or fX (x; θ1 , θ2 ), where θ1 , θ2 are parameters. We may think of a
family of Normal distributions, N, parametrised or labelled or indexed by θ1 , θ2 .
Let us say we have performed and experiment and have collected a sample of random variables X,
x1 , x2 , . . . , xn ; we assume that X ∼ N(µ, σ) but we do not know either one or other (or both) of
the parameters.
Point Estimation Parameter estimation is concerned with estimating parameters. A point estimate for say µ is an approximate value µ̂ computed from the sample. Typically, in addition to the
estimate, µ̂, we give some qualifications such as the variance of the estimate, that is, an indication
of how variable we think µ̂ might be if we repeated the experiment a number of times.
Interval Estimation An interval estimate (set estimate, confidence interval) for say µ is an
interval [µ1 , µ2 computed from the sample which we claim to contain the real µ. Typically, we give
some indication of how plausible the interval is in the form a some sort of probability value.
Hypothesis Testing A typical hypothesis testing example is when a scientist needs to test the
efficacy of a new method.
And experiment is performed where there are two methods, M1 and M2 . Often, M1 is a control
(say old method) and M2 is the new methods whose efficacy we wish to test.
Let us keep the hypothesis simply by assuming that we wish to test whether M2 will give a better
yield than M1 .
10–1
Chapter 11
Statistical Estimation
11.1
Introduction
When we state for example X ∼ fX (x; θ1 , θ2 ), we indicate that the distribution depends on parameters θ1 , θ2 . For example, we may think of a family of Normal distributions, N(θ1 , θ2 ), parametrised
or labelled or indexed by θ1 = µ, θ2 = σ.
11.2
Populations and Samples
When we quote values of parameters, for example the mean and standard deviation of a Normally
distributed r.v., X ∼ N(µ, σ), we are talking about population parameters.
Let us collect a sample of random variables X, x1 , x2 , . . . , xn ; we assume that X ∼ N(µ, σ) but we
do not know either of the parameters. We must estimate them and an obvious first attempt is to
use sample mean and standard deviation.
Note the difference: population versus sample. A population includes all possible random variables;
a sample contains, well, a sample taken from the population. If you wanted a quick estimate of
the mean salary of lecturers in the college, you could ask a number of lecturers you know and take
the average of that sample.
The Human Resources Department could give you an exact figure, because they have the data for
the (complete) population of, N, lecturers. They would compute the true population parameters
as,
N
1X
xi ,
µ=
N i=1
(11.1)
N
1X
σ =
(xi − µ)2 .
N i=1
(11.2)
2
You could imagine that the larger your sample, the better the sample mean would approximate the
population mean.
11–1
Random Sample However, apart from being a small sample, lecturers you know could contain
another source of inaccuracy, namely that the sample is not random and so it may contain a bias
due to the fact that, for example, the lecturers in your sample tend to be younger.
By random sample we mean that each member of the population has an equal chance of being
sampled. Achieving a random sample is not always easy, see Chapter 13.
11.3
Estimating the Mean
A point estimate for say µ is an approximate value µ̂ computed from the sample. Typically, in
addition to the estimate, µ̂, we give some qualifications such as the variance of the estimate, that
is, an indication of how variable we think µ̂ might be if we repeated the experiment a number of
times. The hat symbol, θ̂, is used to indicate that we have an estimate of θ.
The most obvious estimate for µ is to copy eqn. 11.1, noting that we use capital N for the size of
the population and lower-case n for the size of the sample,
n
1X
µ̂ = x̄ =
xi .
n i=1
(11.3)
In this context the bar,¯as in x̄ (x bar indicates mean or average.
Need example here.
11.4
Estimating the Standard Deviation
The “best” estimate for σ is less obvious and eqn. 11.2 is modified slightly to,
n
σˆ2 = s 2 =
1 X
(xi − x̄.
n − 1 i=1
(11.4)
Thus, we not only replace µ by its estimate, x̄, we divide by n − 1 instead of n. It is usual to use
s 2 to denote sample variance.
The reason for the n − 1 is that dividing by n would generally lead to a systematic underestimate
— a so-called bias. This may be discussed in a later chapter; (reference it if we do).
Need example here.
11–2
11.5
Sampling Distributions
11.5.1
Sampling Distribution of the mean
The estimate of the mean given by eqn. 11.3 is itself a random variable; we can imagine taking m
samples, each of size n, and each of these yielding a x̄ˆj for j = 1, 2, . . . , m.
E(x̄) = µ,
(11.5)
V ar (x̄) = σ 2 /n.
(11.6)
√
Therefore, the standard deviation of the estimate of the mean is σ/ n. We already encountered
this in section 9.5 and eqn. 9.8.
Both eqns. 11.5 are rather comforting, (a) the expected value of x̄ is µ and the standard deviation
√
of x̄ is σ/ n, that is, as n increases the standard deviation decreases and will decrease to zero as
n → ∞.
√
Finally, we can state that the sampling distribution of µ̂ is N(µ, σ/ n). This means that if we
conduct a number of sample experiments (take a sample of n Xs and compute the mean bar x,
then bar x will be found to have a normal distribution centred on the true mean µ.
We note emphatically that we do not know µ. In the first part of the discussion below, we assume
that σ 2 is known. However, this is typically untrue, and we must use an estimate for the standard
deviation, as in eqn. 11.4.
Figure 11.1 (Maindonald & Braun 2007, p. 103) shows two sampling distributions, for a random
variable X which has µX = 10, σ = 1; Figure 11.1(a) shows the sampling distribution for a sample
size of n = 4, while Figure 11.1(b) shows the sampling distribution for a sample size of n = 9; the
distribution of X, corresponding to a sample size of n = 1 is shown for comparison.
The useful formula now is, including standardisation:
If the estimator for µ (unknown) is x̄ and σ is known then
x̄ − µ
√ ∼ N(0, 1).
σ/ n
(11.7)
On the other hand, if σ is unknown, and we must replace σ with an estimate, s, see eqn. 11.4,
then
x̄ − µ
√ ∼ tn−1 ,
s/ n
(11.8)
where tn−1 is the Student t distribution with n − 1 degrees of freedom; see section 6.22. As with
N(0, 1), we have tables for the t distribution.
11–3
Figure 11.1: (a) Sampling distribution for a sample size of n = 4; (b) sampling distribution for a
sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for
comparison.
11.5.2
Sampling Distribution for Estimates of the Standard Deviation
If the estimator for σ (unknown) is s, see eqn. 11.4, and µ is also unknown, with estimate x̄, then
n
X
(n − 1)s 2
xi − x̄
=
∼ χ2n−1 ,
2
2
σ
σ
i=1
(11.9)
where χ2n is the Chi-squared distribution with n degrees of freedom; see section 6.4.10. As with
N(0, 1) and tν we have tables for the χ2n distribution.
11–4
11.6
Confidence Intervals
√
In section 11.5.1 we established that the distribution of the sample mean is x̄ ∼ N(µ, σ/ n) or
x̄−µ
√ ∼ N(0, 1). This tells us that the estimate has a distribution that is
equivalently eqn. 11.7 σ/
n
centred on the mean, that the expected value of the estimate is the mean, and that the distribution
√
will have a standard-deviation (spread) of σ/ n.
Thus referring to Figure 11.1(a), we can say that the mean of x̄4 is µ, the true mean — which
we do not know and that different samples would vary between about 1.5σ above and below the
true mean. Hence if the true mean is 10 as in the diagram, and we kept repeating our sampling
experiment, we would expect the estimate x̄4 to vary between about 8.5 and 11.5.
On the other hand, if we used sample size n = 9, we would expect the estimate x̄9 to vary between
about 9.0 and 11.0, see Figure 11.1(b).
The previous few sentences should be suggesting that we should be able to give a plausible interval
estimate such as we estimate that the mean is between 9 and 11, together with a probability for
that assertion, e.g. about 0.95 as discussed in section 9.3 for P (−1.96 < Z < +1.96). But
unfortunately we cannot, for we do not know the true mean.
What can we say? Well, for example, that P (−1.96 < (x̄ − µ)/ √σn < +1.96) = 0.95. Still not
much good, for we do not know µ and we must be satisfied with the less useful statement that
the estimate x̄ is within plus-or-minus 1.96 × √σn from µ, with a probability of 0.95.
More explanation may be needed. What if x̄ is at one of these extremes, namely µ − 1.96 × √σn ;
this would correspond to about 9 in Figure 11.1(a). We can then say that x̄ + 1.96 × √σn just about
reaches up to µ. If we repeat the sampling, this will happen with a probability 1 − 0.025, i.e. the
amount of probability up to Z = −1.96 is 0.025.
Similarly, take the case that x̄ is at the other extreme, namely µ + 1.96 × √σn ; this would correspond
to about 11 in Figure 11.1(a). We can now say that x̄ − 1.96 × √σn just about reaches down to
µ. If we repeat the sampling, this will happen with a probability 1 − 0.025 (recall the symmetry
argument in section 9.3).
Consequently, if we take x̄ +/−1.96× √σn we can say that this interval will capture µ with probability
0.95.
This allows us to construct a confidence interval which we can claim contains µ; that is, we compute
not µ̂, but (L, U), an interval between (L)ower and (U)pper limits which we believe contains µ.
In the case of confidence (probability) 0.95 = 95%, we can compute
σ
σ
(L, U) = (x̄ − 1.96 × √ , x̄ + 1.96 × √ )
n
n
(11.10)
Summary on Point Estimation and Confidence Interval for the Mean when Variance Known
Refer to Figure 11.1, part (b) of which is based on a sample size of n = 9.
• If we take a point estimate for the mean, it will be distributed according to the narrow
distribution, i.e. if the true mean is 10, our estimate can be anywhere between 9 and 11.
11–5
• If we decide to give an interval estimate, we need to decide on a confidence (probability);
the wider the interval, the greater the confidence we can have in it — but a huge interval
with confidence of 100% is not much use to anyone. The usual confidence that is chosen is
95%.
• We would like to be able to look at Figure 11.1 (b) and say that our interval for the mean is
9 to 11 with confidence 95% (based on the diagram this is approximate, 10 − 1.96 × 0.5 to
10 + 1.96 × 0.5 are the precise values for 95%.
But we cannot make a statement like the latter, for we do not know that µ = 10.
• The best we can do is (a) take our estimate, x̄, (b) place a distribution like that in Figure 11.1(b) about it; (c) compute the x̄ + / − √σn (≈ 2) interval (eqn. 11.10).
This allows us to state:
if we repeated our sampling a large number of times, and we computed eqn. 11.10 each time
(getting a different interval), then 95% of these intervals would contain the true mean µ.
Excel-Example 1 Need Excel example here.
Need section on t-distribution and small sample sampling distrib. for mean with std.-dev.
unknown.
11–6
Chapter 12
Hypothesis Testing
12.1
Introduction
In Chapter 11 we discussed estimation of parameters, both point estimates and interval estimates
(with confidence value attached). This chapter is also based on sampling theory but here we are
interested in decisions rather than estimates. For example, based on a sample of occurrences of
heads and tails in a sample of n = 10 tosses of a coin, we might wish to come to the decision
whether the coin is fair. We might want to decide whether application of a new fertiliser really
does increase cropping yield, based on samples involving (i) the current fertiliser and (ii) the new
one.
The hypothesis testing technique involves the postulation of a hypothesis (an assumption, a statement about population distributions or their parameters) and then designing an experiment which
will yield a sample upon which we can decide whether the hypothesis is true — based on sample
data.
A typical hypothesis test is as follows. We make a hypothesis that a random variable is distributed
according to fX (x), e.g. X ∼ N(µ, σ), where we assume that σ is known.
We identify a null hypothesis, H0 : µ = µ0 and an alternative hypothesis, HA : µ > µ0 .
We compute a test statistic (a sample estimate with sample size n), for example µ̂ = X̄n and
reject H0 if X̄n > c, where c is some constant to be determined; X̄n > c is the critical region;
X̄n ≤ c is called the acceptance region.
The greater we make c, then the greater the significance level of the test X̄n > c. We can set
c using the same considerations we used in setting confidence levels for a confidence interval in
section 11.6. As in eqn. 11.7, we know that
Z=
X̄n − µ
√ ∼ N(0, 1).
σ/ n
(12.1)
¯
−µ
so that we can use er f (z) = Φ(z) to choose a c = z such that P (z > c 0 ) = 0.05 = P ( Xσ/n √
>
n
0
c√σ
0
c ) = P (X̄n > n + µ, say, for a 2.5% significance level. (I’ve chose 2.5% = 0.025 because it
corresponds to a cutoff point (Z = 1.96) that we have already encountered.
12–1
That is, z > c 0 would occurs only 2.5% of the time if H0 is true; in other words the critical region
stretches from c 0 to the right of it. The acceptance region stretches to the left of c 0 , i.e. including
0
everywhere that X̄n ≤ c, where c = c√σn + µ.
Recalling P (Z > +1.96) = 0.025, we can set c 0 = 1.96 for a significance level of 0.025.
The latter corresponds to a one sided test.
The standard normal pdf and the relevant critical region is shown in Figure 12.1 (Maindonald &
Braun 2007, p. 106).
Figure 12.1: One side hypothesis test, significance level = 0.025; critical region is shaded to the
right of 1.96. For a two sided test with significance level = 0.05, we include in the critical region
also the marked region to the left of -1.96.
Let us keep the original null hypothesis, H0 : µ = µ0 , and now choose an different alternative
hypothesis, namely HA : µ 6= µ0 . A suitable acceptance region for this might be cl < X̄n < ch ,
with the critical (rejection) region being all points below cl and all points above ch .
If we now choose a significance level of 0.05, we arrive at the familiar P (Z < −1.96 or Z >
12–2
+1.96) = 0.05, that is, if we have µ = µ0 , then values of Z < −1.96 or Z > +1.96 or
X¯n √
−µ
σ/ n
<
X¯n √
−µ
σ/ n
−1.96 or
> +1.96 should occur only 5% of the time and this is a sufficiently significant
deviation for us to reject the null hypothesis.
This is a two sided test.
The significance level, usually denoted α, corresponds to the probability of rejecting H0 when H0
is true, that is, the extreme values in the critical region could occur, but with a small probability,
α.
Table 12.1 shows the possible outcomes of the hypothesis test.
Accept H0
Reject H0
H0 true
HA true
correct
Type 2 error, prob. β
Type 1 error, prob. α
correct
Table 12.1: Outcomes of a hypothesis test.
12–3
Chapter 13
Sampling
13.1
Introduction
To be completed.
13–1
Chapter 14
Classification and Pattern Recognition
14.1
Introduction
The terms classification and pattern recognition are used almost synomomously; statisticians tend
to favour classification, while engineers tend to use pattern recognition. This chapter merely
introduces the concepts; Chapters 15, 16, 18, 17 and 19 fill in the details.
These chapters are a reworking of some of the basic pattern recognition and neural network material
covered in (Campbell 2005) and (Campbell & Murtagh 1998) and (Campbell 2000).
We define/summarize a pattern recognition system using the block diagram in Figure 14.1.
x
sensed
data
Pattern Recognition w (omega)
System(Classifier)
class
Figure 14.1: Pattern recognition system; x a tuple of p measurements, output ω — class label.
Typically textbooks distinguish between supervised classification and unsupervised classification.
Supervised classification Supervised (trained) classification may be posed as a prediction problem rather like regression. The prediction involves class labels.
We have a set of examples, a sample, which we call training data, XT = {xi , ωi }ni=1 . We learn
population parameters from the sample of x’s.
Warning: in some classification and pattern recognition literature, the term sample takes on a
different meaning from the standard statistical term — where a statistical sample means a set of
random vectors taken from a population; in the pattern recognition literature a sample may mean
a single random vector, so that a statistical sample will have to be termed a set of samples.
x is the pattern vector — of course in certain situations x is a simple scalar. ω is the class label,
ω ∈ Ω = {ω1 , . . . , ωc }.
Then, given an unseen pattern x (a random vector), we predict ω. In general, x = (x0 x1 . . . xp−1 )T ,
a p-dimensional vector; T denotes transposition.
14–1
Unsupervised classification Unsupervised classification is more of an exploratory data analysis
technique than is supervised classification.
In this case we have a set of patterns (random vectors) XT = {xi }ni=1 and we want to explore
structure in the set. For example, are they clustered, thereby suggesting that the clusters identify
a number of classes. Clustering involves assigning class labels to the XT = {xi }ni=1 based not on
training data but on proximity of the x’s or some other criterion.
14–2
Chapter 15
Simple Classifier Methods
15.1
Thresholding for one-dimensional data
Let us assume that we want to classify a chemical product, for example fake pharmaceutical drugs,
according to the results of a chemical analysis. The analysis data comprise a vector x where x1
might be percentage mass of component 1, x2 component 2, etc. The label ω might be courntry
of origin, and it is this that we want to predict, given the results x from an analysis of a newly
seize batch.
For the moment, we’ll assume just two classes ω0 and ω1 ; two-class problems are easy to describe,
yet extension to n-class problems is easy.
In our simplistic character recognition system we require to recognise two sources, country 0 and
country 1, ω0 and ω1 . We start off with two components x = (x1 x2 )T .
As described in Chapter 14, we have earlier obtained examples of the drug from both countries,
XT = {xi , ωi }ni=1 , i.e. we have training data, or a sample.
Let us see whether we can recognise using component 1 alone (x1 . Figure 15.1 shows some
(training) data. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the
classification algorithm is:
ω = 1 when x1 ≥ T,
(15.1)
= 0 otherwise.
(15.2)
Use of histograms, see Figure 15.2 might be a more methodical way of determining the threshold,
T.
If enough training data were available, n → ∞, the histograms, h0 (x1 ), h1 (x1 ), properly normalised
would approach probability densities: p0 (x1 ), p1 (x1 ), more properly called class conditional probability densities (pdfs): p(x1 | ω), ω = 0, 1, see Figure 15.3.
When the random vector is three-dimensional (p = 3) or more, it becomes impossible to estimate
the pdfs using histogram binning — there are a great many bins, and most of them contain no
data. In such cases it is usual to assume a distribution family, for example Normal, and to represent
15–1
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1
2
3
T
4
5
6
x1
Figure 15.1: Component 1 x1 .
freq.
h1(x1)
h0(x1)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1
2
3
T
4
5
6
Figure 15.2: Histogram of component 1 x1 .
15–2
x1
p(x1 | 1)
p(x1 | 0)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1
2
3
T
4
5
6
x1
Figure 15.3: Class conditional pdfs.
the class confitional pdfs using parameters estimated from a sample (training data — estimation
= training); see Chapter 11.
The use of explicitly statistical methods is described in Chapter 16 but for now well try some
intuitive methods, but as you will see we are never far from statistics.
15–3
15.2
Linear separating lines/planes for two-dimensions
Since there is overlap in the component-1, x1 , measurement, let us use the two components,
x = (x1 x2 )T , i.e. (component-1, component-2). Figure 15.4 shows a scatter plot of these data
(the sample).
5
x2
0 0 0
0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0
1
1 1 1 1
0 0 0 0
1 1 1 1 1
0 0 0 0
1 1 1 1 1 1
1 1 1 1
1 1 1
4
3
2
1
1
2
3
4
5
6
x1
Figure 15.4: Two dimensions, scatter plot.
The dotted line shows that the data are separable by a straight line; it intercepts the axes at
x1 = 4.5 and x2 = 6.
Apart from plotting the data and drawing the line, how could we derive the separating from the
data? (Thinking of a computer program.)
15.3
Nearest mean classifier
First we estimate the class conditional means µ0 = E(x|ω = ω0 and µ1 = E(x|ω = ω1 ).
Figure 15.5 shows the line joining the class means and the perpendicular bisector of this line; the
perpendicular bisector turns out to be the separating line. We can derive the equation of the
separating line using the fact that points on it are equidistant to both means, µ0 , µ1 , and expand
using Pythagoras’s theorem,
|x − µ0 |2 = |x − µ1 |2 ,
2
(x1 − µ01 ) + (x2 − µ02 )
2
(15.3)
2
2
= (x1 − µ11 ) + (x2 − µ12 ) .
(15.4)
We eventually obtain
(µ01 − µ11 )x1 + (µ02 − µ12 )x2 − (µ201 + µ202 − µ211 − µ212 ) = 0,
15–4
(15.5)
5
x2
0 0 0
0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0
1
1 1 1 1
0 0 0 0
1 1 1 1 1
0 0 0 0
1 1 1 1 1 1
1 1 1 1
1 1 1
4
3
2
1
1
2
3
4
5
6
x1
Figure 15.5: Two dimensional scatter plot showing means and separating line.
which is of the form
b1 x1 + b2 x2 − b0 = 0.
(15.6)
In Figure 15.5, µ01 = 4, µ02 = 3, µ11 = 2, µ12 = 1.5; with these values, eqn 15.6 becomes
4x1 + 3x2 − 18.75 = 0,
(15.7)
which intercepts the x1 axis at 18.75/4 ≈ 4.7 and the x2 axis at 18.75/3 = 6.25.
15.4
Normal form of the separating line, projections, and linear
discriminants
Eqn 15.6 becomes more interesting and useful in its normal form,
a1 x1 + a2 x2 − a0 = 0,
where a12 + a22 = 1; eqn 15.8 can be obtained from eqn 15.6 by dividing across by
(15.8)
p
b12 + b22 .
Figure 15.6 shows interpretations of the normal form straight line equation, eqn 15.8. The coefficients of the unit vector normal to the line are n = (a1 a2 )T and a0 is the perpendicular distance
from the line to the origin. Incidentally, the components correspond to the direction cosines of
n = (a1 a2 )T = (cos θ sin θa2 )T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n corresponds to one row of a (frame) rotating matrix; in other words, see below, section 15.5, dot
product of the vector expression of a point with n, corresponds to projection onto n. (Note that
cos π/2 − θ = sin θ.)
15–5
x2
normal vector (a1, a2)
a0/a2
line
(x1’ x2’)
a1x1 + a2x2 −a0 = 0
a1x1’ + a2x2’ −a0 > 0
a0
theta
a0/a1
x1
at (x1’’, x2’’)
a1x1’’ + a2x2’’ − a0 < 0
Figure 15.6: Normal form of a straight line, interpretations.
Also as shown in Figure 15.6, points x = (x1 x2 )T on the side of the line to which n = (a1 a2 )T
points have a1 x1 + a2 x2 − a0 > 0, whilst points on the other side have a1 x1 + a2 x2 − a0 < 0; as we
know, points on the line have a1 x1 + a2 x2 − a0 = 0.
15.5
Projection and linear discriminant
We know that a1 x1 + a2 x2 = aT x, the dot product of n = (a1 a2 )T and x represents the projection
of points x onto n — yielding the scalar value along n, with a0 fixing the origin. This is plausible:
projecting onto n yields optimum separability.
Such a projection,
g(x) = a1 x1 + a2 x2 ,
(15.9)
is called a linear discriminant; now we can adapt equation eqn. 15.2,
ω = 0 when g(x) > a0 ,
(15.10)
= 1, g(x) < a0 ,
(15.11)
= tie, g(x) = a0 .
(15.12)
Linear discriminants, eqn. 15.12, are often written as
g(x) = a1 x1 + a2 x2 − a0 ,
whence eqn. 15.12 becomes
15–6
(15.13)
ω = 0 when g(x) > 0,
15.6
(15.14)
= 1, g(x) < 0,
(15.15)
= tie, g(x) = 0.
(15.16)
Projections and linear discriminants in p dimensions
Equation 15.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal
to the the p − 1 separating hyperplane. For example, when p = 3, n is the unit vector normal to
the separating plane.
Other important projections used in pattern recognition are Principal Components Analysis (PCA)
and Fisher’s Linear Discriminant Analysis (lda), see Chapter 17.
15.7
Template Matching and Discriminants
An intuitive (but well founded) classification method is that of template matching or correlation
matching. Here we have perfect or average examples of classes stored in vectors {zj }cj=1 , one for
each class. Without loss of generality, we assume that all vectors are normalised to unit length.
Classification of an newly arrived vector x entails computing its template/correlation match with
all c templates:
xT zj ;
(15.17)
class ω is chosen as j corresponding to the maximum of eqn. 15.17.
Yet again we see that classification involves dot product, projection, and a linear discriminant.
15.8
Nearest neighbour methods
Obviously, we may not always have the linear separability of Figure 15.5. One non-parametric
method is to go beyond nearest mean, see eqn. 15.4, to compute the nearest neighbour in the
entire training data set, and to decide class according to the class of the nearest neighbour.
A variation is k-nearest neighbour, where a vote is taken over the classes of the k nearest neighbours.
15–7
Chapter 16
Statistical Classifier Methods
16.1
One-dimensional classification revisited
Recall Figure 15.3, repeated here as Figure 16.1.
p(x1 | 1)
p(x1 | 0)
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1
1
2
3
T
4
5
6
x1
Figure 16.1: Class conditional densities.
We have class conditional pdfs: p(x1 | ω), ω = 0, 1; given a newly arrived x10 we might decide
on its class according to the maximum class conditional pdf at x10 , i.e. set a threshold T where
p(x1 | 0) and p(x1 | 1) cross, see Figure 16.1.
This is not completely correct. What we want is the probability of each class — its posterior
probability — based on the evidence supplied by the data, combined with any prior evidence.
In what follows, P (ω|x) is the posterior probability or a posteriori probability of class ωi given the
observation x; P (ωi ) is the prior probability or a priori probability. We use upper case P (.) for
discrete probabilities, whilst lower case p(.) denotes probability densities.
16–1
Bayes’ Rule
Recall Bayes’ rule from eqn. 5.22 and repeated here,
P (Ai |B) = P (B|Ai )P (Ai )/
n
X
P (B|Ai )P (Ai ).
(16.1)
i=1
This says that the posterior probability of Ai given B (conditional on B having
Pn occurred) is the
product of the conditional probability of B given Ai all divided by P (B) = i=1 P (B|Ai )P (Ai ).
We can rewrite eqn. 16.1 in terms of our random variable x (= B) and our classes ω0 , ω1 (=
Ai , i = 0, 1) to get
P (ωi |x) = P (x|ωi )P (ωi )/
1
X
P (x|ωi )P (ωi ).
(16.2)
i=0
P (ωi |x) is the posterior probability of class ωi given that our analysis has yielded x; P (ωi ) is the
prior probability — if we have no prior preference, the P (ω0 ) = 0.5, P (ω1 ) = 0.5.
Eqn. 16.2 forms a Bayes decision rule: compute the two posterior probabilities and take the class
which has the maximum.
Let the Bayes decision rule be represented by a function g(.) of the feature vector x:
g(x) = ar g maxwj ∈Ω [P (ωj | x)]
(16.3)
To show that the Bayes decision rule, eqn. 16.3, achieves the minimum probability of error, we
compute the probability of error conditional on the feature vector x — the conditional risk —
associated with it:
c
X
P (ωk | x).
(16.4)
R(g(x) = ωj | x) =
k=1,k6=j
That is to say, for the point x we compute the posterior probabilities of all the c − 1 classes not
chosen.
Since Ω = {ω1 , . . . , ωc } form a partition (they are mutually exclusive and exhaustive) and the
P (ωk |x)ck=1 are probabilities and so sum to unity, eqn. 16.4 reduces to:
R(g(x) = ωj ) = 1 − P (ωj | x).
(16.5)
It immediately follows that, to minimise R(g(x) = ωj ), we maximise P (ωj | x), thus establishing
the optimality of eqn. 16.3.
The problem now is to determine P (ω | x) which brings us to Bayes’ rule.
16.2
Bayes’ Rule for the Inversion of Conditional Probabilities
****[Needs tidying and made compatible with previous section.]
From the definition of conditional probability, we have:
p(ω, x) = P (ω | x)p(x),
16–2
(16.6)
and, owing to the fact that the events in a joint probability are interchangeable, we can equate the
joint probabilities :
p(ω, x) = p(x, ω) = p(x | ω)P (ω).
(16.7)
Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes’
rule for the posterior probability P (ω | x):
P (ω | x) =
p(x | ω)P (ω)
.
p(x)
(16.8)
P (ω) expresses our belief that ω will occur, prior to any observation. If we
no prior knowledge,
Phave
c
we can assume equal priors for each class: P (ω1 ) = P (ω2 ) . . . = P (ωc ), j=1 P (ωj ) = 1. Although
we avoid further discussion here, we note that the matter of choice of prior probabilities is the
subject of considerable discussion especially in the literature on Bayesian inference, see, for example,
(Sivia 1996).
p(x) is the unconditional probability density of x, and can be obtained by summing the conditional
densities:
c
X
p(x) =
p(x | ωj )P (ωj ).
(16.9)
j=1
Thus, to solve eqn. 16.8, it remains to estimate the conditional densities.
16.3
Parametric Methods
Where we can assume that the densities follow a particular form, for example Gaussian, the density
estimation problem is reduced to that of estimation of parameters.
The multivariate normal density, see section B.7, p-dimensional, is given by:
p(x | ωj ) =
1
1
exp [− (x − µj )T K−1
j (x − µj )]
1/2
| Kj |
2
(2π)p/2
(16.10)
p(x | ωj ) is completely specified by µj , the p-dimensional mean vector, and Kj the corresponding
p × p covariance matrix:
µj = E[x]ω=ωj ,
(16.11)
Kj = E[(x − µj )(x − µj )T ]ω=ωj .
(16.12)
The respective maximum likelihood estimates are:
Nj
1 X
µj =
xn ,
Nj n=1
(16.13)
and,
Nj
1 X
Kj =
(xn − µj )(xn − µj )T ,
Nj − 1 n=1
where we have separated the training data XT = {xn , ωn }N
n=1 into sets according to class.
16–3
(16.14)
16.4
Discriminants based on Normal Density
We may write eqn. 16.8 as a discriminant function:
gj (x) = P (ωj | x) =
p(x | ωj )P (ωj )
,
p(x)
(16.15)
so that classification, eqn. 16.3, becomes a matter of assigning x to class wj if,
gj (x) > gk (x), ∀ k 6= j.
(16.16)
Since p(x), the denominator of eqn. 16.15 is the same for all gj (x) and since eqn. 16.16 involves
comparison only, we may rewrite eqn. 16.15 as
gj (x) = p(x | ωj )P (ωj ).
(16.17)
We may derive a further possible discriminant by taking the logarithm of eqn. 16.17 — since
logarithm is a monotonically increasing function, application of it preserves relative order of its
arguments:
gj (x) = log p(x | ωj ) + log P (ωj ).
(16.18)
In the multivariate Gaussian case, eqn. 16.18 becomes (Duda & Hart 1973),
p
1
1
gj (x) = − (x − µj )T K−1
j (x − µj ) − log2π − log | Kj | +logP (ωj )
2
2
2
(16.19)
Henceforth, we refer to eqn. 16.19 as the Bayes-Gauss classifier.
The multivariate normal (Gaussian) density provides a good characterisation of pattern (vector)
distribution where we can model the generation of patterns as ideal pattern plus measurement
noise; for an instance of a measured vector x from class ωj :
xn = µj + en ,
(16.20)
where en ∼ N(0, Kj ), that is, the noise covariance is class dependent.
16.5
Bayes-Gauss Classifier – Special Cases
(Duda & Hart 1973, pp. 26–31)
Revealing comparisons with the other learning paradigms which play an important role in this thesis
are made possible if we examine particular forms of noise covariance in which the Bayes-Gauss
classifier decays to certain interesting limiting forms:
• Equal and Diagonal Covariances (Kj = σ 2 I, ∀j, where I is the unit matrix); in this case certain
important equivalences with eqn. 16.19 can be demonstrated:
– Nearest mean classifier;
16–4
– Linear discriminant;
– Template matching;
– Matched filter;
– Single layer neural network classifier.
• Equal but Non-diagonal Covariance Matrices.
– Nearest mean classifier using Mahalanobis distance;
and, as in the case of diagonal covariance,
– Linear discriminant function;
– Single layer neural network;
16.5.1
Equal and Diagonal Covariances
When each class has the same covariance matrix, and these are diagonal, we have, Kj = σ 2 I, so
= σ12 I. Since the covariance matrices are equal, we can eliminate the 12 | logKj |; the
that K−1
j
p
T −1
2 log2π term is constant in any case; thus, including the simplification of the (x − µj ) Kj (x − µj ),
eqn. 16.19 may be rewritten:
1
(x − µj )T (x − µj ) + logP (ωj )
2
2σ
1
=
kx − µj )k2 + logP (ωj ).
2
2σ
gj (x) = −
(16.21)
(16.22)
Nearest mean classifier If we assume equal prior probabilities P (ωj ), the second term in
eqn. 16.22 may be eliminated for comparison purposes and we are left with a nearest mean classifier.
Linear discriminant
If we further expand the squared distance term, we have,
gj (x) = −
1
(xT x − 2µTj x + µTj µj ) + logP (ωj ),
2
2σ
(16.23)
which may be rewritten as a linear discriminant:
gj (x) = wj0 + wjT x
where
wj0 = −
1
(µTj µj ) + logP (ωj ),
2
2σ
and
wj =
1
µj .
σ2
(16.24)
(16.25)
(16.26)
Template matching In this latter form the Bayes-Gauss classifier may be seen to be performing
template matching or correlation matching, where wj = constant × µj , that is, the prototypical
pattern for class j, the mean µj , is the template.
16–5
Matched filter In radar and communications systems a matched filter detector is an optimum
detector of (subsequence) signals, for example, communication symbols. If the vector x is written
as a time series (a digital signal), x[n], n = 0, 1, . . . then the matched filter for each signal j may
be implemented as a convolution:
yj [n] = x[n] ◦ h[n] =
N−1
X
x[n − m] hj [m],
(16.27)
m=0
where the kernel h[.] is a time reversed template — that is, at each time instant, the correlation
between h[.] and the last N samples of x[.] are computed. Provided some threshold is exceeded,
the signal achieving the maximum correlation is detected.
Single Layer Neural Network
sification rule as:
If we restrict the problem to two classes, we can write the clas-
g(x) = g1 (x) − g2 (x) ≥ 0 : ω1 , other w i se ω2
T
= w0 + w x,
(16.28)
(16.29)
1)
where w0 = − 2σ1 2 (µT1 µ1 − µT2 µ2 ) + log PP (ω
(ω2 )
and w =
1
σ 2 (µ1
− µ2 ).
In other words, eqn. 16.29 implements a linear combination, adds a bias, and thresholds the result
— that is, a single layer neural network with a hard-limit activation function.
(Duda & Hart 1973) further demonstrate that eqn. 16.22 implements a hyper-plane partitioning
of the feature space.
16.5.2
Equal but General Covariances
When each class has the same covariance matrix, K, eqn. 16.19 reduces to:
gj (x) = −(x − µj )T K−1 (x − µj ) + logP (ωj )
(16.30)
Nearest Mean Classifier, Mahalanobis Distance If we have equal prior probabilities P (ωj ), we
arrive at a nearest mean classifier where the distance calculation is weighted. The Mahalanobis
distance (x−µj )T K−1
j (x−µj ) effectively weights contributions according to inverse variance. Points
of equal Mahalanobis distance correspond to points of equal conditional density p(x | ωj ).
Linear Discriminant
where
Eqn. 16.30 may be rewritten as a linear discriminant, see also section 15.5:
gj (x) = wj0 + wjT x
(16.31)
1
wj0 = − (µTj K−1 µj ) + logP (ωj ),
2
(16.32)
wj = K−1 µj .
(16.33)
and
16–6
Weighted template matching, matched filter In this latter form the Bayes-Gauss classifier may
be seen to be performing weighted template matching.
Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demonstrated that, for two classes, eqns. 16.31– 16.33 may be implemented by a single neuron. The
only difference from eqn. 16.29 is that the non-bias weights, instead of being simple a difference
between means, is now weighted by the inverse of the covariance matrix.
16.6
Least square error trained classifier
We can formulate the problem of classification as a least-square-error problem. Let us require the
classifier to output a class membership indicator ∈ [0, 1] for each class, we can write:
d = f (x)
(16.34)
where d = (d1 , d2 , . . . dc )T is the c-dimensional vector of class indicators and x, as usual, the
p-dimensional feature vector.
We can express individual class membership indicators as:
dj = b0 +
p
X
bi xi + e.
(16.35)
i=1
In order to continue the analysis we need to refer to the theory of linear regression, see Chapter 20.
We repeat eqn. 20.12 here,
B̂ = (XT X)−1 XT Y
(16.36)
XT Y is a p + 1 × c matrix, and B̂ is a (p + 1) × c matrix of coefficients — that is, one column of
p + 1 coefficients for each class.
Eqn. 16.36 defines the training algorithm of our classifier.
Application of the classifier to a feature vector x may be expressed as:
ŷ = B̂x.
(16.37)
It remains to find the maximum of the c components of ŷ.
In a two-class case, this least-square-error training algorithm yields an identical discriminant to
Fisher’s linear discriminant (Duda & Hart 1973). Fisher’s linear discriminant is described in Chapter 17.
16–7
16.7
Generalised linear discriminant function
Eqn. 15.13 may be adapted to cope with any function(s) of the features xi ; we can define a new
feature vector x0 where:
xk0 = fk (x).
(16.38)
In the pattern recognition literature, the solution of eqn. 16.38 involving now the vector x0 is called
the generalised linear discriminant function (Duda & Hart 1973).
It is desirable to escape from the fixed model of eqn. 16.38: the form of the fk (x) must be
known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have
already shown the correspondence between the linear model, eqn. 20.8, and a single layer neural
network with a single output node and linear activation function. An MLP with appropriate nonlinear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to
learning the mapping between x and y (Bishop 1995).
16–8
Chapter 17
Linear Discriminant Analysis and Principal
Components Analysis
17.1
Principal Components Analysis
Principal component analysis (PCA), also called Karhunen-Loève transform (Duda, Hart & Stork
2000) is a linear transformation which maps a p-dimensional feature vector x ∈ Rp to another
vector y ∈ Rp where the transformation is optimised such that the components of y contain
maximum information in a least-square-error sense. In other words, if we take the first r ≤ p
components (y0 ∈ Rq ), then using the inverse transformation, we can reproduce x with minimum
error. Yet another view is that the first few components of y contain most of the variance, that is,
in those components, the transformation stretches the data maximally apart. It is this that makes
PCA good for visualisation of the data in two dimensions, i.e. the first two principal components
give an optimum view of the spread of the data.
We note however, unlike linear discriminant analysis, see section 17.2, PCA does not take account
of class labels. Hence it is typically a more useful visualisation of the inherent variability of the
data.
In general x can be represented, without error, by the following expansion:
x = Uy =
p
X
yi ui
(17.1)
i=1
where
yi is the ith component of y and
(17.2)
U = (u1 , u2 , . . . , up )
(17.3)
utj uk = δjk = 1, when i = k; otherwise = 0.
(17.4)
where
is an orthonormal matrix:
17–1
If we truncate the expansion at i = q
0
x = Uq y =
q
X
yi ui ,
(17.5)
|x − x0 | = mi ni mum.
(17.6)
i=1
we obtain a least square error approximation of x, i.e.
The optimum transformation matrix U turns out to be the eigenvector matrix of the sample
covariance matrix C:
1 t
A A,
N
(17.7)
UCUt = Λ,
(17.8)
C=
where A is the N × p sample matrix.
the diagonal matrix of eigenvalues.
17.2
Fisher’s Linear Discriminant Analysis
In contrast with PCA (see section 17.1), linear discriminant analysis (LDA) transforms the data
to provide optimal class separability (Duda et al. 2000) (Fisher 1936).
Fisher’s original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant
u (a p-dimensional vector of weights — the weights are very similar to the weights used in neural
networks) which, via a dot product, maps a feature vector x to a scalar,
y = ut x.
(17.9)
u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separability ), and (b) the clustering together of same class data (within-class clustering). Mathematically,
this criterion can be expressed as:
ut SB u
.
J(u) = t
u SW u
where SB is the between-class covariance,
SB = (m1 − m2 )(m1 − m2 )t ,
and
17–2
(17.10)
(17.11)
Sw = C1 + C2 ,
(17.12)
the sum of the class-conditional covariance matrices, see section 17.1.
m1 and m2 are the class means.
u is now computed as:
u = S−1
w m1 − m2 .
(17.13)
There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly
extensions from two-class to multi-class data.
In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second
discriminant, orthogonal to the first, which optimises the separability and clustering criteria, subject
to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to
be view as a two-dimensional scatter plot.
17–3
Chapter 18
Neural Network Methods
Here we show that a single neuron implements a linear discriminant (and hence also implements
a separating hyperplane). Then we proceed to a discussion which indicates that a neural network
comprising three processing layers can implement any arbitrarily complex decision region.
Recall eqn. 15.12, with ai → wi , and now (arbitrarily) allocating discriminant value zero to class 0,
(
p
X
≤ 0, ω = 0
(18.1)
wi xi − w0
g(x) =
> 0, ω = 1.
i=1
Figure 18.1 shows a single artificial neuron which implements precisely eqn. 18.1.
+1 (bias)
w0
x1
w1
w2
x2
.
.
.
xp
wp
Figure 18.1: Single neuron.
The signal flows into the neuron (circle) are weighted; the neuron receives wi xi . The neuron sums
and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid
activation function (softer transition) instead of the hard limit.
The threshold term in the linear discriminant (a0 in eqn. 15.13) is provided by w0 × +1. Another
interpretation of bias, useful in mathematical analysis of neural networks, see section 16.6, is to
represent it by a constant component, +1, as the zeroth component of the augmented feature
vector.
18–1
Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks),
examine the two-dimensional case,
(
≤ 0, ω = 0
w1 x1 + w2 x2 − w0
(18.2)
> 0, ω = 1.
The boundary between classes, given by w1 x1 + w2 x2 − w0 = 0, is a straight line with x1 -axis
intercept at −w0 /w1 and x2 -axis intercept at −w0 /w2 , see Figure 18.2.
x2
−w0/w2
−w1/w0
x1
Figure 18.2: Separating line implemented by two-input neuron.
18–2
18.1
Neurons for Boolean Functions
A neuron with weights w0 = −0.5, and w1 = w2 = 0.35 implements a Boolean AND:
x1
x2 AND(x1,x2)
Neuron summation
Hard-limit (¿0?)
---------------------------------------------- -------------0
0
0
sum = -0.5 + 0.35x1 + 0.35x2 = -0.5 =¿ output= 0
1
0
0
sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0
0
1
0
sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0
1
1
1
sum = -0.5 + 0.35x1 + 0.35x2 = +0.2 =¿ output= 1
------------------------------------------------ ------------Similarly, a neuron with weights w0 = −0.25, and w1 = w2 = 0.35 implements a Boolean OR.
Figure 18.3 shows the x1 -x2 -plane representation of AND, OR, and XOR (exclusive-or).
x2
1
0
0
1
0
x2
1
1
1
1
0
1 x1
x2
1
1
0
1
0
1 x1
AND
OR
1 x1
XOR
Figure 18.3: AND, OR, XOR.
It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers.
Two layer were a big problem in the first wave of neural network research in the 1960s, when it
was not known how to train more than one layer.
18.2
Three-layer neural network for arbitrarily complex decision regions
The purpose of this section is to give an intuitive argument as to why three processing layers can
implement an arbitrarily complex decision region.
Figure 18.4 shows such a decision region in two-dimensions.
As shown in the figure, however, each ‘island’ of class 1 may be delineated using a series of
boundaries, d11 , d12 , d13 , d14 and d21 , d22 , d23 , d24 .
Figure 18.5 shows a three-layer network which can implement this decision region.
First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in
layer 2, we AND together the decisions from the separating hyperplanes to obtain decisions, ‘in
island 1’, ‘in island 2’. Finally, in the output layer, we OR together the latter decisions; thus we
can construct an arbitrarily complex partitioning.
18–3
5
x2
4
3
2
1
d24
d21
0 0 0 0 0 0 0
1 1
0 0 0 0 0
0 0 0 0 0 0 0 1 1 1 1 d23
1 1 1 1 1 10 0 0 0
0 0 0 0 0 0
1 1 1 1 10 0 0 0
0 0 0 0 0 0 0
1 1 1 10 0 0 0
d11
1
d22
1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0
1 1 1 1 1 d14
0 0 1 1 1 1 1 10 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 d121 1 1 1
0 0 0
1 1 1
0 0 0 0 0 0 0 0
d13
1
2
3
4
5
6 x1
Figure 18.4: Complex decision region required.
Of course, this is merely an intuitive argument. A three layer neural network trained with backpropagation or some other technique might well achieve the partitioning in quite a different manner.
18.3
Sigmoid activation functions
If a neural network is to be trained using backpropagation or similar technique, hard limit activation
functions cause problems (associated with differentiation). Sigmoid activation functions are used
instead. A sigmoid activation function corresponding to the hard limit progresses from output
value 0 at −∞, passes through 0 with value 0.5 and flattens out at value 1 at +∞.
18–4
+1 (bias)
d11
x1
x2
.
.
.
xp
+1
d12
+1
d13
AND
.
.
.
+1
class
d14
OR
d21
. . .
d24
.
.
.
AND
Figure 18.5: Three-layer neural network implementing an arbitrarily complex decision region.
18–5
Chapter 19
Unsupervised Classification (Clustering)
19–1
Chapter 20
Regression
20.1
Linear Regression
The simplest linear model, y = mx + c, of school mathematics, is given by:
y = b0 + b1 x + e,
(20.1)
which shows the dependence of the dependent variable y on the independent variable x. In other
words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be
a zero-mean random process. Strictly eqn. 20.1 is affine, since b0 is included, but common usage
dictates the use of linear. Taking the nth observation of (x, y ), we have (Beck & Arnold 1977, p.
133):
yn = b0 + b1 xn + en
(20.2)
Least square error estimators for b0 and b1 , bˆ0 and bˆ1 may be obtained from a set of paired
observations {xn , yn }N
n=1 by minimising the sum of squared residuals:
S=
N
X
rn2 =
n=1
N
X
(yn − yˆn )2
(20.3)
n=1
N
X
S=
(yn − b0 − b1 xn )2
(20.4)
n=1
Minimising with respect to b0 and b1 , and replacing these with their estimators, bˆ0 and bˆ1 , gives
the familiar result:
X
X
X
X
X
xi )2 ]
(20.5)
bˆ1 = N[
yn xn − (
yi )(
xi )]/[N(
xi2 ) − (
bˆ0 =
P
yn
bˆ1 xn
xn −
N
N
P
(20.6)
The validity of these estimates does not depend on the distribution of the errors en ; that is, assumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures,
including eqns. 20.5 and 20.6, assume the xn to be error free, and that the error en is associated
with yn .
20–1
In the case where y , still one-dimensional, is a function of many independent variables — p in our
usual formulation of p-dimensional feature vectors — eqn. 20.2 becomes:
yn = b0 +
p
X
bi xin + en
(20.7)
i=1
where xin is the i -th component of the n-th feature vector.
Eqn. 20.7 can be written compactly as:
yn = xTn b + en
(20.8)
where b = (b0 , b1 , . . . , bp )T is a p + 1 dimensional vector of coefficients, and xn =
(1, x1n , x2n , . . . , xpn ) is the augmented feature vector. The constant 1 in the augmented vector corresponds to the coefficient b0 , that is it is the so called bias term of neural networks, see
sections 15.5 and 18.
All N observation equations may now be collected together:
y = Xb + e
(20.9)
where y = (y1 , y2 , . . . , yn , . . . , yN )T is the N × 1 vector of observations of the dependent variable,
and e = (e1 , e2 , . . . , en , . . . , eN )T . X is the N × p + 1 matrix formed by N rows of p + 1 independent
variables.
Now, the sum of squared residuals, eqn. 20.3, becomes:
S = (y − Xb̂)T .
(20.10)
Minimising with respect to b — just as eqn. 20.3 was minimised with respect to b0 and b1 — leads
to a solution for b̂ (Beck & Arnold 1977, p. 235):
b̂ = (XT X)−1 XT y.
(20.11)
PN
The jk-th element of the (p + 1) × (p + 1) matrix XT X is n=1 xnj xnk , in other words, just N× the
jk-th element of the autocorrelation matrix, R, of the vector of independent variables x estimated
from the N sample vectors.
If we have multiple dependent variables (y ), in this case, c of them, we can replace y in eqn. 20.11
with an appropriate matrix N × c matrix Y formed by N rows each of c observations. Now,
eqn. 20.11 becomes:
B̂ = (XT X)−1 XT Y
(20.12)
XT Y is a p + 1 × c matrix, and B̂ is a (p + 1) × c matrix of coefficients.
Eqn. 20.12 has one significant weakness: it depends on the condition of the matrix XT X. As with
any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly
dependent features will render the matrix singular. In fact, there is an elegant indirect implementation of eqn. 20.12 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky &
Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent training
procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different
manner.
20–2
Bibliography
Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley &
Sons, New York.
Berger, J. (1985). Statistical Decision Theory and Bayesain Analysis 2nd ed., Springer Verlag.
Berry, D. (1996). Statistics — a Bayesian Perspective, Duxbury Press.
Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford,
U.K.
Boslaugh, S. & Watters, P. A. (2008). Statistics in a Nutshell, O’Reilly.
Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis,
University of Ulster.
Campbell, J. (2005). Lecture notes on pattern recognition and image processing, Technical report,
Letterkenny Institute of Technology. http://www.jgcampbell.com/ip/pr.pdf (accessed 200905-01).
Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report,
Computer Science, Queen’s University Belfast. available at: http://www.jgcampbell.com/ip
(2009-05-01).
Casella, G. & Berger, R. (2001). Statistical Inference, 2nd edn, McGraw-Hill.
Crawley, M. J. (2005). Statistics: an introduction using R, John Wiley. Good introduction to
statistics using R.
Duda, R. & Hart, P. (1973). Pattern Classification and Scene Analysis, Wiley-Interscience, New
York.
Duda, R., Hart, P. & Stork, D. (2000). Pattern Classification, Wiley-Interscience.
Duntsch, I. & Gediga, G. (2000). Sets, Relations, Functions, Methodos Publishers. Available via
http://www.cosc.brocku.ca/ duentsch/papers/methprimer1.html (2009-04-30).
Dytham, C. (2009). Choosing and Using Statistics: A Biologist’s Guide, 2nd edn, Blackwell
Publishing. ISBN-13: 978-1-4051-0243-8.
Feller, W. (1968). An Introduction to Probability Theory and its Applications, volume 1, 3rd edn,
John Wiley & Sons, New York.
Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics
7: 179–188. in (?).
20–1
Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer
Graphics, Addison Wesley.
Frey, B. (2006). Statistics Hacks, O’Reilly.
Gelman, A., Carlin, J., Stern, H. & Rubin, D. (1995). Bayesian Data Analysis, Chapman and Hall.
Gelman, A. & Nolan, D. (2002). Teaching statistics: a bag of tricks, Oxford University Press.
Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press,
Baltimore.
Griffiths, D. (2009). Head First Statistics, O’Reilly. ISBN-10: 0596527586. Excellent introduction.
Hacking, I. (2001). An Introduction to Probability and Inductive Logic, Oxford University Press.
Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning, Springer.
Hsu, H. (1997). Theory and Problems of Probability, Random Variables, and Random Processes
(Schaum’s Outlines), McGraw-Hill.
Jaynes, E. & (editor), L. B. (2003). Probability Theory: The Logic of Science, Cambridge University Press. Jaynes was one of the chief advocates of the Bayesian method.
Jeffreys, H. (1961/1998). Theory of Probability, 3rd edn, Oxford University Press (Oxford Classics
Series – 1998), Oxford, U.K.
Larson, H. (1982). Introduction to Probability and Statistical Inference, 3rd edn, John Wiley.
Lee, P. M. (2004). Bayesian Statistics: an introduction, 3rd edn, Arnold. Reputedly one of the
best introductions to Bayesian statistics; Contains examples in R.
MacKay, D. J. C. (2002). Information Theory, Inference and Learning Algorithms, Cambridge
University Press. MacKay is a major advocate of Bayesian methods.
Maindonald, J. & Braun, J. (2007). Data Analysis and Graphics Using R: an example-based
approach, 2nd edn, Cambridge University Press, Cambridge, U.K. ISBN: 978-0-521-86116-8;
good R examples, including graphics.
Matloff, N. (2008). R for programmers, Technical report, University of California, Davis.
http://heather.cs.ucdavis.edu/ matloff/R/RProg.pdf (accessed 2009-04-25).
Meyer, P. L. (1966). Introductory Probability and Statistical Applications, Addison-Wesley, Reading, MA. Excellent introduction, but now out of print.
Milton, M. (2009). Head First Data Analysis: A learner’s guide to big numbers, statistics, and
good decisions, O’Reilly. ISBN-10: 0596153937. Another excellent introduction. Uses R.
Murtagh, F. (2005). Correspondence Analysis and data Coding with Java and R, Chapman and
Hall/CRC Press.
O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Vol. 2B, Bayesian Inference, Edward
Arnold.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (revised second printing), Morgan
Kaufmann, San Francisco, CA.
20–2
Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn,
Cambridge University Press, Cambridge, UK.
Quinn, G. P. & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists,
Cambridge University Press. ISBN-13: 978-0521009768.
Ripley, B. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, U.K.
Rosenkrantz, R. D. (ed.) (1983). E.T. Jaynes. Papers on Probability, Statistics and Statistical
Physics, Kluwer, Dordrecht.
Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the 20th
Century, W.H. Freeman. Great introduction to the origins of statistics.
Silvey, S. (1975). Statistical Inference, Chapman and Hall.
Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K.
Sivia, D. (2006). Data Analysis, A Bayesian Tutorial, 2nd edn, Oxford University Press. Best
introduction to Bayesian inference there is.
Spiegel, M. R., Schiller, J. & Srinivasan, R. A. (2009). Theory and Problems of Probability and
Statistics (Schaum’s Outlines), 3rd edn, McGraw-Hill.
Spiegel, M. R. & Stephens, L. J. (2008). Statistics (Schaum’s Outlines), 4th edn, McGraw-Hill.
Highly recommended; if you have to buy one book, this is the one; has examples using a few
packages, most notably Excel.
Taylor, P. (2008). Probability (manuscript notes on mathematical foundations), Technical report,
University of Manchester. http://www.paultaylor.eu/tripos/Probability.pdf (accessed 200904-25).
Therrien, C. (1989). Decision, Estimation, and Classification, Chichester, UK: John Wiley and
Sons.
Tisted, R. (1988). Elements of Statistical Computing, Chapman and Hall/CRC Press.
Venables, W. & Ripley, B. (2000). S Programming, Springer-Verlag.
Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag.
Highly recommended for learning R (R is a free version of S).
Wasserman, L. (2004). All of Statistics: a concise course in statistical inference, Springer Verlag,
New York, NY. ISBN: 0-387-40272-1; top class encyclopedic reference.
Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 1415–
1442.
–3
Appendix A
Basic Mathematical Notation
The notation described here notation is merely shorthand for common sense concepts which would
otherwise be confusing and long-winded if written in English. Casual familiarity with the most
important items will also allow you to read papers using statistics without becoming confused. The
online book Sets, Relations, Functions (Duntsch & Gediga 2000) is an ideal introduction; we take
these notes from that book.
A.1
A.1.1
Sets
Set Definition and Membership
A set is a very basic mathematical entity and hence is a bit hard to define. Let’s say that a set is
a collection of objects; there cannot be repetition (duplication) of objects. We can specify a set
by writing all its members within curly brackets, { }.
Example 30 Six sided dice, set of possible faces (identified by the number of spots); call the set
D. We can write D as, D = {1, 2, 3, 4, 5, 6}. When there is an obvious sequence, we can write,
D = {1, 2, . . . , 6}.
Sometimes we specify a rule for making the set, we have for example, the trivial rule generated set
D = {i | i ∈ {1, . . . , 6}} = {1, . . . , 6}; the set of even numbers between 1 and 6 is given by
Dev en = {i | i ∈ {1, . . . , 6} and i even} = {2, 4, 6}.
We use the membership symbol ∈ to state that an object is a member of a set, for example,
1 ∈ {1, 2, 3}; we can state non-membership by 6∈, for example, 6 6∈ {1, 2, 3}
There is no ordering of position in a set. {1, 2, 3}, {2, 3, 1} represent the same set. If there is
repetition, it is understood that the repeated elements have no effect so that {1, 2, 3}, {2, 3, 1, 1, 2}
represent the same set.
A–1
A.1.2
Important Number Sets
• Natural numbers: N = {0, 1, 2, . . .}.
• Positive natural numbers: N+ = {1, 2, . . .}.
• Integers: Z = {. . . , −2, −1, 0, 1, 2, . . .}.
• Real numbers: R.
A.1.3
Set Operations
• Intersection. The set formed by the intersection of sets A, B is written
C = A ∩ B = {x : x ∈ A andx ∈ B.
Example 31 A = {1, 2, 3, 4}, B = {3, 4, 5}, A ∩ B = {3, 4}.
• Union. The set formed by the union of sets A, B is written
C = A ∪ B = {x : x ∈ A orx ∈ B, where “or” means inclusive or, that is a or b means either
a or b, or both.
Example 32 A = {1, 2, 3, 4}, B = {3, 4, 5}, A ∪ B = {1, 2, 3, 4, 5}.
• Set difference. The set formed by the difference of sets A, B is written
C=A
B = {x : x ∈ A andx 6∈ B.
That is, remove any members of B.
Example 33 A = {1, 2, 3, 4}, B = {3, 4}, A
B = {1, 2}.
• Set complement (with respect to a universal set, U).
Ā = {x : x 6∈ A andx ∈ U.
Example 34 U = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, Ā{1, 2, 6}.
Comment. In case the notion of a universal set causes difficulty: the universal set depends
on the problem at hand; when talking about a class of students, then U would be the set of
all students in the class. You might have A as the set of all students (in that class — in that
universal set) from County Donegal; then Ā is the set of all students from outside County
Donegal — that is not from County Donegal.
A.1.4
Venn Diagrams
Set operations such as intersection, union, difference and complement are often illustrated using
Venn diagrams such as those shown in Figure A.1.
A–2
11111111
00000000
00000000
11111111
00000000
11111111
A
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
00000000
11111111
000000000
111111111
B
000000000
111111111
000000000
111111111
000000000
111111111
A
11111
00000
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
B
Intersection of A, B
Union of A, B (all shaded area)
U = universal set
111111
000000
000000
111111
000000
111111
000000
111111
000000
111111
000000
111111
A
complement
of A
Figure A.1: Set operations illustrated using Venn diagrams; (a) intersection, (b) union, (c) complement.
Subset When a set A has no members or some or all of the members of B, but no more, we say
that A is a subset of B. A ⊆ B.
Example 35 B = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, so that A ⊆ B.
Equality of sets When a set A has the same members as B, or each is empty, we say that they
are equal: A = B. Another way of looking at this is, if A ⊆ B and B ⊆ A, then A = B.
Empty Set
If a set contains no members, we call it the empty set; symbol ∅.
Cardinality of a Set
The number of elements in a set A is called its cardinality and written |A|.
Example 36 A = {1, 2, . . . , 6}, |A| = 6.
B = {John, Mar y , Jean}, |B| = 3.
Power Set
(Probably not necessary for basic probability.)
Given a set A, the power set of A, P(A), is the set of all subsets of A. |P(A)| = 2|A| . Notice that
you can have a set of sets, for example, the set of all classes in the computing department.
Example 37 A = {a, b, c}, |A| = 3.
P(A) = {∅, {c}, {b}, {a}, {b, c}, {a, c}, {a, b}, {a, b, c}}.
Verify that |P(A)| = 2|A| = 23 = 8.
A–3
Finite and Infinite Sets Roughly speaking, if |A| = n where n is some number we can identify,
then we say that A is a finite set. Most of the sets in our examples are finite sets; otherwise the
set is infinite.
N, Z, R are infinite sets.
This is an example of a finite set of integer numbers A = {1, 2, . . . , n}; in contrast an infinite set
of integer numbers would be written A = {1, 2, . . .} which means A = {1, 2, . . . , ∞}.
Disjoint Sets
We say that A1 , A2 , . . . are disjoint of Ai ∩ Aj = ∅, ∀i, j, i 6= j.
∀ denotes for all.
A.2
Iterated Summation and Product Notation
If we want to write down the operation of summing the numbers from 1 to 6, we could write
s = 1 + 2 + 3 + 4 + 5 + 6 or s = 1 + 2+, P
. . . , +6. But this becomes tedious or impossible for larger
6
lists. We have the summation notation i=1 i .
Similarly, if we want to write downQthe operation of multiplying together all the numbers from 1
6
to 6, we use the product notation i=1 i .
A.3
Iterated Union and Intersection
If we want to write down the operation of taking the union (see section A.1.3 of a list of sets
the numbers from A1 to A6 , we could write B = A1 ∪ A2 , . . . , ∪A6 . But this
S6becomes tedious or
impossible for larger lists. Similar to the summation notation we have B = i=1 Ai .
T6
For intersection we have B = i=1 Ai .
A.4
Cartesian Product Sets
Quite often we need to make new sets by making pairs (or triples or n-tuples) from existing sets.
Example 38 Let B = {1, 2, 3, 4, 5, 6} the set of outcomes from throwing a six-sided dice and
A = {H, T }, the set of outcomes of a coin toss. If we perform an experiment where we
throw the dice and toss a coin and we want to describe the set of all possible pairs C =
{(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )}, we call set C the Cartesian product of A and B.
The Cartesian product of A and B is written A × B.
The cardinality of A × B, |A × B| = |A| × |B|. So in Example 38, we have |A × B| = |A| × |B| =
6 × 2 = 12.
Note: pairs such as (1, H), (1, T ), or generally n-tuples, — enclosed in round brackets ( ) — are
not sets.
A–4
Appendix B
Matrices and Linear Algebra
B.1
Introduction
In Chapters 7 and 8 we introduce two-dimensional random variables, that is, pairs of random
variables which, for one reason or another, we want to treat as pairs rather than separately. Much
of what we do in one-dimension generalises to two- and generally multi-dimensions; likewise two-d.
to multi-dimensions.
B.2
Linear Simultaneous Equations
Eqns. B.1 and B.2 are a pair of linear simultaneous equations,
y1 = 3x1 + 1x2 ,
(B.1)
y2 = 2x1 + 4x2 .
(B.2)
Practically, these equations could express the following:
Price of an apple = x1 , price of an orange = x2 (both unknown). Person A buys 3 apples, and 1
orange and the total bill is 5c (y1 ). Person B buys 2 apples and 4 oranges and the total bill is 10c
(y2 ).
Now, what is x1 , the price of apples, and x2 , the price of oranges? We want to solve for the
unknowns x1 , x2 . Matrix algebra gives us a nice technique for solving such problems, see section B.6,
but first well see how to solve it without matrices.
Substitute y1 = 5 and y2 = 10 into eqns. B.1 and B.2:
5 = 3x1 + 1x2 ,
(B.3)
10 = 2x1 + 4x2 .
(B.4)
Eqn. B.3 gives x2 = 5 − 3x1 , which, substituted into eqn. B.4 gives:
B–1
10 = 2x1 + 4(5 − 3x1 ),
10 = 2x1 + 20 − 12x1 ,
−10 = −10x1 ,
x1 = 1.
Now, substitute x1 = 1 into eqn. B.3:
5 = 3 + x2 ,
x2 = 2.
We have determined our unknowns x1 = 1 and x2 = 2.
Ex. Substitute x1 = 1 and x2 = 2 into eqns. B.3 and B.4 to check the correctness of the result.
B.3
Vectors and Matrices
Eqns. B.1 and B.2 can be written in matrix form as
y = Ax
(B.5)
3 1
where A is a 2 row × 2 column matrix, A =
, y is a one column two row matrix,
2 4
y1
and x is another one
representing a tuple, and what we will from now on call a vector, y =
y2
x1
column two row matrix, x =
.
x2
Vectors We could be extra careful and continue to call objects like x and y tuples. But everyone
in the statistical world uses the term vector for tuple, and, because we are using vector and matrix
arithmetic and algebra, this gives another reason to use vector.
A vector is nothing more than an ordered collection of one-dimensional variables; however, vector
and matrix mathematics have been developed to allow us to do mathematics on vectors without
having to deal with each of the elements of (X1 , X2 , . . . , Xp ) separately.
It will rarely be helpful to think of these vectors as being like vectors of physics and having magnitude
and direction; but it is often helpful to think of two-dimensional vectors as representing points in a
Euclidean plane and to think of general multidimensional vectors (p-dimensions, say) as representing
points in p-dimensional space.
Generally, a system of m equations, in n variables, x1 , x2 , . . . , xn ,
y1 = a11 x1 + a12 x2 · · · + a1n xn
y2 = a21 x1 + a22 x2 · · · + a2n xn
...
yr = ar 1 x1 + ar 2 x2 · · · + ar n xn
...
ym = am1 x1 + am2 x2 · · · + amn xn
B–2
(B.6)
can be written in matrix form as
y = Ax,
(B.7)
where y is an m × 1 vector,



y=


y1
y2
.
.
ym

x1
x2
.
.
xn



,


x is an n × 1 vector,



x=




,


and A is an m-row × n-column matrix




A=



a11
a21
..
..
..
am1
a12
a22
..
ar c
..
am2
..
..
..
..
a1n
a2n
..
..
..
amn




.



That is, the matrix A is a rectangular array of numbers whose element in row r , column c is ar c
(rows are horizontal, think rows of teeth; columns are vertical. The matrix A is said to be m × n,
i.e. m rows, n columns.
Eqn. B.7 can be interpreted as the definition of a function which takes n arguments (x1 , x2 , . . . , xn )
and returns m variables (y1 , y2 . . . ym ). Such a function is also called a transformation: it transforms
n-dimensional vectors to m-dimensional vectors.
Such equations are linear transformations because there are no terms in xr2 or higher, only in
xr = xr1 , and no numbers like 5 (5xr0 = 5 × 1 = 5).
Uses of Vectors and Transformations in Statistics
Instead of denoting
a two-d.
random variX1 = X
able as (X, Y ), it is much more convenient to denote it as vector X =
.
X2 = Y
This is particularly true when we get to larger dimensions, when equations like eqn. 7.15 get
enormous or impossible.
Why transformations?
In other places, we have used combinations of random variables such as U = aX + bY ; and we
might have also V = cX + d Y . Thus, we create a new two-d. random variable (U, V ) using linear
combinations of (X, Y ); we transform (X, Y ) to yield (U, V ). This can be neatly expressed using
matrix notation.
y is an 2 × 1 vector,
B–3
y=
U
V
X
Y
,
x is an 2 × 1 vector,
x=
,
and A is an 2-row × 2-column matrix
A=
a11 = a a12 = b
a21 = c a22 = d
.
The larger equation above allows us to create a m−dimensional random variable, y, as the linear
combination of the n random variables in the n−dimensional vector x.
B.4
Basic Matrix Arithmetic
B.4.1
Matrix Multiplication
We may multiply two matrices A, m × n, and B, q × p, as long as n = q. Such a multiplication
produces an m × p result. Thus,
C
=
A
B.
m×p
m×n n×p
(B.8)
Method: The element at the r th row and cth column of C is the product (sum of component-wise
products) of the r th row of A with the cth column of B. Pictorially:
m
n
---------------—----¿
—
— A
—
—
—
—
—
----------------
p
---------—
— —
— B
— — =
—
— —
—
— — n
—
V —
—
—
----------
p
----------—
—
—
C
—
—
—
—
— m
-----------
C = AB
,
A=
B=
a11 a12
a21 a22
b11 b12
b21 b22
B–4
,
,
so, the product
a11 b11 + a12 b21 a11 b12 + a12 b22
a21 b11 + a22 b21 a21 b12 + a22 b22
C=
.
Example. Consider Eqn. B.7, y = Ax. Thus the product of A(m × n) and x(n × 1) is
y1 = a11 x1 + a12 x2 · · · + a1n xn , · · · ym = am1 x1 + am2 x2 · · · + amn xn .
In summation notation, yr =
Pc=n
c=1
ar c xc .
The product is (m × n) × (n × 1) so the result is (m × 1), which checks okay, for y is (m × 1).
B.4.2
Multiplication by a Scalar
As with vectors (when represented as components), we simply multiply each component by the
scalar,
c
B.4.3
a11 a12
a21 a22
=
ca11 ca12
ca21 ca22
.
Addition
As with vectors (when represented as components), we add component-wise,
a11 a12
a21 a22
+
b11 b12
b21 b22
=
a11 + b11 a12 + b12
a21 + b21 a22 + b22
.
Clearly, the matrices must be the same size, i.e. row and column dimensions must be equal.
B.5
B.5.1
Special Matrices
Identity Matrix
I=
1 0
0 1
i.e. produces no transformation effect. Thus, IA = A
We can define the matrix inverse as follows, if AB = I then B = A−1 , see section B.6.
B–5
B.5.2
Orthogonal Matrix
A matrix which satisfies the property:
AAt = I
i.e. the transpose of the matrix is its inverse, see section B.6.
Another way of viewing orthogonality in matrices is:
For each row of the matrix (ar 1 ar 2 ....ar n ), the scalar product with itself is 1, and with all other
rows, 0. I.e.
Pn
c=1 ar c apc = 1 for r = p,
= 0 otherwise.
B.5.3
Diagonal
A=
Sx
0
0 Sy
is diagonal, i.e. the only non-zero elements are on the diagonal.
The inverse of a diagonal matrix
is
B.5.4
a11
0
0 a22
1/a11
0
0 1/a22
Transpose of a Matrix
At , spoken ‘A-transpose’.
If
a11 a12
a21 a22
a11 a21
a12 a22
A=
then
t
A =
i.e. replace column 1 with row 1 etc.
The transpose is sometimes AT or A0 .
B–6
B.6
Inverse Matrix
Only for square matrices (m = n). Consider again Eqns. B.1 and B.2:
y1 = 3x1 + 1x2
y2 = 2x1 + 4x2
i.e. y = Ax.
3 1
2 4
A=
.
Apply this to
x=
1
2
,
to get
y1 = 3.1 + 1.2 = 5,
y2 = 2.1 + 4.2 = 10.
What if you know y = (5 10)t and you want to retrieve x = (x1 x2 )t ? In other words, can
matrices help us solve for x1 , x2 as we did in section B.2?
The answer is yes. Find the inverse of A = A−1 and then apply the inverse transformation to y,
that is, multiply y by the inverse of the matrix,
x = A−1 y.
(B.9)
In the case of a 2 × 2 matrix
A=
A
−1
1
=
|A|
a11 a12
a21 a22
a22 −a12
−a21
a11
(B.10)
where the determinant of the array, A, is | A |= a11 a22 − a12 a21
If | A |= 0, then A is not invertible, it is singular.
Inverse matrices give us the equivalent of division. If | A |= 0, attempting to find the inverse is
the equivalent to calculating 1/0.
Thus for
B–7
A=
3 1
2 4
we have | A |= 3 × 4 − 2 × 1 = 10 so
= (1/10)
5
10
A
Therefore, apply A
−1
to
−1
4 −1
−2
3
=
0.4 −0.1
−0.2
0.3
We find: A−1 y =
0.4 −0.1
−0.2
0.3
5
5 × 0.4 + 10 × −0.1
1
.
=
=
10
5 × −0.2 + 10 × 0.3
2
which is the answer we got in section B.2. In fact, in section B.2 what we did was something very
similar to how one inverts a matrix in a computer program.
B.7
Multidimensional (Multivariate) Random Variables
We can now generalise two-d. random variables to p dimensions by extending (X, Y ) to
(X1 , X2 , . . . , Xp ). It is usual to call the p-dimensional (multivariate) random variable a random
vector and to use vector notation: X = (X1 , X2 , . . . , Xp ).
The multivariate Normal pdf, p-dimensional, is given by:
f (x) =
1
(2π)p/2 |K|1/2
1
exp [− (x − µ)T K−1 (x − µ)].
2
B–8
(B.11)