Download Notes on Artificial Intelligence

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Notes on Artificial Intelligence
Last Updated 06.02.2016
Francis Tseng (@frnsys)
2
2
3
CONTENTS
Contents
Introduction
5
I
7
Foundations
1 Functions
9
1.0.1
Identity functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.0.2
The inverse of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.0.3
Surjective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.0.4
Injective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.0.5
Surjective & injective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.0.6
Convex and non-convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.0.7
Transcendental functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.0.8
Logarithms
12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Other useful concepts
13
2.1
Solving analytically vs numerically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2
Linear vs nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.3
Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
2.4
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3 Linear Algebra
3.1
15
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.1.1
Real coordinate spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.2
Column and row vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.3
Transposing a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.1.4
Vectors operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
CONTENTS
3
CONTENTS
3.2
3.3
3.4
4
4
3.1.5
Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.1.6
Unit vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.1.7
Angles between vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
3.1.8
Perpendicular vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.9
Normal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.10 Orthonormal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.1.11 Additional vector operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.1
Parametric representations of lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.2
Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2.3
Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2.4
Linear independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3.1
Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3.2
Hadamard product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.3
The identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.4
Diagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.5
Triangular matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.3.6
Some properties of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.3.7
Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.3.8
Matrix determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.3.9
Transpose of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.3.10 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.3.11 The Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.12 Orthogonal matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.13 Adjoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.4.1
Spans and subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
3.4.2
Basis of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.3
Dimension of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.4
Nullspace of a matrix
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.4.5
Columnspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.4.6
Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
CONTENTS
5
CONTENTS
3.4.7
The standard basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4.8
Orthogonal compliments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3.4.9
Coordinates with respect to a basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.4.10 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.5.1
Linear transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.5.2
Kernels
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.6.1
Image of a subset of a domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.6.2
Image of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.6.3
Image of a transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.6.4
Preimage of a set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.7.1
Projections onto subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Identifying transformation properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.8.1
Determining if a transformation is surjective . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.8.2
Determining if a transformation is injective . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.8.3
Determining if a transformation is invertible . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.8.4
Inverse transformations of linear transformations . . . . . . . . . . . . . . . . . . . . . .
58
Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
3.9.1
Properties of eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . .
59
3.9.2
Diagonalizable matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.9.3
Eigenvalues & eigenvectors of symmetric matrices . . . . . . . . . . . . . . . . . . . . . .
60
3.9.4
Eigenspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.9.5
Eigenbasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
3.5
3.6
3.7
3.8
3.9
3.10 Tensors
4 Calculus
4.1
63
Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.1.1
Computing derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.1.2
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.1.3
Differentiation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
4.1.4
Higher order derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
CONTENTS
5
CONTENTS
4.2
4.3
4.4
4.5
6
6
4.1.5
Explicit differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.1.6
Implicit differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
4.1.7
Derivatives of trigonometric functions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.1.8
Derivatives of exponential and logarithmic functions . . . . . . . . . . . . . . . . . . . .
69
4.1.9
Extreme Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.1.10 Rolle’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.1.11 Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.1.12 L’Hopital’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.1.13 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.2.1
Definite integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4.2.2
Basic properties of the integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.2.3
Mean Value Theorem for Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.2.4
Antiderivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.2.5
The fundamental theorem of calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.2.6
Improper integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3.1
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3.2
Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.3.3
Directional derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.3.4
Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
4.3.5
The Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.3.6
The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
4.3.7
Scalar and vector fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.3.8
Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.3.9
Curl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.3.10 Optimization with eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.4.1
Solving simple differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
4.4.2
Basic first order differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
CONTENTS
7
CONTENTS
5 Probability
95
5.1
Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.2
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.3
Joint and disjoint probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.4
Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.5
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
5.5.1
99
Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
The Chain Rule of Probability
5.7
Combinations and Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.8
5.9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.7.1
Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.7.2
Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.7.3
Combinations, permutations, and probability . . . . . . . . . . . . . . . . . . . . . . . . 101
Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8.1
Probability Mass Functions (PMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8.2
Probability Density Functions (PDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8.3
Distribution Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Cumulative Distribution Functions (CDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.9.1
Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.9.2
Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.9.3
Using CDFs
5.9.4
Survival function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.10 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.10.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.10.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10.3 The expectation rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10.4 Jensen’s Inequality
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.10.5 Properties of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.11 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.11.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.12 Common Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.12.1 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.12.2 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.13 Pareto distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
CONTENTS
7
CONTENTS
8
5.14 Multiple random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.14.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.14.2 Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.15 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.15.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.15.2 A Visual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.15.3 An Example Bayes’ Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.15.4 Solving the problem with Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.15.5 Another Example
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.15.6 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.16 The log trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.17 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.17.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.17.2 Specific Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.17.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.17.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.17.5 Kullback-Leibler (KL) divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6 Statistics
139
6.0.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1.1
Scales of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1.2
Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.1.3
Population vs Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.1.4
Independent and Identically Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.1.5
The Law of Large Numbers (LLN)
6.1.6
Regression to the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1.7
Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1.8
Dispersion (Variance and Standard Deviation) . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1.9
Moments
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.1.10 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.1.11 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.1.12 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8
CONTENTS
9
CONTENTS
6.1.13 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.1.14 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.2
6.3
Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.1
Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2.2
Estimates and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.2.3
Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2.4
Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.2.5
Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2.6
Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2.7
Kernel Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Experimental Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3.1
Statistical Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.3.2
Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.3
The Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3.4
Type 1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.5
P Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3.6
The Base Rate Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.7
False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.8
Alpha Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.3.9
The Benjamini-Hochberg Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3.10 Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3.11 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.3.12 Effect Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.3.13 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.3.14 Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.4
6.5
Handling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.4.1
Transforming data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.4.2
Dealing with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.4.3
Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
CONTENTS
9
CONTENTS
10
7 Bayesian Statistics
7.0.1
7.1
7.3
Frequentist vs Bayesian approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.1.1
7.2
Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Choosing a prior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.2.1
Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.2.2
Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.2.3
Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3.1
Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3.2
Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.3.3
Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.4
Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.5
Bayesian point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.6
Credible Intervals (Credible Regions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.7
Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.8
A Bayesian example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.8.1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8 Graphs
8.1
191
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9 Probabilistic Graphical Models
10
173
197
9.1
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2
Belief (Bayesian) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.2.1
Conditional independence assumptions
. . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.2.2
Properties of belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.2.3
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.2.4
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.2.5
Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.2.6
Template models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.2.7
Temporal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.2.8
Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.2.9
Dynamic Bayes Networks (DBNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
CONTENTS
11
CONTENTS
9.2.10 Plate models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.2.11 Structured CPDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
9.2.12 Querying Bayes’s nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
9.2.13 Inference in Bayes’ nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.3
9.4
Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.3.1
Gibbs distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
9.3.2
Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.3.3
Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
10 Optimization
257
10.0.1 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
10.0.2 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
10.1 Gradient vs non-gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
10.2.1 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.2.2 Epochs vs iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.2.3 Learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.2.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.4 Nelder-Mead (aka Simplex or Amoeba optimization) . . . . . . . . . . . . . . . . . . . . . . . . . 263
10.5 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.6 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.6.1 Genetic Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10.6.2 Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.6.3 Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
10.7 Derivative-Free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.8 Hessian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.9 Advanced optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
CONTENTS
11
CONTENTS
12
11 Algorithms
271
11.1 Algorithm design paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
11.2 Algorithmic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
11.2.1 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.2.2 Loop examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
11.2.3 Big-Oh formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.2.4 Big-Omega notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.2.5 Big-Theta notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.2.6 Little-Oh notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.3 The divide-and-conquer paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.3.1 The Master Method/Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
11.4 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.4.1 Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.4.2 Balanced Binary Search Tree
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.4.3 Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
11.4.4 Bloom Filters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
11.5 P vs NP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
11.5.1 NP-hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
11.5.2 NP-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
II Machine Learning
12 Overview
285
287
12.1 Representation vs Learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.2 Types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
12.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
13 Supervised Learning
289
13.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.1.2 Classification
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
13.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
13.2.1 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
12
CONTENTS
13
CONTENTS
13.2.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
13.2.3 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
13.2.4 Deciding between Gradient Descent and the Normal Equation . . . . . . . . . . . . . . . 297
13.2.5 Advanced optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.3.1 Feature selection
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
13.3.2 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
13.3.3 Scaling (normalization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.3.4 Mean subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
13.3.5 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
13.3.6 Bagging (“Bootstrap aggregating”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.4.1 Univariate (simple) Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
13.4.2 How are the parameters determined? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
13.4.3 Multivariate linear regression
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
13.4.4 Example implementation of linear regression with gradient descent . . . . . . . . . . . . . 307
13.4.5 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.4.6 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
13.5.1 One-vs-All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
13.6 Softmax regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.6.1 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.7 Generalized linear models (GLMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
13.7.1 Linear Mixed Models (Mixed Models/Hierarchical Linear Models) . . . . . . . . . . . . . . . 314
13.8 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
13.8.1 Kernels
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
13.8.2 more on support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
13.9 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
13.9.1 Measures of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
13.9.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
13.9.3 Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
13.10 Ensemble models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
13.10.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
CONTENTS
13
CONTENTS
14
13.10.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
13.11 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
13.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
13.12.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.12.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
13.12.3 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
13.12.4 Regularized Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
13.13 Probabilistic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
13.13.1 Discriminative vs Generative learning algorithms . . . . . . . . . . . . . . . . . . . . . . 336
13.13.2 Maximum Likelihood Estimation (MLE)
. . . . . . . . . . . . . . . . . . . . . . . . . . . 336
13.13.3 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
13.14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
14 Neural Nets
14.1 Biological basis
343
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
14.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
14.3 Sigmoid (logistic) neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
14.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
14.4.1 Common activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
14.4.2 Softmax function
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
14.4.3 Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
14.5 Feed-forward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
14.6 Training neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
14.6.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
14.6.2 Statistical (stochastic) training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
14.6.3 Learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
14.6.4 Training algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
14.6.5 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.6.6 Cost (loss/objective/error) functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
14.6.7 Weight initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
14.6.8 Shuffling & curriculum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
14.6.9 Gradient noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
14.6.10 Adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
14.6.11 Gradient Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
14
CONTENTS
15
CONTENTS
14.6.12 Training tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
14.6.13 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
14.7 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
14.8 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
14.8.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
14.8.2 Artificially expanding the training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
14.9 Hyperparameters
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
14.9.1 Choosing hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.9.2 Tweaking hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
14.10 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
14.10.1 Unstable gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
14.10.2 Rmsprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
14.11 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
14.11.1 Local receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
14.11.2 Shared weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
14.11.3 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.11.4 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
14.11.5 Training CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
14.11.6 Convolution kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
14.12 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
14.12.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
14.12.2 RNN inputs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.12.3 Training RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.12.4 LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
14.12.5 BI-RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
14.12.6 Attention mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
14.13 Unsupervised neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
14.13.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
14.13.2 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.13.3 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.13.4 Deep Belief Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
14.14 Other neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
14.14.1 Modular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
CONTENTS
15
CONTENTS
16
14.14.2 Recursive Neural Networks
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
14.14.3 Nonlinear neural nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
14.14.4 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
14.15 Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
14.16 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.16.1 Training generative adversarial networks
. . . . . . . . . . . . . . . . . . . . . . . . . . 414
14.17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
15 Model Selection
419
15.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
15.1.1 Validation vs Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.2 Evaluating regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.2.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
15.2.2 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
15.3 Evaluating classification models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
15.3.1 Area under the curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
15.3.2 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
15.3.3 Log-loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
15.3.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
15.4 Metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
15.5 Hyperparameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
15.5.1 Grid search
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.5.2 Random search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.5.3 Bayesian Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
15.5.4 Choosing the Learning Rate α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
15.6 CASH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
15.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
15.8 bayes nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16 Bayesian Learning
437
16.1 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.1.1 Hidden Markov Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.1.2 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.1.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
16
CONTENTS
17
CONTENTS
16.2 Inference in Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.2.1 Maximum a posteriori (MAP) estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.3 Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.3.1 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
16.4 Nonparametric models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.4.1 What is a nonparametric model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.4.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
16.4.3 Parametric models vs nonparametric models . . . . . . . . . . . . . . . . . . . . . . . . 446
16.4.4 Why use a Bayesian nonparametric approach? . . . . . . . . . . . . . . . . . . . . . . . 447
16.5 The Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
16.5.1 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
16.5.2 Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
16.5.3 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
16.6 Infinite Mixture Models and the Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
16.6.1 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
16.6.2 Polya Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
16.6.3 Stick-Breaking Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
16.6.4 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
16.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
16.7.1 Model fitting vs Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
16.7.2 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
16.7.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
16.7.4 Model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
16.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
17 NLP
459
17.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
17.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
17.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
17.4 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
17.4.1 Sentence segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
17.4.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
17.4.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
17.4.4 Term Frequency-Inverse Document Frequency (tf-idf) Weighting . . . . . . . . . . . . . . 462
CONTENTS
17
CONTENTS
18
17.4.5 The Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
17.4.6 Normalizing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
17.5 Measuring similarity between text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
17.5.1 Minimum edit distance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
17.5.2 Jaccard coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
17.5.3 Euclidean Distance
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
17.5.4 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
17.6 (Probabilistic) Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
17.6.1 A naive method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
17.6.2 A less naive method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
17.6.3 n-gram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
17.6.4 Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
17.6.5 History-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
17.6.6 Global Linear Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
17.6.7 Evaluating language models: perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
17.7 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.7.1 Context-free grammars (CFGs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
17.7.2 Probabilistic Context-Free Grammars (PCFGs) . . . . . . . . . . . . . . . . . . . . . . . . 475
17.8 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
17.8.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
17.8.2 Evaluating text classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
17.9 Tagging
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
17.9.1 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
17.9.2 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483
17.9.3 The Viterbi algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
17.10 Named Entity Recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
17.11 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
17.11.1 Ontological Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
17.11.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
17.12 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
17.12.1 Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
17.12.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
17.13 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
18
CONTENTS
19
CONTENTS
17.13.1 The general approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
17.14 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
17.14.1 Challenges in machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
17.14.2 Classical machine translation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
17.14.3 Statistical machine translation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
17.14.4 Phrase-Based Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
17.15 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
17.16 Neural Networks and NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
17.16.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
17.16.2 CNNs for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
17.16.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
18 Unsupervised Learning
505
18.1 k-Nearest Neighbors (kNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
18.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
18.2.1 K-Means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
18.2.2 Hierarchical Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
18.2.3 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
18.2.4 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
18.2.5 Mean Shift Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
18.2.6 Non-Negative Matrix Factorization (NMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
18.2.7 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
18.2.8 HDBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
18.2.9 CURE (Clustering Using Representatives) . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
18.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
19 In Practice
19.1 Machine Learning System Design
519
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
19.2 Machine learning diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
19.2.1 Learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
19.2.2 Important training figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
19.3 Large Scale Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
19.3.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
19.4 Online (live/streaming) machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
CONTENTS
19
CONTENTS
20
19.4.1 Distribution Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
19.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
III
Artificial Intelligence
525
19.6 State-space and situation-space representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
19.6.1 Search problems (planning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
19.6.2 Problem formulation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
19.6.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
19.7 Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
19.8 Uninformed search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
19.8.1 Exhaustive (“British Museum”) search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
19.8.2 Depth-First Search (DFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
19.8.3 Breadth-First Search (BFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
19.8.4 Uniform Cost Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
19.8.5 Branch & Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
19.8.6 Iterative deepening DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
19.9 Search enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
19.9.1 Extended list filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
19.10 Informed (heuristic) search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
19.10.1 Greedy best-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
19.10.2 Beam Search
19.10.3 A* Search
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
19.10.4 Iterative Deepening A* (IDA*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
19.11 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
19.11.1 Hill-Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
19.11.2 Other local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
19.12 Graph search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
19.12.1 Consistent heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
19.13 Adversarial search (games) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
19.13.1 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
19.13.2 Alpha-Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
19.14 Non-deterministic search
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
19.14.1 Expectimax search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
20
CONTENTS
21
CONTENTS
19.14.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
19.14.3 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
19.14.4 Decision Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
19.15 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
19.15.1 Policy evaluation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
19.15.2 Policy extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
19.15.3 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
19.16 Constraint satisfaction problems (CSPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
19.16.1 Varieties of CSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
19.16.2 Search formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
19.16.3 Backtracking search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
19.16.4 Iterative improvement algorithms for CSPs
. . . . . . . . . . . . . . . . . . . . . . . . . 559
19.17 Online Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
19.18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
20 Planning
561
20.1 An example planning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
20.2 State-space planning vs plan-space (partial-order) planning . . . . . . . . . . . . . . . . . . . . . 563
20.3 State-space planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
20.3.1 Representing plans and systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
20.3.2 STRIPS (Stanford Research Institute Problem Solver)
. . . . . . . . . . . . . . . . . . . . 565
20.3.3 Other representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
20.3.4 Applicability and state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
20.3.5 Searching for plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
20.3.6 The FF Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
20.4 Plan-space (partial-order) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
20.4.1 Plan refinement operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
20.4.2 The Plan-Space Search Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
20.4.3 Threats and flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
20.4.4 Partial order solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
20.4.5 The Plan-Space Planning (PSP) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 575
20.4.6 The UC Partial-Order Planning (UCPoP) Planner . . . . . . . . . . . . . . . . . . . . . . . 578
20.5 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
20.5.1 Simple Task Networks (STN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
CONTENTS
21
CONTENTS
22
20.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
20.5.3 Planning Domains and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
20.5.4 Planning with task networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
20.5.5 Hierarchical Task Network (HTN) planning . . . . . . . . . . . . . . . . . . . . . . . . . . 583
20.6 Graphplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
20.6.1 Action independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
20.6.2 Independent action execution order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
20.6.3 Layered plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
20.6.4 Mutual exclusivity (mutex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
20.6.5 Forward planning graph expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
20.6.6 Backward graph search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
20.7 Other considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
20.7.1 Planning under uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
20.7.2 Planning with time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
20.7.3 Multi-agent planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
20.7.4 Scheduling: Dealing with resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
20.8 Learning plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
20.8.1 Apprenticeship
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
20.8.2 Case-Based Goal Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
20.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
21 Reinforcement learning
599
21.1 Model-based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
21.1.1 Temporal Difference Learning (TDL or TD Learning) . . . . . . . . . . . . . . . . . . . . . 601
21.1.2 Exploration agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
21.2 Model-free learning
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
21.2.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
21.2.2 Exploration vs exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
21.2.3 Approximate Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
21.2.4 Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
21.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
21.3 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
21.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
22
CONTENTS
23
CONTENTS
22 Filtering
611
22.1 Particle filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
22.1.1 DBN particle filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
22.2 Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
22.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
23 In Practice
625
23.1 Starcraft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
23.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
IV
Simulation
627
24 Agent-Based Models
629
24.1 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
24.1.1 Brownian agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
24.2 Multi-task and multi-scale problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
24.3 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
24.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
25 Nonlinear Dynamics
633
25.1 Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634
25.1.1 Bifurcations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
25.1.2 Return maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
25.1.3 Bifurcation diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
25.1.4 Feigenbaum number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
25.1.5 Sensitive to initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
25.2 Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
25.2.1 Ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
25.2.2 More on ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
25.2.3 Reminder on distinction b/w difference and differential equations
. . . . . . . . . . . . . 643
25.2.4 ODE Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
25.2.5 More on stable and unstable manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
25.2.6 Lyapunov exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
25.2.7 Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
25.2.8 Unstable periodic orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
CONTENTS
23
CONTENTS
24
25.3 Nonlinear Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
25.3.1 Delay-Coordinate Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
25.3.2 Fractal dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
25.4 Estimating Lyaponuv exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
25.4.1 Wolf’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
25.4.2 The Kantz algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
25.5 Noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
25.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
25.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
V
In Practice
26 Process
655
657
26.1 Data analysis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
26.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
26.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
26.4 Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
26.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
26.6 Learning: Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
26.6.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
26.7 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660
27 Data Visualization
661
27.1 Bivariate charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
27.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
27.3 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
27.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661
28 Anonymization
663
28.1 k-anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
28.2 l-diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
28.3 t-closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
28.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665
24
CONTENTS
25
VI
CONTENTS
Appendices
29 Data analysis with pandas
667
669
29.1 Dealing with datetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
29.2 Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
29.3 Plotting
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
29.3.1 Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
29.3.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
29.3.3 Plot a cross tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
29.3.4 Plot subplots as a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
29.3.5 Plot overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
29.3.6 Other plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
29.3.7 Decorating plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
29.3.8 Saving a figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
29.4 iPython Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
CONTENTS
25
CONTENTS
26
26
CONTENTS
27
Introduction
These are my personal notes which are broadly intended to cover the basics necessary for data
science, machine learning, and artificial intelligence. They have been collected from a variety of
different sources, which I include as references when I remember to - so take this as a disclaimer that
most of this material is adapted, sometimes directly copied, from elsewhere. Maybe it’s better to call
this a “remix” or “katamari” sampled from resources elsewhere. I have tried to give credit where it
is due, but sometimes I forget to include all my references, so I will generally just say that I take no
credit for any material here.
Many of the graphics and illustrations are of my own creation or have been re-created from others,
but plenty have also been sourced from elsewhere - again, I have tried to give credit where it is due,
but some things slip through.
Data science, machine learning, and artificial intelligence are huge fields that share some foundational
overlap but go in quite different directions. These notes are not comprehensive but aim to cover a
significant portion of that common ground (and a bit beyond too). They are intended to provide
intuitive understandings rather than rigorous proofs; if you are interested in those there are many
other resources which will help with that.
Since mathematical concepts typically have many different applications and interpretations and often
are arrived at through different disciplines and perspectives, I try to explain these concepts in as many
ways as possible.
Some caveats:
• These are my personal notes; while I hope that they are helpful, they may not be helpful for
you in particular!
• This is still very much a work in progress and it will be changing a lot - a lot may be out of
order, missing, littered with TO DOs, etc.
• These notes are compiled from many sources, so there may be suddens shifts in notation/convention - one day I hope to do a deep pass and fix that, but who knows when that
will be :
• These notes are generated from markdown files, so they unfortunately lack any snazzy interactivity. I tried many ways to write markdown post-processors to add some, but it’s a big time
sink…
INTRODUCTION
27
28
The raw notes and graphics are open source - if you encounter errors or have a better way of
explanining something, please don’t hesistate to submit a pull request.
~ Francis Tseng ([@frnsys](https://twitter.com/frnsys))
28
INTRODUCTION
29
Part I
Foundations
29
31
1
Functions
Fundamentally, a function is a relationship (mapping) between the values of some set X and some
set Y :
f :X→Y
A function can map a set to itself. For example, f (x) = x 2 ,
also notated f : x 7→ x 2 , is the mapping of all real numbers
to all real numbers, or f : R → R.
The set you are mapping from is called the domain.
The set that is being mapped to is called the codomain.
The range is the subset of the codomain which the function actually maps to (a function doesn’t necessarily map
to every value in the codomain. But where it does, the
range equals the codomain).
A function is a mapping between domains.
Functions which map to R are known as scalar-valued or real-valued functions.
Functions which map to Rn where n > 1 are known as vector-valued functions.
1.0.1
Identity functions
An identity function maps something to itself:
IX : X → X
CHAPTER 1. FUNCTIONS
31
32
That is, for every a in X, IX (a) = a:
IX (a) = a, ∀ a ∈ X
1.0.2
The inverse of a function
Say we have a function f : X → Y , where f (a) = b for any a ∈ X.
We say f is invertible if and only if there exists a function f −1 : Y → X such that f −1 ◦ f = IX and
f ◦ f −1 = IY . Note that ◦ denotes function composition, i.e. f ◦ g = f (g), which is the same as
f (g(x)).
The inverse of a function is unique, that is, it is surjective and injective (described below), that is,
there is a unique x for each y .
1.0.3
Surjective functions
A surjective function, also called “onto”, is a function f : X → Y where, for every y ∈ Y there
exists at least one x ∈ X such that f (x) = y . That is, every y has at least one corresponding x
value.
This is equivalent to:
range(f ) = Y
1.0.4
Injective functions
An injective function, also called “one-to-one”, is a function f : X → Y where, for every y ∈ Y ,
there exists at most one x ∈ X such that f (x) = y .
That is, not all y necessarily has a corresponding x, but those that do only have one corresponding
x.
1.0.5
Surjective & injective functions
A function can be both surjective and injective, which just means that for every y ∈ Y there exists
exactly one x ∈ X such that f (x) = y , that is, every y has exactly one corresponding x.
As mentioned before, the inverse of a function is both surjective and injective!
32
CHAPTER 1. FUNCTIONS
33
1.0.6
Convex and non-convex functions
A convex function is a continuous function whose value at the midpoint of every interval
in its domain does not exceed the arithmetic mean of its values at the ends of the interval.
(Convex Function. Weisstein, Eric W. Wolfram MathWorld)
A convex region is one in which any two points in the region can be joined by a straight line that
does not leave the region.
Which is to say that a convex function has a minimum, and only one (and this is also the only position
where the derivative is 0).
More formally, a function is convex if the second derivative is positive everywhere. A function can be
convex on a range [a, b] if its second derivative is positive everywhere in that range.
In higher dimensions, these derivatives aren’t scalar values, so we instead define convexity if the
Hessian H (the matrix of second derivatives) is positive semidefinite (notated H ⪰ 0). It is strictly
convex if H is positive definite (notated H ≻ 0). Refer to the Calculus section for more details on
this.
Convex and non-convex functions
1.0.7
Transcendental functions
Transcendental functions are those that are not polynomial, e.g. sin, exp, log, etc.
CHAPTER 1. FUNCTIONS
33
34
1.0.8
Logarithms
Logarithms are frequently encountered. They have many useful properties, such as turning multiplication into addition:
log(xy ) = log(x) + log(y )
Multiplying many small numbers is problematic with computers, leading to underflow errors. Logarithms are commonly used to turn this kind of multiplication into addition and avoid underflow
errors.
Note that log(x), without any base, typically implies the natural log, i.e. loge (x), sometimes notated
ln(x), which has the inverse exp(x), more commonly seen as e x .
34
CHAPTER 1. FUNCTIONS
35
2
Other useful concepts
2.1
Solving analytically vs numerically
Often you may see a distinction made between solving a problem analytically (sometimes algebraeically is used) and solving a problem numerically.
Solving a problem analytically means you can exploit properties of the objects and equations,
e.g. through methods from calculus, avoiding substituting numerical values for the variables you are
manipulating (that is, you only need to manipulate symbols). If a problem may be solved analytically,
the resulting solution is called a closed form solution (or the analytic solution) and is an exact
solution.
Not all problems can be solved analytically; generally more complex mathematical models have no
closed form solution. These problems are also often the ones of most interest. Such problems need
to be approximated numerically, which involves evaluating the equations many times by substituting
different numerical values for variables. The result is an approximate (numerical) solution.
2.2
Linear vs nonlinear models
You’ll often see a caveat with algorithms that they only work for linear models. On the other hand,
some models are touted for their capacity for nonlinear models.
A linear model is a model which takes the general form:
y = β0 + β1 x1 + · · · + βn xn
Note that this function does not need to produce a literal line. The “linear” constraint does not apply
to the predictor variables x1 , . . . , xn . For instance, the function y = x 2 is linear.
CHAPTER 2. OTHER USEFUL CONCEPTS
35
2.3. METRICS
36
“Linear” refers to the parameters; i.e. the function must be “linear in the parameters”, meaning that
the parameters β0 , . . . , βn themselves must form a line (or its equivalent in whatever dimensional
space you’re working in).
A nonlinear model includes parameters such as β 2 or β0 β1 (that is, multiple parameters in the same
term, which is not linear) or transcendental functions.
2.3
Metrics
Many artificial intelligence and machine learning algorithms are based on or benefit from some kind
of metric. In this context the term has a concrete definition.
The typical case for metrics is around similarity. Say you have a bunch of random variables Xi which
take on values in a label space V . If Xi and Xj are connected by an edge, we want them to take on
“similar” values.
How do we define “similar”?
We’ll use a distance function µ : V × V → R+ , which needs to satisfy:
• reflexivity: µ(v , v ) = 0 for all v
• symmetry: µ(v1 , v2 ) = µ(v2 , v1 ) for all v1 , v2
• triangle inequality: µ(v1 , v2 ) ≤ µ(v1 , v3 ) + µ(v3 , v2 ) for all v1 , v2 , v3
If all these are satisfied, we say that µ is a metric.
If only reflexivity and symmetry are satisfied, we have a semi-metric instead.
So we can create a feature fij (Xi , Xj ) = µ(Xi , Xj ) and then this works out such that:
exp(−wij fij (Xi , Xj )), wij > 0
that the lower the distance (metric), the higher the probability.
2.4
References
• Convex Function. Weisstein, Eric W. Wolfram MathWorld.
36
CHAPTER 2. OTHER USEFUL CONCEPTS
37
3
Linear Algebra
When working with data, we typically deal with many data points consisting of many dimensions.
That is, each data point may have a several components; e.g. if people are your data points, they
may be represented by their height, weight, and age, which constitutes three dimensions all together.
These data points of many components are called vectors. These are contrasted with individual
values, which are called scalars.
We deal with these data points - vectors - simultaneously in aggregates known as matrices.
Linear algebra provides the tools needed to manipulate vectors and matrices.
3.1
Vectors
Vectors have magnitude and direction, e.g. 5 miles/hour going east. The magnitude can be thought
of, in some sense, as the “length” of the vector (this isn’t quite right however, as there are many
concepts of “length” - see norms).
Formally, this example would be represented:
[ ]
⃗
v=
5
0
since we are “moving” 5 on x-axis and 0 on the y-axis.
Note that often the arrow is dropped, i.e. the vector is notated as just v .
CHAPTER 3. LINEAR ALGEBRA
37
3.1. VECTORS
3.1.1
38
Real coordinate spaces
Vectors are plotted and manipulated in space. A twodimensional vector, such as the previous example, may be
represented in a two-dimensional space.
A vector with three components would be represented in
a three-dimensional space, and so on for any arbitrary n
dimensions.
A real coordinate space (that is, a space consisting of
real numbers) of n dimensions is notated Rn . Such a
space encapsulates all possible vectors of that dimensionality, i.e. all possible vectors of the form [v1 , v2 , . . . , vn ].
A vector
To denote a vector of n dimensions, we write x ∈ R .
n
For example: the notation for the two-dimensional real
coordinate space is notated R2 , which is all possible realvalued 2-tuples (i.e. all 2D vectors whose components are
real numbers, e.g. [1, 2], [−0.4, 21.4], . . . ). If we wanted
to describe an arbitrary two-dimensional vector, we could
do so with ⃗
v ∈ R2 .
3.1.2
Column and row vectors
A vector x ∈ Rn typically denotes a column vector,
i.e. with n rows and 1 column.
A row vector x T ∈ Rn has 1 row and n columns. The
notation x T is described below.
3.1.3
Transposing a vector
Transposing a vector means turning its rows into columns:


x
 1
x 
[
]
 2 T
,a
a⃗ = 
⃗
=
x
x
x
x
1
2
3
4
x3 
 
x4
So a column vector x can be represented as a row vector
with x T .
38
CHAPTER 3. LINEAR ALGEBRA
39
3.1.4
3.1. VECTORS
Vectors operations
Vector addition
Vectors are added by adding the individual corresponding
components:
[ ]
[
]
[
]
[ ]
6
−4
6 + −4
2
+
=
=
2
4
2+4
6
Multiplying a vector by a scalar
To multiply a vector with a scalar, you just multiply the
individual components of the vector by the scalar:
[ ]
[
]
[ ]
2
3×2
6
=
=
3
1
3×1
3
This changes the magnitude of the vector, but not the
direction.
Vector dot products
The dot product (also called inner product) of two vec- Example: The red vector is before multiplying
tors a⃗, ⃗b ∈ Rn (note that this implies they must be of the a scalar, blue is after.
same dimension) is notated:
a⃗ · ⃗b
It is calculated:

 


 

a1
b1
   
n
 a2   b 2 
∑
   



a⃗ · ⃗b = 
·
=
a
b
+
a
b
+
·
·
·
+
a
b
=
ai bi
1 1
2 2
n n
 ...   ... 
i=1
an
bn
Which results in a scalar value.
Note that sometimes the dot operator is dropped, so a dot product may be notated as just a⃗⃗b.
Also note that the dot product of x · y is equivalent to the matrix multiplication x T y .
Properties of vector dot products:
CHAPTER 3. LINEAR ALGEBRA
39
3.1. VECTORS
40
• Commuative property: The order of the dot product doesn’t matter: a⃗ · ⃗b = ⃗b · a⃗
• Distributive property: You can distribute terms in dot products: (⃗
v +w
⃗)·⃗
x = (⃗
v ·⃗
x +w
⃗ ·⃗
x)
• Associative property: (c⃗
v) · w
⃗ = c(⃗
v ·w
⃗)
3.1.5
Norms
The norm of a vector x ∈ Rn , denoted ||x||, is the “length” of the vector. That is, norms are a
generalization of “distance” or “length”.
There are many different norms, the most common of which is the Euclidean norm (also known as
the ℓ2 norm), denoted ||x||2 , computed:
v
u n
√
u∑
||x||2 = t xi2 = x T x
i=1
This is the “as-the-crow-flies” distance that we are all familiar with.
Generally, a norm is just any function f : Rn → R which satisfies the following properties:
1.
2.
3.
4.
non-negativity: For all x ∈ Rn , f (x) ≥ 0
definiteness: f (x) = 0 if and only if x = 0
homogeneity: For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x)
triangle inequality: For all x, y ∈ Rn , f (x + y ) ≤ f (x) + f (y )
Another norm is the ℓ1 norm:
||x||1 =
n
∑
|xi |
i=1
and the ℓ∞ norm:
||x||∞ = max |xi |
i
These three norms are part of the family of ℓp norms, which are parameterized by a real number
p ≥ 1 and defined as:
||x||p = (
n
∑
1
|xi |p ) p
i=1
There are also norms for [matrices], the most common of which is the Frobenius norm, analogous
to the Euclidean (ℓ2 ) norm for vectors:
||A||Fro
v
u N M
u∑ ∑
=t
A2n,m
n=1 m=1
40
CHAPTER 3. LINEAR ALGEBRA
41
3.1. VECTORS
Lengths and dot products
You may notice that the dot product of a vector with itself is the square of that vector’s length:
a⃗ · a⃗ = a12 + a22 + · · · + an2 = ||⃗
a||2
So the length (Euclidean norm) of a vector can be written:
||⃗
a|| =
3.1.6
√
a⃗ · a⃗
Unit vectors
Each dimension in a space has a unit vector, generally denoted with a hat, e.g. û, which is a vector
constrained to that dimension (that is, it has 0 magnitude in all other dimensions), with length 1,
e.g. ||û|| = 1.
Unit vectors exists for all Rn .
The unit vector is also called a normalized vector (which is not to be confused with a normal vector,
which is something else entirely.)
v is found by computing:
The unit vector in the same direction as some vector ⃗
û =
⃗
v
||⃗
v ||
For instance, in R2 space, we would have two unit vectors:
[ ]
[ ]
1
0
î =
, ĵ =
0
1
In R3 space, we would have three unit vectors:
 
 
 
1
0
0
 
 
 





î = 0 , ĵ = 1 , k̂ = 0

0
0
1
But you can have unit vectors in any direction. Say you have a vector:
[
5
a⃗ =
−6
CHAPTER 3. LINEAR ALGEBRA
]
41
3.1. VECTORS
42
You can find a unit vector û in the direction of this vector like so:
û =
a⃗
||⃗
a||
so, with our example:
[
]


√5
1
1
5
61 
û =
a⃗ = √
=  −6
√
||⃗
a||
61 −6
61
3.1.7
Angles between vectors
Say you have two non-zero vectors, a⃗, ⃗b ∈ Rn .
We often notate the angle between two vectors as θ.
The law of cosine tells us, that for a triangle:
C 2 = A2 + B 2 − 2AB cos θ
Using this law, we can get the angle between our two vectors:
||⃗
a − ⃗b||2 = ||⃗b||2 + ||⃗
a||2 − 2||⃗
a||||⃗b|| cos θ
which simplifies to:
Angle between two vectors.
a⃗ · ⃗b = ||⃗
a||||⃗b|| cos θ
There are two special cases if the vectors are collinear, that is if a⃗ = c⃗b:
• If c > 0, then θ = 0.
• If c < 0, then θ = 180◦
42
CHAPTER 3. LINEAR ALGEBRA
43
3.1. VECTORS
3.1.8
Perpendicular vectors
With the above angle calculation, you can see that if a⃗ and ⃗b are non-zero, and their dot product is
0, that is, a⃗ · ⃗b = ⃗0, then they are perpendicular to each other.
Whenever a pair of vectors satisfies this condition a⃗ · ⃗b = ⃗0, it is said that the two vectors are
orthogonal (i.e. perpendicular).
Note that because any vector times the zero vector equals the zero vector: ⃗0 · ⃗
x = ⃗0.
Thus the zero vector is orthogonal to everything.
Technical detail: So if the vectors are both non-zero and orthogonal, then the vectors are both
perpendicular and orthogonal. But of course, since the zero vector is not non-zero, it cannot be
perpendicular to anything, but it is orthogonal to everything.
3.1.9
Normal vectors
A normal vector is one which is perpendicular to all the points/vectors on a plane.
That is for any vector a⃗ on the plane, and a normal vector, n
⃗, to that plane, we have:
n · a⃗ = ⃗0
⃗
For example: given an equation of a plane, Ax + By + Cz = D, the normal vector is simply:
⃗
n = Aî + B ĵ + C k̂
3.1.10
Orthonormal vectors
Given V = {v⃗1 , v⃗2 , . . . , v⃗k } where:
• ||⃗
vi || = 1 for i = 1, 2, . . . , k. That is, the length of each vector in V is 1 (that is, they have
all been normalized).
• v⃗i · v⃗j = 0 for i ̸= j. That is, these vectors are all orthogonal to each other.
This can be summed up as:

0
i=
̸ j
v⃗i · v⃗j = 
1 i =j
This is an orthonormal set. The term comes from the fact that these vectors are all orthogonal to
each other, and they have all been normalized.
CHAPTER 3. LINEAR ALGEBRA
43
3.1. VECTORS
3.1.11
44
Additional vector operations
These vector operations are less common, but included for reference.
Vector outer products
For the outer product, the two vectors do not need to be of the same dimension (i.e. x ∈ Rn , y ∈
Rm ), and the result is a matrix instead of a scalar:

x ⊗ y ∈ Rn×m

x1 y1 · · · x1 ym
 .
.. 
..
.
=
.
. 
 .

xn y1 · · · xn ym
Note that the outer product x ⊗ y is equivalent to the matrix multiplication xy T .
Vector cross products
Cross products are much more limited than dot products. Dot products can be calculated for any
Rn . Cross products are only defined in R3 .
Unlike the dot product, which results in a scalar, the cross product results in a vector which is
orthogonal to the original vectors (i.e. it is orthogonal to the plane defined by the two original
vectors).




a1
b1
 
 



⃗
a⃗ = a2  , b = b2 

a3
b3


a2 b 3 − a3 b 2



a⃗ × ⃗b = a3 b1 − a1 b3 

a1 b 2 − a2 b 1
For example:


 




1
5
−7 × 4 − 1 × 2
−30

  




−7 × 2 =  1 × 5 − 1 × 4  =  1 

  




1
4
1 × 2 − −7 × 5
37
44
CHAPTER 3. LINEAR ALGEBRA
45
3.2. LINEAR COMBINATIONS
3.2
Linear Combinations
3.2.1
Parametric representations of lines
Any line in an n-dimensional space can be represented using vectors.
Say you have a vector ⃗
v and a set S consisting of all scalar multiplications of that vector (where the
scalar c is any real number):
S = {c⃗
v | c ∈ R}
This set S represents a line, since multiplying a vector with scalars does not changes its direction,
only its magnitudes, so that set of vectors covers the entirety of the line.
A few of the infinite scalar multiplications which define the line.
But that line is around the origin. If you wanted to shift it, you need only to add a vector, which
x . So we could define a line as:
we’ll call ⃗
L = {⃗
x + c⃗
v | c ∈ R}
CHAPTER 3. LINEAR ALGEBRA
45
3.2. LINEAR COMBINATIONS
46
For example: say you are given two vectors:
[ ]
[ ]
2 ⃗
0
a⃗ =
, b=
1
3
Say you want to find the line that goes through them. First
you need to find the vector along that intersecting line,
which is just ⃗b − a⃗.
Although in standard form, that vector originates at the
origin.
Thus you still need to shift it by finding the appropriate
vector ⃗
x to add to it. But as you can probably see, we can
use our a⃗ to shift it, giving us:
Calculating the intersecting line for two vectors.
L = {⃗
a + c(⃗b − a⃗) | c ∈ R}
And this works for any arbitrary n dimensions! (Although in other spaces, this wouldn’t really be a
“line”. In R3 space, for instance, this would define a plane.)
You can convert this form to a parametric equation, where the equation for a dimension of the
vector ai looks like:
ai + (bi − ai )c
Say you are in R2 so, you might have:
{[ ]
L=
[
]
0
−2
+c
|c ∈ R
3
2
}
you can write it as the following parametric equation:
x = 0 + −2c = −2c
y = 3 + 2c = 2c + 3
3.2.2
Linear combinations
Say you have the following vectors in Rm :
v⃗1 , v⃗2 , . . . , v⃗n
46
CHAPTER 3. LINEAR ALGEBRA
47
3.2. LINEAR COMBINATIONS
A linear combination is just some sum of the combination of these vectors, scaled by arbitrary
constants (c1 → cn ∈ R):
c1 v⃗1 + c2 v⃗2 , + . . . + cn v⃗n
For example:
[ ]
[ ]
2 ⃗
0
a⃗ =
, b=
1
3
A linear combination would be:
[ ]
0
0⃗
a + 0⃗b =
0
Any vector in the space R2 can represented by some linear combination of these two vectors.
3.2.3
Spans
The set of all linear combinations for some vectors is called the span.
The span of some vectors can define an entire space. For instance, using our previously-defined
vectors:
span(⃗
a, ⃗b) = R2
But this is not always true for the span of any arbitrary set of vectors. For instance, this does not
represent all the vectors in R2 :
[
] [ ]
−2
2
span(
,
)
−2
2
These two vectors are collinear (that is, they lie along the same line), so combinations of them will
only yield other vectors along that line.
As another example, the span of the zero vector, span(⃗0), cannot represent all vectors in a space.
Formally, the span is defined as:
span(v⃗1 , v⃗2 , . . . , v⃗n ) = {c1 v⃗1 + c2 v⃗2 , + . . . + cn v⃗n | ci ∈ R ∀ 1 ≤ i ≤ n}
CHAPTER 3. LINEAR ALGEBRA
47
3.2. LINEAR COMBINATIONS
3.2.4
48
Linear independence
The set of vectors in the previous collinear example:
{[
] [ ]}
−2
2
,
−2
2
are called a linearly dependent set, which means that some vector in the set can be represented
as the linear combination of some of the other vectors in the set.
[
]
In this example, we could represent −2, −2 using the linear combination of the other vector, i.e.
[
]
−1 2, 2 .
You can think of a linearly dependent set as one that contains a redundant vector - one that doesn’t
add any more information to the set.
As another example:
{[ ] [ ] [ ]}
2
7
9
,
,
3
2
5
is linearly dependent because v⃗1 + v⃗2 = v⃗3 .
Naturally, a set that is not linearly dependent is called a linearly independent set.
For a more formal definition of linear dependence, a set of vectors:
S = {v⃗1 , v⃗2 , . . . , v⃗n }
is linearly dependent iff (if and only if)
 
0
.
.
c1 v⃗1 + c2 v⃗2 + . . . + cn v⃗n = ⃗0 = 
.
0
for some ci ’s where at least one is non-zero.
To put the previous examples in context, if you can show that at least one of the vectors can be
described by the linear combination of the other vectors in the set, that is:
v⃗1 = a2 v⃗2 + a3 v⃗3 + · · · + an v⃗n
then you have a linearly dependent set because that can be reduced to show:
⃗0 = −1v⃗1 + a2 v⃗2 + a3 v⃗3 + · · · + an v⃗n
48
CHAPTER 3. LINEAR ALGEBRA
49
3.2. LINEAR COMBINATIONS
Thus you can calculate the zero vector as a linear combination of the vectors where at least one
constant is non-zero, which satifies the definition for linear dependence.
So then a set is linearly independent if, to calculate the zero vector as a linear combination of the
vectors, the coefficients must all be zero.
Going back to spans, the span of set of size n which is linearly independent can describe that set’s
entire space (e.g. Rn ).
An example problem:
Say you have the set:

   



1
2
−1



   







S = −1 , 1 ,  0 


 2
3
2 
and you want to know:
• does span(S) = R3 ?
• is S linearly independent?
For the first question, you want to see if any linear combination of the set yields any arbitrary vector
in R3 :


 


 
1
2
−1
a

 


 








c1 −1 , c2 1 , c3  0  = b 

2
3
2
c
You can distribute the coefficients:

 
 

 
1c1
2c2
−1c3
a

 
 

 
−1c  , 1c  ,  0c  = b 
1  2 
3 

 
2c1
3c2
2c3
c
So you can break that out into a system of equations:
c1 + 2c2 − c3 = a
−c1 + c2 + 0 = b
2c1 + 3c2 + 2c3 = c
And solve it, which gives you:
CHAPTER 3. LINEAR ALGEBRA
49
3.3. MATRICES
50
1
(3c − 5a + b)
11
1
c2 = (b + a + c3 )
3
c1 = a − 2c2 + c3
c3 =
So it looks like you can get these coefficients from any a, b, c, so we can say span(S) = R3 .
For the second question, we want to see if all of the coefficients have to be non-zero for this to be
true:


 


 
1
2
−1
0
 

 



 

 


c1 
−1 , c2 1 , c3  0  = 0
2
3
2
0
We can just reuse the previous equations we derived for the coefficients, substituting a = 0, b =
0, c = 0, which gives us:
c1 = c2 = c3 = 0
So we know this set is linearly independent.
3.3
Matrices
A matrix can, in some sense, be thought of as a vector of vectors.
The notation m × n in terms of matrices mean there are m rows and n columns.
So that matrix would look like:

a11

 a21

A=
 ...

am1
a12 . . .
a22 . . .
..
..
.
.
am2 . . .

a1n

a2n 

.. 
. 

amn
A matrix of these dimensions may also be notated Rm×n to indicate its membership in that set.
We refer to the entry in the i th row and jth column with the notation Aij .
3.3.1
Matrix operations
Matrix addition
Matrices must have the same dimensions in order to be added (or subtracted).
50
CHAPTER 3. LINEAR ALGEBRA
51
3.3. MATRICES

a11

 a21

A=
 ...

am1
a12 . . .
a22 . . .
..
..
.
.
am2 . . .




a1n
b11




a2n 
 b21
 .
,
B
=
.. 
 ..
. 
amn

a11 + b11

 a21 + b21

A+B=
..

.
bm1
a12 + b12
a22 + b22
..
.

...
...
..
.
am1 + bm1 am2 + bm2 . . .
b12 . . .
b22 . . .
..
..
.
.
bm2 . . .

b1n

b2n 

.. 
. 

bmn

a1n + b1n

a2n + b2n 


..

.

amn + bmn
A+B=B+A
Matrix-scalar multiplication
Just distribute the scalar:

a11

 a21

A=
 ...

am1
a12 . . .
a22 . . .
..
..
.
.
am2 . . .




a1n
ca11




a2n 
 ca21
 .
,
cA
=
.. 
 ..
. 
amn
cam1
ca12 . . .
ca22 . . .
..
..
.
.
cam2 . . .

ca1n

ca2n 

.. 
. 

camn
Matrix-vector products
To multiply a m × n matrix with a vector, the vector must have n components (that is, the same
number of components as there are columns in the matrix, i.e. ⃗
x ∈ Rn ):




x1
 
 x2 
 

⃗
x =
 ... 
xn
The product would be:




A⃗
x =


a11 x1 + a12 x2 + · · · + a1n xn
a21 x1 + a22 x2 + · · · + a2n xn
..
.
am1 x1 + am2 x2 + · · · + amn xn







This results in a m × 1 matrix.
CHAPTER 3. LINEAR ALGEBRA
51
3.3. MATRICES
52
Matrix-vector products as linear combinations
If you interpret each column in a matrix A as its own vector v⃗i , such that:
[
A = v⃗1 v⃗2 . . .
]
v⃗n
Then the product of a matrix and vector can be rewritten simply as a linear combination of those
vectors:
A⃗
x = x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n
Matrix-vector products as linear transformations
A matrix-vector product can also be seen as a linear transformation. You can describe it as a
transformation:
[
A = v⃗1 v⃗2 . . .
]
v⃗n
T : Rn → Rm
T (⃗
x ) = A⃗
x
It satisfies the conditions for a linear transformation (not shown here), so a matrix-vector product is
always a linear transformation.
Just to be clear: the transformation of a vector can always be expressed as that vector’s product with
some matrix; that matrix is referred to as the transformation matrix.
So in the equations above, A is the transformation matrix.
To reiterate:
• any matrix-vector product is a linear transformation
• any linear transformation can be expressed in terms of a matrix-vector product
Matrix-matrix products
To multiply two matrices, one must have the same number of columns as the other has rows. That
is, you can only multiply an m × n matrix with an n × p matrix. The resulting matrix will be of m × p
dimensions.
That is, if A ∈ Rm×n , B ∈ Rn×p , then C = AB ∈ Rm×p .
The resulting matrix is defined as such:
Cij =
n
∑
Aik Bkj
k=1
52
CHAPTER 3. LINEAR ALGEBRA
53
3.3. MATRICES
You can break the terms out into individual matrix-vector products. Then you combine the resulting
vectors to get the final matrix.
More formally, the i th column of the resulting product matrix is obtained by multiplying A with the
i th column of B for i = 1, 2, . . . , k.
[
]


1 3


1 3 2

×
0 1
4 0 1
5 2
The product would be:
[
]
[
]
 
[ ]
1
 
1 3 2
11


× 0 =
4 0 1
9
5
 
[ ]
3
 
1 3 2
10

×
1 =
4 0 1
14
2
[
]


]
[
1 3


1 3 2
11
10

×
 0 1 =
4 0 1
9 14
5 2
Properties of matrix multiplication
Matrix multiplication is not commutative. That is, for matrices A and B, in general A × B ̸= B × A.
They may not even be of the same dimension.
Matrix multiplication is associative. For example, for matrices A, B, C, we can say that:
A × B × C = A × (B × C) = (A × B) × C
There is also an identity matrix I. For any matrix A, we can say that:
A×I=I×A=A
3.3.2
Hadamard product
The Hadamard product, sometimes called the element-wise product, is another way of multiplying matrices, but only for matrices of the same size. It is usually denoted with ⊙. It is simply:
(A ⊙ B)n,m = An,m Bn,m
CHAPTER 3. LINEAR ALGEBRA
53
3.3. MATRICES
54
It returns a matrix of the same size as the input matrices.
Haramard multiplication has the following propertise:
• commutativity: A ⊙ B = B ⊙ A
• associativity: A ⊙ (B ⊙ C) = (A ⊙ B) ⊙ C
• distributivity: A ⊙ (B + C) = A ⊙ B + A ⊙ C
3.3.3
The identity matrix
The identity matrix is an n × n matrix where every component is 0, except for those along the
diagonal:

1

0


In = 
0
.
 ..
0
1
0
..
.
0 ...
0 ...
1 ...
..
..
.
.

0 0 0 ...

0

0


0

.. 
.

1
When you multiply the identity matrix by any vector:




x1
 
 x2 
 
=⃗
In ⃗
x |⃗
x ∈ Rn = 
x
 ... 
xn
That is, a vector multiplied by the identity matrix equals itself.
3.3.4
Diagonal matrices
A diagonal matrix is a matrix where all non-diagonal elements are 0, typically denoted
diag(x1 , x2 , . . . , xn ), where

di
Dij = 
0
i =j
i ̸= j
So the identity matrix is I = diag(1, 1, . . . , 1).
54
CHAPTER 3. LINEAR ALGEBRA
55
3.3. MATRICES
3.3.5
Triangular matrices
We say that a matrix is upper-triangular if all of its elements below the diagonal are zero.
Similarly, a matrix is lower-triangular if all of its elements above the diagonal are zero.
A matrix is diagonal if it is both upper-triangular and lower-triangular.
3.3.6
Some properties of matrices
Associative property
(AB)C = A(BC)
i.e. it doesn’t matter where the parentheses are.
This applies to compositions as well:
(h ◦ f ) ◦ g = h ◦ (f ◦ g)
Distributive property
A(B + C) = AB + AC
(B + C)A = BA + CA
3.3.7
Matrix inverses
If A is an m × m matrix, and if it has an inverse, then:
AA−1 = A−1 A = I
Only square matrices can have inverses. An inverse does not exist for all square matrices, but those
that have one are called invertible or non-singular, otherwise they are non-invertible or singular.
The inverse exists if and only if A is full rank.
The invertible matrices A, B ∈ Rn×n have the following properties:
•
•
•
•
(A−1 )−1 = A
If Ax = B, we can multiply by A−1 on both sides to obtain x = A−1 b
(AB)−1 = B −1 A−1
(A−1 )T = (AT )−1 ; this matrix is often denoted A−T
CHAPTER 3. LINEAR ALGEBRA
55
3.3. MATRICES
56
Pseudo-inverses
A† is a pseudo-inverse (sometimes called a Moore-Penrose inverse) of A, which may be nonsquare, if the following are satisfied:
•
•
•
•
AA† A = A
A† AA† = A
(AA† )T = AA†
(A† A)T = A† A
A pseudo-inverse exists and is unique for any matrix A. If A is invertible, A−1 = A† .
3.3.8
Matrix determinants
The determinant of a square matrix A ∈ Rn×n is a function det : Rn×n → R, denoted |A|, det(A),
or sometimes with the parentheses dropped, det A.
A more intuitive interpretation:
Say we are in a 2D space and we have some shape. It has some area. Then we apply transformations
to that space. The determinant describes how that shape’s area has been scaled as a result of the
transformation.
This can be extended to 3D, replacing “area” with “volume”.
With this interpretation it’s clear that a determinant of 0 means scaling area or volume to 0, which
indicates that space has been “compressed” to a line or a point (or a plane in the case of 3D).
Inverse and determinant for a 2 × 2 matrix
Say you have the matrix:
[
a b
A=
c d
]
You can calculate the inverse of this matrix as:
[
A
−1
1
d −b
=
ad − bc −c a
]
Note that A−1 is undefined if ad − bc = 0, which means that A is not invertible.
Intuitively:
• the inverse of a matrix essentially “undoes” the transformation that matrix represents
56
CHAPTER 3. LINEAR ALGEBRA
57
3.3. MATRICES
• a determinant of 0 implies a transformation that squishes everything together in some way
(e.g. into a line). This means that some vectors occupy the same position on the line.
• by definition, a function takes one input and maps it to one output. So if we have (what used
to be) different vectors mapped to the same position, we can’t take that one same position and
re-map it back to different vectors - that would require a function that gives different outputs
for the same input.
The denominator ad − bc is called the determinant. It is notated as:
det(A) = |A| = ad − db
Inverse and determinant for an n × n matrix
Say we have an n × n matrix A.
A submatrix of Aij is an (n − 1) × (n − 1) matrix constructed from A by ignoring the i th row and
the j th column of A, which we denote by A¬i,¬j .
You can calculate the determinant of an n × n matrix A by using some i th row of A, where 1 ≤ i ≤ n:
det(A) =
n
∑
(−1)i+j aij det(A¬i,¬j )
j=1
All the det(Aij ) eventually reduce to the determinant of a 2 × 2 matrix.
Scalar multiplication of determinants
For an n × n matrix A,
det(kA) = k n det(A)
Determinant of diagonal or triangular matrix
The determinant of a diagonal or triangular matrix is simply the product of the elements along its
diagonal.
Properties of determinants
• For A ∈ Rn×n , t ∈ R, multiplying a single row by the scalar t yields a new matrix B, for which
|B| = t|A|.
• For A ∈ Rn×n , |A| = |AT |
CHAPTER 3. LINEAR ALGEBRA
57
3.3. MATRICES
•
•
•
•
•
58
For A, B ∈ Rn×n , |AB| = |A||B|
For A, B ∈ Rn×n , |A| = 0 if A is singular (i.e. non-invertible).
1
For A, B ∈ Rn×n , |A|−1 = |A|
if A is non-singular (i.e. invertible).
For A, B ∈ Rn×n , if two rows of A are swapped to produce B, then det(A) = − det(B)
The determinant of a matrix A ∈ Rn×n is non-zero if and only if it has full rank; this also means
you can check if a A is invertible by checking that its determinant is non-zero.
3.3.9
Transpose of a matrix
The transpose of a matrix A is that matrix with its columns and rows swapped, denoted AT .
More formally, let A be an m × n matrix, and let B = AT . Then B is an n × m matrix, and Bij = Aji .
• Transpose of determinants: The determinant of a transpose is the same as the determinant
of the original matrix: det(AT ) = det(A)
• Transposes of sums: With matrices A, B, C where C = A + B, then C T = (A + B)T =
AT + B T
• Transposes of inverses: The transpose of the inverse is equal to the inverse of the transpose:
(A−1 )T = (AT )−1
• Transposes of multiplication: (AB)T = B T AT
• Transpose of a vector: for two column vectors a⃗, ⃗b, we know that a⃗ · ⃗b = ⃗b · a⃗ = a⃗T ⃗b, from
x ) · y⃗ = ⃗
x · (AT y⃗ ) (proof omitted).
which we can derive: (A⃗
3.3.10
Symmetric matrices
A square matrix A ∈ Rn×n is symmetric if A = AT .
It is anti-symmetric if A = −AT .
For any square matrix A ∈ Rn×n , the matrix A + AT is symmetric and the matrix A − AT is antisymmetric. Thus any such A can be represented as a sum of a symmetric and an anti-symmetric
matrix:
A=
1
1
(A + AT ) + (A − AT )
2
2
Symmetric matrices have many nice properties.
The set of all symmetric matrices of dimension n is often denoted as Sn , so you can denote a
symmetric n × n matrix A as A ∈ Sn .
The quadratic form
Given a square matrix A ∈ Rn×n and a vector x ∈ R, the scalar value x T Ax is called a quadratic
form:
58
CHAPTER 3. LINEAR ALGEBRA
59
3.3. MATRICES
x T Ax =
n ∑
n
∑
Aij xi xj
i=1 j=1
Here A is typically assumed to be symmetric.
Types of symmetric matrices
Given a symmetric matrix A ∈ Sn …
• A is positive definite (PD) if for all non-zero vectors x ∈ Rn , x T Ax > 0.
– This is often denoted A ≻ 0 or A > 0.
– The set of all positive definite matrices is denoted Sn++ .
• A is positive semidefinite (PSD) if for all vectors x ∈ Rn , x T Ax ≥ 0.
– This is often denoted A ⪰ 0 or A ≥ 0.
– The set of all positive semidefinite matrices is denoted Sn+ .
• A is negative definite (ND) if for all non-zero vectors x ∈ Rn , x T Ax < 0.
– This is often denoted A ≺ 0 or A < 0.
• A is negative semidefinite (NSD) if for all vectors x ∈ Rn , x T Ax ≤ 0.
– This is often denoted A ⪯ 0 or A ≤ 0.
• A is indefinite if it is neither positive semidefinite nor negative semidefinite, that is, if there
exists x1 , x2 ∈ Rn such that x1T Ax1 > 0 and x2T Ax2 < 0.
Some other properties of note:
•
•
•
•
•
If A is positive definite, then −A is negative definite and vice versa.
If A is positive semidefinite, then −A is negative semidefinite and vice versa.
If A is indefinite, then −A is also indefinite and vice versa.
Positive definite and negative definite matrices are always invertible.
For any matrix A ∈ Rm×n , which does not need to be symmetric or square, the matrix G = AT A,
called a Gram matrix, is always positive semidefinite.
– If m ≥ n and A is full rank, then G is positive definite.
Essentially, “positive semidefinite” is to matrices as “non-negative” is to scalar values (and “positive
definite” as “positive” is to scalar values).
CHAPTER 3. LINEAR ALGEBRA
59
3.3. MATRICES
3.3.11
60
The Trace
The trace of a square matrix A ∈ Rn×n is denoted tr(A) and is the sum of the diagonal elements in
the matrix:
tr(A) =
n
∑
Aii
i=1
The trace has the following properties:
•
•
•
•
•
tr(A) = tr(AT )
For B ∈ Rn×n , tr(A + B) = tr(A) + tr(B)
For t ∈ R, tr(tA) = t tr(A)
If AB is square, then tr(AB) = tr(BA)
If ABC is square, then tr(ABC) = tr(BCA) = tr(CAB) and so on for the product of more
matrices
3.3.12
Orthogonal matrix
Say we have a n × k matrix C, whose column rows form an orthonormal set.
If k = n then C is a square matrix (n × n) and since C’s columns are linearly independent, C is
invertible.
For an orthonormal matrix:
CT C = In
−1
C C = In
∴ CT = C−1
When C is an n × n matrix (i.e. square) whose columns form an orthonormal set, we say that C is
an orthogonal matrix.
Orthogonal matrices have the property of C T C = I = CC T .
Orthogonal matrices also have the property that operating on a vector with an orthogonal matrix will
not change its Euclidean norm, i.e. ||Cx||2 = ||x||2 for any x ∈ Rn .
Orthogonal matrices preserve angles and lengths
For an orthogonal matrix C, when you multiply C by some vector, the length and angle of the vector
is preserved:
||⃗
x || = ||C⃗
x ||
cos θ = cos θC
60
CHAPTER 3. LINEAR ALGEBRA
61
3.4. SUBSPACES
3.3.13
Adjoints
The classical adjoint, often just called the adjoint of a matrix A ∈ Rn×n is denoted adj(A) and
defined as:
adj(A)ij = (−1)i+j |A¬j,¬i |
Note that the indices are switched in A¬j,¬i .
3.4
Subspaces
Say we have set of vectors V which is a subset of Rn , that is, every vector in the set has n components.
V is a linear subspace of Rn if
• V contains the zero vector ⃗0
• for a vector ⃗
x in V , c⃗
x (where c ∈ R) must also be
in V , i.e. closure under scalar multiplcation.
• for a vector a⃗ in V and a vector ⃗b in V , a⃗ + ⃗b must
also be in V , i.e. closure under addition.
A subspace
Example:
Say we have the set of vectors:
{[
S=
]
}
x1
∈ R2 : x1 ≥ 0
x2
which is the shaded area below.
Is S a subspace of R2 ?
• It does contain the zero vector
• It is closed under addition:
[ ]
[ ]
[
a
c
a+c
+
=
b
d
b+d
]
Since a & b are both > 0 (that was a criteria for the set), the a+b will also be greater than 0,
so it will also be in the set (there were no constraints on the second component so it doesn’t
matter what that is)
CHAPTER 3. LINEAR ALGEBRA
61
3.4. SUBSPACES
62
• It is NOT closed under multiplcation:
[ ]
[
a
−a
−1
=
b
−b
]
Since a is >= 0, -a will be <= 0, which falls outside the constraints of the set and thus is not
contained within the set.
So no, this set is not a subspace of R2 .
3.4.1
Spans and subspaces
Let’s say we have the set:
U = span(v⃗1 , v⃗2 , v⃗3 )
where each vector has n components. Is this a valid subspace of Rn ?
Since the span represents all the linear combinations of
2
those vectors, we can define an arbitrary vector in the set Is the shaded set of vectors S a subspace of R ?
as:
⃗
x = c1 v⃗1 + c2 v⃗2 + c3 v⃗3
• the set does contain the zero vector:
0v⃗1 + 0v⃗2 + 0v⃗3 = ⃗0
• it is closed under multiplication, since the following is just another linear combination:
a⃗
x = ac1 v⃗1 + ac2 v⃗2 + ac3 v⃗3
• it is closed under addition, since if we take another arbitrary vector in the set:
y⃗ = d1 v⃗1 + d2 v⃗2 + d3 v⃗3
and add them:
62
CHAPTER 3. LINEAR ALGEBRA
63
3.4. SUBSPACES
⃗
x + y⃗ = (c1 + d1 )v⃗1 + (c2 + d2 )v⃗2 + (c3 + d3 )v⃗3
that’s also just another linear combination in the set.
3.4.2
Basis of a subspace
If we have a subspace V = span(S) where the set of vectors S = v⃗1 , v⃗2 , . . . , v⃗n is linearly independent,
then we can say that S is a basis for V .
A set S is the basis for a subspace V if S is linearly independent and its span defines V . In other
words, the basis is the minimum set of vectors that spans the subspace that it is a basis of.
All bases for a subspace will have the same number of elements.
Intuitively, the basis of a subspace is a set of vectors that can be linearly combined to describe any
vector in that subspace. For example, the vectors [0,1] and [1,0] form a basis for R2 .
3.4.3
Dimension of a subspace
The dimension of a subspace is the number of elements in a basis for that subspace.
3.4.4
Nullspace of a matrix
Say we have:
A⃗
x = ⃗0
If you have a set N of all x ∈ Rn that satisfies this equation, do you have a valid subspace?
x = ⃗0 this equation is satisfied. So we know the zero vector is part of this set (which is
Of course if ⃗
a requirement for a valid subspace).
The other two properties (closure under multiplication and addition) necessary for a subspace also
hold:
A(v⃗1 + v⃗2 ) = Av⃗1 + Av⃗2 = ⃗0
A(c v⃗1 ) = ⃗0
and of course ⃗0 is in the set N.
So yes, the set N is a valid subspace, and it is a special subspace: the nullspace of A, notated:
N(A)
CHAPTER 3. LINEAR ALGEBRA
63
3.4. SUBSPACES
64
That is, the nullspace for a matrix A is the subspace described by the set of vectors which yields the
zero vector which multiplied by A, that is, the set of vectors which are the solutions for ⃗
x in:
A⃗
x = ⃗0
Or, more formally, if A is an m × n matrix:
N(A) = {⃗
x ∈ Rn | A⃗
x = ⃗0}
The nullspace for a matrix A may be notated N (A).
To put nullspace (or “kernel”) another way, it is space of all vectors that map to the zero vector after
applying the transformation the matrix represents.
Nullspace and linear independence
If you take each column in a matrix A as a vector v⃗i , that set of vectors is linearly independent if the
nullspace of A consists of only the zero vector. That is, if:
N(A) = {⃗0}
The intuition behind this is because, if the linear combination of a set of vectors can only equal the
zero vector if all of its coefficients are zero (that is, its coefficients are components of the zero vector),
then it is linearly independent:
 
0
 
0
 

x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n = ⃗0 iff ⃗
x =
 ... 
 
0
Nullity
The nullity of a nullspace is its dimension, that is, it is the number of elements in a basis for that
nullspace.
dim(N(A)) = nullity(N(A))
64
CHAPTER 3. LINEAR ALGEBRA
65
3.4. SUBSPACES
Left nullspace
The left nullspace of a matrix A is the nullspace of its transpose, that is N(AT ) :
N(AT ) = {⃗
x |⃗
x T A = ⃗0T }
3.4.5
Columnspace
Again, a matrix can be represented as a set of column vectors. The columnspace of a matrix (also
called the range of the matrix) is all the linear combinations (i.e. the span) of these column vectors:
[
A = v⃗1 v⃗2 . . .
]
v⃗n
C(A) = span(v⃗1 , v⃗2 , . . . , v⃗n )
Because any span is a valid subspace, the columnspace of a matrix is a valid subspace.
So if you expand out the matrix-vector product, you’ll see that every matrix-vector product is within
that matrix’s columnspace:
{A⃗
x |⃗
x ∈ Rn }
A⃗
x = x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n
A⃗
x = C(A)
That is, for any vector in the space Rn , multiplying the matrix by it just yields another linear combination of that matrix’s column vectors. Therefore it is also in the columnspace.
The columnspace (range) for a matrix A may be notated R(A).
Rank of a columnspace
The column rank of a columnspace is its dimension, that is, it is the number of elements in a basis
for that columnspace (i.e. the largest number of columns of the matrix which constitute a linearly
independent set):
dim(C(A)) = rank(C(A))
Rowspace
The rowspace of a matrix A is the columnspace of AT , i.e. C(AT ).
The row rank of a matrix is similarly the number of elements in a basis for that rowspace.
CHAPTER 3. LINEAR ALGEBRA
65
3.4. SUBSPACES
3.4.6
66
Rank
Note that for any matrix A, the column rank and the row rank are equal, so they are typically just
referred to as rank(A).
The rank has some properties:
•
•
•
•
For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be full rank.
For A ∈ Rm×n , rank(A) = rank(AT )
For A ∈ Rm×n , B ∈ Rn×p , rank(AB) ≤ min(rank(A), rank(B))
For A, B ∈ Rm×n , rank(A + B) ≤ rank(A) + rank(B)
The rank of a transformation refers to the number of dimensions in the output.
A matrix is full-rank if it has rank equal to the number of dimensions in its originating space. I.e.
they represent a transformation that preserves the dimensionality (it does not collapse it to a lower
dimension space).
3.4.7
The standard basis
The set of column vectors in an identity matrix In is known as the standard basis for Rn .
Each of those column vectors is notated e⃗i . E.g., in an identity matrix, the column vector:
 
1
 
0
 
 
0 = e⃗1
 
.
 .. 
 
0
For a transformation T (⃗
x ), its transformation matrix A can be expressed as:
[
A = T (e⃗1 ) T (e⃗2 ) . . .
3.4.8
]
T (e⃗n )
Orthogonal compliments
Given that V is some subspace of Rn , the orthogonal compliment of V , notated V ⊥ :
V ⊥ = {⃗
x ∈ Rn | ⃗
x ·⃗
v = 0∀⃗
v ∈V}
That is, the orthogonal compliment of a subspace V is the set of all vectors where the dot product
of each vector with each vector from V is 0, that is where all vectors in the set are orthogonal to all
vectors in V .
V ⊥ is a subspace (proof omitted).
66
CHAPTER 3. LINEAR ALGEBRA
67
3.4. SUBSPACES
Columnspaces, nullspaces, and transposes
C(A) is the orthogonal compliment to N(AT ), and vice versa:
N(AT ) = C(A)⊥
N(AT )⊥ = C(A)
C(AT ) is the orthogonal compliment to N(A), and vice versa.
N(A) = C(AT )⊥
N(A)⊥ = C(AT )
As a reminder, columnspaces and nullspaces are spans, i.e. sets of linear combinations, i.e. lines, so
these lines are orthogonal to each other.
Dimensionality and orthogonal compliments
For V , a subspace of Rn :
dim(V ) + dim(V ⊥ ) = n
(proof omitted here)
The intersection of orthogonal compliments
Since the vectors between a subspace and its orthogonal compliment are all orthogonal:
V ∩ V ⊥ = {⃗0}
That is, the only vector which exists both in a subspace and its orthogonal compliment is the zero
vector.
3.4.9
Coordinates with respect to a basis
With a subspace V of Rn , we have V ’s basis, B, as
B = {v⃗1 , v⃗2 , . . . , v⃗k }
We can describe any vector a⃗ ∈ V as a linear combination of the vectors in its basis B :
CHAPTER 3. LINEAR ALGEBRA
67
3.4. SUBSPACES
68
a⃗ = c1 v⃗1 + c2 v⃗2 + · · · + ck v⃗k
We can take these coefficients c1 , c2 , . . . , ck as the coordinates of a⃗ with respect to B, notated
as:




c1
 
 c2 
 

[⃗
a ]B = 
 ... 
ck
Basically what has happened here is a new coordinate system based off of the basis B is being used.
Example
[ ]
[ ]
2
1
Say we have v⃗1 =
, v⃗2 =
, where B = {v⃗1 , v⃗2 } is the basis for R2 .
1
2
The point (8, 7) in R2 is equal to 3v⃗1 + 2v⃗2 . If we set:
a⃗ = 3v⃗1 + 2v⃗2
Then we can describe a⃗ with respect to B :
[ ]
3
2
[⃗
a ]B =
which looks like:
Change of basis matrix
Given the basis:
B = {v⃗1 , v⃗2 , . . . , v⃗k }
and:




c1
 
 c2 
 

[⃗
a ]B = 
 ... 
ck
68
CHAPTER 3. LINEAR ALGEBRA
69
3.4. SUBSPACES
Coordinates wrt a basis
say there is some n × k matrix where the column vectors are the basis vectors:
C = [v⃗1 , v⃗2 , . . . , v⃗k ]
We can do:
C[⃗
a]B = a⃗
The matrix C is known as the change of basis matrix and allows us to get a⃗ in standard coordinates.
Invertible change of basis matrix
Given the basis of some subspace:
B = {v⃗1 , v⃗2 , . . . , v⃗k }
where v⃗1 , v⃗2 , . . . , v⃗k ∈ Rn , and we have a change of basis matrix:
C = [v⃗1 , v⃗2 , . . . , v⃗k ]
Assume:
CHAPTER 3. LINEAR ALGEBRA
69
3.4. SUBSPACES
70
• C is invertible
• C is square (that is, k = n, which implies that we have n basis vectors, that is, B is a basis
for Rn )
• C’s columns are linearly independent (which they are because it is formed out of basis vectors,
which by definition are linearly independent)
Under these assumptions:
• If C is invertible, the span of B is equal to Rn .
• If the span of B is equal to Rn , C is invertible.
Thus:
[⃗
a]B = C −1 a⃗
Transformation matrix with respect to a basis
Say we have a linear transformation T : Rn → Rn , which we can express as T (⃗
x ) = A⃗
x . This is with
respect to the standard basis; we can say A is the transformation for T with respect to the standard
basis.
Say we have another basis B = {v⃗1 , v⃗2 , . . . , v⃗n } for Rn , that is, it is a basis for for Rn .
We could write:
[T (⃗
x )]B = D[⃗
x ]B
and we call D the transformation matrix for T with respect to the basis B.
Then we have (proof omitted):
D = C−1 AC
where:
• D is the transformation matrix for T with respect to the basis B
• A is the transformation matrix for T with respect to the standard basis
• C is the change of basis matrix for B
3.4.10
Orthonormal bases
If B is an orthonormal set, it is linearly independent, and thus it could be a basis. If B is a basis,
then it is an orthonormal basis.
70
CHAPTER 3. LINEAR ALGEBRA
71
3.5. TRANSFORMATIONS
Coordinates with respect to orthonormal bases
Orthonormal bases make good coordinate systems - it is much easier to find [⃗
x ]B if B is an orthonormal
basis. It is just:







c1
v⃗1 · ⃗
x
 


 c2 
v⃗2 · ⃗
x
 


= . 
[⃗
x ]B = 
 ... 
 .. 
ck
v⃗k · ⃗
x

Note that the standard basis for Rn is an orthonormal basis.
3.5
Transformations
A transformation is just a function which operates on vectors, which, instead of using f , is usually
denoted T .
3.5.1
Linear transformations
A linear transformation is a transformation:
T : Rn → Rm
where we can take two vectors a⃗, ⃗b ∈ Rn and the following conditions are satisfied:
T (⃗
a + ⃗b) = T (⃗
a) + T (⃗b)
T (c a⃗) = cT (⃗
a)
Put another way, a linear transformation is a transformation in which lines are preserved (they don’t
become curves) and the origin remains at the origin. This can be thought of as transforming space
such that grid lines remain parallel and evenly-spaced.
A linear transformation of a space can be described in terms of transformations of the space’s basis
vectors, e.g. î, ĵ. For example, if the basis vectors î, ĵ end up at [a, c], [b, d] respectively, an arbitrary
vector [x, y ] would be transformed to:
[ ]
[ ]
[
a
b
ax + by
x
+y
=
c
d
cx + dy
]
which is equivalent to:
CHAPTER 3. LINEAR ALGEBRA
71
3.5. TRANSFORMATIONS
72
[
a b
c d
][ ]
x
y
In this way, we can think of matrices as representing a transformation of space and the transformation
itself as a product with that matrix.
Extending this further, you can think of matrix multiplication as a composition of transformations
- each matrix represents one transformation; the resulting matrix product is a composition of those
transformations.
The matrix does not have to be square; i.e. it does not have to share dimensionality (in terms of the
matrix’s rows) with the space it’s being applied to. The resulting transformation will have different
dimensions.
For example, a 3x2 matrix will transform a 2D space to a 3D space.
Linear transformation examples
These examples are all in R2 since it’s easier to visualize. But you can scale them up to any Rn .
Reflection
To get from the triangle on the left
and reflect it over the y -axis to get the
triangle on the right, all you’re doing
is changing the sign of all the x values.
So a transformation would look like:
[ ]
[
x
−x
T(
)=
y
y
]
An example of reflection.
Scaling
Say you want to double the size of the triangle instead of flipping it. You’d just scale up all of its
values:
[ ]
[
x
2x
T(
)=
y
2y
72
]
CHAPTER 3. LINEAR ALGEBRA
73
3.6. IMAGES
Compositions of linear transformation
The composition of linear transformations S(⃗
x ) = A⃗
x and T (⃗
x ) = B⃗
x is denoted:
T ◦ S(⃗
x ) = T (S(⃗
x ))
This is read: “the composition of T with S”.
If T : Y → Z and S : X → Y , then T ◦ S : X → Z.
A composition of linear transformations is also a linear transformation (proof omitted here). Because
of this, this composition can also be expressed:
⃗ = C⃗
T ◦ S(X)
x
Where C = BA (proof omitted), so:
⃗ = BA⃗
T ◦ S(X)
x
3.5.2
Kernels
The kernel of T, denoted ker(T ), is all of the vectors in the domain such that the transformation of
those vectors is equal to the zero vector:
ker(T ) = {⃗
x ∈ Rn | T (⃗
x ) = {⃗0}}
You may notice that, because T (⃗
x ) = A⃗
x,
ker(T ) = N(A)
That is, the kernel of the transformation is the same as the nullspace of the transformation matrix.
3.6
Images
3.6.1
Image of a subset of a domain
When you pass a set of vectors (i.e. a subset of a domain Rn ) through a transformation, the result
is called the image of the set under the transformation. E.g. T (S) is the image of S under T .
CHAPTER 3. LINEAR ALGEBRA
73
3.6. IMAGES
74
For example, say we have some vectors which define the triangle on the left. When a transformation
is applied to that set, the image is the result on the right.
Another example: if we have a transformation T : X → Y and A which
is a subset of T , then T (A) is the image of A under T , which is equivalent
to the set of transformations for each
vector in A :
T (A) = {T (⃗
x) ∈ Y | ⃗
x ∈ A}
Example: Image of a triangle.
T : X → Y , where A ⊆ X.
Images describe surjective functions, that is, a surjective function f : X → Y can also be written:
im(f ) = Y
since the image of the transformation encompasses the entire codomain Y .
3.6.2
Image of a subspace
The image of a subspace under a transformation is also a subspace. That is, if V is a subspace, T (V )
is also a subspace.
3.6.3
Image of a transformation
If, instead of a subset or subspace, you take the transformation of an entire space, i.e. T (Rn ), the
terminology is different: that is called the image of T , notated im(T ).
Because we know matrix-vector products are linear transformations:
74
CHAPTER 3. LINEAR ALGEBRA
75
3.7. PROJECTIONS
T (⃗
x ) = A⃗
x
The image of a linear transformation matrix A is equivalent to its column space, that is:
im(T ) = C(A)
3.6.4
Preimage of a set
The preimage is the inverse image. For instance, consider a transformation mapping from the domain
X to the codomain Y :
T :X→Y
And say you have a set S which is a subset of Y . You want to find the set of values in X which map
to S, that is, the subset of X for which S is the image.
For a set S, this is notated:
T −1 (S)
Note that not every point in S needs to map back to X. That is, S may contain some points for
which there are no corresponding points in X. Because of this, the image of the preimage of S is
not necessarily equivalent to S, but we can be sure that it is at least a subset:
T (T −1 (S)) ⊆ S
3.7
Projections
A projection can kind of be thought of as a “shadow” of a vector:
Alternatively, it can be thought of answering “how far does one vector go in the direction of another
vector?”. In the accompanying figure, the projection of b onto a tells us how far b goes in the
direction of a.
The projection of ⃗
x onto line L is notated:
ProjL (⃗
x)
CHAPTER 3. LINEAR ALGEBRA
75
3.7. PROJECTIONS
76
Here, we have the projection of ⃗
x - the red vector - onto the green line L. The projection is the dark red vector. This
2
example is in R but this works in any Rn .
More formally, a projection of a vector ⃗
x onto a line L is some vector in L where ⃗
x − ProjL (⃗
x ) is
orthogonal to L.
A line can be expressed as the set of all scalar multiples of a vector, i.e:
L = {c⃗
v | c ∈ R}
So we know that “some vector in L” can be represented as c⃗
v :
ProjL (⃗
x ) = c⃗
v
By our definition of a projection, we also know that ⃗
x − ProjL (⃗
x ) is orthogonal to L, which can now
be rewritten as:
(⃗
x − c⃗
v) · ⃗
v = ⃗0
(This is the definition of orthogonal vectors.)
76
CHAPTER 3. LINEAR ALGEBRA
77
3.7. PROJECTIONS
Written in terms of c, this simplifies down to:
c=
⃗
x ·⃗
v
⃗
v ·⃗
v
So then we can rewrite:
ProjL (⃗
x) =
⃗
x ·⃗
v
⃗
v
⃗
v ·⃗
v
ProjL (⃗
x) =
⃗
x ·⃗
v
⃗
v
||⃗
v ||2
or, better:
And you can pick whatever vector for ⃗
v so long as it is part of line L.
v is a unit vector, then the projection is simplified even further:
However, if ⃗
ProjL (⃗
x ) = (⃗
x · û)û
Projections are linear transformations (they satisfy the requirements, proof omitted), so you can
represent them as matrix-vector products:
ProjL (⃗
x ) = A⃗
x
where the transformation matrix A is:
[
u12 u2 u1
A=
u1 u2 u22
]
where ui are components of the unit vector.
Also note that the length of a projection (i.e. the scalar component of the projection) is given by the
dot product of the two vectors. For example, in the accompanying figure, the length of Proja⃗(⃗b) is
a · b.
3.7.1
Projections onto subspaces
Given that V is a subspace of Rn , we know that V ⊥ is also a subspace of Rn , and we have a vector
⃗
x such that ⃗
x ∈ Rn , we know that ⃗
x =⃗
v +w
⃗ where ⃗
v ∈ V and w
⃗ ∈ V ⊥ , then:
CHAPTER 3. LINEAR ALGEBRA
77
3.7. PROJECTIONS
78
ProjV ⃗
x =⃗
v
ProjV ⊥ ⃗
x =w
⃗
Reminder: a projection onto a subspace is the same as a projection onto a line (a line is a subspace):
ProjV ⃗
x=
⃗
x ·⃗
v
⃗
v
⃗
v ·⃗
v
where:
V = span(⃗
v)
V = {c⃗
v | c ∈ R}
So ProjV ⃗
x is the unique vector ⃗
v ∈ V such that ⃗
x =⃗
v +w
⃗ where w
⃗ is a unique member of V ⊥ .
Projection onto a subspace as a linear transform
ProjV (⃗
x ) = A(AT A)−1 AT ⃗
x
where V is a subspace of Rn .
Note that A(AT A)−1 AT is just some matrix, which we can call B, so this is in the form of a linear
transform, B⃗
x.
Also ⃗
v = ProjV ⃗
x , so:
⃗
x = ProjV ⃗
x +w
⃗
where w
⃗ is a unique member of V ⊥ .
Projections onto subspaces with orthonormal bases
Given that V is a subspace of Rn and B = {v⃗1 , v⃗2 , . . . , v⃗k } is an orthonormal basis for V . We have
a vector ⃗
x ∈ Rn , so ⃗
x =⃗
v +w
⃗ where ⃗
v ∈ V and w
⃗ ∈ V ⊥.
We know (see previously) that by definition:
ProjV (⃗
x ) = A(AT A)−1 AT ⃗
x
which is quite complicated. It is much simpler for orthonormal bases:
ProjV (⃗
x ) = AAT ⃗
x
78
CHAPTER 3. LINEAR ALGEBRA
79
3.8. IDENTIFYING TRANSFORMATION PROPERTIES
3.8
Identifying transformation properties
3.8.1
Determining if a transformation is surjective
A transformation T (⃗
x ) = A⃗
x is surjective (“onto”) if the column space of A equals the codomain:
span(a1 , a2 , . . . , an ) = C(A) = Rm
which can also be stated as:
rank(A) = m
3.8.2
Determining if a transformation is injective
A transformation T (⃗
x ) = A⃗
x is injective (“one-to-one”) if the the nullspace of A contains only the
zero vector:
N(A) = {⃗0}
which is true if the set of A’s column vectors is linearly independent.
This can also be stated as:
rank(A) = n
3.8.3
Determining if a transformation is invertible
A transformation is invertible if it is both injective and surjective.
For a transformation to be surjective:
rank(A) = m
And for a transformation to be surjective:
rank(A) = n
CHAPTER 3. LINEAR ALGEBRA
79
3.9. EIGENVALUES AND EIGENVECTORS
80
Therefore for a transformation to be invertible:
rank(A) = m = n
So the transformation matrix A must be a square matrix.
3.8.4
Inverse transformations of linear transformations
Inverse transformations are linear transformations if the original transformation is both linear and
invertible. That is, if T is invertible and linear, T −1 is linear:
T −1 (⃗
x ) = A−1⃗
x
(T −1 ◦ T )(⃗
x ) = A−1 A⃗
x = In ⃗
x = AA−1⃗
x = (T ◦ T −1 )(⃗
x)
3.9
Eigenvalues and Eigenvectors
Say we have a linear transformation T : Rn → Rn :
T (⃗
v ) = A⃗
v = λ⃗
v
That is, ⃗
v is scaled by a transformation matrix λ.
We say that:
• ⃗
v is the eigenvector for T
• λ is the eigenvalue associated with that eigenvector
Eigenvectors are vectors for which matrix multiplication is equivalent to only a scalar multiplication,
nothing more. λ, the eigenvalue, is the scalar that the transformation matrix A is equivalent to.
Another way to put this: given a square matrix A ∈ Rn×n , we say λ ∈ C is an eigenvalue of A and
x ∈ Cn is the corresponding eigenvector if:
Ax = λx, x ̸= 0
Note that C refers to the set of complex numbers.
So this means that multiplying A by x just results in a new vector which points in the same direction
has x but scaled by a factor λ.
80
CHAPTER 3. LINEAR ALGEBRA
81
3.9. EIGENVALUES AND EIGENVECTORS
For any eigenvector x ∈ Cn and a scalar t ∈ C, A(cx) = cAx = cλx = λ(cx), that is, cx is also
an eigenvector - but when talking about “the” eigenvector associated with λ, it is assumed that the
eigenvector is normalized to length 1 (though you still have the ambiguity that both x and −x are
eigenvectors in this sense).
Eigenvalues and eigenvectors come up when maximizing some function of a matrix.
So what are our eigenvectors? What ⃗
v satisfies:
A⃗
v = λ⃗
v, ⃗
v ̸= 0
We can do:
A⃗
v = λ⃗
v
⃗0 = λ⃗
v − A⃗
v
We know that ⃗
v = In ⃗
v , so we can do:
⃗0 = λIn ⃗
v − A⃗
v = (λIn − A)⃗
v
The first term, λIn − A, is just some matrix which we can call B, so we have:
⃗0 = B⃗
v
which, by our definition of nullspace, indicates that ⃗
v is in the nullspace of B. That is:
⃗
v ∈ N(λIn − A)
3.9.1
Properties of eigenvalues and eigenvectors
∑
The trace of A is equal to the sum of its eigenvalues: tr(A) = ni=1 λi .
∏
The determinant of A is equal to the product of its eigenvalues: |A| = ni=1 λi .
The rank of A is equal to the number of non-zero eigenvalues of A.
If A is non-singular, then λ1i is an eigenvalue of A−1 with associated eigenvector xi , i.e. A−1 xi =
( λ1i )xi .
• The eigenvalues of a diagonal matrix D = diag(d1 , . . . , dn ) are just the diagonal entries
d1 , . . . , d n .
•
•
•
•
CHAPTER 3. LINEAR ALGEBRA
81
3.9. EIGENVALUES AND EIGENVECTORS
3.9.2
82
Diagonalizable matrices
All eigenvector equations can be written simultaneously as:
AX = XΛ
where the columns of X ∈ Rn×n are the eigenvectors of A and Λ is a diagonal matrix whose entries
are the eigenvalues of A, i.e. Λ = diag(λ1 , . . . , λn ).
If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so that
A = XΛX −1 . A matrix that can be written in this form is called diagonalizable.
3.9.3
Eigenvalues & eigenvectors of symmetric matrices
For a symmetric matrix A ∈ Sn , the eigenvalues of A are all real and the eigenvectors of A are all
orthonormal.
If for all of A’s eigenvalues λi …
•
•
•
•
•
λi > 0, then A is positive definite.
λi ≥ 0, then A is positive semidefinite.
λi < 0, then A is negative definite.
λi ≤ 0, then A is negative semidefinite.
have both positive and negative values, then A is indefinite.
Example
Say we have a linear transformation T (⃗
x ) = A⃗
x . Here are some example values of ⃗
x being input
and the output vectors they yield (it’s not important here what A actually looks like, its just to help
distinguish what is and isn’t an eigenvector.)
[ ]
[ ]
1
1
• A
=
0
1
– ⃗
x is not an eigenvector, it was not merely scaled by A.
[ ]
[
]
4
8
• A
=
7
14
– ⃗
x is an eigenvector, it was only scaled by A. This is a simple example where the vector
was scaled up by 2, so the eigenvalue here is 2.
82
CHAPTER 3. LINEAR ALGEBRA
83
3.10. TENSORS
3.9.4
Eigenspace
The eigenvectors that correspond to an eigenvalue λ form the eigenspace for that λ, notated Eλ :
Eλ = N(λIn − A)
3.9.5
Eigenbasis
Say we have an n × n matrix A. An eigenbasis is a basis for Rn consisting entirely of eigenvectors
for A.
3.10 Tensors
Tensors are generalizations of scalars, vectors, and matrices. A tensor is distinguished by its rank,
which is the number of indices it has. A scalar is a 0th -rank tensor (it has no indices), a vector is
a 1th -rank tensor, i.e. its components are accessed by one index, e.g. xi , and a matrix is a 2th -rank
tensor, i.e. its components are accessed by two indices, e.g. Xi,j , and so on.
Just as we have scalar and vector fields, we also have tensor fields.
3.11 References
• Math for Machine Learning. Hal Daumé III. August 28, 2009.
• Tensor. Rowland, Todd and Weisstein, Eric W. Wolfram MathWorld.
• Essence of Linear Algebra. 3Blue1Brown.
CHAPTER 3. LINEAR ALGEBRA
83
3.11. REFERENCES
84
84
CHAPTER 3. LINEAR ALGEBRA
85
4
Calculus
4.1
Differentiation
The slope of a two-dimensional function (in higher dimensions, the term gradient is used instead of
“slope”; in particular, the gradient is the vector of partial derivatives) can be thought of as the rate
of change for that function.
For a linear function f (x) = ax + b, a is the slope; it is constant for all x throughout the function.
But for non-linear functions, e.g. f (x) = 3x 2 , the slope varies along with x.
Differentation is a way to find another function, called the derivative of the original function, that
gives us the rate of change (slope) of one variable with respect to another variable.
It tells us how to change the input in order to get a change in the output:
f (x + ϵ) ≈ f (x) + ϵf ′ (x)
This will become useful later on - many machine learning training methods use derivatives (in particular, multidimensional partial derivatives, i.e. gradients) to determine how to update weights (inputs)
in order to reduce error (the output).
4.1.1
Computing derivatives
Say that we want to compute the rate of change (slope) at a single point. How? It takes two points
to define a line, which we can easily compute the slope for.
Instead of a single point, we can consider two points that are very, very close together:
(x, f (x))and(x + h, f (x + h))
CHAPTER 4. CALCULUS
85
4.1. DIFFERENTIATION
86
Note that sometimes δ or δx is used instead of h.
Their slope is then given by:
f (x + h) − f (x)
f (x + h) − f (x)
=
x +h−x
h
We want the two points as close as possible, so we can look at the limit of h → 0:
f (x + h) − f (x)
h→0
h
lim
That is the derivative of f :
f (x + h) − f (x)
h→0
h
f ′ (x) = lim
If this limit exists, we say that f is differentiable at x and that its derivative at x is f ′ (x).
Example
We have a car and have a variable x which describes its position at a given point in time t. That is,
f (t) = x.
With differentiation we can get dx
dt which is the rate of change of the car’s position wrt to time,
i.e. the speed (velocity) of the car.
∆x
Note that this is not the same as ∆y
, which gives us the change in x over a time interval ∆t. This
is the average velocity over that time interval.
If instead we want instantaneous velocity - the velocity at a given point in time - we need to have
the time interval ∆t approach 0 (we can’t set ∆t to 0 because then we have division by 0). This is
equivalent to the derivative described previously:
lim
∆t→0
dx
∆x
=
∆t
dt
This can be read as
• “the rate of change in x with respect to t”, or
• “an infinitesimal value of y divided by an infinitesimal value of x”
For a given function f (x), this can also be written as:
d
f (x + ∆) − f (x)
f (x) = lim
∆→0
dx
∆
86
CHAPTER 4. CALCULUS
87
4.1. DIFFERENTIATION
4.1.2
Notation
A derivative of a function y = f (x) may be notated:
• f ′ (x)
• Dx [f (x)]
• Df (x)
• dy
dx
d
• dx
[y ]
As a special case, if we are looking at a variable with respect to time t, we can use Newton’s dot
notation:
• f˙ =
4.1.3
df
dt
Differentiation rules
• Derivative of a constant function: For any fixed real number c,
d
dx [c]
= 0.
– This is because a constant function is just a horizontal line (it has a slope of 0).
•
•
•
•
•
•
d
Derivative of a linear function: For any fixed real numbers m and c, dx
[mx + c] = m
d
d
Constant multiple rule: For any fixed real number c, dx
[cf (x)] = c dx
[f (x)]
d
d
d
Addition rule: dx [f (x) ± g(x)] = dx [f (x)] ± dx [g(x)]
d
The power rule: dx
[x n ] = nx n−1
d
Product rule: dx
[f (x) · g(x)] = f (x) · g ′ (x) + f ′ (x) · g(x)
′
(x)g ′ (x)
d f (x)
Quotient rule: dx
[ g(x) ] = g(x)f (x)−f
g(x)2
Example
d
[6x 5 + 3x 2 + 3x + 1]
dx
•
•
•
•
•
d
d
d
d
Apply the addition rule: dx
[6x 5 ] + dx
[3x 2 ] + dx
[3x] + dx
[1]
d
d
Apply the linear and constant rules: dx
[6x 5 ] + dx
[3x 2 ] + 3 + 0
d
d
[x 5 ] + 3 dx
[x 2 ] + 3
Apply the constant multiplier rule: 6 dx
Then the power rule: 6(5x 4 ) + 3(2x) + 3
And finally: 30x 4 + 6x + 3
CHAPTER 4. CALCULUS
87
4.1. DIFFERENTIATION
88
Chain rule
If a function f is composed of two differentiable functions y (x) and u(x), so that f (x) = y (u(x)),
then f (x) is differentiable and:
df
dy du
=
·
dx
du dx
This rule can be applied sequentially to nestings (compositions) of many functions:
f (g(h(x)))
df
df dg dh
=
·
·
dx
dg dh dx
The chain rule is very useful when you recompose functions in terms of nested functions.
Example
Given f (x) = (x 2 + 1)3 , we can define another function u(x) = x 2 + 1, thus we can rewrite f (x) in
terms of u(x), that is: f (x) = u(x)3 .
•
•
•
•
df
df
We can apply the chain rule: dx
= du
· du
dx .
df
d
d
3
2
Then substitute: dx = du [u ] · dx (x + 1).
df
Then we can just apply the rest of our rules: dx
= 3u 2 · 2x.
df
Then substitute again: dx
= 3(x 2 + 1)2 · 2x and simplify.
4.1.4
Higher order derivatives
The derivative of a function as described above is the first derivative.
The second derivative, or second order derivative, is the derivative of the first derivative, denoted
f ′′ (x).
There’s also the third derivative, f ′′′ (x), and a fourth, and so on.
Any derivative beyond the first is a higher order derivative.
Notation
The above notation gets unwieldy, so there are alternate notations.
For the nth derivative:
• f (n) (x) (this is to distinguish from f n (x) which is the quantity f (x) raised to the nth power)
dnf
• dx
n (Leibniz notation)
dn
• dx n [f (x)] (another form of Leibniz notation)
• Dn f (Euler’s notation)
88
CHAPTER 4. CALCULUS
89
4.1.5
4.1. DIFFERENTIATION
Explicit differentiation
When dealing with multiple variables, there is sometimes the option of explicit differentiation. This
simply involves expressing one variable in terms of the other.
For example: x 2 + y 2 = 1. This can be rewritten in terms of x like so: y = ±(1 − x 2 )1/2 .
Here it is easy to apply the chain rule:
u(x) = 1 − x 2
y = u(x)1/2
dy
d 1/2 d
=
[u ] ·
[1 − x 2 ]
dx
du
dx
d 1/2
d
d 2
=
[u ] · ( [1] −
[x ])
du
dx
dx
d 1/2
d
=
[u ] · (− [x 2 ])
du
dx
d 1/2
=
[u ] · (−2x)
du
1
= u −1/2 · (−2x)
2
1
= (1 − x 2 )−1/2 · (−2x)
2
= −x(1 − x 2 )−1/2
x
=−
(1 − x 2 )1/2
x
=−
y
4.1.6
Implicit differentiation
Implicit differentiation is useful for differentiating equations which cannot be explicitly differentiated
because it is impossible to isolate variables. With implicit differentiation, you do not need to define
one of the variables in terms of the other.
For example, using the same equation from before: x 2 + y 2 = 1.
First, differentiate with respect to x on both sides of the equation:
d 2
d
[x + y 2 ] =
[1]
dx
dx
d 2
[x + y 2 ] = 0
dx
d 2
d 2
[x ] +
[y ] = 0
dx
dx
To differentiate
d
2
dx [y ],
CHAPTER 4. CALCULUS
we can define a new function f (y (x)) = y 2 and then apply the chain rule:
89
4.1. DIFFERENTIATION
90
df
df dy
d 2 dy
=
·
=
[y ] ·
= 2y · y ′
dx
dy dx
dy
dx
So returning to our other in-progress derivative:
d 2
d 2
[x ] +
[y ] = 0
dx
dx
We can substitute and bring it to completion:
d 2
d 2
[x ] +
[y ] = 0
dx
dx
d 2
[x ] + 2y y ′ = 0
dx
2x + 2y y ′ = 0
2y y ′ = −2x
2x
y′ = −
2y
x
y′ = −
y
4.1.7
Derivatives of trigonometric functions
d
sin(x) = cos(x)
dx
d
cos(x) = − sin(x)
dx
d
tan(x) = sec2 (x)
dx
d
sec(x) = sec(x) tan(x)
dx
d
csc(x) = − csc(x) cot(x)
dx
d
cot(x) = − csc2 (x)
dx
1
d
arcsin(x) = √
dx
1 − x2
d
−1
arccos(x) = √
dx
1 − x2
d
1
arctan(x) =
dx
1 + x2
90
CHAPTER 4. CALCULUS
91
4.1.8
4.1. DIFFERENTIATION
Derivatives of exponential and logarithmic functions
d x
e = ex
dx
d x
a = ln(a)ax
dx
d
1
ln(x) =
dx
x
d
1
logb (x) =
dx
x ln(b)
4.1.9
Extreme Value Theorem
A global maximum (or absolute maximum) of a function f on a closed interval I is a value f (c)
such that f (c) ≥ f (x) for all x in I.
A global minimum (or absolulte minimum) of a function f on a closed interval I is a value f (c)
such that f (c) ≤ f (x) for all x in I.
The extreme value theorem states that if f is a function that is continuous on the closed interval
[a, b], then f has both a global minimum and a global maximum on [a, b]. It is assumed that a and
b are both finite.
Extrema and inflection/inflexion points
Note that at any extremum (i.e. a minimum or a maximum), global or local, the slope is 0 because
the graph stops rising/falling and “turns around”. For this reason, extrema are also called stationary
points or turning points.
Thus, the first derivative of a function is equal to 0 at extrema. But the converse does not hold true:
the first derivative of a function is not always an extrema when it equals 0. This is because a slope
of 0 may also be found at a point of inflection:
To discern extrema from inflection points, you can use the extremum test, aka the second derivative
test.
If the second derivative at the stationary point is positive (increasing) or negative (decreasing), then
we know we have a minimum or a maximum, respectively.
The intuition here is that the rate of change is also changing at extrema (e.g. it is going from a
positive slope to a negative slope, which indicates a maximum, or the reverse, which indicates a
minimum).
However, if the second derivative is also 0, then we still have not distinguished the point. It may be a
saddle point or on a flat region. What you can do is continue differentiating until you get a non-zero
result.
If we take n to be the order of the derivative yielding the non-zero result, then if n − 1 is odd, we have
a true extremum. Again, if it the non-zero result is positive, then it is a minimum, if it is negative,
it is maximum.
CHAPTER 4. CALCULUS
91
4.1. DIFFERENTIATION
92
An example of inflection.
A maxima, where the second derivative is negative
92
CHAPTER 4. CALCULUS
93
4.1. DIFFERENTIATION
However, if n − 1 is even, then we have a point of inflection.
Critical points
A critical point are points where the function’s derivative are 0 or not defined. So stationary points
are critical points.
4.1.10
Rolle’s Theorem
If a function f (x) is continuous on the closed interval [a, b], is differentiable on the open interval
(a, b), and f (a) = f (b), then there exists at least one number c in the interval (a, b) such that
f ′ (c) = 0.
This is basically saying that if you have an interval which ends with the same value it starts with, at
some point in that curve the slope will be 0:
4.1.11
Mean Value Theorem
If f (x) is continuous on the closed interval [a, b] and differentiable on the open interval (a, b), there
exists a number c in the open interval (a, b) such that
f ′ (c) =
f (b) − f (a)
b−a
This is basically saying that there is some point on the interval where its instantaneous slope is equal
to the average slope of the interval.
Rolle’s Theorem is a special case of the Mean Value Theorem where f (a) = f (b).
CHAPTER 4. CALCULUS
93
4.2. INTEGRATION
4.1.12
94
L’Hopital’s Rule
An indeterminate limit is one which results in
If limx→c
f (x)
g(x)
is indeterminate of type
0
0
or
±∞
±∞ ,
0
0
or
±∞
±∞ .
then limx→c
f (x)
g(x)
= limx→c
f ′ (x)
g ′ (x) .
If the resulting limit here is also indeterminate, you can re-apply L’Hopital’s rule until it is not.
Note that c can be a finite value, ∞, or −∞.
4.1.13
Taylor Series
Certain functions can be expressed as an expansion of itself around a point a. This expansion is
known as a Taylor series and is an infinite sum of that function and its derivatives around a:
f (x) =
∞
∑
f (n) (a)
n=0
(x − a)n
n!
When a = 0, the series is known as a Maclaurin series.
4.2
Integration
4.2.1
Definite integral
How can we find the area under a graph?
We can try to approximate the area using a finite number (n) of rectangles. The area of rectangles
are easy to calculate so we can just add up their area.
Integration
The more rectangles (i.e. increasing n) we fit, the better the approximation.
So we can have n → ∞ to get the best approximation of the area under the curve.
Say we have a function f (x) that is positive over some interval [a, b]. The width of each rectangle over
that interval, divided into n rectangles (subintervals), is ∆x = b−a
n . The endpoint of an subinterval
th
can be denoted xi , for i = 0, 1, . . . , n. For each i subinterval, we pick some sample point xi∗ in the
interval [xi−1 , xi ]. This sample point is the height of the i th rectangle.
Thus, for the i th rectangle, we have as its area:
94
CHAPTER 4. CALCULUS
95
4.2. INTEGRATION
ai = xi∗
b−a
n
or
ai = xi∗ ∆x
So the total area for the interval is:
n
∑
An =
f (xi∗ )∆x
i=1
This kind of area approximation is called a Riemann sum.
The best approximation then is:
lim
n→∞
n
∑
f (xi∗ )∆x
i=1
So we define the definite integral:
Suppose f is a continuous function on [a, b] and ∆x =
a and b is:
∫
b−a
n .
b
a
f (x)dx = lim An = lim
n→∞
n→∞
Then the definite integral of f between
n
∑
f (xi∗ )∆x
i=1
where xi∗ are any sample points in the interval [xi−1 , xi ] and xk = a + k · ∆x for k = 0, . . . , n.
In the expression
integration.
∫b
a
f (x)dx, f is the integrand, a is the lower limit and b is the upper limit of
Left-handed vs right-handed Riemann sums
A right-handed Riemann sum is just one where xi∗ = xi , and a left-handed Riemann sum is just one
where xi∗ = xi−1 .
4.2.2
Basic properties of the integral
• The constant rule:
∫b
a
cf (x)dx = c
∫b
a
f (x)dx
– A special case rule for integrating constants is:
• Addition and subtraction rule:
CHAPTER 4. CALCULUS
∫b
a
∫b
a
cdx = c(b − a)
(f (x) ± g(x))dx =
∫b
a
f (x)dx ±
∫b
a
g(x)dx
95
4.2. INTEGRATION
96
• The comparison rule
∫
– Suppose f (x) ≥ 0 for all x in [a, b]. Then ab f (x)dx ≥ 0.
∫
∫
– Suppose f (x) ≥ g(x) for all x in [a, b]. Then ab f (x)dx ≥ ab g(x)dx.
∫
– Suppose M ≥ f (x) ≥ m for all x in [a, b]. Then M(b − a) ≥ ab f (x)dx ≥ m(b − a).
• Additivity with respect to endpoints: Suppose a < c < b. Then
∫b
c f (x)dx.
∫b
a
f (x)dx =
∫c
a
f (x)dx +
– This is basically saying the area under the graph from a to b is equal to the area under
the graph from a to c plus the area under the graph from c to b, so long as c is some
point between a and b.
• Power rule of integration: As long as n ̸= 1 and 0 ∈
/ [a, b], or n > 0,
bn+1 −an+1
n+1
4.2.3
a
x n dx =
x n+1 b
n+1 |a
=
Mean Value Theorem for Integration
∫b
Suppose f (x) is continuous on [a, b]. Then
4.2.4
∫b
a
f (x)dx
b−a
= f (c) for some c in [a, b].
Antiderivatives
If we have a function f which is the derivative of another function F , i.e. f = F ′ , then F is an
antiderivative of f .
Generally, a function f has many antiderivatives because of how constants work in derivatives.
So we usually include a +C term, i.e. F (x) + C, to indicate that any constant can be added and
still derive to f . Thus F often refers to a set of functions rather than a unique function.
We say that the integral of f is equal to this set of functions:
∫
f (x)dx = F (x) + C
This is the indefinite integral since we are not specifying a range the integral is computed over.
Thus, we are not given an explicit value but rather the function(s) that results (this typically includes
the ambiguous C term).
Here f is known as the integrand.
In a definite integral, we specify the upper and lower limits:
∫
b
f (x)dx = F (x) + C
a
96
CHAPTER 4. CALCULUS
97
4.2.5
4.2. INTEGRATION
The fundamental theorem of calculus
The fundamental theorem of calculus connects the concept of a derivative to that of an integral.
Suppose that f is continuous on [a, b]. We can define a function F like so:
∫
x
F (x) =
a
f (t)dt for x ∈ [a, b]
Suppose f is continuous on [a, b] and F is defined by F (x) =
∫x
a
f (t)dt.
Then F is differentiable on (a, b) for all x ∈ (a, b), i.e.:
F ′ (x) = f (x)
Thus F is the set of antiderivatives for f .
Suppose f is continuous on [a, b] and F is any antiderivative of f . Then:
∫
b
a
f (x)dx = F (b) − F (a)
Note that F (b) − F (a) may be notated as F (x)|ba .
To understand why this is so, consider:
F and f
Say that F (X) gives the area under f (x) from 0 to x.
If we want to compute the area under f (x) between x and x + h, we can do so like:
CHAPTER 4. CALCULUS
97
4.2. INTEGRATION
98
F (x + h) − F (x)
h
Note that this is also how the derivative is calculated.
So as we take the limit of h → 0:
F (x + h) − F (x)
= f (x)
h→0
h
lim
Thus we have shown that F ′ (x) = f (x), that is, that F is the antiderivative of f .
Given some arbitrary:
F (b) − F (a)
We know that this is also equal to:
∫
b
f (x)dx
a
Therefore:
∫
b
a
f (x)dx = F (b) − F (a)
Basic properties of indefinite integrals
The integral rules defined above still apply.
98
CHAPTER 4. CALCULUS
99
4.2. INTEGRATION
∫
1
• Power rule for indefinite integrals: For all n ̸= −1, x n dx = n+1
x n+1 + C
∫
d
• Integral of the inverse function: For f (x) = x1 , remember that dx
ln x = x1 , so dx
x =
ln |x| + C
∫
d x
• Integral of the exponential function: Because dx
e = e x , e x dx = e x + C
• The substitution rule for indefinite integrals: Assume u is differentiable with a continuous
∫
∫
derivative and that f is continuous on the range of u. Then f (u(x)) du
dx dx = f (u)du.
– Remember that
du
dx
is not a fraction, so you’re not just “canceling” things out here.
Integration by parts
Suppose f and g are differentiable and their derivatives are continuous. Then:
(
∫
)
∫
g(x)dx −
f (x)g(x)dx = f (x)
∫ (
′
f (x)
∫
)
g(x)dx dx
You set f (x) in the following order, called ILATE:
•
•
•
•
•
I for inverse trigonometric functions
L for log functions
A for algebraic functions
T for trigonometric functions
E for expontential functions
4.2.6
Improper integrals
There are two types of improper integrals:
1. Those on an unbounded function, e.g.:
∫
b
f (x)dx
a
2. Those on an unbounded interval, e.g.:
∫
+∞
f (x)dx
a
The integral on an unbounded function depicted above is known as an “improper integral with infinite
integrand at b”. To compute such an integral, we just consider a point infinitesimally previous to b:
CHAPTER 4. CALCULUS
99
4.2. INTEGRATION
100
An unbounded function
∫
lim+
ϵ→0
b−ϵ
f (x)dx
a
The integral on an unbounded interval is known as an “improper integral on an infinite interval”. Here
we just consider the limit:
∫
N
lim
N→+∞ a
f (x)dx
If the interval is unbounded in both directions, we consider instead two separate intervals:
∫
∫
+∞
−∞
=
∫
0
−∞
+∞
f (x)dx +
f (x)dx
0
Say we have the integral
∫
∞
1
dx
x2
If we set the upper bound to be a finite value b and have it approach infinity, we get:
∫
lim
b
b→∞ 1
100
(
)
dx
1 1
= lim
−
2
b→∞ 1
x
b
(
)
1
= lim 1 −
b→∞
b
=1
CHAPTER 4. CALCULUS
101
4.2. INTEGRATION
The formal definition:
1. Suppose
∫b
a
f (x)dx exists for all b ≥ a. Then we define
∫
∞
∫
b
f (x)dx = lim
b→∞ a
a
f (x)dx
as long as this limit exists and is finite. If it does exist we say the integral is convergent and otherwise
we say it is divergent.
2. Similarly if
∫b
a
f (x)dx exists for all a ≤ b we define
∫
∫
b
−∞
f (x)dx = lim
a→−∞ a
3. Finally, suppose c is a fixed real number and that
Then we define
∫
∞
−∞
∫
f (x)dx =
∫c
−∞
b
f (x)dx
f (x)dx and
∫
c
−∞
f (x)dx +
∞
∫∞
c
f (x)dx are both convergent.
f (x)dx
c
.
Improper integrals with a finite number of discontinuities
Suppose f is continuous on [a, b] except at points c1 < c2 < · · · < cn in [a, b].
We define
∫
∫
b
f (x)dx =
a
∫
c1
c2
f (x)dx +
a
c1
f (x)dx + · · · +
∫
b
f (x)dx
cn
as long as each integral on the right converges.
Improper integral with one discontinuity
As a simpler example, say we have an improper integral with a single discontinuity.
If f is continuous on the interval [a, b) and is discontinuous at b, we define
∫
∫
b
a
f (x)dx = lim−
c→b
c
f (x)dx
a
If this limit exists, the integral we say it converges and otherwise we say it diverges.
Similarly, if f is continuous on the interval (a, b] and is discontinuous at a, we define
CHAPTER 4. CALCULUS
101
4.3. MULTIVARIABLE CALCULUS
102
∫
∫
b
a
f (x)dx = lim+
c→a
c
f (x)dx
a
Finally, if f has a discontnuity at a point c in (a, b) and is continuous at all other points in [a, b], if
∫
∫
both ac f (x)dx and cb f (x)dx converge, we define
∫
∫
b
f (x)dx =
b
f (x)dx +
a
4.3
∫
c
a
f (x)dx
c
Multivariable Calculus
We are frequently dealing with data in many dimensions, so we must expand the previous concepts
of derivatives and integrals to higher-dimensional spaces.
4.3.1
Integration
Double integrals
A definite integral for y = f (x) is the area under the curve of f (x), which is the sum of the areas of
infinitely small rectangles assembled in the shape of the curve.
But say we are working with three dimensions, i.e. we have z = f (x, y ). Then the volume under the
surface of f (x, y ) is the sum of the volumes of infinitely small chunks in the shape of the surface.
The area of one face of that chunk is the area under the curve, with respect to x, from x = 0 to
x = b (in the illustration below), i.e. the integral:
∫
b
f (x, y )dx
0
Because this is with respect to x, this integral will be some function of y , e.g. g(y ).
To get the volume of this chunk, we multiply that area by some depth dy , so the volume of a chunk
is:
(∫
)
b
f (x, y )dx dy
0
So if we want to get the volume in the bounds of y = 0, y = a, then we integrate again:
∫
a
(∫
b
)
f (x, y )dx dy
0
102
0
CHAPTER 4. CALCULUS
103
4.3. MULTIVARIABLE CALCULUS
A double integral!
It is also written without the parentheses:
∫
a
∫
b
f (x, y )dxdy
0
0
Illustration of a double integral.
Note that here we first integrated wrt to x and then y , but you can do it the other way around as
well (integrate wrt y first, then x).
Note: the lower bounds here were 0 but that’s just an example.
Another way of conceptualizing double integrals
You could instead conceptualize the double integral as the sum of the volumes of infinitely small
columns:
The area of each column’s base, dx · dy , is sometimes notated as dA.
Variable boundaries
In the previous example we had fixed boundaries (see accompanying illustration, on the left).
What if instead we have a variable boundary (see accompanying illustration, on the right. The lower
x boundary varies now).
Well you express variable boundaries as a functions. As is the case in the example above, the lower
x boundary is some function of y , g(y ). So the volume would be:
∫
a
∫
b
f (x, y )dxdy
0
CHAPTER 4. CALCULUS
g(y )
103
4.3. MULTIVARIABLE CALCULUS
104
Another illustration of a double integral.
Illustration of variable boundaries.
104
CHAPTER 4. CALCULUS
105
4.3. MULTIVARIABLE CALCULUS
That’s if you first integrate wrt to x. If you first integrate wrt to y , instead the upper y boundary is
varying and that would be some function of x, h(x), i.e.:
∫
b
∫
h(x)
f (x, y )dy dx
0
0
Triple integrals
Triple integrals also involve infinitely small volumes and in many cases are no different than double
integrals.
Illustration of a triple integral.
So why use triple integrals? Well they are good for calculating the mass of something - if the density
under the surface is not uniform. The density at a given point is expressed as f (x, y , z), so the mass
of a variably dense volume can be expressed as:
∫
xf inal
∫
yf inal
∫
zf inal
f (x, y , z )dzdy dx
x0
4.3.2
y0
z0
Partial derivatives
Say you have a function z = f (x, y ).
With two variables, we are now working in three dimensions. How does differential calculus work in 3
(or more) dimensions? In three dimensions, what is the slope at a given point? Any given point has
an infinite number of tangent lines (only one tangent plane though). So when you take a derivative
in three dimensions, you have to specify what direction that derivative is in.
CHAPTER 4. CALCULUS
105
4.3. MULTIVARIABLE CALCULUS
106
Say we have z = x 2 +xy +y 2 . If we want to take
a derivative of this function, we have to hold one
variable constant and derive with respect to the
other variable. This derivative is called a partial
derivative. If we were doing it wrt to x, then it
would be notated as:
∂z
∂x
or fx (x, y )
So we could work this out as:
y =C
z = x 2 + xC + C 2
∂z
= 2x + C
∂x
∂z
= 2x + y
∂x
A 3d function (the sphere function, z = x 2 + y 2 )
Then you could get the partial derivative wrt to y , i.e.:
x =C
z = C 2 + Cy + y 2
∂z
= C + 2y
∂y
∂z
= x + 2y
∂y
The plane that these two functions define together for a given point (x, y ) is the tangent plane at
that point.
More generally, for a function f (x, y ), the partial derivatives would be:
∂
f (x + ∆, y ) − f (x, y )
f (x, y ) = lim
∆→0
∂x
∆
∂
f (x, y + ∆) − f (x, y )
f (x, y ) = lim
∆→0
∂y
∆
The partial derivative tells us how much the output of a function f changes with the given variable.
As alluded to earlier, this is important for machine learning because it tells us how a change in each
weight in a multidimensional problem will affect f .
4.3.3
Directional derivatives
Partial derivatives can be generalized into directional derivatives, which are derivatives with respect
to to any arbitrary line (it does not have to be, for example, with respect to the x or y axis). That
106
CHAPTER 4. CALCULUS
107
4.3. MULTIVARIABLE CALCULUS
is, with respect to any arbitrary direction. We represent a direction as a unit vector.
4.3.4
Gradients
A gradient is a vector of all the partial derivatives at a given point, which is to say it is a generalization
of the derivative from two-dimensions to higher dimensions.
The gradient of a function f , with a vector input w = [w1 , . . . , wd ] is notated:


∂f
 ∂w1 
 . 
∇w f =  .. 


∂f
∂wd
Sometimes it is just notated as ∇.
The gradient of some function f (x, y ), i.e. z = f (x, y ) is:
∇f = fx î + fy ĵ
That is, the partial derivative of f wrt to x times the unit vector in the x direction, î , plus the partial
of f wrt to y times the unit vector in the y direction, ĵ.
It can also be written (this is just different notation):
∇f =
∂
∂
f (x, y )î +
f (x, y )ĵ
∂x
∂y
It’s worth noting that this can be thought of in terms of matrices, i.e. given some function f :
Rm×n → R (that is, it takes a matrix A ∈ Rm×n and returns a real value), then the gradient of f ,
with respect to the matrix A, is the matrix of partial derivatives:
 ∂f (A)
∂A11
 ∂f
 (A)
 ∂A21
∂f (A)
∂A12
∂f (A)
∂A22
..
.
···
···
..
.
∂f (A)
∂Am1
∂f (A)
∂Am2
···
∇A f (A) ∈ Rm×n = 
 ...

Which is to say that (∇A f (A))ij =
∂f (A) 
∂A1n
∂f (A) 

∂A2n 
.. 
. 

∂f (A)
∂Amn
∂f (A)
∂Aij .
Properties
Some properties, taken from equivalent properties of partial derivatives, are:
• ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x)
• For t ∈ R, ∇x (tf (x)) = t∇x f (x)
CHAPTER 4. CALCULUS
107
4.3. MULTIVARIABLE CALCULUS
108
Example
Say we have the function f (x, y ) = x 2 + xy + y 2 .
Using the partials we calculated previously, the gradient is:
∇f = (2x + y )î + (2y + x)ĵ
So what we’re really calculating here is a vector field, which gives an x and a y vector, with the
magnitude of the partial derivative of f wrt to x and the partial derivative of f wrt to y , respectively,
then getting the vector which is the sum of those two vectors.
What the gradient tells us is, for a given point, what direction to travel to get the maximum slope
for z.
4.3.5
The Jacobian
For the vector F (x) = [f (x)1 , . . . , f (x)k ]T , the Jacobian, notated ∇x F (X) or as just J, is:

(x)1 · · ·

.
..
..
∇x F (x) = 
.

∂
∂x1 f (x)k · · ·
∂
f
 ∂x1

∂
∂xd f
(x)1


..

.
∂
∂xd f
(x)k

That is, it is an m × n matrix of the first-order partial derivatives for a function f : Rn → Rm , (i.e. for
a function that defines a vector field).
To clarify, the difference between the gradient and the Jacobian is that the gradient is for a single
function, thus yielding a vector, whereas the Jacobian is for multiple functions, thus yielding a matrix.
4.3.6
The Hessian
Say we have a function f : Rn → R, which takes as input a some vector x ∈ Rn and returns a real
number (that is, it defines a scalar field).
The Hessian matrix with respect to x, written ∇2x f (x) or as just H, is the n×n matrix of second-order
partial derivatives:
 ∂ 2 f (x)
∂x12
∂ 2 f (x)
∇2x f (x) ∈ Rn×n


 ∂x2 ∂x1
=
 .
 ..

∂ 2 f (x)
∂xn ∂x1
Which is to say (∇2x f (x))ij =
108
∂ 2 f (x)
∂x1 ∂x2
∂ 2 f (x)
∂x22
···
..
.
···
..
.
∂ 2 f (x)
∂xn ∂x2
···

∂ 2 f (x)
∂x1 ∂xn 
∂ 2 f (x) 
∂x2 ∂xn 

..
.
∂ 2 f (x)
∂xn2



∂ 2 f (x)
∂xi ∂xj .
CHAPTER 4. CALCULUS
109
4.3. MULTIVARIABLE CALCULUS
Wherever the second partial derivatives are continuous, the Hessian is symmetric, i.e.
In machine learning, the Hessian is typically completely symmetric.
∂ 2 f (x)
∂xi ∂xj
=
∂ 2 f (x)
∂xj ∂xi .
Just as the second derivative test is used to check if a critical point is a maximum, a minimum, or
still ambiguous, as the Hessian is composed of second-order partial derivatives, it does the same for
multiple dimensions.
This is accomplished as follows. If the Hessian matrix is real and symmetric, it can be decomposed
into a set of real eigenvalues and an orthogonal basis of eigenvectors. At critical points of the function
we can look at the Hessian’s eigenvalues:
• If the Hessian is positive definite, we have a local minimum (because movement in any direction
is positive)
• If the Hessian is negative definite, we have a local maximum (because movement in any direction
is negative)
• When at least one eigenvalue is positive and at least one is negative, we have a saddle point
• When all non-zero eigenvalues are of the same sign, but at least one is zero, we still have an
ambiguous critical point
The Jacobian and the Hessian are related by:
H(f )(x) = J(∇f )(x)
Intuitively, the i , jth element of the Hessian tells how the i, jth dimension accelerate together. For
example, if the element is negative, then as one dimension accelerates, the other decelerates.
4.3.7
Scalar and vector fields
A scalar field just means a space where, for any point, you can get a scalar value.
For example, with f (x, y ) = x 2 + xy + y 2 , for any (x, y ) you get a scalar value.
A vector field is similar but instead of just a scalar value, you get a value and a direction.
For example, V⃗ = 2x î + 5y ĵ or V⃗ = x 2 y î + y ĵ.
4.3.8
Divergence
Say we have a vector field V⃗ = x 2 y î + 3y ĵ.
The divergence of that vector field is:
di v (V⃗ ) = ∇ · V⃗
CHAPTER 4. CALCULUS
109
4.3. MULTIVARIABLE CALCULUS
110
An example of a 2D vector field
That is, it is the dot product (which tells us how much two vectors move together) of the gradient
and the vector field.
So for our example:
∂
∂
î +
ĵ
∂x
∂y
∂ 2
∂
∇·⃗
v=
(x y ) +
(3y )
∂x
∂y
= 2xy + 3
∇=
The divergence, which is scalar number for any point in a vector field, represents the change in volume
density from an infinitesimal volume around a given point in that field. A positive divergence means
the volume density is decreasing (more going out than coming in); a negative divergence means the
volume density is increasing (more is coming in than going on, this is also called convergence). A
divergence of 0 means the volume density is not changing.
Using our previously calculated divergence, say we want to look at the point (4, 3). We get the
divergence 2 · 4 · 3 + 3 = 27 This means that, in an infinitesimal volume around the point (4, 3), the
volume is decreasing.
110
CHAPTER 4. CALCULUS
111
4.3.9
4.4. DIFFERENTIAL EQUATIONS
Curl
The curl measures the rotational effect of a vector field at a given point. Unlike divergence, where
we are seeing how much the gradient and the vector field move together, we are interested in seeing
how they move against each other. So we use their cross product:
curl(V⃗ ) = ∇ × V⃗
4.3.10
Optimization with eigenvalues
Consider, for a symmetric matrix A ∈ Sn the following equality-constrained optimization problem:
maxn x T Ax, subject to ||x||22 = 1
x∈R
Optimization problems with equality constraints are typically solved by forming the Lagrangian, an
objective function which includes the equality constraints. For this particular problem (i.e. with the
quadratic form), the Lagrangian is:
L(x, λ) = X T Ax − λx T x
Where λ is the Lagrange multiplier associated with the equality constraint. For x ∗ to be the optimal
point to the problem, the gradient of the Lagrangian has to be zero at x ∗ (among other conditions),
i.e.:
∇x L(x, λ) = ∇x (x T Ax − λx T x) = 2AT x − 2λx = 0
This is just the linear equation Ax = λx, so the only points which can maximize (or minimize) x T Ax,
assuming x T x = 1, are the eigenvectors of A.
4.4
Differential Equations
Differential equations are simply just equations that contain derivatives.
Ordinary differential equations (ODEs) involve equations containing:
• variables
• functions
• their derivatives
CHAPTER 4. CALCULUS
111
4.4. DIFFERENTIAL EQUATIONS
112
• their solutions
This is contrasted to partial differential equations (PDEs), which contain partial derivatives instead
of ordinary derivatives.
4.4.1
Solving simple differential equations
Say we have:
f ′′ (x) = 2
First we can integrate both sides:
∫
′′
∫
f (x)dx =
2dx
f ′ (x) = 2x + C1
Then we can integrate once more:
∫
′
f (x)dx =
∫
2x + C1 dxf (x) = x 2 + C1 x + C2
So our solution is f (x) = x 2 + C1 x + C2 . For all values of C1 and C2 , we will get f ′′ = 2.
The values C1 and C2 represent initial conditions, e.g. the starting conditions of a model.
4.4.2
Basic first order differential equations
There are four main types (though there are many others) of differential equations:
•
•
•
•
separable
homogenous
linear
exact
Separable differential equations
A separable equation is in the form:
dy
f (x)
=
dx
g(x)
You can group the terms together like so:
112
CHAPTER 4. CALCULUS
113
4.4. DIFFERENTIAL EQUATIONS
g(y )dy = f (x)dx
And then integrate both sides to obtain the solution:
∫
∫
g(y )dy =
f (x)dx + C
Example
Say we want to solve
dy
= 3x 2 y
dx
Separate the terms:
dy
= (3x 2 )dx
y
Then integrate:
∫
∫
dy
= 3x 2 dx
y
ln y = x 3 + C
y = ex
3 +C
If we let k = e C so k is a constant, we can write the solution as:
y = ke x
3
Homogenous differential equations
A homogenous equation is in the form:
dy
= f (y /x)
dx
To make things easier, we can use the substitution
v=
CHAPTER 4. CALCULUS
y
x
113
4.4. DIFFERENTIAL EQUATIONS
114
so
dy
= f (v )
dx
Then we can set y = xv and use the product rule, so that we get:
dy
dx
dv
v +x
dx
dv
x
dx
dv
dx
=v +x
dv
dx
= f (v )
= f (v ) − v
=
f (v ) − v
x
so now the equation is in separable form and be solved as a separable equation.
Linear differential equations
A linear first order differential equation is a differential equation in the form:
dy
+ f (x)y = g(x)
dx
To solve, you multiply both sides by I = e
∫
f (x)dx
and integrate. I is known as the integrating factor.
Example
y ′ − 2xy = x
So in this case, f (x) = −2x, and g(x) = x, so the equation could be written:
y ′ + f (x)y = g(x)
So, we calculate the integrating factor:
∫
I=e
−2xdx
= e −x
2
and multiply both sides by I, i.e.:
114
CHAPTER 4. CALCULUS
115
4.5. REFERENCES
e −x (y ′ − 2xy ) = xe −x
2
(e −x · y ′ ) − 2xe −x y = xe −x
2
2
2
∫
((e
−x 2
′
2
· y ) − 2xe
−x 2
∫
y )dx =
xe −x dx
2
and work out the integration.
Exact differential equations
An exact equation is in the form:
f (x, y ) + g(x, y )
such that
df
dx
=
dy
=0
dx
dg
dx .
There exists some function h(x, y ) where
dh
= f (x, y )
dx
dh
= g(x, y )
dy
so long as f , g,
4.5
•
•
•
•
•
df
dy
and
dg
dx
are continuous on a connected region.
References
Calculus. Revised 14 October 2013. Wikibooks.
Multivariable Calculus. Khan Academy.
Linear Algebra Review and Reference. Zico Kolter. October 16, 2007.
Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
Math for Machine Learning. Hal Daumé III. August 28, 2009.
CHAPTER 4. CALCULUS
115
4.5. REFERENCES
116
116
CHAPTER 4. CALCULUS
117
5
Probability
Probability theory is the study of uncertainty.
5.1
Probability space
We typically talk about the probability of an event. The probability space defines the possible
outcomes for the event, and is defined by the triple (Ω, F, P ), where
• Ω is the space of possible outcomes, i.e. the outcome space (sometimes called the sample
space).
• F ⊆ 2Ω , where 2Ω is the power set of Ω (i.e. the set of all subsets of Ω, including the empty
set ∅ and Ω itself, the latter of which is called the trivial event), is the space of measurable
events or the event space.
• P is the probability measure, i.e. the probability distribution, that maps an event E ∈ F to
a real value between 0 and 1 (that is, P is a function that outputs a probability for the input
event).
For example, we have a six-sided dice, so the space of possible outcomes Ω = {1, 2, 3, 4, 5, 6}.
We are interested in whether or not the dice roll is odd or even, so the event space is F =
{∅, {1, 3, 5}, {2, 4, 6}, Ω}.
The outcome space Ω may be finite, in which the event space F is typically taken to be 2Ω , or it
may be infinite, in which the probability measure P must satisfy the following axioms:
• non-negativity: for all α ∈ F, P (α) ≥ 0.
• trivial event: P (Ω) = 1.
• additivity: For all α, β ∈ F and α ∩ β = ∅, P (α ∪ β) = P (α) + P (β).
Other axioms include:
CHAPTER 5. PROBABILITY
117
5.2. RANDOM VARIABLES
118
• 0 ≤ P (a) ≤ 1
• P (True) = 1 and P (False) = 0
We refer to an event whose outcome is unknown as a trial, or an experiment, or an observation. An
event is a trial which has resolved (we know the outcome), and we say “the event has occurred” or
that the trial has “satisfied the event”.
The compliment of an event is everything in the outcome space that is not the event, and may be
notated in a few ways: ¬E, E C , Ē.
If two events cannot occur together, they are mutually exclusive.
More concisely (from Probability, Paradox, and the Reasonable Person Principle):
• Experiment: An occurrence with an uncertain outcome that we can observe.
– For example, rolling a die.
• Outcome: The result of an experiment; one particular state of the world. Synonym
for “case.”
– For example: 6.
• Sample Space: The set of all possible outcomes for the experiment. (For now,
assume each outcome is equally likely.)
– For example, {1, 2, 3, 4, 5, 6}.
• Event: A subset of possible outcomes that together have some property we are
interested in.
– For example, the event “even die roll” is the set of outcomes {2, 4, 6}.
• Probability: The number of possible outcomes in the event divided by the number
in the sample space.
– For example, the probability of an even outcome from a six-sided die is |{2, 4,
6}| / |{1, 2, 3, 4, 5, 6}| = 3/6 = 1/2.
5.2
Random Variables
A random variable (sometimes called a stochastic variable) is a function which maps outcomes
to real values (that is, they are technically not variables but rather functions), dependent on some
other probabilistic factor. Random variables represent uncertain events we are interested in with a
numerical value.
Random variables are typically denoted by a capital letter, e.g. X. Values they may take are typically
represented with a lowercase letter, e.g. x.
When we use P (X) we are referring to the distribution of the random variable X, which describes
the probabilities that X takes on each of its possible values.
Contrast this to P (x) which describes the probability of some arbitrary single value x; this is shorthand
for describing the probability that some random variable (e.g. X) takes on that value x, which is
more explicitly denoted P (X = x) or PX (x). This represents some real value.
118
CHAPTER 5. PROBABILITY
119
5.3. JOINT AND DISJOINT PROBABILITIES
For example, we may be flipping a coin and have a random variable X which takes on the value
0 if the flip results in heads and 1 if it results in tails. If it’s a fair coin, then we could say that
P (X = heads) = P (X = tails) = 0.5.
Random variables may be:
• Discrete: the variable can only have specific values, e.g. on a 5 star rating system, the random
variable could only be one of the values [0, 1, 2, 3, 4, 5]. Another way of describing this is that
the space of the variable’s possible values (i.e. the outcome space) is countable and finite. For
discrete random variables which are not numeric, e.g. gender (male, female, etc), we use an
indicator function I to map non-numeric values to numbers, e.g. male = 0, female = 1, . . . ;
we call variables from such functions indicator variables.
• Continuous: the variable can have arbitrarily exact values, e.g. time, speed, distance. That is,
the outcome space is infinite.
• Mixed: these variables assign probabilities to both discrete and continuous random variables.
5.3
Joint and disjoint probabilities
• Joint probability: P (a ∩ b) = P (a ∧ b) = P (a, b) = P (a)P (b|a), the probability of both a
and b occurring.
• Disjoint probability: P (a ∪ b) = P (a ∨ b) = P (a) + P (b) − P (a, b), the probability of a or
b occurring.
Probabilities can be visualized as a Venn diagram:
Probability
The overlap is where both a and b occur (P (a, b)).
The previous axiom, describing P (a ∪ b), adds the spaces where a and b each occur, but this counts
their overlap twice, so we subtract it once.
5.4
Conditional Probability
The conditional probability is the probability of A given B, notated P (A|B).
CHAPTER 5. PROBABILITY
119
5.5. INDEPENDENCE
120
Formally, this is:
P (A|B) =
P (A ∩ B)
P (B)
Where P (B) > 0.
For example, say you have two die. Say A = {snake eyes} and B = {double}.
1
P (A) = 36
since out of the 36 possible dice pairings, only one is snake eyes. P (B) =
the 36 possible dice pairings are doubles.
1
6
since 6 of
Now what is P (A|B)? Well, intuitively, if B has happened, we have reduced our possible event space
to just the 6 doubles, one of which is snake eyes, so P (A|B) = 16 .
Intuitively, this can be thought of as the probability of a given a universe where b occurs.
The overlapping region is P (a|b)
So we ignore the part of the world in which b does not occur.
This can be re-written as:
P (a, b) = P (a|b)P (b)
5.5
Independence
Events X and Y are independent if:
P (X ∩ Y ) = P (X)P (Y )
That is, their outcomes are unrelated.
Another way of saying this is that:
120
CHAPTER 5. PROBABILITY
121
5.6. THE CHAIN RULE OF PROBABILITY
P (X|Y ) = P (X)
Knowing something about Y tells us nothing more about X.
The independence of X and Y can also be notated as X ⊥ Y .
From this we can infer that:
P (X, Y ) = P (X)P (Y )
∩
More generally we can say that events A1 , . . . , An are mutually independent if P ( i∈S Ai ) =
∏
i∈S P (Ai ) for any S ⊂ {1, . . . , n}. That is, the joint probability of any subset of these events is
just equal to the product of their individual probabilities.
Mutual independence implies pairwise independence, but note that the converse is not true (that is,
pairwise independence does not imply mutual independence).
5.5.1
Conditional Independence
Conditional independence is defined as:
P (X|Y, Z) = P (X|Z)
If X is independent of Y conditioned on Z. Which is to say X is independent of Y if Z is true or
known.
From this we can infer that:
P (X, Y |Z) = P (X|Z)P (Y |Z)
Note that mutual independence does not imply conditional independence.
Similarly, we can say that events A1 , . . . , An are conditionally independent given C if P (
∏
i∈S P (Ai |C) for any S ⊂ {1, . . . , n}.
5.6
∩
i∈S
Ai |C) =
The Chain Rule of Probability
Say we have the joint probability P (a, b, c). How do we turn this into conditional probabilities?
We can set y = b, c (that is, the intersection of b and c), then we have P (a, b, c) = P (a, y ), and
we can just apply the previous equation:
CHAPTER 5. PROBABILITY
121
5.7. COMBINATIONS AND PERMUTATIONS
122
P (a, y ) = P (a|y )P (y )
P (a, b, c) = P (a|b, c)P (b, c)
And we can again apply the previous equation to P (b, c), which gets us:
P (a, b, c) = P (a|b, c)P (b|c)P (c)
This is generalized as the chain rule of probability:
P (x1 , . . . , xn ) =
1
∏
P (xi |xi−1 , . . . , x1 )
i=n
5.7
Combinations and Permutations
5.7.1
Permutations
With permutations, order matters. For instance, AB and BA are different permutations (though
they are the same combination, see below).
Permutations are notated:
x Py
= Pyx = P (x, y )
Where:
• x = total number of “items”
• y = “spots” or “spaces” or “positions” available for the items.
A permutation can be expanded like so:
n Pk
= n, n − 1, n − 2, . . . , n − (k − 1)
And generalized to the following formula:
n Pk
=
n!
(n − k)!
For example, consider 7 P3 . 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1, and we only have 3 positions, i.e. 7-6-5,
so we divide by 4! to get our final answer.
122
CHAPTER 5. PROBABILITY
123
5.7. COMBINATIONS AND PERMUTATIONS
5.7.2
Combinations
With combinations, the order doesn’t matter.
The notation is basically the same as permutation, except with a C instead of a P .:
n Ck
=
n Pk
k!
or, expanded:
( )
n!
n
=
n Ck =
k!(n − k)!
k
The n and k pairing together is known as the binomial coefficient, and read as “n choose k”.
5.7.3
Combinations, permutations, and probability
Example
Say you have a a coin. What is P ( 83 H)? That is, what is the probability of flipping
exactly 3 heads?
( )
P ( 38 H) is equal to the combination of
of 3 head flips out of the total 8 flips.
8
3
. That is, we are basically trying to find all the combinations
8 C3
= 56
So there are 56 possible outcomes that result in exactly 3 heads. Because a coin has two possible
outcomes, and we’re flipping 8 times, we know there are 28 total possible outcomes.
So to figure out the probability, we can just take the ratio of these outcomes.
7
56
3
P ( H) = 8 =
8
2
32
Example
Given outcome P (A) = 0.8 and P (B) = 0.2, what is P ( 35 A)? That is, the possibility
of exactly 3 out of 5 trials being A.
CHAPTER 5. PROBABILITY
123
5.8. PROBABILITY DISTRIBUTIONS
124
Basically, like before, we’re looking for possible combinations of 35 A, that is 5 choose 3, i.e. 5 C3 = 10.
So we know there are 10 possible outcomes resulting in 35 A. But what is the probability of a single
combination that results in 35 A? We were given the probabilities so it’s just multiplying:
P (A)P (A)P (A)P (B)P (B) = 0.83 × 0.22
So then we just multiply the number of these combinations, 10, by this resulting probability to get
the final answer.
5.8
Probability Distributions
For some random variable X, there is a probability distribution function P (X) (usually just called
the probability distribution); the particular kind depends on what kind of random variable X is
(i.e. discrete or continuous). A probability distribution describes the probability of a random variable
taking on its possible values.
If the random variable X is distributed according to, say, a Poisson distribution, we say that “X is
Poisson-distributed”.
Distributions themselves are described by parameters - variables which determine the specifics of
a distribution. Different kinds of distributions are described by different sets of parameters. For
instance, the particular shape of a normal distribution is determined by µ (the mean) and σ (the
standard deviation); we say the normal distribution is parameterized by µ and σ. Often we are
given a set of data and assume it comes from a particular kind of distribution, such as a normal
distribution, but we don’t know the specific parameterization of the distribution; that is, with the
normal distribution example, we don’t know what µ and σ are, so we use the data we have and try
and infer these unknown parameters.
5.8.1
Probability Mass Functions (PMF)
For discrete random variables, the distribution is a probability mass function.
It is called a “mass function” because it divides a unit mass (the total probability) across the different
values the random variable can take.
In the example figure, the random variable can take on one of three discrete values, {1, 3, 7}, with
the corresponding probabilities illustrated in the PMF.
5.8.2
124
Probability Density Functions (PDF)
CHAPTER 5. PROBABILITY
125
5.8. PROBABILITY DISTRIBUTIONS
For continuous random variables we have a probability density function. A probability density
function f is a non-negative, integrable function
such that:
∫
f (x)dx = 1
Val(X)
Where Val(X) denotes the range of the random
variable X, so this is the integral over the range
of possible values for X.
The total area under the curve sums to 1 (which
is to say that the aggregate probability of all possible values for X sums to 1).
An example PMF
The probability of a random variable X distributed according to a PDF f is computed:
P (a ≤ X ≤ b) =
∫
b
f (x)dx
a
It’s worth noting that this implies that, for a continuous random variable X, the probability of taking
on any given single value is zero (when dealing with continuous random variables there are infinitely
precise values). Rather, we compute probabilities for a range of values of X; the probability is given
by the area under the curve over this range.
5.8.3
Distribution Patterns
There are a few ways we can describe distributions.
•
•
•
•
Unimodal: The distribution has one main peak.
Bimodal: The distribution has two (approximately) equivalent main peaks.
Multimodal: The distribution has more than two (approximately) equivalent main peaks.
Symmetrical: The distribution falls in equal numbers on both sides of the middle.
Skewness
Skewness describes distributions that have greater density on one side of the distribution. The side
with less is the direction of the skew.
Skewness is defined:
skewness = E[(
CHAPTER 5. PROBABILITY
X−µ 3
) ]
σ
125
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
126
An example PDF (the normal distribution)
Where σ is just the standard deviation.
The normal distribution has a skewness of 0.
Kurtosis
Kurtosis describes how the shape differs from a normal curve (if the tails are lighter or heavier).
Kurtosis is defined:
kurtosis =
E[(X − µ)4 ]
(E[(X − µ)2 ])2
The standard normal distribution has a kurtosis of 3, so sometimes kurtosis is standardized by subtracting 3; this standardized kurtosis measure is called the excess kurtosis.
5.9
Cumulative Distribution Functions (CDF)
A cumulative distribution function CDF(x), often also denoted F (x), describes the cumulative probability up to where the random variable takes the value x, which is to say it tells you P (X ≤ x).
The complimentary distribution (CCDF) of a distribution is 1 − CDF(x).
126
CHAPTER 5. PROBABILITY
127
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
Common distribution descriptors
CHAPTER 5. PROBABILITY
127
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
5.9.1
128
Discrete random variables
The cumulative distribution function of a discrete random variable is just the sum of the probabilities
for the values up to x.
Say our discrete random variable X can take values from {x1 , . . . , xn }. We can define the CDF for
X as:
∑
CDF(x) =
P (X = xi )
{xi |xi ≤x}
The complete discrete CDF is a step function, as you might expect because the CDF is constant
between discrete values.
An example discrete CDF
5.9.2
Continuous random variables
The cumulative distribution function of a continuous random variable is:
∫
CDF(x) =
x
−∞
P (X)dx
That is, it is the integral of the PDF up to the value in question.
128
CHAPTER 5. PROBABILITY
129
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
CDF
Probability values for a specific range x1 → x2 can be calculated with:
CDF(x2 ) − CDF(x1 )
or more simply:
∫
x2
CDF(x) =
P (x)dx
x1
5.9.3
Using CDFs
Visually, there are a few tricks you can do with CDFs.
You can estimate the median by looking at where CDF(x) = 0.5, i.e.:
You can estimate the probability that your x falls between two values:
You can estimate a confidence interval as well. For example, the 90% confidence interval by looking
at the x values in the range which produces CDF(x) = 0.05 and CDF(x) = 0.95.
CHAPTER 5. PROBABILITY
129
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
130
An example continuous CDF (for the normal distribution)
Estimating the median value with a CDF
130
CHAPTER 5. PROBABILITY
131
5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF)
Estimating the probability of value falling in a range with a CDF
Estimating the 80% confidence interval with a CDF
CHAPTER 5. PROBABILITY
131
5.10. EXPECTED VALUES
5.9.4
132
Survival function
The survival function of a random variable X is the complement of the CDF, that is, it is the
probability that the random variable is greater than some value x, i.e. P (X > x). So the survival
function is:
S(x) = P (X > x) = 1 − CDF(x)
5.10 Expected Values
The expected value of a random variable X is:
E[X] = µ
That is, it is the average (mean) value.
It can be thought of as a way of “summarizing” a random variable to a single value.
It can be thought of as a sample from a potentially infinite population. A sample from that population
is expected to be the mean of that population. The value of that mean depends on the distribution
of that population.
5.10.1
Discrete random variables
For a discrete random variable X with a sample space X(Ω) (i.e. all possible values X can take) and
with a PMF p, the expected value is:
E[X] =
∑
xP (X = x)
x∈X(Ω)
The expected value exists only if this sum is well-defined, which basically means it has to aggregate
in some clear way, as either a finite value or positive or negative infinity. But it can’t, for instance,
contain a positive infinity and a negative infinity term simultaneously, because it’s undefined how
those combine.
For example, consider the infinite sum 1 − 1 + 1 − 1 + . . . . The −1 terms go to negative infinity
and the +1 terms go to positive infinity - this is an undefined sum.
132
CHAPTER 5. PROBABILITY
133
5.10.2
5.10. EXPECTED VALUES
Continuous random variables
For a continuous random variable X and a PDF f , the expected value is:
∫
E[X] =
∞
−∞
xf (x)dx
The expected value exists only when this integral is well-defined.
5.10.3
The expectation rule
A function g(X) of a random variable X is itself a random variable. The expected value of that
function can be expressed based on the random variable X, like so:
∑
E[g(x)] =
g(x)p(x)
x∈X(Ω)
∫ ∞
E[g(x)] =
−∞
g(x)f (x)dx
Using whichever is appropriate, depending on if X is discrete or continuous.
5.10.4
Jensen’s Inequality
Jensen’s Inequality states that given a convex function g, then E[g(X)] ≥ g(E[X]).
5.10.5
Properties of expectations
For random variables X and Y where E[|X|] < ∞ and E[|Y |] < ∞ (that is, E[X], E[Y ] are finite),
we have the following properties:
1. E(a) = a for all a ∈ R. That is, the expected value of a constant is just the constant. This is
called the normalization property.
2. E(aX) = aE(X) for all a ∈ R
3. E(X + Y ) = E(X) + E(Y )
4. If X ≥ 0, that is, all possible values of X are greater than 0, then E[X] ≥ 0
5. If X ≤ Y , that is, each possible value of X is less than each possible value of Y , then
E[X] ≤ E[Y ]. This is called the order property.
6. If X and Y are independent, then E[XY ] = E[X]E[Y ]. Note that the converse is not true;
that is, if E[XY ] = E[X]E[Y ], this does not necessarily mean that X and Y are independent.
7. E[IA (X)] = P (X ∈ A), that is, the expected value of an indicator function:

1
if x ∈ A
IA (x) = 
0 otherwise
CHAPTER 5. PROBABILITY
133
5.11. VARIANCE
134
is the probability that the random variable X is in A.
Properties 2 and 3 are called linearity.
To put linearity another way: Let X1 , X2 , . . . , Xn be random variables, which may be dependent or
independent:
E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn )
5.11 Variance
The variance of a distribution is the “spread” of a distribution.
The variance of a random variable X tells us how spread out the data is along that variable’s
axis/dimension.
It can be defined in a couple ways:
Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2
Variance is not a linear function of X, for instance:
Var(aX + b) = a2 Var(X)
If random variables X and Y are independent, then:
Var(X + Y ) = Var(X) Var(Y )
5.11.1
Covariance
The covariance of two random variables is a measure of how “closely related” they are:
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]
With more than two variables, a covariance matrix is used.
Covariance matrices show two things:
134
CHAPTER 5. PROBABILITY
135
5.12. COMMON PROBABILITY DISTRIBUTIONS
• the variance of a variable i , located at the i , i element
• the covariance of variables i , j, located at the i , j and j, i elements
If the covariance between two variables is negative, then we have a downward slope, if it is positive,
then we have an upward slope.
So the covariance matrix tells us a lot about the shape of the data.
Example covariances
5.12 Common Probability Distributions
Here a few distributions you are likely to encounter are described in more detail.
5.12.1
Probability mass functions
Bernoulli Distribution
A random variable distributed according to the Bernoulli distribution can take on two possible values
0, 1, typically described as “failure” and “success” respectively. It has one parameter p, the probability
of success, which is taken to be P (X = 1). Such a random variable is sometimes called a Bernoulli
random variable.
The distribution is described as:
P (X) = p x (1 − p)1−x
CHAPTER 5. PROBABILITY
135
5.12. COMMON PROBABILITY DISTRIBUTIONS
136
And for a Bernoulli random variable X, it is notated X ∼ Ber(p)). X is 1 with probability p and X
is 0 with probability 1 − p.
The mean of a Bernoulli distribution is µ = p, and the standard deviation is σ =
√
p(1 − p).
A Bernoulli distribution describes a single trial, though often you may consider multiple trials, each
with its own random variable.
Geometric Distribution
Say we have a set of iid Bernoulli random variables, each representing a trial. What is the probably
of finding the first success at the nth trial?
This can be described with a geometric distribution, which is a distribution where the probabilities
decrease exponentially fast.
It is formalized as:
P (n) = (1 − p)n−1 p
With the mean µ =
1
p
and the standard deviation σ =
√
1−p
p2 .
Binomial Distribution
Suppose you have a binomial experiment
(i.e. one with two mutually exclusive outcomes,
such as “success” or “failure”) of n trials (that
is, n Bernoulli trials), where p is the probability of success on an individual trial, and is the
same across all trials (that is, the trials are n iid
Bernoulli random variables). You want to determine the probability of k successes in those n
trials.
Note that binomial is in contrast to multinomial
A Binomial Distribution histogram
in which a random variable can take on more
than just two discrete values. This shouldn’t be
confused with multivariate which refers to a situation where there are multiple variables.
The resulting distribution is a binomial distribution, such as:
The binomial distribution has the following properties:
µ = np
√
σ=
136
np(1 − p)
CHAPTER 5. PROBABILITY
137
5.12. COMMON PROBABILITY DISTRIBUTIONS
The binomial distribution is expressed as:
( )
P (X = k) =
n
p(1 − p)n−k
k
A binomial random variable X is denoted:
X ∼ Bin(n, p)
Here X ends up being the number of events that occurred over our trials.
Its expected value is:
E[Z|N, p] = Np
The binomial distribution has two parameters:
• n - a positive integer representing the number of trials
• p - the probability of an event occurring in a single trial
The special case N = 1 corresponds to the Bernoulli distribution.
If we have Z1 , Z2 , . . . , ZN Bernoulli random variables with the same p, then X = Z1 +Z2 +· · ·+ZN ∼
Bin(N, p).
Thus the expected value of a Bernoulli random variable is p (because N = 1).
Some example questions that can be answered with a binomial distribution:
•
•
•
•
Out of ten tosses, how many times will this coin be heads?
From the children born in a given hospital on a given day, how many of them will be girls?
How many students in a given class room will have green eyes?
How many mosquitoes, out of a swarm, will die when sprayed with insecticide?
(Source)
When the number of trials n gets large, the shape of the binomial
distribution starts to approximate
√
a normal distribution with the parameters µ = np and σ = np(1 − p).
CHAPTER 5. PROBABILITY
137
5.12. COMMON PROBABILITY DISTRIBUTIONS
138
Negative Binomial Distribution
The negative binomial distribution is a more general form of the geometric distribution; instead of
giving the probability of the first success in the nth trial, it gives the probability of an arbitrary kth
success in the nth trial. Like the geometric and binomial distribution, it is expected that the trials
are iid Bernoulli random variables. The other requirement is that the last trial is a success.
This distribution is described as:
(
)
n−1 k
P (k|n) =
p (1 − p)n−k
k −1
Poisson Distribution
The Poisson distribution is useful for describing the number of rare (independent) events in a large
population (of independent individuals) during some time span. It looks at how many times a discrete
event occurs, over a period of continuous space or time; without a fixed number of trials.
P (X = k) =
λk e −λ
, k = 0, 1, 2, . . .
k!
If X is Poisson-distributed, we notate it:
X ∼ Poisson(λ)
For the Poisson distribution λ is any positive integer. Its size is proportional to the probability of larger
values in the distribution. That is, increasing λ assigns more probability to large values; decreasing
it assigns more probability to small values. It is sometimes called the intensity of the distribution.
For the Poisson distribution, λ is known as the “(average) arrival rate” or sometimes just the “rate”.
k must be a non-negative integer (e.g. 0, 1, 2, . . . ).
A shorthand for saying that X has a Poisson mass distribution is X ∼ Poisson(λ).
For Poisson distributions, the expected value of our random variable is equal to the parameter λ, that
is:
E[X|λ] = µ = λ
In the Poisson distribution figure, although it looks like the values fall off at some point, it actually
has an infinite tail, so that every positive integer has some positive probability.
138
CHAPTER 5. PROBABILITY
139
5.12. COMMON PROBABILITY DISTRIBUTIONS
Poisson Distributions
Example
On average, 9 cars pass this intersection every hour. What is the probability that two
cars pass the intersection this hour? Assume a Poisson distribution.
This problem can be framed as: what is P (x = 2)?
We know the expected value is 9 and that we have a Poisson distribution, so λ = 9 and:
P (x = 2) =
92 −9
e
2!
Some example questions that can be answered with a Poisson distribution:
•
•
•
•
•
How
How
How
How
How
many
many
many
many
many
pennies will I encounter on my walk home?
children will be delivered at the hospital today?
products will I sell after airing a new television commercial?
mosquito bites did you get today after having sprayed with insecticide?
defects will there be per 100 metres of rope sold?
(Source)
5.12.2
Probability density functions
Uniform distribution
With the uniform distribution, every value is equally likely.
It may be constrained to a range of values as well.
CHAPTER 5. PROBABILITY
139
5.12. COMMON PROBABILITY DISTRIBUTIONS
140
Exponential Distribution
A random variable which is continuous may have an exponential density, often describe as an exponential random variable:
fX (x|λ) = λe −λx , x ≥ 0
Here we say X is exponential:
X ∼ Exp(λ)
Like the Poisson random variable, the exponential random variable can only have positive values. But
because it is continuous, it can also take on non-integral values such as 4.25.
For exponential distributions, the expected value of our random variable is equal to the inverse of the
parameter λ, that is:
E[X|λ] =
1
λ
An Exponential Distribution
140
CHAPTER 5. PROBABILITY
141
5.12. COMMON PROBABILITY DISTRIBUTIONS
Example
Say we have the random variable y which is the exact amount of rain we will get tomorrow,
in inches. What is the probability that y = 2 ± 0.1? Assume you have the probability
density function f for y .
We’d notate the probability we’re looking for like so:
P (|Y − 2| < 0.1)
Which is the probability that y ≈ 2 within a tolerance (acceptable deviance) of 0.1 (i.e. 1.9 to 2.1).
Then we would just find the integral (area under the curve) of the PDF from 1.9 to 2.1, i.e.
∫
2.1
f (x)d
1.9
Gamma Distribution
X ∼ Gam(α, β)
This is over positive real numbers.
It is just a generalization of the exponential random variable:
Exp(β) ∼ Gam(1, β)
The PDF is:
f (x|α, β) =
β α x α−1 e βx
Γ(α)
Where Γ(α) is the Gamma function.
Normal (Gaussian) Distribution
The normal distribution is perhaps the most common probability distribution, occurring very often in
nature.
For a random variable x, the normal probability density function is1 :
1
exp(x) = e x
CHAPTER 5. PROBABILITY
141
5.12. COMMON PROBABILITY DISTRIBUTIONS
142
1
(x − µ)2
P (x) = √ exp(−
)
2σ 2
σ 2π
The (univariate) Gaussian distribution is parameterized by µ and σ (for multivariate Gaussian distributions, see below).
The peak of the distribution is where x = µ.
Gaussian distributions
The height and width of the distribution varies according to σ. The lower σ is, the higher and thinner
the distribution is; the higher σ is, the lower and wider the distribution is.
The standard normal distribution is just N(0, 1).
The Gaussian distribution can be used to approximate other distributions, such as the binomial
distribution when the number of experiments is large, or the Poisson distribution when the average
arrival rate is high.
A normal random variable X is denoted:
X ∼ N(µ, σ)
Where the parameters are:
• µ = the mean
142
CHAPTER 5. PROBABILITY
143
5.12. COMMON PROBABILITY DISTRIBUTIONS
• σ = the standard deviation
The expected value is:
E[X|µ, σ] = µ
t Distribution
For small sample sizes (n < 30), the distribution of the sample mean deviates slightly from the normal
distribution since the sample mean doesn’t exactly match the population’s mean. This distribution
is the t-distribution.
This distribution is the t-distribution, which, for large enough sample sizes (>= 30), converges to
the normal distribution, so it may also be used for large sample sizes too.
The t-distribution has thicker tails than the normal distribution, so observations are more likely to be
within two standard deviations of its mean. This allows for more accurate estimations of the standard
error for small sample sizes.
The t-distribution is always centered around zero and is described by one parameter: the degrees of
freedom. The higher the degrees of freedom, the closer the t-distribution is to the standard normal
distribution.
The confidence interval is computed slightly differently for a t distribution. Instead of the Z score we
use a cutoff, tdf , determined by the degrees of freedom for the distribution.
For a single sample with n observations, the degrees of freedom is df = n − 1. For two samples, you
can use a computer to calculate the degrees of freedom, or you can choose the smallest sample size
minus 1.
The t-distribution’s corresponding test is the t-test, sometimes called the “Student t-test”, which is
used to compare the means of two groups.
From the t-distribution we can calculate a t value:
t=
x̄ − µ
√
s/ n
Then we can use this t value with the t distribution with the degrees of freedom for the sample and
use that to compute a p-value.
Beta Distribution
For an event with two outcomes, the beta distribution is the probability distribution of the probability
of the outcome being positive. The beta distribution’s domain is [0, 1] which makes it appropriate
for this use.
CHAPTER 5. PROBABILITY
143
5.12. COMMON PROBABILITY DISTRIBUTIONS
144
That is, in a beta distribution both the y and the x axes represent probabilities. The x-axis is the
possible probabilities for the events in question, and the y -axis is the probability that that possible
probability is the true probability.
It is notated:
Beta(α, β)
Where α is the number of positive examples and β is the number of negative examples.
Its PDF is:
f (x|α, β) =
x α−1 (1 − x)β−1
B(α, β)
Where B(α, β) is the Beta function.
The Beta distribution is a generalization of the uniform distribution:
Uniform() ∼ Beta(1, 1)
The mean of a beta distribution is just
α
α+β
which is pretty straightforward if you think about it.
If you need to estimate the probability of something happening, the beta distribution can be a good
prior since it is quite easy to calculate its posterior distribution:
Beta(αposterior , βposterior ) = Beta(αlikelihood + αprior , βlikelihood + βprior )
That is, you just use some plausible prior values for α and β such that you have a plausible mean,
then just add your new positive and negative examples to update the beta distribution.
Weibull Distribution
The Weibull distribution is used for modeling reliability or “survival” data, e.g. for dealing with
failure-rates.
It is defined as:
 k−1
k x k e −(x/λ)k
fx (x) = 
λ
0
x ≥0
x <0
The k parameter is the shape parameter and the λ parameter is the scale parameter of the distribution.
If x is the “time-to-failure”, the Weibull distribution describes the failure rate over time. In this case,
the parameter k influences how the failure rate changes over time: if k < 1, the failure rate decreases
144
CHAPTER 5. PROBABILITY
145
5.13. PARETO DISTRIBUTIONS
over time (for instance, defective products fail early and are weeded out), if k = 1, the failure rate is
constant, and if k > 1, then the failure rate increases with time (e.g. parts degrade over time).
Chi-square (χ2 ) distribution
The χ2 distribution is closely related to the normal distribution and often used as a sampling distribution. The χ2 distribution with f degrees of freedom, sometimes notated χ2[f ] , is the sum of the
squares of f independent standard normal (e.g. µ = 0, σ = 1) variates, i.e.:
Y = X12 + X22 + · · · + Xf2
This distribution has a mean f and a variance of 2f (this comes from the additive property of the
mean
√ and the variance). The skewness of the distribution follows the same additive property and
is f8 . So when f is small, the distribution skews to the right, and the skewness decreases as f
increases. When f is very large the distribution approaches the standard normal distribution (by the
central limit theorem).
Some χ2 distributions
5.13 Pareto distributions
A Pareto distribution has a CDF with the form:
CHAPTER 5. PROBABILITY
145
5.14. MULTIPLE RANDOM VARIABLES
146
CDF(x) = 1 − (
x −α
)
xm
They are characterized as having a long tail (i.e. many small values, few large ones), but the large
values are large enough that they still make up a disproportionate share of the total (e.g. the large
values take up 80% of the distribution, the rest are 20%).
Such a distribution is described as scale-free since they are not centered around any particular value.
Compare this to Gaussian distributions which are centered around some value.
Such a distribution is said to obey the power law. A distribution P (k) obeys the power law if, as k
gets large, P (k) is asymptotic to k −γ , where γ is a parameter that describes the rate of decay.
Such distributions are (confusingly) sometimes called scaling distributions because they are invariant
to changes of scale, which is to say that you can change the units the quantities are expressed in and
γ doesn’t change.
5.14 Multiple random variables
In the real world you often work with multiple random variables simultaneously - that is, you are
working in higher dimensions. You could describe a group of random variables as a random vector,
i.e. a random vector X ∈ Rd , where d is the number of dimensions (the number of random variables)
you are working in, i.e. X = [X1 , . . . , Xd ].
A distribution over multiple random variables is called a joint distribution.
For a joint distribution P (a, b), the distribution of a subset of variables is called a marginal distribution
(or just marginal) of the joint distribution, and is computed:
P (a) =
∑
P (a, b)
b
That is, fix b to each of its possible outcomes and sum those probabilities.
Generally, you can compute the marginal like so:
P (x1 , . . . , xi−1 , xi+1 , . . . , xn ) =
∑
P (x1 , . . . , xn )
xi
So you take the variable you want to remove and sum over the probabilities with it fixed for each of
its possible outcomes.
The distribution over multiple random variables is called a joint distribution. When we have multiple random variables, the distribution of some subset of those random variables is the marginal
distribution for that subset.
The probability density function for a joint distribution just takes more arguments, i.e.:
146
CHAPTER 5. PROBABILITY
147
5.14. MULTIPLE RANDOM VARIABLES
P (a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 , . . . , an ≤ Xn ≤ bn ) =
5.14.1
∫
b1
a1
∫
b2
a2
···
∫
bn
an
f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn
Conditional distributions
Conditional distributions are distributions in which the value of one or more other random variables
are known.
For random variables X, Y , the conditional probability of X = a given Y = b is:
P (X = a|Y = b)
P (X = a, Y = b)
P (Y = b)
which is undefined if P (Y = b) = 0.
This can be expanded to multiple given random variables:
P (X = a|Y = b, Z = c)
P (X = a, Y = b, Z = c)
P (Y = b, Z = c)
The conditional distribution of X, conditioned on Y = b, is notated P (X|Y = b).
More generally, we can describe the conditional distribution of X conditioned on Y on all its values
as P (X|Y ).
For continuous random variables, the probability of the random variable being a given specific value
is 0 (see the section on probability density functions), so here we have the denominator as 0, which
won’t do. However, it can be shown that the probability density function f (y |x) underlying the
distribution P (Y |X) is given by:
f (y |x) =
f (x, y )
f (x)
And thus:
P (a ≤ Y ≤ b|X = c) =
5.14.2
∫
b
a
f (y |c)dy =
∫
a
b
f (c, y )
dy
f (c)
Multivariate Gaussian
A random vector X is (multivariate) Gaussian (or “normal”) if any linear combination is (univariate)
Gaussian.
CHAPTER 5. PROBABILITY
147
5.14. MULTIPLE RANDOM VARIABLES
148
Note that “Gaussian” often implies “multivariate Gaussian”.
That is, the dot product of some vector a transpose with X, which is:
T
a X=
n
∑
ai X i
i=1
is Gaussian for every a ∈ Rn .
We say X is (multivariate) Gaussian distributed with mean µ (where µ ∈ Rn , that is, µ is a vector
as well) and covariance matrix C, notated:
X ∼ N(µ, C)
which means X is Gaussian with E[Xi ] = µi and Cij = Cov(Xi , Xj ) and Cii = Var(Xi ).
µ and C are the parameters of the distribution.
If X is a random vector, X ∈ Rn , i.e. [X1 , . . . , Xn ], and a Gaussian, i.e. X ∼ N(µ, C) where µ
is a vector µ = [µ1 , . . . , µn ], and if the covariance matrix C has the variances on its diagonal and
0 everywhere else, like below, then the components of X are independent and individually normally
distributed, i.e. Xi ∼ N(µi , σi2 ).
Caveat: a random vector’s individual components being Gaussian but not independent does not
necessarily imply that the vector itself is Gaussian.

σ12



C=



0
..
.
0

0 ··· 0
..
.. 

..
.
.
.
 = diag(σ 2 , . . . , σ 2 )

1
n
..
..
. 0
.

· · · 0 σn2
Intuitively this makes sense because if Xi and Xj are independent, then their covariance Cov(Xi , Xj ) =
0. So all the i, j entries in C where i ̸= j are 0. This does not necessarily hold for non-Gaussians;
this is a property particular to Gaussians.
Degenerate univariate Gaussian
A degenerate univariate Gaussian distribution is one where X ≡ µ, that is: X ∼ N(µ, 0).
Degenerate multivariate Gaussian
A multivariate Gaussian can also be degenerate, which is when the determinant of its covariance
matrix C is 0.
148
CHAPTER 5. PROBABILITY
149
5.14. MULTIPLE RANDOM VARIABLES
Examples of Gaussians (and non-examples)
These are some examples of what Gaussians can look like. Drawn over the first two are their level
sets which demarcate where the density is constant (you can think of it like a topographical map).
The last example is a degenerate Gaussian.
Some example Gaussians. The last example is a degenerate Gaussian
Some examples that are not Gaussians
Probability density function
A multivariate Gaussian random variable X ∼ N(µ, C) only has a density function if it is nondegenerate (i.e. det(C) ̸= 0).
The PDF is:
1
exp(− (x − µ)T C −1 (x − µ))
2
|2πC|
√
1
Note that |A| = det(A), and the term |2πC| can be also written as (2π)n det(C).
Affine property
An affine transformation is just some function in the form f (x) = Ax + b.
CHAPTER 5. PROBABILITY
149
5.15. BAYES’ THEOREM
150
Any affine transformation of a Gaussian random variable is itself a Gaussian. If X ∼ N(µ, C), then
AX + b ∼ N(Aµ + b, ACAT ).
Marginal distributions of a Gaussian
The marginal distributions of a Gaussian are also Gaussian.
More formally, if you have a Gaussian random vector X ∈ Rn , X ∼ N(µ, C) which you decompose
into Xa = [X1 , . . . , Xk ], Xb = [Xk+1 , . . . , Xn ], where 1 ≤ k ≤ n, then Xa ∼ N(µa , Caa ), Xb ∼
N(µb , Cbb ).
Conditional distributions of a Gaussian
The conditional distributions of a Gaussian are also Gaussian.
More formally, if you have a Gaussian random vector X ∈ Rn , X ∼ N(µ, C) which you decompose
into Xa = [X1 , . . . , Xk ], Xb = [Xk+1 , . . . , Xn ], where 1 ≤ k ≤ n, then (Xa |Xb = xb ) ∼ N(m, D),
−1
−1
where m = µa + Cab Cbb
(xb − µb ) and D = Caa − Cab Cbb
Cba .
Sum of independent Gaussians
The sum of independent Gaussians is also Gaussian.
More formally, if you have Gaussian random vectors X ∈ Rn , X ∼ N(µx , Cx ) and Y ∈ Rn , Y ∼
N(µy , Cy ) which are independent, then X + Y ∼ N(µx + µy , Cx + Cy )
5.15 Bayes’ Theorem
5.15.1
Intuition
The probability of both a and b occurring is P (a ∩ b) (the probability of a or b occurring).
This is the same as the probability of a occurring given b and vice versa:
P (a ∩ b) = P (a|b)P (b) = P (b|a)P (a)
This can be rearranged to form Bayes’ Theorem:
P (b|a) =
P (a|b)P (b)
P (a)
Bayes’ Theorem is useful for answering questions such as, “How likely is A given B?”. For example,
“How likely is my hypothesis true given the data I have?”
150
CHAPTER 5. PROBABILITY
151
5.15.2
5.15. BAYES’ THEOREM
A Visual Explanation
This explanation is adapted from Count Bayesie.
The accompanying figure depicts a 6x10 area (60 pegs total) of lego bricks representing a probability space with the
following probabilities:
P (blue) = 40/60 = 2/3
P (red) = 20/60 = 1/3
Lego Probability Space
Red and blue alone describe the entire set of possible events.
Yellow pegs are conditional upon the red and blue bricks;
that is, their probabilities are conditional upon what color brick is underneath it.
So the following probability properties of yellow should be straightforward:
P (yellow) = 6/60 = 1/10
P (yellow|red) = 4/20 = 1/5
P (yellow|blue) = 2/40 = 1/20
But say you want to figure out P (red|yellow) This is intuitive visually in this example. You’d reason
that there are 6 yellow pegs total, 4 of which are on the red space, so there’s 4/6 probability that we
are in the red space for a given yellow peg.
This intuition is Bayes’ Theorem, and can be written more formally as:
P (red|yellow) =
P (yellow|red)P (red)
P (yellow)
Step by step, what we did was:
nyellow = P (yellow) ∗ ntotal = 1/10 ∗ 60 = 6
nred = P (red) ∗ ntotal = 1/3 ∗ 60 = 20
nyellow|red = P (yellow|red) ∗ nred = 1/5 ∗ 20 = 4
nyellow|red
= 4/6 = 2/3
P (red|yellow) =
nyellow
If you expand out the last equation, you’ll find Bayes’ Theorem:
CHAPTER 5. PROBABILITY
151
5.15. BAYES’ THEOREM
152
nyellow|red
nyellow
P (yellow|red) ∗ nred
=
P (yellow) ∗ ntotal
P (yellow|red) ∗ P (red) ∗ ntotal
=
P (yellow) ∗ ntotal
P (yellow|red)P (red)
=
P (yellow)
P (red|yellow) =
5.15.3
An Example Bayes’ Problem
Consider the following problem:
1% of women at age forty who participate in routine screening have breast cancer. 80%
of women with breast cancer will get positive mammographies. 9.6% of women without
breast cancer will also get positive mammographies. A woman in this age group had
a positive mammography in a routine screening. What is the probability that she
actually has breast cancer?
Intuitively it’s difficult to get the correct answer. Generally, only ~15% doctors can get it right
(Casscells, Schoenberger, and Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many
other studies.)
You can work through the problem like so:
• 1% of women at age forty have breast cancer. To simplify the problem, assume there are 1000
women total, so 10/1000 have breast cancer.
• 80% of women w/ breast cancer will get positive mammographies. So of the 10 women that
have breast cancer, 8/1000 of them will get positive mammographies.
• 9.6% of women without breast cancer will also get positive mammographies. We have 10/1000
women with breast cancer, which means there are 990 without breast cancer. Of those 990,
9.6% will also get positive mammographies, so ~95/1000 women are false positives.
We can rephrase the problem like so: What is the probability that a woman in this age group has
breast cancer, if she gets a positive mammography?
In total, the number of positives we have are 95 + 8 = 103. Then we can just use simple probability:
there’s an 8/103 chance (7.8%) that she has breast cancer, and a 95/103 chance (92.2%) that she’s
a false positive.
One way to interpret these results is that, in general, women of age forty have a 1% chance of having
breast cancer. Getting a positive mammography does not indicate that you have breast cancer, it
just “slides up” your probability of having it to 7.8%.
We could break up the group of 1000 women into:
152
CHAPTER 5. PROBABILITY
153
•
•
•
•
5.15. BAYES’ THEOREM
True positives: 8
False positives: 95
True negatives: 990 - 95 = 895
False negatives: 10 - 8 = 2
Which totals to 1000, so everyone is accounted for.
5.15.4
Solving the problem with Bayes’ Theorem
The original proportion of patients w/ breast cancer is the prior probability.
The probability of a true positive and the probability of a false positive are the conditional
probabilities.
Collectively, this information is known as the priors. The priors are required to solve a Bayesian
problem.
The final answer - the estimated probability that a patient has breast cancer given a positive mammography - is the revised probability, better known as the posterior probability.
If the two conditional probabilities are equal, the posterior probability equals the prior probability
(i.e. if there’s an equal chance of getting a false and a negative positive, then the test really tells you
nothing).
5.15.5
Another Example
Your friend reads you a study which found that only 10% of happy people are rich. Your
friend concludes that money can’t buy happiness. How could you show them otherwise?
Rather than asking “What percent of happy people are rich?”, it is probably better to ask “What
percent of rich people are happy?” to determine if money buys happiness.
With the statistic from the study, statistics about the overall rate of happy people (say 40% of people
are happy) and rich people (say 5% of people are rich), and Bayes’ Theorem, you can calculate this
value:
10% ×
40%
= 80%
5%
So it seems like a lot of rich people are happy.
5.15.6
Naive Bayes
Bayes’ rule:
P (a|b) =
CHAPTER 5. PROBABILITY
P (b|a)P (a)
P (b)
153
5.15. BAYES’ THEOREM
154
Say a is a class and b is some evidence.
We’ll notate the class as c and the evidence as e. We are interested in: what’s the probability of a
class c given some evidence e? We can write this question out as Bayes’ rule:
P (c|e) =
P (e|c)P (c)
P (e)
Our evidence may actually be multiple pieces of evidence: e1 , . . . , en . So instead we can re-write the
equation as:
P (e1 , . . . , en |c)P (c)
P (e1 , . . . , en )
If we can assume that each piece of evidence is independent given the class c, then we can further
write this as:
[
∏n
P (ei |c)]P (c)
P (e1 , . . . , en )
i
Example
In practice: say I have two coins. One is a fair coin (P (heads) = 0.5) and one is a trick coin
(P (heads) = 0.8). I pick one of the coins at random and flip it twice, getting heads and then tails.
Which coin did I pick?
The head and tail outcomes are our evidence. So we can take the product of the probabilities of
these outcomes given a particular class.
The probability of picking either coin was uniform, i.e. there was a 50% chance of picking either. So
we can ignore that probability.
For a fair coin, the probability of getting heads and then tails is P (H|fair) × P (T |fair) = 0.5 × 0.5 =
0.25. For the trick coin, the probability is P (H|trick) × P (T |trick) = 0.8 × 0.2 = 0.16. So it’s more
likely that I picked the fair coin.
If we flip again and get a heads, things change a bit:
For a fair coin: P (H|fair) × P (T |fair) × P (H|fair) = 0.5 × 0.5 × 0.5 = 0.125. For the trick coin:
P (H|trick) × P (T |trick) × P (H|trick) = 0.8 × 0.2 × 0.8 = 0.128. So now it’s slightly more likely
that I picked the trick coin.
154
CHAPTER 5. PROBABILITY
5.16. THE LOG TRICK
155
5.16 The log trick
When working with many independent probabilities, which is often the case in machine learning, you
have to multiply many probabilities which can result in underflow. So it’s often easier to work with
the logarithm of probability functions, which is fine because when optimizing, the max (or min) will
be at the same location in the logarithm form (though their actual values will be different). Using
logarithms will allow us to sum terms instead of multiplying them.
5.17 Information Theory
Information, measured in bits, answers questions - the more initial uncertainty there is about the
answer, the more information the answer contains.
The amount of bits needed to encode an answer depends on the distribution over the possible answers
(i.e., the uncertainty about the answer).
Examples:
• the answer to a boolean question with a prior (0.5, 0.5) requires 1 bit to encode (i.e. just 0 or
1)
• the answer to a 4-way question with a prior (0.25, 0.25, 0.25, 0.25) requires 2 bits to encode
• the answer to a 4-way question with a prior (0, 0, 0, 1) requires 0 bits to encode, since the
answer is already known (no uncertainty)
• the answer to a 3-way question with prior (0.5, 0.25, 0.25) requires, on average, 1.5 bits to
encode
More formally, we can compute the average number of bits required to encode uncertain information
as follows:
∑
pi log2
i
1
pi
This quantity is called the entropy of the distribution (H), and is sometimes written as the equivalent:
H(p1 , . . . , pn ) =
∑
−pi log2 pi
i
If you do something such that the answer distribution changes (e.g. observe new evidence), the
difference between the entropy of the new distribution and the entropy of the old distribution is called
the information gain.
The basic intuition behind information theory is that learning that an unlikely event has
occurred is more informative than learning that a likely event has occurred. A message
saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but
a message saying “there was a solar eclipse this morning” is very informative.
CHAPTER 5. PROBABILITY
155
5.17. INFORMATION THEORY
156
The self-information of an event is:
I(X) = − ln P (X)
and, when using natural log, is measured in nats (when using log2 , it is measured in bits or shannons).
One nat is the information gained by observing an event of probability e1 .
Self-information only measures a single event; to measure the amount of uncertainty in a complete
probability distribution we can instead use the Shannon entropy, which tells us the expected information of an event drawn from that distribution:
H(X) = EX∼P [I(X)] = −EX∼P [ln P (X)]
When X is continuous, Shannon entropy is called the differential entropy.
5.17.1
Entropy
Broadly, entropy is the measure of disorder in a system.
In the case of probability, it is the measure of uncertainty that is associated with the distribution of
a random variable.
If there are a few outcomes which are fairly certain, the system has low entropy. A point-mass
distribution has the lowest entropy. We know exactly what value we’ll get from it.
If there are many outcomes which are equiprobable, the system has high entropy. A uniform distribution has the highest entropy. We don’t really have any idea of what value we’ll draw from
it.
To put it another way: with high entropy, it is very hard to guess the value of the random variable
(because all values are equally or similarly likely); with low entropy it easy to guess its value (because
there are some values which are much more likely than the others).
The entropy of a random variable X is notated H(X), must be H(X) ≥ 0 and is calculated:
H(X) = −E[lg(P (X))]
=−
∑
x
=−
∫
P (x) lg(P (x)) (discrete)
∞
−∞
P (x) lg(P (x))dx (continuous)
Where lg(x) = log2 (x).
This does not say anything about the value of the random variable, only the spread of its distribution.
For example: what is the entropy of a roll of a six-sided die?
1
1
1
1
1
1
1
1
1
1
1
1
1( lg( ) + lg( ) + lg( ) + lg( ) + lg( ) + lg( )) = 2.58
6
6
6
6
6
6
6
6
6
6
6
6
156
CHAPTER 5. PROBABILITY
157
5.17. INFORMATION THEORY
The Maxmimum Entropy Principle says that, all else being equal, we should prefer distributions
that maximize the entropy. That is, you should be conservative in your confidence about how much
you know - if you don’t have any good reason for something to be more likely than something else,
err on the side of them being equiprobable.
5.17.2
Specific Conditional Entropy
The specific conditional entropy H(Y |X = v ) is the entropy of some random variable conditioned on
another random variable taking some value.
5.17.3
Conditional Entropy
The conditional entropy H(Y |X) is the entropy of some random variable conditioned on another
random variable, i.e. it is the average specific conditional entropy of Y , that is:
∑
P (X = v )H(Y |X = v )
j
5.17.4
Information Gain
Say you must transmit the random variable Y . How many bits on average would be saved if both
the sender and the recipient knew X?
To put it more concretely:
IG(Y |X) = H(Y ) − H(Y |X)
The bigger the difference, the more X tells us about Y (because it decreases the entropy, i.e. it makes
it easier to guess Y ).
5.17.5
Kullback-Leibler (KL) divergence
We can measure the difference between two probability distributions P (X), Q(X) over the same
random variable X with the KL divergence:
DKL (P ||Q) = EX∼P [ln
P (X)
] = EX∼P [ln P (X) − ln Q(X)]
Q(X)
The KL divergence has the following properties:
CHAPTER 5. PROBABILITY
157
5.17. INFORMATION THEORY
158
• It is non-negative
• It is 0 if and only if:
– P and Q are the same distribution (for discrete variables)
– P and Q are equal “almost everywhere” (for continuous variables)
• It is not symmetric, i.e. DKL (P ||Q) ̸= DKL (Q||P ), so it is not a true distance metric
The KL divergence is related to cross entropy H(P, Q):
H(P, Q) = H(P ) + DKL (P ||Q)
= EX∼P log Q(X)
Given some discrete random variable X with possible outcomes {x1 , x2 , . . . , xm }, and the probability
distribution of these outcomes, p(X), we can compute the entropy H(X):
H(X) = −
m
∑
P (xi ) logb P (xi )
i
Usually b = 2 (i.e. we use bits).
The mutual information between two discrete random variables X and Y can be computed:
∑ ∑
I(X; Y ) =
p(x, y ) log(
y ∈Y x∈X
p(x, y )
)
p(x)p(y )
For continuous random variables, it is instead computed:
∫ ∫
p(x, y ) log(
I(X; Y ) =
Y
X
p(x, y )
)dxdy
p(x)p(y )
The variation of information between two random variables is computed:
V I(X; Y ) = H(X) + H(Y ) − 2I(X, Y )
The Kullback-Leibler divergence tells us the difference between two probability distributions P
and Q. It is non-symmetric.
For discrete probability distributions, it is calculated:
DKL (P ||Q) =
∑
P (i ) log
i
P (i )
Q(i )
For continuous probability distributions, it is computed:
158
CHAPTER 5. PROBABILITY
159
5.18. REFERENCES
DKL (P ||Q) =
∫
∞
−∞
p(i ) log
p(i )
dx
q(i )
5.18 References
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon.
Parameter Estimation - The PDF, CDF and Quantile Function. Count Bayesie. Will Kurt.
What is the intuition behind beta distribution?. David Robinson, KerrBer.
Distributions of One Variable. An Introduction to Statistics with Python. Thomas Haslwanter.
Probability Theory Review for Machine Learning. Samuel Ieong. November 6, 2006.
MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
Principles of Statistics. M.G. Bulmer. 1979.
OpenIntro Statistics, Second Edition. David M Diez, Christopher D Barr, Mine ÇetinkayaRundel.
A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy. Deeplearning4j. Skymind.
An Intuitive Explanation of Bayes’ Theorem. Eliezer S. Yudkowsky.
Why so Square? Jensen’s Inequality and Moments of a Random Variable. Count Bayesie. Will
Kurt.
What is an intuitive explanation of Bayes’ Rule?. Mike Kayser.
Bayes’ Theorem with Lego. Count Bayesie. Will Kurt.
Probability, Paradox, and the Reasonable Person Principle. Peter Norvig. October 3, 2015.
Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
Think Complexity. Version 1.2.3. Allen B. Downey. 2012.
Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff
Ullman.
CHAPTER 5. PROBABILITY
159
5.18. REFERENCES
160
160
CHAPTER 5. PROBABILITY
161
6
Statistics
Broadly, statistics is concerned with collecting and analyzing data. It seeks to describe rigorous
methods for collecting data (samples), for describing the data, and for inferring conclusions from the
data. There are processes out there in the world that generate observable data, but these processes
are often black boxes and we want to gain some insight into how they work.
We can crudely cleave statistics into two main practices: descriptive statistics, which provides
tools for describing data, and inferential statistics, which provides tools for learning (inferring or
estimating) from data.
This section is focused on frequentist (or classical) statistics, which is distinguished from Bayesian
statistics (covered in another chapter).
6.0.1
Notation
• Regular letters, e.g. X, Y , typically denote observed (known) variables
• Greek letters, e.g. µ, σ, typically denote unknown variables which we are trying to estimate
• Hats over letters, e.g. θ̂, denote estimators (an estimator is a rule for calculating an estimate
given some observed data), e.g. an estimated value for a parameter.
6.1
Descriptive Statistics
Descriptive statistics involves computing values which summarize a set of data. This typically includes
statistics like the mean, standard deviation, median, min, max, etc, which are called summary
statistics.
6.1.1
Scales of Measurement
In statistics, numbers and variables are categorized in certain ways.
CHAPTER 6. STATISTICS
161
6.1. DESCRIPTIVE STATISTICS
162
Variables may be categorical (also called qualitative), in which they represent discrete values (numbers here are arbitrarily assigned to represent categories of qualities), or numerical (also called
quantitative) in which they represent continuous values.
These variables are further categorized into scales of measurement.
• Nominal: Includes qualitative variables that can only be counted; they have no order or
intervals.
– Example: Gender, marital status
• Ordinal: Includes qualitative variables that have a concept of order, so they can be arranged
into some sequence accordingly and meaningfully ranked. But they are without any measure
of magnitude between items in that sequence. So some object A may come after some object
B but there is no measurement of interval between the two (we can’t, for instance, say that A
is 10 more than B).
– Example: Education level (some high school, high school, college, etc)
• Interval: Interval variables are quantitative variables; in some sense they are like ordinal variables that do have a measure of interval between items. But they do not have an absolute zero
point, so we can’t compare values as ratios (we can’t, for instance, say A is twice of B).
– Example: Dates (we can say how many days there are between two dates, but, for example,
we can’t say one date is twice of another)
• Ratio: Ratio variables are like interval variables (also quantitative variables) but have a fixed
and meaningful zero point, so they can be compared as ratios.
– Example: Age, length
6.1.2
Averages
The average of a set of data can be described as its central tendency; which gives some sense of a
typical or common value for a variable. There are three types:
Arithmetic mean
Often just called the “mean” and notated µ (mu).
For a dataset {x1 , . . . , xn }, the arithmetic mean is:
∑n
i=1 xi
n
The mean can be sensitive to extreme values (outliers), which is one reason the median is sometimes
used instead. Which is to say, the median is a more robust statistic (meaning that it is less sensitive
to outliers).
Note that there are other types of means, but the arithmetic mean is by far the most common.
162
CHAPTER 6. STATISTICS
163
6.1. DESCRIPTIVE STATISTICS
Median
The central value in the dataset, e.g.
11234
median = 2
If there are even number of values, you just take the value between the two central values:
112344
median =
2+3
= 2.5
2
Mode
The most frequently occurring value in the dataset, e.g.
12332343
mode = 3
Central tendencies for a distribution
6.1.3
Population vs Sample
With statistics we take a sample of a broader population or already have data which is a sample
from a population. We use this limited sample in order to learn things about the whole population.
CHAPTER 6. STATISTICS
163
6.1. DESCRIPTIVE STATISTICS
164
The mean of the population is denoted µ and consists of N items, whereas the mean of the sample
(i.e. the sample mean, sometimes called the empirical mean) is notated x̄ or µ̂ and consists of n
items.
The sample mean is:
µ̂ =
1 ∑ (i)
x
n i
The sample variance is:
σ̂ 2 =
1 ∑ (i)
(x − µ̂)2
n−1 i
The sample covariance matrix is:
Σ̂ =
1 ∑ (i)
(x − µ̂)(x (t) − µ̂)T
n−1 i
These estimators are unbiased, i.e.:
E[µ̂] = µ
E[σ̂ 2 ] = σ 2
E[Σ̂] = Σ
6.1.4
Independent and Identically Distributed
Often in statistics we assume that a sample is independent and identically distributed (iid); that
is; that the data points are independent from one another (the outcome of one has no influence over
the outcome of any of the others) and that they share the same distribution.
We say that X1 , . . . , Xn are iid if they are independent and drawn from the same distribution, that
is P (X1 ) = · · · = P (Xn ). This can also be stated:
P (X1 , . . . , Xn ) =
∏
P (Xi )
i
In this case, they all share the same mean (expected value) and variance.
This assumption makes computing statistics for the sample much easier.
For instance, if a sample was not identically distributed, each datapoint might come from a different
distribution, in which case there are different means and variances for each datapoint which must be
computed from each of those datapoints alone. They can’t really be treated as a group since the
datapoints aren’t quite equivalent to each other (in a statistical sense).
164
CHAPTER 6. STATISTICS
165
6.1. DESCRIPTIVE STATISTICS
Or, if the sample was not independent, then we lose all the conveniences that come with independence.
The IID assumption doesn’t always hold (i.e. it may be violated), of course, so there are other ways
of approaching such situations while minimizing complexity, such as Hidden Markov Models.
6.1.5
The Law of Large Numbers (LLN)
Let X1 , . . . Xn be iid with mean µ.
The law of large numbers essentially states that as a sample size approaches infinity, its mean will
approach the population (“true”) mean:
lim
n→∞
6.1.6
n
1∑
Xi = µ
n i=1
Regression to the mean
P (Y < x|X = x) gets bigger as x approaches very large values. That is, given a very large X (an
extreme), the chance that Y ’s value is as large as or larger than X is unlikely.
P (Y > x|X = x) gets bigger as x approaches very small values. That is, given a very small X (an
extreme), the chance that Y ’s value is as small as or smaller than X is unlikely.
6.1.7
Central Limit Theorem (CLT)
Say you have a set of data. Even if the distribution of that data is not normal, you can divide the data
into groups (samples) and then average the values of those groups. Those averages will approach
the form of a normal curve as you increase the size of those groups (i.e. increase the sample size).
Let X1 , . . . Xn be iid with mean µ and variance σ 2 .
Then the central limit theorem can be formalized as:
√
n
n 1∑
D
((
→ N(0, 1)
Xi ) − µ) −
σ 2 n i=1
That is, the left side converges in distribution to a normal distribution with mean 0 and variance 1
as n increases.
6.1.8
Dispersion (Variance and Standard Deviation)
Dispersion is the “spread” of a distribution - how spread it out its values are.
The main measures of dispersion are the variance and the standard deviation.
CHAPTER 6. STATISTICS
165
6.1. DESCRIPTIVE STATISTICS
166
Standard Deviation
Standard deviation is represented by σ (sigma) and describes the variation from the mean (µ, i.e. the
expected value), calculated:
√
σ=
E[(X − µ)2 ] =
√
E[X 2 ] − (E[X])2
Variance
The square of the standard deviation, that is, E[(X − µ)2 ], is the variance of X, usually notated σ 2 .
It can also be written:
∑N
Var(X) = σ = E(X ) − E(X) =
2
2
2
i=1 (xi
− µ2 )
N
For a population of size N of datapoints x.
That is, variance is the difference between the square of the inputs and the square of the expected
value.
Coefficient of variation (CV)
Variance depends on the units of measurement, but this can be controlled for by computing the
coefficient of variation:
CV =
σ
× 100
x̄
This allows us to compare variability across variables measured in different units.
Variance of a linear combination of random variables
The variance of a linear combination of (independent) random variables, e.g. aX + bY , can be
computed:
Var(β1 X1 + · · · + βn Xn ) = β12 Var(X1 ) + · · · + βn2 Var(Xn )
166
CHAPTER 6. STATISTICS
167
6.1. DESCRIPTIVE STATISTICS
Range
The range can also be used to get a sense of dispersion. The range is the difference between the
highest and lowest values, but very sensitive to outliers.
As an alternative to the range, you can look at the interquartile range, which is the range of the
middle 50% of the values (that is, the difference of the 75th and 25th percentile values). This is less
sensitive to outliers.
Z score
A Z score is just the number of standard deviations a value is from the mean. It is defined:
Z=
x −µ
σ
The Empirical Rule
The empirical rule describes that, for a normal distribution, there is:
• a 68% chance that a value falls within one standard deviation
• a 95% chance that something falls within two standard deviations
• a 99.7% chance that something falls within three standard deviations
Pooled standard deviation estimates
If you have reason to expect that the standard deviations of two populations are practically identical,
you can use the pooled standard deviation of the two groups to obtain a more accurate estimate
of the standard deviation and standard error:
2
spooled
=
s12 (n1 − 1) + s22 (n2 − 1)
n1 + n2 − 2
Where n1 , n2 , s1 , s2 are the sample sizes and standard deviations of the sample groups. We must
update the degrees of freedom as well, df = n1 + n2 − 2, which we can use for a new t-distribution.
6.1.9
Moments
The kth moment, mk (X), where k ∈ 1, 2, 3, . . . (i.e. it is a positive integer), is E[X k ]. So the first
moment is just the mean, E[X].
The kth central moment is E[(X − E[X])k ]. So the second central moment is just the variance,
E[(X − E[X])2 ].
CHAPTER 6. STATISTICS
167
6.1. DESCRIPTIVE STATISTICS
168
The Empirical Rule
The third moment is the skewness, and the fourth moment is the kurtosis; they all share the same
form (with different normalization terms):
1
]
σ3
1
kurtosis = E[(X − µ)4 ] 4
σ
skewness = E[(X − µ)3
Moments have different units, e.g. the first moment might be in meters (m), the second moment
√
would be m2 , and so on, so it is typical to standardize moments by taking their k-th root, e.g. m2
6.1.10
Covariance
The covariance describes the variance between two random variables.
For random variables x and y , the covariance is (remember, E(x) denotes the expected value of
random variable x):
Cov(x, y ) = E[(x − E(x))(y − E(y ))] =
1∑
(xi − x̄)(yi − ȳ )
n
There must the same number of values n for each.
This is simplified to:
168
CHAPTER 6. STATISTICS
169
6.1. DESCRIPTIVE STATISTICS
Cov(x, y ) = E[xy ] − E[y ]E[x] ≈ xy
¯ − ȳ x̄
A positive covariance means that as x goes up, y goes up. A negative covariance means that as x
goes up, y goes down.
Note that variance is just the covariance of a random variable with itself:
Var(X) = E(XX) − E(X)E(X) = Cov(X, X)
6.1.11
Correlation
Correlation gives us a measure of relatedness between two variables. Alone it does not imply
causation, but it can help guide more formal inquiries (e.g. experiments) into causal relationships.
A good way to visually intuit correlation is through scatterplots.
We can measure correlation with correlation coefficients. These measure the strength and sign of
a relationship (but not the slope, linear regression, detailed later, does that).
Some of the more common correlation coefficients include:
•
•
•
•
Pearson product-moment (used where both variables are on an interval or ratio scale)
Spearman rank-order (where both variables are on ordinal scales)
Phi (where both variables are on nominal/categorical/dichotomous/binary scales)
Point biserial (where one variable is on a nominal/categorical/dichotomous/binary scale and
the other is on an interval or ratio scale)
The Pearson and Spearman coefficients are the most commonly used ones, but sometimes the later
two are used in special cases (e.g. with categorical data).
Pearson product-moment correlation coefficient (Pearson’s correlation)
∑n
r=
xi −x̄
yi −ȳ
i=1 ( sx )( sy )
n−1
Note: this is sometimes denoted as a capital R
You may recognize this as:
Cov(X, Y )
SX SY
CHAPTER 6. STATISTICS
169
6.1. DESCRIPTIVE STATISTICS
170
Here we convert our values to standard scores, i.e. xis−x̄
. This standardizes the values such that
x
their mean is 0 and their variance is 1 (and so they are unitless)
For a population, r is notated ρ (rho).
This value can range from [−1, 1], where 1 and -1 mean complete correlation and 0 means no
correlation.
To test the statistical significance of the Pearson correlation coefficient, you can use the t statistic.
For instance, if you believe there is a relationship between two variables, you set your null hypothesis
as ρ = 0 and then with your estimate of r , calculate the t statistic:
t=√
r
1−r 2
n−2
Then look up the value in a t table.
The Pearson correlation coefficient tells you the strength and direction of a relationship, but it doesn’t
tell you how much variance of one variable is explained by the other.
For that, you can use the coefficient of determination which is just r 2 . So for instance, if you
have r = 0.9, then r 2 = 0.81 which means 81% of the variation of one variable is explained by the
other.
Note that Pearson’s correlation only accurately measures linear relationships; so even if you have a
Pearson correlation near 0, it is still possible that there may be a strong nonlinear relationship. It’s
worthwhile to look at a scatter plot to verify.
It is also not robust in the presence of outliers.
Spearman rank-order correlation coefficient (Spearman’s rank correlation)
Here you compute ranks (i.e. the indices in the sorted sample) rather than standard scores.
For example, for the dataset [1, 10, 100], the rank of the value 10 is 2 because it is second in the
sorted list.
Then you can compute the Spearman correlation:
∑
(6 d 2 )
rs = 1 −
n(n2 − 1)
Where d is the difference in ranks for each datapoint.
Generally, you can interpret rs in the following ways:
• 0.9 ≤ rs ≤ 1 - very strong correlation
• 0.7 ≤ rs ≤ 0.9 - strong correlation
• 0.5 ≤ rs ≤ 0.7 - moderate correlation
170
CHAPTER 6. STATISTICS
171
6.1. DESCRIPTIVE STATISTICS
You can test its statistical significance using a z test, where the null hypothesis is that rs = 0.
√
z = rs n − 1
Spearman’s correlation is more robust to outliers and skewed distributions.
Point-Biserial correlation coefficient
This correlation coefficient is useful when comparing a categorical (binary) variable with an interval
or ratio scale variable:
rpbi =
Mp − Mq √
pq
St
Where Mp is the mean for the datapoints categorized as 1 and Mq is the mean for the datapoints
categorized as 0. St is the standard deviation for the interval/ratio variable, p is the proportion of
datapoints categorized as 1, and q is the proportion of datapoints categorized as 0.
Phi correlation coefficient
This allows you to measure the correlation between two categorical (binary) variables.
It is calculated like so:
A = f (0, 1)
B = f (1, 1)
C = f (0, 0)
D = f (1, 0)
AD − BC
rϕ = √
(A + B)(C + D)(A + C)(B + D)
Where f (a, b) is the frequency of label a and label b occurring together in the data.
6.1.12
Degrees of Freedom
Degrees of freedom describes the number of variables that are “free” in what value they can take.
Often a given variable must be a particular value because of the values the other variables take on
and some constraint(s).
For example: say we have four unknown quantities x1 , x2 , x3 , x4 . We know that their mean is 5. In this
case we have three degrees of freedom - this is because three of the variables are free to take arbitrary
values, but once those three are set, the fourth value must be equal to x4 = 20−x1 −x2 −x3 in order for
the mean to be 5 (that is, in order to satisfy the constraint). So for instance, if x1 = 2, x2 = 4, x3 = 6,
then x4 must equal 8. It is not “free” to take on any other value.
CHAPTER 6. STATISTICS
171
6.1. DESCRIPTIVE STATISTICS
6.1.13
172
Time Series Analysis
Often data has a temporal component; e.g. you are looking for patterns over time.
Generally, time series data may have the following parts: a trend, which is some function reflecting
persistent changes, seasonality; that is, periodic variation, and of course there is going to be some
noise - random variation - as well.
Moving averages
To extract a trend from a series, you can use regression, but sometimes you will be better off
with some kind of moving average. This divides the series into overlapping regions, windows,
of some size, and takes the averages of each window. The rolling mean just takes the mean of
each window. There is also the exponentially-weighted moving average (EWMA) which gives
a weighted average, such that more recent values have the highest weight, and values before that
have weights which drop off exponentially. The EWMA takes an additional span parameter which
determines how fast the weights drop off.
Serial correlation (autocorrelation)
In time series data you may expect to see patterns. For example, if a value is low, it may stay low
for a bit, if it’s high, it may stay high for a bit. These types of patterns are serial correlations,
also called autocorrelation (so-called because it is correlated a dataset with itself, in some sense),
because the values correlate in their sequence.
You can compute serial correlation by shifting the time series by some interval, called a lag, and then
compute the correlation of the shifted series with the original, unshifted series.
6.1.14
Survival Analysis
Survival analysis describes how long something lasts. It can refer to the survival of, for instance, a
person - in the context of disease, a 5-year survival rate is the probability of surviving 5 years after
diagnosis, for example - or a mechanical component, and so on. More broadly it can be seen as
looking at how long something lasts until something happens - for instance, how long until someone
gets married.
A survival curve is a function S(t) which computes the probability of surviving longer than duration
t. Such a duration is called a lifetime.
The survival curve ends up just being the complement of the CDF:
S(t) = 1 − CDF(t)
Looking at it this way, the CDF is the probability of a lifetime less than or equal to t.
172
CHAPTER 6. STATISTICS
173
6.2. INFERENTIAL STATISTICS
Hazard function
A hazard function tells you the fraction of cases that continue until t and then end at t. It can be
computed from the survival curve:
λ(t) =
S(t) − S(t + 1)
S(t)
Hazard functions are also used for estimating survival curves.
Estimating survival curves: Kaplan-Meier estimation
Often we do not have the CDF of lifetimes so we can’t easily compute the survival curve. We
often have non-survival cases alongside have survival cases, where we don’t yet know what their final
lifetime will be. Often, as is the case in the medical context, we don’t want to wait to learn what
these unknown lifetimes will be. So we need to estimate the survival curve with the data we do have.
The Kaplan-Meier estimation allows us to do this. We can use the data we have to estimate the
hazard function, and then convert that into a survival curve.
We can convert a hazard function into an estimate of the survival curve, where each point at time t
is computed by taking the product of complementary hazard functions through that time t, like so:
∏
(1 − λ(t))
t
6.2
Inferential Statistics
Statistical inference is the practice of using statistics to infer some conclusion about a population
based on only a sample of that population. This can be the population’s distribution - we want
to infer from the sample data what the “true” distribution (the population distribution) is and the
unknown parameters that define it.
Generally, data is generated by some process; this data-generating process is also noisy; that is, there
is a relatively small degree of imprecision or fluctuation in values due to randomness. In inferential
statistics, we try to uncover the particular function that describes this process as closely as possible.
We do so by choosing a model (e.g. if we believe it can be modeled linearly, we might choose linear
regression, otherwise we might choose a different kind of model such as a probability distribution;
modeling is covered in greater detail in the machine learning part). Once we have chosen the model,
then we need to determine the parameters (linear coefficients, for example, or mean and variance for
a probability distribution) for that model.
Broadly, the two paradigms of inference are frequentist, which relies on long-run repetitions of an
event, that is, it is empirical (and could be termed the “conventional” or “traditional” framework,
though there’s a lot of focus on Bayesian inference now) and Bayesian, which is about generating a
CHAPTER 6. STATISTICS
173
6.2. INFERENTIAL STATISTICS
174
hypothesis distribution (the prior) and updating it as more evidence is acquired. Bayesian inference is
valuable because there are many events which we cannot repeat, but we still want to learn something
about.
The frequentist believes these unknown parameters have precise “true” values which can be (approximately) uncovered. In frequentist statistics, we can estimate these exact values. When we estimate
a single value for an unknown, that estimation is called a point estimate. This is in contrast to
describing a value estimate as a probability distribution, which is the Bayesian method. The Bayesian
believes that we cannot express these parameters as single values and we should rather describe them
as a distributions of possible values to be explicit about their uncertainty.
Here we focus on frequentist inference; Bayesian inference is covered in a later chapter.
In frequentist statistics, the factor of noise means that we may see relationships (and thus come up
with non-zero parameters) where they don’t exist, just because of the random noise. This is what
p-values are meant to compensate for - if the relationship truly did not exist, what’s the probability,
given the data, that we’d see the non-zero parameter estimate that we computed? Generally if this
probability is less than 0.05 (i.e. p < 0.05) then we accept the result.
Often with statistical inference you are trying to quantify some difference between groups (which can
be framed as measuring an effect size) or testing if some data supports or refutes some hypothesis,
and then trying to determine whether or not this difference or effect can be attributed to chance (this
is covered in the section on experimental statistics).
A word of caution: many statistical tools work only under certain conditions, e.g. assumptions of
independence, or for a particular distribution, or a large enough sample size, or lack of skew, and
so on - so before applying statistical methods and drawing conclusions, make sure the tools are
appropriate for the data. And of course you must always be cautious of potential biases involved in
the data collection process.
6.2.1
Error
Dealing with error is a big part of statistics and some error is unavoidable (noise is natural).
There are three kinds of error:
• Systemic error (systemic flaws in the data collection, e.g. sampling bias)
• Measurement error (due to imprecise instruments, for instance)
• Random error (natural noise, due to chance, uncontrollable, but in theory its effect is minimized
if many measurements are taken)
We never know the true value of something, only what we observe by imprecise means, so we always
must grapple with error.
6.2.2
Estimates and estimators
We can think of the population as representing the underlying data generating process and consider
these parameters as functions of the population. To estimate these parameters from the sample data,
174
CHAPTER 6. STATISTICS
175
6.2. INFERENTIAL STATISTICS
we use estimators, which are functions of the sample data that return an estimate for some unknown
value. Essentially, any statistic is an estimator. For instance, we may estimate the population mean
by using the sample mean as our estimator. Or we may estimate the population variance as the
sample variance. And so on.
Bias
Estimators may be biased for small sample sizes; that is, it tends to have more error for small sample
sizes.
Say we are estimating a parameter θ. The bias of an estimator θ̂m is:
bias(θ̂ + m) = E[θ̂m ] − θ
Where m is the number of samples.
There are unbiased estimators as well, which have an expected mean error (against the population
parameter) of 0. That is, bias(θ̂) = 0, which can also be stated as E[θ̂] = θ.
For example, an unbiased estimator for population variance σ 2 is:
1 ∑
(xi − x̄)2
n−1
An estimator may be asymptotically unbiased if limm→∞ bias(θ̂m ) =, that is, if limm→∞ E[θ̂m ] = 0.
Generally, unbiased estimators are preferred, but sometimes biased estimators have other properties
which make them useful.
For an estimate, we can measure its standard error (SE), which describes how much we expect the
estimate to be off by, on average. It can also be stated as:
√
SE(θ̂) =
Var(θ̂)
“Standard error” sometimes refers to the standard error of the mean, which is the standard deviation
of the mean:
v
u
m
u
1 ∑
σ
SE(µ̂m ) = tVar(
x (i) ) = √
m
i=1
m
Much of statistical inference is concerned with measuring the quality of these estimates.
CHAPTER 6. STATISTICS
175
6.2. INFERENTIAL STATISTICS
6.2.3
176
Consistency
When we used a biased estimator, we generally still want our point estimates to converge to the
true value of the parameter. This property is called consistency. For some error ϵ > 0, we want
limm→∞ P (|θm − θ| > ϵ) = 0. That is, as we increase our sample size, we want the probability of
the absolute difference between the estimate and the true value being greater than ϵ to approach 0.
6.2.4
Point Estimation
Given an unknown population parameter, we may want to estimate a single value for it - this estimate
is called a point estimate. Ideally, the estimate is as close to the true value as possible.
The estimation formula (the function which yields an estimate) is called an estimator and is a
random variable (so there is some underlying distribution). A particular value of the estimator is the
estimate.
A simple example: we have a series of trials with some number of successes. We want an estimate for
the probability of success of the event we looked at. Here an obvious estimate is be the number of
successes over the total number of trials, so our estimator would be Nx and - say we had 40 successes
out of 100 trials - our estimate would be 0.4.
We consider a “good” estimator one whose distribution is concentrated as closely as possible around
the parameter’s true value (that is, it has a small variance). Generally this becomes the case as more
data is collected.
We can take multiple samples (of a fixed size) from a population and compute a point estimate
(e.g. for the mean) from each. Then we can consider the distribution of these point estimates - this
distribution is called a sampling distribution. The standard deviation of the sampling distribution
describes the typical error of a point estimate, so this standard deviation is known as the standard
error (SE) of the estimate.
Alternatively, if you have only one sample, the standard error of the sample mean x̄ can be computed
(where n is the size of the sample):
σx
SEx̄ = σx̄ = √
n
This however requires the population standard deviation, σx , which probably isn’t known - but we can
also use a point estimate for that as well; that is, you can just use s, the sample standard deviation,
instead (provided that the sample size is at least 30, as a rule of thumb, and the population distribution
is not strongly skewed).
Also remember that the distribution of sample means approximates a normal distribution, with better
approximation as sample size increases, as described by the central limit theorem. Some other point
estimates’ sampling distribution also approximate a normal distribution. Such point estimates are
called normal point estimates.
There are other such computations for the standard error of other estimates as well.
176
CHAPTER 6. STATISTICS
177
6.2. INFERENTIAL STATISTICS
We say a point estimate is unbiased if the sampling distribution of the estimate is centered at the
parameter it estimates.
6.2.5
Nuisance Parameters
Nuisance parameters are values we are not directly interested in, but still need to be dealt with in
order to get at what we are interested in.
6.2.6
Confidence Intervals
Rather than provide a single value estimate of a population parameter, that is, a point estimate, it can
be better to provide a range of values for the estimate instead. This range of values is a confidence
interval. The confidence interval is the range of values where an estimate is likely to fall with some
percent probability.
Confidence intervals are expressed in percentages, e.g. the “95% confidence interval”, which describes
the plausibility that the parameter is in that interval. It does not imply a probability (that is, it does
not mean that the true parameter has a 95% chance of being in that interval), however. Rather, the
95% confidence interval is the range of values in which, over repeated experimentation, in 95% of
the experiments, that confidence interval will contain the true value. To put it another way, for the
95% confidence interval, out of every 100 experiments, at least 95 of their confidence intervals will
contain the true parameter value. You would say “We are 95% confident the population parameter
is in this interval”.
Confidence intervals are a tool for frequentist statistics, and in frequentist statistics, unknown parameters are considered fixed (we don’t express them in terms of probability as we do in Bayesian
statistics). So we do not associate a probability with the parameter. Rather, the confidence interval
itself is the random variable, not the parameter. To put it another way, we are saying that 95% of
the intervals we would generate from repeated experimentation would contain the real parameter but we aren’t saying anything about the parameter’s value changing, just that the intervals will vary
across experiments.
The mathematical definition of the 95% confidence interval is (where θ is the unknown parameter):
P (a(Y ) < θ < b(Y )|θ) = 0.95
Where a, b are the endpoints of the interval, calculated according to the sampling distribution of Y .
We condition on θ because, as just mentioned, in frequentist statistics, the parameters are fixed and
the data Y is random.
We can compute the 95% confidence interval by taking the point estimate (which is the best estimate
for the value) and ±2 SE, that is build an interval within two standard errors of the point estimate.
The interval we add or subtract to the point estimate (here it is 2 SE) is called the margin of error.
The value we multiply the SE with is essentially a Z score, so we can more generally describe the
margin of error as z SE.
CHAPTER 6. STATISTICS
177
6.3. EXPERIMENTAL STATISTICS
178
For the confidence interval of the mean, we can be more precise and look within ±1.96 SE (that
is, z = 1.96) of the point estimate x̄ for the 95% confidence interval (that is, our 95% confidence
interval would be x̄ ± 1.96 SE). This is because we know to the sampling distribution for the sample
means approximates a normal distribution (for sufficiently large sample sizes, n ≥ 30 is a rule of
thumb) according to the central limit theorem.
6.2.7
Kernel Density Estimates
Sometimes we don’t want the parameters of our data’s distribution, but just a smoothed representation of it. Kernel density estimation allows us to get this representation. It is a nonparametric
method because it makes no assumptions about the form of the underlying distribution (i.e. no
assumptions about its parameters).
Kernel Density Estimation example
Some kernel function (which generates symmetric densities) is applied to each data point, then the
density estimate is formed by summing the densities. The kernel function determines the shape of
these densities and the bandwidth parameter, h > 0, determines their spread and the smoothing of
the estimate. Typically, a Gaussian kernel function is used, so the bandwidth is equivalent to the
variance.
In this figure, the grey curve is the true density, the red curve is the KDE with h = 0.05, the black
curve is the KDE with h = 0.337, and the green curve is the KDE with h = 2.
6.3
Experimental Statistics
Experimental statistics is concerned with hypothesis testing, where you have a hypothesis and want
to learn if your data supports it. That is, you have some sample data and an apparent effect, and
you want to know if there is any reason to believe that the effect is genuine and not just by chance.
Often you are comparing two or more groups; more specifically, you are typically comparing statistics
across these groups, such as their means. For example, you want to see if the difference of their
means is statistically significant; which is to say, likely that it is a real effect and not just chance.
178
CHAPTER 6. STATISTICS
179
6.3. EXPERIMENTAL STATISTICS
KDE bandwidth comparisons
The “classical” approach to hypothesis testing, null hypothesis significance testing (NHST),
follows this general structure:
1. Quantify the size of the apparent effect by choosing some test statistic, which is just a
summary statistic which is useful for hypothesis testing or identifying p-values. For example, if
you have two populations you’re looking at, this could be the difference in means (of whatever
you are measuring) between the two groups.
2. Define a null hypothesis, which is usually that the apparent effect is not real.
3. Compute a p-value, which is the probability of seeing the effect if the null hypothesis is true.
4. Determine the statistical significance of the result. The lower the p-value, the more significant
the result is, since the less likely it is to have just occurred by chance.
Broadly speaking, there are two types of scientific studies: observational and experimental.
In observational studies, the research cannot interfere while recording data; as the name implies, the
involvement is merely as an observer.
Experimental studies, however, are deliberately structured and executed. They must be designed to
minimize error, both at a low level (e.g. imprecise instruments or measurements) and at a high-level
(e.g. researcher biases).
6.3.1
Statistical Power
The power of a study is the likelihood that it will distinguish an effect of a certain size
from pure luck. - Statistical power and underpowered statistics, Alex Reinhart
Statistical power, sometimes called sensitivity, can be defined as the probability of rejecting the
null hypothesis when it is false.
If β is the probability of a type II error (i.e. failing to reject the null hypothesis when it’s false), then
power = 1 − β.
Power…
CHAPTER 6. STATISTICS
179
6.3. EXPERIMENTAL STATISTICS
180
• Increases as n (sample size) increases
• Increases as σ decreases (less variability)
• Is higher for a one-sided test than for its associated two-sided test
6.3.2
Sample Selection
Bias can enter studies primarily in two ways:
• in the process of selecting the objects to study (sampling and retention)
• in the process of collection information about the objects
To prevent selection bias (selecting samples in such a way that it encourages a particular outcome,
whether done consciously or not), sample selection may be random.
In the case of medical trials and similar studies, random allocation is ideally double blind, so that
neither the patient nor the researchers know which treatment a patient is receiving.
Another sample selection technique is stratified sampling, in which the population is divided into
categories (e.g. male and female) and samples are selected from those subgroups. If the variable used
for stratification is strongly related to the variable being studied, there may be better accuracy from
the sample size.
You need large sample sizes because with small sample sizes, you’re more sensitive to the effects of
chance. e.g. if I flip a coin 10 times, it’s feasible that I get heads 6/10 times (60% of the time).
With that result I couldn’t conclusively say whether or not that coin is rigged. If I flip that coin 1000
times, it’s extremely unlikely that I will get heads 60% of the time (600/1000 times) if it were a fair
coin.
Sometimes to increase sample size, a researcher may use a technique called “replication”, which is
simply repeating the measurements with new samples. but some researchers really only “pseudoreplicate”. samples should be as independent from each other as possible - otherwise you have too many
confounding factors. in medical research, researchers may sample a single patient multiple times,
every week for instance, and treat each week’s sample as a distinct sample. this is pseudoreplication
- you begin to inflate other factors particular to that patient in your results. another example is - say
you wanted to measure pH levels in soil samples across the US. well, you cant sample soil 15ft from
each other because they are too dependent on each other:
Operationalization
Operationalization is the practice of coming up with some way of measuring something which cannot
be directly measured, such as intelligence. This may be accomplished via proxy measurements.
6.3.3
The Null Hypothesis
In an experiment, the null hypothesis, notated H0 , is the “status quo”. For example, in testing
whether or not a drug has an impact on a disease, the null hypothesis would be that the drug has no
effect.
180
CHAPTER 6. STATISTICS
181
6.3. EXPERIMENTAL STATISTICS
When running an experiment, you do it under the assumption that the null hypothesis is true. Then
you ask: what’s the probability of getting the results you got, assuming the null hypothesis is true?
If that probability is very small, the null hypothesis is likely false. This probability - of getting your
results if the null hypothesis were true - is called the P value.
6.3.4
Type 1 Errors
A type 1 error is one where the null hypothesis is rejected, even though it is true.
Type 1 errors are usually presented as a probability of them occurring, e.g. a “0.5% chance of a type
1 error” or a “type 1 error with probability of 0.01”.
6.3.5
P Values
P values are central to null hypothesis significance testing (NHST), but they are commonly misunderstood.
P values do not:
• tell you the probability of the null hypothesis being true
• tell you the probability of any hypothesis being true
• can never prove or disprove hypotheses
There’s no mathematical tool to tell you if your hypothesis is true; you can only see
whether it is consistent with the data, and if the data is sparse or unclear, your conclusions
are uncertain. - Statistics Done Wrong, Alex Reinhart
So what is it then? The P value is the probability of seeing your results or data if the null hypothesis
were true.
That is, given data D and a hypothesis H, where H0 is the null hypothesis, the P value is merely:
P (D|H0 )
If instead we want to find the probability of our hypothesis given the data, that is, P (H|D), we have
to use Bayesian inference instead:
P (H | D) =
P (D | H)P (H)
P (D | H)P (H) + P (D | ¬H)P (̸ H)
Note that P values are problematic when testing multiple hypotheses (multiple testing or multiple
comparisons) because any “significant” results (as determined by P value comparisons, e.g. p < 0.5)
may be deceptively so, since that result may still have just been chance, as the following comic
CHAPTER 6. STATISTICS
181
6.3. EXPERIMENTAL STATISTICS
182
xkcd - “Significant”
182
CHAPTER 6. STATISTICS
183
6.3. EXPERIMENTAL STATISTICS
illustrates. That is, the more significance tests you conduct, the more likely you will make a Type 1
Error.
In this comic, 20 hypotheses are tested, so with a significance level at 5%, it’s expected that at least
one of those tests will come out significant by chance. In the real world this may be problematic in
that multiple research groups may be testing the same hypothesis and chances may be such that one
of them gets significant results.
6.3.6
The Base Rate Fallacy
A very important shortcoming to be aware of is the base rate fallacy. A P value cannot be considered
in isolation. The base rate of whatever occurrence you are looking at must also be taken into account.
Say you are testing 100 treatments for a disease, and it’s a very difficult disease to treat, so there’s
a low chance (say 1%) that a treatment will actually be successful. This is your base rate. A low
base rate means a higher probability of false positives - treatments which, during the course of your
testing, may appear to be successful but are in reality not (i.e. their success was a fluke). A good
example is the mammogram test example (see The p value and the base rate fallacy).
A p value is calculated under the assumption that the medication does not work and tells
us the probability of obtaining the data we did, or data more extreme than it. It does
not tell us the chance the medication is effective. (The p value and the base rate fallacy,
Alex Reinhart)
6.3.7
False Discovery Rate
The false discovery rate is the expected proportion of false positives (Type 1 errors) amongst
hypothesis tests.
For example, if we have a maximum FDR of 0.10 and we have 1000 observations which seem to
indicate a significant hypothesis, then we can expect 100 of those observations to be false positives.
The q value for an individual hypothesis is the minimum FDR at which the test may be called
significant.
Say you run multiple comparisons and have the following values:
•
•
•
•
•
•
•
•
m = the total number of hypotheses tested (number of comparisons)
m0 = the number of true null hypotheses (H0 )
m − m0 = the number of true alternative hypotheses (Hi )
V = the number of false positives (Type 1 errors)
S = the number of true positives
T = the number of false negatives (Type 2 errors)
U = the number of true negatives
R = V + S = the number of hypotheses declared significant
CHAPTER 6. STATISTICS
183
6.3. EXPERIMENTAL STATISTICS
184
We can calculate the FDR as:
FDR = E[
Note that
6.3.8
V
R
V
V
] = E[ ]
V +S
R
= 0 if R = 0.
Alpha Level
The value that you select to compare the p-value to, e.g. 0.5 in the comic, is the alpha level ᾱ, also
called the significance level, of an experiment. Your alpha level should be selected according to the
number of tests you’ll be conducting in an experiment.
There are some approaches to help adjust the alpha level.
The Bonferroni Correction
The highly conservative Bonferroni Correction can be used as a safeguard.
You divide whatever your significance level ᾱ is by the number of statistical tests t you’re doing:
αp =
ᾱ
t
αp is the per-comparison significance level which you apply for each individual test, and ᾱ is the
maximum experiment-wide significance level, called the maximum familywise error rate (FWER).
The Sidak Correction
A more sensitive correction, the Sidak Correction, can also be used:
1
αp = 1 − (1 − ᾱ) n
For n independent comparisons, α, the experiment-wide significance level (the FWER) is:
α = 1 − (1 − αp )n
For n dependent comparisons, use:
α ≤ nαp
184
CHAPTER 6. STATISTICS
185
6.3. EXPERIMENTAL STATISTICS
6.3.9
The Benjamini-Hochberg Procedure
Approaches like the Bonferroni correction lowers the alpha level which end up decreasing your statistical power - that is, you fail to detect false effects as well as true effects.
And with such an approach, you are still susceptible to the base rate fallacy, and may still have false
positives. So how can you calculate the false discovery rate? That is, what fraction of the statistically
Significant results are false positives?
You can use the Benjamini-Hochberg procedure, which tells you which P values to consider statistically
significant:
1. Perform your statistical tests and get the P value for each. Make a list and sort it in ascending
order.
2. Choose a false-discovery rate q. The number of statistical tests is m.
3. Find the largest p value such that p ≤ iq
m , where i is the P value’s place in the sorted list.
4. Call that P value and all smaller than it statistically significant.
The procedure guarantees that out of all statistically significant results, no more than q
percent will be false positives.
The Benjamini-Hochberg procedure is fast and effective, and it has been widely adopted
by statisticians and scientists in certain fields. It usually provides better statistical power
than the Bonferroni correction and friends while giving more intuitive results. It can
be applied in many different situations, and variations on the procedure provide better
statistical power when testing certain kinds of data.
Of course, it’s not perfect. In certain strange situations, the Benjamini-Hochberg procedure gives silly results, and it has been mathematically shown that it is always possible
to beat it in controlling the false discovery rate. But it’s a start, and it’s much better
than nothing. - (Controlling the false discovery rate, Alex Reinhart)
6.3.10
Sum of Squares
The sum of squares within (SSW)
SSW =
m ∑
∑
(
j = 1n (xij − x̄i )2 )
i=1
• This shows how much of SST is due to variation within each group, i.e. variation from within
that group’s mean.
• The degrees of freedom here is calculated m(n − 1).
CHAPTER 6. STATISTICS
185
6.3. EXPERIMENTAL STATISTICS
186
The sum of squares between (SSB)
SSB =
m
∑
[nm [(x¯m − x̄¯)2 ]]
i=1
• This shows how much of SST is due to variation between the group means
• The degrees of freedom here is calculated m − 1.
The total sum of squares (SST)
SST =
m ∑
∑
(
j = 1n (xij − x̄¯)2 )
i=1
SST = SSW + SSB
• Note: x̄¯ is the mean of means, or the “grand mean”.
• This is the total variation for the groups
• The degrees of freedom here is calculated mn − 1.
6.3.11
Statistical Tests
Two-sided tests
Asks “What is the chance of seeing an effect as big as the observed effect, without regard to its
sign?” That is, you are looking for any effect, increase or decrease.
One-sided tests
Asks “What is the chance of seeing an effect as big as the observed effect, with the same sign?”
That is, you are looking for either only an increase or decrease.
Unpaired t-test
The most basic statistical test, used when comparing the means from two groups. Used for small
sample sizes. The t-test returns a p-value.
Paired t-test
The paired t-test is a t-test used when each datapoint in one group corresponds to one datapoint in
the other group.
186
CHAPTER 6. STATISTICS
187
6.3. EXPERIMENTAL STATISTICS
Chi-squared test
When comparing proportions of two populations, it is common to use the chi-squared statistic:
χ2 =
∑ (Oi − Ei )2
i
Ei
Where Oi is the observed frequencies and Ei is the expected frequencies.
Say for example you want to test if a coin is fair. You expect that, if it is fair, you should see about
50/50 heads and tails - this describes your expected frequencies. You flip the coin and observe the
actual resulting frequencies - these are your observed frequencies.
The Chi squared test allows you to determine if these frequencies differ significantly.
ANOVA (Analysis of Variance)
ANOVA, ANCOVA, MANOVA, and MANCOVA are various ways of comparing different groups.
• ANOVA - group A is given a placebo and group B is given the actual medication and the
outcome variable to compare is how many pounds were lost
• ANCOVA - same as ANOVA but now there is an additional covariate we consider, e.g. hours
of exercise per day
• MANOVA and MANCOVA are multivariate counterparts to the above, for instance we may
consider cholesterol levels in addition to weight loss
ANOVA is used to compare three or more groups. It uses a single test to compare the means across
multiple groups simultaneously, which avoids using multiple tests to make multiple comparisons (which
can lead to differences across groups resulting from chance).
There are a few requirements:
• the observations are independent within and across groups
• the data within each group are nearly normal
• the variance in the groups are about equal across groups
ANOVA tests the null hypothesis that the means across groups are the same (that is, that µ1 = · · · =
µk , if there are k groups), with the alternate hypothesis being that at least one mean is different. We
look at the variability in the sample means and see if it is so large that it is unlikely to have been due
to chance. THe variability we use is the mean square between groups (MSG) which has degrees
of freedom df G = k − 1.
The MSG is calculated:
k
1 ∑
1
SSG =
MSG =
ni (x̄i − x̄)2
df G
k − 1 i=1
CHAPTER 6. STATISTICS
187
6.3. EXPERIMENTAL STATISTICS
188
Where the SSG is the sum of squares between groups and ni is the sample size of group i out of
k total groups. x̄ is the mean of outcomes across all groups.
We need a value to compare the MSG to, which is the mean square error (MSE), which measures
the variability within groups and has degrees of freedom df E = n − k.
The MSE is calculated:
MSE =
1
SSE
df E
Where the SSE is the sum of squared errors and is computed as:
SSE = SST − SSG
Where the SSG is same as before and the SST is the sum of squares total:
SST =
n
∑
(xi − x̄)2
i=1
SSG =
k
∑
ni (x̄i − x̄)2
i=1
ANOVA uses a test statistic called F , which is computed:
F =
MSG
MSE
When the null hypothesis is true, difference in variability across sample means should be due only to
chance, so we expect MSG and MSE to be about equal (and thus F to be close to 1).
We take this F statistic and use it with a test called the F test, where we compute a p-value from
the F statistic, using the F distribution, which has the parameters df 1 and df 2 . We expect ANOVA’s
F statistic to follow an F distribution with parameters df 1 = df G , df 2 = df E if the null hypothesis is
true.
One-Way ANOVA
Similar to a t-test but used to compare three or more groups. With ANOVA, you calculate the F
statistic, assuming the null hypothesis1 :
F =
1
188
SSB
m−1
SSW
m(n−1)
Remember that SSB is the “sum of squares between” and SSW is the “sum of squares within”.
CHAPTER 6. STATISTICS
189
6.3. EXPERIMENTAL STATISTICS
Two-Way ANOVA
Allows you to compare the means of two or more groups when there are multiple variables or factors
to be considered.
One-tailed & two-tailed tests
In a two-tailed test, both tails of a distribution are considered. For example, with a drug where
you’re looking for any effect, positive or negative.
In a one-tailed, only one tail is considered. For example, you may be looking only for a positive or
only for a negative effect.
6.3.12
Effect Size
A big part of statistical inference is measuring effect size, which more generally is trying to quantify
differences between groups, but typically just referred to as “effect size”.
There are a few ways of measuring effect size:
Difference in means
The difference in means, e.g. µ1 − µ2
But this has a few problems:
• Must be expressed in the units of measure of the mean (e.g. ft, kg, etc), so it can be difficult
to compare to other studies
• Needs more context about the distributions (e.g. standard deviation) to understand if the
difference is large or not
Distribution overlap
The overlap between the two distributions:
Choose some threshold between the two means, e.g.
• The midpoint between the means:
2 µ2
• Where the PDFs cross: σ1 µσ11 +σ
+σ2
µ1 +µ2
2
Count how many in the first group are below the threshold, call it m1 Count how many in the second
group are above the threshold, call it m2 .
The overlap then is:
m1 m2
+
n1
n2
CHAPTER 6. STATISTICS
189
6.3. EXPERIMENTAL STATISTICS
190
Where n1 , n2 are the sample sizes of the first and second groups, respectively.
This overlap can also be framed as a misclassification rate, which is just
overlap
2 .
These measures are unitless, which makes them easy to compare across studies.
Probability of superiority
The “probability of superiority” is the probability that a randomly chosen datapoint from group 1 is
greater than a randomly chosen datapoint from group 2.
This measure is also unitless.
Cohen’s d
Cohen’s d is the difference in means, divided by the standard deviation, which is computed from the
pooled variance, σp2 , of the groups:
n1 σ12 + n2 σ22
=
n1 + n2
µ1 − µ2
d= √
σp2
σp2
This measure is also unitless.
Different fields have different intuitions about how big a d value is; it’s something you have to learn.
6.3.13
Reliability
Reliability refers to how consistent or repeatable a measurement is (for continuous data).
There are three main approaches:
Multiple-occasions reliability
Aka test-retest reliability. This is how a test holds up over repeated testing, e.g. “temporal stability”.
This assumes the underlying metric does not change.
Multiple-forms reliability
Aka parallel-forms reliability. This asks: how consistent are different tests at measuring the same
thing?
Internal consistency reliability
This asks: do the items on a test all measure the same thing?
190
CHAPTER 6. STATISTICS
191
6.3.14
6.4. HANDLING DATA
Agreement
Agreement is similar to reliability, but used more for discrete data.
Percent agreement
number of cases where tests agreed
all cases
Note that a high percent agreement may be obtained by chance.
Cohen’s kappa
Often just called kappa, this corrects for the possibility of chance agreement:
κ=
po − pe
1 − pe
agreement
Where po is the observed agreement, that is, num.
total cases , and pe is the expected agreement. Kappa
ranges from -1 to 1, where 1 is perfect agreement.
6.4
Handling Data
6.4.1
Transforming data
Occasionally you may find data easier to work with if you apply a transformation to it; that is,
rescale it in some way. For instance, you might take the natural log of your values, or the square
root, or the inverse. This can reduce skew and the effect of outliers or make linear modeling easier.
The function which applies this transformation is called a link function.
6.4.2
Dealing with missing data
Data can be missing for a few reasons:
• Missing completely at random (MCAR) - missing cases are identical to non-missing cases, on
average.
• Missing at random (MAR) - Missing data depends on measured values, so they can be modeled
by other observed variables.
• Missing not at random (MNAR) - Missing data depends on unmeasured/unknown variables, so
there is no way to account for them.
CHAPTER 6. STATISTICS
191
6.4. HANDLING DATA
192
There are a few strategies for dealing with missing data.
The worst you can do is to ignore the missing data and try to run your analysis, missing data and all
(it likely won’t and probably shouldn’t work).
Alternatively, you can delete all datapoints which have missing data, leaving only complete data points
- this is called complete case analysis. Complete case analysis makes the most sense with MCAR
missing data - you will have a reduction in sample size, and thus a reduction in statistical power, as
a result, but your inference will not be biased. The possibly systemic nature of missing data in MAR
and MNAR means that complete case analysis may overlook important details for your model.
You also have the option of filling in missing values - this is called imputation (you “impute” the
missing values). You can, for instance, filling in missing values with the mean of that variable. You
don’t gain any of the information that was missing, and you end up ignoring the uncertainty associated
with the fill-in value (and the resulting variances will be artificially reduced), but you at least get to
maintain your sample size. Again, bias may be introduced in MAR and MNAR situations since the
missing data may be due to some systemic cause.
One of the better approaches is multiple imputation, which produces unbiased parameter estimates
and accounts for the uncertainty of imputed values. A regression model is used to generated the
imputed values, and does well especially under MAR conditions - the regression model may be able
to exploit info in the dataset about the missing data. If some known values correlate with the missing
values, they can be of use in this way.
Then, instead of using the regression model to produce one value for each missing value, multiple
values are produced, so that the end result is multiple copies of your dataset, each with different
imputed values for the missing values. Your perform your analysis across all datasets and average the
produced estimates.
6.4.3
Resampling
Resampling involves repeatedly drawing subsamples from an existing sample.
Resampling is useful for assessing and selecting models and for estimating the precision of parameter
estimates.
A common resampling method is bootstrapping.
Bootstrapping
Bootstrapping is a resampling method to approximate the true sampling distribution of a dataset,
which can then be used to estimate the mean and the variance of the distribution. The advantage
with bootstrapping is that there is no need to compute derivatives or make assumptions about the
distribution’s form.
You take R samples Si∗ , with replacement, each of size n (i.e. each resample is the same size as the
original sample), from your dataset. These samples, S ∗ =
S1∗ , . . . , SR∗ are called replicate bootstrap samples. Then you can compute an estimate of the t
statistic for each of the bootstrap samples, Ti∗ = t(Si∗ ).
192
CHAPTER 6. STATISTICS
193
6.5. REFERENCES
Then you can estimate the mean and variance:
∑
∗
Ti∗
R
∑
(T ∗ − T̄ ∗ )2
∗
ˆ
Var(T
)= i i
R−1
∗
T̄ = Ê[T ] =
i
With bootstrap estimates, there are two possible sources of error. You may have the sampling error
from your original sample S in addition to the bootstrap error, from failing to be comprehensive in
your sampling of bootstrap samples. To avoid the latter, you should try to choose a large R, such
as R = 1000.
6.5
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
References
Review of fundamentals, IFT725. Hugo Larochelle. 2012.
Statistical Inference Course Notes, Xing Su
Regression Models Course Notes, Xing Su
Statistics in a Nutshell. Second Edition. Sarah Boslaugh.
What is the difference between descriptive and inferential statistics?. Jeromy Anglim.
Understanding Variance, Co-Variance, and Correlation. Count Bayesie. Will Kurt.
Think Stats: Exploratory Data Analysis in Python. Version 2.0.27. Allen B Downey.
Principles of Statistics, M.G. Bulmer. 1979.
OpenIntro Statistics. Second Edition. David M Diez, Christopher D Barr, Mine ÇetinkayaRundel.
Computational Statistics I. Allen Downey. SciPy 2015.
Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
Bayesian Statistical Analysis. Chris Fonnesbeck. SciPy 2014.
Lecture Notes from CS229 (Stanford).
Data Analysis Using Regression and Multilevel/Hierarchical Models. First edition. Andrew
Gelman and Jennifer Hill.
Frequentism and Bayesianism: A Practical Introduction. Jake Vanderplas
Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
Introduction to Artificial Intelligence (Udacity CS271). Peter Norvig and Sebastian Thrun.
Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
Controlling the false discovery rate. Alex Reinhart.
The p value and the base rate fallacy. Alex Reinhart.
Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy, Steven N. Goodman, MD,
PhD
Misinterpretations of Significance: A Problem Students Share with Their Teachers?, Heiko
Haller & Stefan Krauss
Statistics Done Wrong, Alex Reinhart
Stevens, S. S. (1946). On the theory of scales of measurement.
CHAPTER 6. STATISTICS
193
6.5. REFERENCES
194
194
CHAPTER 6. STATISTICS
195
7
Bayesian Statistics
Bayesian statistics is an approach to statistics contrasted with frequentist approaches.
As is with frequentist statistical inference, Bayesian inference is concerned with estimating parameters
from some observed data. However, whereas frequentist inference returns point estimates - that is,
single values - for these parameters, Bayesian inference instead expresses these parameters themselves
as probability distributions. This is intuitively appealing as we are uncertain about the parameters
we’ve inferred; with Bayesian inference we can represent this uncertainty.
This is to say that in Bayesian inference, we don’t assign an explicit value to an unknown parameter.
Rather, we define it over a probability distribution as well: what values is the parameter likely to take
on? That is, we treat the parameter itself as a random variable.
We may say for instance that an unknown parameter θ is drawn from an exponential distribution:
θ ∼ Exp(α)
Here α is a hyperparameter, that is, it is a parameter for our parameter θ.
Fundamentally, this is Bayesian inference:
P (θ|X)
Where the parameters θ are the unknown, so we express them as a probability distribution, given the
observations X. This probability distribution is the posterior distribution.
So we must decide (specify) probability distributions for both the data sample and for the unknown
parameters. These decisions involve making a lot of assumptions. Then you must compute a posterior
distribution, which often cannot be calculated analytically - so other methods are used (such as
simulations, described later).
CHAPTER 7. BAYESIAN STATISTICS
195
7.1. BAYES’ THEOREM
196
From the posterior distribution, you can calculate point estimates, credible intervals, quantiles, and
make predictions.
Finally, because of the assumptions which go into specifying the initial distributions, you must test
your model and see if it fits the data and seems reasonable.
Thus Bayesian inference amounts to:
1. Specifying a sampling model for the observed data X, conditioned on the unknown parameter
θ (which we treat as a random variable), such that X ∼ f (X|θ), where f (X|θ) is either the
PDF or the PMF (as appropriate).
2. Specifying a marginal or distribution π(θ) for θ, which is the prior distribution (“prior” for
short): θ ∼ π(θ)
3. From this we wish to compute the posterior, that is, uncover the distribution for θ given the
π(θ)L(θ|X)
observed data X, like so: π(θ|X) = ∫ π(θ)L(θ|X)dθ
, where L(θ|X) ∝ f (θ|X) in θ, called the
likelihood of θ given X. More often than not, the posterior must be approximated through
Markov Chain Monte Carlo (detailed later).
7.0.1
Frequentist vs Bayesian approaches
For frequentists, probability is thought of in terms of frequencies, i.e. the probability of the event is
the amount of times it happened over the total amount of times it could have happened.
In frequentist statistics, the observed data is considered random; if you gathered more observations
they would be different according to the underlying distribution. The parameters of the model,
however, are considered fixed.
For Bayesians, probability is belief or certainty about an event. Observed data is considered fixed,
but the model parameters are random (uncertain) instead and considered to be drawn from some
probability distribution.
Another way of phrasing this is that frequentists are concerned with uncertainty in the data, whereas
Bayesians are concerned with uncertainty in the parameters.
7.1
Bayes’ Theorem
In frequentist statistics, many different estimators may be used, but in Bayesian statistics the only
estimator is Bayes’ Formula (aka Bayes’ Rule or Bayes’ Theorem).
Bayes’ Theorem, aka Bayes’ Rule:
• H is the hypothesis (more commonly represented as the parameters θ)
• D is the data
P (H|D) =
196
P (H)P (D|H)
P (D)
CHAPTER 7. BAYESIAN STATISTICS
197
•
•
•
•
7.1. BAYES’ THEOREM
P (H) = the probability of the hypothesis before seeing the data. The prior.
P (H|D) = probability of the hypothesis, given the data. The posterior.
P (D|H) = the probability of the data under the hypothesis. The likelihood.
P (D) = the probability of data under any hypothesis. The normalizing constant.
For an example of likelihood:
If I want to predict the sides of a dice I rolled, and then I rolled an 8, then P (D|a six sided die) = 0.
That is, it is impossible to have my observed data under the hypothesis of having a six sided die.
A key insight to draw from Bayes’ Rule is that P (H|D) ∝ P (H)P (D|H), that is, the posterior is
proportional to the product of the prior and the likelihood.
Note that the normalizing constant P (D) usually cannot be directly computed and is equivalent to
∫
P (D|H)P (H)dH (which is usually intractable since their are usually multiple parameters of interest,
resulting in a multidimensional integration problem. If θ, the parameters, is one dimensional, then
you could integrate it rather easily).
One workaround is to do approximate inference with non-normalized posteriors, since we know that
the posterior is proportional to the numerator term:
P (H|D) ∝ P (H)P (D|H)
Another workaround to approximate the posterior using simulation methods such as Monte Carlo.
Given a set of hypotheses H0 , H1 , . . . , Hn , the distribution for the priors of these hypotheses is the
prior distribution, i.e. P (H0 ), P (H1 ), . . . , P (Hn ).
The distribution of the posterior probabilities is the posterior distribution, i.e. P (H0 |D), P (H1 |D), . . . , P (Hn |D).
7.1.1
Likelihood
Likelihood is not the same as probability (thus it does not have to sum to 1), but it is proportional
to probability. More specifically, the likelihood of a hypothesis H given some data D is proportional
to the probability of D given that H is true:
L(H|D) = kP (D|H)
Where k is a constant such that k > 0.
With the probability P (D|H), we fix H and allow D to vary. In the case of likelihood, this is reversed:
we fix D and allow the hypotheses to vary.
The Law of Likelihood states that the hypothesis for which the probability of the data is greater is
the more likely hypothesis. For example, H1 is a better hypothesis than H2 if P (D|H1 ) > P (D|H2 ).
We can also quantify how much better H1 is than H2 with the ratio of their likelihoods, i.e.
which is proportional to
L(H1 |D)
L(H2 |D) ,
P (D|H1 )
P (D|H2 ) .
CHAPTER 7. BAYESIAN STATISTICS
197
7.2. CHOOSING A PRIOR DISTRIBUTION
198
Likelihoods are meaningless in isolation (because of the constant k), they must be compared to other
likelihoods, such that the constants cancel out, i.e. as ratios like the example above, to be meaningful.
A Bayes factor is an extension of likelihood ratios: it is a weighted average likelihood ratio based
on the prior distribution of hypotheses. So we have some prior bias as to what hypotheses we expect,
i.e. how probable we expect some hypotheses to be, and we weigh the likelihood ratios by these
expected probabilities.
7.2
Choosing a prior distribution
With Bayesian inference, we must choose a prior distribution, then apply data to get our posterior
distribution. The prior is chosen based on domain knowledge or intuition or perhaps from the results
of previous analysis; that is, it is chosen subjectively - there is no prescribed formula for picking a
prior. If you have no idea what to pick, you can just pick a uniform distribution as your prior.
Your choice of prior will affect the posterior that you get, and the subjectivity of this choice is
what makes Bayesian statistics controversial - but it’s worth noting that all of statistics, whether
or frequentist or Bayesian, involves many subjective decisions (e.g. frequentists must decide on an
estimator to use, what data to collect and how, and so on) - what matters most is that you are
explicit about your decisions and why you made them.
Say we perform an Bayesian analysis and get a posterior. Then we get some new data for the same
problem. We can re-use the posterior from before as our prior, and when we run Bayesian analysis
on the new data, we will get a new posterior which reflects the additional data. We don’t have to
re-do any analysis on the data from before, all we need is the posterior generated from it.
For any unknown quantity we want to model, we say it is drawn from some prior of our choosing.
This is usually some parameter describing a probability distribution, but it could be other values as
well. This is central to Bayesian statistics - all unknowns are represented as distributions of possible
values. In Bayesian statistics: if there’s a value and you don’t know what it is, come up with a prior
for it and add it to your model!
If you think of distributions as landscapes or surfaces, then the data deforms the prior surface to mold
it into the posterior distribution.
The surface’s “resistance” to this shaping process depends on the selected prior distribution.
When it comes to selecting Bayesian priors, there are two broad categories:
• objective priors - these let the data influence the posterior the most
• subjective priors - these allow the practitioner to asset their own views in to the prior. This
prior can be the posterior from another problem or just come from domain knowledge.
An example objective prior is a uniform (flat) prior where every value has equal weighting. Using
a uniform prior is called The Principle of Indifference. Note that a uniform prior restricted within a
range is not objective - it has to be over all possibilities.
Note that the more data you have (as N increases), the choice of prior becomes less important.
198
CHAPTER 7. BAYESIAN STATISTICS
199
7.2.1
7.2. CHOOSING A PRIOR DISTRIBUTION
Conjugate priors
Conjugate priors are priors which, when combined with the likelihood, result in a posterior which
is in the same family. These are very convenient because the posterior can be calculated analytically,
so there is no need to use approximation such as Markov Chain Monte Carlo (see below).
For example, a binomial likelihood is a conjugate with a beta prior - their combination results in a
beta-binomial posterior.
For example, the Gaussian family of distributions are conjugate to itself (self conjugate) - a Gaussian
likelihood with a Gaussian prior results in a Gaussian posterior.
For example, when working with count data you will probably use the Poisson distribution for your
likelihood, which is conjugate with gamma distribution priors, resulting in a gamma posterior.
Unfortunately, conjugate priors only really show up in simple one-dimensional models.
More generally, we can define a conjugate prior like so:
Say random variable X comes from a well-known distribution, fα where α are the possibly unknown
parameters of f . It could be a normal, binomial, etc distribution.
For the given distribution fα , there may exist a prior distribution pβ such that
data
prior
z}|{ z }| {
posterior
z}|{
pβ · fα (X) = pβ ′
Beta-Binomial Model
The Beta-Binomial model is a useful Bayesian model because it provides values between 0 and 1,
which is useful for estimating probabilities or percentages.
It involves, as you might expect, a beta and a binomial distribution.
So say we have N trials and observe n successes. We describe these observations by a binomial
distribution, n ∼ Bin(N, p) for which p is unknown. So we want to come up with some distribution
for p (remember, with Bayesian inference, you do not produce point estimates, that is, a single value,
but a distribution for your unknown value to describe the uncertainty of its true value).
For frequentist inference we’d estimate p̂ =
n
N
which isn’t quite good for low numbers of N.
This being Bayesian inference, we first must select a prior. p is a probability and therefore is bound
to [0, 1]. So we could choose a uniform prior over that interval; that is p ∼ Uniform(0, 1).
However, Uniform(0, 1) is equivalent to a beta distribution where α = 1, β = 1, i.e. Beta(1, 1).
The beta distribution is bound between 0 and 1 so it’s a good choice for estimating probabilities.
We prefer a beta prior over a uniform prior because, given binomial observations, the posterior will
also be a beta distribution.
It works out nicely mathematically:
CHAPTER 7. BAYESIAN STATISTICS
199
7.2. CHOOSING A PRIOR DISTRIBUTION
200
p ∼ Beta(α, β)
n ∼ Bin(N, p)
p | n, N ∼ Beta(α + n, β + N − n)
So with these two distributions, we can directly compute the posterior with no need for simulation
(e.g. MCMC).
How do you choose the parameters for a Beta prior? Well, it depends on the particular problem, but
a conservative one, for when you don’t have a whole lot of info to go on, is Beta( 21 , 12 ), known as
Jeffrey’s prior.
Example
We run 100 trials and observe 10 successes. What is the probability p of a successful trial?
Our knowns are N = 100, n = 10. A binomial distribution describes these observations, but we have
the unknown parameter p.
For our prior for p we choose Beta(1, 1) since it is equivalent to a uniform prior over [0, 1] (i.e. it is
an objective prior).
We can directly compute the posterior now:
p | n, N ∼ Beta(α + n, β + N − n)
p ∼ Beta(11, 91)
Then we can draw samples from the distribution and compute its mean or other descriptive statistics
such as the credible interval.
7.2.2
Sensitivity Analysis
The strength of the prior affects the posterior - the stronger your prior beliefs, the more difficult it
is to change those beliefs (it requires more data/evidence). You can conduct sensitivity analysis
to try your approach with various different priors to get an idea of how different priors affect your
resulting posterior.
7.2.3
Empirical Bayes
Empirical Bayes is a method which combines frequentist and Bayesian approaches by using frequentist
methods to select the hyperparameters.
For instance, say you want to estimate the µ parameter for a normal distribution.
You could use the empirical sample mean from the observed data:
µp =
200
N
1∑
Xi
N i=0
CHAPTER 7. BAYESIAN STATISTICS
201
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
Where µp denotes the prior µ.
Though if working with not much data, this kind of ends like double-counting your data.
7.3
Markov Chain Monte Carlo (MCMC)
With Bayesian inference, in order to describe your posterior, you often must evaluate complex multidimensional integrals (i.e. from very complex, multidimensional probability distributions), which can
be computationally intractable.
Instead you can generate sample points from the posterior distribution and use those samples to
compute whatever descriptions you need. This technique is called Monte Carlo integration, and
the process of drawing repeated random samples in this way is called Monte Carlo simulation. In
particular, we can use a family of techniques known as Markov Chain Monte Carlo, which combine
Monte Carlo integration and simulation with Markov chains, to generate samples for us.
7.3.1
Monte Carlo Integration
Monte Carlo integration is a way to approximate complex integrals using random number generation.
Say we have a complex integral:
∫
h(x)dx
If we can decompose h(x) into the product of a function f (x) and a probability density function
P (x) describing the probabilities of the inputs x, then:
∫
∫
h(x)dx =
f (x)P (x)dx = EP (x) [f (x)]
That is, the result of this integral is the expected value of f (x) over the density P (x).
We can approximate this expected value by taking the mean of many, many samples (n samples):
∫
h(x)dx = EP (x) [f (x)] ≈
n
1∑
f (xi )
n i=1
This process of approximating the integral is Monte Carlo integration.
For very simple cases of known distributions, we can sample directly, e.g.
CHAPTER 7. BAYESIAN STATISTICS
201
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
202
import numpy as np
# Say we think the distribution is a Poisson distribution
# and the parameter of our distribution, lambda,
# is unknown and what we want to discover.
lam = 5
# Collect 100000 samples
sim_vals = np.random.poisson(lam, size=100000)
# Get whatever descriptions we want, e.g.
mean = sim_vals.mean()
# For poisson, the mean is lambda, so we expect
# them to be approximately equal (given a large enough sample size)
abs(lam - mean()) < 0.001
7.3.2
Markov Chains
Markov chains are a stochastic process in which the next state depends only on the current state.
Consider a random variable X and a time index t. The state of X at time t is notated Xt .
For a Markov chain, the state Xt+1 depends only on the current state Xt , that is:
P (Xt+1 = xt+1 |Xt = xt , Xt−1 = xt−1 , . . . , X0 = x0 ) = P (Xt+1 = xt+1 |Xt = xt )
Where P (Xt+1 = xt+1 ) is the transition probability of Xt+1 = xt+1 . The collection of transition
probabilities is called a transition matrix (for discrete states); more generally is is called a transition
kernel.
If we consider t going to infinity, the Markov chain settles on a stationary distribution, where
P (Xt ) = P (Xt−1 ). The stationary distribution does not depend on the initial state of the network.
Markov chains are erdogic, i.e. they “mix”, which means that the influence of the initial state weakens
with time (the rate at which it mixes is its mixing speed).
If we call the k × k transition matrix P and the marginal probability of a state at time t is a k × 1
vector π, then the distribution of the state at time t + 1 is π ′ P . If π ′ P = π ′ , then pi is the stationary
distribution of the Markov chain.
7.3.3
Markov Chain Monte Carlo
MCMC is useful because often we may encounter distributions which aren’t easily expressed mathematically (e.g. their functions may have very strange shapes), but we still want to compute some
202
CHAPTER 7. BAYESIAN STATISTICS
203
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
descriptive statistics (or make other computations) from them. MCMC allows us to work with such
distributions without needing precise mathematical formulations of them.
More generally, MCMC is really useful if you don’t want to (or can’t) find the underlying function
describing something. As long as you can simulate that process in some way, you don’t need to know
the exact function - you can just generate enough sample data to work with in its stead. So MCMC
is a brute force but effective method.
Rather than directly compute the integral for posterior distributions in Bayesian analysis, we can
instead use MCMC to draw several (thousands, millions, etc) samples from the probability distribution,
then use these samples to compute whatever descriptions we’d like about the distribution (often this is
some expected value of a function, E[f (x)], where its inputs are drawn from distribution, i.e. x ∼ p,
where p is some probability distribution).
You start with some random initial sample and, based on that sample, you pick a new sample. This is
the Markov Chain aspect of MCMC - the next sample you choose depends only on the current sample.
This works out so that you spend most your time with high probability samples (b/c they have higher
transition probabilities) but occasionally jump out to lower probability samples. Eventually the MCMC
chain will converge on a random sample.
So we can take all these N samples and, for example, compute the expected value:
E[f (x)] ≈
N
1∑
f (xi )
N i=1
Because of the random initialization, there is a “burn-in” phase in which the sampling model needs
to be “warmed up” until it reaches an equilibrium sampling state, the stationary distribution. So
you discard the first hundred or thousand or so samples as part of this burn-in phase. You can
(eventually) arrive at this stationary distribution independent of where you started which is why the
random initialization is ok - this is an important feature of Markov Chains.
MCMC is a general technique of which there are several algorithms.
Rejection Sampling
Monte Carlo integration allows us to draw samples from a posterior distribution with a known parametric form. It does not, however, enable us to draw samples from a posterior distribution without
a known parametric form. We may instead use rejection sampling in such cases.
We can take our function f (x) and if it has bounded/finite support (“support” is the x values where
f (x) is non-zero, and can be thought of the range of meaningful x values for f (x)), we can calculate
its maximum and then define a bounding rectangle with it, encompassing all of the support values.
This envelope function should contain all possible values of f (x) Then we can randomly generate
points from within this box and check if they are under the curve (that is, less than f (x) for the
point’s x value). If a point is not under the curve, we reject it. Thus we approximate the integral
like so:
∫ B
points under curve
× box area = lim
f (x)dx
n→∞ A
points generated
CHAPTER 7. BAYESIAN STATISTICS
203
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
204
In the case of unbounded support (i.e. infinite tails), we instead choose some majorizing or enveloping
function g(x) (g(x) is typically a probability density itself and is called a proposal density) such that
cg(x) ≥ f (x) , ∀x ∈ (−∞, ∞), where c is some constant. This functions like the bounding box from
before. It completely encloses f . Ideally we choose g(x) so that it is close to the target distribution,
that way most of our sampled points can be accepted.
Then, for each xi we draw (i.e. sample), we also draw a uniform random value ui . Then if ui <
we accept xi , otherwise, we reject it.
f (xi )
cg(xi ) ,
The intuition here is that the probability of a given point being accepted is proportional to the function
f at that point, so when there is greater density in f for that point, that point is more likely to be
accepted.
In multidimensional cases, you draw candidates from every dimension simultaneously.
Metropolis-Hastings
The Metropolis-Hastings algorithm uses Markov chains with rejection sampling.
The proposal density g(θt ) is chosen as in rejection sampling, but it depends on θt−1 , i.e. g(θt |θt−1 ).
First select some initial θ, θ1 .
Then for n iterations:
•
•
•
•
Draw a candidate θtc ∼ g(θt |θt−1 )
f (θtc )g(θt−1 |θtc )
Compute the Metropolis-Hastings ratio: R = f (θt−1
)g(θtc |θt−1 )
Draw u ∼ Uniform
If u < R, accept θt = θtc , otherwise, θt = θt−1
There are a few required properties of the Markov chain for this to work properly:
• The stationary distribution of the chain must be the target density:
– The chain must be recurrent - that is, for all θ ∈ Θ in the target density (the density we
wish to approximate), the probability of returning to any state θi ∈ T heta = 1. That is,
it must be possible eventually for any state in the state space to be reached.
– The chain must be non-null for all θ ∈ Θ in the target density; that is, the expected time
to recurrence is finite.
– The chain must have a stationary distribution equal to the target density.
• The chain must be ergodic, that is:
– The chain must be irreducible - that is, any state θi can be reached from any other state
θj in a finite number of transitions (i.e. the chain should not get stuck in any infinite
loops)
204
CHAPTER 7. BAYESIAN STATISTICS
205
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
– The chain must be aperiodic - that is, there should not be a fixed number of transitions
to get from any state θi to any state θj . For instance, it should not always take three
steps to get from one place to another - that would be a period. Another way of putting
this - there are no fixed cycles in the chain.
It can been proven that the stationary distribution of the Metropolis-Hastings algorithm is the target
density (proof omitted).
Th ergodic property (whether or not the chain “mixes” well) can be validated with some convergence
diagnostics A common method is to plot the chain’s values as their drawn and see if the values tend
to concentrate around a constant; if not, you should try a different proposal density.
Alternatively, you can look at an autocorrelation plot, which measures the internal correlation (from
-1 to 1) over time, called “lag”. We expect that the greater the lag, the less the points should be
autocorrelated - that is, we expect autocorrelation to smoothly decrease to 0 with increasing lag. If
autocorrelation remains high, then the chain is not fully exploring the space. Autocorrelation can
be improved by thinning, which is a technique where only every kth draw is kept and others are
discarded.
Finally, you also have the options of running multiple chains, each with different starting values, and
combining those samples.
You should also use burn-in.
Gibbs Sampling
It is easy to sample from simple distributions. For example, for a binomial distribution, you can
basically just flip a coin. For a multinomial distribution, you can basically just roll a dice.
If you have a multinomial, multivariate distribution, e.g. P (x1 , x2 , . . . , xn ), things get more complicated. If the variables are independent, you can factorize the multivariate distribution as a product of
univariate distributions, treating each as a univariate multinomial distribution, i.e. P (x1 , x2 , . . . , xn ) =
P (x1 ) × P (x2 ) × · · · × P (xn ). Then you can just sample from each distribution individually, i.e. as
a dice roll.
However - what if these aren’t independent, and we want to sample from the joint distribution
P (x1 , x2 , . . . , xn )? We can’t factorize it into simpler distributions like before.
With Gibbs sampling we can approximate this joint distribution under the condition that we can easily
sample from the conditional distribution for each variable, i.e. P (xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ). (This
condition is satisfied on Bayesian networks.)
We take advantage of this and iteratively sample from these conditional distributions and using the
most recent value for each of the other variables (starting with random values at first). For example,
sampling x1 |x2 , . . . , xn , then fixing this value for x1 while sampling x2 |x1 , x3 , . . . , xn , then fixing both
x1 and x2 while sampling x3 |x1 , x2 , x4 , . . . , xn , and so on.
If you iterate through this a large number of times you get an approximation of samples taken from
the actual joint distribution.
Another way to look at Gibbs sampling:
CHAPTER 7. BAYESIAN STATISTICS
205
7.3. MARKOV CHAIN MONTE CARLO (MCMC)
206
Say you have random variables c, r, t (cloudy, raining, thundering) and you have the following probability tables:
c
P(c)
0
0.5
1
0.5
c
r
P(r|c)
0
0
0.9
0
1
0.1
1
0
0.1
1
1
0.9
c
r
t
P(t|c,r)
0
0
0
0.9
0
0
1
0.1
0
1
0
0.5
0
1
1
0.5
1
0
0
0.6
1
0
1
0.4
1
1
0
0.1
1
1
1
0.9
We can first pick some starting sample, e.g. c = 1, r = 0, t = 1.
Then we fix r = 0, t = 1 and randomly pick another c value according to the probabilities in the
table (here it is equally likely that we get c = 0 or c = 1). Say we get c = 0. Now we have a new
sample c = 0, r = 0, t = 1.
Now we fix c = 0, t = 1 and randomly pick another r value. Here r is dependent only on c.
c = 0 so we have a 0.9 probability of picking r = 0. Say that we do. We have another sample
c = 0, r = 0, t = 1, which happens to be the same as the previous sample.
Now we fix c = 0, r = 0 and pick a new t value. t is dependent on both c and r . c = 0, r = 0, so we
have a 0.9 chance of picking t = 0. Say that we do. Now we have another sample c = 0, r = 0, t = 0.
Then we repeat this process until convergence (or for some specified number of iterations).
Your samples will reflect the actual joint distribution of these values, since more likely samples are,
well, more likely to be generated.
206
CHAPTER 7. BAYESIAN STATISTICS
207
7.4
7.4. VARIATIONAL INFERENCE
Variational Inference
MCMC can take a long time to get good answers - in theory if you run it infinitely it will generate
enough samples to get a perfectly accurate distribution, but that’s not a fair criterion (many algorithms
do well if they have infinite time).
With variational inference we don’t need to take samples - instead we fit an approximate joint
distribution Q(x; θ) to approximate the true joint posterior P (x), turning it into an optimization
problem (we try to get them as close as possible according to the KL divergence KL[Q(x; θ)||P (x)].
So we are interested in the parameters θ.
The mean-field form of variational inference assumes that Q factorizes into independent single-variable
∏
factors, i.e. Q(x) = i Qi (xi |θi ).
7.5
Bayesian point estimates
Bayesian inference returns a distribution (the posterior) but we often need a single value (or a vector
in multivariate cases). So we choose a value from the posterior. This value is a Bayesian point
estimate.
Selecting the MAP (maximum a posterior) value is insufficient because it neglects the shape of the
distribution.
Suppose P (θ|X) is the posterior distribution of θ after observing data X.
The expected loss of choosing estimate θ̂ to estimate θ (the true parameter), also known as the risk
of estimate θ̂ is:
l(θ̂) = Eθ [L(θ, θ̂)]
Where L(θ, θ̂) is some loss function.
You can approximate the expected loss using the Law of Large Numbers, which just states that as
sample size grows, the expected value approaches the actual value. That is, as N grows, the expected
loss approaches 0.
For approximating expected loss, it looks like:
N
1∑
L(θi , θ̂) ≈ Eθ [L(θ, θ̂)] = l(θ̂)
N i=1
You want to select the estimate θ̂ which minimizes this expected loss:
argmin Eθ [L(θ, θ̂)]
θ̂
CHAPTER 7. BAYESIAN STATISTICS
207
7.6. CREDIBLE INTERVALS (CREDIBLE REGIONS)
7.6
208
Credible Intervals (Credible Regions)
In Bayesian statistics, The closest analog to confidence intervals in frequentist statistics is the credible
interval. It is much easier to interpret than the confidence interval because it is exactly what most
people confuse the confidence interval to be. For instance, the 95% credible interval is the interval
in which we expect to find θ 95% of the time.
Mathematically this is expressed as:
P (a(y ) < θ < b(y )|Y = y ) = 0.95
We condition on Y because in Bayesian statistics, the data is fixed and the parameters are random.
7.7
Bayesian Regression
The Bayesian methodology can be applied to regression as well. In conventional regression the
parameters are treated as fixed values that we uncover. In Bayesian regression, the parameters are
treated as random variables, as they are elsewhere in Bayesian statistics. We define prior distributions
for each parameter - in particular, normal priors, so that for each parameter we define a prior mean
as well as a covariance matrix for all the parameters.
So we specify:
•
•
•
•
b0 - a vector of prior means for the parameters
B0 - a covariance matrix such that σ 2 B0 is the prior covariance matrix of β
v0 > 0 - the degrees of freedom for the prior
σ02 > 0 - the variance for the prior (which essentially functions as your strength of belief in the
prior - the lower the variance, the more concentrated your prior is around the mean, thus the
stronger your belief)
So the prior for your parameters then is a normal distribution parameterized by (b0 , B0 ).
Then v0 and σ02 give a prior for σ 2 , which is an inverse gamma distribution parameterized by (v0 , σ02 v0 ).
Then there are a few formulas:
208
CHAPTER 7. BAYESIAN STATISTICS
209
7.8. A BAYESIAN EXAMPLE
b1 = (B0−1 + X ′ X)−1 (B0−1 b0 + X ′ X β̂)
B1 = (B0−1 + X ′ X)−1
v1 = v0 + n
v1 σ12 = v0 σ02 + S + r
S = sum of squared errors of the regression
r = (b0 − β̂)′ (B0 + (X ′ X)−1 )−1 (b0 − β̂)
f (β | σ 2 , y , x) = Φ(b1 , σ 2 B1 )
f (σ 2 | y , x) = inv.gamma(
f (β | y , x) =
∫
v1 v1 σ12
,
)
2
2
f (β | σ 2 , y , x)f (σ 2 | y , x)dσ 2 = t(b1 , σ12 B1 , degrees of freedom = v1 )
So the resulting distribution of parameters is a multivariate t distribution.
7.8
A Bayesian example
Let’s say we have a coin. We are uncertain whether or not it’s a fair coin. What can we learn about
the coin’s fairness from a Bayesian approach?
Let’s restate the problem. We can represent the outcome of a coin flip with a random variable,
X. If the coin is not fair, we expect to see heads 100% of the time. That is, if the coin is unfair,
P (X = heads) = 1. Otherwise, we expect it to be around P (X = heads) = 0.5.
It’s reasonable to assume that X is drawn from a binomial distribution, so we’ll use that. The binomial
distribution is parameterized by n, the number of trials, and p, the probability of a “success” (in this
case, a heads), on a given flip. We can restate our previous statements about the coin’s fairness in
terms of this parameter p. That is, if the coin is unfair, we expect p = 1, otherwise, we expect it to
be around p = 0.5.
Thus p is the unknown parameter we are interested in, and with the Bayesian approach, we consider
it a random variable as well; i.e. drawn from some distribution. First we must state what we believe
this distribution to be prior to any evidence (i.e. decide on a prior to use). Because p is a probability,
the beta distribution seems like a good choice since it is bound to [0, 1] like a probability. The beta
distribution has the additional advantage of being a conjugate prior, so the posterior is analytically
derived and requires no simulation.
The beta distribution is parameterized by α and β (i.e. they are our hyperparameters, Beta(α, β)).
Here we can choose values for α and β depending on how we choose to proceed. Let’s be conservative
and use an uninformative prior, that is, a uniform/flat prior, acting as if we don’t feel strongly about
the coin’s bias either way prior to flipping the coin. The beta distribution Beta(1, 1) is flat.
The posterior for a beta prior will not be derived here, but it is Beta(α + k, β + (n − k)), where k is
the number of successes (heads) in our evidence, and n is the total number of trials in our evidence.
Now we can flip the coin a few times to gather our evidence.
CHAPTER 7. BAYESIAN STATISTICS
209
7.8. A BAYESIAN EXAMPLE
210
Below are some illustrations of possible evidence with the prior and the resulting posterior.
Some possible outcomes with a flat prior
A few things to note here:
• When the evidence has even amounts of tails and heads, the posterior centers around p = 0.5.
• When the evidence has even one tail, the possibility of p = 1 drops to nothing.
• When the evidence has no tails, the posterior places more weight on an unfair coin, but there
is still some possibility of p = 0.5. As the number of evidence increases, however, and still no
tails show up, the posterior will have even more weight pushed towards p = 1.
• When the is a lot of evidence containing even amounts of tails and heads, there is greater
confidence that p = 0.5 (that is, there’s smaller variance around it).
What if instead of a flat prior, we had assumed that the coin was fair to begin with? In this scenario,
the α and β values function like counts for heads and tails. So to assume a fair coin we could say,
210
CHAPTER 7. BAYESIAN STATISTICS
211
7.8. A BAYESIAN EXAMPLE
α = 10, β = 10. If we have a really strong belief that it is a fair coin, we could say α = 100, β = 100.
The higher these values are, the stronger our belief.
Some possible outcomes with a informative prior
Since our prior belief is stronger than it was with a flat prior, the same amount of evidence doesn’t
change the prior belief as much. For instance, now if we see a streak of heads, we are less convinced
it is unfair.
In either case, we could take the expected value of p’s posterior distribution as our estimate for p,
and then use that as evidence for a fair or unfair coin.
7.8.1
References
• POLS 506: Simple Bayesian Models. Justin Esarey.
• POLS 506: Basic Monte Carlo Procedures and Sampling. Justin Esarey.
• POLS 506: Metropolis-Hastings, the Gibbs Sampler, and MCMC. Justin Esarey.
CHAPTER 7. BAYESIAN STATISTICS
211
7.8. A BAYESIAN EXAMPLE
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
212
212
Markov chain Monte Carlo (MCMC) introduction. mathematicalmonk.
Markov Chain Monte Carlo Without all the Bullshit. Jeremy Kun.
http://homepages.dcc.ufmg.br/~assuncao/pgm/aulas2014/mcmc-gibbs-intro.pdf
Markov Chain Monte Carlo and Gibbs Sampling. B. Walsh.
Computational Methods in Bayesian Analysis. Chris Fonnesbeck.
Think Bayes. Version 1.0.6. Allen Downey.
Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
Bayesian Statistical Analysis. Chris Fonnesbeck. SciPy 2014.
Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon.
Frequentism and Bayesianism V: Model Selection. Jake Vanderplas.
Understanding Bayes: A Look at the Likelihood. Alex Etz.
High-Level Explanation of Variational Inference. Jason Eisner.
Bayesian Deep Learning. Thomas Wiecki.
A Tutorial on Variational Bayesian Inference. Charles Fox, Stephen Roberts.
Variational Inference. David M. Blei.
Probabilistic Programming Data Science with PyMC3. Thomas Wiecki.
CHAPTER 7. BAYESIAN STATISTICS
213
8
Graphs
A graph consists of vertices (nodes) connected by edges (arcs).
The two main categories of graphs are undirected graphs, in which edges don’t have any particular
direction, and directed graphs, where edges have direction - for example, there may be an edge
from node A to node B, but no edge from node B to node A.
A directed graph
Generally, we use the following notation for a graph G: n is the number of vertices, i.e. |V |, m is the
number of edges, i.e. |E|.
Within undirected graphs there are other distinctions:
• a simple graph is a graph in which there are no loops and only single edges are allowed
• a regular graph is one in which each vertex has the same number of neighbors (i.e. the same
degree)
• a complete graph is a simple graph where every pair of vertices is connected by an edge
• a connected graph is a graph where there exists a path between every pair of vertices
CHAPTER 8. GRAPHS
213
214
An undirected graph
For a connected graph with no parallel edges (i.e. each pair of vertices has only zero or one edge
between it), m is somewhere between Ω(n) and O(n2 ).
Generally, a graph is said to be sparse if m is O(n) or close to it (that is, it has the lower end of
number of edges). If m is closer to O(n2 ), this is generally said to be a dense graph.
An adjacency matrix requires Θ(n2 ) space. If the graph is sparse, this is a waste of space, and an
adjacency list is more appropriate - you have an array of vertices and an array of edges. Each edge
points to its endpoints, and each vertex points to edges incident on it. This requires Θ(m + n)
space (because the array of vertices takes Θ(n) space and the arrays of edges, edge-to-endpoints,
and vertex-to-edges each take Θ(m), for Θ(n + 3m) = Θ(m + n)), so it is better for sparse graphs.
Consider the accompanying example graph.
Example graph
A path A → B is the sequence of nodes connecting A to B, including A and B.
Here the path A → B is A, C, E, B.
A, C, E are the ancestors of B. C, E, B are the descendants of A.
A cycle is a directed path which ends where it starts. Here, A, D, C, B form a cycle.
214
CHAPTER 8. GRAPHS
215
Example of a cyclic graph
A loop is any path (directed or not) which ends where it starts.
A graph with no cycles is said to be acyclic.
A chord is any edge which connects two non-adjacent nodes in a loop.
Directed acyclic graphs (DAGs) are used often. In a DAG, parents are the nodes which point to
a given node; the nodes that a given node points to are its children. A family of a node is the node
and its parents.
The Markov blanket of a node is its parents, children, and the parents of its children (including
itself).
For an undirected graph, a node’s neighbors are noes directly connected to it.
In an undirected graph, a clique is a fully-connected subset of nodes. All members of a clique are
neighbors.
A maximal clique is a clique which is not a subset of another clique.
A graph demonstrating cliques
In this graph:
CHAPTER 8. GRAPHS
215
216
• {A, B, C, D} is a maximal clique
• {B, C, E} is a maximal clique
• {A, B, C} is a non-maximal clique, contained in {A, B, C, D}
An undirected graph is connected if there is a path between every pair of nodes. That is, there is
no isolated subgraph.
For a non-connected graph, its connected components are its subgraphs which are connected.
A singly connected graph is the connected graph (directed or not) where there is only one path
between each pair of nodes. This is equivalent to a tree.
A non-singly connected graph is said to be multiply connected.
A spanning tree of an undirected graph is one where the sum of all edge weights is at least as large
as any other spanning trees’.
Spanning tree example. The graph on the right is a spanning tree of the graph on the left.
Graphs can be represented as an adjacency matrix:

0

1

A=
1

0
1
0
1
1
1
1
0
1

0

1


1

0
Where a non-zero value at Aij indicates that node i is connected to node j.
A clique matrix represents the maximal cliques in a graph.
For example, this clique matrix describes the following maximal clique:
216
CHAPTER 8. GRAPHS
217
8.1. REFERENCES


1

1

C=
1

0
0

1


1

1
A maximal clique
A clique matrix containing only 2-node cliques is an incidence matrix:

1

1

C=
0

0
8.1
1
0
1
0
0
1
1
0
0
1
0
1

0

0


1

1
References
• Algorithms: Design and Analysis, Part 1. Tim Roughgarden. Stanford/Coursera.
• Bayesian Reasoning and Machine Learning. David Barber.
• Probabilistic Graphical Models. Daphne Koller. Stanford University/Coursera.
CHAPTER 8. GRAPHS
217
8.1. REFERENCES
218
218
CHAPTER 8. GRAPHS
219
9
Probabilistic Graphical Models
The key tool for probabilistic inference is the joint probability table. Each row in a joint probability
table describes a combination of values for a set of random variables. That is, say you have n events
which have a binary outcome (T/F). A row would describe a unique configuration of these events,
e.g. if n = 4 then one row might be 0, 0, 0, 0 and another might be 1, 0, 0, 0 and so on. Consider
the simpler case of n = 2, with binary random variables X, Y :
X Y
P(X,Y)
0
0
0.25
1
0
0.45
0
1
0.15
1
1
0.15
Using a joint probability table you can learn a lot about how those events are related probabilistically.
The problem is, however, that joint probability tables can get very big, which is another way of
saying that models (since joint probability tables are a representation of probabilistic models) can get
complex very quickly.
Typically, we have a set of random variables x1 , . . . , xn and we want to compute their probability for
certain states together; that is, the joint distribution P (x1 , . . . , xn ).
Even in the simple case where each random variable is binary, you would still have a distribution over
2n states.
We can use probabilistic graphical models (PGMs) to reduce this space. Probabilistic graphical
models allow us to represent complex networks of interrelated and independent events efficiently and
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
219
9.1. FACTORS
220
with sparse parameters. All graphical models have some limitations in their ability to graphically
express conditional (in)dependence statements but are nevertheless very useful.
There are two main types of graphical models:
• Bayesian models: aka Bayesian networks, sometimes called Bayes nets or belief networks.
These use directed graphs and are used when there are causal relationships between the random
variables.
• Markov models: These use undirected graphs and are used when there are noncausal relationships between the random variables.
9.1
Factors
The concept of factors is important to PGMs.
A factor is a function ϕ(X1 , . . . , Xk ) which takes all possible combinations of outcomes (assignments)
for these random variables X1 , . . . , Xk and gives a real value for each combination.
The set of random variables {X1 , . . . , Xk } is called the scope of the factor.
A joint distribution is a factor which returns a number which is the probability of a given combination
of assignments.
An unnormalized measure is also a factor, e.g. P (I, D, g 1 ).
A conditional probability distribution (CPD) is also a factor, e.g. P (G|I, D).
A common operation on factors is a factor product. Say we have the factors ϕ1 (A, B) and ϕ2 (B, C).
Their factor product would yield a new factor ϕ3 (A, B, C). The result for a given combo ai , bj , ck is
just ϕ1 (ai , bj ) · ϕ2 (bj , ck ).
Another operation is factor marginalization. This is the same as marginalization for probability
distributions but generalized for all factors. For example, ϕ(A, B, C) → ϕ(A, B).
Another operation is factor reduction which is similarly is a generalization of probability distribution
reduction.
9.2
Belief (Bayesian) Networks
Say we are looking at five events:
•
•
•
•
•
220
a dog barking (D)
a raccoon being present (R)
a burglar being present (B)
a trash can is heard knocked over (T )
the police are called (P )
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
221
9.2. BELIEF (BAYESIAN) NETWORKS
Belief Network
We can encode some assumptions about how these events are related in a belief net (also called a
Bayesian net):
Every node is dependent on its parent and nothing else that is not a descendant. To put it another
way: given its parent, a node is independent of all its non-descendants.
For instance, the event P is dependent on its parent D but not B or R or T because their causality
flows through D.
D depends on B and R because they are its parents, but not T because it is not a descendant or a
parent. But D may depend on P because it is a descendant.
We can then annotate the graph with probabilities:
Belief Network
The B and R nodes have no parents so they have singular probabilities.
The others depend on the outcome of their parents.
With the belief net, we only needed to specify 10 probabilities.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
221
9.2. BELIEF (BAYESIAN) NETWORKS
222
If we had just constructed joint probability table, we would have had to specify 25 = 32 probabilities
(rows).
If we expand out the conditional probability of this system using the chain rule, it would look like:
P (p, d, b, t, r ) = P (p|d, b, t, r )P (d|b, t, r )P (b|t, r )P (t|r )P (r )
But we can bring in our belief net’s conditional independence assumptions to simplify this:
P (p, d, b, t, r ) = P (p|d)P (d|b, r )P (b)P (t|r )P (r )
Belief networks are acyclical, that is, they cannot have any loops (a node cannot have a path back
to itself). In particular, they are a directed acyclic graph (DAG).
Two nodes (variables) in a Bayes net are on an active trail if a change in one node affects the other.
This includes cases where the two nodes have a causal relationship, an evidential relationship, or have
some common cause.
Formally, a belief network is a distribution of the form:
P (x1 , . . . , xD ) =
D
∏
P (xi |pa(xi ))
i=1
where pa(xi ) are the parental variables of variable x (that is, x’s parents in the graph).
When you factorize a joint probability, you have a number of options for doing so.
For instance:
P (x1 , x2 , x3 ) = P (xi1 |xi2 , xi3 )P (xi2 |xi3 )P (xi3 )
where (i1 , i2 , i3 ) is any permutation of (1, 2, 3).
Without any conditional independence assumptions, all factorizations produce an equivalent DAG.
However, once you begin dropping edges (i.e. making conditional independence assumptions), the
graphs are not necessarily equivalent anymore.
Some of the graphs are equivalent; they can be converted amongst each other via Bayes’ rule. Others
cannot be bridged in this way, and thus are not equivalent.
Note that belief networks encode conditional independences but do not necessarily encode dependences.
For instance, the graph a → b appears to mean that a and b are dependent. But there may be an
instance of the belief network distribution such that p(b|a) = p(b); that is, a and b are independent.
So although the DAG may seem to imply dependence, there may be cases where it in fact does not.
222
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
223
9.2. BELIEF (BAYESIAN) NETWORKS
In these cases, we call this implied dependence graphical dependence.
The following belief network triple represents the conditional independence of X and Y given Z, that
is P (X, Y |Z) = P (X|Z)P (Y |Z).
Represents conditional independence between X and Y given Z
The following belief network triple also represents the conditional independence of X and Y given Z,
in particular, P (X, Y |Z) ∝ P (Z|X)P (X)P (Y |Z).
Also represents conditional independence between X and Y given Z
The following belief network triple represents the graphical conditional dependence of X and Y , that
is P (X, Y |Z) ∝ P (Z|X, Y )P (X)P (Y ).
Represents graphical conditional dependence of X and Y
Here Z is a collider, since its neighbors are pointing to it.
Generally, if there is a path between X and Y which contains a collider, and this collider is not in
the conditioning set, nor are any of its descendants, we cannot induce dependence between X and
Y from this path. We say such a path is blocked.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
223
9.2. BELIEF (BAYESIAN) NETWORKS
224
Similarly, if there is a non-collider along the path which is in the conditioning set, we cannot induce
dependence between X and Y from this path - such a path is also said to be blocked.
If all paths between X and Y are blocked, we say they are d-separated.
However, if there are no colliders, or the colliders that are there are in the conditioning set or their
descendants, and no non-collider conditioning variables in the path, we say this path d-connects X
and Y and we say they are graphically dependent.
Note that colliders are relative to a path.
For example, in the accompanying figure, C is a collider for the path A − B − C − D but not for the
path A − B − C − E.
Collider example
Consider the belief network A → B ← C. Here A and C are conditionally independent. However,
if we condition them on B, i.e. P (A, C|B), then they become graphically dependent. That is, we
belief the root “causes” of A and C to be independent, but given B we learn something about both
the causes of A and C, which couples them, making them (graphically) dependent.
Note that the term “causes” is used loosely here; belief networks really only make independence
statements, not necessarily causal ones.
(TODO the below is another set of notes for bayes’ nets, incorporate these two)
Independence allows us to more compactly represent joint probability distributions, in that independent
random variables can be represented as smaller, separate probability distributions.
For example, if we have binary random variables A, B, C, D, we would have a joint probability table
of 24 entries. However, if we know that A, B is independent of C, D, then we only need two joint
probability tables of 22 entries each.
Typically, independent is too strong an assumption to make for real-world applications, but we can
often make the weaker, yet still useful assumption of conditional independence.
Conditional independence is when one variable makes another variable irrelevant (because the other
variable adds no additional information), i.e. P (A|B, C) = P (A|B); knowing C adds no more
information when we know B.
224
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
225
9.2. BELIEF (BAYESIAN) NETWORKS
For example, if C causes B and B causes A, then knowledge of B already implies C, so knowing
about C is kind of useless for learning about A if we already know B.
As a more concrete example, given random variables traffic T , umbrella U, and raining R, we could
reasonably assume that U is conditionally independent of T given R, because rain is the common
cause of the two and there is no direct relationship between U and T ; the relationship is through R.
Similarly, given fire, smoke and an alarm, we could say that fire and alarm are conditionally independent given smoke.
As mentioned earlier, we can apply conditional independence to simplify joint distributions.
Take the traffic/umbrella/rain example from before. Their joint distribution is P (T, R, U, which we
can decompose using the chain rule:
P (T, R, U) = P (R)P (T |R)P (U|R, T )
If we make the conditional independence assumption from before (U and T are conditionally independent given R), then we can simplify this:
P (T, R, U) = P (R)P (T |R)P (U|R)
That is, we simplified P (U|R, T ) to P (U|R).
We can describe complex joint distributions more simply with these conditional independence assumptions, and we can do so with Bayes’ nets (i.e. graphical models), which provide additional insight into
the structure of these distributions (in particular, how variables interact locally, and how these local
interactions propagate to more distant indirect interactions).
A Bayes’ net is a directed acyclic graph.
The nodes in the graph are the variables (with domains). They may be assigned (observed) or
unassigned (unobserved).
The arcs in the graphs are interactions between variables (similar to constraints in CSPs). They
indicate “direct influence” between variables (not that this is not necessarily the same as causation,
it’s about the information that observation of one variable gives about the other, which can mean
causation, but not necessarily, e.g. it could simply be a hidden common underlying cause), which is
to say that they encode conditional independences.
For each node, we have a conditional distribution over the variable that node represents, conditioned
on its parents’ values.
Bayes’ nets implicitly encode joint distributions as a product of local conditional distributions:
P (x1 , x2 , . . . , xn ) =
n
∏
P (xi |parents(Xi ))
i=1
This simply comes from the chain rule:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
225
9.2. BELIEF (BAYESIAN) NETWORKS
226
P (x1 , x2 , . . . , xn ) =
n
∏
P (xi |x1 , . . . , xi−1 )
i=1
And then applying conditional independence assumptions.
The graph must be acyclic so that we can come up with a consistent ordering when we apply the
chain rule (that is, decide the order for expanding the distributions). If the graph has cycles, we can’t
come up with a consistent ordering because we will have loops.
Note that arcs can be “reversed” (i.e. parent and children can be swapped) and encode the same
joint distribution - so joint distributions can be represented by multiple Bayes’ nets. But some Bayes’
nets are better representations than others - some will be easier to work with; in particular, if the
arcs do represent causality, the network will be easier to work with.
Bayes’ nets are much smaller than representing such joint distributions without conditional independence assumptions.
A joint distribution over N boolean variables takes 2n space (as demonstrated earlier).
A Bayes’ net, on the other hand, where the N nodes each have at most k parents, only requires size
O(N ∗ 2k+1 ).
The Bayes’ net also encodes additional conditional independence assumptions in its structure.
For example, the Bayes’ net X → Y → Z → W encodes the joint distribution:
P (X, Y, Z, W ) = P (X)P (Y |X)P (Z|Y )P (W |Z)
This structure implies other conditional independence assumptions, e.g. that Z is conditionally independent of X given Y , i.e. P (Z|Y ) = P (Z|X, Y ).
More generally we might ask: given two nodes, are they independent given certain evidence and the
structure of the graph (i.e. assignments of intermediary nodes)?
We can use the d-separation algorithm to answer this question.
First, we consider three configurations of triples as base cases, which we can use to deal with more
complex networks. That is, any Bayes’ net can be decomposed into these three triple configurations.
A simple configuration of nodes in the form of X → Y → Z is called a causal chain and encodes the
joint distribution P (x, y , z ) = P (x)P (y |x)P (z|y ).
X is not guaranteed to be (unconditionally) independent of Z.
However, is X guaranteed to be conditionally independent of Z given Y ?
From the definition of conditional probability, we know that:
P (z|x, y ) =
226
P (x, y , z)
P (x, y )
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
227
9.2. BELIEF (BAYESIAN) NETWORKS
With the Bayes’ net, we can simplify this (the numerator comes from the joint distribution the graph
encodes, as demonstrated previously, and the denominator comes from applying the product rule):
P (z|x, y ) =
P (x)P (y |x)P (z|y )
P (x)P (y |x)
Then, canceling a few things out:
P (z|x, y ) = P (z|y )
So yes, X is guaranteed to be conditionally independent of Z given Y (i.e. once Y is observed). We
say that evidence along the chain “blocks” the influence.
Another configuration of nodes is a common cause configuration:
Common cause configuration
The encoded joint distribution is P (x, y , z) = P (y )P (x|y )P (z|y ).
Again, X is not guaranteed to be (unconditionally) independent of Z.
Is X guaranteed to be conditionally independent of Z given Y ?
Again, we start with the definition of conditional probability:
P (z|x, y ) =
P (x, y , z)
P (x, y )
Apply the product rule to the denominator and replace the numerator with the Bayes’ net’s joint
distribution:
P (z|x, y ) =
P (y )P (x|y )P (z|y )
P (y )P (x|y )
Yielding:
P (z|x, y ) = P (z|y )
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
227
9.2. BELIEF (BAYESIAN) NETWORKS
228
So again, yes, X is guaranteed to be conditionally independent of Z given Y .
Another triple configuration is the common effect configuration (also called v-structures):
Common effect configuration
X and Y are (unconditionally) independent here.
However, is X guaranteed to be conditionally independent of Y given Z?
No - observing Z puts X and Y in competition as the explanation for Z (this is called causal
competition). That is, having observed Z, we think that X or Y was the cause, but not both, so
now they are dependent on each other (if one happened, the other didn’t, and vice versa).
Consider the following Bayes’ net:
Example Bayes’ Net
Where our random variables are rain R, dripping roof D, low pressure L, traffic T , baseball game B.
The relationships assumed here are: low pressure fronts cause rain, rain or a baseball game causes
traffic, and rain causes your friend’s roof to drip.
Given that you observe traffic, the probability that your friend’s roof is dripping goes up - since
perhaps the traffic is caused by rain, which would cause the roof to drip. This relationship is encoded
in the graph the path between T and D.
However - if we observe that it is raining, then observation of traffic has no more effect on D intuitively, this makes sense - we already know it’s raining, so seeing traffic doesn’t tell us more about
the roof dripping. In this sense, observing R “blocks” the path between T and D.
One exception here is the v-structure with R, B, T . Observing that a baseball game is happening
affects our belief about it raining only if we have observed T . Otherwise, they are independent. So
v-structures are “reversed” in some sense.
228
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
229
9.2. BELIEF (BAYESIAN) NETWORKS
That is, we must observe T to activate the path between R and B.
Thus we make the distinction between active triples, in which information “flows” as it did with the
path between T and D and between R and B when T is observed, and inactive triples, in which
this information is “blocked”.
Active triples are chain and common cause configurations in which the central node is not observed and
common effect configurations in which the central node is observed, or common effect configurations
in which some child node of the central node is observed.
An example for the last case:
Triple example
If Z, A, B or C are observed, then the triple is active.
Inactive triples are chain and common cause configurations in which the central node is observed and
common effect configurations in which the central node is not observed.
So now, if we want to know if two nodes X and Y are conditionally independent given some evidence
variables {Z}, we check all undirected paths from X to Y and see if there are any active paths (by
checking all its constituent triples). If there are none, then they are conditionally independent, and
we say that they are d-separated. Otherwise, conditional independence is not guaranteed. This is
the d-separation algorithm.
You can apply d-separation to a Bayes net and get a complete list of conditional independences that
are necessarily true given certain evidence. This tells you the set of probability distributions that can
be represented.
9.2.1
•
•
•
•
Conditional independence assumptions
Sally comes home and hears the alarm (A = 1)
Has she been burgled? (B = 1)
Or was the alarm triggered by an earthquake? (E = 1)
She hears on the radio that there was an earthquake (R = 1)
We start with P (A, B, E, R) and apply the chain rule of probability:
P (A, B, E, R) = P (A|B, E, R)P (R|B, E)P (E|B)P (B)
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
229
9.2. BELIEF (BAYESIAN) NETWORKS
230
Then we can make some conditional independence assumptions:
• The radio report has no effect on the alarm: P (A|B, E, R) → P (A|B, E)
• A burglary has no effect on the radio report: P (R|B, E) → P (R|E)
• A burglary would have no effect on the earthquake: P (E|B) → P (E)
Thus we have simplified the computation of the joint probability distribution:
P (A, B, E, R) = P (A|B, E)P (R|E)P (E)P (B)
We can also construct a belief network out of these conditional independence assumptions:
A simple belief network
Say we are given the following probabilities:
P (B = 1) = 0.01
P (E = 1) = 0.0000001
P (A = 1|B = 1, E = 1) = 0.9999
P (A = 1|B = 0, E = 1) = 0.99
P (A = 1|B = 1, E = 0) = 0.99
P (A = 1|B = 0, E = 0) = 0.0001
P (R = 1|E = 1) = 1
P (R = 1|E = 0) = 0
First consider if Sally has not yet heard the radio; that is, she has only heard the alarm (so the
only evidence she has is A = 1). Sally wants to know if she’s been burgled, so her question is
P (B = 1|A = 1):
230
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
231
9.2. BELIEF (BAYESIAN) NETWORKS
P (B = 1, A = 1)
(Bayes’ rule)
P (A = 1)
∑
E,R P (B = 1, A = 1, E, R)
= ∑
(marginal prob to joint prob)
B,E,R P (B, E, A = 1, R)
∑
P (A = 1|B = 1, E)P (B = 1)P (E)P (R|E)
∑
= E,R
(chain rule w/ our indep. assumps)
B,E,R P (A = 1|B, E)P (B)P (E)P (R|E)
P (B = 1|A = 1) =
≈ 0.99
Now consider that Sally has also heard the report, i.e. R = 1. Now her question is P (B = 1|A =
1, R = 1):
P (B = 1, A = 1, R = 1)
(Bayes’ rule)
P (A = 1, R = 1)
∑
E P (B = 1, A = 1, R = 1, E)
= ∑
(marginal prob to joint prob)
B,E P (A = 1, R = 1, B, E)
∑
P (A = 1|B = 1, E)P (B = 1)P (E)P (R = 1|E)
= E∑
(chain rule w/ our indep. assumps)
B,E P (A = 1|B, E)P (B)P (E)P (R = 1|E)
P (B = 1|A = 1) =
≈ 0.01
So hearing the report and learning that there was an earthquake makes the burglary much less likely.
We may, however, only have soft or uncertain evidence.
For instance, say Sally is only 70% sure that she heard the alarm.
We denote our soft evidence of the alarm’s ringing as à = (0.7, 0.3), which is to say P (A = 1) = 0.7
and P (A = 0) = 0.3.
We’re ignoring the case with the report (R = 1) for simplicity, but with this uncertain evidence we
would calculate:
P (B = 1|Ã) =
∑
P (B = 1|A)P (A|Ã)
A
= 0.7P (B = 1|A = 1) + 0.3P (B = 1|A = 0)
Unreliable evidence is distinct from uncertain evidence.
Say we represent Sally’s uncertainty of hearing the alarm, as described before, as P (S|A) = 0.7.
Now say for some reason we feel that Sally is unreliable for other reasons (maybe she lies a lot). We
would then replace the term P (S|A) with our own interpretation P (H|A). For example, if Sally tells
us her alarm went off, maybe we think that means there’s a 60% chance that the alarm actually went
off.
This new term P (H|A) is our virtual evidence, also called likelihood evidence.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
231
9.2. BELIEF (BAYESIAN) NETWORKS
9.2.2
232
Properties of belief networks
A note on the following graphics: the top part shows the belief network, where a faded node means
it has been marginalized out, and a filled node means it has been observed/conditioned on. The
bottom part shows the relationship between A and B after the marginalization/conditioning.
P (A, B, C) = P (C|A, B)P (A)P (B)
A and B are independent and determine C.
If we marginalize over C (thus “removing” it), A and B are made conditionally independent. That
is, P (A, B) = P (A)P (B).
If we instead condition on C, A and B become graphically dependent. Although A and B are a priori
independent, knowing something about C tells us a bit about A and B.
If we introduce D as a child to C, i.e. D is a descendant of a collider C, then conditioning on D also
makes A and B graphically dependent.
232
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
233
9.2. BELIEF (BAYESIAN) NETWORKS
In this arrangement, C is the “cause” and A and B are independent effects: P (A, B, C) =
P (A|C)P (B|C)P (C).
Here, marginalizing over C makes A and B graphically dependent. In general, P (A, B) ̸= P (A)P (B)
because they share the same cause.
Conditioning on C makes A and B independent: P (A, B|C) = P (A|C)P (B|C). This is because if
you know the “cause” C then you know how the effects A and B occur independent of each other.
The same applies for this arrangement - here A “causes” C and C “causes” B. Conditioning on C
blocks A’s ability to influence B.
These graphs all encode the same conditional independence assumptions.
For both directed and undirected graphs, two graphs are Markov equivalent if they both represent
the same set of conditional independence statements.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
233
9.2. BELIEF (BAYESIAN) NETWORKS
234
234
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
235
9.2. BELIEF (BAYESIAN) NETWORKS
9.2.3
Example
Consider a joint distribution over the following random variables:
•
•
•
•
•
G, grade: g 1 for A, g 2 for B, g 3 for C
I, intelligence, binary: −i for low, +i for high
D, difficulty of the course, binary: −d for easy, +d for hard
S, SAT score, binary: −s for low, +s for high
L, reference letter, binary: −l for not received, +l for received
We can encode some conditional independence assumptions about these random variables into a
belief net:
An example belief network for this scenario
• the grade depends on the student’s intelligence and difficulty of the course
• the student’s SAT score seems dependent on only their intelligence
• whether or not a student receives a recommendation letter depends on their grade
Note that we could add the assumption that intelligence students are likely to take more difficult
courses, if we felt strongly about it:
To turn this graph into a probability distribution, we can represent each node as a CPD:
Then we can apply the chain rule of Bayesian networks which just multiplies all the CPDs:
P (D, I, G, S, L) = P (D)P (I)P (G|I, D)P (S|I)P (L|G)
A Bayesian network (BN) is a directed acyclic graph where its nodes represent the random variables
X1 , . . . , Xn . For each node Xi we have a CPD P (Xi |ParG (Xi ), where ParG (Xi ) refers to the parents
of Xi in the graph G.
In whole, the BN represents a joint distribution via the chain rule for BNs:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
235
9.2. BELIEF (BAYESIAN) NETWORKS
236
An alternative belief network for this scenario
The belief network annotated with nodes’ distributions
236
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
237
9.2. BELIEF (BAYESIAN) NETWORKS
P (X1 , . . . , Xn ) =
∏
P (Xi |ParG (Xi ))
i
We say a probability distribution P factorizes over a BN graph G if the B chain rule holds for P .
There are three types of reasoning that occur with a BN:
• Causal reasoning includes conditioning on an ancestor to determine a descendant’s probability,
e.g. P (L = 1|I = 0).
• Evidential reasoning goes the other way: given a state for a descendant, get the probability
for an ancestor, e.g. P (I = 0|G = 3).
• Intercausal reasoning - consider P (I = 1|G = 3, D = 1). The D node is not directly
connected to the I node, yet conditioning on it does affect the probability.
As the simplest example of intercausal reasoning, consider an OR gate:
An OR gate as a belief network
Knowing Y and X1 (or X2 ) tells you the value of X2 (or X1 ) even though X1 and X2 are not directly
linked. Knowing Y alone does not tell you anything about X1 or X2 ’s values.
There are a few different structures in which a variable X can influence a variable Y , i.e. change
beliefs in Y when conditioned on X:
•
•
•
•
•
X
X
X
X
X
→Y
←Y
→W →Y
←W ←Y
←W →Y
Which the different reasonings described above capture.
The one structure which “blocks” influence is X → W ← Y . That is, where two causes have a joint
effect. This is called a v-structure.
A trail is a sequence of nodes that are connected to each other by single edges in the graph. A trail
X1 − · · · − Xk is active (if there is no evidence) if it has no v-structures Xi−1 → Xi ← Xi+1 , where
Xi is the block.
When can variable X can influence a variable Y given evidence Z?
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
237
9.2. BELIEF (BAYESIAN) NETWORKS
238
• X→Y
• X←Y
X may influence Y given evidence Z under certain conditions, depending on whether or not node W
is part of the evidence Z:
•
•
•
•
X
X
X
X
→W
←W
←W
→W
→Y,
←Y,
→Y,
←Y,
if
if
if
if
W ∈
/Z
W ∈
/Z
W ∈
/ inZ
either W ∈ Z or one of W ’s descendants ∈ Z (intercausal reasoning)
A trail X1 − · · · − Xk is active given evidence Z if, for any v-structure Xi−1 → Xi ← Xi+1 we have
that Xi or one of its descendants is in Z and no other Xi (not in v-structures) is in Z.
9.2.4
Independence
For events α, β, we say P satisfies the independence of α and β, notated P ⊨ α ⊥ β if:
• P (α, β) = P (α)P (β)
• P (α|β) = P (α)
• P (β|α) = P (β)
This can be generalized to random variables:
X, Y, P ⊨ X ⊥ Y if:
• P (X, Y ) = P (X)P (Y )
• P (X|Y ) = P (X)
• P (Y |X) = P (Y )
9.2.5
Conditional independence
For (sets of) random variables X, Y, Z, P ⊨ (X ⊥ Y |Z) if:
•
•
•
•
P (X, Y |Z) = P (X|Z)P (Y |Z)
P (X|Y, Z) = P (X|Z)
P (Y |X, Z) = P (Y |Z)
P (X, Y, Z) ∝ ϕ1 (X, Z)ϕ2 (Y, Z); that is, the probability of the joint distribution P (X, Y, Z) is
proportional to a product of the two factors ϕ1 (X, Z) and ϕ2 (Y, Z)
For example:
There are two coins, one is fair and one is biased to show heads 90% of the time.
238
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
239
9.2. BELIEF (BAYESIAN) NETWORKS
Coin toss example
You pick a coin, toss it, and it comes up heads.
The probability of heads is higher in the second toss. You don’t know what coin you have but heads
on the first toss makes it more likely that you have the bias coin, thus a higher chance of heads on
the second toss. So X1 and X2 are not independent. But if you know what coin you have, the tosses
are then independent; the first toss doesn’t tell you anything about the second anymore. That is,
X1 ⊥ X2 |C.
But note that conditioning can also lose you independence. For example, using the previous student
example, I ⊥ D, but if we condition on grade G, they are no longer independent (this is the same as
the OR gate example).
Student example
We say that X and Y are d-separated in G given Z if there is no active trail in G between X and
Y given Z. This is notated d-sepG (X, Y |Z).
If P factorizes over G and d-sepG (X, Y |Z), then P satisfies X ⊥ Y |Z).
Any node is d-separated from its non-descendants given its parents.
So if a distribution P factorizes over G, then in P , any variable is independent of its non-descendants
given its parents.
We can notate the set of independencies implicit in a graph G, that is, all of the independence
statements that correspond to d-separation statements in the graph G, as I(G):
I(G) = {(X ⊥ Y |Z)|d-sepG (X, Y |Z)}
If P satisfies I(G), then we say that G is an I-map (independency map) of P .
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
239
9.2. BELIEF (BAYESIAN) NETWORKS
240
This does not mean G must imply all independencies in P , just that those that it does imply are in
fact present in P .
SO if P factorizes over G, then G is an I-map for P . The converse also holds: if G is an I-map for
P , then P factorizes over G.
9.2.6
Template models
Within a model you may have structures which repeat throughout or you may want to reuse common
structures between/across models.
In these cases we may use template variables.
A template variable X(U1 , . . . , Uk ) is instantiated multiple times. U1 , . . . , Uk are the arguments.
A template model is a language which specifies how “ground” variables inherit dependency models
from templates.
9.2.7
Temporal models
A common example of template models are temporal models, used for systems which evolve over
time.
When representing a distribution over continuous time, you typically want to discretize time so that
it is not continuous. To do this, you pick a time granularity ∆.
We also have a set of template variables. X (t) describes an instance of a template variable X at time
t∆.
′
′
X (t:t ) = {X (t) , . . . , X (t ) }wheret ≤ t ′
′
That is, X (t:t ) denotes the set of random template variables that spans these time points.
′
We want to represent P (X (t:t ) ) for any t, t ′ .
To simplify this, we can use the Markov assumption, a type of conditional independence assumption.
Without this assumption, we have:
P (X
(0:T )
= P (X
(0)
)
T∏
−1
P (X (t+1) |X (0:t) )
t=0
(this is just using the chain rule for probability)
Then the Markov assumption is (X (t+1) ⊥ X (0:t−1) |X (t) ). That is, any time point is independent of
the past, given the present.
So then we can simplify our distribution:
240
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
241
9.2. BELIEF (BAYESIAN) NETWORKS
P (X (0:T ) = P (X (0) )
T∏
−1
P (X (t+1) |X (t) )
t=0
The Markov assumption isn’t always appropriate, or it may be too strong.
You can make it a better approximation by adding other variables about the state, in addition to
X (t) .
The second assumption we make is of time invariance.
We use a template probability model P (X ′ |X) where X ′ denotes the next time point and X denotes
the current time point. We assume that this model is replicated for every single time point.
That is, for all t:
P (X (t+1) |X (t) ) = P (X ′ |X)
That is, the probability distribution is not influenced by the time t.
Again, this is an approximation and is not always appropriate. Traffic, for example, has a different
dynamic depending on what time of day it is.
Again, you can include extra variables to capture other aspects of the state of the world to improve
the approximation.
Temporal model example (transition model)
Temporal model example
• W = weather
• V = velocity
• L = location
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
241
9.2. BELIEF (BAYESIAN) NETWORKS
242
• F = failure
• O = observation
The left column of the graph is at time slice t, and the right side is at time slice t + 1.
The edges connecting the nodes at t to the nodes at t + 1, e.g. F → F ′ , is an inter-time-slice,
and the edges connecting nodes at t + 1 to the observation, e.g. F ′ → O′ , are intra-time-slices.
We can describe a conditional probability distribution (CPD) for our prime variables as such:
P (W ′ , V ′ , L′ , F ′ , O′ |W, V, L, F )
We don’t need a CPD for the non-prime variables because they have already “happened”.
We can rewrite this distribution with the independence assumptions in the graph:
P (W ′ , V ′ , L′ , F ′ , O ′ |W, V, L, F ) = P (W ′ |W )P (V ′ |W, V )P (L′ |L, V )P (F ′ |F, W )P (O′ |L′ , F ′ )
Here the observation O′ is conditioned on variables in the same time slice (L′ , F ′ ) because we assume
the observation is “immediate”. This is a relation known as an intra-time-slice.
All the other variables are conditioned on the previous time slice, i.e. they are inter-time-slice relations.
Now we start with some initial state (time slice 0, t0 ):
Temporal model example initial state
Then we add on the next time slice, t1 :
And we can repeatedly do this to represent all subsequent time slices t2 , . . . , where each is conditioned
on the previous time slice.
So we have a 2-time-slice Bayesian network (2TBN). A transition model (2TBN) over X1 , . . . , Xn
is specified as a BN fragment such that:
• the nodes include X1′ , . . . , Xn′ (next time slice t + 1) and a subset of X1 , . . . , Xn (time slice t).
• only the nodes X1′ , . . . , Xn′ have parents and a CPD
242
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
243
9.2. BELIEF (BAYESIAN) NETWORKS
Temporal model example at t1
The 2TBN defines a conditional distribution using the chain rule:
P (X ′ |X) =
n
∏
P (Xi′ |Pa(Xi′ ))
i=1
9.2.8
Markov Models
We can consider a Markov model as a chain-structured Bayes’ Net, so our reasoning there applies
here as well.
Each node is a state in the sequence and each node is identically distributed (stationary) and depends on the previous state, i.e. P (Xt |Xt−1 ) (except for the initial state P (X1 )). This is essentially just a conditional independence assumption (i.e. that P (Xt ) is conditionally independent of
Xt−2 , Xt−3 , . . . , X1 given Xt−1 ).
The parameters of a Markov model are the transition probabilities (or dynamics) and the initial
state probabilities (i.e. the initial distribution P (X1 ).
Say we want to know P (X) at time t. A Markov model algorithm for solving this is the forward
algorithm, which is just an instance of variable elimination (in the order X1 , X2 , . . . ). A simplified
version:
∑
P (xt ) =
P (xt |xt−1 )P (xt−1 )
xt−1
Assuming P (x1 ) is known.
P (Xt ) converges as t → ∞, and it converges to the same values regardless of the initial state. This
converged distribution, independent of the initial state, is called the stationary distribution. The
influence of the initial state fades away as t → ∞.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
243
9.2. BELIEF (BAYESIAN) NETWORKS
244
The key insight for a stationary distribution is that P (Xt ) = P (Xt−1 ), and that this is independent
of the initial distribution.
Formally, the stationary distribution satisfies:
P∞ (X) = P∞+1 (X) =
∑
Pt+1|t (X|x)P∞ (x)
x
9.2.9
Dynamic Bayes Networks (DBNs)
A dynamic Bayes’ net (DBN) is a Bayes’ net replicated through time, i.e. variables at time t can be
conditioned on those from time t − 1 (the structure is reminiscent of a recurrent neural network).
A dynamic Bayesian network over X1 , . . . , Xn is defined by a:
• 2TBN BN→ over X1 , . . . , Xn
• a Bayesian network BN(0) over X1(0) , . . . , Xn(0) (time 0, i.e. the initial state)
Ground network
For a trajectory over 0, . . . , T , we define a ground (unrolled network) such that:
• the dependency model for X1(0) , . . . , Xn(0) is copied from BN(0)
• the dependency model for X1(t) , . . . , Xn(t) for all t > 0 is copied from BN→
That is, it is just an aggregate (“unrolled”) of the previously shown network up to time slice tT .
Hidden Markov Models (HMMs)
Often we have a sequence of observations and
we want to use these observations to learn something about the underlying process that generated them. As such we need to introduce time
or space to our models.
An Hidden Markov Model (HMM) is a simple dynamic Bayes’ net. In particular, it is a
Hidden Markov Model
Markov model in which we don’t directly observe
the state. That is, there is a Markov chain where
we don’t see St but rather we see some evidence/observations/emissions/outputs/effects/etc Ot .
The actual observations are stochastic (e.g. an underlying state may produce one of many observations
with some probability). We try to infer the state based on these observations.
244
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
245
9.2. BELIEF (BAYESIAN) NETWORKS
For example, imagine we are in a windowless room and we want to know if it’s raining. We can’t
directly observe whether it’s raining, but we can see if people have brought umbrellas with them.
It is also a 2TBN.
HMMs are used to analyze or to predict time series involving noise or uncertainty.
There is a sequence of states s1 → s2 → s3 → · · · → SN . This sequence is a Markov chain (each
state depends only on the previous state).
Each state emits a measurement/observation, e.g. s1 emits z1 (s1 → z1 ), s2 emits z2 (s2 → z2 ),
and so on. We don’t deserve the states directly; we only observe these measurements (hence, the
underlying Markov model is “hidden”).
Together, these define a Bayes network that is at the core of HMMs.
An HMM is defined by:
•
•
•
•
a state variable S and an observation (sometimes called emission) variable O
the initial distribution P (S0 )
the transition model P (S ′ |S)
the observation model P (O|X) (the probability of seeing evidence given the hidden state, also
called an emissions model)
We introduce an additional conditional independence assumption - that the current observation is
independent of everything else given the current state.
Basic HMM
You can unroll this:
Basic HMM, unrolled
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
245
9.2. BELIEF (BAYESIAN) NETWORKS
246
HMMs, however, may also have internal structures, more commonly in the transition model, but
sometimes in the observation model as well.
TODO in the following X is switched with S, make it consistent
Example
Say we have the following HMM:
Hidden Markov Model Example
We don’t know the starting state, but we know the probabilities:
1
2
1
P (S0 ) =
2
P (R0 ) =
Say on the first day we see that this person is happy and we want to know whether or not it is raining.
That is:
P (R1 |H1 )
We can use Bayes’ rule to compute this posterior:
P (R1 |H1 ) =
P (H1 |R1 )P (R1 )
P (H1 )
We can compute these values by hand:
P (R1 ) = P (R1 |R0 )P (R0 ) + P (R1 |S0 )P (S0 )
P (H1 ) = P (H1 |R1 )P (R1 ) + P (H1 |S1 )P (S1 )
P (H1 |R1 ) can be pulled directly from the graph.
Then you can just run the numbers.
246
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
247
9.2. BELIEF (BAYESIAN) NETWORKS
Inference base cases in an HMM
The first base case: consider the start of an HMM:
P (X1 ) → P (E1 |X1 )
Inferring P (X1 |e1 ), that is, P (X1 ) given we observe a piece of evidence e1 , is straightforward:
P (x1 , e1 )
P (e1 )
P (e1 |x1 )P (x1 )
=
P (e1 )
P (x1 |e1 ) =
∝X1 P (e1 |x1 )P (x1 )
That is, we applied the definition of conditional probability and then expanded the numerator with
the product rule.
For an HMM, P (E1 |X1 ) and P (X1 ) are specified, so we have the information needed to compute
this. We just compute P (e1 |X1 )P (X1 ) and normalize the resulting vector.
The second base case:
Say we want to infer P (X2 ), and we just have the HMM:
X1 → X2
That is, rather than observing evidence, time moves forward one step.
For an HMM, P (X1 ) and P (X2 |X1 ) are specified.
So we can compute P (X2 ) like so:
P (x2 ) =
∑
P (x2 , x1 )
x1
=
∑
P (x2 |x1 )P (x1 )
x1
From these two base cases we can do all that we need with HMMs.
Passage of time
Assume that we have the current belief P (X|evidence to date):
B(Xt ) = P (Xt |e1:t )
After one time step passes, we have:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
247
9.2. BELIEF (BAYESIAN) NETWORKS
248
P (Xt+1 |e1:t ) =
∑
P (Xt+1 |xt )P (xt |e1:t )
xt
Which can be written compactly as:
B ′ (Xt+1 ) =
∑
P (X ′ |x)B(xt )
xt
Intuitively, what is happening here is: we look at each place we could have been, xt , consider how
likely it was that we were there to begin with, B(xt ), and multiply it by the probability of getting to
X ′ had you been there.
Observing evidence
Assume that we have the current belief P (X|previous evidence):
B ′ (Xt+1 ) = P (Xt+1 |e1:t )
Then:
P (Xt+1 |e1:t+1 ) ∝ P (et+1 |Xt+1 )P (Xt+1 |e1:t )
See the above base case for observing evidence - this is just that, and remember, renormalize afterwards.
Another way of putting this:
B(Xt+1 ) ∝ P (e|X)B ′ (Xt+1 )
The Forward Algorithm
Now we can consider the forward algorithm (the one presented previously was a simplification).
We are given evidence at each time and want to know:
Bt (X) = P (Xt |e1:t )
We can derive the following updates:
248
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
249
9.2. BELIEF (BAYESIAN) NETWORKS
P (xt |e1:t ) ∝X P (xt , e1 : t)
=
∑
P (xt−1 , xt , e1:t )
xt−1
=
∑
P (xt−1 , e1:t−1 )P (xt |xt−1 )P (et |xt )
xt−1
= P (et |xt )
∑
P (xt−1 , e1:t−1 )
xt−1
Which we can normalize at each step (if we want P (x|e) at each time step) or all together at the
end.
This is just variable elimination with the order X1 , X2 , . . . .
This computation is proportional to the square number of states.
Most Likely Explanation
With Most Likely Explanation, the concern is not the state at time t, but the most likely sequence
of states that led to time t, given observations.
For MLE, we use an HMM and instead we want to know:
argmax P (x1:t |e1:t )
x1:t
We can use the Viterbi algorithm to solve this, which is essentially just the forward algorithm where
∑
the
is changed to a max:
mt [xt ] = max P (x1:t−1 , xt , e1:t )
x1:t−1
= P (et |xt ) max P (xt |xt−1 )mt−1 [xt−1 ]
xt−1
In contrast, the forward algorithm:
ft [xt ] = P (xt , e1:t )
= P (et |xt )
∑
P (xt |xt−1 )ft−1 [xt−1 ]
xt−1
9.2.10
Plate models
A common template model is a plate model.
Say we are repeatedly flipping a coin.
The surrounding box is the plate. The idea is that these are “stacked”, one for each toss t. That is,
they are indexed by t.
The θ node denotes the CPD parameters. This is outside the plate, i.e. it is not indexed by t.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
249
9.2. BELIEF (BAYESIAN) NETWORKS
250
Simple plate model
Simple plate model, alternative representation
250
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
251
9.2. BELIEF (BAYESIAN) NETWORKS
Another way of visualizing this:
Where o(ti ) is the outcome at time ti . This representation makes it more obvious that each of these
plates is a copy of a template.
Another example:
A plate model for students
Plates may be nested:
If we were to draw this out for two courses and two students:
One oddity here is that now intelligence depends on both the student s and the course c, whereas
before it depends only on the student s. Maybe this is desired, but let’s say we want what we had
before. That is, we want intelligence to be independent of the course c.
Instead, we can use overlapping plates:
Plate models allow for collective inference, i.e. they allow us to look at the aggregate of these
individual instances in order to find broader patterns.
More formally, a plate dependency model:
For a template variable A(U1 , . . . , Uk ) we have template parents B1 (U1 ), . . . , Bm (Um ); that is, an
index cannot appear in the parent which does not appear in the child. This is a particular limitation
of plate models.
We get the following template CPD: P (A|B1 , . . . , Bm ).
9.2.11
Structured CPDs
We can represent CPDs in tables, e.g.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
251
9.2. BELIEF (BAYESIAN) NETWORKS
252
A nested plate model for students
Unrolled student plate model
252
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
253
9.2. BELIEF (BAYESIAN) NETWORKS
An overlapping plate model for students
g1
g2
g3
i0 , d 0
i0 , d 1
i1 , d 0
i1 , d 1
But as we start to have more variables, this table can explode in size.
More generally, we can just represent a CPD P (X|Y1 , . . . , Yk ), which specifies a distribution over
X for each assignment Y1 , . . . , Yk using any function which specifies a factor ϕ(X, Y1 , . . . , Yk ) such
that:
∑
ϕ(x, y1 , . . . , yk ) = 1
x
for all y1 , . . . , yk .
There are many models for representing CPDs, including:
•
•
•
•
•
deterministic CPDs
tree-structured CPDs
logistic CPDs and generalizations
noisy OR/AND
linear Gaussians and generalizations
Context-specific independence shows up in some CPD representations. It is a type of independence
where we have a particular assignment c, from some set of variables C, P ⊨ (X ⊥c Y |Z, c)
Which is to say this independence holds only for particular values of c, rather than all values of c.
For example, consider:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
253
9.2. BELIEF (BAYESIAN) NETWORKS
254
• X ⊥ Y1 |y20 . When Y2 is false, X just takes on the value of Y1 , so there’s no context-specific
independence here.
• X ⊥ Y1 |y21 . When Y2 is true, then it doesn’t matter what value Y1 takes, since X will be true
too. Thus we have context-specific independence.
• Y1 ⊥ Y2 |x 0 . If we know X is false, we already know Y1 , Y2 are false, independent of each other.
So we have context-specific independence here.
• Y1 ⊥ Y2 |x 1 . We don’t have context-specific independence here.
Tree-structured CPDs
Say we have the following model:
Simple model
That is, whether or not a student gets a job J depends on:
• A - if they applied (+a, −a)
• L - if they have a letter of recommendation (+l, −l)
• S - if they scored well on the SAT (+s, −s)
We can represent the CPD as a tree structure.
Note that the notation at the leaf nodes is the probability of not getting the job and of getting it, i.e.
(P (−j), P (+j).
A bit more detail: we’re assuming its possible that the student gets the job without applying, e.g. via
a recruiter, in which case the SAT score and letter aren’t important.
We also assume that if the student scored well on the SAT, the letter is unimportant.
We have three binary random variables. If we represented this CPD as a table, it have 23 = 8
conditional probability distributions. However, in certain contexts we only need 4 distributions since
we have some context-specific independences:
• J ⊥c L| + a, +s
• J ⊥c L, S| − a
• J ⊥c L| + s, A
This last one is just a compact representation of:
254
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
255
9.2. BELIEF (BAYESIAN) NETWORKS
Tree-structured CPD
• J ⊥c L| + s, +a
• J ⊥c L| + s, −a
Consider another model:
Another model
Where the student chooses only one letter to submit.
The tree might look like:
Here the choice variable C determines the dependence of one set of circumstances on another set of
circumstances.
This scenario has context-specific independence but also non-context-specific independence:
L1 ⊥ L2 |J, C
Because, if you break it down into its individual cases:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
255
9.2. BELIEF (BAYESIAN) NETWORKS
256
Tree-structured CPD
• L1 ⊥c L2 |J, c1
• L1 ⊥c L2 |J, c2
both are true.
This scenario relates to a class of CPDs called multiplexer CPDs:
Multiplexer CPD
Y has two lines around it to indicate deterministic dependence.
Here we have some variables Z1 , . . . , Zk and A is a copy of one of these variables.
A is the multiplexer, i.e. the “selector variable”, taking a value from {1, . . . , k}.
For a multiplexer CPD, we have:

1
P (Y |A, Z1 , . . . , Zk ) = 
Y = Za
0 otherwise
That is, the value of A just determines which Z value Y takes on.
256
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
257
9.2. BELIEF (BAYESIAN) NETWORKS
Noisy OR CPD
Noisy OR CPDs
In a noise OR CPD we introduce intermediary variables between xi and Y . These intermediary
variables z take on 1 if its parent value satisfies its criteria. Y becomes an OR variable which is true
if any of the z variables are true.
That is:

0
P (zi = 1|xi ) = 
λi
if xi = 0
if xi = 1
Where λi ∈ [0, 1].
So if xi = 0, zi never gets turned on. If xi = 1, zi gets turned on with probability lambdai .
z0 is a “leak” probability which is the probability that Y gets turned on by itself. P (z0 = 1) = λ0 .
We can write this as a probability and consider the CPD of Y = 0 given our x variables. That is,
what is the probability that all the x variables fail to turn on their corresponding z variables?
P (Y = 0|X1 , . . . , Xk ) = (1 − λ0 )
∏
(1 − λi )
i:xi =1
where (1 − λ0 ) is the probability that Y doesn’t get turned on by the leak.
Thus:
P (Y = 1|X1 , . . . , Xk ) = 1 − P (Y = 0|X1 , . . . , Xk )
A noisy OR CPD demonstrates independence of causal influence. We are assuming that we have
a bunch of causes x1 , . . . , xk for a variable Y , which each act independently to affect the truth of the
Y . That is, there is no interaction between the causes.
Other CPDs for independence of causal influence include noisy AND, noisy MAX, etc.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
257
9.2. BELIEF (BAYESIAN) NETWORKS
258
Continuous variables
Consider:
Continuous variable example (simple)
We have the temperature in a room and a sensor which measures the temperature.
The sensor is not perfect, so it usually around the right temperature, but not exactly.
We can represent this by saying the sensor reading S is a normal distribution around the true temperature T with some standard deviation, i.e.:
S ∼ N(T ; σS2 )
This model is a linear Gaussian.
We can make it more complex, assuming that the outside temperature will also affect the room
temperature:
Continuous variable example (more complex)
Where T ′ is the temperature in a few moments and O is the outside temperature. We may say that
T ′ is also a linear Gaussian:
T ′ ∼ N(αT + (1 − α)O; σT2 )
The αT + (1 − α)O term is just a mixture of the current temperature and the outside temperature.
We can take it another step. Say there is a door D in the room which is either opened or closed
(i.e. it is a binary random variable).
Now T ′ is described as:
258
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
259
9.2. BELIEF (BAYESIAN) NETWORKS
Continuous variable example (even more complex)
2
T ′ ∼ N(α0 T + (1 − α0 )O; σ0T
)if D = 0
2
)if D = 1
T ′ ∼ N(α1 T + (1 − α1 )O; σ1T
This is a conditional linear Gaussian model since its parameters are conditioned on the discrete variable
D.
Generally, a linear Gaussian model looks like:
Basic linear Gaussian
Y ∼ N(w0 +
∑
wi X i ; σ 2 )
∑
Where w0 + wi Xi is the mean (a linear function of the parents) and σ 2 is not related to the
parents/doesn’t depend on the parents.
Then, conditional linear Gaussians introduce one or more discrete parents (only one, A, is depicted
below), and this is just a linear gaussian whose parameters depend on the value of A:
Y ∼ N(wa0 +
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
∑
wai Xi ; σa2 )
259
9.2. BELIEF (BAYESIAN) NETWORKS
9.2.12
260
Querying Bayes’s nets
Conditional probability queries
PGMs can be used to answer many queries, but the most common is probably conditional probability
queries:
Given evidence e about some variables E, we have a query which is a subset of variables Y , and our
task is to compute P (Y |E = e).
Unfortunately, the problem of inference on graphical models is NP-Hard. In particular, the following
are NP-Hard:
•
•
•
•
•
exact inference
given a PGM PΦ , a variable X and a value x ∈ Val(X), compute PΦ (X = x)
even just deciding if PΦ (X = x) > 0 is NP-hard
approximate inference
let ϵ < 0.5. Given a PGM PΦ , a variable X, a value x ∈ Val(X), and an observation e ∈ Val(E),
find a number p that has |PΦ (X = x|E = e) − p| < ϵ.
However, NP-Hard is the worst case result and their are algorithms that perform for most common
cases.
Some conditional probability inference algorithms:
•
•
•
•
•
•
•
variable elimination
message passing over a graph
belief propagation
variational approximations
random sampling instantiations
Markov Chain Monte Carlo (MCMC)
importance sampling
MAP (maximum a posteriori) queries
PGMs can also answer MAP queries:
We have a set of evidence E = e, the query is all other variables Y , i.e. Y = {X1 , . . . , Xn } − E.
Our task is to compute MAP(Y |E = e) = argmaxy P (Y = y |E = e). There may be more than one
possible solution.
This is also a NP-hard problem, but there are also many algorithms to solve these efficiently for most
cases.
Some MAP inference algorithms:
• variable elimination
260
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
261
•
•
•
•
•
9.2. BELIEF (BAYESIAN) NETWORKS
message passing over a graph
max-product belief propagation
using methods for integer programming
for some networks, graph-cut methods
combinatorial search
9.2.13
Inference in Bayes’ nets
Given a query, i.e. a joint probability distribution we are interested in getting a value for, we can infer
an answer for that query from a Bayes’ net.
The simplest approach is inference by enumeration in which we extract the conditional probabilities
from the Bayes’ net and appropriately combine them together.
But this is very inefficient, especially because variables that aren’t in the query require us to enumerate
over all possible values for them. We lose most of the benefit of having this compact representation
of joint distributions.
An alternative approach is variable elimination, which is still NP-hard, but faster than enumeration.
Variable elimination requires the notion of factors. Here are some factors:
• a joint distribution: P (X, Y ), which is just all entries P (x, y ) for all x, y and sums to 1.
Example:
P (T, W )
T
W
P
hot
sun
0.4
hot
rain
0.1
cold
sun
0.2
cold
rain
0.3
• a selected joint: P (x, Y ), i.e. we fix X = x, then look at all entries P (x, y ) for all y , and sums
to P (x). This is a “slice” of the joint distribution.
Example:
P (cold, W )
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
261
9.2. BELIEF (BAYESIAN) NETWORKS
262
T
W
P
cold
sun
0.2
cold
rain
0.3
• a single conditional: P (Y |x), i.e. we fix X = x, then look at all entries P (y |x) for all y , and
sums to 1.
Example:
P (W |cold)
T
W
P
cold
sun
0.4
cold
rain
0.6
• a family of conditionals: P (X, Y ), i.e. we have multiple conditions, all entries P (x|y ) for all
x, y , and sums to |Y |.
Example:
P (W |T )
T
W
P
hot
sun
0.8
hot
rain
0.2
cold
sun
0.4
cold
rain
0.6
• a specified family: P (y |X), i.e. we fix y and look at all entries P (y |x) for all x. Can sum to
anything;
Example:
P (rain|T )
262
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
263
9.2. BELIEF (BAYESIAN) NETWORKS
T
W
P
hot
rain
0.2
cold
rain
0.6
In general, when we write P (Y1 , . . . , YN |X1 , . . . , XM ), we have a factor, i.e. a multi-dimensional array
for which the values are all instantiations P (y1 , . . . , yN |x1 , . . . , xM ).
Any assigned/instantiated X or Y is a dimension missing (selected) from the array, which leads to
smaller factors - when we fix values, we don’t have to consider every possible instantiation of that
variable anymore, so we have less possible combinations of variable values to consider.
For example, if X and Y are both binary random variables, if we don’t fix either of them we have
four to consider ((X = 0, Y = 0), (X = 1, Y = 0), (X = 0, Y = 1), (X = 1, Y = 1)) . If we fix,
say X = 1, then we only have two to consider ((X = 1, Y = 0), (X = 1, Y = 1)).
Consider a simple Bayes’ net:
R→T →L
Where R is whether or not it is raining, T is whether or not there is traffic, and L is whether or not
we are late for class.
We are given the following factors for this Bayes’ net:
P (R)
R
P
+r
0.1
-r
0.9
P (T |R)
R
T
P
+r
+t
0.8
+r
-t
0.2
-r
+t
0.1
-r
-t
0.9
P (L|T )
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
263
9.2. BELIEF (BAYESIAN) NETWORKS
264
T
L
P
+t +l
0.3
+t -l
0.7
-t
+l
0.1
-t
-l
0.9
For example, if we observe L = +l, so we can fix that value and shrink the last factor P (L|T ):
P (+l|T )
T
L
P
+t +l
0.3
-t
0.1
+l
We can join factors, which gives us a new factor over the union of the variables involved.
For example, we can join on R, which involves picking all factors involving R, i.e. P (R) and P (T |R),
giving us P (R, T ). The join is accomplished by computing the entry-wise products, e.g. for each r, t,
compute P (r, t) = P (r )P (t|r ):
P (R, T )
R
T
P
+r
+t 0.08
+r
-t
-r
+t 0.09
-r
-t
0.02
0.81
After completing this join, the resulting factor P (R, T ) replaces P (R) and P (T |R), so our Bayes’
net is now:
(R, T ) → L
We can then join on T , which involves P (L|T ) and P (R, T ), giving us P (R, T, L):
P (R, T, L)
R
264
T
L
P
+r +t
+l
0.024
+r +t
-l
0.056
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
265
9.2. BELIEF (BAYESIAN) NETWORKS
R
T
L
P
+r -t
+l
0.002
+r -t
-l
0.018
-r
+t
+l
0.027
-r
+t
-l
0.063
-r
-t
+l
0.081
-r
-t
-l
0.729
Now we have this joint distribution, and we can use the marginalization operation (also called
elimination) on this factor - that is, we can sum out a variable to shrink the factor. We can only
do this if the variable appears in only one factor.
For example, say we still had our factor P (R, T ) and we wanted to get P (T ). We can do so by
summing out R:
P (T )
T
P
+t 0.17
-t
0.83
So we can take our full joint distribution P (R, T, L) and get P (T, L) by elimination (in particular,
by summing out R):
P (T, L)
T
L
P
+t +l
0.051
+t -l
0.119
-t
+l
0.083
-t
-l
0.747
Then we can further sum out T to get P (L):
P (L)
L
P
+l 0.134
-l
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
0.866
265
9.2. BELIEF (BAYESIAN) NETWORKS
266
This approach is equivalent to inference by enumeration (building up the full joint distribution, then
taking it apart to get to the desired quantity).
However, we can use these operations (join and elimination) to find “shortcuts” to the desired
quantity (i.e. marginalize early without needing to build the entire joint distribution first). This
method is variable elimination.
For example, we can compute P (L) in a shorter route:
•
•
•
•
join on R, as before, to get P (R, T )
then eliminate (sum out) R from P (R, T ) to get P (T )
then join on T , i.e. with P (T ) and P (L|T ), giving us P (T, L)
the eliminate T , giving us P (L)
In contrast, the enumeration method required:
•
•
•
•
join on R to get P (R, T )
join on T to get P (R, T, L)
eliminate R to get P (T )
eliminate T to get P (L)
The advantage of variable elimination is that we never build a factor of more than two variables
(i.e. the full joint distribution P (R, T, L)), thus saving time and space. The largest factor typically
has the greatest influence over the computation complexity.
In this case, we had no evidence (i.e. no fixed values) to work with. If we had evidence, we would
first shrink the factors involving the observed variable, and the evidence would be retained in the final
factor (since we can’t sum it out once it’s observed).
For example, say we observed R = +r .
We would take our initial factors and shrink those involving R:
P (+r )
R
P
+r
0.1
P (T | + r )
R
T
P
+r
+t
0.8
+r
-t
0.2
And we would eventually end up with:
266
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
267
9.2. BELIEF (BAYESIAN) NETWORKS
P (+r, L)
R
L
P
+r +l
0.026
+r -l
0.074
And then we could get P (L| + r ) by normalizing P (+r, L):
P (L| + r )
L
P
+l 0.26
-l
0.74
More concretely, the general variable elimination algorithm is such:
• start with a query P (Q|E1 = e1 , . . . , Ek = ek ), where Q are your query variables
• start with initial factors (i.e. local conditional probability tables instantiated by the evidence
E1 , . . . , Ek , i.e. shrink factors involving the evidence)
• while there are still hidden variables (i.e. those in the net that are not Q or any of the evidence
E1 , . . . , Ek )
• pick a hidden variable H
• join all factors mentioning H
• eliminate (sum out) H
• then join all remaining factors and normalize. The resulting distribution will be P (Q|e1 , . . . , ek ).
The order in which you eliminate variables affects computational complexity in that some orderings
generate larger factors than others. Again, the factor size is what influences complexity, so you want
to use orderings that produce small factors.
For example, if a variable is mentioned in many factors, you generally want to avoid computing that
until later on (usually last). This is because a variable mentioned in many factors means joining over
many factors, which will probably produce a very large factor.
We can encode this in the algorithm by telling it to choose the next hidden variable that would
produce the smallest factor (since factor sizes are relatively easy to compute without needing to
actually produce the factor, just look at the number and sizes of tables that would have to be
joined).
Unfortunately there isn’t always an ordering with small factors, so variable elimination is great in
many situations, but not all.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
267
9.2. BELIEF (BAYESIAN) NETWORKS
268
Sampling
Another method for Bayes’ net inference is sampling. This is an approximate inference method, but
it can be much faster. Here, “sampling” essentially means “repeated simulation”.
The basic idea:
• draw N samples from a sampling distribution S
• compute an approximate posterior probability
• with enough samples, this converges to the true probability P
Sampling from a given distribution:
1. Get sample u from a uniform distribution over [0, 1]
2. Convert this sample u into an outcome for the given distribution by having each outcome
associated with a sub-interval of [0, 1) with sub-interval size equal to the probability of the
outcome
For example, if we have the following distribution:
C
P(C)
red
0.6
green 0.1
blue
0.3
Then we can map u to C in this way:



red


if0 ≤ u < 0.6
c = green if0.6 ≤ u < 0.7



blue
if0.7 ≤ u < 1
There are many different sampling strategies for Bayes’ nets:
•
•
•
•
prior sampling
rejection sampling
likelihood weighting
Gibbs sampling
In practice, you typically want to use either likelihood weighting or Gibbs sampling.
268
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
269
9.2. BELIEF (BAYESIAN) NETWORKS
Prior sampling
We have a Bayes’ net, and we want to sample the full joint distribution it encodes, but we don’t want
to have to build the full joint distribution.
Imagine we have the following Bayes’ net:
P (C) → P (R|C)
P (C) → P (S|C)
P (R|C) → P (W |S, R)
P (S|C) → P (W |S, R)
Where C, R, S, W are binary variables (i.e. C can be +c or −c).
We start from P (C) and sample a value c from that distribution. Then we sample r from P (R|C)
and s from P (S|C) conditioned on the value c we sampled from P (C). Then we sample from
P (W |S, R) conditioned on the sampled r, s values.
Basically, we walk through the graph, sampling from the distribution at each node, and we choose a
path through the graph such that we can condition on previously-sampled variables. This generates
one final sample across the different variables. If we want more samples, we have to repeat this
process.
Prior sampling (SP S ) generates samples with probability:
SP S (x1 , . . . , xn ) =
n
∏
P (xi |Parents(Xi )) = P (x1 , dots, xn )
i=1
That is, it generates samples from the actual joint distribution the Bayes’ net encodes, which is to say
that this sampling procedure is consistent. This is worth mentioning because this isn’t always the
case; some sampling strategies sample from a different distribution and compensate in other ways.
Then we can use these samples to estimate P (W ) or other quantities we may be interested in, but
we need many samples to get good estimates.
Rejection sampling
Prior sampling can be overkill, since we typically keep samples which are irrelevant to the problem at
hand. We can instead use the same approach but discard irrelevant samples.
For instance, if we want to compute P (W ), we only care about values that W takes on, so we don’t
need to keep the corresponding values for C, S, R. Similarly, maybe we are interested in P (C| + s) so we should only be keeping samples where S = +s.
This method is called rejection sampling because we are rejecting samples that are irrelevant to
our problem. This method is also consistent.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
269
9.3. MARKOV NETWORKS
270
Likelihood Weighting
A problem with rejection sampling is that if the evidence is unlikely, we have to reject a lot of samples.
For example, if we wanted to estimate P (C| + s) and S = +s is generally very rare, then many of
our samples will be rejected.
We could instead fix the evidence variables, i.e. when it comes to sample S, just say S = +s. But
then our sample distribution is not consistent.
We can fix this by weighting each sample by the probability of the evidence (e.g. S = +s) given its
parents (e.g. P (+s|Parents)).
Gibbs sampling
With likelihood weighting, we consider the evidence only for variables sampled after we fixed the
evidence (that is, that come after the evidence node in our walk through the Bayes’ net). Anything
we sampled before did not take the evidence into account. It’s possible that what we sample before
we get to our evidence is very inconsistent with the evidence, i.e. makes it very unlikely and gives us
a very low weight for our sample.
With Gibbs sampling, we fix our evidence and then instantiate of all our other variables, x1 , . . . , xn .
This instantiation is arbitrary but it must be consistent with the evidence.
Then, we sample a new value for one variable at a time, conditioned on the rest, though we keep the
evidence fixed. We repeat this many times.
If we repeat this infinitely many times, the resulting sample comes from the correct distribution, and
it is conditioned on both the upstream (pre-evidence) and downstream (post-evidence) variables.
Gibbs sampling is essentially a Markov model (hence it is a Markov Chain Monte Carlo method) in
which the stationary distribution is the conditional distribution we are interested in.
9.3
Markov Networks
Markov networks are also called Markov random fields.
The simplest subclass is pairwise Markov networks.
Say we have the following scenario:
Simple pairwise Markov network
270
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
271
9.3. MARKOV NETWORKS
An idea is floating around and when, for example, Alice & Bob are hanging out, they may share the
idea - they influence each other. We don’t use a directed graph because the influence flows in both
directions.
But how do you parametrize an undirected graph? We no longer have a notion of a conditional that is, one variable conditioning another.
Well, we can just use factors:
Simple pairwise Markov network
ϕ1 [A, B]
−a, −b
30
−a, +b
5
+a, −b
1
+a, +b
10
These factors are sometimes called affinity functions or compatibility functions or soft constraints.
What do these numbers mean?
They indicate the “local happiness” of the variables A and B to take a particular joint assignment.
Here A and B are “happiest” when −a, −b.
We can define factors for the other edges as well:
ϕ2 [B, C]
−b, −c
100
−b, +c
1
+b, −c
1
+b, +c
100
ϕ3 [C, D]
−c, −d
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
1
271
9.3. MARKOV NETWORKS
272
ϕ3 [C, D]
−c, +d
100
+c, −d
100
+c, +d
1
ϕ4 [D, A]
−d, −a
100
−d, +a
1
+d, −a
1
+d, +a
100
Then we have:
P̃ (A, B, C, D) = ϕ1 (A, B)ϕ2 (B, C)ϕ3 (C, D)ϕ4 (A, D)
This isn’t a probability distribution because its numbers aren’t in [0, 1] (hence the tilde over P , which
indicates an unnormalized measure).
We can normalize it to get a probability distribution:
P (A, B, C, D) =
1
P̃ (A, B, C, D)
Z
Z is known as a partition function.
There unfortunately is no natural mapping from the pairwise factors and the marginal probabilities
from the distribution they generate.
For instance, say we are given the marginal probabilities of PΦ (A, B) (the Φ indicates the probability
was computed using a set of factors Φ = {ϕ1 , . . . , ϕn }):
272
A
B
PΦ (A, B)
−a
−b
0.13
−a
+b
0.69
+a
−b
0.14
+a
+b
0.04
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
273
9.3. MARKOV NETWORKS
ϕ1 [A, B]
−a, −b
30
−a, +b
5
+a, −b
1
+a, +b
10
The most likely joint assignment is −a, +b, which doesn’t seem to correspond to the factor. This is
a result of the other factors in the network.
This is unlike Bayesian networks where the nodes were just conditional probabilities.
Formally, a pairwise Markov network is an undirected graph whose nodes are X1 , . . . , Xn and each
edge Xi − Xj is associated with a factor (aka potential) ϕij (Xi − Xj ).
Pairwise Markov networks cannot represent all of the probability distributions we may be interested
in. A pairwise Markov network with n random variables, each with d values, has O(n2 d 2 ) parameters.
On the other hand, if we consider a probability distribution over n random variables, each with d
values, it has O(d n ) parameters, which is far greater than O(n2 d 2 ).
Thus we generalize beyond pairwise Markov networks.
9.3.1
Gibbs distribution
A Gibbs distribution is parameterized by a set of general factors Φ = {ϕ1 (D1 ), . . . , ϕk (Dk )} which
can have a scope of ≥ 2 variables (whereas pairwise Markov networks were limited to two variable
scopes). As a result, this can express any probability distribution because we can just define a factor
over all the random variables.
We also have:
P̃Φ (X1 , . . . , Xn ) =
k
∏
ϕi (Di )
i=1
∑
ZΦ =
P̃Φ (X1 , . . . , Xn )
X1 ,...,Xn
Where ZΦ is the partition function, i.e. the normalizing constant.
Thus we have:
PΦ (X1 , . . . , Xn ) =
1
P̃Φ (X1 , . . . , Xn )
ZΦ
We can generate an induced Markov network HΦ from a set of factors Φ. For each factor in the
set, we connect any variables which are in the same scope.
For example, ϕ1 (A, B, C), ϕ2 (B, C, D) leads to:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
273
9.3. MARKOV NETWORKS
274
Example induced Markov network
So multiple set of factors can induce the same graph. We can go from a set of factors to a graph,
but we can’t go the other way.
We say a probability distribution P factorizes over a Markov network H if there exists a set of factors
Φ such that P = PΦ and h is the induced graph for Φ.
We have active trails in Markov networks as well: a trail X1 − · · · − Xn is active given the set of
observed variables Z if no Xi is in Z.
9.3.2
Conditional Random Fields
A commonly-used variant of Markov networks is conditional random fields (CRFs).
This kind of model is used to deal with task-specific prediction, where we have a set of input/observed
variables X and a set of target variables Y that we are trying to predict.
Using the graphical models we have seen so far is not the best because we don’t want to model
P (X, Y ) - we are already given X. Instead, we just want to model P (Y |X). That way we don’t have
to worry about how features of X are correlated or independent, and we don’t have to model their
distributions.
In this scenario, we can use a conditional random field representation:
Φ = {ϕ1 (D1 ), . . . , ϕk (Dk )}
P̃Φ (X, Y ) =
k
∏
ϕi (Di )
i=1
This looks just like a Gibbs distribution. The difference is in the partition function:
ZΦ (X) =
∑
P̃Φ (X, Y )
Y
So a CRF is parameterized the same as a Gibbs distribution, but it is normalized differently.
The end result is:
PΦ (Y |X) =
274
1
P̃Φ (X, Y )
ZΦ (X)
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
275
9.3. MARKOV NETWORKS
Which is a family of conditional distributions, one for each possible value of X.
In a Markov network, we have the concept of separation, which is like d-separation in Bayesian
networks but we drop the “d” because they are not directed.
X and Y are separated in H given observed evidence Z if there is no active trail in H (that is, no
node along the trail is in Z).
For example:
Markov network separation example
We can separate A and E in a few ways:
• A and E are separated given B and D
• A and E are separated given D
• A and E are separated given B and C
Like with Bayesian networks, we have a theorem: if P factorizes over H and sepH (X, Y |Z), then P
satisfies (X ⊥ Y |Z).
We can say the independences induced by the graph H, I(H), is:
I(H) = {(X ⊥ Y |Z)|sepH (X, Y |Z)}
If P satisfies I(H), we say that H is an I-map (independency map) of P (this is similar to I-maps in
the context of Bayesian networks).
We can also say that if P factorizes over H, then H is an I-map of P .
The converse is also true: for a positive distribution P , if H is an I-map for P , then P factorizes over
H.
If a graph G is an I-map of P , it does not necessarily need to encode all independences of P , just
those that it does encode are in fact in P .
How well can we capture a distribution P ’s independences in a graphical model?
We can denote all independences that hold in P as:
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
275
9.3. MARKOV NETWORKS
276
I(P ) = {(X ⊥ Y |Z)|P ⊨ (X ⊥ Y |Z)}
We know that if P factorizes over G, then G is an I-map for P :
I(G) ⊆ I(P )
The converse doesn’t hold; P may have some independences not in G.
We want graphs which encode more independences because they are sparser (less parameters) and
more informative.
So for sparsity, we want a minimal I-map; that is, an I-map without redundant edges. But it is still
not sufficient for capturing I(P ).
Ideally, we want a perfect map, which is an I-map such that I(G) = I(P ). Unfortunately, not
ever distribution has a perfect map, although sometimes a distribution may have a perfect map as a
Markov network and not as a Bayesian network, and vice versa.
It is possible that a perfect map for a distribution is not unique; that is, there may be other graphs
which model the same set of independence assumptions and thus are also perfect maps.
When graphs model the same independence assumptions, we say they are I-equivalent. Most graphs
have many I-equivalent variants.
9.3.3
Log-linear models
Log-linear models allow us to incorporate local structure into undirected models.
In the original representation of unnormalized density, we had:
P̃ =
∏
ϕi (Di )
i
We turn this into a linear form:
P̃ = exp(−
∑
wj fj (Dj ))
j
Hence the name “log-linear”, because the log is a linear function.
Each feature fj has a scope Dj . Different features can have the same scope.
We can further write it in the form:
P̃ =
∏
exp(−wj fj (Dj ))
j
276
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
277
9.4. REFERENCES
which effectively turns the exp(−wj fj (Dj )) term into a factor with one parameter wj .
For example, say we have binary variables X1 and X2 :
[
a00 a01
ϕ(X1 , X2 ) =
a10 a11
]
We must define the following features using indicator functions (1 if true, else 0):
00
f12
= ⊮{X1 = 0, X2 = 0}
01
f12
= ⊮{X1 = 0, X2 = 1}
10
f12
= ⊮{X1 = 1, X2 = 0}
11
f12
= ⊮{X1 = 1, X2 = 1}
So we have the log-linear model:
ϕ(X1 , X2 ) = exp(−
∑
wkl fijkl (X1 , X2 ))
kl
So we can represent any factor as a log-linear model by including the appropriate features.
For example, say you want to develop a language model for labeling entities in text.
You have target labels Y = {PERSON, LOCATION, . . . } and input words X.
You could, for instance, have the following features:
f (Yi , Xi ) = ⊮{Yi = PERSON, Xi is capitalized}
f (Yi , Xi ) = ⊮{Yi = LOCATION, Xi appears in an atlas}
and so on.
9.4
References
•
•
•
•
Bayesian Reasoning and Machine Learning. David Barber.
Probabilistic Graphical Models. Daphne Koller. Stanford University/Coursera.
MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of
Edinburgh (Coursera). 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
277
9.4. REFERENCES
278
278
CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS
279
10
Optimization
Optimization is the task of finding the arguments to a function which yield its minimum or maximum
value. An optimal argument is denoted with an asterisk, e.g. x .
In the context of machine learning, we are typically dealing with minimization. If necessary, a maximization problem can be reframed as minimization: to maximize a function f (x), you can instead
minimize −f (x).
The function we want to optimize is called the objective function. When the particular optimization
is minimization, the function may also be referred to as a cost, loss, or error function.
Optimization problems can be thought of a topology where you are looking for the global peak (if
you are maximizing) or the globally lowest point (if you are minimizing). For simplicity, minimizing
will be the assumed goal here, as you are often trying to minimize some error function.
Consider a very naive approach: a
greedy random algorithm which starts
at some position in this topology, then
randomly tries moving to a new position and checks if it is better. If it is,
it sets that as the current solution. It
continues until it has some reason to
stop, usually because it has found a
minimum. This is a local minimum;
that is, it is a minimum relative to its
immediately surrounding points, but
it is not necessarily the global minimum, which is the minimum of the
entire function.
Local vs global minima
This algorithm is greedy in that it will
always prefer a better scoring position, even if it is only marginally better. Thus it can be easy to get
CHAPTER 10. OPTIMIZATION
279
280
stuck in local optima - since any step away from it seems worse, even if the global optimum is right
around the corner, so to speak.
Optimization may be accomplished numerically or analytically. The analytic approach involves
computing derivatives and then identifying critical points (e.g. the second derivative test). These
provide the exact optima.
However, this analytic approach is infeasible for many functions. In such cases we resort to numerical
optimization, which involves, in a sense, guessing your way to the optima. The methods covered here
are all numerical optimization methods.
Within optimization there are certain special cases:
10.0.1
Convex optimization
As far as optimization goes, convex optimization is easier to deal with - convex functions have only
global minima (any local minimum is a global minimum) without any saddle points.
10.0.2
Constrained optimization
In optimization we are typically looking to find the optimum across all points. However, we may
only be interested in finding the optimum across a subset S of these points - in this case, we have a
constrained optimization problem. The points that are in S are called feasible points.
One method for constrained optimization is the Karush-Kuhn-Tucker (KKT) method.
It uses the generalized Lagrangian (sometimes called the generalized Lagrange function).
We must describe S as equations and inequalities; in particular, with m functions gi , called equality constraints, and n functions hj , called inequality constraints, such that S = {x|∀i, gi (x) =
0and∀j , hj (x) ≤ 0}.
For each constraint we also have the variables λi , αi , called KKT multipliers.
Then we can define the generalized Lagrangian as:
L(x, ˘, ff) = f (x) + σi λi gi (x) + σj αj hj (x)
This reframes the constrained minimization problem as an unconstrained optimization of the generalized Lagrangian.
That is, so long as at least one feasible point exists and f (x) cannot be ∞, then
min max max L(x, ˘, ff)
x
˘
ff,ff≥0
has the same objective function value and set of optimal points as minx∈S f (x).
So long as the constraints are satisfied,
280
CHAPTER 10. OPTIMIZATION
281
10.1. GRADIENT VS NON-GRADIENT METHODS
max max L(x, ˘, ff) = f (x)
˘
ff,ff≥0
and if a constraint is violated, then
max max L(x, ˘, ff) = ∞
˘
ff,ff≥0
10.1 Gradient vs non-gradient methods
Broadly we can categorize optimization methods into those that use gradients and those that do not:
Non-gradient methods include:
• hill climbing
• simplex/amoeba/Nelder Mead
• genetic algorithms
Gradient methods include:
• gradient descent
• conjugate gradient
• quasi-newton
Gradient methods tend to be more efficient, but are not always possible to use (you don’t always
have the gradient).
10.2 Gradient Descent
Gradient descent (GD) is perhaps the common minimizing optimization (for maximizing, its equivalent is gradient ascent) in machine learning.
Say we have a function C(v ) which we want to minimize. For simplicity, we will use v ∈ R2 . An
example C(v ) is visualized in the accompanying figure.
In this example, the global minimum is visually obvious, but most of the time it is not (especially
when dealing with far more dimensions). But we can apply the model of a ball rolling down a hill
and expand it to any arbitrary n dimensions. The ball will “roll” down to a minimum, though not
necessarily the global minimum.
CHAPTER 10. OPTIMIZATION
281
10.2. GRADIENT DESCENT
282
The position the ball is at is a potential solution;
here it is some values for v1 and v2 . We want to
move the ball such that ∆C, the change in C(v )
from the ball’s previous position to the new position, is negative (i.e. the cost function’s output
is smaller, since we’re minimizing).
More formally, ∆C is defined as:
∆C ≈
∂C
∂C
∆v1 +
∆v2
∂v1
∂v2
We define the gradient of C, denoted ∇C, to be
the vector of partial derivatives (transposed to
be a column vector):
(
∇C ≡
∂C ∂C
,
∂v1 ∂v2
)T
Sphere function
So we can rewrite ∆C as:
∆C ≈ ∇C · ∆v .
We can choose ∆v to make ∆C negative:
∆v = −η∇C,
Where η is a small, positive parameter (the learning rate), which controls the step size.
Finally we have:
∆C ≈ −η∇C · ∇C = −η∥∇C∥2
We can use this to compute a value for ∆v , which is really the change in position for our “ball” to a
new position v ′ :
v → v ′ = v − η∇C.
And repeat until we hit a global (or local) minimum.
This process is known in particular as batch gradient descent because each step is computed over
the entire batch of data.
282
CHAPTER 10. OPTIMIZATION
283
10.2.1
10.2. GRADIENT DESCENT
Stochastic gradient descent (SGD)
With batch gradient descent, the cost function is evaluated on all the training inputs for each step.
This can be quite slow.
With stochastic gradient descent (SGD), you randomly shuffle your examples and look at only
one example for each iteration of gradient descent (sometimes this is called online gradient descent
to contrast with minibatch gradient descent, described below). Ultimately it is less direct than batch
gradient descent but gets you close to the global minimum - the main advantage is that you’re not
iterating over your entire training set for each step, so though its path is more wandering, it ends up
taking less time on large datasets.
The reason we randomly shuffle examples is to avoid “forgetting”. For instance, say you have time
series data where there are somewhat different patterns later in the data than earlier on. If that
training data is presented in sequence, the algorithm will “forget” the patterns earlier on in favor of
those it encounters later on (since the parameter updates learned from the later-on data will effectively
erase the updates from the earlier-on data).
In fact, stochastic gradient descent can help with finding the global minimum because instead of
computing over a single error surface, you are working with many different error surfaces varying with
the example you are current looking at. So it is possible that in one of these surfaces a local minima
does not exist or is less pronounced than in others, which make it easier to surpass.
There’s another form of stochastic gradient descent called minibatch gradient descent. Here b
random examples are used for each iteration, where b is your minibatch size. It is usually in the range
of 2-100; a typical choice might be 10 (minibatch gradient descent where b = 1 is just regular SGD).
Note that a minibatch size can be too large, resulting in greater time for convergence. But generally
it is faster than SGD and has the benefit of aiding in local minima avoidance.
Minibatches also need to be properly representative of the overall dataset (e.g. be balanced for
classes).
When the stochastic variant is used, a
1
b
term is sometimes included:
η
v → v ′ = v − ∇C.
b
10.2.2
Epochs vs iterations
An important point of clarification: an “epoch” and a training “iteration” are not necessarily the
same thing.
One training iteration is one step of your optimization algorithm (i.e. one update pass). In the case
of something like minibatch gradient descent, one training iteration will only look at one batch of
examples.
An epoch, on the other hand, consists of enough training iterations to look at all your training
examples.
CHAPTER 10. OPTIMIZATION
283
10.3. SIMULATED ANNEALING
284
So if you have a total of 1000 training examples and a batch size of 100, one epoch will consist of
10 training iterations.
10.2.3
Learning rates
The learning rate η is typically held constant. It can be slowly decreased it over time if you want θ
to converge on the global minimum in stochastic gradient descent (otherwise, it just gets close). So
for instance, you can divide it by the iteration number plus some constant, but this can be overkill.
10.2.4
Conditioning
Conditioning describes how much the output of a function varies with small changes in input.
In particular, if we have a function f (x) = A−1 x, A ∈ Rn×n , where A has an eigenvalue decomposition,
we can compute its condition number as follows:
max |
i,j
λi
|
λj
Which is the ratio of the magnitude of the largest and smallest eigenvalue; when this is large, we say
we have a poorly conditioned matrix since it is overly sensitive to small changes in input. This has
the practical implications of slow convergence.
In the context of gradient descent, if the Hessian is poorly conditioned, then gradient descent does
not perform as well. This can be alleviated with Newton’s method, where a Taylor series expansion
is used to approximate f (x) near some point x0 , going up to only the second-order derivatives:
1
f (x) ≈ f (x0 ) + (x − x0 )T ∇xf (x0 ) + (x − x0 )T H(f )(x0 )(x − x0 )
2
Solving for the critical point gives:
x = x0 − H(f )(x0 )−1 ∇x f (x0 )
(As a reminder, H(f ) is the Hessian of f )
Such methods which also use second-order derivatives (i.e. the Hessian) are known as second-order
optimization algorithms; those that use only the gradient are called first-order optimization algorithms.
10.3 Simulated Annealing
Simulated annealing is similar to the greedy random approach but it has some randomness which
can “shake” it out of local optima.
284
CHAPTER 10. OPTIMIZATION
285
10.4. NELDER-MEAD (AKA SIMPLEX OR AMOEBA OPTIMIZATION)
Annealing is a process in metal working where the metal starts at a very high temperature and
gradually cools down. Simulated annealing uses a similar process to manage its randomness.
A simulated annealing algorithm starts with a high “temperature” (or “energy”) which “cools” down
(becomes less extreme) as progress is made. Like the greedy random approach, the algorithm tries a
random move. If the move is better, it is accepted as the new position. If the move is worse, then
there is a chance it still may be accepted; the probability of this is based on the current temperature,
the current error, and the previous error:
P (e, e ′ , T ) = exp(
−(e ′ − e)
)
T
Each random move, whether accepted or not, is considered an iteration. After each iteration, the
temperature is decreased according to a cooling schedule. An example cooling schedule is:
k
Tfinal kmax
T (k) = Tinit
Tinit
where
•
•
•
•
Tinit = the starting temperature
Tfinal = the minimum/ending temperature
k = the current iteration
kmax = the maximum number of iterations
For this particular schedule, you probably don’t want to set Tfinal to 0 since otherwise it would rapidly
decrease to 0. Set it something close to 0 instead.
The algorithm terminates when the temperature is at its minimum.
10.4 Nelder-Mead (aka Simplex or Amoeba optimization)
For a problem of n dimensions, create a shape of n + 1 vertices. This shape is a simplex.
One vertex of the simplex is initialized with your best educated guess of the solution vector. That
guess could be the output of some other optimization approach, even a previous Nelder-Mead run. If
you have nothing to start with, a random vector can be used.
The other n vertices are created by moving in one of the n dimensions by some set amount.
Then at each step of the algorithm, you want to (illustrations are for n = 2, thus 3 vertices):
• Find the worst, second worst, and best scoring vertices
CHAPTER 10. OPTIMIZATION
285
10.5. PARTICLE SWARM OPTIMIZATION
286
• Reflect the worst vertex to some point p ′ through the best side
• If p ′ is better, expand by setting the worst vertex to a new point p ′′ , a bit further than p ′ but
in the same direction
• If p ′ is worse, then contract by setting the worst vertex to a new point p ′′ , in the same direction
as p ′ but before crossing the best side
Nelder-Mead
The algorithm terminates when one of the following occurs:
• The maximum number of iterations is reached
• The score is “good enough”
• The vertices have become close enough together
Then the best vertex is considered the solution.
This optimization method is very sensitive to how it is initialized; whether or not a good solution is
found depends a great deal on its starting points.
10.5 Particle Swarm Optimization
Particle swarm optimization is similar to Nelder-Mead, but instead of three points, many more
points are used. These points are called “particles”.
286
CHAPTER 10. OPTIMIZATION
287
10.6. EVOLUTIONARY ALGORITHMS
Each particle has a position (a potential solution) and a velocity which indicates where the particle
moves to in the next iteration. Particles also keep track of their current error to the training examples
and its best position so far.
Globally, we also track the best position overall and the lowest error overall.
The velocity for each particle is computed according to:
• it’s inertia (i.e. the current direction it is moving in)
• it’s historic best position (i.e. the best position it’s found so far)
• the global best position
The influence of these components are:
• inertia weight
• cognitive weight (for historic best position)
• social weight (for global best position)
These weights are parameters that must be tuned, but this method is quite robust to them (that is,
they are not sensitive to these changes so you don’t have to worry too much about getting them just
right).
More particles are better, of course, but more intensive.
You can specify the number of epochs to run.
You can also incorporate a death-birth cycle in which low-performing particles (those that seem to
be stuck, for instance) get destroyed and a new randomly-placed particle is initialized in its place.
10.6 Evolutionary Algorithms
Evolutionary algorithms are a type of algorithm which uses concepts from evolution - e.g. individuals,
populations, fitness, reproduction, mutation - to search a solution space.
10.6.1
Genetic Algorithms
Genetic algorithms are the most common class of evolutionary algorithms.
• You have a population of “chromosomes” (e.g. possible solutions or parameters, which are also
called “object variables”). These chromosomes are interchangeably referred to as “individuals”
• There may be some mutation in the chromosomes (e.g. with binary chromosomes, sometimes
0s become 1s and vice versa or with continuous values, changes happen according to some step
size)
• Parents have children, in which their chromosomes crossover - the front part of one chromosome combines with the back part of another. This is also called recombination.
CHAPTER 10. OPTIMIZATION
287
10.6. EVOLUTIONARY ALGORITHMS
288
• The genotype (the chromosome composition) is expressed as some phenotype (i.e. some
genetically-determined properties) in some individuals
• Then each of these individuals has some fitness value resulting from their phenotypes
• These fitnesses are turned into some probability of survival (this selection pressure is what
pushes the system towards an optimal individual)
• Then the individuals are selected randomly based on their survival probabilities
• These individuals form the new chromosome population for the next generation
Each of these steps requires some decisions by the implementer.
For instance, how do you translate a fitness score into a survival probability?
Well, the simplest way is:
fi
Pi = ∑
i fi
Where fi is the fitness of some individual i .
However, depending on how you calculate fitness, this may not be appropriate.
You could alternatively use a ranking method, in which you just look at the relative fitness rankings
and not their actual values. So the most fit individual is most likely to survive, the second fit is a bit
less likely, and so on.
You pick a probability constant PC , and the survival of the top-ranked individual is PC , that of the
second is (1 − PC )PC , that of the third is, (1 − PC )2 PC , and so on. So Pn−1 = (1 − PC )n−2 PC and
Pn = (1 − PC )n−1 .
If you get stuck on local maxima you can try increasing the step size. When your populations start to
get close to the desired value, you can decrease the step size so the changes are less sporadic (i.e. use
simulated annealing).
When selecting a new population, you can incorporate a diversity rank in addition to a fitness rank.
This diversity ranking tries to maximize the diversity of the new population. You select one individual
for the new population, and then as you select your next individual, you try and find one which is
distinct from the already selected individuals.
The general algorithm is as follows:
1.
2.
3.
4.
randomly initialize a population of µ individuals
compute fitness scores for each individual
randomly choose µ2 pairs of parents, weighted by fitness (see above for an example), to reproduce
with probability Pc (a hyperparameter, e.g. 0.8), perform crossover on the parents to form
two children, which replaces the old population (you may also choose the keep some of the
old population, rather than having two children per pair of parents, as per above - there is no
universal genetic algorithm; you typically need to adjust it for a particular task)
5. randomly apply mutation to some of the population with probability Pm (a hyperparameter,
e.g. 0.01)
288
CHAPTER 10. OPTIMIZATION
289
10.6. EVOLUTIONARY ALGORITHMS
6. repeat
The specifics of how crossover and mutation work depend on your particular problem.
10.6.2
Evolution Strategies
With evolution strategies, each individual includes not only object variables but also “strategy” parameters which, which are variances and covariances (optional) of the object variables. These strategy
parameters control mutation.
From each population of size µ, λ offspring are generated (e.g. λ = 7µ).
All of the object variables, as with genetic algorithms, are derived from the same parents, though each
strategy parameter may be derived from a different pair of parents, selected at random (without any
selection pressure). However, the best approach is to copy an object variable from one of the parents
and set each strategy parameter to be the mean of its parents’ corresponding strategy parameter.
Then mutation mutates both the strategy parameters and the object variables, starting with the
strategy parameters. The mutation of the strategy parameters is called self-adaptation. The object
variables are mutated according to the probability distribution specified by the (mutated) strategy
parameters.
There are two approaches to selection for evolutionary strategy:
• (µ, λ) selection just involves taking the best µ individuals from the λ offspring.
• (µ + λ) selection involves selecting the best µ individuals from the union of the λ offspring
and the µ parents.
(µ, λ) selection is recommended because (µ + λ) selection can interfere with self-adaptation.
10.6.3
Evolutionary Programming
Evolutionary programming does not include recombination; changes to individuals rely solely on
mutation. Mutations are based on a Gaussian distribution, where the standard deviation is the square
root of a linear transform (parametrized according to the user) of the parent’s fitness score. Each of
the µ parents yields one offspring.
Note that in meta-evolutionary programming, the variances are also part of the individual, i.e. subject
to mutation (this is self-adaptation).
The next generation is selected from the union of the parents and the offspring via a process called
q-tournament selection. Each candidate is paired with q (a hyperparameter) randomly selected
opponents and receives a score which is the number of these q opponents that have a worse fitness
score than the candidate. The top-scoring µ candidates are kept as the new generation.
Increasing q causes the selection pressure to be both higher and more deterministic.
CHAPTER 10. OPTIMIZATION
289
10.7. DERIVATIVE-FREE OPTIMIZATION
290
10.7 Derivative-Free Optimization
Note that Nelder-Mead, Particle Swarm, and genetic algorithm optimization methods are sometimes
known as “derivative-free” because they do not involve computing derivatives in order to optimize.
10.8 Hessian optimization
Also known as the “Hessian technique”.
Given a function f (X), where X = [x1 , x2 , . . . , xn ], we can approximate f near a point X using
Taylor’s theorem:
f (X + ∆X) = f (X) +
∑ ∂f
j
∂Xj
∆Xj +
1∑
∂ 2f
∆Xj
∆Xk + . . .
2 jk
∂Xj ∂Xk
1
= f (X) + ∇f · ∆X + ∆X T H∆X + . . .
2
Where H is the Hessian matrix (the jkth entry is
∂2f
∂Xj ∂Xk ).
We can approximate f by dropping the higher-order terms (the ellipsis, . . . , terms):
1
f (X + ∆X) ≈ f (X) + ∇f · ∆X + ∆X T H∆X
2
Assuming the Hessian matrix is positive definite, we can show using calculus that the right-hand
expression can be minimized to:
∆X = −H −1 ∇f
If f is a cost function C and X are the parameters θ to the cost function, we can minimize the cost
by updating θ according to the following algorithm (where η is the learning rate):
•
•
•
•
Initialize θ.
Update θ to θ′ = θ − ηH −1 ∇C, computing H and ∇C at θ.
Update θ′ to θ′′ = θ − ηH ′−1 ∇′ C, computing H ′ and ∇′ C at θ′ .
And so on.
The second derivatives in the Hessian tells us how the gradient is changing, which provides some
advantages (such as convergence speed) over traditional gradient descent.
The Hessian matrix has n2 elements, where n is the number of parameters, so it can be extremely
large. In practice, computing the Hessian can be quite difficult.
290
CHAPTER 10. OPTIMIZATION
291
10.9. ADVANCED OPTIMIZATION ALGORITHMS
10.9 Advanced optimization algorithms
There are other advanced optimization algorithms, such as:
• Conjugate gradient
• BFGS
• L-BFGS
These shouldn’t be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they’re doing.
They are more complex, but (in the context of machine learning) there’s no need to manually pick a
learning rate α and they are often faster than gradient descent. So you can take advantage of them
via some library which has them implemented (though some implementations are better than others).
10.10
•
•
•
•
•
•
•
References
Swarm Intelligence Optimization using Python. James McCaffrey. PyData 2015.
Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
Neural Networks and Deep Learning, Michael A Nielsen. Determination Press, 2015.
Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
Genetic and Evolutionary Algorithms. Gareth Jones.
Advanced Topics: Reinforcement Learning, Lecture 7. David Silver.
An Interactive Tutorial on Numerical Optimization. Ben Frederickson.
CHAPTER 10. OPTIMIZATION
291
10.10. REFERENCES
292
292
CHAPTER 10. OPTIMIZATION
293
11
Algorithms
We measure the performance of algorithms in terms of time complexity (how many steps are taken,
as a function of input size) and space complexity (how much memory is used, as a function of
input size).
There may be trade-offs for better time and space complexity - for instance, there may be situations
where speed is more important than correctness (a correct algorithm is one that always terminates in
the correct answer), in which case we may opt for a faster algorithm that provides only an approximate
solution or a local optimum.
There may be trade-offs between time and space complexity as well - for instance, we might have the
choice between a fast algorithm that uses a lot of memory, or a slower one that has lower memory
requirements.
It’s important to be able to evaluate and compare algorithms in order to determine which is most
appropriate for a particular problem.
11.1 Algorithm design paradigms
Algorithms generally fall into a few categories of “design paradigms”:
• decrease-and-conquer: recursively reduce the problem into a smaller and smaller problem
until it can no longer be reduced, then solve that problem.
• divide-and-conquer: break up the problem into subproblems which can be recursively solved,
then combine the results of the subproblems in some way to form the solution for the original
problem.
• greedy: take the solution that looks best at the moment; i.e. always go for the local optimum,
in the hope that it may the global optimum, even though it may not be.
• dynamic programming: the divide-and-conquer paradigm can lead to redundant computations if the same subproblem appears multiple times; dynamic programming involves memoizing
CHAPTER 11. ALGORITHMS
293
11.2. ALGORITHMIC ANALYSIS
294
(“remembering”, i.e. storing in memory) results of computations so that when an identical subproblem is encountered, the solution can just be retrieved from memory.
• brute force: try everything until you get a solution
11.2 Algorithmic Analysis
The number of operations for an algorithm is typically a function of its inputs, typically denoted n.
There are a few kinds of algorithmic analysis:
• worst-case analysis: the upper bound running time that is true for any arbitrary input of
length n
• average-case analysis: assuming that all input are equally likely, the average running time
• benchmarks: runtime on an agreed-upon set of “typical” inputs
Average-case analysis and benchmarks requires some domain knowledge about what inputs to expect.
When you want to do a more “general purpose” analysis, worst-case analysis is preferable.
When comparing the performance of algorithms:
• we measure the algorithm in terms of the number of steps (operations) required, which provides
a consistent measurement across machines (otherwise, some machines are more powerful and
naturally perform faster)
• algorithms may perform differently depending on characteristics of input data, e.g. if it is already
partially sorted and so on. So we look at the worst case scenario to compensate for this.
• performance can change depending on the size of the input as well (for example, an algorithm
A which appears slower on a smaller dataset than B may in fact be faster than B on larger
datasets), so we look at algorithmic performance as a function of input size, and look at the
asymptotic performance as the problem size increases.
We focus on asymptotic analysis; that is, we focus on performance for large input sizes.
Note, however, that algorithms which are inefficient for large n may be better on small n when
compared to algorithms that perform well for large n. For example, insertion sort has an upper bound
2
runtime of n2 , which, for small n (e.g. n < 90), is better than merge sort. This is because constant
factors are more meaningful with small inputs. Anyways, with small n, it often doesn’t really matter
what algorithm you use, since the input is so small, there are unlikely to be significant performance
differences, so analysis of small input sizes is not very valuable (or interesting).
Thus we define a “fast” algorithm as one in which the worst-case running time grows slowly with
input size.
294
CHAPTER 11. ALGORITHMS
295
11.2.1
11.2. ALGORITHMIC ANALYSIS
Asymptotic Analysis
With asymptotic analysis, we suppress constant factors and lower-order terms, since they don’t matter
much for large inputs, and because the constant factors can vary quite a bit depending on the
architecture, compiler, programmer, etc.
For example, if we have an algorithm with the upper bound runtime of 6n log2 n + 6n, we would
rewrite it as just n log n (note that log typically implies log2 ).
Then we say the running time is O(n log n), said “big-oh of n log n”, the O implies that we have
dropped the constant factors and lower-order terms.
Generally, we categorize big-oh performance by order of growth, which define a set of algorithms that
grow equivalently:
| Order of growth | Name | |—————–|—————————| | O(1) | constant | | O(logb n) |
logarithmic (for any b) | | O(n) | linear | | O(n logb n) | n log n | | O(n2 ) | quadratic | | O(n3 ) |
cubic | | O(c n )|exponenti al(f or any c$) |
The order of growth is determined by the leading term, that is, the term with the highest exponent.
Note for log, the base doesn’t matter because it is equivalent to multiplying by a constant, which we
ignore anyways.
11.2.2
Loop examples
Consider the following algorithm for finding an element in an array:
def func(i, arr):
for el in arr:
if el == i:
return True
return False
This has the running time of O(n) since, in the worst case, it checks every item.
Now consider the following:
def func2(i, arr):
return func(i, arr), func(i, arr)
This still has the running time of O(n), although it has twice the number of operations (i.e. ∼ 2n
operations total), we drop the constant factor 2.
Now consider the following algorithm for checking if two arrays have a common element:
CHAPTER 11. ALGORITHMS
295
11.2. ALGORITHMIC ANALYSIS
296
def func3(arr1, arr2):
for el1 in arr1:
for el2 in arr2:
if el1 == el2:
return True
return False
This has a runtime of O(n2 ), which is called a quadratic time algorithm.
The following algorithm for checking duplicates in an array also has a runtime of O(n2 ), again due
to dropping constant factors:
def func4(arr):
for i, el1 in enumerate(arr):
for el2 in arr[i:]:
if el1 == el2:
return True
return False
11.2.3
Big-Oh formal definition
Say we have a function T (n), n ≥ 0, which is usually the worst-case running time of an algorithm.
We say that T (n) = O(f (n)) if and only if there exist constants c, n0 > 0 such that T (n) ≤ cf (n)
for all n ≥ n0 .
That is, we can multiply f (n) by some constant c such that there is some value n0 , after which T (n)
is always below cf (n).
For example: we demonstrated that 6n log2 n + 6n is the worst-case running time for merge sort. For
merge sort, this is T (n). We described merge sort’s running time in big-oh notation with O(n log n).
This is appropriate because there exists some constant c we can multiply n log n by such that, after
some input size n0 , cf (n) is always larger than T (n). In this sense, n0 defines a sufficiently large
input.
As a simple example, we can prove that 2n+10 = O(2n ).
So the inequality is:
2n+10 ≤ c2n
296
CHAPTER 11. ALGORITHMS
297
11.3. THE DIVIDE-AND-CONQUER PARADIGM
We can re-write this:
210 2n ≤ c2n
Then it’s clear that if we set c = 210 , this inequality holds, and it happens to hold for all n, so we
can just set n0 = 1. Thus 2n+10 = O(2n ) is in fact true.
11.2.4
Big-Omega notation
T (n) = Ω(f (n)) if and only if there exist constants c, n0 > 0 such that T (n) ≥ cf (n) for all n ≥ n0 .
That is, we can multiply f (n) by some constant c such that there is some value n0 , after which T (n)
is always above cf (n).
11.2.5
Big-Theta notation
T (n) = Θ(f (n)) if and only if T (n) = O(f (n)) and T (n) = Ω(f (n)). That is, T (n) eventually
stays sandwiched between c1 f (n) and c2 f (n) after some value n0 .
11.2.6
Little-Oh notation
Stricter than big-oh, in that this must be true for all positive constants.
T (n) = o(f (n)) if and only if for all constants c > 0, there exists a constant n0 such that T (n) ≤
cf (n) for all n ≥ n0 .
11.3 The divide-and-conquer paradigm
This consists of:
• Divide the problem into smaller subproblems. You do not have to literally divide the problem
in the algorithm’s implementation; this may just be a conceptual step.
• Compute the subproblems using recursion.
• Combine the subproblem solutions into the problem for the original problem.
11.3.1
The Master Method/Theorem
The Master Method provides an easy way of computing the upper-bound runtime for a divide-andconquer algorithm, so long as it satisfies the assumptions:
Assume that subproblems have equal size and the recurrence has the format:
CHAPTER 11. ALGORITHMS
297
11.3. THE DIVIDE-AND-CONQUER PARADIGM
298
• Base case: T (n) ≤ c, where c is a constant, for all sufficiently small n.
• For all larger n: T (n) ≤ aT ( bn ) + O(nd ), where a is the number of recursive calls, a ≥ 1, each
subproblem has input size nb , b > 1, and outside each recursive call, we do some additional
O(nd ) of work (e.g. a combine step), parameterized by d, d ≥ 0.



O(nd logn )



T (n) = O(nd )



O(n logb a )
if a = bd
if a < bd
if a > bd
Note that the logarithm base does not matter in the first case, since it just changes the leading
constant (which doesn’t matter in Big-Oh notation), whereas in the last case the base does matter,
because it has a more significant effect.
To use the master method, we just need to determine a, b, d.
For example:
With merge sort, there are two recursive calls (thus a = 2), and the input size to each recursive
call is half the original input (thus b = 2), and the combine step only involves the merge operation
(d = 1).
Thus, working out the master method inequality, we get 2 = 21 , i.e. a = bd , thus:
T (n) = O(n logn )
Proof of the Master Method
(Carrying over the previously-stated assumptions)
For simplicity, we’ll also assume that n is a power of b, but this proof holds in the general case as
well.
In the recursion tree, at each level j = 0, 1, 2, . . . , logb n, there are aj subproblems, each of size n/bj .
At a level j, the total work, not including the work in recursive calls, is:
≤ aj c(
n d
)
bj
Note that the a and b terms are dependent on the level j, but the c, n, d terms are not. We can
rearrange the expression to separate those terms:
≤ cnd (
298
a j
)
bd
CHAPTER 11. ALGORITHMS
299
11.4. DATA STRUCTURES
To get the total work, we can sum over all the levels:
≤ cn
d
log
bn
∑
j=0
(
a j
)
bd
We can think of a as the rate of subproblem proliferation (i.e. how the number of subproblems grow
with level depth) and bd as the rate of work shrinkage per subproblem.
There are three possible scenarios, corresponding to the master method’s three cases:
• If a < bd , then the amount of work decreases with the recursion level.
• If a > bd , then the amount of work increases with the recursion level.
• If a = bd , then the amount of work stays the same with the recursion level.
∑
logb n a j
If a = bd , then the summation term in the total work expression, j=0
( bd ) , simply becomes
d
logb n + 1, thus the total work upper bound in that case is just cn (logb n + 1), which is just
O(nd log n).
A geometric sum for r ̸= 1:
1 + r + r2 + r3 + · · · + rk
Can be expressed in the following closed form:
r k+1 − 1
r −1
1
If r < 1 is constant, then this is ≤ 1−r
(i.e. it is some constant independent of k). If r > 1 is
1
1
k
constant, then this is ≤ r (1 + r −1 ), where the last term (1 + r −1
) is a constant independent of k.
We can bring this back to the master method by setting r =
a
bd .
If a < bd , then the summation term in the total work expression becomes a constant (as demonstrated
with the geometric sum); thus in Big-Oh, that summation term drops, and we are left with O(nd ).
If a > bd , then the summation term in the total work becomes a constant times r k , where k = logb n,
i.e. the summation term becomes a constant times ( bad )logb n . So we get O(nd ( bad )logb n ), which
simplifies to O(alogb n ), which ends up being the number of leavens in the recursion tree. This is
equivalent to O(nlogb a ).
11.4 Data Structures
Data structures are particular ways of organizing data that support certain operations. Structures
have strengths and weaknesses in what kinds of operations they can perform, and how well they can
perform them.
CHAPTER 11. ALGORITHMS
299
11.4. DATA STRUCTURES
300
For example: lists, stacks, queues, heaps, search trees, hashtables, bloom filters, etc.
Different data structures are appropriate for different operations (and thus different kinds of problems).
11.4.1
Heaps
A heap (sometimes called a priority queue) is a container for objects that have keys; these keys are
comparable (e.g. we can say that one key is bigger than another).
Supported operations:
• insert: add a new object to the heap, runtime O(log n)
• extract-min: remove the object with the minimum key value (ties broken arbitrarily), runtime
O(log n)
Alternatively, there are max-heaps which return the maximum key value (this can be emulated by a
heap by negating key values such that the max becomes the min, etc).
Sometimes there are additional operations supported:
• heapify: initialize a heap in linear time (i.e. O(n) time, faster than inserting them one-by-one)
• delete: delete an arbitrary element from the middle of the heap in O(log n) time
For example, you can have a heap where events are your objects and their keys are a scheduled time
to occur. Thus when you extract-min you always get the next event scheduled to occur.
11.4.2
Balanced Binary Search Tree
A balanced binary search tree can be thought of as a dynamic sorted array (i.e. a sorted array which
supports insert and delete operations).
First, consider sorted arrays. They support the following operations:
•
•
•
•
search:
binary search, runtime O(log n)
select: select an element by index, runtime O(1)
min and max: return first and last element of the array (respectively), runtime O(1)
predecessor and successor: return next smallest and next largest element of the array (respectively), runtime O(1)
• rank: the number of elements less than or equal to a given value, runtime O(log n) (search
for the given value and return the position)
• output: output elements in sorted order, runtime O(n) (since they are already sorted)
With sorted arrays, insertion and deletions have O(n) runtime, which is too slow.
If you want more logarithmic-time insertions and deletions, we can use a balanced binary search tree.
This supports the same operations as sorted arrays (though some are slower) in addition to faster
insertions and deletions:
300
CHAPTER 11. ALGORITHMS
301
•
•
•
•
•
•
•
•
11.4. DATA STRUCTURES
search:
runtime O(log n)
select: runtime O(log n) (slower than sorted arrays)
min and max: runtime O(log n) (slower than sorted arrays)
predecessor and successor: runtime O(log n) (slower than sorted arrays)
rank: runtime O(log n)
output: runtime O(n)
insert: runtime O(log n)
delete: runtime O(log n)
To understand how balanced binary search trees, first consider binary search trees.
Binary Search Tree
A binary search tree (BST) is a data structure for efficient searching.
The keys that are stored are the nodes of the tree.
Each node has three pointers: one to its parent, one to its left child, and one to its right child.
These pointers can be null (e.g. the root node has no parent).
The search tree property asserts that for every node, all the keys stored in its left subtree should be
less than its key, and all the keys in its right subtree should be greater than its key.
You can also have a convention for handling equal keys (e.g. just put it on the left or the right
subtree).
This search tree property is what makes it very easy to search for particular values.
Note that there are many possible binary search trees for a given set of keys. The same set of keys
could be arranged as a very deep and narrow BST, or as a very shallow and wide one. The worst
case is a depth of about n, which is more of a chain than a tree; the best case is a depth of about
log2 n, which is perfectly balanced.
This search tree property also makes insertion simple. You search for the key to be inserted, which
will fail since the key is not in the tree yet, and you get a null pointer - you just assign that pointer
to point to the new key. In the case of duplicates, you insert it according to whatever convention
you decided on (as mentioned previously). This insert method maintains the search tree property.
Search and insert performance are dependent on the depth of the tree, so at worst the runtime is
O(height).
The min and max operations are simple: go down the leftmost branch for the min key and go down
the rightmost branch for the max key.
The predecessor operation is a bit more complicated. First you search for the key in question. Then,
if the key’s node has a left subtree, just take the max of that subtree. However, if the key does not
have a left subtree, move up the tree through its parent and ancestors until you find a node with a
key less than the key in question. (The successor operation is accomplished in a similar way.)
The deletion operation is tricky. First we must search for the key we want to delete. Then there are
three possibilities:
CHAPTER 11. ALGORITHMS
301
11.4. DATA STRUCTURES
302
• the node has no children, so we can just delete it and be done
• the node has one child; we can just delete the node and replace it with its child
• the node has two children; we first compute the predecessor of the node, then swap it with the
node, then delete the node
For the select and rank operations, we can augment our search tree by including additional information
at each node: the size of its subtree, including itself (i.e. number of descendants + 1). Augmenting
the data structure in this way does add some overhead, e.g. we have to maintain the data/keep it
updated whenever we modify the tree.
For select, we want to find the i th value of the data structure. Starting at a node x, say a is the size
of its left subtree. If a = i − 1, return x. If a ≥ i , recursively select the i th value of the left subtree.
If a < i − 1, recursively select the (i − a − 1)th value of the right subtree.
Balanced Binary Search Trees (Red-Black Trees)
Balanced binary search trees have the “best” depth of about log2 n.
There are different kinds of balanced binary search trees (which are all quite similar), here we will
talk about red-black trees (other kinds are AVL trees, splaytrees, B trees, etc).
Red-Black trees maintain some invariants/constraints which are what guarantee that the tree is
balanced:
1.
2.
3.
4.
each node stores an additional bit indicating if it is a “red” or “black” node
the root is always black
never allow two reds in a row (e.g. all of a red node’s children are black)
every path from the root node to a null pointer (e.g. an unsuccessful search) must go through
the same number of black nodes
Consider the following: for a binary search tree, if every root-null path has ≥ k nodes, then the tree
includes at the top a perfectly balanced search tree of depth k − 1. Thus there must be at least
2k − 1 nodes in the tree, i.e. n ≥ 2k − 1. We can restate this as k ≤ log2 (n + 1)
In a red-black tree, there is a root-null path with at most log2 (n + 1) black nodes (e.g. it can have
a root-null path composed of only black nodes).
The fourth constraint on red-black trees means that every root-null path has ≤ log2 (n + 1) black
nodes. The third constraint means that we can never have more red nodes than black nodes (because
the red nodes can never come one after the other) in a path. So at most a root-null path will have
≤ 2log2 (n + 1), which gives us a balanced tree.
11.4.3
Hash Tables
Hash tables (also called dictionaries) allow us to maintain a (possibly evolving) set of stuff.
The core operations include:
302
CHAPTER 11. ALGORITHMS
303
11.4. DATA STRUCTURES
• insert using a key
• delete using a key
• lookup using a key
When implemented properly, and on non-pathological data, these operations all run in O(1) time.
Hash tables do not maintain any ordering.
Basically, hash tables use some hash function to produce a hash for an object (some number); this
hash is the “address” of the object in the hash table.
More specifically, we have a hash function which gives us a value in some range [0, n]; we have an
array of length n, so the hash function tells us and what index to place some object.
There is a chance of collisions in which two different objects produce the same hash.
There are two main solutions for resolving collisions:
• (separate) chaining: if there is a collision, store the objects together at that index as a list
• open addressing: here, a hash function specifies a sequence (called a probe sequence) instead
of a single value. Try the first value, if its occupied, try the next, and so on.
• one strategy, linear probing, just has you try the hash value + 1 and keep incrementing by one
until an empty bucket is found
• another is double hashing, in which you have two hash functions, you look at the first hash, if
occupied, offset by the second hash until you find an empty bucket
Each is more appropriate in different situations.
The performance of a hash table depends a lot on the particular hash function. Ideally, we want a
hash function that:
• has good performance (i.e. low collisions)
• should be easy to store
• should be fast to evaluate (constant time)
Designing hash functions is as much an art as it is a science. They are quite difficult to design.
A hash table has a load factor (sometimes just called load), denoted α =
num. objects in hash table
num. buckets in hash table .
For hash table operations to run constant time, it is necessary that α = O(1). Ideally, it is less than
1, especially with open addressing.
So for good performance, we need to control load. For example, if α passes some threshold, e.g. 0.75,
then we may want to expand the hash table to lower the load.
Every hash function has a “pathological” data set which it performs poorly on. There is no hash
function which is guaranteed to spread out every data set evenly (i.e. have low collisions on any
arbitrary data set). You can often reverse engineer this pathological data set by analyzing the hash
function.
CHAPTER 11. ALGORITHMS
303
11.5. P VS NP
304
However, for some hash functions, it is “infeasible” to figure out its pathological data set (as is the
case with cryptographic hash functions).
One approach is to define a family of hash functions, rather than just one, and randomly choose
a hash function to use at runtime. This has the property that, on average, you do well across all
datasets.
11.4.4
Bloom Filters
Bloom filters are a variant on hash tables - they are more space efficient, but they allow for some
errors (that is, there’s a chance of a false positive). In some contexts, this is tolerable.
They are more space efficient because they do not actually store the objects themselves. They are
more commonly used to keep track of what objects have been seen so far. They typically do not
support deletions (there are variants that do incorporate deletions but they are more complicated).
There is also a small chance that it will say that it’s seen an object that it hasn’t (i.e. false positives).
Like a hash table, a bloom filter consists of an array A, but each entry in the array is just one bit.
n
Say we have a set of objects S and the total number of bits n - a bloom filter will use only |S|
bits
per object in S.
We also have k hash functions h1 , . . . , hk (usually k is small).
The insert operation is defined:
hash_funcs = [...]
for i in range(k):
A[h[i](input)] = 1
That is, we just set the values of those bits to 1.
Thus lookup is just to check that all the corresponding bits for an object are 1.
We can’t have any false negatives because those bits will not have been set and bits are never reset
back to zero.
False positives are possible, however, because some other objects may have in aggregate set the bits
corresponding to another object.
11.5 P vs NP
Consider the following problem:
7 * 13 = ?
304
CHAPTER 11. ALGORITHMS
305
11.5. P VS NP
This is solved very quickly by a computer (it gives 91).
Now consider the following factoring problem :
? * ? = 91
This a bit more difficult for a computer to solve, though it will yield the correct answers (7 and 13).
If we consider extremely large numbers, a computer can still very quickly compute their product.
But, given a product, it will take a computer a very, very long time to compute their factors.
In fact, modern cryptography is based on the fact that computers are not good at finding factors for
a number (in particular, prime factors).
This is because computers basically have to use brute force search to identify a factor; with very large
numbers, this search space is enormous (it grows exponentially).
However, once we find a possible solution, it is easy to check that we are correct (e.g. just multiply
the factors and compare the product).
There are many problems which are characterized in this way - they require brute force search to
identify the exact answer (there are often faster ways of getting approximate answers), but once an
answer is found, it can be easily checked if it is correct.
There are other problems, such as multiplication, where we can easily “jump” directly to the correct
exact answer.
For the problems that require search, it is not known whether or not there is also a method that can
“jump” to the correct answer.
Consider the “needle in the haystack” analogy. We could go through each piece of hay until we find
the needle (brute force search). Or we could use a magnet to pull it out immediately. The question
is open: does this magnet exist for problems like factorization?
Problems which we can quickly solve, like multiplication, are in a family of problems called “P”, which
stands for “polynomial time” (referring to the relationship between the number of inputs and how
computation time increases).
Problems which can be quickly verified to be correct are in a family of problems called “NP”, which
stands for “nondeterministic polynomial time”.
P is a subset of NP, since their answers are quickly verified, but NP also includes the aforementioned
search problems. Thus, a major question is whether or not P = NP, i.e. we have the “magnet” for P
problems, but is there also one for the rest of the NP problems? Or is searching the only option?
11.5.1
NP-hard
Problems which are NP-hard are at least as hard as NP problems, so this includes problems which
may not even be in NP.
CHAPTER 11. ALGORITHMS
305
11.6. REFERENCES
11.5.2
306
NP-completeness
There are some NP problems which all NP problems can be reduced to. Such an NP problem is called
NP-complete.
For example, any NP problem can be reduced to a clique problem (e.g. finding a clique of some
arbitrary size in a graph); thus the clique problem is NP-complete. Any other NP problem can be
reduced to a clique problem, and if we can find a way of solving the clique problem quickly, we can
also all of those related problems quickly as well.
To show that a problem is NP complete, you must:
• Show that it is in NP: that is, show that there is a polynomial time algorithm that can verify
whether or not an answer is correct
• Show that it is NP-hard: that is, reduce some known NP-complete problem to your problem in
polynomial time.
11.6 References
•
•
•
•
306
Algorithms: Design and Analysis, Part 1. Tim Roughgarden. Stanford/Coursera.
Think Complexity. Version 1.2.3. Allen B. Downey. 2012.
Beyond Computation: The P vs NP Problem. Michael Sipser, MIT. Tuesday, October 3, 2006.
Algorithmic Puzzles. Anany Levitin, Maria Levitin. 2011.
CHAPTER 11. ALGORITHMS
307
Part II
Machine Learning
307
309
12
Overview
12.1 Representation vs Learning:
• Representation: whether or not a function can be simulated by the model; i.e. is the model
capable of representing a given function?
• Learning: whether or not their exists an algorithm with which the weights can be adjusted to
represent a particular function
12.2 Types of learning
• supervised learning - the learning algorithm is provided with pre-labeled training examples to
learn from.
• unsupervised learning - the learning algorithm is provided with unlabeled examples. Generally,
unsupervised learning is used to uncover some structure of or pattern in the data.
• semi-supervised learning - the learning algorithm is provided with a mixture of labeled and
unlabeled data.
• active learning - similar to semi-supervised learning, but the algorithm can “ask” for extra
labeled data based on what it needs to improve on.
• reinforcement learning - actions are taken and rewarded or penalized in some way and the
goal is maximizing lifetime/long-term reward (or minimizing lifetime/long-term penalty).
12.3 References
• Neural Computing: Theory and Practice (1989). Philip D. Wasserman
CHAPTER 12. OVERVIEW
309
12.3. REFERENCES
310
310
CHAPTER 12. OVERVIEW
311
13
Supervised Learning
In supervised learning, the learning algorithm is provided some pre-labeled examples (a training set)
to learn from.
In regression problems, you try to predict some continuous valued output (i.e. a real number).
In classification problems, you try to predict some discrete valued output (e.g. categories).
Typical notation:
•
•
•
•
•
m = number training of examples
x’s = input variables or features
y ’s = output variables or the “target” variable
(x (i) , y (i) ) = the i th training example
h = the hypothesis, that is, the function that the learning algorithm learns, taking x’s as input
and outputting y ’s
The typical process is:
•
•
•
•
Feed training set data into the learning algorithm
The learning algorithm learns the hypothesis h
Input new data into h
Get output from h
The hypothesis can thought of as the model that you try to learn for a particular task. You then use
this model on new inputs, e.g. to make predictions - generalization is how the model performs on
new examples; this is most important in machine learning.
CHAPTER 13. SUPERVISED LEARNING
311
13.1. BASIC CONCEPTS
312
13.1 Basic concepts
• Capacity: the flexibility of a model - that is, the variety of functions it can fit.
– Representational capacity - the functions which the model can learn
– Effective capacity - in practice, a learning algorithm is not likely to find the best function
out of the possible functions it can learn, though it can learn one that performs exceptionally well - those functions that the learning algorithm is capable of finding defines the
model’s effective capacity.
• Hypothesis space: the set of functions the model is limited to learning. For instance, linear
regression can be limited to linear functions as its hypothesis space, or it can be expanded to
learn polynomials as well, e.g. by introducing an x 2 term.
• Hyperparameter: a parameter of a model that is not learned (that is, you specify it yourself)
• Underfitting: when the model could achieve better generalization with more training or capacity. Characterized by a high training error.
• Overfitting: when the model could achieve better generalization with more training or capacity;
in particular, the model is too tuned to the idiosyncrasies of the training data (for instance, it
may fit to sampling error, which we don’t want). Too much capacity can lead to overfitting
in that the model may be able to learn functions too specific to the data. Characterized by a
large gap between the training error and the test error.
• Model selection: the process of choosing the best hyperparameters on a validation set
If the true function is in your hypothesis space H, we say it is realizable in H.
In machine learning, there are generally two kinds of problems: regression and classification problems.
Machine learning algorithms are typically designed for one or the other.
13.1.1
Regression
Regression involves fitting a model to data. The goal is to understand the relationship between one
set of variables - the dependent or response or target or outcome or explained variables (e.g. y )
- and another set - the independent or explanatory or predictor or regressor variables (e.g. X
or x). In cases of just one dependent and one explanatory variable, we have simple regression. In
scenarios with more than one explanatory variable, we have multiple regression. In scenarios with
more than one dependent variable, we have multivariate regression.
With linear regression we expect that the dependent and explanatory variables have a linear relationship; that is, can be expressed as a linear combination of random variables, i.e.:
y = β0 + β1 x1 + · · · + βn xn + ε
For some dependent variable y and explanatory variables x1 , . . . , xn , where ε is the residual due to
random variation or other noisy factors.
312
CHAPTER 13. SUPERVISED LEARNING
313
13.2. OPTIMIZATION
Of course, we do not know the true values for these β parameters (also called regression coefficients)
so they end up being point estimates as well. We can estimate them as follows.
When given data, one technique we can use is ordinary least squares, sometimes just called least
squares regression, which looks for parameters β0 , . . . , βn such that the sum of the squared residuals
∑
(i.e. the SSE, i.e. e12 + · · · + en2 = ni=1 (yi − ŷi )2 ) is minimized (this minimization requirement is
called the least squares criterion). The resulting line is called the least squares line.
Note that linear model does not mean the model is necessarily a straight line. It can be polynomial
as well - but you can think of the polynomial terms as additional explanatory variables; looking at it
this way, the line (or curve, but for consistency, they are all called “lines”) still follows the form above.
And of course, in higher-dimensions (that is, for multiple regression) we are not dealing with lines
but planes, hyperplanes, and so on. But again, for the sake of simplicity, they are all just referred to
as “lines”.
For example, the line:
y = β0 + β1 x1 + β2 x12 + ε
Can be re-written:
x2 = x12
y = β0 + β1 x1 + β2 x2 + ε
When we use a regression model to predict a dependent variable, e.g. y , we denote it as a estimate
by putting a hat over it, e.g. ŷ .
13.1.2
Classification
Classification problems are where your target variables are discrete, so they represent categories or
classes.
For binary classification, there are only two classes (that is, y ∈ {0, 1}). We call the 0 class the
negative class, and the 1 class the positive class.
Otherwise, the classification problem is called a multiclass classification problem - there are more
than two classes.
13.2 Optimization
Much of machine learning can be framed as optimization problems - there is some kind of objective
(also called loss or cost) function which we want to optimize (e.g. minimize classification error on the
training set). Typically you are trying to find some parameters for your model, θ, which minimizes
this objective or loss function.
CHAPTER 13. SUPERVISED LEARNING
313
13.2. OPTIMIZATION
314
Generally this framework for machine learning is called empirical risk minimization and can be
formulated:
argmin
θ
1∑
l(f (x (i) ; θ), y (i) ) + λΩ(θ)
n i
Where:
• f (x (i) ; θ) is your model, which outputs some predicted value for the input x (i) and θ are the
parameters for the model
• y (i) is the training label (i.e. the ground-truth) for the input x (i)
• l is the loss function
• Ω(θ) is a regularlizer to penalize certain values of θ and λ is the regularization parameter (see
below on regularlization)
Some optimization terminology:
• Critical points:
x ∈ Rn |∇x f (x) = 0
• Curvature in direction v : v T ∇2x f (x)v
• Types of critical points:
– local minima: v T ∇2x f (x)v > 0, ∀v , that is ∇2x f (x) is positive definite
– local maxima: v T ∇2x f (x)v < 0, ∀v , that is ∇2x f (x) is negative definite
– saddle point: curvature is positive in some directions and negative in others
13.2.1
Cost functions
So we have our training set {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} where y ∈ {0, 1} and with
the hypothesis function from before.
Here is the cost function for linear regression:
J(θ) =
m
1 ∑
1
(hθ (x (i) ) − y (i) )2
m i=1 2
Note that the 12 is introduced for convenience, so that the square exponent cancels out when we
differentiate. Introducing an extra constant doesn’t affect the result.
Note that now the
1
2m
has been split into
1
m
and 12 .
We can extract 12 (hθ (x (i) ) − y (i) )2 and call it Cost(hθ (x), y ).
The cost function for logistic regression is different than that used for linear regression because the
hypothesis function of logistic regression causes J(θ) to be non-convex, that is, look something like
the following with many local optima, making it hard to converge on the global minimum.
314
CHAPTER 13. SUPERVISED LEARNING
315
13.2. OPTIMIZATION
A saddle point
A non-convex function
CHAPTER 13. SUPERVISED LEARNING
315
13.2. OPTIMIZATION
316
So we want to find a way to define Cost(hθ (x), y ) such that it gives us a convex J(θ). We will use:

−log(hθ (x))
Cost(hθ (x), y ) = 
−log(1 − hθ (x))
if y = 1
if y = 0
Some properties of this function is that if y = hθ (x) then Cost = 0, and as hθ (x) → 0, Cost → ∞.
We can rewrite cost in a form more conducive to gradient descent:
Cost(hθ (x), y ) = −y log(hθ (x)) − (1 − y )log(1 − hθ (x))
So our entire cost function is:
J(θ) = −
m
1 ∑
[ y (i) log(hθ (x (i) )) + (1 − y (i) )log(1 − hθ (x (i) ))]
m i=1
You could use other cost functions for logistic regression, but this one is derived from the principle
of maximum likelihood estimation and has the nice property of being convex, so this is the one that
basically everyone uses for logistic regression.
Then we can calculate mi nθ J(θ) with gradient descent by repeating and simultaneously updating:
θj := θj − α
m
∑
(hθ (x (i) ) − y (i) ) · xj(i)
i=1
This looks exactly the same as the linear regression gradient descent algorithm, but it is different
because hθ (x) is now the nonlinear hθ (x) = 1+e1θT x . Still, the previous methods for gradient descent
(feature scaling and learning rate adjustment) apply here.
13.2.2
Gradient Descent
Gradient descent is an optimization algorithm for finding parameter values which minimize a cost
function.
Gradient descent perhaps the most common optimization algorithm in machine learning.
So we have some cost function J(θ0 , θ1 , . . . , θn ) and we want to minimize it.
The general approach is:
• Start with some θ0 , θ1 , . . . , θn .
• Changing θ0 , θ1 , . . . , θn in some increment/step to reduce J(θ0 , θ1 , . . . , θn ) as much as possible.
• Repeat the previous step until convergence on a minimum (hopefully)
316
CHAPTER 13. SUPERVISED LEARNING
317
13.2. OPTIMIZATION
Gradient descent algorithm
Repeat the following until convergence: (Note that := is the assignment operator.)
θj := θj − α
∂
J(θ0 , θ1 , . . . , θn )
∂θj
For each j in n.
Every θj is updated simultaneously. So technically, you’d calculate this value for each j in n and only
after they are all updated would you actually update each θj .
For example, if the right-hand side of that equation was a function func(j, t0, t1), you would
implement it like so (example is n = 2):
temp_0 = func(0, theta_0, theta_1)
temp_1 = func(1, theta_0, theta_1)
theta_0 = temp_0
theta_1 = temp_1
α is the learning rate and tells how large a step/increment to change the parameters by.
Learning rates which are too small cause the gradient descent to go slowly. Learning rates which are
too large can cause the gradient descent to overshoot the minimum, and in those cases it can fail to
converge or even diverge.
The partial derivative on the right is just the rate of change from the current value.
13.2.3
Normal Equation
The normal equation is an approach which allows for the direct determination of an optimal θ without
the need for an iterative approach like gradient descent.
With calculus, you find the optimum of a function by calculating where its derivatives equal 0 (the
intuition is that derivatives are rates of change, when the rate of change is zero, the function is
“turning around” and is at a peak or valley).
So we can take the same cost function we’ve been using for linear regression and take the partial
derivatives of the cost function J with respect to every parameter of θ and then set each of these
partial derivatives to 0:
m
1 ∑
J(θ0 , θ1 , . . . , θm ) =
(hθ (x (i) ) − y (i) )2
2m i=1
And for each j
∂
J(θ) = · · · = 0
∂θj
CHAPTER 13. SUPERVISED LEARNING
317
13.2. OPTIMIZATION
318
Then solve for θ0 , θ1 , . . . , θm .
The fast way to do this is to construct a matrix out of your features, including a column for x0 = 1
(so it ends up being an m × (n + 1) dimensional matrix) and then construct a vector out of your
target variables y (which is an m-dimensional vector):
If you have m examples, (x (1) , y (1) ), . . . , (x (m) , y (m) ), and n features and then include x0 = 1, you
have the following feature vectors:
x (i)




x0(i)
 (i) 
x 
 1 
 (i) 
n+1

=
x2  ∈ R
 . 
 .. 
xn(i)
From which we can construct X, known as the design matrix:


(x (1) )T
 (2) T 
 (x ) 


X=
.. 

. 


(x
(m) T
)
That is, the design matrix is composed of the transposes of the feature vectors for all the training
examples. Thus a column in the design matrix corresponds to a feature, and each row corresponds
to an example.
Typically, all examples have the same length, but this may not necessarily be the case. You may have,
for instance, images of different dimensions you wish to classify. This kind of data is heterogeneous.
And then the vector y is the just all of the labels from your training data:


y (1)
 (2) 
y 



y =
 ... 


y
(m)
Then you can calculate the θ vector which minimizes your cost function like so:
θ = (X T X)−1 X T y
With this method, feature scaling isn’t necessary.
Note that it’s possible that X T X is not invertible (that is, it is singular, also called degenerate), but
this is usually due to redundant features (e.g. having a feature in feet and in meters; they communicate
318
CHAPTER 13. SUPERVISED LEARNING
319
13.3. PREPROCESSING
the same information) or having too many features (e.g. m ≤ n), in which case you should delete
some features or use regularization.
Programs which calculate the inverse of a matrix often have a method which allows it to calculate
the optimal θ vector even if X T X is not invertible.
13.2.4
Deciding between Gradient Descent and the Normal Equation
• Gradient Descent
– requires that you choose α
– needs many iterations
– works well when n is large
• Normal Equation
– don’t need to choose α
– don’t need to iterate
– slow if n is very large (computing (X T X)−1 has a complexity of O(n3 )), but is usually ok
up until around n = 10000
Also note that for some learning algorithms, the normal equation is not applicable, whereas gradient
descent still works.
13.2.5
Advanced optimization algorithms
There are other advanced optimization algorithms, such as:
• Conjugate gradient
• BFGS
• L-BFGS
These shouldn’t be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they’re doing.
They are more complex, but (in the context of machine learning) there’s no need to manually pick a
learning rate α and they are often faster than gradient descent. So you can take advantage of them
via some library which has them implemented (though some implementations are better than others).
13.3 Preprocessing
Prior to applying a machine learning algorithm, data almost always must be preprocessed, i.e. prepared
in a way that helps the algorithm perform well (or in a way necessary for the algorithm to work at
all).
CHAPTER 13. SUPERVISED LEARNING
319
13.3. PREPROCESSING
320
Preprocessing
13.3.1
Feature selection
Good features:
• lead to data compression
• retain relevant information
• are based on expert domain knowledge
Common mistakes:
• trying to automate feature selection
• not paying attention to data-specific quirks
• throwing away information unnecessarily
The model where you include all available features is called the full model. But sometimes including
all features can hurt prediction accuracy.
There are a few feature selection strategies that can be used.
One class of selection strategies is called stepwise selection because they iteratively remove or add
one feature at a time, measuring the goodness of fit for each. The two approaches here are the
backward-elimination strategy which begins with the full model and removes one feature at a time,
and the forward-selection strategy which is the reverse of backward-elimination, starting with one
feature and adding the rest one at a time. These two strategies don’t necessarily lead to the same
model.
13.3.2
Feature engineering
Your data may have features explicitly present, e.g. a column in a database. But you can also design
or engineer new features by combining these explicit features or through observing patterns on your
own in the data that haven’t yet been explicitly encoded. We’re doing a form of this in polynomial
regression above by encoding the polynomials as new features.
320
CHAPTER 13. SUPERVISED LEARNING
321
13.3. PREPROCESSING
Representation
A very important choice in machine learning is how you represent the data. What are its salient
features, and in what form is it best presented? Each field in the data (e.g. column in the table)
is a feature and a great deal of time is spent getting this representation right. The best machine
learning algorithms can’t do much if the data isn’t represented in a way suited to the task at hand.
Sometimes it’s not clear how to represent data. For instance, in identifying an image of a car, you
may want to use a wheel as a feature. But how do you define a wheel in terms of pixel values?
Representation learning is a kind of machine learning in which representations themselves can be
learned.
An example representation learning algorithm is the autoencoder. It’s a combination of an encoder
function that converts input data into a different representation and a decoder function which converts
the new representation back into its original format.
Successful representations separate the factors of variations (that is, the contributors to variability)
in the observed data. These may not be explicit in the data, “they may exist either as unobserved
objects or forces in the physical world that affect the observable quantities, or they are constructs in
the human mind that provide useful simplifying explanations or inferred causes of the observed data.”
(Deep Learning).
Deep Learning
Deep learning builds upon representation learning. It involves having the program learn some hierarchy
of concepts, such that simpler concepts are used to construct more complicated ones. This hierarchy
of concepts forms a deep (many-layered) graph, hence “deep learning”.
With deep learning we can have simpler representations aggregate into more complex abstractions.
A basic example of a deep learning model is the multilayer perceptron (MLP), which is essentially a
function composed of simpler functions (layers); each function (i.e. layer) can be thought of as taking
the input and outputting a new representation of it.
For example, if we trained a MLP for image recognition, the first layer may end up learning representations of edges, the next may see corners and contours, the next may identify higher level features
like faces, etc.
13.3.3
Scaling (normalization)
If you design your features such that they are on a similar scale, gradient descent can converge more
quickly.
For example, say you are developing a model for predicting the price of a house. Your first feature
may be the area, ranging from 0-2000 sqft, and your second feature may be the number of bedrooms,
ranging from 1-5.
These two ranges are very disparate, causing the contours of the cost function to be such that the
gradient descent algorithm jumps around a lot trying to find an optimum.
CHAPTER 13. SUPERVISED LEARNING
321
13.3. PREPROCESSING
322
If you scale these features such that they share the same (or at least a similar) range, you avoid this
problem.
More formally, with feature scaling you want to get every feature into approximately a −1 ≤ xi ≤ 1
range (it doesn’t necessarily have to be between -1 and 1, just so long as there is a consistent range
across your features).
With feature scaling, you could also apply mean normalization, where you replace xi with xi − µi
(that is, replace the value of the i th feature with its value minus the mean value for that feature)
such that the mean of that feature is shifted to be about zero (note that you wouldn’t apply this to
x0 = 1).
13.3.4
Mean subtraction
Mean subtraction centers the data around the origin (i.e. it “zero-centers” it), simply by subtracting
each feature’s mean from itself.
13.3.5
Dimensionality Reduction
Sometimes some of your features may be redundant. You can combine these features in such a way
that you project your higher dimension representation into a lower dimension representation while
minimizing information loss. With the reduction in dimensionality, your algorithms will run faster.
The most common technique for dimensionality reduction is principal component analysis (PCA),
although other techniques, such as non-negative matrix factorization (NMF) can be used.
Principal Component Analysis (PCA)
Say you have some data. This data has two dimensions, but you could more or less capture it in one
dimension:
Reducing data dimensionality with PCA
Most of the variability of the data happens along that axis.
This is basically what PCA does.
322
CHAPTER 13. SUPERVISED LEARNING
323
13.3. PREPROCESSING
PCA is the most commonly used algorithm for dimensionality reduction. PCA tries to identify a
lower-dimensional surface to project the data onto such that the square projection error is minimized.
PCA example
PCA might project the data points onto the green line on the left. The projection error are the blue
lines. Compare to the line on the right - PCA would not project the data onto that line since the
projection error is much larger for that line.
This example is going from 2D to 1D, but you can use PCA to project from any n-dimension to a
lower k-dimension. Using PCA, we find some k vectors and project our data onto the linear subspace
spanned by this set of k vectors.
Note that this is different than linear regression, though the example might look otherwise. In PCA,
the projection error is orthogonal to the line in question. In linear regression, it is vertical to the line.
Linear regression also favors the target variable y whereas PCA makes no such distinction.
Prior to PCA you should perform mean normalization (i.e. ensure every feature has zero mean) on
your features and scale them.
First you compute the covariance matrix, which is denoted Σ (same as summation, unfortunately):
Σ=
n
1 ∑
(x (i) )(x (i) )T
m i=1
Then, you compute the eigenvectors of the matrix Σ using singular value decomposition:
[U, S, V ] = svd(Σ)
The resulting U matrix will be an n × n orthogonal matrix which provides the projected vectors you’re
looking for, so take the first k column vectors of U. This n × k matrix can be called Ureduce , which
CHAPTER 13. SUPERVISED LEARNING
323
13.3. PREPROCESSING
324
you then transpose to get these vectors as rows, resulting in a k × n matrix which you then multiply
by your feature matrix.
So how do you choose k, the number of principal components?
One way to choose k is so that most of the variance is retained.
If the average squared projection error (which is what PCA tries to minimize) is:
m
1 ∑
(i)
||x (i) − xapprox
||2
m i=1
And the total variation in the data is given by:
m
1 ∑
||x (i) ||2
m i=1
Then you would choose the smallest value of k such that:
1
m
∑m
(i)
(i)
− xapprox
||2
i=1 ||x
1 ∑m
(i) 2
i=1 ||x ||
m
≤ 0.01
That is, so that 99% of variance is retained.
This procedure for selecting k is made much simpler if you use the S matrix from the svd(Σ) function.
The S matrix’s only non-zero values are along its diagonal, S11 , S22 , . . . , Snn . Using this you can
instead just calculate:
∑k
Sii
≤ 0.01
i=1 Sii
1 − ∑i=1
n
Or, to put it another way:
∑k
Sii
≥ 0.99
i=1 Sii
i=1
∑n
In practice, you can reduce the dimensionality quite drastically, such as by 5 or 10 times, such as
from 10,000 features to 1,000, and retain variance.
But you should not use PCA prematurely - first try an approach without it, then later you can see if
it helps.
The process of using principal component analysis (PCA) to reduce dimensionality of data is called
factor analysis.
In factor analysis, the retained principal components are called common factors and their correlations
with the input variables are called factor loadings.
324
CHAPTER 13. SUPERVISED LEARNING
325
13.4. LINEAR REGRESSION
PCA becomes more reliable the more data you have. The number of examples must be larger than
the number of variables in the input matrix. The assumptions of linear correlation must hold as well
(i.e. that the variables must be linearly related).
PCA Whitening
You can go a step further with the resulting U matrix (with only the k chosen components) with
PCA whitening, which can improve the training process.
PCA whitening is used to decorrelate features and equalize the variance of the features.
Thus the first step is to decorrelate the original data X, which is accomplished by rotating it:
Xrotated = U · X
Then the data is normalized to have a variance of 1 for all of its components. To do so we just divide
each component by the square root of its eigenvalue. An epsilon value is included to prevent division
by zero:
Xrotated
Xwhitened = √
(S + ϵ)
13.3.6
Bagging (“Bootstrap aggregating”)
Basic idea: Generate more data from your existing data by resampling
Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your
prediction by generating additional data for training from your original dataset using
combinations with repetitions to produce multisets of the same cardinality/size as your
original data. By increasing the size of your training set you can’t improve the model
predictive force, but just decrease the variance, narrowly tuning the prediction to expected
outcome. - http://stats.stackexchange.com/a/19053/55910
13.4 Linear Regression
13.4.1
Univariate (simple) Linear Regression
Univariate linear regression or simple linear regression (SLR) is linear regression with a single variable.
In univariate linear regression, we have one input variable x.
The hypothesis takes the form:
CHAPTER 13. SUPERVISED LEARNING
325
13.4. LINEAR REGRESSION
326
hθ (x) = θ0 + θ1 x
Where the θi s are the parameters that the learning algorithm learns.
This should look familiar: it’s just a line.
13.4.2
How are the parameters determined?
The general idea is that you want to choose your parameters so that hθ (x) is close to y for your
training examples (x, y ). This can be written:
m
∑
(hθ (x (i) ) − y (i) )2
i=1
To the math easier, you multiply everything by
1
2m
(this won’t affect the resulting parameters):
m
1 ∑
(hθ (x (i) ) − y (i) )2
2m i=1
This is the cost function (or objective function). In this case, we call it J, which looks like:
J(θ0 , θ1 ) =
m
1 ∑
(hθ (x (i) ) − y (i) )2
2m i=1
Here it is the squared error function - it is probably the most commonly used cost function for
regression problems.
The squared error loss function is not the only loss function available. There are a variety you can
use, and you can even come up with your own if needed. Perhaps, for instance, you want to weigh
positive errors more than negative errors.
We want to find (θ0 , θ1 ) to minimize J(θ0 , θ1 ).
Gradient Descent for Univariate Linear Regression
For univariate linear regression, the derivatives are:
m
∂
1 ∑
J(θ0 , θ1 ) =
(hθ (x (i) ) − y (i) )
∂θ0
m i=1
m
∂
1 ∑
J(θ0 , θ1 ) =
(hθ (x (i) ) − y (i) ) · x (i)
∂θ1
m i=1
so overall, the algorithm involves repeatedly updating:
326
CHAPTER 13. SUPERVISED LEARNING
327
13.4. LINEAR REGRESSION
An example cost function with two parameters
The same cost function, visualized as a contour plot
CHAPTER 13. SUPERVISED LEARNING
327
13.4. LINEAR REGRESSION
328
θ0 := θ0 − α
m
1 ∑
(hθ (x (i) ) − y (i) )
m i=1
θ1 := θ1 − α
m
1 ∑
(hθ (x (i) ) − y (i) ) · x (i)
m i=1
Remember that the θ parameters are updated simultaneously.
Note that because we are summing over all training examples for each step, this particular type of
gradient descent is known as batch gradient descent. There are other approaches which only sum
over a subset of the training examples for each step.
Univariate linear regression’s cost function is always convex (“bowl-shaped”), which has only one
optimum, so gradient descent int his case will always find the global optimum.
A convex function
13.4.3
Multivariate linear regression
Multivariate linear regression is simply linear regression with multiple variables. This technique is for
using multiple features with linear regression.
Say we have:
• n = number of features
• x (i) = the input features of the i th training example
• xj(i) = the value of feature j in the i th training example
Instead of the simple linear regression model we can use a generalized linear model (GLM). That
is, the hypothesis h will take the form of:
hθ (x) = θ0 + θ1 x1 + θ2 x2 + · · · + θn xn
328
CHAPTER 13. SUPERVISED LEARNING
329
13.4. LINEAR REGRESSION
For convenience of notation, you can define x0 = 1 and notate your features and parameters as
zero-indexed n + 1-dimensional vectors:




x0
θ0
 
 
 x1 
 θ1 
 
 
 
 



x =  x2  , θ = 
 θ2 
.
.
 .. 
 .. 
 
 
xn
θn
And the hypothesis can be re-written as:
hθ x = θ T x
Sometimes in multiple regression you may have predictor variables which are correlated with one
another; we say that these predictors are collinear.
Gradient descent with Multivariate Linear Regression
The previous gradient descent algorithm for univariate linear regression is just generalized (this is still
repeated and simultaneously updated):
θj := θj − α
13.4.4
m
1 ∑
(hθ (x (i) ) − y (i) ) · xj(i)
m i=1
Example implementation of linear regression with gradient descent
”””
- X = feature vectors
- y = labels/target variable
- theta = parameters
- hyp = hypothesis (actually, the vector computed from the hypothesis function)
”””
import numpy as np
def cost_function(X, y, theta):
”””
This isn’t used, but shown for clarity
”””
m = y.size
hyp = np.dot(X, theta)
CHAPTER 13. SUPERVISED LEARNING
329
13.4. LINEAR REGRESSION
330
sq_err = sum(pow(hyp - y, 2))
return (0.5/m) * sq_err
def gradient_descent(X, y, theta, alpha=0.01, iterations=10000):
m = y.size
for i in range(iterations):
hyp = np.dot(X, theta)
for i, p in enumerate(theta):
temp = X[:,i]
err = (hyp - y) * temp
cost_function_derivative = (1.0/m) * err.sum()
theta[i] = theta[i] - alpha * cost_function_derivative
return theta
if __name__ == ’__main__’:
def true_function(X):
# Create random parameters for X’s dimensions, plus one for x0.
true_theta = np.random.rand(X.shape[1] + 1)
return true_theta[0] + np.dot(true_theta[1:], X.T), true_theta
# Create some random data
n_samples = 20
n_dimensions = 5
X = np.random.rand(n_samples, n_dimensions)
y, true_theta = true_function(X)
# Add a column of 1s for x0
ones = np.ones((n_samples, 1))
X = np.hstack([ones, X])
# Initialize parameters
theta = np.zeros((n_dimensions+1))
# Split data
X_train, y_train = X[:-1], y[:-1]
X_test, y_test = X[-1:], y[-1:]
# Estimate parameters
theta = gradient_descent(X_train, y_train, theta, alpha=0.01, iterations=10000)
# Predict
print(’true’, y_test)
print(’pred’, np.dot(X_test, theta))
330
CHAPTER 13. SUPERVISED LEARNING
331
13.5. LOGISTIC REGRESSION
print(’true theta’, true_theta)
print(’pred theta’, theta)
13.4.5
Outliers
Outliers can pose a problem for fitting a regression line. Outliers that fall horizontally away from the
rest of the data points can influence the line more, so they are called points with high leverage.
Any such point that actually does influence the line’s slope is called an influential point. You can
examine this effect by removing the point and then fitting the line again and seeing how it changes.
Outliers should only be removed with good reason - they can still be useful and informative and a
good model will be able to capture them in some way.
13.4.6
Polynomial Regression
Your data may not fit a straight line and might be better described by a polynomial function, e.g.
θ0 + θ1 x + θ2 x 2 or θ0 + θ1 x + θ2 x 2 + θ3 x 3 .
A trick to this is that you can write this in the form of plain old multivariate linear regression. You
would, for example, just treat x as a feature x1 , x 2 as another feature x2 , x 3 as another feature x3 ,
and so on:
θ0 + θ1 x1 + θ2 x2 + θ3 x3 + · · · + θn xn
Note that in situations like this, feature scaling is very important because these features’ ranges differ
by a lot due to the exponents.
13.5 Logistic Regression
Logistic regression is a common approach to classification. The name “regression” may be a bit
confusing - it is a classification algorithm, though it returns a continuous value. In particular, it
returns the probability of the positive class; if that probability is ≥ 0.5, then the positive label is
returned.
Logistic regression outputs a value between zero and one (that is, 0 ≤ hθ (x) ≤ 1).
Say we have our hypothesis function
hθ (x) = θT x
With logistic regression, we apply an additional function g :
hθ (x) = g(θT x)
CHAPTER 13. SUPERVISED LEARNING
331
13.5. LOGISTIC REGRESSION
332
where
g(z) =
1
1 + e −z
This function is known as the sigmoid function, also known as the logistic function, with the form:
The sigmoid or logistic function
So in logistic regression, the hypothesis ends up being:
hθ (x) =
1
1 + e θT x
The output of the hypothesis is interpreted as the probability of the given input belonging to the
positive class. that is:
hθ (x) = P (y = 1|x; θ)
Which is read: “the probability that y = 1 given x as parameterized by θ”.
Since we are classifying input, we want to output a label, not a continuous value. So we might say
y = 1 if hθ (x) ≥ 0.5 and y = 0 if hθ (x) < 0.5. The line that forms this divide is an example of
332
CHAPTER 13. SUPERVISED LEARNING
333
13.5. LOGISTIC REGRESSION
a decision boundary. Note that decision boundaries can be non-linear as well (e.g. they could be a
circle or something).
more on Logistic Regression
Logistic regression is also a GLM - you’re fitting a line which models the probability of being in
the positive class. We can use the Bernoulli distribution since it models events with two possible
outcomes and is parameterized by only the probability of the positive outcome, p. Thus our line
would look something like:
pi = β0 + β1 x1 + · · · + βn xn + ϵ
But to represent a probability, y values must be bound to [0, 1]. Currently, our model can be linear
or polynomial and thus can output any continuous value. So we have to apply a transformation to
constrain y ; we do so by applying a logit transformation:
logi t(p) = log(
The
p
1−p
p
)=x
1−p
term constraints the output to be positive. The log operation constrains the values to [0, 1].
The inverse of the logit transformation is:
p=
1
1 + exp(−x)
So the model is now:
logi t(p) = β0 + β1 x1 + · · · + βn xn ϵ
So the likelihood here is:
L(y |p) =
n
∏
piyi (1 − pi )1−yi
i=1
And the log likelihood then is:
l(y |p) =
n
∑
yi log(pi ) + (1 − yi ) log(1 − pi )
i=1
CHAPTER 13. SUPERVISED LEARNING
333
13.5. LOGISTIC REGRESSION
334
even more on Logistic Regression
Linear regression is good for explaining continuous dependent variables. But for discrete variables,
linear regression gives ambiguous results - what does a fractional result mean? It can’t be interpreted
as a probability because linear regression models are not bound to [0, 1] as probability functions must
be.
When dealing with boolean/binary dependent variables you can use logistic regression. When
dealing with non-binary discrete dependent variables, you can use Poisson regression (which is a
GLM that uses the log link function).
So we expect the logistic regression function to output a probability. In linear regression, the model
can output any value, not bound to [0, 1]. So for logistic regression we apply a transformation, most
commonly the logit transformation, so that our resulting values can be interpreted as probability:
transformation(p) = β0 + β1 x1 + · · · + βn xn
p
logit(p) = loge (
)
1−p
So if we solve the original regression equation for p, we end up with:
p=
e β0 +β1 x1 +···+βn xn
1 + e β0 +β1 x1 +···+βn xn
Logistic regression does not have a closed form solution - that is, it can’t be solved in a finite number of
operations, so we must estimate its parameters using other methods, more specifically, we use iterative
methods. Generally the goal is to find the maximum likelihood estimate (MLE), which is the set
of parameters that maximizes the likelihood of the data. So we might start with random guesses for
the parameters, the compute the likelihood of our data (that is, we can compute the probability of
each data point; the likelihood of the data is the product of these individual probabilities) based on
these parameters. We iterate until we find the parameters which maximize this likelihood.
13.5.1
One-vs-All
The technique of one-vs-all (or one-vs-rest) involves dividing your training set into multiple binary
classification problems, rather than as a single multiclass classification problem.
For example, say you have three classes 1,2,3. Your first binary classifier will distinguish between
class 1 and classes 2 and 3„ your second binary classifier will distinguish between class 2 and classes
1 and 3, and your final binary classifier will distinguish between class 3 and classes 1 and 2.
Then to make the prediction, you pick the class i which maximizes maxi hθ(i) (x).
334
CHAPTER 13. SUPERVISED LEARNING
335
13.6. SOFTMAX REGRESSION
13.6 Softmax regression
Softmax regression generalizes logistic regression to beyond binary classification (i.e. multinomial
classification; that is, there are more than just two possible classes). Logistic regression is the reduced
form of softmax regression where k = 2 (thus logistic regression is sometimes caled a “binary Softmax
classifier”). As is with logistic regression, softmax regression outputs probabilities for each class. As
a generalization of logistic regression, softmax regression can also be expressed as a generalized linear
model. It generally uses a cross-entropy loss function.
13.6.1
Hierarchical Softmax
In the case of many, many classes, the hierarchical variant of softmax may be preferred. In hierarchical
softmax, the labels are structured as a hierarchy (a tree). A Softmax classifier is trained for each
node of the tree, to distinguish the left and right branches.
13.7 Generalized linear models (GLMs)
There is a class of machine learning models known as generalized linear models (GLMs) because they
are expressed as a linear combination of parameters, i.e.
ŷ = θ0 + θ1 x1 + · · · + θn xn
We can use linear models for non-regression situations, as we do with logistic regression - that is,
when the output variable is not an unbounded continuous value directly computed from the inputs
(that is, the output variable is not a linear function of the inputs), such as with binary or other kinds
of classification. In such cases, the linear models we used are called generalized linear models. Like
any linear function, we get some value from our inputs, but we then also apply a link function which
transforms the resulting value into something we can use. Another way of putting it is that these
link functions allow us to generalize linear models to other situations.
Linear regression also assumes homoscedasticity; that is, that the variance of the error is uniform
along the line. GLMs do not need to make this assumption; the link function transforms the data to
satisfy this assumption.
For example, say you want to predict whether or not someone will buy something - this is a binary
classification and we want either a 0 or a 1. We might come up with some linear function based on
income and number of items purchased in the last month, but this won’t give us a 0/no or a 1/yes,
it will give us some continuous value. So then we apply some link function of our choosing which
turns the resulting value to give us the probability of a 1/yes.
Linear regression is also a GLM, where the link function is the identity function.
Logistic regression uses the logit link function.
CHAPTER 13. SUPERVISED LEARNING
335
13.7. GENERALIZED LINEAR MODELS (GLMS)
336
Logistic regression is a type of models called generalized linear models (GLM), which involves two
steps:
1. Model the response variable with a probability distribution.
2. Model the distribution’s parameters using the predictor variables and a special form of multiple
regression.
This probability distribution is taken from the exponential family of probability distributions, which
includes the normal, Bernoulli, beta, gamma, Dirichlet, and Poisson distributions (among others). A
distribution is in the exponential family if it can be written in the form:
P (y |n) = b(y ) exp(η T T (y ) − a(η))
η is known as the natural parameter or the canonical parameter of the distribution, T (y ) is the
sufficient statistics, which is often just T (y ) = y . a(η) is the log partition function.
We can set T, a, b to define a family of distributions; this family is parameterized by η, with different
values giving different distributions within the family.
For instance, the Bernoulli distribution is in the exponential family, where
p
η = log( 1 − p))
(
T (y ) = y
a(η) = − log(1 − p)
b(y ) = 1
Same goes for the Gaussian distribution, where
η=µ
T (y ) = y
µ2
a(η) =
2
−y 2
1
exp(
b(y ) = √
)
2
(2π)
Note that with linear models, you should avoid extrapolation, that is, estimating values which are
outside the original data’s range. For example, if you have data in some range [x1 , xn ], you have no
guarantee that your model behaves correctly at x < x1 and x > xn .
13.7.1
Linear Mixed Models (Mixed Models/Hierarchical Linear Models)
In a linear model there may be mixed effects, which includes fixed and random effects. Fixed effects
are variables in your model where their coefficients are fixed (non-random). Random effects are
variables in your model where their coefficients are random.
336
CHAPTER 13. SUPERVISED LEARNING
337
13.7. GENERALIZED LINEAR MODELS (GLMS)
For example, say you want to create a model for crop yields given a farm and amount of rainfall. We
have data from several years and the same farms are represented multiple times throughout. We could
consider that some farms may be better at producing greater crop yields given the same amount of
rainfall as another farm. So we expect that samples from different farms will have different variances
- e.g. if we look at just farm A’s crop yields, that sample would have different variance than if we just
looked at farm B’s crop yields. In this regard, we might expect that models for farm A and farm B
will be somewhat different.
The naive approach would be to just ignore differences between farms and consider only rainfall as a
fixed effect (i.e. with a fixed/constant coefficient). This is sometimes called “pooling” because we’ve
lumped everything (in our case, all the farms) together.
We could create individual models for each farm (“no pooling”) but perhaps for some farms we only
have one or two samples. For those farms, we’d be building very dubious models since their sample
sizes are so small. The information from the other farms are still useful for giving us more data to
work with in these cases, so no pooling isn’t necessarily a good approach either.
We can use a mixed model (“partial pooling”) to capture this and make it so that the rainfall
coefficient random, varying by farm.
more…from another source
We may run into situations like the following:
A situation where an HLM might be better
Where our data seems to encompass multiple models (the red, green, blue, and black ones going
up from left to right), but if we try to model them all simultaneously, we get a complete incorrectly
CHAPTER 13. SUPERVISED LEARNING
337
13.7. GENERALIZED LINEAR MODELS (GLMS)
338
model (the dark grey line going down from left to right).
Each of the true lines (red, green, blue, black) may come from distinct units, i.e. each could represent
a different part of the day or a different US state, etc. When there are different effects for each unit,
we say that there is unit heterogeneity in the data.
In the example above, each line has a different intercept. But the slopes could be different, or both
the intercepts and slopes could be different:
Varying slopes and intercepts
In this case, we use a random-effects model because some of the coefficients are random.
For instance, in the first example above, the intercepts varied, in which case the intercept coefficient
would be replaced with a random variable αi drawn from the normal distribution:
y = αi + β i x + ϵ
Or in the case of the slopes varying, we’d say that β i is a random variable drawn from the normal
distribution. In each case, α is the mean intercept and β is the mean slope.
When both slope and intercept vary, we draw them together from a multivariate normal distribution
since they may have some relation, i.e.
[
]
[ ]
αi
α
∼ Φ(
, Σ)
βi
β
Now consider when there are multiple levels of these effects that we want to model. For instance,
perhaps there are differences across US states but also differences across US regions.
In this case, we will have a hierarchy of effects. Let’s say only the intercept is affected - if we wanted
to model the effects of US regions and US states on separate levels, then the αi will be drawn from
a distribution according to the US region, αi ∼ Φ(µregion , σα2 ), and then the regional mean which
parameterizes αi ’s distribution is drawn from a distribution of regional means, µregion ∼ Φ(µ, σr2 ).
338
CHAPTER 13. SUPERVISED LEARNING
339
13.8. SUPPORT VECTOR MACHINES
13.8 Support Vector Machines
SVMs can be powerful for learning non-linear functions and are widely-used.
With SVMs, the optimization objective is:
minθ
m
∑
[y (i) cost1 (θT x (i) ) + (1 − y (i) )cost0 (θT x (i) )] +
i=1
n
λ∑
θ2
2 j=1 j
Where the term at the end is the regularization term. Note that this is quite similar to the objective
function for logistic regression; we have just removed the m1 term (removing it does not make a
difference to our result because it is a constant) and substituted the log hypothesis terms for two
new cost functions.
If we break up the logistic regression objective function into terms (that is, the first sum and the
regularization term), we might write it as A + λB.
The SVM objective is often instead notated by convention as CA + B. You can think of C as λ1 .
That is, where increasing λ brings your parameters closer to zero, the regularization parameter C has
the opposite effect - as it grows, so do your parameters, and vice versa.
With that representation in mind, we can rewrite the objective by replacing the λ with C on the first
term:
minθ C
m
∑
[y (i) cost1 (θT x (i) ) + (1 − y (i) )cost0 (θT x (i) )] +
i=1
n
1∑
θ2
2 j=1 j
The SVM hypothesis is:

1
hθ (x) = 
if θT x ≥ 0
0 otherwise
SVMs are sometimes called large margin classifiers.
Take the following data:
On the left, a few different lines separating the data are drawn. The optimal one found by SVM is
the one in orange. It is the optimal one because it has the largest margins, illustrated by the red lines
on the right (technically, the margin is orthogonal from the decision boundary to those red lines).
When C is very large, SVM tries to maximize these margins.
However, outliers can throw SVM off if your regularization parameter C is too large, so in those cases,
you may want to try a smaller value for C.
13.8.1
Kernels
Kernels are the main technique for adapting SVMs to do complex non-linear classification.
CHAPTER 13. SUPERVISED LEARNING
339
13.8. SUPPORT VECTOR MACHINES
340
SVM and margins
340
CHAPTER 13. SUPERVISED LEARNING
341
13.8. SUPPORT VECTOR MACHINES
A note on notation. Say your hypothesis looks something like:
θ0 + θ1 x1 + θ2 x2 + θ3 x1 x2 + θ4 x12 + . . .
We can instead notate each non-parameter term as a feature f , like so:
θ0 + θ1 f1 + θ2 f2 + θ3 f3 + θ4 f4 + . . .
For SVMs, how do we choose these features?
What we can do is compute features based on x’s proximity to landmarks l (1) , l (2) , l (3) , . . . . For each
landmark, we get a feature:
fi = similarity(x, l (i) ) = exp(−
||x − l (i) ||2
)
2σ 2
Here, the similarity(x, l (i) ) function is the kernel, sometimes just notated k(x, l (i) ).
We have a choice in what kernel function we use; here we are using Gaussian kernels. In the Gaussian
kernel we have a parameter σ.
If x is close to l (i) , then we expect fi ≈ 1. Conversely, if x is far from l (i) , then we expect fi ≈ 0.
With this approach, classification becomes based on distances to the landmarks - points that are far
away from certain landmarks will be classified 0, points that are close to certain landmarks will be
classified 1. And thus we can get some complex decision boundaries like so:
An example SVM decision boundary
So how do you choose the landmarks?
You can take each training example and place a landmark there. So if you have m training examples,
you will have m landmarks.
CHAPTER 13. SUPERVISED LEARNING
341
13.8. SUPPORT VECTOR MACHINES
342
So given (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) ), choose l (1) = x (1) , l (2) = x (2) , . . . , l (m) = x (m) .
Then given a training example (x (i) , y (i) ), we can compute a feature vector f , where f0 = 1, like so:


f0 = 1

= sim(x (i) , l (1) ) 


= sim(x (i) , l (2) ) 


..

.
 (i)
f
 1
 (i)
 f2


f =


 (i)
(i) (i) 

f
 i = si m(x , l ) 


..


.


fm(i) = si m(x (i) , l (m) )
Then instead of x we use our feature vector f . So our objective function becomes:
mi nθ C
m
∑
(i)
T
[y cost1 (θ f
(i)
) + (1 − y )cost0 (θ f
(i)
T
i=1
(i)
n
1∑
)] +
θj2
2 j=1
Note that here n = m because we have a feature for each of our m training examples.
Of course, using a landmark for each of your training examples makes SVM difficult on large datasets.
There are some implementation tricks to make it more efficient, though.
When choosing the regularization parameter C, note that:
• A large C means lower bias, high variance
• A small C means higher bias, low variance
For the Gaussian kernel, we also have to choose the parameter σ 2 .
• A large σ 2 means that features fi vary more smoothly. Higher bias, lower variance.
• A small σ 2 means that features fi vary less smoothly. Lower bias, higher variance.
When using SVM, you also need to choose a kernel, which could be the Gaussian kernel, or it could
be no kernel (i.e. a linear kernel), or it could be one of many others. The Gaussian and linear kernels
are by far the most commonly used.
You may want to use a linear kernel if n is very large, but you don’t have many training examples (m
is small). Something more complicated may overfit if you only have a little data.
The Gaussian kernel is appropriate if n is small and/or m is large. Note that you should perform
feature scaling before using the Gaussian kernel.
Not all similarity functions make valid kernels - they must satisfy a condition called Mercer’s Theorem
which allows the optimizations that most SVM implementations provide and also so they don’t diverge.
Other off-the-shelf kernels include:
342
CHAPTER 13. SUPERVISED LEARNING
343
13.8. SUPPORT VECTOR MACHINES
• Polynomial kernel: k(x, l) = (x T l)2 , or k(x, l) = (X T l)3 , or k(x, l + 1)3 , etc (there are
many variations), the general form is (x T l + constant)degree . It usually performs worse than the
Gaussian kernel.
• More esoteric ones: String kernel, chi-square kernel, histogram intersection kernel, …
But these are seldom, if ever, used.
Some SVM packages have a built-in multi-class classification functionality. Otherwise, you can use the
one-vs-all method. That is, train K SVMs, one to distinguish y = i from the rest, for i = 1, 2, . . . , K,
then get θ(1) , θ(2) , . . . , θ(K) , and pick classs i with the largest (θ(i) )T x.
If n is large relative to m, e.g. n = 10000, m ∈ [10, 1000], then it may be better to use logistic
regression, or SVM without a kernel (linear kernel).
If n is small (1-1000) and m is intermediate (10-50000), then you can try SVM with the Gaussian
kernel.
If n is small (1-1000) but m is large (50000+), then you can create more features and then use
logistic regression or SVM without a kernel, since otherwise SVMs struggle at large training sizes.
SVM without a kernel works out to be similar to logistic regression for the most part.
Neural networks are likely to work well for most of these situations, but may be slower to train.
The SVM’s optimization problem turns out to be convex, so good SVM packages will find global
minimum or something close to it (so no need to worry about local optima).
Other rules of thumb:
• Use linear kernel when number of features is larger than number of observations.
• Use gaussian kernel when number of observations is larger than number of features.
• If number of observations is larger than 50,000 speed could be an issue when using gaussian
kernel; hence, one might want to use linear kernel. Source
Also:
Usually, the decision is whether to use linear or an RBF (aka Gaussian) kernel. There are
two main factors to consider:
Solving the optimisation problem for a linear kernel is much faster, see e.g. LIBLINEAR. Typically, the best possible predictive performance is better for a
nonlinear kernel (or at least as good as the linear one).
It’s been shown that the linear kernel is a degenerate version of RBF, hence the linear
kernel is never more accurate than a properly tuned RBF kernel. Quoting the abstract
from the paper I linked:
The analysis also indicates that if complete model selection using the Gaussian
kernel has been conducted, there is no need to consider linear SVM.
CHAPTER 13. SUPERVISED LEARNING
343
13.8. SUPPORT VECTOR MACHINES
344
A basic rule of thumb is briefly covered in NTU’s practical guide to support vector
classification (Appendix C).
If the number of features is large, one may not need to map data to a higher
dimensional space. That is, the nonlinear mapping does not improve the performance. Using the linear kernel is good enough, and one only searches for
the parameter C.
Your conclusion is more or less right but you have the argument backwards. In practice,
the linear kernel tends to perform very well when the number of features is large (e.g. there
is no need to map to an even higher dimensional feature space). A typical example of
this is document classification, with thousands of dimensions in input space.
In those cases, nonlinear kernels are not necessarily significantly more accurate than the
linear one. This basically means nonlinear kernels lose their appeal: they require way
more resources to train with little to no gain in predictive performance, so why bother.
TL;DR
Always try linear first since it is way faster to train (AND test). If the accuracy suffices,
pat yourself on the back for a job well done and move on to the next problem. If not, try
a nonlinear kernel. Source
13.8.2
more on support vector machines
Support vector machines is another way of coming up with decision boundaries to divide a space.
Here the decision boundary is positioned so that its margins are as wide as possible.
We can consider some vector w
⃗ which is perpendicular to the decision boundary and has an unknown
length.
Then we can consider an unknown vector u⃗ that we want to classify.
⃗ · u⃗, and see if it is greater than or equal to some constant c.
We can compute their dot product, w
To make things easier to work with mathematically, we set b = −c and rewrite this as:
w
⃗ · u⃗ + b ≥ 0
This is our decision rule: if this inequality is true, we have a positive example.
Now we will define a few things about this system:
w
⃗ ·⃗
x+ + b ≥ 1
w
⃗ ·⃗
x− + b ≤ −1
Where ⃗
x+ is a positive training example and ⃗
x− is a negative training example. So we will insist that
these inequalities hold.
344
CHAPTER 13. SUPERVISED LEARNING
345
13.8. SUPPORT VECTOR MACHINES
Support Vector Machines
CHAPTER 13. SUPERVISED LEARNING
345
13.8. SUPPORT VECTOR MACHINES
346
For mathematical convenience, we will define another variable yi like so:

yi = +1
yi = 
y = −1
i
if positive example
if negative example
So we can rewrite our constraints as:
yi ( w
⃗ ·⃗
x+ + b) ≥ 1
yi ( w
⃗ ·⃗
x− + b) ≥ 1
Which ends up just collapsing into:
yi ( w
⃗ ·⃗
x + b) ≥ 1
Or:
yi ( w
⃗ ·⃗
x + b) − 1 ≥ 0
We then add an additional constraint for an xi in the gutter (that is, within the margin of the decision
boundary):
yi ( w
⃗ ·⃗
x + b) − 1 = 0
So how do you compute the total width of the margins?
You can take a negative example ⃗
x− and a positive example ⃗
x+ and compute their difference ⃗
x+ − ⃗
x− .
This resulting vector is not orthogonal to the decision boundary, so we can project it onto the unit
vector ŵ (the unit vector of the w
⃗ , which is orthogonal to the decision boundary):
width = (⃗
x+ − ⃗
x− ) ·
w
⃗
||w
⃗ ||
Using our previous constraints we get ⃗
x+ = 1 − b and −⃗
x− = 1 + b, so the end result is:
width =
2
||w
⃗ ||
We want to maximize the margins, that is, we want to maximize the width, and we can divide by 12
because we still have a meaningful maximum, and that in turn can be interpreted as the minimum
⃗ , which we can rewrite in a more mathematically convenient form (and still have
of the length of w
the same meaningful minimum):
346
CHAPTER 13. SUPERVISED LEARNING
347
13.8. SUPPORT VECTOR MACHINES
Support Vector Machines
CHAPTER 13. SUPERVISED LEARNING
347
13.8. SUPPORT VECTOR MACHINES
max(
348
2
1
1
) → max(
) → min(||w
⃗ ||) → mi n( ||w
⃗ ||2 )
||w
⃗ ||
||w
⃗ ||
2
Let’s turn this into something we can maximize, incorporating our constraints. We have to use
Lagrange multipliers which provide us with this new function we can maximize without needing to
think about our constraints anymore:
L=
∑
1
||w
⃗ ||2 −
αi [yi (w
⃗ ·⃗
xi + b) − 1]
2
i
(Note that the Lagrangian is an objective function which includes equality constraints).
Where L is the function we want to maximize, and the sum is the sum of the constraints, each with
a multiplier αi .
So then to get the maximum, we just compute the partial derivatives and look for zeros:
∑
∑
∂L
=w
⃗−
α i yi ⃗
xi = 0 → w
⃗=
α i yi ⃗
xi
∂w
⃗
i
i
∑
∑
∂L
=−
α i yi = 0 →
α i yi = 0
∂b
i
i
Let’s take these partial derivatives and re-use them in the original Lagrangian:
L=
∑
∑
∑
∑
∑
1 ∑
( α i yi ⃗
xi ) · ( αj yj ⃗
xj ) −
αi yi ⃗
xi · ( αj yj ⃗
xj ) −
αi yi b +
αi
2 i
j
i
j
Which simplifies to:
L=
∑
αi −
1 ∑∑
α i α j yi yj ⃗
xi · ⃗
xj
2 i j
We see that this depends on ⃗
xi · ⃗
xj .
Similarly, we can rewrite our decision rule, substituting for w
⃗.
w
⃗=
∑
α i yi ⃗
xi
i
∑
w
⃗ · u⃗ + b ≥ 0
αi yi ⃗
xi · u⃗ + b ≥ 0
i
And similarly we see that this depends on ⃗
xi · u⃗.
The nice thing here is that this works in a convex space (proof not shown) which means that it cannot
get stuck on a local maximum.
348
CHAPTER 13. SUPERVISED LEARNING
349
13.8. SUPPORT VECTOR MACHINES
Sometimes you may have some training data ⃗
x which is not linearly separable. What you need is a
transformation, ϕ(⃗
x ) to take the data from its current space to a space where it is linearly separable.
Since the maximization and the decision rule depend only on the dot products of vectors, we can just
substitute the transformation, so that:
• we want to maximize ϕ(⃗
xi ) · ϕ(⃗
xj )
• for the decision rule, we have ϕ(⃗
xi ) · ϕ(⃗
u)
Since these are just dot products between the transformed vectors, we really only need a function
which gives us that dot product:
K(⃗
xi , ⃗
xj ) = ϕ(⃗
xi ) · ϕ(⃗
xj )
This function K is called the kernel function.
So if you have the kernel function, you don’t even need to know the specific transformation - you
just need the kernel function.
Some popular kernels:
• linear kernel: K(⃗
u, ⃗
v ) = (⃗
u·⃗
v + 1)n
• radial basis kernel: e −
||⃗
xi −⃗
xj ||
σ
More on kernels: The Kernel Trick
Many machine learning algorithms can be written in the form:
wTx + b = b +
m
∑
αi x T x (i)
i=1
Where α is a vector of coefficients.
We can substitute x with the output of a feature function ϕ(x) and the dot product x T x (i) with a
function k(x, x (i) ) = ϕ(x)T ϕ(x (i ). This function k is called a kernel.
Thus we are left with:
f (x) = b +
m
∑
αi k(x, x (i) )
i=1
ϕ maps x to a linear space, so this final function f is linear; as such, x can be non-linear.
Machine learning algorithms which use this trick are called kernel methods or kernel machines.
CHAPTER 13. SUPERVISED LEARNING
349
13.9. DECISION TREES
350
13.9 Decision Trees
Basic algorithm:
1. Start with data all in one group
2. Find some criteria which best splits the outcomes
3. Divide the data into two groups (which become the leaves) on that split (which becomes a
node)
4. Within each split, repeat
5. Repeat until the groups are too small or are sufficiently “pure” (homogeneous)
Classification trees are non-linear models:
• They use interactions b/w variables
• Data transformations may be less important (monotone transformations probably won’t affect
how data is split)
• Trees can be used for regression problems (continuous outcome)
13.9.1
Measures of impurity
p̂mk =
1
Nm
∑
⊮(yi = k)
xi ∈Leafm
That is, within the m leaf you have Nm objects to consider and you count the number of a particular
class k in that set of objects and divide it by Nm to get the probability p̂mk .
• misclassification error: 1 − p̂mk(m) ; k(m) = most; common; k
– 0 = perfect purity
– 0.5 = no purity
• Gini index:
∑
k̸=k ′
p̂mk × p̂mk ′ =
∑K
k=1
p̂mk (1 − p̂mk ) = 1 −
∑K
k=1
2
pmk
– 0 = perfect purity
– 0.5 = no purity
• Deviance/information gain: −
∑K
k=1
p̂mk log2 p̂mk
– 0 = perfect purity
– 1 = no purity
350
CHAPTER 13. SUPERVISED LEARNING
351
13.9.2
13.10. ENSEMBLE MODELS
Random forests
Random forests are the ensemble model version of decision trees.
Basic idea:
1. Bootstrap samples (i.e. resample)
2. At each split in the tree, bootstrap the variables (i.e. only a subset of the variables is considered
at each split)
3. Grow multiple trees
4. Each tree votes on a classification
This can be very accurate but slow, prone to overfitting (cross-validation helps though), and not easy
to interpret. However, they generally perform very well.
13.9.3
Classification loss functions
Hinge Loss (aka Max-Margin Loss)
The hinge loss function takes the form ℓ(y ) = max(0, 1 − t · y ) and is typically used for SVMs
(sometimes squared hinge loss is used, which is just the previous equation squared).
Cross-entropy loss
L(y , ŷ ) = −
1 ∑∑
yn,i log ŷn,i
N n∈N i∈C
where
• N = number of samples
• C = number of classes
Typically used for Softmax classifiers.
(
e fyi
Li = − log ∑ fj
j e
13.10
13.10.1
)
Ensemble models
Boosting
Basically, taking many models and combining their outputs as a weighted average.
Basic idea:
CHAPTER 13. SUPERVISED LEARNING
351
13.10. ENSEMBLE MODELS
352
1. Take lots of (possibly) weak predictors h1 , . . . , hk , e.g. a bunch of different trees or regression
models or different cutoffs.
2. Weight them and combine them by creating a classifier which combine the predictors: f (x) =
∑
sign( Tt=1 αt ht (x))
•
•
•
•
•
Goal is to minimize error on training set
Iteratively select a classifier h at each step
Calculate weights based on errors
Increase the weight of missed classifications and select the next classifier
The sign of the result tells you the class
Adaboost is a popular boosting algorithm.
One class of boosting is gradient boosting.
Boosting typically does very well.
more on boosting
Here we focus on binary classification.
Say we have a classifier h which produces +1 or −1.
We have some error rate, which ranges from 0 to 1. A weak classifier is one where the error is just
less than 0.5 (that is, it works slightly better than chance). A stronger classifier has an error rate
closer to 0.
Let’s say we have several weak classifiers, h1 , . . . , hn .
We can combine them into a bigger classifier, H(x), where x is some input, which is the sum of the
individual weak classifiers, and take the sign of the result. In this sense, the weak classifiers vote on
the classification:
∑
H(x) = sign(
hi (x))
i
How do we generate these weak classifiers?
• We can create one by taking the data, training classifiers on it, and selecting with the smallest
error rate (this will be classifier h1 .)
• We can create another by taking the data and giving it some exaggeration of h1 ’s errors (e.g. pay
more attention to the samples that h1 has trouble one). Training a new classifier on this gives
us h2 .
• We can create another by taking the data and giving it some exaggeration to the samples where
the results of h1 ̸= h2 . Training a new classifier on this gives us h3 .
This process can be recursive. That is, h1 could be made up of three individual classifiers as well,
and so could h2 and h3 .
352
CHAPTER 13. SUPERVISED LEARNING
353
13.11. OVERFITTING
For our classifiers we could use decision tree stumps, which is just a single test to divide the data
into groups (i.e. just a part of a fuller decision tree). Note that boosting doesn’t have to use decision
tree (stumps), it can be used with any classifier.
We can assign a weight to each training example, wi , where to start, all weights are uniform. These
weights can be adjusted to exaggerate certain examples. For convenience, we keep it so that all
∑
weights sum to 1, wi = 1, thus enforcing a distribution.
We can compute the error ϵ of a given classifier as the sum of the weights of the examples it got
wrong.
For our aggregate classifier, we may want to weight the classifiers with the weights α1 , . . . , αn .
∑
H(x) = sign(
αi hi (x))
i
The general algorithm is:
• We can set the starting weights wit for our training examples to be
of examples and t = 1, representing the time (or the iteration).
• Then we pick a classifier ht which minimizes the error rate.
• Then we can pick αt .
• And we can calculate w t+1 .
• Then repeat.
wt
1
N
where N is the number
Now suppose wit+1 = Zi e −α h (x)y (x) , where y (x) gives you the right classification (the right sign)
for a given Training example. So if ht (x) correctly classifies a sample, then it and y (x) will be the
same sign, so it will be a positive exponent. Otherwise, if ht (x) gives the incorrect sign, it will be a
negative exponent. Z is some normalizing value so that we get a distribution.
t t
t
We want to minimize the error bound for H(x) if αt = 12 ln 1−ϵ
ϵt .
13.10.2
Stacking
Stacking is similar to boosting, except that you also learn the weights for the weighted average by
wrapping the ensemble of models into another model.
13.11
Overfitting
Overfitting is a problem where your hypothesis describes the training data too well, to the point
where it cannot generalize to new examples. It is a high variance problem. In contrast, underfitting
is a high bias problem.
To clarify, if your model has no bias, it means that it makes no errors on your training data (i.e. it
does not underfit). If your model has no variance, it means your model generalizes well on your test
data (i.e. it does not overfit). It is possible to have bias and variance problems simultaneously.
CHAPTER 13. SUPERVISED LEARNING
353
13.12. REGULARIZATION
354
Another way to think of this is that:
• variance = how much does the model vary if the training data changes? I.e. what space of
possible models does this cover? High variance implies that the model is too sensitive to the
particular training examples it looked at, and thus will not adapt well to other examples.
• bias = is the average model close to the “true” solution/model? High bias means that the
model is systematically incorrect.
Bias and Variance
There is a bias-variance trade-off, in which improvement of one is typically at the detriment of the
other.
You can think of generalization error as the sum of bias and variance. You want to keep both low, if
possible.
Overfitting can happen if your hypothesis is too complex, which can happen if you have too many
features. So you will want to through a feature selection phase and pick out features which seem to
provide the most value.
Alternatively, you can use the technique of regularization, in which you keep all your features, but
reduce the magnitudes/values of parameters θj . This is a good option all of your features are
informative and you don’t want to get rid of any.
13.12
Regularization
Regularization can be defined as any method which aims to reduce the generalization error of a
model though it may not have the same effect on the training error. Since good generalization error
is the main goal of machine learning, regularization is essential to success.
Perhaps the most common form of regularization aims to favor smaller parameters. The intuition is
that, if you have small values for your parameters θ0 , θ1 , . . . , θn , then you have a “simpler” hypothesis
which is less prone to overfitting.
In practice, there may be many combination of parameters which fit your data well. However, some
may overfit/not generalize well. We want to introduce some means of valuing these simpler hypotheses
over more complex ones (i.e. with larger parameters). We can do so with regularization.
354
CHAPTER 13. SUPERVISED LEARNING
355
13.12. REGULARIZATION
So generally regularization is about shrinking your parameters to make them smaller; for this reason
it is sometimes called weight decay. For linear regression, you accomplish this by modifying the cost
∑
function to include the term λ ni=1 θj2 at the end:
J(θ) =
m
n
∑
1 ∑
(hθ (x (i) ) − y (i) )2 + λ
θj2
2m i=1
i=1
Note that we are not shrinking θ0 . In practice, it does not make much of a difference if you include
it or not; standard practice is to leave it out.
λ here is called the regularization parameter. It tunes the balance between fitting the data and
keeping the parameters small (i.e. each half of the cost function). If you make λ too large for your
problem, you may make your parameters too close to 0 for them to be meaningful - large values of
lambda can lead to underfitting problems (since the parameters get close to 0).
∑
The additional λ ni=1 θj2 term is called the regularization loss, and the rest of the loss function is
called the data loss.
13.12.1
Ridge regression
A regularization method used in linear regression; the L2 norm of the parameters is constrained so
that it less than or equal to some specified value (that is, this is L2 regularization):
N
∑
p
∑
i=1
j=1
(yi − β0 −
β̂ = argmin(
β
2
xij βj ) + λ
p
∑
βj2 )
j=1
Where:
• λ ≥ 0 is a hyperparameter controlling the amount of shrinkage.
• N is the number of data points
• p is the number of dimensions
13.12.2
LASSO
LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method which constrains the L1 norm of the parameters such that it is less than or equal to some specified value:
p
p
N
∑
∑
1∑
2
β̂ = argmin(
(yi − β0 −
xij βj ) + λ
|βj |)
2 i=1
β
j=1
j=1
(These two regularization methods are sometimes called shrinkage methods)
CHAPTER 13. SUPERVISED LEARNING
355
13.12. REGULARIZATION
13.12.3
356
Regularized Linear Regression
We can update gradient descent to work with our regularization term:
m
1 ∑
(hθ (x (i) ) − y (i) ) · x0(i)
θ0 := θ0 − α
m i=1
θj := θj − α
m
1 ∑
λ
(hθ (x (i) ) − y (i) ) · xj(i) + θj
m i=1
m
j = (1, 2, 3, . . . , n)
The θj part can be re-written as:
θj := θj (1 − α
m
λ
1 ∑
)−α
(hθ (x (i) ) − y (i) ) · xj(i)
m
m i=1
If we are using the normal equation, we can update it to a regularized form as well:
θ = (X T X + λM)−1 X T y
Where M is an n + 1 × n + 1 matrix, where the diagonal is all ones, except for the element at (0, 0)
which is 0, and every other element is also 0.
13.12.4
Regularized Logistic Regression
We can also update the logistic regression cost function with the regularization term:
m
n
λ ∑
1 ∑
(i)
(i)
(i)
(i)
θj2
J(θ) = − [ y log(hθ (x )) + (1 − y )log(1 − hθ (x ))] +
m i=1
2m i=1
Then we can update gradient descent with the new derivative of this cost function for the parameters
θj where j ̸= 0
θ0 := θ0 − α
θj := θj − α
m
1 ∑
(hθ (x (i) ) − y (i) ) · x0(i)
m i=1
m
1 ∑
λ
(hθ (x (i) ) − y (i) ) · xj(i) + θj
m i=1
m
j = (1, 2, 3, . . . , n)
It looks the same as the one for linear regression, but again, the actual hypothesis function hθ is
different.
356
CHAPTER 13. SUPERVISED LEARNING
357
13.13
13.13. PROBABILISTIC MODELING
Probabilistic modeling
Fundamentally, machine learning is all about data:
• Stochastic, chaotic, and/or complex generative processes
• Noisily observed
• Partially observed
So there is a lot of uncertainty - we can use probability theory to express this uncertainty in the form
of probabilistic models. Generally, learning probabilistic models is known as probablistic machine
learning; here we are primarily concerned with non-Bayesian machine learning.
We have some data x1 , x2 , . . . , xn and some latent variables y1 , y2 , . . . , yn we want to uncover, which
correspond to each of our data points.
We have a parameter θ.
A probabilistic model is just a parameterized joint distribution over all the variables:
P (x1 , . . . , xn , y1 , . . . , yn |θ)
We usually interpret such models as a generative model - how was our observed data generated by
the world?
So the problem of inference is about learning about our latent variables given the observed data,
which we can get via the posterior distribution:
P (y1 , . . . , yn |x1 , . . . , xn , θ) =
P (x1 , . . . , xn , y1 , . . . , yn |θ)
P (x1 , . . . , xn |θ)
Learning is typically posed as a maximum likelihood problem; that is, we try to find θ which maximizes
the probability of our observed data:
θML = argmax P (x1 , . . . , xn |θ)
θ
Then, to make a prediction we want to compute the conditional distribution of some future data:
P (xn+1 , yn+1 |x1 , . . . , xn , θ)
Or, for classification, if we some classes, each parameterizing a joint distribution, we want to pick
the class which maximizes the probability of the observed data:
argmax P (xn+1 |θc )
c
CHAPTER 13. SUPERVISED LEARNING
357
13.13. PROBABILISTIC MODELING
13.13.1
358
Discriminative vs Generative learning algorithms
Discriminative learning algorithms include algorithms like logistic regression, decision trees, kNN, and
SVM. Discriminative approaches try to find some way of separating data (discriminating them), such
as in logistic regression which tries to find a dividing line and then sees where new data lies in relation
to that line. They are unconcerned with how the data was generated.
Say our input features are x and y is the class.
Discriminative learning algorithms learn P (y |x) directly, that is it tries to learn the probability of y
directly as a function of x. To put it another way, what is the probability this new data is of class y
given the features x?
Generative learning algorithms instead tries to develop a model for each class and sees which model
new data conforms to.
Generative learning algorithms learn P (x|y ) and P (y ) instead (that is, it models the joint distribution
P (x, y )). So instead they ask, if this were class y , what is the probability of seeing these new features
x? You’re basically trying to figure out what class is most likely to have generated the given features
x.
P (y ) is the class prior/the prior probability of seeing the class y , that is the probability of class y if
you don’t have any other information.
It is easier to estimate the conditional distribution P (y |x) than it is the joint distribution P (x, y ),
though generative models can be much stronger. With P (x, y ), it is easy to calculate the same
)
conditional (P (y |x) = PP(x,y
(x) ).
For both discriminative and generative approaches, you will have parameters and latent variables θ
which govern these distributions. We treat θ as a random variable.
13.13.2
Maximum Likelihood Estimation (MLE)
Say we have some observed values x1 , x2 , . . . , xn , generated by some latent model parameterized by
θ, i.e. f (x1 , x2 , . . . , xn ; θ), where θ represents a single unknown parameter or a vector of unknown
parameters. If we flip this we get the likelihood of θ, L(θ; x1 , x2 , . . . , xn ), which is the probability
of θ, given the observed data.
Likelihood is just the name for the probability of observed data as a function of the parameters.
The maximum likelihood estimation is the θ (parameter) which maximizes this likelihood. That
is, the value of θ which generates the observed values with the highest probability. MLE can be
done by computing the derivative of the likelihood and solving for zero. It is a very common way of
estimating parameters.
(The Expectation-Maximization (EM) algorithm is a way of computing a maximum likelihood estimate
for situations where some variables may be hidden.)
If the random variables associated with the values, i.e. X1 , X2 , . . . , Xn , are iid, then the likelihood is
just:
358
CHAPTER 13. SUPERVISED LEARNING
359
13.13. PROBABILISTIC MODELING
L(θ; x1 , x2 , . . . , xn ) =
n
∏
f (xi |θ)
i=1
Sometimes this is just notated L(θ).
So we are looking to estimate the θ which maximizes this likelihood (this estimate is often notated
θ̂, the hat typically indicates an estimator):
θ̂ = argmax L(θ; x1 , x2 , . . . , xn )
θ
Logarithms are used, however, for convenience (i.e. dealing with sums rather than products), so
instead we are often maximizing the log likelihood (which has its maximum at the same value
(i.e. the same argmax) as the regular likelihood, though the actual maximum value may be different):
ℓ(θ) =
n
∑
log(f (xi |θ))
i=1
Another way of explaining MLE:
We have some data X = {x (1) , x (2) , . . . , x (m) } and a parametric probability distribution p(x; θ). The
maximum likelihood estimate for θ is:
θML = argmax p(X; θ)
θ
= argmax
θ
m
∏
p(x (i) ; θ)
i=1
(Notation note: p(X; θ) is read “the probability density of X as parameterized by θ”)
Though the logarithm version mentioned above is typically preferred to avoid underflow.
Typically, we are more interested in the conditional probability P (y |x; θ) (i.e. the probability of y
given x, parameterized by θ), in which case, given all our inputs X and all our targets Y , we have
the conditional maximum likelihood estimator:
θML = argmax P (Y |X; θ)
θ
Assuming the examples are iid, this is:
θML = argmax
θ
CHAPTER 13. SUPERVISED LEARNING
m
∑
log P (y (i) |x (i) ; θ)
i=1
359
13.13. PROBABILISTIC MODELING
360
Example
Say we have a coin which may be unfair. We flip it ten times and get HHHHTTTTTT (we’ll call this
observed data X). We are interested in the probability of heads, π, for this coin, so we can determine
if it’s unfair or not.
Here we just have a binomial distribution so the parameters here are n and p (or π as we are referring
to it here). We know n as it is the sample size, so that parameter is easy to “estimate” (i.e. we
already know it). All that’s left is the parameter p to estimate. So we can just use MLE to make this
estimation; for binomial distributions it is rather trivial Because p is the probability of a successful
trial, and it’s intuitive that the most likely p just reflects the number of observed successes over the
total number of observed trials:
π̃MLE = argmax P (X|π)
π
P (y |X) ≈ P (y |π̃MLE )
Where y is the outcome of the next coin flip.
For this case our MLE would be π̃MLE = 0.4 because that is most likely to have generated our
4
observed data (we saw 10
heads).
Also formulated as:
θ̂ = argmax p(x (1) , . . . , x (n) )
θ
For a Gaussian distribution, the sample mean is the MLE.
13.13.3
Expectation Maximization
The expectation maximization (EM) algorithm is a two-staged iterative algorithm.
Say you have a dataset which is missing some values. How can you complete your data?
The EM algorithm allows you to do so.
The two stages work as such:
1. Begin with initial parameters θ̂(t) , t = 0.
2. The “E-step”
1. Using the current parameters θ̂(t) , compute probabilities for each possible completion of
the missing data.
2. Use these probabilities to create a weighted training set of these possible completions.
3. The “M-step”
1. Use a modified version of MLE (one which can deal with weighted samples) to derive new
parameter estimates, θ̂(t+1) .
360
CHAPTER 13. SUPERVISED LEARNING
361
13.13. PROBABILISTIC MODELING
4. Repeat the E and M steps until convergence.
Intuitively, what EM does is tries to find the parameters θ̂ which maximizes the log probability
log P (x|θ) of the observed data x, much like MLE, except does so under the conditions of incomplete
data. EM will converge on a local optimum (maximum) for this log probability.
Example
Say we have two coins A, B, which may not be fair coins.
We conduct 5 experiments in which we randomly choose one of the coins (with equal probability)
and flip it 10 times.
We have the following results:
1. HTTTHHTHTH
2. HHHHTHHHHH
3. HTHHHHHTHH
4. HTHTTTHHTT
5. THHHTHHHTH
We are still interested in learning a parameter for each coin, θ̂A , θ̂B , describing the probability of heads
for each.
If we knew which coin we flipped during each experiment, this would be a simple MLE problem. Say
we did know which coin was picked for each experiment:
1. B: HTTTHHTHTH
2. A: HHHHTHHHHH
3. A: HTHHHHHTHH
4. B: HTHTTTHHTT
5. A: THHHTHHHTH
Then we just use MLE and get:
24
= 0.8
24 + 6
9
θ̂B =
= 0.45
9 + 11
θ̂A =
That is, for each coin we just compute
num heads
total trials .
But, alas, we are missing the data of which coin we picked for each experiment. We can instead
apply the EM algorithm.
Say we initially guess that θ̂A(0) = 0.60, θ̂B(0) = 0.50. For each experiment, we’ll compute the
probability that coin A produced those results and the same probability for coin B. Here we’ll just
show the computation for the first experiment.
CHAPTER 13. SUPERVISED LEARNING
361
13.13. PROBABILISTIC MODELING
362
We’re dealing with a binomial distribution here, so we are using:
( )
P (x) =
n
p(1 − p)n−x , p = θ̂
x
( )
The binomial coefficient is the same for both coins (the xn term) and cancels out in normalization,
so we only care about the remaining factors. So we will instead just use:
P (x) = p(1 − p)n−x , p = θ̂
For the first experiment we have 5 heads (and n = 10). Using our current estimates for θ̂A , θ̂B , we
compute:
θA5 (1 − θA )10−5 ≈ 0.0008
θB5 (1 − θB )10−5 ≈ 0.0010
0.0008
≈ 0.44
0.0008 + 0.0010
0.0010
≈ 0.56
0.0008 + 0.0010
So for this first iteration and for the first experiment, we estimate that the chance of the picked coin
being coin A is about 0.44, and about 0.56 for coin B.
Then we generate the weighted set of these possible completions by computing how much each of
these coins, as weighted by the probabilities we just computed, contributed to the results for this
experiment ((5H, 5T )):
0.44(5H, 5T ) = (2.2H, 2.2T ), (coin A)
0.56(5H, 5T ) = (2.8H, 2.8T ), (coin B)
Then we repeat this for the rest of the experiments, getting the following weighted values for each
coin for each experiment:
coin A
2.2H, 2.2T
7.2H, 0.8T
5.9H, 1.5T
1.4H, 2.1T
4.5H, 1.9T
coin B
2.8H, 2.8T
1.8H, 0.2T
2.1H, 0.5T
2.6H, 3.9T
2.5H, 1.1T
and sum up the weighted values for each coin:
coin A
coin B
21.3H, 8.6T 11.7H, 8.4T
362
CHAPTER 13. SUPERVISED LEARNING
363
13.14. REFERENCES
Then we use these weighted values and MLE to update θ̂A , θ̂B , i.e.:
21.3
≈ 0.71
21.3 + 8.6
11.7
≈
≈ 0.58
117. + 8.4
θ̂A(1) ≈
θ̂B(1)
And repeat until convergence.
Expectation Maximization as a Generalization of K-Means
In K-Means we make hard assignments of datapoints to clusters (that is, they belong to only one
cluster at a time, and that assignment is binary).
EM is similar to K-Means, but we use soft assignments instead - datapoints can belong to multiple
clusters in varying strengths (the “strengths” are probabilities of assignment to each cluster). When
the centroids are updated, they are updated against all points, weighted by assignment strength
(whereas in K-Means, centroids are updated only against their members).
EM converges to approximately the same clusters as K-Means, except datapoints still have some
membership to other clusters (though they may be very weak memberships).
In EM, we consider that each datapoint is generated from a mixture of classes.
For each K classes, we have the prior probability of that class P (C = i ) and the probability of the
datapoint given that class P (x|C = i ).
P (x) =
K
∑
P (C = i )P (x|C = i )
i=1
These terms may be notated:
π = P (C = i )
µi
∑
= P (x|C = i )
i
What this is modeling here is that each centroid is the center of a Gaussian distribution, and we try
to fit these centroids and their distributions to the data.
13.14
References
• Review of fundamentals, IFT725. Hugo Larochelle. 2012.
• Exploratory Data Analysis Course Notes. Xing Su.
• Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff
Ullman.
CHAPTER 13. SUPERVISED LEARNING
363
13.14. REFERENCES
364
• Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Evaluating Machine Learning Models. Alice Zheng. 2015.
• Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
• Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks
Part 2: Setting up the Data and the Loss. Andrej Karpathy.
• POLS 509: Hierarchical Linear Models. Justin Esarey.
• Bayesian Inference with Tears. Kevin Knight, September 2009.
• Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
• Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo
Larochelle, Ryan P. Adams.
• What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
• Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
• Maximum Likelihood Estimation. Penn State Eberly College of Science.
• Data Science Specialization. Johns Hopkins (Coursera). 2015.
• Practical Machine Learning. Johns Hopkins (Coursera). 2015.
• Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome
Friedman.
• CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej
Karpathy.
• How does expectation maximization work?. joriki.
• How does expectation maximization work in coin flipping problem. joriki.
• Deep Learning Tutorial - PCA and Whitening. Chris McCormick.
364
CHAPTER 13. SUPERVISED LEARNING
365
14
Neural Nets
When it comes down to it, a neural net is just a very sophisticated way of fitting a curve.
Neural networks with at least one hidden layer are universal approximators, which means that they
can approximate any (continuous) function. This approximation can be improved by increasing the
number of hidden neurons in the network (but increases the risk of overfitting).
A key advantage to neural networks is that they are capable of learning features independently without much human involvement.
Neural networks can also learn distributed representations. Say we have animals of some type and
color. We could learn representations for each of them, e.g. with a neuron for a red cat, one for a blue
dog, one for a blue cat, etc. But this would mean learning many, many representations (for instance,
with three types and three colors, we learn nine representations). With distributed representations,
we instead have neurons that learn the different colors and other neurons which learn the different
types (for instance, with three types and three colors, we have six neurons). Thus the representation
of a given case, such as a blue dog, is distributed across the neurons, and ultimately much more
compact.
One challenge with neural networks is that it is often hard to understand what a neural network has
learned - they may be black boxes which do what we want, but we can’t peer in and clearly see how
it’s doing it.
14.1 Biological basis
Artificial neural networks (ANNs) are (conceptually) based off of biological neural networks such as
the human brain. Neural networks are composed of neurons which send signals to each other in
response to certain inputs.
A single neuron takes in one or more inputs (via dendrites), processes it, and fires one output (via its
axon).
CHAPTER 14. NEURAL NETS
365
14.2. PERCEPTRONS
366
Source: http://ulcar.uml.edu/~iag/CS/Intro-to-ANN.html
Note that the term “unit” is often used instead of “neuron” when discussing artificial neural networks
to dissociate these from the biological version - while there is some basis in biological neural networks,
there are vast differences, so it is a deceit to present them as analogous.
14.2 Perceptrons
A perceptron, first described by Frank Rosenblatt in 1957, is an artificial neuron (a computational
model of a biological neuron, first introduced in 1943 by Warren McCulloch and Walter Pitts).
Like a biological neuron, it has multiple inputs, processes them, and returns one output.
Each input has a weight associated with it.
In the simplest artificial neuron, a “binary” or “classic spin” neuron, the neuron “fires” an output of
“1” if the weighted sum of these inputs is above some threshold, or “-1” if otherwise.
A single-layer perceptron can’t learn XOR:
A line can’t be drawn to separate the As from the Bs; that is, this is not a linearly separable problem.
Single-layer perceptrons cannot represent linearly inseparable functions.
14.3 Sigmoid (logistic) neurons
A sigmoid neuron is another artificial neuron, similar to a perceptron. However, while the perceptron
has a binary output, the sigmoid neuron has a continuous output, σ(w · x + b), defined by a special
activation function known as the sigmoid function σ (also known as the logistic function):
σ(z) ≡
366
1
.
1 + e −z
CHAPTER 14. NEURAL NETS
367
14.4. ACTIVATION FUNCTIONS
XOR
which can also be written:
1
1 + exp(−
∑
j
wj xj − b)
.
Note that if z = w · x + b is a large positive number, then e −z ≈ 0 and thus σ(z) ≈ 1. If z is
a large negative number, then e −z → ∞ and thus σ(z) ≈ 0. So at these extremes, the sigmoid
neuron behaves like a perceptron.
Here is the sigmoid function visualized:
Which is a smoothed out step function (which is how a perceptron operates):
Sigmoid neurons are useful because small changes in weights and biases will only produce small
changes in output from a given neuron (rather than switching between binary output values as is the
case with the step function, which is typically too drastic).
14.4 Activation functions
The function that determines the output of a neuron is known as the activation function. In the
binary/classic spin case, it might look like:
CHAPTER 14. NEURAL NETS
367
14.4. ACTIVATION FUNCTIONS
368
The sigmoid function
The step function
368
CHAPTER 14. NEURAL NETS
369
14.4. ACTIVATION FUNCTIONS
weights = [...]
inputs
= [...]
sum_w = sum([weights[i] * inputs[i] for i in range(len(inputs))])
def activate(sum_w, threshold):
return 1 if sum_w > threshold else -1
Or:
{
output =
Note that w · x =
vectors.
∑
j
∑
−1 if j wj xj ≤ threshold
∑
1
if j wj xj > threshold
wj xj , so it can be notated as a dot product where the weights and inputs are
In some interpretations, the “binary” neuron returns “0” or “1” instead of “-1” or “1”.
An activation function can generally be described as some function:
output = f (w · x + b)
where b is the bias (see below).
14.4.1
Common activation functions
A common activation function is the sigmoid function, which takes input and squashes it to be in
[0, 1], it has the form:
σ(x) =
1
1 + e −x
However, the sigmoid activation function has some problems. If the activation yields values at the
tails of 0 or 1, the gradient ends up being almost 0. In backpropagation, this local gradient is
multiplied with the gradient of the node’s output against the total error - if this local gradient is
near 0, it “kills” the gradient preventing any signal from going further backwards in the network. For
this reason, when using the sigmoid activation function you must be careful of how you initialize the
weights - if they are too large, you will “saturate” the network and kill the gradient in this way.
Furthermore, sigmoid outputs are not zero-centered:
This is undesirable since neurons in later layers of processing in a Neural Network (more
on this soon) would be receiving data that is not zero-centered. This has implications
on the dynamics during gradient descent, because if the data coming into a neuron is
always positive (e.g. x>0 elementwise in f=wTx+b)), then the gradient on the weights
CHAPTER 14. NEURAL NETS
369
14.4. ACTIVATION FUNCTIONS
370
Sigmoid activation function
w will during backpropagation become either all be positive, or all negative (depending
on the gradient of the whole expression f). This could introduce undesirable zig-zagging
dynamics in the gradient updates for the weights. However, notice that once these
gradients are added up across a batch of data the final update for the weights can have
variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but
it has less severe consequences compared to the saturated activation problem above.
(CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Modeling
one neuron, Andrej Karpathy)
The tanh activation function is another option; it squishes values to be in [−1, 1]. However, while its
output is zero-centered, it suffers from the same activation saturation issue that the sigmoid does.
tanh(x)
e z − e −z
e z + e −z
Note that tanh is really just a rescaled sigmoid function:
σ(x) =
1 + tanh( x2 )
2
The Rectified Linear Unit (ReLU) is f (x) = max(0, x), that is, it just thresholds at 0. Compared to
the sigmoid/tanh functions, it converges with stochastic gradient descent quickly. Though there is
370
CHAPTER 14. NEURAL NETS
371
14.4. ACTIVATION FUNCTIONS
tanh activation function
ReLU activation function
CHAPTER 14. NEURAL NETS
371
14.4. ACTIVATION FUNCTIONS
372
not the same saturation issue as with the sigmoid/tanh functions, ReLUs can still “die” in a different
sense - their weights can be updated such that the neuron never activates again, which causes the
gradient through that neuron to be zero from then on, thus resulting in the same “killing” of the
gradient as with sigmoid/tanh. In practice, lowering the learning rate can avoid this.
Leaky ReLUs are an attempt to fix this problem. Rather than outputting 0 when x < 0, there will
instead be a small negative slope (∼ 0.01) when x < 0. That is, yi = αi x when xi < 0, and αi is
some fixed value. However, it does not always work well.
Note that αi in this case can also be a parameter to learn instead of a fixed value. These are called
parametric ReLUs.
Another alternative is a randomized leaky ReLU, where αi is a random variable during training and
fixed afterwards.
There are also some units which defy the conventional activation form of output = f (w · x + b).
One is the Maxout neuron. It’s function is max(w1T x + b1 , w2T x + b2 ), which is a generalization of
the ReLU and the leaky ReLU (both are special forms of Maxout). It has the benefits of ReLU but
does not suffer the dying ReLU problem, but it’s main drawback is that it doubles the number of
parameters for each neuron (since there are two weight vectors and two bias units).
Karpathy suggests:
Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the
fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout
a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.
source
Activation Function Propagation
Backpropagation
1
1+e −x2
∂E
1
[ ∂E
∂x ]s = [ ∂y ]s (1+e x2 )(1+e −x2 )
Sigmoid
ys =
Tanh
ys = tanh(xs )
∂E
1
[ ∂E
∂x ]s = [ ∂y ]s cosh2 xs
ReLu
ys = max(0, xs )
∂E
[ ∂E
∂x ]s = [ ∂y ]s I{xs > 0}
Ramp
ys = min(−1, max(1, xs ))
∂E
[ ∂E
∂x ]s = [ ∂y ]s I{1− < xs < 1}
There is also the hard-tanh activation function, which is an approximation of tanh that is faster to
compute and take derivatives of:



−1



hardtanh(x) = 1



x
372
x < −1
x >1
otherwise
CHAPTER 14. NEURAL NETS
373
14.4. ACTIVATION FUNCTIONS
14.4.2
Softmax function
The softmax function (called such because it is like a “softened” maximum function) may be used
as the output layer’s activation function. It takes the form:
N
fi (NETN
i )
N
e NETi
= ∑k
j
N
e NETj
To clarify, we are summing over all the output neurons in the denominator.
This function has the properties that it sums to 1 and that all of its outputs are positive, which are
useful for modeling probability distributions.
The cost function to use with softmax is the (categorical) cross-entropy loss function. It has the nice
property of having a very big gradient when the target value is 1 and the output is almost 0.
The categorical cross-entropy loss:
Li = −
∑
ti,j log(pi,j )
j
14.4.3
Radial basis functions
You can base your activation function off of radial basis functions (RBFs):
f (X) =
N
∑
ai p(||bi X − ci ||)
i=1
where
•
•
•
•
•
X = input vector of attributes
p = the RBF
c = vector center (peak) of the RBF
a = the vector coefficient/weight for each RBF
b = the vector coefficient/weight for each input attribute
A radial basis function (RBF) is a function which is:
• symmetric about its center, which is its peak (with a value of 1)
• can be in n dimensions, but always returns a single scalar value r , the distance (usually Euclidean) b/w the input vector and the RBF’s peak:
r = ||x − xi ||
CHAPTER 14. NEURAL NETS
373
14.4. ACTIVATION FUNCTIONS
374
ϕ is used to denote a RBF.
A neural network that uses RBFs as its activation functions is known as radial basis function neural
network.
The 1D Gaussian RBF is an example:
1D Gaussian RBF
Defined as:
ϕ(r ) = e −r
2
The Ricker Wavelet is another example:
The Ricker Wavelet
Defined as:
r2
ϕ(r ) = (1 − r 2 ) · e − 2
374
CHAPTER 14. NEURAL NETS
375
14.5. FEED-FORWARD NEURAL NETWORKS
14.5 Feed-forward neural networks
A feed-forward neural network is a simple neural network with an input layer, and output layer, and
one or more intermediate layers of neurons.
These layers are fully-connected in that every neuron of the previous layer is connected to every
neuron in the following layer. Such layers are also called affine or dense layers.
When we describe the network in terms of layers as a “N-layer” neural network, we leave out the
input layer (i.e. a 1-layer neural network has an input and an output layer, a 2-layer one has an input,
a hidden, and an output layer.). ANNs may also be described by its number of nodes (units), or,
more commonly, by the number of parameters in the entire network. (CS231n Convolutional Neural
Networks for Visual Recognition, Module 1: Neural Networks Part 1: Setting up the Architecture.
Andrej Karpathy. https://cs231n.github.io/neural-networks-1/)
This model is often called “feed-forward” because values go into the input layer and are fed into
subsequent layers.
Different learning algorithms can train such a network so that its weights are adjusted appropriately
for a given task. It’s worth emphasizing that the structure of the network is distinct from the learning
algorithm which tunes its weights and biases. The most popular learning algorithm for neural networks
is backpropagation.
14.6 Training neural networks
14.6.1
Backpropagation
The most common algorithm for adjusting a neural network’s weights and biases is backpropagation.
Backpropagation is just the calculation of partial derivatives (the gradient) by moving backwards
through the network (from output to input), accumulating them by applying the chain rule. In
particular, it computes the gradient of the loss function with respect to the weights in the network
(i.e. the derivatives of the loss function with respect to each weight in the network) in order to update
the weights.
We compute the total error for the network on the training data and then want to know how a change
in an individual weight within the network affects this total error (i.e. the result of our cost function),
e.g. ∂E∂wtotal
.
i
“Backpropagation” is almost just a special term for the chain rule in the context of training neural
networks. This is because a neural network can be thought of as a composition of functions, in which
case to compute the derivative of the overall function, you just apply the chain rule for computing
derivatives.
To elaborate on thinking of neural network as a “composition of functions”: each layer represents
a function taking in the inputs of the previous layer’s output, e.g. if the previous layer is a function
that outputs a vector, g(x), then the next layer, if we call it a function f , is f (g(x)).
CHAPTER 14. NEURAL NETS
375
14.6. TRAINING NEURAL NETWORKS
376
The general procedure for training a neural network with backpropagation is:
• Initialize the neural network’s weights and biases.
• Training data is input into the NN to the output neurons, in feed-forward style.
• The error of the output is then propagated backwards (from the output layer back to the input
layer).
• As the error is propagated, weights and biases are adjusted (according to a learning rate, detailed
below) to minimize the remaining error between the actual and desired outputs.
Consider the following simple neural net:
Simple neural network
Here’s a single neuron expanded:
Single sigmoid neuron
Remember that a neuron processes its inputs by computing the dot product of its weights and inputs
(i.e. the sum of its weight-input products) and then passes this resulting net input into its activation
function (in this case, it is the sigmoid function).
376
CHAPTER 14. NEURAL NETS
377
14.6. TRAINING NEURAL NETWORKS
Say we have passed some training data through the network and computed the total error as Etotal .
total
To update the weight w2,1 , for example, we are looking for the partial derivative ∂E
∂w2,1 , which by the
chain rule is equal to:
∂Etotal
∂Etotal ∂o2,1
∂i2,1
=
×
×
∂w2,1
∂o2,1
∂i2,1
∂w2,1
Then we take this value and subtract it, multiplied by a learning rate η (sometimes notated α), from
the current weight w2,1 to get w2,1 ’s updated weight, though updates are only actually applied after
these update values have been computed for all of the network’s weights.
If we wanted to calculate the update value for w1,1 , we do something similar:
∂Etotal
∂Etotal ∂o1,1
∂i1,1
=
×
×
∂w1,1
∂o1,1
∂i1,1
∂w1,1
Any activation function can be used with backprop, it just must be differentiable anywhere.
The chain rule of derivatives (refresher)
(adapted from the CS231n notes cited below)
Refresher on derivatives: say you have a function f (x, y , z ). The derivative of f with respect to x
∂f
is called a partial derivative, since it is only with respect to one of the variables, is notated ∂x
and
is just a function that tells you how much f changes due to x at any point. The gradient is just a
vector of these partial derivatives, so that there is a partial derivative for each variable (i.e. here it
would be a vector of the partial derivative of f wrt x, and then wrt y , and then wrt z).
∂f
∂f
As a simple example, consider the function f (x, y ) = xy . The derivatives here are just ∂x
= y , ∂y
=x
∂f
What does this mean? Well, take ∂x = y . This means that, at any given point, increasing x by a
infinitesimal amount will change the output of the function by y times the amount that x changed.
So if y = −3, then any small change in x will decrease f by that amount times −3.
Now consider the function f (x, y , z ) = (x + y )z. We can derive this by declaring q = x + y and then
re-writing f to be f = qz . We can compute the gradient of f in this form (note that it is the same
∂q
∂f
∂f
as f (x, y ) = xy from before): ∂q
= z, ∂z
= q. The gradient of q is also simple: ∂q
∂x = 1, ∂y = 1.
We can combine these gradients to get the gradient of f wrt to x, y , z instead of wrt to q, z as we
have now. We can get the missing partial derivatives wrt to x and y by using the chain rule, which
just requires that we multiply the appropriate partials:
∂f ∂q ∂f
∂f ∂q
∂f
=
,
=
∂x
∂q ∂x ∂y
∂q ∂y
In code (adapted from the CS231 notes cited below)
# set some inputs
CHAPTER 14. NEURAL NETS
377
14.6. TRAINING NEURAL NETWORKS
378
x = -2; y = 5; z = -4
# perform the forward pass
q = x + y # q becomes 3
f = q * z # f becomes -12
# perform the backward pass (backpropagation) in reverse order:
# first backprop through f = q * z
dfdz = q # df/dz = q, so gradient on z becomes 3
dfdq = z # df/dq = z, so gradient on q becomes -4
dqdx = 1.
dqdy = 1.
# now backprop through q = x + y
dfdx = dqdx * dfdq # dq/dx = 1. And the multiplication here is the chain rule!
dfdy = dqdy * dfdq # dq/dy = 1
So essentially you can decompose any function into smaller, simpler functions, compute the gradients
for those, then use the chain rule to aggregate them into the original function’s gradient.
The details
Consider the neural network:
A neural network
Where:
• Li is layer i
378
CHAPTER 14. NEURAL NETS
379
•
•
•
•
14.6. TRAINING NEURAL NETWORKS
Nji is the jth node in layer i
N is the number of layers
k is the number of outputs, e.g. classes, or k = 1 for regression
ni is the number of nodes in layer i
Given training data:
(X (1) , Y (1) )
(X (2) , Y (2) )
..
.
(X (m) , Y (m) )
Where X (i) ∈ R1×d and
• d is the dimensionality of the input
• m is the number of training examples
Thus X is the input matrix, X ∈ Rm×d .
For a node Nji :
•
•
•
•
•
bji is the bias for the node (scalar), i.e. bji ∈ R
wji is the weights for the node, wji ∈ R1×ni−1
fj i is the activation function for the node
NETij is the net input for the node, NETij = Wji · OUTi−1 + bji , NETij ∈ R
OUTij is the output for the node, OUTij = fj i (NETij ), OUTij ∈ R
For a layer Li :
• bi is the bias vector for the layer, bi ∈ Rni ×1 , i.e.




b1i
 i
b 
 2

bi = 
 ... 
bni i
• W i is the weight matrix for the layer, W i ∈ Rni ×ni−1 , i.e.




W1i
 i
W 
 2

Wi = 
 ... 
Wni i
CHAPTER 14. NEURAL NETS
379
14.6. TRAINING NEURAL NETWORKS
380
• f i is the activation function for the layer, since generally f i = f1i = f2i = · · · = fnii
• NETi is the net input for the layer:
NETi = W i · OUTi−1 + bi , NETi ∈ Rni ×1
• OUTi is the output for the layer:
OUTi = f i (NETi ), OUTi ∈ Rni ×1
Note that OUTN = hθ (X).
The feed-forward step is a straightforward algorithm:
OUT^0 = X
for i in 1...N
OUT^i = f^i(W^i \cdot OUT^{i-1}+b^i)
With backpropagation, we are interested in the gradient ∇J for each iteration. ∇J includes the
∂J
∂J
components ∂W
i and ∂b i (that is, how the cost function changes with respect to the weights and
j
j
biases in the network).
The main advantage of backpropagation is that it allows us to compute this gradient efficiently.
There are other ways we could do it. For instance, we could manually calculate the partial derivatives
of the cost function with respect to each individual weight, which, if we had w weights, would
require computing the cost function w times, which requires w forward passes. Naturally, given a
complex network with many, many weights, this becomes extremely costly to compute. The beauty
of backpropagation is that we can compute these partial derivatives (that is, the gradient), with just
a single forward pass.
Some notation:
• J is the cost function for the network (what it is depends on the use case)
• δ i is the error for layer i
• ⊙ is the elementwise product (“Hadamard” or “Schur” product)
The error for a layer i is how much the cost function changes wrt to that layer’s net input, i.e.
∂J
δ i = ∂NET
i.
For the output layer, this is straightforward (by applying the chain rule):
∂J
∂J ∂OUTN
δ =
=
∂NETN
∂OUTN ∂NETN
N
380
CHAPTER 14. NEURAL NETS
381
14.6. TRAINING NEURAL NETWORKS
Since OUTN = f N (NETN ), then
∂OUTN
∂NETN
= (f N )′ (NETN ).
Thus we have:
δN =
∂J
(f N )′ (NETN )
∂OUTN
∂J
Note that for ∂OUT
N , we are computing the derivative of the cost function J with respect to each
training example’s corresponding output, and then we average them (in some situations, such as
when the total number of training examples is not fixed, their sum is used). It is costly to do this
across all training examples if you have a large training set, in which case, the minibatch stochastic
variant of gradient descent may be more appropriate. (TODO this may need clarification/revision)
For the hidden layer prior to the output, LN−1 , we would need to connect that layer’s net input,
NETN−1 , to the cost function J:
δ N−1 =
∂J
∂J ∂OUTN ∂NETN ∂OUTN−1
=
∂NETN−1
∂OUTN ∂NETN ∂OUTN−1 ∂NETN−1
We have already calculated the term
∂J
∂OUTN
N
∂OUT ∂NETN
δ N−1 = δ N
Since OUTN−1 = f N−1 (NETN−1 ), then
as δ N , so this can be restated:
∂NETN ∂OUTN−1
∂OUTN−1 ∂NETN−1
∂OUTN−1
∂NETN−1
= (f N−1 )′ (NETN−1 )′ .
Similarly, since NETN = W N · OUT N−1 + bN , then
∂NETN
∂OUTN−1
= W N.
Thus:
δ N−1 = W N δ N ⊙ (f N−1 )′ (NETN−1 )
This is how we compute δ i for all i ̸= N, i.e. we push back (backpropagate) the next (“next” in the
forward sense) layer’s error, δ i+1 , to Li to get δ i . So we can generalize the previous equation:
δ i = W i+1 δ i+1 ⊙ (f i )′ (NETi ), i ̸= N
We are most interested in updating weights and biases, rather than knowing the errors themselves.
That is, we are most interested in the quantities:
∂J ∂J
,
∂W i ∂bi
CHAPTER 14. NEURAL NETS
381
14.6. TRAINING NEURAL NETWORKS
382
For any layer Li .
These are relatively easy to derive.
We want to update the weights such that the error is lowered with the new weights. Thus we compute
the gradient of the error with respect to the weights and biases to learn in which way the error is
increasing and by how much. Then we move in the opposite direction by that amount (typically
weighted by a learning rate).
∂J
∂J ∂NETi
=
∂bi
∂NETi ∂bi
∂J
∂J ∂NETi
=
∂W i
∂NETi ∂W i
We previously showed that δ i =
∂J
,
∂NETi
so here we have:
i
∂J
i ∂NET
=
δ
∂bi
∂bi
i
∂J
i ∂NET
=δ
∂W i
∂W i
Then, knowing that NETi = W i · OUTi−1 + bi , we get:
∂NETi
=1
∂bi
∂NETi
= OUTi−1
i
∂W
Thus:
∂J
= δi
∂bi
∂J
= δ i OUTi−1
∂W i
Then we can use these for gradient descent.
A quick bit of notation: δ j,i refers to layer i ’s error for the jth training example; similarly, OUTj,i
refers to layer i ’s output for the jth training example.
η ∑ j,i
δ (OUTj,i−1 )T
m j
η ∑ j,i
bi → bi −
δ
m j
Wi → Wi −
i−1
∂J
i ∂J
i
So, to clarify, we are computing ∂b
for each training example and computing
i = δ , ∂W i = δ OUT
their average. As mentioned before, in some cases you may only take their sum, which just involves
the removal of the m1 term, so you are effectively just scaling the change.
382
CHAPTER 14. NEURAL NETS
383
14.6. TRAINING NEURAL NETWORKS
Considerations
Backpropagation is not a guarantee of training at all nor of quick training.
Possible issues include:
• network paralysis: if the weights become very large, the neurons’ OUTs may become very large,
where the derivative of the activation function is very small, so weights are not really updated
and get “stuck” at large values.
• local minima: statistical training methods can be used (such as simulated annealing), to avoid
local minima but increase training time
• step size: if it is too small, training is too slow, if it is too large, paralysis or instability (no
convergence) are possible
• stability: that the network does not mess up its learning of something else to learn another
thing. For instance, say it learns good weights for one input, but to learn good weights for
another input, it “overwrites” or “forgets” what it learned about the prior input.
Note that in high dimensions, local minima are usually not a problem.
More typically, there are saddle points, which slow down training but can be escaped in time. This
is because with many dimensions, it is unlikely that a point is a minimum is all dimensions (if we
consider that a point is a minimum in one dimensions with probability p, then it has probability p n
to be a minimum in all n dimensions); it is, however, likely that it is a local minimum in some of the
dimensions.
As training nears the global minimum, p increases, so if you do end up at a local minimum, it will
likely be close enough to the global minimum.
14.6.2
Statistical (stochastic) training
Statistical (or “stochastic”) training methods, contrasted with deterministic training methods (such as
backpropagation as described abov), involve some randomness to avoid local minima. They generally
work by randomly leaving local minima to possibly find the global minimum. The severity of this
randomness decreases over time so that a solution is “settled” upon (this gradual “cooling” of the
randomness is the key part of simulated annealing).
Simulated annealing applied as a training method to a neural network is called Boltzmann training
(neural networks trained in this way are called Boltzmann machines):
1. set T (the artificial temperature) to a large value
2. apply inputs, calculate outputs and objective function
3. make random weight changes, recalculate network output and change in objective function
4a. if objective function improves, keep weight changes 4b. if the objective function worsens,
accept the change according to the probability drawn from Boltzmann distribution, P (c), select
a random variable r from a uniform distribution in [0, 1]; if P (c) > r , keep the change,
otherwise, don’t.
CHAPTER 14. NEURAL NETS
383
14.6. TRAINING NEURAL NETWORKS
384
P (c) = exp(
−c
)
kT
Where:
• c the change in the objective function
• k a constant analogous to the Boltzmann’s constant in simulated annealing, specific for the
current problem
• T the artificial temperature
• P (c) the probability of the change c in the objective function
Steps 3 and 4 are repeated for each of the weights in the network as T is gradually decreased.
The random weight change can be selected in a few ways, but one is just choosing it from a Gaussian
2
distribution, P (w ) = exp( −w
T 2 ), where P (w ) is the probability of a weight change of size w and T
is the artificial temperature. Then you can use Monte Carlo simulation to generate the actual weight
change, ∆w .
Boltzmann training uses the following cooling rate, which is necessary for convergence to a global
minimum:
T (t) =
T0
log(1 + t)
Where T0 is the initial temperature, and t is the artificial time.
The problem with Boltzmann training is that it can be very slow (the cooling rate as computed above
is very low).
This can be resolved by using the Cauchy distribution instead of the Boltzmann distribution; the
former has fatter tails so has a higher probability of selecting large step sizes. Thus the cooling rate
can be much quicker:
T (t) =
T0
1+t
The Cauchy distribution is:
P (x) =
T (t)
T (t)2 + x 2
where P (x) is the probability of a step of size x.
This can be integrated, which makes selecting random weights much easier:
xc = ρT (t) tan(P (x))
384
CHAPTER 14. NEURAL NETS
385
14.6. TRAINING NEURAL NETWORKS
Where ρ is the learning rate coefficient and xc is the weight change.
Here we can just select a random number from a uniform distribution in (− π2 , π2 ), then substitute
this for P (x) and solve for x in the above, using the current temperature.
Cauchy training still may be slow so we can also use a method based on artificial specific heat (in
annealing, there are discrete energy levels where phase changes occur, at which abrupt changes in the
“specific heat” occur). In the context of artificial neural networks, we define the (pseudo)specific heat
to be the average rate of change of temperature with the objective function. The idea is that there
are parts where the objective function is sensitive to small changes in temperature, where the average
value of the objective function makes an abrupt change, so the temperature must be changed slowly
here so as not to get stuck in a local minima. Where the average value of the objective function
changes little with temperature, large changes in temperature can be used to quicken things.
Still, Cauchy training may be much slower than backprop, and can have issues of network paralysis
(because it is possible to have very large random weight changes), esp. if a nonlinearity is used as
the activation function (see the bit on network paralysis and the sigmoid function above).
Cauchy training may be combined with backprop to get the best of both worlds - it simply involves
computing both the backprop and Cauchy weight updates and applying their weighted sum as the
update. Then, the objective function’s change is computed, and like with Cauchy training, if there
is an improvement, the weight change is kept, otherwise, it is kept with a probability determined by
the Boltzmann distribution.
The weighted sum of the individual weight updates is controlled by a coefficient η, such that the sum
is η[α∆Wt−1 + (1 − α)δOUT] + (1 − η)xc , so that if η = 0, the training is purely Cauchy, and if
η = 1, it becomes purely backprop.
There is still the issue of the possibility of retaining a massive weight change due to the Cauchy
distribution’s infinite variance, which creates the possibility of network paralysis. The recommended
approach here is to detect saturated neurons by looking at their OUT values - if it is approaching
the saturation point (positive or negative), apply some squashing function to its weights (note that
this squashing function is not restricted to the range [−1, 1] and in fact may work better with a
larger range). This potently reduces large weights while only attenuating smaller ones, and maintains
symmetry across weights.
14.6.3
Learning rates
The amount weights and biases are adjusted is determined by a learning rate η (sometimes called a
delta rule or delta function). This often involves some constant which manages the momentum of
learning µ (see below for more on momentum). This learning constant can help “jiggle” the network
out of local optima, but you want to take care that it isn’t set so high that the network will also
jiggle out of the global optima. As a simple example:
# LEARNING_CONSTANT is defined elsewhere
def adjust_weight(weight, error, input):
return weight + error * input * LEARNING_CONSTANT
CHAPTER 14. NEURAL NETS
385
14.6. TRAINING NEURAL NETWORKS
386
In some cases, a simulated annealing approach is used, where the learning constant may be tempered
(made smaller, less drastic) as the network evolves, to avoid jittering the network out of the global
optima.
Adaptive learning rates
Over the course of training, it is often better to gradually decrease the learning rate as you approach
an optima so that you don’t “overshoot” it.
Separate adaptive learning rates
The appropriate learning rate can vary across parameters, so it can help to have different adaptive
learning rates for each parameter.
For example, the magnitudes of gradients are often very different across layers (starting small early
on, growing larger further on).
The fan-in of a neuron (number of inputs) also has an effect, determining the size of “overshoot”
effects (the more inputs there are, the more weights are changed simultaneously, all to adjust the
same error, which is what can cause the overshooting).
So what you can do is manually set a global learning rate, then for each weight multiply this global
learning rate by a local gain, determined empirically per weight.
One way to determine these learning rates is as follows:
• start with a local gain gij = 1 for each weight wij
• increase the local gain if the gradient for that weight does not change sign
• use small additive increases and multiplicative decreases:


gij (t − 1) + 0.05
∂E
∂E
if( ∂w
(t) ∂w
(t − 1)) > 0
ij
ij
0.95gij (t − 1)
otherwise
gij (t) = 
This ensures that big gains decay rapidly when oscillations start.
Another tip: limit the gains to line in some reasonable range, e.g. [0.1, 10] or [0.01, 100]
Note that these adaptive learning rates are meant for full batch learning or for very big mini-batches.
Otherwise, you may encounter gradient sign changes that are just due to sampling error of a minibatch.
These adaptive learning rates can also be combined with momentum by using agreement in sign
between the current gradient for a weight and the velocity for that weight.
Note that adaptive learning rates deal only with axis-aligned effects.
14.6.4
Training algorithms
There are a variety of gradient descent algorithms used for training neural networks. Some of the
more popular ones include the following.
386
CHAPTER 14. NEURAL NETS
387
14.6. TRAINING NEURAL NETWORKS
Momentum
Momentum is a technique that can be combined with gradient descent to improve its performance.
Conceptually, it applies the idea of velocity and friction to the error surface (imagine a ball rolling
around the error surface to find a minimum).
We incorporate a matrix of velocity values V , with the same shape as the matrix of weights and biases
for the network (for simplicity, we will roll the weights and matrices together into a matrix W ).
To do so, we break our gradient descent update rule (W → W ′ = W − η∇J) into two separate
update rules; one for updating the velocity matrix, and another for updating the weights and biases:
V → V ′ = µV − η∇J
W → W′ = W + V ′
Another hyperparameter, µ ∈ [0, 1], is also introduced - this controls the “friction” of the system
(µ = 1 is no friction). It is known as the momentum coefficient, and it is tuned to prevent
“overshooting” the minimum.
You can see that if µ is set to 0, we get the regular gradient descent update rule.
Nesterov momentum
A variation on regular momentum is Nesterov momentum:
V → V ′ = µV − η∇J(θ + µV )
W → W′ = W + V ′
Here we add the velocity to the parameters before computing the gradient.
Using gradient descent with this momentum is also called the Nesterov accelerated gradient.
Adagrad
Adagrad is an enhancement of Nesterov momentum that keeps track of squared gradients over time.
This allows it to identify frequently updated parameters and infrequently updated parameters - as a
result, learning rates can be adapted per parameter over time (e.g. higher learning rates are assigned
to infrequently updated parameters). This makes it quite useful for sparse data, and also means that
the learning rate does not need to be manually tuned.
More formally, for each parameter θi we have gt,i = ∇J(θi ), so we re-write the update for a parameter
θi at time t to be:
θt+1,i = θt,i − η · gt,i
CHAPTER 14. NEURAL NETS
387
14.6. TRAINING NEURAL NETWORKS
388
The learning rate η is then modified at each time step t by a diagonal matrix Gt ∈ Rd×d . Each
diagonal element i , i in Gt is the sum of squared gradients of parameter θi up to time t. In particular,
we divide η by the square root of this matrix (empirically, the square root improves performance):
η
θt+1,i = θt,i − √
· gt,i
Gt,ii + ϵ
An additional smoothing term ϵ is included to prevent division by zero (e.g. 1e-8).
As vector operations, this is written:
η
θt+1 = θt − √
⊙ gt
Gt + ϵ
Note that as training progresses, the denominator term Gt will grow very large (the sum of squared
gradients accumulate), such that the learning rate eventually becomes very small and learning virtually
ceases.
Adadelta
Adadelta is an improvement on Adagrad designed to deal with its halting learning rate. Whereas
Adagrad takes the sum of all past squared gradients for a parameter, Adadelta takes only the past w
past squared gradients.
This is implemented as follows. We keep a running average E[g 2 ]t (γ is like a momentum term,
usually around 0.9):
E[g 2 ]t = γE[g 2 ]t−1 + (1 − γ)gt2
And then update the Adagrad equation to replace the matrix Gt with this running (exponentially
decaying) average:
η
∆θt = − √
gt
E[g 2 ]t + ϵ
Note that this running average is the same as the root mean squared (RMS) error criterion of the
gradient, so this can be re-written:
∆θt = −
η
RMS[g]t
There is one more enhancement that is part of Adadelta. The learning rate η‘s units (as in “units
of measurement”, not as in “hidden units”) do not match the parameters’ units; this is true for all
388
CHAPTER 14. NEURAL NETS
389
14.6. TRAINING NEURAL NETWORKS
training methods shown up until now. This is resolved by defining another exponentially decaying
average, this one of squared parameter updates:
E[∆θ2 ]t = γE[∆θ2 ]t−1 + (1 − γ)∆θt2
The RMS error of this term is also taken:
√
RMS[∆θ]t =
E[∆θ2 ]t + ϵ
Thus the Adadelta update is:
∆θt = −
RMS[∆θ]t
gt
RMS[g]t
Note that Adadelta without the numerator RMS term, that is:
∆θt = −
η
RMS[g]t
is known as RMSprop; for RMSprop typical values are γ = 0.9, η = 0.001.
Adam
Adam, short for “Adaptive Moment Estimation”, is, like Adagrad, Adadelta, and RMSprop, an
adaptive learning rate algorithm. Like Adadelta and RMSprop, Adam keeps track of an exponentially
decaying average of past squared gradients (here, it is vt ), but it also keeps track of an exponentially
decaying average of past (non-squared) gradients mt , similar to momentum:
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 − β2 )gt2
mt and vt are estimates of the first moment (the mean) and second moment (the uncentered variance)
of the gradients, respectively.
Note that the β terms are decay rates (typically β1 = 0.9, β2 = 0.999) and that mt , vt are initialized
to zero vectors. As such, mt , vt tend to be a biased towards zero, so bias-corrected versions of each
are computed:
mt
1 − β1t
vt
v̂t =
1 − β2t
m̂t =
CHAPTER 14. NEURAL NETS
389
14.6. TRAINING NEURAL NETWORKS
390
Then the Adam update rule is simply:
η
θt+1 = θt − √
m̂t
v̂t + ϵ
14.6.5
Batch Normalization
Batch normalization is a normalization method for mini-batch training, which can improve training
time (it allows for higher learning rates), act as a regularizer, and reduce the importance of proper
parameter initialization. It is applied to intermediate representations in the network.
For a mini-batch x (of size m), the sample mean and variance for each feature k is computed:
x̄k =
m
1 ∑
xi,k
m i=1
σk2 =
m
1 ∑
(xi,k − x̄k )2
m i=1
Each feature k is then standardized as follows:
xk − x̄k
x̂k = √
σk2 + ϵ
where ϵ is a small positive constant to improve numerical stability.
Standardizing intermediate representations in this way can weaken the representational power of the
layer, so two additional learnable parameters γ and β are introduced to scale and/or shift the data.
Altogether, the batch normalization function is as follows:
BNxk = γk x̂k + βk
When γk = σk and βk = x̄k , we recover the original representation.
Given a layer with some activation function ϕ, which would typically be defined as ϕ(W x + b), we
can redefine it with batch normalization:
ϕ(BN(W x))
The bias is dropped because its effect is cancelled by the standardization.
During test time, we must use x̂k and σk2 as computed over the training data; the final values are
usually achieved by keeping a running average of these statistics over the mini-batches during training.
Refer to Batch Normalized Recurrent Neural Networks (César Laurent, Gabriel Pereyra, Philémon
Brakel, Ying Zhang, Yoshua Bengio) for more details.
390
CHAPTER 14. NEURAL NETS
391
14.6.6
14.6. TRAINING NEURAL NETWORKS
Cost (loss/objective/error) functions
We have some cost function, which is a function of our parameters, typically notated J(θ).
For regression, this is often the mean squared error (MSE), also known as the quadratic cost:
J(θ) =
m
1 ∑
(y (i) − hθ (X (i) ))2
m
Note that hθ represents the output of the entire network.
So for a single example, the cost function is:
(y (i) − hθ (X (i) ))2
For deriving convenience, we’ll include a 12 term. Including this term just scales the cost function,
which doesn’t impact the outcome, and for clarity, we’ll substitute f N (NETN ) for hθ (X (i) ), since they
are equivalent.
(y (i) − f N (NETN ))2
2
Deriving with respect to W N and bN gives us the following for individual examples:
∂J
= (f N (NETN ) − y )(f N )′ (NETN )X (i)
∂W N
∂J
= (f N (NETN ) − y )(f N )′ (NETN )
∂bN
Note that these are dependent on the derivative of the output layer’s activation function,
(f N )′ (NET N ). This can cause training to become slow in the case of activation functions like the
sigmoid function. This is because the derivative near the sigmoid’s tails (i.e. where it outputs values
close to 0 or 1) is very low (the sigmoid flattens out at its tails). Thus, when the output layer’s
function has this property, and outputs values near 0 and 1 (in the case of sigmoid), this reduces the
entire partial derivative, leading to small updates, which has the effect of slow learning. When slow
learning of this sort occurs (that is, the kind caused by the activation functions outputting at their
minimum or maximum), it is called saturation, and it is a common problem with neural networks.
For binary classification, a common cost function is the cross-entropy cost, also known as “log loss”
or “logistic loss”:
J(θ) = −
m ∑
k
1 ∑
[y (i) ln hθ (X (i) ) + (1 − y (i) ) ln(1 − hθ (X (i) ))]
m i j
where m is the total number of training examples and k is the number of output neurons.
CHAPTER 14. NEURAL NETS
391
14.6. TRAINING NEURAL NETWORKS
392
The partial derivatives of the cross-entropy cost with respect to W N and bN are (for brevity, we’ll
notate f N (NETN ) as simply f (n)):
k
∑
∂J
(y − f (n))
=
f ′ (n)X
N
∂W
j f (n)(1 − f (n))
k
∑
∂J
(y − f (n))
f ′ (n)
=
N
∂b
f
(n)(1
−
f
(n))
j
This has the advantage that for some activation functions f , such as the sigmoid function, the
activation function’s derivative f ′ cancels out, thus avoiding the training slowdown that can occur
with the MSE.
However, as mentioned before, this saturation occurs with only some activation functions (like the
sigmoid function). This isn’t a problem, for instance, with linear activation functions, in which case
quadratic cost is appropriate (though neural nets with linear activation functions are limited in what
they can learn).
Thus we have:
δN =
∂J
(f N )′ (NETN )
∂OUTN
Log-likelihood cost function
The log-likelihood cost function is defined as, for a single training example:
− ln fyN (NETN
y )
That is, given an example that belongs to class y , we take the natural log of the value outputted
by the output node corresponding to the class y (typically this is the y th node, since you’d have an
output node for each class). If fyN (NETN
y ) is close to 1, then the resulting cost is low; the further it
is from 1, the larger the value is.
This is assuming that the output node’s activation function outputs probability-like values (such as
is the case with the softmax function).
This cost function’s partial derivatives with respect to W N and bN work out to be:
∂J
= f N−1 (n)(f N (n) − y )
N
∂W
∂J
= f N (n) − y
∂bN
For brevity, we’ve notated f N (NETN ) as simply f N (n), and the same for f N−1 (n); for the latter
n = NETN−1 . (TODO clean this notation up)
392
CHAPTER 14. NEURAL NETS
393
14.6. TRAINING NEURAL NETWORKS
Note that for softmax activation functions, we avoid the saturation problem with this cost function.
Thus softmax output activations and the log-likelihood cost functions are a good pairing for problems
requiring probability-like outputs (such as with classification problems).
Common loss functions
Loss Function
Propagation
Backpropagation
Square
y = 12 (x − d)2
∂E
∂x
= (x − d)T ∂E
∂y
Log, c = ±1
y = log(1 + e −cx )
∂E
∂x
=
Hinge, c = ±1
y = max(0, m − cx)
∂E
∂x
= −cI{cx < m} ∂E
∂y
LogSoftMax, c = 1 . . . k
y = log(
MaxMargin, c = 1 . . . k
y = [maxk̸=c {xk + m} − xc ]+
14.6.7
∑
k
e xk ) − x c
−c ∂E
1+e cx ∂y
e
∑
[ ∂E
e xk − δsc ) ∂E
∂x ]s = (
∂y
xs
k
[ ∂E
∂x ]s
= (δsk ∗ − δsc )I{E > 0} ∂E
∂y
Weight initialization
What are the best values to initialize weights and biases to?
Given normalized data, we could reasonably estimate that roughly half the weights will be negative
and roughly half will be positive.
As a result, it may seem intuitive to initialize all weights to zero. But you should not - this causes
every neuron to have the same output, which causes them to have the same gradients during backpropagation, which causes them to all have the same parameter updates. Thus none of the neurons
will differentiate.
Alternatively, we could set each neuron’s initial weights to be a random vector from a standard
multidimensional normal distribution (mean of 0, standard deviation of 1), scaled by some value,
e.g. 0.001 so that they are kept very small, but still non-zero. This process is known as symmetry
breaking. The random initializations allow the neurons to differentiate themselves during training.
However, this can become problematic.
Consider that the net input to a neuron is:
NET = W · X + b
The following extends to the general case, but for simplicity, consider an input X that is all ones,
with dimension d.
Then NET is a sum of d +1 (plus one for the bias) standard normally distributed independent random
variables.
The sum of n normally distributed independent random variables is:
CHAPTER 14. NEURAL NETS
393
14.6. TRAINING NEURAL NETWORKS
394
n
∑
N(
i
µi ,
n
∑
σi2 )
i
That is, it is also a normal distribution.
Thus NET will still have a mean of 0, but it’s standard deviation will be
√
d + 1.
If for example, d = 100, this leaves us with a standard deviation of ∼ 10. This is quite large, and
implies that NET may take on large values due to how we initialized our weights. If NET takes on
large values, we may run into saturation problems given an activation function such as sigmoid, which
then leads to slow training. Thus, poor weight initialization can lead to slow training.
This is most problematic for deep networks, since they may reduce the gradient signal that flows
backwards by too much (in a weaker version of the gradient “killing” effect).
As the number of inputs to a neuron grows, so too will its output’s variance. This can be controlled
for (calibrated) by scaling its weight vector by the square root of its “fan-in” (its number of inputs),
√
so you should divide the standard multidimensional distribution sampled random vector by n, √
where
n is the number of the neuron’s inputs. For ReLUs, it is recommended you instead divide by 2/n.
(Karpathy’s CS231n notes provides more detail on why this is.)
An alternative to this fan-in scaling for the uncalibrated variances problem is sparse initialization,
which is to set all weights to 0, and then break symmetry by randomly connecting every neuron to
some fixed number (e.g. 10) of neurons below it by setting those weights to ones randomly sampled
from the standard normal distribution like mentioned previously.
Biases are commonly initialized to be zero, though if using ReLUs, then you can set them to a small
value like 0.01 so all the ReLUs fire at the start and are included in the gradient backpropagation
update.
Elsewhere it is recommended that √
ReLU weights should be sampled from zero-mean Gaussian distribution with standard deviation of d2in .
Elsewhere it is recommended that you sample your weights uniformly from [−b, b], where:
√
b=
6
Hk + Hk+1
where Hk and Hk+1 are the sizes of the hidden layers before and after the weight matrix.
14.6.8
Shuffling & curriculum learning
Generally you should shuffle your data every training epoch so the network does not become biased
towards a particular ordering.
However, there are cases in which your network may benefit from a meaningful ordering of input data;
this approach is called curriculum learning.
394
CHAPTER 14. NEURAL NETS
395
14.6.9
14.6. TRAINING NEURAL NETWORKS
Gradient noise
Adding noise from a Gaussian distribution to each update, i.e.
gt,i = gt,i + N(0, σt2 )
with variance annealed with the following schedule:
σt2 =
η
(1 + t)γ
has been shown to make “networks more robust to poor initialization and helps training particularly
deep and complex networks. They suspect that the added noise gives the model more chances to
escape and find new local minima, which are more frequent for deeper models.” (An overview of
gradient descent optimization algorithms, Sebastian Ruder)
14.6.10
Adversarial examples
Adding noise to input, such as in the accompanying figure, can throw off a classifier. Few strategies
are robust against these tricks, but one approach is to generate these adversarial examples and include
them as part of the training set.
Adversarial example source
14.6.11
Gradient Checking
When you write code to compute the gradient, it can be very difficult to debug. Thus it is often useful
to check the gradient by numerically approximating the gradient and comparing it to the computed
gradient.
Say our implemented gradient function is g(θ). We want to check that g(θ) =
∂J(θ)
∂θ .
We choose some ϵ, e.g. ϵ = 0.0001. It should be a small value, but not so small that we run into
floating point precision errors.
CHAPTER 14. NEURAL NETS
395
14.7. NETWORK ARCHITECTURES
396
Then we can numerically approximate the gradient at some scalar value θ:
J(θ + ϵ) − J(θ − ϵ)
2ϵ
When θ is a vector, as is more often the case, we instead compute:
J(θ(i+) − J(θ(i−) )
2ϵ
Where:
• θ(i+) = θ + (ϵ × ei )
• θ(i−) = θ − (ϵ × ei )
• ei is the i th is the basis vector (i.e. it is 0 everywhere except at the i th element, where it is 1)
14.6.12
Training tips
Start training with small, unequal weights to avoid saturating the network w/ large weights. If all
the weights start equal, the network won’t learn anything.
• Normalize real-valued data (subtract mean, divide by standard deviation (see part on data
preprocessing))
• Decrease the learning rate during training
• Use minibatches for a more stable gradient (e.g. use stochastic gradient descent)
• Use momentum to get through plateaus
14.6.13
Transfer Learning
The practice of transfer learning involves taking a neural net trained for another task and applying
it to a different task. For instance, if using an image classification net trained for one classification
task, you can use that same network for another, truncating the output layer, that is, take the vectors
from the second-to-last layer and use those as feature vectors for other tasks.
14.7 Network architectures
The architecture of a neural network describes how its layers are structured - e.g. how many layers
there are, how many neurons in each, and how they are connected.
Neural networks are distinguished by their architecture.
The general structure of a neural network is input layer -> 0 or more hidden layers ->
output layer.
396
CHAPTER 14. NEURAL NETS
397
14.8. OVERFITTING
Neural networks always have one input layer, and the size of that input layer is equal to the input
dimensions (i.e. one node per feature), though sometimes you may have an additional bias node.
Neural networks always have one output layer, and the size of that output layer depends on what
you’re doing. For instance, if your neural network will be a regressor (i.e. for a regression problem),
then you’d have a single output node (unless you’re doing multivariate regression). Same for binary
classification. However with softmax (more than just two classes) you have one output node per class
label, with each node outputting the probability the input is of the class associated with the node.
If your data is linearly separable, then you don’t need any hidden layers (and you probably don’t need
a neural network either and a linear or generalized linear model may be plenty).
Neural networks with additional hidden layers become difficult to train; networks with multiple hidden
layers are the subject of deep learning (detailed below). For many problems, one hidden layer suffices,
and you may not see any performance improvement from adding additional hidden layers.
A rule of thumb for deciding the size of the hidden layer is that the size should be between the size
between the input size and output size (for example, the mean of their sizes).
14.8 Overfitting
Because neural networks can have so many parameters, it can be quite easy for them to overfit. Thus
it is something to always keep an eye out for. This is especially a problem for large neural networks,
which have huge amounts of parameters.
As the network grows in number of layers and size, the network capacity increases, which is to say it
is capable of representing more complex functions.
Simpler networks have fewer local minima, but they are easier to converge to and tend to perform
worse (they have higher loss). There is a great deal of variance across these local minima, so the
outcome is quite sensitive to the random initialization - some times you land in a good local minima,
sometimes not. More complex networks have more local minima, but they tend to perform better,
and there is less variance across how these local minima perform.
Higher-capacity networks run a greater risk of overfitting, but this overfitting can be (preferably) mitigated by other methods such as L2 regularization, dropout, and input noise. So don’t let overfitting
be the sole reason for going with a simpler network if a larger one seems appropriate.
Here are regularization examples for the same data from the previous image, with the neural net for
20 hidden neurons:
As you can see, regularization is effective at counteracting overfitting.
Another simple, but possibly expensive way of reducing overfitting is by increasing the amount of
training data - it’s unlikely to overfit many, many examples. However, this is seldom a practical
option.
Generally, the methods for preventing overfitting include:
• Get more data, if possible
CHAPTER 14. NEURAL NETS
397
14.8. OVERFITTING
398
More complex network, more complex functions
398
CHAPTER 14. NEURAL NETS
399
14.8. OVERFITTING
Regularization strength
CHAPTER 14. NEURAL NETS
399
14.8. OVERFITTING
400
• Limit your model’s capacity so that it can’t fit the idiosyncrasies of the data you have. With
neural networks, this can be accomplished by:
• limiting the number of hidden layers and/or number of units per layer
• start with small weights and stop learning early (so the weights can’t get too large)
• weight decay: penalize large weights using penalties on their squared values (L2) or absolute
values (L1)
• adding Gaussian noise (i.e. xi | + N(0, σi2 ) to inputs
• Average many different models
• Use different models with different forms, or
• Train model on different subsets of the training data (“bagging”)
• Use a single neural network architecture, but learn different sets of weights, and average the
predictions across these different sets of weights
14.8.1
Regularization
Regularization techniques are used to prevent neural networks from overfitting.
L2 Regularization
L2 regularization is the most common form of regularization. We penalize the squared magnitude of
∑
all parameters (weights) as part of the objective function, i.e. we add λw 2 to the objective function
(this additional term is called the regularization term, and λ is an additional hyperparameter, the
∑
regularization parameter). It is common to include 12 , i.e. use 12 λw 2 , so the gradient of this
term wrt to w is just λw instead of 2λw $. This avoids the network relying heavily on a few weights
and encourages it to use all weights a little.
L2 regularization is sometimes called weight decay since the added regularization term penalizes
large weights, favoring smaller weights.
So a regularized cost function J, from the original unregularized cost function J0 , is simply:
J = J0 +
λ ∑ 2
w
2m w
This affects the partial derivative of the cost function with respect to weights in a simple way (again,
biases are not included, so it does not change that partial derivative):
∂J
∂J0
λ
=
+ w
∂w
∂w
m
So your update rule would be:
m
ηλ
η ∑
∂Ji
w →w =
−
m
m i ∂w
′
400
CHAPTER 14. NEURAL NETS
401
14.8. OVERFITTING
Note that biases are typically not included by convention; regularizing them usually does not have an
impact on the network’s generalizability.
L1 Regularization
Similar to L2 regularization, except that the regularization term added to the objective function is
∑
λ|w |; that is, the sum of the absolute values of the weights with a regularization parameter λ.
The main difference between L1 and L2 regularization is that L1 regularization shrinks weights by a
constant amount, whereas L2 regularization shrinks weights by an amount proportional to the weights
themselves. This is made clearer by considering the derived update rules from gradient descent.
L1 regularization has the effect of causing weight vectors to become sparse, such that neurons only
use a few of their inputs and ignore the rest as “noise”. Generally L2 regularization is preferred to L1.
For L1, this partial derivative of the cost function wrt the weights is:
∂J
∂J0
λ
=
+
sign(W N )
∂W N
∂W N
m
This ends up leading to the following update rule:
w → w′ = w −
m
ηλ
η ∑
∂Ji
sign(w ) −
m
m i ∂w
Note that we say that sign(0) = 0.
Compare this with the update rule for L2 regularization:
w → w′ = w −
m
ηλ
η ∑
∂Ji
w−
m
m i ∂w
In L2 regularization, we subtract a term weighted by w , whereas in L1 regularization, the subtracted
term is affected only by the sign of w .
Elastic net regularization
This is just the combination of L1 and L2 regularization, such that the term introduced to the
∑
objective function is λ1 |w | + λ2 w 2 .
Max norm constraints
This involves setting an absolute upper bound on the magnitude of the weight vectors; that is, after
updating the parameters/weights, clamp every weight vector so that it satisfies ||w ||2 < c, where c
is some constant (the maximum magnitude).
CHAPTER 14. NEURAL NETS
401
14.8. OVERFITTING
402
Dropout
Dropout is a regularization method which works well with the others mentioned so far (L1, L2,
maxnorm). It does not involve modifying cost functions. Rather, the network itself is modified.
During training, we specify a probability p. At the start of each training epoch, we only keep a neuron
active with that probability p, otherwise we set its output to zero. If the neuron’s output is set to 0,
that has the effect of temporarily “removing” that neuron for that training iteration. At the end of
the epoch, all neurons are restored.
This dropout is applied only at training time and applied per-layer (that is, it is applied after each
layer, see the code example below). This prevents the network from relying too much on certain
neurons.
One way to think about this is that, for each training step, a sub-network is sampled from the full
network, and only those parameters are updated. Then on the next step, a different sub-sample is
taken and updated, and so on.
To put it another way, dropping out neurons in this way has the effect of training multiple neural
networks simultaneously. If we have multiple networks overfit to different training data, they are
unlikely to all overfit in the same way. So their average should provide better results.
This has the additional advantage that neurons must learn to operate in the absence of other neurons,
which can have the effect of the network learning more robust features. That is, the neurons of the
network should be more resilient to the absence of some information.
A network after dropout is applied to each layer in a training iteration source
At test time, all neurons are active (i.e. we don’t use dropout at test time). There will be twice as
many hidden neurons active as there were in training, so all weights are halved to compensate.
We must scale the activation functions by p to maintain the same expected output for each neuron.
Say x is the output of a neuron without dropout. With dropout, the neuron’s output has a chance
p of being set to 0, so its expected output becomes px (more verbosely, it has 1 − p chance of
402
CHAPTER 14. NEURAL NETS
403
14.8. OVERFITTING
becoming 0, so its output is px + (1 − p)0, which simplifies to px). Thus we must scale the outputs
(i.e. the activation functions) by p to keep the expected output consistent.
This scaling can be applied at training time, which is more efficient - this technique is called inverted
dropout.
For comparison, here is an implementation of regular dropout and an implementation of inverted
dropout (source from: https://cs231n.github.io/neural-networks-2/)
# Dropout
p = 0.5 # probability of keeping a unit active. higher = less dropout
def train_step(X):
””” X contains the data ”””
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = np.random.rand(*H1.shape) < p # first dropout mask
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = np.random.rand(*H2.shape) < p # second dropout mask
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations
H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations
out = np.dot(W3, H2) + b3
# Inverted dropout
p = 0.5 # probability of keeping a unit active. higher = less dropout
def train_step(X):
# forward pass for example 3-layer neural network
H1 = np.maximum(0, np.dot(W1, X) + b1)
U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p!
H1 *= U1 # drop!
H2 = np.maximum(0, np.dot(W2, H1) + b2)
U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p!
H2 *= U2 # drop!
out = np.dot(W3, H2) + b3
CHAPTER 14. NEURAL NETS
403
14.9. HYPERPARAMETERS
404
# backward pass: compute gradients... (not shown)
# perform parameter update... (not shown)
def predict(X):
# ensembled forward pass
H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary
H2 = np.maximum(0, np.dot(W2, H1) + b2)
out = np.dot(W3, H2) + b3
Regularization recommendations
It is most common to use a single, global L2 regularization strength that is cross-validated.
It is also common to combine this with dropout applied after all layers. The value
of p = 0.5 is a reasonable default, but this can be tuned on validation data. https:
//cs231n.github.io/neural-networks-2/
14.8.2
Artificially expanding the training set
In addition to regularization, training on more data can help prevent overfitting. This, unfortunately,
is typically not a practical option. However, the training set can be artificially expanded by taking
existing training data and modifying it in a way we’d expect to see in the real world.
For instance, if we were training a network to recognize handwritten digits, we may take our examples
and rotate them slightly, since this could plausibly happen naturally.
A related technique is training on adversarial examples (detailed elsewhere), in which training examples
are modified to be deliberately hard for the network to classify, so that it can be trained on more
ambiguous/difficult examples.
The most common approach to dealing with overfitting is to apply some kind of regularization.
14.9 Hyperparameters
There are many hyperparameters to set with neural networks, such as:
• architecture decisions
–
–
–
–
number of layers
number of units per layer
type of unit
etc
• weight penalty
• learning rate
• momentum
404
CHAPTER 14. NEURAL NETS
405
14.9. HYPERPARAMETERS
• whether or not to use dropout
• etc
14.9.1
Choosing hyperparameters
TODO See: https://cs231n.github.io/neural-networks-3/#anneal
Not only are there many hyperparameters for neural networks; it can also be very difficult to choose
good ones.
You could do a naive grid search and just try all possible combinations of hyperparameters, which is
infeasible because it blows up in size.
You could randomly sample combinations as well, but this still has the problem of repeatedly trying
hyperparameter values which may have no effect.
Instead, we can apply machine learning to this problem and try and learn what hyperparameters
may perform well based on the attempts thus far. In particular, we can try and predict regions in
the hyperparameter space that might do well. We’d want to also be able to be explicit about the
uncertainty in our prediction.
We can use Gaussian process models to do so. The basic assumption of these models is that similar
inputs give similar outputs.
However, what does “similar” mean? Is 200 hidden units “similar” to 300 hidden units or not?
Fortunately, such models can also learn this scale of similarity for each hyperparameter.
These models predict a Gaussian distribution of values for each hyperparameter (hence the name).
A method for applying this:
• keep track of the best hyperparameter combination so far
• pick a new combination of hyperparameters such that the expected improvement of the best
combination is big
So we might try a new combination, and it might not do that well, but we won’t have replaced our
current best.
This method for selecting hyperparameters is called Bayesian (hyperparameter) optimization, and is
a better approach than by picking hyperparameters by hand (less prone to human error).
14.9.2
Tweaking hyperparameters
A big challenge in designing a neural network is calibrating its hyperparameters. From the start, it
may be difficult to intuit what hyperparameters need tuning. There are so many to choose from:
network architecture, number of epochs, cost function, weight initialization, learning rate, etc.
There are a few heuristics which may help.
When the learning rate η is set too high, you typically see constant oscillation in the error rate as
the network trains. This is because with too large a learning rate, you may miss the minimum in
CHAPTER 14. NEURAL NETS
405
14.10. DEEP NEURAL NETWORKS
406
the error surface by “jumping” too far. Thus once you see this occurring, it’s a hint to try a lower
learning rate.
Learning rates which are too low tend to have a slow decrease in error over training. You can try
higher learning rates if this seems to be the case.
The learning rate does not need to be fixed. When starting out training, you may want a high
learning rate to quickly get close to a minimum. But once you get closer, you may want to decrease
the learning rate to carefully identify the best minimum.
The specification of how the learning rate decreases is called the learning rate schedule.
Some places recommend using a learning rate in the form:
ηt = η0 (1 + η0 λt)−1
Where η0 is the initial learning rate, ηt is the learning rate for the tth example, and λ is another
hyperparameter.
For the number of epochs, we can use a strategy called “early stopping”, where we top once some
performance metric (e.g. classification accuracy) appears to stop improving. More precisely, “stop
improving” can mean when the performance metric doesn’t improve for some n epochs.
However, neural networks sometimes plateau for a little bit and then keep on improving. In which
case, adopting an early stopping strategy can be harmful. You can be somewhat conservative and
set n to a higher value to play it safe.
14.10
Deep neural networks
A deep neural network is simply a neural network with more than one hidden layer. Deep learning
is the field related to deep neural networks. These deep networks can perform much better than
shallow networks (networks with just one hidden layer) because they can embody a complex hierarchy
of concepts.
Many problems can be broken down into subproblems, each of which can be addressed by a separate
neural network.
Say for example we want to know whether or not a face is in an image. We could break that down
(decompose it) into subproblems like:
•
•
•
•
is there an eye?
is there an ear?
is there a nose?
etc.
We could train a neural network on each of these subproblems. We could even break these subproblems
further (e.g. “Is there an eyelash?”, “Is there an iris?”, etc) and train neural networks for those, and
so on.
406
CHAPTER 14. NEURAL NETS
407
14.10. DEEP NEURAL NETWORKS
Then if we want to identify a face, we can aggregate these networks into a larger network.
This kind of multi-layered neural net is a deep neural network.
Multilayer nns must have nonlinear activation functions, otherwise they are equivalent to a single
layer network aggregating its weights.
That is, a 2 layer network has weight vectors W1 and W2 and input X. The network computes
(XW1 )W2 , which is equivalent to X(W1 W2 ), so the network is equivalent to a single layer network
with weight vectors W1 W2
Training deep neural networks (that is, neural networks with more than one hidden layer) is not as
straightforward as it is with a single hidden layer - a simple stochastic gradient descent + backpropagation approach is not as effective or quick.
This is because of unstable gradients. This has two ways of showing up:
• Vanishing gradients, in which the gradient gets smaller moving backwards through the hidden
layers, such that earlier layers learn very slowly (and may not learn at all).
• Exploding gradients, in which the gradient gets much larger moving backwards through the
hidden layers, such that earlier layers cannot find good parameters.
These unstable gradients occur because gradients in earlier layers are the products of the later layers
(refer to backpropagation for details, but remember that the δ i for layer i is computed from δ i+1 ).
Thus if these later terms are mostly < 1, we will have a vanishing gradient. If these later terms are
> 1, they can get very large and lead to an exploding gradient.
14.10.1
Unstable gradients
Certain neural networks, such as RNNs, can have unstable gradients, in which gradients may grow
exponentially (an exploding gradient) or shrink exponentially until it reaches zero (a vanishing
gradient).
With exploding gradients, the minimum is not found because, with such a large gradient, the steps
don’t effectively search the space.
With vanishing gradients, the minimum is not found because a gradient of zero means the space isn’t
searched at all.
Unstable gradients can occur as a result of drastic changes in the cost surface, as illustrated
in the accompanying figure (from Pascanu et al via http://peterroelants.github.io/posts/rnn_
implementation_part01/).
In the figure, the large jump in cost leads to a large gradient which causes the optimizer to make an
exaggerated step.
There are methods for dealing with unstable gradients, including:
•
•
•
•
Gradient clipping (e.g. limiting g to g =
Hessian-Free Optimization
Momentum
Resilient backpropagation (Rprop)
CHAPTER 14. NEURAL NETS
t
||g||2
if ||g||2 > t, where t is some clipping threshold)
407
14.10. DEEP NEURAL NETWORKS
408
Resilient backpropgation (Rprop)
Normally, weights are updated by the size of the gradient (typically scaled by some learning rate).
However, as demonstrated above, this can lead to an unstable gradient.
Resilient backpropagation ignores the size of the gradient and only considers its sign and then uses
two hyperparameters, η − , η + (η + > 1) to determine the size of the update. η − .
If the sign of the gradient changes in an iteration, the weight update ∆ is multiplied by η − , i.e.
∆ = ∆η − . If the gradient’s sign doesn’t change, the weight update ∆ is multiplied by η + , i.e.
∆ = ∆η + .
If the gradient’s sign changes, this usually indicates that we have passed through a local minima.
Then the weight is updated by this computed value in the opposite direction of its gradient:
W → W ′ = W − sign(
∂J
)∆
∂W
Typically, η + = 1.2, η − = 0.5.
This is essentially separate adaptive learning rates but ignoring the size of the gradient and only look
at the sign. That is, we increase weights multiplicatively by η + if the last two gradient signs agree,
otherwise, we decrease the step size multiplicatively by η − . As with separate adaptive learning rates,
we generally want to limit the range of step sizes so that it can’t be too small or too large.
Rprop is meant for full batch learning or for very large mini-batches. To use this technique with
mini-batches, see Rmsprop.
14.10.2
Rmsprop
Rmsprop is the mini-batch version of Rprop. It computes a moving average, MA, of the squared
gradient for each parameter:
MA = λMA + (1 − λ)(
∂J 2
)
∂W
Then normalizes the gradient by dividing by the square root of this moving average:
∂J 1
√
∂W MA
408
CHAPTER 14. NEURAL NETS
409
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
Rmsprop can be used with momentum as well (i.e. update the velocity with this modified gradient).
The basic idea behind rmsprop is to adjust the learning rate per-parameter according
to the (smoothed) sum of the previous gradients. Intuitively this means that frequently
occurring features get a smaller learning rate (because the sum of their gradients is
larger), and rare features get a larger learning rate. http://www.wildml.com/2015/10/
recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
14.11
Convolutional Neural Networks (CNNs)
In a regular neural network, the relationship between a pixel and one that is next to it is the same
as its relationship with a pixel far away - the structural information of the image is totally lost.
Convolutional nets are capable of encoding this structural information about the image; as a result,
they are especially effective with image-based tasks.
Convolutional nets are based on three ideas:
• local receptive fields
• shared weights
• pooling
14.11.1
Local receptive fields
A regular neural network is fully-connected in that every node from a layer i is connected to each
node in the layer i + 1.
This is not the case with convolutional nets.
Typically we think of a layer as a line of neurons. With convolutional nets, it is more useful to think
of the neurons arranged in a grid.
(Note: the following images are from http://neuralnetworksanddeeplearning.com/chap6.html TODO
replace the graphics)
We do not fully connect this input layer to the hidden layer (which we’ll call a convolutional layer).
Rather, we connect regions of neurons to neurons in the hidden layer. These regions are local
receptive fields, local to the neuron at their center (they may more simply be called windows).
We can move across local receptive fields one neuron at a time, or in greater movements. These
movements are called the stride length.
These windows end up learning to detect salient features, but are less sensitive to where exactly they
occur. For instance, for recognizing a human face, it may be important that we see an eye in one
region, but it doesn’t have to be in a particular exact position. A filter (also called a kernel) function
is applied to each window to transform it into another vector (which is then passed to a pooling layer,
see below).
CHAPTER 14. NEURAL NETS
409
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
410
A layer as a grid
Local receptive fields map to a neuron in the hidden layer
410
CHAPTER 14. NEURAL NETS
411
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
Moving across fields at a stride length of 1
One architectural decision with CNNs is the use of wide convolution or narrow convolution. When
you reach the edges of your input (say, the edges of an image), do you stop there or do you pad
the input with zeros (or some other value) so we can fit another window? Padding the input is wide
convolution, not padding is narrow convolution. Note that, as depicted above, narrow convolution
will yield a smaller feature map of size: input shape - filter shape + 1.
Note that this hyperparameter is sometimes called “border mode”. A border mode of “valid” is
equivalent to a narrow convolution.
There are a few different ways of handling padding for a wide convolution. Border modes of “half”
(also called “same”) and “full” correspond to different padding strategies.
Say we have a filter of size r × c (where r is rows and c is columns). For a border mode of
“half”/“same”, we pad the input with a symmetric border of r //2 rows and c//2 columns (where
// indicates integer division). When r and c are both odd, the feature map has the same shape as
the input.
There is also a “full” border mode which pads the input with a symmetric border of r − 1 rows and
c − 1 columns. This is equivalent to applying the filter anywhere it overlaps with a pixel and yields
a feature map of size: input shape + filter shape - 1.
For example, say we have the following image:
xxx
xxx
xxx
Say we have a 3x3 filter.
For a border mode of “half”/“same”, we the padded image would look like (padding is indicated with
o):
CHAPTER 14. NEURAL NETS
411
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
412
ooooo
oxxxo
oxxxo
oxxxo
ooooo
For a border mode of “full”, the padded image would instead be:
ooooooo
ooooooo
ooxxxoo
ooxxxoo
ooxxxoo
ooooooo
ooooooo
14.11.2
Shared weights
Another change here is that the hidden layer has one set of weights and a bias that is shared across
the entire layer (these weights and biases are accordingly referred to as shared weights and the shared
bias, and together, they define a filter or a kernel).
As a result, if we have receptive fields of m × m size, the output of the i, jth neuron in the hidden
layer looks like:
f (b +
m−1
∑ m−1
∑
Wk,l OUT0i+k,j+l )
k=0 l=0
Where W ∈ Rm×m is the array of shared weights and OUT0x,y is the output of the input neuron at
position x, y .
Another way of writing the above is:
f (b + W ∗ OUT0 )
Where ∗ is the convolution operator, which is like a blurring/mixing of functions. In this context, it
is basically a weighted sum.
The consequence of this sharing of weights and biases is that this layer detects the same feature
across different receptive fields. For example, this layer could detect vertical edges anywhere in the
image. If an edge shows up in the upper-right part of the image, the corresponding input neuron for
that receptive field will fire. If an edge shows up in the lower-left part of the image, the corresponding
input neuron for that receptive field will also fire, due to the fact that they all share weights and a
bias.
412
CHAPTER 14. NEURAL NETS
413
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
As a result of this property, this mapping between layers is often called a feature map. Technically,
the kernel/filter output a feature map, which is to say they are not the same thing, but in practice
the terms “kernel” and “filter” are often used interchangeably with “feature map”.
For example, say we have a 3x3 filter:


0, 0, 0


0, 0, 0


0, 0, 0
Each position in the filter is a weight to be learned; here we have initialized them to 0 (not necessarily
the best choice, but this is just an example).
Each position in the filter lines up with a pixel as the filter slides across the image (as depicted above)
Let’s say that the weights learned by the filter end up being the following:


−1, −1, −1


 −1, 10, −1 


−1, −1, −1
Then say we place the filter over the following patch of pixels:


0, 0, 0


0, 255, 0


0, 0, 0
We want to combine (i.e. mix) the filter and the pixel values to produce a single pixel value (which
will be a single pixel in the resulting feature map). We do so by convolving them as the sum of the
element-wise product:
(−1×0)+(−1×0)+(−1×0)+(−1×0)+(10×255)+(−1×0)+(−1×0)+(−1×0)+(−1×0)
Pixel-by-pixel the feature map is produced in this way.
We may include multiple feature maps/filters, i.e. have the input connect to many hidden layers of
this kind (this is typically how it’s done in practice). Each layer would learn to detect a different
feature.
Another benefit to sharing weights and biases across the layer is that it introduces some resilience
to overfitting - the sharing of weights means that the layer cannot favor peculiarities in particular
parts of the training data; it must take the whole example into account. As a result, regularization
methods are seldom necessary for these layers.
CHAPTER 14. NEURAL NETS
413
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
14.11.3
414
Pooling layers
In addition to convolutional layers there are also pooling layers, which often accompany convolutional
layers (often one per convolutional layer) and follow after them. Pooling layers produced a condensed
version of the feature map they are given (for this reason, this process is also known as subsampling,
so pooling layers are sometimes called subsampling layers). For example, a 2 × 2 neuron region of
the feature map may be represented with only one neuron in the pooling layer.
Mapping from feature map to a pooling layer
There are a few different strategies for how this compression works. A common one is max-pooling,
in which the pooling neuron just outputs the maximum value of its inputs. In some sense, maxpooling asks its region: was your feature present? And activates if it was. It isn’t concerned with
where in that region the feature was, since in practice, its precise location doesn’t matter so much
as its relative positioning to other features (especially with images).
Another pooling technique is L2 pooling. Say a pooling neuron has an m × m input region of
neurons coming from the layer i . Then it’s output is:
v
um−1 m−1
u∑ ∑
t
(OUTij,k )2
j=0 k=0
Another pooling technique is average-pooling in which the average value of the input is output.
There is also the k-max pooling method, which takes the top k values in each dimension, instead
of just the top value as is with max-pooling. The result is a matrix rather than a vector.
14.11.4
Network architecture
Generally, we have many feature maps (convolutional layers) and pooling layer pairs grouped together;
conceptually it is often easier to think of these groups themselves as layers (called “convolutionalpooling layers”).
414
CHAPTER 14. NEURAL NETS
415
14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
An example convolutional network
The output layer is fully-connected (i.e. every neuron from the convolutional-pooling layer are connected to every neuron in the output layer).
Often it helps to include another (or more) fully-connected layer just prior to the output layer. This can
be thought of as aggregating and considering all the features coming from the convolutional-pooling
layer.
It is also possible to insert additional convolutional-pooling layers (this practice is called hierarchical
pooling). Conceptually, these take the features output by the previous convolutional-pooling layer
and extract higher-level features. The way these convolutional-pooling layers connect to each other
is a little different. Each of this new layer’s input neurons (that is, the neurons in its first set of
convolutional layers) takes as its input all of the outputs (within its local receptive field) from the
preceding convolutional-pooling layer.
For example, if the preceding convolutional-pooling layer has 20 layers in it, and we have receptive
fields of size 5 × 5, then each of the input neurons for the new convolutional-pooling layer would
have 20 × 5 × 5 inputs.
14.11.5
Training CNNs
Backpropagation is slightly different for a convolutional net because the typical backpropagation
assumes fully-connected layers.
TODO add this
14.11.6
Convolution kernels
CNNs learn a convolution kernel and (for images) apply it to every pixel across the image:
CHAPTER 14. NEURAL NETS
415
14.12. RECURRENT NEURAL NETWORKS (RNNS)
416
Convolution kernel example source
14.12
Recurrent Neural Networks (RNNs)
A recurrent neural network is a feedback neural network, that is, it is a neural net where the outputs
of neurons are fed back into their inputs. They have properties which give them advantages over
feed-forward NNs for certain problems. In particular, RNNs are well-suited for handling sequences as
input.
With machine learning, data is typically represented in vector form. This works for certain kinds of
data, such as numerical data, but not necessarily for other kinds of data, like text. We usually end up
coercing text into some vector representation (e.g. TF-IDF) and end up losing much of its structure
(such as the order of words). This is ok for some tasks (such as topic detection), but for many others
we are throwing out important information. We could use bigrams or trigrams or so on to preserve
some structure but this becomes unmanageably large (we end up with very high-dimension vectors).
Recurrent neural networks are able to take sequences as input, i.e. iterate over a sequence, instead
of fixed-size vectors, and as such can preserve the sequential structure of things like text and have a
stronger concept of “context”.
Basically, an RNN takes in each item in the sequence and updates a hidden representation (its state)
based on that item and the hidden representation from the previous time step. If there is no previous
hidden representation (i.e. we are looking at the first item in the sequence), we can initialize it as
either all zeros or treat the initial hidden representation as another parameter to be learned.
Another way of putting this is that the core difference of an RNN from a regular feedforward network
416
CHAPTER 14. NEURAL NETS
417
14.12. RECURRENT NEURAL NETWORKS (RNNS)
is that the output of a neuron is a function of its inputs and of its past state, e.g.
OUTt = f (OUTt−1 Wr + Xt Wx )
Where Wr are the recursive weights.
14.12.1
Network architecture
In the most basic RNN, the hidden layer have two inputs: the input from the previous layer, and the
layer’s own output from the previous time step (so it loops back onto itself):
Simple RNN network, with hidden nodes looping source
This simple network can be visualized over time as well:
Simple RNN network, with hidden nodes looping over time source
Say we have a hidden layer L1 of size 3 and another hidden layer L2 of size 2. In a regular NN, the
input to L2 is of size 3 (because that’s the output size of L1 ). In an RNN, L2 would have 3+2 inputs,
3 from L1 , and 2 from its own previous output.
CHAPTER 14. NEURAL NETS
417
14.12. RECURRENT NEURAL NETWORKS (RNNS)
418
This simple feedback mechanism offers a kind of short-term memory - the network “remembers” the
output from the previous time step.
It also allows for variable-sized inputs and outputs - the inputs can be fed in one at a time and
combined by this feedback mechanism.
14.12.2
RNN inputs
The input item can be represented with one-hot encoding, i.e. each term is to a vector of all zeroes
and one 1. For example, if we had the vocabulary
the, mad, cat, the terms might be respectively represented as [1, 0, 0], [0, 1, 0], [0, 0, 1].
Another way to represent these terms is with an embedding matrix, in which each term is mapped
to some index of the matrix which points to some n-dimensional vector representation. So the RNN
learns vector representations for each term.
Convolutional neural networks, and feed-forward neural networks in general, treat an input the same
no matter when they are given it. For RNNs, the hidden representation is like (short-term) “memory”
for the network, so context is taken into account for inputs; that is, an input will be treated differently
depending on what the previous input(s) was/were.
14.12.3
Training RNNs
(Note that RNNs train very slowly on CPUs; they train significantly faster on GPUs.)
RNNs are trained using a variant of backpropagation called backpropagation through time, which
just involves unfolding the RNN a certain number of time steps, which results in what is essentially
a regular feedforward network, and then applying backpropagation:
∂E
∂E
∂OUTt
∂E
=
=
Wr
∂OUTt−1
∂OUTt ∂OUTt−1
∂OUTt
which starts with:
∂E
∂E
=
∂y
∂OUTn
Where OUTn is the output of the last layer.
The gradients of the cost function wrt to the weights is computed by summing the weight gradients
in each layer:
n
∑
∂E
∂E
Xt
=
∂Wx
k=0 ∂OUTt
n
∑
∂E
∂E
=
OUTt−1
∂Wr
k=1 ∂OUTt
418
CHAPTER 14. NEURAL NETS
419
14.12. RECURRENT NEURAL NETWORKS (RNNS)
This summing of the weight gradients at each time step is the main difference from regular feedforward
networks, aside from that BPTT is basically just backpropagation on an RNN unrolled up to some
time step t.
However, if working with long sequences, this is effectively like training a deep network with many
hidden layers (i.e. this is equivalent to an unrolled RNN), which can be difficult (due to vanishing or
exploding gradients). In practice, it’s common to truncate the backpropagation by running it for only
to a few time steps back.
The vanishing gradient problem in RNNs means long-term dependencies won’t be learned - the effect
of earlier steps “vanish” over time steps (this is the same problem of vanishing gradients in deep
feedforward networks, given that an RNN is basically a deep neural net).
Exploding gradients are more easily dealt with - it’s obvious when they occur (you’ll see NaNs, for
instance), and you can clip them at some maximum value, which can be quite effective (refer to this
paper)
Some strategies for dealing with vanishing gradients:
• vanishing gradients are sensitive to weight initialization, so proper weight initialization can help
avoid them
• ReLUs can work better as the nonlinear activation functions since they are not bounded by 1
as the sigmoid and tanh nonlinearities are
Generally, however, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures
are used instead of vanilla RNNs, which were designed for mitigating vanishing gradients (for the
purpose of better learning long-range dependencies).
14.12.4
LSTMs
This short-term memory of (vanilla) RNNs may be too short. RNNs may incorporate long short-term
memory (LSTM) units instead, which just computes hidden states in a different way.
With an LSTM unit, we have memory stored and passed through a more involved series of steps.
This memory is modified in each step, with something being added and something being removed at
each step. The result is a neural network that can handle longer-term context.
These LSTM units have a three gates (in contrast to the single activation function vanilla RNNs
have):
• write (input) - controls the amount of current input to be remembered
• read (output) - controls the amount of memory given as output to the next stage
• erase (forget) - controls what part of the memory is erased or kept in the current time step
TODO include Chris Olah’s LSTM diagrams: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
These gates are sigmoid functions combined with a pointwise multiplication operation. They are
called gates because they tune how much of their input is passed on (i.e. sigmoids give a value in
CHAPTER 14. NEURAL NETS
419
14.12. RECURRENT NEURAL NETWORKS (RNNS)
420
A LSTM unit source
[0, 1], which can be thought as the percent of input to pass on). The parameters for these gates are
learned.
The input gate determines how much of the input is let through, the forget gate determines how much
of the previous state is let through. We compute a new “memory” (i.e. the LSTM unit’s internal
state) from the outputs of these gates. The output gate determines how much of this new memory
to output as the hidden state.
In more detail:
The forget gate controls what is removed (“forgotten”) from the cell state. The input to the forget
gate is the concatenation of the cell’s output from the previous step, OUTt−1 and the current input
to the cell, Xt . The gate computes a value in [0, 1] (with the sigmoid function) for each value in
the previous cell state Ct−1 ; the resulting value determines how much of that value to keep (1 means
keep it all, 0 means forget all of it). So we are left with a vector of values in [0, 1], which we then
pointwise multiply with the existing cell state to get the updated cell state.
The output of a forget gate f at step t is:
˙
ft = sigmoid(Wf [OUT
t−1 , Xt ] + bf )
Then our intermediate value of Ct is Ct′ = ft Ct−1 .
Where Wf , bf are the forget gate’s weight vector and bias, respectively.
420
CHAPTER 14. NEURAL NETS
421
14.12. RECURRENT NEURAL NETWORKS (RNNS)
The input gate controls what information gets stored in the cell state. This gate also takes as input
the concatenation of OUTt−1 and Xt . We will denote its output at step t as it . Like the forget gate,
this is a vector of values in [0, 1] which determine how much information gets through - 0 means
none, 1 means all of it.
A tanh function takes the same input and outputs a vector of candidate values, C̃t .
We pointwise multiple this candidate value vector with the input gate’s output vector to get the
vector that is passed to the cell state. This resulting vector is pointwise added to the updated cell
state.
˙
it = sigmoid(Wi [OUT
t−1 , Xt ] + bi )
˙
C̃t = tanh(WC [OUT
t−1 , Xt ] + bC )
Thus our final updated value of Ct is Ct = Ct′ + it C̃t .
We don’t output this cell state Ct directly. Rather, we have yet another gate, the output gate
(sometimes called a read gate) that outputs another vector with values in [0, 1], ot , which determines
how much of the cell state is outputted. This gate again takes in as input the concatenation of OUTt−1
and Xt .
So the output of the output gate is just:
ot = sigmoid(Wo [OUTt−1 , Xt ] + bo )
To get the final output of the cell, we pass the cell state Ct through tanh and then pointwise multiply
that with the output of the output gate:
OUTt = ot tanh(Ct )
An RNN is can be thought of as an LSTM in which all input and output gates are 1 and all forget
gates are 0, with an additional activation function (e.g. tanh) afterwards (LSTMs do not have this
additional activation function).
There are many variations of LSTMs (see this paper for empirical comparisons between some of
them), the most common of which is the Gated Recurrent Unit (GRU).
GRUs
A gated recurrent unit (GRU) is a simpler LSTM unit; it includes only two gates (also both sigmoid
functions) - the reset gate r and the update gate z. The reset gate determines how to mix the current
input and the previous state and the update gate determines how much of the previous state to retain.
A vanilla RNN is a GRU architecture in which all reset gates are 1 and all update gates are 0 (with
an additional activation function; like LSTMs don’t have this additional nonlinearity). GRUs don’t
CHAPTER 14. NEURAL NETS
421
14.12. RECURRENT NEURAL NETWORKS (RNNS)
422
have internal states like LSTM units do; there is no output gate so there is no need for an internal
state. The cell state and its output are also merged as its hidden state, ht :
ht−1 = OUTt−1
˙ t−1 , Xt ])
zt = sigmoid(Wz [h
˙ t−1 , Xt ])
rt = sigmoid(Wr [h
˙ t ht−1 , Xt ])
h̃t = tanh(W [r
ht = (1 − zt )ht−1 + zt h̃t
OUTt = ht
Peephole connections
This LSTM variant just passes on the previous cell state, Ct−1 , to the forget and input gates, and
the new cell state, Ct , to the output gate, that is, all that is changed is that:
˙ t−1 , OUTt−1 , Xt ] + bf )
ft = sigmoid(Wf [C
˙ t−1 , OUTt−1 , Xt ] + bi )
it = sigmoid(Wi [C
˙ t−1 , OUTt−1 , Xt ] + bo )
ot = sigmoid(Wo [C
Update gates
In this LSTM variant, the forget and input gates are combined into a single update gate. The value
ft is computed the same, but it is instead just:
it = 1 − ft
Essentially, we just update enough information to replace what was forgotten.
14.12.5
BI-RNNs
Bidirectional RNNs (BI-RNNs) are a variation on RNNs in which the RNN can not only look into
the past, but it can also look into the “future”. The BI-RNN has two states, sif (the forward state)
and sib (the backward state). The forward state sif is based on x1 , x2 , . . . , xi , whereas the backward
state sib is based on xn , xn−1 , . . . , xi . These states are managed by two different RNNs, one which is
given the sequence x1:n and the other is fed xn:1 (that is, the input in reverse).
The output at position i is the concatenation of these RNNs’ output vectors, i.e. yi = [yif ; yib ].
422
CHAPTER 14. NEURAL NETS
423
14.13. UNSUPERVISED NEURAL NETWORKS
14.12.6
Attention mechanisms
In people, “attention” is a mechanism by which we focus on one particular element of our environment,
such that our perception of the focused element is in high-fidelity/resolution, whereas surrounding
elements are at a lower resolution.
Attention mechanisms in recurrent neural networks emulate this behavior. This amounts to a weighted
sum across input states (typically weights are normalized to sum to 1); higher weights indicate more
“focus” or attention.
For instance, consider neural machine translation models. Their basic form consists of two RNNs, one
which takes an input sentence (the encoder) and one which produces the translated output sentence
(the decoder). The encoder takes the input sentence, produces a sentence embedding (i.e. a single
vector meant to encapsulate the sentence’s meaning), then the decoder takes that embedding and
outputs the translated sentence.
Representing a sentence as a single embedding is challenging, especially since earlier parts of the
sentence may be forgotten. There are some architectures such as the bidirectional variant that help
with this, but attention mechanisms can help so that the decoder has access to the full spread of
inputs and can “focus” more on translating individual parts when appropriate.
This means that instead of taking a single sentence embedding, each output word is produced through
this weighted combination of all input states.
Note that these attention weights are stored for each step since each step the model distributes its
attention differently. This can add up quickly.
Attention mechanisms can be thought of as an addressing system for selecting locations in memory
(e.g. an array) in a weighted fashion.
14.13
Unsupervised neural networks
The most basic one is probably the autoencoder, which is a feed-forward neural net which
tries to predict its own input. While this isn’t exactly the world’s hardest prediction task,
one makes it hard by somehow constraining the network. Often, this is done by introducing a bottleneck, where one or more of the hidden layers has much lower dimensionality
than the inputs. Alternatively, one can constrain the hidden layer activations to be
sparse (i.e. each unit activates only rarely), or feed the network corrupted versions of its
inputs and make it reconstruct the clean ones (this is known as a denoising autoencoder).
[https://www.metacademy.org/roadmaps/rgrosse/deep_learning]
14.13.1
Autoencoders
Autoencoders are a feedforward neural network used for unsupervised learning. Autoencoders extract
meaningful features by trying to output a reproduction of its input. That is, the output layer is the
same size as its input layer, and it tries to reconstruct its input at the output layer.
CHAPTER 14. NEURAL NETS
423
14.13. UNSUPERVISED NEURAL NETWORKS
424
Generally the output of an autoencoder is notated x̂.
The first half (i.e. from the input layer up to the hidden layer) of the autoencoder architecture is
called the encoder, and the latter half (i.e. from the hidden layer to the output layer) is called the
decoder.
Often the weights of the decoder, W ∗, are just the transpose of the weights of the encoder W , i.e.
W ∗ = W T . We refer to such weights as tied weights.
Essentially what happens is the hidden layer learns a compressed representation of the input (given
that it is a smaller size than the input/output layers, this is called an undercomplete hidden layer, the
learned representation is called an undercomplete representation), since it needs to be reconstructed
by the decoder back to its original form. That is, the network needs to find some way of representing
the input with less information. In some sense, we do this already with language, where we may
represent a photo with a word (or a thousand words with a photo).
Undercomplete hidden layers do a good job compressing data similar to its training set, but bad for
other inputs.
On the other hand, the hidden layer may be larger than the input/output layers, in which case it
is called an overcomplete hidden layer and the learned representation of the input is an overcomplete representation. There’s no compression as a result, and there’s not guarantee that anything
meaningful will be learned (since it can essentially just copy the input).
However, overcomplete representation as a concept is appealing because if we are using this autoencoder to learn features for us, we may want to learn many features. So how can we learn useful
overcomplete representations?
Sparse autoencoders
Using a hidden layer size smaller than your input is tricky - encoding a lot of information into fewer
bits is quite challenging.
Rather counterintuitively, a larger hidden layer helps, where some hidden units are randomly turned
off during a training iteration - that way, the output isn’t a mere copy of the input, and learning is
easier since there is more “room” to represent the input. Such an autoencoder is called a sparse
autoencoder.
In effect, what an autoencoder is learning is some higher-level representation of its input. In the case
of an image, it may go from pixels to edges.
We can stack these sparse autoencoders on top of each other, so that higher and higher-level representations are learned. The sparse autoencoder that goes from pixels to edges can go into another
one that learns how to go from edges to shapes, for example.
Denoising autoencoders
A denoising autoencoder is a way of learning useful overcomplete representations. The general idea
is that we want the encoder to be robust to noise (that is, to be able to reconstruct the original
input even in the presence of noise). So instead of inputting x, we input x̃, which is just x with noise
424
CHAPTER 14. NEURAL NETS
425
14.13. UNSUPERVISED NEURAL NETWORKS
added (sometimes called a corrupted input), and the network tries to reconstruct the noiseless x as
its output.
There are many ways this noise can be added, but two popular approaches:
• for each component in an input, set it to 0 with probability v
• adding Gaussian noise (mean 0, and some variance; this variance is a hyperparameter) ####
Loss functions for autoencoders
Say our neural network is f (x) = x̂.
For binary inputs, we can use cross-entropy (more precisely, the sum of Bernoulli cross-entropies):
l(f (x)) = −
∑
(xk log(x̂k )) + (1 − xk )(log(1 − x̂k ))
k
For real-valued inputs, we can use the sum of squared differences (i.e. the squared euclidean distance):
l(f (x)) =
1∑
(x̂k − xk )2
2 k
And we use a linear activation function at the output.
Loss function gradient in autoencoders
Note that if you are using tied weights, the gradient ∇W l(f (x (t) )) is the sum of two gradients; that
is, it is sum of the gradients for W ∗ and W T .
Contractive autoencoders
A contractive autoencoder is another way of learning useful overcomplete representations. We do so
by adding an explicit term in the loss that penalizes uninteresting solutions (i.e. that penalizes just
copying the input).
Thus we have a new loss function, extended from an existing loss function:
l(f (x (t) )) + λ||∇x (t) h(x (t) )||2F
Where λ is a hyperparameter and ∇x (t) h(x (t) ) is the Jacobian of the encoder, represented as h(x (t) ),
and ||A||F is the Frobenius norm:
v
u∑
n
um ∑
||A||F = t
|aij |2
i=1 j=1
CHAPTER 14. NEURAL NETS
425
14.13. UNSUPERVISED NEURAL NETWORKS
426
Where A is a m × n matrix. To put it another way, the Frobenius norm is the square root of the
sum of the absolute squares of a matrix’s elements; in this case, the matrix is the Jacobian of the
encoder.
Intuitively, the term we’re adding to the loss (the squared Frobenius norm of the Jacobian) increases
the loss if we have non-zero partial derivatives with the encoder h(x (t) ) with respect to the input; this
essentially means we want to encourage the encoder to throw away information (i.e. we don’t want
the encoder’s output to change with changes to the input; i.e. we want the encoder to be invariant
to the input).
We balance this out with the original loss function which, as usual, encourages the encoder to keep
good information (information that is useful for reconstructing the original input).
By combining these two conflicting priorities, the result is that the encoder keeps only the good
information (the latter term encourages it to throw all information away, the former term encourages
it to keep only the good stuff). The λ hyperparameter lets us tweak which of these terms to prioritize.
Contractive vs denoising autoencoders
Both perform well and each has their own advantages.
Denoising autoencoders are simpler to implement in that they are a simple extension of regular
autoencoders and do not require computing the Jacobian of the hidden layer.
Contractive autoencoders have a deterministic gradient (since no sampling is involved; i.e. no random
noise), which means second-order optimizers can be used (conjugate gradient, LBFGs, etc), and can
be more stable than denoising autoencoders.
Deep autoencoders
Autoencoders can have more than one hidden layer but they can be quite difficult to train (e.g. with
small initial weights, the gradient dies).
They can be trained with unsupervised layer-by-layer pre-training (stacking RBMs), or care can be
taken in weight initialization.
Shallow autoencoders for pre-training
A shallow autoencoder is just an autoencoder with one hidden layer.
In particular, we can create a deep autoencoder by stacking (shallow) denoising autoencoders.
This typically works better than pre-training with RBMs.
Alternatively, (shallow) contractive autoencoders can be stacked, and they also work very well for
pre-training.
426
CHAPTER 14. NEURAL NETS
427
14.13.2
14.13. UNSUPERVISED NEURAL NETWORKS
Sparse Coding
The sparse coding model is another unsupervised neural network.
The general problem is that for each input x (t) , we want to find a latent representation h(t) such
that:
• h(t) is sparse (has many zeros)
• we can reconstruct the original input x (t) as well as possible
Formally:
T
1∑
1
min
min ||x (t) − Dh(t) ||22 + λ||h(t) ||1
(t)
D T
2
t=1 h
Note that Dh(t) is the reconstruction x̂ (t) , so the term ||x (t) − Dh(t) ||22 is the reconstruction error.
D is the matrix of weights; in the context of sparse coding it is called a dictionary matrix, and it is
equivalent to an autoencoder’s output weight matrix.
The term ||h(t) ||1 is a sparsity penalty, to encourage h(t) to be sparse, by penalizing its L1 norm.
We constraint the columns of D to be of norm 1 because otherwise D could just grow large, allowing
h(t) to become small (i.e. sparse). Sometimes the columns of D are constrained to be no greater
than norm 1 instead of being exactly 1.
14.13.3
Restricted Boltzmann machines
Restricted Boltzmann machines (RBMs) are a type of neural network used for unsupervised learning;
it tries to extract meaningful features.
Such methods are useful for when we have a small supervised training set, but perhaps abundant
unlabeled data. We can train an RBM (or another unsupervised learning method) on the unlabeled
data to learn useful features to use with the supervised training set - this approach is called semisupervised learning.
14.13.4
Deep Belief Nets
Deep belief networks are a generative neural network. Given some feature values, a deep belief
net can be run “backwards” and generate plausible inputs. For example, if you train a DBN on
handwritten digits, it can be used to generate new images of handwritten digits.
Deep belief nets are also capable of unsupervised and semi-supervised learning. In an unsupervised
setting, DBNs can still learn useful features.
CHAPTER 14. NEURAL NETS
427
14.14. OTHER NEURAL NETWORKS
14.14
14.14.1
428
Other neural networks
Modular Neural Networks
So say we have trained a neural net which has learned our function W , and given a word input, it
outputs us the word’s high-dimensional vector representation.
We can re-use this network in a modular fashion so that we construct a larger neural net which can
take a fixed-size set of words as input. For example, the following network takes in five words, from
which we get their representations, which are then passed into another network R to yield some
output s.
A modular neural network (Bottou (2011))
14.14.2
Recursive Neural Networks
Using modular neural networks like above is limiting in the fact that we can only accept a fixed
number of inputs.
We can get around this by adding an association module A, which takes two representations and
merges them.
As you can see, it can take either a reputation from a word (via a W module) or from a phrase (via
another A module).
We probably don’t want to merge words linearly though. Instead we might want to group words in
some way:
428
CHAPTER 14. NEURAL NETS
429
14.14. OTHER NEURAL NETWORKS
Using association modules (Bottou (2011))
A recursive neural network (Bottou (2011))
CHAPTER 14. NEURAL NETS
429
14.14. OTHER NEURAL NETWORKS
430
This kind of model is a “recursive neural network” (sometimes “tree-structured neural network”)
because it has modules feeding into modules of the same type.
14.14.3
Nonlinear neural nets
In typical NNs, the architecture of the network is specified before hand and is static - neurons don’t
change connections. In a nonlinear neural net, however, the connections between neurons becomes
dynamic, so that new connections may form and old connections may break. This is more like how
the human brain operates. But so far at least, these are very complex and difficult to train.
14.14.4
Neural Turing Machines
A Neural Turing Machine is a neural network enhanced with external addressable memory (and a
means of interfacing with it). Like a Turing machine, it can simulate any arbitrary procedure - in fact,
given an input sequence and a target output sequence, it can learn a procedure to map between the
two on its own, trainable via gradient descent (as the entire thing is differentiable).
The basic architecture of NTMs is that there is a controller (which is a neural network, typically an
RNN, e.g. LSTM, or a standard feedforward network), read/write heads (the write “head” actually
consists of two heads, an erase and an add head, but referred to as a single head), and a memory
matrix Mt ∈ RN×M .
Each row (of which there are N, each of size M) in the memory matrix is referred to as a memory
“location”.
Unlike a normal Turing machine, the read and write operations are “blurry” in that they interact
in some way with all elements in memory (normal Turing machines address one element at a time).
There is an attentional “focus” mechanism that constrains the memory interaction to a smaller portion
- each head outputs a weighting vector which determines how much it interacts (i.e. reads or writes)
with each location.
At time t, the read head emits a (normalized) weighting vector over the N locations, wt .
From this we get the M length read vector rt :
rt =
∑
wt (i )Mt (i )
i
At time t, the write head emits a weighting vector wt (note that the write and read heads each emit
their own wt that is used in the context of that head) and an erase vector et that have M elements
which line in the range (0,1)$.
Using these vectors, the memory vectors Mt−1 (i ) (i.e. locations) from the previous time-step are
updated:
M̃t (i ) = Mt−1 [⊮ − wt (i )et ]
430
CHAPTER 14. NEURAL NETS
431
14.14. OTHER NEURAL NETWORKS
Where ⊮ is a row vector of all ones and the multiplication against the memory location is point-wise.
Thus a memory location is erased (all elements set to zero) if wt and et are all ones, and if either is
all zeros, then the memory is unchanged.
The write head also produces an M length add vector at , which is added to the memory after the
erase step:
Mt (i ) = M̃t (i ) + wt (i )at
So, how are these weight vectors wt produced for each head?
For each head, two addressing mechanisms are combined to produce its weighting vectors:
• content-based addressing: focus attention on locations similar to the controller’s outputted
values
• location-based addressing: conventional lookup by location
Content-based addressing
Each head produces a length M key vector kt .
kt functions as a lookup key; we want to find an entry in Mt most similar to kt . A similarity function
K (e.g. cosine similarity) is applied to kt against all entries in Mt . The similarity value is multiplied
by a “key strength” βt > 0, which can attenuate the focus of attention. Then the resulting vector
of similarities is normalized by applying softmax. The resulting weighting vector is wtc :
exp(βt K(kt , Mt (i )))
wtc (i ) = ∑
j exp(βt K(kt , Mt (j)))
Location-based addressing
The location-based addressing mechanism is used to move across memory locations iteratively
(i.e. given a current location, move to this next location; this is called a rotational shift) and for
random-access jumps.
Each head outputs a scalar interpolation gate gt in the range (0, 1). This is used to blend the old
weighting outputted by the head, wt−1 , with the new weighting from the content-based addressing
system, wtc . The result is the gated weighting wtg :
wtg = gt wtc + (1 − gt )wt−1
If the gate is zero, the content weighting is ignored and only the previous weighting is used.
CHAPTER 14. NEURAL NETS
431
14.15. NEUROEVOLUTION
432
(TODO not totally clear on this part) Next, the head also emits a shift weighting st which specifies
a normalized distribution over the allowed integer shifts. For example, if shifts between -1 and 1 are
allowed, st has three elements describing how much the shifts of -1, 0, and 1 are performed. One
way of doing this is by adding a softmax layer of the appropriate size to the controller.
Then we apply the rotation specified by st to wtg :
w̃t (i ) =
N−1
∑
wtg (j)st (i − j)
j=0
Over time, the shift weighting, if it isn’t “sharp”, can cause weightings to disperse over time. For
example, with permitted shifts of -1, 0, 1 and st = [0.1, 0.8, 0.1], the single point gets slightly
blurred across the three points To counter this, each head also emits a scalar γt ≥ 1 that is used to
(re)sharpen the final weighting:
w̃t (i )γt
wt (i ) = ∑
γt
j w̃t (j)
Refer to the paper for example uses.
14.15
Neuroevolution
Neuroevolution is the process of applying evolutionary algorithms to neural networks to learn their
parameters (weights) and/or architecture (topology).
Neuroevolution is flexible in its application; it may be used for supervised, unsupervised, and reinforcement learning tasks. An example application is state or action value evaluation, e.g. for game
playing.
With neuroevolution, an important choice is the genetic representation (genotype) of the neural
network. For instance, if the architecture is fixed by the user, the weights can just be genetically
represented as a vector of real numbers. Then the standard genetic algorithm (i.e. fitness, mutation,
crossover, etc) can be applied.
This simple representation of weights as a vector is called conventional neuroevolution (CNE).
However, because the performance of a neural net is so dependent on topology, evolving the topology
in addition to the weights can lead to better performance. One such method is NeuroEvolution
of Augmenting Topologies (NEAT), of which there are many variations (e.g. RBF-NEAT, CascadeNEAT).
Direct encoding the parameters are mapped one-to-one onto the vector; that is each weight is mapped
to one number in the vector. However, there may be advantage to using indirect encodings, in which
information in one part of the vector may be linked to another part. This compacts the genetic
representation in that not every value must be represented (some are shared, mapping to multiple
connections).
432
CHAPTER 14. NEURAL NETS
433
14.15. NEUROEVOLUTION
A Compositional Pattern Producing Network (CPPN) is a neural network which functions as a patterngenerator. CPPNs typically include different activation functions (such as sine, for repeating patterns,
or Gaussian, to create symmetric patterns). Although they were originally designed to produce twodimensional patterns (e.g. images), CPPNs may be used to evolved indirectly encoded neural networks
- they “exploit geometric domain properties to compactly describe the connectivity pattern of a largescale ANN” (Riesi & Togelius). The CPPN itself may be evolved using NEAT - this approach is
called HyperNEAT.
A form of indirect encodings are developmental approaches in which the network develops new connections as the game is being played.
In non-deterministic games, the fitness function may be noisy (since the same action can lead to
different scores). One way around this is to average the performance over many independent plays.
For complex problems, it sometimes is too difficult to evolve the network directly to that problem.
Instead, staging (also called incremental evolution) may be preferred, where the network is evolved
on simpler problems that gradually increase towards the original complex task. Similarly, transfer
learning may be useful here as well.
A challenge in evolving competitive AI is that there may not be a good enough opponent to play
against and learn from. A method called competitive coevolution can be used, in which the fitness
of one AI player depends on how it performs against another AI player drawn from the same or from
another population.
A similar method called cooperative coevolution, where fitness is instead based on its performance in
collaboration with other players, may make more sense in other contexts. It may be adapted more
generally by applying it at the individual neuron level - that is, each neuron’s fitness depends on how
well it works with the other neurons in the network. The CoSyNE neuroevolution algorithm is based
on this.
In many cases, there is no single performance metric that can be used; rather, performance is evaluated
based on many different dimensions. The simplest way around this is to combine these various
metrics in some way - e.g. as a linear combination - but another way is cascading elitism, where
“each generation contains separate selection events for each fitness function, ensuring equal selection
pressure” (Riesi & Togelius).
There is another class of algorithms called multiobjective evoluationary algorithm (MOEA) where
multiple fitness functions are specified. These algorithms try to satisfy all their given objectives
(fitness functions) and can also manage conflicts between objectives by identifying (mapping) them
and deciding on tradeoffs. When a solution is found where no objective can be further improved
without worsening another, the solution is said to be on the Pareto Front. One such MOEA is
NSGA-II (Non-dominated Sorting Genetic Algorithm).
There exist interactive evolution approaches in which a human can set or modify objectives during
evolution, or even act as the fitness function themselves. Other ways humans can intervene include
shaping, where the human can shape the environment to influence training, and demonstration, in
which the human takes direct control and the network learns from that example.
CHAPTER 14. NEURAL NETS
433
14.16. GENERATIVE ADVERSARIAL NETWORKS
14.16
434
Generative Adversarial Networks
Generative models are typically trained with maximum-likelihood estimation which can become intractable (due to the normalization/partition term).
Generative adversarial networks (GAN) are a method for training generative models with neural
networks, trained with stochastic gradient descent instead of MLE.
Sampling from the model is achieved by inputting noise; the outputs of the networks are the samples.
A conditional generative adversarial network (cGAN) is an extension which allows the model to
condition on external information.
Note that denoising autoencoders have been used to achieve something similar. Denoising autoencoders learn to reconstruct empirical data X from noised inputs X̃ and can be sampled from by using a
Markov chain, alternating between sampling reconstructed values P (X|X̃) and noise C(X̃|X), which
eventually reaches a stationary distribution which matches the empirical density model established
by the training data. (this method under the category of generative stochastic networks). GANs in
contrast, have a much simpler sampling project (they don’t require a Markov chain), they require
only noise input.
A GAN has two components:
• the generator G, which attempts to generate fraudulent, but convincing, samples
• the discriminator D, which tries to distinguish fraudulent samples from genuine ones
These two are pitted against each other in an adversarial game. As such, the objective function here
is a minimax value function:
min max(Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))])
G
D
Breaking this down:
1. Train the discriminator to maximize the probability of the training data
2. Train the discriminator to minimize the probability of the data sampled from the generator. At
the same time, train the generator on the opposite objective (maximize the probability that the
discriminator assigns to its own samples).
They are trained in alternation using stochastic gradient descent.
This paper incorporates conditioning into this general GAN framework. Some condition y is established for generation; this restricts the generator in its output and the discriminator in its expected
input.
• Z is the noise space used to seed the generative model. Z = Rdz where dz is a hyperparameter.
Values z i nZ are sampled from a noise distribution pz (z) (it can be, for example, a simple
Gaussian noise model).
434
CHAPTER 14. NEURAL NETS
435
14.16. GENERATIVE ADVERSARIAL NETWORKS
• Y is an embedding space used to condition the generative model on some external information,
drawn from the training data. Y = RdY where dY is a hyperparameter. Using condition
information provided in the training data, we can define a density model py (y ).
• X is the data space which represents an image output from the generator or input to the
discriminator. Each input is associated with some conditional data y , so we have a density
model pdata (x, y ).
We have two functions:
• G : (Z × Y ) → X is the generative model/generator which takes noise data z i nZ along with
an embedding y ∈ Y and produces an output x ∈ X.
• D : (X × Y ) → [0, 1] is the discriminative model/discriminator which takes an input x and
condition y and predicts the probability under condition y that x came from the empirical data
distribution rather than from the generative model.
The generator G implicitly defines a conditional density model pg (x|y ). We combine this density
model with the existing conditional density py (y ) to yield the joint model pg (x, y ). The task is to
parameterize G so that it replicates the empirical density model pdata (x, y ).
The conditional GAN objective function becomes:
min max(Ex,y ∼pdata (x,y ) [log D(x, y )] + Ey ∼py ,z∼pz (z) [log(1 − D(G(z, y ), y ))])
G
D
The conditional data y is sampled from either the training data or an independent distribution.
In terms of cost functions: we have a batch of training data {(xi , yi )}ni=1 and zi drawn from the noise
prior.
The cost equation for the discriminator D is a simple logistic cost expression (to give a positive label
to input truly from the data distribution and a negative label to counterfeit examples):
JD = −
n
n
∑
1 ∑
( log D(xi , yi ) +
log(1 − D(G(zi , yi ), yi )))
2n i=1
i=1
The cost equation for G is (to maximize the probability the discriminator assigns to samples from G,
i.e. to trick the discriminator):
JG = −
n
1∑
log D(G(zi , yi ))
n i=1
Note that a “maximally confused” discriminator would output 0.5 for both true and counterfeit
examples.
Note that we have to be careful how we draw the conditional data y . We can’t just use conditional
samples from the data itself because the generator may just learn to reproduce true input based on
the conditional input.
CHAPTER 14. NEURAL NETS
435
14.16. GENERATIVE ADVERSARIAL NETWORKS
436
Instead, we build a kernel density estimate py (y ) (called a Parzen window estimate) using the conditional values in the training data. We use a Gaussian kernel and cross-validate the kernel width σ
using a held-out validation set. Then we draw samples from this density model to use as conditional
inputs.
14.16.1
Training generative adversarial networks
We have:
•
•
•
•
x = the data
pz (z) a prior for drawing noise samples
pg which is the generator’s distribution that we learn
G(z; θg ), the generator function (i.e. the generator neural network), which takes as input a
noise sample z, parametrized by θg , mapping to the space of x (that is, it outputs a fraudulent
sample from x)
• D(x; θd ), the discriminator function (i.e. the discriminator neural network), which take as input
the output from G, and outputs a scalar which is the estimated probability that the input came
from x rather than from pg .
Together, D and G play a two-player minimax game with the value function V (G, D):
min max(Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))])
G
D
We simultaneously train D to maximize the probability of assigning the correct labels and train G to
minimize log(1 − D(G(z))). In particular, we want to train D more quickly to be a more discerning
discriminator, which causes G to be a better counterfeiter.
However, we don’t want to train D to completion first because it would result in overfitting (and is
computationally prohibitive). Rather, we train D for k steps (k is a hyperparameter), then train G
for one step, and repeat.
Another problem is that early on G is bad at creating counterfeits, and D can recognize them
as such easily - this causes log(1 − D(G(z))) to saturate. So instead of training G to minimize
log(1 − D(G(z))), we can train it to maximize log(D(G(z)).
The basic algorithm is:
m = minibatch_size
pz = noise_prior
px = data distribution
for i in range(epochs):
for j in range(k):
# sample m noise samples from the noise prior
436
CHAPTER 14. NEURAL NETS
437
14.17. REFERENCES
z = sample_minibatch(m, pz)
# sample m examples from the data
x = sample_minibatch(m, px)
# update the discriminator by ascending its stochastic gradient
update_discriminator(m, z, x)
# sample m noise samples from the noise prior
z = sample_minibatch(m, pz)
# update the generator by descending its stochastic gradient
update_generator(m, z)
Where update_discriminator has the gradient:
∇θd
m
1 ∑
[log D(x (i) ) + log(1 − D(G(z (i) )))]
m i=1
and update_generator has the gradient:
m
1 ∑
∇θg
log(1 − D(G(z (i) )))
m i=1
The paper used momentum for the gradient updates.
14.17
References
• Neural Computing: Theory and Practice (1989). Philip D. Wasserman.
• MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks
Part 2: Setting up the Data and the Loss. Andrej Karpathy.
• Understanding LSTM Networks. Chris Olah. August 27, 2015.
• Crash Introduction to Artificial Neural Networks. Ivan Galkin.
• Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
• The Nature of Code. Daniel Shiffman.
• Neural Networks and Deep Learning, Michael A Nielsen. Determination Press, 2015.
• Neural Networks. Christos Stergiou & Dimitrios Siganos.
• A Step by Step Backpropagation Example. Matt Mazur. March 17, 2015.
• Gradient Descent with Backpropagation. July 31, 2015. Brandon B.
CHAPTER 14. NEURAL NETS
437
14.17. REFERENCES
438
• A Primer on Neural Network Models for Natural Language Processing. Yoav Goldberg. October
5, 2015.
• Neural Networks for Machine Learning. Geoff Hinton. 2012. University of Toronto (Coursera).
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks
Part 1: Setting up the Architecture. Andrej Karpathy.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Backpropagation,
Intuitions. Andrej Karpathy.
• Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. 2014.
• Composing Music with Recurrent Neural Networks. Daniel Johnson. August 3, 2015.
• Neural Networks. Hugo Larochelle. 2013. Université de Sherbrooke.
• General Sequence Learning using Recurrent Neural Networks. Alec Radford.
• Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing
Gradients. Denny Britz. October 8, 2015.
• Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python
and Theano. Denny Britz. October 27, 2015.
• How to implement a recurrent neural network Part 1. Peter Roelants.
• Debugging: Gradient Checking. Stanford UFLDL.
• A Basic Introduction to Neural Networks. ai-junkie.
• Neural Networks in Plain English.
• Understanding Natural Language with Deep Neural Networks Using Torch. Soumith Chintala.
• 26 Things I Learned in the Deep Learning Summer School. Marek Rei.
• Conv Nets: A Modular Perspective. Chris Olah.
• Understanding Convolutions. Chris Olah.
• Deep Learning, NLP, and Representations. Chris Olah.
• How to choose the number of hidden layers and nodes in feedforward neural network. gung,
doug.
• comp.ai.neural-nets FAQ. Warren S. Sarle.
• Fundamentals of Deep Learning. Nikhil Buduma. 2015.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Setting up the data
and the model. Andrej Karpathy.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Modeling one neuron.
Andrej Karpathy.
• Deep Learning Glossary. WildML (Denny Britz).
• Batch Normalized Recurrent Neural Networks. César Laurent, Gabriel Pereyra, Philémon Brakel,
Ying Zhang, Yoshua Bengio.
• An overview of gradient descent optimization algorithms. Sebastian Rudr.
• Neuroevolution in Games: State of the Art and Open Challenges. Sebastian Risi, Julian Togelius.
November 3, 2015.
• Neuroevolution: from architectures to learning. Dario Floreano, Peter Dürr, Claudio Mattiussi.
• Conditional generative adversarial nets for convolutional face generation. Jon Gauthier.
• Generative Adversarial Nets. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu,
David Warde-Farley, Sherjil Ozair, Aaron Courville, Youshua Bengio.
• Must Know Tips/Tricks in Deep Neural Networks. Xiu-Shen Wei.
438
CHAPTER 14. NEURAL NETS
439
•
•
•
•
14.17. REFERENCES
Understanding Convolution in Deep Learning. Tim Dettmers.
theano conv documentation
Attention and Memory in Deep Learning and NLP. Denny Britz.
Chris Olah & Shan Carter, “Attention and Augmented Recurrent Neural Networks”, Distill,
2016.
CHAPTER 14. NEURAL NETS
439
14.17. REFERENCES
440
440
CHAPTER 14. NEURAL NETS
441
15
Model Selection
Model selection is the process of choosing between different machine learning approaches - e.g. SVM,
logistic regression, etc - or choosing between different hyperparameters or sets of features for the
same machine learning approach - e.g. deciding between the polynomial degrees/complexities for
linear regression.
The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important
than you’d think - there may be a “best” algorithm for a particular problem, but often its performance
is not much better than other well-performing approaches for that problem.
There may be certain qualities you look for in an model:
•
•
•
•
•
Interpretable - can we see or understand why the model is making the decisions it makes?
Simple - easy to explain and understand
Accurate
Fast (to train and test)
Scalable (it can be applied to a large dataset)
Though there are generally trade-offs amongst these qualities.
15.1 Model evaluation
In order to select amongst models, we need some way of evaluating their performance.
You can’t evaluate a model’s hypothesis function with the cost function because minimizing the error
can lead to overfitting.
A good approach is to take your data and split it randomly into a training set and a test set (e.g. a
70%/30% split). Then you train your model on the training set and see how it performs on the test
set.
For linear regression, you might do things this way:
CHAPTER 15. MODEL SELECTION
441
15.1. MODEL EVALUATION
442
• Learn parameter θ from training data by minimizing training error J(θ).
• Compute test set error (using the squared error) (mtest is the test set size):
Jtest (θ) =
test
1 m∑
(i)
(i)
(hθ (xtest
) − ytest
)
2mtest i=1
For logistic regression, you might do things this way:
• Learn parameter θ from training data by minimizing training error J(θ).
• Compute test set error (mtest is the test set size):
test
1 m∑
(i)
(i)
(i)
(i)
ytest
log hθ (xtest
) + (1 − ytest
) log hθ (xtest
)
Jtest (θ) = −
mtest i=1
• Alternatively, you can use the misclassification error (“0/1 misclassification error”, read “zeroone”), which is just the fraction of examples that your hypothesis has mislabeled:



1,
er r (hθ (x), y ) = 

0,
test error =
1
mtest
if hθ (x) ≥ 0.5, y = 0 or if hθ (x) < 0.5, y = 1
otherwise
m
test
∑
(i)
(i)
er r (hθ (xtest
), ytest
)
i=1
A better way of splitting the data is to not split it only into training and testing sets, but to also
include a validation set. A typical ratio is 60% training, 20% validation, 20% testing.
So instead of just measuring the test error, you would also measure the validation error.
Validation is used mainly to tune hyperparameters - you don’t want to tune them on the training set
because that can result in overfitting, nor do you want to tune them on your test set because that
results in an overly optimistic estimation of generalization. Thus we keep a separate set of data for
the purpose of validation, that is, for tuning the hyperparameters - the validation set.
You can use these errors to identify what kind of problem you have if your model isn’t performing
well:
• If your training error is large and your validation/test set error is large, then you have a high
bias (underfitting) problem.
• If your training error is small and your validation/test set error is large, then you have a high
variance (overfitting) problem.
442
CHAPTER 15. MODEL SELECTION
443
15.2. EVALUATING REGRESSION MODELS
Because the test set is used to estimate the generalization error, it should not be used for “training”
in any sense - this includes tuning hyperparameters. You should not evaluate on the test set and then
go back and tweak things - this will give an overly optimistic estimation of generalization error.
Some ways of evaluating a model’s performance on (some of) your known data are:
• hold out (just set aside some portion of the data for validation; this is less reliable if the amount
of data is small such that the held out portion is very small)
• k-fold cross-validation (better than hold out for small datasets)
–
–
–
–
the training set is divided into k folds
iteratively take k − 1 folds for training and validate on the remaining fold
average the results
there is also “leave-one-out” cross-validation which is k-fold cross-validation where k = n
(n is the number of datapoints)
• bootstrapping
– new datasets are generated by sampling with replacement (uniformly at random) from the
original dataset
– then train on the bootstrapped dataset and validate on the unselected data
• jackknife resampling
– essentially to leave-one-out cross-validation, since leave-one-out is basically sampling without replacement
15.1.1
Validation vs Testing
Validation refers to the phase where you are tuning your model and its hyperparameters. Once you
do that, you want to test this model on a new set of data it has not seen yet (i.e. data which has not
been used in cross-validation or bootstrapping or whatever method you used). This is to simulate
the model’s performance on completely new data and see how it does, which is the most important
quality of a model.
15.2 Evaluating regression models
The main techniques for evaluating regression models are:
•
•
•
•
mean absolute error
median absolute error
(root) mean squared error
coefficient of determination (R2 )
CHAPTER 15. MODEL SELECTION
443
15.2. EVALUATING REGRESSION MODELS
15.2.1
444
Residuals
A residual ei is the difference between the observed and predicted outcome, i.e.:
ei = yi − ŷi
This can also be thought of as the vertical distance between an observed data point and the regression
line.
∑
Fitting a line by least squares minimizes ni=1 ei2 ; that is, it minimizes the mean squared error
(MSE) between the line and the data. But there always remain some error from the fit line; this
remaining error is the residual.
Alternatively, the mean absolute error or median absolute error can be used instead of the mean
squared error.
ei can be interpreted as estimates of the regression error ϵi , since we can only compute the true error
if we know the true model parameters.
We can measure the quality of a linear model, which is called goodness of fit. One approach is
to look at the variation of the residuals. You can also use the coefficient of determination (R2 ),
explained previously, which measures the variance explained by the least squares line.
Residual (error) variation
Residual variation measures how well a regression line fits the data points.
The average squared residual (the estimated residual variance) is the same as the mean squared error,
∑
i.e. σ 2 = n1 ni=1 ei2 .
However, to make this estimator unbiased, you’re more likely to see:
n
1 ∑
e2
n − 2 i=1 i
σ̂ 2 =
That is, with the degrees of freedom taken into account (here for intercept and slope, which both
have to be estimated).
The square root of this estimated variance, σ, is the root mean squared error (RMSE).
Coefficient of determination
The total variation is equal to the residual variation (variation after removing the predictor) plus
the systematic/regression variation (the variation explained by the regression model):
n
∑
n
∑
n
∑
i=1
i=1
i=1
(Yi − Ȳ )2 =
444
(Yi − Ŷi )2 +
(Ŷi − Ȳ )2
CHAPTER 15. MODEL SELECTION
445
15.2. EVALUATING REGRESSION MODELS
R2 (0 ≤ R2 ≤ 1) is the percent of total variability that is explained by the regression model, that is:
∑
∑
n
n
2
regression variation
(Ŷi − Ȳ )2
residual variation
i=1 (Yi − Ŷi )
R =
= ∑i=1
=
1
−
=
1
−
∑
n
n
2
2
total variation
total variation
i=1 (Yi − Ȳ )
i=1 (Yi − Ȳ )
2
R2 can be a misleading summary of model fit since deleting data or adding terms will inflate it.
TODO combine the below
15.2.2
Coefficient of determination
Example of error
For a line y = mx + b, the error of a point (xn , xy ) against that line is:
yn − (mxn + b)
Intuitively, this is the vertical difference between the point on the line at xn and the actual point at
xn .
The squared error of the line is the sum of the squares of all of these errors:
CHAPTER 15. MODEL SELECTION
445
15.3. EVALUATING CLASSIFICATION MODELS
446
SEline =
n
∑
(yi − (mxi + b))2
i=0
To get a best fit line, you want to minimize this squared error. That is, you want to find m and b
which minimizes SEline . This works out as 1 :
m=
x̄ ȳ − xy
¯
2
¯
x̄ − x 2
b = ȳ − mx̄
Note that you can alternatively calculate the regression line slope m as with the covariance and
variance:
m=
Cov(x, y )
Var(x)
The line that these values yields is the regression line.
We can calculate the total variation in y , SEȳ , as:
SEȳ =
n
∑
(yi − ȳ )2
i=0
And then we can calculate the percentage of total variation in y described by the regression line:
1−
SEline
SEȳ
This is known as the coefficient of determination or R-squared.
The closer R-squared is to 1, the better a fit the line is.
15.3 Evaluating classification models
Important quantities:
•
•
•
•
•
1
446
P
Sensitivity: T PT+F
N
TN
Specificity: T N+F P
P
Positive predictive value: T PT+F
P
TN
Negative predictive value: T N+F N
+T N
Accuracy: T P +FT PP +T
N+F N
Reminder: a bar over a variable (x̄) means the mean of those values. So x¯2 =
x12 +x22 +···+xn2
n
CHAPTER 15. MODEL SELECTION
447
15.3.1
15.3. EVALUATING CLASSIFICATION MODELS
Area under the curve (AUC)
This method is for binary classification and multilabel classification. In binary classification you may
choose some cutoff above which you assign a sample to one class, and below which you assign a
sample to the other class.
Depending on your cutoff, you will get different results - there is a trade off between the true and
false positive rates.
You can plot a Receiver Operating Characteristic (ROC) curve, which has for its y-axis P (T P ) and
for its x-axis P (F P ). Every point on the curve corresponds to a cutoff value. That is, the ROC curve
visualizes a sweep through all the cutoff thresholds so you can see the performance of your classifier
across all cutoff thresholds, whereas other metrics (such as the F-score and so on) only tell you the
performance for one particular cutoff. By looking at all thresholds at once, you get a more complete
and honest picture of how your classifier is performing, in particular, how well it is separating the
classes. It is insensitive to the bias of the data’s classes - that is, if there are way more or way less
of the positive class than there are of the negative class (other metrics may be deceptively favorable
or punishing in such unbalanced circumstances).
The area under the curve (AUC) is used to quantify how good the classification algorithm is. In
general, an AUC of above 0.8 is considered “good”. An AUC of 0.5 (a straight line) is equivalent to
random guessing.
ROC curves
So ROC curves (and the associated AUC metric) are very useful for evaluating binary classification.
CHAPTER 15. MODEL SELECTION
447
15.3. EVALUATING CLASSIFICATION MODELS
448
Note that ROC curves can be extended to classification of three or more classes by using the one-vs-all
approach (see section on classification).
TODO incorporate the explanation below as well:
AUC is a metric for binary classification and is especially useful when dealing with high-bias data,
that is, where one class is much more common than the other. Using accuracy as a metric falls apart
in high-bias datasets: for example, say you have 100 training examples, one of which is is positive,
the rest of which are negative. You could develop a model which just labels every thing negative, and
it would have 99% accuracy. So accuracy doesn’t really tell you enough here.
Many binary classifies output some continuous value (0-1), rather than class labels; there is some
threshold (usually 0.5) above which one label is assigned, and below which the other label is assigned.
Some models may work best with a different threshold. Changing this threshold leads to a trade off
between true positives and false positives - for example, decreasing the threshold will yield more true
positives, but also more false positives.
AUC runs over all thresholds and plots the the true vs false positive rates. This curve is called a
receiver operating characteristic curve, or ROC curve. A random classifier would give you equal false
and true positives, which leads to a AUC of 0.5; the curve in this case would be a straight line. The
better the classifier is, the more area under the curve there is (so the AUC approaches 1).
15.3.2
Confusion Matrices
This method is suitable for binary or multiclass classification.
For classification, evaluation often comes in the form of a confusion matrix.
The core values are:
•
•
•
•
True positives (TP): samples classified as positive which were labeled positive
True negatives (TN): samples classified as negative which were labeled negative
False positives (FP): samples classified as positive which were labeled negative
False negatives (FN): samples classified as negative which were labeled positive
A few other metrics are computed from these values:
• Accuracy: How often is the classifier correct? ( TP+TN
total )
• Misclassification rate (or “error rate”): How often is the classifier wrong? ( FP+FN
=
total
1 − accuracy)
• Recall (or “sensitivity” or “true positive rate”): How often are positive-labeled samples
TP
predicted as positive? ( num positive-labeled
examples )
• False positive rate: How often are negative-labeled samples predicted as positive?
FP
( num negative-labeled
examples )
• Specificity (or “true negative rate”): How often are negative-labeled samples predicted as
TN
negative? ( num negative-labeled
examples )
TP
• Precision: How many of the predicted positive samples are correctly predicted? ( TP+FP
)
448
CHAPTER 15. MODEL SELECTION
449
15.3. EVALUATING CLASSIFICATION MODELS
examples
• Prevalence: How many labeled-positive samples are there in the data? ( num positive-labeled
)
num examples
Some other values:
• Positive predictive value (PPV): precision but takes prevalence into account. With a perfectly
balanced dataset (i.e. equal positive and negative examples, that is prevalence is 0.5), the PPV
equals the precision.
• Null error rate: how often you would be wrong if you just predicted positive for every example.
This is a good starting baseline metric to compare your classifier against.
• F-score: The weighted average of recall and precision
• Cohen’s Kappa: a measure of how well the classifier performs compared against if it had just
guessed randomly, that is a high Kappa score happens when there is a big difference between
the accuracy and the null error rate.
• ROC Curve: (see the section on this)
15.3.3
Log-loss
This method is suitable for binary, multiclass, and multilabel classification.
Log-loss is an accuracy metric that can be used when the classifier output is not a class but a
probability, as is the case with logistic regression. It penalizes the classifier based on how far off it is,
e.g. if it predicts 1 with probability of 0.51 but the correct class is 0, it is less “wrong” than if it had
predicted class 1 with probability 0.95.
For a binary classifier, log-loss is computed:
−
N
1∑
yi log(ŷi ) + (1 − yi ) log(1 − ŷi )
n i
Log-loss is the cross-entropy b/w the distribution of the true labels and the predictions. It is related
to relative entropy (that is, Kullback-Leilber divergence).
Intuitively, the way this works is the yi terms “turn on” the appropriate parts, e.g. when yi = 1 then
the term yi log(ŷi ) is activated and the other is 0. The reverse is true when yi = 0.
Because log(1) = 0, we get the best loss (0) when the term within the log operation is 1; i.e. when
yi = 1 we want ŷi to equal 1, so the loss comes down to log(ŷi ), but when yi = 0, we want ŷi = 0,
so the loss in that case comes down to log(1 − ŷi ).
15.3.4
F1 score
The F1 score, also called the balanced F-score or F-measure, is the weighted average of precision
and recall:
F1 = 2
CHAPTER 15. MODEL SELECTION
precision × recall
precision + recall
449
15.4. METRIC SELECTION
450
The best score is 1 and the worst is 0.
It can be used for binary, multiclass, and multilabel classification (for the latter two, use the weighted
average of the F1 score for each class).
15.4 Metric selection
When it comes to skewed classes (or high bias data), metric selection is more nuanced.
For instance, say you have a dataset where only 0.5% of the data is in category 1 and the rest is
in category 0. You run your model and find that it categorized 99.5% of the data correctly! But
because of the skew in that data, your model could just be: classify each example in category 0, and
it would achieve that accuracy.
Note that the convention is to set the rare class to 1 and the other class to 0. That is, we try to
predict the rare class.
Instead, you may want to use precision/recall as your evaluation metric.
1T
0T
1P
True positive
False positive
0P
False negative
True negative
Where 1T/0T indicates the actual class and 1P/0P indicates the predicted class.
Precision is the number of true positives over the total number predicted as positive. That is, what
fraction of the examples labeled as positive actually are positive?
true positives
true positives + false positives
Recall is the number of true positives over the number of actual positives. That is, what fraction of
the positive examples in the data were identified?
true positives
true positives + false negatives
So in the previous example, our simple classifier would have a recall of 0.
There is a trade-off between precision and recall.
Say you are using a logistic regression model for this classification task. Normally, the category
threshold in logistic regression is 0.5, that is, predict class 1 if hθ (x) ≥ 0.5 and predict class 0 if
hθ (x) < 0.5.
450
CHAPTER 15. MODEL SELECTION
451
15.5. HYPERPARAMETER SELECTION
But you may want to only classify an example as 1 if you’re very confidence. So you may change the
threshold to 0.9 to be stricter about your classifications. In this case, you would increase precision,
but lower recall since the model may not be confident enough about some of the more ambiguous
positive examples.
Conversely, you may want to lower the threshold to avoid false negatives, in which case recall increases,
but precision decreases.
So how do you compare precision/recall values across algorithms to determine which is best? You
can condense precision and recall into a single metric: the F1 score (also just called the F-score,
which is the harmonic mean of the precision and recall):
F1 score = 2
PR
P +R
Although more data doesn’t always help, it generally does. Many algorithms perform significantly
better as they get more and more data. Even relatively simple algorithms can outperform more
sophisticated ones, solely on the basis of having more training data.
If your algorithm doesn’t perform well, here are some things to try:
•
•
•
•
•
•
Get more training examples (can help with high variance problems)
Try smaller sets of features (can help with high variance problems)
Try additional features (can help with high bias problems)
Try adding polynomial features (x12 , x22 , x1 x2 , etc) (can help with high bias problems)
Try decreasing the regularization parameter λ (can help with high bias problems)
Try increasing the regularization parameter λ (can help with high variance problems)
15.5 Hyperparameter selection
Another part of model selection is hyperparameter selection.
Hyperparameter tuning is often treated as an art, i.e. without a reliable and practical systematic
process for optimizing them. However, there are some automated methods that can be useful,
including:
•
•
•
•
grid search
random search
evolutionary algorithms
Bayesian optimization
Random search and grid search don’t perform particularly well but are worth being familiar with.
CHAPTER 15. MODEL SELECTION
451
15.5. HYPERPARAMETER SELECTION
15.5.1
452
Grid search
Just searching through combinations of different hyperparameters and seeing which combination
performs the best. Generally hyperparameters are searched over specific intervals or scales, depending
on the particular hyperparameter. It may be 10, 20, 30, etc or 1e-5, 1e-4, 1e-3, etc. It is easy to
parallelize but quite brute-force.
15.5.2
Random search
Surprisingly, randomly sampling from the full grid often works just as well as a complete grid search,
but in much less time.
Intuitively: if we want the hyperparameter combination leading to the top 5% of performance, then
any random hyperparameter combination from the grid has a 5% chance of leading to that result. If
we want to successfully find such a combination 95% of the time, how many random combinations
do we need to run through?
If we take n hyperparameter combinations, the probability that all n are outside of this 5% of top
combinations is (1 − 0.05)n , so the probability that at least one is in the 5% is just 1 − (1 − 0.05)n .
If we want to find one of these combinations 95% of the time, that is, we want the probability that at
least one of them to be what we’re looking for to be 95%, then we just set 1 − (1 − 0.05)n = 0.95,
and thus n ≥ 60, so we need to try only 60 random hyperparamter combinations at minimum to have
a 95% chance of finding at least one hyperparameter combination that yields top 5% performance
for the model.
15.5.3
Bayesian Hyperparameter Optimization
We can use Bayesian optimization to select good hyperparameters for us. We can sample hyperparameters from a Gaussian process (the prior) and use the result as observations to compute a
posterior distribution. Then we select the next hyperparameters to try by optimizing the expected
improvement over the current best result or the Gaussian process upper confidence bound (UCB). In
particular, we choose an acquisition function to construct a utility function from the model posterior
- this is what we use to decide what next set of hyperparameters to try.
Basic idea: Model the generalization performance of an algorithm as a smooth function of its hyperparameters and then try to find the maxima.
It has two parts:
• Exploration: evaluate this function on sets of hyperparameters where the outcome is most
uncertain
• Exploitation: evaluate this function on sets of hyperparameters which seem likely to output
high values
Which repeat until convergence.
452
CHAPTER 15. MODEL SELECTION
453
15.5. HYPERPARAMETER SELECTION
This is faster than grid search by making “educated” guesses as to where the optimal set of hyperparameters might be, as opposed to brute-force searching through the entire space.
One problem is that computing the results of a hyperparameter sample can be very expensive (for
instance, if you are training a large neural network).
We use a Gaussian process because its properties allow us to compute marginals and conditionals in
closed form.
Some notation for the following:
• f (x) is the function drawn from the Gaussian process prior, where x is the set of hyperparameters
• observations are in the form {xn , yn }N
n=1 , where yn ∼ N (f (xn ), v ) and v is the variance of
noise introduced into the function observations
• the acquisition function is a : X → R+ , where X is the hyperparameter space
• the next set of hyperparameters to try is xnext = argmaxx a(x)
• the current best set of hyperparameters is xbest
• Φ() denotes the cumulative distribution function of the standard normal
A few popular choices of acquisition functions include:
• probability of improvement: with a Gaussian process, this can be computed analytically as:
aPI (x; {xn , yn }θ) = Φ(γ(x))
f (xbest − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)
• expected improvement: under a Gaussian process, this also has a closed form:
aEI (x; {xn , yn }, θ) = σ(x; {xn , yn }, θ)(γ(x)Φ(γ(x)) + N (γ(x); 0, 1))
• Gaussian process upper confidence bound: use upper confidence bounds (when maximizing,
otherwise, lower confidence bounds) to construct acquisition functions that minimize regret
over the course of their optimization:
aLCB (x; {xn , yn }, θ) = µ(x; {xn , yn }, θ) − κσ(x; {xn , yn }, θ)
Where κ is tunable to balance exploitation against exploration.
Some difficulties with Bayesian optimization of hyperparameters include:
CHAPTER 15. MODEL SELECTION
453
15.5. HYPERPARAMETER SELECTION
454
• often unclear what the appropriate choice for the covariance function and its associated hyperparameters (these hyperparameters are distinct from the ones the method is optimizing;
i.e. these are in some sense “hyper-hyperparameters”)
• the function evaluation can be a time-consuming optimization procedure. One method is to
optimize expected improvement per second, thereby taking wall clock time into account. That
way, we prefer to evaluate points that are not only likely to be good, but can also be evaluated
quickly. However, we don’t know the duration function c(x) : X → R+ , but we can use this
same Gaussian process approach to model c(x) alongside f (x).
Furthermore, we can parallelize these Bayesian optimization procedures (refer to paper)
15.5.4
Choosing the Learning Rate α
You can plot out a graph with the number of gradient descent iterations on the x-axis and the values
of minθ J(θ) on the y-axis and visualize how the latter changes with the number of iterations. At
some point, that curve will flatten out; that’s about the number of iterations it took for gradient
descent to converge on your particular problem.
Decreasing error over number of iterations
You could use an automatic convergence test which just declares convergence if J(θ) decreases by
less than some threshold value in an iteration, but in practice that threshold value may be difficult
to determine.
You would expect this curve to be similar to the one above. mi nθ J(θ) should decrease with the
454
CHAPTER 15. MODEL SELECTION
455
15.6. CASH
number of iterations, if gradient descent is working correctly. If not, then you should probably be
using a smaller learning rate (α). But again, don’t make it too small or convergence will be slow.
15.6 CASH
The particular problem is called the CASH problem (Combined Algorithm Selection and Hyperparameter optimization problem).
It can be formalized as such:
• A = {A(1) , . . . , A(R) } is a set of algorithms
– the algorithm A(j) ’s hyperparameters has the domain Λ(j)
• Dtrain = {(x1 , y1 ), . . . , (xn , yn )} be a training set
(1)
(K)
(1)
(K)
– it is split into K cross-validation folds Dvalid
, . . . , Dvalid
and Dtrain
, . . . , Dtrain
(i)
(i)
(i)
(j)
• the loss L(A(j)
achieves on Dvalid
when trained on
λ , Dtrain , Dvalid ) is the loss an algorithm A
(i)
Dtrain with hyperparameters λ
we want to find the joint algorithm and hyperparameter settings that minimizes this loss:
∗
A , λ∗ ∈
argmin
A(j) ∈A,λ∈Λ(j)
K
1 ∑
(i)
(i)
L(A(j)
λ , Dtrain , Dvalid )
K i=1
Approaches to this problem include the aforementioned Bayesian optimization methods.
Meta-learning is another approach, in which machine learning is applied machine learning itself, that
is, to algorithm and hyperparameter selection (and additionally feature preprocessing). The input
data are different machine learning tasks and datasets, the output is a well-performing algorithm and
hyperparameter combination. In meta-learning we learn “meta-features” to identify similar problems
for which a algorithm and hyperparameter combination is good for.
These meta-features can include things like the number of datapoints, features, and classes, the data
skewness, the entropy of the targets, etc.
Meta-learning can be combined with Bayesian optimization - it can be used to roughly identify
good algorithm and hyperparameter choices, and Bayesian optimization can be used to fine-tune
these choices. This approach of using meta-learning to support Bayesian optimization is called
“warmstarting”.
As Bayesian optimization searches for hyperparameters it may come across many well-performing
hyperparameters that it discards because they are not the best. However, they can be saved to
construct an (weighted) ensemble model, which usually outperforms individual models. The ensemble
selection method seems to work best for constructing the ensemble:
• start with an empty ensemble
CHAPTER 15. MODEL SELECTION
455
15.7. REFERENCES
456
• iteratively, up to a specified ensemble size
– add a model that maximizes ensemble validation performance
Models are unweighted, but models can be added multiple time so the end result is a weighted
ensemble.
15.7 References
• Review of fundamentals, IFT725. Hugo Larochelle. 2012.
• Exploratory Data Analysis Course Notes. Xing Su.
• Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff
Ullman.
• Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Evaluating Machine Learning Models. Alice Zheng. 2015.
• Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
• Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks
Part 2: Setting up the Data and the Loss. Andrej Karpathy.
• POLS 509: Hierarchical Linear Models. Justin Esarey.
• Bayesian Inference with Tears. Kevin Knight, September 2009.
• Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
• Practical Bayesian Optimization of Machine Learning Algorithms.
Larochelle, Ryan P. Adams.
Jasper Snoek, Hugo
• What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
• Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
• Maximum Likelihood Estimation. Penn State Eberly College of Science.
• Data Science Specialization. Johns Hopkins (Coursera). 2015.
456
CHAPTER 15. MODEL SELECTION
457
15.7. REFERENCES
• Practical Machine Learning. Johns Hopkins (Coursera). 2015.
• Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome
Friedman.
• Model evaluation: quantifying the quality of predictions. scikit-learn.
• Efficient and Robust Automated Machine Learning. Matthias Feurer, Aaron Klein, Katharina
Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter. — prerequisites:
CHAPTER 15. MODEL SELECTION
457
15.8. BAYES NETS
15.8
458
458
bayes nets
CHAPTER 15. MODEL SELECTION
459
16
Bayesian Learning
Bayesian machine learning is a particular set of approaches to probabilistic machine learning (for other
probabilistic models, see Supervised Learning).
Bayesian learning treats model parameters as random variables - in Bayesian learning, parameter
estimation amounts to computing posterior distributions for these random variables based on the
observed data.
Bayesian learning typically involves generative models - one notable exception is Bayesian linear
regression, which is a discriminative model.
16.1 Bayesian models
Bayesian modeling treats those two problems as one.
We first have a prior distribution over our parameters (i.e. what are the likely parameters?) P (θ).
From this we compute a posterior distribution which combines both inference and learning:
P (y1 , . . . , yn , θ|x1 , . . . , xn ) =
P (x1 , . . . , xn , y1 , . . . , yn |θ)P (θ)
P (x1 , . . . , xn )
Then prediction is to compute the conditional distribution of the new data point given our observed
data, which is the marginal of the latent variables and the parameters:
P (xn+1 |x1 , . . . , xn ) =
∫
P (xn+1 |θ)P (θ|x1 , . . . , xn )dθ
Classification then is to predict the distributions of the new datapoint given data from other classes,
then finding the class which maximizes it:
CHAPTER 16. BAYESIAN LEARNING
459
16.1. BAYESIAN MODELS
460
P (xn+1 |x1c , . . . , xnc )
16.1.1
∫
=
P (xn+1 |θc )P (θc |x1c , . . . , xnc )dθc
Hidden Markov Models
HMMs can be thought of as clustering over time; that is, each state is a “cluster”.
The data points and latent variables are sequences, and πk becomes the transition probability given
the state (cluster) k. θk∗ becomes the emission distribution for x given state k.
16.1.2
Model-based clustering
• model data from heterogeneous unknown sources
• K unknown sources (clusters)
• each cluster/source is modeled using a parametric model (e.g. a Gaussian distribution)
For a given data point i , we have:
zi |π ∼ Discrete(π)
Where zi is the cluster label for which data point i belongs to. This is the latent variable we want to
discover.
π is the mixing proportions which is the vector of probabilities for each class k, that is:
π = (πi , . . . , πK )|α ∼ Dirichlet(
α
α
,..., )
K
K
That is, πk = P (zi = k).
We also model each data point xi as being drawn from a source (cluster) like so, where F is however
we are modeling the cluster (e.g. a Gaussian), parameterized by θz∗i , that is some parameters for the
zi -labeled cluster:
xi |zi , θk∗ ∼ F (θz∗i )
(Note that the star, as in θ∗ , is used to denote the optimal solution for θ.)
For this approach we have two priors over parameters of the model:
• For the mixing proportions, we typically use a Dirichlet prior (above) because it has the nice
property of being a conjugate prior with multinomial distributions.
• For each cluster k we use some prior H, that is θk∗ |H ∼ H.
Graphically, this is:
460
CHAPTER 16. BAYESIAN LEARNING
461
16.1. BAYESIAN MODELS
Model-based clustering plate model
16.1.3
Naive Bayes
The main assumption of Naive Bayes is that all features are independent effects of the label. This is
a really strong simplifying assumption but nevertheless in many cases Naive Bayes performs well.
Naive Bayes is also statistically efficient which means that it doesn’t need a whole lot of data to learn
what it needs to learn.
If we were to draw it out as a Bayes’ net:
Y → F1
Y → F2
...
Y → Fn
Where Y is the label and F1 , F2 , . . . , Fn are the features.
The model is simply:
P (Y |F1 , . . . , Fn ) ∝ P (Y )
∏
P (Fi |Y )
i
This just comes from the Bayes’ net described above.
The Naive Bayes learns P (Y, f1 , f2 , . . . , fn ) which we can normalize (divide by P (f1 , . . . , fn )) to get
the conditional probability P (Y |f1 , . . . , fn ):
∏
P (y1 , f1 , . . . , fn ) P (y1 ) i P (fi |y1 )
∏
P (y2 , f1 , . . . , fn ) P (y2 ) i P (fi |y2 )
P (Y, f1 , . . . , fn ) =
=
..
..
.
.
P (yk , f1 , . . . , fn )
CHAPTER 16. BAYESIAN LEARNING
P (yk )
∏
i
P (fi |yk )
461
16.2. INFERENCE IN BAYESIAN MODELS
462
So the parameters of Naive Bayes are P (Y ) and P (Fi |Y ) for each feature.
16.2 Inference in Bayesian models
16.2.1
Maximum a posteriori (MAP) estimation
A Bayesian alternative to MLE, we can estimate probabilities using maximum a posteriori estimation,
where we instead choose a probability (a point estimate) that is most likely given the observed data:
π̃MAP = argmax P (π|X)
π
= argmax
π
P (X|π)P (π)
P (X)
= argmax P (X|π)P (π)
π
P (y |X) ≈ P (y |π̃MAP )
So unlike MLE, MAP estimation uses Bayes’ Rule so the estimate can use prior knowledge (P (π))
about what we expect π to be.
Again, this may be done with log-likelihoods:
θMAP = argmax p(θ|x) = argmax log p(x|θ) + log p(θ)
θ
θ
16.3 Maximum A Posteriori (MAP)
Likelihood function L(θ) is the probability of the data D as a function of the parameters θ.
This often has very small values so typically we work with the log-likelihood function instead:
ℓ(θ) = log L(θ)
The maximum likelihood criterion simply involves choosing the parameter θ to maximize ℓ(θ). This
can (sometimes) be done analytically by computing the derivative and setting it to zero and yields
the maximum likelihood estimate.
MLE’s weakness is that if you have only a little training data, it can overfit. This problem is known
as data sparsity. For example, you flip a coin twice and it happens to land on heads both times. Your
maximum likelihood estimate for θ (probability that the coin lands on heads) would be 1! We can
then try to generalize this estimate to another dataset and test it by measuring the log-likelihood on
the test set. If a tails shows up at all in the test set, we will have a test log-likelihood of −∞.
462
CHAPTER 16. BAYESIAN LEARNING
463
16.3. MAXIMUM A POSTERIORI (MAP)
We can instead use Bayesian techniques for parameter estimation. In Bayesian parameter estimation,
we treat the parameters θ as a random variable as well, so we learn a joint distribution p(θ, D).
We first require a prior distribution p(θ) and the likelihood p(D|θ) (as with maximum likelihood).
We want to compute p(θ|D), which is accomplished using Bayes’ rule:
p(θ|D) = ∫
p(θ)p(D|θ)
p(θ′ )p(D|θ′ )dθ′
Though we work with only the numerator for as long as possible (i.e. we delay normalization until
it’s necessary):
p(θ|D) ∝ p(θ)p(D|θ)
The more data we observe, the less uncertainty there is around the parameter, and the likelihood
term comes to dominate the prior - we say that the data overwhelm the prior.
We also have the posterior predictive distribution p(D′ |D), which is the distribution over future
observables given past observations. This is computed by computing the posterior over θ and then
marginalizing out θ:
p(D′ |D) =
∫
p(θ|D)p(D′ |θ)dθ
The normalization step is often the most difficult, since we must compute an integral over potentially
many, many parameters.
We can instead formulate Bayesian learning as an optimization problem, allowing us to avoid this
integral. In particular, we can use maximum a-posteriori (MAP) approximation.
Whereas with the previous Bayesian approach (the “full Bayesian” approach) we learn a distribution
over θ, with MAP approximation we simply get a point estimate (that is, a single value rather than
a full distribution). In particular, we get the parameters that are most likely under the posterior:
θ̂MAP = argmax p(θ|D)
θ
= argmax p(θ, D)
θ
= argmax p(θ)p(D|θ)
θ
= argmax log p(θ) + log p(D|θ)
θ
Maximizing log p(D|θ) is equivalent to MLE, but now we have an additional prior term log p(θ). This
prior term functions somewhat like a regularizer. In fact, if p(θ) is a Gaussian distribution centered
at 0, we have L2 regularization.
CHAPTER 16. BAYESIAN LEARNING
463
16.3. MAXIMUM A POSTERIORI (MAP)
16.3.1
464
Markov Chain Monte Carlo (MCMC)
Motivation
With MLE and MAP estimation we get only a single value for π, and this collapsing into a single
value loses information - what if we instead considered the entire distribution of values for π, i.e.
P (π|X)?
As it stands, with MLE and MAP we only get an approximation of P (y |X). But with the distribution
P (π|X) we could directly compute its expected value:
E[P (y |X)] =
∫
P (y |π)P (π|X)dπ
And with Bayes’ Rule we have:
P (π|X) =
P (X|π)P (π)
P (X|π)P (π)
=∫
P (X)
π P (X|π)P (π)dπ
So we have two integrals here, and unfortunately integrals can be hard (sometimes impossible) to
compute.
With MCMC we can get the values we need without needing to calculating the integrals.
Monte Carlo methods
Monte Carlo methods are algorithms which perform probabilistic simulations to give you some value.
For example:
Say you have a square and a circle inscribed within it, so that they are co-centric and
the circle’s diameter is equal to the length of a side of the square. You take some rice
and uniformly scatter it in the shapes at random. You can count the total number of
grains of rice in the circle (C) and do the same for rice in the square (S). The ratio CS
approximates the ratio of the area of the circle to the area of the square. The area of
the circle and for the square can be thought of as integrals (adding an infinite number of
infinitesimally small points), so what you have effectively done is approximate the value
of integrals.
MCMC
In the example the samples were uniformly distributed, but in practice they can be drawn from
other distributions. If we collect enough samples from the distribution we can compute pretty much
anything we would want to know about the distribution - mean, standard deviation, etc.
For example, we can compute the expected value:
464
CHAPTER 16. BAYESIAN LEARNING
465
16.3. MAXIMUM A POSTERIORI (MAP)
N
1 ∑
f (z (t) )
N→∞ N
t=1
E[f (z)] = lim
Since we don’t sample infinite points, we sample as many as we can for an approximation:
E[f (z)] ≈
T
1∑
f (z (t) )
T t=1
How exactly then is the sampling of z (0) , . . . , z (T ) according to a given distribution accomplished?
We treat the sampling process as a walk around a sample space and the walk proceeds as a Markov
chain; that is, the choice of the next sample depends on only the current state, based on a transition
probability Ptrans (z (t+1) |z (t) ).
So the general walking algorithm is:
• Randomly initialize z (0)
• for t = 1 to T do:
– z (t+1) := g(z (t) )
Where g is just a function which returns the next sample based on Ptrans and the current sample.
Gibbs Sampling
Gibbs sampling is an MCMC algorithm, where z is a point/vector [z1 , . . . , zk ] and k > 1. So here
the samples are vectors of at least two terms. You don’t select an entire sample at once, what you
do is make a separate probabilistic choice for each dimension, where the choice is dependent on the
other k − 1 dimensions, using the newest values for each.
For example, say k = 3 so you have vectors in the form [z1 , z2 , z3 ].
• First you pick a new value z1(t+1) based on z2(t) and z3(t) .
• Then you pick a new value z2(t+1) based on z1(t+1) and z3(t) .
• Then you pick a new value z3(t+1) based on z1(t+1) and z2(t+1) .
Gibbs Sampling (more)
Now that we have the generative model, we can use it to calculate the probability of some set of
group assignments for our data points. But how do we learn what a good set of group assignments
is?
We can use Gibbs Sampling, that is:
• Take the set of data points, and randomly initialize group assignments.
CHAPTER 16. BAYESIAN LEARNING
465
16.4. NONPARAMETRIC MODELS
466
• Pick a point. Fix the group assignments of all the other points, and assign the chosen point a
new group (which can be either an existing cluster or a new cluster) with a CRP-ish probability
(as described in the models above) that depends on the group assignments and values of all
the other points.
• We will eventually converge on a good set of group assignments, so repeat the previous step
until happy.
16.4 Nonparametric models
First: a parametric model is one in which the capacity is fixed and does not increase with the amount
of training data. For example, a linear classifier, a neural network with fixed number of hidden units,
etc. The amount of parameters is finite, and the particular amount is determined before any data is
observed (e.g. with linear regression, we decide the number of parameters that will be used, rather
than learning it from the data).
Another way of thinking of it is: a parametric model tries to come up with some function from the
data, then the data is thrown out. You use that learned function in place of the data for future
predictions.
A nonparametric model doesn’t throw out the data, it keeps it around for later predictions; as a result,
as more data becomes available, you don’t need to create an updated model like you would with the
parametric approach.
16.4.1
What is a nonparametric model?
• counterintuitively, it does not mean a model without parameters. Rather, it means a model
with a very large number of parameters (e.g. infinite). Here, “nonparametric” refers more to
“not a parametric model”, not “without parameters”.
• could also be defined as a parametric model where the number of parameters increases with
the data, instead of fixing the number of parameters (that is, the number of things we can
learn) as is the case with parametric models. I.e. the capacity of the model increases with the
amount of training data.
• can also be defined as a family of distributions that is dense in some large space relevant to
the problem at hand.
– For example, with a regression problem, the space of possible solutions may be all continuous functions, which is infinite-dimensional (if you have infinite cardinality). A nonparametric model can span this infinite space.
To expand and visualize the last point, consider the regression problem example.
This is the space of continuous functions, where f ∗ is the function we are looking for.
With a parametric model, we have a finite number of parameters, so we can only cover a fraction of
this space (the square).
466
CHAPTER 16. BAYESIAN LEARNING
467
16.4. NONPARAMETRIC MODELS
Space of continuous functions
Space of continuous functions w/ parametric model
Space of continuous functions w/ nonparametric model
CHAPTER 16. BAYESIAN LEARNING
467
16.4. NONPARAMETRIC MODELS
468
However, with a nonparametric model, we can have infinite parameters and cover the entire space.
We apply some assumptions, e.g. favoring simpler functions over complex ones, so we can apply a
prior to the space which assigns more mass to simpler functions (the darker parts in the accompanying
figure). But every part of the space still has some mass.
It is possible to create a nonparametric model by nesting a parametric learning algorithm inside another
parametric learning algorithm. The outer learning algorithm learns the number of parameters, whereas
the inner learning algorithm performs as it normally would (learning the parameters themselves).
16.4.2
An example
An example of a nonparametric model is nearest neighbor regression, in which we simply store the
training set, then, for a given new point, identify the closest point and return its associated target
value.
That is:
ŷ = yi
i = argmin ||Xi − x||22
Another example is wrapping a parametric algorithm instead another parametric algorithm - where
the number of parameters of the inner algorithm is a parameter that the outer parametric algorithm
learns.
16.4.3
Parametric models vs nonparametric models
Parametric models are relatively rigid; once you choose a model, there are some limitations to what
forms that model can take (i.e. how it can fit to the data), and the only real flexibility is in the
parameters which can be adjusted. For instance, with linear regression, the model must take the
form of y = β0 + β1 x1 + · · · + βn xn ; you can only adjust the βi values. If the “true” model
does not take this form, we probably won’t be able to estimate it well because the model we chose
fundamentally does not conform to it.
Parametric models, on the other hand, offer greater freedom of fit.
As an example, a histogram can be considered a nonparametric representation of a probability density
- it “let’s the data speak for itself”, so to speak (you may hear nonparametric models described in this
way). The density that forms in the histogram is determined directly by the data. You don’t make
any assumptions about what the distribution is beforehand - e.g. you don’t have to say, “I think this
might be a normal distribution”, and then try to force the normal probability density function onto
the data.
Nonparametric models don’t actually mean there are no parameters, but it is perhaps better described
as not having a fixed set of parameters.
468
CHAPTER 16. BAYESIAN LEARNING
469
16.5. THE DIRICHLET PROCESS
16.4.4
Why use a Bayesian nonparametric approach?
1. Model selection
• e.g. clustering - you have to specify the number of clusters. Too many and you overfit,
too few and you underfit.
• with a Bayesian approach you are not doing any optimizing (such as finding a maximum
likelihood), you are just computing a posterior distribution. So there is no “fitting” happening, so you cannot overfit.
• If you have a large model or one which grows with the amount of data, you can avoid
underfitting too.
• (of course, you can still specify an incorrect model and get poor performance)
2. Useful properties of Bayesian nonparametric models
• Exchangeability - you can permute your data without affecting learning (i.e. order of your
data doesn’t matter)
• Can model Zipf, Heap, and other power laws
• Flexible ways of building complex models from simpler parts
Nonparametric models still make modeling assumptions, they are just less constrained than most
parametric models.
There are also semiparametric models in which they are nonparametric in some ways and parametric
in others.
16.5 The Dirichlet Process
The Dirichlet process is “the cornerstone of Bayesian nonparametrics”.
It is a stochastic process - a model over an infinite collection of random variables.
There are a few ways to think about Dirichlet processes:
• the infinite limit of a Gibbs sampler for finite mixture models
• the Chinese Restaurant Process
• The stick-breaking construction
16.5.1
Dirichlet Distribution
The Dirichlet distribution is a probability distribution over all possible multinomial distributions.
For example, say we have some data which we want to classify into three classes A, B, C. Maybe
the data has 0.25 probability of being in class A, 0.5 probability of being in B, and 0.25 of being in
C. Or maybe it has 0.1 probability of being in class A, then 0.6 and 0.3 for B and C respectively.
CHAPTER 16. BAYESIAN LEARNING
469
16.5. THE DIRICHLET PROCESS
470
Or it could be another distribution - we don’t know. The Dirichlet distribution is the probability
distribution representing these possible multinomial distributions across our classes.
The Dirichlet distribution is formalized as:
∑K−1
∏ a −1
αk ) K−1
P (p|a) = ∏K−1k=0
pk k
Γ(α
)
k k=0
k=0
Γ(
where:
• p = a multinomial distribution
• α = the parameters of the dirichlet (a K-dimensional vector)
• K = the number of categories
Note that this term:
Γ(
∑K−1
αk )
Γ(αk )
k=0
∏K−1
k=0
Is just a normalizing constant so that we get a distribution. So if you’re just comparing ratios of
these distributions you can ignore it.
You begin with some prior which can be derived from other data or from domain knowledge or
intuition.
As more data comes in, we update the dirichlet (i.e. with Bayesian updates):
P (p|data) =
P (data|p)P (p|α)
P (p)
This can be done simply as updating the column in α which corresponds to a new data point, e.g. if
we have three classes and α = [2, 4, 1] and we encounter a new data point which belongs to the
class α1 , we just add one to that column in α, so it becomes [2, 5, 1].
Entropy
Also known as information content, energy, log likelihood, or − ln(p)
It can be thought of as the amount of “surprise” for an event.
If an event is totally certain, it has zero entropy.
A coin flip as some entropy since there are only two equally-probable possibilities.
If you have a pair of dice, there is some entropy for rolling a 6 (because there are multiple combinations
which can lead to 6) but much higher entropy for rolling a 12 (because there is only one combination
which leads to a 12).
We can look at the entropy of the Dirichlet function:
470
CHAPTER 16. BAYESIAN LEARNING
471
16.5. THE DIRICHLET PROCESS
E(p|α) = − ln(
K−1
∏
pkαk −1 )
k0
=
K−1
∑
(αk − 1)(− ln(pk ))
k0
We’ll break out the entropy of a given multinomial distribution p into its own term:
ek = − ln(pk )
Interpreting α
We can take the α vector and normalize it. The normalized α vector is the expected value of the
dirichlet, that is, it is its mean.
The sum of the unnormalized α vector is the weight of the distribution, which can be thought of as
1
its precision. In a normal distribution, the precision is variance
; a higher precision means a narrower
normal distribution which means that values are likely to be near the mean. A lower precision means
a wider distribution in which points are less likely to be near the mean.
So a dirichlet with a higher weight means that the multinomial distribution is more likely to be close
to the expected value.
Dirichlet distributions can be thought of as a simplex, which is a generalization of a triangle in some
arbitrary dimensions (e.g. in 2D it is 2-simplex, a triangle, in 3D it is 3-simplex, a pyramid, etc.).
Some examples are blow with their corresponding α vectors:
16.5.2
Finite Mixture Models
This is a continuation of the model-based clustering approach mentioned earlier.
We want to learn, via inference, values for π, zi , and θk∗ .
We can use a form of MCMC sampling - Gibbs sampling.
(to do: this is incomplete)
16.5.3
Chinese Restaurant Process
Partitions
Given a set S, a partition ϱ is a disjoint family of non-empty subsets (clusters) of S whose union is
S. So a partition is some configuration of clusters which encompasses the members of S.
E.g.
CHAPTER 16. BAYESIAN LEARNING
471
16.5. THE DIRICHLET PROCESS
472
Some Dirichlet simplexes with their α vectors
S = {A, B, C, D, E, F }
ϱ = {{A, D}, {B, C, E}, {F }}
The set of all partitions of S is denoted PS .
Random partitions are random variables taking value in PS .
The Chinese Restaurant Process (CRP)
The CRP is an example of random partitions and involves a sequence of customers coming into a
restaurant. Each customer decides whether or not to sit at a new (empty) table or join a table with
other customers. The customers are sociable so prefer to join tables with more customers, but there
is still some probability that they will sit at a new table:
P (sit at new table) =
P (sit at table c) =
α
α+
∑
c∈ϱ
nc
nc
α+
∑
c∈ϱ
nc
Where nc is the number of customers at a table c and α is a parameter.
Here the customers correspond to members of the set S, and tables are the clusters in a partition ϱ
of S.
472
CHAPTER 16. BAYESIAN LEARNING
473
16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS
This process has a rich-get-richer property, in that large clusters are more likely to attract more
customers, thus growing larger, and so on.
If you multiply all the conditional probabilities together, the overall probability of the partition ϱ,
called the exchangeable partition probability function (EPPF), is:
P (ϱ|α) =
α|ϱ| Γ(α) ∏
Γ(|c|)
Γ(n + α) c∈ϱ
This probability ends up not depending on the sequence in which customers arrive - so this is an
exchangeable random partition.
The α parameter affects the number of clusters in the partition - the larger the α, the more clusters
we expect to see.
Model-based Clustering with the Chinese Restaurant Process
Given a dataset S, we want to partition it into clusters of similar items.
Each cluster c ∈ ϱ is described by a model F (θc∗ ), for example a Gaussian, parameterized by θc∗ .
We model each item in each cluster as drawn from that cluster’s model.
We are going to use a Bayesian approach, so we introduce a prior over ϱ and θc∗ and the compute
posteriors over both. We use a CRP mixture model; that is we use a Chinese Restaurant Process
for the prior over ϱ and an independent and identically distributed (iid) prior H over the cluster
parameters θc∗ .
So the CRP mixture model in more detail:
• ϱ ∼ CRP (α)
• θc∗ |ϱ ∼ H for c ∈ ϱ
• xi |θc∗ , ϱ ∼ F (θc∗ ) for c ∈ ϱ with i ∈ c
16.6 Infinite Mixture Models and the Dirichlet Process
(this is basically a paraphrasing of this post by Edwin Chen.)
Many clustering methods require the specification of a fixed number of clusters. However, in realworld problems there may be an infinite number of possible clusters - in the case of food there may
be Italian or Chinese or fast-food or vegetarian food and so on. Nonparametric Bayesian methods
allow parameters to change with the data; e.g. as we get more data we can let the number of clusters
grow.
Say we have some data, where each data point is some vector.
We can view our data from a generative perspective: we can assume that the true clusters in the data
are each defined by some model with some parameters, such as Gaussians with µi and σi parameters.
CHAPTER 16. BAYESIAN LEARNING
473
16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS
474
We further assume that these parameters themselves come from a distribution G0 Then we assume
the data is generated by selecting a cluster, then taking a sample from that cluster.
Ok, how then do we assign the data points to groups?
16.6.1
Chinese Restaurant Process
(see explanation above)
(As a side note, the Indian Buffet Process is an extension of the CRP in which customers can sample
food from multiple tables, that is, they can belong to multiple clusters.)
More formally:
• Generate table assignments g1 , . . . , g2 ∼ CRP (α), that is, according to a Chinese Restaurant
Process. gi is the table assigned to datapoint i .
• We generate table parameters ϕ1 , . . . , ϕm ∼ G0 according to the base distribution G0 , where
ϕk is the parameter for the kth distinct group.
• Given the table assignments and table parameters, generate each datapoint pi ∼ F (ϕgi ) from
a distribution F with the specified table parameters. For example, F could be a Gaussian and
phii might be a vector specifying the mean and standard deviation.
16.6.2
Polya Urn Model
Basically the same as the Chinese Restaurant Process, except that while the CRP specifies a distribution over partitions (see above), the Polya Urn model does that and also assigns parameters to each
group.
Say we have an urn containing αG0 (x) balls of some color x for each possible value of x. G0 is our
base distribution and G0 (x) is the probability of sampling x from G0 .
Then we iteratively pick a ball at random from the urn, place it back and also place an additional
new ball of the same color of the one we drew.
As α increases (that is, we draw more new ball colors from the base distribution, which is the same
as placing more weight on our prior), the colors in the urn tend towards the base distribution.
More formally:
• Generate colors ϕ1 , . . . , ϕn ∼ P oly a(G0 , α), that is, according to a Polya Urn Model. ϕi is
the color of the i th ball.
• Given the ball colors, generate each datapoint pi ∼ F (ϕi ) (where we are using F is a way like
in the Chinese Restaurant Process above).
16.6.3
Stick-Breaking Process
The stick-breaking process is also very similar to the CRP and the Polya Urn model.
474
CHAPTER 16. BAYESIAN LEARNING
475
16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS
We start with a “stick” of length one, then generate a random variable β1 ∼ Beta(1, α). Since
we’re drawing from the Beta distribution, β1 will be a real number between 0 and 1 with the expected
1
value 1+α
.
Then break off the stick at β1 . We define w1 to be the length of the left stick.
Then we take the right piece (the one we broke off) and generate β1 ∼ Beta(1, α).
Then break off the stick at β2 , set w2 to be the length of the stick to the right, and so on.
Here α again functions as a dispersion parameter; when it is low there are few, denser clusters, when
it is high, there are more clusters.
More formally:
• Generate group probabilities (stick lengths) w1 , . . . , w∞ ∼ Stick(α), that is, according to a
Stick-Breaking process.
• Generate group parameters ϕ1 , . . . , ϕ∞ ∼ G0 , where ϕk is the parameter for the kth distinct
group.
• Generate group assignments g1 , . . . , gn ∼ Multinomi al(w1 , . . . , w∞ ) for each datapoint.
• Given group assignments and group parameters, generate each datapoint pi ∼ F (ϕgi ) (where
we are using F is a way like in the Chinese Restaurant Process above).
16.6.4
Dirichlet Process
The CRP, Polya Urn Model, and Stick-Breaking Process are all connected to the Dirichlet Process.
Suppose we have a Dirichlet process DP (G0 , α) where G0 is the base distribution and α is the
dispersion parameter. Say we want to sample xi ∼ G, where G is a distribution sampled from our
Dirichlet Process, G ∼ DP (G0 , α).
We could generate these xi values by taking a Polya Urn Model with color distribution G0 and
dispersion α - then xi could be the color of the i th ball in the urn.
Or we could generate these xi by assigning customers to tables via a CRP with dispersion α. Then all
the customers for a table is given the same value (e.g. color) sampled from G0 . xi is the value/color
given to the i th customer; here xi can be thought of as the parameters for table i .
Or we could generate weights wk via a Stick-Breaking Process with dispersion α. Then we give each
weight wk a value/color vk sampled from G0 . We assign xi to vk with probability wk .
More formally:
• Generate a distribution G ∼ DP (G0 , α) from a Dirichlet process with base distribution G0 and
a dispersion parameter α.
• Generate group-level parameters xi ∼ G where xi is the group parameter for the i th datapoint.
Note that xi is not the same as ϕi ; xi is the parameter associated to the group that the i th
data point belongs to whereas ϕk is the parameter of the kth distinct group.
• Given group-level parameters xi , generate each datapoint pi ∼ F (xi ) (where we are using F is
a way like in the Chinese Restaurant Process above).
CHAPTER 16. BAYESIAN LEARNING
475
16.7. MODEL SELECTION
476
16.7 Model selection
16.7.1
Model fitting vs Model selection
Model fitting is just about fitting a particular model to data, e.g. minimizing error against it. Say we
use high-degree polynomial as our model (i.e. use more than one predictor variable). The resulting
fit model might not actually be appropriate for the data - it may overfit it, for instance, or be overly
complex.
Now way we fit a straight line (i.e. use just one predictor variable). We might find that the straight
line is a better model for the data. The process of choosing a between these models is called model
selection.
So we need some way of quantifying the quality of models in order to compare them.
A naive approach is to use the likelihood (the product of the probabilities of each datapoint), or more
commonly, the log-likelihood (the sum of the log probabilities of each datapoint) and then select
the model with the greatest likelihood (this is the maximum likelihood approach). This method is
problematic, however, because more complicated (higher-degree) polynomial models will always have
a higher likelihood, though they are not necessarily better in the sense that we mean (they overfit
the data).
More complex model, greater data likelihood source
16.7.2
Model fitting
Say you have datapoints x1 , . . . , xn and errors for those datapoints e1 , . . . , en . Say there is some true
value for x, we’ll call it xtrue , that we want to learn.
A frequentist approach assumes this true value is fixed and that the data is random. So in this case,
we consider the distribution P (xi , ei |xtrue ) and want to identify a point estimate - that is, a single
value - for xtrue . This distribution tells us the probability of a point xi with its error ei .
For instance, if we assume that x is normally distributed:
476
CHAPTER 16. BAYESIAN LEARNING
477
16.7. MODEL SELECTION
1
−(xi − xtrue )2
exp(
)
P (xi , ei |xtrue ) = √
2ei2
2πei2
Then we can consider the likelihood of the data overall by taking the product of the probabilities of
each individual datapoint:
L(X, E) =
n
∏
P (xi , ei |xtrue )
i=1
Though typically we work with the log likelihood to avoid underflow errors:
log L(X, E) =
n
1∑
(xi − xtrue )2
(log(2πei2 ) +
)
2 i=1
ei2
A common frequentist approach to fitting a model is to use maximum likelihood. That is, find an
estimate for xtrue which maximizes this log likelihood:
argmax log L
xtrue
Equivalently, we could minimize the loss (e.g. the squared error).
For simple cases, we can compute the maximum likelihood estimate analytically, by solving
d log L
dxtrue
=0
When all the errors ei are equal, this ends up reducing to:
xtrue =
n
1∑
xi
n i=1
That is, the mean of the datapoints.
For more complex situations, we instead use numerical optimization (i.e. we approximate the estimate).
The Bayesian approach instead involves looking at P (xtrue |xi , ei ), that is, we look at a probability
distribution for the unknown value based on fixed data. We aren’t looking for a point estimate (a
single value) any more, but rather describe xtrue as a probability distribution. If we do want a point
estimate (often you have to have a concrete value to work with), we can take the expected value
from the distribution.
P (xtrue |xi , ei ) is computed:
P (xtrue |xi , ei ) =
P (xi , ei |xtrue )P (xtrue )
P (xi , ei )
Which is to say, it is the posterior distribution. For simple cases, the posterior can be computed
analytically, but more often you will need Markov Chain Monte Carlo to approximate it.
CHAPTER 16. BAYESIAN LEARNING
477
16.7. MODEL SELECTION
16.7.3
478
Model Selection
Just as model fitting differs between frequentist and Bayesian approaches, so does model selection.
Frequentists compare model likelihood, e.g., for two models M1 , M2 , they would compare
P (D|M1 ), P (D|M2 ).
Bayesians compare the model posterior, e.g. P (M1 |D), P (M2 |D).
The parameters are left out in both cases here since we aren’t concerned with how good the fit of
the model is, but rather, how appropriate the model itself is as a “type” of model.
We can use Bayes theorem to turn the posterior into something we can compute:
P (M | D) = P (D | M)
P (M)
P (D)
Using conditional probability, we know that P (D | M) can be computed as the integral over the
parameter space of the likelihood:
P (D | M) =
∫
Ω
P (D | θ, M)P (θ | M)dθ
Computing P (D) - the probability of seeing your data at all - is really hard, impossible even. But we
can avoid dealing with it by comparing P (M1 | D) and P (M2 | D) as an odds ratio:
O21 ≡
P (M2 | D)
P (D | M2 ) P (M2 )
=
P (M1 | D)
P (D | M1 ) P (M1 )
2)
We still have to deal with PP (M
(M1 ) , which is known as the prior odds ratio (because P (M1 ), P (M2 ) are
priors). This ratio is assumed to equal 1 if there’s no reason to believe or no prior evidence that one
model will do better than the other.
| M2 )
The remaining ratio PP (D
(D | M1 ) is known as the Bayes factor and is the most important part here. The
integrals needed to compute the Bayes factor can be approximated using MCMC.
16.7.4
Model averaging
We aren’t required to choose just one model - rather, with Bayesian model averaging we can
combine as many as we’d like.
The basic approach is to define a prior over our models, compute a posterior over the models given
the data, and then combine the outputs of the models as a weighted average, using models’ posterior
probabilities as weights.
478
CHAPTER 16. BAYESIAN LEARNING
479
16.8. REFERENCES
16.8 References
• Review of fundamentals, IFT725. Hugo Larochelle. 2012.
• Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Bayesian Machine Learning. Roger Grosse.
• Kernel Density Estimation and Kernel Regression. Justin Esarey.
• Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
• Frequentism and Bayesianism V: Model Selection. Jake Vanderplas.
• Bayesian Nonparameterics 1. Machine Learning Summer School 2013. Max Planck Institute
for Intelligent Systems, Tübingen. Yee Whye Teh.
• Lecture 15: Learning probabilistic models. Roger Grosse, Nitish Srivastava.
• Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process. Edwin Chen.
CHAPTER 16. BAYESIAN LEARNING
479
16.8. REFERENCES
480
480
CHAPTER 16. BAYESIAN LEARNING
481
17
NLP
17.1 Problems
Some (higher-level) problems that fall under NLP include:
•
•
•
•
•
machine translation
(structured) information extraction
summarization
natural language interfaces
speech recognition
At a lower-level, these include the following problems:
•
•
•
•
•
part-of-speech tagging
parsing
word-sense disambiguation
named entity recognition
etc
17.2 Challenges
Ambiguity is one of the greatest challenges to NLP:
For example:
Fed raises interest rates, where “raises” is the verb, and “Fed” is the noun phrase Fed
raises interest rates, where “interest” is the verb, and “Fed raises” is the noun phrase
CHAPTER 17. NLP
481
17.3. TERMINOLOGY
482
This ambiguity occurs at many levels:
• the acoustic level: e.g. mixing up similar-sounding words
• the syntactic level: e.g. multiple plausible grammatical parsings of a sentence
• the semantic level: e.g. some words can mean multiple things (“bank” as in a river or a financial
institution); this is called word sense ambiguity
• the discourse (multi-clause) level: e.g. unclear what a pronoun is referring to
Other challenges include:
• non-standard english: for instance, text shorthand, phrases such as “SOOO PROUD” as opposed to “so proud”, or hashtags, etc
• segmentation issues: [the] [New] [York-New] [Haven] [Railroad] vs. [the] [New York]-[New
Haven] [Railroad]
• idioms (e.g. “get cold feet”, doesn’t literally mean what it says)
• neologisms (e.g. “unfriend”, “retweet”, “bromance”)
• world knowledge (e.g. “Mary and Sue are sisters” vs “Mary and Sue are mothers.”)
• tricky entity names: “Where is A Bug’s Life playing”, or “a mutation on the for gene”
The typical approach is to codify knowledge about language & knowledge about the world and find
some way to combine them to build probabilistic models.
17.3 Terminology
• synset: a synset is a set of synonyms that represent a single sense of a word.
• wordform: the full inflected surface form: e.g. “cat” and “cats” are different wordforms.
• lemma: the same stem, part of speech, rough word sense; e.g. “cat” and “cats” are the same
lemma.
– One lemma can have many meanings. For example:
a bank can hold investments… agriculture on the east bank…
• sense: a discrete representation of an aspect of a word’s meaning. The usages of bank in the
previous example have a different sense.
• homonyms: words that share form but have unrelated, distinct meanings (such as “bank”).
– Homographs: bank/bank, bat/bat
– Homophones: write/right, piece/peace
• polysemy:
– A polysemous word has related meanings, for example:
* “the bank was built in 1875 (”bank” = a building belonging to a financial institution)”
482
CHAPTER 17. NLP
483
17.4. DATA PREPARATION
* “I withdrew money from the bank (”bank” = a financial institution)”
– Systematic polysemy, or metonymy, is when the meanings have a systematic relationship.
– For example, “school”, “university”, “hospital” - all can mean the institution or the building, so the systematic relationship here is building <=> organization.
– Another example is author <=> works of author, e.g. “Jane Austen wrote Emma” and
“I love Jane Austen”.
• synonyms: different words that have the same propositional meaning in some or all contexts.
However, there may be no examples of perfect synonymy since even if propositional meaning
is identical, they may vary in notions of politeness or other usages and so on.
– For example, “water” and “H2O” - each are more appropriate in different contexts.
– As another example, “big” and “large” - sometimes they can be swapped, sometimes they
cannot:
That’s a big plane. How large is that plane? (Acceptable) Miss Nelson became kind of
a big sister to Benjamin. Miss Nelson became kind of a large sister to Benjamin (Not as
acceptable)
The latter works less because “big” has multiple senses, one of which does not correspond to “large”.
• antonyms: -senses which are opposite with respect to one feature of meaning, but otherwise
are similar, such as dark/light, short/fast, etc.
• hyponym: one sense is a hyponym of another if the first sense is more specific (i.e. denotes a
subclass of the other).
– car is a hyponym of vehicle
– mango is a hyponym of fruit
• hypernym/superordinate:
– vehical is a hypernym of car
– fruit is a hypernym of mango
• token: an instance of that type in running text; N = number of tokens, i.e. counting every
word in the sentence, regardless of uniqueness.
• type: an element of the vocabulary; V = vocabulary = set of types (|V | = the size of the
vocabulary), i.e. counting every unique word in the sentence.
17.4 Data preparation
17.4.1
Sentence segmentation
“!”, “?” are pretty reliable indicators that we’ve reached the end of a sentence. Periods can mean
the end of the sentence or an abbreviation (e.g. Inc. or Dr.) or numbers (e.g. 4.3).
CHAPTER 17. NLP
483
17.4. DATA PREPARATION
17.4.2
484
Tokenization
Tokenization is the process of breaking up text into discrete units for analysis - this is typically into
words or phrases.
The best approach for tokenization varies widely depending on the particular problem and language.
German, for example, has many long compound words which you may want to split up. Chinese has
no spaces (no easy way for word segmentation), Japanese has no spaces and multiple alphabets.
17.4.3
Normalization
Once you have your tokens you need to determine how to normalize them. For example, “USA” and
“U.S.A.” could be collapsed into a single token. But about “Windows”, “window”, and “windows”?
Some common approaches include:
• case folding - reducing all letters to lower case (but sometimes case may be informative)
• lemmatization - reduce inflections or variant forms to base form.
• stemming = reducing terms of their stems; a crude chopping of affixes; a simplified version of
lemmatization. The Porter stemmer is the most common English stemmer.
17.4.4
Term Frequency-Inverse Document Frequency (tf-idf) Weighting
Using straight word counts may not be the best approach in many cases.
Rare terms are typically more informative than frequent terms, so we want to bias our numerical
representations of tokens to give rarer words higher weights. We do this via inverse document
frequency weighting (idf):
i dft = log(
N
)
dft
For a term t which appears in df documents (dft = document frequency for t).
log is used here to “dampen” the effect of idf.
This can be combined with t’s term frequency tfd for a particular document d to produce tf-idf
weighting, which is the best known weighting scheme for text information retrieval:
wt,d = (1 + log tft,d ) × log(
17.4.5
n
)
dft
The Vector Space Model (VSM)
This representation of text data - that is, some kind of numerical feature for each word, such as the
tf-idf weight and frequency, defines a |V |-dimensional vector space (where V is the vocabulary size).
484
CHAPTER 17. NLP
485
•
•
•
•
17.5. MEASURING SIMILARITY BETWEEN TEXT
terms are the axes of space
documents are points (vectors) in this space
this space is very high-dimensional when dealing with large vocabularies
these vectors are very sparse - most entries are zero
17.4.6
Normalizing vectors
This is a different kind of normalization than the previously mentioned one, which was about normalizing the language. Here, we are normalizing vectors in a more mathematical sense.
Vectors can be length-normalized by dividing each of its components by its length. We can use the
L2 norm, which makes it a unit vector (“unit” means it is of length 1):
||⃗
x ||2 =
√∑
xi2
i
This means that if we have, for example, a document and copy of that document with every word
doubled, length normalization causes each to have identical vectors (without normalization, the copy
would have been twice as long).
17.5 Measuring similarity between text
17.5.1
Minimum edit distance
The minimum edit distance between two strings is the minimum number of editing operations
(insertion/deletion/substitution) needed to transform one into the other. Each editing operation has
a cost of 1, although in Levenshtein minimum edit distance substitutions cost 2 because they are
composed of a deletion and an insertion.
17.5.2
Jaccard coefficient
The Jaccard coefficient is a commonly-used measure of overlap for two sets A and B.
jaccar d(A, B) =
|A ∩ B|
|A ∪ B|
A set has a Jaccard coefficient of 1 against itself: jaccar d(A, A) = 1.
If A and B have no overlapping elements, jaccar d(A, B) = 0.
The Jaccard coefficient does not consider term frequency, just set membership.
CHAPTER 17. NLP
485
17.5. MEASURING SIMILARITY BETWEEN TEXT
17.5.3
486
Euclidean Distance
Using the vector space model above, the similarity between two documents can be measured by the
euclidean distance between their two vectors.
However, euclidean distance can be problematic since longer vectors have greater distance.
For instance, there could be one document vector, a, and another document vector b which is just
a scalar multiple of the first document. Intuitively they may be more similar since they lie along the
same line. But by euclidean distance, c is closer to a.
Euclidean distances
17.5.4
Cosine similarity
In cases like the euclidean distance example above, using angles between vectors can be a better
metric for similarity.
For length-normalized vectors, cosine similarity is just their dot product:
⃗ = q⃗ · d⃗ =
cos(⃗
q , d)
|V |
∑
qi di
i=1
Where q and d are length-normalized vectors and qi is the tf-idf weight of term i in document q and
di is the tf-idf weight of term i in document d.
486
CHAPTER 17. NLP
487
17.6. (PROBABILISTIC) LANGUAGE MODELS
17.6 (Probabilistic) Language Models
The approach of probabilistic language models involves generating some probabilistic understanding
of language - what is likely or unlikely. For example, given sentence A and sentence B, we want to
be able to say whether or not sentence A is more probable sentence than sentence B.
We have some finite vocabulary V . There is an infinite set of strings (“sentences”) that can be
produced from V , notated V † (these strings have zero or more words from V , ending with the STOP
symbol). These sentences may make sense, or they may not (e.g. they might be grammatically
incorrect).
Say we have a training sample of N example sentences in English. We want to learn a probability
distribution p over the possible set of sentences V † ; that is, p is a function that satisfies:
∑
p(x) = 1, p(x) ≥ 0for allx ∈ V †
x∈V †
The goal is for likely English sentences (i.e. “correct” sentences) to be more probable than nonsensical
sentences.
These probabilistic models have applications in many areas:
• Machine translation: P (high winds tonight) > P (large winds tonight).
• Spelling correction: P (about fifteen minutes from) > P (about fifteen minuets from).
• Speech recognition: P (I saw a van) > P (eyes awe of an).
So generally you are asking: what is the probability of this given sequence of words?
17.6.1
A naive method
For any sentence x1 , . . . , xn , we notate the count of that sentence in the training corpus as
c(x1 , . . . , xn ).
Then we might simply say that:
p(x1 , . . . , xn ) =
c(x1 , . . . , xn )
N
However, this method assigns 0 probability to sentences that are not in the training corpus, thus
leaving many plausible sentences unaccounted for.
17.6.2
A less naive method
You could use the chain rule here:
CHAPTER 17. NLP
487
17.6. (PROBABILISTIC) LANGUAGE MODELS
488
P (the water is so transparent) =
P (the) × P (water|the) × P (is|the water)
×P (so|the water is) × P (transparent|the water is so)
Formally, the above would be expressed:
P (w1 w2 . . . wn ) =
∏
P (wi |w1 w2 . . . wi−1 )
i
Note that probabilities are usually done in log space to avoid underflow, which occurs if you’re
multiplying many small probabilities together, and because then you can just add the probabilities,
which is faster than multiplying:
p1 × p2 × p3 = log p1 + log p2 + log p3
To make estimating these probabilities manageable, we use the Markov assumption and assume that
a given word’s conditional probability only depends on the immediately preceding k words, not the
entire preceding sequence (that is, that any random variable depends only on the previous random
variable, and is conditionally independent of all the random variables before that):
P (X1 = x1 )
n
∏
P (Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P (X1 = x1 )
i=2
n
∏
P (Xi = xi |Xi−1 = xi−1 )
i=2
That is, for any i ∈ 2 . . . n, for any x1 , . . . , xi :
P (Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P (Xi = xi |Xi−1 = xi−1 )
In particular, this is the first-order Markov assumption; if it seems appropriate, we could instead use
the second-order Markov assumption, where we instead assume that any random variable depends
only on the previous two random variables:
P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P (X1 = x1 )P (X2 = x2 |X1 = x1 )
n
∏
P (Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1
i=3
Though this is usually condensed to:
n
∏
P (Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1
488
CHAPTER 17. NLP
489
17.6. (PROBABILISTIC) LANGUAGE MODELS
This can be extended to the third-order Markov assumption and so on.
In the context of language models, we define x−1 , x0 as the special “start” symbol, ∗, indicating the
start of a sentence.
We also remove the assumption that n is fixed and instead consider it as a random variable. We can
/ V.
just define Xn = STOP, where STOP is a special symbol, STOP ∈
17.6.3
n-gram Models
The unigram model treats each word as if they have an independent probability:
∏
P (w1 w2 . . . wn ) ≈
P (wi )
i
The bigram model conditions on the previous word:
P (w1 w2 . . . wi−1 ) ≈
∏
P (wi |wi−1 )
i
We estimate bigram probabilities using the maximum likelihood estimate (MLE):
PMLE (wi |wi−1 ) =
count(wi−1 , wi )
count(wi−1 )
Which is just the count of word i occuring after word i − 1 over all of the occurences of word i − 1.
This can be extended to trigrams, 4-grams, 5-grams, etc.
Though language has long-distance dependencies, i.e. the probability of a word can depend on
another word much earlier in the sentence, n-grams work well in practice.
Trigram Models
With a trigram model, we have a parameter q(w |u, v ) for each trigram (sequence of three words)
u, v , w such that w ∈ V ∪ {STOP} and u, v ∈ V ∪ {∗}.
For any sentence x1 , . . . , xn , where xi ∈ V for i = 1 . . . (n − 1) and xn = STOP, the probability of
the sentence under the trigram language model is:
p(x1 , . . . , xn ) =
∏
i = 1n q(xi |xi−2 , xi−1 )
With x−1 , x0 as the special “start” symbol, ∗.
(This is just a second-order Markov process)
So then, how do we estimate the q(wi |wi−2 , wi−1 ) parameters?
CHAPTER 17. NLP
489
17.6. (PROBABILISTIC) LANGUAGE MODELS
490
We could use the maximum likelihood estimate:
Count(wi−2 , wi−1 , wi )
Count(wi−2 , wi−1 )
qML (wi |wi−2 , wi−1 ) =
However, this still has the problem of assigning 0 probability to trigrams that were not encountered
in the training corpus.
There are also still many, many parameters to learn: if we have a vocabulary size N = |V |, then we
have N 3 parameters in the model.
Dealing with zeros
Zeroes occur if some n-gram occurs in the testing data which didn’t occur in the training set.
Say we had the following training set:
… denied the reports … denied the claims … denied the request
And the following test set:
… denied the offer
Here P (offer|denied the) = 0 since the model has not encountered that term.
We can get around this using Laplace smoothing, also known as add-one smoothing): simply
pretend that we saw each word once more than we actually did (i.e. add one to all counts).
With add-one smoothing, our MLE becomes:
PAdd−1 (wi |wi−1 ) =
count(wi−1 , wi ) + 1
count(wi−1 ) + V
Note that this smoothing can be very blunt and may drastically change your counts.
Interpolation
Above we defined the trigram maximum-likelihood estimate. We can do the same for bigram and
unigram estimates:
Count(wi−1 , wi )
Count(wi−1 )
Count(wi )
qML (wi ) =
Count()
qML (wi |wi−1 ) =
490
CHAPTER 17. NLP
491
17.6. (PROBABILISTIC) LANGUAGE MODELS
These various estimates demonstrate the bias-variance trade-off - the trigram maximum-likelihood
converges to a better estimate but requires a lot more data to do so; the unigram maximum-likelihood
estimate converges to a worse estimate but does so with a lot less data.
With linear interpolation, we try to combine the strengths and weaknesses of each of these estimates:
q(wi |wi−2 , wi−1 ) = λ1 qML (wi |wi−2 , wi−1 ) + λ2 qML (wi |wi−1 ) + λ3 qML (wi )
Where λ1 + λ2 + λ3 = 1, λi ≥ 0∀i .
That is, we compute a weighted average of the estimates.
For a vocabulary V ′ = V ∪ {STOP},
∑
w ∈V ′
q(w |u, v ) defines a distribution, since it sums to 1.
How do we estimate the λ values?
We can take out some our training data as validation data (say ~5%). We train the maximumlikelihood estimates on the training data, then we define c ′ (w1 , w2 , w3 ) as the count of a trigram in
the validation set.
Then we define:
L(λ1 , λ2 , λ3 ) =
∑
c ′ (w1 , w2 , w3 ) log q(w3 |w1 , w2 )
w1 ,w2 ,w3
And choose λ1 , λ2 , λ3 to maximize L (this ends up being the same as choosing λ1 , λ2 , λ3 to minimize
the perplexity).
In practice, however, the λ values are allowed to vary.
We define a function Π that partitions histories, e.g.



1






if Count(wi−1 , wi−2 ) = 0
2 if 1 ≤ Count(wi−1 , wi−2 ) ≤ 2
Π(wi−2 , wi−1 ) = 


3 if 3 ≤ Count(wi−1 , wi−2 ) ≤ 5



4
otherwise
These partitions are usually chosen by hand.
Then we vary the λ values based on the partition:
Π(wi−2 ,wi−1 )
q(wi |wi−2 , wi−1 ) = λ1
Π(wi−2 ,wi−1 )
Where lambda1
CHAPTER 17. NLP
Π(wi−2 ,wi−1 )
qML (wi |wi−2 , wi−1 )+λ2
Π(wi−2 ,wi−1 )
+ lambda2
Π(wi−2 ,wi−1 )
qML (wi |wi−1 )+λ3
qML (wi )
+ lambda3 Π(wi−2 , wi−1 ) = 1 and each are ≥ 0.
491
17.6. (PROBABILISTIC) LANGUAGE MODELS
492
Discounting methods
Generally, these maximum likelihood estimates can be high, so we can define “discounted” counts,
e.g. Count ∗ (x) = Count(x) − 0.5 (the value to discount by can be determined on a validation set,
like the λ values from before). As a result of these discounted counts, we will have some probability
mass left over, which is defined as:
α(wi−1 ) = 1 −
∑ Count ∗ (wi−1 , w )
w
Count(wi−1 )
We can assign this leftover probability mass to words we have not yet seen.
We can use a Katz Back-Off model. First we will consider the bigram model.
We define two sets:
A(wi−1 ) = {w : Count(wi−1 , w ) > 0}
B(wi−1 ) = {w : Count(wi−1 , w ) = 0}
Then the bigram model:

i−1 ,wi )

 Count∗(w
Count(w )
i−1
qBO (wi |wi−1 ) = 
α(wi−1 ) ∑
if wi ∈ A(wi−1 )
qML (wi )
qML (w )
w ∈B(wi−1 )
if wi ∈ B(wi−1 )
Where
α(wi−1 ) = 1 −
Count ∗ (wi−1 , w )
Count(wi−1 )
w ∈A(wi−1 )
∑
Basically, this assigns the leftover probability mass to bigrams that were not previously encountered.
The Katz Back-Off model can be extended to trigrams as well:
A(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w ) > 0}
B(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w ) = 0}

i−2 ,wi−1 ,wi )

 Count∗(w
Count(w ,w )
if wi ∈ A(wi−2 , wi−1 )
i−2 i−1
qBO (wi |wi−2 , wi−1 ) = 
α(wi−2 , wi−1 ) ∑
qBO (wi |wi−1 )
q (w |wi−1 )
w ∈B(w
,w
) BO
i−2
α(wi−2 , wi−1 ) = 1 −
492
if wi ∈ B(wi−2 , wi−1 )
i−1
Count ∗ (wi−2 , wi−1 , w )
Count(wi−2 , wi−1 )
w ∈A(wi−2 ,wi−1 )
∑
CHAPTER 17. NLP
493
17.6.4
17.6. (PROBABILISTIC) LANGUAGE MODELS
Log-Linear Models
When it comes to language models, the trigram model may be insufficient. There may be more
information than just the previous two words that we want to take into account - for instance, the
author of a paper, whether or not a particular word occurs in an earlier context, the part of speech
of the preceding word, etc.
We may want to do something similar when it comes to tagging, e.g. condition on that a previous
word is a particular word, or that it has a particular ending (“ing”, “e”, etc), and so on.
We can use log-linear models to capture this extra information (encoded as numerical features,
e.g. 1 if the preceding word is “foo”, and 0 otherwise.).
With log-linear models, we frame the problem as such: We have some input domain X and a finite
label set Y . We want to produce a conditional probability p(y |x) for any x, y where x ∈ X, y ∈ Y .
For example, in language modeling, x would be a “history” of words, i.e. w1 , w2 , . . . , wi−1 and y is
an “outcome” wi (i.e. the predicted following word).
We represent our features as vectors (applying indicator functions and so on where necessary). We’ll
denote a feature vector for an input/output pair (x, y ) as f (x, y ).
We also have a parameter vector equal in length to our feature vectors (e.g. if we have m features,
then the parameter vector v ∈ Rm ).
We can compute a “score” for a pair (x, y ) as just the dot product of these two: v · f (x, y ) which
we can turn into the desired conditional probability p(x|y ):
p(y |x; v ) = ∑
e v ·f (x,y )
v ·f (x,y ′ )
y ′ ∈Y e
Read as “the probability of y given x under the parameters v ”.
This can be re-written as:
log p(y |x; v ) = v · f (x, y ) − log
∑
′
e v ·f (x,y )
y ′ ∈Y
This is why such models are called “log-linear”: the v ·f (x, y ) term is the linear term and we calculate
∑
′
a log probability (and then there is the normalization term log y ′ ∈Y e v ·f (x,y ) ).
So how do we estimate the parameters v ?
We assume we have training examples (x (i) , y (i) ) for i = 1, . . . , n and that each (x (i) , y (i) ) ∈ X × Y .
We can use maximum-likelihood estimates to estimate v , i.e.
vML = argmax L(v )
L(v ) =
CHAPTER 17. NLP
v ∈Rm
n
∑
n
∑
i=1
i=1
log p(y (i) |x (i) ; v ) =
v · f (x (i) , y (i) ) −
n
∑
i=1
log
∑
e v ·f (x
(i) ,y ′ )
y ′ ∈Y
493
17.6. (PROBABILISTIC) LANGUAGE MODELS
494
i.e. L(v ) is the log-likelihood of the data under the parameters v , and it is concave so we can
optimize it fairly easily with gradient ascent.
We can add regularization to improve generalization.
17.6.5
History-based models
The models that have been presented so far are called history-based models, in the following sense:
•
•
•
•
•
We break structures down into a derivation (a sequence of decisions)
Each decision has an associated conditional probability
The probability of a structure is just the product of the decision probabilities that created it
The parameter values are estimated using some variant of maximum-likelihood estimation
When choose y s such that they maximize either a joint probability p(x, y ; θ) (e.g. in the case
of HMMs or PCFGs) or a conditional probability p(y |x; θ) (in the case of log-linear models).
17.6.6
Global Linear Models
GLMs extend log-linear models though they are different than history-based models (there are no
“derivations” or probabilities for “decisions”).
In GLMs, we have feature vectors for entire structures, i.e. “global features”. This allows us to
incorporate features that are difficult to include in history-based models.
GLMs have three components:
• f (x, y ) ∈ Rd which maps a structure (x, y ) (e.g. a sentence and a parse tree) to a feature
vector$ to a feature vector$ to a feature vector$ to a feature vector
• GEN which is a function that maps an input x to a set of candidates GEN(x). For example, it
could return the set of all possible English translations for a French sentence x.
• v ∈ Rd is a parameter vector; it is learned from training data
So the final output is a function F : X → Y , which ends up being:
F (x) = argmax f (x, y ) · v
y ∈GEN(x)
17.6.7
Evaluating language models: perplexity
Perplexity is a measure of the quality of a language model.
Assume we have a set of m test sentences, s1 , s2 , . . . , sm .
We can compute the probability of these sentences under our learned model p:
494
CHAPTER 17. NLP
495
17.7. PARSING
m
∏
p(si )
i=1
Though typically we look at log probability instead:
m
∑
log p(si )
i=1
The perplexity is computed:
perplexity = 2−l
l=
m
1 ∑
log p(si )
M i=1
Where M is the total number of words in the test data. Note that log is log2 .
Lower perplexity is better (because a high log probability is better, which causes perplexity to be
low).
17.7 Parsing
The parsing problem takes some input sentence and outputs a parse tree which describes the syntactic
structure of the sentence.
The leaf nodes of the tree are the words themselves, which are each tagged with a part-of-speech.
Then these are grouped into phrases, such as noun phrases (NP) and verb phrases (VP), up to
sentences (S) (these are sometimes called constituents).
These parse trees can describe grammatical relationships such as subject-verb, verb-object, and so
on.
[Example of a parse tree, from Tjo3ya]
We can treat it as a supervised learning problem by using sentences annotated with parse trees (such
data is usually called a “treebank”).
17.7.1
Context-free grammars (CFGs)
A formalism for the parsing problem.
A context-free grammar is a four-tuple G = (N, Σ, R, S) where:
• N is a set of non-terminal symbols
• Σ is a set of terminal symbols
• R is a set of rules of the form X → Y1 Y2 . . . Yn for n ≥ 0, X ∈ N, Yi ∈ (N ∪ Σ)
CHAPTER 17. NLP
495
17.7. PARSING
496
• S ∈ N is a distinguished start symbol
An example CFG:
•
•
•
•
N = {S, NP, VP, PP, DT, Vi, Vt, NN, IN}
S=S
Σ = {sleeps, saw, woman, telescope, the, with, in}
R is the following set of rules:
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
S → NP VP
VP → Vi
VP → Vt NP
VP → VP PP
NP → DT NN
NP → NP PP
PP → IN NP
Vi → sleeps
Vt → saw
NN → man
NN → woman
NN → telescope
DT → the
IN → with
IN → in
Note: - S = sentence - VP = verb phrase - NP = noun phrase - PP = prepositional phrase - DT =
determiner - Vi = intransitive verb - Vt = transitive verb - NN = noun - IN = preposition
We can derive sentences from this grammar.
A left-most derivation is a sequence of strings s1 , . . . , sn where:
• s1 = S, the start symbol
• sn ∈ Σ∗ ; that is, sn consists only of terminal symbols
• each si for i = 2, . . . , n is derived from si−1 by picking the left-most non-terminal X in si−1
and replacing it with some β where X → β is a rule in R.
Using the example grammar, we could do:
1.
2.
3.
4.
5.
496
“S”
expand “S” to “NP VP”
expand “NP” (since it is the left-most symbol) to “D N”, yielding “D N VP”
expand “D” (again, it is left-most) to “the”, yielding “the N VP”
expand “N” (since the left-most symbol “the” is a terminal symbol) to “man”, yielding “the
man VP”
CHAPTER 17. NLP
497
17.7. PARSING
6. expand “VP” to “Vi” (since it is the last non-terminal symbol), yielding “the man Vi”
7. expand “Vi” to “sleeps”, yielding “the man sleeps”
8. the sentence consists only of terminal symbols, so we are done.
Thus a CFG defines a set of possible derivations, which can be infinite.
We say that a string s ∈ Σ∗ is in the language defined by the CFG if we can derive it from the CFG.
A string in a CFG may have multiple derivations - this property is called “ambiguity”.
For instance, “fruit flies like a banana” is ambiguous in that “fruit flies” may be a noun phrase or it
may be a noun and a verb.
17.7.2
Probabilistic Context-Free Grammars (PCFGs)
PCFGs are CFGs in which each rule is assigned a probability, which helps with the ambiguity problem.
We can compute the probability of a particular derivation as the product of the probability of its rules.
We notate the probability of a rule as q(α → β). Note that we have individual probability distributions
∑
∑
for the left-side of each rule, e.g. q(VP → β) = 1, q(NP → β = 1, and so on. Another way of
saying this is these distributions are conditioned on the left-side of the rule.
These probabilities can be learned from data as well, simply by counting all the rules in a treebank
and using maximum likelihood estimates:
qML (α → β) =
Count(α → β)
Count(α)
Given a PCFG, a sentence s, and a set of trees which yield s as T(s), we want to compute
argmaxt∈T(s) p(t). That is, given a sentence, what is the most likely parse tree to have produced
this sentence?
A challenge here is that |T(s)| may be very large, so brute-force search is not an option. We can use
the CKY algorithm instead.
First we will assume the CFG is in Chomsky normal form. A CFG is in Chomsky normal form if the
rules in R take one of two forms:
• X → Y1 Y2 for X, Y1 , Y2 ∈ N
• X → Y for X ∈ N, Y ∈ Σ
In practice, any PCFG can be converted to an equivalent PCFG in Chomsky normal form by combining
multiple symbols into single symbols (e.g. you can convert VP → Vt NP PP by defining a new symbol
Vt-NP → Vt NP and then redefining VP → Vt-NP PP).
First, let’s consider the problem maxt∈T(s) p(t).
Notation:
• n = number of words in the sentence
CHAPTER 17. NLP
497
17.7. PARSING
498
• wi = the i th word in the sentence
We define a dynamic programming table π[i, j, X] which is the maximum probability of a constituent
with non-terminal X spanning the words i , . . . , j inclusive. We set i, j ∈ 1, . . . , n and i ≤ j.
We want to calculate maxt∈T(s)p(t) = π[1, n, S], i.e. the max probability for a parse tree spanning
the first through the last word of the sentence with the S symbol.
We will use a recursive definition of π.
The base case is: for all i = 1, . . . , n for X ∈ N, π[i, i , X] = q(X → wi ). If X → wi is not in the
grammar, then q(X → wi ) = 0.
The recursive definition is: for all i = 1, . . . , (n − 1) and j = (i + 1), . . . , n and X ∈ N:
π(i, j, X) =
max
X→Y Z∈R,s∈{i,...,(j−1)}
q(X → Y Z)π(i, s, Y )π(s + 1, j, Z)
s is called the “split point” because it determines where the word sequence from i to j (inclusive) is
split.
The full CKY algorithm:
Initialization: For all i ∈ {i, . . . , n}, for all X ∈ N:

q(X → xi )
π(i , i , X) = 
0
ifX → xi ∈ R
otherwise
Then:
• For l = 1, . . . , (n − 1)
– For i = 1, . . . , (n − l)
* Set j = i + 1
* For all X ∈ N, calculate:
π(i , j, X) =
max
X→Y Z∈R,s∈{i,...,(j−1)}
bp(i , j, X) =
argmax
X→Y Z∈R,s∈{i,...,(j−1)}
q(X → Y Z)π(i, s, Y )π(s + 1, j, Z)
q(X → Y Z)π(i, s, Y )π(s + 1, j, Z)
This has the runtime O(n3 |N|3 ) because the l and i loops n times each, giving us n2 , then at the
inner-most loop (for all X ∈ N) loops |N| times, then X → Y Z ∈ R has |N|2 values to search
through because these are |N| choices for Y and |N| choices for Z. Then there are also n choices to
search through for s.
498
CHAPTER 17. NLP
499
17.7. PARSING
Weaknesses of PCFGs
PCFGs (as described above) don’t perform very well; they have two main shortcomings:
• Lack of sensitivity to lexical information
– that is, attachment is completely independent of the words themselves
• Lack of sensitivity to structural frequencies
– for example, with the phrase “president of a company in Africa”, “in Africa” can be
attached to either “president” or “company”. If we were to parse this phrase, we might
come up with two trees described by exactly the same rule sets, the only difference is
where the PP “in Africa” is attached to. Since they are exactly the same rule sets, they
have the same probability, so the PCFG can’t distinguish the two. However, statistically,
the “close attachment” structure (i.e. generally the PP would attach to the closer object,
in this case, “company”) is more frequent, so it should be preferred.
Lexicalized PCFGs
Lexicalized PCFGs deal with the above weaknesses.
For a non-terminal rule, we specify one its children as the “head” of the rule, which is essentially the
most “important” part of the rule (e.g. for the rule VP → VtNP, the verb Vt is the most important
semantic part and thus the head).
We define another set of rules which identifies the heads of our grammar’s rules, e.g. “If the rule
contains NN, NNS, or NNP, choose the rightmost NN, NNS, or NNP as the head”.
Now when we construct the tree, we annotate each node with its headword (that is, the word that
is in the place of the head of a rule).
For instance, say we have the following tree:
VP
��� Vt
�
��� questioned
��� NP
��� DT
�
��� the
��� NN
��� witness
We annotate each node with its headword:
VP(questioned)
��� Vt(questioned)
�
��� questioned
CHAPTER 17. NLP
499
17.7. PARSING
500
��� NP(witness)
��� DT(the)
�
��� the
��� NN(witness)
��� witness
We can revise our Chomsky Normal Form for lexicalized PCFGs by defining the rules in R to have
one of the following three forms:
• X(h) →1 Y1 (h)Y2 (w ) for X, Y1 , Y2 ∈ N and h, w ∈ Σ
• X(h) →2 Y1 (w )Y2 (h) for X, Y1 , Y2 ∈ N and h, w ∈ Σ
• X(h) → h for X ∈ N, h ∈ Σ
Note the subscripts on →1 , →2 which indicate which of the children is the head.
Parsing lexicalized PCFGs
That is, we consider rules with words, e.g. NN(dog) is a different rule than NN(cat). By doing so,
we increase the number of possible rules to O(|Σ|2 |N|3 ), which is a lot.
However, given a sentence w1 , w2 , . . . , wn , at most O(n2 |N|3 ) rules are applicable because we can
disregard any rule that does not contain one of w1 , w2 , . . . , wn ; this makes parsing lexicalized PCFGs
a bit easier (it can be done in O(N 5 |N|3 ) time rather than O(n3 |Σ|2 |N|3 ) time, which is the runtime
if we consider all possible rules).
Parameter estimatino in lexicalized PCFGs
In a lexicalized PCFGs, our parameters take the form:
q(S(saw) →2 NP(man)VP(saw))
We decompose this parameter into a product of two parameters:
q(S →2 NP VP|S, saw)q(man|S →2 NP VP, saw)
The first term describes: given S(saw), what is the probability that it expands →2 NP VP?
The second term describes: given the rule S →2 NP VP and the headword saw, what is the probability
that man is the headword of NP?
Then we used smoothed estimation for the two parameter estimates (we’re using linear interpolation):
q(S →2 NP VP|S, saw) = λ1 qML (S →2 NP VP|S, saw) + λ2 qML (S →2 NP VP|S)
500
CHAPTER 17. NLP
501
17.8. TEXT CLASSIFICATION
Again, λ1 , λ2 ≥ 0, λ1 + λ2 = 1.
To clarify:
Count(S(saw) →2 NP VP)
Count(S(saw))
Count(S →2 NP VP)
qML (S →2 NP VP|S) =
Count(S)
qML (S →2 NP VP|S, saw) =
Here is the linear interpolation for the second parameter:
q(man|S →2 NP VP, saw) = λ3 qML (man|S →2 NP VP, saw)+λ4 qML (man|S →2 NP VP)+λ5 qML (man|NP)
Again, λ3 , λ4 , λ5 ≥ 0, λ3 + λ4 + λ5 = 1.
To clarify, qML (man|NP) describes: given NP, what is the probability that its headword is man?
This presentation of PCFGs do not deal with the close attachment issue as described earlier, though
there are modified forms which do.
17.8 Text Classification
The general text classification problem is given an input document d and a fixed set of classes
C = {c1 , c2 , . . . , cj } output a predicted class c ∈ C.
17.8.1
Naive Bayes
This supervised approach to classification is based on Bayes’ rule. It relies on a very simple representation of the document called “bag of words”, which is ignorant of the sequence or order of word
occurrence (and other things), and only pays attention to their counts/frequency.
So you can represent the problem with Bayes’ rule:
P (c|d) =
P (d|c)P (c)
P (d)
And the particular problem at hand is finding the class which maximizes P (c|d), that is:
CMAP = argmaxc∈C P (c|d) = argmaxc∈C P (d|c)P (c)
Where CMAP is the maximum a posteriori class.
CHAPTER 17. NLP
501
17.8. TEXT CLASSIFICATION
502
Using our bag of words assumption, we represent a document as features x1 , . . . xn without concern
for their order:
CMAP = argmaxc∈C P (x1 , x2 , . . . , xn |c)P (c)
We additionally assume conditional independence, i.e. that the presence of one word doesn’t have
any impact on the probability of any other word’s occurrence:
P (x1 , x2 , . . . , xn |c) = P (x1 |c) · P (x2 |c) · · · · · P (xn |c)
And thus we have the multinomial naive bayes classifier:
CNB = argmaxc∈C P (cj )
∏
P (x|c)
x∈X
To calculate the prior probabilities, we use the maximum likelihood estimates approach:
P (cj ) =
doccount(C = cj )
Ndoc
That is, the prior probability for a given class is the count of documents in that class over the total
number of documents.
Then, for words:
count(wi , cj )
w ∈V count(w , cj )
P (wi |cj ) = ∑
That is, the count of a word in documents of a given class, over the total count of words in that
class.
To get around the problem of zero probabilities (for words encountered in test input but not in
training, which would cause a probability of a class to be zero since the probability of a class is the
joint probability of the words encountered), you can use Laplace smoothing (see above):
count(wi , cj ) + 1
P (wi |cj ) = ∑
( w ∈V count(w , cj )) + |V |
Note that to avoid underflow (from multiplying lots of small probabilities), you may want to work
with log probabilities (see above).
In practice, even with all these assumptions, Naive Bayes can be quite good:
• Very fast, low storage requirements
502
CHAPTER 17. NLP
503
•
•
•
•
17.8. TEXT CLASSIFICATION
Robust to irrelevant features (they tend to cancel each other out)
Very good in domains with many equally important features
Optimal if independence assumptions hold
A good, dependable baseline for text classification
17.8.2
Evaluating text classification
The possible outcomes are:
•
•
•
•
true positive: correctly identifying something as true
false positive: incorrectly identifying something as true
true negative: correctly identifying something as false
false negative: incorrectly identifying something as false
The accuracy of classification is calculated as:
accuracy =
tp + tn
tp + f p + f n + f n
Though as a metric it isn’t very useful if you are dealing with situations where the correct class is
sparse and most words you encounter are not in the correct class:
Say you’re looking for a word that only occurs 0.01% of the time. you have a classifier
you run on 100,000 docs and the word appears in 10 docs (so 10 docs are correct, 99,990
are not correct). but you can have that classifier classify all docs as not correct and get
an amazing accuracy of 99,990/100,000 = 99.99% but the classifier didn’t actually do
anything!
So other metrics are needed.
Precision measures the percent of selected items that are correct:
precision =
tp
tp + f p
Recall measures the percent of correct items that are selected:
recall =
tp
tp = f n
Typically, there is a trade off between recall and precision - the improvement of one comes at the
sacrifice of the other.
The F measure combines both precision and recall into a single metric:
CHAPTER 17. NLP
503
17.9. TAGGING
504
F =
(β 2 + 1)P R
1
=
β2P + R
α P1 + (1 − α) R1
Where α is a weighting value so you can assign more importance to either precision or recall.
People usually use the balanced F1 measure, where β = 1 (that is, α = 1/2):
F =
2P R
P +R
17.9 Tagging
A class of NLP problems in which we want to assign a tag to each word in an input sentence.
• Part-of-speech tagging: Given an input sentence, output a POS tag for each word. Like in
many NLP problems, ambiguity makes this a difficult task.
• Named entity recognition: Given an input sentence, identify the named entities in the
sentence (e.g. a company, or location, or person, etc) and what type the entity is (other words
are tagged as non-entities). Entities can span multiple words, so there will often be “start” and
“continue” tags (e.g. for “Wall Street”, “Wall” is tagged as “start company”, and “Street” is
tagged as “continue company”).
There are two types of constraints in tagging problems:
• local: words with multiple meanings can have a bias (a “local preference”) towards one meaning
(i.e. one meaning is more likely than the others)
• contextual: certain meanings of a word are more likely in certain contexts
These constraints can sometimes conflict.
17.9.1
Generative models
One approach to tagging problems (and supervised learning in general) is to use a conditional model
(often called a discriminative model), i.e. to learn the distribution p(y |x) and select argmaxy p(y |x)
as the label.
Alternatively, we can use a generative model which instead learns the distribution p(x, y ). We often
have p(x, y ) = p(y )p(x|y ), where p(y ) is the prior and p(x|y ) is the conditional generative model.
This is generative because we can use this to generate new sentences by sampling the distribution
given the words we have so far.
We can apply Bayes’ Rule as well to derive the conditional distribution as well:
504
CHAPTER 17. NLP
505
17.9. TAGGING
p(y |x) =
Where p(x) =
∑
y
p(y )p(x|y )
p(x)
p(y )p(x|y ).
Again, we can select argmaxy p(y |x) as the label, but we can apply Bayes’ Rule to equivalently get
)p(x|y )
argmaxy p(yp(x)
. But note that p(x) does not vary with y (i.e. it is constant), so it does not affect
the argmax, and we can just drop it to get argmaxy p(x)p(x|y ).
17.9.2
Hidden Markov Models (HMM)
An example of a generative model.
We have an input sentence x = x1 , x2 , . . . , xn where xi is the i th word in the sentence.
We also have a tag sequence y = y1 , y2 , . . . , yn where yi is the tag for the i th word in the sentence.
We can use a HMM to define the joint distribution p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ).
Then the most likely tag sequence for x is argmaxy1 ,...,yn p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ).
Trigram HMMs
For any sentence x1 , . . . , xn where xi ∈ V for i = 1, . . . , n and any tag sequence y1 , . . . , yn+1 where
yi ∈ S for i = 1, . . . , n and yn+1 = STOP (where S is the set of possible tags, e.g. DT, NN, VB, P,
ADV, etc), the joint probability of the sentence and tag sequence is:
p(x1 , . . . , xn , y1 , . . . , yn+1 ) =
n+1
∏
i=1
q(yi |yi−2 , yi−1 )
n
∏
e(xi |yi )
i=1
Again we assume that x0 = x−1 = ∗.
The parameters for this model are:
• q(s|u, v ) for any s ∈ S ∪ {STOP}, u, v ∈ S ∪ {∗}
• e(x|s) for any s ∈ S, x ∈ V , sometimes called “emission parameters”
The first product is the (second-order) Markov chain, quite similar to the trigram Markov chain
used before for language modeling, and the e(xi |yi ) terms of the second product are what we have
observed. Combined, these produce a hidden Markov model (the Markov chain is “hidden”, since we
don’t observe the tag sequences, we only observe the xi s).
CHAPTER 17. NLP
505
17.9. TAGGING
506
Parameter estimation in HMMs
For the q(yi |yi−2 , yi−1 ) parameters, we can again use a linear interpolation with maximum likelihood
estimates approach as before with the trigram language model.
For the emission parameters, we can also use a maximum likelihood estimate:
e(x|y ) =
Count(y , x)
Count(y )
However, we again have the issue that e(x|y ) = 0 for all y if we have never seen x in the training
data. This will cause the entire joint probability p(x1 , . . . , xn , y1 , . . . , yn+1 ) to become 0.
How do we deal with low-frequency words then?
We can split the vocabulary into two sets:
• frequent words: occurring ≥ t times in the training data, where t is some threshold (e.g.
t = 5)
• low-frequency words: all other words, including those not seen in the training data
Then map low-frequency words into a small, finite set depending on textual features, such as prefixes,
suffixes, etc. For example, we may map all all-caps words (e.g. IBM, MTA, etc) to a word class “allCaps”, and we may map all four-digit numbers (e.g. 1988, 2010, etc) to a word class “fourDigitNum”,
or all first words of sentences to a word class “firstWord”, and so on.
17.9.3
The Viterbi algorithm
We want to compute argmaxy1 ,...,yn p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ), but we don’t want to do so via
brute-force search. The search space is far too large, growing exponentially with n (the search space’s
size is |S|n ).
A more efficient way of computing this is to use the Viterbi algorithm:
Define Sk for k = −1, . . . , n to be the set of possible tags at position k:
S−1 = S0 = {∗}
Sk = S∀k ∈ {1, . . . , n}
Then we define:
r (y−1 , y0 , y1 , . . . , yk ) =
k
∏
i=1
q(yi |yi−2 , yi−1 )
k
∏
e(xi |yi )
i=1
This computes the probability from our HMM for a given sequence of tags, y−1 , y0 , y1 , . . . , yk , but
only up to the kth position.
506
CHAPTER 17. NLP
507
17.10. NAMED ENTITY RECOGNITION (NER)
We define a dynamic programming table: π(k, u, v ) as the maximum probability of a tag sequence
ending in tags u, v at position k, i.e:
π(k, u, v ) =
max
(y−1 ,y0 ,y1 ,...,yk ):yk−1 =u,yk =v
r (y−1 , y0 , y1 , . . . , yk )
To clarify: k ∈ {1, . . . , n}, u ∈ Sk−1 , v ∈ Sk .
For example: say we have the sentence “The man saw the dog with the telescope”, which we re-write
as “START START The Man saw the dog with the telescope”. We’ll set Sk = {D, N, V, P } for
k ≥ 1 and S−1 = S0 = {∗}.
If we want to compute π(7, P, D), then k = 7 so then fix the 7th term with the D tag and the k − 1
term with the P tag. Then we consider all possible tag sequences (ending with P, D) up to the 7th
term (e.g. ∗, D, N, V, P, P, P, D and so on) and get the probability of the most likely sequence.
We can re-define the above recursively.
The base case is π(0, ∗, ∗) = 1 since we always have the two START tokens tagged as ∗ at the
beginning.
Then, for any k ∈ {1, . . . , n} for any u ∈ Sk−1 and v ∈ Sk :
π(k, u, v ) = max (π(k − 1, w , u)q(v |w , u)e(xk |v ))
w ∈Sk−2
The Viterbi algorithm is just the application of this recursive definition while keeping backpointers
to the tag sequences with max probability:
• For k = 1, . . . , n
– For u ∈ Sk−1 , v ∈ Sk
* π(k, u, v ) = maxw ∈Sk−2 (π(k − 1, w , u)q(v |w , u)e(xk |v ))
* bp(k, u, v ) = argmaxw ∈Sk−2 (π(k − 1, w , u)q(v |w , u)e(xk |v ))
• Set (yn−1 , yn ) = argmax(u,v ) (π(n, u, v )q(STOP|u, v ))
• For k = (n − 2), . . . , 1, yk = bp(k + 2, yk1 , yk+2 )
• Return the tag sequence y1 , . . . , yn
It has the runtime O(n|S|3 ) because of the loop over k value (for k = 1, . . . , n, so this happens n
times), then its inner loops over S twice (for u ∈ Sk−1 and for v ∈ Sk ), with each loop searching
over |S|.
17.10
Named Entity Recognition (NER)
Named entity recognition is the extraction of entities - people, places, organizations, etc - from a
text.
CHAPTER 17. NLP
507
17.11. RELATION EXTRACTION
508
Many systems use a combination of statistical techniques, linguistic parsing, and gazetteers to maximize detection recall & precision. Distant supervision and unsupervised techniques can also help with
training, limiting the amount of gold-standard data necessary to build a statistical model.
Boundary errors are common in NER:
First Bank of Chicago announced earnings…
Here, the extractor extracted “Bank of Chicago” when the correct entity is the “First Bank of
Chicago”.
A general NER approach is to use supervised learning:
1.
2.
3.
4.
Collect a set of training documents
Label each entity with its entity type or O for “other”.
Design feature extractors
Train a sequence classifier to predict the labels from the data.
17.11
Relation Extraction
International Business Machines Corporation (IBM or the company) was incorporate in
the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co.
(C-T-R)…
From such a text you could extract the following relation triples:
Founder-year(IBM,1911)
Founding-location(IBM,New York)
These relations may be represented as resource description framework (RDF) triples in the form of
subject predicate object.
Golden Gate Park location San Francisco
17.11.1
Ontological Relations
• IS-A describes a subsumption between classes, called a hypernum:
Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal…
• instance-of relation between individual and class
San Francisco instance-of city
There may be many domain-specific ontological relations as well, such as founded (between a PERSON
and an ORGANIZATION), cures (between a DRUG and a DISEASE), etc.
508
CHAPTER 17. NLP
509
17.11. RELATION EXTRACTION
17.11.2
Methods
Relation extractors can be built using:
• handwritten patterns
• supervised machine learning
• semi-supervised and unsupervised
– bootstrapping (using seeds)
– distance supervision
– unsupervised learning from the web
Handwritten patterns
• Advantages:
– can take advantage of domain expertise
– human patterns tend to be high-precision
• Disadvantages:
– human patterns are often low-recall
– hard to capture all possible patterns
Supervised
• Advantages:
– can get high accuracy if…
* there’s enough hand-labeled training data
* if the test is similar enough to training
• Disadvantages:
– labeling a large training set is expensive
– don’t generalize well
You could use classifiers: find all pairs of named entities, then use a classifier to determine if the two
are related or not.
Unsupervised
If you have no training set and either only a few seed tuples or a few high-precision patterns, you can
bootstrap and use the seeds to accumulate more data.
The general approach is:
1. Gather a set of seed pairs that have a relation R
CHAPTER 17. NLP
509
17.12. SENTIMENT ANALYSIS
510
2. Iterate:
1.
2.
3.
4.
Find sentences with these pairs
Look at the context between or around the pair
Generalize the context to create patterns
Use these patterns to find more pairs
For example, say we have the seed tuple < Mark Twain, Elmira >. We could use Google or some
other set of documents to search based on this tuple. We might find:
• “Mark Twain is buried in Elmira, NY”
• “The grave of Mark Twain is in Elmira”
• “Elmira is Mark Twain’s final resting place”
which gives us the patterns:
• “X is buried in Y”
• “The grave of X is in Y”
• “Y is X’s final resting place”
Then we can use these patterns to search and find more tuples, then use those tuples to find more
patterns, etc.
Two algorithms for this bootstrapping is the Dipre algorithm and the Snowball algorithm, which is a
version of Dipre which requires the strings be named entities rather than any string.
Another semi-supervised algorithm is distance supervision, which mixes bootstrapping and supervised
learning. Instead of a few seeds, you use a large database to extract a large number of seed examples
and go from there:
1.
2.
3.
4.
5.
For each relation R
For each tuple in a big database
Find sentences in a large corpus with both entities of the tuple
Extract frequent contextual features/patterns
Train a supervised classifier using the extracted patterns
17.12
Sentiment Analysis
In general, sentiment analysis involves trying to figure out if a sentence/doc/etc is positive/favorable
or negative/unfavorable; i.e. detecting attitudes in a text.
The attitude may be
• a simple weighted polarity (positive, negative, neutral), which is more common
510
CHAPTER 17. NLP
511
17.12. SENTIMENT ANALYSIS
• from a set of types (like, love, hate, value, desire, etc)
When using multinomial Naive Bayes for sentiment analysis, it’s often better to use binarized multinomial Naive Bayes under the assumption that word occurrence matters more than word frequency:
seeing “fantastic” five times may not tell us much more than seeing it once. So in this version, you
would cap word frequencies at one.
An alternate approach is to use log(f r eq(w )) instead of 1 for the count.
However, sometimes raw word counts don’t work well either. In the case of IMDB ratings, the word
“bad” appears in more 10-star reviews than it does in 2-star reviews!
Instead, you’d calculate the likelihood of that word occurring in an n-star review:
P (w |c) = ∑
f (w , c)
w ∈C f (w , c)
And then you’d used the scaled likelihood to make these likelihoods comparable between words:
P (w |c)
P (w )
17.12.1
Sentiment Lexicons
Certain words have specific sentiment; there are a variety of sentiment lexicons which specify those
relationships.
17.12.2
Challenges
Negation
“I didn’t like this movie” vs “I really like this movie.”
One way to handle negation is to prefix every word following a negation word with NOT_, e.g. “I
didn’t NOT_like NOT_this NOT_movie”.
“Thwarted Expectations” problem
For example, a film review which talks about how great a film should be, but fails to live up to those
expectations:
This film should be brilliant. It sounds like a great plot, the actors are first grade, and the
supporting cast is good as well, and Stallone is attempting to deliver a good performance.
However, it can’t hold up.
CHAPTER 17. NLP
511
17.13. SUMMARIZATION
17.13
512
Summarization
Generally, sumamrization is about producing an abridged version of a text without or with minimal
loss of important information.
There are a few ways to categorize summarization problems.
• Single-document vs multi-document summarization: summarizing a single document, yielding
an abstract or outline or headline, or producing a gist of the content of multiple documents?
• Generic vs query-focused summarization: give a general summary of the document, or a summary tailored to a particular user query?
• Extractive vs abstractive: create a summary from sentences pulled from the document, or
generate new text for the summary?
Here, extractive summarization will be the focus (abstractive summarization is really hard).
The baseline used in summarization, which often works surprisingly well, is just to take the first
sentence of a document.
17.13.1
The general approach
Summarization usually uses this process:
1. Content Selection: choose what sentences to use from the document.
• You may weight salient words based on tf-idf, its presence in the query (if there is one),
or based on topic signature.
– For the latter, you can use log-likelihood ratio (LLR):

1
if − 2 log λ(wi ) > 10
w ei ght(wi ) = 
0 otherwise
• Weight a sentence (or a part of a sentence, i.e. a window) by the weights of its words:
w eight(s) =
1 ∑
w ei ght(w )
|S| w ∈S
• You can combine LLR with maximal marginal relevance (MMR), which is a greedy algorithm which selects sentences by their similarity to the query and by their dissimilarity
(novelty) to already-selected sentences to avoid redundancy.
2. Information Ordering: choose the order for the sentences in the summary.
• If you are summarizing documents with some chronological order to them, such as the
news, then it makes sense to order sentences chronologically (if you are, for example,
summarizing a set of news articles).
512
CHAPTER 17. NLP
513
17.14. MACHINE TRANSLATION
• You can also use topical ordering, and order sentences by the order of topics in the source
documents.
• You can also use coherence:
– Choose orderings that make neighboring sentences (cosine) similar.
– Choose orderings in which neighboring sentences discuss the same entity.
3. Sentence Realization: clean up the sentences so that the summary is coherent or remove
unnecessary content. You could remove:
• appositives: “Rajam[, an artist living in Philadelphia], found inspiration in the back of city
magazines.”
• attribution clauses: “Sources said Wednesday”
• initial adverbials: “For example”, “At this point”
17.14
17.14.1
Machine Translation
Challenges in machine translation
• lexical ambiguity (e.g. “bank” as financial institution, or as in a “river bank”)
• differing word orders (e.g. English is subject-verb-object and Japanese is subject-object-verb)
• syntactic structure can vary across languages (e.g. “The bottle floated into the cave” when
translated into Spanish has the literal meaning “the bottle entered the cave floating”; the verb
“floated” becomes an adverb “floating” modifying “entered”)
• syntactic ambiguity (e.g. “John hit the dog with the stick” can have two different translations
depending on whether “with the stick” attaches to “John” or to “hit the dog”)
• pronoun resolution (e.g. “The computer outputs the data; it is stored in ASCII” - what is
“it” referring to?)
17.14.2
Classical machine translation methods
Early machine translation methods used direct machine translation, which involved translating wordby-word by using a set of rules for translating particular words. Once the words are translated,
reordering rules are applied.
But such rule-based systems quickly become unwieldy and fail to encompass the variety of ways words
can be used in languages.
There are also transfer-based approaches, which have three phases:
1. Analysis: analyze the source language sentence (e.g. a syntactic analysis to generate a parse
tree)
2. Transfer: convert the source-language parse tree to a target-language parse tree based on a set
of rules
3. Generation: convert the target-language parse tree to an output sentence
CHAPTER 17. NLP
513
17.14. MACHINE TRANSLATION
514
Another approach is interlingua-based translation, which involves two phases:
1. Analysis: analyze the source language sentence into a language-independent representation of
its meaning
2. Generation: convert the meaning representation into an output sentence
17.14.3
Statistical machine translation methods
If we have parallel corpora (parallel meaning that they “line up”) for the source and target languages,
we can use these as training sets for translation (that is, used a supervised learning approach rather
than a rule-based one).
The Noisy Channel Model
The noisy channel model has two components:
• p(e), the language model (trained from just the target corpus, could be, for example, a trigram
model)
• p(f |e), the translation model
Where e is a target language sentence (e.g. English) and f is a source language sentence (e.g. French).
We want to generate a model p(e|f ) which estimates the conditional probability of a target sentence
e given the source sentence f .
So we have the following, using Bayes’ Rule:
p(e|f ) =
p(e, f )
p(e)p(f |e)
=∑
p(f )
e p(e)p(f |e)
argmax p(e|f ) = argmax p(e)p(f |e)
e
e
IBM translation models
IBM Model 1
We want to model p(f |e), where e is the source language sentence with l words, and f is the target
language sentence with m words.
We say that an alignment a identifies which source word each target word originated from; that is,
a = {a1 , . . . , am } where each aj ∈ {0, . . . , l}, and if aj = 0 then it does not align to any word.
There are (l + 1)m possible alignments.
Then we define models for p(a|e, m) (the distribution of possible alignments) and p(f |a, e, m), giving:
514
CHAPTER 17. NLP
515
17.14. MACHINE TRANSLATION
p(f , a|e, m) = p(a|e, m)p(f |a, e, m)
p(f |e, m) =
∑
p(a|e, m)p(f |a, e, m)
a∈A
Where A is the set of all possible alignments.
We can also use the model p(f , a|e, m) to get the distribution of alignments given two sentences:
p(f , a|e, m)
a∈A p(f , a|e, m)
p(a|f , e, m) = ∑
Which we can then use to compute the most likely alignment for a sentence pair f , e:
a∗ = argmax p(a|f , e, m)
a
When we start, we assume that all alignments a are equally likely:
p(a|e, m) =
1
(l + 1)m
Which is a big simplification but provides a starting point.
We want to estimate p(f |a, e, m), which is:
p(f |a, e, m) =
m
∏
t(fj |eaj )
j=1
Where t(fj |eaj ) is the probability of the source word eaj being aligned with fj . These are the parameters
we are interested in learning.
So the general generative process is as follows:
1
1. Pick an alignment a with probability (l+1)
m
2. Pick the target language words with probability:
p(f |a, e, m) =
m
∏
t(fj |eaj )
j=1
Then we get our final model:
p(f , a|e, m) = p(a|e, m)p(f |a, e, m) =
CHAPTER 17. NLP
m
∏
1
t(fj |eaj )
(l + 1)m j=1
515
17.14. MACHINE TRANSLATION
516
IBM Model 2
An extension of IBM Model 1; it introduces alignment (also called distortion) parameters q(i |j, l, m),
which is the probability that the jth target word is connected to the i th source word. That is, we no
longer assume alignments have uniform probability.
We define:
m
∏
p(a|e, m) =
q(aj |j, l, m)
j=1
where a = {a1 , . . . , am }.
This now gives us the following as our final model:
p(f , a|e, m) =
m
∏
q(aj |j, l, m)t(fj |eaj )
i=1
In overview, the generative process for IBM model 2 is:
1. Pick an alignment a = {a1 , a2 , . . . , am } with probability:
m
∏
q(aj |j, l, m)
j=1
2. Pick the target language words with probability:
p(f , a|e, m) =
m
∏
t(fj |eaj )
j=1
Which is equivalent to the final model described above.
Then we can use this model to get the most likely alignment for any sentence pair:
Given a sentence pair e1 , e2 , . . . , el and f1 , f2 , . . . , fm :
aj = argmax q(a|j, l, m)t(fj |ea )
a∈{0,...,l}
For j = 1, . . . , m.
516
CHAPTER 17. NLP
517
17.14. MACHINE TRANSLATION
Estimating the q and t parameters
We need to estimate our q(i |j, l, m) and t(f |e) parameters. We have a parallel corpus of sentence
pairs, a single example of which is notated (e (k) , f (k) ) for k = 1, . . . , n.
Our training examples do not have alignments annotated (if we did, we could just use maximum
)
Count(j|i,l,m)
likelihood estimates, e.g. tML (f |e) = Count(e,f
Count(e) and qML (j|i, l, m) = Count(i,l,m) ).
We can use the Expectation Maximization algorithm to estimate these parameters.
We initialize our q and t parameters to random values. Then we iteratively do the following until
convergence:
1. Compute “counts” (called expected counts) based on the data and our current parameter
estimates
2. Re-estimate the parameters with these counts
The amount we increment counts by is:
q(j|i , lk , mk )t(fi (k) |ej(k) )
δ(k, i , j) = ∑l
k
j=0
q(j|i, lk , mk )t(fi (k) |ej(k) )
The algorithm for updating counts c is:
• For k = 1, . . . , n
• For i = 1, . . . , mk , for j = 0, . . . , lk
–
–
–
–
c(ej(k) , fi (k) )+ = δ(k, i , j)
c(ej(k) )+ = δ(k, i, j)
c(j|i , l, m)+ = δ(k, i , j)
c(i, l, m)+ = δ(k, i , j)
Then recalculate the parameters:
c(e, f )
c(e)
c(j|i, l, m)
q(j|i, l, m) =
c(i , l, m)
t(f |e) =
How does this method work?
First we define the log-likelihood function as a function of our t and q parameters:
L(t, q) =
n
∑
k=1
log p(f (k) |e (k) ) =
n
∑
k=1
log
∑
p(f (k) a|e (k) )
a
Which quantifies how well our current parameter estimates fit the data.
CHAPTER 17. NLP
517
17.14. MACHINE TRANSLATION
518
So the maximum likelihood estimates are just:
argmax L(t, q)
t,q
Though the EM algorithm will converge only to a local maximum of the log-likelihood function.
17.14.4
Phrase-Based Translation
Phrase-based models must extract a phrase-based (PB) lexicon, which consists of pairs of matching
phrases (consisting of one or more words), one from the source language, from the target language.
This phrase lexicon can be learned from alignments.
However, alignments are many-to-one; that is, multiple words in the target language can map to
a single word in the source language, but the reverse cannot happen. A workaround is to learn
alignments in both ways (i.e. from source to target and from target to source), then look at the
intersections of these alignments as (a starting point) the phrase lexicon.
This phrase lexicon can be expanded (“grown”) through some heuristics (not covered here).
This phrase lexicon can be noisy, so we want to apply some heuristics to clean it up. In particular,
we want phrase pairs that are consistent. A phrase pair (e, f ) is consistent if:
1. There is at least one word in e aligned to a word in f
2. There are no words in f aligned to words outside e
3. There are no words in e aligned to words outside f
We discard any phrase pairs that are not consistent.
We can use these phrases to estimate the parameter t(f |e) easily:
t(f |e) =
Count(f , e)
Count(e)
We give each phrase pair (f , e) a score g(f , e). For example:
g(f , e) = log(
Count(f , e)
)
Count(e)
A phrase-based model consists of:
• a phrase-based lexicon, with a way of computing a score for each phrase pair
• a trigram language model with parameters q(w |u, v )
• a distortion parameter η, which is typically negative
518
CHAPTER 17. NLP
519
17.14. MACHINE TRANSLATION
Given an input (source language) sentence x1 , . . . , xn , a phrase is a tuple (s, t, e) which indicates
that the subsequence xs , . . . , xt can be translated to the string e in the target language using a
phrase pair in the lexicon.
We denote P as the set of all phrases for a sentence.
For any phrase p, s(p), t(p), e(p) correspond to its components in the tuple. g(p) is the score for
the phrase.
A derivation y is a finite sequence of phrases p1 , p2 , . . . , pL where each phrase is in P . The underlying
translation defined by y is denoted e(y ) (that is, e(y ) just represents the combined string of y ’s
phrases).
For an input sentence x = x1 , . . . , xn , we refer to the set of valid derivations for x as Y (x). It is a
set of all finite length sequences of phrases p1 , p2 , . . . , pL such that:
• Each phrase pk , k ∈ {1, . . . , L} is a member of the set of phrases P
• Each word in x is translated exactly once
• For all k ∈ {1, . . . , (L − 1)}, |t(pk ) + 1 − s(pk+1 )| ≤ d where d ≥ 0 is a parameter of the
model. We must also have |1 − s(p1 ) ≤ d. d is the distortion limit which constrains how far
phrases can move (a typical value is d = 4). Empirically, this results in better translations, and
it also reduces the search space of possible translations.
Y (x) is exponential in size (it grows exponentially with sentence length), so it gets quite large.
Now we want to score these derivations and select the highest-scoring one as the translation, i.e.
argmax f (y )
y ∈Y (x)
Where f (y ) is the scoring function. It typically involves a product of a language model and a
translation model.
In particular, we have the scoring function:
f (y ) = h(e(y )) +
L
∑
k=1
g(pk ) +
L−1
∑
η|t(pk ) + 1 − s(pk+1 )
k=0
Where e(y ) is the sequence of words in the translation, h(e(y )) is the score of the sequence of words
under the language model (e.g. a trigram language model), g(pk ) is the score for the phrase pk ,
and the last summation is the distortion score, which penalizes distortions (so that we favor smaller
distortions).
We also define t(p0 ) = 0.
Because Y (x) is exponential in size, we want to avoid a brute-force method for identifying the
highest-scoring derivation. In fact, it is an NP-Hard problem, so we must apply a heuristic method in particular, using beam search.
CHAPTER 17. NLP
519
17.14. MACHINE TRANSLATION
520
For this algorithm, called the Decoding Algorithm, we keep a state as a tuple (e1 , e2 , b, r, α) where
e1 , e2 are target words, b is a bit-string of length n (that is, the same length of the input sentence)
which indicates which words in the source sentence have been translated, r is the integer specifying
the endpoint of the last phrase in the state, and α is a score for the state.
The initial state is q0 = (∗, ∗, 0n , 0, 0), where 0n is a bit-string of length n with all zeros.
We can represent the state of possible translations as a graph of these states, e.g. the source sentence
has many initial possible translation states, which each also lead to many other possible states, etc.
As mentioned earlier, this graph becomes far too large to brute-force search through.
We define ph(q) as a function which returns the set of phrases that can follow state q.
For a phrase p to be a member of ph(q), it must satisfy the following:
• p must not overlap with the bit-string b, i.e. bi = 0 for i ∈ {s(p), . . . , t(p)}. This formalizes
the fact that we don’t want to translate the same word twice.
• The distortion limit must not be violated (i.e. |r + 1 − s(p)| ≤ d)
We also define next(q, p) to be the state formed by combining the state q with the phrase p (i.e. it
is a transition function for the state graph).
Formally, we have a state q = (e1 , e2 , b, r, α) and a phrase p = (s, t, ϵ1 , . . . , ϵM ) where ϵi is a word
in the phrase. The transition function next(q, p) yields the state $q’ = (e_1’, e_2’, b’, r’, alpha’),
defined as follows:
•
•
•
•
•
Define ϵ−1 = e1 , ϵ0 = e2
Define e1′ = ϵM−1 , e2′ = ϵM
Define bi′ = 1 for i ∈ {s, . . . , t}. Define bi′ = bi for i ∈
/ {s, . . . , t}.
Define r ′ = t
Define:
α′ = α + g(p) +
M
∑
log q(ϵi |ϵi−2 , ϵi−1 ) + η|r + 1 − s|
i=1
We also define a simple equality function, eq(q, q ′ ) which returns true or false if the two states are
equal, ignoring scores (that is, if all their components are equal, without requiring that their scores
are equal).
The final decoding algorithm:
• Inputs:
• a sentence x1 , . . . , xn
• a phrase-based model (L, h, d, η), where L is the lexicon, h is the language model, d is the
distortion limit, and η is the distortion parameter. This model defines the functions ph(q) and
next(q, p).
520
CHAPTER 17. NLP
521
17.15. WORD CLUSTERING
• Initialization: set Q0 = {q0 }, Qi = ∅ for i = 1, . . . , n, where q0 is the initial state as defined
earlier. Each Qi contains possible states in which i words are translated.
• For i = 0, . . . , n − 1
• For each state q ∈ beam(Qi ), for each phrase p ∈ ph(q):
– q ′ = next(q, p)
– Add Add(Qi , q ′ , q, p) where i = len(q ′ )
• Return: highest scoring state in Qn . Backpointers can be used to find the underlying sequence
of phrases.
Add(Q, q ′ , q, p) is defined:
• If there is some q ′′ ∈ Q such that eq(q ′′ , q) is true:
• if α(q ′ ) > α(q ′′ )
– Q = {q ′ } ∪ Q {q ′′ } (remove the lower scoring state, add the higher scoring one)
– set bp(q ′ ) = (q, p)
•
•
•
•
else return
Else
Q = Q ∪ {q ′ }
set bp(q ′ ) = (q, p)
That is, if we already have an equivalent state, keep the higher scoring of the two, and we keep a
backpointer of how we got there.
beam(Q) is defined:
First define α∗ = argmaxq∈Q α(q). We define β ≥ 0 to be the beam-width parameter. Then
beam(Q) = {q ∈ Q : α(q) ≥ α ∗ −β}
17.15
Word Clustering
The Brown clustering algorithm is an unsupervised method which take as input some large quantity of sentences, and from that, learns useful representations of words, outputting a hierarchical
word clustering (e.g. weekdays and weekends might be clustered together, months may be clustered
together, family relations may be clustered, etc).
The general intuition is that similar words appear in similar contexts - that is, they have similar
distributions of words to their immediate left and right.
We have a set of all words seen in the corpus V = {w1 , w2 , . . . , wT }. Say C : V → {1, 2, . . . , k} is
a partition of the vocabulary into k classes (that is, C maps each word to a class label).
The model is as follows, where C(w0 ) is a special start state:
p(w1 , w2 , . . . , wT ) =
n
∏
e(wi |C(wi ))q(C(wi )|C(wi−1 ))
i=1
CHAPTER 17. NLP
521
17.15. WORD CLUSTERING
522
Which can be restated:
log p(w1 , w2 , . . . , wT ) =
n
∑
log e(wi |C(wi ))q(C(wi )|C(wi−1 ))
i=1
So we want to learn the parameters e(v |c) for every v ∈ V, c ∈ {1, . . . , k} and q(c ′ |c) for every
c ′ , c ∈ {1, . . . , k}.
We first need to measure the quality of a partition C:
Quality(C) =
=
n
∑
log e(wi |C(wi ))q(C(wi )|C(wi−1 ))
i=1
k
∑
k
∑
p(c, c ′ ) log
c=1 c ′ =1
p(c, c ′ )
+G
p(c)p(c ′ )
Where G is some constant. This basically computes the likelihood of this corpus under C.
Here:
n(c, c ′ )
n(c)
, p(c) = ∑
′
c,c ′ n(c, c )
c n(c)
p(c, c ′ ) = ∑
Where n(c) is the number of times class c occurs in the corpus, n(c, c ′ ) is the number of times c ′
is seen following c, under the function C.
The basic algorithm for Brown clustering is as follows:
•
•
•
•
Start with |V | clusters (each word gets its own cluster, but by the end we will find k clusters)
We run |V | − k merge steps:
At each merge step, we pick two clusters ci , cj and merge them into a single cluster
We greedily pick merges such that Quality(C) for the clustering C after the merge step is
maximized at each stage
This approach is inefficient: O(|V |5 ) though it can be improved to O(|V |3 ), which is still quite slow.
There is a better way based on this approach:
•
•
•
•
•
We specify a parameter m, e.g. m = 1000
We take the top m most frequent words and puts each into its own cluster, c1 , c2 , . . . , cm .
For i = (m + 1) . . . |V |
Create a new cluster cm+1 for the i th most frequent word. We now have m + 1 clusters
Choose two clusters from c1 , . . . , cm+1 to be merged, picking the merge that gives a max value
for Quality(C) (now we just have m clusters again)
• Carry out (m − 1) final merges to create a full hierarchy.
This has the run time of O(|V |m2 + n), where n is the corpus length.
522
CHAPTER 17. NLP
523
17.16
17.16. NEURAL NETWORKS AND NLP
Neural Networks and NLP
Typically when words are represented as vectors, it is as a one-hot representation, that is, a vector
of length |V | where V is the vocabulary, with all elements 0 except for the one corresponding to the
particular word being represented (that is, it is a sparse representation).
This can be quite unwieldy as it has dimensionality of |V |, which is typically quite large.
We can instead use neural networks to learn dense representations of words (“word embeddings”) of
a fixed dimension (the particular dimensionality is specified as a hyperparameter, there is not as of
this time a theoretical understanding of how to choose this value) and can capture other properties
of words (such as analogies).
Representing a sentence can be accomplished by concatenating the embeddings of its words, but this
can be problematic in that typically fixed-size vectors are required, and sentences are variable in their
word length.
A way around this is to use the continuous bag of words (CBOW) representation, in which, like
the traditional bag-of-words representation, we throw out word order information and combine the
embeddings by summing or averaging them, e.g. given a set of word embeddings v1 , . . . , vk :
CBOW(v1 , . . . , vk ) =
k
1∑
vi
k i=1
An extension of this method is the weighted CBOW (WCBOW) which is just a weighted average of
the embeddings.
How are these word embeddings learned? Typically, it is by training a neural network (specifically
for learning the embeddings) on an auxiliary task. For instance, context prediction is a common
embedding training task, in which we try to predict a word given its surrounding context (under the
assumption that words which appear in similar contexts are similar in other important ways).
17.16.1
Word Embeddings
A word embedding W : words → Rn is a parameterized function that maps words to high-dimensional
vectors (typically 200-500 dimensions).
This function is typically a lookup table parameterized by a matrix θ, where each row represents a
word. That is, the function is often Wθ (wn ) = θn . θ is initialized with random vectors for each word.
So given a task involving words, we want to learn W so that we have good representations for each
word.
You can visualize a word embedding space using t-SNE (a technique for visualizing high-dimensional
data):
As you can see, words that are similar in meaning tend to be closer together. Intuitively this makes
sense - if words have similar meaning, they are somewhat interchangeable, so we expect that their
vectors be similar too.
CHAPTER 17. NLP
523
17.16. NEURAL NETWORKS AND NLP
524
Visualizing a word embedding space with t-SNE (Turian et al (2010))
We’ll also see the vectors capture notions of analogy, for example “Paris” is to “France” as “Tokyo”
is to “Japan”. These kinds of analogies can be represented as vector addition: “Paris” - “France” +
“Japan” = “Tokyo”.
The best part is the neural network is not explicitly told to learn representations with these properties
- it is just a side effect. This is one of the remarkable properties of neural networks - they learn good
ways of representing the data more or less on their own.
And these representations can be portable. That is, maybe you learn W for one natural language
task, but you may be able to re-use W for another natural language task (provided it’s using a similar
vocabulary). This practice is sometimes called “pretraining” or “transfer learning” or “multi-task
learning”.
You can also map multiple words to a single representation, e.g. if you are doing a multilingual task.
For example, the English and French words for “dog” could map to the same representation since
they mean the same thing (in which case we could call this a “bilingual word embedding”).
Here’s an example visualization of a Chinese and English bilingual word embedding:
You can even go a step further and learn image and word representations together, so that vectors
representing images of horses are close to the vector for the word “horse”.
Two main techniques for learning word embeddings are:
• CBOW: predicting the probability of context words given a word
• Skip-gram: predicting the probability of a word given context words
17.16.2
CNNs for NLP
CBOW representations lose word-ordering information, which can be important for some tasks
(e.g. sentiment analysis).
CNNs are useful in such situations because they avoid the need of going to, for instance, bigram
methods. They can automatically learn important local structures (much as they do with image
recognition).
524
CHAPTER 17. NLP
525
17.16. NEURAL NETWORKS AND NLP
A Chinese and English word embedding (Socher et al (2013a))
CHAPTER 17. NLP
525
17.16. NEURAL NETWORKS AND NLP
17.16.3
526
References
• A Primer on Neural Network Models for Natural Language Processing. Yoav Goldberg. October
5, 2015.
• Natural Language Processing. Dan Jurafsky, Christopher Manning, Stanford (Coursera).
• Natural Language Processing. Michael Collins. Columbia University (Coursera).
526
CHAPTER 17. NLP
527
18
Unsupervised Learning
In unsupervised learning, our data does not have any labels. Unsupervised learning algorithms try to
find some structure in the data.
An example is a clustering algorithm. We don’t tell the algorithm in advance anything about the
structure of the data; it discovers it on its own by figuring how to group them.
Some other examples are dimensionality reduction, in which you try to reduce the dimensionality
of the data representation, density estimation, in which you estimate the probability distribution of
the data, p(x), and feature extraction, in which you try to learn meaningful features automatically.
18.1 k-Nearest Neighbors (kNN)
A very simple nonparametric classification algorithm in which you take the k closest neighbors to a
point (“closest” depends on the distance metric you choose) and each neighbor constitutes a “vote”
for its label. Then you assign the point the label with the most votes.
Because this is essentially predicting an input’s label based on similar instances, kNN is a case-based
approach. The key with case-based approaches is how you define similarity - a common way is feature
dot products:
sim(x, x ′ ) = x · x ′ =
∑
xi xi′
i
k can be chosen heuristically: generally you don’t want it to be so high that the votes become noisy
(in the extreme, if you have n datapoints and set k = n, you will just choose the most common label
in the dataset), and you want to chose it so that it is coprime with the number of classes (that is,
they share no common divisors except for 1). This prevents ties.
Alternatively, you can apply an optimization algorithm to choose k.
CHAPTER 18. UNSUPERVISED LEARNING
527
18.2. CLUSTERING
528
Some distances that you can use include Euclidean distance, Manhattan distance (also known as the
city block distance or the taxicab distance), Minkowski distance (a generalization of the Manhattan
and Euclidean distances), and Mahalanobis distance.
Minkowski-type distances assume that data is symmetric; that in all dimensions, distance is on the
same scale. Mahalanobis distance, on the other hand, takes into account the standard deviation of
each dimension.
kNN can work quite quickly when implemented with something like a k-d tree.
kNN and other case-based approaches are examples of nonparametric models. With nonparametric
models, there is not a fixed set of parameters (which isn’t to say that there are no parameters, though
the name “nonparametric” would have you think otherwise). Rather, the complexity of the classifier
increases with the data. Nonparametric models typically require a lot of data before they start to be
competitive with parametric models.
18.2 Clustering
18.2.1
K-Means Clustering Algorithm
First, randomly initialize K points, called the cluster centroids.
Then iterate:
• Cluster assignment step: go through each data point and assign it to the closest of the K
centroids.
• Move centroid step: move the centroids to the average of their points.
Closeness is computed by some distance metric, e.g. euclidean.
More formally, there are two inputs:
• K - the number of clusters
• The training set
x (1) , x (2) , . . . , x (m)
Where x (i) ∈ Rn (we drop the x0 = 1 convention).
Randomly initialize K cluster centroids µ1 , µ2 , . . . , µK ∈ Rn .
Repeat:
• For i = 1 to m
– c (i) := index (from 1 to K) of cluster centroid closest to x (i) . That is, c (i) := mink ||x (i) −
µk ||.
• For k = 1 to K
528
CHAPTER 18. UNSUPERVISED LEARNING
529
18.2. CLUSTERING
– µk := average (mean) of points assigned to cluster k
If you have an empty cluster, it is common to just eliminate it entirely.
We can notate the cluster centroid of the cluster to which example x (i) has been assigned as µc (i) .
In K-means, the optimization objective is:
J(c (i) , . . . , c (m) , µ1 , . . . , µK ) =
m
1 ∑
||x (i) − µc (i) ||2
m i=1
mi nc (i) ,...,c (m) ,µ1 ,...,µK J(c (i) , . . . , c (m) , µ1 , . . . , µK )
This cost function is sometimes called the distortion cost function or the distortion of the K-means
algorithm.
The algorithm outlined above is minimizing the cost: the first step tries to minimize c (i) , . . . , c (m)
and the second step tries to minimize µ1 , . . . , µK .
Note that we randomly initialize the centroids, so different runs of K-means could lead to (very)
different clusterings.
One question is - what’s the best way to initialize the initial centroids to avoid local minima of the
cost function?
First of all, you should halve K < m (i.e. less than your training examples.)
Then randomly pick K training examples. Use these as your initialization points (i.e. set µ1 , . . . , µk
to these K examples).
Then, to better avoid local optima, just rerun K-means several times (e.g. 50-1000 times) with new
initializations of points. Keep track of the resulting cost function and then pick the clustering that
gave the lowest cost.
So, how do you choose a good value for K?
Unfortunately, there is no good way of doing this automatically. The most common way is to just
choose it manually by looking at the output. If you plot out the data and look at it - even among
people it is difficult to come to a consensus on how many clusters there are.
One method that some use is the Elbow method. In this approach, you vary K, run K-means, and
compute the cost function for each value. If you plot out K vs the cost functions, there may be a
clear “elbow” in the graph and you pick the K at the elbow. However, most of the time there isn’t
a clear elbow, so the method is not very effective.
One drawback of K-means (which many other clustering algorithms share) is that every point has a
cluster assignment, which is to say K-means has no concept of “noise”.
Furthermore, K-means expects clusters to be globular, so it can’t handle more exotic cluster shapes
(such as moon-shaped clusters).
There are still many situations where K-means is quite useful, especially since it scales well to large
datasets.
CHAPTER 18. UNSUPERVISED LEARNING
529
18.2. CLUSTERING
18.2.2
530
Hierarchical Agglomerative Clustering
Hierarchical agglomerative clustering (HAC) is a bottom-up clustering process which is fairly simple:
1. Find two closest data points or clusters, merge into a cluster (and remove the original points
or clusters which formed the new cluster)
2. Repeat
This results in a hierarchy (e.g. a tree structure) describing how the data can be grouped into clusters
and clusters of clusters. This structure can be visualized as a dendrogram:
0
12
3 4 5
6 78
9
A dendrogram
Two things which must be specified for HAC are:
• the distance metric: euclidean, cosine, etc
• the merging approach - that is, how is the distance between two clusters measured?
–
–
–
–
complete linkage - use the distance between the two further points
average linkage - take the average distances of all pairs between the clusters
single linkage - take the distance between the two nearest points
(there are others as well)
Unlike K-means, HAC is deterministic (since there are no randomly-initialized centroids) but it can
be unstable: changing a few points or the presence of some outliers can vastly change the result.
Scaling of variables/features can also affect clustering.
HAC does not assume globular clusters, although it does not have a concept of noise.
18.2.3
Affinity Propagation
In affinity propagation, data points “vote” on their preferred “exemplar”, which yields a set of exemplars as the initial cluster points. Then we just assign each point to the nearest exemplar.
Affinity Propagation is one of the few clustering algorithms which supports non-metric dissimilarities
(i.e. the dissimilarities do not need to be symmetric or obey the triangle inequality).
530
CHAPTER 18. UNSUPERVISED LEARNING
531
18.2. CLUSTERING
Like K-means, affinity propagation also does not have a concept of noise and also assumes that
clusters are globular. Unlike K-means, however, it is deterministic, and it does not scale very well
(mostly because its support for non-metric dissimilarities precludes it from many optimizations that
other algorithms can take advantage of).
18.2.4
Spectral Clustering
With spectral clustering, datapoints are clustered by affinity - that is, by nearby points - rather than
by centroids (as is with K-Means). Using affinity instead of centroids, spectral clustering can identify
clusters where K-Means fails to.
In spectral clustering, an affinity matrix is produced which, for a set of n datapoints, is an n × n
matrix. Pairwise affinities are computed for the dataset. Affinity is some distance metric.
Then, from this affinity matrix, PCA is used to extract the eigenvectors with the largest eigenvalues
and the data is then projected to the new space defined by PCA. The data will be more clearly
separated in this new representation such that conventional clustering methods (e.g. K-Means) can
be applied.
More formally: spectral clustering generates a graph of the datapoints, with edges as the distances
between the points. Then the Laplacian of the graph is produced:
Given the adjacency matrix A and the degree matrix D of a graph G of n vertices, the Laplacian
matrix Ln×n is simply L = D − A.
As a reminder:
• the adjacency matrix A is an n × n matrix where the element Ai,j is 1 if an edge exists between
vertices i and j and 0 otherwise.
• the degree matrix D is an n × n diagonal matrix where the element Di,i is the degree of vertex
i.
Then the eigenvectors of the Laplacian are computed to find an embedding of the graph into Euclidean
space. Then some clustering algorithm (typically K-Means) is run on the data in this transformed
space.
Spectral clustering enhances clustering algorithms which assume globular clusters in that its space
transformation of the data causes non-globular data to be globular in the transformed space. However,
the graph transformation slows things down.
18.2.5
Mean Shift Clustering
Mean shift clustering extends KDE one step further: the data points iteratively hill-climb to the peak
of nearest KDE surface.
As a parameter to the kernel density estimates, you need to specify a bandwidth - this will affect the
KDEs and their peaks, and thus it will affect the clustering results. You do not, however, need to
specify the number of clusters.
CHAPTER 18. UNSUPERVISED LEARNING
531
18.2. CLUSTERING
532
Below are some examples of different bandwidth results (source).
You also need to make the choice of what kernel to use. Two commonly used kernels are:
• Flat kernel:
532
CHAPTER 18. UNSUPERVISED LEARNING
533
18.2. CLUSTERING

1
K(x) = 
if ||x|| ≤ 1
0 otherwise
• Gaussian kernel
Gaussian kernel
Mean shift is slow (O(N 2 )).
18.2.6
Non-Negative Matrix Factorization (NMF)
NMF is a particular matrix factorization in which each element of V is ≥ 0 (a non-negative constraint),
and results in factor matrices W and H such that each of their elements are also ≥ 0.
Non-negative matrix factorization (By Qwertyus, CC BY-SA 3.0, via Wikimedia Commons)
CHAPTER 18. UNSUPERVISED LEARNING
533
18.2. CLUSTERING
534
Each column vi in V can be calculated from W and H like so (where hi is a column in H):
vi = W hi
NMF can be used for clustering; it has a consequence of naturally clustering the columns of V .
It is also useful for reducing (i.e. compressing) the dimensionality of a dataset, in particular, it reduces
it into a linear combination of bases.
If you add an orthogonality constraint, i.e. HH T = I, if the value at Hkj > 0, then the jth column
of V , that is, vj , belongs to the cluster k.
Matrix factorization
Let V be an m × n matrix of rank r . Then there is an m × r matrix W and an r × n matrix H such
that V = W H. So we can factorize (or decompose) V into W and H.
This matrix factorization can be seen as a form of compression (for low rank matrices, at least) - if
we were to store V on its own, we have to store m × n elements, but if we store W and H separately,
we only need to store m × r + r × n elements, which will be smaller than m × n for low rank matrices.
Note that this kind of factorization can’t be solved analytically, so it is usually approximated numerically (there are a variety of algorithms for doing so).
18.2.7
DBSCAN
DBSCAN transforms the space according to density, then identifies for dense regions as clusters by
using single linkage clustering. Sparse points are considered noise - not all points are forced to have
cluster assignment.
DBSCAN handles non-globular clusters well, provided they have consistent density - it has some
trouble with variable density clusters (they may be split up into multiple clusters).
18.2.8
HDBSCAN
HDBSCAN is an improvement upon DBSCAN which can handle variable density clusters, while
preserving the scalability of DBSCAN. DBSCAN’s epsilon parameter is replaced with a “min cluster
size” parameter.
HDBSCAN uses single-linkage clustering, and a concern with single-linkage clustering is that some
errant point between two clusters may accidentally act as a bridge between them, such that they are
identified as a single cluster. HDBSCAN avoids this by first transforming the space in such a way
that sparse points (these potentially troublesome noise points) are pushed further away.
To do this, we first define a distance called the core distance, corek (x), which is point x’s distance
from its kth nearest neighbor.
534
CHAPTER 18. UNSUPERVISED LEARNING
535
18.2. CLUSTERING
Then we define a new distance metric based on these core distances, called mutual reachability
distance. The mutual reachability distance dmreach−k between points a and b is the furthest of the
following points: corek (a), corek (b), d(a, b), where d(a, b) is the regular distance metric between a
and b. More formally:
dmreach−k (a, b) = max(corek (a), corek (b), d(a, b))
For example, if k = 5:
Then we can pick another point:
And another point:
Say we want to compute the mutual reachability distance between the blue b and green g points.
First we can compute d(b, g):
Which is larger than corek (b), but both are smaller than corek (g). So the mutual reachability distance
between b and g is corek(g):
On the other hand, the mutual reachability distance between the red and green points is equal to
d(r, g) because that is larger than either of their core distances.
We build a distance matrix out of these mutual reachability distances; this is the transformed space.
We can use this distance matrix to represent a graph of the points.
We want to construct a minimum spanning tree out of this graph.
CHAPTER 18. UNSUPERVISED LEARNING
535
18.2. CLUSTERING
536
536
CHAPTER 18. UNSUPERVISED LEARNING
537
CHAPTER 18. UNSUPERVISED LEARNING
18.2. CLUSTERING
537
18.2. CLUSTERING
538
As a reminder, a spanning tree of a graph is any subgraph which contains all vertices and is a tree
(a tree is a graph where vertices are connected by only one path; i.e. it is a connected graph - all
vertices are connected - but there are no cycles).
The weight of a tree is the sum of its edges’ weights. A minimum spanning tree is a spanning tree
with the least (or equal to least) weight.
The minimum spanning tree of this graph can be constructed using Prim’s algorithm.
From this spanning tree, we then want to create the cluster hierarchy. This can be accomplished by
sorting edges from closest to furthest and iterating over them, creating a merged cluster for each
edge.
(A note from the original post which I don’t understand yet: “The only difficult part here is to identify
the two clusters each edge will join together, but this is easy enough via a union-find data structure.”)
Given this hierarchy, we want a set of flat clusters. DBSCAN asks you to specify the number of
clusters, but HDBSCAN can independently discover them. It does require, however, that you specify
a minimum cluster size.
In the produced hierarchy, it is often the case that a cluster splits into one large subcluster and a
few independent points. Other times, the cluster splits into two good-sized clusters. The minimum
cluster size makes explicit what a “good-sized” cluster is.
If a cluster splits into clusters which are at or above the minimum cluster size, we consider them
to be separate clusters. Otherwise, we don’t split the cluster (we treat the other points as having
“fallen out of” the parent cluster) and just keep the parent cluster intact. However, we keep track
of which points have “fallen out” and at what distance that happened. This way we know at which
distance cutoffs the cluster “sheds” points. We also keep track at what distances a cluster split into
its children clusters.
Using this approach, we “clean up” the hierarchy.
We use the distances at which a cluster breaks up into subclusters to measure the persistence of a
1
cluster. Formally, we think in terms of λ = distance
.
We define for each cluster a λbirth , which is the distance at which this cluster’s parent split to yield
this cluster, and a λdeath , which is the distance at which this cluster itself split into subclusters (if it
does eventually split into subclusters).
Then, for each point p within a cluster, we define λp to be when that point “fell out” of the cluster,
which is either somewhere in between λbirth , λdeath , or, if the point does not fall out of the cluster, it
is just λdeath (that is, it falls out when the cluster itself splits).
The stability of a cluster is simply:
∑
(λp − λbirth )
p∈cluster
Then we start with all the leaf nodes and select them as clusters. We move up the tree and sum the
stabilities of each cluster’s child clusters. Then:
538
CHAPTER 18. UNSUPERVISED LEARNING
539
18.3. REFERENCES
• If the sum of cluster’s child stabilities greater than its own stability, then we set its stability to
be the sum of its child stabilities.
• If the sum of a cluster’s child stabilities is less than its own stability, then we select the cluster
and unselect its descendants.
When we reach the root node, return the selected clusters. Points not in any of the selected clusters
are considered noise.
As a bonus: each λp in the selected clusters can be treated as membership strength to the cluster if
we normalize them.
18.2.9
CURE (Clustering Using Representatives)
If you are dealing with more data than can fit into memory, you may have issues clustering it.
A flexible clustering algorithm (there are no restrictions about the shape of the clusters it can find)
which can handle massive datasets is CURE.
CURE uses Euclidean distance and generates a set of k representative points for each clusters. It uses
these points to represent clusters, therefore avoiding the need to store every datapoint in memory.
CURE works in two passes.
For the first pass, a random sample of points from the dataset are chosen. The more samples the
better, so ideally you choose as many samples as can fit into memory. Then you apply a conventional
clustering algorithm, such as hierarchical clustering, to this sample. This creates an initial set of
clusters to work with.
For each of these generated clusters, we pick k representative points, such that these points are as
dispersed as possible within the cluster.
For example, say k = 4. For each cluster, pick a point at random, then pick the furthest point from
that point (within the same cluster), then pick the furthest point (within the same cluster) from
those two points, and repeat one more time to get the fourth representative point.
Then copy each representative point and move that copy some fixed fraction (e.g. 0.2) closer to
the cluster’s centroid. These copied points are called “synthetic points” (we use them so we don’t
actually move the datapoints themselves). These synthetic points are the representatives we end up
using for each cluster.
For the second pass, we then iterate over each point p in the entire dataset. We assign p to its
closest cluster, which is the cluster that has the closest representative point to p.
18.3 References
• How HDBSCAN Works. Leland McInnes.
• Thoughtful Machine Learning. Matthew Kirk. 2015.
• Comparing Python Clustering Algorithms. Leland McInnes.
CHAPTER 18. UNSUPERVISED LEARNING
539
18.3. REFERENCES
540
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Mean Shift Clustering. Matt Nedrich.
• Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff
Ullman.
• Example of matrix factorization.. MH1200.
• Non-negative matrix factorization. Wikipedia.
540
CHAPTER 18. UNSUPERVISED LEARNING
541
19
In Practice
19.1 Machine Learning System Design
Before you start building your machine learning system, you should:
• Be explicit about the problem.
– Start with a very specific and well-defined question: what do you want to predict, and
what do you have to predict it with?
• Brainstorm some possible strategies.
– What features might be useful?
– Do you need to collect more data?
• Try and find good input data
– Randomly split data into:
* training sample
* testing sample
* if enough data, a validation sample too
• Use features of or features built from the data that may help with prediction
Then to start:
• Start with a simple algorithm which can be implemented quickly.
– Apply a machine learning algorithm
– Estimate the parameters for the algorithm on your training data
• Test the simple algorithm on your validation data, evaluate the results
CHAPTER 19. IN PRACTICE
541
19.2. MACHINE LEARNING DIAGNOSTICS
542
• Plot learning curves to decide where things need work:
– Do you need more data?
– Do you need more features?
– And so on.
• Error analysis: manually examine the examples in the validation set that your algorithm made
errors on. Try to identify patterns in these errors. Are there categories of examples that the
model is failing on in particular? Are there any other features that might help?
If you have an idea for a feature which may help, it’s best to just test it out. This process is much
easier if you have a single metric for your model’s performance.
19.2 Machine learning diagnostics
In machine learning, a diagnostic is:
A test that you can run to gain insight [about] what is/isn’t working with a learning
algorithm, and gain guidance as to how best to improve its performance.
They take time to implement but can save you a lot of time by preventing you from going down
fruitless paths.
19.2.1
Learning curves
To generate a learning curve, you deliberately shrink the size of your training set and see how the
training and validation errors change as you increase the training set size. This way you can see how
your model improves (or doesn’t, if something unexpected is happening) with more training data.
With smaller training sets, we expect the training error will be low because it will be easier to fit to
less data. So as training set size grows, the average training set error is expected to grow. Conversely,
we expect the average validation error to decrease as the training set size increases.
If it seems like the training and validation error curves are flattening out at a high error as training
set size increases, then you have a high bias problem. The curves flattening out indicates that getting
more training data will not (by itself) help much.
On the other hand, high variance problems are indicated by a large gap between the training and
validation error curves as training set size increases. You would also see a low training error. In this
case, the curves are converging and more training data would help.
542
CHAPTER 19. IN PRACTICE
543
19.2. MACHINE LEARNING DIAGNOSTICS
Training figures (from Xiu-Shen Wei)
CHAPTER 19. IN PRACTICE
543
19.3. LARGE SCALE MACHINE LEARNING
19.2.2
544
Important training figures
19.3 Large Scale Machine Learning
19.3.1
Map Reduce
You can distribute the workload across computers to reduce training time.
For example, say you’re running batch gradient descent with b = 400.
θj := θj − α
4
1 ∑
00i=1 (hθ (x (i) ) − y (i) )xj(i)
400
You can divide up (map) your batch so that different machines calculate the error of a subset
(e.g. with 4 machines, each machine takes 100 examples) and then those results are combined
(reduced/summed) back on a single machine. So the summation term becomes distributed.
Map Reduce can be applied wherever your learning algorithm can be expressed as a summation over
your training set.
Map Reduce also works across multiple cores on a single computer.
19.4 Online (live/streaming) machine learning
19.4.1
Distribution Drift
Say you train a model on some historical data and then deploy your model in a production setting
where it is working with live data.
It is possible that the distribution of the live data starts to drift from the distribution your model
learned. This change may be due to factors in the real world that influence the data.
Ideally you will be able to detect this drift as it happens, so you know whether or not your model needs
adjusting. A simple way to do it is to continually evaluate the model by computing some validation
metric on the live data. If the distribution is stable, then this validation metric should remain stable;
if the distribution drifts, the model starts to become a poor fit for the new incoming data, and the
validation metric will worsen.
19.5 References
• Review of fundamentals, IFT725. Hugo Larochelle. 2012.
• Exploratory Data Analysis Course Notes. Xing Su.
544
CHAPTER 19. IN PRACTICE
545
19.5. REFERENCES
• Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff
Ullman.
• Machine Learning. 2014. Andrew Ng. Stanford University/Coursera.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Evaluating Machine Learning Models. Alice Zheng. 2015.
• Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
• Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville.
• CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks
Part 2: Setting up the Data and the Loss. Andrej Karpathy.
• POLS 509: Hierarchical Linear Models. Justin Esarey.
• Bayesian Inference with Tears. Kevin Knight, September 2009.
• Learning to learn, or the advent of augmented data scientists. Simon Benhamou.
• Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo
Larochelle, Ryan P. Adams.
• What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou.
• Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010.
• Maximum Likelihood Estimation. Penn State Eberly College of Science.
• Data Science Specialization. Johns Hopkins (Coursera). 2015.
• Practical Machine Learning. Johns Hopkins (Coursera). 2015.
• Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome
Friedman.
• CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej
Karpathy.
• Must Know Tips/Tricks in Deep Neural Networks. Xiu-Shen Wei.
CHAPTER 19. IN PRACTICE
545
19.5. REFERENCES
546
546
CHAPTER 19. IN PRACTICE
547
Part III
Artificial Intelligence
547
549
19.6. STATE-SPACE AND SITUATION-SPACE REPRESENTATIONS
# Search
19.6 State-space and situation-space representations
In artificial intelligence, problems are often represented using the state-space representation (sometimes called a state-transition system), in which the possible states of the problem and the operations that move between them are represented as a graph or a tree:
•
•
•
•
Nodes are (abstracted) world configurations (states)
Arcs represent successors (action results)
A goal test is a set of goal nodes (which may just include a single goal)
Each state occurs only once as a node
More formally, we consider a problem to have a set of possible starting states S, a set of operators
F which can be applied to the states, and a set of goal states G. A solution to a problem formalized
in this way, called a procedure, consists of a starting state s ∈ S and a sequence of operators that
define a path from s to a state in G. Typically a problem is represented as a tuple of these values,
(S, F, G).
The distinction between state-space and situation-space is as follows: if the relevant parts of the
problem are fully-specified (fully-known), then we work with states and operators, and have a statespace problem. If there is missing information (i.e., the problem is partially-specified), then we work
with situations and actions (note that operators are often referred to actions in state-space as well),
and we have a situation-space problem. Most of what is said for state-space problems is applicable
to situation-space problems.
For now we will focus on state-space problems.
This state-space model can be applied to itself, in such that a given problem can be decomposed into
subproblems (also known as subgoals); the relationships between the problem and its subproblems
(and their subproblems’ subproblems, etc) are also represented as a graph. Successor relationships
can be grouped by AND or OR arcs which group edges together. A problem node with subproblems
linked by AND edges must have all of the grouped subproblems resolved; a problem with subproblems
linked by OR edges must have only one of the subproblems resolved. Using this graph, you can
identify a path of subproblems which can be used to solve the primary problem. This process is
known as problem reduction.
We take this state-space representation as the basis for a search problem.
19.6.1
Search problems (planning)
A search problem consists of:
• an initial state
• a set of possible actions/applicability conditions
549
19.6. STATE-SPACE AND SITUATION-SPACE REPRESENTATIONS
550
• a successor function: from a state to a set of (action, state)
• the successor function plus the initial state is the state space (which is a directed graph as
described before)
• a path (i.e. a solution)
• a goal (a goal state or a goal test function)
• a path cost function (for optimality, generally it is the sum of the step costs)
To clarify some terminology:
• if node A leads to node B, then node A is a parent of B and B is a successor or child of A
– if the edge connecting A to B is due to an operator q, we say that “B is a successor to A
under the operator q”.
• if a node has no successors, it is a terminal
• if there is a path between node A and node C such that node A is a parent of a parent … of a
parent of C, then A is an ancestor of C and C is a descendant of A.
– if the graph is cyclical, e.g. there is a path from A through C back to A, then A is both
an ancestor and a descendant of C.
Practically, we may use a data structure for nodes that encapsulate the following information for the
node:
•
•
•
•
•
state - a state in the state space
parent node - the immediate predecessor in the search tree (only the root node has no parent)
action - the action that, when performed in the parent node’s state, leads to this node’s state
path cost - the path cost leading to this node
depth - the depth of this node in the search tree
In the context of artificial intelligence, a path through state-space is called a plan - search is fundamental to planning (other aspects of planning are covered in more detail later).
19.6.2
Problem formulation
Problem formulation can itself be a problem, as it typically is with real-world problems. We have to
consider how granular/abstract we want to be and what actions and states to include. To make this
a bit easier, we typically make the following assumptions about the environment:
•
•
•
•
finite and discrete
fully observable
deterministic
static (no events)
And other assumptions are typically included as well:
550
551
•
•
•
•
19.7. SEARCH ALGORITHMS
restricted goals
sequential plans (no parallel activity in plans)
implicit time (activities do not have a duration)
offline planning (the state transition system is not changing while we plan)
19.6.3
Trees
In practice, we rarely build the full state-space graph in memory (because it is often way too big).
Rather, we work with trees.
Trees have a few constraints:
• only one node does not have a parent: the root node.
• every other node in the tree is a descendant of the root node
• every other node has only one parent
An additional term relevant to trees is depth, which is the number of ancestors a node has.
The root node is the current state and branches out into possible future states (i.e. the children are
successors).
Given a tree with branching factor b and maximum depth m, there are O(bm ) nodes in the tree.
These trees can get quite big, so often we can’t build the full tree either (it would be infinite if there
are circular paths in the state space graph). Thus we only build sections that we are immediately
concerned with.
To build out parts of the tree we are interested in, we take a node and apply a successor function
(sometimes called a generator function) to expand the node, which gives us all of that node’s
successors (children).
There is also often a lot of repetition in search trees, which some search algorithm enhancements
take advantage of.
19.7 Search algorithms
We apply algorithms to this tree representation in order to identify paths (ideally the optimal path)
from the root node (start state) to a goal node (a goal state, of which there may be many).
Most search algorithms share some common components:
• a fringe (sometimes called a frontier) of unexplored nodes is maintained
• some process for deciding which nodes to expand
The general tree search algorithm is as follows:
• initialize the fringe with a search node for the initial state
551
19.8. UNINFORMED SEARCH
•
•
•
•
•
552
iteratively:
if the fringe is empty, return a failure
otherwise, select a node from the fringe based on the current search strategy
if this node’s state passes the goal test (or is the goal state), return the path to this node
otherwise, expand the fringe with this node’s children (successors)
Most search algorithms are based on this general structure, varying in how they choose which node
to expand from the fringe.
When considering search algorithms, we care about:
• completeness - is it guaranteed to find a solution if one exists?
• optimal - is it guaranteed to find the optimal solution, if one exists?
• size complexity - how much space does the algorithm need? Basically, how big can the fringe
get?
• time complexity - how does runtime change with input size? Basically, how many nodes get
expanded?
These search algorithms discussed here are unidirectional, since they only expand in one direction
(from the start state down, or, in some cases, from the terminal nodes up). However, there are also
bidirectional search procedures which start from both the start state and from the goal state. They
can be difficult to use, however.
19.8 Uninformed search
Uninformed search algorithms, sometimes called blind search algorithms, vary in how they decide
which node to expand.
Consider the following search space, where S is our starting point and G is our goal:
Example search space
19.8.1
Exhaustive (“British Museum”) search
Exhaustively search all paths (without revisiting any previously visited points) - it doesn’t really matter
how you decide which node to expand because they will all be expanded.
552
553
19.8. UNINFORMED SEARCH
“British Museum” search
553
19.8. UNINFORMED SEARCH
19.8.2
•
•
•
•
554
Depth-First Search (DFS)
time complexity: expands O(bm ) nodes (if m is finite)
size complexity: the fringe takes O(bm) space
complete if m is not infinite (i.e. if there are no cycles)
optimal: no, it finds the “leftmost” solution
Go down the left branch of the tree (by convention) until you can’t go any further.
If that is not your target, then backtrack - go up to the closest branching node and take the other
leftmost path. Backtracking is a technique that appears in almost every search algorithm, where we
try extending a path, and if the extension fails or is otherwise unsatisfactory, we take a step back and
try a different successor.
Repeat until you reach your target.
Depth-first search
It stops just on the first complete path, which may not be the optimal path.
Another way to think about depth-first search is with a queue (LIFO) which holds your candidate
paths as you construct them.
Your starting “path” includes just the starting point:
[(S)]
Then on each iteration, you take the left-most path (which is always the first in the queue) and check
if it reaches your goal.
If it does not, you extend it to build new paths, and replace it with those new paths.
[(SA), (SB)]
On this next iteration, you again take the left-most path. It still does not reach your goal, so you
extend it. And so on:
[(SABC), (SAD), (SB)]
[(SABCE), (SAD), (SB)]
554
555
19.8. UNINFORMED SEARCH
You can no longer extend the left-most path, so just remove it from the queue.
[(SAD), (SB)]
Then keep going.
19.8.3
•
•
•
•
Breadth-First Search (BFS)
time complexity: expands O(bs ) nodes, where s is the depth of the shallowest solution
size complexity: the fringe takes O(bs ) space
complete: yes
optimal: yes, if all costs are 1, otherwise, a deeper path could have a cheaper cost
Build out the tree level-by-level until you reach your target.
Breadth-first search
In the queue representation, the only thing that is different from depth-first is that instead of placing
new paths at the front of the queue, you place them at the back. Another way of putting this is that
instead of a LIFO data structure for its fringe (as is used with DFS), BFS uses a FIFO data structure
for its fringe.
19.8.4
Uniform Cost Search
We can make breadth-first search sensitive to path cost with uniform cost search (also known as
Dijkstra’s algorithm), in which we simply prioritize paths by their cost g(n) (that is, the distance
from the root node to n) rather than by their depth.
• time complexity: If we say the solution costs C∗ and arcs cost at least ϵ, then the “effective
C∗
depth” is roughly /epsilon
, so the time complexity is O(bC∗/ϵ )
• size complexity: the fringe takes O(bC∗/ϵ ) space
• complete: yes if the best solution has finite cost and minimum arc cost is positive
• optimal: yes
555
19.9. SEARCH ENHANCEMENTS
19.8.5
556
Branch & Bound
On each iteration, extend the shortest cumulative path. Once you reach your goal, extend every other
extendible path to check that its length ends up being longer than your current path to the goal.
The fringe is kept sorted so that the shortest path is first.
Branch and bound search
This approach can be quite exhaustive, but it can be improved by using extended list filtering.
19.8.6
Iterative deepening DFS
The general idea is to combine depth-first search’s space advantage with breadth-first search’s
time/shallow-solution advantages.
•
•
•
•
Run depth-first search with depth limit 1
If no solution:
Run depth-first search with depth limit 2
If no solution:
– Run depth-first search with depth limit 3
(etc)
19.9 Search enhancements
19.9.1
Extended list filtering
Extended list filtering involves maintaining a list of visited nodes and only expanding nodes in the
fringe if they have not already been expanded - it would be redundant to search again from that
node.
For example, branch and bound search can be combined with extended list filtering to make it less
exhaustive.
556
557
19.10. INFORMED (HEURISTIC) SEARCH
Branch and bound search w/ extended list
19.10
Informed (heuristic) search
Informed search algorithms improve on uninformed search by incorporating heuristics which tell us
whether or not we’re getting closer to the goal. With heuristics, we can search less of the search
space.
In particular, we want admissible heuristics, which is simply a heuristic that never overestimates the
distance to the goal.
Formally, we can define the admissible heuristic as:
H(x, G) ≤ D(x, G)
That is, a node is admissible if the estimated distance H(x, G) between and node x and the goal G
is less than or equal to the actual distance D(x, G) between the node and the goal.
Note that sometimes inadmissible heuristics (i.e. those that sometimes overestimate the distance to
the goal) can still be useful.
The specific heuristic function is chosen depending on the particular problem (i.e. we estimate the
distance to the goal state differently in different problems, for instance, with a travel route, we might
estimate the cost with linear distance to the target city).
The typical trade-off with heuristics is between simplicity/efficiency and accuracy.
The question of finding good heuristics, and doing so automatically, has been a big topic in AI
planning recently.
19.10.1
Greedy best-first search
Best-first search algorithms are those that select the next node from the fringe by argminn f (n),
where f (n) is some evaluation function.
With greedy best-first search, the fringe is kept sorted by heuristic distance to the goal; that is,
f (n) = h(n).
This often ends up with a suboptimal path, however.
557
19.10. INFORMED (HEURISTIC) SEARCH
19.10.2
558
Beam Search
Beam search is essentially breadth-first search, but we set a beam width w which is the limit to the
number of paths you will consider at any level. This is typically a low number like 2, but can be
iteratively expanded (similar to iterative deepening) if necessary.
Beam search
The fringe is the same as in breadth-first search, but we keep only the w best paths as determined
by the heuristic distance.
Beam search is not complete, unless the iterative approach is used.
19.10.3
A* Search
A* is an extension of branch & bound search which includes (admissible) heuristic distances in its
sorting.
We define g(n) as the known distance from the root to the node n (this is what we sort the fringe
by in branch & bound search). We additionally define h(n) as the admissible heuristic distance from
the node n to a goal node.
With A* search, we simply sort the fringe by g(n) + h(n). That is, A* search is a best-first search
algorithm where f (n) = g(n) + h(n).
A* search is optimal if h(n) is admissible; that is, it never overestimates the distance to the goal. It
is complete as well; i.e. if a solution exists, A* will find it.
A* is also optimally efficient (with respect to the number of expanded nodes) for a given heuristic
function. That is, no other optimal algorithm is guaranteed to expand fewer nodes than A*.
Uniform cost search is a special case of A* where h(n) = 0, i.e. f (n) = g(n).
The downside of A* vs greedy best-first search is that it can be slower since it explores the space
more thoroughly - it has worst case time and space complexity of O(B l ), where b is the branching
factor (the number of successors per node on average) and l is the length of the path we’re looking
for.
Typically we are dealing with the worst case; the fringe usually grows exponentially. Sometimes the
time complexity is permissible, but the space complexity is problematic because there may simply not
be enough memory for some problems.
558
559
19.11. LOCAL SEARCH
There is a variation of A* called iterative deepening A* (IDA*) which uses significantly less memory.
19.10.4
Iterative Deepening A* (IDA*)
Iterative deepening A* is an extension of A* which uses an iterative approach, searching up to a
distance g(x) and increasing that distance until a solution is found.
19.11
Local search
Local search algorithms do not maintain a fringe; that is, we don’t keep track of unexplored alternatives. Rather, we continuously try to improve a single option until we can’t improve it anymore.
Instead of extending a plan, the successor function in local search takes an existing plan and just
modifies a part of it.
Local search is generally much faster and more memory efficient, but because it does not keep track
of unexplored alternatives, it is incomplete and suboptimal.
19.11.1
Hill-Climbing
A basic method in local search is hill climbing - we choose a starting point, move to the best
neighboring state (i.e. closest as determined by the heuristic), and repeat until there are no better
positions to move to - we’ve reached the top of hill. As mentioned, this is incomplete and suboptimal,
as it can end up in local maxima.
Hill-climbing search
The difference between hill climbing and greedy search is that with greedy search, the entire fringe is
sorted by heuristic distance to the goal. With hill climbing, we only sort the children of the currently
expanded node, choosing the one closest to the goal.
559
19.12. GRAPH SEARCH
19.11.2
560
Other local search algorithms
You can also use simulated annealing (detailed elsewhere) to try to escape local maxima - and this
helps, and has a theoretical guarantee that it will converge to the optimal state given infinite time,
but of course, this is not a practical guarantee for real-world applications. So simulated annealing in
practice can do better but still can end up in local optima.
You can also use genetic algorithms (detailed elsewhere).
19.12
Graph search
Up until now we have considered search algorithms in the context of trees.
With search trees, we often end up with states repeated throughout the tree, which will have redundant
subtrees, and thus end up doing (potentially a lot of) redundant computation.
Instead, we can consider the search space as a graph.
Graph search algorithms are typically just slight modifications of tree search algorithms. One main
modification is the introduction of a list of explored (or expanded) nodes, so that we only expand
states which have not already expanded.
Completeness is not affected by graph search, but it is not optimal. We may close off a branch
because we have already expanded that state elsewhere, but it’s possible that the shortest path still
goes through that state. Graph search algorithms (such as the graph search version of A*) can be
made optimal through an additional constraint to admissible heuristics: consistency.
19.12.1
Consistent heuristics
The main idea of consistency is that the estimated heuristic costs should be less than or equal to
the actual costs for each arc between any two nodes, not just between any node and the goal state:
|H(x, G) − H(y , G)| ≤ D(x, y )
That is, the absolute value of the difference between the estimated distance between a node x and
the goal and the estimated distance between a node y and the goal is less than or equal to the
distance between the nodes x and y .
Consistency enforces this for any two nodes, which includes the goal node, so consistency implies
admissibility.
19.13
Adversarial search (games)
Adversarial search is essentially search for games involving two or more players.
560
561
19.13. ADVERSARIAL SEARCH (GAMES)
There are many kinds of games - here we primarily consider games that are:
•
•
•
•
deterministic (sometimes called “non-chance”)
two-player
turn-based
zero-sum: agents have opposite utilities (one’s gain is another’s loss), also known as a “pure
competition” or “strictly competitive” game.
• perfect information: every player has full knowledge of what state the game is in and what
actions are possible
Note that while we are only considering zero-sum games, in general games agents have independent
utilities, so there is opportunity for cooperation, indifference, competition, and so on.
One way of formulating of games (there are many) is as a tree:
•
•
•
•
•
•
states S, starting with s0
players P = {1, . . . , n}, usually taking turns
actions A (may depend on player/state)
transition function (analogous to a successor function), S × A → S
terminal test (analogous to a goal test): S → {t, f }
terminal utility function (computes how much an end/terminal state is worth to each player):
S × P → R. For example, we may assign a utility of 100 for terminal states where we win, and
-100 for terminal states where we lose.
We want our adversarial search algorithm to return a strategy (a policy) which is essentially a
function which returns an action given a state. That is, a policy tells us what action take in a state
- this is constrasted to a plan, which details a step-by-step procedure from start to finish. This is
because we can’t plan on opponents acting in a particular way, so we need a strategy to respond to
their actions.
The solution then, for an adversarial search algorithm for a player is a policy S → A.
19.13.1
Minimax
In minimax, we start at the bottom of the tree (where we have utilities computed terminal nodes),
moving upwards. We propagate the terminal utilities through the graph up to the root node, propagating the utility at each depth that satisfies a particular criteria.
For nodes at depths that correspond to the opponent’s turns, we assume that the opponent chooses
their best move (that is, we assume they are a perfect adversary), which means we propagate the
minimum utility for us.
For nodes at depths that correspond to our turn, we want to choose our best move; that is, we
propagate the maximum utility for us.
The propagated utility is known as the backed-up evaluation.
561
19.13. ADVERSARIAL SEARCH (GAMES)
562
Minimax
At the end, this gives us a utility for the root node, which gives us a value for the current state.
Minimax is just like exhaustive depth-first search, so its time complexity is O(bm ) and space complexity
is O(bm).
Minimax is optimal against a perfect adversarial player (that is, an opponent that always takes their
best action), but it is not otherwise.
Depth-limited minimax
Most interesting games have game trees far too deep to expand all the way to the terminal nodes.
Instead, we can use depth-limited search to only go down a few levels. However, since we don’t reach
the terminal nodes, their values never propagate up the tree. How will we compute the utility of any
given move?
We can introduce an evaluation function which computes a utility for non-terminal positions, i.e. it
estimates the value of an action. For instance, with chess, you could just take the different of
the number of your units vs the number of the opponent’s units. Generally moves that lower your
opponent’s units is better, but not always.
Iterative deepening minimax
Often in games there are some time constraints - for instance, the computer opponent should respond
within a reasonable amount of time.
Iterative deepening can be applied to minimax, running for a set amount of time, and return the best
policy found thus far.
This type of algorithm is called an anytime algorithm because it has an answer ready at anytime.
Generalizing minimax
If the game is not zero-sum or has multiple players, we can generalize minimax as such:
• terminal nodes have utility tuples
• node values are also utility tuples
562
563
19.13. ADVERSARIAL SEARCH (GAMES)
• each player maximizes their own component
This can model cooperation and competition dynamically.
19.13.2
Alpha-Beta
We can further improve minimax by pruning the game tree; i.e. removing branches we know won’t
be worthwhile. This variation is known as alpha-beta search.
Alpha-Beta Minimax
Here we can look at branching and figure out a bound for describing its score.
First we look at the left-most branch and see the value 2 in its left-most terminal node. Since we are
looking for the min here, we know that the score for this branch node will be at most 2. If we then
look at the other terminal node, we see that it is 7 and we know the branch node’s score is 2.
At this point we can apply a similar logic to the next node up (where we are looking for the max).
We know that it will be at least 2.
So then we look at the next branch node and see that it will be at most 1. We don’t have to look
at the very last terminal node because now we know that the max node can only be 2. So we have
saved ourselves a little trouble.
In larger trees this approach becomes very valuable, since you are effectively discounting entire
branches and saving a lot of unnecessary computation. This allows you to compute deeper trees.
Note that with alpha-beta, the minimax value computed for the root is always correct, but the values
of intermediate nodes may be wrong, and as such, (naive) alpha-beta is not great for action selection.
Good ordering of child nodes improves upon this. With a “perfect ordering”, time complexity drops
to O(bm/2 ).
Ordering
Generally you want to generate game trees so that successors to each node are ordered left-to-right
in descending order of their eventual backed-up evaluations (such an ordering is called the “correct”
ordering). Naturally, it is quite difficult to generate this ordering before these evaluations have been
computed.
563
19.14. NON-DETERMINISTIC SEARCH
564
Thus a plausible ordering must suffice. These are a few techniques for generating plausible orderings
of nodes:
• Generators first produce the most immediately desirable choices (though without regard to
possible consequences further on)
• Shallow search first generates some of the tree and then uses some static evaluation function
and compute backed-up evaluations upwards to order the results.
• Dynamic generation, in which alpha-beta is applied to identify plausible branches of the game
tree, then branch is evaluated which can cause the ordering to change.
19.14
Non-deterministic search
In many situations the outcomes of actions are uncertain. Another way of phrasing is that actions
may be noisy.
Like adversarial search, non-deterministic search solutions take the form of policies.
19.14.1
Expectimax search
We can model uncertainty as a “dumb” adversary in a game.
Whereas in minimax we assume a “smart” adversary, and thus consider worst-case outcomes (i.e. that
the opponent plays their best move), with non-deterministic search, we instead consider average-case
outcomes (i.e. expected utilities). This is called expectimax search.
So instead of minimax’s min nodes, we have “chance” nodes, though we still keep max nodes. For a
chance node, we compute its expected utility as the weighted (by probability) average of its children.
Because we take the weighted average of children for a chance node’s utility, we cannot use alpha-beta
pruning as we could with minimax. There could conceivably be an unexplored child which increases
the expected utility enough to make that move ideal, so we have to explore all child nodes to be sure.
Expectiminimax
We can have games that involve adversaries and chance, in which case we would have both minimax
layers and expectimax layers. This approach is called expectiminimax.
19.14.2
Monte Carlo Tree Search
Say you are at some arbitrary position in your search tree (it could be the start or somewhere further
along). You can treat the problem of what node to move to next as a multi-armed bandit problem
and apply the Monte Carlo search technique.
564
565
19.14. NON-DETERMINISTIC SEARCH
Multi-armed bandit
Say you have multiple options with uncertain payouts. You want to maximize your overall payout,
and it seems the most prudent strategy would be to identify the one option which consistently yields
better payouts than the other options.
However - how do you identify the best option, and do so quickly?
This problem is known as the multi-armed bandit problem, and a common strategy is based on
upper confidence bounds (UCB).
To start, you randomly try the options and compute confidence intervals for each options’ payout:
√
x̄i ±
2 ln(n)
ni
where:
• x̄i is the mean payout for option i
• ni is the number of times option i was chosen
• n is the total number of trials
You take the upper bound of these confidence intervals and continue to choose the option with the
highest upper bound. As you use this option more, it’s confidence interval will narrow (since you have
collected more data on it), and eventually another option’s confidence interval upper bound will be
higher, at which point you switch to that option.
Monte Carlo Tree Search
At first, you have no statistical information about the child nodes to compute confidence intervals. So
you randomly choose a child and run Monte Carlo simulations down that branch to see the outcomes.
For each simulation run, you go along each node in the branch that was walked and increment its
play count (i.e. number of trials) by 1, and if the outcome is a win, you increment its win count by
1 as well (this explanation assumes a game, but is generalizes to other cases).
You repeat this until you have enough statistics for the direct child nodes of your current position to
make a UCB choice as to where to move next.
You will need to run less simulations over time because you accumulate these statistics for the search
tree.
First-Play Urgency (FPU)
A variation of MCTS where fixed scores are assigned to unvisited nodes.
565
19.14. NON-DETERMINISTIC SEARCH
19.14.3
566
Markov Decision Processes (MDPs)
MDPs are another way of modeling non-deterministic search.
MDPs are essentially Markov models, but there’s a choice of action.
In MDPs, there may be two types of rewards (which can be positive or negative):
• terminal rewards (i.e. those that come at the end, these aren’t always present)
• “living” rewards, which are given for each step (these are always present)
For instance, you could imagine a maze arranged on a grid. The desired end of the maze has a
positive terminal reward and a dead end of the maze has a negative terminal reward. Every nonterminal position in the maze also has a reward (“living” rewards) associated with it. Often these
living rewards are negative so that each step is penalized, thus encouraging the agent to find the
desired end in as few steps as possible.
The agent doesn’t have complete knowledge of the maze so every action has an uncertain outcome.
It can try to move north - sometimes it will successfully do so, but sometimes it will hit a wall
and remain in its current position. Sometimes our agent may even move in the wrong direction
(e.g. maybe a wheel gets messed up or something).
This kind of scenario can be modeled as a Markov Decision Process, which includes:
a set of states s ∈ S
a set of actions a ∈ A
a transition function T (s, a, s ′ ), sometimes called a state transition matrix
gives the probability that a from s leads to s ′ , i.e. P (s ′ |s, a)
also called the “model” or the “dynamics”
a reward function R(s, a, s ′ ) (sometimes just R(s) or R(s ′ )), sometimes called a utility function,
which associates a reward (or penalty) with each state
• a discount γ
• a start state
• maybe a terminal state
•
•
•
•
•
•
MDPs, as non-deterministic search problems, can be solved with expectimax search.
MDPs are so named because we make the assumption that action outcomes depend only on the
current state (i.e. the Markov assumption).
The solution of an MDP is an optimal policy π∗ : S → A:
• gives us an action to take for each state
• an optimal policy maximizes expected utility if followed
• an explicit policy defines a reflex agent
566
567
19.14. NON-DETERMINISTIC SEARCH
In contrast, expectimax does not give us entire policies. Rather, it gives us an action for a single
state only. It’s similar to a policy, but requires re-computing at each step. Sometimes this is fine
because a problem may be too complicated to compute an entire policy anyways.
The objective MDP is to maximize the expected sum of all future rewards, i.e.
max(E[
∞
∑
Rt ])
t=0
Sometimes a discount factor γ ∈ [0, 1] is included, e.g. γ = 0.9, which decays future reward:
max(E[
∞
∑
γ t Rt ])
t=0
Using this, we can define a value function V (s) for each state:
V π (s) = E[
∑
γ t Rt |s0 = s]
t=0
That is, it is the expected sum of future discounted reward provided we start in state s with policy
π.
This can be computed empirically via simulations. In particular, we can use the value iteration
algorithm.
With value iteration, we recursively calculate the value function, starting from the goal states, to get
the optimal value function, from which we can derive the optimal policy.
More formally - we want to recursively estimate the value V (s) of a state s. We do this by estimating
the value of possible successor states s ′ , discounting by γ, and incorporating the reward/cost of the
state R(s ′ ), across possible actions from s. We take the maximum of these estimates.
V (s) = max[γ
∑
a
s′
P (s ′ |s, a)V (s ′ )] + R(s)
This method is called back-up.
In terminal states, we just set V (s) = R(s).
We estimate these values over all our states - these estimates eventually converge.
This function essentially defines the optimal policy - that is:
π(s) = argmax
a
∑
s′
P (s ′ |s, a)V (s ′ )
(since it’s maximization we can drop γ and R(s))
567
19.14. NON-DETERMINISTIC SEARCH
568
Example: Grid World
Note that the X square is a wall. Every movement has an uncertain outcome, e.g. if the agent moves
to the east, it may only successfully do so with an 80% chance.
For R(s) = −0.01:
A
B
C
D
0 →
→
→
+1
1 ↑
X
←
-1
2 ↑
←
←
↓
At C1 the agent plays very conservatively and moves in the opposite direction of the negative terminal
position because it can afford doing so many times until it accidentally randomly moves to another
position.
Similar reasoning is behind the policy at D2.
For R(s) = −0.03:
A
B
C
D
0 →
→
→
+1
1 ↑
X
↑
-1
2 ↑
←
←
←
With a stronger step penalty, the agent finds it better to take a risk and move upwards at C1, since
it’s too expensive to play conservatively.
Similar reasoning is behind the change in policy at D2.
For R(s) = −2:
A
B
C
D
0 →
→
→
+1
1 ↑
X
→
-1
2 →
→
→
↑
With such a large movement penalty, the agent decides it’s better to “commit suicide” by diving into
the negative terminal node and end the game as soon as possible.
568
569
19.14. NON-DETERMINISTIC SEARCH
q-states
Each MDP state projects an expectimax-like search tree; that is, we build a search tree from the
current state detailing what actions can be taken and the possible outcomes for each action.
We can describe actions and states together as a q-state (s, a). When you’re in a state s and you
take an action a, you end up in this q-state (i.e. you are committed to action a in state s) and
the resolution of this q-state is described by the transition (s, a, s ′ ), described by the probability
which is given by transition function T (s, a, s ′ ). There is also a reward associated with a transition,
R(s, a, s ′ ), which may be positive or negative.
Utility sequences
How should we encode preferences for sequences of utilities? For example, should the agent prefer
the reward sequence [0, 0, 1] or [1, 0, 0]? It’s reasonable to prefer rewards closer in time, e.g. to
prefer [1, 0, 0] over [0, 0, 1].
We can model this by discounting, that is, decaying reward value exponentially. If a reward is worth
1 now, it is worth γ one step later, and worth γ 2 two steps later (γ is called the “discount” or “decay
rate”).
Stationary preferences are those which are invariant to the inclusion of another reward which delays
the others in time, i.e.:
[a1 , a2 , . . . ] ≻ [b1 , b2 , . . . ] ⇔ [r, a1 , a2 , . . . ] ≻ [r, b1 , b2 , . . . ]
Nonstationary preferences are possible, e.g. if the delay of a reward changes its value relative to other
rewards (maybe it takes a greater penalty for some reason).
With stationary preferences, there are only two ways to define utilities:
• Additive utility: U([r0 , r1 , r2 , . . . ]) = r0 + r1 + r2 + . . .
• Discounted utility: U([r0 , r1 , r2 , . . . ]) = r0 + γr1 + γ 2 r2 + . . .
Note that additive utility is just discounted utility where γ = 1.
For now we will assume stationary preferences.
If a game lasts forever, do we have infinite rewards? Infinite rewards makes it difficult to come up
with a good policy.
We can specify a finite horizon (like depth-limited search) and just consider only up to some fixed
number of steps. This gives us nonstationary policies, since π depends on the time left.
Alternatively, we can just use discounting, where 0 < γ < 1:
U([r0 , . . . , r∞ ]) =
∞
∑
t=0
γ t rt ≤
Rmax
1−γ
569
19.14. NON-DETERMINISTIC SEARCH
570
A smaller γ means a shorter-term focus (a smaller horizon).
Another way is to use an absorbing state. That is, we guarantee that for every policy, a terminal
state will eventually be reached.
Usually we use discounting.
Solving MDPs
We say that the value (utility) of a state s is V ∗ (s), which is the expected utility of starting in s and
acting optimally. This is equivalent to running expectimax from s.
While a reward is for a state in a single time step, a value is the expected utility over all paths from
that state.
The value (utility) of a q-state (s, a) is Q∗ (s, a), called a Q-value, which is the expected utility
starting out taking action a from state s and subsequently acting optimally. This is equivalent to
running expectimax from the chance node that follows from s when taking action a.
The optimal policy π ∗ (s) gives us the optimal action from a state s.
So the main objective is to compute (expectimax) values for the states, since this gives us the expected
utility (i.e. average sum of discounted rewards) under optimal action.
More concretely, we can define value recursively:
V ∗ (s) = max
Q∗ (s, a)
a
Q∗ (s, a) =
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )]
s′
These are the Bellman equations.
They can be more compactly written as:
V ∗ (s) = max
a
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )]
s′
Again, because these trees can go on infinitely (or may just be very deep), we want to limit how far
we search (that is, how far we do this recursive computation). We can specify time-limited values,
i.e. define Vk (s) to be the optimal value of s if the game ends in k more time steps. This is equivalent
to depth-k expectimax from s.
To clarify, k = 0 is the bottom of the tree, that is, k = 0 is the last time step (since there are 0
more steps to the end).
We can use this with the value iteration algorithm to efficiently compute these Vk (s) values in our
tree:
• start with V0 (s) = 0 (i.e. with no time steps left, we have an expected reward sum of zero).
Note that this is a zero vector over all states.
570
571
19.14. NON-DETERMINISTIC SEARCH
• given a vector of Vk (s) values, do one ply of expectimax from each state:
Vk+1 (s) = max
a
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γVk (s ′ )]
s′
Note that since we are starting at the last time step k = 0 and moving up, when we compute Vk+1 (s)
we have already computed Vk (s ′ ), so this saves us extra computation.
Then we simply repeat until convergence. This converges if the discount is less than 1.
With the value iteration algorithm, each iteration has complexity O(S 2 A). There’s no penalty for
depth here, but the more states you have, the slower this gets.
The approximations get refined towards optimal values the deeper you go into the tree. However, the
policy may converge long before the values do - so while you may not have a close approximation of
values, the policy/strategy they convey early on may already be optimal.
Partially-Observable MDPs (POMDPs)
Partially-observed MDPs are MDPs in which the states are not (fully) observed. They include observations O and an observation function P (o|s) (sometimes notated O(s, o); it gives a probability
for an observation given a state).
When we take an action, we get an observation which puts us in a new belief state (a distribution
of possible states).
Partially-observable environments may require information-gathering actions in addition to goaloriented actions. Such information-gathering actions may require detours from goals but may be
worth it in the long run. See the section on reinforcement learning for more.
With POMDPs the state space becomes very large because there are many (infinite) probability
distributions over a set of states.
As a result, you can’t really run value iteration on POMDPs, but you can use approximate Q-learning
(see the section on reinforcement learning) or a truncated (limited lookahead) expectimax approach
to approximate the value of actions.
In general, however, POMDPs are very hard/expensive to solve.
19.14.4
Decision Networks
Decision networks are a generalization of Bayes’ networks. Some nodes are random variables (these
are essentially embedded Bayes’ networks), some nodes are action variables, in which a decision is
made, and some nodes are utility functions, which computes a utility for its parent nodes.
For instance, an action node could be “bring (or don’t bring) an umbrella”, and a random variable
node could be “it is/isn’t raining”. These nodes may feed into a utility node which computes a utility
based on the values of these nodes. For instance, if it is raining and we don’t bring an umbrella, we
571
19.14. NON-DETERMINISTIC SEARCH
572
will have a very low utility, compared to when it isn’t raining and we don’t bring an umbrella, for
which we will have a high utility.
We want to choose actions that maximize the expected utility given observed evidence.
The general process for action selection is:
•
•
•
•
•
instantiate all evidence
set action node(s) each possible way
calculate the posterior for all parents of the utility node, given the evidence
calculated the expected utility for each action
choose the maximizing action (it will vary depending on the observed evidence)
This is quite similar to expectimax/MDPs, except now we can incorporate evidence we observe.
An example decision network. Rectangles are action nodes, ellipses are chance nodes, and diamonds are utility nodes.
From Artificial Intelligence: Foundations of Computational Agents
Value of information
More evidence helps, but typically there is a cost to acquiring it. We can quantify the value of
acquiring evidence as the value of information to determine whether or not it is more evidence is
worth the cost. We can compute this with a decision network.
The value of information is simply the expected gain in the maximum expected utility given the new
evidence.
For example, say someone hides 100 dollars behind one of two doors, and if we can correctly guess
which door it is behind, we get the money.
There is a 0.5 chance that the money is behind either door.
In this scenario, we can use the following decision network:
choose door → U
money door → U
572
573
19.14. NON-DETERMINISTIC SEARCH
Where choose door is the action variable, money door is the random variable, and U is the utility
node.
The utility function at U is as follows:
choose door money door utility
a
a
100
a
b
0
b
a
0
b
b
100
In this current scenario, our maximum expected utility is 50. That is, choosing either door a or b
gives us 100 × 0.5 = 50 expected utility.
How valuable is knowing which door the money is behind?
We can consider that if we know which door the money is behind, our maximum expected utility
becomes 100, so we can quantify the value of that information as 100 − 50 = 50, which is what
we’d be willing to pay for that information.
In this scenario, we get perfect information, because we observe the evidence “perfectly” (that is, our
friend tells us the truth and there’s no chance that we misheard them).
More formally, the value of perfect information of evidence E ′ , given existing evidence e (of which
there might be none), is:
VPI(E ′ |e) = (
∑
e′
P (e ′ |e)MEU(e, e ′ )) − MEU(e)
Properties of VPI:
• nonnegative: ∀E ′ , e : VPI(E ′ |e) ≥ 0, i.e. is not possible for VPI to be negative (proof not
shown)
• nonadditive: VPI(Ej , Ek |e) ̸= VPI(Ej |e) + VPI(Ek |e) (e.g. consider observing the same evidence twice - no more information is added)
• order-independent: VPI(Ej , Ek |e) = VPI(Ej |e) + VPI(Ek |e, Ej ) = VPI(Ek |e) + VPI(Ej |e, Ek )
Also: generally, if the parents of the utility node is conditionally independent of another node Z given
the current evidence e, then VPI(Z|e) = 0. Evidence has to affect the utility node’s parents to
actually affect the utility.
What’s the value of imperfect information? Well, we just say that “imperfect” information is perfect
information of a noisy version of the variable in question.
For example, say we have a “light level” random variable that we observe through a sensor. Sensors
always have some noise, so we add an additional random variable to the decision network (connected
to the light level random variable) which corresponds to the sensor’s light level measurement. Thus
573
19.15. POLICIES
574
the sensor’s observations are “perfect” in the context of the sensor random variable, because they are
exactly what the sensor observed, though technically they are noisy in the context of the light level
random variable.
19.15
19.15.1
Policies
Policy evaluation
How do we evaluate policies?
We can compute the values under a fixed policy. That is, we construct a tree based on the policy (it
is a much simpler tree because for any given state, we only have one action - the action the policy
says to take from that state), and then compute values from that tree.
More specifically, we compute the value of applying a policy π from a state s:
V π (s) =
∑
T (s, π(s), s ′ )[R(s, π(s), s ′ ) + γV π (s ′ )]
s′
Again, since we only have one action to choose from, the maxa term has been removed.
We can use an approach similar to value iteration to compute these values, i.e.
V0π (s) = 0
π
Vk+1
(s) =
∑
s′
T (s, π(s), s ′ )[R(s, π(s), s ′ ) + γVkπ (s ′ )]
This approach is sometimes called simple value iteration since we’ve dropped maxa .
This has complexity O(S 2 ) per iteration.
19.15.2
Policy extraction
Policy extraction is the problem opposite to policy evaluation - that is, given values, how do we
extract the policy which yields these values?
Say we have optimal values V ∗ (s). We can extract the optimal policy π ∗ (s) like so:
π ∗ (s) = argmax
a
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )]
s′
That is, we do one step of expectimax.
What if we have optimal Q-values instead?
With Q-values, it is trivial to extract the policy, since the hard work is already capture by the Q-value:
574
575
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
π ∗ (s) = argmax Q∗ (s, a)
a
19.15.3
Policy iteration
Value iteration is quite slow - O(S 2 A) per iteration. However, you may notice that the maximum
value calculated for each state rarely changes. The result of this is that the policy often converges
long before the values.
Policy iteration is another way of solving MDPs (an alternative to value iteration) in which we start
with a given policy and improve on it iteratively:
• First, we evaluate the policy (calculate utilities for the given policy until the utilities converge).
• Then we update the policy using one-step look-ahead (one-step expectimax) with the resulting
converged utilities as the future (given) values (i.e. policy extraction).
• Repeat until the policy converges.
Policy iteration is optimal and, under some conditions, can converge must faster.
More formally:
Evaluation: iterate values until convergence:
πi
Vk+1
(s) =
∑
s′
T (s, πk (s), s ′ )[R(s, πk (s), s ′ ) + γVkπi (s ′ )]
Improvement: compute the new policy with one-step lookahead:
πi+1 (s) = argmax
a
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γV πi (s ′ )]
s′
Policy iteration and value iteration are two ways of solving MDPs, and they are quite similar - they
are just variations of Bellman updates that use one-step lookahead expectimax.
19.16
Constraint satisfaction problems (CSPs)
Search as presented thusfar has been concerned with producing a plan or a policy describing how to
act to achieve some goal state. However, there are search problems in which the aim is to identify
the goal states themselves - such problems are called identification problems.
In constraint satisfaction problems, we want to identify states which statisfy a set of constraints.
We have a set of variables Xi , with values from a domain D (sometimes the domain varies according
to i , e.g. X1 may have a different domain than X2 ). We assign each variable Xi with a value from
575
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
576
its corresponding domain, each unique assignment of these variables (which may be partial, i.e. some
may be unassigned) is a state.
We want to satisfy a set of constraints on what combinations of values are allowed on different
subsets of variables. So we want to identify states which satisfy these constraints; that is, we want
to identify variable assignments that satisfy the constraints.
Constraints can be specified using a formal language, e.g. code that A ̸= B or something like that.
We can represent constraints as a graph.
In a binary CSP, each constraint relates at most two variables. We can construct a binary constraint
graph in which the nodes are variables, and arcs show constraints. We don’t need to specify what
the constraints are.
If we have constraints that are more than binary (that is, they relate more than just two variables),
we can represent the constraints as square nodes in the graph and link them to the variables they
relate (as opposed to representing constraints as the arcs themselves).
General-purpose CSP algorithms use this graph structure for faster search.
19.16.1
Varieties of CSPs
Variables may be:
•
•
•
•
discrete, and come from
finite domains
infinite domains (integers, strings, etc)
continuous
Constraints may be:
• unary (involve a single variable, this is essentially reducing a domain, e.g. A ̸= green)
• binary (involve a pair of variables)
• higher-order (involve three or more variables)
We may also have preferences, i.e. soft constraints. We can represent these as costs for each variable
assignment. This gives us a constraint optimization problem.
19.16.2
Search formulation
We can formulate CSPs as search problems using search trees or search graphs (in the context of
CSPs, they are called constraint graphs).
States are defined by the values assigned so far (partial assignments).
The initial state is the empty assignment, {}.
Successor functions assign a value to an unassigned variable (one at a time).
576
577
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
The goal test is to check if the current assignment is complete (all variables have values) and satisfies
all constraints.
Breadth-first search does not work well here because all the solutions will be at the bottom of the
search tree (all variables must have values assigned, and that happens only at the bottom).
Depth-first search does a little better, but it is very naive - it can make a mistake early on in its path,
but not realize it until reaching the end of a branch.
The main shortcoming with these approaches is that we aren’t checking constraints until it’s far too
late.
19.16.3
Backtracking search
Backtracking search is the basic uninformed search algorithm for solving CSPs. It is a simple augmentation of depth-first search.
Rather than checking the constraint satisfaction at the very end of a branch, we check constraints
as we go, i.e. we only try values that do not conflict with previous assignments. This is called an
incremental goal test.
Furthermore, we only consider one variable at a time in some order. Variable assignments are commutative (i.e. the order in which we assign them doesn’t matter, e.g. A = 1 and then B = 2 leads
to the same variable assignment as B = 2 then A = 1). So at one level, we consider assignments
for A, at the next, for B, and so on.
The moment we violate a constraint, we backtrack and try different a variable assignment.
Simple backtracking can be improved in a few ways:
•
•
•
•
•
ordering
we can be smarter about in what order we assign variables
we can be smarter about what we try for the next value for a variable
filtering: we can detect failure earlier
structure: we can exploit the problem structure
Backtracking pseudocode:
def backtracking(csp):
def backtracing_recursive(assignment):
if is_complete(assignment):
return assignment
var = select_unassigned_variable(csp.variables, assignment)
for val in csp.order_domain_values(var, assignment):
if is_consistent_with_constraints(val, assignment, csp.constraints):
assignment[var] = val
result = backtracking_recursive(assignment)
if result is not None: # if not a failure
577
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
578
return result
else: # otherwise, remove the assignment
del assignment[var]
return None # failure
Filtering
Filtering looks ahead to eliminate incompatible variable assignments early on.
With forward checking, when we assign a new variable, we look ahead and eliminate values for
other variables that we know will be incompatible with this new assignment. So when we reach that
variable, we only have to check values we know will not violate a constraint (that is, we only have to
consider a subset of the variable’s domain).
If we reach an empty domain for a variable, we know to backup.
With constraint propagation methods, we can check for failure ahead of time.
One constraint propagation method is arc consistency (AC3).
First, we must consider the consistency of an arc (here, in the context of binary constraints, but this
can be extended to higher-order constraints). In the context of filtering, an arc X → Y is consistent
if and only if for every x in the tail there is some y in the head which could be assigned without
violating a constraint.
An inconsistent arc can be made consistent by deleting values from its tail; that is, by deleting tail
values which lead to constraint-violating head values.
Note that since arcs are directional, a consistency relationship (edge) must be checked in both
directions.
We can re-frame forward checking as just enforcing consistency of arcs pointing to each new assignment.
A simple form of constraint propagation is to ensure all arcs in the CSP graph are consistent. Basically,
we visit each arc, check if its consistent, if not, delete values from its tail until it is consistent. If we
encounter an empty domain (that is, we’ve deleted all values from its tail), then we know we have
failed.
Note that if a value is deleted from a tail of a node, its incoming arcs must be-rechecked.
We combine this with backtracking search by applying this filtering after each new variable assignment.
It’s extra work at each step, but it should save us backtracking.
Arc consistency (AC3) pseudocode:
function AC3(csp):
queue = csp.all_arcs()
while queue:
from_node, to_node = queue.pop()
if remove_inconsistent_values(from_node, to_node):
for node in neighbors(from_node):
578
579
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
queue.append((node, from_node))
return csp
function remove_inconsistent_values(from_node, to_node):
removed = False
for x in domain[from_node]:
if no value y in domain[to_node] allows (x,y) to satisfy the constraint from_node <-> to_node
domain[from_node].remove(x)
removed = True
return removed
Arc consistency can be generalized to k-consistency:
• 1-consistency is node consistency, i.e. each node’s domain has a value which satisfies its own
unary constraints.
• 2-consistency is arc consistency: for each pair of nodes, any consistent assignment to one can
be extended to the other (“extended” meaning from the tail to the head).
• k-consistency: for each k nodes, any consistent assignment to k − 1 can be extended to the
kth node.
• 3-consistency is called path consistency
Naturally, a higher k consistency is more expensive to compute.
We can extend this further with strong k-consistency which means that all lower orders of consistency
(i.e. k − 1 consistency, k − 2 consistency, etc) are also satisfied. With strong k-consistency, no
backtracking is necessary - but in practice, it’s never practical to compute.
Ordering
One method for selecting the next variable to assign to is called minimum remaining values
(MRV), in which we choose the variable with the fewest legal values left in its domain (hence this
is sometimes called most constrained variable). We know this number if we are running forward
checking. Essentially we decide to try the hardest variables first so if we fail, we fail early on and thus
have to do less backtracking (for this reason, this is sometimes called fail-fast ordering).
For choosing the next value to try, a common method is least constraining value. That is, we try
the value that gives us the most options later on. We may have to re-run filtering to determine what
the least constraining value is.
Problem Structure
Sometimes there are features of the problem structure that we can use to our advantage.
For example, we may have independent subproblems (that is, we may have multiple connected components; i.e. isolated subgraphs), in which case we can divide-and-conquer.
In practice, however, you almost never see independent subproblems.
579
19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS)
580
Tree-Structured CSPs
Some CSPs have a tree structure (i.e. have no loops). Tree-structured CSPs can be solved in O(nd 2 )
time, much better than the O(d n ) for general CSPs.
The algorithm for solving tree-structured CSPs is as follows:
1. For order in a tree-structured CSP, we first choose a root variable, then order variables such
that parents precede children.
2. Backward pass: starting from the end moving backwards, we visit each arc once (the arc
pointing from parent to child) and make it consistent.
3. Forward assignment: starting from the root and moving forward, we assign each variable so
that it is consistent with its parent.
This method has some nice properties:
• after the backward pass, all root-to-leaf arcs are consistent
• if root-to-leaf arcs are consistent, the forward assignment will not backtrack
Unfortunately, in practice you don’t typically encounter tree-structured CSPs.
Rather, we can improve an existing CSPs structure so that it is nearly tree-structured.
Sometimes there are just a few variables which prevent the CSP from having a tree structure.
With cutset conditioning, we assign values to these variables such that the rest of the graph is a
tree.
This, for example, turns binary constraints into unary constraints, e.g. if we have a constraint A ̸= B
and we fix B = green, then we can rewrite that constraint as simply A ̸= green.
Cutset conditioning with a cutset size c gives runtime O(d c (n − c)d 2 ), so it is fast for a small c.
More specifically, the cutset conditioning algorithm:
1. choose a cutset (the variables to set values for)
2. instantiate the cutset in all possible ways (e.g. produce a graph for each possible combination
of values for the cutset)
3. for each instantiation, compute the residual (tree-structured) CSP by removing the cutset
constraints and replacing them with simpler constraints (e.g. replace binary constraints with
unary constraints as demonstrated above)
4. solve the residual CSPs
Unfortunately, finding the smallest cutset is an NP-hard problem.
There are other methods for improving the CSP structure, such as tree decomposition.
Tree decomposition involves creating “mega-variables” which represent subproblems of the original
problem, such that the graph of these mega-variables has a tree structure. For each of these megavariables we consider valid combinations of assignments to its variables.
These subproblems must overlap in the right way (the running intersection property) in order to
ensure consistent solutions.
580
581
19.16.4
19.17. ONLINE EVOLUTION
Iterative improvement algorithms for CSPs
Rather than building solutions step-by-step, iterative algorithms start with an incorrect solution and
try to fix it.
Such algorithms are local search methods in that they work with “complete” states (that is, all
variables are assigned, though constraints may be violated/unsatisfied), and there is no fringe.
Then we have operators which reassign variable values.
A very simple iterative algorithm:
• while not solved
• randomly select any conflicted variable
• select a value which violates the fewest constraints (the min-conflicts heuristic), i.e. hill climb
with h(n) = num. of violated constraints
In practice, this min-conflicts approach tends to perform quickly for randomly-generated CSPs; that
is, there are some particular CSPs which are very hard for it, but for the most part, it can perform in
almost constant time for arbitrarily difficult randomly-generated CSPs.
Though, again, unfortunately many real-world CSPs fall in this difficult domain.
19.17
Online Evolution
Multi-action adversarial games (assuming turn-based) are tricky because they have enormous branching factors. The problem is no longer what the best single action is for a turn - now we need to
find the best sequence of actions to take. An evolutionary algorithm can be applied to select these
actions in a method called online evolution because the agent doesn’t not learn in advance (offline
learning), rather, it learns the best moves while it plays.
Online evolution evolves the actions in a single turn and uses an estimation of the state at the end
of the turn as a fitness function. This is essentially a single iteration of rolling horizon evolution, a
method that evolves a sequence of actions and evolves new action sequences as those actions are
executed. In its application here, we have a horizon of just one turn.
An individual (to be evolved) in this context is a candidate sequence of actions for the turn. A basic
genetic algorithm can be applied. The fitness function can include rollouts, e.g. to a depth of one
extra turn, to incorporate how an opponent might counter move, but it may not help performance.
19.18
References
• Introduction to Monte Carlo Tree Search. Jeff Bradberry.
• MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT.
• Introduction to Artificial Intelligence (2nd ed). Philip C. Jackson, Jr. 1985.
581
19.18. REFERENCES
582
• Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012.
• Planning Algorithms. Steven M. LaValle. 2006.
• Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of
Edinburgh (Coursera). 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• Logical Foundations of Artificial Intelligence (1987) (Chapter 12: Planning)
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Artificial Intelligence: Foundations of Computational Agents. David Poole, Alan Mackworth.
• Algorithmic Puzzles. Anany Levitin, Maria Levitin. 2011.
• A way to deal with enormous branching factors. Julian Togelius. March 25, 2016.
• Online Evolution for Multi-Action Adversarial Games. Niels Justesen, Tobias Mahlmann, Julian
Togelius.
582
583
20
Planning
Planning is tricky because:
• environmental properties:
– systems may be stochastic
– there may be multiple agents in the system
– there may be partial observability (that is, the state of the system may not be fully known)
• agent properties:
– some information may be unknown
– plans are hierarchical (high level to low level parts)
In planning we represent the world in belief states, in which multiple states are possible (because
of incomplete information, we are not certain what the true state of the world is). Actions that are
taken can either increase or decrease the possible states, in some cases down to one state.
Sequences of actions can be defined as trees, where each branch is an action and each node is a state
or is an observation of the world. Then we search this tree for a plan which will satisfy the goal.
Broadly within planning, there are two kinds:
• domain-specific planning, in which the representations and techniques are tailored for a
particular problem (e.g. path and motion planning, perception planning, manipulation planning,
communication planning)
• domain-independent planning, in which generic representations and techniques are used
The field of planning includes many areas of research and different techniques:
• Domain modeling (HTN, SIPE)
• Domain description (PDDL, NIST PSL)
CHAPTER 20. PLANNING
583
20.1. AN EXAMPLE PLANNING PROBLEM
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
584
Domain analysis (TIMS)
Search methods (Heuristics, A*)
Graph planning algorithms (GraphPlan)
Partial-order planning (Nonlin, UCPOP)
Hierarchical planning (NOAH, Nonlin, O-Plan)
Refinement planning (Kambhampati)
Opportunistic search (OPM)
Constraint satisfaction (CSP, OR, TMMS)
Optimization method (NN, GA, ant colony optimization)
Issue/flaw handling (O-Plan)
Plan analysis (NOAH, Critics)
Plan simulation (QinetiQ)
Plan qualitative modeling (Excalibur)
Plan repair (O-Plan)
Re-planning (O-Plan)
Plan monitoring (O-Plan, IPEM)
Plan generalization (Macrops, EBL)
Case-based planning (CHEF, PRODIGY)
Plan learning (SOAR, PRODIGY)
User interfaces (SIPE, O-Plan)
Plan advice (SRI/Myers)
Mixed-initiative plans (TRIPS/TRAINS)
Planning web services (O-Plan, SHOP2)
Plan sharing & comms (I-X, I-N-C-A)
20.1 An example planning problem
Planning approaches are often presented on toy problems, which can be quite different from real-world
problems. Namely, toy problems have a concise and exact description, but real-world problems seldom,
if ever, have an agreed-upon or unambiguous description. They also have important consequences,
whereas toy problems do not. But toy problems provide a standard way of comparing approaches.
Some example toy problems:
•
•
•
•
the
the
the
the
farmers, wolves, and the river
sliding-block puzzle
n-queens problem
Dock-Worker Robots (DWR) domain (i.e. the container/block stacking problem)
For the following notes on planning, we will use the Dock-Worker Robots problem as an toy problem:
• we have some containers
584
CHAPTER 20. PLANNING
585
•
•
•
•
•
20.2. STATE-SPACE PLANNING VS PLAN-SPACE (PARTIAL-ORDER) PLANNING
we have some locations, connected by paths
containers can be stacked onto pallets (no limit to the height of these stacks)
we have robots (vehicles) that can have a container loaded onto them
each location can only have one robot at a time
we have cranes which can pick up and stack containers (one at a time)
The available actions are:
•
•
•
•
•
robot r from location l to some adjacent and unoccupied location l ′
take container c with empty crane k from the top of pile p, all located at the same location l
put down container c held by crane k on top of pile p, all located at location l
load container c held by crane k onto unloaded robot r , all located at location l
unload container c with empty crane k from loaded robot r , all located at location l
move
20.2 State-space planning vs plan-space (partialorder) planning
There are two main approaches to planning: state-space planning and plan-space planning,
sometimes called partial-order planning.
•
•
•
•
•
•
•
•
•
•
•
•
•
state-space planning
finite search space
explicit representation of intermediate states
commits to specific action orderings
causal structure only implicit
search nodes relatively simple and successors are easy to compute
not great at handling goal interactions
plan-space planning
infinite search space
no explicit intermediate states
choice of actions and order are independent (no commitment to a particular ordering)
explicit representation of rationale
search nodes are complex and successors are expensive to compute
Nowadays with efficient heuristics, state-space planning is the more efficient way for finding solutions.
20.3 State-space planning
With state-space planning, we generate plans by searching through state space.
CHAPTER 20. PLANNING
585
20.3. STATE-SPACE PLANNING
20.3.1
586
Representing plans and systems
We can use a state-transition system as a conceptual model for planning. Such a system is
described by a 4-tuple Σ = (S, A, E, γ), where:
•
•
•
•
S = {s1 , s2 , . . . } is a finite or recursively enumerable set of states
A = {a1 , a2 , . . . } is a finite or recursively enumerable set of actions
E = {e1 , e2 , . . . } is a finite or recursively enumerable set of events
γ : S × (A ∪ E) → 2S is a state transition function (note 2S is the power set of all states, that
is an element of the set is itself a set of world states)
If a ∈ A and γ(s, a) ̸= ∅, then a is applicable in s. Applying a in s will take the system to
s ′ ∈ γ(s, a).
We can also represent such a state-transition system as a directed labelled graph G = (NG , EG ),
where:
• the nodes correspond to the states in S, i.e. NG = S
• there is an arc from s ∈ NG to s ′ ∈ NG , i.e. s → s ′ ∈ EG , with label u ∈ (A ∪ E) if and only
if s ′ ∈ γ(s, u).
A plan is a structure that gives us appropriate actions to apply in order to achieve some objective
when starting from a given state.
The objective can be:
•
•
•
•
a goal state sg or a set of goal states Sg
to satisfy some conditions over the sequence of states
to optimize a utility function attached to states
a task to be performed
A permutation of a solution (a plan) is a case in which some actions in the path to the solution can
have their order changed without affecting the success of the path (that is, the permuted path still
leads to the solution with the same cost). In this case, the actions are said to be independent.
Generally we have a planner which generates a plan and the passes the plan to a controller which
executes the actions in the plan. The execution of the action then changes the state of the system.
The system, however, changes not only via the controller’s actions but also through external events.
So the controller must observe the system using an observation function η : S → O and generate
the appropriate action.
Sometimes, however, there may be parts of the system which we cannot observe, so, given the
observations that could be collected, there may be many possible states of the world - this is the
belief state of the controller.
The system as it actually is is often different from how it was described to the planner (as Σ, which,
as an abstraction, loses some details). In dynamic planning, planning and execution are more closely
586
CHAPTER 20. PLANNING
587
20.3. STATE-SPACE PLANNING
linked to compensate for this scenario (which is more the rule than the exception). That is the
controller must supervise the plan, i.e. it must detect when observations differ from expected results.
The controller can pass this information to the planner as an execution status, and then the planner
can revise its plan to take into account the new state.
20.3.2
STRIPS (Stanford Research Institute Problem Solver)
The STRIPS representation gives us an internal structure to our states, which up until now have
been left as black boxes. It is based on first order predicate logic; that is, we have objects in our
domain, represented by symbols and grouped according to type, and these objects are related (such
relationships are known as predicates) to each other in some way.
For example, in the Dock-Worker Robot domain, one type of object is robot and each robot would
be represented with a unique symbol, e.g. robot1, robot2, etc.
We must specify all of this in a syntax that the planner can understand. The most common syntax
is PDDL (Planning Domain Definition Language). For example:
(define (domain dock-worker-robot))
(:requirements :strips :typing)
(:types
location
;there are several connected locations
pile
;is attached to a location,
;it holds a pallet and a stack of containers
robot
;holds at most 1 container,
;only 1 robot per location
crane
;belongs to a location to pickup containers
container
)
(:predicates
(adjacent ?l1 ?l2 - location)
;location ?l1 is adjacent to ?l2
(attached ?p - pile ?l - location)
;pile ?p attached to location ?l
(belong ?k - crane ?l - location)
;crane ?k belongs to location ?l
(at ?r - robot ?l - location)
;robot ?r is at location ?l
(occupied ?l - location)
;there is a robot at location ?l
(loaded ?r - robot ?c - container)
;robot ?r is loaded with container ?c
(unloaded ?r - robot)
;robot ?r is empty
(holding ?k - crane ?c - container) ;crane ?k is holding a container ?c
(empty ?k - crane)
;crane ?k is empty
(in ?c - container ?p - pile)
;container ?c is within pile ?p
(top ?c - container ?p - pile)
;container ?c on top of pile ?p
(on ?c1 ?c2 - container)
;container ?c1 is on container ?c2
CHAPTER 20. PLANNING
587
20.3. STATE-SPACE PLANNING
588
)
)
Let L be a first-order language with finitely many predicate symbols, finitely many constant symbols,
and no function symbols (e.g. as defined with PDDL above).
A state in a STRIPS planning domain is a set of ground atoms of L. An atom is a predicate with
an appropriate number of objects (e.g. those we defined above). An atom is ground if all its objects
are real objects (rather than variables).
• (ground) atom p holds in state s if and only if p ∈ s (this is the closed world assumption);
i.e. it is “true”
• s satisfies a set of (ground) literals (a literal is an atom that is either positive or negative,
e.g. an atom or a negated atom) g (denoted s ⊨ g) if:
• every positive literal in g is in s
• every negative literal in g is not in s
Say we have the symbols loc1, loc2, p1, p2, crane1, r1, c1, c2, c3, pallet. An example
state for the DWR problem:
state = {
adjacent(loc1, loc2), adjacent(loc2, loc1),
attached(p1, loc1), attached(p2, loc1),
belong(crane1, loc1),
occupied(loc2),
empty(crane1),
at(r1,loc2),
unloaded(r1),
in(c1,p1), in(c3,p1),
on(c3,c1), on(c1,pallet),
top(c3,p1),
in(c2,p2),
on(c2,pallet),
top(c2,p2)
}
In STRIPS, a planning operator is a triple o = name(o), precond(o), effects(o), where:
• the name of the operator name(o) is a syntactic expression of the form n(x1 , . . . , xk ) where n is
a unique symbol and x1 , . . . , xk are all variables that appear in o (i.e. it is a function signature)
• the preconditions precond(o) and the effects effects(o) of the operator are sets of literals
(i.e. positive or negative atoms)
• the positive effects form the add list
588
CHAPTER 20. PLANNING
589
20.3. STATE-SPACE PLANNING
• the negative effects form the delete list
An action in STRIPS is a ground instance of a planning operator (that is, we substitute the variables
for symbols, e.g. we are “calling” the operator, as in a function).
For example, we may have an operator named move(r,l,m) with the preconditions adjacent(l,m),
at(r,l), !occupied(m) and the effects at(r,m), occupied(m), !occupied(l), !at(r,l). An
action might be move(robot1, loc1, loc2), since we are specifying specific instances to operate
on.
In the PDDL syntax, this can be written:
(:action move
:parameters (?r - robot ?from ?to - location)
:precondition (and
(adjacent ?from ?to) (at ?r ?from)
(not (occupied ?to)))
:effect (and
(at ?r ?to) (occupied ?to)
(not (occupied ?from)) (not (at ?r ?from)) ))
This is a bit confusing because PDDL does not distinguish “action” from “operator”.
20.3.3
Other representations
Representations other than STRIPS includes:
• propositional reprsentation:
• world state is a set of propositions (i.e. only symbols, no variables)
• actions consist of precondition propositions, propositions to be added and removed (i.e. there
are no operators b/c we only have symbols)
• so the STRIPS representation is essentially propositional representation but with first-order
literals instead of propositions (i.e. the preconditions of an operator can be positive or negative)
• state-variable representation:
• state is a tuple of state variables {x1 , . . . , xn }
• an action is a partial function over states
These representations, however, can all be translated between each other.
20.3.4
Applicability and state transitions
When is an action applicable in a state?
Let L be the set of literals. L+ is the set of atoms that are positive literals in L and L− is the set of
all atoms whose negations are in L.
Let a be an action and s a state. a is applicable in s if and only if:
CHAPTER 20. PLANNING
589
20.3. STATE-SPACE PLANNING
590
• precond+ (a) ⊆ s
• precond− (a) ∩ s = ∅
Which just says all positive preconditions must be true in the current state, and all negative preconditions must be false in the state.
The state transition function γ for an applicable action a in state s is defined as:
γ(s, a) = (s − effects− (a)) ∪ effects+ (a)
That is, we apply the delete list (remove those effects from the state) and apply the add list (add
those effects to the state).
Finding actions applicable for a given state is a non-trivial problem, in particular because there may
be many, many available actions.
We can define an algorithm which will find the applicable actions for a given operator in a given state:
•
•
•
•
•
•
•
•
initialize:
A is a set of actions, initially empty
op is the operator
precs is the list of remaining preconditions to be satisfied
v is the substitutions for the variables of the operator
s is the given state
function addApplicables(A, op, precs, v, s)
if no positive preconditions remaining
– for every negative precondition np in s
– if the state falsifies the np, return
– add v(op) to A
• else:
–
–
–
–
select the next positive precondition pp
for each proposition sp in s
extend ‘v’ such that pp and sp match, the result is v’
if v’ is valid, then:
* addApplicables(A, op, (precs - pp), v’, s)
We can formally define a planning domain in STRIPS.
Given our function-free first-order language L, a STRIPS planning domain on L is a restricted
(meaning there are no events) state-transition system Σ = (S, A, γ) such that:
• S is a set of STRIPS states, i.e. sets of ground atoms
• A is a set of ground instances of some STRIPS planning operators O (i.e. actions)
590
CHAPTER 20. PLANNING
591
•
•
•
•
20.3. STATE-SPACE PLANNING
γS × A → S where
γ(s, a) = (s − effects− (a)) ∪ effects+ (a) if a is applicable in s
γ(s, a) = undefined otherwise
S is closed under γ
We can formally define a planning problem as a triple P = (Σ, si , g), where:
• Σ is the STRIPS planning domain (as described above)
• si ∈ S is the initial state
• g is a set of ground literals describing the goal such that the set of goal states is Sg = {s ∈
S|s ⊨ g} (as a reminder, s ⊨ g means s satisfies g)
In PDDL syntax, we can define the initial state like so:
(:init
(adjacent l1 l2)
(adjacent l2 l1)
;etc
)
and the goal like so:
(:goal (and
(in c1 p2) (in c2 p2)
;etc
))
We formally define a plan as any sequence of actions π = a1 , . . . , ak where k ≥ 0:
• The length of a plan π is |π| = k, i.e. the number of actions
• If π1 = a1 , . . . , ak and π2 = a1′ , . . . , aj′ are plans, their concatenation is the plan π1 · π2 =
a1 , . . . , ak , a1′ , . . . , aj′
• The extended state transition function for plans is defined as follows:
• γ(s, π) = s if k = 0 (that is, if π is empty)
• γ(s, π) = γ(γ(s, a1 ), a2 , . . . , ak ) if k > 0 and a1 is applicable in s
• γ(s, π) = undefined otherwise
A plan π is a solution for a planning problem P if γ(si , π) satisfies g.
A solution π is redundant if there is a proper subsequence of π that is also a solution for P.
A solution π is minimal if no other solution for P contains fewer actions that π.
CHAPTER 20. PLANNING
591
20.3. STATE-SPACE PLANNING
20.3.5
592
Searching for plans
Forward Search
The basic idea is to apply standard search algorithms (e.g. bread-first, depth-first, A*, etc) to the
planning problem.
•
•
•
•
search space is a subset of the state space
nodes correspond to world states
arcs correspond to state transitions
path in the search space corresponds to plan
Forward search is sound (if a plan is returned, it will indeed be a solution) and it is complete (if a
solution exists, it will be found).
Backward Search
Alternatively, we can search backwards from a goal state to the initial state.
First we define two new concepts:
An action a ∈ A is relevant for g if:
• g ∩ effects(a) ̸= ∅
• g + ∩ effects− (a) = ∅
• g − ∩ effects+ (a) = ∅
Essentially what this is says is the action must contribute the goal (the first item) and the action
must not interfere with the goal (the last two items).
This is equivalent to applicability.
The regression set of g for a relevant action a ∈ A is:
γ −1 (g, a) = (g − effects(a)) ∪ precond(a)
That is, it is the inverse of the state transition function.
When searching backwards, sometimes we end up with operators rather than actions (i.e. some of
the parameters are still variables). We could in theory branch out to all possible actions from this
operator by just substituting all possible values for the variable, but that will increase the branching
factor by a lot. Instead, we can do lifted backward search, in which we just stick with these partially
instantiated operators instead of actions.
Keeping variables in a plan, such as with lifted backward search, is called least commitment planning.
592
CHAPTER 20. PLANNING
593
20.3. STATE-SPACE PLANNING
20.3.6
The FF Planner
The FF Planner performs a forward state-space search (the basic strategy can be A* or enforced hill
climbing (EHC, a kind of best-first search where we commit to the first state that looks better than
all previous states we have looked at)).
It uses a relaxed problem heuristic hF F . The relaxed problem is constructed by ignoring delete list of
all the operators.
Then we solve this relaxed problem; this can be done in polynomial time: - chain forward to build a
relaxed planning graph - chain backward to extract a relaxed plan from the graph
Then we use the length (i.e. number of actions) of the relaxed plan as a heuristic value (e.g. for A*).
For example, with the simplified DWR from before:
•
•
•
•
•
•
•
•
•
move(r,l,l’)
precond: at(r,l), adjacent(l,l’)
effects: at(r,l), not at(r,l)
load(c,r,l)
precond: at(r,l), in(c,l), unloaded(r)
effects: loaded(r,c), not in(c,l), not unloaded(r)
unload(c,r,l)
precond: at(r,l), loaded(r,c)
effects: unloaded(r), in(c,l), not loaded(r,c)
To get the relaxed problem, we drop all delete lists:
•
•
•
•
•
•
•
•
•
move(r,l,l’)
precond: at(r,l), adjacent(l,l’)
effects: at(r,l), not at(r,l)
load(c,r,l)
precond: at(r,l), in(c,l), unloaded(r)
effects: loaded(r,c)
unload(c,r,l)
precond: at(r,l), loaded(r,c)
effects: unloaded(r), in(c,l)
Pseudocode for computing the relaxed planning graph (RPG):
• function computeRPG(A, si , g)
• F0 = si , t = 0
• while g ⊊ Ft do
– t =t +1
– At = {a ∈ A|precond(a) ⊆ Ft }
CHAPTER 20. PLANNING
593
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
–
–
–
–
594
Ft = Ft−1
for all a ∈ At do
Ft = Ft ∪ effects+ (a)
if Ft = Ft−1 then return failure
• return [F0 , A1 , F1 , . . . , At , Ft ]
Pseudocode for extracting a plan from the RPG (in particular, the size of the plan, since this is a
heuristic calculation):
•
•
•
•
function extractRPSize([F0 , A1 , F1 , . . . , Ak , Fk ], g)
if g ⊊ Fk then return failure
M = max{firstlevel(gi , [F0 , . . . , Fk ])|gi ∈ g}
for t = 0 to M do
– Gt = {gi ∈ g|firstlevel(gi , [F0 , . . . , Fk ]) = t}
• for t = M to 1 do
– for all gt ∈ Gt do
– select a : firstlevel(a, [A1 , . . . , At ]) = t and gt ∈ effects+ (a)
– for all p ∈ precond(a) do
* Gfirstlevel(p,[F0 ,...,Fk ]) = Gfirstlevel(p,[F0 ,...,Fk ]) ∪ {p}
• return number of selected actions
The firstlevel function tells us which layer (by index) a goal gi first appears in the planning graph.
This heuristic is not admissible (it is not guaranteed to return a minimal plan), but in practice it is
quite accurate, so it (or ideas inspired by it) are frequently used (currently state-of-the-art).
20.4 Plan-space (partial-order) planning
Partial plans are like plans mentioned thus far (i.e. simple sequences of actions), but we also record
the rationale behind each action, e.g. to achieve the precondition of another action. In aggregate
these partial plans may form the solution to the problem (i.e. they act as component plans).
We also have explicit ordering constraints (i.e. these actions must occur in this order), so we can
have partial plans with partial order, which means that we can execute actions in parallel.
And as with lifted backward search, we may have variables in our actions as well.
We adjust the planning problem a bit - instead of achieving goals, we want to accomplish tasks.
Tasks are high-level descriptions of some activity we want to execute - this is typically accomplished
by decomposing the high-level task into lower-level subtasks. This is the approach for Hierarchical
Task Network (HTN) planning. There is a simpler version called STN planning as well.
594
CHAPTER 20. PLANNING
595
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
Rather than searching through the state-space, we search through plan-space - a graph of partial
plans. The nodes are partially-specified plans, the arcs are plan refinement operations (which is why
this is called refinement planning), and the solutions are partial-order plans.
More concretely, if a plan is a set of actions organized into some structure, then a partial plan is:
•
•
•
•
•
a subset of the actions
a subset of the organizational structure
temporal ordering of actions
rationale: what the action achieves in the plan
a subset of variable bindings
More formally, we define a partial plan as a tuple π = (A, ≺, B, L) where:
• A = {a1 , . . . , ak } is a set of partially-instantiated planning operators
• ≺ is a set of ordering constraints on A of the form (ai ≺ aj )
• B is a set of binding constraints on the variables of actions in A of the form x = y , x ̸= y or
x ∈ Dx
• L is a set of causal links of the form ai → [p] → aj such that:
• ai , aj are actions in A
• the constraint (ai ≺ aj ) is in ≺
• the proposition p is an effect of ai and a precondition of aj
• the binding constraints for variables in ai and aj appearing p are in B
Note that for causal links, ai is the producer in the causal link and aj is the consumer.
20.4.1
Plan refinement operations
Adding actions
With least-commitment planning, we only want to add actions (more specifically, we are adding
partially-instantiated operators) if it’s justified:
• to achieve unsatisfied preconditions
• to achieve unsatisfied goal conditions
Actions can be added anywhere in the plan.
Note that each action that we add has its own set of variables, unrelated to those of other actions.
Adding causal links
Causal links link a provider (an effect of an action or an atom that holds in the initial state) to a
consumer (a precondition of an action or a goal condition). There is an ordering constraint here as
well, in that the provider must come before the consumer (but not necessarily directly before).
We add causal links to prevent interference with other actions.
CHAPTER 20. PLANNING
595
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
596
Adding variable bindings
A solution plan must have actions, not partially-instantiated operators. Variable bindings are what
allow us to turn operators into actions.
Variable bindings constraints keep track of possible values for variables, and also can specify codesignation (i.e. that certain variables must or must not have the same value). For example, with
causal links, there are variables in the producer that must be the same as those corresponding variables
in the consumer (because they are “carried over”). When two variables must share the same value,
we say they are unified.
Adding ordering constraints
Ordering constraints are just binary relations specifying the temporal order between actions in a plan.
Ordering constraints help us avoid possible interference. Causal links imply ordering constraints, and
some trivial ordering constraints are that all actions must come after the initial state and before the
goal.
20.4.2
The Plan-Space Search Problem
The initial search state includes just the initial state and the goal as “dummy actions”:
• an init action with no preconditions and with the initial state as its effects
• a goal action with the goal conditions as its preconditions and with no effects
We start with the empty plan: π0 = ({init, goal}, {(init ≺ goal)}, {}, {})
It includes just the two dummy actions, one ordering constraint (init before goal), and no variable
bindings or causal links.
We generate successors through one or more plan refinement operators:
•
•
•
•
adding
adding
adding
adding
20.4.3
an action to A
an ordering constraint to ≺
a binding constraint to B
a causal link to L
Threats and flaws
A threat in a partial plan is when we have an action that might occur in parallel with a causal link
and has an effect that is complimentary to the condition we want to protect (that is, it interferes
with a condition we want to protect).
We can often get around this by introducing a new causal link that requires this conflicting action to
follow the causal link, instead of occurring in parallel.
More formally, an action ak in a partial plan is a thread to a causal link ai → [p] → aj if and only if:
596
CHAPTER 20. PLANNING
597
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
• ak has an effect ̸ q that is possibly inconsistent with p, i.e. q and p are unifiable
• the ordering constraints (ai ≺ ak ) and (ak ≺ aj ) are consistent with ≺
• the binding constraints for the unification q and p are consistent with B
That is, if we have one action which produces the precondition p, which is what we want, but another
action which simultaneously produces the precondition ̸ q where q and p are unifiable, then we have
a threat.
A flaw in a partial plan is either:
• an unsatisfied subgoal, e.g. a precondition of an action in A without a causal link that supports
it, or
• a threat
20.4.4
Partial order solutions
We consider a plan π = (A, ≺, B, L) as partial order solution for a planning problem P if:
• its ordering constraints ≺ are not circular
• its binding constraints B are consistent
• it is flawless
20.4.5
The Plan-Space Planning (PSP) algorithm
The main principle is to refine the partial plan π while maintaining ≺ and B consistent until π has
no more flaws.
Basic operations:
•
•
•
•
•
find the flaws of π
select a flaw
find a way of resolving the chosen flaw
choose one of the resolvers for the flaw
refine π according to the chosen resolver
The PSP procedure is sound and complete whenever π0 can be refined into a solution plan and
PSP(π0 ) returns such a plan.
To clarify, soundness: if PSP returns a plan, it is a solution plan, and completeness: if there is a
solution plan, PSP will return it.
Proof:
• soundness: ≺ and B are consistent at every stage of refinement
• completeness: induction on the number of actions in the solution plan
CHAPTER 20. PLANNING
597
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
598
The general algorithm:
function PSP(plan):
all_flaws = plan.open_goals() + plan.threats()
if not all_flaws:
return plan
flaw = all_flaws.select_one()
all_resolvers = flaw.get_resolvers(plan)
if not all_resolves:
return failure
resolver = all_resolvers.choose_one()
new_plan = plan.refine(resolver)
return PSP(new_plan)
Where the initial plan is the empty plan as described previously.
is a non-deterministic decision (i.e. this is something we may need to
backtrack to; that is, if one resolver doesn’t work out, we need to try another branch).
all_resolvers.choose_one
is a deterministic selection (we don’t need to backtrack to this because all
flaws must be resolved). The order is not important for completeness, but it is important for efficiency.
all_flaws.select_one
Implementing plan.open_goals(): we find unachieved subgoals incrementally:
• the goal conditions π0 are the initial unachieved subgoals
• when adding an action: all preconditions are unachieved subgoals
• when adding a causal link, the protected proposition is no longer unachieved
Implementing plan.threats(): we find threats incrementally:
• no threats in the goal conditions π0
• when adding an action anew to π = (A, ≺, B, L):
• for every causal link (ai → [p] → aj ) ∈ L
– if (anew ≺ ai ) or (aj ≺ anew ), then next link
– else for every effect q of anew
– if ∃σ : σ(p) = σ(̸ q)) then q of anew threatens (ai → [p] → aj )
• when adding a causal link (ai → [p] → aj ) to π = (A, ≺, B, L):
• for every action aold ∈ A
– if (aold ≺ ai ) or (aj ≺ aold ), then next action
– else for every effect q of aold
– if ∃σ : σ(p) = σ(̸ q)) then q of aold threatens (ai → [p] → aj )
Implementing flaw.get_resolvers(plan): for an unachieved precondition p of ag :
598
CHAPTER 20. PLANNING
599
20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING
• add causal links to an existing action:
• for every action aold ∈ A, see if an existing action can be a provider for this precondition
– if (ag = aold ) or (ag ≺ aold ) then next action (i.e. if the existing action is the consumer
or it comes after the consumer, then it cannot be a producer, so just move on to the next
action)
– else for every effect q of aold , check if the existing action produces a precondition that is
equal to the unachieved precondition:
– if ∃σ : σ(p) = σ(q)) then adding aold → [σ(p)] → ag is a resolver
• add a new action and a causal link (i.e. create a new provider):
• for every effect q of every operator o
– if ∃σ : σ(p) = σ(q)) then adding anew = o.newInstance() and anew → [σ(p)] → ag is
a resolver
For an effect q of action at threatening ai → [p] → aj :
•
•
•
•
•
•
•
•
order action before threatened link:
if (at = ai ) or (aj ≺ at ), then not a resolver
else adding (at ≺ ai ) is a resolver
order threatened link before action:
if (at = ai ) or (at ≺ ai ), then not a resolver
else adding (aj ≺ at ) is a resolver
extend variable bindings such that unification fails:
for every variable v in p or q
– if v ̸= σ(v ) is consistent with B then adding v ̸= σ(V ) is a resolver
Implementing plan.refine(resolver): refines a partial plan by adding elements specified in
resolver, i.e.:
•
•
•
•
an ordering constraint;
one or more binding constraints;
a causal link; and/or
a new action
This may introduce new flaws, so we must update the flaws (i.e.
plan.threats()).
plan.open_goals()
and
Implementing ordering constraints: ordering constraint management can be implemented as an independent module with two operations:
• querying whether (ai ≺ aj )
• adding (ai ≺ aj )
CHAPTER 20. PLANNING
599
20.5. TASKS
600
Implementing variable binding constraints:
Types of constraints:
• unary constraints: x ∈ Dx
• equality constraints: x = y
• inequalities: x ̸= y
Unary and equality constraints can be dealt with in linear time, but inequality constraints cause
exponential complexity here - with inequalities, this is a general constraint satisfaction problem which
is NP-complete. So these variable binding constraints can become problematic.
20.4.6
The UC Partial-Order Planning (UCPoP) Planner
This is slight variation to PSP as outlined above, in which threats are instead dealt with after an
open goal is dealt with (that is, it deals with threats from the resolver that was used to deal with an
open goal, which is to say that threats are resolved as part of the successor generation process).
UCPoP takes in a slightly different input as well. In addition to the partial plan, it also takes in an
agenda, which is a set of (a, p) pairs where a is an action and p is one of its preconditions (i.e. this
is a list of things that still need to be dealt with, that is the remaining open goal flaws).
20.5 Tasks
With task network planning, we still have terms, literals, operators, actions, state transition functions,
and plans.
A few new things are added:
• the tasks to be performed
• methods describing ways in which tasks can be performed
• organized collections of tasks called task networks
Formally:
• we have task symbols Ts = {t1 , . . . , tn } for giving unique names to tasks
• the operator names must be ⊊ Ts (that is, it must be a subset of the task symbols and cannot
be equal to the entire set). Operator names that correspond to task symbols are called primitive
tasks.
• non-primitive task symbols are Ts − operator names, i.e. task symbols with no corresponding
operators.
• task: ti (r1 , . . . , rk )
• ti is the task symbol (primitive or non-primitive)
• r1 , . . . , rk are terms or objects manipulated by the task
600
CHAPTER 20. PLANNING
601
20.5. TASKS
• a ground task is one in which all its parameters are ground (i.e. they are actual objects, not
variables)
• action: a = op(c1 , . . . , ck ), where op is an operator name and c1 , . . . , ck are constants representing parameters for the action, accomplishes ground primitive task ti (r1 , . . . , rk ) in state s
if and only if:
• name(a) = ti , i.e. the name of the action must be the task symbol, and c1 = r1 and . . . and
ck = rk (i.e. the parameters must be the same)
• a is applicable in s
20.5.1
Simple Task Networks (STN)
We can group tasks into task networks.
A simple task network w is an acyclic directed graph (U, E) in which:
• the node set U = {t1 , . . . , n} is a set of tasks
• the edges in E define a partial ordering of the tasks in U
A task network w is ground/primitive if all tasks tu ∈ U are ground/primitive, otherwise it is
unground/non-primitive.
A task network may also be totally ordered or partially ordered. We have an ordering tu ≺ tv in w if
there is a path from tu to tv .
A network w is totally ordered if and only if E defines a total order on U (that is, that every node is
ordered with respect to every other node in the network). If w is totally ordered, we can represent it
as a sequence of tasks t1 , . . . , tn ).
Let w = t1 , . . . , tn be a totally ordered, ground, primitive STN. Then the plan π(w ) is defined as:
π(w ) = a1 , . . . , an
Where ai = ti , 1 ≤ i ≤ n.
Simple task networks are a simplification of the more general hierarchical task networks (HTNs).
Example (DWR)
Tasks:
• t1 = take(crane, loc, c1 , c2 , p1 ) (primitive, because we have an operator of that same name in
the DWR domain, and ground, because all arguments here are objects and not variables)
• t2 = take(crane, loc, c2 , c3 , p1 ) (primitive, ground)
• t3 = move-stack(p1 , q) (non-primitive, because we do not have an operator named “movestack” in the DWR domain, and unground because q is a variable)
CHAPTER 20. PLANNING
601
20.5. TASKS
602
Task networks:
• w1 = ({t1 , t2 , t3 }, {(t1 , t2 ), (t1 , t3 )}) (partially-ordered, because we don’t have an order specified for t2 , t3 , non-primitive, because the non-primitive task t3 is included, and unground,
because the unground task t3 is included.
• w2 = ({t1 , t2 }, {(t1 , t2 )}) (totally ordered, ground, primitive)
• π(w2 ) = t1 , t2
20.5.2
Methods
Methods are plan refinements (i.e. they correspond to state transitions in our search space).
Let MS be a set of method symbols. A STN method is a 4-tuple m = (name(m), task(m), precond(m), network(m
where:
• name(m) is the name of the method
• it is a syntactic expression of the form n(x1 , . . . , xk )
– n ∈ MS is a unique method symbol
– x1 , . . . , xk are all the variable symbols that occur in m
• task(m) is a non-primitive task (primitive tasks can just be accomplished by an operator)
accomplished by this method
• precond(m) is a set of literals; the method’s preconditions
• network(m) is a task network (U, E) containing the set of subtasks, which is U, of m
Example (DWR)
Say we want to define a method which involves taking the topmost container of a stack and moving
it. If you recall, we have the operators take and put, which we can use to define this method.
The name of the method could be take-and-put(c, k, l, po , pd , xo , xd ). The task this method completes would be move-topmost(po , pd ). The preconditions would be top(c, p0 ), on(c, xo ), attached(po , l), belong(
Finally, the subtasks would be take(k, l, c, xo , po ), put(k, l, c, xd , pd ).
Where:
•
•
•
•
•
•
•
602
c is the container to move
k is the crane to use
l is the location
po is the original pile
pd is the destination pile
xo is the container from which we are taking c
xd is the container which we are placing c on
CHAPTER 20. PLANNING
603
20.5. TASKS
Applicability and relevance
A method instance m is applicable in a state s if:
• precond+ (m) ⊆ s
• precond− (m) ∩ s = ∅
A method instance m is relevant for a task t if there is a substitution σ such that σ(t) = task(m)
Decomposition of tasks
The decomposition of an individual task t by a relevant method m under σ is either:
• δ(t, m, σ) = σ(network(m)) or
• δ(t, m, σ) = σ(subtasks(m)) if m is totally ordered
δ is called the decomposition method for a task given a method and a substitution.
That is, we break the task down into its subtasks.
The decomposition of tasks in a STN is as follows:
Let:
• w = (U, E) be a STN
• t ∈ U be a task with no predecessors in w
• m be a method that is relevant for t under some substitution σ with network(m) = (Um , Em )
The decomposition of t in w by m under σ is the STN δ(w , t, m, σ) where:
• t is replaced in U by σ(Um )
• edges in E involving t are replaced by edges to appropriate nodes in σ(Um )
That is, we replace the task with its subtasks.
20.5.3
Planning Domains and Problems
An STN planning domain is a pair D = (O, M) where:
• O is a set of STRIPS planning operators
• M is a set of STN methods
D is a total-order STN planning domain if every m ∈ M is totally ordered.
An STN planning problem is a 4-tuple P = (si , wi , O, M) where:
CHAPTER 20. PLANNING
603
20.5. TASKS
604
• si is the initial state (a set of ground atoms)
• wi is a task network called the initial task network
• D = (O, M) is an STN planning domain
So it is quite similar to a STRIPS planning problem (just the domain also includes STN methods and
we have a initial task network).
P is a total-order STN planning problem if wi and D are both totally ordered.
A plan π = a1 , . . . , an is a solution for an STN planning problem P = (si , wi , O, M) if:
• wi is empty and π is empty (i.e. if we had no tasks, doing nothing is a solution)
or:
• there is a primitive task t ∈ wi that has no predecessors in wi (i.e. it is one of the first tasks)
• a1 = t is applicable in si
• pi ′ = a2 , . . . , an is a solution for P ′ = (γ(si , a1 ), wi − {t}, O, M) (i.e. recurse)
or:
• there is a non-primitive task t ∈ wi that has no predecessors in wi (i.e. it is one of the first
tasks)
• a method m ∈ M is relevant for t, i.e. σ(t) = task(m) and applicable in si
• π is a solution for P ′ = (si , δ(wi , t, m, σ), O, M)
20.5.4
Planning with task networks
The Ground Total order Forward Decomposition (Ground-TFD) procedure
function GroundTFD(s1, (t1, ..., tk), O, M):
if k=0 return []
if t1.is_primitive() then
actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)}
if not actions then return failure
(a, sigma) = actions.choose_one() # non-deterministic choice, we may need to backtrack to her
plan = GroundTFD(gamma(s,a), sigma((t2, ..., tk)), O, M)
if plan = failure then return failure
else return [a].extend(plan)
else:
methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)}
if not methods then return failure
(m, sigma) = methods.choose_one()
plan = subtasks(m).extend(sigma((t2, ..., tk)))
return GroundTFD(s, plan, O, M)
604
CHAPTER 20. PLANNING
605
20.5. TASKS
TFD considers only applicable actions, much like forward search, and it only considers relevant actions,
much like backward search.
Ground-TFD can be generalized to Lifted-TFD, giving the same advantages as lifted backward search
(e.g. least commitment planning).
The Ground Partial order Forward Decomposition (Ground-PFD) procedure
function GroundPFD(s1, w, O, M):
if w.U = {} return []
task = {t in U| t.no_predecessors_in(w.E)}.choose_one()
if task.is_primitive() then
actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)}
if not actions then return failure
(a, sigma) = actions.choose_one()
plan = GroundPFD(gamma(s,a), sigma(w-{task}), O, M)
else:
methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)}
if not methods then return failure
(m, sigma) = methods.choose_one()
return GroundPFD(s, delta(w,task,m,sigma), O, M)
20.5.5
Hierarchical Task Network (HTN) planning
HTN planning is more general than STN planning, which also means there is no single algorithm for
implementing HTN planning.
In STN planning, we had two types of constraints:
•
•
•
•
•
ordering constraints, which were maintained in the network
preconditions (constraints on a state before a method or action is applied):
enforced as part of the planning procedure
must know the state to test for applicability
must perform forward search
HTN planning has the flexibility to use these constraints or other arbitrary constraints as needed; that
is, it maintains more general constraints explicitly (in contrast, with STN planning the constraints
are embedded as part of the network or the planning). For instance, we could include constraints on
the resources used by the tasks.
HTN methods are different than STN methods. Formally:
Let MS be a set of method symbols. An HTN method is a 4-tuple m = (name(m), task(m), subtasks(m), constr(m
where:
• name(m), task(m) are the same as with STN methods
CHAPTER 20. PLANNING
605
20.6. GRAPHPLAN
606
• (subtasks(m), constr(m)) is a hierarchical task network, with subtasks (similar to STN methods) but also arbitrary constraints.
So the main difference between HTN and STN is that HTN can handle arbitrary constraints, which
makes it more powerful, but also more complex.
HTN vs STRIPS planning
The STN/HTN formalism is more expressive; you can encode more problems with this formalism
than you can with STRIPS. For example, STN/HTN planning can encode undecidable problems.
However, if you leave out the recursive aspects of STN planning, you can translate such a nonrecursive STN problem into an equivalent STRIPS problem. However the size of the problem may
become exponentially larger.
There is also a set of STN domains called “regular” STN domains which are equivalent to STRIPS.
It is important to note that STN/HTN and STRIPS planning are meant to solve different kinds of
problems - STN/HTN for task-based planning, and STRIPS for goal-based planning.
20.6 Graphplan
Like STRIPS, a planning problem for Graphplan consists of a set of operators constituting a domain,
an initial state, and set of goals that need to be achieved.
A major difference is that Graphplan works on a propositional representation, i.e. the atoms that
make up the world are no longer structured - they don’t consist of objects and their relationships
but of individual symbols (facts about the world) which can be either true or false. Actions are also
individual symbols, not parameterized actions. However, note that every STRIPS planning problem
can be translated into an equivalent propositional problem, so long as its operators have no negative
preconditions.
The Graphplan algorithm creates a data structure called a planning graph. The algorithm has two
major steps:
1. the planning graph is expanded with two new layers, an action layer and a proposition layer
2. the graph is searched for a plan
The initial layer is a layer of propositions that are true in the initial state. Then a layer of actions
applicable given this initial layer is added, followed by a proposition layer of those propositions that
would be true after these actions (including those that were true before which have not been altered
by the actions).
This expansion step runs in polynomial time (so it is quite fast).
The search step searches backwards, searching from the last proposition layer in the plan and goes
backwards to the initial state. The search itself can be accomplished with something like A*. If a
plan is not found, the algorithm goes back to the first step.
606
CHAPTER 20. PLANNING
607
20.6. GRAPHPLAN
An example Graphplan planning graph, from Jiří Iša
Example: Simplified DWR problem:
• location 1
• robot r
• container a
• location 2
• robot q
• container b
• robots can load and unload autonomously
• locations may contain unlimited number of robots and containers
• problem: swap locations of containers (i.e. we want container a at location 2 and container b
at location 1)
Here are the STRIPS operators we could use in this domain:
• move(r,l,l’)
• precond: at(r,l), adjacent(l,l’)
CHAPTER 20. PLANNING
607
20.6. GRAPHPLAN
•
•
•
•
•
•
•
608
effects: at(r,l), not at(r,l)
load(c,r,l)
precond: at(r,l), in(c,l), unloaded(r)
effects: loaded(r,c), not in(c,l), not unloaded(r)
unload(c,r,l)
precond: at(r,l), loaded(r,c)
effects: unloaded(r), in(c,l), not loaded(r,c)
There are no negative preconditions here so we can translate this into a propositional representation.
Basically for each operator, we have to consider every possible configuration (based on the preconditions), and each configuration is given a symbol. For example, for move(r,l,l’), we have two
robots and two locations. So there are four possibilities for at(r,l) (i.e. at(robot r, location
1), at(robot q, location 1), at(robot r, location 2), at(robot q, location 2)), and
because we have only two locations, there is only one possibility for adjacent(l,l’), so we will
have eight symbols for that STRIPS operator in the propositional representation.
So for instance, we could use the symbol r1 for at(robot r, location 1), the symbol r2 for
at(robot r, location 2) , the symbol ur for unloaded(robot r), and so on.
Then we’d represent the initial state like so: {r1, q2, a1, b2, ur, uq}.
Our actions are also symbolically represented. For instance, we could use the symbol Mr12 for
move(robot r, location 1, location 2).
A propositional planning problem P = (Σ, si , g) has a solution if and only if Sg ∩ Γ> ({si }) ̸= ∅,
where Γ> ({si }) is the set of all states reachable from the initial state.
We can identify reachable states by constructing a reachability tree, where:
• the root is the initial state si
• the children of a node s are Γ({s})
• arcs are labeled with actions
All nodes in the reachability tree is denoted Γ> ({si }).
All nodes up to depth d are Γd ({si }).
These trees are usually very large: there are O(k d ) nodes, where k is the number of applicable actions
per state. So we cannot simply traverse the entire tree.
Instead, we can construct a planning graph.
A planning graph is a layered (layers as mentioned earlier) directed graph G = (N, E), where:
• N = P0 ∪ A1 ∪ P1 ∪ A2 ∪, . . .
• P0 , P1 , . . . are state proposition layers
• A1 , A2 , . . . are action layers
608
CHAPTER 20. PLANNING
609
20.6. GRAPHPLAN
The first proposition layer P0 has the propositions in the initial state si .
An action layer Aj has all actions a where precond(a) ⊆ Pj−1 .
A proposition layer Pj has all propositions p where p ∈ Pj−1 or ∃a ∈ Aj : p ∈ effects+ (a). Note that
we do not look at negative effects; we never remove negative effects from a layer.
As a result, both proposition layers and action layers increase (grow larger) monotonically as we move
forward through the graph.
We create arcs throughout the graph like so:
•
•
•
•
•
from proposition p ∈ Pj−1 to action a ∈ Aj if p ∈ precond(a)
from action a ∈ Aj to layer p ∈ Pj
positive arc if p ∈ effects+ (a)
negative arc if p ∈ effects− (a)
no arcs between other layers
If a goal g is reachable from the initial state si , then there will be some proposition layer Pg in the
planning graph such that g ⊆ Pg .
This is a necessary condition, but not sufficient because the planning graph’s proposition layers
contain propositions that may be true depending on the selected actions in the previous action layer;
furthermore, these proposition layers may contain inconsistent propositions (e.g. a robot cannot be
in two different locations simultaneously). Similarly, actions in an action layer may not be applicable
at the same time (e.g. a robot cannot move to two different locations simultaneously).
The advantage of the planning graph is that it is of polynomial size and we can evaluate this necessary
condition in polynomial time.
20.6.1
Action independence
Actions which cannot be executed simultaneously/in parallel are dependent, otherwise, they are
independent.
More formally:
Two actions a1 and a2 are independent if and only if:
• effects− (a1 ) ∩ (precond(a2 ) ∪ effects+ (a2 )) = ∅ and
• effects− (a2 ) ∩ (precond(a1 ) ∪ effects+ (a1 )) = ∅
(that is, they don’t interfere with each other)
A set of actions π is independent if and only if every pair of actions a1 , a2 ∈ π is independent.
In pseudocode:
CHAPTER 20. PLANNING
609
20.6. GRAPHPLAN
610
function independent(a1, a2):
for each p in neg_effects(a1):
if p in precond(a2) or p in pos_effects(a2):
return false
for each p in neg_effects(a2):
if p in precond(a1) or p in pos_effects(a1):
return false
return true
A set π of independent actions is applicable to a state s if and only if
∪
a∈π
precond(a) ⊆ s.
The result of applying the set π in s is defined as γ(s, π) = (s − effects − (π)) ∪ effects+ (π), where:
∪
• precond(π) = a∈π precond(a)
∪
• effects+ (π) = a∈π effects+ (a)
∪
• effects− (π) = a∈π effects− (a)
20.6.2
Independent action execution order
To turn a set of independent actions into a sequential plan:
If a set π of independent actions is applicable in state s, then for any permutation a1 , . . . , ak of the
elements of π:
• the sequence a1 , . . . , ak is applicable to s
• the state resulting from the application of π to s is the same as from the application of
a1 , . . . , ak , i.e. γ(s, π) = γ(s, (a1 , . . . , ak )).
Which is to say that the execution order doesn’t matter for a set of independent actions, we can
execute these actions in any order we like (intuitively, this makes sense, because none of them interfere
with each other, so they can happen in any order).
20.6.3
Layered plans
Let P = (A, si , g) be a statement of a propositional planning problem and let G = (N, E) be the
planning graph as previously defined.
A layered plan over G is a sequence of sets of actions Π = π1 , . . . , πk where:
• π i ⊆ Ai ⊆ A
• πi is applicable in state Pi−1
• the actions of πi are independent
A layered plan Π is a solution to a planning problem P if and only if:
• π1 is applicable in si
• for j ∈ {2, . . . , k}, pij is applicable in state γ(. . . γ(γ(si , π1 ), π2 ), . . . πj−1 )
• g ⊆ γ(. . . γ(γ(si , π1 ), π2 ), . . . , πk )
610
CHAPTER 20. PLANNING
611
20.6.4
20.6. GRAPHPLAN
Mutual exclusivity (mutex)
In addition to the nodes and edges thus far mentioned, there are also mutual exclusivity (mutex)
propositions and actions.
Two propositions in a layer may be incompatible if:
• the only actions which produce them are dependent actions
• they are the positive and negative effects of the same action
We introduce a No-Op operation for a proposition p, notated Ap. They carry p from one proposition
layer to the next. Thus their only precondition is p, and their only effect is p. These were implied
previously when we said that all propositions are carried over to the next proposition layer; we are
just making them explicit through these No-Op actions.
With these no-op actions, we can now encode the second reason for incompatible propositions (i.e. if
they are positive and negative effects of the same action) as the first (i.e. they are produced by
dependent actions), the dependent actions now being the no-op action and the original action.
We say that two propositions p and q in proposition layer Pj are mutex (mutually exclusive) if:
• every action in the preceding action layer Aj that has p as a positive effect (including no-op
actions) is mutex with every action in Aj that has q as a positive effect
• there is no single action in Aj that has both p and q as positive effects
Notation: µPj = {(p, q)|p, q ∈ Pj are mutex}
In pseudocode, this would be:
function mutex_proposition(p1, p2, mu_a_j):
for each a1 in p1.producers:
for each a2 in p2.producers:
if (a1, a2) not in mu_a_j:
return false
return true
See below for the definition mu_a_j.
Two actions a1 and a2 in action layer Aj are mutex if:
• a1 and a2 are independent, or
• a precondition of a1 is mutex with a precondition of a2
Notation: µAj = {(a1 , a2 )|a1 , a2 ∈ Aj are mutex}
In pseudocode, this would be:
CHAPTER 20. PLANNING
611
20.6. GRAPHPLAN
612
function mutex_action(a1, a2, mu_P):
if not independent(a1, a2):
return true
for each p1 in precond(a1):
for each p2 in precond(a2):
if (p1, p2) not in mu_P:
return true
return false
How do mutex relations propagate through the planning graph?
If p, q ∈ Pj−1 and (p, q) ∈
/ µPj−1 then (p, q) ∈
/ µPj .
Proof:
• if p, q ∈ Pj−1 then Ap, Aq ∈ Aj (reminder: Ap, Aq are no-op operations for p, q respectively)
• if (p, q) ∈
/ µ¶j−1 then (Ap, Aq) ∈
/ µAj
• since Ap, Aq ∈ Aj and (Ap, Aq) ∈
/ µAj , (p, q) ∈ µPj must hold
If a1 , a2 ∈ Aj−1 and (a1 , a2 ) ∈
/ µAj−1 then (a1 , a2 ) ∈
/ µAj .
Proof:
•
•
•
•
•
if a1 , a2 ∈ Aj−1 and (a1 , a2 ) ∈
/ µAj−1 then
a1 , a2 are independent
their preconditions in Pj−1 are not mutex
both properties remain true for Pj
hence a1 , a2 ∈ Aj and (a1 , a2 ) ∈
/ µAj
So mutex relations decrease in some sense further down the planning graph.
Actions with mutex preconditions p and q are impossible, and as such, we can remove that action
from the graph.
20.6.5
Forward planning graph expansion
This is the process of growing the planning graph. Theoretically, the planning graph is infinite, but
we can set a limit on it given a planning problem P = (A, si , g).
If g is reachable from si , then there is a proposition layer Pg such that g ⊆ Pg and ̸ ∃g1 , g2 ∈ g :
(g1 , g2 ) ∈ µPg (that is, there are no pairs of goal propositions that are mutually exclusive in the
proposition layer Pg ).
The basic idea behind the Graphplan algorithm:
• expand the planning graph, one action layer and one proposition layer at a time
612
CHAPTER 20. PLANNING
613
20.6. GRAPHPLAN
• stop expanding when we reach the first graph for which Pg is the last proposition layer such
that:
• g ⊆ Pg
• ̸ ∃g1 , g2 ∈ g : (g1 , g2 ) ∈ µPg
• search backwards from the last proposition layer Pg for a solution
Pseudocode for the expand step:
•
•
•
•
•
•
function expand(Gk−1 )
Ak = {a ∈ A|precond(a) ⊆ Pk−1 and{(p1 , p2 )|p1 , p2 ∈ precond(a)} ∩ µPk−1 = ∅}
µAk = {(a1 , a2 )|a1 , a2 ∈ Ak , a1 ̸= a2 , andmutex(a1 , a2 , µPk−1 )}
Pk = {p|∃a ∈ Ak : p ∈ effects+ (a)}
µPk = {(p1 , p2 )|p1 , p2 ∈ Pk , p1 ̸= p2 , andmutex(p1 , p2 , µAk )}
for all a ∈ Ak
– prek = prek ∪ ({p|p ∈ Pk−1 andp ∈ precond(a)} × a)
– ek+ = ek+ ∪ (a × {p|p ∈ Pk andp ∈ effects+ (a)})
– ek− = ek− ∪ (a × {p|p ∈ Pk andp ∈ effects− (a)})
The size of a planning graph up to level k and the time required to expand it to that level are both
polynomial in the size of the planning problem.
Proof: given a problem size of n propositions and m actions, |Pj | ≤ n, |Aj | ≤ n + m, including no-op
actions. The algorithms for generating each layer and all link types are polynomial in the size of the
layer and we have a linear number of layers k.
Eventually a planning graph G will reach a fixed-point level, which is the kth level such that for all
i , i > k, level i of G is identical to level k, i.e. Pi = Pk , µPi = µPk , Ai = Ak , µAi = µAk .
This is because as Pi grows monotonically, µPi shrinks monotonically, and Ai , Pi depend only on
Pi−1 , µPi−1 .
20.6.6
Backward graph search
This is just a depth-first graph search, starting from the last proposition layer Pk , where the search
nodes are subsets of nodes from the different layers.
The general idea:
• let g be the set of goal propositions that need to be achieved at a given proposition layer Pj
(starting with the last layer)
• find a set of actions πj ⊆ Aj such that these actions are not mutex and together achieve g
• take the union of the preconditions of πj as the new (sub)goal set to be achieved in the
proposition layer Pj−1
CHAPTER 20. PLANNING
613
20.6. GRAPHPLAN
614
When implementing this search, we want to keep track of, for each proposition layer, which set of
subgoals have failed the search (i.e. are unachievable). The motivation is that if we search and run
into a failure state, we have to backtrack, and we don’t want to end up in the same failure state
some other way.
We use a nogood table ∇ to keep track of what nodes we’ve seen before. Up to layer k, the nogood
table is an array of k sets of sets of goal propositions. That is, for each layer we have a set of sets.
The inner sets represent a single combination of propositions that cannot be achieved. The outer set,
then, contains all combinations of propositions that cannot be achieved for that layer.
So before searching for the set g in a proposition layer Pj , we first check whether or not g ∈ ∇(j);
that is, check to see g has already been determined unachievable for Pj .
Otherwise, if we do search for g in Pj and find that it is unachievable, we add g to ∇(j).
For this backward search, we define a function extract:
•
•
•
•
•
•
•
function extract(G, g, i )
if i = 0 then return ()
if g ∈ ∇(i ) then return failure
Π = gpSearch(G, g, {}, i )
if Π ̸= failure then return Π
∇(i ) = ∇(i ) + g
return failure
The function gpSearch is defined as:
• function gpSearch(G, g, π, i )
• if g = {} then
∪
– Π = extract(G, a∈π precond(a), i − 1)
– if Π = failure then return failure
– return concat(Π, (π))
•
•
•
•
•
p = g.selectOneSubgoal()
providers = {a ∈ Ai |p ∈ effects+ (a)and ̸ ∃a′ ∈ π : (a, a′ ) ∈ µAi }
if providers = {} then return failure
a = providers.chooseOne() (we may need to backtrack to here)
return gpSearch(G, g − effects+ (a), π + a, i )
We can combine everything into the complete Graphplan algorithm:
• function graphplan(A, si , g)
• i = 0, ∇ = [], P0 = si , G = (P0 , {})
• while (g ⊊ Pi org 2 ∩ µPi ̸= ∅) and ̸ fixedPoint(G) do
– i = i + 1; expand(G)
614
CHAPTER 20. PLANNING
615
•
•
•
•
20.7. OTHER CONSIDERATIONS
if g ⊊ Pi or g 2 ∩ µPi ̸= ∅ then return failure
η = fixedPoint(G)?|∇(k)| : 0 (ternary operator)
Π = extract(G, g, i )
while Π = failure do
–
–
–
–
–
i = i + 1; expand(G)
Π = extract(G, g, i )
if Π = failure and fixedPoint(G) then
if η = |∇(k)| then return failure
η = |∇(k)|
• return Π
The Graphplan algorithm is sound, complete, and always terminates.
The plan that is returned (if the planning problem has a solution, otherwise, no plan is returned) will
have a minimal number of layers, but not necessarily a minimal number of actions.
It is orders of magnitude faster than the previously-discussed techniques due to the planning graph
structure (the backwards search still takes exponential time though).
20.7 Other considerations
20.7.1
Planning under uncertainty
Thus far all the approaches have assumed the outcome of actions are deterministic.
However, the outcome of actions are often uncertain, so the resulting state is uncertain.
One approach is belief state search. A belief state is a set of world states, one of which is the true
state, but we don’t know which one. The resulting solution plan is a sequence of actions.
Another approach is contingency planning. The possible outcomes of an action are called contingencies. The resulting solution plan is a tree that branches according to contingencies. At branching
points, observation actions are included to see which branch is actually happening.
Both of these approaches are naturally way more complex due to multiple possible outcomes for
actions.
If we can quantify the degree of uncertainty (i.e. we know the probabilities of different outcomes given
an action) we have, we can use probabilistic planning. Instead of simple state transition systems,
we use a partially observable Markov decision process (POMDP) as our model:
• a set of world states S
• a set of actions A, actions are applicable in certain states: s ∈ S : A(s) ⊆ A
• cost function, gives the cost of an action in a given state: c(a, s) > 0 for s ∈ S and a ∈ A
CHAPTER 20. PLANNING
615
20.7. OTHER CONSIDERATIONS
616
• transition probabilities: Pa (s ′ |s) for s, s ′ ∈ S and a ∈ A (probability of state s ′ when executing
action a in state s)
• initial belief state (probability distribution over all states in S)
• final belief state (corresponds to the goal)
• solution (called a “policy”): a function from states to actions, i.e. given a state, this is the
action we should execute
• we want the optimal policy, i.e. the policy with the minimal expected cost
20.7.2
Planning with time
So far we have assumed actions are instantaneous but in reality, they take time. That is, we should
assume that our actions are durative (they take time). We can assume that actions take a known
amount of time, with a start time point and an end time point.
With A* we can include time as part of an action’s cost.
With partial plans (e.g. HTN) we can use a temporal constraint manager, which could be:
• time point networks: associates all time points in a given plan, where we assert relations
between different time points (e.g. that t1 < t2 )
• interval algebra: instead of relating time points, we relate the intervals that correspond to
the action execution (e.g. we assert that interval i1 must occur before i2 or that i3 must occur
during i4 )
One way:
We can specify an earliest start time ES(s) and a latest start time LS(s) for each task/state in the
network.
We define ES(s0 ) = 0. For any other state:
ES(s) = max ES(A) + duration(A)
A→a
Where A → s just denotes each predecessor state of s.
We define LS(sg ) = ES(sg ). For any other state:
LS(s) = min LS(B) − duration(s)
s→B
Where s → B just denotes each successor state of s.
616
CHAPTER 20. PLANNING
617
20.7. OTHER CONSIDERATIONS
20.7.3
Multi-agent planning
So far we have assumed planners working in isolation, with control over all the other agents in the
plan.
Planners may need to work in concert with other planners or other agents (e.g. people); such scenarios
introduce a few new concerns, such as how plans and outcomes are represented and communicated
across agents.
Multi-agent planning does away with the assumption of an isolated planner, and results in a much
more complex problem:
•
•
•
•
•
agents with different beliefs
agents with different capabilities
agents with joint goals
agents with individual/conflicting goals
joint actions (multiple coordinating agents required to accomplish an action)
Other things that must be considered during execution of multi-agent plans:
• coordination (ordering constraints, sharing resources, joint actions)
• communication (e.g. communicating results)
• execution failure recovery (local plan repair, propagating changes to the plan across agents)
20.7.4
Scheduling: Dealing with resources
Actions need resources (time can also be thought of as a resource).
Planning which deals with resources is known as scheduling.
A resource is an entity needed to perform an action, which are described by resource variables.
A distinction between state and resource variables:
• state variables: modified by actions in absolute ways
• resource variables: modified by actions in relative ways
Some resource types include:
•
•
•
•
•
reusable vs consumable
discrete vs continuous
unary (only one available)
shareable
resources with states
Planning approaches can be arranged in a table:
CHAPTER 20. PLANNING
617
20.8. LEARNING PLANS
618
Deterministic
Fully observable
Stochastic
A*, depth-first, breadth-first MDP
Partially observable
POMDP
20.8 Learning plans
There are a few ways an agent can learn plans. Presented here are some simpler ones; see the section
on Reinforcement Learning for more sophisticated methods.
20.8.1
Apprenticeship
We can use machine learning to learn how to navigate a state space (i.e. to plan) by “watching”
another agent (an “expert”) perform. For example, if we are talking about a game, the learning
agent can watch a skilled player play, and based on that learn how to play on its own.
In this case, the examples are states s, the candidates are pairs (s, a), and the “correct” actions are
those taken by the exprt.
We define features over (s, a) pairs: f (s, a).
The score of a q-state (s, a) is given by w · f (s, a).
This is basically classification, where the inputs are states and the labels are actions.
20.8.2
Case-Based Goal Formulation
With case-based goal formulation, a library of cases relevant to the problem is maintained (e.g. with
RTS games, this could be a library of replays for that game). Then the agent uses this library to
select a goal to pursue, given the current world state That is, the agent finds the state case q (from
a case library L) most similar to the current world state s:
q = argmin distance(s, c)
c∈L
Where the distance metric may be domain independent or domain specific.
Then, the goal state g is formulated by looking ahead n actions from q to a future state in that case
q ′ , finding that difference, and adding that to the current world state s:
g = s + (q ′ − q)
The number of actions n is called the planning window size. A small planning window is better for
domains where plans are invalidated frequently.
618
CHAPTER 20. PLANNING
619
20.9. REFERENCES
20.9 References
• Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Logical Foundations of Artificial Intelligence (1987) (Chapter 12: Planning)
• Planning Algorithms. Steven M. LaValle. 2006.
• Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of
Edinburgh (Coursera). 2015.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• Comparison of State-Space Planning and Plan-Space Planning. Planning (Fall 2001). Carnegie
Mellon University. Manuela Veloso.
CHAPTER 20. PLANNING
619
20.9. REFERENCES
620
620
CHAPTER 20. PLANNING
621
21
Reinforcement learning
A quick refresher - a Markov Decision Process (MDP) involves a set of states and actions. Actions
connect states with some uncertainty described by the dynamics P (s ′ |a, s) (i.e. transition probabilities
which underlie the transition function T (s, a, s ′ )) Additionally is a reward function R(s, a, s ′ ) that
associates a reward or penalty with each state. When these are known or learned the result is a policy
π which prescribes actions to take given a state. We can then value a given state s and a policy
π in terms of expected future rewards with a value function V π (s). So the ultimate goal here is to
identify an optimal policy.
Markov Decision Processes as described so far have been fully-observed in the sense that we knew all
of their parameters (transition probabilities and so on). Because everything was known in advance,
we could conduct offline planning, that is, formulate a plan without needing to interact with the
world.
MDP parameters aren’t always known from the onset - we may not know the reward function R
or even the transition model T , and then we must engage in online planning, in which we must
interact with the world to learn more about it to better formulate a plan.
Online planning involves reinforcement learning, where agents can learn in what states rewards or
goals are located without needing to know from the start.
Reinforcement learning in summary:
•
•
•
•
the agent interacts with its environment and receives feedback in the form of rewards
the agent’s utility is defined by the reward function
the agent must learn to act so as to maximize expected rewards
learning is based on observed samples of outcomes
The ultimate goal of reinforcement learning is to learn a policy which returns an action to take given
a state. To form a good policy we need to know the value of a given state; we do so by learning a
value function which is the sum of rewards from the current state to some terminal state, following a
CHAPTER 21. REINFORCEMENT LEARNING
621
21.1. MODEL-BASED LEARNING
622
fixed policy. This value function can be learned and approximated by any learning and approximation
approach, e.g. neural networks.
One challenge with reinforcement learning is the credit assignment problem - a much earlier action
could be responsible for the current outcome, but how is that responsibility assigned? And how it is
quantified?
With reinforcement learning, we still assume a MDP, it’s just not fully specified - that is, we don’t
know R and we might not know T .
If we do know T , a utility-based agent can learn R and thus V , which we can then use for MDP.
If T and R are both unknown, a Q-learning agent can learn Q(s, a) without needing either. Where
V (s) is the value over states, Q(s, a) is the value over state-action pairs and can also be used with
MDP.
A reflex agent can also directly learn the policy π(s) without needing to know T or R.
Reinforcement learning agents can be passive, which means the agent has a fixed policy and learns
R and T (if necessary) while executing that policy.
Alternatively, an active reinforcement learning agent changes its policy as it goes and learns.
Passive learning has the drawbacks that it can take awhile to converge on good estimates for the
unknown quantities, and it may limit how much of the space is actually explored, and as such, there
may be little or no information about some states and better paths may remain unknown.
Critic, Actor, and Actor-Critic methods
We can broadly categorize various RL methods into three groups:
• critic-only methods, which first learn a value function and then use that to define a policy,
e.g. TD learning.
• actor-only methods, which directly search the policy space. An example is an evolutionary
approach where different policies are evolved.
• actor-critic methods, where a critic and an actor are both included and learned separately. The
critic observes the actor and evaluates its policy, determining when it needs to change.
21.1 Model-based learning
Model-based learning is a simple approach to reinforcement learning.
The basic idea:
• learn an approximate model (i.e. P (s ′ |s, a), that is, T (s, a, s ′ ), and R(s, a, s ′ )) based on
experiences
• solve for values (i.e. using value iteration or policy iteration) as if the learned model were correct
In more detail:
622
CHAPTER 21. REINFORCEMENT LEARNING
623
21.1. MODEL-BASED LEARNING
1. learn an empirical MDP model
• count outcomes s ′ for each s, a
• normalize to get an estimate of T̂ (s, a, s ′ )
• discover each R̂(s, a, s ′ ) when we experience (s, a, s ′ )
2. solve the learned MPD (e.g. value iteration or policy iteration)
21.1.1
Temporal Difference Learning (TDL or TD Learning)
In temporal difference learning, the agent moves from one state s to the next s ′ , looks at the
reward difference between the states, then backs up (propagates) the values (as in value iteration)
from one state to the next.
We run this many times, reaching a terminal state, then restarting, and so on, to get better estimates
of the rewards (utilities) for each state.
We keep track of rewards for visited states as U(s) and also the number of times we have visited
each state as N(s).
The main part of the algorithm is:
•
•
•
•
if s ′ is new then U[s ′ ] = r ′
if s is not null then
increment Ns [s]
U[s] = U[s] + α(Ns [s])(r + γU[s] − U[s])
Where α is the learning rate and γ is the discount factor.
Another way of thinking of TD learning:
Consider a sequence of values v1 , v2 , v3 , . . . , vt . We want to estimate the value vt+1 . We might do
t
so by averaging the observed values, e.g. v̂t+1 = v1 +···+v
.
t
We can rearrange these terms to give us:
t v̂t+1 = v1 + · · · + vt−1 + vt
t v̂t+1 = (t − 1)v̂t + vt
1
vt
v̂t+1 = (1 − )v̂t +
t
t
vt − v̂t
v̂t+1 = v̂t +
t
The term vt − v̂t is the temporal difference error. Basically our estimate v̂t+1 is derived by updating
the previous estimate v̂t proportionally to this error. For instance, if vt > v̂t then the next estimate
is increased.
This approach treats all values with equal weight, though we may want to decay older values.
CHAPTER 21. REINFORCEMENT LEARNING
623
21.2. MODEL-FREE LEARNING
624
Greedy TDL
One method of active reinforcement learning is the greedy approach.
Given new estimates for the rewards, we recompute a new optimal policy and then use that policy
to guide exploration of the space. This gives us new estimates for the rewards, and so on, until
convergence.
Thus it is greedy in the sense that it always tries to go for policies that seem immediately better,
although in the end that doesn’t necessarily guarantee the overall optimal policy (this is the exploration
vs exploitation problem).
One alternate approach is to randomly try a non-optimal action, thus exploring more of the space.
This works, but can be slow to converge.
21.1.2
Exploration agent
A approach better than TDL is to use an exploration agent, which favors exploring more when it
is uncertain. More specifically, we can use the same TDL algorithm, but while Ns < ϵ, where ϵ is
some exploration threshold, we set U[s] = R, where R is the largest reward we expect to get. When
Ns > ϵ, we start using the learned reward as with regular TDL.
21.2 Model-free learning
With model-free learning, instead of trying to estimate T or R, we take actions and the actual
outcome to what we expected the outcome would be.
With passive reinforcement learning, the agent is given an existing policy and just learns from
the results of that policy’s execution (that is, learns the state values; i.e. this is essentially just policy
evaluation, except this is not offline, this involves interacting with the environment).
To compute the values for each state under π, we can use direct evaluation:
• act according to π
• every time we visit a state, record what the sum of discounted rewards turned out to be
• average those samples
Direct evaluation is simple, doesn’t require any knowledge of T or R, and eventually gets the correct
average values. However, it throws out information about state connections, since each state is
learned separately - for instance, if we have a state si with a positive reward, and another state sj
that leads into it, it’s possible that direct evaluation assigns state sj a negative reward, which doesn’t
make sense - since it leads to a state with a positive reward, it should also have some positive reward.
Given enough time/samples, this will eventually resolve, but that can require a long time.
Policy evaluation, on the other hand, does take in account the relationship between states, since the
value of each state is a function of its child states, i.e.
624
CHAPTER 21. REINFORCEMENT LEARNING
625
21.2. MODEL-FREE LEARNING
πi
Vk+1
(s) =
∑
s′
T (s, πk (s), s ′ )[R(s, πk (s), s ′ ) + γVkπi (s ′ )]
However, we don’t know T and R. Well, we could just try actions and take samples of outcomes s ′
and average:
π
Vk+1
(s) =
1∑
samplei
n i
Where each samplei = R(s, π(s), si′ ) + γVkπ (si′ ). R(s, π(s), si′ ) is just the observed reward from
taking the action.
This is called sample-based policy evaluation.
One challenge here: when you try an action, you end up in a new state - how do you get back to the
original state to try another action? We don’t know anything about the MDP so we don’t necessarily
know what action will do this.
So really, we only get one sample, and then we’re off to another state.
With temporal difference learning, we learn from each experience (“episode”); that is, we update
V (s) each time we experience a transition (s, a, s ′ , r ). The likely outcomes s ′ will contribute updates
more often. The policy is still fixed (given), and we’re still doing policy evaluation.
Basically, we have an estimate V (s), and then we take an action and get a new sample. We update
V (s) like so:
V π (s) = (1 − α)V π (s) + (α)sample
So we specify a learning rate α (usually small, e.g. α = 0.1) which controls how much of the old
estimate we keep. This learning rate can be decreased over time.
This is an exponential moving average.
This update can be re-written as:
V π (s) = V π (s) + α(sample − V π (s))
The term (sample − V π (s)) can be interpreted as an error, i.e. how off our current estimate V π (s)
was from the observed sample.
So we still never learn T or R, we just keep running sample averages instead; hence temporal difference
learning is a model-free method for doing policy evaluation.
However, it doesn’t help with coming up with a new policy, since we need Q-values to do so.
CHAPTER 21. REINFORCEMENT LEARNING
625
21.2. MODEL-FREE LEARNING
21.2.1
626
Q-Learning
With active reinforcement learning, the agent is actively trying new things rather than following a
fixed policy.
The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When
you land on a decent strategy, do you just stick with it? What if there’s a better strategy out there?
How do you balance using your current best strategy and searching for an even better one?
Remember that value iteration requires us to look at maxa over the set of possible actions from a
state:
Vk+1 (s) = max
a
∑
T (s, a, s ′ )[R(s, a, s ′ ) + γVk (s ′ )]
s′
However, we can’t compute maximums from samples since the maximum is always unknown (there’s
always the possibility of a new sample being larger; we can only compute averages from samples).
We can instead iteratively compute Q-values (Q-value iteration):
Qk+1 (s, a) =
∑
s′
T (s, a, s ′ )[R(s, a, s ′ ) + γ max
Q(s ′ , a′ )]
′
a
Remember that while a value V (s) is the value of a state, a Q-value Q(s, a) is the value of an action
(from a particular state).
Here the max term pushed inside, and we are ultimately just computing an average, so we can
compute this from samples.
This is the basis of the Q-Learning algorithm, which is just sample-based Q-value iteration.
We learn Q(s, a) values as we go:
• take an action a from a state s and see the outcome as a sample (s, a, s ′ , r ).
• consider the old estimate Q(s, a)
• consider the new sample estimate: sample = R(s, a, s ′ )+γ maxa′ Q(s ′ , a′ ), where R(s, a, s ′ ) =
r , i.e. the reward we just received
• incorporate this new estimate into a running average:
Q(s, a) = (1 − α)Q(s, a) + (α)sample
This can also be written:
Q(s, a) =α R(s, a, s ′ ) + γ max
Q(s ′ , a′ )
′
a
626
CHAPTER 21. REINFORCEMENT LEARNING
627
21.2. MODEL-FREE LEARNING
These updates emulate Bellman updates as we do in known MDPs.
Q-learning converges to an optimal policy, even if you’re acting suboptimal. When an optimal policy
is still learned from suboptimal actions, it is called off-policy learning. Another way of saying this
is that with off-policy learning, Q-values are updated not according to the current policy (i.e. the
current actions), but according to a greedy policy (i.e. the greedy/best actions).
We still, however, need to explore and decrease the learning rate (but not too quickly or you’ll stop
learning things).
In Q-Learning, we don’t need P or the reward/utility function. We directly learn the rewards/utilities
of state-action pairs, Q(s, a).
With this we can just choose our optimal policy as:
π(s) = argmax
a
∑
Q(s, a)
s′
The Q update formula is simply:
Q(s, a) = Q(s, a) + α(R(s) + γ max Q(s ′ , a′ ) − Q(s, a))
Where α is the learning rate and γ is the discount factor.
Again, we can back up (as with value iteration) to propagate these values through the network.
Note that a simpler version of Q-learning is SARSA, (“State Action Reward State Action”, because
the quintuple (st , at , rt , st+1 , at+1 ) is central to this method), which uses the update:
Q(s, a) = Q(s, a) + α(R(s) + γQ(s ′ , a′ ) − Q(s, a))
SARSA, in contrast to Q-learning, is on-policy learning; that is, it updates states based on the current
policy’s actions, so Q-values are learned according to the current policy and not a greedy policy.
n-step Q-learning
An action may be responsible for a reward later on, so we want to be able to learn that causality,
i.e. propagate rewards. The default one-step Q-learning and SARSA algorithms only associate reward
with the direct state-action pair s, a that immediately led to it. We can instead propagate these
rewards further with n-step variations, e.g. n-step Q-learning updates Q(s, a) with:
rt + γrt+1 + · · · + γ n−1 rt+n + max γ n Q(s + t + n + 1, a)
a
CHAPTER 21. REINFORCEMENT LEARNING
627
21.2. MODEL-FREE LEARNING
21.2.2
628
Exploration vs exploitation
Up until now we have not considered how we select actions. So how do we? That is, how do we
explore?
One simple method is to sometimes take random actions (ϵ-greedy). With a small probability ϵ, act
randomly, with probability 1 − ϵ, act on the current best policy.
After the space is thoroughly explored, you don’t want to keep moving randomly - so you can decrease
ϵ over time.
A simple modification for ϵ-greedy action selection is soft-max action selection, where actions are
chosen based on their estimated Q(s, a) value. One specific method is to use a Gibbs or Boltzmann
distribution where selecting action a in state s is proportional to e Q(s,a)/T where T > 0 is a temperature which influences how randomly actions should be chosen. The higher the temperature, the more
random; when T = 0, the best-valued action is always chosen. More specifically, in state s, action
a is chosen with the probability:
e Q(s,a)/T
Q(s,a)/T
ae
∑
Alternatively, we can use exploration functions. Generally, we want to explore areas we have high
uncertainty for. More specifically, an exploration function takes a value estimate u and a visit count
n and returns an optimistic utility. For example: f (u, n) = u + kn .
We can modify our Q-update to incorporate an exploration function:
Q(s, a) =α R(s, a, s ′ ) + γ max
f (Q(s ′ , a′ ), N(s ′ , a′ ))
′
a
This encourages the agent not only to try unknown states, but to also try states that lead to unknown
states.
In addition to exploration and exploitation, we also introduce a concept of regret. Naturally, mistakes
are made as the space is explored - regret is a measure of the total mistake cost. That is, it is the
difference between your expected rewards and optimal expected rewards.
We can try to minimize regret - to do so, we must not only learn to be optimal, but we must optimally
learn how to be optimal.
For example: both random exploration and exploration functions are optimal, but random exploration
has higher regret.
21.2.3
Approximate Q-Learning
Sometimes state spaces are far too large to satisfactorily explore. This can be a limit of memory
(since Q-learning keeps a table of Q-values) or simply that there are too many states to visit in a
reasonable time. In fact, this is the rule rather than the exception. So in practice we cannot learn
about every state.
628
CHAPTER 21. REINFORCEMENT LEARNING
629
21.2. MODEL-FREE LEARNING
The general idea of approximate Q-learning is to transfer learnings from one state to other similar
states. For example, if we learn from exploring one state that a fire pit is bad, then we can generalize
that all fire pit states are probably bad.
This is an approach like machine learning - we want to learn general knowledge from a few training
states; the states are represented by features (for example, we could have a binary feature has fire pit).
Then we describe q-states in terms of features, e.g. as linear functions (called a Q-function; this
method is called linear function approximation):
Q(s, a) = w1 f1 (s, a) + w2 f2 (s, a) + · · · + wn fn (s, a)
Note that we can do the same for value functions as well, i.e.
V (s) = w1 f1 (s) + w2 f2 (s) + · · · + wn fn (s)
So we observe a transition (s, a, r, s ′ ) and then we compute the difference of this observed transition
from what we expected, i.e:
difference = [r + γ max
Q(s ′ , a′ )] − Q(s, a)
′
a
With exact Q-learning, we would update Q(s, a) like so:
Q(s, a) = Q(s, a) + α[difference]
With approximate Q-learning, we instead update the weights, and we do so in proportion to their
feature values:
wi = wi + α[difference]fi (s, a)
This is the same as least-squares regression.
That is, given a point x, with features f (x) and target value y , the error is:
error(w ) =
∑
1
(y −
wk fk (x))2
2
k
The derivative of the error with respect to a weight wm is:
∑
∂error(w )
= −(y −
wk fk (x))fm (x)
∂wm
k
CHAPTER 21. REINFORCEMENT LEARNING
629
21.2. MODEL-FREE LEARNING
630
Then we update the weight:
wm = wm + α(y −
∑
wk fk (x))fm (x)
k
In terms of approximate Q-learning, the target y is r + γ maxa′ Q(s ′ , a′ ) and our prediction is Q(s, a):
wm = wm + α[r + γ max
Q(s ′ , a′ ) − Q(s, a)]fm (s, a)
′
a
21.2.4
Policy Search
Q-learning tries to model the states by learning q-values. However, a feature-based Q-learning model
that models the states well does not necessarily translate to a good feature-based policy (and viceversa).
Instead of trying to model the unknown states, we can directly try to learn the policies that maximize
rewards.
So we can use Q-learning and learn a decent solution, then fine-tune by hill climbing on feature
weights. That is, we learn an initial linear Q-function, the nudge each feature weight and up and
down to see if the resulting policy is better.
We test whether or not a policy is better by running many sample episodes.
If we have many features, we have to test many new policies, and this hill climbing approach becomes
impractical. There are better methods (not discussed here).
21.2.5
Summary
A helpful table:
For a known MDP, we can compute an offline solution:
Goal
Technique
Compute V ∗ , Q∗ , π ∗
Value or policy iteration
Evaluate a fixed policy π
Policy evaluation
For an unknown MDP, we can use model-based approaches:
630
Goal
Technique
Compute V ∗ , Q∗ , π ∗
Value or policy iteration on the approximated MDP
Evaluate a fixed policy π
Policy evaluation on the approximated MDP
CHAPTER 21. REINFORCEMENT LEARNING
631
21.3. DEEP Q-LEARNING
Goal
Technique
Or we can use model-free approaches:
Goal
Technique
Compute V ∗ , Q∗ , π ∗
Q-learning
Evaluate a fixed policy π
Value learning
21.3 Deep Q-Learning
The previous Q-learning approach was tabular in that we essentially kept a table of mappings from
(s, a) to some value. However, we’d like to be a bit more flexible and not have to map exact states
to values, but map similar states to similar values.
The general idea behind deep Q-learning is using a deep neural network to learn Q(s, a), which gives
us this kind of mapping.
This is essentially a regression problem, since Q-values are continuous. So we can use a squared error
loss in the form of a Bellman equation:
L=
1
[r + max
Q(s ′ , a′ ) − Q(s, a)]2
a′
2
Where the r + maxa′ Q(s ′ , a′ ) term is the target value and Q(s, a) is the predicted value.
Approximating Q-values using nonlinear functions is not very stable, so tricks are needed to get good
performance.
One problem is catastrophic forgetting, in which similar states may lead to drastically different
outcomes. For instance, there may be a state for which is a single move away from winning, and
then another similar state where that same move leads to failure. When the agent wins from that
first state, it will assign a high value to it. Then, when it loses from the similar state, it revises its
value negatively, and in doing so it “overwrites” its assessment of the other state.
So catastrophic forgetting occurs when similar states lead to very different outcomes, and when this
happens, the agent is unable to properly learn.
One trick for this is experience replay in which each experience tuple (s, a, r, s ′ ) are saved (this
collection of saved experiences is called “replay memory”). Memory size is often limited to keep only
the last n experiences.
Then the network is trained using random minibatches sampled from the replay memory. This
essentially turns the task into a supervised learning task.
A deep Q-learning algorithm that includes experience replay and ϵ-greedy exploration follows (source):
CHAPTER 21. REINFORCEMENT LEARNING
631
21.4. REFERENCES
•
•
•
•
632
initialize replay memory D
initialize action-value function Q (with random weights)
observe initial state s
repeat
– select an action a
* with probability ϵ select a random action
* otherwise select a = argmaxa′ Q(s, a′ )
–
–
–
–
–
carry out action a
observe reward r and new state s ′
store experience < s, a, r, s ′ > in replay memory D
sample random transitions < ss, aa, r r, ss ′ > from replay memory D
calculate target for each minibatch transition
* if ss ′ is terminal state then tt = r r
* otherwise tt = r r + γ maxa′ Q(ss ′ , aa′ )
– train the Q network using (tt − Q(ss, aa))2 as loss
– s = s′
21.4 References
• Reinforcement Learning - Part 1. Brandon B.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• 11.3 Reinforcement Learning. Artificial Intelligence: Foundations of Computational Agents.
David Poole and Alan Mackworth. Cambridge University Press, 2010.
• Reinforcement Learning in a Nutshell. V. Heidrich-Meisner, M. Lauer, C. Igel, M. Riedmiller.
• Asynchronous Methods for Deep Reinforcement Learning. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver,
Koray Kavukcuoglu.
• Demystifying Deep Reinforcement Learning. Tambet Matiisen.
• Reinforcement Learning: A Tutorial. Mance E. Harmon & Stephanie S. Harmon.
• Q-learning with Neural Networks. Brandon B.
632
CHAPTER 21. REINFORCEMENT LEARNING
633
22
Filtering
Often an agent is uncertain what state the world is in. Filtering (or monitoring) is the task of
tracking and updating the belief state (the distribution) Bt (X) = Pt (Xt |e1 , . . . , et ) as new evidence
is observed.
We start with B1 (X) with some initial setting (typically uniform), and update as new evidence is
observed/time passes.
22.1 Particle filters
Sometimes we have state spaces which are too large for exact inference (i.e. too large for the forward
algorithm) or just to hold in memory. For example, if the state space X is continuous.
Instead, we can use particle filtering, which provides an approximate solution.
With particle filters, possible states are represented as particles (vectors); the density of these vectors
in state space represents the posterior probability of being in a certain state (that is, higher density
means the true state is more likely in that region), and the set of all these vectors represents the
belief state.
Another way of putting this: Particles are essentially samples of possible states. Each particle can be
thought of as a hypothesis that we are in the state it represents. The more particles there are for a
state mean the more likely we are in that state.
So to start, these particles may be very diffuse, spread out across the space somewhat uniformly. As
more data (measurements/observations) is collected, the particles are resampled and placed according
to these observations, and they start to concentrate in more likely regions.
More formally, our representation of P (X) is now a list of N particles (generally N << X, and we
don’t need to store X in memory anymore, just the particles).
P (x) is approximated by the number of particles with value x (i.e. the more particles that have value
x, the more likely state x is).
CHAPTER 22. FILTERING
633
22.1. PARTICLE FILTERS
634
Particles have weights, and they all start with a weight of 1.
As time passes, we “move” each particle by sampling its next position from the transition model:
x ′ = sample(P (X ′ |x))
As we gain evidence, we fix the evidence and downweight samples based on the evidence:
w (x) = P (e|x)
B(X) ∝ P (e|x)B ′ (X)
These particle weights reflect how likely the evidence is from that particle’s state. A result of this is
that the probabilities don’t sum to one anymore.
This is similar to likelihood weighting.
Rather than tracking the weighted samples, we resample.
That is, we sample N times from the weighted sample distribution (i.e. we draw with replacement).
This is essentially renormalizing the distribution and has the effect of “moving” low-weight (unlikely)
samples to where high-weight samples are (i.e. to likely states), so they become more “useful”.
The particle filter algorithm:
# s is a set of particles with importance weights
# u is a control vector
# z is a measurement vector
def particle_filter(s, u, z):
# a new particle set
s_new = []
eta = 0
n = len(s)
for i in range(n):
# sample a particle (with replacement)
# based on the importance weights
p = sample(s)
# sample a possible successor state (i.e. a new particle)
# according to the state transition probability
# and the sampled particle
# p’ ~ p(p’|u, p)
p_new = sample_next_state(u, p)
# use the measurement probability as the importance weight
634
CHAPTER 22. FILTERING
635
22.2. KALMAN FILTERS
# p(z|p’)
w_new = measurement_prob(z, p_new)
# save to new particle set
s_new.append((p_new, w_new))
# normalize the importance weights
# so they act as a probability distribution
eta = sum(w for p, w in s_new)
for i in range(n):
s_new[i][1] /= eta
Particle filters do not scale to high-dimensional spaces because the number of particles you need to
fill a high-dimensional space grows exponentially with the dimensionality. Though there are some
particle filter methods that can handle this better.
But they work well for many applications. They are easy to implement, computationally efficient, and
can deal well with complex posterior distributions.
22.1.1
DBN particle filters
There are also DBN particle filters in which each particle represents a full assignment to the world
(i.e. a full assignment of all variables in the Bayes’ net). Then at each time step, we sample a
successor for each particle.
When we observe evidence, we weight each entire sample by the likelihood of the evidence conditioned
on the sample.
Then we resample - select prior samples in proportion to their likelihood.
Basically, a DBN particle filter is a particle filter where each particle represents multiple assigned
variables rather than just one.
22.2 Kalman Filters
(note: the images below are all sourced from How a Kalman filter works, in pictures, Tim Babb)
Kalman filters can provide an estimate for the current state of a system and from that, provide
an estimate about the next state of the system. They make the approximation that everything is
Gaussian (i.e. transmissions and emissions).
We have some random variables about the current state (in the example below, they are position p
and velocity v ); we can generalize this as a random variable vector S. We are uncertain about the
current state but (we assume) they can be expressed as Gaussian distributions, parameterized by a
mean value and a variance (which reflects the uncertainty).
CHAPTER 22. FILTERING
635
22.2. KALMAN FILTERS
636
636
CHAPTER 22. FILTERING
637
22.2. KALMAN FILTERS
These random variables may be uncorrelated, as they are above (knowing the state of one tells us
nothing about the other), or they may be correlated like below.
This correlation is described by a covariance matrix, Σ, where the element σij describes the correlation
between the i th and jth random variables. Covariance matrices are symmetric.
We say the current state is at time t − 1, so the random variables describing it is notated St−1 , and
the next state (which we want to predict) is time t, so the random variables we predict then are
notated St
The Kalman filter basically takes the random variable distributions for the current state and gives us
new random variable distributions for the next state:
In essence it moves each possible point for the current state to a new predicted point.
We then have to come up with some function for making the prediction. In the example of position
p and velocity v , we can just use pt = pt−1 + ∆tvt−1 to update the position and assume the velocity
is kept constant, i.e. vt = vt−1 .
We can represent these functions collectively as a matrix applied to the state vector:
CHAPTER 22. FILTERING
637
22.2. KALMAN FILTERS
638
Correlated Gaussian random variables
638
CHAPTER 22. FILTERING
639
22.2. KALMAN FILTERS
From current distributions to predicted distributions for the next state
CHAPTER 22. FILTERING
639
22.2. KALMAN FILTERS
640
[
]
1 ∆t
St =
St−1
0 1
= Ft St−1
We call this matrix Ft our prediction matrix. With this, we can transform the means of each random
variable (we notate the vector of these means at Ŝt−1 since these means are our best estimates) to
the predict means in the next state, Ŝt .
We can similarly apply the prediction matrix to determine the covariance at time t, using the property
Cov (Ax) = AΣAT , so that:
Σt = Ft Σt−1 FtT
It’s possible we also want to model external influences on the system. In the position and velocity
example, perhaps some acceleration is being applied. We can capture these external influences in a
vector ut , which is called the control vector.
For the position and velocity example, this control vector would just have acceleration a, i.e. ut = [a].
We then need to update your prediction functions for each random variable in S to incorporate it, i.e.
pt = pt−1 + ∆tvt−1 + 12 ∆t 2 a and vt = vt−1 + a∆t.
Again, we can pull out the coefficients for the control vector terms into a matrix. For this example,
it would be:
[ ∆t 2 ]
Ut =
2
∆t
This matrix is called the control matrix, which we’ll notate as Ut .
We can then update our prediction function:
St = Ft St−1 + Ut ut
These control terms capture external influences we are certain about, but we also want to model
external influences we are uncertain about. To model this, instead of moving each point from the
distributions of St−1 exactly to where the prediction function says it should go, we also describe these
new predicted points as Gaussian distributions with covariance matrices Qt .
We can incorporate the uncertainty modeled by Qt by including it when we update the predicted
covariance at time t:
Σt = Ft Σt−1 FtT + Qt
640
CHAPTER 22. FILTERING
641
22.2. KALMAN FILTERS
Modeling uncertainty in the predicted points
CHAPTER 22. FILTERING
641
22.2. KALMAN FILTERS
642
Now consider that we have sensors which measure the current state for us, though there is some
measurement error (noise). We can model these sensors with the matrix Ht (which would include
measured values for each of our state random variables) and incorporate them:
µexpected = Ht Ŝt
Σexpected = Ht Pt HTt
This gives us the final equation for our predicted state values.
Now say we’ve come to the next state and we get in new sensor values. This allows us to observe
the new state (with some noise/uncertainty) and combine it to our predicted state values to get a
more accurate estimate of the new current state.
The readings we get for our state random variables (e.g. position and velocity) are represented by a
vector zt , and the uncertainty/noise (covariance) in these measurements is described by the covariance
matrix Rt . Basically, these sensors are also described as Gaussian distributions, where the values the
sensor gave us, zt , is considered the vector of the means for each random variable.
Uncertainty in sensor readings
642
CHAPTER 22. FILTERING
643
22.2. KALMAN FILTERS
We are left with two Gaussians - one describing the sensor readings and their uncertainty, and another
describing the predicted values and their uncertainty. We can multiply the distributions to get their
overlap, which describes the space of values likely for both distributions.
Overlap of the two Gaussians
The resulting overlap is, yet again, also a Gaussian distribution with its own mean and covariance
matrix.
We can compute this new mean and covariance from the two distributions that formed it.
First, consider the product of two 1D Gaussian distributions:
?
N (x, µ0 , σ0 ) · N (x, µ1 , σ1 ) = N (x, µ, σ)
As a reminder, the Gaussian distribution is formalized as:
(xµ)2
1
N (x, µ, σ) = √ e − 2σ2
σ 2π
CHAPTER 22. FILTERING
643
22.2. KALMAN FILTERS
644
The product of two 1D Gaussians
We can solve for both µ and σ 2 to get:
σ02 (µ1 µ0 )
σ02 + σ12
σ4
σ 2 = σ02 2 0 2
σ0 + σ1
µ = µ0 +
To make this more readable, we can factor out k, such that:
k=
σ02
σ02 + σ12
µ = µ0 + k(µ1 − µ0 )
σ 2 = σ02 − kσ02
In dimensions higher than 1, we can re-write the above with matrices (µ are now vectors here):
K=
Σ20
Σ20 + Σ21
µ = µ0 + K(µ1 − µ0 )
Σ2 = Σ20 − KΣ20
This matrix K is the Kalman gain.
644
CHAPTER 22. FILTERING
645
22.3. REFERENCES
So we have the two following distributions:
• The predicted state: µ0 , Σ0 = (Ht St , Ht Σt HtT )
• The observed state: µ1 , Σ1 = (zt , Rt )
And using the above, we compute their overlap to get a new best estimate:
Ht Ŝt = Ht Ŝt + K(zt − Ht Ŝt )
Ht Σt HtT = Ht Σt HtT − KHt Σt HtT
K=
Ht Σt HtT
Ht Σt HtT + Rt
Simplifying a bit, we get:
Ŝt = Ŝt + K(zt − Ht Ŝt )
Σt = Σt − KHt Σt
Σt HtT
K=
Ht Σt HtT + Rt
Which are the equations for the update step, which gives us the new best estimate Σ̂.
Kalman filters work for modeling linear systems; for nonlinear systems you instead need to use the
extended Kalman filter.
22.3 References
• How a Kalman filter works, in pictures. August 11, 2015. Tim Babb.
• Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
CHAPTER 22. FILTERING
645
22.3. REFERENCES
646
646
CHAPTER 22. FILTERING
647
23
In Practice
In practice, there is no one-size-fits-all solution for AI problems. Generally, some combination of
techniques is required.
23.1 Starcraft
Starcraft is hard for AI because:
•
•
•
•
•
•
•
adversarial
long horizon
partially observable (fog-of-war)
realtime (i.e. 24fps, one action per frame)
huge branching factor
concurrent (i.e. players move simultaneously)
resource-rich
There is no single algorithm (e.g. minimax) that will solve it off-the-shelf.
The Berkeley Overmind won AIIDE 2010 (a Starcraft AI competition). It used:
•
•
•
•
•
•
•
search: for path planning for troops
CSPs: for base layout (i.e. buildings/facilities)
minimax: for targeting of opponent’s troops and facilities
reinforcement learning (potential fields): for micro control (i.e. troop control)
inference: for tracking opponent’s units
scheduling: for managing/prioritizing resources
hierarchical control: high-level to low-level plans
CHAPTER 23. IN PRACTICE
647
23.2. REFERENCES
648
23.2 References
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
648
CHAPTER 23. IN PRACTICE
649
Part IV
Simulation
649
651
24
Agent-Based Models
Agent-based models include:
•
•
•
•
•
individual agents model intelligent behavior, usually with a simple set of rules
the agents are situated in some space or a network and interact with each other locally
the agents usually have imperfect, local information
there is usually variability between agents
often there are random elements, either among the agents or in the world
24.1 Agents
• An agent is an entity that perceives and acts.
• A rational agent selects actions that maximize its (expected) utility.
• Characteristics of the percepts, environment, and action space dictate techniques for selecting
rational actions.
Reflex agents choose actions based on the current percept (and maybe memory). They are concerned
almost exclusively with the current state of the world - they do not consider the future consequences
of their actions, and they don’t have a goal that their are working towards. Rather, they just operate
off of simple “reflexes”.
Agents that plan consider long(er) term consequences of their actions, have a model of how the
world changes based on their actions, and work towards a particular goal (or goals), and can find an
optimal solution (plan) for achieving its goal or goals.
24.1.1
Brownian agents
A Brownian agent is described by a set of state variables ui(k) where i ∈ [1, . . . , N] refers to the
individual agent i and k indicates the different variables.
CHAPTER 24. AGENT-BASED MODELS
651
24.2. MULTI-TASK AND MULTI-SCALE PROBLEMS
652
These state variables may be external, which are observable from outside the agent, or internal degrees
of freedom that must be inferred from observable actions.
The state variables can change over time due to the environment or internal dynamics. We can
generally express the dynamics of the state variables as follows:
dui(k)
= fi (k) + Fistoch
dt
The principle of causality is represented here: any effect such as a temporal change of variable u has
some causes on the right-hand side of the equation; such causes are described as a superposition of
deterministic and stochastic influences imposed on the agent i .
In this formulation, fi (k) is a deterministic term representing influences that can be specified on the
time and length scale of the agent, whereas Fistoch is a stochastic term which represents influences
that exist, but not observable on the time and length scale of the agent.
The deterministic term fi (k) captures all specified influences that cause changes to the state variable
ui(k) , including interactions with other agents j ∈ N, so it could be a function of the state variables
of other agents in addition to external conditions.
24.2 Multi-task and multi-scale problems
A multi-task domain is an environment where an agent performs two or more separate tasks.
A multi-scale domain is a multi-task domain that satisfies the following:
• multiple structural scales: actions are performed across multiple levels of coordination
• interrelated tasks: there is not a strict separation across tasks and the performance in each
tasks impacts other tasks
• actions are performed in real-time
More generally, multi-scale problems involve working at many different levels of detail.
For example, an AI for an RTS game must manage many simultaneous goals at the micro and macro
level, and these goals and their tasks are often interwoven, and all this must be done in real-time.
24.3 Utilities
We encode preferences for an agent, e.g. A ≻ B means the agent prefers A over B (on the other
hand, A ∼ B means the agent is indifferent about either).
A lottery represents these preferences under uncertainty, e.g. [p, A; 1 − pB].
Rational preferences must obey the axioms of rationality:
652
CHAPTER 24. AGENT-BASED MODELS
653
24.4. REFERENCES
• orderability: (A ≻ B) ∨ (B ≻ A) ∨ (A ∼ B). You either have to like A better than B, B
better than A, or be indifferent.
• transitivity: (A ≻ B) ∧ (B ≻ C) =⇒ (A ∼ C)
• continuity: A ≻ B ≻ C =⇒ ∃p[p, A; 1 − p, C] ∼ B. That is, if B is somewhere between A
and C, there is some lottery between A and C that is equivalent to B.
• substitutability: A ∼ B =⇒ [p, A; , 1 − p, C] ∼ [p, B; 1 − p, C]. If you’re indifferent to A
and B, you are indifferent to them in lotteries.
• monotonicity: A ≻ B =⇒ (p ≥ q ⇔ [p, A; , 1 − p, B] ⪰ [q, A; 1 − q, B]). If you prefer A
over B, when given lotteries between A and B, you prefer the lottery that is biased towards A.
When preferences are rational, they imply behavior that maximizes expected utility, which implies we
can come up with a utility function to represent these preferences.
That is, there exists a real-valued function U such that:
U(A) ≥ U(B) ⇔ A ⪰ B
U([p1, S1 ; . . . ; pn , Sn ]) =
∑
pi U(Si )
i
The second equation says that the utility of a lottery is the expected value of that lottery.
24.4 References
• Think Complexity. Version 1.2.3. Allen B. Downey. 2012.
• An Agent-Based Model of Collective Emotions in Online Communities. Frank Schweitzer, David
Garcia. Swiss Federal Institute of Technology Zurich. 2008.
• Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012.
• CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley
(edX).
CHAPTER 24. AGENT-BASED MODELS
653
24.4. REFERENCES
654
654
CHAPTER 24. AGENT-BASED MODELS
655
25
Nonlinear Dynamics
Two approaches to science:
•
•
•
•
•
•
•
•
mathematical, defined by equations, using proofs (“classical” models).
typically deterministic
generally involve many simplifying assumptions
uses linear approximations to model non-linear systems
computational-based, typically defined by simple rules, using simulations (“complex” models)
often stochastic
often also involve simplifying assumptions, but less
deals better with non-linear systems
(Chapter 1 of Think Complexity provides a good overview of these two approaches).
Some systems may be very hard to accurately model, even though they may be deterministic.
Complex behavior arises from deterministic nonlinear dynamic systems and exhibit two special properties:
• sensitive dependence on initial conditions
• characteristic structure
Most nonlinear dynamic systems are chaotic, and nonlinear dynamic systems constitute most of the
dynamic systems we encounter. In general, systems involving flows (heat, fluid, etc) demonstrate
nonlinear dynamics, but they also show up in classical mechanics (e.g. the three-body problem, the
double-jointed pendulum).
The equations that describe chaotic systems can’t be solved analytically - they are solved with
computers instead.
CHAPTER 25. NONLINEAR DYNAMICS
655
25.1. MAPS
656
25.1 Maps
Maps describe systems that operate in discrete time intervals.
In particular, a map is a mathematical operator that advances the system one time step (i.e. the
next step). We describe them using a difference equation (not to be confused with differential
equations, which come up later):
xn+1 = f (xn )
Where f is the map and xi is the state of the system at time step i .
The states (i.e. each of x0 , x1 , . . . , also called iterates) of a map may converge to a fixed point
where they no longer change as the map is further applied (i.e. it is invariant to the dynamics of the
system), which is notated as x ∗ .
There are different kinds of fixed points:
• attracting fixed points, which the system tends towards when perturbed (stable)
• unstable fixed points, in which the dynamics are stationary, but the system is not “naturally
drawn” to (they are repelling). If they are perturbed from this point, they do not settle back
into it.
For example, if you drop the double pendulum, eventually it settles to a stationary position:
_____
|
|
0
This is an attracting fixed point.
There are other fixed points in this system. For example:
0
|
|
__|__
That is, the pendulum could be balanced on top, in which it would remain stationary, but easily
disturbed, and this is not one that the system would settle into.
The time steps leading to a fixed point is called the transient.
The sequence of iterates is called an orbit or a trajectory of the dynamical system.
The first state x0 is the initial condition.
656
CHAPTER 25. NONLINEAR DYNAMICS
657
25.1. MAPS
A common map is the logistic map, L(xn ) (often used to model populations):
xn+1 = r xn (1 − xn )
It includes a parameter r ∈ (0, 4) and x ∈ (0, 1).
Attractors are the states that remain after the transient dies out (i.e. after the system “settles”).
Attracting fixed points are one kind of attractor, but there are also periodic (oscillating) and chaotic
(strange) attractors.
A basin of attraction is the set of initial conditions which eventually converge on the same attractor.
One attractor is the (fixed) periodic orbit (also called a limit cycle) which is just a sequence of
iterates that repeats indefinitely. A particular cycle may be referred to as an n-cycle, where n refers
to the period of the cycle, i.e. the number of time steps that the cycle repeats over.
25.1.1
Bifurcations
A bifurcation refers to a qualitative change in the topology of an attractor. For example, in the
logistic map, one value of r may give a fixed point attractor, but changing it to another value may
change it to a (fixed) periodic orbit attractor (here r would be called a bifurcation parameter).
Note that “qualitative” change means it changes the kind of attractor, i.e. if changing r just shifts
the fixed point, that is not a bifurcation.
For example, with the logistic map: when r = 3.6, we have a chaotic attractor (also known as a
strange attractor), when r = 3.1 we have a periodic attractor, and when r = 2 we have a fixed
point attractor.
25.1.2
Return maps
Often (1D/scalar) maps are plotted as a time domain plot, in which the horizontal axis is time n
and the vertical axis is the state value xn .
Another way of plotting them is using the (first) return map, also known as a correlation plot or
a cobweb diagram, in which the horizontal axis is xn and the vertical axis is xn+1 . This is known
as the first return map because we correlate xn with xn+1 . A second return map, for example, would
correlate xn with xn+2 .
On a return map, we also often include the line xn+1 = xn , which is the line on which any fixed points
must lie (by definition).
25.1.3
Bifurcation diagrams
A bifurcation diagram has as its horizontal axis r (i.e. from the logistic map) and has xn as its
vertical axis. We also remove the transient from the front of the trajectory (knowing how remove the
CHAPTER 25. NONLINEAR DYNAMICS
657
25.1. MAPS
658
A cobweb plot for the logistic map, from Wikipedia
658
CHAPTER 25. NONLINEAR DYNAMICS
659
25.1. MAPS
transient, i.e. how many points to throw away, takes a bit of trial-and-error). That way we only see
the points the system settles on for any given value of r (if indeed it settles).
So for each value of r that leads to a fixed point attractor, there is only one value of xn For each
value of r that leads to periodic attractors, we may have two or a few of xn . For each value of r that
leads to chaotic attractors, we will have many, many values of xn .
Bifurcation diagram of the logistic map, from Wikipedia
When looking at a bifurcation diagram, you may notice some interesting structures. In particular, you
may notice that some periodic attractors bifurcate into period attractors of double the cycle (e.g. a
2-cycle period that turns into a 4-cycle period at some value of r ). This is called a period-doubling
cascade.
You may also notice “dark veils” (they can look like thin lines cutting through) in the chaotic parts
of the bifurcation diagram - they are the result of unstable periodic orbits.
The bifurcation diagram can also be a fractal object in that it can contain copies of itself within
itself. For example, you can “zoom in” on the diagram and find its own structure repeated at smaller
levels.
Note that many, but not all, chaotic systems have a fractal state-space structure.
25.1.4
Feigenbaum number
If you look at the parts between bifurcations (the “pitchforks”) in a bifurcation diagram, you may
notice that their widths and heights decrease at a constant ratio.
CHAPTER 25. NONLINEAR DYNAMICS
659
25.2. FLOWS
660
Feigenbaum numbers from a bifurcation diagram, source
If we take ∆i to be the width of a bifurcation i , we can frame this as (for widths)
look at the limit of this as n → ∞ to figure out this ratio. For the logistic map:
lim
n→∞
∆2
∆1
=
∆3
∆2 .
We can
∆n
= 4.66
∆n+1
This value is called the Feigenbaum number, and it holds (as 4.66) for any 1D map with a quadratic
maximum (i.e. it looks like a parabola near its maximum).
For the heights of these pitchforks there’s a different value that’s computed in a similar way.
25.1.5
Sensitive to initial conditions
One way of describing maps’ property of sensitivity to initial conditions is that they bring far points
close together and cause close together points to be pushed far away from each other.
One analogy is the kneading of dough. As you knead dough, parts that were close together end up
far apart, and parts that were far apart end up close together (although if we consider the kneading
as a continuous process, technically, this is a flow, but we can imagine it as discrete time steps).
25.2 Flows
Flows describe systems that operate in continuous time (e.g. the double-jointed pendulum). They
are modeled using differential equations rather than difference equations.
660
CHAPTER 25. NONLINEAR DYNAMICS
661
25.2. FLOWS
All the concepts that apply to maps also apply to flows.
In the double-jointed pendulum, the state has four components: the angle of the top joint θ1 , the
angle of the lower joint θ2 , the angular velocity of the top joint ω1 , and the angular velocity of the
lower joint ω2 .
A system without any friction (more generally, friction is known as dissipation) is called a conservative system or a Hamiltonian system or a non-dissipative system. These systems do not
have attracting fixed points because there is nothing to cause the transient to die out. They do,
however, still have fixed points - just not attracting ones - and they still have chaos - just not chaotic
attractors.
Conversely, dissipative systems are those that have attractors.
25.2.1
Ordinary differential equations
An ODE expresses relationships between derivatives of an unknown function.
For example:
d
x(t) = 1
dt
The unknown function here is x(t) and the derivative is wrt to time t.
To solve this, we know that the derivative of x(t) is equal to 1, so we ask - what x(t) - that is, what
set of functions - would make this true? Here, x(t) can be any function of time that has a slope of
1, i.e. x(t) = t + C.
Another example: say we have the ODE x ′′ (t) = −x(t) and the initial conditions x(t = 0) = 1.
This is asking: what function is the negative of its own second derivative? This could be sin or cos.
The initial condition restricts this to cos because only cos(0) = 1.
This is an analytic solution (closed-form, i.e. can be written out finitely). In most case, we will not
be solving ODEs analytically, but numerically. This is because ODEs that can be solved analytically
are, by definition, not chaotic.
ODEs may be linear, which have the form of a sum of “constant times variable” terms, e.g. y = ax +b,
where a, b are constants and y , x are variables. or ODEs may be nonlinear, which involve powers of
the variables, products of variables, or transcendental functions involving the variables.
Note that nonlinearity is a necessary condition for chaos. If the ODE is nonlinear, it’s possible there
may not be an analytic solution - this property is known as nonintegrability, and it is a necessary and
sufficient condition for chaos. They must be solved numerically.
Technically, “nonintegrability” only applies to Hamiltonian systems, but here we will use it as a
shorthand for “has no analytic solution” more generally.
With flows we can think of fixed points in terms of a “dynamics landscape”, i.e. the topography of
the system. We can think of stable fixed points as some kind of “bowl” that values “roll down”. In
contrast, an unstable fixed point could be either an upside-down bowl or a saddle.
CHAPTER 25. NONLINEAR DYNAMICS
661
25.2. FLOWS
662
Sidebar on some linear algebra: matrices can be applied to transform space, i.e. for rotations, scaling,
translations, etc. A matrix of eigenvectors allows us to express the fundamental features of a landscape.
A point that starts on an eigenvector stays on that eigenvector (“eigen” means “same”). An eigenvalue
tells you how fast a state travels along an eigenvector and in what direction - specifically, the movement
is exponential: e st , where s is the eigenvalue. Each eigenvalue is associated with an eigenvector.
Say we have two crossing eigenvectors, one of which is associated with eigenvalue s1 and one associated with eigenvalue s2 . Both s1 , s2 are negative, which means that e st shrinks, meaning that both
eigenvectors “point” inwards (note that the fixed point is marked with *):
s_1
|
v
|
--->-*-<---s_2
|
^
|
That is, we have a bowl shape.
If instead both eigenvalues were positive, we’d have an upside-down bowl.
If one were positive and one were negative, we’d have a saddle.
Of course in practice, the forms (bowl, upside-down bowl, saddle) are rarely this neat and tidy, but
often we use these as (linear) approximations when looking locally (i.e. “zoomed in” on a particular
region). When looking at a larger scale, we instead must resort to nonlinear mathematics - the
eigenvectors typically aren’t “straight” at larger scales; they may become curvy.
When a fixed point’s unstable eigenvector (that is, the one moving away from the fixed point) connects
to the stable eigenvector of another fixed point (that is, the eigenvector moving into the other fixed
point), that is called a heteroclinic orbit. For example (the relevant part has double-arrows, the
weird hump is meant to be a curve to show that these eigenvectors are linear only locally around each
fixed point):
|
|
v
|
-->>-/
---<-*->>-/
v
\
|
\-->>-*-<--
|
|
^
^
|
|
On the other hand, if a fixed point’s unstable eigenvector, in the large scale, loops back and connects
to its stable eigenvector, that is called a homoclinic orbit.
662
CHAPTER 25. NONLINEAR DYNAMICS
663
25.2. FLOWS
/-<<--\
|
|
|
|
|
^
v
|
|
/
---<-*->>--/
|
^
|
We call these larger structures (i.e. when looking beyond just the local eigenvectors, but rather the full
curves that connect them) the stable or unstable manifolds of a fixed point. They are like nonlinear
generalizations of eigenvectors in that they are invariant manifolds; that is, a state that starts on
one of these manifolds stays on the manifold. They start out tangent to the eigenvectors (which is
why we just use eigenvectors locally), but as mentioned before, they “curve” out depending on the
dynamics landscape.
Growth/movement along these manifolds is also exponential, like it is for eigenvectors.
If all manifolds are stable, you have a fixed point (some kind of bowl, roughly speaking, but nonlinear).
If all manifolds are unstable, you have a fixed point (some kind of upside-down bowl, roughly speaking).
Also note: a nonlinear system can have any number of attractors, of all types (fixed points, limit
cycles/periodic orbits, quasiperiodic orbits [not discussed in this class], chaotic attractors) scattered
throughout its state space, but there is no way of knowing a priori where they are and what type
they are (or even how many there are).
Every point in the state space is in the basin of attraction of some attractor. The basins of attraction
and the basin boundaries partition the state space.
25.2.2
More on ODEs
An nth-order ODE can be broken up into n 1st-order ODEs.
For example, take the ODE for a simple harmonic oscillator (a mass on a spring):
mx ′′ + βx ′ + kx − mg = 0
This is a 2nd-order ODE. We can break it down into 1st-order ODEs like so:
1. Isolate the highest-order term:
mg − βx ′ − kx
x =
m
′′
CHAPTER 25. NONLINEAR DYNAMICS
663
25.2. FLOWS
664
2. Then define a helper variable:
x′ = v
3. Rewrite the whole equation using the helper variable:
v′ = g −
β
k
v− x
m
m
We have actually defined two first-order ODEs (that is, it is a 2D ODE system), which we can
represent as a vector:
[ ]
[
x′
=
v′
g−
v
β
mv −
]
k
mx
There are no derivatives on the right-hand side, which is how we want things to be. The derivatives
are isolated and the right-hand side just captures the dynamics of the system. The vector on the
left-hand side is called a state vector.
Here we started with a 2nd-order ODE so we only required one helper variable. More generally, for
an nth-order ODE, you require n − 1 helper variables.
Note that at least 3 dimensions is necessary for a chaotic system.
The general form for an nth-order ODE system is as follows:
ẋ1 = f1 (x1 , . . . , xn )
ẋ2 = f2 (x1 , . . . , xn )
..
.
ẋn = fn (x1 , . . . , xn )
(As a reminder, ẋ is another notation for the derivative of x.)
The state variables can be represented as a state vector:




x1
 
 x2 
 

⃗
x =
 ... 
xn
This system defines a vector field. For every value of ⃗
x , we can compute ⃗
x˙ = f⃗(⃗
x ), which tells us
the slope at that point (i.e. which way is downhill, and how steep it is).
664
CHAPTER 25. NONLINEAR DYNAMICS
665
25.2. FLOWS
For linear systems, matrices can describe how a “ball rolls in a landscape” (e.g. bowls, saddles, etc).
The description is only good locally for nonlinear systems, as mentioned earlier.
For example, consider the following 2D linear system expressed with ODEs:
ẋ1 = ax1 + bx2
ẋ2 = cx1 + dx2
This can be re-written as:
x
⃗
x˙ = A⃗
[
x1
⃗
x=
x2
]
[
a b
A=
c d
]
So the matrix A describes the dynamics of the system.
But with a nonlinear system, we cannot write down such a matrix A and have only numbers in it.
25.2.3
Reminder on distinction b/w difference and differential equations
A differential equation f⃗ takes a state vector ⃗
x and gives us ⃗
x˙ , that is, the derivative of ⃗
x.
A difference equation f⃗ takes a state vector ⃗
xn and gives us the state vector at the next (discrete)
xn+1 .
time step, ⃗
25.2.4
ODE Solvers
An ODE solver takes as input:
• an ODE
• initial conditions, ⃗
x (t = t0 )
• a time difference ∆t
And gives as output an estimate of ⃗
x (t0 + ∆t).
There are different methods of doing this, but a common one is Forward Euler, sometimes just
called Euler’s method or “follow the slope” - as it says, you just follow the slope to the next
point. But how far do you follow the slope? There may be a lot of “bumps” in the landscape in
which case following the slope at one point may become inaccurate after some distance (e.g. it may
“overstep”). Shorter steps are computationally more expensive, since you must re-calculate the slope
more frequently, but gives greater accuracy. For an ODE solver, this step size is controlled via the
∆t input. These two factors - the shape of the landscape and the time step - are main contributors
to error here.
CHAPTER 25. NONLINEAR DYNAMICS
665
25.2. FLOWS
666
For Forward Euler, the estimate of ⃗
x (t0 + ∆t) is computed as follows:
⃗
x (t0 + ∆t) = ⃗
x (t0 ) + ∆t · ⃗
x ′ (t)
A related method is Backward Euler:
⃗
x (t0 + ∆t) = ⃗
x (t0 ) + ∆t · ⃗
xF′ E (t0 + ∆t)
Where ⃗
xF E (t0 + ∆t)′ is not the derivative at the original point, but rather the derivative of the point
reached after one time step of Forward Euler.
Intuitively, this is like taking a “test step”, computing the derivative there, moving back to the start,
and then and moving based on the derivative computed from the test step.
Note that Forward Euler and Backward Euler have numerical damping effects. For Backward Euler,
it is positive damping, so it acts sort of like friction; for Forward Euler it is negative. The results of
these computational precision errors, however, are indistinguishable from natural effects, which makes
them difficult to deal with.
Note that Forward Euler is equivalent to the first part of a Taylor series, which is also used to
approximate a point locally:
1
1
f (x0 + ∆x) = f (x0 ) + ∆x(f ′ (x0 )) + (∆x)2 (f ′′ (x0 )) + . . . (∆x n )f n (x0 )
2
n!
There are also other errors such as floating point errors - e.g. truncation or roundoff errors, depending
on how the are handled. This is common with sensors. These